Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix check_nvidia to support running multiple single GPU training / inference at the same time #856

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

grll
Copy link

@grll grll commented Aug 1, 2024

I know multi-gpu is work in progress but in the meantime we could allow people to use single gpu multiple times in different python processes by specifying the GPU on which it should run with CUDA_VISIBLE_DEVICES for each process.

This patch fix check_nvidia to allow that use case.

@danielhanchen
Copy link
Contributor

Interesting! Will check this out! Thanks!

@Sehyo
Copy link

Sehyo commented Aug 30, 2024

Hi! It seems that this fix simply ignores the exceptions, and does not patch the trainer like the original code does.
I have published a new pull request, that fixes the problem properly:
#974 (review)
@danielhanchen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants