Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Existing Cache files leads to permanent DataLoader hang #398

Open
lilavocado opened this issue Oct 16, 2024 · 3 comments
Open

Existing Cache files leads to permanent DataLoader hang #398

lilavocado opened this issue Oct 16, 2024 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@lilavocado
Copy link

🐛 Bug

I'm currently using litdata completely locally where I first convert the dataset using 'optimize' and use StreamingDataset to stream the records from a local directory to train my model. I want to train multiple models (using the same dataset) in parallel but it seems that the cache files created from the previous runs ends up blocking the StreamingDataset of the future runs (probably due to locking?) It took me quite a while to figure out that the freeze was due to the cache files.

My workaround for now is to create a new caching directory for each run following the documentation using the 'Dir' class from resolver.py which was a bit confusing at first because 'Dir' takes arguments 'url' and 'path' which makes it seem like it only works when you have data in the cloud (url.) It would have made more sense it the arguments were like 'path' (either url or local directory) and 'cache_dir' (the directory to store cache)

The question I had was: why does it have to cache data when all the data is already available locally?

It would be great if StreamingDataset directly took an argument like cache_dir.

Thanks.

@lilavocado lilavocado added bug Something isn't working help wanted Extra attention is needed labels Oct 16, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@tchaton
Copy link
Collaborator

tchaton commented Oct 16, 2024

Hey @lilavocado. This library is made to work for cloud data first. I haven't faced this issue before.

Could you provide a simple reproducible script ?

@emileclastres
Copy link

Hi, I have experienced a similar problem when using a StreamingDataLoader on two concurrent studios with multiple workers on the lightning platform. After a while, the multiprocessed dataloader would just stop iterating batches without throwing an error, and GPU utilization would drop to 0 on one or a few GPUs. It looks to me like the /cache/ folder is shared for the multiple runs and is the root of the issue (maybe hash collisions?).

Simply exposing a cache_dir argument to StreamingDataset could be a convenient fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants