Existing Cache files leads to permanent DataLoader hang #398

lilavocado · 2024-10-16T03:37:43Z

🐛 Bug

I'm currently using litdata completely locally where I first convert the dataset using 'optimize' and use StreamingDataset to stream the records from a local directory to train my model. I want to train multiple models (using the same dataset) in parallel but it seems that the cache files created from the previous runs ends up blocking the StreamingDataset of the future runs (probably due to locking?) It took me quite a while to figure out that the freeze was due to the cache files.

My workaround for now is to create a new caching directory for each run following the documentation using the 'Dir' class from resolver.py which was a bit confusing at first because 'Dir' takes arguments 'url' and 'path' which makes it seem like it only works when you have data in the cloud (url.) It would have made more sense it the arguments were like 'path' (either url or local directory) and 'cache_dir' (the directory to store cache)

The question I had was: why does it have to cache data when all the data is already available locally?

It would be great if StreamingDataset directly took an argument like cache_dir.

Thanks.

github-actions · 2024-10-16T03:38:11Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-10-16T07:51:49Z

Hey @lilavocado. This library is made to work for cloud data first. I haven't faced this issue before.

Could you provide a simple reproducible script ?

emileclastres · 2024-10-21T10:28:15Z

Hi, I have experienced a similar problem when using a StreamingDataLoader on two concurrent studios with multiple workers on the lightning platform. After a while, the multiprocessed dataloader would just stop iterating batches without throwing an error, and GPU utilization would drop to 0 on one or a few GPUs. It looks to me like the /cache/ folder is shared for the multiple runs and is the root of the issue (maybe hash collisions?).

Simply exposing a cache_dir argument to StreamingDataset could be a convenient fix.

lilavocado added bug Something isn't working help wanted Extra attention is needed labels Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Existing Cache files leads to permanent DataLoader hang #398

Existing Cache files leads to permanent DataLoader hang #398

lilavocado commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

tchaton commented Oct 16, 2024

emileclastres commented Oct 21, 2024

Existing Cache files leads to permanent DataLoader hang #398

Existing Cache files leads to permanent DataLoader hang #398

Comments

lilavocado commented Oct 16, 2024

🐛 Bug

github-actions bot commented Oct 16, 2024

tchaton commented Oct 16, 2024

emileclastres commented Oct 21, 2024