Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shared cache: address O(P) scalability of SCR metadata files #493

Open
adammoody opened this issue Mar 5, 2022 · 0 comments
Open

shared cache: address O(P) scalability of SCR metadata files #493

adammoody opened this issue Mar 5, 2022 · 0 comments

Comments

@adammoody
Copy link
Contributor

adammoody commented Mar 5, 2022

When using a global file system as cache, we need to examine the number of metadata files that SCR creates for each dataset. SCR stores a filemap for each process as well as a number of files resulting from the redundancy encoding, even when using SINGLE. In all, SCR writes 9 metadata files per process, per dataset.

ls -ltr /dev/shm/$USER/scr.${SLURM_JOBID}/scr.dataset.12/.scr
-rw------- 1  520 Mar  5 14:35 filemap_0

-rw------- 1   44 Mar  5 14:35 reddescmap.er.er
-rw------- 1  312 Mar  5 14:35 reddescmap.er.shuffile
-rw------- 1  178 Mar  5 14:35 reddescmap.er.0.redset
-rw------- 1  548 Mar  5 14:35 reddescmap.er.0.single.grp_1_of_2.mem_1_of_1.redset

-rw------- 1   44 Mar  5 14:35 reddesc.er.er
-rw------- 1  303 Mar  5 14:35 reddesc.er.shuffile
-rw------- 1  178 Mar  5 14:35 reddesc.er.0.redset
-rw------- 1  548 Mar  5 14:35 reddesc.er.0.single.grp_1_of_2.mem_1_of_1.redset

When cache is node-local storage, these files are distributed among the compute nodes. Each node only holds a small subset of the files, and they are written in parallel. However, these files all pile into a single scr.dataset.<id>/.scr directory when cache is a global file system. The number of files written to this single directory scales as O(9*P) where P is the number of processes.

That feels extreme, especially since the application may write one single shared file in the dataset. For a large-scale run where P=16,000, SCR would produce 144,000 files!

We have a few options:

  1. Modify SCR to keep those metadata files in node-local storage when cache is a global file system.
  2. Modify er/shuffile/redset to avoid creating (so many) redundancy files in SINGLE.
  3. Modify scr/er/shuffile/redset to merge data into fewer physical files, where data from multiple compute nodes are combined.
@adammoody adammoody changed the title Address O(P) scalability of SCR cache metadata files shared cache: address O(P) scalability of SCR cache metadata files Mar 5, 2022
@adammoody adammoody changed the title shared cache: address O(P) scalability of SCR cache metadata files shared cache: address O(P) scalability of SCR metadata files Mar 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant