Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lock files shared between library and scripts #336

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

adammoody
Copy link
Contributor

There are a number of files shared between the library and the run scripts, including flush.scr, halt.scr, nodes.scr, index.scr and the summary and rank2file maps among others. While in most cases, the library is not trying to access those files the same time as the scripts, this PR adds read/write locks around shared files to help avoid NFS caching problems.

If the library updates the flush file from a compute node and then stops, and then the job script reads the flush file from another node, it may see a stale copy due to NFS caching. The hope is that adding read/write locks will help mitigate those caching problems on some NFS implementations.

@adammoody
Copy link
Contributor Author

Well, I gave this a shot and unfortunately, the extra lock/unlock calls don't seem to help on one system where I can reproduce the problem. Bummer!

This may still help on some NFS file systems, though adding the lock/unlock calls could presumably slow things down a bit. Sigh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant