Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment variable option to disable mpibind #24

Open
SteVwonder opened this issue Sep 23, 2021 · 3 comments
Open

Environment variable option to disable mpibind #24

SteVwonder opened this issue Sep 23, 2021 · 3 comments

Comments

@SteVwonder
Copy link
Member

In the case of Flux, we don't want Flux processes to be bound by mpibind when launched under Slurm. Instead, we only want mpibind to be activated once Flux has started. In parts of Flux's documentation, we currently recommend users launch Flux under Slurm with srun --mpibind=off flux start, but this only applies to LLNL clusters (open issue).

An alternate solution would be to suggest users to do MPIBIND=off srun flux start or set MPIBIND=off in an lmod file, but this requires mpibind to support providing options via environment variables. Once Flux is started, we could unset the MPIBIND env var when launching jobs (and reactivating mpibind via the Flux mpibind job shell plugin is also an option).

@grondo
Copy link
Collaborator

grondo commented Sep 23, 2021

All instructions to launch Flux under Slurm run a single broker per node. Does the Slurm mpibind plugin do anything when there is a single process per node? If by default mpibind does not do any binding, then maybe we do not have to give different instructions in Flux for Slurm clusters with or without the mpbind plugin?

@eleon
Copy link
Collaborator

eleon commented Sep 23, 2021

mpibind does map the process to the hardware when there is a single process per node. One reason is that you may have a node with multiple NUMA domains (sockets). If the caller uses the greedy option of mpibind, the process will get the full node, but if it does not, it will get a single NUMA domain.

We should be able to control this with the Slurm plugin, which is in development.

Two options come to mind:

  1. Follow what @SteVwonder proposes: The Slurm plugin would read MPIBIND=off to disable mpibind.
  2. Change the default value of greedy from off to on in the mpibind C algorithm. With this change, anytime you have a single process, it will have access to the full node regardless of the number of NUMA domains. Actually, we could make this change in the Slurm plugin: Call mpibind with greedy on.

I need to think more about (2), but if it is indeed a reasonable default, then Flux would not have to change anything.

@grondo
Copy link
Collaborator

grondo commented Sep 23, 2021

One idea for 1. would be to use a variable named SLURM_MPIBIND=off. That way Flux would not have to unset an mpibind environment variable since it is one specific to the Slurm mpibind plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants