Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WBM filtering fails assert on entries_old_corr #139

Open
jackwebersdgr opened this issue Sep 9, 2024 · 3 comments
Open

WBM filtering fails assert on entries_old_corr #139

jackwebersdgr opened this issue Sep 9, 2024 · 3 comments
Labels
help Q&A support issues

Comments

@jackwebersdgr
Copy link

I'm attempting to compile the filtered WBM dataset in order to test a new model, but ran into this assert:

assert len(entries_old_corr) == 76_390, f"{len(entries_old_corr)=}, expected 76,390"

assert len(entries_old_corr) == 76_390, f"{len(entries_old_corr)=}, expected 76,390"
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AssertionError: len(entries_old_corr)=256963, expected 76,390

Is the fully filtered WBM dataset stored anywhere else or must it be computed on the fly using this script? Thanks!

@janosh
Copy link
Owner

janosh commented Sep 9, 2024

strange, i reran that whole script just last month following #121 and it was back to a working state after #122. either way, no need for you to run that file yourself. the data files it creates are listed here, also up on figshare and will be auto-downloaded if you access the corresponding DataFiles attributes. e.g.

import pandas as pd

from matbench_discovery.data import DataFiles

df_summary = pd.read_csv(DataFiles.wbm_summary.path)
df_wbm_init_structs = pd.read_json(DataFiles.wbm_cses_plus_init_structs.path)

@janosh janosh added the help Q&A support issues label Sep 9, 2024
@jackwebersdgr
Copy link
Author

Ah, I see. I was under the impression that this script would perform the filtration based on matching structure prototypes in MP, but it seems like this also results in ~257k datapoints. Is there a simple way to obtain the filtered 215.5k set, perhaps via some set of material_ids?

Additionally, it seems the documentation in the site is out of date, and can be updated with the above code https://matbench-discovery.materialsproject.org/contribute#--direct-download

@janosh
Copy link
Owner

janosh commented Sep 10, 2024

have a look at the subset kwarg in load_df_wbm_with_preds

def load_df_wbm_with_preds(
*,
models: Sequence[str] = (),
pbar: bool = True,
id_col: str = Key.mat_id,
subset: pd.Index | Sequence[str] | Literal["uniq_protos"] | None = None,
max_error_threshold: float | None = 5.0,
**kwargs: Any,
) -> pd.DataFrame:
"""Load WBM summary dataframe with model predictions from disk.
Args:
models (Sequence[str], optional): Model names must be keys of
matbench_discovery.data.Model. Defaults to all models.
pbar (bool, optional): Whether to show progress bar. Defaults to True.
id_col (str, optional): Column to set as df.index. Defaults to "material_id".
subset (pd.Index | Sequence[str] | 'uniq_protos' | None, optional):
Subset of material IDs to keep. Defaults to None, which loads all materials.
'uniq_protos' drops WBM structures with matching prototype in MP
training set and duplicate prototypes in WBM test set (keeping only the most
stable structure per prototype). This increases the 'OOD-ness' of WBM.
max_error_threshold (float, optional): Maximum absolute error between predicted

if subset == "uniq_protos":
df_out = df_out.query(Key.uniq_proto)
elif subset is not None:

and

@unique
class TestSubset(LabelEnum):
"""Which subset of the test data to use for evaluation."""
uniq_protos = "uniq_protos", "Unique Structure Prototypes"
ten_k_most_stable = "10k_most_stable", "10k Most Stable"
full = "full", "Full Test Set"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help Q&A support issues
Projects
None yet
Development

No branches or pull requests

2 participants