Datasets
The structure libraries from which these models descend.
The lineage of an MLIP is shaped at least as much by what it was trained on as by its architecture. Two equivariant transformers with nearly identical layer counts can produce wildly different relaxation paths if one was raised on QM9 and the other on OMat24. The clusters on the similarity map track training dataset more reliably than they track model family — and on most days, more reliably than the architecture diagrams in the original papers.
This section is a working reference, not a comprehensive review of materials and molecular chemistry datasets. The entries cover the libraries actually referenced by the catalog.
| Name | Domain | Size | DFT level | Year |
|---|---|---|---|---|
| QM9 | molecules | ~134,000 small organic molecules | B3LYP/6-31G(2df,p) | 2014 |
| MD17 / rMD17 | molecules | ~3.6M MD frames across 10 small organic molecules | PBE+TS / CCSD(T) for select molecules | 2017 |
| ANI-1x / ANI-2x | molecules | ~5M (ANI-1x) and ~9M (ANI-2x) conformations of small organics | ωB97X / 6-31G* | 2018 |
| MP-2018+ | materials | ~150k DFT-relaxed inorganic crystals (MP-2018 snapshot, growing) | PBE; PBE+U for selected TM oxides; VASP | 2018 |
| OC20 | catalysis | ~1.3M relaxations, ~265M structure-energy pairs along the relaxation paths | RPBE; VASP | 2020 |
| OC22 | catalysis | ~62k relaxations on oxide surfaces | PBE+U; VASP | 2022 |
| MPTrj | materials | ~1.6M VASP relaxation frames extracted from Materials Project | PBE; PBE+U for select TMs; VASP | 2023 |
| SPICE | molecules | ~1.1M conformations | ωB97M-D3(BJ) / def2-TZVPPD | 2023 |
| OMat24 | materials | ~118M structures | PBE; PBE+U for 3d transition metals | 2024 |
| sAlex | materials | ~4.2M inorganic structures, subsampled from the Alexandria database | PBE; some PBE+U; VASP | 2024 |
| MAD | mixed | ~95M frames spanning crystals, surfaces, molecules, and disordered configurations | PBE / r²SCAN (mixed-fidelity) | 2025 |
| OMol25 | molecules | ~110M molecular frames | ωB97X-D / def2-TZVPD | 2025 |
Sorted oldest first to read as a timeline of what models had to learn from.