Datasets and Dataset Collections
Below is a collection of free / open datasets and lists of datasets at the interface between chemistry, materials and machine learning / AI.
Foundations
-
Ecosystem of pretrained models, datasets, and Python libraries for transformers, diffusion models, and other modern ML architectures.
License: -
transformersLLMsmodel hub
Chemistry
Materials
-
Curated list of datasets for machine learning with materials, including links to data resources and related projects.
License: MIT
datasetsmaterialscurated list
-
Digital Materials Foundry – Experimental Materials Data Library (Henry Royce Institute)
Library of experimental materials data repositories curated by the Henry Royce Institute. Includes device-performance, stress–strain, thermoelectric, optical property databases, etc. Useful resource for ML in materials. License varies per dataset. oai_citation:0‡Henry Royce Institute
License: Varies per dataset (MIT, CC-BY-4.0 etc.)
experimental materials datamaterials discoverydata-library
-
Open-access computational materials database providing predicted and known properties of inorganic materials (e.g., formation energy, band-gap, structure) built via DFT and high-throughput workflows. Widely used in ML for materials. oai_citation:1‡Wikipedia
License: CC-BY-4.0
inorganic materials datasetDFT computed propertiesmaterials ML
-
MatBench – Benchmark Datasets for Materials Property Prediction
Benchmark dataset suite curated by the Materials Project for ML-based materials property prediction. Tasks range across electronic, thermal, mechanical properties; includes APIs & leaderboard. License: MIT. oai_citation:3‡GitHub
License: MIT
benchmark datasetsmaterials MLproperty prediction
-
Porous Material AI Gym: Open Datasets for Machine Learning on Porous Materials
Collection of open datasets for machine learning pertaining to porous materials (MOFs, COFs, zeolites). Includes thousands of labelled examples (adsorption, band-gaps, charges) for supervised learning. Provides ready-to-use data for ML workflows.
License: -
porous materialsmachine learning datasetMOFs/COFs