Open Source Data Sources for Machine Learning
A curated list of open data sources for machine learning research and projects.
Popular Open Data Repositories
- OpenML — collaborative platform with thousands of datasets and benchmarks
- Kaggle — competitions and community-contributed datasets
- Papers With Code — datasets linked directly to research papers and leaderboards
- UC Irvine ML Repository — classic benchmark datasets
- AWS Open Datasets — large-scale datasets hosted on S3
- TensorFlow Datasets — ready-to-use datasets via
tfds - Hugging Face Datasets — the largest hub for ML datasets, strongly integrated with
transformers
Meta Portals
- DataPortals.org — directory of open data portals worldwide
- OpenDataMonitor — European open data catalogue
Other Listings
NLP
- The Pile — 825GB diverse English text corpus by EleutherAI
- Common Crawl — petabyte-scale web crawl data, used to train most LLMs
- C4 — cleaned Common Crawl, used to train T5
- RedPajama — 1T token open reproduction of LLaMA training data
- ROOTS — multilingual corpus used to train BLOOM
- OpenSubtitles — multilingual subtitle corpus, useful for dialogue and translation
Computer Vision
- ImageNet — the standard benchmark for image classification
- COCO — object detection, segmentation, and captioning
- Open Images — 9M images with labels and bounding boxes by Google
Tabular / Structured Data
- UCI Repository — classic tabular benchmarks
- Google Dataset Search — search engine for datasets across the web
- World Bank Open Data — economic and development indicators
This post is licensed under CC BY 4.0 by the author.