Open Source Data Sources for Machine Learning

Posted Apr 2, 2026 Updated Apr 2, 2026

By Naresh Modina 1 min read

A curated list of open data sources for machine learning research and projects.

Popular Open Data Repositories

OpenML — collaborative platform with thousands of datasets and benchmarks
Kaggle — competitions and community-contributed datasets
Papers With Code — datasets linked directly to research papers and leaderboards
UC Irvine ML Repository — classic benchmark datasets
AWS Open Datasets — large-scale datasets hosted on S3
TensorFlow Datasets — ready-to-use datasets via tfds
Hugging Face Datasets — the largest hub for ML datasets, strongly integrated with transformers

Meta Portals

DataPortals.org — directory of open data portals worldwide
OpenDataMonitor — European open data catalogue

Other Listings

NLP

The Pile — 825GB diverse English text corpus by EleutherAI
Common Crawl — petabyte-scale web crawl data, used to train most LLMs
C4 — cleaned Common Crawl, used to train T5
RedPajama — 1T token open reproduction of LLaMA training data
ROOTS — multilingual corpus used to train BLOOM
OpenSubtitles — multilingual subtitle corpus, useful for dialogue and translation

Computer Vision

ImageNet — the standard benchmark for image classification
COCO — object detection, segmentation, and captioning
Open Images — 9M images with labels and bounding boxes by Google

Tabular / Structured Data

UCI Repository — classic tabular benchmarks
Google Dataset Search — search engine for datasets across the web
World Bank Open Data — economic and development indicators

datasource data datasets nlp opendata

This post is licensed under CC BY 4.0 by the author.

Trending Tags

nlp architecture bert data datasets datasource deeplearning finetuning gpt huggingface