Post

Open Source Data Sources for Machine Learning

A curated list of open data sources for machine learning research and projects.

Meta Portals

Other Listings

NLP

  • The Pile — 825GB diverse English text corpus by EleutherAI
  • Common Crawl — petabyte-scale web crawl data, used to train most LLMs
  • C4 — cleaned Common Crawl, used to train T5
  • RedPajama — 1T token open reproduction of LLaMA training data
  • ROOTS — multilingual corpus used to train BLOOM
  • OpenSubtitles — multilingual subtitle corpus, useful for dialogue and translation

Computer Vision

  • ImageNet — the standard benchmark for image classification
  • COCO — object detection, segmentation, and captioning
  • Open Images — 9M images with labels and bounding boxes by Google

Tabular / Structured Data

This post is licensed under CC BY 4.0 by the author.