AI Training Datasets

Open datasets used to train the world's most powerful language models, vision systems, and multimodal AI.

Common Crawl (CC-2024-18)

3.4 petabyte web crawl covering 3.6 billion web pages. The foundational pretraining data behind GPT, Claude, Gemini, and most major LLMs. Includes WARC, WAT, and WET files.

3.4 PB
3.6B pages
Open
WARC / WET

The Pile v2 — Multilingual

825GB diverse English corpus curated by EleutherAI. Includes Wikipedia, GitHub, ArXiv, PubMed, Books3, and Common Crawl subsets. Extended with multilingual data in 30+ languages.

1.3 TB
30+
MIT
JSONL

OpenAssistant RLHF Dataset

161,000 messages in 35 languages across 10,000 conversation trees. Human-labeled preference data for RLHF training, including assistant responses ranked by quality.

4.2 GB
161K
Apache 2.0
JSON

LAION-5B Image-Text Pairs

5.85 billion image-text pairs scraped from Common Crawl, used to train CLIP, Stable Diffusion, and multimodal LLMs. Filtered variants available in English and multilingual variants.

5.85B
~240 TB
CC-BY 4.0
Parquet

FineWeb — High Quality Web Data

15 trillion tokens of deduplicated, filtered English web text derived from Common Crawl. Hugging Face's flagship pretraining dataset — reportedly used in several recent open models.

15T
44 TB
ODC-By
Parquet