AI Training Datasets

Open datasets used to train the world's most powerful language models, vision systems, and multimodal AI.

Common Crawl (CC-2024-18)

3.4 petabyte web crawl covering 3.6 billion web pages. The foundational pretraining data behind GPT, Claude, Gemini, and most major LLMs. Includes WARC, WAT, and WET files.

Size3.4 PB

Records3.6B pages

LicenseOpen

FormatWARC / WET

Download ↓Details

The Pile v2 — Multilingual

825GB diverse English corpus curated by EleutherAI. Includes Wikipedia, GitHub, ArXiv, PubMed, Books3, and Common Crawl subsets. Extended with multilingual data in 30+ languages.

Size1.3 TB

Languages30+

LicenseMIT

FormatJSONL

Download ↓Details

OpenAssistant RLHF Dataset

161,000 messages in 35 languages across 10,000 conversation trees. Human-labeled preference data for RLHF training, including assistant responses ranked by quality.

Size4.2 GB

Messages161K

LicenseApache 2.0

FormatJSON

Download ↓Details

LAION-5B Image-Text Pairs

5.85 billion image-text pairs scraped from Common Crawl, used to train CLIP, Stable Diffusion, and multimodal LLMs. Filtered variants available in English and multilingual variants.

Pairs5.85B

Size~240 TB

LicenseCC-BY 4.0

FormatParquet

Download ↓Details

FineWeb — High Quality Web Data

15 trillion tokens of deduplicated, filtered English web text derived from Common Crawl. Hugging Face's flagship pretraining dataset — reportedly used in several recent open models.

Tokens15T

Size44 TB

LicenseODC-By

FormatParquet

Download ↓Details