AI Training Datasets
Open datasets used to train the world's most powerful language models, vision systems, and multimodal AI.
Common Crawl (CC-2024-18)
3.4 petabyte web crawl covering 3.6 billion web pages. The foundational pretraining data behind GPT, Claude, Gemini, and most major LLMs. Includes WARC, WAT, and WET files.
The Pile v2 — Multilingual
825GB diverse English corpus curated by EleutherAI. Includes Wikipedia, GitHub, ArXiv, PubMed, Books3, and Common Crawl subsets. Extended with multilingual data in 30+ languages.
OpenAssistant RLHF Dataset
161,000 messages in 35 languages across 10,000 conversation trees. Human-labeled preference data for RLHF training, including assistant responses ranked by quality.
LAION-5B Image-Text Pairs
5.85 billion image-text pairs scraped from Common Crawl, used to train CLIP, Stable Diffusion, and multimodal LLMs. Filtered variants available in English and multilingual variants.
FineWeb — High Quality Web Data
15 trillion tokens of deduplicated, filtered English web text derived from Common Crawl. Hugging Face's flagship pretraining dataset — reportedly used in several recent open models.