Raw Data Access

Direct download access to raw training data, tokenizer artifacts, and evaluation benchmarks used in frontier AI model research.

Web Crawl Archives

๐Ÿ“ฆcc-2024-18-en-wet.gz812 GBWET
๐Ÿ“ฆcc-2024-18-multilingual.gz2.1 TBWARC
๐Ÿ“„cc-2024-18-index.cdx.gz48 GBCDX
๐Ÿ“ฆfineweb-2024-deduped.parquet44 TBParquet

Tokenizer Vocabularies

๐Ÿ”คtiktoken-cl100k_base.tiktoken1.8 MBTikToken
๐Ÿ”คllama3-tokenizer.model2.0 MBSentPiece
๐Ÿ”คgemma-tokenizer-256k.json4.3 MBJSON

Evaluation Benchmarks

๐Ÿ“Šmmlu-test-all-subjects.jsonl180 MBJSONL
๐Ÿ“Šhumaneval-problems.jsonl540 KBJSONL
๐Ÿ“Šmath-500-test.jsonl4.2 MBJSONL
๐Ÿ“Šhellaswag-validation.arrow92 MBArrow

Embedding Vectors

๐Ÿงฎtext-embedding-3-large-wiki.npy22 GBNumPy
๐Ÿงฎbge-m3-multilingual.faiss14 GBFAISS