Raw Data Access
Direct download access to raw training data, tokenizer artifacts, and evaluation benchmarks used in frontier AI model research.
Web Crawl Archives
cc-2024-18-en-wet.gz812 GBWET
cc-2024-18-multilingual.gz2.1 TBWARC
cc-2024-18-index.cdx.gz48 GBCDX
fineweb-2024-deduped.parquet44 TBParquet
Tokenizer Vocabularies
tiktoken-cl100k_base.tiktoken1.8 MBTikToken
llama3-tokenizer.model2.0 MBSentPiece
gemma-tokenizer-256k.json4.3 MBJSON
Evaluation Benchmarks
mmlu-test-all-subjects.jsonl180 MBJSONL
humaneval-problems.jsonl540 KBJSONL
math-500-test.jsonl4.2 MBJSONL
hellaswag-validation.arrow92 MBArrow
Embedding Vectors
text-embedding-3-large-wiki.npy22 GBNumPy
bge-m3-multilingual.faiss14 GBFAISS