pile.eleuther.ai Open in urlscan Pro
2606:50c0:8000::153  Public Scan

Submitted URL: http://pile.eleuther.ai/
Effective URL: https://pile.eleuther.ai/
Submission: On December 28 via api from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

THE PILE

An 800GB Dataset of Diverse Text for Language Modeling



WHAT IS THE PILE?

The Pile is a 825 GiB diverse, open source language modelling data set that
consists of 22 smaller, high-quality datasets combined together.

Pile Paper (arXiv)


DOWNLOAD

The Pile is hosted by the Eye.

Download Pile

The format of the Pile is jsonlines data compressed using zstandard.

Have a model that uses or evaluates on the Pile? Let us know!


WHY IS THE PILE A GOOD TRAINING SET?

Recent work has shown that especially for large models, diversity in data
sources improves general cross-domain knowledge of the model, as well as
downstream generalization capability. In our evaluations, not only do models
trained on the Pile show moderate improvements in traditional language modeling
benchmarks, they also show significant improvements on Pile BPB.


WHY IS THE PILE A GOOD BENCHMARK?

To score well on Pile BPB (bits per byte), a model must be able to understand
many disparate domains including books, github repositories, webpages, chat
logs, and medical, physics, math, computer science, and philosophy papers. Pile
BPB is a measure of world knowledge and reasoning ability in these domains,
making it a robust benchmark of general, cross-domain text modeling ability for
large language models.


CITING

If you use the Pile or any of the components, please cite us!



@article{pile,
  title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
  journal={arXiv preprint arXiv:2101.00027},
  year={2020}
}
                




LEADERBOARD

* indicates potential test-set overlap. Zero-shot indicates that not all of the
components of the Pile were present in the training data.

Rank Model Test BPB

1.

Jan 1.2021

GPT-3 (Zero-Shot)*

OpenAI

0.7177

2.

Jan 1.2021

GPT-2 (Zero-Shot)*

OpenAI

1.2253

Evaluation code

EleutherAI 2021