Open Data for Language Models

I am co-leading Data Research for OLMo. We’ve released:

  • Dolma, the largest open dataset for language model pretraining to-date. Dolma won a best paper award at ACL 2024 🏆!
  • peS2o, a transformation of S2ORC optimized for pretraining language models of science. Dolma won a best paper award at ACL 2024 🏆!

Prior to this, I co-led the curation of:

  • S2ORC, the largest, machine-readable collection of open-access full-text papers to-date. Request API access 🔑 here!
  • CORD-19, the most comprehensive, continually-updated set of COVID-19 literature at the time, and