Open Data for Language Models
I am co-leading Data Research for OLMo. We’ve released:
- Dolma, the largest open dataset for language model pretraining to-date. Dolma won a best paper award at ACL 2024 🏆!
- peS2o, a transformation of S2ORC optimized for pretraining language models of science. Dolma won a best paper award at ACL 2024 🏆!
Prior to this, I co-led the curation of: