Open Data for Language Models

I am co-leading the data effort for OLMo with Luca Soldaini. We’ve released:

  • Dolma, the largest open dataset for language model pretraining to-date.
  • peS2o, a transformation of S2ORC optimized for pretraining language models of science,

Prior to this, I co-led the data curation efforts behind:

  • S2ORC, the largest, machine-readable collection of open-access full-text papers to-date. Request API access 🔑 here!
  • CORD-19, the most comprehensive, continually-updated set of COVID-19 literature at the time, and

Adapting Language Models to Specialized Texts

I was a developer of SciBERT, one of the first pretrained language models for scientific text. Our follow-on work on domain adaptation via continued pretraining won an honorable mention for best paper at ACL 2020 🏆.

I also develop methods for infusing language models with visual layout. I packaged these models into PaperMage, an open-source Python library that won best paper at ACL 2023 System Demos 🏆.

Since 2023, I’ve been working on out-of-domain generalization via parameter-efficient training and data augmentation.

Standards and Best Practices in NLP Evaluation

I design rigorous evaluation guidelines for NLP, including few-shot learning and long-form summarization which won an outstanding paper award at EACL 2023 🏆.

I’ve organized community shared tasks to evaluate NLP systems for biomedical literature retrieval and understanding, including TREC-COVID and SCIVER.

I’ve also worked on standardized benchmark development for domain fit and efficiency of language models.

NLP for Sensemaking over Large Collections

I’ve published some of the largest gold standard datasets for training and evaluating language models on scientific literature understanding tasks, including:

Recently I’ve been interested in helping humans re-find documents they’ve seen before, even when they don’t remember identifying details.

AI-Powered Reading Assistance

I’m a founding member and a lead contributor to the Semantic Reader project, which combines HCI and AI research to design intelligent reading interfaces for scientists. I am also an active developer of the Semantic Reader Open Research Platform, where we share all our code, models and data publicly so others can build their own reading interfaces.

Through this project, I’ve developed novel systems including ScholarPhi which explains math notation, Scim which highlights salient passages, and PaperPlain which simplifies difficult passages and provides navigational guidance. One such system called CiteSee, which visually augments and personalizes inline citations while reading, won a best paper award at CHI 2023 🏆.

Limitations in Language Models

I’ve identified key limitations in today’s language models:

  • Language models produce worse multidocument summaries when provided retrieved documents (EMNLP Findings 2023),
  • Standard prompting strategies result in coherence errors when summarizing novels (arXiv 2023),
  • Layout-infused language models can be worse than text-only language models at visual layout parsing (ACL Findings 2023),
  • Language models can decontextualize snippets from Wikipedia but not scientific articles (EMNLP 2023).