research

Open Data for Language Models

I am co-leading Data Research for OLMo. We’ve released:

  • Dolma, the largest open dataset for language model pretraining to-date. Dolma won a best paper award at ACL 2024 🏆!
  • peS2o, a transformation of S2ORC optimized for pretraining language models of science. Dolma won a best paper award at ACL 2024 🏆!

Prior to this, I co-led the curation of:

  • S2ORC, the largest, machine-readable collection of open-access full-text papers to-date. Request API access 🔑 here!
  • CORD-19, the most comprehensive, continually-updated set of COVID-19 literature at the time, and

Adapting Language Models to Specialized Texts

I was a developer of SciBERT, one of the first pretrained language models for scientific text. Our follow-on work on domain adaptation via continued pretraining won an honorable mention for best paper at ACL 2020 🏆.

I also develop methods for infusing language models with visual layout. I packaged these models into PaperMage, an open-source Python library that won best paper at ACL 2023 System Demos 🏆.

Since 2023, I’ve been working on out-of-domain generalization via parameter-efficient training and data augmentation.

Standards and Best Practices in NLP Evaluation

I design rigorous evaluation guidelines for NLP, including few-shot learning and long-form summarization which won an outstanding paper award at EACL 2023 🏆.

I’ve organized community shared tasks to evaluate NLP systems for biomedical literature retrieval and understanding, including TREC-COVID and SCIVER.

I’ve also worked on standardized benchmark development for domain fit and efficiency of language models.

NLP for Sensemaking over Large Collections

I’ve published some of the largest gold standard datasets for training and evaluating language models on scientific literature understanding tasks, including:

Recently I’ve been interested in helping humans re-find documents they’ve seen before, even when they don’t remember identifying details.

AI-Powered Reading Assistance

I’m a founding member and a lead contributor to the Semantic Reader project, which combines HCI and AI research to design intelligent reading interfaces for scientists. I am also an active developer of the Semantic Reader Open Research Platform, where we share all our code, models and data publicly so others can build their own reading interfaces.

Through this project, I’ve developed novel systems including ScholarPhi which explains math notation, Scim which highlights salient passages, and PaperPlain which simplifies difficult passages and provides navigational guidance. One such system called CiteSee, which visually augments and personalizes inline citations while reading, won a best paper award at CHI 2023 🏆.