publications | Kyle Lo

2024

RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models

Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, and Kyle Lo

ArXiv, Sep 2024

Abs arXiv Bib PDF

Information retrieval methods often rely on a single embedding model trained on large, general-domain datasets like MSMARCO. While this approach can produce a retriever with reasonable overall performance, models trained on domain-specific data often yield better results within their respective domains. While prior work in information retrieval has tackled this through multi-task training, the topic of combining multiple domain-specific expert retrievers remains unexplored, despite its popularity in language model generation. In this work, we introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts along with a routing mechanism to select the most appropriate expert for each query. It is lightweight and allows easy addition or removal of experts without additional training. Evaluation on the BEIR benchmark demonstrates that RouterRetriever outperforms both MSMARCO-trained (+2.1 absolute nDCG@10) and multi-task trained (+3.2) models. This is achieved by employing our routing mechanism, which surpasses other routing techniques (+1.8 on average) commonly used in language modeling. Furthermore, the benefit generalizes well to other datasets, even in the absence of a specific expert on the dataset. To our knowledge, RouterRetriever is the first work to demonstrate the advantages of using multiple domain-specific expert embedding models with effective routing over a single, general-purpose embedding model in retrieval tasks.
@article{Lee2024RouterRetrieverET, title = {RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models}, author = {Lee, Hyunji and Soldaini, Luca and Cohan, Arman and Seo, Minjoon and Lo, Kyle}, journal = {ArXiv}, year = {2024}, month = sep, volume = {2409.02685}, url = {https://arxiv.org/abs/2409.02685}, }
OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Daniel Morrison, Sewon Min, Weijia Shi, and 17 more authors

ArXiv, Sep 2024

Abs arXiv Bib PDF

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
@article{Muennighoff2024OLMoEOM, title = {OLMoE: Open Mixture-of-Experts Language Models}, author = {Muennighoff, Niklas and Soldaini, Luca and Groeneveld, Dirk and Lo, Kyle and Morrison, Jacob Daniel and Min, Sewon and Shi, Weijia and Walsh, Pete and Tafjord, Oyvind and Lambert, Nathan and Gu, Yuling and Arora, Shane and Bhagia, Akshita and Schwenk, Dustin and Wadden, David and Wettig, Alexander and Hui, Binyuan and Dettmers, Tim and Kiela, Douwe and Farhadi, Ali and Smith, Noah A. and Koh, Pang Wei and Singh, Amanpreet and Hajishirzi, Hanna}, journal = {ArXiv}, year = {2024}, month = sep, url = {https://arxiv.org/abs/2409.02060}, }
The Semantic Reader Project: Augmenting Scholarly Documents Through AI-Powered Interactive Reading Interfaces

Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X. Zhang, Cassidy Trier, Chloe Anastasiades, and 48 more authors

Communications of the ACM, Sep 2024

Abs DOI arXiv Bib PDF

Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the greater the need for new technology to support scholars. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has changed little in decades. For instance, the PDF format for sharing papers remains widely used due to its portability but has significant downsides, inter alia, static content and poor accessibility for low-vision readers. This paper explores the question “Can recent advances in AI and HCI power intelligent, interactive, and accessible reading interfaces—even for legacy PDFs?” We describe the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers. Through this project, we’ve developed a collection of novel reading interfaces and evaluated them with study participants and real-world users to show improved reading experiences for scholars. We’ve also released a production research paper reading interface that will continuously incorporate novel features from our research as they mature. We structure this paper around five key opportunities for AI assistance in scholarly reading —discovery, efficiency, comprehension, synthesis, and accessibility—and present an overview of our progress and discuss remaining open challenges.Augmenting scholarly documents through AI-powered interactive reading interfaces.
@article{10.1145/3659096, title = {The Semantic Reader Project: Augmenting Scholarly Documents Through AI-Powered Interactive Reading Interfaces}, author = {Lo, Kyle and Chang, Joseph Chee and Head, Andrew and Bragg, Jonathan and Zhang, Amy X. and Trier, Cassidy and Anastasiades, Chloe and August, Tal and Authur, Russell and Bragg, Danielle and Bransom, Erin and Cachola, Isabel and Candra, Stefan and Chandrasekhar, Yoganand and Chen, Yen-Sung and Cheng, Evie Yu-Yen and Chou, Yvonne and Downey, Doug and Evans, Rob and Fok, Raymond and Hu, Fangzhou and Huff, Regan and Kang, Dongyeop and Kim, Tae Soo and Kinney, Rodney and Kittur, Aniket and Kang, Hyeonsu B. and Klevak, Egor and Kuehl, Bailey and Langan, Michael J. and Latzke, Matt and Lochner, Jaron and MacMillan, Kelsey and Marsh, Eric and Murray, Tyler and Naik, Aakanksha and Nguyen, Ngoc-Uyen and Palani, Srishti and Park, Soya and Paulic, Caroline and Rachatasumrit, Napol and Rao, Smita and Sayre, Paul and Shen, Zejiang and Siangliulue, Pao and Soldaini, Luca and Tran, Huy and van Zuylen, Madeleine and Wang, Lucy Lu and Wilhelm, Christopher and Wu, Caroline and Yang, Jiangjiang and Zamarron, Angele and Hearst, Marti A. and Weld, Daniel S.}, year = {2024}, volume = {67}, url = {https://doi.org/10.1145/3659096}, doi = {10.1145/3659096}, journal = {Communications of the ACM}, month = sep, numpages = {12}, }
$evaluating-language-model-math-reasoning-via-grounding-in-educational-curricula.png$
MathFish : Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

Li Lucy, Tal August, Rose E. Wang, Luca Soldaini, Courtney Allison, and Kyle Lo

ArXiv, Aug 2024

Abs arXiv Bib PDF

To ensure that math curriculum is grade-appropriate and aligns with critical skills or concepts in accordance with educational standards, pedagogical experts can spend months carefully reviewing published math problems. Drawing inspiration from this process, our work presents a novel angle for evaluating language models’ (LMs) mathematical abilities, by investigating whether they can discern skills and concepts enabled by math content. We contribute two datasets: one consisting of 385 fine-grained descriptions of K-12 math skills and concepts, or standards, from Achieve the Core (ATC), and another of 9.9K math problems labeled with these standards (MathFish). We develop two tasks for evaluating LMs’ abilities to assess math problems: (1) verifying whether a problem aligns with a given standard, and (2) tagging a problem with all aligned standards. Working with experienced teachers, we find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways. We also show that LMs often generate problems that do not fully align with standards described in prompts, suggesting the need for careful scrutiny on use cases involving LMs for generating curricular materials. Finally, we categorize problems in GSM8k using math standards, allowing us to better understand why some problems are more difficult to solve for models than others.
@article{Lucy2024EvaluatingLM, title = {MathFish : Evaluating Language Model Math Reasoning via Grounding in Educational Curricula}, author = {Lucy, Li and August, Tal and Wang, Rose E. and Soldaini, Luca and Allison, Courtney and Lo, Kyle}, journal = {ArXiv}, year = {2024}, month = aug, url = {https://arxiv.org/abs/2408.04226}, }
KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions

Fangyuan Xu, Kyle Lo, Luca Soldaini, Bailey Kuehl, Eunsol Choi, and David Wadden

In Findings of the Association for Computational Linguistics ACL 2024, Aug 2024

Abs DOI arXiv ACL Bib PDF

Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. To evaluate the capabilities of current LLMs on this task, we construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain. Given a research question, an initial model-generated answer and a set of relevant papers, an expert annotator iteratively issues instructions for the model to revise and improve its answer. We collect 1,260 interaction turns from 234 interaction sessions with three state-of-the-art LLMs. Each turn includes a user instruction, a model response, and a human evaluation of the model response. Through a detailed analysis of the collected responses, we find that all models struggle to incorporate new information into an existing answer, and to perform precise and unambiguous edits. Further, we find that models struggle to judge whether their outputs successfully followed user instructions, with accuracy at least 10 points short of human agreement. Our findings indicate that KIWI will be a valuable resource to measure progress and improve LLMs’ instruction-following capabilities for knowledge intensive writing tasks.
@inproceedings{xu-etal-2024-kiwi, title = {{KIWI}: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions}, author = {Xu, Fangyuan and Lo, Kyle and Soldaini, Luca and Kuehl, Bailey and Choi, Eunsol and Wadden, David}, booktitle = {Findings of the Association for Computational Linguistics ACL 2024}, month = aug, year = {2024}, url = {https://aclanthology.org/2024.findings-acl.770}, doi = {10.18653/v1/2024.findings-acl.770}, }
OLMo: Accelerating the Science of Language Models

Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, and 36 more authors

In ACL, Aug 2024

🏆 Best Paper Award 🏆
Abs DOI arXiv ACL Bib PDF

Best Paper Award

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation.
@inproceedings{groeneveld-etal-2024-olmo, title = {{OLM}o: Accelerating the Science of Language Models}, author = {Groeneveld, Dirk and Beltagy, Iz and Walsh, Evan and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack and Khot, Tushar and Merrill, William and Morrison, Jacob and Muennighoff, Niklas and Naik, Aakanksha and Nam, Crystal and Peters, Matthew and Pyatkin, Valentina and Ravichander, Abhilasha and Schwenk, Dustin and Shah, Saurabh and Smith, William and Strubell, Emma and Subramani, Nishant and Wortsman, Mitchell and Dasigi, Pradeep and Lambert, Nathan and Richardson, Kyle and Zettlemoyer, Luke and Dodge, Jesse and Lo, Kyle and Soldaini, Luca and Smith, Noah and Hajishirzi, Hannaneh}, booktitle = {ACL}, month = aug, year = {2024}, url = {https://aclanthology.org/2024.acl-long.841}, doi = {10.18653/v1/2024.acl-long.841}, }
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, and 29 more authors

In ACL, Aug 2024

🏆 Best Paper Award 🏆
Abs DOI arXiv ACL Bib PDF

Best Paper Award

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.
@inproceedings{soldaini-etal-2024-dolma, title = {Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}, author = {Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Bogin, Ben and Chandu, Khyathi and Dumas, Jennifer and Elazar, Yanai and Hofmann, Valentin and Jha, Ananya and Kumar, Sachin and Lucy, Li and Lyu, Xinxi and Lambert, Nathan and Magnusson, Ian and Morrison, Jacob and Muennighoff, Niklas and Naik, Aakanksha and Nam, Crystal and Peters, Matthew and Ravichander, Abhilasha and Richardson, Kyle and Shen, Zejiang and Strubell, Emma and Subramani, Nishant and Tafjord, Oyvind and Walsh, Evan and Zettlemoyer, Luke and Smith, Noah and Hajishirzi, Hannaneh and Beltagy, Iz and Groeneveld, Dirk and Dodge, Jesse and Lo, Kyle}, booktitle = {ACL}, month = aug, year = {2024}, url = {https://aclanthology.org/2024.acl-long.840}, doi = {10.18653/v1/2024.acl-long.840}, }
One Thousand and One Pairs: A "novel" challenge for long-context language models

Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, and Mohit Iyyer

ArXiv, Jun 2024

Abs arXiv Bib PDF

Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.
@article{Karpinska2024OneTA, title = {One Thousand and One Pairs: A "novel" challenge for long-context language models}, author = {Karpinska, Marzena and Thai, Katherine and Lo, Kyle and Goyal, Tanya and Iyyer, Mohit}, journal = {ArXiv}, year = {2024}, month = jun, url = {https://arxiv.org/abs/2406.16264}, }
DataComp-LM: In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, and 52 more authors

ArXiv, Jun 2024

Abs arXiv Bib PDF

We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
@article{Gadre2023DataCompIS, title = {DataComp-LM: In search of the next generation of training sets for language models}, author = {Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon and Bitton, Yonatan and Nezhurina, Marianna and Abbas, Amro and Hsieh, Cheng-Yu and Ghosh, Dhruba and Gardner, Josh and Kilian, Maciej and Zhang, Hanlin and Shao, Rulin and Pratt, Sarah and Sanyal, Sunny and Ilharco, Gabriel and Daras, Giannis and Marathe, Kalyani and Gokaslan, Aaron and Zhang, Jieyu and Chandu, Khyathi and Nguyen, Thao and Vasiljevic, Igor and Kakade, Sham and Song, Shuran and Sanghavi, Sujay and Faghri, Fartash and Oh, Sewoong and Zettlemoyer, Luke and Lo, Kyle and El-Nouby, Alaaeldin and Pouransari, Hadi and Toshev, Alexander and Wang, Stephanie and Groeneveld, Dirk and Soldaini, Luca and Koh, Pang Wei and Jitsev, Jenia and Kollar, Thomas and Dimakis, Alexandros G. and Carmon, Yair and Dave, Achal and Schmidt, Ludwig and Shankar, Vaishaal}, journal = {ArXiv}, year = {2024}, month = jun, url = {https://arxiv.org/abs/2406.11794}, }
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

David Wadden, Kejian Shi, Jacob Daniel Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, and 6 more authors

ArXiv, Jun 2024

Abs arXiv Bib PDF

We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF demonstrations are notable for their long input contexts, detailed task specifications, and complex structured outputs. While instruction-following resources are available in specific domains such as clinical medicine and chemistry, SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields. To demonstrate the utility of SciRIFF, we develop a sample-efficient strategy to adapt a general instruction-following model for science by performing additional finetuning on a mix of general-domain and SciRIFF demonstrations. In evaluations on nine held-out scientific tasks, our model – called SciTulu – improves over a strong LLM baseline by 28.1% and 6.5% at the 7B and 70B scales respectively, while maintaining general instruction-following performance within 2% of the baseline. We are optimistic that SciRIFF will facilitate the development and evaluation of LLMs to help researchers navigate the ever-growing body of scientific literature. We release our dataset, model checkpoints, and data processing and evaluation code to enable further research.
@article{Wadden2024SciRIFFAR, title = {SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature}, author = {Wadden, David and Shi, Kejian and Morrison, Jacob Daniel and Naik, Aakanksha and Singh, Shruti and Barzilay, Nitzan and Lo, Kyle and Hope, Tom and Soldaini, Luca and Shen, Shannon Zejiang and Downey, Doug and Hajishirzi, Hanna and Cohan, Arman}, journal = {ArXiv}, year = {2024}, month = jun, url = {https://arxiv.org/abs/2406.07835}, }
FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, and 1 more author

ArXiv, May 2024

Abs arXiv Bib PDF

Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, we study the use of instructions in IR systems. First, we introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR repurposes detailed instructions – also known as narratives – developed for professional assessors to evaluate retrieval systems. In particular, we build our benchmark from three collections curated for shared tasks at the Text REtrieval Conference (TREC). These collections contains hundreds to thousands of labeled documents per query, making them suitable for our exploration. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements after fine-tuning on our training set.
@article{Weller2024FollowIREA, title = {FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions}, author = {Weller, Orion and Chang, Benjamin and MacAvaney, Sean and Lo, Kyle and Cohan, Arman and Durme, Benjamin Van and Lawrie, Dawn and Soldaini, Luca}, journal = {ArXiv}, year = {2024}, month = may, url = {https://arxiv.org/abs/2403.15246}, }
Know Your Audience: The benefits and pitfalls of generating plain language summaries beyond the "general" audience

Tal August, Kyle Lo, Noah A. Smith, and Katharina Reinecke

In CHI, Honolulu, HI, USA, May 2024

Abs DOI ACM Bib PDF

Language models (LMs) show promise as tools for communicating science to the general public by simplifying and summarizing complex language. Because models can be prompted to generate text for a specific audience (e.g., college-educated adults), LMs might be used to create multiple versions of plain language summaries for people with different familiarities of scientific topics. However, it is not clear what the benefits and pitfalls of adaptive plain language are. When is simplifying necessary, what are the costs in doing so, and do these costs differ for readers with different background knowledge? Through three within-subjects studies in which we surface summaries for different envisioned audiences to participants of different backgrounds, we found that while simpler text led to the best reading experience for readers with little to no familiarity in a topic, high familiarity readers tended to ignore certain details in overly plain summaries (e.g., study limitations). Our work provides methods and guidance on ways of adapting plain language summaries beyond the single “general” audience.
@inproceedings{10.1145/3613904.3642289, author = {August, Tal and Lo, Kyle and Smith, Noah A. and Reinecke, Katharina}, title = {Know Your Audience: The benefits and pitfalls of generating plain language summaries beyond the "general" audience}, year = {2024}, month = may, url = {https://doi.org/10.1145/3613904.3642289}, doi = {10.1145/3613904.3642289}, booktitle = {CHI}, numpages = {26}, }
FABLES: Evaluating faithfulness and content selection in book-length summarization

Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, and 1 more author

ArXiv, Apr 2024

Abs arXiv Bib PDF

While long-context large language models (LLMs) can technically summarize book-length documents (>100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study mitigates the issue of data contamination by focusing on summaries of books published in 2023 or 2024, and we hire annotators who have fully read each book prior to the annotation task to minimize cost and cognitive burden. We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD, which allows us to rank LLM summarizers based on faithfulness: Claude-3-Opus significantly outperforms all closed-source LLMs, while the open-source Mixtral is on par with GPT-3.5-Turbo. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate. While LLM-based auto-raters have proven reliable for factuality and coherence in other settings, we implement several LLM raters of faithfulness and find that none correlates strongly with human annotations, especially with regard to detecting unfaithful claims. Our experiments suggest that detecting unfaithful claims is an important future direction not only for summarization evaluation but also as a testbed for long-context understanding. Finally, we move beyond faithfulness by exploring content selection errors in book-length summarization: we develop a typology of omission errors related to crucial narrative elements and also identify a systematic over-emphasis on events occurring towards the end of the book.
@article{Kim2024FABLESEF, title = {FABLES: Evaluating faithfulness and content selection in book-length summarization}, author = {Kim, Yekyung and Chang, Yapei and Karpinska, Marzena and Garimella, Aparna and Manjunatha, Varun and Lo, Kyle and Goyal, Tanya and Iyyer, Mohit}, journal = {ArXiv}, year = {2024}, month = apr, url = {https://arxiv.org/abs/2404.01261}, }
Accelerating Scientific Paper Skimming with Augmented Intelligence Through Customizable Faceted Highlights

Raymond Fok, Luca Soldaini, Cassidy Trier, Erin Bransom, Kelsey MacMillan, Evie Cheng, Hita Kambhamettu, and 5 more authors

ACM Transactions on Interactive Intelligent Systems, Mar 2024

Abs DOI ACM Bib PDF

Scholars need to keep up with an exponentially increasing flood of scientific papers. To aid this challenge, we introduce Scim, a novel intelligent interface that helps scholars skim papers to rapidly review and gain a cursory understanding of its contents. Scim supports the skimming process by highlighting salient content within a paper, directing a scholar’s attention. These automatically-extracted highlights are faceted by content type, evenly distributed across a paper, and have a density configurable by scholars. We evaluate Scim with an in-lab usability study and a longitudinal diary study, revealing how its highlights facilitate the more efficient construction of a conceptualization of a paper. Finally, we describe the process of scaling highlights from their conception within Scim, a research prototype, to production on over 521,000 papers within the Semantic Reader, a publicly-available augmented reading interface for scientific papers. We conclude by discussing design considerations and tensions for the design of future skimming tools with augmented intelligence.
@article{10.1145/3665648, author = {Fok, Raymond and Soldaini, Luca and Trier, Cassidy and Bransom, Erin and MacMillan, Kelsey and Cheng, Evie and Kambhamettu, Hita and Bragg, Jonathan and Lo, Kyle and Hearst, Marti A. and Head, Andrew and Weld, Daniel S.}, title = {Accelerating Scientific Paper Skimming with Augmented Intelligence Through Customizable Faceted Highlights}, year = {2024}, url = {https://doi.org/10.1145/3665648}, doi = {10.1145/3665648}, journal = {ACM Transactions on Interactive Intelligent Systems}, month = mar, }
InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

Jan Trienes, Sebastian Antony Joseph, Jorg Schlotterer, Christin Seifert, Kyle Lo, Wei Xu, Byron C. Wallace, and 1 more author

ArXiv, Jan 2024

Abs arXiv Bib PDF

Text simplification aims to make technical texts more accessible to laypeople but often results in deletion of information and vagueness. This work proposes InfoLossQA, a framework to characterize and recover simplification-induced information loss in form of question-and-answer (QA) pairs. Building on the theory of Question Under Discussion, the QA pairs are designed to help readers deepen their knowledge of a text. We conduct a range of experiments with this framework. First, we collect a dataset of 1,000 linguist-curated QA pairs derived from 104 LLM simplifications of scientific abstracts of medical studies. Our analyses of this data reveal that information loss occurs frequently, and that the QA pairs give a high-level overview of what information was lost. Second, we devise two methods for this task: end-to-end prompting of open-source and commercial language models, and a natural language inference pipeline. With a novel evaluation framework considering the correctness of QA pairs and their linguistic suitability, our expert evaluation reveals that models struggle to reliably identify information loss and applying similar standards as humans at what constitutes information loss.
@article{Trienes2024InfoLossQACA, title = {InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification}, author = {Trienes, Jan and Joseph, Sebastian Antony and Schlotterer, Jorg and Seifert, Christin and Lo, Kyle and Xu, Wei and Wallace, Byron C. and Li, Junyi Jessy}, journal = {ArXiv}, year = {2024}, month = jan, url = {https://arxiv.org/abs/2401.16475}, }

2023

Paloma: A Benchmark for Evaluating Language Model Fit

Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, A. Jha, Oyvind Tafjord, Dustin Schwenk, and 9 more authors

ArXiv, Dec 2023

Abs arXiv Bib PDF

Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains—varying distributions of language. Rather than assuming perplexity on one distribution extrapolates to others, Perplexity Analysis for Language Model Assessment (Paloma), measures LM fit to 585 text domains, ranging from nytimes.com to r/depression on Reddit. We invite submissions to our benchmark and organize results by comparability based on compliance with guidelines such as removal of benchmark contamination from pretraining. Submissions can also record parameter and training token count to make comparisons of Pareto efficiency for performance as a function of these measures of cost. We populate our benchmark with results from 6 baselines pretrained on popular corpora. In case studies, we demonstrate analyses that are possible with Paloma, such as finding that pretraining without data beyond Common Crawl leads to inconsistent fit to many domains.
@article{Magnusson2023PalomaAB, title = {Paloma: A Benchmark for Evaluating Language Model Fit}, author = {Magnusson, Ian and Bhagia, Akshita and Hofmann, Valentin and Soldaini, Luca and Jha, A. and Tafjord, Oyvind and Schwenk, Dustin and Walsh, Evan Pete and Elazar, Yanai and Lo, Kyle and Groeneveld, Dirk and Beltagy, Iz and Hajishirzi, Hanna and Smith, Noah A. and Richardson, Kyle and Dodge, Jesse}, journal = {ArXiv}, year = {2023}, month = dec, url = {https://arxiv.org/abs/2312.10523}, }
PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents

Kyle Lo, Zejiang Shen, Benjamin Newman, Joseph Chang, Russell Authur, Erin Bransom, Stefan Candra, and 10 more authors

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Dec 2023

🏆 Best Paper Award 🏆
Abs DOI ACL Bib PDF

Best Paper Award

Despite growing interest in applying natural language processing (NLP) and computer vision (CV) models to the scholarly domain, scientific documents remain challenging to work with. They’re often in difficult-to-use PDF formats, and the ecosystem of models to process them is fragmented and incomplete. We introduce PaperMage, an open-source Python toolkit for analyzing and processing visually-rich, structured scientific documents. PaperMage offers clean and intuitive abstractions for seamlessly representing and manipulating both textual and visual document elements. PaperMage achieves this by integrating disparate state-of-the-art NLP and CV models into a unified framework, and provides turn-key recipes for common scientific document processing use-cases. PaperMage has powered multiple research prototypes of AI applications over scientific documents, along with Semantic Scholar’s large-scale production system for processing millions of PDFs. GitHub: https://github.com/allenai/papermage
@inproceedings{lo-etal-2023-papermage, title = {{P}aper{M}age: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents}, author = {Lo, Kyle and Shen, Zejiang and Newman, Benjamin and Chang, Joseph and Authur, Russell and Bransom, Erin and Candra, Stefan and Chandrasekhar, Yoganand and Huff, Regan and Kuehl, Bailey and Singh, Amanpreet and Wilhelm, Chris and Zamarron, Angele and Hearst, Marti A. and Weld, Daniel and Downey, Doug and Soldaini, Luca}, booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations}, month = dec, year = {2023}, url = {https://aclanthology.org/2023.emnlp-demo.45}, doi = {10.18653/v1/2023.emnlp-demo.45}, }
Decomposing Complex Queries for Tip-of-the-tongue Retrieval

Kevin Lin, Kyle Lo, Joseph Gonzalez, and Dan Klein

In Findings of EMNLP, Dec 2023

Abs DOI arXiv ACL Bib PDF

When re-finding items, users who forget or are uncertain about identifying details often rely on creative strategies for expressing their information needs—complex queries that describe content elements (e.g., book characters or events), information beyond the document text (e.g., descriptions of book covers), or personal context (e.g., when they read a book). Standard retrieval models that rely on lexical or semantic overlap between query and document text are challenged in such retrieval settings, known as tip-of-the-tongue (TOT) retrieval. We introduce a simple but effective framework for handling such complex queries by decomposing the query with an LLM into individual clues routing those as subqueries to specialized retrievers, and ensembling the results. Our approach takes advantage of off-the-shelf retrievers (e.g., CLIP for retrieving images of book covers) or incorporate retriever-specific logic (e.g., date constraints). We show that our framework incorporating query decomposition into retrievers can improve gold book recall up to 6% absolute gain for Recall@5 on a new collection of 14,441 real-world query-book pairs from an online community for resolving TOT inquiries.
@inproceedings{lin-etal-2023-decomposing, title = {Decomposing Complex Queries for Tip-of-the-tongue Retrieval}, author = {Lin, Kevin and Lo, Kyle and Gonzalez, Joseph and Klein, Dan}, booktitle = {Findings of EMNLP}, month = dec, year = {2023}, url = {https://aclanthology.org/2023.findings-emnlp.367}, doi = {10.18653/v1/2023.findings-emnlp.367}, }
A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents

Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, and Kyle Lo

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023

Abs DOI arXiv ACL Bib PDF

Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context from the original document. In this work, we use language models to rewrite snippets from scientific documents to be read on their own. First, we define the requirements and challenges for this user-facing decontextualization task, such as clarifying where edits occur and handling references to other documents. Second, we propose a framework that decomposes the task into three stages: question generation, question answering, and rewriting. Using this framework, we collect gold decontextualizations from experienced scientific article readers. We then conduct a range of experiments across state-of-the-art commercial and open-source language models to identify how to best provide missing-but-relevant information to models for our task. Finally, we develop QaDecontext, a simple prompting strategy inspired by our framework that improves over end-to-end prompting. We conclude with analysis that finds, while rewriting is easy, question generation and answering remain challenging for today’s models.
@inproceedings{newman-etal-2023-question, title = {A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents}, author = {Newman, Benjamin and Soldaini, Luca and Fok, Raymond and Cohan, Arman and Lo, Kyle}, booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing}, month = dec, year = {2023}, url = {https://aclanthology.org/2023.emnlp-main.193}, doi = {10.18653/v1/2023.emnlp-main.193}, }
Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval

John Giorgi, Luca Soldaini, Bo Wang, Gary Bader, Kyle Lo, Lucy Wang, and Arman Cohan

In Findings of EMNLP, Dec 2023

Abs DOI arXiv ACL Bib PDF

Multi-document summarization (MDS) assumes a set of topic-related documents are provided as input. In practice, this document set is not always available; it would need to be retrieved given an information need, i.e. a question or topic statement, a setting we dub “open-domain’ MDS. We study this more challenging setting by formalizing the task and bootstrapping it using existing datasets, retrievers and summarizers. Via extensive automatic and human evaluation, we determine: (1) state-of-the-art summarizers suffer large reductions in performance when applied to open-domain MDS, (2) additional training in the open-domain setting can reduce this sensitivity to imperfect retrieval, and (3) summarizers are insensitive to the retrieval of duplicate documents and the order of retrieved documents, but highly sensitive to other errors, like the retrieval of irrelevant documents. Based on our results, we provide practical guidelines to enable future work on open-domain MDS, e.g. how to choose the number of retrieved documents to summarize. Our results suggest that new retrieval and summarization methods and annotated resources for training and evaluation are necessary for further progress in the open-domain setting.
@inproceedings{giorgi-etal-2023-open, title = {Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval}, author = {Giorgi, John and Soldaini, Luca and Wang, Bo and Bader, Gary and Lo, Kyle and Wang, Lucy and Cohan, Arman}, booktitle = {Findings of EMNLP}, month = dec, year = {2023}, url = {https://aclanthology.org/2023.findings-emnlp.549}, doi = {10.18653/v1/2023.findings-emnlp.549}, }
Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders

Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, and Kyle Lo

ArXiv, Nov 2023

Abs arXiv Bib PDF

Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base model pretraining, comparatively little investigation has examined whether the training procedures themselves can be improved to yield better generalization capabilities in the resulting models. In this work, we recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives. We validate these recommendations using the BEIR benchmark and find results are persistent across choice of dense encoder and base model size and are complementary to other resource-intensive strategies for out-of-domain generalization such as architectural modifications or additional pretraining. We hope that this thorough and impartial study around various training techniques, which augments other resource-intensive methods, offers practical insights for developing a dense retrieval model that effectively generalizes, even when trained on a single dataset.
@article{Lee2023BackTB, title = {Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders}, author = {Lee, Hyunji and Soldaini, Luca and Cohan, Arman and Seo, Minjoon and Lo, Kyle}, journal = {ArXiv}, year = {2023}, month = nov, url = {https://arxiv.org/abs/2311.09765}, }
BooookScore: A systematic exploration of book-length summarization in the era of LLMs

Yapei Chang, Kyle Lo, Tanya Goyal, and Mohit Iyyer

ArXiv, Oct 2023

Abs arXiv Bib PDF

Summarizing book-length documents (>100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book-length summarization datasets (e.g., BookSum) are in the pretraining data of most public LLMs, and existing evaluation methods struggle to capture errors made by modern LLM summarizers. In this paper, we present the first study of the coherence of LLM-based book-length summarizers implemented via two prompting workflows: (1) hierarchically merging chunk-level summaries, and (2) incrementally updating a running summary. We obtain 1193 fine-grained human annotations on GPT-4 generated summaries of 100 recently-published books and identify eight common types of coherence errors made by LLMs. Because human evaluation is expensive and time-consuming, we develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. BooookScore has high agreement with human annotations and allows us to systematically evaluate the impact of many other critical parameters (e.g., chunk size, base LLM) while saving $15K and 500 hours in human evaluation costs. We find that closed-source LLMs such as GPT-4 and Claude 2 produce summaries with higher BooookScore than the oft-repetitive ones generated by LLaMA 2. Incremental updating yields lower BooookScore but higher level of detail than hierarchical merging, a trade-off sometimes preferred by human annotators. We release code and annotations after blind review to spur more principled research on book-length summarization.
@article{Chang2023BooookScoreAS, title = {BooookScore: A systematic exploration of book-length summarization in the era of LLMs}, author = {Chang, Yapei and Lo, Kyle and Goyal, Tanya and Iyyer, Mohit}, journal = {ArXiv}, year = {2023}, month = oct, url = {https://arxiv.org/abs/2310.00785}, }
The Rise of Open Science: Tracking the Evolution and Perceived Value of Data and Methods Link-Sharing Practices

Hancheng Cao, Jesse Dodge, Kyle Lo, Daniel A. McFarland, and Lucy Lu Wang

ArXiv, Oct 2023

Abs arXiv Bib PDF

In recent years, funding agencies and journals increasingly advocate for open science practices (e.g. data and method sharing) to improve the transparency, access, and reproducibility of science. However, quantifying these practices at scale has proven difficult. In this work, we leverage a large-scale dataset of 1.1M papers from arXiv that are representative of the fields of physics, math, and computer science to analyze the adoption of data and method link-sharing practices over time and their impact on article reception. To identify links to data and methods, we train a neural text classification model to automatically classify URL types based on contextual mentions in papers. We find evidence that the practice of link-sharing to methods and data is spreading as more papers include such URLs over time. Reproducibility efforts may also be spreading because the same links are being increasingly reused across papers (especially in computer science); and these links are increasingly concentrated within fewer web domains (e.g. Github) over time. Lastly, articles that share data and method links receive increased recognition in terms of citation count, with a stronger effect when the shared links are active (rather than defunct). Together, these findings demonstrate the increased spread and perceived value of data and method sharing practices in open science.
@article{Cao2023TheRO, title = {The Rise of Open Science: Tracking the Evolution and Perceived Value of Data and Methods Link-Sharing Practices}, author = {Cao, Hancheng and Dodge, Jesse and Lo, Kyle and McFarland, Daniel A. and Wang, Lucy Lu}, year = {2023}, journal = {ArXiv}, month = oct, url = {https://arxiv.org/abs/2310.03193}, }
When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

Orion Weller, Kyle Lo, David Wadden, Dawn J Lawrie, Benjamin Van Durme, Arman Cohan, and Luca Soldaini

ArXiv, Sep 2023

Abs arXiv Bib PDF

Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear.
@article{Weller2023WhenDG, title = {When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets}, author = {Weller, Orion and Lo, Kyle and Wadden, David and Lawrie, Dawn J and Durme, Benjamin Van and Cohan, Arman and Soldaini, Luca}, journal = {ArXiv}, year = {2023}, month = sep, url = {https://arxiv.org/abs/2309.08541}, }
Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation

Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared Fernandez, Tom Sherborne, Kyle Lo, and 7 more authors

ArXiv, Jul 2023

Abs arXiv Bib PDF

Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across different institutions. Moreover, improvements in metrics such as FLOPs often fail to translate to progress in real-world applications. In response, we introduce Pentathlon, a benchmark for holistic and realistic evaluation of model efficiency. Pentathlon focuses on inference, which accounts for a majority of the compute in a model’s lifecycle. It offers a strictly-controlled hardware platform, and is designed to mirror real-world applications scenarios. It incorporates a suite of metrics that target different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption. Pentathlon also comes with a software library that can be seamlessly integrated into any codebase and enable evaluation. As a standardized and centralized evaluation platform, Pentathlon can drastically reduce the workload to make fair and reproducible efficiency comparisons. While initially focused on natural language processing (NLP) models, Pentathlon is designed to allow flexible extension to other fields. We envision Pentathlon will stimulate algorithmic innovations in building efficient models, and foster an increased awareness of the social and environmental implications in the development of future-generation NLP models.
@article{Peng2023EfficiencyPA, title = {Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation}, author = {Peng, Hao and Cao, Qingqing and Dodge, Jesse and Peters, Matthew E. and Fernandez, Jared and Sherborne, Tom and Lo, Kyle and Skjonsberg, Sam and Strubell, Emma and Plessas, Darrell and Beltagy, Iz and Walsh, Evan Pete and Smith, Noah and Hajishirzi, Hanna}, journal = {ArXiv}, year = {2023}, month = jul, url = {https://arxiv.org/abs/2307.09701}, }
Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents

Catherine Chen, Zejiang Shen, Dan Klein, Gabriel Stanovsky, Doug Downey, and Kyle Lo

In Findings of ACL, Jul 2023

Abs DOI arXiv Bib PDF

Recent work has shown that infusing layout features into language models (LMs) improves processing of visually-rich documents such as scientific papers. Layout-infused LMs are often evaluated on documents with familiar layout features (e.g., papers from the same publisher), but in practice models encounter documents with unfamiliar distributions of layout features, such as new combinations of text sizes and styles, or new spatial configurations of textual elements. In this work we test whether layout-infused LMs are robust to layout distribution shifts. As a case study we use the task of scientific document structure recovery, segmenting a scientific paper into its structural categories (e.g., “title”, “caption”, “reference”). To emulate distribution shifts that occur in practice we re-partition the GROTOAP2 dataset. We find that under layout distribution shifts model performance degrades by up to 20 F1. Simple training strategies, such as increasing training diversity, can reduce this degradation by over 35% relative F1; however, models fail to reach in-distribution performance in any tested out-of-distribution conditions. This work highlights the need to consider layout distribution shifts during model evaluation, and presents a methodology for conducting such evaluations.
@inproceedings{chen-etal-2023-layout, title = {Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents}, author = {Chen, Catherine and Shen, Zejiang and Klein, Dan and Stanovsky, Gabriel and Downey, Doug and Lo, Kyle}, booktitle = {Findings of ACL}, month = jul, year = {2023}, url = {https://aclanthology.org/2023.findings-acl.844}, doi = {10.18653/v1/2023.findings-acl.844}, }

Complex Mathematical Symbol Definition Structures: A Dataset and Model for Coordination Resolution in Definition Extraction

Anna Martin-Boyle, Andrew Head, Kyle Lo, Risham Sidhu, Marti A. Hearst, and Dongyeop Kang

ArXiv, May 2023

arXiv Bib PDF

@article{MartinBoyle2023ComplexMS,
  title = {Complex Mathematical Symbol Definition Structures: A Dataset and Model for Coordination Resolution in Definition Extraction},
  author = {Martin-Boyle, Anna and Head, Andrew and Lo, Kyle and Sidhu, Risham and Hearst, Marti A. and Kang, Dongyeop},
  journal = {ArXiv},
  year = {2023},
  month = may,
  volume = {abs/2305.14660},
}

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, and Kyle Lo

In EACL, May 2023

🏆 Outstanding Paper Award 🏆
Abs arXiv ACL Bib PDF

Outstanding Paper Award

While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall’s tau using 50% judgements). We release our human judgments, annotation templates, and software as a Python library for future research.
@inproceedings{krishna-etal-2023-longeval, title = {{L}ong{E}val: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization}, author = {Krishna, Kalpesh and Bransom, Erin and Kuehl, Bailey and Iyyer, Mohit and Dasigi, Pradeep and Cohan, Arman and Lo, Kyle}, booktitle = {EACL}, month = may, year = {2023}, url = {https://aclanthology.org/2023.eacl-main.121}, }
Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks

Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo, Jonathan Bragg, Jeff Hammerbacher, Doug Downey, and 2 more authors

In Intelligent and Interactive Writing Assistants (In2Writing) Workshop, Apr 2023

Abs arXiv Bib PDF

Large language models have introduced exciting new opportunities and challenges in designing and developing new AI-assisted writing support tools. Recent work has shown that leveraging this new technology can transform writing in many scenarios such as ideation during creative writing, editing support, and summarization. However, AI-supported expository writing–including real-world tasks like scholars writing literature reviews or doctors writing progress notes–is relatively understudied. In this position paper, we argue that developing AI supports for expository writing has unique and exciting research challenges and can lead to high real-world impacts. We characterize expository writing as evidence-based and knowledge-generating: it contains summaries of external documents as well as new information or knowledge. It can be seen as the product of authors’ sensemaking process over a set of source documents, and the interplay between reading, reflection, and writing opens up new opportunities for designing AI support. We sketch three components for AI support design and discuss considerations for future research.
@inproceedings{Shen2023BeyondSD, title = {Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks}, author = {Shen, Zejiang and August, Tal and Siangliulue, Pao and Lo, Kyle and Bragg, Jonathan and Hammerbacher, Jeff and Downey, Doug and Chang, Joseph Chee and Sontag, David}, booktitle = {Intelligent and Interactive Writing Assistants (In2Writing) Workshop}, year = {2023}, month = apr, volume = {abs/2304.02623}, }
CiteSee: Augmenting Citations in Scientific Papers with Persistent and Personalized Historical Context

Joseph Chee Chang, Amy X. Zhang, Jonathan Bragg, Andrew Head, Kyle Lo, Doug Downey, and Daniel S. Weld

In CHI, Hamburg, Germany, Apr 2023

🏆 Best Paper Award 🏆
Abs DOI arXiv ACM Bib PDF

Best Paper Award

When reading a scholarly article, inline citations help researchers contextualize the current article and discover relevant prior work. However, it can be challenging to prioritize and make sense of the hundreds of citations encountered during literature reviews. This paper introduces CiteSee, a paper reading tool that leverages a user’s publishing, reading, and saving activities to provide personalized visual augmentations and context around citations. First, CiteSee connects the current paper to familiar contexts by surfacing known citations a user had cited or opened. Second, CiteSee helps users prioritize their exploration by highlighting relevant but unknown citations based on saving and reading history. We conducted a lab study that suggests CiteSee is significantly more effective for paper discovery than three baselines. A field deployment study shows CiteSee helps participants keep track of their explorations and leads to better situational awareness and increased paper discovery via inline citation when conducting real-world literature reviews.
@inproceedings{10.1145/3544548.3580847, author = {Chang, Joseph Chee and Zhang, Amy X. and Bragg, Jonathan and Head, Andrew and Lo, Kyle and Downey, Doug and Weld, Daniel S.}, title = {CiteSee: Augmenting Citations in Scientific Papers with Persistent and Personalized Historical Context}, year = {2023}, month = apr, url = {https://doi.org/10.1145/3544548.3580847}, doi = {10.1145/3544548.3580847}, booktitle = {CHI}, numpages = {15}, }
Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing

Tal August, Lucy Lu Wang, Jonathan Bragg, Marti A. Hearst, Andrew Head, and Kyle Lo

ACM Transactions of Computer-Human Interaction (TOCHI), Apr 2023

Abs DOI arXiv ACM Bib PDF

When seeking information not covered in patient-friendly documents, healthcare consumers may turn to the research literature. Reading medical papers, however, can be a challenging experience. To improve access to medical papers, we explore four features enabled by natural language processing: definitions of unfamiliar terms, in-situ plain language section summaries, a collection of key questions that guides readers to answering passages, and plain language summaries of those passages. We embody these features into a prototype system, Paper Plain. We evaluate Paper Plain, finding that participants who used the prototype system had an easier time reading research papers without a loss in paper comprehension compared to those who used a typical PDF reader. Altogether, the study results suggest that guiding readers to relevant passages and providing plain language summaries alongside the original paper content can make reading medical papers easier and give readers more confidence to approach these papers.
@article{10.1145/3589955, author = {August, Tal and Wang, Lucy Lu and Bragg, Jonathan and Hearst, Marti A. and Head, Andrew and Lo, Kyle}, title = {Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing}, year = {2023}, url = {https://doi.org/10.1145/3589955}, doi = {10.1145/3589955}, journal = {ACM Transactions of Computer-Human Interaction (TOCHI)}, month = apr, }
Scim: Intelligent Skimming Support for Scientific Papers

Raymond Fok, Hita Kambhamettu, Luca Soldaini, Jonathan Bragg, Kyle Lo, Marti Hearst, Andrew Head, and 1 more author

In IUI, Sydney, NSW, Australia, Mar 2023

Abs DOI arXiv ACM Bib PDF

Scholars need to keep up with an exponentially increasing flood of scientific papers. To aid this challenge, we introduce Scim, a novel intelligent interface that helps experienced researchers skim – or rapidly review – a paper to attain a cursory understanding of its contents. Scim supports the skimming process by highlighting salient paper contents in order to direct a reader’s attention. The system’s highlights are faceted by content type, evenly distributed across a paper, and have a density configurable by readers at both the global and local level. We evaluate Scim with both an in-lab usability study and a longitudinal diary study, revealing how its highlights facilitate the more efficient construction of a conceptualization of a paper. We conclude by discussing design considerations and tensions for the design of future intelligent skimming tools.
@inproceedings{10.1145/3581641.3584034, author = {Fok, Raymond and Kambhamettu, Hita and Soldaini, Luca and Bragg, Jonathan and Lo, Kyle and Hearst, Marti and Head, Andrew and Weld, Daniel S}, title = {Scim: Intelligent Skimming Support for Scientific Papers}, year = {2023}, month = mar, url = {https://doi.org/10.1145/3581641.3584034}, doi = {10.1145/3581641.3584034}, booktitle = {IUI}, numpages = {15}, }
LIMEADE: From AI Explanations to Advice Taking

Benjamin Charles Germain Lee, Doug Downey, Kyle Lo, and Daniel S. Weld

ACM Transactions on Interactive Intelligent Systems, Mar 2023

Abs DOI arXiv ACM Bib PDF

Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA2Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.
@article{10.1145/3589345, author = {Lee, Benjamin Charles Germain and Downey, Doug and Lo, Kyle and Weld, Daniel S.}, title = {LIMEADE: From AI Explanations to Advice Taking}, year = {2023}, month = mar, url = {https://doi.org/10.1145/3589345}, doi = {10.1145/3589345}, journal = {ACM Transactions on Interactive Intelligent Systems}, }
The Semantic Scholar Open Data Platform

Rodney Michael Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, and 41 more authors

ArXiv, Jan 2023

Abs arXiv Bib PDF

The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date, with 200M+ papers, 80M+ authors, 550M+ paper-authorship edges, and 2.4B+ citation edges. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings. In this paper, we describe the components of the S2 data processing pipeline and the associated APIs offered by the platform. We will update this living document to reflect changes as we add new data offerings and improve existing services.
@article{Kinney2023TheSS, title = {The Semantic Scholar Open Data Platform}, author = {Kinney, Rodney Michael and Anastasiades, Chloe and Authur, Russell and Beltagy, Iz and Bragg, Jonathan and Buraczynski, Alexandra and Cachola, Isabel and Candra, Stefan and Chandrasekhar, Yoganand and Cohan, Arman and Crawford, Miles and Downey, Doug and Dunkelberger, Jason and Etzioni, Oren and Evans, Rob and Feldman, Sergey and Gorney, Joseph and Graham, David W. and Hu, F.Q. and Huff, Regan and King, Daniel and Kohlmeier, Sebastian and Kuehl, Bailey and Langan, Michael and Lin, Daniel and Liu, Haokun and Lo, Kyle and Lochner, Jaron and MacMillan, Kelsey and Murray, Tyler and Newell, Christopher and Rao, Smita and Rohatgi, Shaurya and Sayre, Paul L and Shen, Zejiang and Singh, Amanpreet and Soldaini, Luca and Subramanian, Shivashankar and Tanaka, A. and Wade, Alex D and Wagner, Linda M. and Wang, Lucy Lu and Wilhelm, Christopher and Wu, Caroline and Yang, Jiangjiang and Zamarron, Angele and van Zuylen, Madeleine and Weld, Daniel S.}, journal = {ArXiv}, year = {2023}, month = jan, volume = {abs/2301.10140}, }

2022

SciFact-Open: Towards open-domain scientific claim verification

David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Iz Beltagy, Lucy Lu Wang, and Hannaneh Hajishirzi

In Findings of EMNLP, Dec 2022

Abs ACL Bib PDF

While research on scientific claim verification has led to the development of powerful systems that appear to approach human performance, these approaches have yet to be tested in a realistic setting against large corpora of scientific literature. Moving to this open-domain evaluation setting, however, poses unique challenges; in particular, it is infeasible to exhaustively annotate all evidence documents. In this work, we present SciFact-Open, a new test collection designed to evaluate the performance of scientific claim verification systems on a corpus of 500K research abstracts. Drawing upon pooling techniques from information retrieval, we collect evidence for scientific claims by pooling and annotating the top predictions of four state-of-the-art scientific claim verification models. We find that systems developed on smaller corpora struggle to generalize to SciFact-Open, exhibiting performance drops of at least 15 F1. In addition, analysis of the evidence in SciFact-Open reveals interesting phenomena likely to appear when claim verification systems are deployed in practice, e.g., cases where the evidence supports only a special case of the claim. Our dataset is available at https://github.com/dwadden/scifact-open.
@inproceedings{wadden-etal-2022-scifact, title = {{S}ci{F}act-Open: Towards open-domain scientific claim verification}, author = {Wadden, David and Lo, Kyle and Kuehl, Bailey and Cohan, Arman and Beltagy, Iz and Wang, Lucy Lu and Hajishirzi, Hannaneh}, booktitle = {Findings of EMNLP}, month = dec, year = {2022}, url = {https://aclanthology.org/2022.findings-emnlp.347}, }

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, Angela Fan, Christopher Akiki, Elizabeth-Jane Pavlick, Suzana Ili’c, Daniel Hesslow, Roman Castagn’e, and 383 more authors

ArXiv, Nov 2022

Abs arXiv Bib PDF

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

@article{Scao2022BLOOMA1,
  title = {BLOOM: A 176B-Parameter Open-Access Multilingual Language Model},
  author = {Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Elizabeth-Jane and Ili'c, Suzana and Hesslow, Daniel and Castagn'e, Roman and Luccioni, Alexandra Sasha and Yvon, Franccois and Gall{\'e}, Matthias and Tow, Jonathan and Rush, Alexander M. and Biderman, Stella Rose and Webson, Albert and Ammanamanchi, Pawan Sasanka and Wang, Thomas and Sagot, Beno{\i}t and Muennighoff, Niklas and del Moral, Albert Villanova and Ruwase, Olatunji and Bawden, Rachel and Bekman, Stas and McMillan-Major, Angelina and Beltagy, Iz and Nguyen, Huu and Saulnier, Lucile and Tan, Samson and Suarez, Pedro Ortiz and Sanh, Victor and Laurenccon, Hugo and Jernite, Yacine and Launay, Julien and Mitchell, Margaret and Raffel, Colin and Gokaslan, Aaron and Simhi, Adi and Etxabe, Aitor Soroa and Aji, Alham Fikri and Alfassy, Amit and Rogers, Anna and Nitzav, Ariel Kreisberg and Xu, Canwen and Mou, Chenghao and Emezue, Chris C. and Klamm, Christopher and Leong, Colin and van Strien, Daniel Alexander and Adelani, David Ifeoluwa and Radev, Dragomir R. and Ponferrada, Eduardo G. and Levkovizh, Efrat and Kim, Ethan and Natan, Eyal Bar and Toni, Francesco De and Dupont, G{\'e}rard and Kruszewski, Germ{\'a}n and Pistilli, Giada and ElSahar, Hady and Benyamina, Hamza and Tran, Hieu and Yu, Ian and Abdulmumin, Idris and Johnson, Isaac and Gonzalez-Dios, Itziar and de la Rosa, Javier and Chim, Jenny and Dodge, Jesse and Zhu, Jian and Chang, Jonathan and Frohberg, Jorg and Tobing, Josephine L. and Bhattacharjee, Joydeep and Almubarak, Khalid and Chen, Kimbo and Lo, Kyle and von Werra, Leandro and Weber, Leon and Phan, Long and Allal, Loubna Ben and Tanguy, Ludovic and Dey, Manan and Mu{\~n}oz, Manuel Romero and Masoud, Maraim and Grandury, Mar'ia and vSavsko, Mario and Huang, Max and Coavoux, Maximin and Singh, Mayank and Jiang, Mike Tian-Jian and Vu, Minh Chien and Jauhar, Mohammad Ali and Ghaleb, Mustafa and Subramani, Nishant and Kassner, Nora and Khamis, Nurulaqilla and Nguyen, Olivier and Espejel, Omar and de Gibert, Ona and Villegas, Paulo and Henderson, Peter and Colombo, Pierre and Amuok, Priscilla and Lhoest, Quentin and Harliman, Rheza and Bommasani, Rishi and L'opez, Roberto and Ribeiro, R. and Osei, Salomey and Pyysalo, Sampo and Nagel, Sebastian and Bose, Shamik and Muhammad, Shamsuddeen Hassan and Sharma, Shanya and Longpre, S. and Nikpoor, Somaieh and Silberberg, Stanislav and Pai, Suhas and Zink, Sydney and Torrent, Tiago Timponi and Schick, Timo and Thrush, Tristan and Danchev, Valentin and Nikoulina, Vassilina and Laippala, Veronika and Lepercq, Violette and Prabhu, Vrinda and Alyafeai, Zaid and Talat, Zeerak and Raja, Arun and Heinzerling, Benjamin and Si, Chenglei and Salesky, Elizabeth and Mielke, Sabrina J. and Lee, Wilson Y. and Sharma, Abheesht and Santilli, Andrea and Chaffin, Antoine and Stiegler, Arnaud and Datta, Debajyoti and Szczechla, Eliza and Chhablani, Gunjan and Wang, Han and Pandey, Harshit and Strobelt, Hendrik and Fries, Jason Alan and Rozen, Jos and Gao, Leo and Sutawika, Lintang and Bari, M Saiful and Al-shaibani, Maged S. and Manica, Matteo and Nayak, Nihal V. and Teehan, Ryan and Albanie, Samuel and Shen, Sheng and Ben-David, Srulik and Bach, Stephen H. and Kim, Taewoon and Bers, Tali and F{\'e}vry, Thibault and Neeraj, Trishala and Thakker, Urmish and Raunak, Vikas and Tang, Xiang and Yong, Zheng Xin and Sun, Zhiqing and Brody, Shaked and Uri, Y and Tojarieh, Hadar and Roberts, Adam and Chung, Hyung Won and Tae, Jaesung and Phang, Jason and Press, Ofir and Li, Conglong and Narayanan, Deepak and Bourfoune, Hatim and Casper, Jared and Rasley, Jeff and Ryabinin, Max and Mishra, Mayank and Zhang, Minjia and Shoeybi, Mohammad and Peyrounette, Myriam and Patry, Nicolas and Tazi, Nouamane and Sanseviero, Omar and von Platen, Patrick and Cornette, Pierre and Lavall'ee, Pierre Franccois and Lacroix, R{\'e}mi and Rajbhandari, Samyam and Gandhi, Sanchit and Smith, Shaden and Requena, St{\'e}phane and Patil, Suraj and Dettmers, Tim and Baruwa, Ahmed and Singh, Amanpreet and Cheveleva, Anastasia and Ligozat, Anne-Laure and Subramonian, Arjun and N'ev'eol, Aur'elie and Lovering, Charles and Garrette, Daniel H and Tunuguntla, Deepak R. and Reiter, Ehud and Taktasheva, Ekaterina and Voloshina, Ekaterina and Bogdanov, Eli and Winata, Genta Indra and Schoelkopf, Hailey and Kalo, Jan-Christoph and Novikova, Jekaterina and Forde, Jessica Zosa and Clive, Jordan and Kasai, Jungo and Kawamura, Ken and Hazan, Liam and Carpuat, Marine and Clinciu, Miruna and Kim, Najoung and Cheng, Newton and Serikov, Oleg and Antverg, Omer and van der Wal, Oskar and Zhang, Rui and Zhang, Ruochen and Gehrmann, Sebastian and Pais, S. Osher and Shavrina, Tatiana and Scialom, Thomas and Yun, Tian and Limisiewicz, Tomasz and Rieser, Verena and Protasov, Vitaly and Mikhailov, Vladislav and Pruksachatkun, Yada and Belinkov, Yonatan and Bamberger, Zachary and Kasner, Zdenvek and Rueda, Alice and Pestana, Amanda and Feizpour, Amir and Khan, Ammar and Faranak, Amy and Santos, Ananda Santa Rosa and Hevia, Anthony and Unldreaj, Antigona and Aghagol, Arash and Abdollahi, Arezoo and Tammour, Aycha and HajiHosseini, Azadeh and Behroozi, Bahareh and Ajibade, Benjamin Olusola and Saxena, Bharat Kumar and Ferrandis, Carlos Mu{\~n}oz and Contractor, Danish and Lansky, David M. and David, Davis and Kiela, Douwe and Nguyen, Duong Anh and Tan, Edward and Baylor, Emily and Ozoani, Ezinwanne and Mirza, Fatim T and Ononiwu, Frankline and Rezanejad, Habib and Jones, H.A. and Bhattacharya, Indrani and Solaiman, Irene and Sedenko, Irina and Nejadgholi, Isar and Passmore, J. Lawrence and Seltzer, Joshua and Sanz, Julio Bonis and Fort, Karen and Dutra, L{\'i}via Macedo and Samagaio, Mairon and Elbadri, Maraim and Mieskes, Margot and Gerchick, Marissa and Akinlolu, Martha and McKenna, Michael and Qiu, Mike and Ghauri, M. K. K. and Burynok, Mykola and Abrar, Nafis and Rajani, Nazneen and Elkott, Nour and Fahmy, Nourhan and Samuel, Olanrewaju Modupe and An, Ran and Kromann, R. P. and Hao, Ryan and Alizadeh, Samira and Shubber, Sarmad and Wang, Silas L. and Roy, Sourav and Viguier, Sylvain and Le, Thanh-Cong and Oyebade, Tobi and Le, Trieu Nguyen Hai and Yang, Yoyo and Nguyen, Zachary Kyle and Kashyap, Abhinav Ramesh and Palasciano, Alfredo and Callahan, Alison and Shukla, Anima and Miranda-Escalada, Antonio and Singh, Ayush Kumar and Beilharz, Benjamin and Wang, Bo and de Brito, Caio Matheus Fonseca and Zhou, Chenxi and Jain, Chirag and Xu, Chuxin and Fourrier, Cl{\'e}mentine and Perin'an, Daniel Le'on and Molano, Daniel and Yu, Dian and Manjavacas, Enrique and Barth, Fabio and Fuhrimann, Florian and Altay, Gabriel and Bayrak, Giyaseddin and Burns, Gully A. and Vrabec, Helena U. and Bello, Iman I.B. and Dash, Isha and Kang, Ji Soo and Giorgi, John and Golde, Jonas and Posada, Jose David and Sivaraman, Karthi and Bulchandani, Lokesh and Liu, Lu and Shinzato, Luisa and de Bykhovetz, Madeleine Hahn and Takeuchi, Maiko and P{\`a}mies, Marc and Castillo, Mar{\'i}a Andrea and Nezhurina, Marianna and Sanger, Mario and Samwald, Matthias and Cullan, Michael and Weinberg, Michael and Wolf, M and Mihaljcic, Mina and Liu, Minna and Freidank, Moritz and Kang, Myungsun and Seelam, Natasha and Dahlberg, Nathan and Broad, Nicholas Michio and Muellner, Nikolaus and Fung, Pascale and Haller, Patricia and Chandrasekhar, R. and Eisenberg, R. and Martin, Robert and Canalli, Rodrigo L. and Su, Rosaline and Su, Ruisi and Cahyawijaya, Samuel and Garda, Samuele and Deshmukh, Shlok S and Mishra, Shubhanshu and Kiblawi, Sid and Ott, Simon and Sang-aroonsiri, Sinee and Kumar, Srishti and Schweter, Stefan and Bharati, Sushil Pratap and Laud, T. A. and Gigant, Th'eo and Kainuma, Tomoya and Kusa, Wojciech and Labrak, Yanis and Bajaj, Yashasvi and Venkatraman, Y. and Xu, Yifan and Xu, Ying and Xu, Yun-chao and Tan, Zhee Xao and Xie, Zhongli and Ye, Zifan and Bras, Mathilde and Belkada, Younes and Wolf, Thomas},
  journal = {ArXiv},
  year = {2022},
  month = nov,
  volume = {abs/2211.05100},
}

Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities

Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, and Doug Downey

In NeurIPS (Datasets and Benchmarks), Nov 2022

Abs OpenReview Bib PDF

With the advent of large language models, methods for abstractive summarization have made great strides, creating potential for use in applications to aid knowledge workers processing unwieldy document collections. One such setting is the Civil Rights Litigation Clearinghouse (CRLC, https://clearinghouse.net), which posts information about large-scale civil rights lawsuits, serving lawyers, scholars, and the general public. Today, summarization in the CRLC requires extensive training of lawyers and law students who spend hours per case understanding multiple relevant documents in order to produce high-quality summaries of key events and outcomes. Motivated by this ongoing real-world summarization effort, we introduce Multi-LexSum, a collection of 9,280 expert-authored summaries drawn from ongoing CRLC writing. Multi-LexSum presents a challenging multi-document summarization task given the length of the source documents, often exceeding two hundred pages per case. Furthermore, Multi-LexSum is distinct from other datasets in its multiple target summaries, each at a different granularity (ranging from one-sentence "extreme" summaries to multi-paragraph narrations of over five hundred words). We present extensive analysis demonstrating that despite the high-quality summaries in the training data (adhering to strict content and style guidelines), state-of-the-art summarization models perform poorly on this task. We release Multi-LexSum for further summarization research and to facilitate the development of applications to assist in the CRLC’s mission at https://multilexsum.github.io.
@inproceedings{shen2022multilexsum, title = {Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities}, author = {Shen, Zejiang and Lo, Kyle and Yu, Lauren and Dahlberg, Nathan and Schlanger, Margo and Downey, Doug}, booktitle = {NeurIPS (Datasets and Benchmarks)}, year = {2022}, month = nov, url = {https://openreview.net/forum?id=z1d8fUiS8Cr}, }
Overview of the Third Workshop on Scholarly Document Processing

Arman Cohan, Guy Feigenblat, Dayne Freitag, Tirthankar Ghosal, Drahomira Herrmannova, Petr Knoth, Kyle Lo, and 4 more authors

In Scholarly Document Processing (SDP) Workshop, Oct 2022

Abs ACL Bib PDF

With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These needs have motivated an increasing focus on computational methods for enhancing search, summarization, and analysis of scholarly documents. However, the various strands of research on scholarly document processing remain fragmented. To reach out to the broader NLP and AI/ML community, pool distributed efforts in this area, and enable shared access to published research, we held the 3rd Workshop on Scholarly Document Processing (SDP) at COLING as a hybrid event (https://sdproc.org/2022/). The SDP workshop consisted of a research track, three invited talks and five Shared Tasks: 1) MSLR22: Multi-Document Summarization for Literature Reviews, 2) DAGPap22: Detecting automatically generated scientific papers, 3) SV-Ident 2022: Survey Variable Identification in Social Science Publications, 4) SKGG: Scholarly Knowledge Graph Generation, 5) MuP 2022: Multi Perspective Scientific Document Summarization. The program was geared towards NLP, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.
@inproceedings{cohan-etal-2022-overview, title = {Overview of the Third Workshop on Scholarly Document Processing}, author = {Cohan, Arman and Feigenblat, Guy and Freitag, Dayne and Ghosal, Tirthankar and Herrmannova, Drahomira and Knoth, Petr and Lo, Kyle and Mayr, Philipp and Shmueli-Scheuer, Michal and de Waard, Anita and Wang, Lucy Lu}, booktitle = {Scholarly Document Processing (SDP) Workshop}, month = oct, year = {2022}, url = {https://aclanthology.org/2022.sdp-1.1}, }
MultiVerS: Improving scientific claim verification with weak supervision and full-document context

David Wadden, Kyle Lo, Lucy Lu Wang, Arman Cohan, Iz Beltagy, and Hannaneh Hajishirzi

In Findings of NAACL, Jul 2022

Abs DOI ACL Bib PDF

The scientific claim verification task requires an NLP system to label scientific documents which Support or Refute an input claim, and to select evidentiary sentences (or rationales) justifying each predicted label. In this work, we present MultiVerS, which predicts a fact-checking label and identifies rationales in a multitask fashion based on a shared encoding of the claim and full document context. This approach accomplishes two key modeling goals. First, it ensures that all relevant contextual information is incorporated into each labeling decision. Second, it enables the model to learn from instances annotated with a document-level fact-checking label, but lacking sentence-level rationales. This allows MultiVerS to perform weakly-supervised domain adaptation by training on scientific documents labeled using high-precision heuristics. Our approach outperforms two competitive baselines on three scientific claim verification datasets, with particularly strong performance in zero / few-shot domain adaptation experiments. Our code and data are available at https://github.com/dwadden/multivers.
@inproceedings{wadden-etal-2022-multivers, title = {{M}ulti{V}er{S}: Improving scientific claim verification with weak supervision and full-document context}, author = {Wadden, David and Lo, Kyle and Wang, Lucy Lu and Cohan, Arman and Beltagy, Iz and Hajishirzi, Hannaneh}, booktitle = {Findings of NAACL}, month = jul, year = {2022}, url = {https://aclanthology.org/2022.findings-naacl.6}, doi = {10.18653/v1/2022.findings-naacl.6}, }
MultiCite: Modeling realistic citations requires moving beyond the single-sentence single-label setting

Anne Lauscher, Brandon Ko, Bailey Kuehl, Sophie Johnson, Arman Cohan, David Jurgens, and Kyle Lo

In NAACL, Jul 2022

Abs DOI ACL Bib PDF

Citation context analysis (CCA) is an important task in natural language processing that studies how and why scholars discuss each others’ work. Despite decades of study, computational methods for CCA have largely relied on overly-simplistic assumptions of how authors cite, which ignore several important phenomena. For instance, scholarly papers often contain rich discussions of cited work that span multiple sentences and express multiple intents concurrently. Yet, recent work in CCA is often approached as a single-sentence, single-label classification task, and thus many datasets used to develop modern computational approaches fail to capture this interesting discourse. To address this research gap, we highlight three understudied phenomena for CCA and release MULTICITE, a new dataset of 12.6K citation contexts from 1.2K computational linguistics papers that fully models these phenomena. Not only is it the largest collection of expert-annotated citation contexts to-date, MULTICITE contains multi-sentence, multi-label citation contexts annotated through-out entire full paper texts. We demonstrate how MULTICITE can enable the development of new computational methods on three important CCA tasks. We release our code and dataset at https://github.com/allenai/multicite.
@inproceedings{lauscher-etal-2022-multicite, title = {{M}ulti{C}ite: Modeling realistic citations requires moving beyond the single-sentence single-label setting}, author = {Lauscher, Anne and Ko, Brandon and Kuehl, Bailey and Johnson, Sophie and Cohan, Arman and Jurgens, David and Lo, Kyle}, booktitle = {NAACL}, month = jul, year = {2022}, url = {https://aclanthology.org/2022.naacl-main.137}, doi = {10.18653/v1/2022.naacl-main.137}, }
Automatic question answering for multiple stakeholders, the epidemic question answering dataset

Travis R. Goodwin, Dina Demner-Fushman, Kyle Lo, Lucy Lu Wang, Hoa T. Dang, and Ian M. Soboroff

Scientific Data, Jul 2022

Abs DOI Nature Bib PDF

One of the effects of COVID-19 pandemic is a rapidly growing and changing stream of publications to inform clinicians, researchers, policy makers, and patients about the health, socio-economic, and cultural consequences of the pandemic. Managing this information stream manually is not feasible. Automatic Question Answering can quickly bring the most salient points to the user’s attention. Leveraging a collection of scientific articles, government websites, relevant news articles, curated social media posts, and questions asked by researchers, clinicians, and the general public, we developed a dataset to explore automatic Question Answering for multiple stakeholders. Analysis of questions asked by various stakeholders shows that while information needs of experts and the public may overlap, satisfactory answers to these questions often originate from different information sources or benefit from different approaches to answer generation. We believe that this dataset has the potential to support the development of question answering systems not only for epidemic questions, but for other domains with varying expertise such as legal or finance.
@article{Goodwin2022, author = {Goodwin, Travis R. and Demner-Fushman, Dina and Lo, Kyle and Wang, Lucy Lu and Dang, Hoa T. and Soboroff, Ian M.}, title = {Automatic question answering for multiple stakeholders, the epidemic question answering dataset}, journal = {Scientific Data}, year = {2022}, month = jul, volume = {9}, doi = {10.1038/s41597-022-01533-w}, url = {https://doi.org/10.1038/s41597-022-01533-w}, }
Data Governance in the Age of Large-Scale Data-Driven Language Technology

Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, and 13 more authors

In FAccT, Seoul, Republic of Korea, Jun 2022

Abs DOI arXiv ACM Bib PDF

The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
@inproceedings{10.1145/3531146.3534637, author = {Jernite, Yacine and Nguyen, Huu and Biderman, Stella and Rogers, Anna and Masoud, Maraim and Danchev, Valentin and Tan, Samson and Luccioni, Alexandra Sasha and Subramani, Nishant and Johnson, Isaac and Dupont, Gerard and Dodge, Jesse and Lo, Kyle and Talat, Zeerak and Radev, Dragomir and Gokaslan, Aaron and Nikpoor, Somaieh and Henderson, Peter and Bommasani, Rishi and Mitchell, Margaret}, title = {Data Governance in the Age of Large-Scale Data-Driven Language Technology}, year = {2022}, month = jun, url = {https://doi.org/10.1145/3531146.3534637}, doi = {10.1145/3531146.3534637}, booktitle = {FAccT}, numpages = {17}, }

bigscience-roots-corpus-a-1-6tb-composite-multilingual-dataset.png

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova Moral, Teven Le Scao, Leandro Von Werra, and 47 more authors

In NeurIPS (Datasets and Benchmarks), May 2022

Abs OpenReview Bib PDF

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

@inproceedings{bigsciencerootscorpus,
  title = {The BigScience {ROOTS} Corpus: A 1.6{TB} Composite Multilingual Dataset},
  author = {Lauren{\c{c}}on, Hugo and Saulnier, Lucile and Wang, Thomas and Akiki, Christopher and del Moral, Albert Villanova and Scao, Teven Le and Werra, Leandro Von and Mou, Chenghao and Ponferrada, Eduardo Gonz{\'a}lez and Nguyen, Huu and Frohberg, J{\"o}rg and {\v{S}}a{\v{s}}ko, Mario and Lhoest, Quentin and McMillan-Major, Angelina and Dupont, G{\'e}rard and Biderman, Stella and Rogers, Anna and allal, Loubna Ben and Toni, Francesco De and Pistilli, Giada and Nguyen, Olivier and Nikpoor, Somaieh and Masoud, Maraim and Colombo, Pierre and de la Rosa, Javier and Villegas, Paulo and Thrush, Tristan and Longpre, Shayne and Nagel, Sebastian and Weber, Leon and Mu{\~n}oz, Manuel Romero and Zhu, Jian and Strien, Daniel Van and Alyafeai, Zaid and Almubarak, Khalid and Chien, Vu Minh and Gonzalez-Dios, Itziar and Soroa, Aitor and Lo, Kyle and Dey, Manan and Suarez, Pedro Ortiz and Gokaslan, Aaron and Bose, Shamik and Adelani, David Ifeoluwa and Phan, Long and Tran, Hieu and Yu, Ian and Pai, Suhas and Chim, Jenny and Lepercq, Violette and Ilic, Suzana and Mitchell, Margaret and Luccioni, Sasha and Jernite, Yacine},
  booktitle = {NeurIPS (Datasets and Benchmarks)},
  year = {2022},
  month = may,
  url = {https://openreview.net/forum?id=UoEw6KigkUn},
}

Generating Scientific Claims for Zero-Shot Scientific Fact Checking

Dustin Wright, David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Isabelle Augenstein, and Lucy Lu Wang

In ACL, May 2022

Abs DOI ACL Bib PDF

Automated scientific fact checking is difficult due to the complexity of scientific language and a lack of significant amounts of training data, as annotation requires domain expertise. To address this challenge, we propose scientific claim generation, the task of generating one or more atomic and verifiable claims from scientific sentences, and demonstrate its usefulness in zero-shot fact checking for biomedical claims. We propose CLAIMGEN-BART, a new supervised method for generating claims supported by the literature, as well as KBIN, a novel method for generating claim negations. Additionally, we adapt an existing unsupervised entity-centric method of claim generation to biomedical claims, which we call CLAIMGEN-ENTITY. Experiments on zero-shot fact checking demonstrate that both CLAIMGEN-ENTITY and CLAIMGEN-BART, coupled with KBIN, achieve up to 90% performance of fully supervised models trained on manually annotated claims and evidence. A rigorous evaluation study demonstrates significant improvement in generated claim and negation quality over existing baselines
@inproceedings{wright-etal-2022-generating, title = {Generating Scientific Claims for Zero-Shot Scientific Fact Checking}, author = {Wright, Dustin and Wadden, David and Lo, Kyle and Kuehl, Bailey and Cohan, Arman and Augenstein, Isabelle and Wang, Lucy Lu}, booktitle = {ACL}, month = may, year = {2022}, url = {https://aclanthology.org/2022.acl-long.175}, doi = {10.18653/v1/2022.acl-long.175}, }
VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups

Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S. Weld, and Doug Downey

Transactions of ACL (TACL), May 2022

Abs DOI ACL Bib PDF

Accurately extracting structured content from PDFs is a critical first step for NLP over scientific papers. Recent work has improved extraction accuracy by incorporating elementary layout information, for example, each token’s 2D position on the page, into language model pretraining. We introduce new methods that explicitly model VIsual LAyout (VILA) groups, that is, text lines or text blocks, to further improve performance. In our I-VILA approach, we show that simply inserting special tokens denoting layout group boundaries into model inputs can lead to a 1.9% Macro F1 improvement in token classification. In the H-VILA approach, we show that hierarchical encoding of layout-groups can result in up to 47% inference time reduction with less than 0.8% Macro F1 loss. Unlike prior layout-aware approaches, our methods do not require expensive additional pretraining, only fine-tuning, which we show can reduce training cost by up to 95%. Experiments are conducted on a newly curated evaluation suite, S2-VLUE, that unifies existing automatically labeled datasets and includes a new dataset of manual annotations covering diverse papers from 19 scientific disciplines. Pre-trained weights, benchmark datasets, and source code are available at https://github.com/allenai/VILA.
@article{shen-etal-2022-vila, title = {{VILA}: Improving Structured Content Extraction from Scientific {PDF}s Using Visual Layout Groups}, author = {Shen, Zejiang and Lo, Kyle and Wang, Lucy Lu and Kuehl, Bailey and Weld, Daniel S. and Downey, Doug}, journal = {Transactions of ACL (TACL)}, volume = {10}, year = {2022}, month = may, url = {https://aclanthology.org/2022.tacl-1.22}, doi = {10.1162/tacl_a_00466}, }
ACCoRD: A Multi-Document Approach to Generating Diverse Descriptions of Scientific Concepts

Sonia K. Murthy, Kyle Lo, Daniel King, Chandra Bhagavatula, Bailey Kuehl, Sophie Johnson, Jon Borchardt, and 3 more authors

ArXiv, May 2022

Abs arXiv Bib PDF

Systems that can automatically define unfamiliar terms hold the promise of improving the accessibility of scientific texts, especially for readers who may lack prerequisite background knowledge. However, current systems assume a single "best" description per concept, which fails to account for the many potentially useful ways a concept can be described. We present ACCoRD, an end-to-end system tackling the novel task of generating sets of descriptions of scientific concepts. Our system takes advantage of the myriad ways a concept is mentioned across the scientific literature to produce distinct, diverse descriptions of target scientific concepts in terms of different reference concepts. To support research on the task, we release an expert-annotated resource, the ACCoRD corpus, which includes 1,275 labeled contexts and 1,787 hand-authored concept descriptions. We conduct a user study demonstrating that (1) users prefer descriptions produced by our end-to-end system, and (2) users prefer multiple descriptions to a single "best" description.
@article{Murthy2022ACCoRDAM, title = {ACCoRD: A Multi-Document Approach to Generating Diverse Descriptions of Scientific Concepts}, author = {Murthy, Sonia K. and Lo, Kyle and King, Daniel and Bhagavatula, Chandra and Kuehl, Bailey and Johnson, Sophie and Borchardt, Jon and Weld, Daniel S. and Hope, Tom and Downey, Doug}, journal = {ArXiv}, year = {2022}, month = may, volume = {abs/2205.06982}, }
Exploring the Role of Local and Global Explanations in Recommender Systems

Marissa Radensky, Doug Downey, Kyle Lo, Zoran Popovic, and Daniel S Weld

In CHI (Extended Abstracts), New Orleans, LA, USA, Apr 2022

Abs DOI arXiv ACM Bib PDF

Explanations are well-known to improve recommender systems’ transparency. These explanations may be local, explaining individual recommendations, or global, explaining the recommender model overall. Despite their widespread use, there has been little investigation into the relative benefits of the two explanation approaches. We conducted a 30-participant exploratory study and a 30-participant controlled user study with a research-paper recommender to analyze how providing local, global, or both explanations influences user understanding of system behavior. Our results provide evidence suggesting that both are more helpful than either alone for explaining how to improve recommendations, yet both appeared less helpful than global alone for efficiently identifying false positive and negative recommendations. However, we note that the two explanation approaches may be better compared in a higher-stakes or more opaque domain.
@inproceedings{10.1145/3491101.3519795, author = {Radensky, Marissa and Downey, Doug and Lo, Kyle and Popovic, Zoran and Weld, Daniel S}, title = {Exploring the Role of Local and Global Explanations in Recommender Systems}, year = {2022}, month = apr, url = {https://doi.org/10.1145/3491101.3519795}, doi = {10.1145/3491101.3519795}, booktitle = {CHI (Extended Abstracts)}, numpages = {7}, }
Infrastructure for rapid open knowledge network development

Michael Cafarella, Michael Anderson, Iz Beltagy, Arie Cattan, Sarah Chasins, Ido Dagan, Doug Downey, and 19 more authors

AI Magazine, Mar 2022

Abs DOI AAAI Bib PDF

Abstract The past decade has witnessed a growth in the use of knowledge graph technologies for advanced data search, data integration, and query-answering applications. The leading example of a public, general-purpose open knowledge network (aka knowledge graph) is Wikidata, which has demonstrated remarkable advances in quality and coverage over this time. Proprietary knowledge graphs drive some of the leading applications of the day including, for example, Google Search, Alexa, Siri, and Cortana. Open Knowledge Networks are exciting: they promise the power of structured database-like queries with the potential for the wide coverage that is today only provided by the Web. With the current state of the art, building, using, and scaling large knowledge networks can still be frustratingly slow. This article describes a National Science Foundation Convergence Accelerator project to build a set of Knowledge Network Programming Infrastructure systems to address this issue.
@article{10.1002/aaai.12038, author = {Cafarella, Michael and Anderson, Michael and Beltagy, Iz and Cattan, Arie and Chasins, Sarah and Dagan, Ido and Downey, Doug and Etzioni, Oren and Feldman, Sergey and Gao, Tian and Hope, Tom and Huang, Kexin and Johnson, Sophie and King, Daniel and Lo, Kyle and Lou, Yuze and Shapiro, Matthew and Shen, Dinghao and Subramanian, Shivashankar and Wang, Lucy Lu and Wang, Yuning and Wang, Yitong and Weld, Daniel S. and Vo-Phamhi, Jenny and Zeng, Anna and Zou, Jiayun}, title = {Infrastructure for rapid open knowledge network development}, journal = {AI Magazine}, volume = {43}, doi = {https://doi.org/10.1002/aaai.12038}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/aaai.12038}, eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/aaai.12038}, year = {2022}, month = mar, aaai = {aimagazine/article/view/19126}, }

2021

FLEX: Unifying Evaluation for Few-Shot NLP

Jonathan Bragg, Arman Cohan, Kyle Lo, and Iz Beltagy

In NeurIPS, Dec 2021

Abs OpenReview Bib PDF

Few-shot NLP research is highly active, yet conducted in disjoint research threads with evaluation suites that lack challenging-yet-realistic testing setups and fail to employ careful experimental design. Consequently, the community does not know which techniques perform best or even if they outperform simple baselines. In response, we formulate the FLEX Principles, a set of requirements and best practices for unified, rigorous, valid, and cost-sensitive few-shot NLP evaluation. These principles include Sample Size Design, a novel approach to benchmark design that optimizes statistical accuracy and precision while keeping evaluation costs manageable. Following the principles, we release the FLEX benchmark, which includes four few-shot transfer settings, zero-shot evaluation, and a public leaderboard that covers diverse NLP tasks. In addition, we present UniFew, a prompt-based model for few-shot learning that unifies pretraining and finetuning prompt formats, eschewing complex machinery of recent prompt-based approaches in adapting downstream task formats to language model pretraining objectives. We demonstrate that despite simplicity, UniFew achieves results competitive with both popular meta-learning and prompt-based approaches.
@inproceedings{NEURIPS2021_8493eeac, author = {Bragg, Jonathan and Cohan, Arman and Lo, Kyle and Beltagy, Iz}, booktitle = {NeurIPS}, title = {FLEX: Unifying Evaluation for Few-Shot NLP}, url = {https://proceedings.neurips.cc/paper/2021/file/8493eeaccb772c0878f99d60a0bd2bb3-Paper.pdf}, volume = {34}, year = {2021}, month = dec, }
Explaining Relationships Between Scientific Documents

Kelvin Luu, Xinyi Wu, Rik Koncel-Kedziorski, Kyle Lo, Isabel Cachola, and Noah A. Smith

In ACL, Aug 2021

Abs DOI ACL Bib PDF

We address the task of explaining relationships between two scientific documents using natural language text. This task requires modeling the complex content of long technical documents, deducing a relationship between these documents, and expressing the details of that relationship in text. In addition to the theoretical interest of this task, successful solutions can help improve researcher efficiency in search and review. In this paper we establish a dataset of 622K examples from 154K documents. We pretrain a large language model to serve as the foundation for autoregressive approaches to the task. We explore the impact of taking different views on the two documents, including the use of dense representations extracted with scientific IE systems. We provide extensive automatic and human evaluations which show the promise of such models, but make clear challenges for future work.
@inproceedings{luu-etal-2021-explaining, title = {Explaining Relationships Between Scientific Documents}, author = {Luu, Kelvin and Wu, Xinyi and Koncel-Kedziorski, Rik and Lo, Kyle and Cachola, Isabel and Smith, Noah A.}, booktitle = {ACL}, month = aug, year = {2021}, url = {https://aclanthology.org/2021.acl-long.166}, doi = {10.18653/v1/2021.acl-long.166}, }
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner

In NAACL, Jun 2021

Abs DOI ACL Bib PDF

Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present Qasper, a dataset of 5049 questions over 1585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.
@inproceedings{dasigi-etal-2021-dataset, title = {A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers}, author = {Dasigi, Pradeep and Lo, Kyle and Beltagy, Iz and Cohan, Arman and Smith, Noah A. and Gardner, Matt}, booktitle = {NAACL}, month = jun, year = {2021}, url = {https://aclanthology.org/2021.naacl-main.365}, doi = {10.18653/v1/2021.naacl-main.365}, }
Overview and Insights from the SCIVER shared task on Scientific Claim Verification

David Wadden, and Kyle Lo

In Scholarly Document Processing (SDP) Workshop, Jun 2021

Abs ACL Bib PDF

We present an overview of the SCIVER shared task, presented at the 2nd Scholarly Document Processing (SDP) workshop at NAACL 2021. In this shared task, systems were provided a scientific claim and a corpus of research abstracts, and asked to identify which articles Support or Refute the claim as well as provide evidentiary sentences justifying those labels. 11 teams made a total of 14 submissions to the shared task leaderboard, leading to an improvement of more than +23 F1 on the primary task evaluation metric. In addition to surveying the participating systems, we provide several insights into modeling approaches to support continued progress and future research on the important and challenging task of scientific claim verification.
@inproceedings{wadden-lo-2021-overview, title = {Overview and Insights from the {SCIVER} shared task on Scientific Claim Verification}, author = {Wadden, David and Lo, Kyle}, booktitle = {Scholarly Document Processing (SDP) Workshop}, month = jun, year = {2021}, url = {https://aclanthology.org/2021.sdp-1.16}, }
Overview of the Second Workshop on Scholarly Document Processing

Iz Beltagy, Arman Cohan, Guy Feigenblat, Dayne Freitag, Tirthankar Ghosal, Keith Hall, Drahomira Herrmannova, and 8 more authors

In Scholarly Document Processing (SDP) Workshop, Jun 2021

Abs ACL Bib PDF

With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These needs have motivated an increasing focus on computational methods for enhancing search, summarization, and analysis of scholarly documents. However, the various strands of research on scholarly document processing remain fragmented. To reach out to the broader NLP and AI/ML community, pool distributed efforts in this area, and enable shared access to published research, we held the 2nd Workshop on Scholarly Document Processing (SDP) at NAACL 2021 as a virtual event (https://sdproc.org/2021/). The SDP workshop consisted of a research track, three invited talks, and three Shared Tasks (LongSumm 2021, SCIVER, and 3C). The program was geared towards the application of NLP, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.
@inproceedings{beltagy-etal-2021-overview, title = {Overview of the Second Workshop on Scholarly Document Processing}, author = {Beltagy, Iz and Cohan, Arman and Feigenblat, Guy and Freitag, Dayne and Ghosal, Tirthankar and Hall, Keith and Herrmannova, Drahomira and Knoth, Petr and Lo, Kyle and Mayr, Philipp and Patton, Robert and Shmueli-Scheuer, Michal and de Waard, Anita and Wang, Kuansan and Wang, Lucy Lu}, booktitle = {Scholarly Document Processing (SDP) Workshop}, month = jun, year = {2021}, url = {https://aclanthology.org/2021.sdp-1.22}, }
Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols

Andrew Head, Kyle Lo, Dongyeop Kang, Raymond Fok, Sam Skjonsberg, Daniel S. Weld, and Marti A. Hearst

In CHI, Yokohama, Japan, May 2021

Abs DOI arXiv ACM Bib PDF

Despite the central importance of research papers to scientific progress, they can be difficult to read. Comprehension is often stymied when the information needed to understand a passage resides somewhere else—in another section, or in another paper. In this work, we envision how interfaces can bring definitions of technical terms and symbols to readers when and where they need them most. We introduce ScholarPhi, an augmented reading interface with four novel features: (1) tooltips that surface position-sensitive definitions from elsewhere in a paper, (2) a filter over the paper that “declutters” it to reveal how the term or symbol is used across the paper, (3) automatic equation diagrams that expose multiple definitions in parallel, and (4) an automatically generated glossary of important terms and symbols. A usability study showed that the tool helps researchers of all experience levels read papers. Furthermore, researchers were eager to have ScholarPhi’s definitions available to support their everyday reading.
@inproceedings{10.1145/3411764.3445648, author = {Head, Andrew and Lo, Kyle and Kang, Dongyeop and Fok, Raymond and Skjonsberg, Sam and Weld, Daniel S. and Hearst, Marti A.}, title = {Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols}, year = {2021}, month = may, url = {https://doi.org/10.1145/3411764.3445648}, doi = {10.1145/3411764.3445648}, booktitle = {CHI}, numpages = {18}, }
Discourse Understanding and Factual Consistency in Abstractive Summarization

Saadia Gabriel, Antoine Bosselut, Jeff Da, Ari Holtzman, Jan Buys, Kyle Lo, Asli Celikyilmaz, and 1 more author

In EACL, Apr 2021

Abs DOI ACL Bib PDF

We introduce a general framework for abstractive summarization with factual consistency and distinct modeling of the narrative flow in an output summary. Our work addresses current limitations of models for abstractive summarization that often hallucinate information or generate summaries with coherence issues. To generate abstractive summaries with factual consistency and narrative flow, we propose Cooperative Generator-Discriminator Networks (Co-opNet), a novel transformer-based framework where the generator works with a discriminator architecture to compose coherent long-form summaries. We explore four different discriminator objectives which each capture a different aspect of coherence, including whether salient spans of generated abstracts are hallucinated or appear in the input context, and the likelihood of sentence adjacency in generated abstracts. We measure the ability of Co-opNet to learn these objectives with arXiv scientific papers, using the abstracts as a proxy for gold long-form scientific article summaries. Empirical results from automatic and human evaluations demonstrate that Co-opNet learns to summarize with considerably improved global coherence compared to competitive baselines.
@inproceedings{gabriel-etal-2021-discourse, title = {Discourse Understanding and Factual Consistency in Abstractive Summarization}, author = {Gabriel, Saadia and Bosselut, Antoine and Da, Jeff and Holtzman, Ari and Buys, Jan and Lo, Kyle and Celikyilmaz, Asli and Choi, Yejin}, booktitle = {EACL}, month = apr, year = {2021}, url = {https://aclanthology.org/2021.eacl-main.34}, doi = {10.18653/v1/2021.eacl-main.34}, }
Searching for scientific evidence in a pandemic: An overview of TREC-COVID

Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen Voorhees, and 2 more authors

Journal of Biomedical Informatics, Apr 2021

Abs DOI arXiv Science Direct Bib PDF

We present an overview of the TREC-COVID Challenge, an information retrieval (IR) shared task to evaluate search on scientific literature related to COVID-19. The goals of TREC-COVID include the construction of a pandemic search test collection and the evaluation of IR methods for COVID-19. The challenge was conducted over five rounds from April to July 2020, with participation from 92 unique teams and 556 individual submissions. A total of 50 topics (sets of related queries) were used in the evaluation, starting at 30 topics for Round 1 and adding 5 new topics per round to target emerging topics at that state of the still-emerging pandemic. This paper provides a comprehensive overview of the structure and results of TREC-COVID. Specifically, the paper provides details on the background, task structure, topic structure, corpus, participation, pooling, assessment, judgments, results, top-performing systems, lessons learned, and benchmark datasets.
@article{ROBERTS2021103865, title = {Searching for scientific evidence in a pandemic: An overview of TREC-COVID}, journal = {Journal of Biomedical Informatics}, volume = {121}, year = {2021}, month = apr, doi = {https://doi.org/10.1016/j.jbi.2021.103865}, url = {https://www.sciencedirect.com/science/article/pii/S1532046421001945}, author = {Roberts, Kirk and Alam, Tasmeer and Bedrick, Steven and Demner-Fushman, Dina and Lo, Kyle and Soboroff, Ian and Voorhees, Ellen and Wang, Lucy Lu and Hersh, William R.}, }
TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection

Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R. Hersh, Kyle Lo, Kirk Roberts, and 2 more authors

SIGIR Forum, Feb 2021

Abs DOI arXiv ACM Bib PDF

TREC-COVID is a community evaluation designed to build a test collection that captures the information needs of biomedical researchers using the scientific literature during a pandemic. One of the key characteristics of pandemic search is the accelerated rate of change: the topics of interest evolve as the pandemic progresses and the scientific literature in the area explodes. The COVID-19 pandemic provides an opportunity to capture this progression as it happens. TREC-COVID, in creating a test collection around COVID-19 literature, is building infrastructure to support new research and technologies in pandemic search.
@article{10.1145/3451964.3451965, author = {Voorhees, Ellen and Alam, Tasmeer and Bedrick, Steven and Demner-Fushman, Dina and Hersh, William R. and Lo, Kyle and Roberts, Kirk and Soboroff, Ian and Wang, Lucy Lu}, title = {TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, year = {2021}, volume = {54}, url = {https://doi.org/10.1145/3451964.3451965}, doi = {10.1145/3451964.3451965}, journal = {SIGIR Forum}, month = feb, numpages = {12}, }

2020

Text mining approaches for dealing with the rapidly expanding literature on COVID-19

Lucy Lu Wang, and Kyle Lo

Briefings in Bioinformatics, Dec 2020

Abs DOI OpenReview Oxford University Press Bib PDF

More than 50 000 papers have been published about COVID-19 since the beginning of 2020 and several hundred new papers continue to be published every day. This incredible rate of scientific productivity leads to information overload, making it difficult for researchers, clinicians and public health officials to keep up with the latest findings. Automated text mining techniques for searching, reading and summarizing papers are helpful for addressing information overload. In this review, we describe the many resources that have been introduced to support text mining applications over the COVID-19 literature; specifically, we discuss the corpora, modeling resources, systems and shared tasks that have been introduced for COVID-19. We compile a list of 39 systems that provide functionality such as search, discovery, visualization and summarization over the COVID-19 literature. For each system, we provide a qualitative description and assessment of the system’s performance, unique data or user interface features and modeling decisions. Many systems focus on search and discovery, though several systems provide novel features, such as the ability to summarize findings over multiple documents or linking between scientific articles and clinical trials. We also describe the public corpora, models and shared tasks that have been introduced to help reduce repeated effort among community members; some of these resources (especially shared tasks) can provide a basis for comparing the performance of different systems. Finally, we summarize promising results and open challenges for text mining the COVID-19 literature.
@article{10.1093/bib/bbaa296, author = {Wang, Lucy Lu and Lo, Kyle}, title = {{Text mining approaches for dealing with the rapidly expanding literature on COVID-19}}, journal = {Briefings in Bioinformatics}, volume = {22}, year = {2020}, month = dec, doi = {10.1093/bib/bbaa296}, url = {https://doi.org/10.1093/bib/bbaa296}, eprint = {https://academic.oup.com/bib/article-pdf/22/2/781/36654452/bbaa296.pdf}, }
Fact or Fiction: Verifying Scientific Claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine Zuylen, Arman Cohan, and Hannaneh Hajishirzi

In EMNLP, Nov 2020

Abs DOI ACL Bib PDF

We introduce scientific claim verification, a new task to select abstracts from the research literature containing evidence that SUPPORTS or REFUTES a given scientific claim, and to identify rationales justifying each decision. To study this task, we construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SciFact, and demonstrate that simple domain adaptation techniques substantially improve performance compared to models trained on Wikipedia or political news. We show that our system is able to verify claims related to COVID-19 by identifying evidence from the CORD-19 corpus. Our experiments indicate that SciFact will provide a challenging testbed for the development of new systems designed to retrieve and reason over corpora containing specialized domain knowledge. Data and code for this new task are publicly available at https://github.com/allenai/scifact. A leaderboard and COVID-19 fact-checking demo are available at https://scifact.apps.allenai.org.
@inproceedings{wadden-etal-2020-fact, title = {Fact or Fiction: Verifying Scientific Claims}, author = {Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh}, booktitle = {EMNLP}, month = nov, year = {2020}, url = {https://aclanthology.org/2020.emnlp-main.609}, doi = {10.18653/v1/2020.emnlp-main.609}, }
Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions

Dongyeop Kang, Andrew Head, Risham Sidhu, Kyle Lo, Daniel Weld, and Marti A. Hearst

In Scholarly Document Processing (SDP) Workshop, Nov 2020

Abs DOI ACL Bib PDF

The task of definition detection is important for scholarly papers, because papers often make use of technical terminology that may be unfamiliar to readers. Despite prior work on definition detection, current approaches are far from being accurate enough to use in realworld applications. In this paper, we first perform in-depth error analysis of the current best performing definition detection system and discover major causes of errors. Based on this analysis, we develop a new definition detection system, HEDDEx, that utilizes syntactic features, transformer encoders, and heuristic filters, and evaluate it on a standard sentence-level benchmark. Because current benchmarks evaluate randomly sampled sentences, we propose an alternative evaluation that assesses every sentence within a document. This allows for evaluating recall in addition to precision. HEDDEx outperforms the leading system on both the sentence-level and the document-level tasks, by 12.7 F1 points and 14.4 F1 points, respectively. We note that performance on the high-recall document-level task is much lower than in the standard evaluation approach, due to the necessity of incorporation of document structure as features. We discuss remaining challenges in document-level definition detection, ideas for improvements, and potential issues for the development of reading aid applications.
@inproceedings{kang-etal-2020-document, title = {Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions}, author = {Kang, Dongyeop and Head, Andrew and Sidhu, Risham and Lo, Kyle and Weld, Daniel and Hearst, Marti A.}, booktitle = {Scholarly Document Processing (SDP) Workshop}, month = nov, year = {2020}, url = {https://aclanthology.org/2020.sdp-1.22}, doi = {10.18653/v1/2020.sdp-1.22}, }
TLDR: Extreme Summarization of Scientific Documents

Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel Weld

In Findings of EMNLP, Nov 2020

Abs DOI ACL Bib PDF

We introduce TLDR generation, a new form of extreme summarization, for scientific papers. TLDR generation involves high source compression and requires expert background knowledge and understanding of complex domain-specific language. To facilitate study on this task, we introduce SCITLDR, a new multi-target dataset of 5.4K TLDRs over 3.2K papers. SCITLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden. We propose CATTS, a simple yet effective learning strategy for generating TLDRs that exploits titles as an auxiliary training signal. CATTS improves upon strong baselines under both automated metrics and human evaluations. Data and code are publicly available at https://github.com/allenai/scitldr.
@inproceedings{cachola-etal-2020-tldr, title = {{TLDR}: Extreme Summarization of Scientific Documents}, author = {Cachola, Isabel and Lo, Kyle and Cohan, Arman and Weld, Daniel}, booktitle = {Findings of EMNLP}, month = nov, year = {2020}, url = {https://aclanthology.org/2020.findings-emnlp.428}, doi = {10.18653/v1/2020.findings-emnlp.428}, }
Mitigating Biases in CORD-19 for Analyzing COVID-19 Literature

Anshul Kanakia, Kuansan Wang, Yuxiao Dong, Boya Xie, Kyle Lo, Zhihong Shen, Lucy Lu Wang, and 4 more authors

Frontiers in Research Metrics and Analytics, Nov 2020

Abs DOI PubMed Central Bib PDF

On the behest of the Office of Science and Technology Policy in the White House, six institutions, including ours, have created an open research dataset called COVID-19 Research Dataset (CORD-19) to facilitate the development of question-answering systems that can assist researchers in finding relevant research on COVID-19. As of May 27, 2020, CORD-19 includes more than 100,000 open access publications from major publishers and PubMed as well as preprint articles deposited into medRxiv, bioRxiv, and arXiv. Recent years, however, have also seen question-answering and other machine learning systems exhibit harmful behaviors to humans due to biases in the training data. It is imperative and only ethical for modern scientists to be vigilant in inspecting and be prepared to mitigate the potential biases when working with any datasets. This article describes a framework to examine biases in scientific document collections like CORD-19 by comparing their properties with those derived from the citation behaviors of the entire scientific community. In total, three expanded sets are created for the analyses: 1) the enclosure set CORD-19E composed of CORD-19 articles and their references and citations, mirroring the methodology used in the renowned “A Century of Physics” analysis; 2) the full closure graph CORD-19C that recursively includes references starting with CORD-19; and 3) the inflection closure CORD-19I, that is, a much smaller subset of CORD-19C but already appropriate for statistical analysis based on the theory of the scale-free nature of the citation network. Taken together, all these expanded datasets show much smoother trends when used to analyze global COVID-19 research. The results suggest that while CORD-19 exhibits a strong tilt toward recent and topically focused articles, the knowledge being explored to attack the pandemic encompasses a much longer time span and is very interdisciplinary. A question-answering system with such expanded scope of knowledge may perform better in understanding the literature and answering related questions. However, while CORD-19 appears to have topical coverage biases compared to the expanded sets, the collaboration patterns, especially in terms of team sizes and geographical distributions, are captured very well already in CORD-19 as the raw statistics and trends agree with those from larger datasets.
@article{10.3389/frma.2020.596624, author = {Kanakia, Anshul and Wang, Kuansan and Dong, Yuxiao and Xie, Boya and Lo, Kyle and Shen, Zhihong and Wang, Lucy Lu and Huang, Chiyuan and Eide, Darrin and Kohlmeier, Sebastian and Wu, Chieh-Han}, title = {Mitigating Biases in CORD-19 for Analyzing COVID-19 Literature}, journal = {Frontiers in Research Metrics and Analytics}, volume = {5}, year = {2020}, month = nov, url = {https://www.frontiersin.org/articles/10.3389/frma.2020.596624}, doi = {10.3389/frma.2020.596624}, }
TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19

Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen Voorhees, and 2 more authors

Journal of the American Medical Informatics Association, Jul 2020

Abs DOI PubMed Central Bib PDF

TREC-COVID is an information retrieval (IR) shared task initiated to support clinicians and clinical research during the COVID-19 pandemic. IR for pandemics breaks many normal assumptions, which can be seen by examining 9 important basic IR research questions related to pandemic situations. TREC-COVID differs from traditional IR shared task evaluations with special considerations for the expected users, IR modality considerations, topic development, participant requirements, assessment process, relevance criteria, evaluation metrics, iteration process, projected timeline, and the implications of data use as a post-task test collection. This article describes how all these were addressed for the particular requirements of developing IR systems under a pandemic situation. Finally, initial participation numbers are also provided, which demonstrate the tremendous interest the IR community has in this effort.
@article{10.1093/jamia/ocaa091, author = {Roberts, Kirk and Alam, Tasmeer and Bedrick, Steven and Demner-Fushman, Dina and Lo, Kyle and Soboroff, Ian and Voorhees, Ellen and Wang, Lucy Lu and Hersh, William R}, title = {{TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19}}, journal = {Journal of the American Medical Informatics Association}, volume = {27}, year = {2020}, month = jul, doi = {10.1093/jamia/ocaa091}, url = {https://doi.org/10.1093/jamia/ocaa091}, }
CORD-19: The COVID-19 Open Research Dataset

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Doug Burdick, Darrin Eide, and 21 more authors

In NLP for COVID-19 Workshop, Jul 2020

Abs ACL Bib PDF

The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the basis of many COVID-19 text mining and discovery systems. In this article, we describe the mechanics of dataset construction, highlighting challenges and key design decisions, provide an overview of how CORD-19 has been used, and describe several shared tasks built around the dataset. We hope this resource will continue to bring together the computing community, biomedical experts, and policy makers in the search for effective treatments and management policies for COVID-19.
@inproceedings{wang-etal-2020-cord, title = {{CORD-19}: The {COVID-19} Open Research Dataset}, author = {Wang, Lucy Lu and Lo, Kyle and Chandrasekhar, Yoganand and Reas, Russell and Yang, Jiangjiang and Burdick, Doug and Eide, Darrin and Funk, Kathryn and Katsis, Yannis and Kinney, Rodney Michael and Li, Yunyao and Liu, Ziyang and Merrill, William and Mooney, Paul and Murdick, Dewey A. and Rishi, Devvret and Sheehan, Jerry and Shen, Zhihong and Stilson, Brandon and Wade, Alex D. and Wang, Kuansan and Wang, Nancy Xin Ru and Wilhelm, Christopher and Xie, Boya and Raymond, Douglas M. and Weld, Daniel S. and Etzioni, Oren and Kohlmeier, Sebastian}, booktitle = {NLP for COVID-19 Workshop}, month = jul, year = {2020}, url = {https://aclanthology.org/2020.nlpcovid19-acl.1}, }
S2ORC: The Semantic Scholar Open Research Corpus

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld

In ACL, Jul 2020

Abs DOI ACL Bib PDF

We introduce S2ORC, a large corpus of 81.1M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text is annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, we aggregate papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date. We hope this resource will facilitate research and development of tools and tasks for text mining over academic text.
@inproceedings{lo-etal-2020-s2orc, title = {{S}2{ORC}: The Semantic Scholar Open Research Corpus}, author = {Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Daniel}, booktitle = {ACL}, month = jul, year = {2020}, url = {https://aclanthology.org/2020.acl-main.447}, doi = {10.18653/v1/2020.acl-main.447}, }
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith

In ACL, Jul 2020

🏆 Honorable Mention for Best Paper 🏆
Abs DOI ACL Bib PDF

Honorable Mention for Best Paper

Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.
@inproceedings{gururangan-etal-2020-dont, title = {Don{'}t Stop Pretraining: Adapt Language Models to Domains and Tasks}, author = {Gururangan, Suchin and Marasovi{\'c}, Ana and Swayamdipta, Swabha and Lo, Kyle and Beltagy, Iz and Downey, Doug and Smith, Noah A.}, booktitle = {ACL}, month = jul, year = {2020}, url = {https://aclanthology.org/2020.acl-main.740}, doi = {10.18653/v1/2020.acl-main.740}, }

2019

SciBERT: A Pretrained Language Model for Scientific Text

Iz Beltagy, Kyle Lo, and Arman Cohan

In EMNLP, Nov 2019

Abs DOI ACL Bib PDF

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et. al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.
@inproceedings{beltagy-etal-2019-scibert, title = {{S}ci{BERT}: A Pretrained Language Model for Scientific Text}, author = {Beltagy, Iz and Lo, Kyle and Cohan, Arman}, booktitle = {EMNLP}, month = nov, year = {2019}, url = {https://aclanthology.org/D19-1371}, doi = {10.18653/v1/D19-1371}, }
Quantifying Sex Bias in Clinical Studies at Scale With Automated Data Extraction

Sergey Feldman, Waleed Ammar, Kyle Lo, Elly Trepman, Madeleine Zuylen, and Oren Etzioni

JAMA Network Open, Jul 2019

Abs DOI PubMed Central JAMA Bib PDF

Analyses of female representation in clinical studies have been limited in scope and scale.To perform a large-scale analysis of global enrollment sex bias in clinical studies.In this cross-sectional study, clinical studies from published articles from PubMed from 1966 to 2018 and records from Aggregate Analysis of ClinicalTrials.gov from 1999 to 2018 were identified. Global disease prevalence was determined for male and female patients in 11 disease categories from the Global Burden of Disease database: cardiovascular, diabetes, digestive, hepatitis (types A, B, C, and E), HIV/AIDS, kidney (chronic), mental, musculoskeletal, neoplasms, neurological, and respiratory (chronic). Machine reading algorithms were developed that extracted sex data from tables in articles and records on December 31, 2018, at an artificial intelligence research institute. Male and female participants in 43 135 articles (792 004 915 participants) and 13 165 records (12 977 103 participants) were included.Sex bias was defined as the difference between the fraction of female participants in study participants minus prevalence fraction of female participants for each disease category. A total of 1000 bootstrap estimates of sex bias were computed by resampling individual studies with replacement. Sex bias was reported as mean and 95\% bootstrap confidence intervals from articles and records in each disease category over time (before or during 1993 to 2018), with studies or participants as the measurement unit.There were 792 004 915 participants, including 390 470 834 female participants (49\%), in articles and 12 977 103 participants, including 6 351 619 female participants (49\%), in records. With studies as measurement unit, substantial female underrepresentation (sex bias ≤ −0.05) was observed in 7 of 11 disease categories, especially HIV/AIDS (mean for articles, −0.17 [95\% CI, −0.18 to −0.16]), chronic kidney diseases (mean, −0.17 [95\% CI, −0.17 to −0.16]), and cardiovascular diseases (mean, −0.14 [95\% CI, −0.14 to −0.13]). Sex bias in articles for all categories combined was unchanged over time with studies as measurement unit (range, −0.15 [95\% CI, −0.16 to −0.13] to −0.10 [95\% CI, −0.14 to −0.06]), but improved from before or during 1993 (mean, −0.11 [95\% CI, −0.16 to −0.05]) to 2014 to 2018 (mean, −0.05 [95\% CI, −0.09 to −0.02]) with participants as the measurement unit. Larger study size was associated with greater female representation.Automated extraction of the number of participants in clinical reports provides an effective alternative to manual analysis of demographic bias. Despite legal and policy initiatives to increase female representation, sex bias against female participants in clinical studies persists. Studies with more participants have greater female representation. Differences between sex bias estimates with studies vs participants as measurement unit, and between articles vs records, suggest that sex bias with both measures and data sources should be reported.
@article{10.1001/jamanetworkopen.2019.6700, author = {Feldman, Sergey and Ammar, Waleed and Lo, Kyle and Trepman, Elly and van Zuylen, Madeleine and Etzioni, Oren}, title = {{Quantifying Sex Bias in Clinical Studies at Scale With Automated Data Extraction}}, journal = {JAMA Network Open}, volume = {2}, year = {2019}, month = jul, doi = {10.1001/jamanetworkopen.2019.6700}, url = {https://doi.org/10.1001/jamanetworkopen.2019.6700}, eprint = {https://jamanetwork.com/journals/jamanetworkopen/articlepdf/2737103/feldman\_2019\_oi\_190268.pdf}, }
Combining Distant and Direct Supervision for Neural Relation Extraction

Iz Beltagy, Kyle Lo, and Waleed Ammar

In NAACL, Jun 2019

Abs DOI ACL Bib PDF

In relation extraction with distant supervision, noisy labels make it difficult to train quality models. Previous neural models addressed this problem using an attention mechanism that attends to sentences that are likely to express the relations. We improve such models by combining the distant supervision data with an additional directly-supervised data, which we use as supervision for the attention weights. We find that joint training on both types of supervision leads to a better model because it improves the model’s ability to identify noisy sentences. In addition, we find that sigmoidal attention weights with max pooling achieves better performance over the commonly used weighted average attention in this setup. Our proposed method achieves a new state-of-the-art result on the widely used FB-NYT dataset.
@inproceedings{beltagy-etal-2019-combining, title = {Combining Distant and Direct Supervision for Neural Relation Extraction}, author = {Beltagy, Iz and Lo, Kyle and Ammar, Waleed}, booktitle = {NAACL}, month = jun, year = {2019}, url = {https://aclanthology.org/N19-1184}, doi = {10.18653/v1/N19-1184}, }

2018

Ontology alignment in the biomedical domain using entity definitions and context

Lucy Lu Wang, Chandra Bhagavatula, Mark Neumann, Kyle Lo, Chris Wilhelm, and Waleed Ammar

In BioNLP Workshop, Jul 2018

Abs DOI ACL Bib PDF

Ontology alignment is the task of identifying semantically equivalent entities from two given ontologies. Different ontologies have different representations of the same entity, resulting in a need to de-duplicate entities when merging ontologies. We propose a method for enriching entities in an ontology with external definition and context information, and use this additional information for ontology alignment. We develop a neural architecture capable of encoding the additional information when available, and show that the addition of external data results in an F1-score of 0.69 on the Ontology Alignment Evaluation Initiative (OAEI) largebio SNOMED-NCI subtask, comparable with the entity-level matchers in a SOTA system.
@inproceedings{wang-etal-2018-ontology, title = {Ontology alignment in the biomedical domain using entity definitions and context}, author = {Wang, Lucy Lu and Bhagavatula, Chandra and Neumann, Mark and Lo, Kyle and Wilhelm, Chris and Ammar, Waleed}, booktitle = {BioNLP Workshop}, month = jul, year = {2018}, url = {https://aclanthology.org/W18-2306}, doi = {10.18653/v1/W18-2306}, }
Construction of the Literature Graph in Semantic Scholar

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, and 16 more authors

In NAACL, Jun 2018

Abs DOI ACL Bib PDF

We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction into familiar NLP tasks (e.g., entity extraction and linking), point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task. The methods described in this paper are used to enable semantic features in \urlwww.semanticscholar.org.
@inproceedings{ammar-etal-2018-construction, title = {Construction of the Literature Graph in Semantic Scholar}, author = {Ammar, Waleed and Groeneveld, Dirk and Bhagavatula, Chandra and Beltagy, Iz and Crawford, Miles and Downey, Doug and Dunkelberger, Jason and Elgohary, Ahmed and Feldman, Sergey and Ha, Vu and Kinney, Rodney and Kohlmeier, Sebastian and Lo, Kyle and Murray, Tyler and Ooi, Hsu-Han and Peters, Matthew and Power, Joanna and Skjonsberg, Sam and Wang, Lucy Lu and Wilhelm, Chris and Yuan, Zheng and van Zuylen, Madeleine and Etzioni, Oren}, booktitle = {NAACL}, month = jun, year = {2018}, url = {https://aclanthology.org/N18-3011}, doi = {10.18653/v1/N18-3011}, }