Jump to Content
Ian Tenney

Ian Tenney

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
    Michael Xieyang Liu
    Krystal Kallarackal
    Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024)
    Preview abstract Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at Google. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models. View details
    Retrieval-guided Counterfactual Generation for QA
    Bhargavi Paranjape
    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics (2022), pp. 1670-1686 (to appear)
    Preview abstract Deep NLP models have been shown to be brittle to input perturbations. Recent work has shown that data augmentation using counterfactuals — i.e. minimally perturbed inputs — can help ameliorate this weakness. We focus on the task of creating counterfactuals for question answering, which presents unique challenges related to world knowledge, semantic diversity, and answerability. To address these challenges, we develop a Retrieve-Generate-Filter(RGF) technique to create counterfactual evaluation and training data with minimal human supervision. Using an open-domain QA framework and question generation model trained on original task data, we create counterfactuals that are fluent, semantically diverse, and automatically labeled. Data augmentation with RGF counterfactuals improves performance on out-of-domain and challenging evaluation sets over and above existing methods, in both the reading comprehension and open-domain QA settings. Moreover, we find that RGF data leads to significant improvements in a model’s robustness to local perturbations. View details
    Preview abstract Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics. View details
    Preview abstract We show that embedding-based language models capture a significant amount of information about the scalar magnitudes of objects but are short of the capability required for general common-sense reasoning. We identify ambiguity and numeracy as the key factors limiting their performance, and show that a simple reversible transformation of the pre-training corpus can have a significant effect on the results. We identify the best models and metrics to use when doing zero-shot transfer across tasks in this domain. View details
    The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models
    Andy Coenen
    Sebastian Gehrmann
    Ellen Jiang
    Carey Radebaugh
    Ann Yuan
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics (to appear)
    Preview abstract We present the Language Interpretability Tool (LIT), an open-source platform for visualization and understanding of NLP models. We focus on core questions about model behavior: Why did my model make this prediction? When does it perform poorly? What happens under a controlled change in the input? LIT integrates local explanations, aggregate analysis, and counterfactual generation into a streamlined, browser-based interface to enable rapid exploration and error analysis. We include case studies for a diverse set of workflows, including exploring counterfactuals for sentiment analysis, measuring gender bias in coreference systems, and exploring local behavior in text generation. LIT supports a wide range of models--including classification, seq2seq, and structured prediction--and is highly extensible through a declarative, framework-agnostic API. LIT is under active development, with code and full documentation available at https://github.com/pair-code/lit. View details
    Preview abstract Large pre-trained models have revolutionized natural language understanding. However, researchers have found they can encode correlations undesired in many applications, like \emph{surgeon} being associated more with \emph{he} than \emph{she}. We explore such \emph{gendered correlations} as a case study, to learn how we can configure and train models to mitigate the risk of encoding unintended associations. We find that it is important to define correlation metrics, since they can reveal differences among models with similar accuracy. Large models have more capacity to encode gendered correlations, but this can be mitigated with general dropout regularization. Counterfactual data augmentation is also effective, and can even reduce correlations not explicitly targeted for mitigation, potentially making it useful beyond gender too. Both techniques yield models with comparable accuracy to unmitigated analogues, and still resist re-learning correlations in fine-tuning. View details
    Asking without Telling: Exploring Latent Ontologies in Contextual Representations
    Julian Michael
    Jan A. Botha
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (to appear)
    Preview abstract The success of pretrained contextual encoders, such as ELMo and BERT, has brought a great deal of interest in what these models learn: do they, without explicit supervision, learn to encode meaningful notions of linguistic structure? If so, how is this structure encoded? To investigate this, we introduce latent subclass learning (LSL): a modification to existing classifier-based probing methods that induces a latent categorization (or ontology) of the probe's inputs. Without access to fine-grained gold labels, LSL extracts emergent structure from input representations in an interpretable and quantifiable form. In experiments, we find strong evidence of familiar categories, such as a notion of personhood in ELMo, as well as novel ontological distinctions, such as a preference for fine-grained semantic roles on core arguments. Our results provide unique new evidence of emergent structure in pretrained encoders, including departures from existing annotations which are inaccessible to earlier methods. View details
    What Happens To BERT Embeddings During Fine-tuning?
    Amil Merchant
    Elahe Rahimtoroghi
    Proceedings of the 2020 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics (to appear)
    Preview abstract While there has been much recent work studying how linguistic information is encoded in pre-trained sentence representations, comparatively little is understood about how these models change when adapted to solve downstream tasks. Using a suite of analysis techniques (probing classifiers, Representational Similarity Analysis, and model ablations), we investigate how fine-tuning affects the representations of the BERT model. We find that while fine-tuning necessarily makes significant changes, it does not lead to catastrophic forgetting of linguistic phenomena. We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing. Finally, we also find that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization. View details
    What do you learn from context? Probing for sentence structure in contextualized word representations
    Patrick Xia
    Berlin Chen
    Alex Wang
    Adam Poliak
    R. Thomas McCoy
    Najoung Kim
    Benjamin Van Durme
    Samuel R. Bowman
    International Conference on Learning Representations (2019)
    Preview abstract Contextualized representation models such as CoVe (McCann et al., 2017) and ELMo (Peters et al., 2018a) have recently achieved state-of-the-art results on a broad suite of downstream NLP tasks. Building on recent token-level probing work (Peters et al., 2018a; Blevins et al., 2018; Belinkov et al., 2017b; Shi et al., 2016), we introduce a broad suite of sub-sentence probing tasks derived from the traditional structured-prediction pipeline, including parsing, semantic role labeling, and coreference, and covering a range of syntactic, semantic, local, and long-range phenomena. We use these tasks to examine the word-level contextual representations and investigate how they encode information about the structure of the sentence in which they appear. We probe three recently-released contextual encoder models, and find that ELMo better encodes linguistic structure at the word level than do other comparable models. We find that the existing models trained on language modeling and translation produce strong representations for syntactic phenomena, but only offer small improvements on semantic tasks over a non-contextual baseline. View details
    BERT Rediscovers the Classical NLP Pipeline
    Association for Computational Linguistics (2019) (to appear)
    Preview abstract Pre-trained sentence encoders such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have rapidly advanced the state-of-theart on many NLP tasks, and have been shown to encode contextual information that can resolve many aspects of language structure. We extend the edge probing suite of Tenney et al. (2019) to explore the computation performed at each layer of the BERT model, and find that tasks derived from the traditional NLP pipeline appear in a natural progression: part-of-speech tags are processed earliest, followed by constituents, dependencies, semantic roles, and coreference. We trace individual examples through the encoder and find that while this order holds on average, the encoder occasionally inverts the order, revising low-level decisions after deciding higher-level contextual relations. View details
    Preview abstract We release a corpus of atomic insertion ed-its: instances in which a human editor has inserted a single contiguous span of text into an existing sentence. Our corpus is derived fromWikipedia edit history and contains 43 million sentences across 8 different languages. We argue that the signal contained in these edits is valuable for research in semantics and dis-course, and that such signal differs from that found in conventional language modeling corpora. We provide experimental evidence from both a corpus linguistics and a language modeling perspective to support these claims. View details
    No Results Found