Jump to Content
Been Kim

Been Kim

Been is a research scientist at Brain. Her research focuses on improving interpretability in machine learning by building interpretability method for already-trained models or building inherently interpretable models. She has MS and PhD degrees from MIT. Been has given tutorials on interpretability at ICML 2017 , at the Deep Learning Summer school at University of Toronto, Vector institute in 2018 and at CVPR 2018 . Been is one of the executive board member of Women in Machine Learning (WiML), and helps with various ML conferences as a workshop chair, an area chair, a steering committee and a program chair. More on here .
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    DISSECT: Disentangled Simultaneous Explanations via Concept Traversals
    Chun-Liang Li
    Brian Eoff
    Rosalind Picard
    International Conference on Learning Representations (2022)
    Preview abstract Explaining deep learning model inferences is a promising venue for scientific understanding, improving safety, uncovering hidden biases, evaluating fairness, and beyond, as argued by many scholars. One of the principal benefits of counterfactual explanations is allowing users to explore "what-if" scenarios through what does not and cannot exist in the data, a quality that many other forms of explanation such as heatmaps and influence functions are inherently incapable of doing. However, most previous work on generative explainability cannot disentangle important concepts effectively, produces unrealistic examples, or fails to retain relevant information. We propose a novel approach, DISSECT, that jointly trains a generator, a discriminator, and a concept disentangler to overcome such challenges using little supervision. DISSECT generates Concept Traversals (CTs), defined as a sequence of generated examples with increasing degrees of concepts that influence a classifier's decision. By training a generative model from a classifier's signal, DISSECT offers a way to discover a classifier's inherent "notion" of distinct concepts automatically rather than rely on user-predefined concepts. We show that DISSECT produces CTs that (1) disentangle several concepts, (2) are influential to a classifier's decision and are coupled to its reasoning due to joint training (3), are realistic, (4) preserve relevant information, and (5) are stable across similar inputs. We validate DISSECT on several challenging synthetic and realistic datasets where previous methods fall short of satisfying desirable criteria for interpretability and show that it performs consistently well. Finally, we present experiments showing applications of DISSECT for detecting potential biases of a classifier and identifying spurious artifacts that impact predictions. View details
    Beyond Rewards: a Hierarchical Perspective on Offline Multiagent Behavioral Analysis
    Shayegan Omidshafiei
    Yannick Assogba
    Advances in Neural Information Processing Systems (NeurIPS) (2022) (to appear)
    Preview abstract Each year, expert-level performance is attained in increasingly-complex multiagent domains, notable examples including Go, Poker, and StarCraft II. This rapid progression is accompanied by a commensurate need to better understand how such agents attain this performance, to enable their safe deployment, identify limitations, and reveal potential means of improving them. In this paper we take a step back from performance-focused multiagent learning, and instead turn our attention towards agent behavior analysis. We introduce a model-agnostic method for discovery of behavior clusters in multiagent domains, using variational inference to learn a hierarchy of behaviors at the joint and local agent levels. Our framework makes no assumption about agents' underlying learning algorithms, does not require access to their latent states or policies, and is trained using only offline observational data. We illustrate the effectiveness of our method for enabling the coupled understanding of behaviors at the joint and local agent level, detection of behavior changepoints throughout training, discovery of core behavioral concepts, demonstrate the approach's scalability to a high-dimensional multiagent MuJoCo control domain, and also illustrate that the approach can disentangle previously-trained policies in OpenAI's hide-and-seek domain. View details
    Preview abstract Interpretability techniques aim to provide the rationale behind a model's decision, typically by explaining either an individual prediction (local explanation, e.g. `why is this patient diagnosed with this condition') or a class of predictions (global explanation, e.g. `why is this set of patients diagnosed with this condition in general'). While there are many methods focused on either one, few frameworks can provide both local and global explanations in a consistent manner. In this work, we combine two powerful existing techniques, one local (Integrated Gradients, IG) and one global (Testing with Concept Activation Vectors), to provide local and global concept-based explanations. We first sanity check our idea using two synthetic datasets with a known ground truth, and further demonstrate with a benchmark natural image dataset. We test our method with various concepts, target classes, model architectures and IG parameters (e.g. baselines). We show that our method improves global explanations over vanilla TCAV when compared to ground truth, and provides useful local insights. Finally, a user study demonstrates the usefulness of the method compared to no or global explanations only. We hope our work provides a step towards building bridges between many existing local and global methods to get the best of both worlds. View details
    Preview abstract Concept-based explanations can be a key direction to understand how DNNs make decisions. In this paper, we study concept-based explainability in a systematic framework. First, we define the notion of completeness, which quantifies how sufficient a particular set of concepts is in explaining the model's behavior. Based on performance and variability motivations, we propose two definitions to quantify completeness. We show that they yield the commonly-used PCA method under certain assumptions. Next, we study two additional constraints to ensure the interpretability of discovered concept, based on sparsity principles. Through systematic experiments, on specifically-designed synthetic dataset and real-world text and image datasets, we demonstrate the superiority of our framework in finding concepts that are complete (in explaining the decision) and that are interpretable. View details
    Concept Bottleneck Models
    Pang Wei Koh
    Thao Nguyen
    Yew Siang Tang
    Stephen Mussmann
    Emma Pierson
    Percy Liang
    ICML 2020 (2020) (to appear)
    Preview abstract We seek to learn models that support interventions on high-level concepts: e.g., would the model would have predicted severe arthritis if it didn’t think that there was a bone spur in the x-ray? However, state-of-the-art neural networks are trained end-to-end from raw input (e.g., pixels) to output (e.g., arthritis severity), and do not admit manipulation of high-level concepts like “the existence of bone spurs”. In this paper, we revisit the classic idea of learning concept bottleneck models that first predict concepts (provided at training time) from the raw input, and then predict the final label from these concepts. By construction, we can intervene on the predicted concepts at test time and propagate these changes to the final prediction. On an x-ray dataset and bird species recognition dataset, concept bottleneck models achieve competitive predictive accuracy with standard end-to-end models, while allowing us to explain predictions in terms of high-level clinical concepts (“bone spurs”) and bird attributes (“wing color”). Moreover, concept bottleneck models allow for richer human-model interaction: model accuracy improves significantly if we can correct model mistakes on concepts at test time. View details
    Preview abstract Machine learning (ML) is increasingly being used in image retrieval systems for medical decision making. One application of ML is to retrieve visually similar medical images from past patients (e.g. tissue from biopsies) to reference when making a medical decision with a new patient. However, no algorithm can perfectly capture an expert's ideal notion of similarity for every case: an image that is algorithmically determined to be similar may not be medically relevant to a doctor's specific diagnostic needs. In this paper, we identified the needs of pathologists when searching for similar images retrieved using a deep learning algorithm, and developed tools that empower users to cope with the search algorithm on-the-fly, communicating what types of similarity are most important at different moments in time. In two evaluations with pathologists, we found that these refinement tools increased the diagnostic utility of images found and increased user trust in the algorithm. The tools were preferred over a traditional interface, without a loss in diagnostic accuracy. We also observed that users adopted new strategies when using refinement tools, re-purposing them to test and understand the underlying algorithm and to disambiguate ML errors from their own errors. Taken together, these findings inform future human-ML collaborative systems for expert decision-making. View details
    Preview abstract Interpretability has become an important topic of research as more machine learning (ML) models are deployed and widely used to make important decisions. Most of the current explanation methods provide explanations through feature importance scores, which identify features that are salient for each individual input. However, how to systematically summarize and interpret such per sample feature importance scores itself is challenging. In this work, we propose principles and desiderata for \emph{concept} based explanation, which goes beyond per-sample features to identify higher level human-understandable concepts that apply across the entire dataset. We develop a new algorithm, ACE, to automatically extract visual concepts. Our systematic experiments demonstrate that ACE discovers concepts that are human-meaningful, coherent and salient for the neural network's predictions. View details
    Preview abstract DeConvNet, Guided BackProp, LRP, were invented to better understand deep neural networks. We show that these methods do not produce the theoretically correct explanation for a linear model. Yet they are used on multi-layer networks with millions of parameters. This is a cause for concern since linear models are simple neural networks. We argue that explanation methods for neural nets should work reliably in the limit of simplicity, the linear models. Based on our analysis of linear models we propose a generalization that yields two explanation techniques (PatternNet and PatternAttribution) that are theoretically sound for linear models and produce improved explanations for deep networks. View details
    Preview abstract Explaining the output of a complicated machine learning model like a deep neural network (DNN) is a central challenge in machine learning. Increasingly, explanations are required for debugging models, building trust prior to model deployment, and potentially identifying unwanted effects like model bias. Several methods have been proposed to address this issue. Local explanation methods provide explanations of the output of a model on a single input. Given the importance of these explanations to the use and deployment of these models, we ask: can we trust local explanations for DNNs created using current methods? In particular, we seek to assess how specific local explanations are to the parameter values of DNNs. We compare explanations generated using a fully trained DNNs to explanations of DNNs with some or all parameters replaced by random values. Somewhat surprisingly, we find that, for several local explanation methods, explanations derived from networks with randomized weights and trained weights are both visually and quantitatively similar; in some cases, virtually indistinguishable. By randomizing different portions of the network, we find that local explanations are significantly reliant on lower level features of the DNN. View details
    To Trust Or Not To Trust A Classifier
    Heinrich Jiang
    Melody Guan
    Maya Gupta
    NeurIPS (2018)
    Preview abstract Knowing when a classifier's prediction can be trusted is useful in many applications and critical for safely using AI. While the bulk of the effort in machine learning research has been towards improving classifier performance, understanding when a classifier's predictions should and should not be trusted has received far less attention. The standard approach is to use the classifier's discriminant or confidence score; however, we show there exists an alternative that is more effective in many situations. We propose a new score, called the {\it trust score}, which measures the agreement between the classifier and a modified nearest-neighbor classifier on the testing example. We show empirically that high (low) trust scores produce surprisingly high precision at identifying correctly (incorrectly) classified examples, consistently outperforming the classifier's confidence score as well as many other baselines. Further, under some mild distributional assumptions, we show that if the trust score for an example is high (low), the classifier will likely agree (disagree) with the Bayes-optimal classifier. Our guarantees consist of non-asymptotic rates of statistical consistency under various nonparametric settings and build on recent developments in topological data analysis. View details
    Preview abstract Estimating the influence of a given feature to a model prediction is challenging. We introduce ROAR, RemOve And Retrain, a benchmark to evaluate the accuracy of interpretability methods that estimate input feature importance in deep neural networks. We remove a fraction of input features deemed to be most important according to each estimator and measure the change to the model accuracy upon retraining. The most accurate estimator will identify inputs as important whose removal causes the most damage to model performance relative to all other estimators. This evaluation produces thought-provoking results -- we find that several estimators are less accurate than a random assignment of feature importance. However, averaging a set of squared noisy estimators (a variant of a technique proposed by Smilkov et al. (2017)), leads to significant gains in accuracy for each method considered and far outperforms such a random guess. View details
    Preview abstract The interpretation of deep learning models is a challenge due to their size, complexity, and often opaque internal state. In addition, many systems, such as image classifiers, operate on low-level features rather than high-level concepts. To address these challenges, we introduce Concept Activation Vectors (CAVs), which provide an interpretation of a neural net's internal state in terms of human-friendly concepts. The key idea is to view the high-dimensional internal state of a neural net as an aid, not an obstacle. We show how to use CAVs as part of a technique, Testing with CAVs (TCAV), that uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result--for example, how sensitive a prediction of “zebra” is to the presence of stripes. Using the domain of image classification as a testing ground, we describe how CAVs may be used to explore hypotheses and generate insights for a standard image classification network as well as a medical application. View details
    Human-in-the-Loop Interpretability Prior
    Isaac Lage
    Andrew Ross
    Samuel J. Gershman
    Finale Doshi-Velez
    NeurIPS (Spotlight) (2018)
    Preview abstract We often desire our models to be interpretable as well as accurate. Prior work on optimizing models for interpretability has relied on easy-to-quantify proxies for interpretability, such as sparsity or the number of operations required. In this work, we optimize for interpretability by directly including humans in the optimization loop. We develop an algorithm that minimizes the number of user studies to find models that are both predictive and interpretable and demonstrate our approach on several data sets. Our human subjects results show trends towards different proxy notions of interpretability on different datasets, which suggests that different proxies are preferred on different tasks. View details
    Sanity Checks for Saliency Maps
    Julius Adebayo
    Justin Gilmer
    Michael Christoph Muelly
    Ian Goodfellow
    Moritz Hardt
    NeurIPS (Spotlight) (2018)
    Preview abstract Saliency methods have emerged as a popular tool to highlight features in an input deemed relevant for the prediction of a learned model. Several saliency methods have been proposed, often guided by visual appeal on image data. In this work, we propose an actionable methodology to evaluate what kinds of explanations a given method can and cannot provide. We find that reliance, solely, on visual assessment can be misleading. Through extensive experiments we show that some existing saliency methods are independent both of the model and of the data generating process. Consequently, methods that fail the proposed tests are inadequate for tasks that are sensitive to either data or model, such as, finding outliers in the data, explaining the relationship between inputs and outputs that the model learned, and debugging the model. We interpret our findings through an analogy with edge detection in images, a technique that requires neither training data nor model. Theory in the case of a linear model and a single-layer convolutional neural network supports our experimental findings. View details
    Preview abstract As machine learning systems become ubiquitous, there has been a surge of interest in interpretable machine learning: systems that provide explanation for their outputs. These explanations are often used to qualitatively assess other criteria such as safety or non-discrimination. However, despite the interest in interpretability, there is very little consensus on what interpretable machine learning is and how it should be measured. In this position paper, we first define interpretability and describe when interpretability is needed (and when it is not). Next, we suggest a taxonomy for rigorous evaluation and expose open questions towards a more rigorous science of interpretable machine learning. View details
    The (Un)reliability of Saliency methods
    Sara Hooker
    Julius Adebayo
    Maximilian Alber
    Kristof T. Schütt
    Sven Dähne
    NIPS Workshop (2017)
    Preview abstract Saliency methods aim to explain the predictions of deep neural networks. These methods lack reliability when the explanation is sensitive to factors that do not contribute to the model prediction. We use a simple and common pre-processing step ---adding a constant shift to the input data--- to show that a transformation with no effect on the model can cause numerous methods to incorrectly attribute. In order to guarantee reliability, we posit that methods should fulfill input invariance, the requirement that a saliency method mirror the sensitivity of the model with respect to transformations of the input. We show, through several examples, that saliency methods that do not satisfy input invariance result in misleading attribution. View details
    No Results Found