Jump to Content
Thomas Mensink

Thomas Mensink

I am a research scientist working on Computer Vision and Deep Learning.

Other research interests include: (learning) image representations, dense prediction tasks, zero-shot learning, metric learning and structured predictions all applied on image classification and retrieval tasks. My work has been awarded -among others- by the ECCV Koenderink Prize (2020), a NWO VENI Grant (2015), the ACM Multimedia Best Paper Award (2014), and the ACM ICMR Best Paper Award (2016).

For a full list of (pre-Google) publications see Google Scholar or personal website
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there. View details
    Preview abstract Mixup is a widely adopted strategy for training deep networks, where additional samples are augmented through a linear interpolation of input pairs and their corresponding labels. Mixup has shown to improve classification performance, network calibration, and out-of-distribution generalization. While effective, a cornerstone of Mixup, namely that networks learn linear behavior patterns between classes, is only indirectly enforced since the output interpolation is performed at the probability level. This paper seeks to address this limitation by instead mixing the classifiers of the labels directly for each mixed input pair. We propose to define the target of each augmented sample as a uniquely new classifier, whose parameters are given as a linear interpolation of the classifier vectors of the input sample pair. The space of all possible classifiers is continuous and spans all interpolations between classifier pairs. To perform tractable optimization, we propose a dual-contrastive Infinite Class Mixup loss, where we contrast the unique classifier of a single pair to both the mixed classifiers and the predicted outputs of all other pairs in a batch. Infinite Class Mixup is generic in nature and applies to any variant of Mixup. Empirically, we show that our formulation outperforms standard Mixup and variants such as RegMixup and Remix on balanced and long-tailed recognition benchmarks, both at large-scale and in data-constrained settings, highlighting the broad applicability of our approach. View details
    Preview abstract We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA [37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information from the knowledge base. An oracle experiment with perfect retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and an automatic retrieval-augmented prototype yields 48.8%. We believe that our dataset enables future research on retrieval-augmented vision+language models. It is available at https://github.com/google-research/google-research/tree/master/encyclopedic_vqa. View details
    How (not) to ensemble LVLMs for VQA
    Lisa Alazraki
    Lluis Castrejon
    Fantine Huot
    "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops
    Preview abstract This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it? View details
    Preview abstract Computer vision is driven by the many datasets available for training or evaluating novel methods. However, each dataset has a different set of class labels, visual definition of classes, images following a specific distribution, annotation protocols, etc. In this paper we explore the automatic discovery of visual-semantic relations between labels across datasets. We aim to understand how instances of a certain class in a dataset relate to the instances of another class in another dataset. Are they in an identity, parent/child, overlap relation? Or is there no link between them at all? To find relations between labels across datasets, we propose methods based on language, on vision, and on their combination. We show that we can effectively discover label relations across datasets, as well as their type. We apply our method to four applications: understand label relations, identify missing aspects, increase label specificity, and predict transfer learning gains. We conclude that label relations cannot be established by looking at the names of classes alone, as they depend strongly on how each of the datasets was constructed. View details
    Preview abstract We address the problem of ensemble selection in transfer learning: Given a large pool of source models we want to select an ensemble of models which, after fine-tuning on the target training set, yields the best performance on the target test set. Since fine-tuning all possible ensembles is computationally prohibitive, we aim at predicting performance on the target dataset using a computationally efficient transferability metric. We propose several new transferability metrics designed for this task and evaluate them in a challenging and realistic transfer learning setup for semantic segmentation: we create a large and diverse pool of source models by considering 17 source datasets covering a wide variety of image domain, two different architectures, and two pre-training schemes. Given this pool, we then automatically select a subset to form an ensemble performing well on a given target dataset. We compare the ensemble selected by our method to two baselines which select a single source model, either (1) from the same pool as our method; or (2) from a pool containing large source models, each with similar capacity as an ensemble. Averaged over 17 target datasets, we outperform these baselines by 6.0% and 2.5% relative mean IoU, respectively. View details
    Preview abstract Transfer learning has become a popular method for leveraging pre-trained models in computer vision. However, without performing computationally expensive fine-tuning, it is difficult to quantify which pre-trained source models are suitable for a specific target task, or, conversely, to which tasks a pre-trained source model can be easily adapted to. In this work, we propose Gaussian Bhattacharyya Coefficient (GBC), a novel method for quantifying transferability between a source model and a target dataset. In a first step we embed all target images in the feature space defined by the source model, and represent them with per-class Gaussians. Then, we estimate their pairwise class separability using the Bhattacharyya coefficient, yielding a simple and effective measure of how well the source model transfers to the target task. We evaluate GBC on image classification tasks in the context of dataset and architecture selection. Further, we also perform experiments on the more complex semantic segmentation transferability estimation task. We demonstrate that GBC outperforms state-of-the-art transferability metrics on most evaluation criteria in the semantic segmentation settings, matches the performance of top methods for dataset transferability in image classification, and performs best on architecture selection problems for image classification. View details
    Preview abstract Transferability metrics is a maturing field with increasing interest, which aims at providing heuristics for selecting the most suitable source models to transfer to a given target dataset, without finetuning them all. However, existing works rely on custom experimental setups which differ across papers, leading to inconsistent conclusions about which transferability metrics work best. In this paper we conduct a large-scale study by systematically constructing a broad range of 715k experimental setup variations. We discover that even small variations to an experimental setup lead to different conclusions about the superiority of a transferability metric over another. Then we propose better evaluations by aggregating across many experiments, enabling to reach more stable conclusions. As a result, we reveal the superiority of LogME at selecting good source datasets to transfer from in a semantic segmentation scenario, and N LEEP at selecting good source architectures in an image classification scenario. However, no single transferability metric works best in all scenarios. View details
    EDEN: Multimodal Synthetic Dataset of Enclosed Garden Scenes
    Hoang-An Le
    Partha Das
    Sezer Karaoglu
    Theo Gevers
    Winter Conference on Applications of Computer Vision (WACV) (2021)
    Preview abstract Multimodal large-scale datasets for outdoor scenes are mostly designed for urban driving problems. The scenes are highly structured and semantically different from scenarios seen in nature-centered scenes such as gardens or parks. To promote machine learning methods for nature-oriented applications, such as agriculture and gardening, we propose the multimodal synthetic dataset for Enclosed garDEN scenes (EDEN). The dataset features more than 300K images captured from more than 100 garden models. Each image is annotated with various low/high-level vision modalities, including semantic segmentation, depth, surface normals, intrinsic colors, and optical flow. Experimental results on the state-of-the-art methods for semantic segmentation and monocular depth prediction, two important tasks in computer vision, show positive impact of pre-training deep networks on our dataset for unstructured natural scenes. The dataset and related materials will be available at https://lhoangan.github.io/eden. View details
    Multi-Loss Weighting with Coefficient of Variations
    Rick Groenendijk
    Sezer Karaoglu
    Theo Gevers
    Winter Conference on Applications of Computer Vision (WACV) (2021)
    Preview abstract Many interesting tasks in machine learning and computer vision are learned by optimising an objective function defined as a weighted linear combination of multiple losses. The final performance is sensitive to choosing the correct (relative) weights for these losses. Finding a good set of weights is often done by adopting them into the set of hyper-parameters, which are set using an extensive grid search. This is computationally expensive. In this paper, the weights are defined based on properties observed while training the model, including the specific batch loss, the average loss, and the variance for each of the losses. An additional advantage is that the defined weights evolve during training, instead of using static loss weights. In literature, loss weighting is mostly used in a multi-task learning setting, where the different tasks obtain different weights. However, there is a plethora of single-task multi-loss problems that can benefit from automatic loss weighting. In this paper, it is shown that these multi-task approaches do not work on single tasks. Instead, a method is proposed that automatically and dynamically tunes loss weights throughout training specifically for single-task multi-loss problems. The method incorporates a measure of uncertainty to balance the losses. The validity of the approach is shown empirically for different tasks on multiple datasets. View details
    Preview abstract Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Importantly, these are all complex, structured output tasks types relevant to modern computer vision applications. In total we carry out over 2000 transfer learning experiments, including many where the source and target come from different image domains, task types, or both. We systematically analyze these experiments to understand the impact of image domain, task type, and dataset size on transfer learning performance. Our study leads to several insights and concrete recommendations: (1) for most tasks there exists a source which significantly outperforms ILSVRC'12 pre-training; (2) the image domain is the most important factor for achieving positive transfer; (3) the source dataset should \emph{include} the image domain of the target dataset to achieve best results; (4) at the same time, we observe only small negative effects when the image domain of the source task is much broader than that of the target; (5) transfer across task types can be beneficial, but its success is heavily dependent on both the source and target task types. View details
    Calibration of Neural Networks using Splines
    Kartik Gupta
    Amir Rahimi
    Thalaiyasingam Ajanthan
    Richard Ian Hartley
    International Conference on Learning Representations (ICLR) (2021)
    Preview abstract Calibrating neural networks is of utmost importance when employing them in safety-critical applications where the downstream decision making depends on the predicted probabilities. Measuring calibration error amounts to comparing two empirical distributions. In this work, we introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions. From this, by approximating the empirical cumulative distribution using a differentiable function via splines, we obtain a recalibration function, which maps the network outputs to actual (calibrated) class assignment probabilities. The spline-fitting is performed using a held-out calibration set and the obtained recalibration function is evaluated on an unseen test set. We tested our method against existing calibration approaches on various image classification datasets and our spline-based recalibration approach consistently outperforms existing methods on KS error as well as other commonly used calibration measures. View details
    Automatic generation of dense non-rigid optical flow
    Hoang-An Le
    Anil Baslamisli
    Tushar Nimbhorkar
    Sezer Karaoglu
    Theo Gevers
    Computer Vision and Image Understanding (CVIU) (2021)
    Preview abstract There hardly exists any large-scale datasets with dense optical flow of non-rigid motion from real-world imagery as of today. The reason lies mainly in the required setup to derive ground truth optical flows: a series of images with known camera poses along its trajectory, and an accurate 3D model from a textured scene. Human annotation is not only too tedious for large databases, it can simply hardly contribute to accurate optical flow. To circumvent the need for manual annotation, we propose a framework to automatically generate optical flow from real-world videos. The method extracts and matches objects from video frames to compute initial constraints, and applies a deformation over the objects of interest to obtain dense optical flow fields. We propose several ways to augment the optical flow variations. Extensive experimental results show that training on our automatically generated optical flow outperforms methods that are trained on rigid synthetic data using FlowNet-S, LiteFlowNet, PWC-Net, and RAFT. View details
    Neural Feature Matching in Implicit 3D Representations
    Yunlu Chen
    Basura Fernando
    Hakan Bilen
    Efstratios Gavves
    International Conference on Machine Learning (ICML) (2021)
    Preview abstract Recently, neural implicit functions have achieved impressive results for encoding 3D shapes. Conditioning on low-dimensional latent codes generalises a single implicit function to learn shared representation space for a variety of shapes, with the advantage of smooth interpolation. While the benefits from the global latent space do not correspond to explicit points at local level, we propose to track the continuous point trajectory by matching implicit features with the latent code interpolating between shapes, from which we corroborate the hierarchical functionality of the deep implicit functions, where early layers map the latent code to fitting the coarse shape structure, and deeper layers further refine the shape details. Furthermore, the structured representation space of implicit functions enables to apply feature matching for shape deformation, with the benefits to handle topology and semantics inconsistency, such as from an armchair to a chair with no arms, without explicit flow functions or manual annotations. View details
    Preview abstract In this paper, the argument is made that for true novel view synthesis of objects, where the object can be synthesized from any viewpoint, an explicit 3D shape representation is desired. For this point clouds are estimated, which can be freely rotated into the desired view and then projected into a new image. This novel view, however, is sparse by nature and hence this coarse view is used as the input of an image completion network to obtain the dense image. In order to acquire the point cloud, without resorting to special acquisition hardware or multi-view approaches, the pixel-wise depth map is estimated from a single RGB input image. Combined with the camera intrinsics this results in a partial point cloud. By using forward warping and backward warping between the input view and the target view, the network can be trained end-to-end without supervision on depth. Experimentally the benefit of using point clouds as an explicit 3D shape for novel view synthesis is validated on the 3D ShapeNet benchmark. View details
    PointMixup: Data Augmentation for Point Clouds
    Yunlu Chen
    Vincent Tao Hu
    Efstratios Gavves
    Pascal Mettes
    Pengwan Yang
    Cees Snoek
    ECCV (2020)
    Preview abstract This paper introduces a data augmentation for point clouds by interpolation between examples. Data augmentation by interpolation has shown to be a simple and effective approach in the image domain. Such a mixup is however not directly transferable to point clouds, as we do not have a one-to-one correspondence between the points of two different objects. In this paper, we introduce optional assignment mixup to enable data augmentation by interpolation of point clouds. Our proposed mixup generates new point cloud examples as a linear interpolation of the shortest path between two point clouds. This shortest path is given by the optimal bijection as specified with the Earth Mover's Distance. We prove that our optimal assignment mixup abides to the shortest path property, linearity of the interpolation, and assignment invariance. Experimentally, we show the potential of the mixup for point cloud classification, especially when examples are scarce. View details
    Preview abstract This paper presents a novel 3D object detection framework that processes LiDAR data directly on a representation of the sensor's native range images. When operating in the range image view, one faces learning challenges, including occlusion and considerable scale variation, limiting the obtainable accuracy. To address these challenges, a range-conditioned dilated block (RCD) is proposed to dynamically adjust a continuous dilation rate as a function of the measured range, achieving scale invariance. Furthermore, soft range gating helps mitigate the effect of occlusion. An end-to-end trained box-refinement network brings additional performance improvements in occluded areas, and produces more accurate bounding box predictions. On the Waymo Open Dataset, currently the largest and most diverse publicly released autonomous driving dataset, our improved range-based detector outperforms state of the art at long range detection. Our framework is superior to prior multiview, voxel-based methods over all ranges, setting a new baseline for range-based 3D detection on this large scale public dataset. View details
    Interactive Exploration of Journalistic Video Footage through Multimodal Semantic Matching
    Sarah Ibrahimi
    Shuo Chen
    Devanshu Arya
    Arthur Camara
    Yunlu Chen
    Tanja Crijns
    Maurits van der Goes
    Emiel van Miltenburg
    Daan Odijk
    William Thong
    Jiaojiao Zhao
    Pascal Mettes
    ACM Multimedia (2019)
    Preview abstract This demo presents a system for journalists to explore video footage for broadcasts. Daily news broadcasts contain multiple news items that consist of many video shots and searching for relevant footage is a labor intensive task. Without the need for annotated video shots, our system extracts semantics from footage and automatically matches these semantics to query terms from the journalist. The journalist can then indicate which aspects of the query term need to be emphasized, e.g. the title or its thematic meaning. The goal of this system is to support the journalist in their search process by encouraging interaction with the system. View details
    IterGANs: Iterative GANs to Learn and Control 3D Object Transformation
    Ysbrand Galama
    Computer Vision and Image Understanding (2019)
    Preview abstract We are interested in learning visual representations which allow for 3D manipulations of visual objects based on a single 2D image. We cast this into an image-to-image transformation task, and propose Iterative Generative Adversarial Networks (IterGANs) which iteratively transform an input image into an output image. Our models learn a visual representation that can be used for objects seen in training, but also for never seen objects. Since object manipulation requires a full understanding of the geometry and appearance of the object, our IterGANs learn an implicit 3D model and a full appearance model of the object, which are both inferred from a single (test) image. Two advantages of IterGANs are that the intermediate generated images can be used for an additional supervision signal, even in an unsupervised fashion, and that the number of iterations can be used as a control signal to steer the transformation. Experiments on rotated objects and scenes show how IterGANs help with the generation process. View details
    Preview abstract A key challenge for RGB-D segmentation is how to effectively incorporate 3D geometric information from the depth channel into 2D appearance features. We propose to model the effective receptive field of 2D convolution based on the scale and locality from the 3D neighborhood. Standard convolutions are local in the image space (u, v), often with a fixed receptive field of 3x3 pixels. We propose to define convolutions local with respect to the corresponding point in the 3D real world space (x, y, z), where the depth channel is used to adapt the receptive field of the convolution, which yields the resulting filters invariant to scale and focusing on the certain range of depth. We introduce 3D Neighborhood Convolution (3DN-Conv), a convolutional operator around 3D neighborhoods. Further, we can use estimated depth to use our RGB-D based semantic segmentation model from RGB input. Experimental results validate that our proposed 3DN-Conv operator improves semantic segmentation, using either ground-truth depth (RGB-D) or estimated depth (RGB). View details
    On the Benefit of Adding an Adversarial Loss to Depth Prediction
    Rick Groenendijk
    Sezer Karaoglu
    Theo Gevers
    Computer Vision and Image Understanding (CVIU) (2019)
    Preview abstract Adversarial learning is one of the most promising novel learning paradigms in computer vision. Using Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs) many image manipulation tasks have been addressed, from generating images from text, transfer images from one domain to another, or to translate sketches to images (and vice versa). In this paper we address the benefit for adding adversarial training to the task of monocular depth estimation, when trained from stereo pairs of images. For this depth estimation task many losses have been proposed, like L1 and SSIM image reconstruction loss, left-right consistency, and occlusion loss. We evaluate three flavours of adversarial models (Vanilla GANs, LSGANs and Wasserstein GANs) to our model with different number of image reconstruction losses. Based on extensive experimental evaluation, we conclude that adding a GAN is useful when the reconstruction loss is not too much constrained. While, for a constrained reconstruction loss (using a combination of 5 different losses) outperforms (or is on par with) any method trained with a GAN. View details
    No Results Found