Jump to Content

Mingyang Zhang

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Facilitated by large language models (LLMs), personalized text generation has become a rapidly growing research direction. Most existing studies focus on designing specialized models for a particular domain, or they require fine-tuning the LLMs to generate personalized text. We consider a typical scenario in which the large language model, which generates personalized output, is frozen and can only be accessed through APIs. Under this constraint, all one can do is to improve the input text (i.e., text prompts) sent to the LLM, a procedure that is usually done manually. In this paper, we propose a novel method to automatically revise prompts for personalized text generation. The proposed method takes the initial prompts generated by a state-of-the-art, multistage framework for personalized generation and rewrites a few critical components that summarize and synthesize the personal context. The prompt rewriter employs a training paradigm that chains together supervised learning (SL) and reinforcement learning (RL), where SL reduces the search space of RL and RL facilitates end-to-end training of the rewriter. Using datasets from three representative domains, we demonstrate that the rewritten prompts outperform both the original prompts and the prompts optimized via supervised learning or reinforcement learning alone. In-depth analysis of the rewritten prompts shows that they are not only human readable, but also able to guide manual revision of prompts when there is limited resource to employ reinforcement learning to train the prompt rewriter, or when it is costly to deploy an automatic prompt rewriter for inference. View details
    Creator Context for Tweet Recommendation
    Matt Colen
    Sergey Levi
    Vladimir Ofitserov
    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
    Preview abstract When discussing a tweet, people usually not only refer to the content it delivers, but also to the person behind the tweet. In other words, grounding the interpretation of the tweet in the context of its creator plays an important role in deciphering the true intent and the importance of the tweet. In this paper, we attempt to answer the question of how creator context should be used to advance tweet understanding. Specifically, we investigate the usefulness of different types of creator context, and examine different model structures for incorporating creator context in tweet modeling. We evaluate our tweet understanding models on a practical use case -- recommending relevant tweets to news articles. This use case already exists in popular news apps, and can also serve as a useful assistive tool for journalists. We discover that creator context is essential for tweet understanding, and can improve application metrics by a large margin. However, we also observe that not all creator contexts are equal. Creator context can be time sensitive and noisy. Careful creator context selection and deliberate model structure design play an important role in creator context effectiveness. View details
    End-to-End Query Term Weighting
    Karan Samel
    Swaraj Khadanga
    Wensong Xu
    Xingyu Wang
    Kashyap Kolipaka
    Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '23) (2023)
    Preview abstract Bag-of-words based lexical retrieval systems are still the most commonly used methods for real-world search applications. Recently deep learning methods have shown promising results to improve this retrieval performance but are expensive to run in an online fashion, non-trivial to integrate into existing production systems, and might not generalize well in out-of-domain retrieval scenarios. Instead, we build on top of lexical retrievers by proposing a Term Weighting BERT (TW-BERT) model. TW-BERT learns to predict the weight for individual n-gram (e.g., uni-grams and bi-grams) query input terms. These inferred weights and terms can be used directly by a retrieval system to perform a query search. To optimize these term weights, TW-BERT incorporates the scoring function used by the search engine, such as BM25, to score query-document pairs. Given sample query-document pairs we can compute a ranking loss over these matching scores, optimizing the learned query term weights in an end-to-end fashion. Aligning TW-BERT with search engine scorers minimizes the changes needed to integrate it into existing production applications, whereas existing deep learning based search methods would require further infrastructure optimization and hardware requirements. The learned weights can be easily utilized by standard lexical retrievers and by other retrieval techniques such as query expansion. We show that TW-BERT improves retrieval performance over strong term weighting baselines within MSMARCO and in out-of-domain retrieval on TREC datasets. View details
    Job Type Extraction for Service Businesses
    Yaping Qi
    Hayk Zakaryan
    Yonghua Wu
    Companion Proceedings of the ACM Web Conference 2023
    Preview abstract Google My Business (GMB) is a platform that allows business owners to manage their business profiles, which will be displayed when a user issues a relevant query on Google Search or Maps. Many GMB businesses provide diverse services from home cleaning and plumbing to legal services and education. However the exact service content, which we call job types, is often missing in their profiles. This leaves the burden of finding such content to the users, either by the tedious work of scanning through business websites or time-consuming calling of the owners. In the present paper, we describe how we build a pipeline to automatically extract the job types from websites of business owners and how we solve scalability issues for deployment. Rather than focusing on developing novel and sophisticated machine learning models, we share various challenges we have faced and practical experiences of building such a pipeline, including the cold start problem of dataset collection with limited human annotation resource, scalability, reaching a launch bar of high precision, and building a general pipeline with reasonable coverage of any free-text web pages without relying on the Document Object Model (DOM) structure. With these challenges, standard approaches for information extraction do not directly apply or are not scalable to be served. In this paper, we show how we address these challenges in different stages of the extraction pipeline, including: (1) utilizing structured content like tables and lists to tackle the cold start problem of dataset collection; (2) exploitation of various context information to improve model performance without hurting scalability; and (3) formulating the extraction problem as a retrieval task to improve generalizability, efficiency as well as coverage. The pipeline has been successfully deployed, and is scalable enough to be refreshed every few days to extract the latest online information. The extracted job types are serving millions of users of Google Search and Google Maps with at least three use cases: (1) job types of a place are directly displayed on mobile devices; (2) job types provide explanation as to why a place shows up given a query; (3) job types are used as a signal to rank business places. According to a user survey, the displayed job types has greatly enhanced the probability of a user hiring a service provider. View details
    SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval
    Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23), ACM (2023) (to appear)
    Preview abstract In dense retrieval, prior work has largely improved retrieval effectiveness using multi-vector dense representations, exemplified by ColBERT. In sparse retrieval, more recent work, such as SPLADE, demonstrated that one can also learn sparse lexical representations to achieve comparable effectiveness while enjoying better interpretability. In this work, we combine the strengths of both the sparse and dense representations for first-stage retrieval. Specifically, we propose SparseEmbed – a novel retrieval model that learns sparse lexical representations with contextual embeddings. Compared with SPLADE, our model leverages the contextual embeddings to improve model expressiveness. Compared with ColBERT, our sparse representations are trained end-to-end to optimize both efficiency and effectiveness. View details
    Learning Sparse Lexical Representations Over Expanded Vocabularies for Retrieval
    Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM '23) (2023)
    Preview abstract A recent line of work in first-stage Neural Information Retrieval has focused on learning sparse lexical representations instead of dense embeddings. One such work is SPLADE, which has been shown to lead to state-of-the-art results in both the in-domain and zero-shot settings, can leverage inverted indices for efficient retrieval, and offers enhanced interpretability. However, existing SPLADE models are fundamentally limited to learning a sparse representation based on the native BERT WordPiece vocabulary. In this work, we extend SPLADE to support learning sparse representations over arbitrary sets of tokens to improve flexibility and aid integration with existing retrieval systems. As an illustrative example, we focus on learning a sparse representation over a large (300k) set of unigrams. We add an unsupervised pretraining task on C4 to learn internal representations for new tokens. Our experiments show that our Expanded-SPLADE model maintains the performance of WordPiece-SPLADE on both in-domain and zero-shot retrieval while allowing for custom output vocabularies. View details
    Multi-Aspect Dense Retrieval
    Swaraj Khadanga
    Wensong Xu
    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ACM (2022)
    Preview abstract Prior work in Dense Retrieval usually encodes queries and documents using single-vector representations (also called embeddings) and performs retrieval in the embedding space using approximate nearest neighbor search. This paradigm enables efficient semantic retrieval. However, the single-vector representations can be ineffective at capturing different aspects of the queries and documents in relevance matching, especially for some vertical domains. For example, in e-commerce search, these aspects could be category, brand and color. Given a query "white nike socks", a Dense Retrieval model may mistakenly retrieve some "white adidas socks" while missing out the intended brand. We propose to explicitly represent multiple aspects using one embedding per aspect. We introduce an aspect prediction task to teach the model to capture aspect information with particular aspect embeddings. We design a lightweight network to fuse the aspect embeddings for representing queries and documents. Our evaluation using an e-commerce dataset shows impressive improvements over strong Dense Retrieval baselines. We also discover that the proposed aspect embeddings can enhance the interpretability of Dense Retrieval models as a byproduct. View details
    Preview abstract The pre-trained language model (eg, BERT) based deep retrieval models achieved superior performance over lexical retrieval models (eg, BM25) in many passage retrieval tasks. However, limited work has been done to generalize a deep retrieval model to other tasks and domains. In this work, we carefully select five datasets, including two in-domain datasets and three out-of-domain datasets with different levels of domain shift, and study the generalization of a deep model in a zero-shot setting. Our findings show that the performance of a deep retrieval model is significantly deteriorated when the target domain is very different from the source domain that the model was trained on. On the contrary, lexical models are more robust across domains. We thus propose a simple yet effective framework to integrate lexical and deep retrieval models. Our experiments demonstrate that these two models are complementary, even when the deep model is weaker in the out-of-domain setting. The combined model obtains an average of 20.4% relative gain over the deep retrieval model, and an average of 9.54% over the lexical model in three out-of-domain datasets. View details
    Preview abstract The content on the web is in a constant state of flux. New entities, issues, and ideas continuously emerge, while the semantics of the existing conversation topics gradually shift. In recent years, pretrained language models like BERT greatly improved the state-of-the-art for a large spectrum of content understanding tasks. Therefore, in this paper, we aim to study how these language models can be adapted to better handle continuously evolving web content. In our study, we first analyze the evolution of 2013 – 2019 Twitter data, and unequivocally confirm that a BERT model trained on past tweets would heavily deteriorate when directly applied to data from later years. Then, we investigate two possible sources of the deterioration: the semantic shift of existing tokens and the sub-optimal or failed understanding of new tokens. To this end, we both explore two different vocabulary composition methods, as well as propose three sampling methods which help in efficient incremental training for BERT-like models. Compared to a new model trained from scratch offline, our incremental training (a) reduces the training costs, (b) achieves better performance on evolving content, and (c) is suitable for online deployment. The superiority of our methods is validated using two downstream tasks. We demonstrate significant improvements when incrementally evolving the model from a particular base year, on the task of Country Hashtag Prediction, as well as on the OffensEval 2019 task. View details
    Natural Language Understanding with Privacy-Preserving BERT
    Proceedings of the 30th ACM International Conference on Information and Knowledge Management, ACM (2021)
    Preview abstract Privacy preservation remains a key challenge in data mining and Natural Language Understanding (NLU). Previous research shows that the input text or even text embeddings can leak private information. This concern motivates our research on effective privacy preservation approaches for pretrained Language Models (LMs). We investigate the privacy and utility implications of applying dχ-privacy, a variant of Local Differential Privacy, to BERT fine-tuning in NLU applications. More importantly, we further propose privacy-adaptive LM pretraining methods and show that our approach can boost the utility of BERT dramatically while retaining the same level of privacy protection. We also quantify the level of privacy preservation and provide guidance on privacy configuration. Our experiments and findings lay the groundwork for future explorations of privacy-preserving NLU with pretrained LMs. View details
    Preview abstract Document layout comprises both structural and visual (\eg font size) information that are vital but often ignored by machine learning models. The few existing models which do use layout information only consider \textit{textual} contents, and overlook the existence of contents in other modalities such as images. Additionally, spatial interactions of presented contents in a layout was never fully exploited. On the other hand, a series of document understanding tasks are calling out for layout information. One example is given a position in a document, which image is the best to fit in. To address current models' limitations and tackle layout-aware document understanding tasks, we first parse a document into blocks whose content can be textual, tabular, or multimedia (\eg images) using a proprietary tool. We then propose a novel hierarchical framework, LAMPreT, to encode the blocks. Our LAMPreT model encodes each block with a multimodal transformer in the lower-level, and aggregates the block-level representations and connections utilizing a specifically designed transformer at the higher-level. We design hierarchical pre-training objectives where the lower-level model is trained with the standard masked language modeling (MLM) loss and the multimodal alignment loss, and the higher-level model is trained with three layout-aware objectives: (1) block-order predictions, (2) masked block predictions, and (3) image fitting predictions. We test the proposed model on two layout-aware tasks -- image suggestions and text block filling, and show the effectiveness of our proposed hierarchical architecture as well as pre-training techniques. View details
    Preview abstract Search engines often follow a 2-phase paradigm where in the first step an initial set of documents is retrieved (the \emp{retrieval} step) and in the second step the documents are ranked so as to obtain the final result list (the \emp{re-ranking} step). The focus of this paper is on improving the \emph{retrieval} step (measured mainly by recall) using deep neural network-based approaches. While deep neural networks were shown to improve the performance of the re-ranking step, there is little literature about using deep neural networks to improve the retrieval step. Previous works on deep neural networks for IR usually apply a simple lexical retrieval model for the retrieval step (e.g., BM25) and emphasize on the re-ranking step. In this paper, we propose and study a hybrid retrieval approach, which leverages both semantic (deep neural network based) and lexical (keyword matching based like BM25) matching techniques. The main idea is to perform semantic and lexical retrieval in parallel, and then to combine the result lists to generate the initial result set for re-ranking. An empirical evaluation, using a public TREC collection, shows that semantic retrieval model generated result lists often contain a substantial number of relevant documents not covered by the lexical-based generated lists. Further analysis of these relevant documents shows that they often also exhibit different characteristics than the lexical-based documents, attesting to the complementary nature of the two approaches. Finally, the experiments show that by combining the two result lists, the recall of the result list can increase significantly, the retrieval step can be greatly improved and these improvements are highly robust. View details
    Preview abstract Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching. View details
    Multi-view Embedding-based Synonyms for Personal Search
    Hongbo Deng
    Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '19) (2019), pp. 575-584
    Preview abstract Synonym expansion is a technique that adds related words to search queries, which may lead to more relevant documents being retrieved, thus improving recall. There is extensive prior work on synonym expansion for web search, however very few studies have tackled its application for email search. Synonym expansion for private corpora like emails poses several unique research challenges. First, the emails are not shared across users, which precludes us from directly employing query-document bipartite graphs, which are standard in web search synonym expansion. Second, user search queries are of personal nature, and may not be generalizable across users. Third, the size of the underlying corpora from which the synonyms may be mined is relatively small (i.e., user's private email inbox) compared to the size of the web corpus. Therefore, in this paper, we propose a solution tailored to the challenges of synonym expansion for email search. We formulate it as a multi-view learning problem, and propose a novel embedding-based model that joins information from multiple sources to obtain the optimal synonym candidates. To demonstrate the effectiveness of the proposed technique, we evaluate our model using both explicit human ratings as well as a live experiment using the Gmail Search service, one of the world's largest email search engines. View details
    Preview abstract Semantic text matching is one of the most important research problems in many domains, including, but not limited to, information retrieval, question answering, and recommendation. Among the different types of semantic text matching, long-document-to-long-document text matching has many applications, but has rarely been studied. Most existing approaches for semantic text matching have limited success in this setting, due to their inability to capture and distill the main ideas and topics from long-form text. In this paper, we propose a novel Siamese multi-depth attention-based hierarchical recurrent neural network (SMASH RNN) that learns the long-form semantics, and enables long-form document based semantic text matching. In addition to word information, SMASH RNN is using the document structure to improve the representation of long-form documents. Specifically, SMASH RNN synthesizes information from different document structure levels, including paragraphs, sentences, and words. An attention-based hierarchical RNN derives a representation for each document structure level. Then, the representations learned from the different levels are aggregated to learn a more comprehensive semantic representation of the entire document. For semantic text matching, a Siamese structure couples the representations of a pair of documents, and infers a probabilistic score as their similarity. We conduct an extensive empirical evaluation of SMASH RNN with three practical applications, including email attachment suggestion, related article recommendation, and citation recommendation. Experimental results on public data sets demonstrate that SMASH RNN significantly outperforms competitive baseline methods across various classification and ranking scenarios in the context of semantic matching of long-form documents. View details
    Semantic Location in Email Query Suggestion
    John Foley
    Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (2018), pp. 977-980
    Preview abstract Mobile devices are pervasive, which means that users have access to web content and their personal documents at all locations, not just their home or office. Existing work has studied how locations can influence information needs, focusing on web queries. We explore whether or not location information can be helpful to users who are searching their own personal documents. We wish to study whether a users’ location can predict their queries over their own personal data, so we focus on the task of query suggestion. While we find that using location directly can be helpful, it does not generalize well to novel locations. To improve this situation, we explore using semantic location: that is, rather than memorizing location-query associations, we generalize our location information to names of the closest point of interest. By using short, semantic descriptions of locations, we find that we can more robustly improve query completion and observe that users are already using locations to extend their own queries in this domain. We present a simple but effective model that can use location to predict queries for a user even before they type anything into a search box, and which learns effectively even when not all queries have location information. View details
    Preview abstract Modern search engines leverage a variety of sources, beyond the traditional query-document content similarity, to improve their ranking performance. Among them, query context has attracted attention in prior work. Previously, query context was mainly modeled by user search history, either long-term or short-term, to help the ranking of future queries. In this paper, we focus on situational context, i.e., the contextual features of the current search request that are independent from both query content and user history. As an example, situational context can depend on search request time and location. We propose two context-aware ranking models based on neural networks. The first model learns a low-dimensional deep representation from the combination of contextual features. The second model extends the first model by leveraging binarized contextual features in addition to the high-level abstractions learned from a deep network. The existing context-aware ranking models are mainly based on search history, especially click data that can be gathered from the search engine logs. Although context-aware models have been widely explored in web search, their influence on search scenarios where click data is highly sparse is relatively unstudied. The focus of this paper, personal search (e.g., email search or on-device search) is one of such scenarios. We evaluate our models using the click data collected from one of the world's largest personal search engines. The experiments demonstrate that the proposed models significantly outperform the baselines which do not take context into account. These results indicate the importance of situational context for personal search, and open up a venue for further exploration of situational context in other search scenarios. View details
    No Results Found