Jump to Content
Mohammed Attia

Mohammed Attia

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Effective Multi Dialectal Arabic POS Tagging
    Kareem Darwish
    Hamdy Mubarak
    Younes Samih
    Ahmed Abdelali
    Lluís Màrquez
    Mohamed Eldesouki
    Laura Kallmeyer
    Natural Language Engineering (NLE) (2020)
    Preview abstract This work introduces robust multi-dialectal part of speech tagging trained on an annotated dataset of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses Conditional Random Fields (CRF), while the second combines word and character-based representations in a Deep Neural Network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a dataset of 350 tweets per dialect. View details
    Segmentation for Domain Adaptation in Arabic
    Ali Elkahky
    Workshop on Arabic Natural Language Processing -- ACL 2019, Florence, Italy (2019)
    Preview abstract Segmentation serves as an integral part in many NLP applications including Machine Translation, Parsing, and Information Retrieval. When a model trained on the standard language is applied to dialects, the accuracy drops dramatically. However, there are more lexical items shared by the standard language and dialects than can be found by mere surface word matching. This shared lexicon is obscured by a lot of cliticization, gemination, and character repetition. In this paper, we prove that segmentation and base normalization of dialects can help in domain adaptation by reducing data sparseness. Segmentation will improve a system performance by reducing the number of OOVs, help isolate the differences and allow better utilization of the commonalities. We show that adding a small amount of dialectal segmentation training data reduced OOVs by 5% and remarkably improves POS tagging for dialects by 7.37% f-score, even though no dialect-specific POS training data is included. View details
    QC-GO Submission for MADAR Shared Task: Arabic Fine-Grained Dialect Identification
    Ahmed Abdelali
    Hamdy Mubarak
    Kareem Darwish
    Mohamed Eldesouki
    Younes Samih
    MADAR Shared on Dialect Identification -- ACL 2019 (2019)
    Preview abstract This paper describes the QC-GO team submission to the MADAR Shared Task Subtask 1 (travel domain dialect identification) and Subtask 2 (Twitter user location identification). In our participation in both subtasks, we explored a number of approaches and system combinations to obtain the best performance for both tasks. These include deep neural nets and heuristics. Since individual approaches suffer from various shortcomings, the combination of different approaches was able to fill some of these gaps. Our system achieves F1-Scores of 66.1% and 67.0% on the development sets for Subtasks 1 and 2 respectively. View details
    POS Tagging for Improving Code-Switching Identification in Arabic
    Ahmed Abdelali
    Ali Elkahky
    Hamdy Mubarak
    Kareem Darwish
    Younes Samih
    Workshop on Arabic Natural Language Processing -- ACL 2019, Florence, Italy (2019)
    Preview abstract When speakers code-switch between their native language and a second language or language variant, they follow a syntactic pattern where words and phrases from the embedded language are inserted into the matrix language. This paper explores the possibility of utilizing this pattern in improving code-switching identification between Modern Standard Arabic (MSA) and Egyptian Arabic (EA). We try to answer the question of how strong is the POS signal in word-level code-switching identification. We build a deep learning model enriched with linguistic features (including POS tags) that outperforms the state-of-the-art results by 1.9% on the development set and 1.0% on the test set. We also show that in intra-sentential code-switching, the selection of lexical items is constrained by POS categories, where function words tend to come more often from the dialectal language while the majority of content words come from the standard language. View details
    Multilingual Multi-class Sentiment Classification Using Convolutional Neural Networks
    Younes Samih
    Ali Elkahky
    Laura Kallmeyer
    Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018), pp. 635-640
    Preview abstract This paper describes a language-independent model for multi-class sentiment analysis using a simple neural network architecture of five layers (Embedding, Conv1D, GlobalMaxPooling and two Fully-Connected). The advantage of the proposed model is that it does not rely on language-specific features such as ontologies, dictionaries, or morphological or syntactic pre-processing. Equally important, our system does not use pre-trained word2vec embeddings which can be costly to obtain and train for some languages. In this research, we also demonstrate that oversampling can be an effective approach for correcting class imbalance in the data. We evaluate our methods on three publicly available datasets for English, German and Arabic, and the results show that our system’s performance is comparable to, or even better than, the state of the art for these datasets. We make our source-code publicly available. View details
    GHH at SemEval-2018 Task 10: Discovering Discriminative Attributes in Distributional Semantics
    Younes Samih
    Wolfgang Maier
    SemEval 2018 Task 10 on Capturing Discriminative Attributes (2018)
    Preview abstract This paper describes our system submission to the SemEval 2018 Task 10 on Capturing Discriminative Attributes. Given two concepts and an attribute, the task is to determine whether the attribute is semantically related to one concept and not the other. In this work we assume that discriminative attributes can be detected by discovering the association (or lack of association) between a pair of words. The hypothesis we test in this contribution is whether the semantic difference between two pairs of concepts can be treated in terms of measuring the distance between words in a vector space, or can simply be obtained as a by-product of word co-occurrence counts. View details
    GHHT at CALCS 2018: Named Entity Recognition for Dialectal Arabic Using Neural Networks
    Younes Samih
    Wolfgang Maier
    Third Workshop on Computational Approaches to Linguistic Code-switching in ACL 2018 (2018)
    Preview abstract This paper describes our system submission to the CALCS 2018 shared task on named entity recognition on code-switched data for the language variant pair of Modern Standard Arabic and Egyptian dialectal Arabic. We build a a Deep Neural Network that combines word and character-based representations in convolutional and recurrent networks with a CRF layer. The model is augmented with stacked layers of enriched information such pre-trained embeddings, Brown clusters and named entity gazetteers. Our system is ranked second among those participating in the shared task achieving an FB1 average of 70.09%. View details
    Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach
    Kareem Darwish
    Ahmed Abdelali
    Hamdy Mubarak
    Younes Samih
    The 3rd Workshop on Open-Source Arabic Corpora and Processing Tools in the Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018)
    Preview abstract Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted. Diacritization attempts to recover short vowels and is an essential step for Text-to-Speech (TTS) systems. Though Automatic diacritization of Modern Standard Arabic (MSA) has received significant attention, limited research has been conducted on dialectal Arabic (DA) diacritization. Phonemic patterns of DA vary greatly from MSA and even from one another, which accounts for the noted difficulty with mutual intelligibility between dialects. With the recent advent of spoken dialog systems (or intelligent personal assistants), dialect vowel restoration is crucial to allow systems to speak back to the users in their own language variant. In this paper we present our research and benchmark results on the automatic diacritization of Tunisian and Moroccan using linear Conditional Random Fields. View details
    The Morpho-syntactic Annotation of Animacy for a Dependency Parser
    Ali Elkahky
    Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018), pp. 2607-2615
    Preview abstract In this paper we present the annotation scheme and parser results of the animacy feature in Russian and Arabic, two morphologicallyrich languages, in the spirit of the universal dependency framework (McDonald et al., 2013; de Marneffe et al., 2014). We explain the animacy hierarchies in both languages and make the case for the existence of five animacy types. We train a morphological analyzer on the annotated data and the results show a prediction f-measure for animacy of 95.39% for Russian and 92.71% for Arabic. We also use animacy along with other morphological tags as features to train a dependency parser, and the results show a slight improvement gained from animacy. We compare the impact of animacy on improving the dependency parser to other features found in nouns, namely, ‘gender’, ‘number’, and ‘case’. To our knowledge this is the first contrastive study of the impact of morphological features on the accuracy of a transition parser. A portion of our data (1,000 sentences for Arabic and Russian each, along with other languages) annotated according to the scheme described in this paper is made publicly available (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1983) as part of the CoNLL 2017 Shared Task on Multilingual Parsing (Zeman et al., 2017). View details
    Multi-Dialect Arabic POS Tagging: A CRF Approach
    Kareem Darwish
    Hamdy Mubarak
    Ahmed Abdelali
    Mohamed Eldesouki
    Younes Samih
    Randah Alharbi
    Walid Magdy
    Laura Kallmeyer
    Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018), pp. 93-98
    Preview abstract This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional Random Fields (CRF) sequence labeler to train POS taggers for each dialect and examine the effect of cross and joint dialect training, and give benchmark results for the datasets. Using clitic n-grams, clitic metatypes, and stem templates as features, we were able to train a joint model that can correctly tag four different dialects with an average accuracy of 89.3%. View details
    CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
    Daniel Zeman
    Martin Popel
    Milan Straka
    Jan Hajic
    Joakim Nivre
    Filip Ginter
    Juhani Luotolahti
    Sampo Pyysalo
    Martin Potthast
    Francis Tyers
    Elena Badmaeva
    Memduh Gokirmak
    Anna Nedoluzhko
    Silvie Cinkova
    Jan Hajic jr.
    Jaroslava Hlavacova
    Václava Kettnerová
    Zdenka Uresova
    Jenna Kanerva
    Stina Ojala
    Anna Missilä
    Christopher D. Manning
    Sebastian Schuster
    Siva Reddy
    Dima Taji
    Nizar Habash
    Herman Leung
    Marie-Catherine de Marneffe
    Manuela Sanguinetti
    Maria Simi
    Hiroshi Kanayama
    Valeria de Paiva
    Kira Droganova
    Héctor Martínez Alonso
    Çagrı Çöltekin
    Umut Sulubacak
    Hans Uszkoreit
    Vivien Macketanz
    Aljoscha Burchardt
    Kim Harris
    Katrin Marheinecke
    Georg Rehm
    Tolga Kayadelen
    Ali Elkahky
    Zhuoran Yu
    Emily Pitler
    Saran Lertpradit
    Michael Mandl
    Jesse Kirchner
    Hector Fernandez Alcalde
    Esha Banerjee
    Antonio Stella
    Atsuko Shimada
    Sookyoung Kwak
    Gustavo Mendonca
    Tatiana Lando
    Rattima Nitisaroj
    Josie Li
    Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
    Preview
    Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM
    Mohamed Eldesouki
    Younes Samih
    Ahmed Abdelali
    Hamdy Mubarak
    Kareem Darwish
    Laura Kallmeyer
    arxiv.org 2017 (2017)
    Preview abstract Arabic word segmentation is essential for a variety of NLP applications such machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for each dialect. The two approaches involve posing the problem as a ranking problem, where an SVM ranker picks the best segmentation, and as a sequence labeling problem, where a bi-LSTM RNN coupled with CRF determines where best to segment words. We are able to achieve solid segmentation results for all dialects using rather limited training data. We also show that employing Modern Standard Arabic data for domain adaptation and assuming context independence improve overall results. View details
    Preview abstract The aim of this document is to provide a list of dependency tags that are to be used for the Arabic dependency annotation task, with examples provided for each tag. The dependency representation is a simple description of the grammatical relationships in a sentence. It represents all sentence relations uniformly typed as dependency relations. The dependencies are all binary relations between a governor (also known the head) and a dependant (any complement of or modifier to the head). View details
    Learning from Relatives: Unified Dialectal Arabic Segmentation
    Younes Samih
    Mohamed Eldesouki
    Ahmed Abdelali
    Hamdy Mubarak
    Kareem Darwish
    Laura Kallmeyer
    CONLL, Vancouver, Canada (2017)
    Preview abstract Arabic dialects do not just share a common koine, but there are shared pan-dialectal linguistic phenomena that allow computational models for dialects to learn from each other. In this paper we build a unified segmentation model where the training data for different dialects are combined and a single model is trained. The model yields higher accuracies than dialect-specific models, eliminating the need for dialect identification before segmentation. We also measure the degree of relatedness between four major Arabic dialects by testing how a segmentation model trained on one dialect performs on the other dialects. We found that linguistic relatedness is contingent with geographical proximity. In our experiments we use SVM-based ranking and bi-LSTM-CRF sequence labeling. View details
    A Neural Architecture for Dialectal Arabic Segmentation
    Younes Samih
    Mohamed Eldesouki
    Hamdy Mubarak
    Ahmed Abdelali
    Laura Kallmeyer
    Kareem Darwish
    The Third Arabic Natural Language Processing Workshop (WANLP), Valencia, Spain (2017), pp. 46-54
    Preview abstract The automated processing of Arabic dialects is challenging due to the lack of spelling standards and the scarcity of annotated data and resources in general. Segmentation of words into their constituent tokens is an important processing step for natural language processing. In this paper, we show how a segmenter can be trained on only 350 annotated tweets using neural networks without any normalization or reliance on lexical features or linguistic resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that heavily depend on additional resources. View details
    The Power of Language Music: Arabic Lemmatization through Patterns
    Ayah Zirizkly
    Mona Diab
    Proceedings of the Workshop on Cognitive Aspects of the Lexicon, Osaka, Japan (2016), pp. 40-50
    Preview abstract Patterns play a pivotal role in Arabic morphological processing whether related to derivation or inflection. These patterns have not been yet adequately and fully utilized in computational processing of the language. The novel contribution of this paper is performing lemmatization (a high level lexical processing) without relying on a lookup dictionary. We use a machine learning classifier to predict the lemma pattern for a given stem, and use mapping rules to convert stems to their respective lemmas. View details
    CogALex-V Shared Task: GHHH - Detecting Semantic Relations via Word Embeddings
    Suraj Maharjan
    Younes Samih
    Laura Kallmeyer
    Thamar Solorio
    CogALex-2016 Shared Task on the Corpus-Based Identification of Semantic Relations, Osaka, Japan (2016), pp. 86-91
    Preview abstract This paper describes our system submitted to the CogALex-2016 Shared Task on the Corpus-Based Identification of Semantic Relations. The evaluation results of our system on the test set are 88.1\% (79.0\% for TRUE only) f-measure for Task-1 on detecting semantic similarity, and 76.0\% (42.3\% when excluding RANDOM) for Task-2 on identifying more finer grained semantic relations. In our experiments, we try word analogy, linear regression, and multi-task Convolutional Neural Networks (CNN) with word embeddings from publicly available word vectors. We found that linear regression performs better in binary classification (Task-1), while CNN has better performance in multi-class semantic classification (Task-2). We assume that word analogy is more suited for deterministic answers rather than handling the ambiguity of one-to-many and many-to-many relationships. We also show that classifier performance could benefit from balancing the frequency of labels in the training data. View details
    Multilingual Code-switching Identification via LSTM Recurrent Neural Networks
    Younes Samih
    Suraj Maharjan
    Laura Kallmeyer
    Thamar Solorio
    Proceedings of the Second Workshop on Computational Approaches to Code Switching,, Austin, TX (2016), pp. 50-59
    Preview abstract This paper describes the HHU-UH-G system submitted to the EMNLP 2016 Second Workshop on Computational Approaches to Code Switching. Our system ranked first place for Arabic (MSA-Egyptian) with an F1-score of 0.83 and second place for Spanish-English with an F1-score of 0.90. The HHU-UHG system introduces a novel unified neural network architecture for language identification in code-switched tweets for both Spanish-English and MSA-Egyptian dialect. The system makes use of word and character level representations to identify code-switching. For the MSA-Egyptian dialect the system does not rely on any kind of language-specific knowledge or linguistic resources such as, Part Of Speech (POS) taggers, morphological analyzers, gazetteers or word lists to obtain state-of-the-art performance. View details
    Preview abstract Idafa in traditional Arabic grammar is an umbrella construction that covers several phenomena including what is expressed in English as noun-noun compounds and Saxon & Norman genitives. Additionally, Idafa participates in some other constructions, such as quantifiers, quasi-prepositions, and adjectives. Identifying the various types of the Idafa construction (IC) is of importance to Natural Language Processing (NLP) applications. Noun-Noun compounds exhibit special behaviour in most languages impacting their semantic interpretation. Hence distinguishing them could have an impact on downstream NLP applications. The most comprehensive computational syntactic representation of the Arabic language is found in the LDC Arabic Treebank (ATB). Despite its coverage, ICs are not explicitly labeled in the ATB and furthermore, there is no clear distinction between ICs of noun-noun relations and other traditional ICs. Hence, we devise a detailed syntactic and semantic typification process of the IC phenomenon in Arabic. We target the ATB as a platform for this classification. We render the ATB annotated with explicit IC labels in addition to further semantic characterization which is useful for syntactic, semantic and cross language processing. Our typification of IC comprises 3 main syntactic IC types: False Idafas (FIC), Grammatical Idafas (GIC), and True Idafas (TIC), which are further divided into 10 syntactic subclasses. The TIC group is further classified into semantic relations. We devise a method for automatic IC labeling and compare its yield against the CATiB Treebank. Our evaluation shows that we achieve the same level of accuracy, but with the additional fine-grained classification into the various syntactic and semantic types. View details
    No Results Found