From data
To meaning

The annual meeting of the Association for Computational Linguistics (ACL 2019) has gathered more than 2000 researchers from all around the world in Florence, Italy. In the enormous halls these are the words that everybody is repeating, as everybody would expect: Deep learning, neural networks and word embeddings, especially when applied in multilingual context. Google’s language model BERT, for instance, is the main topic of many scientific papers and NLP systems presented at the conference.

During the  ACL award ceremony, winner Ronald Kaplan starts his lecture on Computational Psycholinguistics on an unexpected note. He opens with the “Aspects of the theory of Syntax” by Noam Chomsky (1965), focusing on the topic of competence and linguistic performance.  He traces back some of the most important steps in the history of computational linguistics: Augmented Transition Networks, Hierarchical attribute-value matrices, functional structures and features structures, and other techniques which has been used in the past in NLP are not so far from the present technologies mainly based on machine learning. In the years where computational linguistics is reaching its climax, the main goal remains to model competence and linguistic performance, i.e. to create a “reasonable model of language use”. 

The 2019 ACL conference gathers academia and industry researchers employed in the Natural Language Processing field all over the world. Big companies such as Baidu, Tencent, Facebook, Microsoft, Apple, IBM, Naver, Bloomberg, Salesforce, Bosch, Amazon, Samsung and more are taking part at the event, as well as smaller companies who have been doing research for many years in this branch. Researchers from the most important universities in the world are meeting: Europe, USA, Canada, South America, Russia, China, Korea and other countries from south-eastern Asia, as well as African countries, India, Australia. This 57th edition is the most participated edition ever. It is the first time the conference is been held in Italy, which was accomplished also with the help the the Italian Association for Computational Linguistics (AILC), of which CELI is a member.

What are the most relevant fields of investigation this year? Dialogue and Interactive systems, Sentence-level Semantics, Machine Translation, information extraction and Text Mining, Sentiment Analysis, Multilinguality, Question Answering, etc. 

A consistent part of the sessions focus on Dialogue: how to properly interpret user queries, how to produce answers in natural language that mirror the tone of the conversation, how to chit-chat but also how to erogate information so to help users obtain what they want. In a nutshell, to understand and model the human dialogue in an efficient way, so to reproduce it trough machines. In this field, Chinese research hubs (Beijing University, Huazhong University, Chinese Academy of Sciences, etc) seems to be the ones obtaining the best results compared to baselines from previous works. Their success is also a merit of the large quantity of data they can elicit: Tencent and Baidu are big communication companies as well as strong partners in most of the relevant papers.

The important role of the Chinese community doesn’t negatively affect the major western companies. Google AI and Microsoft keep investing in research and they are harvesting the results of their commercial products that have already been on the market for some time now. For instance, Microsoft team in Hyderabad (India) analyses conversations made with “Ruuh”, an opem-ended dialogue system. Their goal is to give back to the scientific and industrial community best practices and lessons learned while building a chatbot.

Low resource languages also had their fair share, with the Fourth Arabic Computational Linguistics Workshop co-located with the ACL 2019. Researchers from all over the Arab word met to discuss the challenges they are facing in NLP tasks for Arabic, share some insights into how they approached them  and explore possible solutions. Some of the issues discussed includes the scarcity of data, specially in some the many Arabic dialects, which do not have standardized writing conventions, and also code switching between more than one variation of Arabic or even between more than language (between Arabic and English in some Arab countries and between Arabic and French in some others). The American University of Beirut presented the language model “hULMonA” or Our Dream (The Universal Language Model in Arabic). “hULMonA” is a pre-trained model that uses huge corpus of Modern Standard Arabic. The Model also uses MADAMIRA (a morphological analysis and disambiguation tool specific for Arabic) to tokenize the input sentences instead of the pre-trained multilingual WordPiece tokenizer used in BERT, which might explain why “hULMonA” outperformed multilingual BERT in some classification tasks.

Written by Andrea Bolioli, Francesca Alloatti and Milad Botros