seminar-archive

3 October 2022

Sławomir Dadas (National Information Processing Institute)

Our experience with training neural sentence encoders for the Polish language

Representing sentences or short texts as dense vectors with a fixed number of dimensions is a common technique in tasks such as information retrieval, question answering, text clustering or plagiarism detection. A simple method to construct such representation is to aggregate vectors generated by a language model or extracted from word embeddings. However, higher quality representations can be obtained by fine-tuning a language model on a dataset of semantically similar sentence pairs. In this presentation, we will introduce methods for learning sentence encoders based on the Transformer architecture as well as our experiences with training such models for the Polish language. In addition, we will discuss approaches for building large datasets of paraphrases using publicly available corpora. We will also show a practical application of sentence encoders in a system developed for finding abusive clauses in consumer agreements.

14 November 2022

Łukasz Augustyniak, Kamil Tagowski, Albert Sawczyn, Denis Janiak, Roman Bartusiak, Adrian Dominik Szymczak, Arkadiusz Janz, Piotr Szymański, Marcin Wątroba, Mikołaj Morzy, Tomasz Jan Kajdanowicz, Maciej Piasecki (Wrocław University of Science and Technology)

This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish

The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become a de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark (klej is the word for glue in Polish) has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (lepiszcze is the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.

28 November 2022

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)

Lying in Polish: language analysis and methods of automated detection

Lying is an integral part of daily communication in both written and oral form. In this presentation, we will present the results obtained on a collection of nearly 1,500 true and false statements, half of which are transcripts and the other half are written statements, from probably the largest study on lying in the Polish language. In the first part of the presentation, we will examine the differences between true and false statements: we will check whether they differ in terms of complexity and sentiment, as well as characteristics such as length, concreteness and distribution of parts of speech. In the second part of the presentation, we will discuss models that automatically distinguish true from false statements. We will cover simple approaches, such as models trained on dictionary features, as well as more complex, pre-trained transformer neural networks. We will also talk about an attempt to detect lying with the use of automated fact-checking and present the preliminary results of work on the interpretability (explanations) of lie detection models.

19 December 2022

Wojciech Kryściński (Salesforce Research)

Long Story Short: A Talk about Text Summarization

Automatic Text Summarization is a challenging task within Natural Language Processing that requires advanced language understanding and generation capabilities. In recent years substantial progress has been made in developing neural models for the task thanks to the efforts of the research community and advancements in the broader field of NLP. Despite this progress, text summarization remains a challenging task that is far from being solved. In this talk, we will first discuss the early approaches and the current state of the field. Next, we will critically evaluate key ingredients of the existing research setup: datasets, evaluation metrics, and models. Finally, we will focus on emerging research directions and consider the future of text summarization.

9 January 2023

Marzena Karpińska (University of Massachusetts Amherst)

Challenges in Evaluation of Machine Generated Text

The recent progress in natural language generation (NLG) has made it difficult for researchers to effectively evaluate the output of their models. Traditional metrics, such as BLEU and ROUGE, are no longer sufficient to distinguish between high quality and low quality outputs, especially in open-ended tasks like story and poetry generation, or at the paragraph level. As a result, many researchers rely on crowdsourced human evaluations of text quality, using platforms like Amazon Mechanical Turk (AMT) to collect ratings of coherence or grammaticality. In this talk, I will first present a series of experiments highlighting the challenges and pitfalls of such approaches showing that even experts may struggle to accurately evaluate model-generated text using Likert-style scales, especially in the story generation task. Next, I will address similar issues in automatic evaluation of machine translation of the literary domain, and outline some unique difficulties inherent in the translation task itself.

6 February 2023
Agnieszka Mikołajczyk-Bareła (VoiceLab / Politechnika Gdańska / HearAI)
HearAI: Towards Deep learning-based Sign Language Recognition
Deaf and hearing-impaired people have a huge communication barrier. Different nationalities use different sign languages, and there is no universal one, as they are natural human languages with their own grammatical rules and lexicons. Deep learning-based methods for sign language translation need a lot of adequately labeled training data to perform well. In the HearAI non-profit project, we addressed this problem and investigated different multilingual open sign language corpora labeled by linguists in the language-agnostic Hamburg Notation System (HamNoSys). First, we simplified the difficult-to-understand structure of the HamNoSys without significant loss of gloss meaning by introducing numerical multilabels. Second, we utilized estimated pose landmarks and selected video keyframes' image-level features to recognize isolated glosses. We separately analyzed possibilities of dominant hand location, its position and shape, and overall movement symmetry, which allowed us to deeply explore the usefulness of HamNoSys for gloss recognition.

13 February 2023

Artur Nowakowski, Gabriela Pałka, Kamil Guttmann, Mikołaj Pokrywka (Adam Mickiewicz University in Poznań)

AMU at WMT 2022: state-of-the-art machine translation methods

The majority of machine translation systems are trained at the sentence level. However, today, the expectation is that the translation system will take into account the context of the entire document. To meet this expectation, the organizers of the WMT 2022 conference created the General MT Task, which involves translating texts from different domains: news articles, social media content, conversations, and e-commerce texts. The presentation will discuss the task faced during the WMT 2022 conference in the Czech-Ukrainian and Ukrainian-Czech translation directions. The encountered problems such as correct translation of named entities, consideration of document context, and proper inclusion of rarely used characters like emojis will be discussed. Additionally, methods for selecting the best translation among the translations generated by the system using automatic translation quality assessment models will be presented. The primary goal of the presentation is to showcase the components of the system that contributed to achieving the best results among all shared task participants.

27 February 2023

Sebastian Vincent (University of Sheffield)

MTCUE: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation

Efficient use of both intra- and extra-textual context is one of the critical gaps between human and neural machine translation. Research so far has mostly focused on individual, well-defined types of context, such as the surrounding text or discrete external variables such as the gender of the speaker. This work introduces MTCue, a novel neural machine translation framework which rewrites all context as text and learns an abstract representation of context enabling transfer across different data settings and leveraging similar attributes in low resource settings. Focusing on the domain of dialogue with access to document and metadata context, we evaluate multiple variants of MTCue, with four choices for context-source combination and several context vectorisation functions. Our experiments across six language pairs show gains in translation quality over a non-contextual baseline. Further analysis shows that the context encoder of MTCue learns a context space representation which is organised w.r.t. specific attributes such as formality, effectively enabling their zero-shot control. Pre-training on context embeddings also lets MTCue learn new control codes with less data than a tagging baseline.

27 March 2023

Julian Zubek, Joanna Rączaszek-Leonardi (Faculty of Psychology, University of Warsaw)

Agent-based models of symbol emergence in communication inspired by processes of language development

Influenced by computer science, we come to understand symbols as discrete elements of an abstract structure on which formal operations are performed. Semiotically, symbols are a particular type of signs that function within a system of interdependencies and whose interpretation requires knowledge of the rules governing this system. From the perspective of language evolution and development, the emergence of symbolic structures and the ability to use them pose a number of basic questions. In our research program, we focus on how abstract symbols emerge together with the ability to perform physical actions in the world and how symbols can control these actions. To illustrate these relations, we use computer simulations in which agents coordinate their actions using a communication protocol that emerges from the bottom-up in a reinforcement learning scheme. We point out the assumptions underlying these types of models and the existing difficulties in modeling multiple sources of pressure shaping the structure of language. We present the results of our own simulations, illustrating a) the influence of interaction history on the structure of language, b) the relations between context availability and communication protocol ambiguity, c) the role of dialogue in the coordination and structuring of actions in a dynamic environment. The results show the complex nature of symbols, which requires complementary description at the level of formal structure and at the level of system dynamics. This complexity should also be reflected in the design and evaluation of artificial intelligence algorithms intended for interaction with humans.

24 April 2023

Mateusz Krubiński (Charles University in Prague)

A picture is worth a thousand words – on Multimodal Summarization

Automatic summarization is one of the basic tasks both in Natural Language Processing – text summarization – and in Computer Vision – video summarization. Multimodal summarization connects those two fields by creating a summary based on information from different modalities. To motivate such research, it’s enough to visit any news portal: the most popular multimedia news formats are now multimodal – the reader is often presented not only with a textual article but also with a short, vivid video. To draw the attention of the reader, such video-based articles are usually presented as a short textual summary paired with an image thumbnail.

In this talk, I will present a brief history of text-centric Multimodal Summarization - a formulation in which we require the textual modality to be present both in the input and in the output. I will show how the task evolved over the years and highlight what I believe to be the major challenges. In the second part, I will talk about my own experiments, focusing on pre-training and evaluation methodologies. I will also share my experience with creating a dataset based on information automatically collected from internet webpages, which shows that sometimes aiming lower may lead to a great outcome.

25 May 2023

Agata Savary (Université Paris-Saclay)

We thought the eyes of coreference were shut to multiword expressions and they mostly are

Multiword expressions are combinations of words which exhibit peculiar semantic properties such as different degrees of non-compositionality, decomposability, transparency and figuration. Long-standing linguistic debates suggest that such semantic idiosyncrasy conditions the morpho-syntactic configurations in which a given multiword expression can occur. This papers extends this argumentation to nominal coreference. Namely, we hypothesise that internal components of a multiword expression are unlikely to occur in coreference chains. While previous work noticed the rareness of coreference-related phenomena in presence of multiword expressions, this observation has never been quantified, to the best of our knowledge. We bridge this gap by performing an automated corpus-based study of the intersections between verbal multiword expressions and nominal coreference in French. The results largely corroborate our hypothesis but also display various tendencies depending on the types of multiword expressions and of the corpus genre. The analysis of the corpus examples highlights interesting properties of coreference, notably in speech.

Natural Language Processing Seminar 2022–2023