Differences between revisions 529 and 530

Natural Language Processing Seminar 2022–2023

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube.

3 October 2022

Sławomir Dadas (National Information Processing Institute)

Our experience with training neural sentence encoders for the Polish language

Representing sentences or short texts as dense vectors with a fixed number of dimensions is a common technique in tasks such as information retrieval, question answering, text clustering or plagiarism detection. A simple method to construct such representation is to aggregate vectors generated by a language model or extracted from word embeddings. However, higher quality representations can be obtained by fine-tuning a language model on a dataset of semantically similar sentence pairs. In this presentation, we will introduce methods for learning sentence encoders based on the Transformer architecture as well as our experiences with training such models for the Polish language. In addition, we will discuss approaches for building large datasets of paraphrases using publicly available corpora. We will also show a practical application of sentence encoders in a system developed for finding abusive clauses in consumer agreements.

14 November 2022

Łukasz Augustyniak, Kamil Tagowski, Albert Sawczyn, Denis Janiak, Roman Bartusiak, Adrian Dominik Szymczak, Arkadiusz Janz, Piotr Szymański, Marcin Wątroba, Mikołaj Morzy, Tomasz Jan Kajdanowicz, Maciej Piasecki (Wrocław University of Science and Technology)

This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish

The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become a de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark (klej is the word for glue in Polish) has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (lepiszcze is the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.

28 November 2022

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)

Lying in Polish: language analysis and methods of automated detection

Lying is an integral part of daily communication in both written and oral form. In this presentation, we will present the results obtained on a collection of nearly 1,500 true and false statements, half of which are transcripts and the other half are written statements, from probably the largest study on lying in the Polish language. In the first part of the presentation, we will examine the differences between true and false statements: we will check whether they differ in terms of complexity and sentiment, as well as characteristics such as length, concreteness and distribution of parts of speech. In the second part of the presentation, we will discuss models that automatically distinguish true from false statements. We will cover simple approaches, such as models trained on dictionary features, as well as more complex, pre-trained transformer neural networks. We will also talk about an attempt to detect lying with the use of automated fact-checking and present the preliminary results of work on the interpretability (explanations) of lie detection models.

19 December 2022

Wojciech Kryściński (Salesforce Research)

Long Story Short: A Talk about Text Summarization

Automatic Text Summarization is a challenging task within Natural Language Processing that requires advanced language understanding and generation capabilities. In recent years substantial progress has been made in developing neural models for the task thanks to the efforts of the research community and advancements in the broader field of NLP. Despite this progress, text summarization remains a challenging task that is far from being solved. In this talk, we will first discuss the early approaches and the current state of the field. Next, we will critically evaluate key ingredients of the existing research setup: datasets, evaluation metrics, and models. Finally, we will focus on emerging research directions and consider the future of text summarization.

9 January 2023

Marzena Karpińska (University of Massachusetts Amherst)

Challenges in Evaluation of Machine Generated Text

The recent progress in natural language generation (NLG) has made it difficult for researchers to effectively evaluate the output of their models. Traditional metrics, such as BLEU and ROUGE, are no longer sufficient to distinguish between high quality and low quality outputs, especially in open-ended tasks like story and poetry generation, or at the paragraph level. As a result, many researchers rely on crowdsourced human evaluations of text quality, using platforms like Amazon Mechanical Turk (AMT) to collect ratings of coherence or grammaticality. In this talk, I will first present a series of experiments highlighting the challenges and pitfalls of such approaches showing that even experts may struggle to accurately evaluate model-generated text using Likert-style scales, especially in the story generation task. Next, I will address similar issues in automatic evaluation of machine translation of the literary domain, and outline some unique difficulties inherent in the translation task itself.

6 February 2023

Agnieszka Mikołajczyk (VoiceLab / Politechnika Gdańska / hear.ai)

HearAI: Towards Deep learning-based Sign Language Recognition

Deaf and hearing-impaired people have a huge communication barrier. Different nationalities use different sign languages, and there is no universal one, as they are natural human languages with their own grammatical rules and lexicons. Deep learning-based methods for sign language translation need a lot of adequately labeled training data to perform well. In the HearAI non-profit project, we addressed this problem and investigated different multilingual open sign language corpora labeled by linguists in the language-agnostic Hamburg Notation System (HamNoSys). First, we simplified the difficult-to-understand structure of the HamNoSys without significant loss of gloss meaning by introducing numerical multilabels. Second, we utilized estimated pose landmarks and selected video keyframes' image-level features to recognize isolated glosses. We separately analyzed possibilities of dominant hand location, its position and shape, and overall movement symmetry, which allowed us to deeply explore the usefulness of HamNoSys for gloss recognition.

13 February 2023

Artur Nowakowski, Gabriela Pałka, Kamil Guttmann, Mikołaj Pokrywka (Adam Mickiewicz University in Poznań)

AMU at WMT 2022: state-of-the-art machine translation methods

The majority of machine translation systems are trained at the sentence level. However, today, the expectation is that the translation system will take into account the context of the entire document. To meet this expectation, the organizers of the WMT 2022 conference created the General MT Task, which involves translating texts from different domains: news articles, social media content, conversations, and e-commerce texts. The presentation will discuss the task faced during the WMT 2022 conference in the Czech-Ukrainian and Ukrainian-Czech translation directions. The encountered problems such as correct translation of named entities, consideration of document context, and proper inclusion of rarely used characters like emojis will be discussed. Additionally, methods for selecting the best translation among the translations generated by the system using automatic translation quality assessment models will be presented. The primary goal of the presentation is to showcase the components of the system that contributed to achieving the best results among all shared task participants.

27 February 2023

Sebastian Vincent (University of Sheffield)

MTCUE: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation

Efficient use of both intra- and extra-textual context is one of the critical gaps between human and neural machine translation. Research so far has mostly focused on individual, well-defined types of context, such as the surrounding text or discrete external variables such as the gender of the speaker. This work introduces MTCue, a novel neural machine translation framework which rewrites all context as text and learns an abstract representation of context enabling transfer across different data settings and leveraging similar attributes in low resource settings. Focusing on the domain of dialogue with access to document and metadata context, we evaluate multiple variants of MTCue, with four choices for context-source combination and several context vectorisation functions. Our experiments across six language pairs show gains in translation quality over a non-contextual baseline. Further analysis shows that the context encoder of MTCue learns a context space representation which is organised w.r.t. specific attributes such as formality, effectively enabling their zero-shot control. Pre-training on context embeddings also lets MTCue learn new control codes with less data than a tagging baseline.

Please see also the talks given in 2000–2015 and 2015–2020.

-  ⇤ ← Revision 529 as of 2023-01-24 19:50:53 → 
  Size: 15640
  Editor: MaciejOgrodniczuk
  Comment:
+   ← Revision 530 as of 2023-01-30 11:33:14 → ⇥
  Size: 16669
  Editor: MaciejOgrodniczuk
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 34:
-||<style="border:0;padding-left:30px;padding-bottom:5px">'''Talk title will be made available shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made avaliable shortly.||
+||<style="border:0;padding-left:30px;padding-bottom:5px">'''HearAI: Towards Deep learning-based Sign Language Recognition''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Deaf and hearing-impaired people have a huge communication barrier.  Different nationalities use different sign languages, and there is no universal one, as they are natural human languages with their own grammatical rules and lexicons. Deep learning-based methods for sign language translation need a lot of adequately labeled training data to perform well. In the HearAI non-profit project, we addressed this problem and investigated different multilingual open sign language corpora labeled by linguists in the language-agnostic Hamburg Notation System (HamNoSys). First, we simplified the difficult-to-understand structure of the HamNoSys without significant loss of gloss meaning by introducing numerical multilabels. Second, we utilized estimated pose landmarks and selected video keyframes' image-level features to recognize isolated glosses. We separately analyzed possibilities of dominant hand location, its position and shape, and overall movement symmetry, which allowed us to deeply explore the usefulness of HamNoSys for gloss recognition.||

Diff for "seminar"

Menu

Natural Language Processing Seminar 2022–2023