Natural Language Processing Seminar 2020–2021

5 October 2020

Piotr Rybak, Robert Mroczkowski, Janusz Tracz (ML Research at Allegro.pl), Ireneusz Gawlik (ML Research at Allegro.pl & AGH University of Science and Technology)

https://www.youtube.com/watch?v=LkR-i2Z1RwM Review of BERT-based Models for Polish Language  Delivered in Polish.

In recent years, a series of BERT-based models improved the performance of many natural language processing systems. During this talk, we will briefly introduce the BERT model as well as some of its variants. Next, we will focus on the available BERT-based models for Polish language and their results on the KLEJ benchmark. Finally, we will dive into the details of the new model developed in cooperation between ICS PAS and Allegro.

2 November 2020

Inez Okulska (NASK National Research Institute)

https://www.youtube.com/watch?v=B7Y9fK2CDWw Concise, robust, sparse? Algebraic transformations of word2vec embeddings versus precision of classification  Talk delivered in Polish.

The introduction of the vector representation of words, containing the weights of context and central words, calculated as a result of mapping giant corpora of a given language, and not encoding manually selected, linguistic features of words, proved to be a breakthrough for NLP research. After the first delight, there came revision and search for improvements - primarily in order to broaden the context, to handle homonyms, etc. Nevertheless, the classic embeddinga still apply to many tasks - for example, content classification - and in many cases their performance is still good enough. What do they code? Do they contain redundant elements? If transformed or reduced, will they maintain the information in a way that still preserves the original "meaning"? What is the meaning here? How far can these vectors be deformed and how does it relate to encryption methods? In my speech I will present a reflection on this subject, illustrated by the results of various "tortures” of the embeddings (word2vec and glove) and their precision in the task of classifying texts whose content must remain masked for human users.

16 November 2020

Agnieszka Chmiel (Adam Mickiewicz University, Poznań), Danijel Korzinek (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=MxbgQL316DQ PINC (Polish Interpreting Corpus): how a corpus can help study the process of simultaneous interpreting  Talk delivered in Polish.

PINC is the first Polish simultaneous interpreting corpus based on Polish-English and English-Polish interpretations from the European Parliament. Using naturalistic data makes it possible to answer many questions about the process of simultaneous interpreting. By analysing the ear-voice span, or the delay between the source text and the target text, mechanisms of activation and inhibition can be investigated in the interpreter’s lexical processing. Fluency and pause data help us examine the cognitive load. This talk will focus on how we process data in the corpus (such as interpreter voice identification) and what challenges we face in relation to linguistic analysis, dependency parsing and bilingual alignment. We will show how specific data can be applied to help us understand what interpreting involves or even what happens in the interpreter’s mind.

30 November 2020

Findings of ACL: EMNLP 2020: Polish session

Łukasz Borchmann et al. (Applica.ai)

https://www.youtube.com/watch?v=THe1URk40Nk Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines  Talk delivered in Polish. Slides in English.

Contract Discovery deals with tasks, such as ensuring the inclusion of relevant legal clauses or their retrieval for further analysis (e.g., risk assessment). Because there was no publicly available benchmark for span identification from legal texts, we proposed it along with hard-to-beat baselines. It is expected to process unstructured text, as in most real-world usage scenarios; that is, no legal documents segmentation into the hierarchy of distinct (sub)sections is to be given in advance. What is more, it is assumed that a searched passage can be any part of the document and not necessarily a complete paragraph, subparagraph, or clause. Instead, the process should be considered as a few-shot span identification task. In this particular setting, pretrained, universal encoders fail to provide satisfactory results. In contrast, solutions based on the Language Models perform well, especially when unsupervised fine-tuning is applied.

Piotr Szymański (Wrocław Technical University), Piotr Żelasko (Johns Hopkins University)

https://www.youtube.com/watch?v=TXSDhCtTRpw WER we are and WER we think we are  Talk delivered in Polish. Slides in English.

Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.

17 December 2020

Piotr Przybyła (Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=newobY5cBJo Multi-Word Lexical Simplification  Talk delivered in Polish.

The presentation will cover the task of multi-word lexical simplification, in which a sentence in natural language is made easier to understand by replacing its fragment with a simpler alternative, both of which can consist of many words. In order to explore this new direction, a corpus (MWLS1) including 1462 sentences in English from various sources with 7059 simplifications was prepared through crowdsourcing. Additionally, an automatic solution (Plainifier) for the problem, based on a purpose-trained neural language model, will be discussed along with the evaluation, comparing to human and resource-based baselines. The results of the presented study were also published at the COLING 2020 conference in an article of the same title.

18 January 2021

Norbert Ryciak, Maciej Chrabąszcz, Maciej Bartoszuk (Sages)

https://www.youtube.com/watch?v=L8RRx9KVhJs Classification of patent applications  Talk delivered in Polish. Slides in English.

During our presentation we will discuss the solution for patent applications classification task that was one of GovTech competition problems. We will describe the characteristics of the problem and proposed solution, especially the original method of representing documents as “clouds of word embedding”.

1 February 2021

Adam Jatowt (University of Innsbruck)

https://www.youtube.com/watch?v=e7NblngMe6A Question Answering & Finding Temporal Analogs in News Archives  Talk delivered mostly in English (introduction in Polish).

News archives offer immense value to our society, helping users to learn details of events that occurred in the past. Currently, the access to such collections is difficult for average users due to large sizes and the need for expertise in history. We propose a large-scale open-domain question answering model designed for long-term news article collections, with a dedicated module for re-ranking articles by using temporal information. In the second part of the talk we will discuss methods for finding and explaining temporal analogs – entities in the past which are analogical to the entities in the present (e.g., walkman as a temporal analog of iPad).

15 February 2021

Aleksandra Nabożny (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=Rd0nHiVuSZk Methods of optimizing the work of experts during the annotation of non-credible medical texts  Talk delivered in Polish.

Automatic credibility assessment of medical content is an extremely difficult task. This is because expert assessment is burdened with a large interpretive bias, which depends on the individual clinical experience of a given doctor. Moreover, a simple factual assessment turns out to be insufficient to determine the credibility of this type of content. During the seminar, I will present the results of my and my team's efforts to optimize the annotation process. We proposed a sentence ordering method where non-credible sentences are more likely to be placed at the beginning of the queue for evaluation. I will also present our proposals for extending the annotator protocol to increase the consistency of assessments. Finally, I invite you to a discussion on potential research directions to detect harmful narratives in the so-called medical fake news.

9 March 2021

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Izabela Chojnicka (Faculty of Psychology, University of Warsaw), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)

https://www.youtube.com/watch?v=ja04r8WW4Nk Machine learning in detecting schizophrenia and autism from textual utterances  Talk delivered in Polish.

Detection of mental disorders from textual input is an emerging field for applied machine and deep learning methods. In our talk, we will explore the limits of automated detection of autism spectrum disorder and schizophrenia. We will analyse both disorders and describe two diagnostic tools: TLC and ADOS-2, along with the characteristics of the collected data. We will compare the performance of: (1) TLC and ADOS-2, (2) machine learning and deep learning methods applied to the data gathered by these tools, and (3) psychiatrists. We will discuss the effectiveness of several baseline approaches such as bag-of-words and dictionary-based methods, including sentiment and language abstraction. We will then introduce the newest approaches using deep learning for text representation and inference. Owing to the related nature of both disorders, we will describe experiments with transfer and zero-shot learning techniques. Finally, we will explore few-shot methods dedicated to low data size scenarios, which is a typical problem for the clinical setting. Psychiatry is one of the few medical fields in which the diagnosis of most disorders is based on the subjective assessment of a psychiatrist. Therefore, the introduction of objective tools supporting diagnostics seems to be pivotal. This work is a step in this direction.

15 March 2021

Filip Graliński, Agnieszka Kaliska (Applica.ai / Adam Mickiewicz University), Tomasz Stanisławek, Anna Wróblewska (Applica.ai / Warsaw University of Technology), Dawid Lipiński, Bartosz Topolski (Applica.ai), Paulina Rosalska (Applica.ai / Nicolaus Copernicus University), Przemysław Biecek (Warsaw University of Technology / Samsung R&D Institute Poland)

https://www.youtube.com/watch?v=uDBaqxmzppk Key Information Extraction from documents: Kleister NDA/Charity challenges  Talk delivered in Polish. Slides in English.

This presentation will show-case two new datasets (Kleister NDA and Kleister Charity) for Key Information Extraction. They involve a mix of born-digital and scanned long formal documents in English. In these datasets, an NLP system is expected to find or infer various types of entities by utilizing both textual and structural layout features.

12 April 2021

Marek Kubis (Adam Mickiewicz University)

https://www.youtube.com/watch?v=37d0br2axyQ Quantitative analysis of character networks in Polish 19th- and 20th-century novels  Talk delivered in Polish.

I will present a study on induction and quantitative analysis of character networks inferred from Polish novels. The corpus compiled for this study includes both 19th- and 20th-century literary works obtained from publicly available sources. I will discuss the development of the corpus and the network extraction procedure. The structural properties observed for the networks induced from Polish novels will be confronted with the results observed for English novels. Furthermore, I will compare the networks induced from 19th-century novels to the 20th-century networks.

7 June 2021

Maciej Ogrodniczuk, Michał Rudolf (Institute of Computer Science, Polish Academy of Sciences)

ParlaMint: Towards Comparable Parliamentary Corpora  The first part of the slides in Polish.

Marta Kołczyńska (Institute of Political Studies, Polish Academy of Sciences)

Parliamentary debates in COVID times  The second part of the slides in English.

In the first part of the talk we will present the CLARIN-ERIC-funded project ParlaMint which aims to create a multilingual comparable corpus of parliamentary data based on national corpora of sitting transcripts. The second part of the talk will focus on the work of a research group that used the ParlaMint corpus data in the Parliamentary debate analysis task during the Helsinki Digital Humanities Hackathon #DHH21.