seminar-archive

11 October 2021

Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Polyadic Quantifiers in Heterofunctional Coordination

The aim of this talk is to provide a semantic analysis of a construction – Heterofunctional Coordination – which is typical of Slavic and some neighbouring languages. In this construction, expressions bearing different grammatical functions may be conjoined. In this talk, I will propose a semantic analysis of such constructions based on the concept of generalized quantifiers (Mostowski; Lindström; Barwise and Cooper), and more specifically – polyadic quantifiers (van Benthem; Keenan; Westerståhl). Some familiarity with the language of predicate logic should suffice to fully understand the talk; all linguistic concepts (including "coordination", "grammatical functions") and logical concepts (including "generalized quantifiers" and "polyadic quantifiers") will be explained in the talk.

18 October 2021

Przemysław Kazienko, Jan Kocoń (Wrocław University of Technology)

Personalized NLP

Many natural language processing tasks, such as classifying offensive, toxic, or emotional texts, are inherently subjective in nature. This is a major challenge, especially with regard to the annotation process. Humans tend to perceive textual content in their own individual way. Most current annotation procedures aim to achieve a high level of agreement in order to generate a high quality reference source. Existing machine learning methods commonly rely on agreed output values that are the same for all annotators. However, annotation guidelines for subjective content can limit annotators' decision-making freedom. Motivated by moderate annotation agreement on offensive and emotional content datasets, we hypothesize that a personalized approach should be introduced for such subjective tasks. We propose new deep learning architectures that take into account not only the content but also the characteristics of the individual. We propose different approaches for learning the representation and processing of data about text readers. Experiments were conducted on four datasets: Wikipedia discussion texts labeled with attack, aggression, and toxicity, and opinions annotated with ten numerical emotional categories. All of our models based on human biases and their representations significantly improve prediction quality in subjective tasks evaluated from an individual's perspective. Additionally, we have developed requirements for annotation, personalization, and content processing procedures to make our solutions human-centric.

8 November 2021

Ryszard Tuora, Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

Dependency Trees in Automatic Inflection of Multi Word Expressions in Polish

Natural language generation for morphologically rich languages can benefit from automatic inflection systems. This work presents such a system, which can tackle inflection, with particular emphasis on Multi Word Expressions (MWEs). This is done using rules induced automatically from a dependency treebank. The system is evaluated on a dictionary of Polish MWEs. Additionally, a similar algorithm can be utilized for lemmatization of MWEs. In principle, the system can also be applied to other languages with similar morphological mechanisms. To prove that, we will present a simple solution for Russian.

29 November 2021

Piotr Przybyła (Institute of Computer Science, Polish Academy of Sciences)

When classification accuracy is not enough: Explaining news credibility assessment and measuring users' reaction

Automatic assessment of text credibility has recently become a very popular task in NLP, with many solutions proposed and evaluated through accuracy-based measures. However, little attention has been given to the deployment scenarios for such models that would reduce the spread of misinformation, as intended. Within the study presented here, two credibility assessment techniques were implemented in a browser extension, which was then used in a user study, allowing to answer questions in three areas. Firstly, how resource-intensive NLP models can be compressed to work in a constrained environment? Secondly, what interpretability and visualisation techniques are most effective in human-computer cooperation? Thirdly, are user relying on such automated tools really more effective in spotting fake news?

6 December 2021

Joanna Byszuk (Institute of Polish Language, Polish Academy of Sciences)

Towards multimodal stylometry – possibilities and challenges of new approach to film and TV series analysis

This talk will present a proposal of novel approach to quantitative analysis of multimodal works on the example of the corpus of Doctor Who television series, which draws from stylometry and multimodal theory of film analysis. Stylometric methods have long been popular in the analysis of literary texts. They usually include comparision of texts based on the frequencies of use of selected features which create "stylometric fingerprints", i.e. patterns characteristic of authors, genres and other factors. They are, however, rarely applied to data other than text, with a few new approaches applying stylometry to the study of dance movements (works by Miguel Escobar Varela) or music (Backer and Kranenburg). Multimodal theory of film analysis is in turn a relatively new approach (developed primarily by John Bateman and Janina Wildfeuer), emphasizing the importance of examining information from various image, language and sound modalities for a more comprehensive interpretation. The presented approach uses stylometric method of comparison but taking multiple types of features from various film modalities, i.e. features of image and sound as well as the content of the spoken dialogues. The talk will discuss the benefits and challenges of such an approach and quantitative film media analysis in general.

20 December 2021

Piotr Pęzik (University of Łódź / VoiceLab), Agnieszka Mikołajczyk, Adam Wawrzyński (VoiceLab), Bartłomiej Nitoń, Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

Keyword Extraction with a Text-to-text Transfer Transformer (T5)

The talk will explore the relevance of the Text-To-Text Transfer Transfomer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the newly released Polish Open Science Metadata Corpus (POSMAC), which is currently a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5, extremeText, TermoPL, KeyBert and conclude that the T5 model yields particularly promising results for sparsely represented keywords. Furthermore, a plT5 keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.

31 January 2022

Tomasz Limisiewicz (Charles University in Prague)

Interpreting and Controlling Linguistic Features in Neural Networks’ Representations

Neural networks have achieved state-of-the-art results in a variety of tasks in natural language processing. Nevertheless, neural models are black boxes; we do not understand the mechanisms behind their successes. I will present the tools and methodologies used to interpret black box models. The talk will primarily focus on the representations of Transformer-based language models and our novel method — orthogonal probe, which offers good insight into the network's hidden states. The results show that specific linguistic signals are encoded distinctly in the Transformer. Therefore, we can effectively separate their representations. Additionally, we demonstrate that our findings generalize to multiple diverse languages. Identifying specific information encoded in the network allows removing unwanted biases from the representation. Such an intervention increases system reliability for high-stakes applications.

28 February 2022

Maciej Chrabąszcz (Sages)

Natural Language Generation

The seminar focuses on the problem of generating image descriptions. Models, which will be presented, were tested as part of creating a solution for automatic photo annotation. Among others, there will be presented models with attention and models which use pre-trained vision and text-generating models.

28 March 2022

Tomasz Stanisławek (Applica)

Information extraction from documents with complex layout

The rapid development of the domain of NLP in recent years, and particularly the introduction of new language models (BERT, RoBERTa, T5, GPT-3), has popularised the use of information extraction techniques to automate business processes. Unfortunately, most business documents contain not only plain text, but also various types of graphical structures (for example: tables, lists, bold text, forms) that prevent correct processing with the currently existing methods (reading text as a sequence of tokens). During the presentation, I will discuss: a) problems with the existing methods used in the Information Extraction domain, b) Kleister - new data sets created for the purpose of testing new models c) LAMBERT - the new language model with injection of information about the position of tokens, d) further directions of development of the field.

11 April 2022

Daniel Ziembicki (University of Warsaw), Anna Wróblewska, Karolina Seweryn (Warsaw University of Technology)

Polish natural language inference and factivity — an expert-based dataset and benchmarks

The presentation will focus on four themes: (1) the phenomenon of factivity in contemporary Polish, (2) the prediction of entailment, contradictory, and neutrality relations in text, (3) the linguistic dataset we built centered on the factivity-nonfactivity opposition, and (4) a discussion of the results of ML models trained on the dataset in (3), that aimed to predict the semantic relations from (2).

16 May 2022

Inez Okulska, Anna Zawadzka, Michał Szczyszek, Anna Kołos, Zofia Cieślińska (NASK)

Style effect(iveness): How and why to encode morphosyntactic features of entire documents

What if we could represent the text of any length with a single, equal, and additionally fully interpretable vector? No corpus to train, no dictionary of pretrained embeddings, one document at a time, to analyze by humans or classifiers? Why not! StyloMetrix vectors are a combination of linguistic metrics that build on the richness of the spaCy library. This approach, of course, misses the semantics of individual words or phrases; thus, it theoretically does not allow for the detection of specific topics. Unless semantics is also carried by style. And in fact, previous experiments and the results of philological research show that these areas are strongly intertwined. For it turns out that – for example – content inappropriate for children or young people is not only an obvious set of forbidden keywords but also a combination of characteristic morphosyntactic indicators of the text. These are so clear and distinctive that, using only the StyloMetrix representation, one can achieve a precision of 90% in a multi-class classification task. Moreover, it turns out that since each vector value is a normalized indicator of a particular grammatical feature of a document, one can also learn something about the linguistic determinants of a given style. This construction of metrics is also a step toward the interpretability of algebraic feature selection methods. All the experiments presented in the talk will be based on content published on the Internet.

23 May 2022

Karolina Stańczak (Copenhagen University)

A Latent-Variable Model for Intrinsic Probing

The success of pre-trained contextualized representations has prompted researchers to analyze them for the presence of linguistic information. Indeed, it is natural to assume that these pre-trained representations do encode some level of linguistic knowledge as they have brought about large empirical improvements on a wide variety of NLP tasks, which suggests they are learning true linguistic generalization. In this work, we focus on intrinsic probing, an analysis technique where the goal is not only to identify whether a representation encodes a linguistic attribute, but also to pinpoint where this attribute is encoded. We propose a novel latent-variable formulation for constructing intrinsic probes and derive a tractable variational approximation to the log-likelihood. Our results show that our model is versatile and yields tighter mutual information estimates than two intrinsic probes previously proposed in the literature. Finally, we find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.

6 June 2022

Cezary Klamra, Grzegorz Wojdyga (Institute of Computer Science, Polish Academy of Sciences), Sebastian Żurowski (Nicolaus Copernicus University in Toruń), Paulina Rosalska (Nicolaus Copernicus University in Toruń / Applica.ai), Matylda Kozłowska (Oracle Poland), Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

Devulgarization of Polish Texts Using Pre-trained Language Models

We will propose a text style transfer method for replacing vulgar expressions in Polish utterances with their non-vulgar equivalents while preserving the main characteristics of the text. After fine-tuning three pre-trained language models (GPT-2, GPT-3 and T-5) on a newly created parallel corpus of vulgar/non-vulgar sentence pairs, we evaluate their style transfer accuracy, content preservation and language quality. To the best of our knowledge, the proposed solution is the first of its kind for Polish. The paper presenting the solution was accepted to ICCS 2022.

13 June 2022

Michał Ulewicz

Semantic Role Labeling – data and models

Semantic Role Labeling (SRL) represents the meaning of a sentence in the form of predicate-argument structures (so called frames). This approach allows to divide the sentence into meaningful parts and for each part precisely answer the questions: who did what to whom, when, where, and how. SRL consists of two steps: i) predicate identification and sense disambiguation, ii) argument identification and classification. High quality training data in the form of propbanks is crucial for building accurate SRL models. Such datasets are available for English language, unfortunately most languages simply do not have corresponding propbanks due to the high effort and cost of constructing such resources. In my presentation, I will describe how SRL can help in precise text processing. I will present attempts to automatically generate datasets for various languages, including Polish, using the annotation projection technique and the identified problems specific to projection from English into Polish. I will tell you about SRL models that I built based on the Transformer architecture.

Natural Language Processing Seminar 2021–2022