seminar-archive

7 October 2024

Janusz S. Bień (University of Warsaw, profesor emeritus)

Identifying glyphs in some 16th century fonts: a case study

Some glyphs from 16th century fonts, described in the monumental work “Polonia Typographica Saeculi Sedecimi”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed.

14 October 2024

Alexander Rosen (Charles University in Prague)

Lexical and syntactic variability of languages and text genres. A corpus-based study

This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation (Universal Dependencies), the research utilizes parallel texts from a large multilingual corpus (InterCorp). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts.

The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages.

Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style.

28 October 2024

Rafał Jaworski (Adam Mickiewicz University in Poznań)

Framework for aligning and storing of multilingual word embeddings for the needs of translation probability computation

The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings.

The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages.

To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent.

The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words.

The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one.

4 November 2024

Jakub Kozakoszczak (Deutsche Telekom)

ZIML: A Markup Language for Regex-Friendly Linguistic Annotation

Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points.

I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them.

I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended.

21 November 2024

Christian Chiarcos (University of Augsburg)

Aspects of Knowledge Representation for Discourse Relation Annotation

Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with OntoLex and techniques for the induction of discourse marker inventories.

2 December 2024
Participants of PolEval 2024
Presentation of the Shared Task results
Welcome to PolEval 2024 (Łukasz Kobyliński, Maciej Ogrodniczuk, Filip Graliński, Ryszard Staruch, Karol Saputa)
PolEval 2024 Task 1: Reading Comprehension (Ryszard Tuora / Aleksandra Zwierzchowska)
Optimizing LLMs for Polish Reading Comprehension: A Comparative Study of Ensemble and Unified Approaches (Krzysztof Wróbel)
PolEval 2024 Task 2: Emotion and Sentiment Recognition (Jan Kocoń, Bartłomiej Koptyra)
Emotion and Sentiment Recognition in Polish Texts Using Large Language Models: A Winning Approach to PolEval 2024 (Krzysztof Wróbel)
Ensemble as a Variance Reduction Method for Emotion and Sentiment Recognition (Tomasz Warzecha)
Emotion and Sentiment Recognition Using Ensemble Models (Jakub Kosterna)
Zero-shot Approach Using Bielik LLM for Emotion Recognition in Polish (Paweł Cyrta)
PolEval 2024 Task 3: Polish Automatic Speech Recognition Challenge (Michał Junczyk, Iwona Christop, Piotr Pęzik)
Augmenting Polish Automatic Speech Recognition System with Synthetic Data (Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz)
Exploration of training Zipformer and E-Branchformer models with Polish language BIGOS dataset (Paweł Cyrta)

19 December 2024

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)

Adaptive Attacks on Misinformation Detection Using Reinforcement Learning

The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs.

17 February 2025

Alicja Martinek (NASK National Research Institute, AGH University of Kraków), Ewelina Bartuzi-Trokielewicz (NASK National Research Institute, Warsaw University of Technology)

Detecting deepfakes and false ads through analysis of text and social engineering techniques

Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90\% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques.

24 March 2025

Maciej Rapacz, Aleksander Smywiński-Pohl (AGH University of Krakow)

Interlinear Translation of Ancient Greek Texts: How Morphological Tags Enhance Machine Translation Quality

Interlinear translation prioritizes preserving the original syntactic structure by placing target language words directly below their source text counterparts, maintaining the original word order rather than natural fluency. Although interlinear translations often deviate from the linguistic norms of the target language, they serve as a valuable tool for those wishing to deeply understand texts in their original form, especially in the case of sacred and ancient texts.

In our research, we conducted the first attempt to apply machine translation to generate interlinear translations from Ancient Greek to Polish and English. We compared the performance of specialized models (GreTa, PhilTa) pretrained on Ancient Greek texts with a general-purpose multilingual model (mT5). We examined 144 different model configurations, manipulating the base model, morphological tag encoding method, tag set, and text normalization approach, using the Greek New Testament texts as our corpus.

During the presentation, we will describe our research methodology and discuss the results. The best results were achieved by models in which we implemented new dedicated embedding layers for encoding morphological information, which yielded results up to 35-38% better (BLEU) compared to the baseline scenario. Additional detailed study showed that PhilTa performs better than mT5, particularly in scenarios with limited data availability. PhilTa achieved the highest results in translation to English (60.40 BLEU), while mT5-large performed best with Polish (59.33 BLEU).

14 April 2025

Ryszard Staruch, Filip Graliński (Adam Mickiewicz University in Poznań)

Leveraging Large Language Models for the Grammatical Error Correction Task

Large Language Models (LLMs) currently represent the state-of-the-art in many natural language processing tasks. However, their effectiveness in correcting language errors in texts written in Polish remains unclear. To address this gap, a dedicated dataset for Polish text correction has been developed. During the talk, this dataset will be presented along with the evaluation results of selected LLM-based solutions. In the second part of the seminar, new techniques for adapting LLMs to the task of minimal-edit text correction will be discussed, focusing on texts written by language learners — using English as a case study.

28 April 2025

Manfred Stede (Universität Potsdam)

Discourse structure in the Potsdam Commentary Corpus: Human annotation, human disagreement, and automatic parsing

The talk gives a brief introduction to Rhetorical Structure Theory (RST, Mann/Thompson 1988) and then explains the design decisions for the Potsdam Commentary Corpus (PCC), which brings together RST, coreference, and other annotation layers on 175 German news editorials. After illustrating cross-layer queries on the corpus in the ANNIS linguistic database, we turn to the intricacies of manual RST annotation. I will give an overview of the annotation guidelines and their motivations, and present results from an (ongoing) study on annotator disagreements, from which we derive ideas for redesigning the annotation scheme (and potentially the underlying theory), with a comparison to the recent proposal of "eRST" by Zeldes et al. (2025). In the last part of the talk, I outline our results on automatic parsing using the system by Ji and Eisenstein (2014).

19 May 2025

Anna Wileczek (Jan Kochanowski University of Kielce)

Young words and (un)new trends. Some remarks on the development directions of contemporary juvenile speech

This presentation will analyse the linguistic and cultural phenomenon of contemporary youth speak (youth slang), considering it not only as a social variety of language, but also as an expressive and expansive communicative style. Due to the specificity of its vocabulary, the preferences of its users, its communicative strategies and its interpretation of the image of reality, the linguistic-semantic resource of youth slang is subject to the study of many sub-disciplines of linguistics. Examples will be sourced from the dictionary available on The Observatory of Youth Language and Culture website and the PWN Youth Word of the Year database.

26 May 2025

Deniz Zeyrek (Middle East Technical University)

Building monolingual and multilingual discourse banks and implications for discourse structure

In this talk, I will overview the Turkish Discourse Bank (TDB), and the TED-MDB (TED Multilingual Discourse Bank), both annotated at the discourse level by native speakers. The TDB is a resource of over 3800 implicitly or explicitly conveyed discourse relations built over a multi-genre corpus of 40.000 words. The TED-MDB is a multilingual corpus of six English TED talks with translations into five languages (Turkish, Polish, European Portuguese, Russian, and German, recently extended to a sixth language, Lithuanian) with about 600 relation annotations per language. While both corpora follow the rules and principles of the Penn Discourse Treebank (PDTB), they also consider the language-specific characteristics of individual languages. I will summarize the characteristics of both corpora and the work of our research team where these corpora are exploited, discussing implications on discourse structure.

2 June 2025

Maciej Ogrodniczuk, Aleksandra Tomaszewska, Bartosz Żuk, Alina Wróblewska (Institute of Computer Science, Polish Academy of Sciences)

ICS PAS contribution to the development of Polish Large Language Model

During the seminar, we will present the PLLuM family of language models developed on behalf of the Ministry of Digital Affairs by Polish research centres with a view to the specific nature of Polish, the local cultural context and the needs of the national public administration. During the presentation, we will discuss the main assumptions and objectives of the PLLuM project, as well as the process of creating models. We will pay particular attention to the work carried out by the Institute of Computer Science of the Polish Academy of Sciences, including the acquisition and processing of text data, the construction of instruction and preference corpora, the training of models and their evaluation, including the evaluation of generative capabilities and the security mechanisms used.

9 June 2025

Adam Majczyk (Warsaw University of Technology)

The discourse of propaganda: language strategies analyzed with NLP methods

This presentation details a master’s thesis analyzing propaganda techniques within Polish discourse on migration over the past decade. It employed fine-tuned Large Language Models (LLMs) via LoRA for span identification and classification of 18 propaganda types, using an English training dataset. A core part of the methodology involved applying these models to diverse Polish texts—news articles, parliamentary speeches, and TikTok transcripts—related to migration, after translating these texts to English. The developed models demonstrated competitive performance. The research further provides an analysis of the collected Polish data, highlighting evolving trends in propaganda usage and variations across different communication platforms and time periods.

16 czerwca 2025

Agata Savary (Université Paris-Saclay)

Diversity quantification in natural language processing

The concept of diversity has received increased consideration in Natural Language Processing (NLP) in recent years. This is due to various motivations like promoting equity and inclusion, approximating human linguistic behavior, and increasing systems’ performance. Diversity has however often been addressed in an ad hoc manner in NLP, and with few explicit links to other domains where this notion is better theorized. We survey articles in the ACL Anthology from the past 6 years, with "diversity" or "diverse" in their title. We find a wide range of settings in which diversity is quantified, often highly specialized and using inconsistent terminology. We put forward a unified taxonomy of why, what on, where, and how diversity is measured in NLP. Diversity measures are cast upon a unified framework from ecology and economy (Stirling, 2007) with 3 dimensions of diversity: variety, balance and disparity. We discuss the trends which emerge due to this systematized approach. We believe that this study paves the way towards a better formalization of diversity in NLP, which should bring a better understanding of this notion and a better comparability between various approaches.

This is a joint work with Louis Estève (Université Paris-Saclay, France), Marie-Catherine de Marneffe (Université Catholique de Louvain, Belgium), Nurit Melnik (The Open University of Israel) and Olha Kanishcheva (Jena University in Germany and the STEP University in Ukraine), within the framework of the UniDive COST Action on Universality, Diversity and Idiosyncrasy in Language Technology.

23 June 2025

Aleksandra Tomaszewska, Bartosz Żuk, Dariusz Czerski, Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

NeoN – a new tool for the detection and preliminary analysis of lexical neologisms

During the webinar, we will present NeoN, a tool developed by the Linguistic Engineering Group at the Institute of Computer Science of the Polish Academy of Sciences, that enables the detection and preliminary analysis of the most recent words in Polish. NeoN uses corpus and dictionary resources, frequency analysis, and normalization rules to filter data from Internet sources. For form categorization, contextual lemmatization, and automatic definition generation, it relies on a language model. We will also review other applications for detecting lexical innovations and outline future directions for NeoN’s development, including extending it to support additional languages.

Natural Language Processing Seminar 2024–2025