Natural Language Processing Seminar 2025–2026

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It will restart in October and will take place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube.

15 September 2025

Louis Esteve (Universite Paris-Saclay)

Diversity and dataset size – a quantitative perspective

The field of Natural Language Processing (NLP) studies the abilities of computer systems to process and generate natural language, and has received increasing attention from the general population since the democratisation of generative and conversational models. However, behind the scenes, state-of-the-art NLP models are trained on ever-larger datasets, reaching trillions of tokens. It may be argued that the creation and use of such immense datasets is motivated by the idea that 'the larger the dataset, the more diverse it is', and that in turn 'if the training set is more diverse, it shall yield better models'. However, these statements thus far remain intuitions and need to be properly tested. To this end, this presentation will tackle methods and caveats of formal diversity quantification including limitations of the literature, a preliminary discussion on the link between diversity and dataset size, as well as their impact on downstream applications.

6 October 2025

Stan Matwin (Dalhousie University / Institute of Computer Science, Polish Academy of Sciences)

Deep, multi-faceted learning of diagnosing mental disorders from clinical interview records

The key characteristics of mental illnesses are reflected in audio recordings of clinical interviews with patients and their families. We have developed a deep learning method that automatically extracts the relevant features necessary for the diagnosis of mental illnesses (ADHD, depression, bipolar disorder and schizophrenia) from such interviews. We use a variety of pre-trained models to extract representations from both the audio segments of these interviews and their text versions. We use several modern representation techniques (embeddings). We apply a Big Data approach by exploring existing audio and text corpora annotated with emotional labels. We address the problem of annotated data scarcity by using parametric model fine-tuning (Parameter Efficient Fine-Tuning). All these representations are then combined into a single multimodal form. To diagnose the above mental disorders, we use contrastive learning and model synthesis using a committee of experts (Mixture of Experts). The results show that through multimodal analysis of clinical interviews, mental disorders can be diagnosed with satisfactory accuracy (project conducted in collaboration with H. Naderi and R. Uher).

20 October 2025

Arkadiusz Modzelewski (University of Padua / Polish-Japanese Academy of Information Technology)

The Why and How of Disinformation: Datasets, Methods and Language Models Evaluation

What language tools do disinformation agents employ? Can incorporating persuasion and intent knowledge enhance the ability of large language models to detect disinformation? And how effective are LLMs at identifying disinformation in Polish and English? In this talk, I will present findings from my PhD research on disinformation, persuasion, and the intent behind misleading information. I will introduce one of the largest Polish disinformation datasets, alongside a novel English dataset, both designed to capture manipulative techniques and intent of disinformation agents. Drawing on these and other resources, I will discuss how well current LLMs perform in detecting disinformation, persuasion, and intent, and highlight promising directions for improving their effectiveness in disinformation detection.

3 November 2025

Gražina Korvel (Vilnius University)

Developing Speech Corpora for Low-Resource Languages

Developing diverse, well-annotated speech corpora is essential for training modern machine learning models. This presentation discusses the principles and methodologies involved in creating large-scale speech corpora, with a focus on the Lithuanian language as a case study. It presents the Great Lithuanian Speech Corpus (LIEPA-3) project, outlining strategies for collecting, annotating, and ensuring the quality of data, as well as ensuring balanced representation across dialects, genders, and age groups. The talk also addresses challenges related to ethical data collection and corpus standardization.

24 November 2025

Jan Eliasz, Mikołaj Langner, Jan Kocoń (Wrocław University of Science and Technology)

Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs

We investigate two complementary strategies for improving the reliability of Large Language Models in classification settings. First, we show that decomposing multi-label classification into a set of independent binary decisions offers clear practical advantages over structured output formulations: it substantially reduces parsing errors, works seamlessly with decoder-only architectures, and delivers faster inference when combined with prefix caching, without requiring any model retraining.

Divide, Cache, Conquer. Dichotomic Prompting for Efficient Multi-Label LLM-Based Classfication

Second, we demonstrate that reasoning-enabled LLMs are markedly better at tasks requiring contextual sensitivity, such as offensive-language annotation. When prompted to adopt a specific role, reasoning models maintain that role more consistently and make more accurate, fine-grained judgments than their non-reasoning counterparts. Viewed together, these findings highlight a unifying principle: LLMs become both more efficient and more context-aware when their decision process is made more structured, whether through task decomposition or through explicit reasoning.

1 December 2025

Filip Kucia, Anna Wróblewska, Bartosz Grabek, Szymon Trochimiak (Warsaw University of Technology)

How to Make Museums More Interactive? Case Study of the “Artistic Chatbot”

This presentation examines the challenges of deploying large language model (LLM)-powered chatbots in public cultural spaces, based on our experience with Artistic Chatbot – a voice-based conversational agent used during a month-long art exhibition at the Warsaw Academy of Fine Arts. We focus on two intertwined issues: how to make a system answer questions about a multilingual artistic collection, and how to evaluate the quality of those answers. On the technical side, we discuss strategies for building a retrieval-augmented knowledge base from heterogeneous, multilingual exhibition materials and the trade-offs between native-language models and pivot-language approaches based on translation. From the perspective of interaction design, we outline a fully voice-based setup in a gallery space, in which visitors walk up to a ceiling-mounted microphone and address the system through spoken trigger expressions, without screens or keyboards. The core of the talk is a post-hoc evaluation. We analyse interaction logs and conduct a human annotation study to compare different modelling and retrieval configurations along dimensions such as factual precision, coherence and relevance to the exhibition domain. Using this case study, we ask how to define and measure a “good” answer in conversational AI for cultural heritage, and how choices about language, translation and voice interaction should influence future deployments in museums and galleries.

19 January 2026

Matteo Gioele Collu (University of Padova)

Do you trust your LLM? An introduction to Indirect Prompt Injection

In this talk, I will introduce the vulnerabilities that enable indirect prompt injection attacks, where malicious instructions are hidden in external content and unknowingly executed by large language models. To illustrate the risks, I will present two case studies: the LLMail Inject competition, which demonstrated creative adversarial attacks, and an injection scenario targeting the peer review process.

23 February 2026

Grzegorz Chodak (Wrocław University of Science and Technology), Dariusz Tworzydło (University of Warsaw)

Can image crises in organisations be predicted using artificial intelligence? Results from the Crisis Detector research project

Image crises can lead to the bankruptcy of companies or the end of politicians' careers. The presentation will show the results of research on the possibility of detecting image crises in media content using large language models (LLMs). The results showing that LLMs are highly effective at recognizing crisis signals and classifying crises will be discussed. Practical possibilities for building early warning systems for image crises will also be presented.

13 April 2026

Iwona Christop, Marek Kubis (Adam Mickiewicz University in Poznań)

ART: Benchmark for Evaluating Audio Reasoning Capabilities in Multimodal Large Language Models

Large language models integrate more and more information from different modalities, including audio signals. However, existing benchmarks for evaluating audio processing capabilities mainly focus on single tasks, such as transcription or classification. Consequently, they provide little insight into models' ability to combine different types of information for reasoning. During the presentation, we will introduce the Audio Reasoning Tasks (ART), a benchmark designed to evaluate the reasoning abilities of multimodal language models based on audio signals. The ART dataset contains tasks that require the integration of information from various aspects of a recording. We will discuss how the benchmark was designed and share the results of experiments showing that current models have limited capabilities when it comes to audio-based reasoning.

27 April 2026

Łucja Biel, Katarzyna Wasilewska, Dariusz Koźbiał (Institute of Applied Linguistics, University of Warsaw)

An MDA analysis of the Polish Eurolect and the national variety: Dimensions of variation across institutional legal and administrative registers

This study applies full Multidimensional Analysis (MDA) to examine linguistic variation in the Polish Eurolect – a hybrid variety shaped by translation and institutional constraints within the European Union – by comparing it to the national variety. Using a corpus of key institutional registers (legal acts, judgments, administrative reports, and citizen-oriented websites), we identify four dimensions of variation: Argumentative vs Informational, Engaged Instruction vs Distanced Authority, Prescriptive vs Narrative, and Lexical Richness. The findings reveal notable differences between how supranational and national institutions communicate. EU legal acts and judgments show greater prescriptiveness, legal referencing, and argumentative structuring compared to their Polish counterparts. EU websites have less engagement and explanatory strategies while EU reports favour a less distanced style. The findings map variation and group institutional registers, thereby visualizing similarities and differences between supranational and national institutional communication.

18 May 2026

Maciej Ogrodniczuk, Anna Latusek, Alina Wróblewska, Bartosz Żuk (Institute of Computer Science, Polish Academy of Sciences)

Universal Discourse: Towards a Multilingual Model of Discourse Relations

During the presentation, we will outline the objectives of the Universal Discourse project, which aims to create a universal, multilingual model for describing discursive relations. The theoretical basis for the work is the ISO 24617-8 standard, which serves as a starting point for the harmonisation of existing corpus resources. In the first part, we will discuss the issue of text segmentation into discursive units. We will present a comparative analysis of various formalisms (such as RST or PDTB) and our own rule-based heuristic, which demonstrates high consistency in determining the boundaries of units at the constituent sentence level. We will then present the proposed multi-layered model of relations. We will focus on a proprietary decision tree that allows for the systematic classification of semantic links (including causal, conditional and temporal links). Finally, we will present the current status of work on the development of discourse parsers and the results of our first experiments.

25 May 2026

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)

Exploring morphology-aware tokenization: A case study on Spanish language modeling

In the presentation we will explore to what extent the integration of morphological information can improve subword tokenization and thus also language modeling performance. We will focus on Spanish, a language with fusional morphology, where subword segmentation can benefit from linguistic structure. Instead of relying on purely data-driven strategies like Byte Pair Encoding (BPE), I will demonstrate a linguistically grounded approach: training a tokenizer on morphologically segmented data. This is possible thanks to developing a semi-supervised segmentation model for Spanish, building gold-standard datasets to guide and evaluating it. This tokenizer can be used to pre-train a masked language model and assess its performance on several downstream tasks. Our results show improvements over a baseline with a standard tokenizer, supporting our hypothesis that morphology-aware tokenization offers a viable and principled alternative for improving language modeling.

8 June 2026

Maciej Rapacz, Aleksander Smywiński-Pohl (AGH University)

How Much of the Translator Is in a Translation? The Targum Corpus and Measuring the Translator's Intervention

This talk introduces the Targum Corpus and explores whether a translator's personal imprint on their work can be measured. We begin with the corpus itself: 651 New Testament translations in five languages – Polish, English, French, Italian, and Spanish. Rather than maximizing the number of languages, the corpus deliberately prioritizes historical depth, tracing New Testament translation within each language from the sixteenth century to the present day. We walk through how the corpus is built – its texts and metadata – and what kinds of quantitative questions about translation it now makes it possible to ask. The second half turns to one of those questions: a way of measuring how much a translator shapes a text, using an interlinear translation as a neutral baseline. We define the translator's intervention as the difference between the vector representations of a literary translation and its interlinear counterpart, and present early results suggesting that the method sorts translations along a spectrum – from the highly literal, through dynamic renderings, to outright paraphrase.

Please see also the talks given in 2000–2015 and 2015–2025.