Locked History Actions

Diff for "seminar"

Differences between revisions 381 and 767 (spanning 386 versions)
Revision 381 as of 2021-01-19 10:24:04
Size: 14931
Comment:
Revision 767 as of 2025-11-24 11:58:39
Size: 13032
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
= Natural Language Processing Seminar 2020–2021 = = Natural Language Processing Seminar 2025–2026 =
Line 5: Line 5:
||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, currently online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It will restart in October and will take place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]||
Line 7: Line 7:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''5 October 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Rybak''', '''Robert Mroczkowski''', '''Janusz Tracz''' (ML Research at Allegro.pl), '''Ireneusz Gawlik''' (ML Research at Allegro.pl & AGH University of Science and Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=LkR-i2Z1RwM|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-10-05.pdf|Review of BERT-based Models for Polish Language]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">In recent years, a series of BERT-based models improved the performance of many natural language processing systems. During this talk, we will briefly introduce the BERT model as well as some of its variants. Next, we will focus on the available BERT-based models for Polish language and their results on the KLEJ benchmark. Finally, we will dive into the details of the new model developed in cooperation between ICS PAS and Allegro.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 September 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Louis Esteve''' (Universite Paris-Saclay) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2025-09-15.pdf|Diversity and dataset size – a quantitative perspective]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The field of Natural Language Processing (NLP) studies the abilities of computer systems to process and generate natural language, and has received increasing attention from the general population since the democratisation of generative and conversational models. However, behind the scenes, state-of-the-art NLP models are trained on ever-larger datasets, reaching trillions of tokens. It may be argued that the creation and use of such immense datasets is motivated by the idea that 'the larger the dataset, the more diverse it is', and that in turn 'if the training set is more diverse, it shall yield better models'. However, these statements thus far remain intuitions and need to be properly tested. To this end, this presentation will tackle methods and caveats of formal diversity quantification including limitations of the literature, a preliminary discussion on the link between diversity and dataset size, as well as their impact on downstream applications.||
Line 12: Line 12:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 November 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Inez Okulska''' (NASK National Research Institute)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=B7Y9fK2CDWw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-11-02.pdf|Concise, robust, sparse? Algebraic transformations of word2vec embeddings versus precision of classification]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The introduction of the vector representation of words, containing the weights of context and central words, calculated as a result of mapping giant corpora of a given language, and not encoding manually selected, linguistic features of words, proved to be a breakthrough for NLP research. After the first delight, there came revision and search for improvements - primarily in order to broaden the context, to handle homonyms, etc. Nevertheless, the classic embeddinga still apply to many tasks - for example, content classification - and in many cases their performance is still good enough. What do they code? Do they contain redundant elements? If transformed or reduced, will they maintain the information in a way that still preserves the original "meaning"? What is the meaning here? How far can these vectors be deformed and how does it relate to encryption methods? In my speech I will present a reflection on this subject, illustrated by the results of various "tortures” of the embeddings (word2vec and glove) and their precision in the task of classifying texts whose content must remain masked for human users.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''6 October 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University / Institute of Computer Science, Polish Academy of Sciences) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=hwBs4D7clls|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-10-06.pdf|Deep, multi-faceted learning of diagnosing mental disorders from clinical interview records]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides partially in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The key characteristics of mental illnesses are reflected in audio recordings of clinical interviews with patients and their families. We have developed a deep learning method that automatically extracts the relevant features necessary for the diagnosis of mental illnesses (ADHD, depression, bipolar disorder and schizophrenia) from such interviews. We use a variety of pre-trained models to extract representations from both the audio segments of these interviews and their text versions. We use several modern representation techniques (embeddings). We apply a Big Data approach by exploring existing audio and text corpora annotated with emotional labels. We address the problem of annotated data scarcity by using parametric model fine-tuning (Parameter Efficient Fine-Tuning). All these representations are then combined into a single multimodal form. To diagnose the above mental disorders, we use contrastive learning and model synthesis using a committee of experts (Mixture of Experts). The results show that through multimodal analysis of clinical interviews, mental disorders can be diagnosed with satisfactory accuracy (project conducted in collaboration with H. Naderi and R. Uher).||
Line 17: Line 17:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''16 November 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Chmiel''' (Adam Mickiewicz University, Poznań), '''Danijel Korzinek''' (Polish-Japanese Academy of Information Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=MxbgQL316DQ|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-11-16.pdf|PINC (Polish Interpreting Corpus): how a corpus can help study the process of simultaneous interpreting]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">PINC is the first Polish simultaneous interpreting corpus based on Polish-English and English-Polish interpretations from the European Parliament. Using naturalistic data makes it possible to answer many questions about the process of simultaneous interpreting. By analysing the ear-voice span, or the delay between the source text and the target text, mechanisms of activation and inhibition can be investigated in the interpreter’s lexical processing. Fluency and pause data help us examine the cognitive load. This talk will focus on how we process data in the corpus (such as interpreter voice identification) and what challenges we face in relation to linguistic analysis, dependency parsing and bilingual alignment. We will show how specific data can be applied to help us understand what interpreting involves or even what happens in the interpreter’s mind.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''20 October 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Arkadiusz Modzelewski''' (University of Padua / Polish-Japanese Academy of Information Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=KNxm8Vt_wfw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-10-20.pdf|The Why and How of Disinformation: Datasets, Methods and Language Models Evaluation]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">What language tools do disinformation agents employ? Can incorporating persuasion and intent knowledge enhance the ability of large language models to detect disinformation? And how effective are LLMs at identifying disinformation in Polish and English? In this talk, I will present findings from my PhD research on disinformation, persuasion, and the intent behind misleading information. I will introduce one of the largest Polish disinformation datasets, alongside a novel English dataset, both designed to capture manipulative techniques and intent of disinformation agents. Drawing on these and other resources, I will discuss how well current LLMs perform in detecting disinformation, persuasion, and intent, and highlight promising directions for improving their effectiveness in disinformation detection.||
Line 22: Line 22:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''30 November 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Findings of ACL: EMNLP 2020''': Polish session||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Łukasz Borchmann''' et al. (Applica.ai)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=THe1URk40Nk|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-11-30a.pdf|Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:10px">Contract Discovery deals with tasks, such as ensuring the inclusion of relevant legal clauses or their retrieval for further analysis (e.g., risk assessment). Because there was no publicly available benchmark for span identification from legal texts, we proposed it along with hard-to-beat baselines. It is expected to process unstructured text, as in most real-world usage scenarios; that is, no legal documents segmentation into the hierarchy of distinct (sub)sections is to be given in advance. What is more, it is assumed that a searched passage can be any part of the document and not necessarily a complete paragraph, subparagraph, or clause. Instead, the process should be considered as a few-shot span identification task. In this particular setting, pretrained, universal encoders fail to provide satisfactory results. In contrast, solutions based on the Language Models perform well, especially when unsupervised fine-tuning is applied.||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Szymański''' (Wrocław Technical University), '''Piotr Żelasko''' (Johns Hopkins University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=TXSDhCtTRpw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-11-30b.pdf|WER we are and WER we think we are]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''3 November 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Gražina Korvel''' (Vilnius University) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2025-11-03.pdf|Developing Speech Corpora for Low-Resource Languages]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Developing diverse, well-annotated speech corpora is essential for training modern machine learning models. This presentation discusses the principles and methodologies involved in creating large-scale speech corpora, with a focus on the Lithuanian language as a case study. It presents the Great Lithuanian Speech Corpus (LIEPA-3) project, outlining strategies for collecting, annotating, and ensuring the quality of data, as well as ensuring balanced representation across dialects, genders, and age groups. The talk also addresses challenges related to ethical data collection and corpus standardization.||
Line 31: Line 27:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''17 December 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=newobY5cBJo|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2020-12-17.pdf|Multi-Word Lexical Simplification]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will cover the task of multi-word lexical simplification, in which a sentence in natural language is made easier to understand by replacing its fragment with a simpler alternative, both of which can consist of many words. In order to explore this new direction, a corpus (MWLS1) including 1462 sentences in English from various sources with 7059 simplifications was prepared through crowdsourcing. Additionally, an automatic solution (Plainifier) for the problem, based on a purpose-trained neural language model, will be discussed along with the evaluation, comparing to human and resource-based baselines. The results of the presented study were also published at the COLING 2020 conference in [[https://www.aclweb.org/anthology/2020.coling-main.123.pdf|an article of the same title]].||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''24 November 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Jan Eliasz''', '''Mikołaj Langner''', '''Jan Kocoń''' (Wrocław University of Science and Technology) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2025-11-24-1.pdf|Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">We investigate two complementary strategies for improving the reliability of Large Language Models in classification settings. First, we show that decomposing multi-label classification into a set of independent binary decisions offers clear practical advantages over structured output formulations: it substantially reduces parsing errors, works seamlessly with decoder-only architectures, and delivers faster inference when combined with prefix caching, without requiring any model retraining.||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2025-11-24-2.pdf|Divide, Cache, Conquer. Dichotomic Prompting for Efficient Multi-Label LLM-Based Classfication]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Second, we demonstrate that reasoning-enabled LLMs are markedly better at tasks requiring contextual sensitivity, such as offensive-language annotation. When prompted to adopt a specific role, reasoning models maintain that role more consistently and make more accurate, fine-grained judgments than their non-reasoning counterparts. Viewed together, these findings highlight a unifying principle: LLMs become both more efficient and more context-aware when their decision process is made more structured, whether through task decomposition or through explicit reasoning.||
Line 36: Line 34:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''18 January 2021'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Norbert Ryciak''', '''Maciej Chrabąszcz''', '''Maciej Bartoszuk''' (Sages)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://teams.microsoft.com/l/meetup-join/19%3a2a54bf781d2a466da1e9adec3c87e6c2%40thread.tacv2/1608302845411?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%22f5f2c910-5438-48a7-b9dd-683a5c3daf1e%22%7d|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2021-01-18.pdf|Classification of patent applications]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">During our presentation we will discuss the solution for patent applications classification task that was one of !GovTech competition problems. We will describe the characteristics of the problem and proposed solution, especially the original method of representing documents as “clouds of word embedding”.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''1 December 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Filip Kucia''', '''Anna Wróblewska''', '''Bartosz Grabek''', '''Szymon Trochimiak''' (Warsaw University of Technology) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''How to Make Museums More Interactive? Case Study of the “Artistic Chatbot”''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation examines the challenges of deploying large language model (LLM)-powered chatbots in public cultural spaces, based on our experience with Artistic Chatbot – a voice-based conversational agent used during a month-long art exhibition at the Warsaw Academy of Fine Arts. We focus on two intertwined issues: how to make a system answer questions about a multilingual artistic collection, and how to evaluate the quality of those answers. On the technical side, we discuss strategies for building a retrieval-augmented knowledge base from heterogeneous, multilingual exhibition materials and the trade-offs between native-language models and pivot-language approaches based on translation. From the perspective of interaction design, we outline a fully voice-based setup in a gallery space, in which visitors walk up to a ceiling-mounted microphone and address the system through spoken trigger expressions, without screens or keyboards. The core of the talk is a post-hoc evaluation. We analyse interaction logs and conduct a human annotation study to compare different modelling and retrieval configurations along dimensions such as factual precision, coherence and relevance to the exhibition domain. Using this case study, we ask how to define and measure a “good” answer in conversational AI for cultural heritage, and how choices about language, translation and voice interaction should influence future deployments in museums and galleries.||
Line 41: Line 39:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''1 February 2021'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Adam Jatowt''' (University of Innsbruck)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://teams.microsoft.com/l/meetup-join/19%3ameeting_YTM3ZWZlYjUtMzJkNC00NGRkLWE3ZWItMWEyYmJhOGFjMmYz%40thread.v2/0?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%22f5f2c910-5438-48a7-b9dd-683a5c3daf1e%22%7d|{{attachment:seminarium-archiwum/teams.png}}]] '''Question Answering & Finding Temporal Analogs in News Archives''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">News archives offer immense value to our society, helping users to learn details of events that occurred in the past. Currently, the access to such collections is difficult for average users due to large sizes and the need for expertise in history. We propose a large-scale open-domain question answering model designed for long-term news article collections, with a dedicated module for re-ranking articles by using temporal information. In the second part of the talk we will discuss methods for finding and explaining temporal analogs – entities in the past which are analogical to the entities in the present (e.g., walkman as a temporal analog of iPad).||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 February 2021'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksandra Nabożny''' (Polish-Japanese Academy of Information Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Talk title will be available shortly.''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will appear very soon.||
||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2025]].||
Line 53: Line 43:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 April 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Efficient training of word embeddings with a focus on negative examples''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation is based on our [[https://pdfs.semanticscholar.org/1f50/db5786913b43f9668f997fc4c97d9cd18730.pdf|AAAI 2018]] and [[https://aaai.org/ojs/index.php/AAAI/article/view/4683|AAAI 2019]] papers on English word embeddings. In particular, we examine the notion of “negative examples”, the unobserved or insignificant word-context co-occurrences, in spectral methods. we provide a new formulation for the word embedding problem by proposing a new intuitive objective function that perfectly justifies the use of negative examples. With the goal of efficient learning of embeddings, we propose a kernel similarity measure for the latent space that can effectively calculate the similarities in high dimensions. Moreover, we propose an approximate alternative to our algorithm using a modified Vantage Point tree and reduce the computational complexity of the algorithm with respect to the number of words in the vocabulary. We have trained various word embedding algorithms on articles of Wikipedia with 2.3 billion tokens and show that our method outperforms the state-of-the-art in most word similarity tasks by a good margin. We will round up our discussion with some general thought s about the use of embeddings in modern NLP.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''17 November 2025''' '''(NOTE: the seminar will start at 16:00)'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marzena Karpińska''' (Microsoft) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''!OneRuler: testing multilingual language models on long contexts''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">In this presentation, I will look at how well language models perform when extracting information from texts of up to 128,000 tokens (approximately 100,000 words) in 26 languages, including Polish. The results of the experiments show that as the length of the context increases, the differences between languages with large and small data resources also increase. Surprisingly, even minimal changes in the command (adding the possibility that the information does not exist) cause a significant decrease in effectiveness, especially with longer texts.||


||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 March 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Mateusz Krubiński''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Talk title will be given shortly''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.||
Line 58: Line 54:

||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2020]].||

Natural Language Processing Seminar 2025–2026

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It will restart in October and will take place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube.

seminarium

15 September 2025

Louis Esteve (Universite Paris-Saclay)

Diversity and dataset size – a quantitative perspective  Talk in English.

The field of Natural Language Processing (NLP) studies the abilities of computer systems to process and generate natural language, and has received increasing attention from the general population since the democratisation of generative and conversational models. However, behind the scenes, state-of-the-art NLP models are trained on ever-larger datasets, reaching trillions of tokens. It may be argued that the creation and use of such immense datasets is motivated by the idea that 'the larger the dataset, the more diverse it is', and that in turn 'if the training set is more diverse, it shall yield better models'. However, these statements thus far remain intuitions and need to be properly tested. To this end, this presentation will tackle methods and caveats of formal diversity quantification including limitations of the literature, a preliminary discussion on the link between diversity and dataset size, as well as their impact on downstream applications.

6 October 2025

Stan Matwin (Dalhousie University / Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=hwBs4D7clls Deep, multi-faceted learning of diagnosing mental disorders from clinical interview records  Talk in Polish. Slides partially in English.

The key characteristics of mental illnesses are reflected in audio recordings of clinical interviews with patients and their families. We have developed a deep learning method that automatically extracts the relevant features necessary for the diagnosis of mental illnesses (ADHD, depression, bipolar disorder and schizophrenia) from such interviews. We use a variety of pre-trained models to extract representations from both the audio segments of these interviews and their text versions. We use several modern representation techniques (embeddings). We apply a Big Data approach by exploring existing audio and text corpora annotated with emotional labels. We address the problem of annotated data scarcity by using parametric model fine-tuning (Parameter Efficient Fine-Tuning). All these representations are then combined into a single multimodal form. To diagnose the above mental disorders, we use contrastive learning and model synthesis using a committee of experts (Mixture of Experts). The results show that through multimodal analysis of clinical interviews, mental disorders can be diagnosed with satisfactory accuracy (project conducted in collaboration with H. Naderi and R. Uher).

20 October 2025

Arkadiusz Modzelewski (University of Padua / Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=KNxm8Vt_wfw The Why and How of Disinformation: Datasets, Methods and Language Models Evaluation  Talk in English.

What language tools do disinformation agents employ? Can incorporating persuasion and intent knowledge enhance the ability of large language models to detect disinformation? And how effective are LLMs at identifying disinformation in Polish and English? In this talk, I will present findings from my PhD research on disinformation, persuasion, and the intent behind misleading information. I will introduce one of the largest Polish disinformation datasets, alongside a novel English dataset, both designed to capture manipulative techniques and intent of disinformation agents. Drawing on these and other resources, I will discuss how well current LLMs perform in detecting disinformation, persuasion, and intent, and highlight promising directions for improving their effectiveness in disinformation detection.

3 November 2025

Gražina Korvel (Vilnius University)

Developing Speech Corpora for Low-Resource Languages  Talk in Polish. Slides in English.

Developing diverse, well-annotated speech corpora is essential for training modern machine learning models. This presentation discusses the principles and methodologies involved in creating large-scale speech corpora, with a focus on the Lithuanian language as a case study. It presents the Great Lithuanian Speech Corpus (LIEPA-3) project, outlining strategies for collecting, annotating, and ensuring the quality of data, as well as ensuring balanced representation across dialects, genders, and age groups. The talk also addresses challenges related to ethical data collection and corpus standardization.

24 November 2025

Jan Eliasz, Mikołaj Langner, Jan Kocoń (Wrocław University of Science and Technology)

http://zil.ipipan.waw.pl/seminarium-online Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs  Talk in English.

We investigate two complementary strategies for improving the reliability of Large Language Models in classification settings. First, we show that decomposing multi-label classification into a set of independent binary decisions offers clear practical advantages over structured output formulations: it substantially reduces parsing errors, works seamlessly with decoder-only architectures, and delivers faster inference when combined with prefix caching, without requiring any model retraining.

http://zil.ipipan.waw.pl/seminarium-online Divide, Cache, Conquer. Dichotomic Prompting for Efficient Multi-Label LLM-Based Classfication  Talk in English.

Second, we demonstrate that reasoning-enabled LLMs are markedly better at tasks requiring contextual sensitivity, such as offensive-language annotation. When prompted to adopt a specific role, reasoning models maintain that role more consistently and make more accurate, fine-grained judgments than their non-reasoning counterparts. Viewed together, these findings highlight a unifying principle: LLMs become both more efficient and more context-aware when their decision process is made more structured, whether through task decomposition or through explicit reasoning.

1 December 2025

Filip Kucia, Anna Wróblewska, Bartosz Grabek, Szymon Trochimiak (Warsaw University of Technology)

http://zil.ipipan.waw.pl/seminarium-online How to Make Museums More Interactive? Case Study of the “Artistic Chatbot”  Talk in Polish.

This presentation examines the challenges of deploying large language model (LLM)-powered chatbots in public cultural spaces, based on our experience with Artistic Chatbot – a voice-based conversational agent used during a month-long art exhibition at the Warsaw Academy of Fine Arts. We focus on two intertwined issues: how to make a system answer questions about a multilingual artistic collection, and how to evaluate the quality of those answers. On the technical side, we discuss strategies for building a retrieval-augmented knowledge base from heterogeneous, multilingual exhibition materials and the trade-offs between native-language models and pivot-language approaches based on translation. From the perspective of interaction design, we outline a fully voice-based setup in a gallery space, in which visitors walk up to a ceiling-mounted microphone and address the system through spoken trigger expressions, without screens or keyboards. The core of the talk is a post-hoc evaluation. We analyse interaction logs and conduct a human annotation study to compare different modelling and retrieval configurations along dimensions such as factual precision, coherence and relevance to the exhibition domain. Using this case study, we ask how to define and measure a “good” answer in conversational AI for cultural heritage, and how choices about language, translation and voice interaction should influence future deployments in museums and galleries.

Please see also the talks given in 2000–2015 and 2015–2025.