|
Size: 24780
Comment:
|
Size: 12998
Comment:
|
| Deletions are marked like this. | Additions are marked like this. |
| Line 3: | Line 3: |
| = Natural Language Processing Seminar 2017–2018 = | = Natural Language Processing Seminar 2025–2026 = |
| Line 5: | Line 5: |
| ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available [[https://www.youtube.com/channel/UC5PEPpMqjAr7Pgdvq0wRn0w|on YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| | ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It will restart in October and will take place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| |
| Line 7: | Line 7: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 October 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Paweł Rutkowski''' (University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=Acfdv6kUe5I|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-10-02.pdf|Polish Sign Language from the perspective of corpus linguistics]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Polish Sign Language (polski język migowy, PJM) is a full-fledged visual-spatial language used by the Polish Deaf community. It started to evolve in the second decade of the nineteenth century, with the foundation of the first school for the deaf in Poland. Until recently, PJM attracted very little attention from the linguistic community in Poland. The aim of this talk is to present a large-scale research project aimed at creating an extensive and representative corpus of PJM. The corpus is currently being compiled at the University of Warsaw. It is a collection of video clips showing Deaf people using PJM in a variety of different communication contexts. The videos are richly annotated: they are segmented, lemmatized, translated into Polish, tagged for various grammatical features and transcribed with !HamNoSys symbols. The Corpus of PJM is currently one of the two largest sets of annotated sign language data in the world. Special attention will be paid to the issue of lexical frequency in PJM. Studies of this type are available for a handful of sign languages only, including American Sign Language, New Zealand Sign Language, British Sign Language, Australian Sign Language and Slovene Sign Language. Their empirical basis ranged from 100,000 tokens (NZSL) to as little as 4,000 tokens (ASL). The present talk contributes to our understanding of lexical frequency in sign languages by analyzing a much larger set of relevant data from PJM.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 September 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Louis Esteve''' (Universite Paris-Saclay) || ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2025-09-15.pdf|Diversity and dataset size – a quantitative perspective]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The field of Natural Language Processing (NLP) studies the abilities of computer systems to process and generate natural language, and has received increasing attention from the general population since the democratisation of generative and conversational models. However, behind the scenes, state-of-the-art NLP models are trained on ever-larger datasets, reaching trillions of tokens. It may be argued that the creation and use of such immense datasets is motivated by the idea that 'the larger the dataset, the more diverse it is', and that in turn 'if the training set is more diverse, it shall yield better models'. However, these statements thus far remain intuitions and need to be properly tested. To this end, this presentation will tackle methods and caveats of formal diversity quantification including limitations of the literature, a preliminary discussion on the link between diversity and dataset size, as well as their impact on downstream applications.|| |
| Line 12: | Line 12: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 October 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Katarzyna Krasnowska-Kieraś''', '''Piotr Rybak''', '''Alina Wróblewska''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=8qzqn69nCmg|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-10-23.pdf|Towards the evaluation of feature embedding models of the fusional languages in the context of morphosyntactic disambiguation and dependency parsing]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Neural networks are recently very successful in various natural language processing tasks. An important component of a neural network approach is a dense vector representation of features, i.e. feature embedding. Various feature types are possible, e.g. words, part-of-speech tags. In our talk we are going to present results of an analysis showing what should be used as features in estimating embedding models of the fusional languages – tokens or lemmata. Furthermore, we are going to discuss the methodological question whether the results of the intrinsic evaluation of embeddings are informative for downstream applications, or the embedding models should be evaluated extrinsically. The accompanying experiments were conducted on Polish – a fusional Slavic language with a relatively free word order. The mentioned research has inspired us to implement a morphosyntactic disambiguator – Toygger (Krasnowska-Kieraś, 2017). The tool won the shared task 1 (A) in [[http://poleval.pl|PolEval 2017]] competition and will be presented in our talk.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''6 October 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University / Institute of Computer Science, Polish Academy of Sciences) || ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=hwBs4D7clls|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-10-06.pdf|Deep, multi-faceted learning of diagnosing mental disorders from clinical interview records]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides partially in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The key characteristics of mental illnesses are reflected in audio recordings of clinical interviews with patients and their families. We have developed a deep learning method that automatically extracts the relevant features necessary for the diagnosis of mental illnesses (ADHD, depression, bipolar disorder and schizophrenia) from such interviews. We use a variety of pre-trained models to extract representations from both the audio segments of these interviews and their text versions. We use several modern representation techniques (embeddings). We apply a Big Data approach by exploring existing audio and text corpora annotated with emotional labels. We address the problem of annotated data scarcity by using parametric model fine-tuning (Parameter Efficient Fine-Tuning). All these representations are then combined into a single multimodal form. To diagnose the above mental disorders, we use contrastive learning and model synthesis using a committee of experts (Mixture of Experts). The results show that through multimodal analysis of clinical interviews, mental disorders can be diagnosed with satisfactory accuracy (project conducted in collaboration with H. Naderi and R. Uher).|| |
| Line 17: | Line 17: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''6 November 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Szymon Łęski''' (Samsung R&D Poland)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=266ftzwmKeU|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-11-06.pdf|Deep neural networks in language models]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In my talk I will first give introduction to language models: traditional, n-gram based, and new, based on recurrent networks. Then, based on recent papers, I will discuss the most interesting extensions and modifications to RNN-based language models, such as modifying word representations or models with output not limited to a pre-defined vocabulary.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''20 October 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Arkadiusz Modzelewski''' (University of Padua / Polish-Japanese Academy of Information Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=KNxm8Vt_wfw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-10-20.pdf|The Why and How of Disinformation: Datasets, Methods and Language Models Evaluation]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">What language tools do disinformation agents employ? Can incorporating persuasion and intent knowledge enhance the ability of large language models to detect disinformation? And how effective are LLMs at identifying disinformation in Polish and English? In this talk, I will present findings from my PhD research on disinformation, persuasion, and the intent behind misleading information. I will introduce one of the largest Polish disinformation datasets, alongside a novel English dataset, both designed to capture manipulative techniques and intent of disinformation agents. Drawing on these and other resources, I will discuss how well current LLMs perform in detecting disinformation, persuasion, and intent, and highlight promising directions for improving their effectiveness in disinformation detection.|| |
| Line 22: | Line 22: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''20 November 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Michał Ptaszyński''' (Kitami Institute of Technology, Japan)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=hUtI5lCyUew|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-11-20.pdf|Capturing Emotions in Context as a way towards Computational Phronesis]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Research on emotions within Artificial Intelligence and related fields has flourished rapidly through recent years. Unfortunately, in most research emotions are analyzed without their context. I will argue, that recognizing emotions without recognizing their context is incomplete and cannot be sufficient for real-world applications. I will also describe some consequences of disregarding the context of emotions. Finally, I will present one approach, in which the context of emotions is considered and briefly describe some of the first experiments performed in this matter.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''3 November 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Gražina Korvel''' (Vilnius University) || ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2025-11-03.pdf|Developing Speech Corpora for Low-Resource Languages]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Developing diverse, well-annotated speech corpora is essential for training modern machine learning models. This presentation discusses the principles and methodologies involved in creating large-scale speech corpora, with a focus on the Lithuanian language as a case study. It presents the Great Lithuanian Speech Corpus (LIEPA-3) project, outlining strategies for collecting, annotating, and ensuring the quality of data, as well as ensuring balanced representation across dialects, genders, and age groups. The talk also addresses challenges related to ethical data collection and corpus standardization.|| |
| Line 27: | Line 27: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''27 November 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Ogrodniczuk''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2017-11-27.pdf|Automated coreference resolution in Polish]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The talk presents the description of nominal referential constructs in Polish (i.e. textual fragments referencing the same discourse entities) and the computational-linguistic methods implemented for their decoding. The algorithms are corpus-based with manual annotation of coreferential constructs and are evaluated using standard metrics.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''24 November 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Jan Eliasz''', '''Mikołaj Langner''', '''Jan Kocoń''' (Wrocław University of Science and Technology) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=4inBbYUbFvA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-11-24-1.pdf|Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">We investigate two complementary strategies for improving the reliability of Large Language Models in classification settings. First, we show that decomposing multi-label classification into a set of independent binary decisions offers clear practical advantages over structured output formulations: it substantially reduces parsing errors, works seamlessly with decoder-only architectures, and delivers faster inference when combined with prefix caching, without requiring any model retraining.|| ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=DjIhTMfbfHM|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-11-24-2.pdf|Divide, Cache, Conquer. Dichotomic Prompting for Efficient Multi-Label LLM-Based Classfication]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Second, we demonstrate that reasoning-enabled LLMs are markedly better at tasks requiring contextual sensitivity, such as offensive-language annotation. When prompted to adopt a specific role, reasoning models maintain that role more consistently and make more accurate, fine-grained judgments than their non-reasoning counterparts. Viewed together, these findings highlight a unifying principle: LLMs become both more efficient and more context-aware when their decision process is made more structured, whether through task decomposition or through explicit reasoning.|| |
| Line 32: | Line 34: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 December 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Adam Dobaczewski''', '''Piotr Sobotka''', '''Sebastian Żurowski''' (Nicolaus Copernicus University in Toruń)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=az06czLflMw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-12-04.pdf|Dictionary of Polish reduplications and repetitions]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In our talk we will present a dictionary prepared by the team from the Institute of Polish Language of the Nicolaus Copernicus University in Toruń (grant NPRH 11H 13 0265 82). We document In the dictionary expressions of the Polish language in which the presence of reduplication or repetition of forms of the same lexemes can be observed. We distinguish the units of language according to the Bogusławski's operational grammar framework and divide them into two basic groups: (i) lexical units consisting of two such segments or forms of the same lexeme (Pol. ''całkiem całkiem''; ''fakt faktem''); operational units based on some pattern of repetition of words belonging to a certain class predicted by this scheme (Pol. ''N[nom] N[inst] ale _'', where N stands for any noun, e.g. ''sąd sądem, ale _''; ''miłość miłością, ale _''). We have prepared a dictionary in traditional (printed) form due to the relatively small number of registered units. Its material base is the resources of the NKJP, which were searched using dedicated search engine of repetitions in the NKJP. This tool was specially prepared for this project at the LEG ICS PAS.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''1 December 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Filip Kucia''', '''Anna Wróblewska''', '''Bartosz Grabek''', '''Szymon Trochimiak''' (Warsaw University of Technology) || ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2025-12-01.pdf|How to Make Museums More Interactive? Case Study of the “Artistic Chatbot”]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation examines the challenges of deploying large language model (LLM)-powered chatbots in public cultural spaces, based on our experience with Artistic Chatbot – a voice-based conversational agent used during a month-long art exhibition at the Warsaw Academy of Fine Arts. We focus on two intertwined issues: how to make a system answer questions about a multilingual artistic collection, and how to evaluate the quality of those answers. On the technical side, we discuss strategies for building a retrieval-augmented knowledge base from heterogeneous, multilingual exhibition materials and the trade-offs between native-language models and pivot-language approaches based on translation. From the perspective of interaction design, we outline a fully voice-based setup in a gallery space, in which visitors walk up to a ceiling-mounted microphone and address the system through spoken trigger expressions, without screens or keyboards. The core of the talk is a post-hoc evaluation. We analyse interaction logs and conduct a human annotation study to compare different modelling and retrieval configurations along dimensions such as factual precision, coherence and relevance to the exhibition domain. Using this case study, we ask how to define and measure a “good” answer in conversational AI for cultural heritage, and how choices about language, translation and voice interaction should influence future deployments in museums and galleries.|| |
| Line 37: | Line 39: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''29 January 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Roman Grundkiewicz''' (Adam Mickiewicz University in Poznań/University of Edinburgh)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=dj9rTwzDCdA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-01-29.pdf|Automatic Grammatical Error Correction using Machine Translation]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In my presentation I will be talking about the task of automated grammatical error correction (GEC) in texts written by non-native English speakers. I will present our experiments on the application of the phrase-based statistical machine translation (SMT), and our GEC system, which achieved new state-of-the-art results. The importance of the parameter optimization towards the task-specific evaluation metric and new GEC-adapted dense and sparse features will be discussed. I will also briefly describe the results of further research using neural machine translation (NMT).|| |
||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2025]].|| |
| Line 42: | Line 41: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''12 February 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Mykowiecka''', '''Aleksander Wawer''', '''Małgorzata Marciniak''', '''Piotr Rychlik''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=9QPldbRyIzU|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-02-12.pdf|Recognition of metaphorical noun phrases in Polish with distributional semantics]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|The talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Our talk addresses the use of vector models for Polish based on lemmas and forms. We compare the results for two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition. Then we apply vector models to detect metaphorical and literal meaning of adjective-noun (AN) phrases. We show the results of our method for isolated phrases and compare them to other known methods. Finally, we discuss the problem of recognition of metaphorical/literal meaning of AN phrases in sentences.|| |
{{{#!wiki comment |
| Line 47: | Line 43: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''26 February 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Celina Heliasz''' (University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2018-02-26.pdf|To create or to contribute? On the search for synergy between computer scientists and linguists]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|The talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The main topic of my presentation are the methods of conducting research in the field of corpus linguistics, which is currently being addressed by both computer scientists and linguists. In my speech, I will present the attempts to recognize and visualize semantic relations in the text undertaken by computer scientists as part of the two projects: RST (Rhetorical Structure Theory) and PDTB (Penn Discourse Treebank). Then, I contrast RST and PDTB with analogous attempts made by computer scientists and linguists at IPI PAN as part of the CLARIN-PL venture. The aim of the presentation is to show the determinants of effective linguistic analysis, which must be taken into account when designing IT tools, if these tools are to conduct research on text and derive strong foundations of linguistic theories from them, and not only to implement existing theories in this field.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''17 November 2025''' '''(NOTE: the seminar will start at 16:00)'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marzena Karpińska''' (Microsoft) || ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''!OneRuler: testing multilingual language models on long contexts'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In this presentation, I will look at how well language models perform when extracting information from texts of up to 128,000 tokens (approximately 100,000 words) in 26 languages, including Polish. The results of the experiments show that as the length of the context increases, the differences between languages with large and small data resources also increase. Surprisingly, even minimal changes in the command (adding the possibility that the information does not exist) cause a significant decrease in effectiveness, especially with longer texts.|| |
| Line 52: | Line 48: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''9 April 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jan Kocoń''' (Wrocław University of Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=XgSyuWEHWhU|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-04-09.pdf|Recognition of temporal expressions and events in Polish text documents]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|The talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">A temporal expression is a sequence of words that informs you about when, how often an event occurs or how long it lasts. Event descriptions are words which indicate a change of state in the description of reality (and also some states). These issues fall within the scope of information extraction. They are well defined and described for English and partly for other languages. The TimeML specification, whose temporal information description language has been accepted as an ISO standard, has been officially adapted for six languages and the temporal expressions description section is defined for eleven languages. The result of the work carried out within CLARIN-PL is the adaptation of TimeML guidelines for Polish language. The motivation for this topic was the fact that temporal information is used by various natural language processing tasks, including methods for question answering, automatic text summarisation, semantic relations extraction and many others. These methods allow researchers in the domain of Digital Humanities and Social Sciences to work with a very large collection of texts whose analysis, without these methods, would be very time-consuming, if possible at all. In addition to the adaptation of the temporal information description language itself, the quality and efficiency of methods is a key aspect for temporal expressions and events recognition. The presentation will discuss both the analysis of the quality of data prepared by domain experts (including annotation agreement analysis) and the results of research aimed at reducing the complexity of the computational problem while preserving the quality of methods.|| |
|
| Line 57: | Line 49: |
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 April 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Włodzimierz Gruszczyński, Dorota Adamiec, Renata Bronikowska''' (Institute of the Polish Language, Polish Academy of Sciences), '''Witold Kieraś, Dorota Komosińska, Marcin Woliński''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=APvZdALq6ZU|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-04-23.pdf|Historical corpus – problems of transliteration, transcription and annotation on the example of the Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772)]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|The talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">During the seminar, the process of creating the Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), also called the Baroque Corpus, will be discussed. The particular emphasis will be placed on those tasks and problems that are specific to historical corpora, in contrast to corpora of contemporary texts, e.g. the National Corpus of Polish. We will also show the tools that were created for the needs of the project or adapted to these needs. After the general presentation of the project (assumptions, financing, team, current status, corpus's purpose) we will discuss particular problems in the order in which they appeared during the creation of the corpus: the selecting of texts, gathering them and incorporating them into a database, the necessity of their transcription into modern spelling (resulting from a huge spelling differentiation of old prints and manuscripts), issues of morphological analysis, morphosyntactic annotation (manual and automatic) and corpus searching.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 May 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Łukasz Kobyliński, Michał Wasiluk, Zbigniew Gawłowicz''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=QpmLVzqQfcM|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-05-14.pdf|MTAS corpus search engine and its implementation for Polish language corpora]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|The talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">During the seminar we will discuss our experiences with the MTAS search engine in the context of Polish language corpora. We will present several implementations of MTAS in such corpus-related projects as KORBA (the corpus of Polish language of the XVII and XVIII century), the XIX century language corpus, as well as National Corpus of Polish. We will also discuss preliminary experiments with implementing MTAS in Korpusomat - a tool that allows users to create their own corpora. During the presentation we will share our solutions to the problems encountered during the adaptation of MTAS to Polish and preliminary efficiency test results. We will also discuss the search capabilities of the engine and our plans for enhancing MTAS.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 May 2018''' (IPI PAN seminar presentation, 13:00)|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Borkowski''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=o2FFtfrqh3I|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-05-21.pdf|Semantic methods of categorization in the tasks of text document analysis]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|The talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In my PhD thesis entitled `Semantic methods of categorization in the tasks of text document analysis', a new algorithm of semantic categorization of documents was proposed and examined. On its basis, a new algorithm for category aggregation was developed, a family of semantic algorithms of classifiers, as well as a heterogeneous classifier committee (which combines the algorithm of semantic categorization and previously known classifiers). In my talk I will briefly present their concepts and the results of their effectiveness studies.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 May 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Krzysztof Wołk''' (Polish-Japanese Academy of Information Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FyeVRSXbBOg|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-05-28.pdf|Exploration and usage of comparable corpora in machine translation]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|The talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The problem that will be presented in the seminar is how to improve machine speech translation between Polish and English. The most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation. Polish language resources are lacking in parallel and monolingual data. Therefore, the main objective of the study was to develop an automatic toolkit for textual resources preparation by mining comparable corpora and quasi comparable corpora. Experiments were conducted mostly on casual human speech, consisting of lectures, movie subtitles, European Parliament proceedings, and European Medicines Agency texts. The aims were to rigorously analyze the problems and to improve the quality of baseline systems, i.e., adaptation of techniques and training parameters to increase the Bilingual Evaluation Understudy (BLEU) score for maximum performance. A further aim was to create additional bilingual and monolingual data resources by using available online data and by obtaining and mining comparable corpora for parallel sentence pairs. For this task, a methodology employing a Support Vector Machine and the Needleman-Wunsch algorithm was used, along with a chain of specialized tools.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 June 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (University of Manchester)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Supporting document screening for systematic reviews using machine learning and text mining'''  {{attachment:seminarium-archiwum/icon-pl.gif|The talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Systematic reviews, aiming to aggregate and analyse all the literature for a given research question, are a crucial tool in medical research. Their most laborious stage is screening, i.e. manual selection of dozens of relevant articles from thousands returned by search engines. Formulating the problem as a text classification task and using appropriate unsupervised text mining tools could lead to significant work saved. The presentation will cover adaptation of machine learning algorithms to the problem, tools for extracting and visualising terms and topics in collections, system deployment and evaluation at NICE (National Institute for Health and Care Excellence), a UK agency publishing health technology guidelines.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 June 2018'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Danijel Korzinek''' (Polish-Japanese Academy of Information Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Preparing a speech corpus using the recordings of the Polish Film Chronicle'''  {{attachment:seminarium-archiwum/icon-pl.gif|The talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will describe how a speech corpus based on the Polish Film Chronicle, a collection of short historical news segments, was created during the CLARIN-PL project. This resource is an extremely useful tool for linguistic research, specifically in the context of historical speech and language. The years 1945–1960 were chosen for this purpose. The presentation will discuss various topics: from the legal issues of acquiring the resources, to more the more technical aspects of dealing with the adaptation of speech analysis tools to this, rather uncommon domain.|| ||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2017]].|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 March 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Mateusz Krubiński''' (Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Talk title will be given shortly'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.|| }}} |
Natural Language Processing Seminar 2025–2026
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It will restart in October and will take place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube. |
15 September 2025 |
Louis Esteve (Universite Paris-Saclay) |
The field of Natural Language Processing (NLP) studies the abilities of computer systems to process and generate natural language, and has received increasing attention from the general population since the democratisation of generative and conversational models. However, behind the scenes, state-of-the-art NLP models are trained on ever-larger datasets, reaching trillions of tokens. It may be argued that the creation and use of such immense datasets is motivated by the idea that 'the larger the dataset, the more diverse it is', and that in turn 'if the training set is more diverse, it shall yield better models'. However, these statements thus far remain intuitions and need to be properly tested. To this end, this presentation will tackle methods and caveats of formal diversity quantification including limitations of the literature, a preliminary discussion on the link between diversity and dataset size, as well as their impact on downstream applications. |
6 October 2025 |
Stan Matwin (Dalhousie University / Institute of Computer Science, Polish Academy of Sciences) |
|
The key characteristics of mental illnesses are reflected in audio recordings of clinical interviews with patients and their families. We have developed a deep learning method that automatically extracts the relevant features necessary for the diagnosis of mental illnesses (ADHD, depression, bipolar disorder and schizophrenia) from such interviews. We use a variety of pre-trained models to extract representations from both the audio segments of these interviews and their text versions. We use several modern representation techniques (embeddings). We apply a Big Data approach by exploring existing audio and text corpora annotated with emotional labels. We address the problem of annotated data scarcity by using parametric model fine-tuning (Parameter Efficient Fine-Tuning). All these representations are then combined into a single multimodal form. To diagnose the above mental disorders, we use contrastive learning and model synthesis using a committee of experts (Mixture of Experts). The results show that through multimodal analysis of clinical interviews, mental disorders can be diagnosed with satisfactory accuracy (project conducted in collaboration with H. Naderi and R. Uher). |
20 October 2025 |
Arkadiusz Modzelewski (University of Padua / Polish-Japanese Academy of Information Technology) |
|
What language tools do disinformation agents employ? Can incorporating persuasion and intent knowledge enhance the ability of large language models to detect disinformation? And how effective are LLMs at identifying disinformation in Polish and English? In this talk, I will present findings from my PhD research on disinformation, persuasion, and the intent behind misleading information. I will introduce one of the largest Polish disinformation datasets, alongside a novel English dataset, both designed to capture manipulative techniques and intent of disinformation agents. Drawing on these and other resources, I will discuss how well current LLMs perform in detecting disinformation, persuasion, and intent, and highlight promising directions for improving their effectiveness in disinformation detection. |
3 November 2025 |
Gražina Korvel (Vilnius University) |
Developing diverse, well-annotated speech corpora is essential for training modern machine learning models. This presentation discusses the principles and methodologies involved in creating large-scale speech corpora, with a focus on the Lithuanian language as a case study. It presents the Great Lithuanian Speech Corpus (LIEPA-3) project, outlining strategies for collecting, annotating, and ensuring the quality of data, as well as ensuring balanced representation across dialects, genders, and age groups. The talk also addresses challenges related to ethical data collection and corpus standardization. |
24 November 2025 |
Jan Eliasz, Mikołaj Langner, Jan Kocoń (Wrocław University of Science and Technology) |
|
We investigate two complementary strategies for improving the reliability of Large Language Models in classification settings. First, we show that decomposing multi-label classification into a set of independent binary decisions offers clear practical advantages over structured output formulations: it substantially reduces parsing errors, works seamlessly with decoder-only architectures, and delivers faster inference when combined with prefix caching, without requiring any model retraining. |
|
Second, we demonstrate that reasoning-enabled LLMs are markedly better at tasks requiring contextual sensitivity, such as offensive-language annotation. When prompted to adopt a specific role, reasoning models maintain that role more consistently and make more accurate, fine-grained judgments than their non-reasoning counterparts. Viewed together, these findings highlight a unifying principle: LLMs become both more efficient and more context-aware when their decision process is made more structured, whether through task decomposition or through explicit reasoning. |
1 December 2025 |
Filip Kucia, Anna Wróblewska, Bartosz Grabek, Szymon Trochimiak (Warsaw University of Technology) |
How to Make Museums More Interactive? Case Study of the “Artistic Chatbot” |
This presentation examines the challenges of deploying large language model (LLM)-powered chatbots in public cultural spaces, based on our experience with Artistic Chatbot – a voice-based conversational agent used during a month-long art exhibition at the Warsaw Academy of Fine Arts. We focus on two intertwined issues: how to make a system answer questions about a multilingual artistic collection, and how to evaluate the quality of those answers. On the technical side, we discuss strategies for building a retrieval-augmented knowledge base from heterogeneous, multilingual exhibition materials and the trade-offs between native-language models and pivot-language approaches based on translation. From the perspective of interaction design, we outline a fully voice-based setup in a gallery space, in which visitors walk up to a ceiling-mounted microphone and address the system through spoken trigger expressions, without screens or keyboards. The core of the talk is a post-hoc evaluation. We analyse interaction logs and conduct a human annotation study to compare different modelling and retrieval configurations along dimensions such as factual precision, coherence and relevance to the exhibition domain. Using this case study, we ask how to define and measure a “good” answer in conversational AI for cultural heritage, and how choices about language, translation and voice interaction should influence future deployments in museums and galleries. |
Please see also the talks given in 2000–2015 and 2015–2025. |



