Locked History Actions

Diff for "seminar"

Differences between revisions 618 and 711 (spanning 93 versions)
Revision 618 as of 2024-04-03 13:32:44
Size: 24779
Comment:
Revision 711 as of 2025-05-05 09:17:05
Size: 28600
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
= Natural Language Processing Seminar 2023–2024 = = Natural Language Processing Seminar 2024–2025 =
Line 7: Line 7:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''9 October 2023'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Mikołajczyk-Bareła''', '''Wojciech Janowski''' (!VoiceLab), '''Piotr Pęzik''' (University of Łódź / !VoiceLab), '''Filip Żarnecki''', '''Alicja Golisowicz''' (!VoiceLab)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-10-09.pdf|TRURL.AI: Fine-tuning large language models on multilingual instruction datasets]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">This talk will summarize our recent work on fine-tuning a large generative language model on bilingual instruction datasets, which resulted in the release of an open version of Trurl (trurl.ai). The motivation behind creating this model was to improve the performance of the original Llama 2 7B- and 13B-parameter models (Touvron et al. 2023), from which it was derived in a number of areas such as information extraction from customer-agent interactions and data labeling with a special focus on processing texts and instructions written in Polish. We discuss the process of optimizing the instruction datasets and the effect of the fine-tuning process on a number of selected downstream tasks.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 October 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Janusz S. Bień''' (University of Warsaw, profesor emeritus) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=2mLYixXC_Hw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-07.pdf|Identifying glyphs in some 16th century fonts: a case study]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Some glyphs from 16th century fonts, described in the monumental work “[[https://crispa.uw.edu.pl/object/files/754258/display/Default|Polonia Typographica Saeculi Sedecimi]]”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate !OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed.||
Line 12: Line 12:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''16 October 2023'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Konrad Wojtasik''', '''Vadim Shishkin''', '''Kacper Wołowiec''', '''Arkadiusz Janz''', '''Maciej Piasecki''' (Wrocław University of Science and Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Evaluation of information retrieval models in zero-shot settings on different documents domains''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Information Retrieval over large collections of documents is an extremely important research direction in the field of natural language processing. It is a key component in question-answering systems, where the answering model often relies on information contained in a database with up-to-date knowledge. This not only allows for updating the knowledge upon which the system responds to user queries but also limits its hallucinations. Currently, information retrieval models are neural networks and require significant training resources. For many years, lexical matching methods like BM25 outperformed trained neural models in Open Domain setting, but current architectures and extensive datasets allow surpassing lexical solutions. In the presentation, I will introduce available datasets for the evaluation and training of modern information retrieval architectures in document collections from various domains, as well as future development directions.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 October 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=E2ujmqt7Q2E|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-14.pdf|Lexical and syntactic variability of languages and text genres. A corpus-based study]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation ([[https://universaldependencies.org/|Universal Dependencies]]), the research utilizes parallel texts from a large multilingual corpus ([[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze16ud|InterCorp]]). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts.||
||<style="border:0;padding-left:30px;padding-bottom:5px">The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages.||
||<style="border:0;padding-left:30px;padding-bottom:15px">Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style.||
Line 17: Line 19:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''30 October 2023'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Faleńska''' (University of Stuttgart)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-10-30.pdf|Steps towards Bias-Aware NLP Systems]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">For many, Natural Language Processing (NLP) systems have become everyday necessities, with applications ranging from automatic document translation to voice-controlled personal assistants. Recently, the increasing influence of these AI tools on human lives has raised significant concerns about the possible harm these tools can cause.||
||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will start by showing a few examples of such harmful behaviors and discussing their potential origins. I will argue that biases in NLP models should be addressed by advancing our understanding of their linguistic sources. Then, the talk will zoom into three compelling case studies that shed light on inequalities in commonly used training data sources: Wikipedia, instructional texts, and discussion forums. Through these case studies, I will show that regardless of the perspective on the particular demographic group (speaking about, speaking to, and speaking as), subtle biases are present in all these datasets and can perpetuate harmful outcomes of NLP models.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 October 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Rafał Jaworski''' (Adam Mickiewicz University in Poznań)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=52LZ976imBA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-28.pdf|Framework for aligning and storing of multilingual word embeddings for the needs of translation probability computation]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings.||
||<style="border:0;padding-left:30px;padding-bottom:5px">The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages.||
||<style="border:0;padding-left:30px;padding-bottom:5px">To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent.||
||<style="border:0;padding-left:30px;padding-bottom:5px">The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words.||
||<style="border:0;padding-left:30px;padding-bottom:15px">The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one.||
Line 23: Line 28:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''13 November 2023'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Rybak''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-11-13.pdf|Advancing Polish Question Answering: Datasets and Models]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Although question answering (QA) is one of the most popular topics in natural language processing, until recently it was virtually absent in the Polish scientific community. However, the last few years have seen a significant increase in work related to this topic. In this talk, I will discuss what question answering is, how current QA systems work, and what datasets and models are available for Polish QA. In particular, I will discuss the resources created at IPI PAN, namely the [[https://huggingface.co/datasets/ipipan/polqa|PolQA]] and [[https://huggingface.co/datasets/ipipan/maupqa|MAUPQA]] and the [[https://huggingface.co/ipipan/silver-retriever-base-v1|Silver Retriever]] model. Finally, I will point out further directions of work that are still open when it comes to Polish question answering.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 November 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Kozakoszczak''' (Deutsche Telekom)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-11-04.pdf|ZIML: A Markup Language for Regex-Friendly Linguistic Annotation]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points.||
||<style="border:0;padding-left:30px;padding-bottom:5px">I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them.||
||<style="border:0;padding-left:30px;padding-bottom:15px">I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended.||
Line 28: Line 35:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 December 2023''' (a series of short invited talks by Coventry Univerity researchers)||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Xiaorui Jiang''', '''Opeoluwa Akinseloyin''', '''Vasile Palade''' (Coventry University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-1.pdf|Towards More Human-Effortless Systematic Review Automation]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Wystąpienie w jęz. angielskim.}}||
||<style="border:0;padding-left:30px;padding-bottom:10px">Systematic literature review (SLR) is the standard tool for synthesising medical and clinical evidence from the ocean of publications. SLR is extremely expensive. SLR is extremely expensive. AI can play a significant role in automating the SLR process, such as for citation screening, i.e., the selection of primary studies-based title and abstract. [[http://systematicreviewtools.com/|Some tools exist]], but they suffer from tremendous obstacles, including lack of trust. In addition, a specific characteristic of systematic review, which is the fact that each systematic review is a unique dataset and starts with no annotation, makes the problem even more challenging. In this study, we present some seminal but initial efforts on utilising the transfer learning and zero-shot learning capabilities of pretrained language models and large language models to solve or alleviate this challenge. Preliminary results are to be reported.||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Kacper Sówka''' (Coventry University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-2.pdf|Attack Tree Generation Using Machine Learning]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Wystąpienie w jęz. angielskim.}}||
||<style="border:0;padding-left:30px;padding-bottom:10px">My research focuses on applying machine learning and NLP to the problem of cybersecurity attack modelling. This is done by generating "attack tree" models using public cybersecurity datasets (CVE) and training a siamese neural network to predict the relationship between individual cybersecurity vulnerabilities using a DistilBERT encoder fine-tuned using Masked Language Modelling.||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Xiaorui Jiang''' (Coventry University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-3.pdf|Towards Semantic Science Citation Index]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Wystąpienie w jęz. angielskim.}}||
||<style="border:0;padding-left:30px;padding-bottom:10px">It is a difficult task to understand and summarise the development of scientific research areas. This task is especially cognitively demanding for postgraduate students and early-career researchers, of the whose main jobs is to identify such developments by reading a large amount of literature. Will AI help? We believe so. This short talk summarises some recent initial work on extracting the semantic backbone of a scientific area through the synergy of natural language processing and network analysis, which is believed to serve a certain type of discourse models for summarisation (in future work). As a small step from it, the second part of the talk introduces how comparison citations are utilised to improve multi-document summarisation of scientific papers.||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Xiaorui Jiang''', '''Alireza Daneshkhah''' (Coventry University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2023-12-11-4.pdf|Natural Language Processing for Automated Triaging at NHS]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Wystąpienie w jęz. angielskim.}}||
||<style="border:0;padding-left:30px;padding-bottom:15x">In face of a post-COVID global economic slowdown and aging society, the primary care units in the National Healthcare Services (NHS) are receiving increasingly higher pressure, resulting in delays and errors in healthcare and patient management. AI can play a significant role in alleviating this investment-requirement discrepancy, especially in the primary care settings. A large portion of clinical diagnosis and management can be assisted with AI tools for automation and reduce delays. This short presentation reports the initial studies worked with an NHS partner on developing NLP-based solutions for the automation of clinical intention classification (to save more time for better patient treatment and management) and an early alert application for Gout Flare prediction from chief complaints (to avoid delays in patient treatment and management).||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 November 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Christian Chiarcos''' (University of Augsburg)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FxiOM5zAKo8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-11-21.pdf|Aspects of Knowledge Representation for Discourse Relation Annotation]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with !OntoLex and techniques for the induction of discourse marker inventories.||
Line 42: Line 40:
||<style="border:0;padding-top:15px;padding-bottom:5px">'''8 January 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Danijel Korzinek''' (Polish-Japanese Academy of Information Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-01-08.pdf|ParlaSpeech – Developing Large-Scale Speech Corpora in the ParlaMint project]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The purpose of this sub-project was to develop tools and methodologies that would allow the linking of the textual corpora developed within the [[https://www.clarin.eu/parlamint|ParlaMint]] project with their coresponding audio and video footage available online. The task was naturally more involved than it may seem intuitivetily and it higned mostly on the proper alignment of very long audio (up to a full working day of parliamentary sessions) to its corresponding transcripts, while accounting for many mistakes and inaccuracies in the matching and order between the two modalities. The project was developed using fully open-source models and tools, which are available online for use in other projects of similar scope. So far, it was used to fully prepare corpora for two languages (Polish and Croatian), but more are being currently developed.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 December 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Participants of !PolEval 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Presentation of the Shared Task results''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=cwu8YfqtnTs|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-01.pdf|Welcome to PolEval 2024]]''' (Łukasz Kobyliński, Maciej Ogrodniczuk, Filip Graliński, Ryszard Staruch, Karol Saputa) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=OnxkmpGmxP4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-02.pdf|PolEval 2024 Task 1: Reading Comprehension]]''' (Ryszard Tuora / Aleksandra Zwierzchowska) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=9FDTOx55WMI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-03.pdf|Optimizing LLMs for Polish Reading Comprehension: A Comparative Study of Ensemble and Unified Approaches]]''' (Krzysztof Wróbel) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=_Ur9kzZ3ols|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-04.pdf|PolEval 2024 Task 2: Emotion and Sentiment Recognition]]''' (Jan Kocoń, Bartłomiej Koptyra) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=V3_z2KiVgco|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-05.pdf|Emotion and Sentiment Recognition in Polish Texts Using Large Language Models: A Winning Approach to PolEval 2024]]''' (Krzysztof Wróbel) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=59Xkzoi3TDY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-06.pdf|Ensemble as a Variance Reduction Method for Emotion and Sentiment Recognition]]''' (Tomasz Warzecha) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=ESNbPIwjfvw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-07.pdf|Emotion and Sentiment Recognition Using Ensemble Models]]''' (Jakub Kosterna) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=Ds8BkUTpcm8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-08.pdf|Zero-shot Approach Using Bielik LLM for Emotion Recognition in Polish]]''' (Paweł Cyrta) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=lmRZn7254MY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-08.pdf|PolEval 2024 Task 3: Polish Automatic Speech Recognition Challenge]]''' (Michał Junczyk, Iwona Christop, Piotr Pęzik) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=G35l9xJWqA0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-10.pdf|Augmenting Polish Automatic Speech Recognition System with Synthetic Data]]''' (Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz) ||
||<style="border:0;padding-left:30px;padding-bottom:15px">[[https://www.youtube.com/watch?v=uIDfc6c1TtA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-11.pdf|Exploration of training Zipformer and E-Branchformer models with Polish language BIGOS dataset]]''' (Paweł Cyrta) ||
Line 47: Line 55:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''12 February 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Tsimur Hadeliya''', '''Dariusz Kajtoch''' (Allegro ML Research)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-02-12.pdf|Evaluation and analysis of in-context learning for Polish classification tasks]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">With the advent of language models such as ChatGPT, we are witnessing a paradigm shift in the way we approach natural language processing tasks. Instead of training a model from scratch, we can now solve tasks by designing appropriate prompts and choosing suitable demonstrations as input to a generative model. This approach, known as in-context learning (ICL), has shown remarkable capabilities for classification tasks in the English language . In this presentation, we will investigate how different language models perform on Polish classification tasks using the ICL approach. We will explore the effectiveness of various models, including multilingual and large-scale models, and compare their results with existing solutions. Through a comprehensive evaluation and analysis, we aim to gain insights into the strengths and limitations of this approach for Polish classification tasks. Our findings will shed light on the potential of ICL for the Polish language. We will discuss challenges and opportunities, and propose directions for future work.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''19 December 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xqDkbiF4izI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-12-19.pdf|Adaptive Attacks on Misinformation Detection Using Reinforcement Learning]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''17 February 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alicja Martinek''' (NASK National Research Institute, AGH University of Kraków), '''Ewelina Bartuzi-Trokielewicz''' (NASK National Research Institute, Warsaw University of Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=rCzTBQYkooI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-02-17.pdf|Detecting deepfakes and false ads through analysis of text and social engineering techniques]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90\% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''24 March 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Rapacz''', '''Aleksander Smywiński-Pohl''' (AGH University of Krakow) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FZzPMTa2cYA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-03-24.pdf|Interlinear Translation of Ancient Greek Texts: How Morphological Tags Enhance Machine Translation Quality]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">Interlinear translation prioritizes preserving the original syntactic structure by placing target language words directly below their source text counterparts, maintaining the original word order rather than natural fluency. Although interlinear translations often deviate from the linguistic norms of the target language, they serve as a valuable tool for those wishing to deeply understand texts in their original form, especially in the case of sacred and ancient texts.||
||<style="border:0;padding-left:30px;padding-bottom:5px">In our research, we conducted the first attempt to apply machine translation to generate interlinear translations from Ancient Greek to Polish and English. We compared the performance of specialized models (!GreTa, !PhilTa) pretrained on Ancient Greek texts with a general-purpose multilingual model (mT5). We examined 144 different model configurations, manipulating the base model, morphological tag encoding method, tag set, and text normalization approach, using the Greek New Testament texts as our corpus.||
||<style="border:0;padding-left:30px;padding-bottom:15px">During the presentation, we will describe our research methodology and discuss the results. The best results were achieved by models in which we implemented new dedicated embedding layers for encoding morphological information, which yielded results up to 35-38% better (BLEU) compared to the baseline scenario. Additional detailed study showed that !PhilTa performs better than mT5, particularly in scenarios with limited data availability. !PhilTa achieved the highest results in translation to English (60.40 BLEU), while mT5-large performed best with Polish (59.33 BLEU).||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 April 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Ryszard Staruch''', '''Filip Graliński''' (Adam Mickiewicz University in Poznań)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xRDXmKoEiOQ|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-04-14.pdf|Leveraging Large Language Models for the Grammatical Error Correction Task]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Large Language Models (LLMs) currently represent the state-of-the-art in many natural language processing tasks. However, their effectiveness in correcting language errors in texts written in Polish remains unclear. To address this gap, a dedicated dataset for Polish text correction has been developed. During the talk, this dataset will be presented along with the evaluation results of selected LLM-based solutions. In the second part of the seminar, new techniques for adapting LLMs to the task of minimal-edit text correction will be discussed, focusing on texts written by language learners — using English as a case study.||
Line 53: Line 78:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''29 February 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Post workshop seminar on analysis of parliamentary data''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|All talks in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Introduction'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">12:00–12:10: '''Welcome'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">12:10–12:40: ||||'''Maciej Ogrodniczuk''' (Institute of Computer Science, Polish Academy of Sciences||
||<style="border:0;padding-left:30px;padding-bottom:10px"> ||||'''[[attachment:seminarium-archiwum/2024-02-29-1.pdf|Polish Parliamentary Corpus and ParlaMint corpus]]'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Competition Entries'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">12:40–13:00: '''[[attachment:seminarium-archiwum/2024-02-29-2.pdf|Application to analyse the sentiment of utterances of Polish MPs]]''' (Bartłomiej Klimowski)||
||<style="border:0;padding-left:30px;padding-bottom:0px">13:00–13:20: '''[[attachment:seminarium-archiwum/2024-02-29-3.pdf|Analysis of the dynamics of emotions in parliamentary debates about the war in Ukraine]]''' (Konrad Kiljan i Ewelina Gajewska)||
||<style="border:0;padding-left:30px;padding-bottom:10px">13:20–13:40: '''[[attachment:seminarium-archiwum/2024-02-29-4.pdf|Gender-fair language in the Polish parliament: a corpus-based study of parliamentary debates in the ParlaMint corpus]]''' (Aleksandra Tomaszewska i Anna Jamka)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Invited Talk'''||
||<style="border:0;padding-left:30px;padding-bottom:10px">14:00–15:00: '''[[attachment:seminarium-archiwum/2024-02-29-5.pdf|Changes in the Polish language of the last hundred years in the mirror of parliamentary debates]]''' (Marek Łaziński)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Panel Discussion'''||
||<style="border:0;padding-left:30px;padding-bottom:10px">15:00–15:45: '''Parliamentary data processing: what next?''' (Członkowie Kapituły Konkursu)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Conclusion'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">15:45–15:50: '''Diploma presentation'''||
||<style="border:0;padding-left:30px;padding-bottom:15px">15:50–16:00: '''Summary of the workshop'''||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 April 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Manfred Stede''' (Universität Potsdam)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FNJIyX6GmCY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-04-28.pdf|Discourse structure in the Potsdam Commentary Corpus: Human annotation, human disagreement, and automatic parsing]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The talk gives a brief introduction to Rhetorical Structure Theory (RST, [[https://www.sfu.ca/rst/05bibliographies/bibs/Mann_Thompson_1988.pdf|Mann/Thompson 1988]]) and then explains the design decisions for the Potsdam Commentary Corpus (PCC), which brings together RST, coreference, and other annotation layers on 175 German news editorials. After illustrating cross-layer queries on the corpus in the ANNIS linguistic database, we turn to the intricacies of manual RST annotation. I will give an overview of the annotation guidelines and their motivations, and present results from an (ongoing) study on annotator disagreements, from which we derive ideas for redesigning the annotation scheme (and potentially the underlying theory), with a comparison to the recent proposal of "eRST" by [[https://direct.mit.edu/coli/article/51/1/23/124464/eRST-A-Signaled-Graph-Theory-of-Discourse|Zeldes et al. (2025)]]. In the last part of the talk, I outline our results on automatic parsing using the system by [[https://aclanthology.org/P14-1002/|Ji and Eisenstein (2014)]].||
Line 71: Line 83:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''25 March 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-03-25.pdf|Are text credibility classifiers robust to adversarial actions?]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Automatic text classifiers are widely used for helping in content moderation for platforms hosting user-generated text, especially social networks. They can be employed to filter out unfriendly, misinforming, manipulative or simply illegal information. However, we have to remember that authors of such text often have a strong motivation to spread them and might try to modify the original content, until they find a reformulation that gets through automatic filters. Such modified variants of original data, called adversarial examples, play a crucial role in analyzing the robustness of ML models to the attacks of motivated actors. The presentation will be devoted to a systematic analysis of the problem in context of detecting misinformation. I am going to show concrete examples where a replacement of trivial words causes a change in a classifier's decision, as well as the BODEGA framework for robustness analysis, used in the InCreiblAE shared task at [[https://checkthat.gitlab.io/clef2024/task6/|CheckThat! evaluation lab at CLEF 2024]].||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''26 May 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Deniz Zeyrek''' (Middle East Technical University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Building monolingual and multilingual discourse banks and implications for discourse structure''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will overview the Turkish Discourse Bank (TDB), and the TED-MDB (TED Multilingual Discourse Bank), both annotated at the discourse level by native speakers. The TDB is a resource of over 3800 implicitly or explicitly conveyed discourse relations built over a multi-genre corpus of 40.000 words. The TED-MDB is a multilingual corpus of six English TED talks with translations into five languages (Turkish, Polish, European Portuguese, Russian, and German, recently extended to a sixth language, Lithuanian) with about 600 relation annotations per language. While both corpora follow the rules and principles of the Penn Discourse Treebank (PDTB), they also consider the language-specific characteristics of individual languages. I will summarize the characteristics of both corpora and the work of our research team where these corpora are exploited, discussing implications on discourse structure.||
Line 76: Line 88:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 March 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Krzysztof Węcel''' (Poznań University of Economics and Business)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-03-28.pdf|Credibility of information in the context of fact-checking process]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will focus on the topics of !OpenFact project, which is a response to the problem of fake news. As part of the project, we develop methods that allow us to verify the credibility of information. In order to ensure methodological correctness, we rely on the process used by fact-checking agencies. These activities are based on complex data sets obtained, among others, from !ClaimReview, Common Crawl or by monitoring social media and extracting statements from texts. It is also important to evaluate information in terms of its checkworthiness and the credibility of sources whose reputation may result from publications sourced from !OpenAlex or Crossref. Stylometric analysis allows us to determine authorship, and the comparison of human and machine work opens up new possibilities in detecting the use of artificial intelligence. We use local small language models as well as remote LLMs with various scenarios. We have built large sets of statements that can be used to verify new texts by examining semantic similarity. They are described with additional, constantly expanded metadata allowing for the implementation of various use cases.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 June 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Ogrodniczuk''', '''Aleksandra Tomaszewska''', '''Bartosz Żuk''', '''Alina Wróblewska''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''The title of the talk (on the Polish Large Language Model) will be given shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be given shortly.||
Line 81: Line 93:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''25 April 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Seminar summarising the work on the Corpus of Modern Polish (Decade 2011-2020)||
||<style="border:0;padding-left:30px;padding-bottom:0px">11:30–12:00: '''The Corpus of Modern Polish, Decade 2011-2020 — introduction''' (Marek Łaziński)||
||<style="border:0;padding-left:30px;padding-bottom:0px">12:00–12:30: '''Hybrid representation of syntactic information in KWJP''' (Marcin Woliński)||
||<style="border:0;padding-left:30px;padding-bottom:0px">12:30–13:00: Coffee break||
||<style="border:0;padding-left:30px;padding-bottom:0px">13:00–13:30: '''How to search effectively for information in the KWJP'''||
||<style="border:0;padding-left:30px;padding-bottom:15px">13:30–14:00: '''Discussion on the future of corpora'''||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 June 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksandra Tomaszewska''', '''Bartosz Żuk''', '''Dariusz Czerski''', '''Maciej Ogrodniczuk''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''The title of the talk (on the NeoN tool for detecting lexical innovations) will be given shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be given shortly.||
Line 89: Line 98:
||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2023]].|| ||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2024]].||

Natural Language Processing Seminar 2024–2025

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube.

seminarium

7 October 2024

Janusz S. Bień (University of Warsaw, profesor emeritus)

https://www.youtube.com/watch?v=2mLYixXC_Hw Identifying glyphs in some 16th century fonts: a case study  Talk in Polish.

Some glyphs from 16th century fonts, described in the monumental work “Polonia Typographica Saeculi Sedecimi”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed.

14 October 2024

Alexander Rosen (Charles University in Prague)

https://www.youtube.com/watch?v=E2ujmqt7Q2E Lexical and syntactic variability of languages and text genres. A corpus-based study  Talk in English.

This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation (Universal Dependencies), the research utilizes parallel texts from a large multilingual corpus (InterCorp). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts.

The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages.

Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style.

28 October 2024

Rafał Jaworski (Adam Mickiewicz University in Poznań)

https://www.youtube.com/watch?v=52LZ976imBA Framework for aligning and storing of multilingual word embeddings for the needs of translation probability computation  Talk in Polish.

The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings.

The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages.

To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent.

The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words.

The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one.

4 November 2024

Jakub Kozakoszczak (Deutsche Telekom)

http://zil.ipipan.waw.pl/seminarium-online ZIML: A Markup Language for Regex-Friendly Linguistic Annotation  Talk in English.

Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points.

I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them.

I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended.

21 November 2024

Christian Chiarcos (University of Augsburg)

https://www.youtube.com/watch?v=FxiOM5zAKo8 Aspects of Knowledge Representation for Discourse Relation Annotation  Talk in English.

Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with OntoLex and techniques for the induction of discourse marker inventories.

2 December 2024

Participants of PolEval 2024

Presentation of the Shared Task results  Talk in Polish. Slides in English.

https://www.youtube.com/watch?v=cwu8YfqtnTs Welcome to PolEval 2024 (Łukasz Kobyliński, Maciej Ogrodniczuk, Filip Graliński, Ryszard Staruch, Karol Saputa)

https://www.youtube.com/watch?v=OnxkmpGmxP4 PolEval 2024 Task 1: Reading Comprehension (Ryszard Tuora / Aleksandra Zwierzchowska)

https://www.youtube.com/watch?v=9FDTOx55WMI Optimizing LLMs for Polish Reading Comprehension: A Comparative Study of Ensemble and Unified Approaches (Krzysztof Wróbel)

https://www.youtube.com/watch?v=_Ur9kzZ3ols PolEval 2024 Task 2: Emotion and Sentiment Recognition (Jan Kocoń, Bartłomiej Koptyra)

https://www.youtube.com/watch?v=V3_z2KiVgco Emotion and Sentiment Recognition in Polish Texts Using Large Language Models: A Winning Approach to PolEval 2024 (Krzysztof Wróbel)

https://www.youtube.com/watch?v=59Xkzoi3TDY Ensemble as a Variance Reduction Method for Emotion and Sentiment Recognition (Tomasz Warzecha)

https://www.youtube.com/watch?v=ESNbPIwjfvw Emotion and Sentiment Recognition Using Ensemble Models (Jakub Kosterna)

https://www.youtube.com/watch?v=Ds8BkUTpcm8 Zero-shot Approach Using Bielik LLM for Emotion Recognition in Polish (Paweł Cyrta)

https://www.youtube.com/watch?v=lmRZn7254MY PolEval 2024 Task 3: Polish Automatic Speech Recognition Challenge (Michał Junczyk, Iwona Christop, Piotr Pęzik)

https://www.youtube.com/watch?v=G35l9xJWqA0 Augmenting Polish Automatic Speech Recognition System with Synthetic Data (Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz)

https://www.youtube.com/watch?v=uIDfc6c1TtA Exploration of training Zipformer and E-Branchformer models with Polish language BIGOS dataset (Paweł Cyrta)

19 December 2024

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=xqDkbiF4izI Adaptive Attacks on Misinformation Detection Using Reinforcement Learning  Talk in English.

The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs.

17 February 2025

Alicja Martinek (NASK National Research Institute, AGH University of Kraków), Ewelina Bartuzi-Trokielewicz (NASK National Research Institute, Warsaw University of Technology)

https://www.youtube.com/watch?v=rCzTBQYkooI Detecting deepfakes and false ads through analysis of text and social engineering techniques  Talk in Polish.

Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90\% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques.

24 March 2025

Maciej Rapacz, Aleksander Smywiński-Pohl (AGH University of Krakow)

https://www.youtube.com/watch?v=FZzPMTa2cYA Interlinear Translation of Ancient Greek Texts: How Morphological Tags Enhance Machine Translation Quality  Talk in Polish. Slides in English.

Interlinear translation prioritizes preserving the original syntactic structure by placing target language words directly below their source text counterparts, maintaining the original word order rather than natural fluency. Although interlinear translations often deviate from the linguistic norms of the target language, they serve as a valuable tool for those wishing to deeply understand texts in their original form, especially in the case of sacred and ancient texts.

In our research, we conducted the first attempt to apply machine translation to generate interlinear translations from Ancient Greek to Polish and English. We compared the performance of specialized models (GreTa, PhilTa) pretrained on Ancient Greek texts with a general-purpose multilingual model (mT5). We examined 144 different model configurations, manipulating the base model, morphological tag encoding method, tag set, and text normalization approach, using the Greek New Testament texts as our corpus.

During the presentation, we will describe our research methodology and discuss the results. The best results were achieved by models in which we implemented new dedicated embedding layers for encoding morphological information, which yielded results up to 35-38% better (BLEU) compared to the baseline scenario. Additional detailed study showed that PhilTa performs better than mT5, particularly in scenarios with limited data availability. PhilTa achieved the highest results in translation to English (60.40 BLEU), while mT5-large performed best with Polish (59.33 BLEU).

14 April 2025

Ryszard Staruch, Filip Graliński (Adam Mickiewicz University in Poznań)

https://www.youtube.com/watch?v=xRDXmKoEiOQ Leveraging Large Language Models for the Grammatical Error Correction Task  Talk in Polish.

Large Language Models (LLMs) currently represent the state-of-the-art in many natural language processing tasks. However, their effectiveness in correcting language errors in texts written in Polish remains unclear. To address this gap, a dedicated dataset for Polish text correction has been developed. During the talk, this dataset will be presented along with the evaluation results of selected LLM-based solutions. In the second part of the seminar, new techniques for adapting LLMs to the task of minimal-edit text correction will be discussed, focusing on texts written by language learners — using English as a case study.

28 April 2025

Manfred Stede (Universität Potsdam)

https://www.youtube.com/watch?v=FNJIyX6GmCY Discourse structure in the Potsdam Commentary Corpus: Human annotation, human disagreement, and automatic parsing  Talk in English.

The talk gives a brief introduction to Rhetorical Structure Theory (RST, Mann/Thompson 1988) and then explains the design decisions for the Potsdam Commentary Corpus (PCC), which brings together RST, coreference, and other annotation layers on 175 German news editorials. After illustrating cross-layer queries on the corpus in the ANNIS linguistic database, we turn to the intricacies of manual RST annotation. I will give an overview of the annotation guidelines and their motivations, and present results from an (ongoing) study on annotator disagreements, from which we derive ideas for redesigning the annotation scheme (and potentially the underlying theory), with a comparison to the recent proposal of "eRST" by Zeldes et al. (2025). In the last part of the talk, I outline our results on automatic parsing using the system by Ji and Eisenstein (2014).

26 May 2025

Deniz Zeyrek (Middle East Technical University)

http://zil.ipipan.waw.pl/seminarium-online Building monolingual and multilingual discourse banks and implications for discourse structure  Talk in English.

In this talk, I will overview the Turkish Discourse Bank (TDB), and the TED-MDB (TED Multilingual Discourse Bank), both annotated at the discourse level by native speakers. The TDB is a resource of over 3800 implicitly or explicitly conveyed discourse relations built over a multi-genre corpus of 40.000 words. The TED-MDB is a multilingual corpus of six English TED talks with translations into five languages (Turkish, Polish, European Portuguese, Russian, and German, recently extended to a sixth language, Lithuanian) with about 600 relation annotations per language. While both corpora follow the rules and principles of the Penn Discourse Treebank (PDTB), they also consider the language-specific characteristics of individual languages. I will summarize the characteristics of both corpora and the work of our research team where these corpora are exploited, discussing implications on discourse structure.

2 June 2025

Maciej Ogrodniczuk, Aleksandra Tomaszewska, Bartosz Żuk, Alina Wróblewska (Institute of Computer Science, Polish Academy of Sciences)

http://zil.ipipan.waw.pl/seminarium-online The title of the talk (on the Polish Large Language Model) will be given shortly  Talk in Polish.

The summary of the talk will be given shortly.

23 June 2025

Aleksandra Tomaszewska, Bartosz Żuk, Dariusz Czerski, Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

http://zil.ipipan.waw.pl/seminarium-online The title of the talk (on the NeoN tool for detecting lexical innovations) will be given shortly  Talk in Polish.

The summary of the talk will be given shortly.

Please see also the talks given in 2000–2015 and 2015–2024.