| Size: 3129 Comment:  | Size: 21454 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 3: | Line 3: | 
| = Natural Language Processing Seminar 2016–2017 = | = Natural Language Processing Seminar 2024–2025 = | 
| Line 5: | Line 5: | 
| ||<style="border:0;padding:0">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| | ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| | 
| Line 7: | Line 7: | 
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''10 October 2016'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Katarzyna Pakulska''', '''Barbara Rychalska''', '''Krystyna Chodorowska''', '''Wojciech Walczak''', '''Piotr Andruszkiewicz''' (Samsung)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Paraphrase Detection Ensemble – !SemEval 2016 winner'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">This seminar describes the winning solution designed for a core track within the !SemEval 2016 English Semantic Textual Similarity (STS) task. The goal of the competition was to measure semantic similarity between two given sentences on a scale from 0 to 5. At the same time the solution should replicate human language understanding. The presented model is a novel hybrid of recursive auto-encoders from deep learning (RAE) and a !WordNet award-penalty system, enriched with a number of other similarity models and features used as input for Linear Support Vector Regression.|| | ||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 October 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Janusz S. Bień''' (University of Warsaw, profesor emeritus) || ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=2mLYixXC_Hw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-07.pdf|Identifying glyphs in some 16th century fonts: a case study]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Some glyphs from 16th century fonts, described in the monumental work “[[https://crispa.uw.edu.pl/object/files/754258/display/Default|Polonia Typographica Saeculi Sedecimi]]”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate !OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed.|| | 
| Line 12: | Line 12: | 
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''24 October 2016'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Adam Przepiórkowski, Jakub Kozakoszczak, Jan Winkowski, Daniel Ziembicki, Tadeusz Teleżyński''' (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Corpus of formalized steps of textual entailment'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Description will be available shortly.|| | ||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 October 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=E2ujmqt7Q2E|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-14.pdf|Lexical and syntactic variability of languages and text genres. A corpus-based study]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation ([[https://universaldependencies.org/|Universal Dependencies]]), the research utilizes parallel texts from a large multilingual corpus ([[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze16ud|InterCorp]]). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages.|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style.|| | 
| Line 17: | Line 19: | 
| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 November 2016'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Norbert Ryciak, Aleksander Wawer''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Using recursive deep neural networks and syntax to compute phrase semantics'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Description will be available shortly.|| | ||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 October 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Rafał Jaworski''' (Adam Mickiewicz University in Poznań)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=52LZ976imBA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-28.pdf|Framework for aligning and storing of multilingual word embeddings for the needs of translation probability computation]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words.|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one.|| | 
| Line 22: | Line 28: | 
| ||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given between 2000 and 2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015-16]].|| | ||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 November 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Kozakoszczak''' (Deutsche Telekom)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-11-04.pdf|ZIML: A Markup Language for Regex-Friendly Linguistic Annotation]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them.|| ||<style="border:0;padding-left:30px;padding-bottom:15px">I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 November 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Christian Chiarcos''' (University of Augsburg)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FxiOM5zAKo8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-11-21.pdf|Aspects of Knowledge Representation for Discourse Relation Annotation]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with !OntoLex and techniques for the induction of discourse marker inventories.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 December 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Participants of !PolEval 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Presentation of the Shared Task results'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|||| ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=cwu8YfqtnTs|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-01.pdf|Welcome to PolEval 2024]]''' (Łukasz Kobyliński, Maciej Ogrodniczuk, Filip Graliński, Ryszard Staruch, Karol Saputa) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=OnxkmpGmxP4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-02.pdf|PolEval 2024 Task 1: Reading Comprehension]]''' (Ryszard Tuora / Aleksandra Zwierzchowska) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=9FDTOx55WMI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-03.pdf|Optimizing LLMs for Polish Reading Comprehension: A Comparative Study of Ensemble and Unified Approaches]]''' (Krzysztof Wróbel) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=_Ur9kzZ3ols|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-04.pdf|PolEval 2024 Task 2: Emotion and Sentiment Recognition]]''' (Jan Kocoń, Bartłomiej Koptyra) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=V3_z2KiVgco|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-05.pdf|Emotion and Sentiment Recognition in Polish Texts Using Large Language Models: A Winning Approach to PolEval 2024]]''' (Krzysztof Wróbel) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=59Xkzoi3TDY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-06.pdf|Ensemble as a Variance Reduction Method for Emotion and Sentiment Recognition]]''' (Tomasz Warzecha) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=ESNbPIwjfvw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-07.pdf|Emotion and Sentiment Recognition Using Ensemble Models]]''' (Jakub Kosterna) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=Ds8BkUTpcm8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-08.pdf|Zero-shot Approach Using Bielik LLM for Emotion Recognition in Polish]]''' (Paweł Cyrta) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=lmRZn7254MY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-08.pdf|PolEval 2024 Task 3: Polish Automatic Speech Recognition Challenge]]''' (Michał Junczyk, Iwona Christop, Piotr Pęzik) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=G35l9xJWqA0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-10.pdf|Augmenting Polish Automatic Speech Recognition System with Synthetic Data]]''' (Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz) || ||<style="border:0;padding-left:30px;padding-bottom:15px">[[https://www.youtube.com/watch?v=uIDfc6c1TtA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-11.pdf|Exploration of training Zipformer and E-Branchformer models with Polish language BIGOS dataset]]''' (Paweł Cyrta) || ||<style="border:0;padding-top:5px;padding-bottom:5px">'''19 December 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xqDkbiF4izI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-12-19.pdf|Adaptive Attacks on Misinformation Detection Using Reinforcement Learning]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''17 February 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alicja Martinek''' (NASK National Research Institute, AGH University of Kraków), '''Ewelina Bartuzi-Trokielewicz''' (NASK National Research Institute, Warsaw University of Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Detecting deepfakes and false ads through analysis of text and social engineering techniques'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90\% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 March 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Rapacz''', '''Aleksander Smywiński-Pohl''' (AGH University in Cracow) || ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Interlinear Translation of Ancient Greek Texts: How Morphological Tags Enhance Machine Translation Quality'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 April 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Ryszard Staruch''', '''Filip Graliński''' (Adam Mickiewicz University in Poznań)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Talk title will be given shortly'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.|| ||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2024]].|| {{{#!wiki comment ||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 March 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Mateusz Krubiński''' (Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Talk title will be given shortly'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 April 2020'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Efficient training of word embeddings with a focus on negative examples'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation is based on our [[https://pdfs.semanticscholar.org/1f50/db5786913b43f9668f997fc4c97d9cd18730.pdf|AAAI 2018]] and [[https://aaai.org/ojs/index.php/AAAI/article/view/4683|AAAI 2019]] papers on English word embeddings. In particular, we examine the notion of “negative examples”, the unobserved or insignificant word-context co-occurrences, in spectral methods. we provide a new formulation for the word embedding problem by proposing a new intuitive objective function that perfectly justifies the use of negative examples. With the goal of efficient learning of embeddings, we propose a kernel similarity measure for the latent space that can effectively calculate the similarities in high dimensions. Moreover, we propose an approximate alternative to our algorithm using a modified Vantage Point tree and reduce the computational complexity of the algorithm with respect to the number of words in the vocabulary. We have trained various word embedding algorithms on articles of Wikipedia with 2.3 billion tokens and show that our method outperforms the state-of-the-art in most word similarity tasks by a good margin. We will round up our discussion with some general thought s about the use of embeddings in modern NLP.|| }}} | 
Natural Language Processing Seminar 2024–2025
| The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube. | 
| 7 October 2024 | 
| Janusz S. Bień (University of Warsaw, profesor emeritus) | 
| Some glyphs from 16th century fonts, described in the monumental work “Polonia Typographica Saeculi Sedecimi”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed. | 
| 14 October 2024 | 
| Alexander Rosen (Charles University in Prague) | 
| 
 | 
| This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation (Universal Dependencies), the research utilizes parallel texts from a large multilingual corpus (InterCorp). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts. | 
| The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages. | 
| Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style. | 
| 28 October 2024 | 
| Rafał Jaworski (Adam Mickiewicz University in Poznań) | 
| 
 | 
| The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings. | 
| The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages. | 
| To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent. | 
| The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words. | 
| The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one. | 
| 4 November 2024 | 
| Jakub Kozakoszczak (Deutsche Telekom) | 
| 
 | 
| Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points. | 
| I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them. | 
| I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended. | 
| 21 November 2024 | 
| Christian Chiarcos (University of Augsburg) | 
| 
 | 
| Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with OntoLex and techniques for the induction of discourse marker inventories. | 
| 19 December 2024 | 
| Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences) | 
| 
 | 
| The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs. | 
| 14 April 2025 | 
| Ryszard Staruch, Filip Graliński (Adam Mickiewicz University in Poznań) | 
| Talk summary will be made available soon. | 
| Please see also the talks given in 2000–2015 and 2015–2024. | 





