Size: 23853
Comment:
|
← Revision 711 as of 2025-05-05 09:17:05 ⇥
Size: 28600
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
= Natural Language Processing Seminar 2016–2017 = | = Natural Language Processing Seminar 2024–2025 = |
Line 5: | Line 5: |
||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available [[https://www.youtube.com/channel/UC5PEPpMqjAr7Pgdvq0wRn0w|on YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| | ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| |
Line 7: | Line 7: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''10 October 2016'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Katarzyna Pakulska''', '''Barbara Rychalska''', '''Krystyna Chodorowska''', '''Wojciech Walczak''', '''Piotr Andruszkiewicz''' (Samsung)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2016-10-10.pdf|Paraphrase Detection Ensemble – SemEval 2016 winner]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">This seminar describes the winning solution designed for a core track within the [[http://alt.qcri.org/semeval2016/task1/|SemEval 2016 English Semantic Textual Similarity]] task. The goal of the competition was to measure semantic similarity between two given sentences on a scale from 0 to 5. At the same time the solution should replicate human language understanding. The presented model is a novel hybrid of recursive auto-encoders from deep learning (RAE) and a !WordNet award-penalty system, enriched with a number of other similarity models and features used as input for Linear Support Vector Regression.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 October 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Janusz S. Bień''' (University of Warsaw, profesor emeritus) || ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=2mLYixXC_Hw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-07.pdf|Identifying glyphs in some 16th century fonts: a case study]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Some glyphs from 16th century fonts, described in the monumental work “[[https://crispa.uw.edu.pl/object/files/754258/display/Default|Polonia Typographica Saeculi Sedecimi]]”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate !OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed.|| |
Line 12: | Line 12: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''24 October 2016'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Adam Przepiórkowski, Jakub Kozakoszczak, Jan Winkowski, Daniel Ziembicki, Tadeusz Teleżyński''' (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2016-10-24.pdf|Corpus of formalized textual entailment steps]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The authors present resources created within CLARIN project aiming to help with qualitative evaluation of RTE systems: two textual derivations corpora and a corpus of textual entailment rules. Textual derivation is a series of atomic steps which connects Text with Hypothesis in a textual entailment pair. Original pairs are taken from the FraCaS corpus and a polish translation of the RTE3 corpus. Textual entailment rule sanctions textual entailment relation between the input and the output of a step, using syntactic patterns written in the UD standard and some other semantic, logical and contextual constraints expressed in FOL.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 October 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=E2ujmqt7Q2E|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-14.pdf|Lexical and syntactic variability of languages and text genres. A corpus-based study]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation ([[https://universaldependencies.org/|Universal Dependencies]]), the research utilizes parallel texts from a large multilingual corpus ([[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze16ud|InterCorp]]). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages.|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style.|| |
Line 17: | Line 19: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 November 2016'''|| | ||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 October 2024'''|| |
Line 19: | Line 21: |
||<style="border:0;padding-left:30px;padding-bottom:5px">''' [[attachment:seminarium-archiwum/2016-11-07.pdf|Concordia – translation memory search algorithm]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The talk covers [[http://tmconcordia.sourceforge.net/|the Concordia algorithm]] which is used to maximize the productivity of a human translator. The algorithm combines the features of standard fuzzy translation memory searching with a concordancer. As the key non-functional requirement of computer-aided translation mechanisms is performance, Concordia incorporates upgraded versions of standard approximate searching techniques, aiming at reducing the computational complexity.|| |
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=52LZ976imBA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-28.pdf|Framework for aligning and storing of multilingual word embeddings for the needs of translation probability computation]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words.|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one.|| |
Line 22: | Line 28: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 November 2016'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Norbert Ryciak, Aleksander Wawer''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=hGKzZxFa0ik|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2016-11-21.pdf|Using recursive deep neural networks and syntax to compute phrase semantics]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The seminar presents initial experiments on recursive phrase-level sentiment computation using dependency syntax and deep learning. We discuss neural network architectures and implementations created within Clarin 2 and present results on English language resources. Seminar also covers undergoing work on Polish language resources.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 November 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Kozakoszczak''' (Deutsche Telekom)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-11-04.pdf|ZIML: A Markup Language for Regex-Friendly Linguistic Annotation]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them.|| ||<style="border:0;padding-left:30px;padding-bottom:15px">I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended.|| |
Line 27: | Line 35: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''5 December 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Dominika Rogozińska''', '''Marcin Woliński''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2016-12-05.pdf|Methods of syntax disambiguation for constituent parse trees in Polish as post–processing phase of the Świgra parser]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation shows methods of syntax disambiguation for Polish utterances produced by the Świgra parser. Presented methods include probabilistic context free grammars and maximum entropy models. The best of described models achieves efficiency measure at the level of 96.2%. The outcome of our experiments is a module for post-processing Świgra's parses.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 November 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Christian Chiarcos''' (University of Augsburg)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FxiOM5zAKo8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-11-21.pdf|Aspects of Knowledge Representation for Discourse Relation Annotation]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with !OntoLex and techniques for the induction of discourse marker inventories.|| |
Line 32: | Line 40: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''9 January 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Pluwak''' (Institute of Slavic Studies, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">''' [[attachment:seminarium-archiwum/2017-01-09.pdf|Building a domain-specific knowledge representation using an extended method of frame semantics on a corpus of Polish, English and German lease agreements]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Wystąpienie w języku polskim.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The !FrameNet project is defined by its authors as a lexical base with some ontological features (not an ontology sensu stricto, however, due to a selective approach towards description of frames and lexical units, as well as frame-to-frame relations). Ontologies, as knowledge representations in the field of NLP, should have the capacity of implementation to specific domains and texts, however, in the !FrameNet bibliography published before January 2016 I haven’t found a single knowledge representation based entirely on frames or on an extensive structure of frame-to-frame relations. I did find a few examples of domain-specific knowledge representations with the use of selected !FrameNet frames, such as !BioFrameNet or Legal !FrameNet, where frames were applied to connect data from different sources. Therefore, in my dissertation, I decided to conduct an experiment and build a knowledge representation of frame-to-frame relations for the domain of lease agreements. The aim of my study was the description of frames useful in case of building a possible data extraction system from lease agreements, this is frames containing answers to questions asked by a professional analyst while reading lease agreements. In my work I have asked several questions, e.g. would I be able to use !FrameNet frames for this purpose or would I have to build my own frames? Will the analysis of Polish cause language-specific problems? How will the professional language affect the use of frames in context? Etc.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 December 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Participants of !PolEval 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Presentation of the Shared Task results'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|||| ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=cwu8YfqtnTs|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-01.pdf|Welcome to PolEval 2024]]''' (Łukasz Kobyliński, Maciej Ogrodniczuk, Filip Graliński, Ryszard Staruch, Karol Saputa) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=OnxkmpGmxP4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-02.pdf|PolEval 2024 Task 1: Reading Comprehension]]''' (Ryszard Tuora / Aleksandra Zwierzchowska) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=9FDTOx55WMI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-03.pdf|Optimizing LLMs for Polish Reading Comprehension: A Comparative Study of Ensemble and Unified Approaches]]''' (Krzysztof Wróbel) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=_Ur9kzZ3ols|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-04.pdf|PolEval 2024 Task 2: Emotion and Sentiment Recognition]]''' (Jan Kocoń, Bartłomiej Koptyra) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=V3_z2KiVgco|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-05.pdf|Emotion and Sentiment Recognition in Polish Texts Using Large Language Models: A Winning Approach to PolEval 2024]]''' (Krzysztof Wróbel) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=59Xkzoi3TDY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-06.pdf|Ensemble as a Variance Reduction Method for Emotion and Sentiment Recognition]]''' (Tomasz Warzecha) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=ESNbPIwjfvw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-07.pdf|Emotion and Sentiment Recognition Using Ensemble Models]]''' (Jakub Kosterna) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=Ds8BkUTpcm8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-08.pdf|Zero-shot Approach Using Bielik LLM for Emotion Recognition in Polish]]''' (Paweł Cyrta) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=lmRZn7254MY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-08.pdf|PolEval 2024 Task 3: Polish Automatic Speech Recognition Challenge]]''' (Michał Junczyk, Iwona Christop, Piotr Pęzik) || ||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=G35l9xJWqA0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-10.pdf|Augmenting Polish Automatic Speech Recognition System with Synthetic Data]]''' (Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz) || ||<style="border:0;padding-left:30px;padding-bottom:15px">[[https://www.youtube.com/watch?v=uIDfc6c1TtA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-11.pdf|Exploration of training Zipformer and E-Branchformer models with Polish language BIGOS dataset]]''' (Paweł Cyrta) || |
Line 37: | Line 55: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 January 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marek Rogalski''' (Lodz University of Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Automatic paraphrasing'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Paraphrasing is conveying the essential meaning of a message using different words. The ability to paraphrase is a measure of understanding. A teacher asking student a question "could you please tell us using your own words ...", tests whether the student has understood the topic. On this presentation we will discuss the task of automatic paraphrasing. We will differentiate between syntax-level paraphrases and essential-meaning-level paraphrases. We will bring up several techniques from seemingly unrelated fields that can be applied in automatic paraphrasing. We will also show results that we've been able to produce with those techniques.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''19 December 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xqDkbiF4izI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-12-19.pdf|Adaptive Attacks on Misinformation Detection Using Reinforcement Learning]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs.|| |
Line 42: | Line 60: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''6 February 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Łukasz Kobyliński''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=TP9pmPKla1k|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-02-06.pdf|Korpusomat – a tool for creation of searcheable own corpora]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">[[http://korpusomat.nlp.ipipan.waw.pl/|Korpusomat]] is a web tool facilitating unassisted creation of corpora for linguistic studies. After sending a set of text files they are automatically morphologically analysed and lemmatised using Morfeusz and disambiguated using Concraft tagger. The resulting corpus can be then downloaded and analysed offline using Poliqarp search engine to query for information related to text segmentation, base forms, inflectional interpretations and (dis)ambiguities. Poliqarp is also capable of calculating frequencies and applying basic statistical measures necessary for quantitative analysis. Apart from plain text files Korpusomat can also process more complex textual formats such as popular EPUBs, download source data from the Internet, strip unnecessary information and extract document metadata.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''17 February 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alicja Martinek''' (NASK National Research Institute, AGH University of Kraków), '''Ewelina Bartuzi-Trokielewicz''' (NASK National Research Institute, Warsaw University of Technology)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=rCzTBQYkooI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-02-17.pdf|Detecting deepfakes and false ads through analysis of text and social engineering techniques]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90\% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques.|| |
Line 47: | Line 65: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''20 February 2017''' (invited talk at [[https://ipipan.waw.pl/en/institute/scientific-activities/seminars/institute-seminar|the Institute seminar]])|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Elżbieta Hajnicz''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://youtu.be/lDKQ9jhIays|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[https://ipipan.waw.pl/pliki/seminaria/20170220-Hajnicz.pdf|Representation language of the valency dictionary Walenty]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|The talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The Polish Valence Dictionary (Walenty) is intended to be used by natural language processing tools, particularly parsers, and thus it offers formalized representation of the valency information. The talk presented the notion of valency and its representation in the dictionary along with examples illustrating how particular syntactic and semantic language phenomena are modelled.|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''24 March 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Rapacz''', '''Aleksander Smywiński-Pohl''' (AGH University of Krakow) || ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FZzPMTa2cYA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-03-24.pdf|Interlinear Translation of Ancient Greek Texts: How Morphological Tags Enhance Machine Translation Quality]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">Interlinear translation prioritizes preserving the original syntactic structure by placing target language words directly below their source text counterparts, maintaining the original word order rather than natural fluency. Although interlinear translations often deviate from the linguistic norms of the target language, they serve as a valuable tool for those wishing to deeply understand texts in their original form, especially in the case of sacred and ancient texts.|| ||<style="border:0;padding-left:30px;padding-bottom:5px">In our research, we conducted the first attempt to apply machine translation to generate interlinear translations from Ancient Greek to Polish and English. We compared the performance of specialized models (!GreTa, !PhilTa) pretrained on Ancient Greek texts with a general-purpose multilingual model (mT5). We examined 144 different model configurations, manipulating the base model, morphological tag encoding method, tag set, and text normalization approach, using the Greek New Testament texts as our corpus.|| ||<style="border:0;padding-left:30px;padding-bottom:15px">During the presentation, we will describe our research methodology and discuss the results. The best results were achieved by models in which we implemented new dedicated embedding layers for encoding morphological information, which yielded results up to 35-38% better (BLEU) compared to the baseline scenario. Additional detailed study showed that !PhilTa performs better than mT5, particularly in scenarios with limited data availability. !PhilTa achieved the highest results in translation to English (60.40 BLEU), while mT5-large performed best with Polish (59.33 BLEU).|| |
Line 52: | Line 72: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 March 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Wojciech Jaworski''' (University of Warsaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://youtu.be/VgCsXsicoR8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-03-02.pdf|Integration of dependency parser with a categorial parser]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">As part of the talk I will describe the division of texts into sentences and controlling the execution of each parser within the emerging hybrid parser in the Clarin-bis project. I will describe the adopted method of dependency structure conversion aimed to make them compatible with the structures of categorial parser. The conversion will have two aspects: changing the attributes of each node and changing the links between nodes. I will depict how the method used can be extended to convert compressed forests generated by the parser Świgra. At the end I wil talk about the plans and the goals of reimplementation of the !MateParser algorithm.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''13 March 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marek Kozłowski, Szymon Roziewski''' (National Information Processing Institute)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://youtu.be/3mtjJfI3HkU|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-03-13.pdf|Internet model of Polish and semantic text processing]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation shows how [[http://babelnet.org/|BabelNet]] (the multilingual encyclopaedia and semantic network based on publicly available data sources such as Wikipedia and !WordNet), can be used in the task of grouping short texts, sentiment analysis or emotional profiling of movies based on their subtitles. The second part presents the work based on [[http://commoncrawl.org/|CommonCrawl]] – publicly available petabyte-size open repository of multilingual Web pages. !CommonCrawl was used to build two models of Polish: n-gram-based and semantic distribution-based.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''20 March 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Szymanik''' (University of Amsterdam)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=OzftWhtGoAU|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-03-20.pdf|Semantic Complexity Influences Quantifier Distribution in Corpora]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In this joint paper with Camilo Thorne, we study whether semantic complexity influences the distribution of generalized quantifiers in a large English corpus derived from Wikipedia. We consider the minimal computational device recognizing a generalized quantifier as the core measure of its semantic complexity. We regard quantifiers that belong to three increasingly more complex classes: Aristotelian (recognizable by 2-state acyclic finite automata), counting (k+2-state finite automata), and proportional quantifiers (pushdown automata). Using regression analysis we show that semantic complexity is a statistically significant factor explaining 27.29% of frequency variation. We compare this impact to that of other known sources of complexity, both semantic (quantifier monotonicity and the comparative/superlative distinction) and superficial (e.g., the length of quantifier surface forms). In general, we observe that the more complex a quantifier, the less frequent it is.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''27 March 2017''' (invited talk at [[https://ipipan.waw.pl/en/institute/scientific-activities/seminars/institute-seminar|the institute seminar]])|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Paweł Morawiecki''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=onaYI6XY1S4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-03-27.pdf|Introduction to deep neural networks]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In the last few years, Deep Neural Networks (DNN) has become a tool that provides the best solution for many problems from image and speech recognition. Also in natural language processing DNN totally revolutionizes the way how translation or word representation is done (and for many other problems). This presentation aims to provide good intuitions related to the DNN, their core architectures and how they operate. I will discuss and suggest the tools and source materials that can help in the further exploration of the topic and independent experiments.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''3 April 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Katarzyna Budzynska''', '''Chris Reed''' (Institute of Philosophy and Sociology, Polish Academy of Sciences / University of Dundee)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2017-04-03.pdf|Argument Corpora, Argument Mining and Argument Analytics (part I)]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Argumentation, the most prominent way people communicate, has been attracting a lot of attention since the very beginning of the scientific reflection. [[http://www.arg.tech|The Centre for Argument Technology]] has been developing the infrastructure for studying argument structures for almost two decades. Our approach demonstrate several characteristics. First, we build upon the graph-based standard for argument representation, Argument Interchange Format AIF (Rahwan et al., 2007); and Inference Anchoring Theory IAT (Budzynska and Reed, 2011) which allows us to capture dialogic context of argumentation. Second, we focus on a variety of aspects of argument structures such as argumentation schemes (Lawrence and Reed, 2016); illocutionary intentions speakers associate with arguments (Budzynska et al., 2014a); ethos of arguments' authors (Duthie et al., 2016); rephrase relation which paraphrases parts of argument structures (Konat et al., 2016); and protocols of argumentative dialogue games (Yaskorska and Budzynska, forthcoming).|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''10 April 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Paweł Morawiecki''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=6H9oUYsfaw8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2017-04-10.pdf|Neural nets for natural language processing – selected architectures and problems]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">For the last few years more and more problems in NLP have been successfully tackled with neural nets, particularly with deep architectures. These are such problems as sentiment analysis, topic classification, coreference, word representations and image labelling. In this talk i will give some details on most promising architectures used in NLP including recurrent and convolutional nets. The presented solutions will be given in a context of a concrete problem, namely the coreference problem in Polish language.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 May 2017'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Katarzyna Budzynska''', '''Chris Reed''' (Institute of Philosophy and Sociology, Polish Academy of Sciences / University of Dundee)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2017-05-15.pdf|Argument Corpora, Argument Mining and Argument Analytics (part II)]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In the second part of our presentation we will describe characteristics of argument structures using examples from our [[http://corpora.aifdb.org|AIF corpora of annotated argument structures]] in various domains and genres (see also [[http://ova.arg-tech.org|OVA+ annotation tool]]) including moral radio debates (Budzynska et al., 2014b); Hansard records of the UK parliamentary debates (Duthie et al., 2016); e-participation (Konat et al., 2016; Lawrence et al., forthcoming); and the US 2016 presidential debates (Visser et al., forthcoming). Finally, we will show how such complex argument structures, which on the one hand make the annotation process more time-consuming and less reliable, can on the other hand result in automatic extraction of a variety of valuable information when applying technologies for argument mining (Budzynska and Villata, 2017; Lawrence and Reed, forthcoming) and argument analytics (Reed et al., forthcoming).|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''12 June 2017''' (invited talk at [[https://ipipan.waw.pl/en/institute/scientific-activities/seminars/institute-seminar|the Institute seminar]])|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Adam Pawłowski''' (University of Wroclaw)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=RNIThH3b4uQ|{{attachment:seminarium-archiwum/youtube.png}}]] '''Sequential structures in texts'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The subject of my lecture is the phenomenon of sequentiality in linguistics. Sequentiality is defined here as a characteristic feature of a text or of a collection of texts, which expresses the sequential relationship between units of the same type, ordered along the axis of time or according to a different variable (e.g. the sequence of reading or publishing). In order to model sequentiality which is thus understood, we can use, among others, time series, spectral analysis, theory of stochastic processes, theory of information or some tools of acoustics.Referring to both my own research and existing literature, in my lecture I will be presenting sequential structures and selected models thereof in continuous texts, as well as models used in relation to sequences of several texts (known as chronologies of works); I will equally mention glottochronology, which is a branch of quantitative linguistics that aims at mathematical modeling of the development of language over long periods of time. Finally, I will relate to philosophical attempts to elucidate sequentiality (the notion of the text’s ‘memory’, the result chain, Pitagoreism, Platonism).|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 April 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Ryszard Staruch''', '''Filip Graliński''' (Adam Mickiewicz University in Poznań)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xRDXmKoEiOQ|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-04-14.pdf|Leveraging Large Language Models for the Grammatical Error Correction Task]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Large Language Models (LLMs) currently represent the state-of-the-art in many natural language processing tasks. However, their effectiveness in correcting language errors in texts written in Polish remains unclear. To address this gap, a dedicated dataset for Polish text correction has been developed. During the talk, this dataset will be presented along with the evaluation results of selected LLM-based solutions. In the second part of the seminar, new techniques for adapting LLMs to the task of minimal-edit text correction will be discussed, focusing on texts written by language learners — using English as a case study.|| |
Line 94: | Line 78: |
||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given between 2000 and 2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–16]].|| | ||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 April 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Manfred Stede''' (Universität Potsdam)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FNJIyX6GmCY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-04-28.pdf|Discourse structure in the Potsdam Commentary Corpus: Human annotation, human disagreement, and automatic parsing]]'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The talk gives a brief introduction to Rhetorical Structure Theory (RST, [[https://www.sfu.ca/rst/05bibliographies/bibs/Mann_Thompson_1988.pdf|Mann/Thompson 1988]]) and then explains the design decisions for the Potsdam Commentary Corpus (PCC), which brings together RST, coreference, and other annotation layers on 175 German news editorials. After illustrating cross-layer queries on the corpus in the ANNIS linguistic database, we turn to the intricacies of manual RST annotation. I will give an overview of the annotation guidelines and their motivations, and present results from an (ongoing) study on annotator disagreements, from which we derive ideas for redesigning the annotation scheme (and potentially the underlying theory), with a comparison to the recent proposal of "eRST" by [[https://direct.mit.edu/coli/article/51/1/23/124464/eRST-A-Signaled-Graph-Theory-of-Discourse|Zeldes et al. (2025)]]. In the last part of the talk, I outline our results on automatic parsing using the system by [[https://aclanthology.org/P14-1002/|Ji and Eisenstein (2014)]].|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''26 May 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Deniz Zeyrek''' (Middle East Technical University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Building monolingual and multilingual discourse banks and implications for discourse structure'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will overview the Turkish Discourse Bank (TDB), and the TED-MDB (TED Multilingual Discourse Bank), both annotated at the discourse level by native speakers. The TDB is a resource of over 3800 implicitly or explicitly conveyed discourse relations built over a multi-genre corpus of 40.000 words. The TED-MDB is a multilingual corpus of six English TED talks with translations into five languages (Turkish, Polish, European Portuguese, Russian, and German, recently extended to a sixth language, Lithuanian) with about 600 relation annotations per language. While both corpora follow the rules and principles of the Penn Discourse Treebank (PDTB), they also consider the language-specific characteristics of individual languages. I will summarize the characteristics of both corpora and the work of our research team where these corpora are exploited, discussing implications on discourse structure.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 June 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Ogrodniczuk''', '''Aleksandra Tomaszewska''', '''Bartosz Żuk''', '''Alina Wróblewska''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''The title of the talk (on the Polish Large Language Model) will be given shortly'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be given shortly.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 June 2025'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksandra Tomaszewska''', '''Bartosz Żuk''', '''Dariusz Czerski''', '''Maciej Ogrodniczuk''' (Institute of Computer Science, Polish Academy of Sciences)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''The title of the talk (on the NeoN tool for detecting lexical innovations) will be given shortly'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be given shortly.|| ||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2024]].|| {{{#!wiki comment ||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 March 2024'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Mateusz Krubiński''' (Charles University in Prague)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Talk title will be given shortly'''  {{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.|| ||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 April 2020'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Efficient training of word embeddings with a focus on negative examples'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation is based on our [[https://pdfs.semanticscholar.org/1f50/db5786913b43f9668f997fc4c97d9cd18730.pdf|AAAI 2018]] and [[https://aaai.org/ojs/index.php/AAAI/article/view/4683|AAAI 2019]] papers on English word embeddings. In particular, we examine the notion of “negative examples”, the unobserved or insignificant word-context co-occurrences, in spectral methods. we provide a new formulation for the word embedding problem by proposing a new intuitive objective function that perfectly justifies the use of negative examples. With the goal of efficient learning of embeddings, we propose a kernel similarity measure for the latent space that can effectively calculate the similarities in high dimensions. Moreover, we propose an approximate alternative to our algorithm using a modified Vantage Point tree and reduce the computational complexity of the algorithm with respect to the number of words in the vocabulary. We have trained various word embedding algorithms on articles of Wikipedia with 2.3 billion tokens and show that our method outperforms the state-of-the-art in most word similarity tasks by a good margin. We will round up our discussion with some general thought s about the use of embeddings in modern NLP.|| }}} |
Natural Language Processing Seminar 2024–2025
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube. |
7 October 2024 |
Janusz S. Bień (University of Warsaw, profesor emeritus) |
Some glyphs from 16th century fonts, described in the monumental work “Polonia Typographica Saeculi Sedecimi”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed. |
14 October 2024 |
Alexander Rosen (Charles University in Prague) |
|
This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation (Universal Dependencies), the research utilizes parallel texts from a large multilingual corpus (InterCorp). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts. |
The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages. |
Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style. |
28 October 2024 |
Rafał Jaworski (Adam Mickiewicz University in Poznań) |
|
The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings. |
The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages. |
To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent. |
The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words. |
The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one. |
4 November 2024 |
Jakub Kozakoszczak (Deutsche Telekom) |
|
Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points. |
I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them. |
I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended. |
21 November 2024 |
Christian Chiarcos (University of Augsburg) |
|
Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with OntoLex and techniques for the induction of discourse marker inventories. |
19 December 2024 |
Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences) |
|
The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs. |
17 February 2025 |
Alicja Martinek (NASK National Research Institute, AGH University of Kraków), Ewelina Bartuzi-Trokielewicz (NASK National Research Institute, Warsaw University of Technology) |
|
Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90\% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques. |
24 March 2025 |
Maciej Rapacz, Aleksander Smywiński-Pohl (AGH University of Krakow) |
|
Interlinear translation prioritizes preserving the original syntactic structure by placing target language words directly below their source text counterparts, maintaining the original word order rather than natural fluency. Although interlinear translations often deviate from the linguistic norms of the target language, they serve as a valuable tool for those wishing to deeply understand texts in their original form, especially in the case of sacred and ancient texts. |
In our research, we conducted the first attempt to apply machine translation to generate interlinear translations from Ancient Greek to Polish and English. We compared the performance of specialized models (GreTa, PhilTa) pretrained on Ancient Greek texts with a general-purpose multilingual model (mT5). We examined 144 different model configurations, manipulating the base model, morphological tag encoding method, tag set, and text normalization approach, using the Greek New Testament texts as our corpus. |
During the presentation, we will describe our research methodology and discuss the results. The best results were achieved by models in which we implemented new dedicated embedding layers for encoding morphological information, which yielded results up to 35-38% better (BLEU) compared to the baseline scenario. Additional detailed study showed that PhilTa performs better than mT5, particularly in scenarios with limited data availability. PhilTa achieved the highest results in translation to English (60.40 BLEU), while mT5-large performed best with Polish (59.33 BLEU). |
14 April 2025 |
Ryszard Staruch, Filip Graliński (Adam Mickiewicz University in Poznań) |
|
Large Language Models (LLMs) currently represent the state-of-the-art in many natural language processing tasks. However, their effectiveness in correcting language errors in texts written in Polish remains unclear. To address this gap, a dedicated dataset for Polish text correction has been developed. During the talk, this dataset will be presented along with the evaluation results of selected LLM-based solutions. In the second part of the seminar, new techniques for adapting LLMs to the task of minimal-edit text correction will be discussed, focusing on texts written by language learners — using English as a case study. |
28 April 2025 |
Manfred Stede (Universität Potsdam) |
|
The talk gives a brief introduction to Rhetorical Structure Theory (RST, Mann/Thompson 1988) and then explains the design decisions for the Potsdam Commentary Corpus (PCC), which brings together RST, coreference, and other annotation layers on 175 German news editorials. After illustrating cross-layer queries on the corpus in the ANNIS linguistic database, we turn to the intricacies of manual RST annotation. I will give an overview of the annotation guidelines and their motivations, and present results from an (ongoing) study on annotator disagreements, from which we derive ideas for redesigning the annotation scheme (and potentially the underlying theory), with a comparison to the recent proposal of "eRST" by Zeldes et al. (2025). In the last part of the talk, I outline our results on automatic parsing using the system by Ji and Eisenstein (2014). |
Please see also the talks given in 2000–2015 and 2015–2024. |