Locked History Actions

Diff for "seminar"

Differences between revisions 254 and 711 (spanning 457 versions)
Revision 254 as of 2019-05-16 14:54:45
Size: 28358
Comment:
Revision 711 as of 2025-05-05 09:17:05
Size: 28600
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
= Natural Language Processing Seminar 2018–2019 = = Natural Language Processing Seminar 2024–2025 =
Line 5: Line 5:
||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available [[https://www.youtube.com/channel/UC5PEPpMqjAr7Pgdvq0wRn0w|on YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]|| ||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]||
Line 7: Line 7:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''1 October 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Janusz S. Bień''' (University of Warsaw – prof. emeritus)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=mOYzwpjTAf4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-10-01.pdf|Electronic indexes to lexicographical resources]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">We will focus on the indexes to lexicographical resources available online in !DjVu format. Such indexes can be browsed, searched, modified and created with the djview4poliqarp open source program; the origins and the history of the program will be briefly presented. Originally the index support was added to the program to handle the list of entries in the 19th century Linde's dictionary, but can be used conveniently also for other resources, as will be demonstrated on selected examples. In particular some new features, introduced to the program in the last months, will be presented publicly for the first time.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 October 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Janusz S. Bień''' (University of Warsaw, profesor emeritus) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=2mLYixXC_Hw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-07.pdf|Identifying glyphs in some 16th century fonts: a case study]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Some glyphs from 16th century fonts, described in the monumental work “[[https://crispa.uw.edu.pl/object/files/754258/display/Default|Polonia Typographica Saeculi Sedecimi]]”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate !OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed.||
Line 12: Line 12:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 October 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Wojciech Jaworski, Szymon Rutkowski''' (University of Warsaw)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=SbPAdmRmW08|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-10-15.pdf|A multilayer rule based model of Polish inflection]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will be devoted to the multilayer model of Polish inflection. The model has been developed on the basis of Grammatical Dictionary of Polish; it does not use the concept of a inflexion paradigm. The model consists of three layers of hand-made rules: "orthographic-phonetic layer" converting a segment to representation reflecting morphological patterns of the language, "analytic layer" generating lemma and determining affix and "interpretation layer" giving a morphosyntactic interpretation based on detected affixes. The model provides knowledge about the language to a morphological analyzer supplemented with the function of guessing lemmas and morphosyntactic interpretations for non-dictionary forms (guesser). The second use of the model is generation of word forms based on lemma and morphosyntactic interpretation. The presentation will also cover the issue of disambiguation of the results provided by the morphological analyzer. The demo version of the program is available on the Internet.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 October 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=E2ujmqt7Q2E|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-14.pdf|Lexical and syntactic variability of languages and text genres. A corpus-based study]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation ([[https://universaldependencies.org/|Universal Dependencies]]), the research utilizes parallel texts from a large multilingual corpus ([[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze16ud|InterCorp]]). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts.||
||<style="border:0;padding-left:30px;padding-bottom:5px">The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages.||
||<style="border:0;padding-left:30px;padding-bottom:15px">Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style.||
Line 17: Line 19:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''29 October 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Waszczuk''' (Heinrich-Heine-Universität Düsseldorf)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=zjGQRG2PNu0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-10-29.pdf|From morphosyntactic tagging to identification of verbal multiword expressions: a discriminative approach]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The first part of the talk was dedicated to Concraft-pl 2.0, the new version of a morphosyntactic tagger for Polish based on conditional random fields. Concraft-pl 2.0 performs morphosyntactic segmentation as a by-product of disambiguation, which allows to use it directly on the segmentation graphs provided by the analyser Morfeusz. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. During the second part, an approach to identifying verbal multiword expressions (VMWEs) based on dependency parsing results was presented. In this approach, VMWE identification is reduced to the problem of dependency tree labeling, where one of two labels (MWE or not-MWE) must be predicted for each node in the dependency tree. The underlying labeling model can be seen as conditional random fields (as used in Concraft) adapted to tree structures. A system based on this approach ranked 1st in the closed track of the PARSEME shared task 2018.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 October 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Rafał Jaworski''' (Adam Mickiewicz University in Poznań)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=52LZ976imBA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-28.pdf|Framework for aligning and storing of multilingual word embeddings for the needs of translation probability computation]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings.||
||<style="border:0;padding-left:30px;padding-bottom:5px">The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages.||
||<style="border:0;padding-left:30px;padding-bottom:5px">To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent.||
||<style="border:0;padding-left:30px;padding-bottom:5px">The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words.||
||<style="border:0;padding-left:30px;padding-bottom:15px">The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one.||
Line 22: Line 28:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''5 November 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Kozakoszczak''' (Faculty of Modern Languages, University of Warsaw / Heinrich-Heine-Universität Düsseldorf)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=sz7dGmf8p3k|{{attachment:seminarium-archiwum/youtube.png}}]] '''Mornings to Wednesdays — semantics and normalization of Polish quasi-periodical temporal expression''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The standard interpretations of expressions like “Januarys” and “Fridays” in temporal representation and reasoning are slices of collections of 2nd order, e.g. all the sixth elements of day sequences of cardinality 7 aligned with calendar weeks. I will present results of the work on normalizing most frequent Polish quasi-periodical temporal expressions for online booking systems. On the linguistic side I will argue against synonymy of the kind “Fridays” = “sixth days of the weeks” and give semantic tests for rudimentary classification of quasi-periodicity. In the formal part I will propose an extension to existing formalisms covering intensional quasi-periodical operators “from”, “to”, “before” and “after” restricted to monotonic domains. In the implementation part I will demonstrate an algorithm for lazy generation of generalized intersection of collections.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 November 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Kozakoszczak''' (Deutsche Telekom)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-11-04.pdf|ZIML: A Markup Language for Regex-Friendly Linguistic Annotation]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points.||
||<style="border:0;padding-left:30px;padding-bottom:5px">I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them.||
||<style="border:0;padding-left:30px;padding-bottom:15px">I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended.||
Line 27: Line 35:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''19 November 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Daniel Zeman''' (Institute of Formal and Applied Linguistics, Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xUmZ8Mxcmg0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-11-19.pdf|Universal Dependencies and the Slavic Languages]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">I will present Universal Dependencies, a worldwide community effort aimed at providing multilingual corpora, annotated at the morphological and syntactic levels following unified annotation guidelines. I will discuss the concept of core arguments, one of the cornerstones of the UD framework. In the second part of the talk I will focus on some interesting problems and challenges of applying Universal Dependencies to the Slavic languages. I will discuss examples from 12 Slavic languages that are currently represented in UD and show that cross-linguistic consistency can still be improved.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 November 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Christian Chiarcos''' (University of Augsburg)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FxiOM5zAKo8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-11-21.pdf|Aspects of Knowledge Representation for Discourse Relation Annotation]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with !OntoLex and techniques for the induction of discourse marker inventories.||
Line 32: Line 40:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''3 December 2018'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Ekaterina Lapshinova-Koltunski''' (Saarland University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=UQ_6dDNEw8E|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2018-12-03.pdf|Analysis and Annotation of Coreference for Contrastive Linguistics and Translation Studies]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will report on the ongoing work on coreference analysis in a multilingual context. I will present two approaches in the analysis of coreference and coreference-related phenomena: (1) top-down or theory-driven: here we start from some linguistic knowledge derived from the existing frameworks, define linguistic categories to analyse and create an annotated corpus that can be used either for further linguistic analysis or as training data for NLP applications; (2) bottom-up or data-driven: in this case, we start from a set of features of shallow character that we believe are discourse-related. We extract these structures from a huge amount of data and analyse them from a linguistic point of view trying to describe and explain the observed phenomena from the point of view of existing theories and grammars.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 December 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Participants of !PolEval 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Presentation of the Shared Task results''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=cwu8YfqtnTs|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-01.pdf|Welcome to PolEval 2024]]''' (Łukasz Kobyliński, Maciej Ogrodniczuk, Filip Graliński, Ryszard Staruch, Karol Saputa) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=OnxkmpGmxP4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-02.pdf|PolEval 2024 Task 1: Reading Comprehension]]''' (Ryszard Tuora / Aleksandra Zwierzchowska) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=9FDTOx55WMI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-03.pdf|Optimizing LLMs for Polish Reading Comprehension: A Comparative Study of Ensemble and Unified Approaches]]''' (Krzysztof Wróbel) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=_Ur9kzZ3ols|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-04.pdf|PolEval 2024 Task 2: Emotion and Sentiment Recognition]]''' (Jan Kocoń, Bartłomiej Koptyra) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=V3_z2KiVgco|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-05.pdf|Emotion and Sentiment Recognition in Polish Texts Using Large Language Models: A Winning Approach to PolEval 2024]]''' (Krzysztof Wróbel) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=59Xkzoi3TDY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-06.pdf|Ensemble as a Variance Reduction Method for Emotion and Sentiment Recognition]]''' (Tomasz Warzecha) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=ESNbPIwjfvw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-07.pdf|Emotion and Sentiment Recognition Using Ensemble Models]]''' (Jakub Kosterna) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=Ds8BkUTpcm8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-08.pdf|Zero-shot Approach Using Bielik LLM for Emotion Recognition in Polish]]''' (Paweł Cyrta) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=lmRZn7254MY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-08.pdf|PolEval 2024 Task 3: Polish Automatic Speech Recognition Challenge]]''' (Michał Junczyk, Iwona Christop, Piotr Pęzik) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=G35l9xJWqA0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-10.pdf|Augmenting Polish Automatic Speech Recognition System with Synthetic Data]]''' (Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz) ||
||<style="border:0;padding-left:30px;padding-bottom:15px">[[https://www.youtube.com/watch?v=uIDfc6c1TtA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-11.pdf|Exploration of training Zipformer and E-Branchformer models with Polish language BIGOS dataset]]''' (Paweł Cyrta) ||
Line 37: Line 55:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 January 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Adam Przepiórkowski''' (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw), '''Agnieszka Patejuk''' (Institute of Computer Science, Polish Academy of Sciences / University of Oxford)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-01-07.pdf|Enhanced Universal Dependencies]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The aim of this talk is to present the two threads of our recent work on Universal Dependencies (UD), a standard for syntactically annotated corpora (http://universaldependencies.org/). The first thread is concerned with the developement of a new UD treebank of Polish, one that makes extensive use of the enhanced level of representation made available in the current UD standard. The treebank is the result of conversion from an earlier ‘treebank’ of Polish, one that was annotated with constituency and functional structures as they are understood in Lexical Functional Grammar. We will outline the conversion procedure and present the resulting UD treebank of Polish. The second thread is concerned with various inconsistencies and deficiencies of UD that we identified in the process of developing the UD treebank of Polish. We will concentrate on two particularly problematic areas in UD, namely, on the core/oblique distinction, which aims to – but does not really – replace the infamous argument/adjunct dichotomy, and on coordination, a phenomenon problematic for all dependency approaches.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''19 December 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xqDkbiF4izI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-12-19.pdf|Adaptive Attacks on Misinformation Detection Using Reinforcement Learning]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs.||
Line 42: Line 60:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 January 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agata Savary''' (François Rabelais University Tours)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-01-14.pdf|Literal occurrences of multiword expressions: quantitative and qualitative analyses]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Multiword expressions (MWEs) such as to “pull strings” (to use one's influence), “to take part” or to “do in” (to kill) are word combinations which exhibit lexical, syntactic, and especially semantic idiosyncrasies. They pose special challenges to linguistic modeling and computational linguistics due to their non-compositional semantics, i.e. the fact that their meaning cannot be deduced from the meanings of their components, and from their syntactic structure, in a way deemed regular for the given language. Additionally, MWEs can have both idiomatic and literal occurrences. For instance “pulling strings” can be understood either as making use of one’s influence, or literally. Even if this phenomenon has been largely addressed in psycholinguistics, linguistics and natural language processing, the notion of a literal reading has rarely been formally defined or subject to quantitative analyses. I will propose a syntax-based definition of a literal reading of a MWE. I will also present the results of a quantitative and qualitative analysis of this phenomenon in Polish, as well as in 4 typologically distinct languages: Basque, German, Greek and Portuguese. This study, performed in a multilingual annotated corpus of the [[http://www.parseme.eu|PARSEME network]], shows that literal readings constitute a rare phenomenon. We also identify some properties that may distinguish them from their idiomatic counterparts.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''17 February 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alicja Martinek''' (NASK National Research Institute, AGH University of Kraków), '''Ewelina Bartuzi-Trokielewicz''' (NASK National Research Institute, Warsaw University of Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=rCzTBQYkooI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-02-17.pdf|Detecting deepfakes and false ads through analysis of text and social engineering techniques]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90\% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques.||
Line 47: Line 65:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 January 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marek Łaziński''' (University of Warsaw), '''Michał Woźniak''' (Jagiellonian University) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-01-21.pdf|Aspect in dictionaries and corpora. What for and how aspect pairs should be tagged in corpora?]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Corpora are generally tagged for grammatical categories, also for verbal aspect value. They all choose between pf and ipf, some of them add the third value: bi-aspectual (not present in the National Corpus of Polish). However, no Slavic corpus tags the aspect value of a verb form in reference to an aspect partner. If we can mark aspect pairs in dictionaries, it should be also possible in corpora. However under the condition, that we extrapolate aspect features of lexeme to specific verb forms in specific uses. Retaining the existing morphological tagging including aspect value, two more aspect tags have been added: 1) morphological markers of aspect and 2) reference to superlemma. Every verb form in the corpus has thus three parts: 1) The existing grammatcial characteristics (TAKIPI), 2) Repeated or corrected aspect value (including bi-aspecual) and morphological markers, 3) Reference to the aspect pair–superlemma. A corpus tagged for aspect pairs, even with alternative reference for every lexeme, opens new perspectives for research. The possibilities are especially rich in a parallel corpus with one Slavic and one aspectless language, as the Mainz-Warsaw Corpus. In order to check the usefulness of our aspect pair tagging a series of queries will be built which allow to compare grammatical profiles of suffixal and prefixal aspect pf and ipf partners.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''24 March 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Rapacz''', '''Aleksander Smywiński-Pohl''' (AGH University of Krakow) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FZzPMTa2cYA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-03-24.pdf|Interlinear Translation of Ancient Greek Texts: How Morphological Tags Enhance Machine Translation Quality]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">Interlinear translation prioritizes preserving the original syntactic structure by placing target language words directly below their source text counterparts, maintaining the original word order rather than natural fluency. Although interlinear translations often deviate from the linguistic norms of the target language, they serve as a valuable tool for those wishing to deeply understand texts in their original form, especially in the case of sacred and ancient texts.||
||<style="border:0;padding-left:30px;padding-bottom:5px">In our research, we conducted the first attempt to apply machine translation to generate interlinear translations from Ancient Greek to Polish and English. We compared the performance of specialized models (!GreTa, !PhilTa) pretrained on Ancient Greek texts with a general-purpose multilingual model (mT5). We examined 144 different model configurations, manipulating the base model, morphological tag encoding method, tag set, and text normalization approach, using the Greek New Testament texts as our corpus.||
||<style="border:0;padding-left:30px;padding-bottom:15px">During the presentation, we will describe our research methodology and discuss the results. The best results were achieved by models in which we implemented new dedicated embedding layers for encoding morphological information, which yielded results up to 35-38% better (BLEU) compared to the baseline scenario. Additional detailed study showed that !PhilTa performs better than mT5, particularly in scenarios with limited data availability. !PhilTa achieved the highest results in translation to English (60.40 BLEU), while mT5-large performed best with Polish (59.33 BLEU).||
Line 52: Line 72:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 February 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Anna Wróblewska''' (Applica / Warsaw University of Technology), '''Filip Graliński''' (Applica / Adam Mickiewicz University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=tZ_rkR7XqRY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-02-11.pdf|Text-based machine learning processes and their interpretability]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||||
||<style="border:0;padding-left:30px;padding-bottom:15px">How do we tackle text modeling challenges in business applications? We will present a prototype architecture for automation of processes in text based work and a few use cases of machine learning models. Use cases will be about emotion detection, abusive language recognition and more. We will also show our tool to explain suspicious findings in datasets and the models behaviour.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 February 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Dutkiewicz''' (Poznan University of Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=Ap2zn8-RfWI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-02-28.pdf|Empirical research on medical information retrieval]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||||
||<style="border:0;padding-left:30px;padding-bottom:15px">We discuss results and evaluation procedures a of the bioCADDIE 2016 challenge on search of precision medical data. Our good results are due to word embedding query expansion with appropriate weights. Information Retrieval (IR) evaluation is demanding because of considerable effort required to judge over 10000 documents. A simple sampling method was proposed over 10 years ago for estimation of Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG) in spite of incomplete judgments. For this method to work the number of judged documents has to be relatively large. Such conditions were not fulfilled in bioCADDIE 2016 challenge and TREC PM 2017, 2018. The specificity of the bioCADDIE evaluation makes the post-challenge results incompatible with these judged during the contest. In bioCADDIE, for some questions there were not any judged relevant document. The results are strongly dependent on the cut-off rank. As the effect, in the bioCADDIE challenge infAP is weakly correlated with infNDCG, and an error could by up to 0.15-0.20 in absolute value. We believe, that the deviation of evaluation measures may override the primary role of the measure in such a case. We collaborate this claim by simulation of synthetic results. We propose a simulated environment with properties, which mirror the real systems. We implement a number of evaluation measures within the simulation and discuss the usefulness of the measures with partially annotated collection of documents in regard to the collection size, number of annotated document and proportion between the number of relevant and irrelevant documents. In particular we focus on the behavior of aforementioned AP and NDCG and their inferred versions. Other studies suggest that infNDCG weakly correlates with other measures and therefore should not be selected as the most important measure.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 April 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Ryszard Staruch''', '''Filip Graliński''' (Adam Mickiewicz University in Poznań)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xRDXmKoEiOQ|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-04-14.pdf|Leveraging Large Language Models for the Grammatical Error Correction Task]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Large Language Models (LLMs) currently represent the state-of-the-art in many natural language processing tasks. However, their effectiveness in correcting language errors in texts written in Polish remains unclear. To address this gap, a dedicated dataset for Polish text correction has been developed. During the talk, this dataset will be presented along with the evaluation results of selected LLM-based solutions. In the second part of the seminar, new techniques for adapting LLMs to the task of minimal-edit text correction will be discussed, focusing on texts written by language learners — using English as a case study.||
Line 63: Line 78:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 March 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Grzegorz Wojdyga''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-03-21.pdf|Size optimisation of language models]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">During the seminar, the results of work on reducing the size of language models will be discussed. The author will review the literature on the size reduction of recurrent neural networks (in terms of language models). Then, author's own implementations will be presented along with evaluation results on different Polish and English corpora.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 April 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Manfred Stede''' (Universität Potsdam)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FNJIyX6GmCY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-04-28.pdf|Discourse structure in the Potsdam Commentary Corpus: Human annotation, human disagreement, and automatic parsing]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The talk gives a brief introduction to Rhetorical Structure Theory (RST, [[https://www.sfu.ca/rst/05bibliographies/bibs/Mann_Thompson_1988.pdf|Mann/Thompson 1988]]) and then explains the design decisions for the Potsdam Commentary Corpus (PCC), which brings together RST, coreference, and other annotation layers on 175 German news editorials. After illustrating cross-layer queries on the corpus in the ANNIS linguistic database, we turn to the intricacies of manual RST annotation. I will give an overview of the annotation guidelines and their motivations, and present results from an (ongoing) study on annotator disagreements, from which we derive ideas for redesigning the annotation scheme (and potentially the underlying theory), with a comparison to the recent proposal of "eRST" by [[https://direct.mit.edu/coli/article/51/1/23/124464/eRST-A-Signaled-Graph-Theory-of-Discourse|Zeldes et al. (2025)]]. In the last part of the talk, I outline our results on automatic parsing using the system by [[https://aclanthology.org/P14-1002/|Ji and Eisenstein (2014)]].||
Line 68: Line 83:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''25 March 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Łukasz Dębowski''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=gIoI-A00Y7M|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-03-25.pdf|GPT-2 – Some remarks of an observer]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">GPT-2 is the latest neural statistical language model by the OpenAI team. A statistical language model is a distribution of probabilities on texts that can be used for automatic text generation. In essence, GPT-2 turned out to be a surprisingly good generator of semantically coherent texts of the length of several paragraphs, pushing the boundaries of what has seemed possible technically so far. Anticipating the use of GPT-2 to generate fake news, the OpenAI team decided to publish only a ten times reduced version of the model. In my talk, I will share some remarks about GPT-2.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''26 May 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Deniz Zeyrek''' (Middle East Technical University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Building monolingual and multilingual discourse banks and implications for discourse structure''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will overview the Turkish Discourse Bank (TDB), and the TED-MDB (TED Multilingual Discourse Bank), both annotated at the discourse level by native speakers. The TDB is a resource of over 3800 implicitly or explicitly conveyed discourse relations built over a multi-genre corpus of 40.000 words. The TED-MDB is a multilingual corpus of six English TED talks with translations into five languages (Turkish, Polish, European Portuguese, Russian, and German, recently extended to a sixth language, Lithuanian) with about 600 relation annotations per language. While both corpora follow the rules and principles of the Penn Discourse Treebank (PDTB), they also consider the language-specific characteristics of individual languages. I will summarize the characteristics of both corpora and the work of our research team where these corpora are exploited, discussing implications on discourse structure.||
Line 73: Line 88:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''8 April 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Wołk''' (Polish-Japanese Academy of Information Technology and Institute of Literary Research, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=QVrY4rRzMOI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-04-08.pdf|Language collocations in quantitative research]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation is aimed to aid the enormous effort required to analyze phraseological writing competence by developing an automatic evaluation tool for texts. An attempt is made to measure both second language (L2) writing proficiency and text quality. The !CollGram technique that searches a reference corpus to determine the frequency of each pair of tokens (n-grams) and calculates the t-score and related information. We used the Level 3 Corpus of Contemporary American English as a reference corpus. Our solution performed well in writing evaluation and is freely available as a web service or as source for other researchers. We also present how to use it as early depression detection tools and stylometry.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 June 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Ogrodniczuk''', '''Aleksandra Tomaszewska''', '''Bartosz Żuk''', '''Alina Wróblewska''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''The title of the talk (on the Polish Large Language Model) will be given shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be given shortly.||
Line 78: Line 93:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''15 April 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alina Wróblewska''', '''Piotr Rybak''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=p-VldtRqvmg|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-04-15.pdf|Dependency parsing of Polish]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Dependency parsing is a crucial issue in various NLP tasks. The predicate-argument structure transparently encoded in dependency-based syntactic representations may support machine translation, question answering, sentiment analysis, etc. In the talk, we will present PDB – the largest dependency treebank for Polish, and COMBO – a language-independent neural system for part-of-speech tagging, morphological analysis, lemmatisation and dependency parsing.||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 June 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksandra Tomaszewska''', '''Bartosz Żuk''', '''Dariusz Czerski''', '''Maciej Ogrodniczuk''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''The title of the talk (on the NeoN tool for detecting lexical innovations) will be given shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be given shortly.||
Line 83: Line 98:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''13 May 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Niewiński''', '''Maria Pszona''', '''Alessandro Seganti''', '''Helena Sobol''' (Samsung R&D Poland), Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Samsung R&D Poland in !SemEval 2019 competition''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:0px">The talk presents Samsung R&D Poland solutions that participated in !SemEval 2019 competition. Both were ranked as the second one in two different tasks of competition.||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''1. [[attachment:seminarium-archiwum/2019-05-13a.pdf|Fact Checking in Community Question Answering Forums]]'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">We present our submission to !SemEval 2019 Task 8 on Fact-Checking in Community Forums. The aim was to classify questions from !QatarLiving forum as OPINION, FACTUAL or SOCIALIZING. We will present [[attachment:seminarium-archiwum/2019-05-13a-opis.pdf|our primary solution]]: Deeply Regularized Residual Neural Network (DRR NN) with Universal Sentence Encoder embeddings, which was ranked second in the official evaluation phase. Moreover, we will compare this solution with two contrastive models based on ensemble methods.||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''2. [[attachment:seminarium-archiwum/2019-05-13b.pdf|Linguistically enhanced deep learning offensive sentence classifier]]'''||
||<style="border:0;padding-left:30px;padding-bottom:15px">How to define an offensive content? What is a bad word? In our presentation we will discuss the problem of recognizing what is offensive and what is not in social media (Twitter etc.). Furthermore we present [[attachment:seminarium-archiwum/2019-05-13b-opis.pdf|the system that we implemented]] to participate in the !SemEval 2019 Task 5 and Task 6 (where we had 2nd place in Task 6 Subtask C) and compare our results to other state of the art approaches. We will see that our approach outperformed other models by adding linguistically based observation to the model features.||
||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2024]].||
Line 92: Line 100:
||<style="border:0;padding-top:5px;padding-bottom:5px">'''27 May 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Magdalena Zawisławska''' (University of Warsaw)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Synamet — Polish Corpus of Synesthetic Metaphors''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The aim of the paper is to discuss the procedure of the identification of synesthetic metaphors and the annotation of metaphoric units (MUs) in the Synamet corpus, which was created within the frames of the NCN grant (UMO-2014/15/B/ HS2/00182). The theoretical basis for the description of metaphors was the Conceptual Metaphor Theory (CMT) by Lakoff and Johnson combined with Fillmore's frame semantics. Lakoff and Johnson define a metaphor as a conceptual mapping from the source domain to the target domain, e.g. LOVE IS A JOURNEY. Because the concept of a domain is unclear, it has been replaced by a frame which, unlike a conceptual domain, links the semantic and linguistic levels (frames are activated by lexical units). The synesthetic metaphor in a narrower sense is defined as mapping from one perceptual modality to a different perceptual modality, e.g. a bright sound (VISION → HEARING), and in a broader sense—it is defined as description of non-perceptual phenomena with expressions referring primarily to sensory perceptions, e.g. rough character (TOUCH → PERSON). The Synamet project uses an even wider definition of synesthetic metaphor as any expression in which two different frames are activated and one of them is perceptual. Texts in the Synamet corpus come from blogs devoted to perfumes, wine, beer, music, or coffee, in which, due to the topics, the chance to find synesthetic metaphors was the greatest. The paper presents the basic statistics of the corpus and atypical metaphorical units that required modification of the annotation procedure.||
{{{#!wiki comment
Line 97: Line 102:
||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2018]].||
||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 March 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Mateusz Krubiński''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Talk title will be given shortly''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 April 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Efficient training of word embeddings with a focus on negative examples''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation is based on our [[https://pdfs.semanticscholar.org/1f50/db5786913b43f9668f997fc4c97d9cd18730.pdf|AAAI 2018]] and [[https://aaai.org/ojs/index.php/AAAI/article/view/4683|AAAI 2019]] papers on English word embeddings. In particular, we examine the notion of “negative examples”, the unobserved or insignificant word-context co-occurrences, in spectral methods. we provide a new formulation for the word embedding problem by proposing a new intuitive objective function that perfectly justifies the use of negative examples. With the goal of efficient learning of embeddings, we propose a kernel similarity measure for the latent space that can effectively calculate the similarities in high dimensions. Moreover, we propose an approximate alternative to our algorithm using a modified Vantage Point tree and reduce the computational complexity of the algorithm with respect to the number of words in the vocabulary. We have trained various word embedding algorithms on articles of Wikipedia with 2.3 billion tokens and show that our method outperforms the state-of-the-art in most word similarity tasks by a good margin. We will round up our discussion with some general thought s about the use of embeddings in modern NLP.||
}}}

Natural Language Processing Seminar 2024–2025

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube.

seminarium

7 October 2024

Janusz S. Bień (University of Warsaw, profesor emeritus)

https://www.youtube.com/watch?v=2mLYixXC_Hw Identifying glyphs in some 16th century fonts: a case study  Talk in Polish.

Some glyphs from 16th century fonts, described in the monumental work “Polonia Typographica Saeculi Sedecimi”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed.

14 October 2024

Alexander Rosen (Charles University in Prague)

https://www.youtube.com/watch?v=E2ujmqt7Q2E Lexical and syntactic variability of languages and text genres. A corpus-based study  Talk in English.

This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation (Universal Dependencies), the research utilizes parallel texts from a large multilingual corpus (InterCorp). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts.

The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages.

Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style.

28 October 2024

Rafał Jaworski (Adam Mickiewicz University in Poznań)

https://www.youtube.com/watch?v=52LZ976imBA Framework for aligning and storing of multilingual word embeddings for the needs of translation probability computation  Talk in Polish.

The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings.

The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages.

To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent.

The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words.

The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one.

4 November 2024

Jakub Kozakoszczak (Deutsche Telekom)

http://zil.ipipan.waw.pl/seminarium-online ZIML: A Markup Language for Regex-Friendly Linguistic Annotation  Talk in English.

Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points.

I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them.

I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended.

21 November 2024

Christian Chiarcos (University of Augsburg)

https://www.youtube.com/watch?v=FxiOM5zAKo8 Aspects of Knowledge Representation for Discourse Relation Annotation  Talk in English.

Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with OntoLex and techniques for the induction of discourse marker inventories.

2 December 2024

Participants of PolEval 2024

Presentation of the Shared Task results  Talk in Polish. Slides in English.

https://www.youtube.com/watch?v=cwu8YfqtnTs Welcome to PolEval 2024 (Łukasz Kobyliński, Maciej Ogrodniczuk, Filip Graliński, Ryszard Staruch, Karol Saputa)

https://www.youtube.com/watch?v=OnxkmpGmxP4 PolEval 2024 Task 1: Reading Comprehension (Ryszard Tuora / Aleksandra Zwierzchowska)

https://www.youtube.com/watch?v=9FDTOx55WMI Optimizing LLMs for Polish Reading Comprehension: A Comparative Study of Ensemble and Unified Approaches (Krzysztof Wróbel)

https://www.youtube.com/watch?v=_Ur9kzZ3ols PolEval 2024 Task 2: Emotion and Sentiment Recognition (Jan Kocoń, Bartłomiej Koptyra)

https://www.youtube.com/watch?v=V3_z2KiVgco Emotion and Sentiment Recognition in Polish Texts Using Large Language Models: A Winning Approach to PolEval 2024 (Krzysztof Wróbel)

https://www.youtube.com/watch?v=59Xkzoi3TDY Ensemble as a Variance Reduction Method for Emotion and Sentiment Recognition (Tomasz Warzecha)

https://www.youtube.com/watch?v=ESNbPIwjfvw Emotion and Sentiment Recognition Using Ensemble Models (Jakub Kosterna)

https://www.youtube.com/watch?v=Ds8BkUTpcm8 Zero-shot Approach Using Bielik LLM for Emotion Recognition in Polish (Paweł Cyrta)

https://www.youtube.com/watch?v=lmRZn7254MY PolEval 2024 Task 3: Polish Automatic Speech Recognition Challenge (Michał Junczyk, Iwona Christop, Piotr Pęzik)

https://www.youtube.com/watch?v=G35l9xJWqA0 Augmenting Polish Automatic Speech Recognition System with Synthetic Data (Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz)

https://www.youtube.com/watch?v=uIDfc6c1TtA Exploration of training Zipformer and E-Branchformer models with Polish language BIGOS dataset (Paweł Cyrta)

19 December 2024

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=xqDkbiF4izI Adaptive Attacks on Misinformation Detection Using Reinforcement Learning  Talk in English.

The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs.

17 February 2025

Alicja Martinek (NASK National Research Institute, AGH University of Kraków), Ewelina Bartuzi-Trokielewicz (NASK National Research Institute, Warsaw University of Technology)

https://www.youtube.com/watch?v=rCzTBQYkooI Detecting deepfakes and false ads through analysis of text and social engineering techniques  Talk in Polish.

Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90\% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques.

24 March 2025

Maciej Rapacz, Aleksander Smywiński-Pohl (AGH University of Krakow)

https://www.youtube.com/watch?v=FZzPMTa2cYA Interlinear Translation of Ancient Greek Texts: How Morphological Tags Enhance Machine Translation Quality  Talk in Polish. Slides in English.

Interlinear translation prioritizes preserving the original syntactic structure by placing target language words directly below their source text counterparts, maintaining the original word order rather than natural fluency. Although interlinear translations often deviate from the linguistic norms of the target language, they serve as a valuable tool for those wishing to deeply understand texts in their original form, especially in the case of sacred and ancient texts.

In our research, we conducted the first attempt to apply machine translation to generate interlinear translations from Ancient Greek to Polish and English. We compared the performance of specialized models (GreTa, PhilTa) pretrained on Ancient Greek texts with a general-purpose multilingual model (mT5). We examined 144 different model configurations, manipulating the base model, morphological tag encoding method, tag set, and text normalization approach, using the Greek New Testament texts as our corpus.

During the presentation, we will describe our research methodology and discuss the results. The best results were achieved by models in which we implemented new dedicated embedding layers for encoding morphological information, which yielded results up to 35-38% better (BLEU) compared to the baseline scenario. Additional detailed study showed that PhilTa performs better than mT5, particularly in scenarios with limited data availability. PhilTa achieved the highest results in translation to English (60.40 BLEU), while mT5-large performed best with Polish (59.33 BLEU).

14 April 2025

Ryszard Staruch, Filip Graliński (Adam Mickiewicz University in Poznań)

https://www.youtube.com/watch?v=xRDXmKoEiOQ Leveraging Large Language Models for the Grammatical Error Correction Task  Talk in Polish.

Large Language Models (LLMs) currently represent the state-of-the-art in many natural language processing tasks. However, their effectiveness in correcting language errors in texts written in Polish remains unclear. To address this gap, a dedicated dataset for Polish text correction has been developed. During the talk, this dataset will be presented along with the evaluation results of selected LLM-based solutions. In the second part of the seminar, new techniques for adapting LLMs to the task of minimal-edit text correction will be discussed, focusing on texts written by language learners — using English as a case study.

28 April 2025

Manfred Stede (Universität Potsdam)

https://www.youtube.com/watch?v=FNJIyX6GmCY Discourse structure in the Potsdam Commentary Corpus: Human annotation, human disagreement, and automatic parsing  Talk in English.

The talk gives a brief introduction to Rhetorical Structure Theory (RST, Mann/Thompson 1988) and then explains the design decisions for the Potsdam Commentary Corpus (PCC), which brings together RST, coreference, and other annotation layers on 175 German news editorials. After illustrating cross-layer queries on the corpus in the ANNIS linguistic database, we turn to the intricacies of manual RST annotation. I will give an overview of the annotation guidelines and their motivations, and present results from an (ongoing) study on annotator disagreements, from which we derive ideas for redesigning the annotation scheme (and potentially the underlying theory), with a comparison to the recent proposal of "eRST" by Zeldes et al. (2025). In the last part of the talk, I outline our results on automatic parsing using the system by Ji and Eisenstein (2014).

26 May 2025

Deniz Zeyrek (Middle East Technical University)

http://zil.ipipan.waw.pl/seminarium-online Building monolingual and multilingual discourse banks and implications for discourse structure  Talk in English.

In this talk, I will overview the Turkish Discourse Bank (TDB), and the TED-MDB (TED Multilingual Discourse Bank), both annotated at the discourse level by native speakers. The TDB is a resource of over 3800 implicitly or explicitly conveyed discourse relations built over a multi-genre corpus of 40.000 words. The TED-MDB is a multilingual corpus of six English TED talks with translations into five languages (Turkish, Polish, European Portuguese, Russian, and German, recently extended to a sixth language, Lithuanian) with about 600 relation annotations per language. While both corpora follow the rules and principles of the Penn Discourse Treebank (PDTB), they also consider the language-specific characteristics of individual languages. I will summarize the characteristics of both corpora and the work of our research team where these corpora are exploited, discussing implications on discourse structure.

2 June 2025

Maciej Ogrodniczuk, Aleksandra Tomaszewska, Bartosz Żuk, Alina Wróblewska (Institute of Computer Science, Polish Academy of Sciences)

http://zil.ipipan.waw.pl/seminarium-online The title of the talk (on the Polish Large Language Model) will be given shortly  Talk in Polish.

The summary of the talk will be given shortly.

23 June 2025

Aleksandra Tomaszewska, Bartosz Żuk, Dariusz Czerski, Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

http://zil.ipipan.waw.pl/seminarium-online The title of the talk (on the NeoN tool for detecting lexical innovations) will be given shortly  Talk in Polish.

The summary of the talk will be given shortly.

Please see also the talks given in 2000–2015 and 2015–2024.