Differences between revisions 276 and 711 (spanning 435 versions)

Natural Language Processing Seminar 2024–2025

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube.

7 October 2024

Janusz S. Bień (University of Warsaw, profesor emeritus)

Identifying glyphs in some 16th century fonts: a case study

Some glyphs from 16th century fonts, described in the monumental work “Polonia Typographica Saeculi Sedecimi”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed.

14 October 2024

Alexander Rosen (Charles University in Prague)

Lexical and syntactic variability of languages and text genres. A corpus-based study

This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation (Universal Dependencies), the research utilizes parallel texts from a large multilingual corpus (InterCorp). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts.

The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages.

Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style.

28 October 2024

Rafał Jaworski (Adam Mickiewicz University in Poznań)

Framework for aligning and storing of multilingual word embeddings for the needs of translation probability computation

The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings.

The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages.

To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent.

The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words.

The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one.

4 November 2024

Jakub Kozakoszczak (Deutsche Telekom)

ZIML: A Markup Language for Regex-Friendly Linguistic Annotation

Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points.

I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them.

I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended.

21 November 2024

Christian Chiarcos (University of Augsburg)

Aspects of Knowledge Representation for Discourse Relation Annotation

Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with OntoLex and techniques for the induction of discourse marker inventories.

2 December 2024
Participants of PolEval 2024
Presentation of the Shared Task results
Welcome to PolEval 2024 (Łukasz Kobyliński, Maciej Ogrodniczuk, Filip Graliński, Ryszard Staruch, Karol Saputa)
PolEval 2024 Task 1: Reading Comprehension (Ryszard Tuora / Aleksandra Zwierzchowska)
Optimizing LLMs for Polish Reading Comprehension: A Comparative Study of Ensemble and Unified Approaches (Krzysztof Wróbel)
PolEval 2024 Task 2: Emotion and Sentiment Recognition (Jan Kocoń, Bartłomiej Koptyra)
Emotion and Sentiment Recognition in Polish Texts Using Large Language Models: A Winning Approach to PolEval 2024 (Krzysztof Wróbel)
Ensemble as a Variance Reduction Method for Emotion and Sentiment Recognition (Tomasz Warzecha)
Emotion and Sentiment Recognition Using Ensemble Models (Jakub Kosterna)
Zero-shot Approach Using Bielik LLM for Emotion Recognition in Polish (Paweł Cyrta)
PolEval 2024 Task 3: Polish Automatic Speech Recognition Challenge (Michał Junczyk, Iwona Christop, Piotr Pęzik)
Augmenting Polish Automatic Speech Recognition System with Synthetic Data (Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz)
Exploration of training Zipformer and E-Branchformer models with Polish language BIGOS dataset (Paweł Cyrta)

19 December 2024

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)

Adaptive Attacks on Misinformation Detection Using Reinforcement Learning

The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs.

17 February 2025

Alicja Martinek (NASK National Research Institute, AGH University of Kraków), Ewelina Bartuzi-Trokielewicz (NASK National Research Institute, Warsaw University of Technology)

Detecting deepfakes and false ads through analysis of text and social engineering techniques

Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90\% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques.

24 March 2025

Maciej Rapacz, Aleksander Smywiński-Pohl (AGH University of Krakow)

Interlinear Translation of Ancient Greek Texts: How Morphological Tags Enhance Machine Translation Quality

Interlinear translation prioritizes preserving the original syntactic structure by placing target language words directly below their source text counterparts, maintaining the original word order rather than natural fluency. Although interlinear translations often deviate from the linguistic norms of the target language, they serve as a valuable tool for those wishing to deeply understand texts in their original form, especially in the case of sacred and ancient texts.

In our research, we conducted the first attempt to apply machine translation to generate interlinear translations from Ancient Greek to Polish and English. We compared the performance of specialized models (GreTa, PhilTa) pretrained on Ancient Greek texts with a general-purpose multilingual model (mT5). We examined 144 different model configurations, manipulating the base model, morphological tag encoding method, tag set, and text normalization approach, using the Greek New Testament texts as our corpus.

During the presentation, we will describe our research methodology and discuss the results. The best results were achieved by models in which we implemented new dedicated embedding layers for encoding morphological information, which yielded results up to 35-38% better (BLEU) compared to the baseline scenario. Additional detailed study showed that PhilTa performs better than mT5, particularly in scenarios with limited data availability. PhilTa achieved the highest results in translation to English (60.40 BLEU), while mT5-large performed best with Polish (59.33 BLEU).

14 April 2025

Ryszard Staruch, Filip Graliński (Adam Mickiewicz University in Poznań)

Leveraging Large Language Models for the Grammatical Error Correction Task

Large Language Models (LLMs) currently represent the state-of-the-art in many natural language processing tasks. However, their effectiveness in correcting language errors in texts written in Polish remains unclear. To address this gap, a dedicated dataset for Polish text correction has been developed. During the talk, this dataset will be presented along with the evaluation results of selected LLM-based solutions. In the second part of the seminar, new techniques for adapting LLMs to the task of minimal-edit text correction will be discussed, focusing on texts written by language learners — using English as a case study.

28 April 2025

Manfred Stede (Universität Potsdam)

Discourse structure in the Potsdam Commentary Corpus: Human annotation, human disagreement, and automatic parsing

The talk gives a brief introduction to Rhetorical Structure Theory (RST, Mann/Thompson 1988) and then explains the design decisions for the Potsdam Commentary Corpus (PCC), which brings together RST, coreference, and other annotation layers on 175 German news editorials. After illustrating cross-layer queries on the corpus in the ANNIS linguistic database, we turn to the intricacies of manual RST annotation. I will give an overview of the annotation guidelines and their motivations, and present results from an (ongoing) study on annotator disagreements, from which we derive ideas for redesigning the annotation scheme (and potentially the underlying theory), with a comparison to the recent proposal of "eRST" by Zeldes et al. (2025). In the last part of the talk, I outline our results on automatic parsing using the system by Ji and Eisenstein (2014).

26 May 2025

Deniz Zeyrek (Middle East Technical University)

Building monolingual and multilingual discourse banks and implications for discourse structure

In this talk, I will overview the Turkish Discourse Bank (TDB), and the TED-MDB (TED Multilingual Discourse Bank), both annotated at the discourse level by native speakers. The TDB is a resource of over 3800 implicitly or explicitly conveyed discourse relations built over a multi-genre corpus of 40.000 words. The TED-MDB is a multilingual corpus of six English TED talks with translations into five languages (Turkish, Polish, European Portuguese, Russian, and German, recently extended to a sixth language, Lithuanian) with about 600 relation annotations per language. While both corpora follow the rules and principles of the Penn Discourse Treebank (PDTB), they also consider the language-specific characteristics of individual languages. I will summarize the characteristics of both corpora and the work of our research team where these corpora are exploited, discussing implications on discourse structure.

2 June 2025

Maciej Ogrodniczuk, Aleksandra Tomaszewska, Bartosz Żuk, Alina Wróblewska (Institute of Computer Science, Polish Academy of Sciences)

The title of the talk (on the Polish Large Language Model) will be given shortly

The summary of the talk will be given shortly.

23 June 2025

Aleksandra Tomaszewska, Bartosz Żuk, Dariusz Czerski, Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

The title of the talk (on the NeoN tool for detecting lexical innovations) will be given shortly

The summary of the talk will be given shortly.

Please see also the talks given in 2000–2015 and 2015–2024.

-  ⇤ ← Revision 276 as of 2019-10-28 10:11:26 → 
  Size: 12268
  Editor: MaciejOgrodniczuk
  Comment:
+   ← Revision 711 as of 2025-05-05 09:17:05 → ⇥
  Size: 28600
  Editor: MaciejOgrodniczuk
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-= Natural Language Processing Seminar 2019–2020 =
+= Natural Language Processing Seminar 2024–2025 =
 Line 5:
-||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available [[https://www.youtube.com/channel/UC5PEPpMqjAr7Pgdvq0wRn0w|on YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]||
+||<style="border:0;padding-bottom:10px">The NLP Seminar is organised by the [[http://nlp.ipipan.waw.pjl/|Linguistic Engineering Group]] at the [[http://www.ipipan.waw.pl/en/|Institute of Computer Science]], [[http://www.pan.pl/index.php?newlang=english|Polish Academy of Sciences]] (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on [[https://www.youtube.com/ipipan|YouTube]]. ||<style="border:0;padding-left:30px">[[seminarium|{{attachment:seminar-archive/pl.png}}]]||
 Line 7:
-||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 September 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Igor Boguslavsky''' (Institute for Information Transmission Problems, Russian Academy of Sciences / Universidad Politécnica de Madrid)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-09-23.pdf|Semantic analysis based on inference]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">I will present a semantic analyzer SemETAP, which is a module of a linguistic processor ETAP designed to perform analysis and generation of NL texts. We proceed from the assumption that the depth of understanding is determined by the number and quality of inferences we can draw from the text. Extensive use of background knowledge and inferences permits to extract implicit information.||
||<style="border:0;padding-left:30px;padding-bottom:0px">Salient features of SemETAP include: ||
||<style="border:0;padding-left:30px;padding-bottom:0px">— knowledge base contains both linguistic and background knowledge;||
||<style="border:0;padding-left:30px;padding-bottom:0px">— inference types include strict entailments and plausible expectations; ||
||<style="border:0;padding-left:30px;padding-bottom:0px">— words and concepts of the ontology may be supplied with explicit decompositions for inference purposes; ||
||<style="border:0;padding-left:30px;padding-bottom:0px">— two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences; ||
||<style="border:0;padding-left:30px;padding-bottom:15px">— a new logical formalism Etalog is developed in which all inference rules are written.||
+||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 October 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Janusz S. Bień''' (University of Warsaw, profesor emeritus) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=2mLYixXC_Hw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-07.pdf|Identifying glyphs in some 16th century fonts: a case study]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Some glyphs from 16th century fonts, described in the monumental work “[[https://crispa.uw.edu.pl/object/files/754258/display/Default|Polonia Typographica Saeculi Sedecimi]]”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate !OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed.||
-Line 18:
+Line 12:
-||<style="border:0;padding-top:5px;padding-bottom:5px">'''7 October 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Tomasz Stanisz''' (Institute of Nuclear Physics, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=sRreAjtf2Jo|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-10-07.pdf|What can a complex network say about a text?]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Complex networks, which have found application in the quantitative description of many different phenomena, have proven to be useful in research on natural language. The network formalism allows to study language from various points of view - a complex network may represent, for example, distances between given words in a text, semantic similarities, or grammatical relationships. One of the types of linguistic networks are word-adjacency networks, which describe mutual co-occurrences of words in texts. Although simple in construction, word-adjacency networks have a number of properties allowing for their practical use. The structure of such networks, expressed by appropriately defined quantities, reflects selected characteristics of language; applying machine learning methods to collections of those quantities may be used, for example, for authorship attribution.||
+||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 October 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=E2ujmqt7Q2E|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-14.pdf|Lexical and syntactic variability of languages and text genres. A corpus-based study]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation ([[https://universaldependencies.org/|Universal Dependencies]]), the research utilizes parallel texts from a large multilingual corpus ([[https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze16ud|InterCorp]]). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts.||
||<style="border:0;padding-left:30px;padding-bottom:5px">The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages.||
||<style="border:0;padding-left:30px;padding-bottom:15px">Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style.||
-Line 23:
+Line 19:
-||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 October 2019''' (NOTE: The seminar will start at 12:30!)||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Agnieszka Patejuk''' (Institute of Computer Science, Polish Academy of Sciences / University of Oxford), '''Adam Przepiórkowski''' (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Coordination in the Universal Dependencies standard''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">''Universal Dependencies'' (UD; [[https://universaldependencies.org/]]) is a widespread syntactic annotation scheme employed by many parsers of multiple languages.  However, the scheme does not adequately represent coordination, i.e., structures involving conjunctions.  In this talk, we propose representations of two aspects of coordination which have not so far been properly represented either in UD or in dependency grammars: coordination of unlike grammatical functions and nested coordination.||
+||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 October 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Rafał Jaworski''' (Adam Mickiewicz University in Poznań)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=52LZ976imBA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-10-28.pdf|Framework for aligning and storing of multilingual word embeddings for the needs of translation probability computation]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings.||
||<style="border:0;padding-left:30px;padding-bottom:5px">The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages.||
||<style="border:0;padding-left:30px;padding-bottom:5px">To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent.||
||<style="border:0;padding-left:30px;padding-bottom:5px">The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words.||
||<style="border:0;padding-left:30px;padding-bottom:15px">The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one.||
 Line 28:
-||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 November 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marcin Będkowski''' (Educational Research Institute), '''Michał Marcińczuk''' (Wrocław University of Science and Technology), '''Łukasz Kobyliński''', '''Grzegorz Wojdyga''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Similarity of descriptions of qualifications contained in the Integrated Qualifications Register''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">In the talk we will discuss the problem of comparing documents contained in the Integrated Qualifications Register in terms of their content similarity.||
||<style="border:0;padding-left:30px;padding-bottom:5px"In the first part, we characterize the background of the issue, including the structure of the description of learning outcomes in qualifications and sentences describing learning outcomes. According to the definition in the Act on the Integrated Qualifications System, the learning effect is knowledge, skills and social competences acquired in the learning process, and the qualification is a set of learning effects, the achievement of which is confirmed by an appropriate document (e.g. diploma, certificate). Sentences whose referents are learning outcomes have a stable structure and consist essentially of so-called an operational verb (describing an activity constituting a learning effect) and a nominal phrase that complements it (naming the object that is the subject of this activity, in short: the object of skill). For example: "Determines vision defects and how to correct them based on eye refraction measurement" or "The student reads technical drawings."||
||<style="border:0;padding-left:30px;padding-bottom:5px">In the second part, we outline the approach that allows determining the degree of similarity between qualifications and their grouping, along with its assumptions and the intuitions behind them. We will define the accepted understanding of content similarity, namely we outline the approach to determine the similarity of texts in a variant that allows automatic text processing using computer tools. We will present a simple representation model, the so-called bag of words, in two versions.||
||<style="border:0;padding-left:30px;padding-bottom:5px"The first of them assumes the full atomization of learning outcomes (including nominal phrases, skill objects) and their presentation as sets of single plata-mathematical nouns representing skills objects. The second is based on n-grams, taking into account the TFIDF measure (i.e. weighing by frequency of terms - inverse frequency in documents), allowing the extraction of key words and phrases from texts.||
||<style="border:0;padding-left:30px;padding-bottom:5px">The first approach can be described as "wasteful", while the second - "frugal". The first allows for presenting many similar qualifications for each qualification (although the degree of similarity may be low). On the other hand, the second allows a situation in which there will be no similar for a given qualification.||
||<style="border:0;padding-left:30px;padding-bottom:5px">In the third part, we describe sample groupings and ranking lists based on both approaches, based on multidimensional scaling and the k-average algorithm, as well as hierarchical grouping. We will also present a case study that will illustrate the advantages and disadvantages of both approaches.||
||<style="border:0;padding-left:30px;padding-bottom:5px">In the fourth part we will present the conclusions on grouping qualifications, but also general conclusions related to the definition of key words. In particular, we will present conclusions on the use of the indicated methods for comparing texts of varying length, as well as partially overlapping (containing common fragments).||
||<style="border:0;padding-left:30px;padding-bottom:15px">The talk was prepared in cooperation with the authors of the expertise on automatic analysis and comparison of qualifications for the purpose of grouping them prepared under the project "Keeping and developing the Integrated Qualifications Register", POWR.02.11.00-00-0001/17.||
+||<style="border:0;padding-top:5px;padding-bottom:5px">'''4 November 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Jakub Kozakoszczak''' (Deutsche Telekom)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2024-11-04.pdf|ZIML: A Markup Language for Regex-Friendly Linguistic Annotation]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">Attempts at building regex patterns that match information annotated in the text with embedded markup lead to prohibitively unmanageable patterns. Regex and markup combine even worse when the pattern must use distances as a matching condition because tags disrupt the text format. On the other hand, fully externalized markup preserves text format but leaves regex patterns without reference points.||
||<style="border:0;padding-left:30px;padding-bottom:5px">I introduce the Zero Insertion Markup Language (ZIML), where every combination of characters and labels in the annotated text is represented by a unique "allocharacter". Regex patterns also translate to appropriate patterns with allocharacters, preserving text span matches in standard regex engines. As the main result, ZIML extends regex semantics to include label referencing by matching allocharacters that represent them.||
||<style="border:0;padding-left:30px;padding-bottom:15px">I will give a proof of correctness for ZIML translation and demonstrate its implementation, including a user-facing pattern language that integrates labels into regex syntax. I hope to discuss potential applications of ZIML in linguistics and natural language processing. A basic understanding of model theory and regex functionality is recommended.||
-Line 40:
+Line 35:
-||<style="border:0;padding-top:5px;padding-bottom:5px">'''18 November 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''The !InterCorp multilingual parallel corpus: representation of grammatical categories''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">!InterCorp, a multilingual parallel component of the Czech National Corpus, has been on-line since 2008, growing steadily to its present size of 1.7 billion words in 40 languages. A substantial share of fiction is complemented by legal and journalistic texts, parliament proceedings, film subtitles and the Bible. The texts are sentence-aligned and – in most languages – tagged and lemmatized. We will focus on the issue of morphosyntactic annotation, currently using language-specific tagsets and tokenization rules, and explore various solutions, including those based on the guidelines, data and tools developed in the Universal Dependencies project.||
+||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 November 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Christian Chiarcos''' (University of Augsburg)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FxiOM5zAKo8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-11-21.pdf|Aspects of Knowledge Representation for Discourse Relation Annotation]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Semantic technologies comprise a broad set of standards and technologies including aspects of knowledge representation, information management and computational inference. In this lecture, I will describe the application of knowledge representation standards to the realm of computational discourse, and especially, the annotation of discourse relations. In particular, this includes the formal modelling of discourse relations of different theoretical frameworks by means of modular, interlinked ontologies, the machine-readable edition of discourse marker inventories with !OntoLex and techniques for the induction of discourse marker inventories.||
-Line 45:
+Line 40:
-||<style="border:0;padding-top:5px;padding-bottom:5px">'''21 November 2019'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alexander Rosen''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''A learner corpus of Czech''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Texts produced by language learners (native or non-native) include all sorts of non-canonical phenomena, complicating the task of linguistic annotation while requiring an explicit markup of deviations from the standard. Although a number of English learner corpora exist and other languages have been catching up recently, a commonly accepted approach to designing an error taxonomy and annotation scheme has not emerged yet. For !CzeSL, the corpus of Czech as a Second Language, several such approaches were designed and tested, later extended also to texts produced by Czech schoolchildren. I will show various pros and cons of these approaches, especially with a view of Czech as a highly inflectional language with free word order.||
+||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 December 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Participants of !PolEval 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Presentation of the Shared Task results''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=cwu8YfqtnTs|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-01.pdf|Welcome to PolEval 2024]]''' (Łukasz Kobyliński, Maciej Ogrodniczuk, Filip Graliński, Ryszard Staruch, Karol Saputa)                                                                                          ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=OnxkmpGmxP4|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-02.pdf|PolEval 2024 Task 1: Reading Comprehension]]''' (Ryszard Tuora / Aleksandra Zwierzchowska)                                                ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=9FDTOx55WMI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-03.pdf|Optimizing LLMs for Polish Reading Comprehension: A Comparative Study of Ensemble and Unified Approaches]]''' (Krzysztof Wróbel)          ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=_Ur9kzZ3ols|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-04.pdf|PolEval 2024 Task 2: Emotion and Sentiment Recognition]]''' (Jan Kocoń, Bartłomiej Koptyra)                                               ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=V3_z2KiVgco|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-05.pdf|Emotion and Sentiment Recognition in Polish Texts Using Large Language Models: A Winning Approach to PolEval 2024]]''' (Krzysztof Wróbel) ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=59Xkzoi3TDY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-06.pdf|Ensemble as a Variance Reduction Method for Emotion and Sentiment Recognition]]''' (Tomasz Warzecha)                                      ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=ESNbPIwjfvw|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-07.pdf|Emotion and Sentiment Recognition Using Ensemble Models]]''' (Jakub Kosterna)                                                             ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=Ds8BkUTpcm8|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-08.pdf|Zero-shot Approach Using Bielik LLM for Emotion Recognition in Polish]]''' (Paweł Cyrta)                                                  ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=lmRZn7254MY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-08.pdf|PolEval 2024 Task 3: Polish Automatic Speech Recognition Challenge]]''' (Michał Junczyk, Iwona Christop, Piotr Pęzik)                     ||
||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=G35l9xJWqA0|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-10.pdf|Augmenting Polish Automatic Speech Recognition System with Synthetic Data]]''' (Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz)     ||
||<style="border:0;padding-left:30px;padding-bottom:15px">[[https://www.youtube.com/watch?v=uIDfc6c1TtA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[http://poleval.pl/files/2024-11.pdf|Exploration of training Zipformer and E-Branchformer models with Polish language BIGOS dataset]]''' (Paweł Cyrta)                         ||
-Line 50:
+Line 55:
-||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2019]].||
+||<style="border:0;padding-top:5px;padding-bottom:5px">'''19 December 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Piotr Przybyła''' (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xqDkbiF4izI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2024-12-19.pdf|Adaptive Attacks on Misinformation Detection Using Reinforcement Learning]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The presentation will cover XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. This solution is adaptive: it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. It reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We will cover the evaluation of the approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''17 February 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Alicja Martinek''' (NASK National Research Institute, AGH University of Kraków), '''Ewelina Bartuzi-Trokielewicz''' (NASK National Research Institute, Warsaw University of Technology)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=rCzTBQYkooI|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-02-17.pdf|Detecting deepfakes and false ads through analysis of text and social engineering techniques]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Existing deepfake detection algorithm frequently fail to successfully identify fabricated materials. These algorithms primarily focus on technical analysis of video and audio, often neglecting the meaning of content itself. In this paper, we introduce a novel approach that emphasizes the analysis of text-based transcripts, particularly those from AI-generated deepfake advertisements, placing the text content at the center of attention. Our method combines linguistic features, evaluation of grammatical mistakes, and the identification of social engineering techniques commonly used in fraudulent content. By examining stylistic inconsistencies and manipulative language patterns, we enhance the accuracy of distinguishing between real and deepfake materials. To ensure interpretability, we employed classical machine learning models, allowing us to provide explainable insights into decision-making processes. Additionally, zero-shot evaluations were conducted using three large language model based solutions to assess their performance in detecting deepfake content. The experimental results show that these factors yield a 90\% accuracy in distinguishing between deepfake-based fraudulent advertisements and real ones. This demonstrates the effectiveness of incorporating content-based analysis into deepfake detection, offering a complementary layer to existing audio-visual techniques.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''24 March 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Rapacz''', '''Aleksander Smywiński-Pohl''' (AGH University of Krakow) ||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FZzPMTa2cYA|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-03-24.pdf|Interlinear Translation of Ancient Greek Texts: How Morphological Tags Enhance Machine Translation Quality]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}&#160;{{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">Interlinear translation prioritizes preserving the original syntactic structure by placing target language words directly below their source text counterparts, maintaining the original word order rather than natural fluency. Although interlinear translations often deviate from the linguistic norms of the target language, they serve as a valuable tool for those wishing to deeply understand texts in their original form, especially in the case of sacred and ancient texts.||
||<style="border:0;padding-left:30px;padding-bottom:5px">In our research, we conducted the first attempt to apply machine translation to generate interlinear translations from Ancient Greek to Polish and English. We compared the performance of specialized models (!GreTa, !PhilTa) pretrained on Ancient Greek texts with a general-purpose multilingual model (mT5). We examined 144 different model configurations, manipulating the base model, morphological tag encoding method, tag set, and text normalization approach, using the Greek New Testament texts as our corpus.||
||<style="border:0;padding-left:30px;padding-bottom:15px">During the presentation, we will describe our research methodology and discuss the results. The best results were achieved by models in which we implemented new dedicated embedding layers for encoding morphological information, which yielded results up to 35-38% better (BLEU) compared to the baseline scenario. Additional detailed study showed that !PhilTa performs better than mT5, particularly in scenarios with limited data availability. !PhilTa achieved the highest results in translation to English (60.40 BLEU), while mT5-large performed best with Polish (59.33 BLEU).||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''14 April 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Ryszard Staruch''', '''Filip Graliński''' (Adam Mickiewicz University in Poznań)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=xRDXmKoEiOQ|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-04-14.pdf|Leveraging Large Language Models for the Grammatical Error Correction Task]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Large Language Models (LLMs) currently represent the state-of-the-art in many natural language processing tasks. However, their effectiveness in correcting language errors in texts written in Polish remains unclear. To address this gap, a dedicated dataset for Polish text correction has been developed. During the talk, this dataset will be presented along with the evaluation results of selected LLM-based solutions. In the second part of the seminar, new techniques for adapting LLMs to the task of minimal-edit text correction will be discussed, focusing on texts written by language learners — using English as a case study.||


||<style="border:0;padding-top:5px;padding-bottom:5px">'''28 April 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Manfred Stede''' (Universität Potsdam)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=FNJIyX6GmCY|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2025-04-28.pdf|Discourse structure in the Potsdam Commentary Corpus: Human annotation, human disagreement, and automatic parsing]]''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The talk gives a brief introduction to Rhetorical Structure Theory (RST, [[https://www.sfu.ca/rst/05bibliographies/bibs/Mann_Thompson_1988.pdf|Mann/Thompson 1988]]) and then explains the design decisions for the Potsdam Commentary Corpus (PCC), which brings together RST, coreference, and other annotation layers on 175 German news editorials. After illustrating cross-layer queries on the corpus in the ANNIS linguistic database, we turn to the intricacies of manual RST annotation. I will give an overview of the annotation guidelines and their motivations, and present results from an (ongoing) study on annotator disagreements, from which we derive ideas for redesigning the annotation scheme (and potentially the underlying theory), with a comparison to the recent proposal of "eRST" by [[https://direct.mit.edu/coli/article/51/1/23/124464/eRST-A-Signaled-Graph-Theory-of-Discourse|Zeldes et al. (2025)]]. In the last part of the talk, I outline our results on automatic parsing using the system by [[https://aclanthology.org/P14-1002/|Ji and Eisenstein (2014)]].||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''26 May 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Deniz Zeyrek''' (Middle East Technical University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Building monolingual and multilingual discourse banks and implications for discourse structure''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">In this talk, I will overview the Turkish Discourse Bank (TDB), and the TED-MDB (TED Multilingual Discourse Bank), both annotated at the discourse level by native speakers. The TDB is a resource of over 3800 implicitly or explicitly conveyed discourse relations built over a multi-genre corpus of 40.000 words. The TED-MDB is a multilingual corpus of six English TED talks with translations into five languages (Turkish, Polish, European Portuguese, Russian, and German, recently extended to a sixth language, Lithuanian) with about 600 relation annotations per language. While both corpora follow the rules and principles of the Penn Discourse Treebank (PDTB), they also consider the language-specific characteristics of individual languages.  I will summarize the characteristics of both corpora and the work of our research team where these corpora are exploited, discussing implications on discourse structure.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 June 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Maciej Ogrodniczuk''', '''Aleksandra Tomaszewska''', '''Bartosz Żuk''', '''Alina Wróblewska''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''The title of the talk (on the Polish Large Language Model) will be given shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be given shortly.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''23 June 2025'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Aleksandra Tomaszewska''', '''Bartosz Żuk''', '''Dariusz Czerski''', '''Maciej Ogrodniczuk''' (Institute of Computer Science, Polish Academy of Sciences)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''The title of the talk (on the NeoN tool for detecting lexical innovations) will be given shortly''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">The summary of the talk will be given shortly.||

||<style="border:0;padding-top:10px">Please see also [[http://nlp.ipipan.waw.pl/NLP-SEMINAR/previous-e.html|the talks given in 2000–2015]] and [[http://zil.ipipan.waw.pl/seminar-archive|2015–2024]].||

{{{#!wiki comment


||<style="border:0;padding-top:5px;padding-bottom:5px">'''11 March 2024'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Mateusz Krubiński''' (Charles University in Prague)||
||<style="border:0;padding-left:30px;padding-bottom:5px">[[http://zil.ipipan.waw.pl/seminarium-online|{{attachment:seminarium-archiwum/teams.png}}]] '''Talk title will be given shortly''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available soon.||

||<style="border:0;padding-top:5px;padding-bottom:5px">'''2 April 2020'''||
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Stan Matwin''' (Dalhousie University)||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''Efficient training of word embeddings with a focus on negative examples''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}} {{attachment:seminarium-archiwum/icon-en.gif|Slides in English.}}||
||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation is based on our [[https://pdfs.semanticscholar.org/1f50/db5786913b43f9668f997fc4c97d9cd18730.pdf|AAAI 2018]] and [[https://aaai.org/ojs/index.php/AAAI/article/view/4683|AAAI 2019]] papers on English word embeddings. In particular, we examine the notion of “negative examples”, the unobserved or insignificant word-context co-occurrences, in spectral methods. we provide a new formulation for the word embedding problem by proposing a new intuitive objective function that perfectly justifies the use of negative examples. With the goal of efficient learning of embeddings, we propose a kernel similarity measure for the latent space that can effectively calculate the similarities in high dimensions. Moreover, we propose an approximate alternative to our algorithm using a modified Vantage Point tree and reduce the computational complexity of the algorithm with respect to the number of words in the vocabulary. We have trained various word embedding algorithms on articles of Wikipedia with 2.3 billion tokens and show that our method outperforms the state-of-the-art in most word similarity tasks by a good margin. We will round up our discussion with some general thought s about the use of embeddings in modern NLP.||
}}}

Diff for "seminar"

Menu

Natural Language Processing Seminar 2024–2025