Natural Language Processing Seminar 2024–2025

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube.

7 October 2024

Janusz S. Bień (University of Warsaw, profesor emeritus)

Identifying glyphs in some 16th century fonts: a case study

Some glyphs from 16th century fonts, described in the monumental work “Polonia Typographica Saeculi Sedecimi”, can be more or less easily identified with the Unicode standard characters. Some glyphs don't have Unicode codepoints, but can be printed with an appropriate OpenType/TrueType fonts using typographic features. For some of them their encoding remains an open question. Some examples will be discussed.

14 October 2024

Alexander Rosen (Charles University in Prague)

Lexical and syntactic variability of languages and text genres. A corpus-based study

This study examines metrics of syntactic complexity (SC) and lexical diversity (LD) as tools for analyzing linguistic variation within and across languages. Using quantifiable measures based on cross-linguistically consistent (morpho)syntactic annotation (Universal Dependencies), the research utilizes parallel texts from a large multilingual corpus (InterCorp). Six SC and two LD metrics – covering the length and embedding levels of nominal and clausal constituents, mean dependency distance (MDD), and sentence length – are applied as metadata for sentences and texts.

The presentation will address how these metrics can be visualized and incorporated into corpus queries, how they reflect structural differences across languages and text types, and whether SC and LD vary more across languages or text types. It will also consider the impact of language-specific annotation nuances and correlations among the measures. The analysis includes comparative examples from Polish, Czech, and other languages.

Preliminary findings indicate higher SC in non-fiction compared to fiction across languages, with nominal and clausal metrics being dominant factors. The results suggest distinct patterns for MDD and sentence length, highlighting the impact of structural differences (e.g., analytic vs. synthetic morphology, dominant word-order patterns) and the influence of source text type and style.

28 October 2024 (Note: the talk will take place at 12:00 pm)

Rafał Jaworski (Adam Mickiewicz University in Poznań)

Framework for aligning and storing of multilingual word embeddings for the needs of translation probability computation

The presentation will cover my research in the field of natural language processing for computer-aided translation. In particular, I will present the Inter-language Vector Space algorithm set for aligning sentences at the word and phrase level using multilingual word embeddings.

The first function of the set is used to generate vector representations of words. They are generated using an auto-encoder neural network based on text data – a text corpus. In this way vector dictionaries for individual languages are created. The vector representations of words in these dictionaries constitute vector spaces that differ between languages.

To solve this problem and obtain vector representations of words that are comparable between languages, the second function of the Inter-language Vector Space set is used. It is used to align vector spaces between languages using transformation matrices calculated using the singular value decomposition method. This matrix is calculated based on homonyms, i.e. words written identically in the language of space X and Y. Additionally, a bilingual dictionary is used to improve the results. The transformation matrix calculated in this way allows for adjusting space X in such a way that it overlaps space Y to the maximum possible extent.

The last function of the set is responsible for creating a multilingual vector space. The vector space for the English language is first added to this space in its entirety and without modification. Then, for each other vector space, the transformation matrix of this space to the English space is first calculated. The vectors of the new space are multiplied by this matrix and thus become comparable to the vectors representing English words.

The Inter-language Vector Space algorithm set is used in translation support systems, for example in the author's algorithm for automatic transfer of untranslated tags from the source sentence to the target one.

4 November 2024

Jakub Kozakoszczak (Deutsche Telekom)

ZIML: A Markup Language for Regex-Friendly Linguistic Annotation

The summary of the talk will be made available shortly.

21 November 2024

Christian Chiarcos (University of Augsburg)

Aspects of Knowledge Representation for Discourse Relation Annotation

The summary of the talk will be made available shortly.

2 December 2024

Participants of PolEval 2024

Presentation of the workshop results

The program will be made available after the contest ends.

19 December 2024

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)

Adaptive Attacks on Misinformation Detection Using Reinforcement Learning

The summary of the talk will be made available shortly.

Please see also the talks given in 2000–2015 and 2015–2023.

seminar

Menu

Natural Language Processing Seminar 2024–2025