Locked History Actions

Diff for "seminar"

Differences between revisions 366 and 367
Revision 366 as of 2020-12-07 15:52:07
Size: 11800
Comment:
Revision 367 as of 2020-12-18 10:43:37
Size: 11850
Comment:
Deletions are marked like this. Additions are marked like this.
Line 33: Line 33:
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://teams.microsoft.com/l/meetup-join/19%3ameeting_YzM3MDg5YzItMmYyOC00NDM5LWE1MWYtYzliODA4MThiN2Zl%40thread.v2/0?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%22f5f2c910-5438-48a7-b9dd-683a5c3daf1e%22%7d|{{attachment:seminarium-archiwum/teams.png}}]] '''Multi-Word Lexical Simplification''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://teams.microsoft.com/l/meetup-join/19%3ameeting_YzM3MDg5YzItMmYyOC00NDM5LWE1MWYtYzliODA4MThiN2Zl%40thread.v2/0?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%22f5f2c910-5438-48a7-b9dd-683a5c3daf1e%22%7d|{{attachment:seminarium-archiwum/teams.png}}]] '''[[attachment:seminarium-archiwum/2020-12-17.pdf|Multi-Word Lexical Simplification]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||

Natural Language Processing Seminar 2020–2021

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, currently online – please use the link next to the presentation title. All recorded talks are available on YouTube.

seminarium

5 October 2020

Piotr Rybak, Robert Mroczkowski, Janusz Tracz (ML Research at Allegro.pl), Ireneusz Gawlik (ML Research at Allegro.pl & AGH University of Science and Technology)

https://www.youtube.com/watch?v=LkR-i2Z1RwM Review of BERT-based Models for Polish Language  Delivered in Polish.

In recent years, a series of BERT-based models improved the performance of many natural language processing systems. During this talk, we will briefly introduce the BERT model as well as some of its variants. Next, we will focus on the available BERT-based models for Polish language and their results on the KLEJ benchmark. Finally, we will dive into the details of the new model developed in cooperation between ICS PAS and Allegro.

2 November 2020

Inez Okulska (NASK National Research Institute)

https://www.youtube.com/watch?v=B7Y9fK2CDWw Concise, robust, sparse? Algebraic transformations of word2vec embeddings versus precision of classification  Talk delivered in Polish.

The introduction of the vector representation of words, containing the weights of context and central words, calculated as a result of mapping giant corpora of a given language, and not encoding manually selected, linguistic features of words, proved to be a breakthrough for NLP research. After the first delight, there came revision and search for improvements - primarily in order to broaden the context, to handle homonyms, etc. Nevertheless, the classic embeddinga still apply to many tasks - for example, content classification - and in many cases their performance is still good enough. What do they code? Do they contain redundant elements? If transformed or reduced, will they maintain the information in a way that still preserves the original "meaning"? What is the meaning here? How far can these vectors be deformed and how does it relate to encryption methods? In my speech I will present a reflection on this subject, illustrated by the results of various "tortures” of the embeddings (word2vec and glove) and their precision in the task of classifying texts whose content must remain masked for human users.

16 November 2020

Agnieszka Chmiel (Adam Mickiewicz University, Poznań), Danijel Korzinek (Polish-Japanese Academy of Information Technology)

https://www.youtube.com/watch?v=MxbgQL316DQ PINC (Polish Interpreting Corpus): how a corpus can help study the process of simultaneous interpreting  Talk delivered in Polish.

PINC is the first Polish simultaneous interpreting corpus based on Polish-English and English-Polish interpretations from the European Parliament. Using naturalistic data makes it possible to answer many questions about the process of simultaneous interpreting. By analysing the ear-voice span, or the delay between the source text and the target text, mechanisms of activation and inhibition can be investigated in the interpreter’s lexical processing. Fluency and pause data help us examine the cognitive load. This talk will focus on how we process data in the corpus (such as interpreter voice identification) and what challenges we face in relation to linguistic analysis, dependency parsing and bilingual alignment. We will show how specific data can be applied to help us understand what interpreting involves or even what happens in the interpreter’s mind.

30 November 2020

Findings of ACL: EMNLP 2020: Polish session

Łukasz Borchmann et al. (Applica.ai)

Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines  Talk delivered in Polish. Slides in English.

Contract Discovery deals with tasks, such as ensuring the inclusion of relevant legal clauses or their retrieval for further analysis (e.g., risk assessment). Because there was no publicly available benchmark for span identification from legal texts, we proposed it along with hard-to-beat baselines. It is expected to process unstructured text, as in most real-world usage scenarios; that is, no legal documents segmentation into the hierarchy of distinct (sub)sections is to be given in advance. What is more, it is assumed that a searched passage can be any part of the document and not necessarily a complete paragraph, subparagraph, or clause. Instead, the process should be considered as a few-shot span identification task. In this particular setting, pretrained, universal encoders fail to provide satisfactory results. In contrast, solutions based on the Language Models perform well, especially when unsupervised fine-tuning is applied.

Piotr Szymański (Wrocław Technical University), Piotr Żelasko (Johns Hopkins University)

WER we are and WER we think we are  Talk delivered in Polish. Slides in English.

Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.

17 December 2020

Piotr Przybyła (Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences)

https://teams.microsoft.com/l/meetup-join/19%3ameeting_YzM3MDg5YzItMmYyOC00NDM5LWE1MWYtYzliODA4MThiN2Zl%40thread.v2/0?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%22f5f2c910-5438-48a7-b9dd-683a5c3daf1e%22%7d Multi-Word Lexical Simplification  Talk delivered in Polish.

The presentation will cover the task of multi-word lexical simplification, in which a sentence in natural language is made easier to understand by replacing its fragment with a simpler alternative, both of which can consist of many words. In order to explore this new direction, a corpus (MWLS1) including 1462 sentences in English from various sources with 7059 simplifications was prepared through crowdsourcing. Additionally, an automatic solution (Plainifier) for the problem, based on a purpose-trained neural language model, will be discussed along with the evaluation, comparing to human and resource-based baselines. The results of the presented study were also published at the COLING 2020 conference in an article of the same title.

Please see also the talks given in 2000–2015 and 2015–2020.