Differences between revisions 391 and 392

Natural Language Processing Seminar 2020–2021

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, currently online – please use the link next to the presentation title. All recorded talks are available on YouTube.

5 October 2020

Piotr Rybak, Robert Mroczkowski, Janusz Tracz (ML Research at Allegro.pl), Ireneusz Gawlik (ML Research at Allegro.pl & AGH University of Science and Technology)

Review of BERT-based Models for Polish Language

In recent years, a series of BERT-based models improved the performance of many natural language processing systems. During this talk, we will briefly introduce the BERT model as well as some of its variants. Next, we will focus on the available BERT-based models for Polish language and their results on the KLEJ benchmark. Finally, we will dive into the details of the new model developed in cooperation between ICS PAS and Allegro.

2 November 2020

Inez Okulska (NASK National Research Institute)

Concise, robust, sparse? Algebraic transformations of word2vec embeddings versus precision of classification

The introduction of the vector representation of words, containing the weights of context and central words, calculated as a result of mapping giant corpora of a given language, and not encoding manually selected, linguistic features of words, proved to be a breakthrough for NLP research. After the first delight, there came revision and search for improvements - primarily in order to broaden the context, to handle homonyms, etc. Nevertheless, the classic embeddinga still apply to many tasks - for example, content classification - and in many cases their performance is still good enough. What do they code? Do they contain redundant elements? If transformed or reduced, will they maintain the information in a way that still preserves the original "meaning"? What is the meaning here? How far can these vectors be deformed and how does it relate to encryption methods? In my speech I will present a reflection on this subject, illustrated by the results of various "tortures” of the embeddings (word2vec and glove) and their precision in the task of classifying texts whose content must remain masked for human users.

16 November 2020

Agnieszka Chmiel (Adam Mickiewicz University, Poznań), Danijel Korzinek (Polish-Japanese Academy of Information Technology)

PINC (Polish Interpreting Corpus): how a corpus can help study the process of simultaneous interpreting

PINC is the first Polish simultaneous interpreting corpus based on Polish-English and English-Polish interpretations from the European Parliament. Using naturalistic data makes it possible to answer many questions about the process of simultaneous interpreting. By analysing the ear-voice span, or the delay between the source text and the target text, mechanisms of activation and inhibition can be investigated in the interpreter’s lexical processing. Fluency and pause data help us examine the cognitive load. This talk will focus on how we process data in the corpus (such as interpreter voice identification) and what challenges we face in relation to linguistic analysis, dependency parsing and bilingual alignment. We will show how specific data can be applied to help us understand what interpreting involves or even what happens in the interpreter’s mind.

30 November 2020

Findings of ACL: EMNLP 2020: Polish session

Łukasz Borchmann et al. (Applica.ai)

Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines

Contract Discovery deals with tasks, such as ensuring the inclusion of relevant legal clauses or their retrieval for further analysis (e.g., risk assessment). Because there was no publicly available benchmark for span identification from legal texts, we proposed it along with hard-to-beat baselines. It is expected to process unstructured text, as in most real-world usage scenarios; that is, no legal documents segmentation into the hierarchy of distinct (sub)sections is to be given in advance. What is more, it is assumed that a searched passage can be any part of the document and not necessarily a complete paragraph, subparagraph, or clause. Instead, the process should be considered as a few-shot span identification task. In this particular setting, pretrained, universal encoders fail to provide satisfactory results. In contrast, solutions based on the Language Models perform well, especially when unsupervised fine-tuning is applied.

Piotr Szymański (Wrocław Technical University), Piotr Żelasko (Johns Hopkins University)

WER we are and WER we think we are

Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.

17 December 2020

Piotr Przybyła (Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences)

Multi-Word Lexical Simplification

The presentation will cover the task of multi-word lexical simplification, in which a sentence in natural language is made easier to understand by replacing its fragment with a simpler alternative, both of which can consist of many words. In order to explore this new direction, a corpus (MWLS1) including 1462 sentences in English from various sources with 7059 simplifications was prepared through crowdsourcing. Additionally, an automatic solution (Plainifier) for the problem, based on a purpose-trained neural language model, will be discussed along with the evaluation, comparing to human and resource-based baselines. The results of the presented study were also published at the COLING 2020 conference in an article of the same title.

18 January 2021

Norbert Ryciak, Maciej Chrabąszcz, Maciej Bartoszuk (Sages)

Classification of patent applications

During our presentation we will discuss the solution for patent applications classification task that was one of GovTech competition problems. We will describe the characteristics of the problem and proposed solution, especially the original method of representing documents as “clouds of word embedding”.

1 February 2021

Adam Jatowt (University of Innsbruck)

Question Answering & Finding Temporal Analogs in News Archives

News archives offer immense value to our society, helping users to learn details of events that occurred in the past. Currently, the access to such collections is difficult for average users due to large sizes and the need for expertise in history. We propose a large-scale open-domain question answering model designed for long-term news article collections, with a dedicated module for re-ranking articles by using temporal information. In the second part of the talk we will discuss methods for finding and explaining temporal analogs – entities in the past which are analogical to the entities in the present (e.g., walkman as a temporal analog of iPad).

15 February 2021

Aleksandra Nabożny (Polish-Japanese Academy of Information Technology)

Methods of optimizing the work of experts during the annotation of non-credible medical texts

Automatic credibility assessment of medical content is an extremely difficult task. This is because expert assessment is burdened with a large interpretive bias, which depends on the individual clinical experience of a given doctor. Moreover, a simple factual assessment turns out to be insufficient to determine the credibility of this type of content. During the seminar, I will present the results of my and my team's efforts to optimize the annotation process. We proposed a sentence ordering method where non-credible sentences are more likely to be placed at the beginning of the queue for evaluation. I will also present our proposals for extending the annotator protocol to increase the consistency of assessments. Finally, I invite you to a discussion on potential research directions to detect harmful narratives in the so-called medical fake news.

9 March 2021 (NOTE: the seminar will start at 12:00)

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Izabela Chojnicka (Faculty of Psychology, University of Warsaw), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)

Machine learning in detecting schizophrenia and autism from textual utterances

Detection of mental disorders from textual input is an emerging field for applied machine and deep learning methods. In our talk, we will explore the limits of automated detection of autism spectrum disorder and schizophrenia. We will analyse both disorders and describe two diagnostic tools: TLC and ADOS-2, along with the characteristics of the collected data. We will compare the performance of: (1) TLC and ADOS-2, (2) machine learning and deep learning methods applied to the data gathered by these tools, and (3) psychiatrists. We will discuss the effectiveness of several baseline approaches such as bag-of-words and dictionary-based methods, including sentiment and language abstraction. We will then introduce the newest approaches using deep learning for text representation and inference. Owing to the related nature of both disorders, we will describe experiments with transfer and zero-shot learning techniques. Finally, we will explore few-shot methods dedicated to low data size scenarios, which is a typical problem for the clinical setting. Psychiatry is one of the few medical fields in which the diagnosis of most disorders is based on the subjective assessment of a psychiatrist. Therefore, the introduction of objective tools supporting diagnostics seems to be pivotal. This work is a step in this direction.

Please see also the talks given in 2000–2015 and 2015–2020.

-  ⇤ ← Revision 391 as of 2021-02-12 12:25:07 → 
  Size: 17842
  Editor: MaciejOgrodniczuk
  Comment:
+   ← Revision 392 as of 2021-02-15 14:13:02 → ⇥
  Size: 17599
  Editor: MaciejOgrodniczuk
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 48:
-||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://teams.microsoft.com/l/meetup-join/19%3ameeting_YWY2ZmM1NmUtZjI0MS00ZjZlLWE0MTYtOTgzMjI3Mjk4ZTQ4%40thread.v2/0?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%22f5f2c910-5438-48a7-b9dd-683a5c3daf1e%22%7d|{{attachment:seminarium-archiwum/teams.png}}]] '''Methods of optimizing the work of experts during the annotation of non-credible medical texts''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
+||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2021-02-15.pdf|Methods of optimizing the work of experts during the annotation of non-credible medical texts]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||

Diff for "seminar"

Menu

Natural Language Processing Seminar 2020–2021