Size: 17842
Comment:
|
Size: 17599
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 48: | Line 48: |
||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://teams.microsoft.com/l/meetup-join/19%3ameeting_YWY2ZmM1NmUtZjI0MS00ZjZlLWE0MTYtOTgzMjI3Mjk4ZTQ4%40thread.v2/0?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%22f5f2c910-5438-48a7-b9dd-683a5c3daf1e%22%7d|{{attachment:seminarium-archiwum/teams.png}}]] '''Methods of optimizing the work of experts during the annotation of non-credible medical texts'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| | ||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2021-02-15.pdf|Methods of optimizing the work of experts during the annotation of non-credible medical texts]]'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| |
Natural Language Processing Seminar 2020–2021
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, currently online – please use the link next to the presentation title. All recorded talks are available on YouTube. |
5 October 2020 |
Piotr Rybak, Robert Mroczkowski, Janusz Tracz (ML Research at Allegro.pl), Ireneusz Gawlik (ML Research at Allegro.pl & AGH University of Science and Technology) |
In recent years, a series of BERT-based models improved the performance of many natural language processing systems. During this talk, we will briefly introduce the BERT model as well as some of its variants. Next, we will focus on the available BERT-based models for Polish language and their results on the KLEJ benchmark. Finally, we will dive into the details of the new model developed in cooperation between ICS PAS and Allegro. |
2 November 2020 |
Inez Okulska (NASK National Research Institute) |
|
The introduction of the vector representation of words, containing the weights of context and central words, calculated as a result of mapping giant corpora of a given language, and not encoding manually selected, linguistic features of words, proved to be a breakthrough for NLP research. After the first delight, there came revision and search for improvements - primarily in order to broaden the context, to handle homonyms, etc. Nevertheless, the classic embeddinga still apply to many tasks - for example, content classification - and in many cases their performance is still good enough. What do they code? Do they contain redundant elements? If transformed or reduced, will they maintain the information in a way that still preserves the original "meaning"? What is the meaning here? How far can these vectors be deformed and how does it relate to encryption methods? In my speech I will present a reflection on this subject, illustrated by the results of various "tortures” of the embeddings (word2vec and glove) and their precision in the task of classifying texts whose content must remain masked for human users. |
16 November 2020 |
Agnieszka Chmiel (Adam Mickiewicz University, Poznań), Danijel Korzinek (Polish-Japanese Academy of Information Technology) |
|
PINC is the first Polish simultaneous interpreting corpus based on Polish-English and English-Polish interpretations from the European Parliament. Using naturalistic data makes it possible to answer many questions about the process of simultaneous interpreting. By analysing the ear-voice span, or the delay between the source text and the target text, mechanisms of activation and inhibition can be investigated in the interpreter’s lexical processing. Fluency and pause data help us examine the cognitive load. This talk will focus on how we process data in the corpus (such as interpreter voice identification) and what challenges we face in relation to linguistic analysis, dependency parsing and bilingual alignment. We will show how specific data can be applied to help us understand what interpreting involves or even what happens in the interpreter’s mind. |
30 November 2020 |
Findings of ACL: EMNLP 2020: Polish session |
Łukasz Borchmann et al. (Applica.ai) |
|
Contract Discovery deals with tasks, such as ensuring the inclusion of relevant legal clauses or their retrieval for further analysis (e.g., risk assessment). Because there was no publicly available benchmark for span identification from legal texts, we proposed it along with hard-to-beat baselines. It is expected to process unstructured text, as in most real-world usage scenarios; that is, no legal documents segmentation into the hierarchy of distinct (sub)sections is to be given in advance. What is more, it is assumed that a searched passage can be any part of the document and not necessarily a complete paragraph, subparagraph, or clause. Instead, the process should be considered as a few-shot span identification task. In this particular setting, pretrained, universal encoders fail to provide satisfactory results. In contrast, solutions based on the Language Models perform well, especially when unsupervised fine-tuning is applied. |
Piotr Szymański (Wrocław Technical University), Piotr Żelasko (Johns Hopkins University) |
Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems. |
17 December 2020 |
Piotr Przybyła (Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences) |
The presentation will cover the task of multi-word lexical simplification, in which a sentence in natural language is made easier to understand by replacing its fragment with a simpler alternative, both of which can consist of many words. In order to explore this new direction, a corpus (MWLS1) including 1462 sentences in English from various sources with 7059 simplifications was prepared through crowdsourcing. Additionally, an automatic solution (Plainifier) for the problem, based on a purpose-trained neural language model, will be discussed along with the evaluation, comparing to human and resource-based baselines. The results of the presented study were also published at the COLING 2020 conference in an article of the same title. |
18 January 2021 |
Norbert Ryciak, Maciej Chrabąszcz, Maciej Bartoszuk (Sages) |
During our presentation we will discuss the solution for patent applications classification task that was one of GovTech competition problems. We will describe the characteristics of the problem and proposed solution, especially the original method of representing documents as “clouds of word embedding”. |
1 February 2021 |
Adam Jatowt (University of Innsbruck) |
|
News archives offer immense value to our society, helping users to learn details of events that occurred in the past. Currently, the access to such collections is difficult for average users due to large sizes and the need for expertise in history. We propose a large-scale open-domain question answering model designed for long-term news article collections, with a dedicated module for re-ranking articles by using temporal information. In the second part of the talk we will discuss methods for finding and explaining temporal analogs – entities in the past which are analogical to the entities in the present (e.g., walkman as a temporal analog of iPad). |
15 February 2021 |
Aleksandra Nabożny (Polish-Japanese Academy of Information Technology) |
Methods of optimizing the work of experts during the annotation of non-credible medical texts |
Automatic credibility assessment of medical content is an extremely difficult task. This is because expert assessment is burdened with a large interpretive bias, which depends on the individual clinical experience of a given doctor. Moreover, a simple factual assessment turns out to be insufficient to determine the credibility of this type of content. During the seminar, I will present the results of my and my team's efforts to optimize the annotation process. We proposed a sentence ordering method where non-credible sentences are more likely to be placed at the beginning of the queue for evaluation. I will also present our proposals for extending the annotator protocol to increase the consistency of assessments. Finally, I invite you to a discussion on potential research directions to detect harmful narratives in the so-called medical fake news. |
9 March 2021 (NOTE: the seminar will start at 12:00) |
Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Izabela Chojnicka (Faculty of Psychology, University of Warsaw), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences) |
Machine learning in detecting schizophrenia and autism from textual utterances |
Detection of mental disorders from textual input is an emerging field for applied machine and deep learning methods. In our talk, we will explore the limits of automated detection of autism spectrum disorder and schizophrenia. We will analyse both disorders and describe two diagnostic tools: TLC and ADOS-2, along with the characteristics of the collected data. We will compare the performance of: (1) TLC and ADOS-2, (2) machine learning and deep learning methods applied to the data gathered by these tools, and (3) psychiatrists. We will discuss the effectiveness of several baseline approaches such as bag-of-words and dictionary-based methods, including sentiment and language abstraction. We will then introduce the newest approaches using deep learning for text representation and inference. Owing to the related nature of both disorders, we will describe experiments with transfer and zero-shot learning techniques. Finally, we will explore few-shot methods dedicated to low data size scenarios, which is a typical problem for the clinical setting. Psychiatry is one of the few medical fields in which the diagnosis of most disorders is based on the subjective assessment of a psychiatrist. Therefore, the introduction of objective tools supporting diagnostics seems to be pivotal. This work is a step in this direction. |
Please see also the talks given in 2000–2015 and 2015–2020. |