Size: 19410
Comment:
|
Size: 20797
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 57: | Line 57: |
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Filip Graliński''' (Applica.ai / Adam Mickiewicz University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Kleister''' (full title will be given shortly)  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available shortly.|| |
||<style="border:0;padding-left:30px;padding-bottom:0px">'''Filip Graliński''', '''Agnieszka Kaliska''' (Applica.ai / Adam Mickiewicz University), '''Tomasz Stanisławek''', '''Anna Wróblewska''' (Applica.ai / Warsaw University of Technology), '''Dawid Lipiński''', '''Bartosz Topolski''' (Applica.ai), '''Paulina Rosalska''' (Applica.ai / Nicolaus Copernicus University), '''Przemysław Biecek''' (Warsaw University of Technology / Samsung R&D Institute Poland)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://teams.microsoft.com/l/meetup-join/19%3ameeting_ZTlmNGQ1M2ItNDJiNS00NTgwLThiMDMtZTQzZDVkNzhmZWRi%40thread.v2/0?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%22f5f2c910-5438-48a7-b9dd-683a5c3daf1e%22%7d|{{attachment:seminarium-archiwum/teams.png}}]] '''Key Information Extraction from documents: Kleister NDA/Charity challenges'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">This presentation will show-case two new datasets (Kleister NDA and Kleister Charity) for Key Information Extraction. They involve a mix of born-digital and scanned long formal documents in English. In these datasets, an NLP system is expected to find or infer various types of entities by utilizing both textual and structural layout features.|| |
Line 61: | Line 61: |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''29 March 2021'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marek Kubis''' (Applica.ai / Adam Mickiewicz University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">'''Quantitative analysis of character networks in Polish 19th- and 20th-century novels''' (draft title)  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">Talk summary will be made available shortly. Until then you can read [[https://academic.oup.com/dsh/advance-article-abstract/doi/10.1093/llc/fqab012/6151748|the Digital Scholarhip in the Humanities article]] and [[https://dev.clariah.nl/files/dh2019/boa/0843.html|its abstract from Digital Humanities 2019]].|| |
||<style="border:0;padding-top:5px;padding-bottom:5px">'''12 April 2021'''|| ||<style="border:0;padding-left:30px;padding-bottom:0px">'''Marek Kubis''' (Adam Mickiewicz University)|| ||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://teams.microsoft.com/l/meetup-join/19%3ameeting_NTQxMTRjOTctNWE2ZS00OGU5LTgzMDAtYTk2N2FjMmJhYWJk%40thread.v2/0?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%22f5f2c910-5438-48a7-b9dd-683a5c3daf1e%22%7d|{{attachment:seminarium-archiwum/teams.png}}]] '''Quantitative analysis of character networks in Polish 19th- and 20th-century novels'''  {{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}|| ||<style="border:0;padding-left:30px;padding-bottom:15px">I will present a study on induction and quantitative analysis of character networks inferred from Polish novels. The corpus compiled for this study includes both 19th- and 20th-century literary works obtained from publicly available sources. I will discuss the development of the corpus and the network extraction procedure. The structural properties observed for the networks induced from Polish novels will be confronted with the results observed for English novels. Furthermore, I will compare the networks induced from 19th-century novels to the 20th-century networks.|| |
Line 66: | Line 66: |
||<style="border:0;padding-left:30px;padding-bottom:15px">Streszczenie wystąpienia podamy już wkrótce. || |
Natural Language Processing Seminar 2020–2021
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, currently online – please use the link next to the presentation title. All recorded talks are available on YouTube. |
5 October 2020 |
Piotr Rybak, Robert Mroczkowski, Janusz Tracz (ML Research at Allegro.pl), Ireneusz Gawlik (ML Research at Allegro.pl & AGH University of Science and Technology) |
In recent years, a series of BERT-based models improved the performance of many natural language processing systems. During this talk, we will briefly introduce the BERT model as well as some of its variants. Next, we will focus on the available BERT-based models for Polish language and their results on the KLEJ benchmark. Finally, we will dive into the details of the new model developed in cooperation between ICS PAS and Allegro. |
2 November 2020 |
Inez Okulska (NASK National Research Institute) |
|
The introduction of the vector representation of words, containing the weights of context and central words, calculated as a result of mapping giant corpora of a given language, and not encoding manually selected, linguistic features of words, proved to be a breakthrough for NLP research. After the first delight, there came revision and search for improvements - primarily in order to broaden the context, to handle homonyms, etc. Nevertheless, the classic embeddinga still apply to many tasks - for example, content classification - and in many cases their performance is still good enough. What do they code? Do they contain redundant elements? If transformed or reduced, will they maintain the information in a way that still preserves the original "meaning"? What is the meaning here? How far can these vectors be deformed and how does it relate to encryption methods? In my speech I will present a reflection on this subject, illustrated by the results of various "tortures” of the embeddings (word2vec and glove) and their precision in the task of classifying texts whose content must remain masked for human users. |
16 November 2020 |
Agnieszka Chmiel (Adam Mickiewicz University, Poznań), Danijel Korzinek (Polish-Japanese Academy of Information Technology) |
|
PINC is the first Polish simultaneous interpreting corpus based on Polish-English and English-Polish interpretations from the European Parliament. Using naturalistic data makes it possible to answer many questions about the process of simultaneous interpreting. By analysing the ear-voice span, or the delay between the source text and the target text, mechanisms of activation and inhibition can be investigated in the interpreter’s lexical processing. Fluency and pause data help us examine the cognitive load. This talk will focus on how we process data in the corpus (such as interpreter voice identification) and what challenges we face in relation to linguistic analysis, dependency parsing and bilingual alignment. We will show how specific data can be applied to help us understand what interpreting involves or even what happens in the interpreter’s mind. |
30 November 2020 |
Findings of ACL: EMNLP 2020: Polish session |
Łukasz Borchmann et al. (Applica.ai) |
|
Contract Discovery deals with tasks, such as ensuring the inclusion of relevant legal clauses or their retrieval for further analysis (e.g., risk assessment). Because there was no publicly available benchmark for span identification from legal texts, we proposed it along with hard-to-beat baselines. It is expected to process unstructured text, as in most real-world usage scenarios; that is, no legal documents segmentation into the hierarchy of distinct (sub)sections is to be given in advance. What is more, it is assumed that a searched passage can be any part of the document and not necessarily a complete paragraph, subparagraph, or clause. Instead, the process should be considered as a few-shot span identification task. In this particular setting, pretrained, universal encoders fail to provide satisfactory results. In contrast, solutions based on the Language Models perform well, especially when unsupervised fine-tuning is applied. |
Piotr Szymański (Wrocław Technical University), Piotr Żelasko (Johns Hopkins University) |
Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems. |
17 December 2020 |
Piotr Przybyła (Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences) |
The presentation will cover the task of multi-word lexical simplification, in which a sentence in natural language is made easier to understand by replacing its fragment with a simpler alternative, both of which can consist of many words. In order to explore this new direction, a corpus (MWLS1) including 1462 sentences in English from various sources with 7059 simplifications was prepared through crowdsourcing. Additionally, an automatic solution (Plainifier) for the problem, based on a purpose-trained neural language model, will be discussed along with the evaluation, comparing to human and resource-based baselines. The results of the presented study were also published at the COLING 2020 conference in an article of the same title. |
18 January 2021 |
Norbert Ryciak, Maciej Chrabąszcz, Maciej Bartoszuk (Sages) |
During our presentation we will discuss the solution for patent applications classification task that was one of GovTech competition problems. We will describe the characteristics of the problem and proposed solution, especially the original method of representing documents as “clouds of word embedding”. |
1 February 2021 |
Adam Jatowt (University of Innsbruck) |
|
News archives offer immense value to our society, helping users to learn details of events that occurred in the past. Currently, the access to such collections is difficult for average users due to large sizes and the need for expertise in history. We propose a large-scale open-domain question answering model designed for long-term news article collections, with a dedicated module for re-ranking articles by using temporal information. In the second part of the talk we will discuss methods for finding and explaining temporal analogs – entities in the past which are analogical to the entities in the present (e.g., walkman as a temporal analog of iPad). |
15 February 2021 |
Aleksandra Nabożny (Polish-Japanese Academy of Information Technology) |
|
Automatic credibility assessment of medical content is an extremely difficult task. This is because expert assessment is burdened with a large interpretive bias, which depends on the individual clinical experience of a given doctor. Moreover, a simple factual assessment turns out to be insufficient to determine the credibility of this type of content. During the seminar, I will present the results of my and my team's efforts to optimize the annotation process. We proposed a sentence ordering method where non-credible sentences are more likely to be placed at the beginning of the queue for evaluation. I will also present our proposals for extending the annotator protocol to increase the consistency of assessments. Finally, I invite you to a discussion on potential research directions to detect harmful narratives in the so-called medical fake news. |
Please see also the talks given in 2000–2015 and 2015–2020. |