Size: 13492
Comment:
|
Size: 13517
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 35: | Line 35: |
||<style="border:0;padding-left:30px;padding-bottom:15px">The talk will explore the relevance of the Text-To-Text Transfer Transfomer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the newly released Polish Open Science Metadata Corpus (POSMC), which is currently a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5, extremeText, TermoPL, !KeyBert and conclude that the T5 model yields particularly promising results for sparsely represented keywords. Furthermore, a plT5 keyword generation model trained on the POSMC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.|| | ||<style="border:0;padding-left:30px;padding-bottom:15px">The talk will explore the relevance of the Text-To-Text Transfer Transfomer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the newly released Polish Open Science Metadata Corpus (POSMC), which is currently a collection of 216,214 abstracts of scientific publications compiled in the [[https://curlicat.eu/|CURLICAT]] project. We compare the results obtained by four different methods, i.e. plT5, extremeText, TermoPL, !KeyBert and conclude that the T5 model yields particularly promising results for sparsely represented keywords. Furthermore, a plT5 keyword generation model trained on the POSMC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.|| |
Natural Language Processing Seminar 2021–2022
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, currently online – please use the link next to the presentation title. All recorded talks are available on YouTube. |
11 October 2021 |
Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw) |
The aim of this talk is to provide a semantic analysis of a construction – Heterofunctional Coordination – which is typical of Slavic and some neighbouring languages. In this construction, expressions bearing different grammatical functions may be conjoined. In this talk, I will propose a semantic analysis of such constructions based on the concept of generalized quantifiers (Mostowski; Lindström; Barwise and Cooper), and more specifically – polyadic quantifiers (van Benthem; Keenan; Westerståhl). Some familiarity with the language of predicate logic should suffice to fully understand the talk; all linguistic concepts (including "coordination", "grammatical functions") and logical concepts (including "generalized quantifiers" and "polyadic quantifiers") will be explained in the talk. |
18 October 2021 |
Przemysław Kazienko, Jan Kocoń (Wrocław University of Technology) |
Many natural language processing tasks, such as classifying offensive, toxic, or emotional texts, are inherently subjective in nature. This is a major challenge, especially with regard to the annotation process. Humans tend to perceive textual content in their own individual way. Most current annotation procedures aim to achieve a high level of agreement in order to generate a high quality reference source. Existing machine learning methods commonly rely on agreed output values that are the same for all annotators. However, annotation guidelines for subjective content can limit annotators' decision-making freedom. Motivated by moderate annotation agreement on offensive and emotional content datasets, we hypothesize that a personalized approach should be introduced for such subjective tasks. We propose new deep learning architectures that take into account not only the content but also the characteristics of the individual. We propose different approaches for learning the representation and processing of data about text readers. Experiments were conducted on four datasets: Wikipedia discussion texts labeled with attack, aggression, and toxicity, and opinions annotated with ten numerical emotional categories. All of our models based on human biases and their representations significantly improve prediction quality in subjective tasks evaluated from an individual's perspective. Additionally, we have developed requirements for annotation, personalization, and content processing procedures to make our solutions human-centric. |
20 December 2021 |
Piotr Pęzik, Agnieszka Mikołajczyk, Adam Wawrzyński (University of Łódź / VoiceLab), Bartłomiej Nitoń, Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences) |
Keyword Extraction with a Text-to-text Transfer Transformer (T5) |
The talk will explore the relevance of the Text-To-Text Transfer Transfomer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the newly released Polish Open Science Metadata Corpus (POSMC), which is currently a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5, extremeText, TermoPL, KeyBert and conclude that the T5 model yields particularly promising results for sparsely represented keywords. Furthermore, a plT5 keyword generation model trained on the POSMC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction. |
31 January 2022 |
Tomasz Limisiewicz (Charles University in Prague) |
Interpreting and Controlling Linguistic Features in Neural Networks’ Representations |
The talk summary will be made available shortly. |
Please see also the talks given in 2000–2015 and 2015–2020. |