Locked History Actions

Diff for "seminar"

Differences between revisions 439 and 440
Revision 439 as of 2021-12-13 20:10:42
Size: 13492
Comment:
Revision 440 as of 2021-12-13 23:15:12
Size: 13517
Comment:
Deletions are marked like this. Additions are marked like this.
Line 35: Line 35:
||<style="border:0;padding-left:30px;padding-bottom:15px">The talk will explore the relevance of the Text-To-Text Transfer Transfomer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the newly released Polish Open Science Metadata Corpus (POSMC), which is currently a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5, extremeText, TermoPL, !KeyBert and conclude that the T5 model yields particularly promising results for sparsely represented keywords. Furthermore, a plT5 keyword generation model trained on the POSMC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.|| ||<style="border:0;padding-left:30px;padding-bottom:15px">The talk will explore the relevance of the Text-To-Text Transfer Transfomer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the newly released Polish Open Science Metadata Corpus (POSMC), which is currently a collection of 216,214 abstracts of scientific publications compiled in the [[https://curlicat.eu/|CURLICAT]] project. We compare the results obtained by four different methods, i.e. plT5, extremeText, TermoPL, !KeyBert and conclude that the T5 model yields particularly promising results for sparsely represented keywords. Furthermore, a plT5 keyword generation model trained on the POSMC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.||

Natural Language Processing Seminar 2021–2022

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, currently online – please use the link next to the presentation title. All recorded talks are available on YouTube.

seminarium

11 October 2021

Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Polyadic Quantifiers in Heterofunctional Coordination  Talk delivered in Polish.

The aim of this talk is to provide a semantic analysis of a construction – Heterofunctional Coordination – which is typical of Slavic and some neighbouring languages. In this construction, expressions bearing different grammatical functions may be conjoined. In this talk, I will propose a semantic analysis of such constructions based on the concept of generalized quantifiers (Mostowski; Lindström; Barwise and Cooper), and more specifically – polyadic quantifiers (van Benthem; Keenan; Westerståhl). Some familiarity with the language of predicate logic should suffice to fully understand the talk; all linguistic concepts (including "coordination", "grammatical functions") and logical concepts (including "generalized quantifiers" and "polyadic quantifiers") will be explained in the talk.

18 October 2021

Przemysław Kazienko, Jan Kocoń (Wrocław University of Technology)

https://www.youtube.com/watch?v=mvjO4R1r6gM Personalized NLP  Talk delivered in English.

Many natural language processing tasks, such as classifying offensive, toxic, or emotional texts, are inherently subjective in nature. This is a major challenge, especially with regard to the annotation process. Humans tend to perceive textual content in their own individual way. Most current annotation procedures aim to achieve a high level of agreement in order to generate a high quality reference source. Existing machine learning methods commonly rely on agreed output values that are the same for all annotators. However, annotation guidelines for subjective content can limit annotators' decision-making freedom. Motivated by moderate annotation agreement on offensive and emotional content datasets, we hypothesize that a personalized approach should be introduced for such subjective tasks. We propose new deep learning architectures that take into account not only the content but also the characteristics of the individual. We propose different approaches for learning the representation and processing of data about text readers. Experiments were conducted on four datasets: Wikipedia discussion texts labeled with attack, aggression, and toxicity, and opinions annotated with ten numerical emotional categories. All of our models based on human biases and their representations significantly improve prediction quality in subjective tasks evaluated from an individual's perspective. Additionally, we have developed requirements for annotation, personalization, and content processing procedures to make our solutions human-centric.

8 November 2021

Ryszard Tuora, Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

https://www.youtube.com/watch?v=KeeVWXXQlw8 Dependency Trees in Automatic Inflection of Multi Word Expressions in Polish  Talk delivered in Polish.

Natural language generation for morphologically rich languages can benefit from automatic inflection systems. This work presents such a system, which can tackle inflection, with particular emphasis on Multi Word Expressions (MWEs). This is done using rules induced automatically from a dependency treebank. The system is evaluated on a dictionary of Polish MWEs. Additionally, a similar algorithm can be utilized for lemmatization of MWEs. In principle, the system can also be applied to other languages with similar morphological mechanisms. To prove that, we will present a simple solution for Russian.

29 November 2021

Piotr Przybyła (Institute of Computer Science, Polish Academy of Sciences)

https://teams.microsoft.com/l/meetup-join/19%3a06de5a6d7ed840f0a53c26bf62c9ec18%40thread.tacv2/1637587495615?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%2256c98727-58a9-4bc2-a706-2e47ff6ae312%22%7d When classification accuracy is not enough: Explaining news credibility assessment and measuring users' reaction  Talk delivered in Polish.

Automatic assessment of text credibility has recently become a very popular task in NLP, with many solutions proposed and evaluated through accuracy-based measures. However, little attention has been given to the deployment scenarios for such models that would reduce the spread of misinformation, as intended. Within the study presented here, two credibility assessment techniques were implemented in a browser extension, which was then used in a user study, allowing to answer questions in three areas. Firstly, how resource-intensive NLP models can be compressed to work in a constrained environment? Secondly, what interpretability and visualisation techniques are most effective in human-computer cooperation? Thirdly, are user relying on such automated tools really more effective in spotting fake news?

6 December 2021

Joanna Byszuk (Institute of Polish Language, Polish Academy of Sciences)

https://teams.microsoft.com/l/meetup-join/19%3a2a54bf781d2a466da1e9adec3c87e6c2%40thread.tacv2/1638180705225?context=%7b%22Tid%22%3a%220425f1d9-16b2-41e3-a01a-0c02a63d13d6%22%2c%22Oid%22%3a%22f5f2c910-5438-48a7-b9dd-683a5c3daf1e%22%7d Towards multimodal stylometry – possibilities and challenges of new approach to film and TV series analysis  Talk delivered in Polish.

This talk will present a proposal of novel approach to quantitative analysis of multimodal works on the example of the corpus of Doctor Who television series, which draws from stylometry and multimodal theory of film analysis. Stylometric methods have long been popular in the analysis of literary texts. They usually include comparision of texts based on the frequencies of use of selected features which create "stylometric fingerprints", i.e. patterns characteristic of authors, genres and other factors. They are, however, rarely applied to data other than text, with a few new approaches applying stylometry to the study of dance movements (works by Miguel Escobar Varela) or music (Backer and Kranenburg). Multimodal theory of film analysis is in turn a relatively new approach (developed primarily by John Bateman and Janina Wildfeuer), emphasizing the importance of examining information from various image, language and sound modalities for a more comprehensive interpretation. The presented approach uses stylometric method of comparison but taking multiple types of features from various film modalities, i.e. features of image and sound as well as the content of the spoken dialogues. The talk will discuss the benefits and challenges of such an approach and quantitative film media analysis in general.

20 December 2021

Piotr Pęzik, Agnieszka Mikołajczyk, Adam Wawrzyński (University of Łódź / VoiceLab), Bartłomiej Nitoń, Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

Keyword Extraction with a Text-to-text Transfer Transformer (T5)  Talk delivered in Polish.

The talk will explore the relevance of the Text-To-Text Transfer Transfomer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the newly released Polish Open Science Metadata Corpus (POSMC), which is currently a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5, extremeText, TermoPL, KeyBert and conclude that the T5 model yields particularly promising results for sparsely represented keywords. Furthermore, a plT5 keyword generation model trained on the POSMC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.

31 January 2022

Tomasz Limisiewicz (Charles University in Prague)

Interpreting and Controlling Linguistic Features in Neural Networks’ Representations  Talk delivered in English.

The talk summary will be made available shortly.

Please see also the talks given in 2000–2015 and 2015–2020.