Natural Language Processing Seminar 2022–2023
The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, usually at 10:15 am, often online – please use the link next to the presentation title. All recorded talks are available on YouTube. |
3 October 2022 |
Sławomir Dadas (National Information Processing Institute) |
|
Representing sentences or short texts as dense vectors with a fixed number of dimensions is a common technique in tasks such as information retrieval, question answering, text clustering or plagiarism detection. A simple method to construct such representation is to aggregate vectors generated by a language model or extracted from word embeddings. However, higher quality representations can be obtained by fine-tuning a language model on a dataset of semantically similar sentence pairs. In this presentation, we will introduce methods for learning sentence encoders based on the Transformer architecture as well as our experiences with training such models for the Polish language. In addition, we will discuss approaches for building large datasets of paraphrases using publicly available corpora. We will also show a practical application of sentence encoders in a system developed for finding abusive clauses in consumer agreements. |
28 November 2022 |
Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences) |
Talk title will be made available shortly |
Talk summary will be made avaliable shortly. |
19 December 2022 |
Wojciech Kryściński (Salesforce Research) |
Current state, challenges, and approaches to Text Summarization |
Neural Text Summarization is a challenging task within Natural Language Processing that requires advanced language understanding and generation capabilities. In recent years substantial progress has been made in developing neural models for the task thanks to the efforts of the research community and advancements in the broader field of NLP. Despite this progress, text summarization remains a challenging task that is far from being solved. In this talk, we will first discuss the early approaches and the current state of the field. Next, we will critically evaluate key ingredients of the existing research setup: datasets, evaluation metrics, and models. Finally, we will focus on emerging research directions and consider the future of text summarization. |
9 January 2023 |
Marzena Karpińska (University of Massachusetts Amherst) |
Talk title will be made available shortly |
Talk summary will be made avaliable shortly. |
23 January 2023 |
Agnieszka Mikołajczyk (VoiceLab / Politechnika Gdańska / hear.ai) |
Talk title will be made available shortly |
Talk summary will be made avaliable shortly. |
6 February 2023 |
Artur Nowakowski, Gabriela Pałka, Kamil Guttmann, Mikołaj Pokrywka (Adam Mickiewicz University in Poznań) |
Talk title will be made available shortly |
Talk summary will be made avaliable shortly. |
Please see also the talks given in 2000–2015 and 2015–2020. |
2 April 2020
Stan Matwin (Dalhousie University)
Efficient training of word embeddings with a focus on negative examples

This presentation is based on our AAAI 2018 and AAAI 2019 papers on English word embeddings. In particular, we examine the notion of “negative examples”, the unobserved or insignificant word-context co-occurrences, in spectral methods. we provide a new formulation for the word embedding problem by proposing a new intuitive objective function that perfectly justifies the use of negative examples. With the goal of efficient learning of embeddings, we propose a kernel similarity measure for the latent space that can effectively calculate the similarities in high dimensions. Moreover, we propose an approximate alternative to our algorithm using a modified Vantage Point tree and reduce the computational complexity of the algorithm with respect to the number of words in the vocabulary. We have trained various word embedding algorithms on articles of Wikipedia with 2.3 billion tokens and show that our method outperforms the state-of-the-art in most word similarity tasks by a good margin. We will round up our discussion with some general thought s about the use of embeddings in modern NLP.