seminar-archive

23 September 2019

Igor Boguslavsky (Institute for Information Transmission Problems, Russian Academy of Sciences / Universidad Politécnica de Madrid)

Semantic analysis based on inference

I will present a semantic analyzer SemETAP, which is a module of a linguistic processor ETAP designed to perform analysis and generation of NL texts. We proceed from the assumption that the depth of understanding is determined by the number and quality of inferences we can draw from the text. Extensive use of background knowledge and inferences permits to extract implicit information.

Salient features of SemETAP include:

— knowledge base contains both linguistic and background knowledge;

— inference types include strict entailments and plausible expectations;

— words and concepts of the ontology may be supplied with explicit decompositions for inference purposes;

— two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences;

— a new logical formalism Etalog is developed in which all inference rules are written.

7 October 2019

Tomasz Stanisz (Institute of Nuclear Physics, Polish Academy of Sciences)

What can a complex network say about a text?

Complex networks, which have found application in the quantitative description of many different phenomena, have proven to be useful in research on natural language. The network formalism allows to study language from various points of view - a complex network may represent, for example, distances between given words in a text, semantic similarities, or grammatical relationships. One of the types of linguistic networks are word-adjacency networks, which describe mutual co-occurrences of words in texts. Although simple in construction, word-adjacency networks have a number of properties allowing for their practical use. The structure of such networks, expressed by appropriately defined quantities, reflects selected characteristics of language; applying machine learning methods to collections of those quantities may be used, for example, for authorship attribution.

21 October 2019

Agnieszka Patejuk (Institute of Computer Science, Polish Academy of Sciences / University of Oxford), Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Coordination in the Universal Dependencies standard

Universal Dependencies (UD; https://universaldependencies.org/) is a widespread syntactic annotation scheme employed by many parsers of multiple languages. However, the scheme does not adequately represent coordination, i.e., structures involving conjunctions. In this talk, we propose representations of two aspects of coordination which have not so far been properly represented either in UD or in dependency grammars: coordination of unlike grammatical functions and nested coordination.

4 November 2019

Marcin Będkowski (University of Warsaw / Educational Research Institute), Wojciech Stęchły, Leopold Będkowski, Joanna Rabiega-Wiśniewska (Educational Research Institute), Michał Marcińczuk (Wrocław University of Science and Technology), Grzegorz Wojdyga, Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

Analysis of existing solutions for grouping of qualifications

In the talk we will discuss the problem of comparing documents contained in the Integrated Qualifications Register in terms of their content similarity.

In the first part, we characterize the background of the issue, including the structure of the description of learning outcomes in qualifications and sentences describing learning outcomes. According to the definition in the Act on the Integrated Qualifications System, the learning effect is knowledge, skills and social competences acquired in the learning process, and the qualification is a set of learning effects, the achievement of which is confirmed by an appropriate document (e.g. diploma, certificate). Sentences whose referents are learning outcomes have a stable structure and consist essentially of so-called an operational verb (describing an activity constituting a learning effect) and a nominal phrase that complements it (naming the object that is the subject of this activity, in short: the object of skill). For example: "Determines vision defects and how to correct them based on eye refraction measurement" or "The student reads technical drawings."

In the second part, we outline the approach that allows determining the degree of similarity between qualifications and their grouping, along with its assumptions and the intuitions behind them. We will define the accepted understanding of content similarity, namely we outline the approach to determine the similarity of texts in a variant that allows automatic text processing using computer tools. We will present a simple representation model, the so-called bag of words, in two versions.

The first of them assumes the full atomization of learning outcomes (including nominal phrases, skill objects) and their presentation as sets of single plata-mathematical nouns representing skills objects. The second is based on n-grams, taking into account the TFIDF measure (i.e. weighing by frequency of terms - inverse frequency in documents), allowing the extraction of key words and phrases from texts.

The first approach can be described as "wasteful", while the second – "frugal". The first allows for presenting many similar qualifications for each qualification (although the degree of similarity may be low). On the other hand, the second allows a situation in which there will be no similar for a given qualification.

In the third part, we describe sample groupings and ranking lists based on both approaches, based on multidimensional scaling and the k-average algorithm, as well as hierarchical grouping. We will also present a case study that will illustrate the advantages and disadvantages of both approaches.

In the fourth part we will present the conclusions on grouping qualifications, but also general conclusions related to the definition of key words. In particular, we will present conclusions on the use of the indicated methods for comparing texts of varying length, as well as partially overlapping (containing common fragments).

The talk was prepared in cooperation with the authors of the expertise on automatic analysis and comparison of qualifications for the purpose of grouping them prepared under the project "Keeping and developing the Integrated Qualifications Register", POWR.02.11.00-00-0001/17.

18 November 2019

Alexander Rosen (Charles University in Prague)

The InterCorp multilingual parallel corpus: representation of grammatical categories

InterCorp, a multilingual parallel component of the Czech National Corpus, has been on-line since 2008, growing steadily to its present size of 1.7 billion words in 40 languages. A substantial share of fiction is complemented by legal and journalistic texts, parliament proceedings, film subtitles and the Bible. The texts are sentence-aligned and – in most languages – tagged and lemmatized. We will focus on the issue of morphosyntactic annotation, currently using language-specific tagsets and tokenization rules, and explore various solutions, including those based on the guidelines, data and tools developed in the Universal Dependencies project.

21 November 2019

Alexander Rosen (Charles University in Prague)

A learner corpus of Czech

Texts produced by language learners (native or non-native) include all sorts of non-canonical phenomena, complicating the task of linguistic annotation while requiring an explicit markup of deviations from the standard. Although a number of English learner corpora exist and other languages have been catching up recently, a commonly accepted approach to designing an error taxonomy and annotation scheme has not emerged yet. For CzeSL, the corpus of Czech as a Second Language, several such approaches were designed and tested, later extended also to texts produced by Czech schoolchildren. I will show various pros and cons of these approaches, especially with a view of Czech as a highly inflectional language with free word order.

12 December 2019

Aleksandra Tomaszewska (Institute of Applied Linguistics, University of Warsaw)

Cross-Genre Analysis of EU Borrowings in Polish — the Need for Research Automation

During this presentation, the project ”EU Borrowings —formation mechanisms, functions, evolution, and assimilation in the Polish language” will be presented, funded by a Diamond Grant from the Polish Ministry of Science and Higher Education. The project aims to analyze and categorize EU borrowings, that is the effects of language contacts that occur in the European Union.

First, the author will discuss the theoretical background of the phenomenon, the aims of the research project; and present a compiled corpus of EU Polish language genres composed of three sub-corpora: transcriptions of interviews with MEPs, EU law (regulations and directives), and press releases of EU institutions. In the next part of the presentation, various methods and tools used in this research will be presented, including the methods of conducting analyses on the collected research material. Based on these specific examples, the need for automation of research on the latest borrowings in Polish will also be signaled.

13 January 2020

Ryszard Tuora, Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

Integrating Polish Language Tools and Resources in spaCy

In our project we aim to fill a niche between the robust tools, which have been developed during research work, and are dedicated to particular NLP tasks in Polish, and users looking for, and expecting an easily accessible resources. spaCy is one of the leading NLP frameworks, which is open-source, but has no official support for Polish. In our talk we will present the model for spaCy that we have been working on. It currently allows for segmentation, lemmatization, morphosyntactic analysis, dependency parsing and named entity recognition. We will discuss the tools which we have integrated, the results of evaluation, a real-world case in which it was used, and some possible paths for further development.

27 January 2020

Alina Wróblewska, Katarzyna Krasnowska-Kieraś (Institute of Computer Science, Polish Academy of Sciences)

Empirical Linguistic Study of Sentence Embeddings

The results of empirical linguistic study on retention of linguistic information in sentence embeddings will be presented. The research methods are based on universal probing tasks and downstream tasks. The results of experiments on English and Polish indicate that different types of sentence embeddings encode linguistic information to varying degrees. The research was published in the article Empirical Linguistic Study of Sentence Embeddings in the proceedings of ACL 2019.

24 February 2020

Piotr Niewiński (Samsung R&D Polska), Aleksander Wawer, Grzegorz Wojdyga (Institute of Computer Science, Polish Academy of Sciences)

Fact-checking in FEVER competition

Aleksander Wawer, Grzegorz Wojdyga (Institute of Computer Science, Polish Academy of Sciences), Justyna Sarzyńska-Wawer (Institute of Psychology, Polish Academy of Sciences)

Fact Checking or Psycholinguistics: How to Distinguish Fake and True Claims?

Piotr Niewiński, Maria Pszona, Maria Janicka (Samsung R&D Polska)

Generative Enhanced Model (extended, redesigned & fine-tuned GPT language model) for adversarial attacks

During seminar we will present our works for the FEVER (Fact Extraction and Verification) competition. "Fake news" has become a dangerous phenomenon in the modern information circulation. There are many approaches to the problem of recognizing fake messages – in FEVER competition, having certain text, the task is to find specific evidence from certain sources for verification. During the presentation, we will show the most interesting ideas submitted by the participants of previous editions, we will discuss our article, that compares facts verification approaches with psycholinguistic analysis, and we will also present a winning model to cheat facts verification systems.

9 March 2020

Piotr Przybyła (Institute of Computer Science, Polish Academy of Sciences)

Assessing a document credibility based on style

The presentation will cover my work on automatically detecting documents of low credibility, such as fake news, based on their stylistic properties. During the study, a new corpus of 103,219 documents from 229 sources was gathered and used to evaluate general-purpose text classifiers. Given their unsatisfactory performance, new methods were implemented based on stylometric features and neural architectures. It was also verified whether the proposed classifiers indeed pay attention to the vocabulary known to by typical for fake news. The results of the presented research were published at the AAAI 2020 conference in an article entitled Capturing the Style of Fake News.

Natural Language Processing Seminar 2019–2020