Differences between revisions 282 and 284 (spanning 2 versions)

Natural Language Processing Seminar 2019–2020

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa). All recorded talks are available on YouTube.

23 September 2019

Igor Boguslavsky (Institute for Information Transmission Problems, Russian Academy of Sciences / Universidad Politécnica de Madrid)

Semantic analysis based on inference

I will present a semantic analyzer SemETAP, which is a module of a linguistic processor ETAP designed to perform analysis and generation of NL texts. We proceed from the assumption that the depth of understanding is determined by the number and quality of inferences we can draw from the text. Extensive use of background knowledge and inferences permits to extract implicit information.

Salient features of SemETAP include:

— knowledge base contains both linguistic and background knowledge;

— inference types include strict entailments and plausible expectations;

— words and concepts of the ontology may be supplied with explicit decompositions for inference purposes;

— two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences;

— a new logical formalism Etalog is developed in which all inference rules are written.

7 October 2019

Tomasz Stanisz (Institute of Nuclear Physics, Polish Academy of Sciences)

What can a complex network say about a text?

Complex networks, which have found application in the quantitative description of many different phenomena, have proven to be useful in research on natural language. The network formalism allows to study language from various points of view - a complex network may represent, for example, distances between given words in a text, semantic similarities, or grammatical relationships. One of the types of linguistic networks are word-adjacency networks, which describe mutual co-occurrences of words in texts. Although simple in construction, word-adjacency networks have a number of properties allowing for their practical use. The structure of such networks, expressed by appropriately defined quantities, reflects selected characteristics of language; applying machine learning methods to collections of those quantities may be used, for example, for authorship attribution.

21 October 2019

Agnieszka Patejuk (Institute of Computer Science, Polish Academy of Sciences / University of Oxford), Adam Przepiórkowski (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Coordination in the Universal Dependencies standard

Universal Dependencies (UD; https://universaldependencies.org/) is a widespread syntactic annotation scheme employed by many parsers of multiple languages. However, the scheme does not adequately represent coordination, i.e., structures involving conjunctions. In this talk, we propose representations of two aspects of coordination which have not so far been properly represented either in UD or in dependency grammars: coordination of unlike grammatical functions and nested coordination.

4 November 2019

Marcin Będkowski (University of Warsaw / Educational Research Institute), Wojciech Stęchły, Leopold Będkowski, Joanna Rabiega-Wiśniewska (Educational Research Institute), Michał Marcińczuk (Wrocław University of Science and Technology), Grzegorz Wojdyga, Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

Analysis of existing solutions for grouping of qualifications

In the talk we will discuss the problem of comparing documents contained in the Integrated Qualifications Register in terms of their content similarity.

In the first part, we characterize the background of the issue, including the structure of the description of learning outcomes in qualifications and sentences describing learning outcomes. According to the definition in the Act on the Integrated Qualifications System, the learning effect is knowledge, skills and social competences acquired in the learning process, and the qualification is a set of learning effects, the achievement of which is confirmed by an appropriate document (e.g. diploma, certificate). Sentences whose referents are learning outcomes have a stable structure and consist essentially of so-called an operational verb (describing an activity constituting a learning effect) and a nominal phrase that complements it (naming the object that is the subject of this activity, in short: the object of skill). For example: "Determines vision defects and how to correct them based on eye refraction measurement" or "The student reads technical drawings."

In the second part, we outline the approach that allows determining the degree of similarity between qualifications and their grouping, along with its assumptions and the intuitions behind them. We will define the accepted understanding of content similarity, namely we outline the approach to determine the similarity of texts in a variant that allows automatic text processing using computer tools. We will present a simple representation model, the so-called bag of words, in two versions.

The first of them assumes the full atomization of learning outcomes (including nominal phrases, skill objects) and their presentation as sets of single plata-mathematical nouns representing skills objects. The second is based on n-grams, taking into account the TFIDF measure (i.e. weighing by frequency of terms - inverse frequency in documents), allowing the extraction of key words and phrases from texts.

The first approach can be described as "wasteful", while the second - "frugal". The first allows for presenting many similar qualifications for each qualification (although the degree of similarity may be low). On the other hand, the second allows a situation in which there will be no similar for a given qualification.

In the third part, we describe sample groupings and ranking lists based on both approaches, based on multidimensional scaling and the k-average algorithm, as well as hierarchical grouping. We will also present a case study that will illustrate the advantages and disadvantages of both approaches.

In the fourth part we will present the conclusions on grouping qualifications, but also general conclusions related to the definition of key words. In particular, we will present conclusions on the use of the indicated methods for comparing texts of varying length, as well as partially overlapping (containing common fragments).

The talk was prepared in cooperation with the authors of the expertise on automatic analysis and comparison of qualifications for the purpose of grouping them prepared under the project "Keeping and developing the Integrated Qualifications Register", POWR.02.11.00-00-0001/17.

18 November 2019

Alexander Rosen (Charles University in Prague)

The InterCorp multilingual parallel corpus: representation of grammatical categories

InterCorp, a multilingual parallel component of the Czech National Corpus, has been on-line since 2008, growing steadily to its present size of 1.7 billion words in 40 languages. A substantial share of fiction is complemented by legal and journalistic texts, parliament proceedings, film subtitles and the Bible. The texts are sentence-aligned and – in most languages – tagged and lemmatized. We will focus on the issue of morphosyntactic annotation, currently using language-specific tagsets and tokenization rules, and explore various solutions, including those based on the guidelines, data and tools developed in the Universal Dependencies project.

21 November 2019

Alexander Rosen (Charles University in Prague)

A learner corpus of Czech

Texts produced by language learners (native or non-native) include all sorts of non-canonical phenomena, complicating the task of linguistic annotation while requiring an explicit markup of deviations from the standard. Although a number of English learner corpora exist and other languages have been catching up recently, a commonly accepted approach to designing an error taxonomy and annotation scheme has not emerged yet. For !CzeSL, the corpus of Czech as a Second Language, several such approaches were designed and tested, later extended also to texts produced by Czech schoolchildren. I will show various pros and cons of these approaches, especially with a view of Czech as a highly inflectional language with free word order.

Please see also the talks given in 2000–2015 and 2015–2019.

-  ⇤ ← Revision 282 as of 2019-11-06 07:15:36 → 
  Size: 12566
  Editor: MaciejOgrodniczuk
  Comment:
+   ← Revision 284 as of 2019-11-18 15:22:29 → ⇥
  Size: 12919
  Editor: MaciejOgrodniczuk
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 30:
-||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=-oSBqG4_VDk|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-11-04.pdf|Similarity of descriptions of qualifications contained in the Integrated Qualifications Register]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
+||<style="border:0;padding-left:30px;padding-bottom:0px">[[https://www.youtube.com/watch?v=-oSBqG4_VDk|{{attachment:seminarium-archiwum/youtube.png}}]] '''[[attachment:seminarium-archiwum/2019-11-04.pdf|Similarity of descriptions of qualifications contained in the Integrated Qualifications Register]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
||<style="border:0;padding-left:30px;padding-bottom:5px">'''[[attachment:seminarium-archiwum/2019-11-04b.pdf|Analysis of existing solutions for grouping of qualifications]]''' &#160;{{attachment:seminarium-archiwum/icon-pl.gif|Talk delivered in Polish.}}||
-Line 42:
+Line 43:
-||<style="border:0;padding-left:30px;padding-bottom:5px">'''The !InterCorp multilingual parallel corpus: representation of grammatical categories''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||
+||<style="border:0;padding-left:30px;padding-bottom:5px">[[https://www.youtube.com/watch?v=kkqlUnq7jGE|{{attachment:seminarium-archiwum/youtube.png}}]] '''The !InterCorp multilingual parallel corpus: representation of grammatical categories''' &#160;{{attachment:seminarium-archiwum/icon-en.gif|Talk delivered in English.}}||

Diff for "seminar"

Menu

Natural Language Processing Seminar 2019–2020