Natural Language Processing Seminar 2015–2016

12 October 2015

Vincent Ng (University of Texas at Dallas)

Recent years have seen considerable progress on the notoriously difficult task of coreference resolution owing in part to the availability of coreference-annotated corpora such as MUC, ACE, and OntoNotes. Coreference, however, is more than MUC/ACE/OntoNotes coreference: it encompasses many interesting cases of anaphora that are not covered in the extensively investigated MUC/ACE/OntoNotes entity coreference task. This talk examined several comparatively less-studied coreference tasks that were arguably no less challenging than the MUC/ACE/OntoNotes entity coreference task, including the Winograd Schema Challenge, zero anaphora resolution, and event coreference resolution.

26 October 2015

Wojciech Jaworski (University of Warsaw)

Syntactic-semantic parser for Polish

The author presented the parser being developed within CLARIN-PL project, its morphological pre-processing, a categorial grammar of Polish integrated with valency dictionary and used by the parser and the semantic graph formalism used for meaning representation. He also discussed algorithms used by the parser and optimization strategies, both related to performance and concise representation of ambiguous syntactic and semantic parsing trees.

16 November 2015

Izabela Gatkowska (Jagiellonian University in Kraków)

The Empirical Network of Lexical Links

The empirical network of lexical links is the result of an experiment using a human associative mechanism – the person who is the subject of the research says the test first word that comes to his mind after understanding the stimulus word. The study was conducted in a cyclical manner, i.e. response words obtained in the first cycle were used as stimuli in the second cycle, which enabled the creation of a semantic network, which differs from the network created with the bodies of a text, for example, WORTSCHATZ and a network constructed by hand, for example. WordNet. The empirically obtained words, which are derived from those words in the network, have a direction and power connections. The set of incoming and outgoing connections, in which is found a specific expression, creates a lexical node network (subnet). The manner in which the network characterizes meaning, is shown in the example of feedback connections which are a specific example of the dependencies which appear between two words, appearing in the lexical node. A qualitative analysis of the semantic lexical relations known in linguistics, and employed for example in the WordNet dictionary, permit an interpretation of only approximately 25% of linkage feedback. The remaining links may be interpreted by referring to the model of the description of the significance as proposed in the FrameNet dictionary. A qualitative interpretation of all the links found in the lexical node may permit a study of the comparative lexical network nodes experimentally constructed for different natural languages, and may also allow, a separation of empirical semantic models employed by the same set of links found between nodes in a given network.

30 November 2015

Dora Montagna (Universidad Autónoma de Madrid)

Semantic representation of a polysemous verb in Spanish

The author presented a theoretical model of representation of meaning, based on Pustejovsky's theory of the Generative Lexicon. The proposal is intended as a base for automatic disambiguation, but also as a new model of lexicographic description. The model will be applied to a highly productive verb in Spanish, assuming the hypothesis of verbal underspecification in order to establish patterns of semantic behaviors.

7 December 2015

Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences), Witold Kieraś (University of Warsaw)

Morphosyntactic tagging of Polish – state of the art and future perspectives

During the presentation, the state of the art in the area of automatic approaches to morphosyntactic tagging of Polish language text was discussed, with a particular focus on the analysis of performance of publicly available tools, which are possible to use in real applications. A qualitative and quantitative analysis of the errors made by the taggers was conducted, along with a discussion on the possible causes and solutions to these problems. Tagging results for Polish was compared and contrasted with the results for other European languages.

8 December 2015

Salvador Pons Bordería (Universitat de València)

Discourse Markers from a pragmatic perspective: The role of discourse units in defining functions

One of the most disregarded aspects in the description of discourse markers is position. Notions such as "initial position" or "final position" are meaningless unless it can be specified with regard to what a DM is "initial" or "final". The presentation defended the idea that, for this question to be answered, appeal must be made to the notion of "discourse unit". Provided with a set of a) discourse units, and b) discourse positions, determining the function of a given DM is quasi-automatic.

11 January 2016

Małgorzata Marciniak, Agnieszka Mykowiecka, Piotr Rychlik (Institute of Computer Science, Polish Academy of Sciences)

Terminology extraction from Polish data – program TermoPL

The presentation addressed the problems of terminology extraction from Polish domain corpora. The authors described the C-value method to rank term candidates based on frequency measure and number of term contexts. The method takes into account nested terms that may not appear by themselves in data. Using this method, several nested grammatical subphrases are obtained which are syntactically correct, but semantically odd, like 'USG jamy' `USG of cavity’. The recognition of nested terms is supported by word connection strength which allows to eliminate truncated phrases from the top part of the term list. The talk was completed by the demo of the TermoPL tool.

25 January 2015

Wojciech Jaworski (University of Warsaw)

Syntactic-semantic parser for Polish: integration with lexical resources, parsing

During the lecture the author presented the integration of syntactic-semantic with SGJP, Polimorf, Słowosieć and Walenty as well as preliminary observations concerning the impact that checking semantic preferences has on parsing. He also described a categorical formalism used to parse and presented briefly how the parser works.

22 February 2016

Witold Dyrka (Wrocław University of Technology)

Language(s) of proteins? – premises, contributions and perspectives

In his speech the author presented arguments in favour of treating protein sequences, or higher protein structures, as sentences in some language(s). Then he plans to show several interesting results (my own and others') of application of quantitative methods of text analysis, and formal linguistics tools (such as probabilistic context-free grammars) for the analysis of proteins. Eventually, he presented plans of his further work on the "protein linguistics", which - as he hopes - would inspire an interesting discussion.

22 February 2016

Linguistic Engineering Group (Institute of Computer Science, Polish Academy of Sciences)

Extended seminar

12:00–12:15: People, projects, tools

12:15–12:45: Morfeusz 2: analyzer and inflectional synthesizer for Polish

12:45–13:15: Toposław: Creating MWU lexicons

13:15–13:45: Lunch break

13:45–14:15: TermoPL: Terminology extraction from Polish data

14:15–14:45: Walenty: Valency dictionary of Polish

14:45–15:15: POLFIE: LFG grammar for Polish

7 March 2016

Zbigniew Bronk (Grammatical Dictionary of Polish team member)

JOD – a markup language for Polish declension

JOD, a markup language for Polish declension, had been constructed in order to precisely describe inflectional rules and schemes for nouns and adjectives in Polish. Its first application was the description of inflection of surnames, taking into account the sex of the person or persons using the given surname. This model has been the basis for the "Automaton of declension of Polish surnames." The author presented the general idea of the language and the implementation of its interpreter, as well as the JOD editor and the website "Automaton of declension of Polish surnames".

21 March 2016

Bartosz Zaborowski, Aleksander Zabłocki (Institute of Computer Science, Polish Academy of Sciences)

Poliqarp2 on the home straight

In this talk the authors present a linguistic data search engine Poliqarp 2, on which they have been working for last three years. They describe both technical aspects as well as interesting features from the user's point of view. They briefly recall the data model supported by the engine, the structure of language supported by the new query engine, its expressive power, and differences compared to the previous version. In particular, they focus on elements added or modified during the development of the project (support for Składnica and LFG data models, post-processing, syntactic sugars). Among technicals they shortly present the software architecture and some details about the implementation of indexes. They also describe nontrivial decisions related to the input data processing (National Corpus of Polish in particular). They end the talk by presenting results of preliminary efficiency measurements.

4 April 2016

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

Identification of opinion targets in Polish

Seminar concluded and summarised the results of a grant of The National Science Centre (NCN) finished in January 2016. It presented three resources with labelled sentiments and opinion targets, developed within the project: a bank of dependency trees, created from the corpus of product reviews, a subset of Skladnica dependency treebank and a collection of tweets. The seminar included a discussion of experiments on automated recognition of opinion targets. These involved the use of two parsing methods: dependency and shallow, and a hybrid method in which the results of syntactic analysis were used by statistical models (eg. CRF).

21 April 2016 (Thursday)

Magdalena Derwojedowa (University of Warsaw)

“Tem lepiej, ale jest to interes miljonowy i traktujemy go poważnie” – A thousand words a thousand times in 5 parts

The talk presented the 1M corpus of the project „Automatic morphological analysis of Polish texts from 1830-1918 period with respect to the evolution of inflection and spelling” (DEC-2012/07/B/HS2/00570), the structure of the corpus, its stylistic, temporal and regional diversity as well as the resource inflectional characteristics in comparison with features described in Bajerowa's works.

9 May 2016

Daniel Janus (Rebased.pl)

From unstructured data to searchable metadata-rich corpus: Skyscraper, P4, Smyrna

The presentation described tools facilitating construction of custom datasets: in particular, corpora of texts. The author presented Skyscraper, a library allowing scraping structured data out of whole WWW sites, and Smyrna, a concordancer for Polish texts enriched with metadata. In addition, a dataset built using these tools was be presented: Polish Parliamentary Proceedings Processor (PPPP, or P4), including, inter alia, a continuously updated corpus of speeches in the Polish parliament. The presentation largely focused on technical solutions used in the tools shown.

19 May 2016 (Thursday)

Kamil Kędzia, Konrad Krulikowski (University of Warsaw)

Generating paraphrases' templates for Polish using parallel corpus

A software for generating paraphrases in Polish under CLARIN-PL project was prepared. The developers will demonstrate how it works on chosen examples. They will also explain a method of Ganitkevitch et al. (2013) which allowed its authors to create an openly available Paraphrase Database (PPDB). Furthermore, they will discuss its enhancements and the approach to the challenges specific to the Polish language. Additionally they will demonstrate a way of measuring paraphrases' quality.

23 May 2016

Damir Ćavar (Indiana University)

The Free Linguistic Environment

The Free Linguistic Environment (FLE) started as a project to develop an open and free platform for white-box modeling and grammar engineering, i.e. development of natural language morphologies, prosody, syntax, and semantic processing components that are for example based on theoretical frameworks like two-level morphology, Lexical Functional Grammar (LFG), Glue Semantics, and similar. FLE provides a platform that makes use of some classical algorithms and also new approaches based on Weighted Finite State Transducer models to enable probabilistic modeling and parsing at all linguistic levels. Currently its focus is to provide a platform that is compatible with LFG and an extended version of it, one that we call Probabilistic Lexical Functional Grammar (PLFG). This probabilistic modeling can apply to the c(onstituent) -structure component, i.e. a Context Free Grammar (CFG) backbone can be extended by a Probabilistic Context Free Grammar (PCFG). Probabilities in PLFG can also be associated with structural representations and corresponding f(unctional feature)-structures or semantic properties, i.e. structural and functional properties and their relations can be modeled using weights that can represent probabilities or other forms of complex scores or metrics. In addition to these extensions of the LFG-framework, FLE provides also an open platform for experimenting with algorithms for semantic processing or analyses based on (probabilistic) lexical analyses, c- and f-structures, or similar such representations. Its architecture is extensible to cope with different frameworks, e.g. dependency grammar, optimality theory based approaches, and many more.

6 June 2016

Karol Opara (Systems Research Institute of the Polish Academy of Sciences)

Grammatical rhymes in Polish poetry – a quantitative analysis

Polish is a highly inflected language and parts of speech in the same morphological form have common endings. This allows one to easily find a multitude of rhyming words known as grammatical rhymes. Their overuse is strongly discouraged in the contemporary Polish literary cannon due to their alleged banality. The speech presented the results of computer-aided investigations into poets’ technical mastery based on estimating the share of grammatical rhymes in their verses. A method of automatic rhyme detection was discussed as well as the extraction of statistical information from texts, and a new “literary” criterion of choosing the sample size for statistical tests. Finally, a ranking of the technical mastery of various Polish poets was presented.

Natural Language Processing Seminar 2016–2017

10 October 2016

Katarzyna Pakulska, Barbara Rychalska, Krystyna Chodorowska, Wojciech Walczak, Piotr Andruszkiewicz (Samsung)

Paraphrase Detection Ensemble – SemEval 2016 winner

This seminar describes the winning solution designed for a core track within the SemEval 2016 English Semantic Textual Similarity task. The goal of the competition was to measure semantic similarity between two given sentences on a scale from 0 to 5. At the same time the solution should replicate human language understanding. The presented model is a novel hybrid of recursive auto-encoders from deep learning (RAE) and a WordNet award-penalty system, enriched with a number of other similarity models and features used as input for Linear Support Vector Regression.

24 October 2016

Adam Przepiórkowski, Jakub Kozakoszczak, Jan Winkowski, Daniel Ziembicki, Tadeusz Teleżyński (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Corpus of formalized textual entailment steps

The authors present resources created within CLARIN project aiming to help with qualitative evaluation of RTE systems: two textual derivations corpora and a corpus of textual entailment rules. Textual derivation is a series of atomic steps which connects Text with Hypothesis in a textual entailment pair. Original pairs are taken from the FraCaS corpus and a polish translation of the RTE3 corpus. Textual entailment rule sanctions textual entailment relation between the input and the output of a step, using syntactic patterns written in the UD standard and some other semantic, logical and contextual constraints expressed in FOL.

7 November 2016

Rafał Jaworski (Adam Mickiewicz University in Poznań)

Concordia – translation memory search algorithm

The talk covers the Concordia algorithm which is used to maximize the productivity of a human translator. The algorithm combines the features of standard fuzzy translation memory searching with a concordancer. As the key non-functional requirement of computer-aided translation mechanisms is performance, Concordia incorporates upgraded versions of standard approximate searching techniques, aiming at reducing the computational complexity.

21 November 2016

Norbert Ryciak, Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

Using recursive deep neural networks and syntax to compute phrase semantics

The seminar presents initial experiments on recursive phrase-level sentiment computation using dependency syntax and deep learning. We discuss neural network architectures and implementations created within Clarin 2 and present results on English language resources. Seminar also covers undergoing work on Polish language resources.

5 December 2017

Dominika Rogozińska, Marcin Woliński (Institute of Computer Science, Polish Academy of Sciences)

Methods of syntax disambiguation for constituent parse trees in Polish as post–processing phase of the Świgra parser

The presentation shows methods of syntax disambiguation for Polish utterances produced by the Świgra parser. Presented methods include probabilistic context free grammars and maximum entropy models. The best of described models achieves efficiency measure at the level of 96.2%. The outcome of our experiments is a module for post-processing Świgra's parses.

9 January 2017

Agnieszka Pluwak (Institute of Slavic Studies, Polish Academy of Sciences)

Building a domain-specific knowledge representation using an extended method of frame semantics on a corpus of Polish, English and German lease agreements

The FrameNet project is defined by its authors as a lexical base with some ontological features (not an ontology sensu stricto, however, due to a selective approach towards description of frames and lexical units, as well as frame-to-frame relations). Ontologies, as knowledge representations in the field of NLP, should have the capacity of implementation to specific domains and texts, however, in the FrameNet bibliography published before January 2016 I haven’t found a single knowledge representation based entirely on frames or on an extensive structure of frame-to-frame relations. I did find a few examples of domain-specific knowledge representations with the use of selected FrameNet frames, such as BioFrameNet or Legal FrameNet, where frames were applied to connect data from different sources. Therefore, in my dissertation, I decided to conduct an experiment and build a knowledge representation of frame-to-frame relations for the domain of lease agreements. The aim of my study was the description of frames useful in case of building a possible data extraction system from lease agreements, this is frames containing answers to questions asked by a professional analyst while reading lease agreements. In my work I have asked several questions, e.g. would I be able to use FrameNet frames for this purpose or would I have to build my own frames? Will the analysis of Polish cause language-specific problems? How will the professional language affect the use of frames in context? Etc.

23 January 2017

Marek Rogalski (Lodz University of Technology)

Automatic paraphrasing

Paraphrasing is conveying the essential meaning of a message using different words. The ability to paraphrase is a measure of understanding. A teacher asking student a question "could you please tell us using your own words ...", tests whether the student has understood the topic. On this presentation we will discuss the task of automatic paraphrasing. We will differentiate between syntax-level paraphrases and essential-meaning-level paraphrases. We will bring up several techniques from seemingly unrelated fields that can be applied in automatic paraphrasing. We will also show results that we've been able to produce with those techniques.

6 February 2017

Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

Korpusomat – a tool for creation of searcheable own corpora

Korpusomat is a web tool facilitating unassisted creation of corpora for linguistic studies. After sending a set of text files they are automatically morphologically analysed and lemmatised using Morfeusz and disambiguated using Concraft tagger. The resulting corpus can be then downloaded and analysed offline using Poliqarp search engine to query for information related to text segmentation, base forms, inflectional interpretations and (dis)ambiguities. Poliqarp is also capable of calculating frequencies and applying basic statistical measures necessary for quantitative analysis. Apart from plain text files Korpusomat can also process more complex textual formats such as popular EPUBs, download source data from the Internet, strip unnecessary information and extract document metadata.

20 February 2017 (invited talk at the Institute seminar)

Elżbieta Hajnicz (Institute of Computer Science, Polish Academy of Sciences)

Representation language of the valency dictionary Walenty

The Polish Valence Dictionary (Walenty) is intended to be used by natural language processing tools, particularly parsers, and thus it offers formalized representation of the valency information. The talk presented the notion of valency and its representation in the dictionary along with examples illustrating how particular syntactic and semantic language phenomena are modelled.

2 March 2017

Wojciech Jaworski (University of Warsaw)

Integration of dependency parser with a categorial parser

As part of the talk I will describe the division of texts into sentences and controlling the execution of each parser within the emerging hybrid parser in the Clarin-bis project. I will describe the adopted method of dependency structure conversion aimed to make them compatible with the structures of categorial parser. The conversion will have two aspects: changing the attributes of each node and changing the links between nodes. I will depict how the method used can be extended to convert compressed forests generated by the parser Świgra. At the end I wil talk about the plans and the goals of reimplementation of the MateParser algorithm.

13 March 2017

Marek Kozłowski, Szymon Roziewski (National Information Processing Institute)

Internet model of Polish and semantic text processing

The presentation shows how BabelNet (the multilingual encyclopaedia and semantic network based on publicly available data sources such as Wikipedia and WordNet), can be used in the task of grouping short texts, sentiment analysis or emotional profiling of movies based on their subtitles. The second part presents the work based on CommonCrawl – publicly available petabyte-size open repository of multilingual Web pages. CommonCrawl was used to build two models of Polish: n-gram-based and semantic distribution-based.

20 March 2017

Jakub Szymanik (University of Amsterdam)

Semantic Complexity Influences Quantifier Distribution in Corpora

In this joint paper with Camilo Thorne, we study whether semantic complexity influences the distribution of generalized quantifiers in a large English corpus derived from Wikipedia. We consider the minimal computational device recognizing a generalized quantifier as the core measure of its semantic complexity. We regard quantifiers that belong to three increasingly more complex classes: Aristotelian (recognizable by 2-state acyclic finite automata), counting (k+2-state finite automata), and proportional quantifiers (pushdown automata). Using regression analysis we show that semantic complexity is a statistically significant factor explaining 27.29% of frequency variation. We compare this impact to that of other known sources of complexity, both semantic (quantifier monotonicity and the comparative/superlative distinction) and superficial (e.g., the length of quantifier surface forms). In general, we observe that the more complex a quantifier, the less frequent it is.

27 March 2017 (invited talk at the institute seminar)

Paweł Morawiecki (Institute of Computer Science, Polish Academy of Sciences)

Introduction to deep neural networks

In the last few years, Deep Neural Networks (DNN) has become a tool that provides the best solution for many problems from image and speech recognition. Also in natural language processing DNN totally revolutionizes the way how translation or word representation is done (and for many other problems). This presentation aims to provide good intuitions related to the DNN, their core architectures and how they operate. I will discuss and suggest the tools and source materials that can help in the further exploration of the topic and independent experiments.

3 April 2017

Katarzyna Budzynska, Chris Reed (Institute of Philosophy and Sociology, Polish Academy of Sciences / University of Dundee)

Argument Corpora, Argument Mining and Argument Analytics (part I)

Argumentation, the most prominent way people communicate, has been attracting a lot of attention since the very beginning of the scientific reflection. The Centre for Argument Technology has been developing the infrastructure for studying argument structures for almost two decades. Our approach demonstrate several characteristics. First, we build upon the graph-based standard for argument representation, Argument Interchange Format AIF (Rahwan et al., 2007); and Inference Anchoring Theory IAT (Budzynska and Reed, 2011) which allows us to capture dialogic context of argumentation. Second, we focus on a variety of aspects of argument structures such as argumentation schemes (Lawrence and Reed, 2016); illocutionary intentions speakers associate with arguments (Budzynska et al., 2014a); ethos of arguments' authors (Duthie et al., 2016); rephrase relation which paraphrases parts of argument structures (Konat et al., 2016); and protocols of argumentative dialogue games (Yaskorska and Budzynska, forthcoming).

10 April 2017

Paweł Morawiecki (Institute of Computer Science, Polish Academy of Sciences)

Neural nets for natural language processing – selected architectures and problems

For the last few years more and more problems in NLP have been successfully tackled with neural nets, particularly with deep architectures. These are such problems as sentiment analysis, topic classification, coreference, word representations and image labelling. In this talk i will give some details on most promising architectures used in NLP including recurrent and convolutional nets. The presented solutions will be given in a context of a concrete problem, namely the coreference problem in Polish language.

15 May 2017

Katarzyna Budzynska, Chris Reed (Institute of Philosophy and Sociology, Polish Academy of Sciences / University of Dundee)

Argument Corpora, Argument Mining and Argument Analytics (part II)

In the second part of our presentation we will describe characteristics of argument structures using examples from our AIF corpora of annotated argument structures in various domains and genres (see also OVA+ annotation tool) including moral radio debates (Budzynska et al., 2014b); Hansard records of the UK parliamentary debates (Duthie et al., 2016); e-participation (Konat et al., 2016; Lawrence et al., forthcoming); and the US 2016 presidential debates (Visser et al., forthcoming). Finally, we will show how such complex argument structures, which on the one hand make the annotation process more time-consuming and less reliable, can on the other hand result in automatic extraction of a variety of valuable information when applying technologies for argument mining (Budzynska and Villata, 2017; Lawrence and Reed, forthcoming) and argument analytics (Reed et al., forthcoming).

12 June 2017 (invited talk at the Institute seminar)

Adam Pawłowski (University of Wroclaw)

Sequential structures in texts

The subject of my lecture is the phenomenon of sequentiality in linguistics. Sequentiality is defined here as a characteristic feature of a text or of a collection of texts, which expresses the sequential relationship between units of the same type, ordered along the axis of time or according to a different variable (e.g. the sequence of reading or publishing). In order to model sequentiality which is thus understood, we can use, among others, time series, spectral analysis, theory of stochastic processes, theory of information or some tools of acoustics.Referring to both my own research and existing literature, in my lecture I will be presenting sequential structures and selected models thereof in continuous texts, as well as models used in relation to sequences of several texts (known as chronologies of works); I will equally mention glottochronology, which is a branch of quantitative linguistics that aims at mathematical modeling of the development of language over long periods of time. Finally, I will relate to philosophical attempts to elucidate sequentiality (the notion of the text’s ‘memory’, the result chain, Pitagoreism, Platonism).

Natural Language Processing Seminar 2017–2018

2 October 2017

Paweł Rutkowski (University of Warsaw)

Polish Sign Language from the perspective of corpus linguistics

Polish Sign Language (polski język migowy, PJM) is a full-fledged visual-spatial language used by the Polish Deaf community. It started to evolve in the second decade of the nineteenth century, with the foundation of the first school for the deaf in Poland. Until recently, PJM attracted very little attention from the linguistic community in Poland. The aim of this talk is to present a large-scale research project aimed at creating an extensive and representative corpus of PJM. The corpus is currently being compiled at the University of Warsaw. It is a collection of video clips showing Deaf people using PJM in a variety of different communication contexts. The videos are richly annotated: they are segmented, lemmatized, translated into Polish, tagged for various grammatical features and transcribed with HamNoSys symbols. The Corpus of PJM is currently one of the two largest sets of annotated sign language data in the world. Special attention will be paid to the issue of lexical frequency in PJM. Studies of this type are available for a handful of sign languages only, including American Sign Language, New Zealand Sign Language, British Sign Language, Australian Sign Language and Slovene Sign Language. Their empirical basis ranged from 100,000 tokens (NZSL) to as little as 4,000 tokens (ASL). The present talk contributes to our understanding of lexical frequency in sign languages by analyzing a much larger set of relevant data from PJM.

23 October 2017

Katarzyna Krasnowska-Kieraś, Piotr Rybak, Alina Wróblewska (Institute of Computer Science, Polish Academy of Sciences)

Towards the evaluation of feature embedding models of the fusional languages in the context of morphosyntactic disambiguation and dependency parsing

Neural networks are recently very successful in various natural language processing tasks. An important component of a neural network approach is a dense vector representation of features, i.e. feature embedding. Various feature types are possible, e.g. words, part-of-speech tags. In our talk we are going to present results of an analysis showing what should be used as features in estimating embedding models of the fusional languages – tokens or lemmata. Furthermore, we are going to discuss the methodological question whether the results of the intrinsic evaluation of embeddings are informative for downstream applications, or the embedding models should be evaluated extrinsically. The accompanying experiments were conducted on Polish – a fusional Slavic language with a relatively free word order. The mentioned research has inspired us to implement a morphosyntactic disambiguator – Toygger (Krasnowska-Kieraś, 2017). The tool won the shared task 1 (A) in PolEval 2017 competition and will be presented in our talk.

6 November 2017

Szymon Łęski (Samsung R&D Poland)

Deep neural networks in language models

In my talk I will first give introduction to language models: traditional, n-gram based, and new, based on recurrent networks. Then, based on recent papers, I will discuss the most interesting extensions and modifications to RNN-based language models, such as modifying word representations or models with output not limited to a pre-defined vocabulary.

20 November 2017

Michał Ptaszyński (Kitami Institute of Technology, Japan)

Capturing Emotions in Context as a way towards Computational Phronesis

Research on emotions within Artificial Intelligence and related fields has flourished rapidly through recent years. Unfortunately, in most research emotions are analyzed without their context. I will argue, that recognizing emotions without recognizing their context is incomplete and cannot be sufficient for real-world applications. I will also describe some consequences of disregarding the context of emotions. Finally, I will present one approach, in which the context of emotions is considered and briefly describe some of the first experiments performed in this matter.

27 November 2017

Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences)

Automated coreference resolution in Polish

The talk presents the description of nominal referential constructs in Polish (i.e. textual fragments referencing the same discourse entities) and the computational-linguistic methods implemented for their decoding. The algorithms are corpus-based with manual annotation of coreferential constructs and are evaluated using standard metrics.

4 December 2017

Adam Dobaczewski, Piotr Sobotka, Sebastian Żurowski (Nicolaus Copernicus University in Toruń)

Dictionary of Polish reduplications and repetitions

In our talk we will present a dictionary prepared by the team from the Institute of Polish Language of the Nicolaus Copernicus University in Toruń (grant NPRH 11H 13 0265 82). We document In the dictionary expressions of the Polish language in which the presence of reduplication or repetition of forms of the same lexemes can be observed. We distinguish the units of language according to the Bogusławski's operational grammar framework and divide them into two basic groups: (i) lexical units consisting of two such segments or forms of the same lexeme (Pol. całkiem całkiem; fakt faktem); operational units based on some pattern of repetition of words belonging to a certain class predicted by this scheme (Pol. N[nom] N[inst] ale _, where N stands for any noun, e.g. sąd sądem, ale _; miłość miłością, ale _). We have prepared a dictionary in traditional (printed) form due to the relatively small number of registered units. Its material base is the resources of the NKJP, which were searched using dedicated search engine of repetitions in the NKJP. This tool was specially prepared for this project at the LEG ICS PAS.

29 January 2018

Roman Grundkiewicz (Adam Mickiewicz University in Poznań/University of Edinburgh)

Automatic Grammatical Error Correction using Machine Translation

In my presentation I will be talking about the task of automated grammatical error correction (GEC) in texts written by non-native English speakers. I will present our experiments on the application of the phrase-based statistical machine translation (SMT), and our GEC system, which achieved new state-of-the-art results. The importance of the parameter optimization towards the task-specific evaluation metric and new GEC-adapted dense and sparse features will be discussed. I will also briefly describe the results of further research using neural machine translation (NMT).

12 February 2018

Agnieszka Mykowiecka, Aleksander Wawer, Małgorzata Marciniak, Piotr Rychlik (Institute of Computer Science, Polish Academy of Sciences)

Recognition of metaphorical noun phrases in Polish with distributional semantics

Our talk addresses the use of vector models for Polish based on lemmas and forms. We compare the results for two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition. Then we apply vector models to detect metaphorical and literal meaning of adjective-noun (AN) phrases. We show the results of our method for isolated phrases and compare them to other known methods. Finally, we discuss the problem of recognition of metaphorical/literal meaning of AN phrases in sentences.

26 February 2018

Celina Heliasz (University of Warsaw)

To create or to contribute? On the search for synergy between computer scientists and linguists

The main topic of my presentation are the methods of conducting research in the field of corpus linguistics, which is currently being addressed by both computer scientists and linguists. In my speech, I will present the attempts to recognize and visualize semantic relations in the text undertaken by computer scientists as part of the two projects: RST (Rhetorical Structure Theory) and PDTB (Penn Discourse Treebank). Then, I contrast RST and PDTB with analogous attempts made by computer scientists and linguists at IPI PAN as part of the CLARIN-PL venture. The aim of the presentation is to show the determinants of effective linguistic analysis, which must be taken into account when designing IT tools, if these tools are to conduct research on text and derive strong foundations of linguistic theories from them, and not only to implement existing theories in this field.

9 April 2018

Jan Kocoń (Wrocław University of Technology)

Recognition of temporal expressions and events in Polish text documents

A temporal expression is a sequence of words that informs you about when, how often an event occurs or how long it lasts. Event descriptions are words which indicate a change of state in the description of reality (and also some states). These issues fall within the scope of information extraction. They are well defined and described for English and partly for other languages. The TimeML specification, whose temporal information description language has been accepted as an ISO standard, has been officially adapted for six languages and the temporal expressions description section is defined for eleven languages. The result of the work carried out within CLARIN-PL is the adaptation of TimeML guidelines for Polish language. The motivation for this topic was the fact that temporal information is used by various natural language processing tasks, including methods for question answering, automatic text summarisation, semantic relations extraction and many others. These methods allow researchers in the domain of Digital Humanities and Social Sciences to work with a very large collection of texts whose analysis, without these methods, would be very time-consuming, if possible at all. In addition to the adaptation of the temporal information description language itself, the quality and efficiency of methods is a key aspect for temporal expressions and events recognition. The presentation will discuss both the analysis of the quality of data prepared by domain experts (including annotation agreement analysis) and the results of research aimed at reducing the complexity of the computational problem while preserving the quality of methods.

23 April 2018

Włodzimierz Gruszczyński, Dorota Adamiec, Renata Bronikowska (Institute of the Polish Language, Polish Academy of Sciences), Witold Kieraś, Dorota Komosińska, Marcin Woliński (Institute of Computer Science, Polish Academy of Sciences)

Historical corpus – problems of transliteration, transcription and annotation on the example of the Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772)

During the seminar, the process of creating the Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), also called the Baroque Corpus, will be discussed. The particular emphasis will be placed on those tasks and problems that are specific to historical corpora, in contrast to corpora of contemporary texts, e.g. the National Corpus of Polish. We will also show the tools that were created for the needs of the project or adapted to these needs. After the general presentation of the project (assumptions, financing, team, current status, corpus's purpose) we will discuss particular problems in the order in which they appeared during the creation of the corpus: the selecting of texts, gathering them and incorporating them into a database, the necessity of their transcription into modern spelling (resulting from a huge spelling differentiation of old prints and manuscripts), issues of morphological analysis, morphosyntactic annotation (manual and automatic) and corpus searching.

14 May 2018

Łukasz Kobyliński, Michał Wasiluk, Zbigniew Gawłowicz (Institute of Computer Science, Polish Academy of Sciences)

MTAS corpus search engine and its implementation for Polish language corpora

During the seminar we will discuss our experiences with the MTAS search engine in the context of Polish language corpora. We will present several implementations of MTAS in such corpus-related projects as KORBA (the corpus of Polish language of the XVII and XVIII century), the XIX century language corpus, as well as National Corpus of Polish. We will also discuss preliminary experiments with implementing MTAS in Korpusomat - a tool that allows users to create their own corpora. During the presentation we will share our solutions to the problems encountered during the adaptation of MTAS to Polish and preliminary efficiency test results. We will also discuss the search capabilities of the engine and our plans for enhancing MTAS.

21 May 2018 (IPI PAN seminar presentation, 13:00)

Piotr Borkowski (Institute of Computer Science, Polish Academy of Sciences)

Semantic methods of categorization in the tasks of text document analysis

In my PhD thesis entitled `Semantic methods of categorization in the tasks of text document analysis', a new algorithm of semantic categorization of documents was proposed and examined. On its basis, a new algorithm for category aggregation was developed, a family of semantic algorithms of classifiers, as well as a heterogeneous classifier committee (which combines the algorithm of semantic categorization and previously known classifiers). In my talk I will briefly present their concepts and the results of their effectiveness studies.

28 May 2018

Krzysztof Wołk (Polish-Japanese Academy of Information Technology)

Exploration and usage of comparable corpora in machine translation

The problem that will be presented in the seminar is how to improve machine speech translation between Polish and English. The most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation. Polish language resources are lacking in parallel and monolingual data. Therefore, the main objective of the study was to develop an automatic toolkit for textual resources preparation by mining comparable corpora and quasi comparable corpora. Experiments were conducted mostly on casual human speech, consisting of lectures, movie subtitles, European Parliament proceedings, and European Medicines Agency texts. The aims were to rigorously analyze the problems and to improve the quality of baseline systems, i.e., adaptation of techniques and training parameters to increase the Bilingual Evaluation Understudy (BLEU) score for maximum performance. A further aim was to create additional bilingual and monolingual data resources by using available online data and by obtaining and mining comparable corpora for parallel sentence pairs. For this task, a methodology employing a Support Vector Machine and the Needleman-Wunsch algorithm was used, along with a chain of specialized tools.

4 June 2018

Piotr Przybyła (University of Manchester)

Supporting document screening for systematic reviews using machine learning and text mining

Systematic reviews, aiming to aggregate and analyse all the literature for a given research question, are a crucial tool in medical research. Their most laborious stage is screening, i.e. manual selection of dozens of relevant articles from thousands returned by search engines. Formulating the problem as a text classification task and using appropriate unsupervised text mining tools could lead to significant work saved. The presentation will cover adaptation of machine learning algorithms to the problem, tools for extracting and visualising terms and topics in collections, system deployment and evaluation at NICE (National Institute for Health and Care Excellence), a UK agency publishing health technology guidelines.

11 June 2018

Danijel Korzinek (Polish-Japanese Academy of Information Technology)

Preparing a speech corpus using the recordings of the Polish Film Chronicle

The presentation will describe how a speech corpus based on the Polish Film Chronicle, a collection of short historical news segments, was created during the CLARIN-PL project. This resource is an extremely useful tool for linguistic research, specifically in the context of historical speech and language. The years 1945–1960 were chosen for this purpose. The presentation will discuss various topics: from the legal issues of acquiring the resources, to more the more technical aspects of dealing with the adaptation of speech analysis tools to this, rather uncommon domain.

See the talks given between 2000 and 2015 and the current schedule.

seminar-archive

Menu

Natural Language Processing Seminar 2015–2016

Natural Language Processing Seminar 2016–2017

Natural Language Processing Seminar 2017–2018