Natural Language Processing Seminar 2015–2016

The NLP Seminar is organised by the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences (ICS PAS). It takes place on (some) Mondays, normally at 10:15 am, in the seminar room of the ICS PAS (ul. Jana Kazimierza 5, Warszawa).

12 October 2015

Vincent Ng (University of Texas at Dallas)

Beyond OntoNotes Coreference

Recent years have seen considerable progress on the notoriously difficult task of coreference resolution owing in part to the availability of coreference-annotated corpora such as MUC, ACE, and OntoNotes. Coreference, however, is more than MUC/ACE/OntoNotes coreference: it encompasses many interesting cases of anaphora that are not covered in the extensively investigated MUC/ACE/OntoNotes entity coreference task. This talk examined several comparatively less-studied coreference tasks that were arguably no less challenging than the MUC/ACE/OntoNotes entity coreference task, including the Winograd Schema Challenge, zero anaphora resolution, and event coreference resolution.

26 October 2015

Wojciech Jaworski (University of Warsaw)

Syntactic-semantic parser for Polish

The author presented the parser being developed within CLARIN-PL project, its morphological pre-processing, a categorial grammar of Polish integrated with valency dictionary and used by the parser and the semantic graph formalism used for meaning representation. He also discussed algorithms used by the parser and optimization strategies, both related to performance and concise representation of ambiguous syntactic and semantic parsing trees.

16 November 2015

Izabela Gatkowska (Jagiellonian University in Kraków)

The Empirical Network of Lexical Links

The empirical network of lexical links is the result of an experiment using a human associative mechanism – the person who is the subject of the research says the test first word that comes to his mind after understanding the stimulus word. The study was conducted in a cyclical manner, i.e. response words obtained in the first cycle were used as stimuli in the second cycle, which enabled the creation of a semantic network, which differs from the network created with the bodies of a text, for example, WORTSCHATZ and a network constructed by hand, for example. WordNet. The empirically obtained words, which are derived from those words in the network, have a direction and power connections. The set of incoming and outgoing connections, in which is found a specific expression, creates a lexical node network (subnet). The manner in which the network characterizes meaning, is shown in the example of feedback connections which are a specific example of the dependencies which appear between two words, appearing in the lexical node. A qualitative analysis of the semantic lexical relations known in linguistics, and employed for example in the WordNet dictionary, permit an interpretation of only approximately 25% of linkage feedback. The remaining links may be interpreted by referring to the model of the description of the significance as proposed in the FrameNet dictionary. A qualitative interpretation of all the links found in the lexical node may permit a study of the comparative lexical network nodes experimentally constructed for different natural languages, and may also allow, a separation of empirical semantic models employed by the same set of links found between nodes in a given network.

30 November 2015

Dora Montagna (Universidad Autónoma de Madrid)

Semantic representation of a polysemous verb in Spanish

The author presented a theoretical model of representation of meaning, based on Pustejovsky's theory of the Generative Lexicon. The proposal is intended as a base for automatic disambiguation, but also as a new model of lexicographic description. The model will be applied to a highly productive verb in Spanish, assuming the hypothesis of verbal underspecification in order to establish patterns of semantic behaviors.

7 December 2015

Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences), Witold Kieraś (University of Warsaw)

Morphosyntactic tagging of Polish – state of the art and future perspectives

During the presentation, the state of the art in the area of automatic approaches to morphosyntactic tagging of Polish language text was discussed, with a particular focus on the analysis of performance of publicly available tools, which are possible to use in real applications. A qualitative and quantitative analysis of the errors made by the taggers was conducted, along with a discussion on the possible causes and solutions to these problems. Tagging results for Polish was compared and contrasted with the results for other European languages.

8 December 2015

Salvador Pons Bordería (Universitat de València)

Discourse Markers from a pragmatic perspective: The role of discourse units in defining functions

One of the most disregarded aspects in the description of discourse markers is position. Notions such as "initial position" or "final position" are meaningless unless it can be specified with regard to what a DM is "initial" or "final". The presentation defended the idea that, for this question to be answered, appeal must be made to the notion of "discourse unit". Provided with a set of a) discourse units, and b) discourse positions, determining the function of a given DM is quasi-automatic.

11 January 2016

Małgorzata Marciniak, Agnieszka Mykowiecka, Piotr Rychlik (Institute of Computer Science, Polish Academy of Sciences)

Terminology extraction from Polish data – program TermoPL

The presentation addressed the problems of terminology extraction from Polish domain corpora. The authors described the C-value method to rank term candidates based on frequency measure and number of term contexts. The method takes into account nested terms that may not appear by themselves in data. Using this method, several nested grammatical subphrases are obtained which are syntactically correct, but semantically odd, like 'USG jamy' `USG of cavity’. The recognition of nested terms is supported by word connection strength which allows to eliminate truncated phrases from the top part of the term list. The talk was completed by the demo of the TermoPL tool.

25 January 2015

Wojciech Jaworski (University of Warsaw)

Syntactic-semantic parser for Polish: integration with lexical resources, parsing

During the lecture the author presented the integration of syntactic-semantic with SGJP, Polimorf, Słowosieć and Walenty as well as preliminary observations concerning the impact that checking semantic preferences has on parsing. He also described a categorical formalism used to parse and presented briefly how the parser works.

22 February 2016

Witold Dyrka (Wrocław University of Technology)

Language(s) of proteins? – premises, contributions and perspectives

In his speech the author presented arguments in favour of treating protein sequences, or higher protein structures, as sentences in some language(s). Then he plans to show several interesting results (my own and others') of application of quantitative methods of text analysis, and formal linguistics tools (such as probabilistic context-free grammars) for the analysis of proteins. Eventually, he presented plans of his further work on the "protein linguistics", which - as he hopes - would inspire an interesting discussion.

22 February 2016

Linguistic Engineering Group (Institute of Computer Science, Polish Academy of Sciences)

Extended seminar

12:00–12:15: People, projects, tools

12:15–12:45: Morfeusz 2: analyzer and inflectional synthesizer for Polish

12:45–13:15: Toposław: Creating MWU lexicons

13:15–13:45: Lunch break

13:45–14:15: TermoPL: Terminology extraction from Polish data

14:15–14:45: Walenty: Valency dictionary of Polish

14:45–15:15: POLFIE: LFG grammar for Polish

7 March 2016

Zbigniew Bronk (Grammatical Dictionary of Polish team member)

JOD – a markup language for Polish declension

JOD, a markup language for Polish declension, had been constructed in order to precisely describe inflectional rules and schemes for nouns and adjectives in Polish. Its first application was the description of inflection of surnames, taking into account the sex of the person or persons using the given surname. This model has been the basis for the "Automaton of declension of Polish surnames." The author presented the general idea of the language and the implementation of its interpreter, as well as the JOD editor and the website "Automaton of declension of Polish surnames".

21 March 2016

Bartosz Zaborowski, Aleksander Zabłocki (Institute of Computer Science, Polish Academy of Sciences)

Poliqarp2 on the home straight

In this talk the authors present a linguistic data search engine Poliqarp 2, on which they have been working for last three years. They describe both technical aspects as well as interesting features from the user's point of view. They briefly recall the data model supported by the engine, the structure of language supported by the new query engine, its expressive power, and differences compared to the previous version. In particular, they focus on elements added or modified during the development of the project (support for Składnica and LFG data models, post-processing, syntactic sugars). Among technicals they shortly present the software architecture and some details about the implementation of indexes. They also describe nontrivial decisions related to the input data processing (National Corpus of Polish in particular). They end the talk by presenting results of preliminary efficiency measurements.

4 April 2016

Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

Identification of opinion targets in Polish

Seminar concluded and summarised the results of a grant of The National Science Centre (NCN) finished in January 2016. It presented three resources with labelled sentiments and opinion targets, developed within the project: a bank of dependency trees, created from the corpus of product reviews, a subset of Skladnica dependency treebank and a collection of tweets. The seminar included a discussion of experiments on automated recognition of opinion targets. These involved the use of two parsing methods: dependency and shallow, and a hybrid method in which the results of syntactic analysis were used by statistical models (eg. CRF).

21 April 2016 (Thursday)

Magdalena Derwojedowa (University of Warsaw)

“Tem lepiej, ale jest to interes miljonowy i traktujemy go poważnie” – A thousand words a thousand times in 5 parts

The talk presented the 1M corpus of the project „Automatic morphological analysis of Polish texts from 1830-1918 period with respect to the evolution of inflection and spelling” (DEC-2012/07/B/HS2/00570), the structure of the corpus, its stylistic, temporal and regional diversity as well as the resource inflectional characteristics in comparison with features described in Bajerowa's works.

9 May 2016

Daniel Janus (Rebased.pl)

From unstructured data to searchable metadata-rich corpus: Skyscraper, P4, Smyrna

The presentation described tools facilitating construction of custom datasets: in particular, corpora of texts. The author presented Skyscraper, a library allowing scraping structured data out of whole WWW sites, and Smyrna, a concordancer for Polish texts enriched with metadata. In addition, a dataset built using these tools was be presented: Polish Parliamentary Proceedings Processor (PPPP, or P4), including, inter alia, a continuously updated corpus of speeches in the Polish parliament. The presentation largely focused on technical solutions used in the tools shown.

19 May 2016 (Thursday)

Kamil Kędzia, Konrad Krulikowski (University of Warsaw)

Generating paraphrases' templates for Polish using parallel corpus

A software for generating paraphrases in Polish under CLARIN-PL project was prepared. The developers will demonstrate how it works on chosen examples. They will also explain a method of Ganitkevitch et al. (2013) which allowed its authors to create an openly available Paraphrase Database (PPDB). Furthermore, they will discuss its enhancements and the approach to the challenges specific to the Polish language. Additionally they will demonstrate a way of measuring paraphrases' quality.

23 May 2016

Damir Ćavar (Indiana University)

The Free Linguistic Environment

The Free Linguistic Environment (FLE) started as a project to develop an open and free platform for white-box modeling and grammar engineering, i.e. development of natural language morphologies, prosody, syntax, and semantic processing components that are for example based on theoretical frameworks like two-level morphology, Lexical Functional Grammar (LFG), Glue Semantics, and similar. FLE provides a platform that makes use of some classical algorithms and also new approaches based on Weighted Finite State Transducer models to enable probabilistic modeling and parsing at all linguistic levels. Currently its focus is to provide a platform that is compatible with LFG and an extended version of it, one that we call Probabilistic Lexical Functional Grammar (PLFG). This probabilistic modeling can apply to the c(onstituent) -structure component, i.e. a Context Free Grammar (CFG) backbone can be extended by a Probabilistic Context Free Grammar (PCFG). Probabilities in PLFG can also be associated with structural representations and corresponding f(unctional feature)-structures or semantic properties, i.e. structural and functional properties and their relations can be modeled using weights that can represent probabilities or other forms of complex scores or metrics. In addition to these extensions of the LFG-framework, FLE provides also an open platform for experimenting with algorithms for semantic processing or analyses based on (probabilistic) lexical analyses, c- and f-structures, or similar such representations. Its architecture is extensible to cope with different frameworks, e.g. dependency grammar, optimality theory based approaches, and many more.

6 June 2016

Karol Opara (Systems Research Institute of the Polish Academy of Sciences)

Grammatical rhymes in Polish poetry – a quantitative analysis

Polish is a highly inflected language and parts of speech in the same morphological form have common endings. This allows one to easily find a multitude of rhyming words known as grammatical rhymes. Their overuse is strongly discouraged in the contemporary Polish literary cannon due to their alleged banality. The speech presented the results of computer-aided investigations into poets’ technical mastery based on estimating the share of grammatical rhymes in their verses. A method of automatic rhyme detection was discussed as well as the extraction of statistical information from texts, and a new “literary” criterion of choosing the sample size for statistical tests. Finally, a ranking of the technical mastery of various Polish poets was presented.

Natural Language Processing Seminar 2016–2017

10 October 2016

Katarzyna Pakulska, Barbara Rychalska, Krystyna Chodorowska, Wojciech Walczak, Piotr Andruszkiewicz (Samsung)

Paraphrase Detection Ensemble – SemEval 2016 winner

This seminar describes the winning solution designed for a core track within the SemEval 2016 English Semantic Textual Similarity task. The goal of the competition was to measure semantic similarity between two given sentences on a scale from 0 to 5. At the same time the solution should replicate human language understanding. The presented model is a novel hybrid of recursive auto-encoders from deep learning (RAE) and a WordNet award-penalty system, enriched with a number of other similarity models and features used as input for Linear Support Vector Regression.

24 October 2016

Adam Przepiórkowski, Jakub Kozakoszczak, Jan Winkowski, Daniel Ziembicki, Tadeusz Teleżyński (Institute of Computer Science, Polish Academy of Sciences / University of Warsaw)

Corpus of formalized textual entailment steps

The authors present resources created within CLARIN project aiming to help with qualitative evaluation of RTE systems: two textual derivations corpora and a corpus of textual entailment rules. Textual derivation is a series of atomic steps which connects Text with Hypothesis in a textual entailment pair. Original pairs are taken from the FraCaS corpus and a polish translation of the RTE3 corpus. Textual entailment rule sanctions textual entailment relation between the input and the output of a step, using syntactic patterns written in the UD standard and some other semantic, logical and contextual constraints expressed in FOL.

7 November 2016

Rafał Jaworski (Adam Mickiewicz University in Poznań)

Concordia – translation memory search algorithm

The talk covers the Concordia algorithm which is used to maximize the productivity of a human translator. The algorithm combines the features of standard fuzzy translation memory searching with a concordancer. As the key non-functional requirement of computer-aided translation mechanisms is performance, Concordia incorporates upgraded versions of standard approximate searching techniques, aiming at reducing the computational complexity.

21 November 2016

Norbert Ryciak, Aleksander Wawer (Institute of Computer Science, Polish Academy of Sciences)

Using recursive deep neural networks and syntax to compute phrase semantics

The seminar presents initial experiments on recursive phrase-level sentiment computation using dependency syntax and deep learning. We discuss neural network architectures and implementations created within Clarin 2 and present results on English language resources. Seminar also covers undergoing work on Polish language resources.

5 December 2017

Dominika Rogozińska, Marcin Woliński (Institute of Computer Science, Polish Academy of Sciences)

Methods of syntax disambiguation for constituent parse trees in Polish as post–processing phase of the Świgra parser

The presentation shows methods of syntax disambiguation for Polish utterances produced by the Świgra parser. Presented methods include probabilistic context free grammars and maximum entropy models. The best of described models achieves efficiency measure at the level of 96.2%. The outcome of our experiments is a module for post-processing Świgra's parses.

9 January 2017

Agnieszka Pluwak (Institute of Slavic Studies, Polish Academy of Sciences)

Building a domain-specific knowledge representation using an extended method of frame semantics on a corpus of Polish, English and German lease agreements

The FrameNet project is defined by its authors as a lexical base with some ontological features (not an ontology sensu stricto, however, due to a selective approach towards description of frames and lexical units, as well as frame-to-frame relations). Ontologies, as knowledge representations in the field of NLP, should have the capacity of implementation to specific domains and texts, however, in the FrameNet bibliography published before January 2016 I haven’t found a single knowledge representation based entirely on frames or on an extensive structure of frame-to-frame relations. I did find a few examples of domain-specific knowledge representations with the use of selected FrameNet frames, such as BioFrameNet or Legal FrameNet, where frames were applied to connect data from different sources. Therefore, in my dissertation, I decided to conduct an experiment and build a knowledge representation of frame-to-frame relations for the domain of lease agreements. The aim of my study was the description of frames useful in case of building a possible data extraction system from lease agreements, this is frames containing answers to questions asked by a professional analyst while reading lease agreements. In my work I have asked several questions, e.g. would I be able to use FrameNet frames for this purpose or would I have to build my own frames? Will the analysis of Polish cause language-specific problems? How will the professional language affect the use of frames in context? Etc.

23 January 2017

Marek Rogalski (Lodz University of Technology)

Automatic paraphrasing

Paraphrasing is conveying the essential meaning of a message using different words. The ability to paraphrase is a measure of understanding. A teacher asking student a question "could you please tell us using your own words ...", tests whether the student has understood the topic. On this presentation we will discuss the task of automatic paraphrasing. We will differentiate between syntax-level paraphrases and essential-meaning-level paraphrases. We will bring up several techniques from seemingly unrelated fields that can be applied in automatic paraphrasing. We will also show results that we've been able to produce with those techniques.

6 February 2017

Łukasz Kobyliński (Institute of Computer Science, Polish Academy of Sciences)

Korpusomat – a tool for creation of searcheable own corpora

Korpusomat is a web tool facilitating unassisted creation of corpora for linguistic studies. After sending a set of text files they are automatically morphologically analysed and lemmatised using Morfeusz and disambiguated using Concraft tagger. The resulting corpus can be then downloaded and analysed offline using Poliqarp search engine to query for information related to text segmentation, base forms, inflectional interpretations and (dis)ambiguities. Poliqarp is also capable of calculating frequencies and applying basic statistical measures necessary for quantitative analysis. Apart from plain text files Korpusomat can also process more complex textual formats such as popular EPUBs, download source data from the Internet, strip unnecessary information and extract document metadata.

20 February 2017 (invited talk at the Institute seminar)

Elżbieta Hajnicz (Institute of Computer Science, Polish Academy of Sciences)

Representation language of the valency dictionary Walenty

The Polish Valence Dictionary (Walenty) is intended to be used by natural language processing tools, particularly parsers, and thus it offers formalized representation of the valency information. The talk presented the notion of valency and its representation in the dictionary along with examples illustrating how particular syntactic and semantic language phenomena are modelled.

2 March 2017

Wojciech Jaworski (University of Warsaw)

Integration of dependency parser with a categorial parser

As part of the talk I will describe the division of texts into sentences and controlling the execution of each parser within the emerging hybrid parser in the Clarin-bis project. I will describe the adopted method of dependency structure conversion aimed to make them compatible with the structures of categorial parser. The conversion will have two aspects: changing the attributes of each node and changing the links between nodes. I will depict how the method used can be extended to convert compressed forests generated by the parser Świgra. At the end I wil talk about the plans and the goals of reimplementation of the MateParser algorithm.

13 March 2017

Marek Kozłowski, Szymon Roziewski (National Information Processing Institute)

Internet model of Polish and semantic text processing

The presentation shows how BabelNet (the multilingual encyclopaedia and semantic network based on publicly available data sources such as Wikipedia and WordNet), can be used in the task of grouping short texts, sentiment analysis or emotional profiling of movies based on their subtitles. The second part presents the work based on CommonCrawl – publicly available petabyte-size open repository of multilingual Web pages. CommonCrawl was used to build two models of Polish: n-gram-based and semantic distribution-based.

20 March 2017

Jakub Szymanik (University of Amsterdam)

Semantic Complexity Influences Quantifier Distribution in Corpora

In this joint paper with Camilo Thorne, we study whether semantic complexity influences the distribution of generalized quantifiers in a large English corpus derived from Wikipedia. We consider the minimal computational device recognizing a generalized quantifier as the core measure of its semantic complexity. We regard quantifiers that belong to three increasingly more complex classes: Aristotelian (recognizable by 2-state acyclic finite automata), counting (k+2-state finite automata), and proportional quantifiers (pushdown automata). Using regression analysis we show that semantic complexity is a statistically significant factor explaining 27.29% of frequency variation. We compare this impact to that of other known sources of complexity, both semantic (quantifier monotonicity and the comparative/superlative distinction) and superficial (e.g., the length of quantifier surface forms). In general, we observe that the more complex a quantifier, the less frequent it is.

27 March 2017 (invited talk at the institute seminar)

Paweł Morawiecki (Institute of Computer Science, Polish Academy of Sciences)

Introduction to deep neural networks

In the last few years, Deep Neural Networks (DNN) has become a tool that provides the best solution for many problems from image and speech recognition. Also in natural language processing DNN totally revolutionizes the way how translation or word representation is done (and for many other problems). This presentation aims to provide good intuitions related to the DNN, their core architectures and how they operate. I will discuss and suggest the tools and source materials that can help in the further exploration of the topic and independent experiments.

3 April 2017

Katarzyna Budzynska, Chris Reed (Institute of Philosophy and Sociology, Polish Academy of Sciences / University of Dundee)

Argument Corpora, Argument Mining and Argument Analytics (part I)

Argumentation, the most prominent way people communicate, has been attracting a lot of attention since the very beginning of the scientific reflection. The Centre for Argument Technology has been developing the infrastructure for studying argument structures for almost two decades. Our approach demonstrate several characteristics. First, we build upon the graph-based standard for argument representation, Argument Interchange Format AIF (Rahwan et al., 2007); and Inference Anchoring Theory IAT (Budzynska and Reed, 2011) which allows us to capture dialogic context of argumentation. Second, we focus on a variety of aspects of argument structures such as argumentation schemes (Lawrence and Reed, 2016); illocutionary intentions speakers associate with arguments (Budzynska et al., 2014a); ethos of arguments' authors (Duthie et al., 2016); rephrase relation which paraphrases parts of argument structures (Konat et al., 2016); and protocols of argumentative dialogue games (Yaskorska and Budzynska, forthcoming).

10 April 2017

Paweł Morawiecki (Institute of Computer Science, Polish Academy of Sciences)

Neural nets for natural language processing – selected architectures and problems

For the last few years more and more problems in NLP have been successfully tackled with neural nets, particularly with deep architectures. These are such problems as sentiment analysis, topic classification, coreference, word representations and image labelling. In this talk i will give some details on most promising architectures used in NLP including recurrent and convolutional nets. The presented solutions will be given in a context of a concrete problem, namely the coreference problem in Polish language.

15 May 2017

Katarzyna Budzynska, Chris Reed (Institute of Philosophy and Sociology, Polish Academy of Sciences / University of Dundee)

Argument Corpora, Argument Mining and Argument Analytics (part II)

In the second part of our presentation we will describe characteristics of argument structures using examples from our AIF corpora of annotated argument structures in various domains and genres (see also OVA+ annotation tool) including moral radio debates (Budzynska et al., 2014b); Hansard records of the UK parliamentary debates (Duthie et al., 2016); e-participation (Konat et al., 2016; Lawrence et al., forthcoming); and the US 2016 presidential debates (Visser et al., forthcoming). Finally, we will show how such complex argument structures, which on the one hand make the annotation process more time-consuming and less reliable, can on the other hand result in automatic extraction of a variety of valuable information when applying technologies for argument mining (Budzynska and Villata, 2017; Lawrence and Reed, forthcoming) and argument analytics (Reed et al., forthcoming).

12 June 2017 (invited talk at the Institute seminar)

Adam Pawłowski (University of Wroclaw)

Sequential structures in texts

The subject of my lecture is the phenomenon of sequentiality in linguistics. Sequentiality is defined here as a characteristic feature of a text or of a collection of texts, which expresses the sequential relationship between units of the same type, ordered along the axis of time or according to a different variable (e.g. the sequence of reading or publishing). In order to model sequentiality which is thus understood, we can use, among others, time series, spectral analysis, theory of stochastic processes, theory of information or some tools of acoustics.Referring to both my own research and existing literature, in my lecture I will be presenting sequential structures and selected models thereof in continuous texts, as well as models used in relation to sequences of several texts (known as chronologies of works); I will equally mention glottochronology, which is a branch of quantitative linguistics that aims at mathematical modeling of the development of language over long periods of time. Finally, I will relate to philosophical attempts to elucidate sequentiality (the notion of the text’s ‘memory’, the result chain, Pitagoreism, Platonism).

See the talks given between 2000 and 2015 and the current schedule.

seminar-archive

Menu

Natural Language Processing Seminar 2015–2016

Natural Language Processing Seminar 2016–2017