This archive PNET (Polish Named Entity Triggers) contains the list of external and internal evidences (trigger words) of Polish named entities (NEs).

An external or an internal NE evidence is a word or a list of words which appears frequently in the vicinity or inside named entities and is a good indicator of these NEs' types. 
For instance "aktor" ('actor') is an external evidence for person names (as in "aktor [Zbigniew Buczkowski]"), while "von" is an internal evidence for the same type ("[John von Neumann]"). Many words can be both external and internal evidences, e.g. "jezioro" ('lake') is a external evidence in "jezioro [Mamry]" ('[Mamry] lake') and an internal evidence in "Jezioro Białe" ('[White Lake]'). 

External and internal NE evidences can be used in automatic NE recognition via grammar-based or machine-learning methods.   

===================
Production method:

The list has been created semi-automatically:
- A list of over 260,000 proper names has been extracted from Polish Wikipedia (on the basis of Wikipedia infobox types and categories). 
- Their types have been manually mapped on the Prolexbase ontology and further on the NKJP typology. 
- Words and word sequences that appear in at least 5 multi-word names have been automatically extracted for the following Prolexbase types: Way, Organization, Association, Ensemble, Firm, and Institution. For type City the frequency threshold was 10 instead of 5.
- The resulting words and word sequences have been manually validated. 
- Trigger words for the remaining types have been collected manually from Prolexbase and Wikipedia entries.
- All single words have been automatically mapped against the PoliMorf (version 0.6.1) lemmas. The list of all inflected forms has been recopied for each known lemma. The unknown evidence words and multi-word evidences were left as such. 

===================
Contents:

---------------------------------------------------------------------
               | Internal evid. |  External evid. |      Total      |
   NE type     |----------------|-----------------|-----------------|
               | lemmas | forms | lemmas |  forms | lemmas |  forms |
---------------|----------------|-----------------|-----------------|
   persName    |     2  |    35 |    846 | 13,198 |   848  | 13,233 |
---------------|----------------|-----------------|-----------------|
   orgName     |  124   | 1,913 |    77  |  1,100 |   124  |  3,013 |
---------------|----------------|-----------------|-----------------|
   geogName    |   187  | 2,689 |    179 |  2,601 |   211  |  5,290 |
---------------|----------------|-----------------|-----------------|
 p | district  |     4  |    57 |     10 |    144 |    12  |    201 |
 l |-----------|----------------|-----------------|-----------------|
 a | settlement|   201  | 3,805 |     18 |    266 |   211  |  4,071 |
 c |-----------|----------------|-----------------|-----------------|
 e | region    |    28  |  427  |     49 |    768 |    58  |  1,195 |
 N |-----------|----------------|-----------------|-----------------|
 a | country   |    30  |   454 |     22 |    350 |    30  |    804 |
 m |-----------|----------------|-----------------|-----------------|
 e | bloc      |    21  |   306 |      0 |      0 |    21  |    306 |
---------------|----------------|-----------------|-----------------|
     ALL       |   602  | 9,672 |  1,173 | 18,413 | 1,503  | 28,085 |
---------------------------------------------------------------------

===================
Typology:

The NE typology used here is compatible with the NE annotation in the National Corpus of Polish (NKJP). The following types and subtypes are used:
- bloc,
- country,
- district,
- geogName,
- orgName,
- persName,
- region,
- settlement.

===================
Authors:
Małgorzata Baron
Leszek Manicki
Agata Savary

===================
License:
The resource is available under the 2-clause BSD license. 

===================
Last modification:
October 25, 2012

===================
File format:

The file contains one evidence word per line. Each line has 6 TAB-separated fields.
<inflected>	<lemma>	<tag>	<evidence-type>	<ne-type>	<example>

<inflected> - inflected form of the (possibly multi-word or foreign) evidence word, e.g. "Jeziorze"
<lemma> - lemma of the evidence word (if any), e.g. "jezioro"
<tag> - morphosyntactic tag (if any), e.g. "subst:sg:loc:n2"
<evidence-type> - type of the evidence word: 'ext' (external) or 'int' (internal)
<ne-type> - type of the named entity announced by the evidence word, e.g. 'geogName'
<example> - sample named entity (if any) containing the evidence word (for internal evidences only), e.g. "Jezioro Ochrydzkie"

Examples:
aktorowi	aktor	subst:sg:dat:m1	ext	pers	
jeziorze	jezioro	subst:sg:loc:n2	ext	geog	
Jeziorze	jezioro	subst:sg:loc:n2	int	geog	Jezioro Ochrydzkie
n.	nad	prep:inst:nwok	int	settlement	Babice n. Sanem
sur-Marne			int	settlement	Nogent-sur-Marne
von			int	pers	John von Neumann

===================
Links:
PoliMorf - http://zil.ipipan.waw.pl/PoliMorf
NKJP - http://nkjp.pl
Prolexbase - http://www.cnrtl.fr/lexiques/prolex/


