Datasets and FormatsDetails on the coreference information
Five different corpora will be used in the task. Each distribution includes a general README file as well as separate info.txt and (optionally) tagsets.pdf files with specific information of each data set (source, license, automatic tools, tagsets, etc.).
Catalan, Spanish The data sets come from the AnCora-CO corpora (Recasens and Martí, 2009). AnCora-ES (the Spanish part) contains 75k words from the Lexesp balanced 6-million-word corpus, 225k words from the EFE news agency, and 200k from the Spanish version of the 'El Periódico' newspaper. AnCora-CA (the Catalan part) consists of 75k words from the EFE news agency, 225k words from the ACN news agency, and 200k words from the Catalan version of the 'El Periodico' newspaper. The subset of 200k words coming from 'El Periódico' corresponds to the same news in Catalan and Spanish, spanning from January to December 2000. Hand-annotated with constituents, functions, thematic roles, semantic verb classes, named entities, WordNet nominal senses, and coreference. Training: 300-330k words. Test: 50k words. Freely available for research purposes.
Dutch The data set comes from the KNACK-2002 corpus, texts from the Flemish weekly magazine Knack. Hand-annotated with coreference. Training: 168 documents.
English The data set is an excerpt of news from the OntoNotes Corpus Release 2.0. The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute to annotate a one-million-word English corpus by hand. Hand-annotated with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology, NEs and coreference). Training: 100k words. Test: 24k words. The OntoNotes corpus is distributed by LDC. LDC will distribute the task training and test data sets to SemEval2010 participants after they sign and submit a license agreement for the data. The license agreement requires the data to be returned/destroyed at the end of the task. The website will be announced in due time.
German The data set comes from the TüBa-D/Z Treebank (Hinrichs et al. 2005), a German newspaper corpus based on data taken from the daily issues of "die tageszeitung" (taz). Hand-annotated with inflectional morphology, constituent structure, grammatical functions, and anaphoric and coreference relations. Training: 415k words. 'die tageszeitung', the owner of the copyright to the original texts, grants all participants of the SemEval task a temporary license for the duration of the task.
Italian The data set comes from the LiveMemories corpus, an Italian corpus under construction as part of the LiveMemories project. The corpus includes texts from Wikipedia, blogs, and news articles. The excerpt for SemEval-2010 consists of texts from the Italian Wikipedia. Training: 100k. Test: 50k. The data are distributed under the Wikipedia distribution rules.
DATA FORMATTING #begin document CESS-CAT-AAP/95694_20030723.tbf.xml
Inside a document, the information of each sentence is organized vertically with one word per line. The information associated to each word is described with several fields (columns) representing different layers of linguistic annotation. Columns are separated by TAB characters. Sentences are separated by a blank line. ID TOKEN LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL NE PNE PRED PPRED APREDs PAPREDs COREF Column 1 Columns 2--8: words and morphosyntactic information Columns 9--12: syntactic dependency tree Columns 13--14 Columns 15--16+N+M: semantic role labeling Last column: output to be predicted For more details on the corpora and the annotation, see the README file included in the distribution (Downloads).
DETAILS ON THE COREFERENCE ANNOTATION
The annotation of coreference is shown in the last column in a numerical-bracketed format. Every entity has an ID number. Every mention is marked with the ID of the entity it refers to. An open parenthesis (before the entity ID) shows the beginning of the mention (first token), and a closed parenthesis (after the entity ID) shows the end of the mention (last token). The following examples are extracted from the Catalan sentence (AnCora-CA): [La remodelada plaça del [Mercat]_2]_1 es va inaugurar ahir amb actes d'homenatge a [Josep_Roura_i_Estrada]_3 (1787-1860), conegut per la introducció de l'enllumenat públic de gas a Espanya. A la casa natal de [Roura]_3, a [la plaça]_1, s'[hi]_1 va instal·lar un fanal antic de gas. Using the open-close notation from the task datasets: la [...] (1 plaça [...] 1) Mentions with one single token show the entity ID within parentheses: Roura [...] (3) Tokens belonging to more than one mention separate the respective La [...] (1 remodelada [...] plaça [...] del [...] Mercat [...] (2)|1)
Since the two mentions "la plaça" and "hi" corefer with "La remodelada plaça del Mercat", the last column shows the same entity ID for both of them. la [...] (1 plaça [...] 1) [...] hi [...] (1)
For any queries, comments or feedback regarding the data sets, please post in the forum.
|