This repository contains the data and supplementary materials for Task 2 of the
co-located with KONVENS 2023.
The Shared Task competition is run on CodaLab
February, 2023 - Trial data releaseApril 1, 2023 - Training and development data releaseJune 8, 2023 - Evaluation enabled on the dev setJune 15, 2023 - Test data release (blind)July 1, 2023 - Submissions openJuly 31, 2023 - Submissions closeAugust 14, 2023 - System descriptions dueSeptember 7, 2023 - Camera-ready system paper deadline- September 18-22, 2023 - Workshop at KONVENS 2023
The data is available as JSON files where document (news article) is a individual JSON file.
Each JSON file has the following format:
{
"Annotations": [
{
"Addr": [],
"Cue": ["0:1"],
"Form": "Indirect",
"Frame": ["0:0", "0:1"],
"HasNested": false,
"IsNested": false,
"Message": ["0:2", "0:3", "0:4", "0:5"],
"STWR": "Speech",
"Source": ["0:0"]
},
],
"DocumentName": "Title of the document",
"Sentences": [
{
"SentenceId": 0,
"Tokens": ["They", "said", "this", "is", "a", "sentence", "."]
},
]
}
where
Addr
,Cue
,Frame
,Message
andSource
are an array ofSentenceID:TokenID
offsets.Form
describes the form of message (what kind of quote it is); one ofDirect|Indirect|FreeIndirect|IndirectFreeIndirect|Reported
STWR
describes how the message was uttered; one ofSpeech|Thought|Writing|ST|SW|TW
.ST
is Speech+Thought,SW
is Speech+Writing,TW
is Thought+WritingHasNested
is a boolean indicating if there is another annotation nested within this annotationIsNested
is a boolean indication if this annotation is nested within some other annotation. The subtask 2 (simplified) only considers annotations with"IsNested": false
Trial data is available under data/trial
in this repository.
Besides the machine readable JSON, we also provide pretty printed files (ending with .pretty.json
in the folder data/trial/pretty
) which contain additional information to make them easily human-readable (namely the text of each annotation in addition to the sentence/token offsets).
The data for includes news articles from the German WIKINEWS, extracted from the articles XML dump. The entire dump from April 2022 consists of 13,001 published articles, from which we sampled 1000 articles to annotate. These annotated articles contain almost 250,000 tokens.
The data is provided under the CC BY-NC-SA 4.0 license.
A code skeleton to read and write the data format can be found in the starter_kit
directory.
The annotation guidelines (in German) can be found in doc
directory.