Chapter 2 Data Overview

2.1 Linguistical and Structural Annotation

2.1.1 Linguistic Annotation

The linguistic annotation described earlier is part of the corpus as so-called positional attributes (p-attributes). The following table provides short explanations of the p-attributes in the MigParl corpus.

In the so-called token stream the linguistic annotation looks like this:

2.1.2 Structural Annotation (Metadata)

In the XML/TEI data format, all passages of uninterrupted speech are tagged with metadata, or so-called structural attributes (s-attributes). This structurization facilitates the creation of subcorpora. For instance, parliamentary speeches are often interrupted by interjections - the information whether an utterance is an interjection or an actual speech is maintained in the corpus. The legislative period, session, date, name of a speaker and his/her party are included, among others. The structural annotation is the basis for all kinds of diachronic or synchronic comparisons users may want to perform.

The following table provides short explanations of the s-attributes which are present in the MigParl corpus.

2.2 Data Report

In the following, we report on further attributes of the corpus.

2.2.1 Size and Time

The size of the entire corpus is about 51.47 million tokens. The corpus covers the time between 2000-01-19 and 2018-12-20. There are 0 missing values in the date attributes which can be addressed with the structural attribute date. We also provide the attribute year.

The following table and visualization show the temporal distribution of tokens over the corpus.

To provide comparabilty with other resources of the MigTex project (in particular MigPress), we added a structural attribute deriving calendar weeks (informed by ISO 8601) from dates (attribute: calendar_week). This is visualized in the following graph.

2.2.2 Parties

There are 31 parties in the entire corpus. This attribute can be selected via the attribute party. This includes 48 speakers with party assignment “NA” which are governmental actors for which no information was available. The following table shows which parties occur in which regional state. When multiple parties are seperated by a vertical bar (“|”), the speaker changed its party membership within a legislative period.

The following visualization shows the absolute number of tokens uttered by each party for parties with more than 100,000 tokens.

2.2.3 Role

There are 3 different roles a speaker can occupy: governmental actors (role = “government”), member of parliament (role = “mp”) and presidential speaker (role = “presidency”). The following table provides information about the distribution of roles in the corpus.

2.2.4 Speaker

The structural attribute speaker describes the individual speaker.

2.2.5 Regional State

The attribute regional_state indicates in which parliament the speech was given. States are indicated in the two-letter-scheme of the ISO 3166-2 standard.

2.2.6 Sample Source of the Speeches

Speeches are sampled with two different approaches which are described in detail in the Selection Strategy part of the MigParl documentation. Here, the consequences of these different sampling techniques should be illustrated in one graph. The year overview shown earlier is modified to account for the origin of the speech.