3 Corpus Preparation

The general preparation workflow of the corpus has been described in Blätte, Rakers, and Leonhardt (2022) and follows the established triad of “Preprocessing”, “XMLification” and “Consolidation” which has also been used for the previous version of the corpus (Blätte and Blessing 2018). These three steps are described in more detail in the following section.

3.1 Data Collection

For the period of time not already covered by the existing GermaParl data, raw data is downloaded from the Open Data Website of the German Bundestag. For all these legislative periods the data is downloaded in the XML format provided by the German Bundestag on 2023-02-24 and 2023-05-16. For a list of used resources including the precise download source, see the section on data sources.

Until the 18th legislative period, the entire text of one protocol is encoded in a single text node. Metadata is added as a rudimentary XML header containing some document-level metadata such as the date, legislative period and session number of the meeting. While this means that the initial two-column layout offered by the PDF version of the raw data is already resolved and margin column are already removed, header lines (containing the page number, among other things) are still part of the text and must be removed. Crucially, structural annotation is not offered apart from very limited document-level metadata in these XML files.

For the period of time covered by the previous version of GermaParl, the same raw data as in its initial preparation is used. More precisely, this comprises txt files from the 86th session of the 13th legislative period until the 210th session of the 18th legislative period which have been downloaded from the archives of the German Bundestag along with the remainder of the 18th legislative period. As already mentioned in Blätte and Blessing (2018), these txt files were not available for all sessions in this period. To fill a gap between the end of the 16th and the beginning of the 17th legislative period, PDF files were retrieved from the Bundestag.

Beginning with the 19th legislative period, the Bundestag offers a thoroughly structurally annotated XML format which necessitates a different data preparation approach which is described in section 3.5. The data preparation is presented in the following.

3.2 Preprocessing

For each legislative period, the first step is to extract the actual content from the documents. This involves both removing elements of the retrieved text which are not actually part of the parliamentary proceedings such as table of contents or appendices as well as other layout elements which do not represent speeches.

In addition, some XML files actually contain more than one session. This can occur when two sessions occur on the same day, e.g. when a session is interrupted and then continued later on the same day. For example, the XML file provided by the German Bundestag called “01020.xml” suggests that it contains the 20th session of the first legislative period. This is also the information provided in the metadata stored in the header of the XML. However, on closer inspection, the file also contains the 21st session of the first legislative period which is held on the same day. Unfortunately, the next file provided by the Bundestag, “01021.xml”, also contains both the 20th and the 21st session of the first legislative period while the header indicates the 21st session only. Integrating both files as provided by the German Bundestag and using only the metadata to distinguish between different sessions would thus lead to duplicate sessions in the final corpus. Addressing this, such occurrences were identified manually. Then, the first file of each pair was split to separate the two sessions contained in the file. This has been done for the following XML files provided by the German Bundestag:

  • “01020.xml”
  • “01041.xml”
  • “01075.xml”
  • “01149.xml”
  • “01223.xml”
  • “01280.xml”
  • “02219.xml”
  • “02225.xml”
  • “03097.xml”
  • “04066.xml”
  • “04087.xml”
  • “04112.xml”
  • “05047.xml”
  • “05232.xml”

After this split, where possible for each protocol the table of contents, appendices as well as header lines are removed. In addition, expressions in parentheses are moved to a separate line to make the identification of interjections and other nodes easier. Words split by line breaks are joined where possible by the means of heuristics. Finally, some minor adjustments were made to correct typos or apparent OCR errors which otherwise would hinder the XMLification. Some more comprehensive modifications were made automatically and on a file-by-file basis to account for speeches occuring twice in the raw data.

Aside from having been downloaded before, the decision to use plain text files for the period covered by GermaParl v1 was informed by data quality: The txt files used there have a good quality aside from a minor issue which will be discussed shortly. In contrast to the XML files offered by the German Bundestag for the same period, these txt files also retain the paragraph structure of the PDF files more often, making the detection of interjections in particular easier. The txt files were used without any major cleaning from the repository they were collected in as raw data in the previous preparation process. The aforementioned minor issue concerned a number of different encoding issues which were looked at carefully. Since txt files are not available for the earlier periods, they were no option.

For the gap in the txt files described earlier, PDF files were used. This is quite feasible for these more recent periods of time because compared with older PDF files, the layout of these newer files is very consistent. One remaining challenge is the two-column layout of the files which has to be resolved. This is done using the trickypdf R package which has been developed in the context of the PolMine project. It must be noted that as trickypdf depends on the poppler utilities to extract data from the PDF files, different versions of trickypdf and poppler influence the outcome of this extraction. Such variations were observed especially regarding whitespace. In consequence the preprocessing of those protocols based on PDF also comprises the removal of additional whitespace and line breaks.

The protocols of the 19th legislative period followed a separate processing pipeline and were not preprocessed in the same sense as the raw data which does not contain any structural annotations. The preparation of the most recent protocols is discussed in section 3.5.

3.3 XMLification

During XMLification, the plain text is turned into a structured XML format. Using a battery of regular expressions, the beginning of speeches as well as interjections are identified. This is based on the assumption that plenary protocols by design exhibit regular patterns such as

“Name (Parliamentary Group):”

which can be used for these purposes. From these speaker calls, the role and – if applicable – the affiliation of a speaker to a parliamentary group can be extracted. Metadata such as the date, the year, the legislative period as well as the session number can be directly retrieved from the original XML files. For those protocols which are prepared based on txt and PDF files, the metadata is also retrieved from these XML files.

Given the long period of time covered by the corpus, the workflow has to take temporal variations and changing conventions into account. For example, using the full name of a speaker is only introduced to the proceedings in the 12th legislative period. As a consequence, regular expressions will lead to mismatches. This is addressed by a list of known mismatches which are excluded from the list of matches the regular expressions generate. This process is used for speakers as well as interjections.

Both the battery of initial regular expressions to detect speakers and interjections as well as the formulation of mismatches is done in an iterative fashion, repeatedly checking a sample of the data for results.

3.4 Consolidation

After the XMLification step, the data was available in a structurally annotated form on an elementary level. However, up until now, only information which is already part of the protocols has been used. Aside from the metadata of the protocols (legislative period, session number, date), each speaker is annotated with family name (legislative periods 1 to 11), first name and family name (legislative periods 12 to 19), possibly local specification (in case of ambivalence of a family name, in legislative periods 1 to 19) and, if applicable, affiliation to a parliamentary group (legislative periods 1 to 19). The role of a speaker can be derived from the speaker call itself (member of parliament, governmental actor, presidential actor or other).

In consequence, the information about the full name of a speaker is missing for legislative periods 1 to 11. Furthermore, the information about the party affiliation of a speaker is missing in the protocols themselves for all legislative periods. To add this information, external resources must be used.

However, before speakers can be consolidated and information can be added, some flaws or deviating information in the original data must be remedied. These mainly comprise of typos or apparent OCR errors such as additional white space or punctuation marks. In some occasions, the parliamentary group of a speaker is obviously mislabeled. This is modified. Also, in some instances, the local reference of a speaker is only added to the speaker name during the legislative period, for example if a speaker with the same family name joins the Bundestag over the course of a legislative period. In other cases, one speaker would remain without a local specification and every additional speaker with the same family name would be disambiguated with a local reference. In some instances, the best way to guarantee that the correct speaker is matched and the correct information is added to the correct speaker is by adding local references. These interventions were realized by using manually curated replacement lists and scripts to make the process repeatable.

During the preparation process, these interventions are documented in a csv document. Because it is the most crucial intervention, the change of the who attribute – i.e. the element of the speaker call in the original protocol which was used to identify who is actually speaking – is also part of the final XML: The original who particle is stored separately before any modification took place in a separate attribute called who_original.

Table 3.1 illustrates the changes made to the raw data before the consolidation of speaker attributes is performed.

Figure 3.1: Pre-Consolidation adjustments

3.4.1 External Data Sources

To add additional information to a speaker, the PolMine project relies on Wikipedia to add information to the corpus. For this use case, information for most actors is available on Wikipedia. However, for members of parliament, the so called Stammdaten file of the German Bundestag could be used as well. This file offers a wide variety of additional information for each member of parliament such as the full name, party affiliation and beyond. The data is provided as XML and available on the Website of the German Bundestag.

This data comes with three caveats: One is the aforementioned limitation in coverage. Only members of parliament are part of the data. Governmental actors which are not also members of parliament are missing from the data. The second limitation is that some parts of the “biographical data” are static. This concerns aspects which might change such as the party affiliation or the gender of a speaker. Third, there are some rather specific deviations between the information annotated in the initial protocols and the external data sources.

These three caveats are addressed as follows: The data of the Stammdaten is used for members of parliament, while other speakers are covered with Wikipedia data. To allow for the assignment of time-specific party affiliations, lists from Wikipedia (see for example the list of members of parliament in the first legislative period of the German Bundestag on Wikipedia) are used to add one party affiliation per legislative period to each speaker. While this does not account for speakers leaving or switching parties within legislative periods, it seems like the most achievable granularity available in a structured format. The third limitation is addressed by allowing for a less strict or more fuzzy matching of attributes. For example, some speakers which are annotated as being affiliated to the parliamentary group of DP in the parliamentary proceedings are annotated as belonging to the DP/FVP parliamentary group in the external data. In these cases, it is also possible that a speaker changed the parliamentary group over the course of a legislative period. Here, the proceedings often document the more recent information when compared to the static external information.

For speakers which are not members of parliament, this data is mainly gathered from Wikipedia. Governmental actors, for example, can be enriched via individual government or cabinet pages on Wikipedia. For some speakers, the biographical lexicon Munzinger was a useful resource to identify full names or party affiliations.

Table 3.1 illustrates the source of each attribute. Attributes marked with ‘protocol metadata’ are directly derived from the XML version of the initial protocols. When ‘derived from protocol (regex)’ is used, this indicates that the information has been extracted based on the protocol’s text using regular expressions. As described earlier, for the most part the party affiliation is taken from legislative specific Wikipedia pages.

Table 3.1: Sources of structural attributes of the GermaParl corpus
s-attribute source
protocol_lp protocol metadata
protocol_no protocol metadata
protocol_date protocol metadata
protocol_year protocol metadata
protocol_url recorded during download
protocol_filetype recorded during download
speaker_who derived from protocol (regex)
speaker_name based on “who”, consolidated with Stammdaten and/or Wikipedia or Munzinger
speaker_parlgroup derived from protocol (regex), possibly consolidated
speaker_party Wikipedia or Munzinger
speaker_role derived from protocol (regex)
ne linguistically annotated
ne_type linguistically annotated
p derived from protocol (heuristics)
p_type derived from protocol(regex), based on p
s linguistically annotated

This process comes with the caveat that the information about a speaker’s party affiliation is not always recorded at the same point in time. Concerning the Wikipedia data, the party affiliation of a speaker at the start of a legislative period is documented while in other cases, the party affiliation at the end of the legislative period is used. In the respective tables in Wikipedia, there often is additional information about members leaving or joining parties in the ultimate column. However, this information cannot be retrieved automatically. This could be implemented later on.

3.4.2 Matching Speaker Nodes and External Data

Both the XMLified protocol data and the external data can be perceived as tabular data. The consolidation and enrichment stage of the preparation pipeline thus constitutes a matching between the speaker nodes (initially containing the information about the ‘who’, i.e. the raw, non-consolidated speaker name, potentially a speaker’s affiliation to a parliamentary group, the speaker’s role and, taken from the document-level metadata, the legislative period) and the external data (which contains the full name, potentially the parliamentary group, the legislative period and, derived from the way it was collected, the role of the speaker as well as the information that should be added).

There are scenarios in which this information is still not sufficient to unequivocally disambiguate speakers. In these scenarios, more than one entry in the external data is a reasonable match for a speaker in the protocol data and thus a candidate to draw additional information from. In these cases, three additional sources of distinct speaker-level information were retrieved from the protocols: The named gender of the speaker (identified by testing whether the term “Frau” was found at the beginning of the speaker call), a potentially indicated local reference after the speaker name (such as “Müller (Berlin)”) and the position of governmental actors. Both the gender of a speaker and the local reference (mostly indicating the electoral district) can also be found in the Stammdaten. In some instances, the latter does not fully correspond to the information found in the protocols and was adjusted or added in the Stammdaten file. For a single governmental speaker, it was necessary to compare the position within the cabinet with the same variable in the external data to sufficiently disambiguate the speaker from a colleague in the same cabinet.

With this apparatus in place to compare both the protocol data and the external data, the speakers were consolidated – i.e. names were harmonized – and enriched – i.e. information was added.

3.4.3 Annotation of Agenda Items

While the previous version of GermaParl used regular expressions to identify agenda items, the updated version does not yet implement this approach for the period of time which is not already covered by the previous release. This is due to the rather high amount of manual intervention necessary to address false positives and false negatives for the additional 13 legislative periods added with GermaParl v2. An alternative strategy was used to add agenda items based on the similarity of sentences in utterances of presidential speakers to known agenda items of the 19th legislative period. For this, quanteda and quanteda.textstats (Benoit et al. 2018) were used to implement the similarity measure. This is a rather experimental approach and thus not used for the CWB version of the corpus.

3.4.4 Additional Remarks concerning the enhancement of speakers

There are some special speakers which occur rarely and need specific handling. For one, this concerns former presidents of Germany, or Bundespräsidenten, who occasionally speak after leaving office, in particular when their successor is sworn in. These speakers are annotated as Bundespräsidenten although they technically are not in office anymore. In addition, in the protocols, the names of speakers which contain the letter “ß” are sometimes written with “ss” instead. For GermaParl v2, an attempt was made to harmonize these different spellings, although remaining variations are likely.

3.5 The 19th Legislative Period as a Special Case

The data of the 19th legislative period is already thoroughly structurally annotated. To this end, the task was not to identify speakers but to transform the data provided by the Bundestag to the target format. This mainly comprised separating procedural utterances of presidential speakers which often are annotated as part of an actual speech by another actor. It is also necessary to restructure debates based on agenda items. To this end, a XML parser was written in R.

In addition, occasional errors in the data were remedied. This, for instance, concerned XML nodes which were left empty unintentionally in the original files. In one instance, a qualitative check revealed that the affiliation of a speaker to party and parliamentary group did not correspond to the information found on Wikipedia. This was adjusted accordingly.3

The consolidation steps follow the previous approach, albeit with less concerns about the quality of the raw data than in previous legislative periods.

3.6 Linguistic Annotation

After the structural annotation, the text is linguistically annotated. The main worker behind this process is Stanford CoreNLP (for the current beta version of the corpus, in version 4.5.0, Manning et al. 2014). To make use of the full potential of parallelization of the Java version of Stanford CoreNLP from within R, the R package bignlp (Blätte and Leonhardt 2022) was written in the context of the PolMine project. The default German model was used to tokenize the text, split the tokens into sentences, add UD Part-of-Speech tags and identify Named Entities. As the German model for Stanford CoreNLP does not offer lemmatization or language specific Part-of-Speech tags such as the Stuttgart-Tübingen-Tagset (STTS), we use TreeTagger to add these annotation layers to the final CWB corpus (Schmid 1994).

3.7 CWBification

The output of the linguistic annotation is a vertical XML format which can be imported into the Corpus Workbench using the R package cwbtools (Blätte 2022a). cwbtools has been developed in the context of the PolMine project to manage and index corpora. During this import, some additional harmonization of the names of parliamentary groups and parties is performed. This concerns variations of the same parliamentary group such as “PDS”, “Gruppe der PDS” and “PDS/Linke Liste” which could be considered the same organization. In the TEI/XML version, this is only done with regards to spelling variations. This difference might be resolved in future releases.


  1. This was observed for Uwe Kamann (https://de.wikipedia.org/w/index.php?title=Uwe_Kamann&oldid=215956829) (last accessed on 2023-05-22).↩︎