1 The GermaParl Corpus - An Overview
The availability and quality of data is a crucial aspect of research. Available, high-quality data enables researchers to answer substantive research questions without the need to invest a large amount of time in the collection of data. As the study of parliamentary debates becomes more common and increasingly advanced, the need for quality data is also becoming apparent in this field.
With GermaParl, the PolMine project strives to provide a high-quality, multi-purpose and evolving resource for the research of German parliamentary debate, aiming to contribute to the rich dataverse of parliamentary corpora. The corpus follows established international resources such as DutchParl (Marx and Schuth 2010) and EuroParl (Tiedemann 2012), two well-known resources which also served as a model for the name of the GermaParl corpus. The established version of GermaParl was developed by Andreas Blätte (2020) and was updated for this new release.
Covering about 72 years of parliamentary debate and comprising of 4345 individual parliamentary sessions, GermaParl enables researchers to study the entire parliamentary discourse in the German Bundestag from 1949 to 2021. Containing 270 million tokens in total, the corpus is not only extensive in volume but also comprehensively annotated. Structural annotation layers facilitate the analysis of meaningful subsets of the corpus, allowing for comparisons between speakers, parties or legislative periods — to name just a few possibilities of synchronic and diachronic analyses. Linguistic annotation layers enable users of the data to create complex queries to treat text as linguistic data without the necessity to install additional NLP tools. With these features, we believe that GermaParl provides a useful contribution to the existing realm of prepared and machine-readable parliamentary data.1
The remainder of this section provides a brief overview about the context in which GermaParl is developed and maintained as well as some additional introductory remarks about the resource. Other sections of this documentation contain an in-depth report of the data (Section 2), a presentation of the data preparation workflow (section 3) as well as some more technical and future aspects of corpus development.
Note: This documentation will be provided as both a website and a pdf document, realized with the bookdown R package (Xie 2016). The pdf version of this documentation is currently under construction.
1.1 GermaParl in the Context of the PolMine Project
GermaParl is developed in the context of the PolMine project. The established version of the corpus which covers the years 1996 to 2016 has been described by Blätte and Blessing (2018). A beta version of GermaParl v2 was presented in Blätte, Rakers, and Leonhardt (2022) in which broader aspects of the development philosophy of the resource were also discussed. In contrast, the documentation presented here provides an in-depth overview about existing attributes and shines a light on the more technical aspects of both the development and the structure of the resource.
1.2 Dissemination
The data format of a resource often is a first criterion for its usability. Currently, the final corpus is disseminated in two formats. Firstly, the corpus is provided as TEI/XML. As a sustainable, interoperable format, the parliamentary corpus is provided as a structured XML format. For that, the raw, mainly unstructured text data downloaded as PDF, XML and TXT from the website of the German Bundestag is turned into an XML format inspired by the standards of the Text Encoding Initiative (TEI). This process is facilitated by a reproducible workflow. The TEI/XML files structure the content of a protocol, providing information about speakers, parliamentary groups and what is said by which person in which session. The TEI/XML files are provided in the GermaParlTEI-Repository on GitHub.
The data is also provided in a linguistically annotated format which has been imported into the Corpus Workbench (CWB) (Evert and Hardie 2011). While the TEI/XML format can be used as an exchange format for more experienced users, a potentially more accessible format and thus an appropriate starting point for users which are not familiar with XML based pipelines is provided by the CWB version. This is also the version of the resource which includes additional linguistic annotation layers. When adding linguistic annotation to the structured text, some basic NLP tasks such as tokenization, Part-of-Speech-Tagging, lemmatization and Named Entity Recognition are performed. The linguistically annotated data is then indexed and imported into the Corpus Workbench. During this process, some additional harmonization steps are performed to further consolidate the names of parliamentary groups and parties in order to increase the usability of the CWB corpus. Dissemination of this version of the corpus is based on the distribution of the data in a compressed binary format, a so-called tarball, which is stored in the open online repository Zenodo. From there, it can be downloaded manually to be used, for example, in a compatible environment such as CQPweb In the PolMine project, the analysis environment polmineR
is developed (Blätte 2022b). Implemented in the statistical programming language R, it provides a purpose-built solution for the analysis of large, CWB-indexed corpora. polmineR
is designed to lower barriers for the analysis of large-scale, linguistically annotated corpora in a reproducible fashion. To further increase the ease of use, the corpora to be used with polmineR
can be downloaded from within R using the package cwbtools
. If not noted otherwise, the following descriptions refer to the CWB version of the corpus.
While the PolMine project has some experience with the preparation and dissemination of corpora in these two output formats, new developments should be considered. In particular, an effort to standardize parliamentary data from different countries and languages is a promising avenue of development in the field. The ParlaMint corpora (Erjavec et al. 2022) are a great showcase of the potential a shared encoding standard for parliamentary data can provide. As a specification of the Parla-CLARIN TEI standard, this specific XML output should be provided in the near future.
1.3 Installation - Using GermaParl with polmineR
As mentioned earlier, it is possible to retrieve the corpus manually and use it like any other Corpus Workbench resource, for example via the CWB’s own command line interface or graphical user interfaces like CQPweb. As a central use case, GermaParl was designed to work with the polmineR
R package. To reduce barriers, only three lines of R code are necessary to download and install the necessary packages to retrieve and work with the corpus.
1.4 First steps and Digging Deeper
Making resources accessible is at the heart of the PolMine project. To this end, a number of training and teaching resources were developed within the project. In addition, valuable training material was created by other scholars of the community.
The “UCSSR” (Using Corpora in Social Science Research) series of online slides makes extensive use of GermaParl and introduces some analytic approaches to parliamentary debates. Thus, the slides are a great starting point to explore the data. They were designed in a way that facilitates the independent acquisition of skills and knowledge and aspire to be sufficiently thorough to serve as a point of reference for substantial analysis.
1.4.1 Video Tutorials for GermaParl
Christoph Nguyen has crafted video tutorials on the previous version of GermaParl available on YouTube for a class on Parliamentary Analysis in R. Four tutorials give a hands-on introduction to analyzing GermaParl in combination with the polmineR
package.
Click on the lessons to watch Christoph’s tutorials (in German)!
It must be noted that the newest version of GermaParl comes with some specific design decisions that result in some differences in the internal structure of the corpus data. For details in this regard, please consult the release notes of GermaParl v2 Release Candidate 3 for now. In consequence, the set of polmineR
commands currently (polmineR
version 0.8.9.9001) is not entirely backwards compatible and thus not all commands shown in the videos will work for the new resource. For an in-depth explanation of the internal structure, see section 4 about “XML Structure”.
1.4.2 Cookin’ with GermaParl Webinar Series
After the initial release of GermaParl v2.0.0, we introduced the “Cookin’ with GermaParl” webinar series in which the GermaParl team presents recipes for common approaches on a regular basis. Upcoming sessions are announced via the GermaParl mailing list. Earlier sessions of the webinar series will be made available on YouTube. The presented recipes are made available as R Markdown documents in an online “Cookbook” provided as a GitHub repository.
1.5 License
The license of the GermaParl corpus is the Creative Commons Attribution ShareAlike 4.0 License (CC BY-SA 4.0). That means:
BY - Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
SA - ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
See the CC Attribution-ShareAlike 4.0 License for further explanations.
1.6 Quotation
To ensure the reproducibility of your research, it is important to refer to and specify the corpus (including version and DOI) you used.
Blaette, Andreas, & Leonhardt, Christoph (2023). GermaParl Corpus of Plenary Protocols (v2.0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10416536.
1.7 Acknowledgements
We gratefully acknowledge funding from the German National Research Data Infrastructure (Nationale Forschungsdateninfrastruktur / NFDI). Funding from KonsortSWD has advanced the data preparation tool set to facilitate the robust annotation of additional annotation layers in large corpora (such as Named Entities). This is instrumental for linking parliamentary data with other data. Funding from the Text+ consortium is instrumental for updates of the corpus, quality control and keeping data formats up with current and future developments.
The data quality of GermaParl we are able to offer at this stage has benefitted significantly from a cooperation with the SOLDISK project at the University of Hildesheim, and comprehensive manual quality control of the data carried out by the SOLDISK team. A very special thanks goes to Hannes Schammann, Max Kisselew, Franziska Ziegler, Carina Böker, Jennifer Elsner and Carolin McCrea.
1.8 Quality Control and Issue Tracking
While we provide a thoroughly checked language resource which has undergone a number of iterations and a closed beta phase, the possibility of remaining errors and flaws in the data cannot be discarded. We conceptualized GermaParl as an evolving resource, meaning that the preparation pipeline is designed in a way that allows for the incorporation of user feedback and feature requests. The most effective way to collect feedback is to use GitHub Issues.
The documentation of the corpus is stored on GitHub. We use the same location to collect feedback. The repository can be found here: https://github.com/PolMine/GermaParl2.
1.9 Structure of this Documentation
The new version of GermaParl is not only a temporal update of the established corpus. While it shares a lot of qualities of the previous version, some processing steps are fine-tuned, certain attributes are updated and a number of additional features is provided. The purpose of the remainder of this documentation is thus threefold:
- the corpus is comprehensively annotated which yields great potential for developing precise and deliberate individual workflows for analysis. To facilitate this, the data is structured in a specific data format and provided in a way which might seem less familiar than, for example, a data frame representation. Therefore, central features of the resource in general and its data format in particular should be presented in some depth (Section 2).
- GermaParl should be understood as an evolving resource (Blätte, Rakers, and Leonhardt 2022). Consequentially, the data preparation process is made transparent to allow for feedback being effectively incorporated into new versions of the corpus and to increase the trustworthiness of the data (Section 3).
- the data should be accessible not only in terms of availability but also with regards to its usability. In consequence, information about how to get started is provided (see section 1.4 and the code examples in the data report in section 2).