by Christoph Leonhardt, Andreas Blaette
GermaParl2 – Constitution Day 2025 Release
We are pleased to seize Germany’s 2025 Constitution Day as an opportunity for the release of the newest version of GermaParl (v2.3.0-rc1). The new version provides incremental quality improvements and covers all sessions of Germany’s Bundestag. The corpus includes 291 million tokens in 4559 protocols of the entire 20 legislative debates until March 18, 2025. The new GermaParl2 version is the up-to-date resource for researchers eager to analyse parliamentary debates during Germany’s “Ampel” government.
We offer a beta today that is available for registered users. Prospective users can find more information about access to the data and data installation on Zenodo.
Developing and maintaining GermaParl2 is at the intersection of work within the context of the NFDI consortia Text+ (updates, quality improvements, standardisation) and KonsortSWD (linked data): Aside from presenting current updates, we also use this occasion to take stock of past developments and cast a glance into the future of GermaParl.
Towards a High-Quality, Full Coverage Resource
The first version of GermaParl2 was released on May 23, Constitution Day, of 2023 and presented a substantive update of coverage compared to the initial version of GermaParl. GermaParl v2.0.0 provides access to all debates between September 1949 and September 2021 in two formats which comprise rich structural and linguistical annotation and facilitate many kinds of analyses in social science research and beyond. The corpus was released on Zenodo and accompanied by comprehensive documentation and continuous community outreach.
Following-up this initial release, we provided four releases of GermaParl via Zenodo either to the broader public or as release candidates for which access can be granted upon request. GermaParl v2.0.1, released in December 2023, provided some incremental quality improvements of the initial release. Following about half a year in closed beta as GermaParl v2.1.0-rc2, GermaParl v2.1.0 was made available as a public release in July 2024. GermaParl v2.1.0 extended the temporal coverage of GermaParl2 to July 2023 and introduced date-specific assignments of party affiliations for the 20th legislative period. In earlier versions of the corpus, the limited availability of structured date-specific data on speakers’ party affiliations resulted in a rather coarse granularity of party affiliation assignments: We were not able to represent changes in party affiliations of speakers within a legislative period. So each speaker was assigned to the same party throughout a legislative period. However, this situation is changing, and more detailed information on party affiliation is becoming available in more structured formats. As a start, we added date-specific party affiliation information for Members of Parliament in the 20th legislative period (using detailed albeit mostly unstructured party affiliation data offered by the corresponding Wikipedia overview page, see here), but emerging new resources potentially facilitate more nuanced annotations of party affiliations throughout the corpus in the future. This will be an important next step in the development of GermaParl.
GermaParl as Linked Data
With the release of GermaParl v2.2.0-rc1 as a release candidate in July 2024, the corpus was extended to cover all sessions until June 2024. More importantly from a technical point of view, a novel feature was introduced: The inclusion of Uniform Resource Identifiers (URIs) of the DBpedia Knowledge Graph for persons, organizations and locations in continuous text. Adding URIs to these entities in the Corpus Workbench (CWB) version of the corpus via the DBpedia Spotlight Entity Linking tool is the first step toward GermaParl as a linked textual resource. Using the toolset developed by us in the project “Linking Textual Data” (as part of KonsortSWD within the National Research Data Infrastructure/NFDI), in particular the R package dbpedia, we were able to add these URIs as a structural attribute to the CWB version of GermaParl.
The addition of URIs is also part of the newest release of the corpus, GermaParl v2.3.0-rc1 which we announce today as a closed-beta release. While the inclusion of entity-specific URIs is currently still experimental and its quality not yet checked, the potentials of URIs for substantive research are manifold – for example allowing the disambiguation or enrichment of entities. By making this new annotation layer available for the GermaParl user community at an early stage, we want to foster discussion and advance the development of tools and data in a community-driven fashion.
Using the new annotation layer should be easy: Aside from the new release candidate of GermaParl, anything you need to get started is the following CQP query syntax which will allow you to look for regions described by a specific URI in the new structural attribute “dbpedia_uri”:
'/region[dbpedia_uri,a]::a.dbpedia_uri="http://de.dbpedia.org/resource/Europa"'
This query syntax can be used in places which allow CQP queries – such as the methods of count()
or kwic()
in the polmineR
R package. This can be useful in instances in which the same concept is represented by a number of different expressions or when different concepts are described by the same words.
For a first impression, we could look at sequences of words which the query above corresponds to:
count("GERMAPARL2",
query = '/region[dbpedia_uri,a]::a.dbpedia_uri="http://de.dbpedia.org/resource/Europa"',
cqp = TRUE,
breakdown = TRUE)
## Error in count("GERMAPARL2", query = "/region[dbpedia_uri,a]::a.dbpedia_uri=\"http://de.dbpedia.org/resource/Europa\"", : konnte Funktion "count" nicht finden
For a few more details on the annotation process, associated potentials and challenges as well as future steps, we refer to the “Cookin’ with GermaParl” webinar series.
GermaParl v2.3.0-rc1: High-Quality Coverage, Entire 20th Legislative Period
After GermaParl v2.2.0-rc1 extended the temporal coverage of the corpus to June 2024, the most recent update, GermaParl v2.3.0-rc1 which we release today adds all remaining protocols of the 20th legislative period until March 2025. It makes use of the date-specific assignments of party affiliations introduced with the release of GermaParl v2.1.0. In more recent debates, this modification facilitates the date-specific differentiation between members of the “DIE LINKE” and members of the “Bündnis Sahra Wagenknecht” (“BSW”) in particular. Please note that the assignment of date-specific party affiliations is currently still limited to the 20th legislative period. Related to this, the difference between “DIE LINKE” and “Die Linke” as a parliamentary group is deliberate and indicates the difference between parliamentary group (“Fraktion”) and a group with group status after the parliamentary group split up (“Gruppe”).
Users familiar with GermaParl v2.2.0-rc1 might notice some additional changes. In an attempt to further improve the quality of the data, we consolidated the full names and party affiliations we assigned to speakers extracted from the protocols. While consistency was always an important motivation, in some instances, the same speaker could be assigned to different variations of the same name (e.g., including or omitting middle initials) in different legislative periods. In other instances, a speaker could be assigned to a party in one parliamentary role but not in another due to the different resources used to enrich different speaker roles. GermaParl v2.3.0-rc1 makes an effort to remedy the most obvious instances of contradictory metadata assignments. Other changes include the recoding of the abbreviation for the “Zentrumspartei” from “Z” (which is used in the original protocols to indicate the respective parliamentary group) to “DZP” (for “Deutsche Zentrumspartei”) which is the preferred abbreviation of the party in the “Stammdaten” file, a collection of metadata of Members of Parliaments provided by the German Bundestag. We also consolidated the party affiliation of “Ludwig Erhard”. After indicating in earlier versions that the party affiliation ultimately seems to remain unclear, we follow the general assumption expressed in both the Wikipedia overview pages and in other resources such as the “Parliaments Day-by-Day” database by Turner-Zwinkels and colleagues (2022) and assign “CDU” throughout.
Where We Stand, Looking Ahead
After the release of GermaParl v2.2.0-rc1 constituted an extensive update of the resource, GermaParl v2.3.0-rc1 provides further incremental improvements and extends the coverage of the corpus to March 2025. Now, GermaParl comprises the first 20 legislative periods in their entirety. As before, the development GermaParl relies on community involvement: With the experimental nature of new features, user feedback is invaluable for the future development of the resource. While we are convinced that the addition of Uniform Resource Identifiers greatly enhances the usefulness of GermaParl, specific conceptual, methodological and technical decisions should be made with usability and accessibility in mind. In addition, the quality of these new annotations is not systemically evaluated. With this release as a release candidate, we want to enable the community to contribute to the development of the resource by engaging in the discussion about the specific implementation and potential use cases. Extensive feedback constitutes an important part of our strategy to develop useful tools and data. Please reach out to us via email (dennis.schuele@uni-due.de) to report issues and suggestions.
GermaParl v2.3.0-rc1 is also a transitionary release. For the preparation of this version, we reevaluated the way we represent speaker metadata. In the current release, we did not yet implement this to the full extent. Among others, changes of speaker names between or within legislative periods, for example due to marriage, are not yet properly represented. A better solution – fine-grained, date-specific annotations of names and party affiliations for more speakers – is on the horizon. To this end, a next major milestone is the provision of GermaParl as a resource in the XML encoding standard of the ParlaMint project. This new development will not only make it easier to represent additional, fine-grained and extensively documented metadata on speaker level but, crucially, will strengthen the interoperability of the resource, making comparative research more accessible and the integration of workflows and tools more seamless. By making use of more consolidated representations of metadata, GermaParl v2.3.0-rc1 is a first step toward this goal.
Acknowledgements
We gratefully acknowledge funding by KonsortSWD and Text+ within the German National Research Data Infrastructure (Nationale Forschungsdateninfrastruktur/NFDI) to prepare, enrich and maintain the corpus and associated resources. We also want to thank the Institute of Contemporary History in Ljubljana, Slovenia, for the opportunity to work toward the ParlaMint version of GermaParl during a Visiting Fellowship in October 2024.
Technical Note
The release of GermaParl v2.3.0-rc1 comprises of three objects which are available on Zenodo. GermaParl v2.3.0-rc1 (CWB), GermaParl v2.3.0-rc1 (XML) and GermaParl v2.3.0-rc1.1 (XML). GermaParl v2.3.0-rc1 (XML) and GermaParl v2.3.0-rc1.1 (XML) are functionally identical except for slight improvements and the harmonization of agenda item annotations in legislative periods 19 and 20 in the latter. Since we do not include agenda items in the CWB version of the corpus due to the limited robustness of this annotation, both sets of XML files would result in the same CWB corpus. In consequence, we do not provide a separate CWB version for this set of XML files. Users interested in the CWB version (for example to be used with the polmineR
R package) can use the GermaParl v2.3.0-rc1 CWB version which contains the described improvements and coverage. For users who are interested in the XML version, it is advised to use GermaParl v2.3.0-rc1.1. Although substantively mostly identical, there is little reason to use GermaParl v2.3.0-rc1 in its XML representation. We include GermaParl v2.3.0-rc1 (XML) for reasons of research data management and transparent versioning as it is this precise set of XML files which was used for the preparation of the CWB corpus.
Subscribe via RSS