by Christoph Leonhardt
Update for GermaParl2 – Improving Corpus Quality and a Look Ahead
We always envisioned GermaParl as an evolving resource. Since some issues only become apparent during productive work, the continuous provision of releases aimed at improving the corpus was always part of our roadmap. Accordingly, over the last months, we put the corpus to work in substantive analyses, comprehensive quality checks and educational outreach. This enabled us to spot remaining flaws. The same is true for users of the resource who might encounter bugs and missing features – some of which were brought to our attention via our issue tracker on GitHub.
As a first step towards even better data quality, we can gladly announce the first patch release for GermaParl2 today. Like the previous version, we made the corpus available via Zenodo. The release of GermaParl v2.0.1 builds on the release version of GermaParl2, including the same features and covering the same period of time, but addresses a lot of the recently identified issues and provides additional improvements regarding the general quality of the resource. We want to highlight updates in three areas which are especially noteworthy:
Removed appendices
We noticed that for quite a large number of sessions, the processed protocols not only contained speeches delivered on the plenary floor but also appendices (see issue #1 on GitHub). These comprise of different elements, in particular speeches which were only added to the minutes of the session. Their accidental inclusion was caused by a greater-than-expected variation in end-of-speech expressions. These appendices are now removed from the corpus.
Improved speaker recognition
While most speakers are correctly identified, despite our best efforts, speakers are not properly recognized in every case. This was also noticed by members of the GermaParl community (see issue #2 on GitHub). At least for presidential speakers, we were able to improve the identification of speakers by introducing additional line breaks before presidential speakers start to speak. This enables our regex-based approach to recognize speaker calls properly even if line breaks are missing in the original data. In addition, the identification of speakers of the federal council was improved by adding and tuning regular expressions for this group of speakers.
Large Paragraphs in LPs 13 to 18
From earlier iterations of the corpus preparation, we already knew that the reconstruction of paragraphs and the concatenation of stage expressions – i.e., elements interrupting a speaker’s utterance – can be limited by noise which is mostly introduced by issues in the raw data. In some cases, this results in unexpected behavior such as the unintended concatenation of multiple lines which can obscure valid speaker calls. While this has been addressed for earlier legislative periods in the initial release of GermaParl, the changing nature of the raw data unfortunately lead to remaining overly large paragraphs in legislative periods 13 to 18. This is improved now, resulting in additionally identified speakers. Especially protocols in legislative periods 15 and 16 benefit from this improvement. Furthermore, several minor but meaningful improvements are included in this release. See the change log in the documentation for all changes.
GermaParl v2.1.0 Release Candidate 2
Aside from this patch release, we want to use this opportunity to announce the closed-beta release of the next version of GermaParl2 - GermaParl v2.1.0-rc2. This release candidate includes all improvements of GermaParl v2.0.1 described above plus the first 116 protocols of the 20th legislative period. So, while GermaParl v2.0.1 (like the initial GermaParl2 release) covered parliamentary debates between September 1949 and September 2021, this new beta release extends this period to the time between September 1949 and July 2023. Unlike the public release of GermaParl v2.0.1, this update is provided via Zenodo with restricted access. While we are confident that the corpus preparation workflow allows us to create qualitative and reliable versions of the corpus, final quality checks which were performed for the previous corpus versions are still pending for the recently added protocols of the 20th legislative period. Releasing the upcoming version of GermaParl2 in this restricted way allows us to share the more recent debates in a level of quality which should be sufficient for a lot of purposes but is not yet fully checked for potentially remaining flaws. Accordingly, interested users should be aware of the potentially preliminary status of the resource. In particular, the restricted release makes it possible to include the community in this corpus curation effort: We invite interested users to apply for access on Zenodo and help us to further improve the resource before the final open release by reporting bugs via our issue tracker. Requesting access is described on the corresponding Zenodo page.
Next Steps
GermaParl v2.0.1 addresses many issues which will be indicated as „closed“ in the issue tracker on GitHub. However, you will find that this update does not address all reported issues. In addition, new issues will certainly become apparent as people continue to use the resource. Please note that these issues might indicate current limitations of the resource for your particular use case. If the possibility to patch the resource becomes apparent, we will try to include it in an upcoming release. This particularly applies to the beta-release of GermaParl v2.1.0-rc2 which presents a good opportunity to address additional issues and potential feature requests.
Finally, we want to re-iterate our suggestion to share your experiences - bugs, flaws, uncertainties – with us via GitHub issues. With GermaParl v2.1.0 already on the horizon, there will be upcoming releases and we greatly benefit from your feedback to further improve the resource.
Subscribe via RSS