Corpora

The PolMine Project has two related key activity areas to prepare and share textual data:

Corpora of plenary protocols;
Corpora and further language resources for migration and integration research.

The project derives its thrust from political science, but we understand we work with large-scale linguistic data. So we use tools and approaches for processing and managing data from corpus linguistics, computational linguistics and information science.

Based on an initial “XMLification” of raw data (pdf, html, txt documents), we prepare linguistically annotated versions of the corpora that are imported into the Corpus Workbench (CWB). Whenever possible, we deposit tarballs with the indexed versions of corpora with Zenodo, the open science data repository of our choice.

To offer access to corpora with restricted access, we host them on an OpenCPU server. At this stage, this is relevant for the MigPress corpus of migration- and integration-related newspaper reports.

For publicly available corpora, R users can use a convenient installation mechanism included in the cwbtools package, which is available at CRAN. We recommend to use the polmineR R package for working with CWB indexed corpora. The polmineR package also includes functionality for remote access to restricted-access corpora.

GermaParl

GermaParl is a consolidated high-quality corpus of plenary debates in the German Bundestag that meets standards for social science research. At this stage, the corpus includes all plenary protocols that were published by the German Bundestag between 1996 and 2016. The naming of GermaParl is inspired by the EuroParl and DutchParl corpus. The GermaParl corpus can be retrieved as follows:

The GermaParlTEI repository at GitHub offers XML/TEI versions of the corpus.
A tarball with the linguisitcally annotated and CWB indexed version of the corpus is available via Zenodo.
The GermaParl R package published with CRAN includes a small subset of GermaParl for demonstration purposes and convenience functionality to download the full corpus from Zenodo.

Using the GermaParl R package, it just takes two lines of R code to install the corpus:

install.packages("GermaParl")
GermaParl::germaparl_download_corpus()

UNGA

The corpus of the verbatim meeting records of the United Nations General Assembly (UNGA) is the language resource with the most global outlook we have prepared. Our primary intention when preparing the corpus was to have a resource for training purposes that is genuinely international and will not convey any preference for any nation on earth.

The UNGA convenes delegates appointed by states, which is a significant difference from elected representatives in parliaments. The website of the Dag Hammarskjöld Library describes the practices and systematics of record keeping in some detail. Due to the poor quality of optical character recognition (OCR) of documents issued before 1994, only documents starting from 1994 (the 49th session) entered the UNGA corpus. At this stage, documents up to the 79th meeting of the 72nd session (20 March 2018) are covered. 2585 pdf files were processed to build the UNGA corpus.

The UNGA corpus is deposited with Zenodo. To install the corpus, proceed as follows.

cwbtools::corpus_install(doi = "10.5281/zenodo.3831472")

MigParl

MigParl is an indexed and linguistically annotated corpus of speeches on migration and integration affairs in Germany’s regional parliaments (“Landtage”). The corpus has been prepared in the MigTex Project, which was part of a larger joint project to establish the research community of the German Centre for Migration and Integration Affairs (DeZIM / Deutsches Zentrum für Migration and Integrationsforschung). Funding awarded by Germany’s Federal Ministry for Family Affairs, Senior Citizens, Women and Youth (BMFSFJ / Bundesministerium für Familie, Senioren, Frauen und Jugend) is gratefully acknowledged.

Consult the MigParl Website to learn more about the corpus, its preparation and usage.

The corpus is stored on Zenodo and can be downloaded and installed as follows:

cwbtools::corpus_install(doi = "10.5281/zenodo.3872263")

ParisParl

The ParisParl Corpus comprises all protocols of plenary sessions in the French Assemblée nationale between 1996 and 2019. The corpus is built based on pdf documents issued by the Assemblée nationale. The Framework for Parsing Plenary Protocols (R package frappp) has been used to extract structural information from the original text and to prepare an XML version of the corpus (preliminary TEI format). The structural annotation comprises speaker, party affiliation, parliamentary group affiliation, role, legislative period, session, date, interjections, year and agenda item.

As part of the corpus preparation pipeline, the data has been linguistically annotated (using the TreeTagger and StanfordNLP) and imported into the Corpus Workbench (CWB). The linguistic annotation comprises POS-tagging and lemmatization.

This language resource is still very much in development and comes without any guarantees.

The corpus is stored on Zenodo and can be downloaded and installed as follows:

cwbtools::corpus_install(doi = "10.5281/zenodo.3819374")

AustroParl

The AustroParl Corpus of Parliamentary Debates comprises all protocols of plenary sessions in the Austrian Nationalrat between 1996 and 2019. The corpus is built based on pdf documents issued by the Nationalrat. The R package frappp has been used to extract structural information from the original text and to prepare an XML version of the corpus (preliminary TEI format). The structural annotation comprises speaker, party affiliation, parliamentary group affiliation, role, legislative period, session, date, interjections, year and agenda item.

This language resource is still very much in development and comes without any guarantees.

The corpus is stored on Zenodo and can be downloaded and installed as follows:

cwbtools::corpus_install(doi = "10.5281/zenodo.3819505")

Corpora with Restricted Access

MigPress

In the MigTex Project, we prepared MigPress, a corpus with the migration- and integration-related reports the Frankfurter Allgemeine Zeitung (FAZ) and the Süddeutsche Zeitung (SZ) published between 2000 and 2018. As the material is licensed and the corpus can not be made freely available, we host the resource on an OpenCPU server. The polmineR package can be used for remote access.

The MigTex Project was part of a larger project to establish the research community of the German Centre for Migration and Integration Affairs (DeZIM / Deutsches Zentrum für Migration and Integrationsforschung). Funding awarded by Germany’s Federal Ministry for Family Affairs, Senior Citizens, Women and Youth (BMFSFJ / Bundesministerium für Familie, Senioren, Frauen und Jugend) is gratefully acknowledged.

Consult the MigPress Website to learn more about the corpus. Please get in touch with us for further information on data access.

Access to development versions for beta users

We use Amazon S3 as a cloud storage solution for development versions of corpora that have not yet been published. In this case, data access is restricted to certified beta users and credentials will be required for downloading data. A gist explains the usage of credentials.