XML/TEI versions of corpora are publicly available under Creative Commons licenses for non-commercial users either at GitHub, or for cooperation partners through our GitLab server. GitHub/GitLab offers tools for tracking the differences of versions and comes with a issue tracking system very useful for quality management purposes.

Linguistically annotated versions of corpora that have been indexed and imported into the Corpus Workbench are shipped using R data packages. In combination with the polmineR package, large corpora will work (more or less) out of the box on average computers that everybody can afford. To keep the initial installation size of data packages modest, they only include small sample data and corpus-specific functionality at the outset. Full corpora can then be downloaded from a dedicated webspace. The following corpora are accessible accordingly.

GermaParl DOI

GermaParl, a corpus of debates in the German Bundestag, is the flagship corpus of the PolMine Project. At this stage, the corpus includes all plenary protocols that were published by the German Bundestag between 1996 and 2016. Plain text documents issued by the German Bundestag were considered the best raw data format for corpus preparation and were used whenever they are available. For a period between 2008 and 2010, txt files are not available throughout. To fill the gap, pdf documents were processed. The naming of GermaParl is inspired by the EuroParl and DutchParl corpus. XML/TEI version of the corpus are available at GitHub. Consult the GitHub Pages with the package and corpus documentation to learn more.

To install GermaParl, proceed as follows.


To check whether GermaParl has been installed correctly, run the following code.

corpus() # you should see GERMAPARL in the output table


The data included in the corpus are the verbatim meeting records of the United Nations General Assembly. At http://research.un.org/c.php?g=98268&p=636540, the UN describes the practices and systematics of record keeping in some detail. After handling technical restrictions of the UN’s document database (http://unbisnet.un.org), most of the verbatim records (about 7000) were downloaded, a qualitative evaluation revealed that the recognition accuracy of the optical character recognition (OCR) of the documents was limited for meeting records from 1993 or before. Hence, we decided to restrict ourselves to the use of documents starting from 1994 (the 49th session). At the point of writing, the most recent document processed is the one of the 79th meeting of the 72nd session (20 March 2018). All in all, at this point we work with 2585 pdf files.

To install UNGA, proceed as follows.


To check whether UNGA has been installed correctly, proceed as follows.

corpus() # you should see UNGA in the output table