XML/TEI versions of corpora are publicly available under Creative Commons licenses for non-commercial users either at GitHub, or for cooperation partners through our GitLab server. GitHub/GitLab offers tools for tracking the differences of versions and comes with a issue tracking system very useful for quality management purposes.
Linguistically annotated versions of corpora that have been indexed and imported into the Corpus Workbench are shipped using R data packages. In combination with the polmineR package, large corpora will work (more or less) out of the box on average computers that everybody can afford. To keep the initial installation size of data packages modest, they only include small sample data and corpus-specific functionality at the outset. Full corpora can then be downloaded from a dedicated webspace. The following corpora are accessible accordingly.
To install GermaParl, proceed as follows.
install.packages(drat) drat::addRepo("polmine") install.packages("GermaParl") library(GermaParl) germaparl_download_corpus()
To check whether GermaParl has been installed correctly, run the following code.
install.packages("polmineR") library(polmineR) use("GermaParl") corpus() # you should see GERMAPARL in the output table
The data included in the corpus are the verbatim meeting records of the United Nations General Assembly. At http://research.un.org/c.php?g=98268&p=636540, the UN describes the practices and systematics of record keeping in some detail. After handling technical restrictions of the UN’s document database (http://unbisnet.un.org), most of the verbatim records (about 7000) were downloaded, a qualitative evaluation revealed that the recognition accuracy of the optical character recognition (OCR) of the documents was limited for meeting records from 1993 or before. Hence, we decided to restrict ourselves to the use of documents starting from 1994 (the 49th session). At the point of writing, the most recent document processed is the one of the 79th meeting of the 72nd session (20 March 2018). All in all, at this point we work with 2585 pdf files.
To install UNGA, proceed as follows.
install.packages(drat) drat::addRepo("polmine") install.packages("UNGA") library(UNGA) unga_download_corpus()
To check whether UNGA has been installed correctly, proceed as follows.
install.packages("polmineR") library(polmineR) use("UNGA") corpus() # you should see UNGA in the output table