The codebase of the PolMine Project is developed using the statistical programming language R. Why do we opt for R? Perl, Java and Python are viable alternatives for processing text and text mining. Yet R offers a broad range of packages for statistical analysis, great visualisation capabilites, and is widely used by social scientists - the primary target group of the PolMine Project.
The packages polmineR, RcppCWB, cwbtools and GermaParl are available via the Comprehensive R Archive Network (CRAN). Several additional packages, some of them more experimental, are open source developments available at the GitHub presence of the PolMine Project. The following packages are the core of the PolMine codebase.
The driving idea for developing polmineR is to deal with text as linguistic data. The package supports working with linguistically and structurally annoted data and offers a seamless integration of qualitative and quantitative steps in corpus analysis to facilitate validation. In its design, polmineR follows the paradigm of object-oriented programming and implements a three-tier software architecture. The Open Corpus Workbench (CWB), a classic indexing and query engine, serves as an efficient backend. Using the CWB exposes a powerful language for querying large corpora, the Corpus Query Processor (CQP).
The RcppCWB package is a follow-up to the rcqp package (now in the CRAN archives) that has pioneered to wrap the CWB into an R package. The primary purpose of the RcppCWB package is to reimplement a wrapper library for the CWB using a design (using Rcpp) that makes it easier to achieve cross-platform portability. RcppCWB can be installed on Windows, macOS and Linux. Even though RcppCWB may be used in an analytical workflow directly, it is intended to be used as an interface for packages with higher-level functionality for working with large, linguistically and structurally annotated corpora. This is how the RcppCWB package is used by polmineR to query CWB indexed corpora.
The cwbtools package includes a set of tools to conveniently create, manipulate and manage CWB indexed corpora from within R. It complements packages that use the CWB as a backend for text mining and corpus analysis with R, namely the packages ‘rcqp’ and ‘RcppCWB’. The core class of the package is the CorpusData-class to manage the tokenstream of a corpus (including linguistic annotation) and corpus metadata. Methods for coercing and/or importing data from other data structures are the basis for a workflow to make the transition from other established tools and approaches for corpus analysis (such as tm, quanteda, or tidytext) to importing data into the CWB, for using it with polmineR.
Quite often, liberating text from a pdf prison is the first step to further process text. To proceed from pdf to machine-readable text, R users can use a couple of packages, such as Rpoppler, or pdftools. However, when dealing with more heavily layouted document, potentially arduous postprocessing work really starts after the initial text extraction. To get rid of unwanted features resulting from document layout, manual cleaning, batteries of regular expressions and several further programming quirks may be necessary to get the wanted text cleanly. The idea of the trickypdf package is to proactively deal with the layout of a document and to only extract from defined boxes from the outset, and to keep nerve-wrecking postprocessing minimal. In the context of the PolMine Project, trickypdf is used to deal with the two-column layout often used by parliaments for publishing plenary protocols.
CRAN offers a few packages for Natural Language Processing (NLP). These packages are not “pure R” NLP-tools, but offer interfaces to standard NLP tools implemented in other programming languages, see OpenNLP, coreNLP, udpipe, or spacyr. The cleanNLP package combines external tools in one coherent framework. So why yet another NLP R package? Existing packages are not particularly good at dealing with large corpora. The thrust of the bignlp-package is to use a standard tool (Stanford CoreNLP) in parallel mode. To be parsimonious with the memory available, it implements line-by-line processing, so that annotated data is not be kept in memory.