by Andreas Blätte
The polmineR package uses the rcqp package to access CWB indexed corpora. The core of the rcqp package are R-style C functions wrapping functions of the ‘corpus library’ (CL), a core C library of the CWB. This provides speed. Still, in my feverish dreams, I wondered whether I could access the corpus library directly. As I have installed the CWB, that C library is on my system anyway. Wouldn’t it be nice to be able write my own C routines to call functions of the corpus library?
It does not make sense at all to reimplement the well-developed functionality of the rcqp package. But there are some core bottlenecks in the polmineR package slowing things down. At times, I need to get back and forth between R and the C functions of the corpus library as exposed by rcqp. For these bottlenecks, I have finally managed to implement that old idea of accessing the corpus library directly. In other words, I have written a set of C++ functions for performance critial tasks.
Getting and loading packages
The new C++ code is included in a package I call ‘polmineR.Rcpp’. You may have a look at the package at GitHub and install it from there using devtools. If you want to reproduce things, make sure that you have the latest development version of the polmineR package.
We load these two packages, and ggplot2 for visualisation.
Sufficiently large data
Unsurprisingly, the corpus I use to run some tests is a corpus of protocols of the German Bundestag (“PLPRBT”). A release and an explanation how to use the corpus is planned to supplement the publication of an article I co-authored with Andreas Wüst.
It is a 100M corpus. That may be sufficiently large to learn something about performance.
Counting the number of tokens in a corpus
Let us begin with a first standard task, counting the number of tokens in the overall corpus. There are three ways to get there:
- decoding all tokens and counting that;
- obtaining counts from calling the ‘cwb-s-decode’ utility of the CWB, and
- accessing a C function of the corpus library called cqi_id2freq that is not exposed through rcqp but that I can access using Rcpp.
So let us put things together and look at it.
It is fair to say that the cwb-lexdecode utility essentially does the same as the Rcpp function I have written. Time is lost because R will catch the output from the system call and needs to parse it a again into a data.table. Be that as it may. My Rcpp function is the winner.
A helper function
Before I proceed, I introduce a small helper function. To keep the code presented here readable, I decided to work with a wrapper that will execute all functions contained in a list of functions and return the number of seconds it has taken to do that.
Counting tokens in partitions
The next test concern counts for partitions (i.e. subcorpora) rather than entire corpora. Results will depend on the size of the partition, and I want to compare affairs for corpora of three different kinds: A corpus for one individual speaker (Angela Merkel), for speeches given in one year, and all speeches given in one legislative period. At first, I generate these partitions.
Let is look at the sizes of these partitions.
So these partitions are as different in size as I wanted them to be. These are the functions that will perform counts for these partitions.
Now, let us run these functions, get the results, doing that with and without Rcpp.
The following bar chart visualises the results.
Admittedly, the Rcpp implementation is somewhat faster, but it is not a huge difference.
Creating partitions (without count)
The next scenario is to create a partition, without counts included at first. We do not always need counts. Here are the functions …
Let us execute the functions (with and without Rcpp functions).
And here is the bar chart.
So creating partitions (without counts) is much, much faster.
Faster creation of partitions with counts
The final scenario is the creation of partitions that will include counts. That will be triggered by providing the parameter ‘pAttribute’. We start again with the list of functions…
… run the tests …
And we plot the results.
Conclusion
I assume the results are straight forward and do not really need further discussion. There is a lot of talk about the performance gains that can be achieved using Rcpp. Here is the evidence for the polmineR package. Some core functions will run considerably faster.
However, these successes will not induce me to rewrite everything. There are a few other performance boosters in the package. Using the the data.table package has made a huge difference. Still, I am really happy that I managed to write some Rcpp routines that speed up performance critical tasks considerably. It should make using the package more enjoyable!
Subscribe via RSS