RcppCWB v0.2.6 at CRAN: Towards publishing cwbtools

This week, a new version of RcppCWB made it to CRAN. The package exposes the functionality of the Corpus Worbench, adding a limited number of performance-critical functions written in Rcpp/C++. Packages such as polmineR are built on top of RcppCWB and serve as a much more comfortable “front-end”.

The new RcppCWB version exposes two new functions:

  • The cl_charset_name()-function is about performance and efficiency. It gives access to the information on the encoding of a corpus kept in memory after the registry file describing the corpus has been parsed. Now, to check the encoding, the registry files needs not be read and parsed again and again, resulting in a performance gain. The new function is a model for further similar functions that will replace inefficient functionality to read and parse registry files in the polmineR package.

  • The cl_delete_corpus() is at least as important. It is all about flexibility. With this function, the parsed information from a registry file can be removed from memory, so that the registry file is parsed afresh upon using a corpus the next time. The scenario for this function is that we want changes we made for a corpus to become effective. Particularly when we add new structural attributes to a corpus, we now do not have to restart our R session to make the changed attribute available. This is really important to see indexed corpora not only as efficient data, but as reasonably flexible data too.

The cl_delete_corpus() function is part of a development to make further use of RcppCWB as a backend for the cwbtools package, available only at GitHub at this stage. The package is not yet mature, but is intended to supplement Rcpp and polmineR to have a seamless interface between R and the Corpus Workbench, and to be move beyond a static view of indexed corpora, i.e. to turn them into a much more flexbile resource than they are so far. Adding usability and flexibility to efficiency will be the purpose of cwbtools.