Rcppcwb V0.4.4 Released

A new RcppCWB version “Jaberwocky” (v0.4.4) just made it to CRAN. Initially, this release was meant to be a minor maintenance release to address a warning on paths in an example of the cwbtools package. The exercise went beyond that, a broader set of issues and bug reports have been addressed, to make RcppCWB an efficient, robust and trustworthy basis for processing text as linguistic data.

RcppCWB has become much more consistent in handling paths - a potential source of confusing messages and errors. The basic issue is that the Corpus Workbench (CWB) is parses the registry files describing corpora only once. After loading a corpus, an internal C representation keeps information on a corpus. But when the corpus is modified in any way (i.e. by adding an s-Attribute with a new annotation layer), this internal C represenation may be outdated. RcppCWB previously did not consider this siutation appropriately and only offered functionality prone to crash.

The cl_deleted_corpus() function to unload a corpus is now robust. A new utility function corpus_is_loaded() offers a check. Various functions have been updated to handle paths more consistently across platforms (Linux, macOS and Windows), to avoid confusing error messages.

The release also includes an extended test suite and is an intermediate step to achieve full Windows compatibility of the functionality to build corpora. This is still to be achieved for the functions cwb_encode(), cwb_makeall(), cwb_compress_rdx() and cwb_huffcode()). Note that this is a limitation to build corpora using RcppCWB on Windows. There are no known limitations of the functionality to analyzing corpus data.

Further work on Windows compatibility is a milestone envisaged for RcppCWB v0.6.0. For RcppCWB v0.5.0, the goal is to re-align the CWB code with the latest version of RcppCWB.