• Rcpp wrappers for Corpus Library (CL) functions are exposed directly and
    can be used in C++ functions imported using Rcpp::sourceCpp() or Rcpp::cppFunction().
  • Dependency PCRE has been updated to PCRE2 #68.
  • The README suggested to install the development version of RcppCWB using the snippet devtools::install_github("PolMine/RcppCWB"). The missing ref = "dev" has been inserted.
  • cwb_encode() crashed if arguments data_dir and vrt_dir include a tilde. Tilde expansion is now applied to these arguments to avoid this #73.
  • A new vignette explains how to write C++ inline functions.
  • Fixed package configuration that prevented that compiler is used for compiling CWB C scripts as intended #66.
  • Adding ‘-luuid’ to PKG_FLAGS in Makevars solves linker issue FOLDERID_ #67.
  • GitHub Actions now working for Windows #47.
  • Fixed a bug in the region_matrix_corpus() C++ code that would not show any context at all if s_attribute expansion transgressed start or end of corpus.
  • Fixed a bug in the region_matrix_corpus() C++ code that would result from not considering that query matches may go cover more than one strucs of a structural attribute.
  • corpus_info_file() does not crash if INFO is not defined in the registry file (#62).
  • Implicit processing of arguments sAttribute and pAttribute as s_attribute or p_attribute respectively is now accompanied by a warning that arguments are deprectated.
  • The check_corpus() function distinguishes between whether a corpus is loaded in the CL and/or CQP context.
  • cwb_huffcode() and cwb_compress_rdx() have argument delete to trigger deleting redundant files after compression (#60).
  • cqp_load_corpus will internally upper corpus ID as required in the CQP context (#64).

New Features

  • New auxiliary function cwb_charsets() reports the charsets supported by CWB.
  • New functions cl_load_corpus() and cqp_load_corpus() do what the functions suggests.
  • New function cl_list_corpora() complements existing function cqp_list_corpora() for the CL context.
  • New arguments skip_blank_lines, strip_whitespace and xml of cwb_encode() open configuration options of cwb_encode(), overcoming the previously hard-coded equivalent to the command-line option “-xsB”.(#38)
  • Unexported functions .cpos_to_id(), .cl_find_corpus() and .cl_new_attribute() are an entry to passing around pointers, rather than re-creating objects whenever switching from R to C.
  • Functions .s_attr() and .p_attr() return pointers for a s- or p-attribute.
  • Functions cl_* are now available with pointer as input (e.g. cpos_to_id()).
  • The CORPUS_REGISTRY environment variable is not set to the temporary registry, to avoid often confusing behavior and collissions whent loading RcppCWB and polmineR at the same time (#13).
  • The cqp_drop_subcorpus() function that has been disabled temporarily is usable again (#34).
  • cqp_query() is now able to process subcorpora.
  • RcppCWB:::.cqp_subcropus() will construct a subcorpus from a region matrix.
  • The check_corpus() does not re-set the registry directory and more, but tries to load the checked corpus if it has not yet been loaded.
  • A new function s_attr_relationship() will detect whether two s-attributes are siblings, or in a descendent or ancestor relationship.
  • Functions cwb_encode(), cwb_huffcode(), cwb_makeall() and cwb_compress_rdx() now have an argument quietly to control display of output messages. cwb_encode() has an argument verbose to control whether counter on the number of tokens processed is dislpayed.

Minor improvements

  • Difficulties of cwb_encode() to digest variations of path statements between macOS and Windows are addressed using a reliable normalization of paths with fs::path() (#48).
  • Argument encoding is checked for the validity of the encoding passed in (#34).
  • A patch introducing a sanity check omits ‘stringop-overflow’ compiler warning thrown by file cl/cdaccess.c on Windows (#45).
  • An update of Xcode command line developer tools includes flex 2.6.4 Apple(flex-34), and this is the version used not, resulting and extensive code changes in cl/lex.creg.c and cqp/lex.yy.c, yet without causing new errors or changing the functionality.
  • check_cpos() issues a warning if argument cpos is NULL (#21).
  • Functions cl_cpos2id(), cl_cpos2lbound(), cl_cpos2rbound(), cl_cpos2str() and cl_cpo2struc() will return an empty, zero-length integer vector if argument cpos is NULL (#21).
  • Warnings issued by check_corpus() (used internally by many functions) resulted from slightly differing representations of otherwise identical paths. Using fs::path() for path for normalization internally will omit misleading warning messages.
  • cqp_get_registry() will now return a fs::path object, as a safeguard for a consistent normalization of paths.
  • Function cl_delete_corpus() will now (visibly) return a logial value.
  • The check for the availability of ncurses is omitted in the configure file and the editline subdirectory of src/cwb is included in .Rbuildignore to minimize the size of the tarball. The ncurses library is a dependency of editline, but editline is not built in the context of this package (#26).
  • cqp_load_corpus() will return FALSE if corpus has not been loaded successfully.
  • Disaggregated wrappers.cpp into cl.cpp, cqp.cpp and utils.cpp, so that the code is organized more coherently corresponding to the different logics.
  • Function check_cqp_query() renamed to check_query() to avoid a conflict with a function defined in the polmineR package.
  • cqp_list_subcorpora() returns a character vector. Previously, we just had obscure printed messages.
  • s_attribute_decode() will not break if s-attribute has no values (#54).
  • Functions cl_struc2str() and cl_struc2cpos() may now include negative values, the vectors returned will have NA values at respective positions. The check against negative values in check_strucs is dropped accordingly.

Bux fixes

  • The cwb_encode() function did not declare structural attributes in the registry and mistakenly channeled output for the file to the terminal (#49). Fixed.
  • Re-running cwb_encode() did not reset global variables, which resulted in a set of errors. Solved. (#51)

New Features

  • The CWB code is updated to v3.4.33 / r1690 (#29). Automated patches that have been developed are a safeguard that it will be painless in the future to align RcppCWB with upstream CWB development.
  • The C code in the files cwb-huffcode.c, cwb-compress-rdx.c and cwb-makeall.c was not in line with the CWB version of the rest of the code (v3.4.14 / SVN revision 1069) but rather v2.2.b99 or v3.0.0. All code changes up to v3.4.14 were reconstructed and implemented (#35). Note that cwb-encode.c was at CWB v3.4.14, as the encoding functionality was exposed at a later stage.
  • A new function cwb_version() will report the version of the CWB source code.
  • The cwb_encode() function now has a previously missing argument encoding to state the encoding of the corpus to be indexed.
  • Reduced number of example *.vrt-files to one to keep package size below 5GB.

Minor Improvements

  • Encoding a cropus using cwb_encode() now assumes implicitly that input files are XML files and remove blank lines and leading and trailing whitespace. This is equivalent to the option “-xsB” of the command line utility cwb-encode.
  • The C++ code of cwb_encode() is now a patch of the main() function of cwb-encode.c, so that code in the *.cpp file can be limited to a slim wrapper, limiting the risk that the code in RcppCWB looses touch with CWB upstream development.
  • Header files _eval.h, _globalvars.h and _cl.h in the ./src directory are autogenerated files now, not to be edited by hand.
  • The C++ code of the cqp_drop_subcorpus() function is temporarily disabled to ensure that the package can be built (#34).
  • Fixed a mishandling of paths on Windows in check_corpus() that would trigger resetting the registry unintendendly and potentially falsely.
  • To avoid a compiler warning (unused variable) issued by Rcpp solved by Rcpp v1.0.7, this version of Rcpp is now required (#22).
  • In use_tmp_dir(), normalizePath() is applied on the tempdir() result to avoid confusion with symbolic links on macOS.
  • New unit test for cwb_encode() (not yet run on Windows).
  • A C-level inconsistency in cqp_get_registry() that would sometimes result in a wrong return value (i.e. registry path) has been fixed (#14).
  • To avoid an unintended behavior of cwb_makeall(), an internal check is performed whether the corpus has been loaded already and whether the home directory of the loaded corpus and defined in the registry file are identical (#31).
  • The link to the TXM project has been removed from the documentation to avoid the error ‘SSL certificate problem: unable to get local issuer certificate’ (#32).
  • The cl_delete_corpus() function crashed when trying to delete a corpus that has not been loaded (#33). The function now aborts gracefully returning 0 when trying to delete a corpus that has not been loaded.
  • A new function corpus_is_loaded() can be used to check whether a corpus is loaded.
  • Unused file ’_options.h’ removed from src/cwb/cl/cqp
  • Targets ‘lex.creg.c’, ‘registry.tab.c’ and ‘registry.tab.h’ removed from cl/Makefile to avoid an unwanted call of flex which is not necessarily present (#30).
  • Windows builds will be linked with a fresh and fully reproducible cross-compilation of CWB static libraries, see the PolMine/libcl repository. The consolidation of the workflow to prepare cross-compiled static libraries is a preparatory step to enable UCRT builds on Windows.
  • The Range struc in the code for util functionality (encode and more, files utils.h, utils.cpp and _cwb_encode.c) has been renamed as SAttrEncoder to avoid a C++ One Definition Rule warning resulting for a struc with the same name in the CL context (#28).
  • A shortcoming when passing in variables into the format string to construct the PKG_LIBS variable resulted in a faulty call of the linker on Solaris and a compilation error. Fixed (#25).
  • A hacky and recently unnecessary LDFLAG “-Wl,–allow-multiple-definition” on Solaris has been dropped.
  • Usage and evaluation of the pcretest utility is now in line with POSIX requirements, omitting an error on Solaris. A statement on the availability of the tool provides information whether it is available at all (#24).
  • The message on the findability of ncurses is more telling now, avoiding a “mission critial”-style alarm when ncurses may be present but is not findable by pkg-config (#26).

New Features

  • Encode XML (vrt file format) with new function cwb_encode() that exposes functionality of cwb-encode CWB utility.
  • Functions cl_cpos2lbound() and cl_cpos2rbound() will now accept an integer vector with length > 1 as argument cpos and return a vector with the same length. Useful to speed up iterated queries for left and right boundaries of regions (#19).
  • A new function cl_struc_values() exposes the corresponding C function of the Corpus Library (CL). The previous implicit assumption that all structural attributes have values can thus be tested. Intended to work with annotations of sentences and paragraphs, i.e. common structural attributes that do usually not have values.
  • A new function corpus_data_dir() will derive the data directory from the internal C representation of a corpus.
  • New function s_attr_regions() will derive regions defined by a structural attribute from the *.rng file. Fastest option for large corpora.
  • New functions s_attr_is_sibling() and s_attr_is_descendent() test the sibling/descendent relationship of structural attributes.

Minor Improvements

  • Function check_corpus() now includes checks whether the registry provided (argument registry) is identical with the registry defined internally by CQP. The registry is reset if directories are not identical.
  • Minor adjustments of configure script for aarch64, adding -fPIC to CFLAGS so that this flag will be used when Linux default configuration is used as fallback.
  • The implementation of the s_attribute_decode() method was incomplete for method “Rcpp”. This alternative to the “pure R” approach is now implemented (#2).
  • The unused file ‘setpaths.R’ has been removed from the tools directory (#10).
  • The argument method previously setting “wininet” in ./tools/winlibs.R is omitted to avoid the warning “the ‘wininet’ method is deprecated for http:// and https:// URLs” on Windows.
  • The configure script will print the libdirs derived using pcre-config and link against libintl on macOS by default.
  • If RcppCWB is compiled on macOS, the package configure script checks the architecture of the machine and ensures that (if glib-2.0 is not yet present) a version of glib-2.0 compiled for Apple Silicon/the M1 chip is loaded in case an amd64 architecture is detected.
  • The package configure script now uses pcre-config to locate header files of PCRE.
  • The configure script checks whether pcre has been compiled with Unicode properties support. If not, a warning is issued that also explains the recommended solution to use ‘–enable-unicode-properties’ when calling configure.
  • To avoid warnings when running R CMD check, the http://pcre.org is used rather than https://pcre.org in the DESCRIPTION and the README file.
  • To overcome a somewhat dirty solution for multiple symbol definitions, adding the ‘fcommon’ flag to the CFLAGS in the configure script has been removed. The C code has been modified such that multiple symbol definitions are omitted.
  • The macOS image used for test on Travis CI is now ‘xcode9.4’
  • On Solaris, the configure script would define the flag “-Wl,–allow-multiple-definition” to be passed to the linker flags. The rework of the CWB includes and the inclusion of the header file ‘env.h’ makes it possible to drop this flag. It was defined at a confusing place anyway.
  • Using the compiler desired by the user (in Makeconf, Makevars file) is now there for all OSes.
  • If pkg-config is not present on macOS, a warning is issued; the user gets the advice to use the brew package manager to install pkg-config.
  • There is an explicit check in the configure script whether the dependencies ncurses, pcre and glib-2.0 are present. If not, a telling error with installation instructions is displayed.
  • When unloading the package, the dynamic library RcppCWB.so is unloaded.
  • When loading the package, CQP is initialized by default (call cqp_initialize())
  • Starting with GCC 10, the compiler defaults to -fno-common, resulting in error messages during the linker stage, see the change log of the GCC compiler. To address this issue, the -fcommon option is now used by default when compiling the CWB C files on Linux 64bit systems. The CWB code includes header files multiple times, causing multiple definitions.
  • On Linux systems, the hard-coded definition as the preferred C compiler in the CWB configuration sripts will be replaced by what the CC variable defines (in ~/.R/Makevars or the Makeconf file, the result returned by R CMD config CC).
  • Remaining bashisms have been removed from the cleanup file. The shebang line of the cleanup and the configure file is now #!/bin/sh, to avoid any reliance on bash.
  • There have been (minor) modifiations of the C code of the CWB so that compilation succeeds on Solaris.
  • Using the ‘-C’ flag in the CWB Makefiles has been replaced by ‘cd cl’ / ‘cd cqp’ to avoid dependence on GNU make. GNU make is still required, because of ‘include’ statements in the Makefiles.
  • Removed an action on ‘depend.mk’ from ‘cleanup’ script to avoid error messages that depend.mk is not present when Makefiles are first loaded.
  • Dummy depend.mk files will satisfy include statement in Makefiles when running ‘make clean’ (depend.mk files are created only when running depend.mk)
  • For creating index of static archives (libcl, libcqb, libcwb), a call to ‘ranlib’ has been replaced by an equivalent ‘ar -s’ in the Makefiles, but commented out.
  • In the platform-specific config files of the CWB, the ‘-march’-option has been taken out, to safeguard portability.
  • To meet the requirements of the upcoming changes in the CRAN check process to use staged installs, the procedure to reset the paths in the test data within the package has been replaced throughout by using a temporary registry directory. The get_tmp_registry() will return the whereabouts of this directory.
  • If glib-2.0 is not present on macOS, binaries of the static library and header files are downloaded from a GitHub repo. This prepares to get RcppCWB pass macOS checking on CRAN machines.
  • A slight modification of the C code will now prevent previous crashes resulting from a faulty CQP syntax. The solution will not yet be effective for Windows systems until we have recompiled the libcqp static library that is downloaded during the installation process.
  • A new C++-level function ‘check_corpus’ checks whether a given corpus is available and is used by the check_corpus()-function. Problems with the previous implementation that relied on files in the registry directory to ensure the presence of a corpus hopefully do not occur.
  • Calling the ‘find_readline.perl’ utility script is omitted on macOS, so previous warning messages when running the makefile do not show up any more.
  • Function cl_charset_name() is exposed, it will return the charset of a corpus. Faster than parsing the registry file again and again.
  • A new cl_delete_corpus()-function can remove loaded corpora from memory.
  • In Makevars.win, libiconv is explicitly linked, to make RcppCWB compatible with new release of Rtools.
  • regex in check_s_attribute() for parsing registry file improved so that it does not produce an error if ‘# [attribute]’ follows after declaration of s_attribute
  • for linux and macOS, CWB 3.4.14 included, so that UTF-8 support is realized
  • bug removed in check_cqp_query that would prevent special characters from working in CQP queries
  • check_strucs, check_cpos and check_id are checking for NAs now to avoid crashes
  • cwb command line tools cwb-makeall, cwb-huffcode and cwb-compress-rdx exposed as cwb_makeall, cwb_huffcode and cwb_compress_rdx
  • when loading the package, a check is performed to make sure that paths in the registry files point to the data files of the sample data (issues may occur when installing binaries)
  • auxiliary functions to check whether input to Rcpp-wrappers/C functions is valid are now exported and documented
  • more consistent validity checks of input to functions for structural attributes
  • Compiling RcppCWB on unix-like systems (macOS, Linux) will work now without the presence of glib (on Windows, the dependency persists).#
  • The presence of the bison parser is not required any more. The package includes the C source generated by the bison parser along with the original input files.
  • Functionality to generate CWB-indexed corpora and to generate and manipulate the registry file describing a corpus has been moved to a new package ‘cwbtools’ (see https://www.github.com/PolMine/cwbtools) in order to maintain a clearly defined scope of RcppCWB to expose functionality of the C code of the CWB.
  • Minor intervention in function ‘valid_subcorpus_name’ to omit a -Wtautological-pointer-compare warning leading to a WARNING when checking package for R 3.5.0 with option –as-cran
  • In previous versions the drive of the working directory and of the registry/data directory had to be identical on Windows; this limitation does not persist;
  • Some utility functions could be removed that were necessary to check the identity of the drives of the working directory and the data.
  • In addition to low-level functionality of the corpus library (CL), functions of the Corpus Query Processor (CQP) are exposed, building on C wrappers in the rcqp package;
  • The authors of the rcqp package (Bernard Desgraupes and Sylvain Loiseau) are mentioned as package authors and as authors of functions using CQP, as the code used to expose CQP functionality is a modified version of rcqp code;
  • Extended package description explaining the rationale for developing the RcppCWB package;
  • Documentation of functions has been rearranged, many examples have been included;
  • Renaming of exposed functions of corpus library from cwb_… to cl_…;
  • sanity checks in R wrappers for Rcpp functions.
  • CWB source code included in package to be GPL compliant
  • template to adjust HOME and INFO in registry file used (tools/setpaths.R)
  • using VignetteBuilder has been removed
  • definition of Rprintf in cwb/cl/macros.c
  • now using configure/configure.win script in combination with setpaths.R
  • vignette included that explains cross-compiling CWB for Windows
  • check in struc2str to ensure that structure has attributes
  • Windows compatibility (potentially still limited)