cwbtools 0.3.1 Unreleased


  • The (weak) dependency on the polmineR package (it was in the ‘Suggests:’ section of the DESCRIPTION file) has been removed. Changes are purely internal (higher-level polmineR functions have been replaced by lower-level RcppCWB functions, some tests were re-written). Dropping the dependency has the advantage that there is a much clearer structure of dependencies now (RcppCWB -> cwbtools -> polmineR).


  • A remaining CLI formatting issue has been removed from the user dialogue for modifying the .Renviron file.
  • Unit tests used a test download of the United Nations General Assembly (UNGA) corpus from Zenodo. To reduce the time required for testing the package, a test download of the (much smaller) GermaParlSample copus is performed.


  • The corpus_install() function tried to ask for user feedback when not in an interactive session. The function now checks whether it is possible to ask for user feedback.
  • Part of the output of the cwbtools::create_cwb_directories() function did show if verbose was FALSE. Fixed.

cwbtools 0.3.0 2020-07-09


  • The corpus_install() gives much better and nicer reports on steps performed during corpus downloads. User dialogues have been reworked thoroughly to provide better user guidance.
  • The use_corpus_registry_envvar() function is called by corpus_install() and will amend the .Renviron file as appropriate if the user so desires.
  • To resolve a DOI, the ‘zen4R’ package is used, to extract information on the whereabouts of a corpus tarball efficiently from the Zenodo API.
  • A corpus_testload() has been implemented to check whether a (newly installed) corpus is accessible.


  • Extracting the version number from the corpus tarball is somewhat more forgiving if the version number does not start with “v”.
  • The registry file for a newly downloaded corpus is refreshed only if a temporary registry directory is used.
  • To remedy the fairly common error that the path to the info file is not stated correctly in the registry file, a fallback mechanism will look up potential alternatives to an info file stated wrongly.


  • The json string returned from Zenodo may include newline strings that are escaped such that they cannot be processed by jsonlite::fromJSON(). The auxiliary function to get and process information from Zenodo now ensures that newline characters are escaped such that they can be processed.
  • The corpus_copy() function did not set the path to the info file to the new data directory - corrected.
  • The corpus_install() function failed when the registry_dir got a NULL value from the default call to cwbtools::cwb_registry_dir(). But if the directories are created, the registry directory is there. Fixed.
  • Removed a bug (faulty assignment) that would prevent that the path of a registry file is handled correctly (i.e. wrapped in quotation marks) by registry_file_compose() when the path includes any whitespace characters.


cwbtools 0.2.0 2020-04-14


  • The install_corpus() function has been reworked thoroughly. Using system directories for the registry and the corpus directory is now supported. This is a prerequisite that corpora can be installed outside of R packages Installing corpora within corpora is not allowed by CRAN.
  • A set of new auxiliary functions (cwb_directories(), cwb_registry_dir(), cwb_corpus_dir()) will get the whereabouts of the registry directory and the corpus directory. In particular, they consider that the polmineR package may have generated a temporary corpus registry, resetting the CORPUS_REGISTRY environment variable.
  • The install_corpus() function accepts an argument doi to provide a Document Object Identifier (DOI). At this stage, the DOI is assumed to be awarded by Zenodo. Information available at the Zenodo site will be resolved to get the URL of a corpus tarball that can be downloaded. Upon installing a corpus from Zenodo, the DOI and the version number will be written as corpus properties into the registry file.
  • To avoid removing corpora accidentally, the corpus_install() function will ask the user for feedback if a corpus would be installed that is already present and that would be deleted or overwritten.
  • New auxiliary functions create_cwb_directories and use_corpus_registry_envvar() will assist users to create the required directory structure for CWB indexed corpora.


  • The default value of the argument “repo” that defines the repository for packaged corpora is now the drat repository of the PolMine GitHub account (“”).


  • New R6 Roxygen documentation used for documenting the CorpusData class.
  • A (preliminary) vignette has been added that explains how to add a sentence annotation can be added to an existing indexed corpus.

cwbtools 0.1.2 2019-12-17


  • Trying to remove the entire temporary session directory at the end of the package vignettes caused problems to build the package documentation. A more limited approach to clean up temporary files after build the vignettes will omit this problem.

cwbtools 0.1.1 2019-12-09


  • The pkg_add_corpus() function will now create the cwb directories (registry and data directory) if necessary. Previously, these directories were required to exist before moving a corpus into a package, making it necessary to put dummy files into packages to keep R CMD build from issuing warnings and git from dropping these directories. Creating the directories on demand is a precondition for a CRAN release of data packages (#11).


  • In the upcoming R version 4.0, the matrix class will inherit from class array. The new package version now takes into account that length(class(matrix(1:4,2,2))) will return the value 2.


  • The NEWS file now follows the styleguide such that pkgdown::build_site() will generate a proper changelog page.

cwbtools 0.1.0 2019-10-21

  • updated vignette so that annex explains installation of CoreNLP v3.9.2 (2018-10-05)
  • New functions s_atttribute_get_regions() and s_attribute_get_values().
  • In corpus_install(), using download.file() replaces curl::curl_download() for Windows because curl apparently is not able to process target filenames that include special characters.
  • For Windows machines, there is a check for non-ASCII characters in the file path. If TRUE, a path generated by a call to shortPathName() is used.
  • In the vignette, the registry is reset after creating the new corpora, to make the new corpus available.
  • A (preliminary) decode()-method will turn a partition into an Annotation object from the NLP package.
  • A new conll_get_regions()-function will turn an CoNLL-style annotated token stream into a table with regions that can be encoded using s_attribute_encode().
  • A new function s_attribute_merge() will merge two data.table objects defining s-attributes, checking for overlaps.

cwbtools 0.0.11 Unreleased

cwbtools 0.0.10 Unreleased

  • Missing documentation written for fields of class CorpusData.
  • New fields ‘sentences’ and ‘named_entities’ added to class CorpusData, as a basis for encoding annotation of sentences and named entities.

cwbtools 0.0.9 Unreleased

  • issue with parsing path correctly in registry_file_path when path is in inverted commas solved (adjusted regex)
  • issue with ALTREP vector for corpus positions resolved
  • layout of progress bars consistently using pbapply package
  • sanity checks for s_attribute_encode, ensure that region_matrix is integer matrix
  • s_attribute_encode when called with method = “R” will now add s_attribute to registry
  • s_attribute_encode will add structural attribute to registry when using R implementation, too
  • corpus_as_tarball-function added
  • install_corpus able to install from tarball
  • progress option for CorpusData$import_xml()-method
  • Minimal rework of progress bar in CorpusData$add_corpus_positions() (helper function .fn)
  • Three dots (…) are passed into download.file() by install_corpus(), if argument tarball is specified. This is a precondition for passing arguments to download password-protected corpora.

cwbtools 0.0.8 Unreleased

  • major bug removed when writing regions to disk (s_attribute_encode) with R
  • when creating/removing files in p_attribute_encode, only basenames of filenames are outputted
  • for CorpusData$encode(), an already existing corpus will be removed

cwbtools 0.0.7 Unreleased

  • bug removed in function pkg_create_cwb_dirs causing error when a directory already exists
  • new vignette ‘europarl’: sample workflow for putting indexed corpus into package
  • for $tokenize()-method of CorpusData: stricter requirement that chunkdata is data.table
  • progress bar for $tokenize()-method, when tokenizers package is used
  • tilde expansion for paths that are passed into p_attribute_encode
  • stri_detect_regex replacing grepl to speed things up in p_attribute_encode
  • awful workaround for coping with latin1 removed in p_attribute_encode
  • stip_punct = FALSE for $tokenize() method of CorpusData
  • purging the data for the CWB has been moved away from p_attribute_encode to a $purge()-method of CorpusData (to be performed on chunkdata) as a matter of efficiency.
  • continuous removal of objects and garbage collection in p_attribute_encode to be as parsimonious with memory as possible
  • checking of encoding in p_attribute_encode has been moved to $check_encoding() method in CorpusData-class to keep necessity to copy around vectors (potentially exceeding memory) to a minimum.
  • additional parameters passed into tokenizers::tokenize_words by …
  • writing hex for content of s_attributes to cope with encoding issues
  • values coerced to character

cwbtools 0.0.6 Unreleased

  • DataPackage class turned into pkg_*-functions
  • first version that passes all tests

cwbtools 0.0.5 Unreleased

  • undocumented

cwbtools 0.0.4 Unreleased

  • askYesNo function has been replaced by readlines(), to ensure compatibility with R versions < 3.5