Utility functions to assist the installation and management of indexed CWB corpora.

corpus_install(
  pkg = NULL,
  repo = "https://PolMine.github.io/drat/",
  tarball = NULL,
  doi = NULL,
  checksum = NULL,
  lib = .libPaths()[1],
  registry_dir,
  corpus_dir,
  ask = interactive(),
  load = TRUE,
  verbose = TRUE,
  user = NULL,
  password = NULL,
  ...
)

corpus_packages()

corpus_rename(
  old,
  new,
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  verbose = TRUE
)

corpus_remove(corpus, registry_dir, ask = interactive(), verbose = TRUE)

corpus_as_tarball(
  corpus,
  registry_dir,
  data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
  tarfile,
  verbose = TRUE
)

corpus_copy(
  corpus,
  registry_dir,
  data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
  registry_dir_new = fs::path(tempdir(), "cwb", "registry"),
  data_dir_new = fs::path(tempdir(), "cwb", "indexed_corpora", tolower(corpus)),
  remove = FALSE,
  verbose = interactive(),
  progress = TRUE
)

corpus_recode(
  corpus,
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
  skip = character(),
  to = c("latin1", "UTF-8"),
  verbose = TRUE
)

corpus_testload(
  corpus,
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  verbose = TRUE
)

corpus_get_version(corpus, registry_dir = Sys.getenv("CORPUS_REGISTRY"))

corpus_reload(corpus, registry_dir, verbose = TRUE)

Arguments

pkg

Name of a package (length-one character vector).

repo

URL of the repository.

tarball

URL, S3-URI or local filename of a tarball with a CWB indexed corpus. If NULL (default) and argument doi is stated, the whereabouts of a corpus tarball will be derived from DOI.

doi

The DOI (Digital Object Identifier) of a corpus deposited at Zenodo (e.g. "10.5281/zenodo.3748858".)

checksum

A length-one character vector with a MD5 checksum to check for the integrity of a downloaded tarball. If the tarball is downloaded from Zenodo by stating a DOI (argument doi), the checksum included in the metadata for the record is used for the check.

lib

Directory for R packages, defaults to .libPaths()[1].

registry_dir

The corpus registry directory. If missing, the result of cwb_registry_dir().

corpus_dir

The directory that contains the data directories of indexed corpora. If missing, the value of cwb_corpus_dir() will be used.

ask

A logical value, whether to ask user for confirmation before removing a corpus.

load

A logical value, whether to load corpus after installation.

verbose

Logical, whether to be verbose.

user

A user name that can be specified to download a corpus from a password protected site.

password

A password that can be specified to download a corpus from a password protected site.

...

Further parameters that will be passed into download.file(), if tarball is specified.

old

Name of the (old) corpus.

new

Name of the (new) corpus.

corpus

The ID of a CWB indexed corpus (in upper case).

data_dir

The data directory where the files of the CWB corpus live.

tarfile

Filename of tarball.

registry_dir_new

Target directory with for (new) registry files.

data_dir_new

Target directory for corpus files.

remove

A logical value, whether to remove orginal files after having created the copy.

progress

Logical, whether to show a progress bar.

skip

A character vector with s_attributes to skip.

to

Character string describing the target encoding of the corpus.

Value

Logical value TRUE if installation has been successful, or FALSE

if not.

Details

A CWB corpus consists a set of binary files with corpus data kept together in a data directory, and a registry file, which is a plain test file that details the corpus id, corpus properties, structural and positional attributes. The registry file also specifies the path to the corpus data directory. Typically, the registry directory and a corpus directory with the data directories for individual corpora are within one parent folder (which might be called "cwb" by default). See the following stylized directory structure.


  .
  |- registry/
  |  |- corpus1
  |  +- corpus2
  |
  + indexed_corpora/
    |- corpus1/
    |  |- file1
    |  |- file2
    |  +- file3
    |
    +- corpus2/
       |- file1
       |- file2
       +- file3

The corpus_install() function will assist the installation of a corpus. The following scenarios are offered:

  • If argument tarball is a local tarball, the tarball will be extracted and files will be moved.

  • If tarball is a URL, the tarball will be downloaded from the online location. It is possible to state user credentials using the arguments user and password. Then the aforementioned installation (scenario 1) is executed. If argument pkg is the name of an installed package, corpus files will be moved into this package.

  • If argument doi is Document Object Identifier (DOI), the URL from which a corpus tarball can be downloaded is derived from the information available at that location. The tarball is downloaded and the corpus installed. If argument pkg is defined, files will be moved into a R package, the syste registry and corpus directories are used otherwise. Note that at this stage, it is assumed that the DOI has been awarded by Zenodo

  • If argument pkg is provided and tarball is NULL, corpora included in the package will be installed as system corpora, using the storage location specified by registry_dir.

If the corpus to be installed is already available, a dialogue will ask the user whether an existing corpus shall be deleted and installed anew, if argument ask is TRUE.

corpus_packages() will detect the packages that include CWB corpora. Note that the directory structure of all installed packages is evaluated which may be slow on network-mounted file systems.

corpus_rename() will rename a corpus, affecting the name of the registry file, the corpus id, and the name of the directory where data files reside.

corpus_remove() can be used to delete a corpus.

corpus_as_tarball() will create a tarball (.tar.gz-file) with two subdirectories. The 'registry' subdirectory will host the registry file for the tarred corpus. The data files will be put in a subdirectory with the corpus name in the 'indexed_corpora' subdirectory.

corpus_copy() will create a copy of a corpus (useful for experimental modifications, for instance).

corpus_get_version parses the registry file and derives the corpus version number from the corpus properties. The return value is a numeric_version class object. The corpus version is expected to follow semantic versioning (three digits, e.g. '0.8.1'). If the corpus version has another format or if it is not available, the return value is NA.

corpus_reload() will unload a corpus if necessary and reload it. Useful to make new features of a corpus available after modification. Returns logical value TRUE if succesful, FALSE if not.

See also

For managing registry files, see registry_file_parse for switching to a packaged corpus.

Examples

registry_file_new <- fs::path(tempdir(), "cwb", "registry", "reuters")
if (file.exists(registry_file_new)) file.remove(registry_file_new)
corpus_copy(
  corpus = "REUTERS",
  registry_dir = system.file(package = "RcppCWB", "extdata", "cwb", "registry"),
  data_dir = system.file(
    package = "RcppCWB",
    "extdata", "cwb", "indexed_corpora", "reuters"
  )
)
unlink(fs::path(tempdir(), "cwb"), recursive = TRUE)
corpus <- "REUTERS"
pkg <- "RcppCWB"
s_attr <- "places"
Q <- '"oil"'

registry_dir_src <- system.file(package = pkg, "extdata", "cwb", "registry")
data_dir_src <- system.file(package = pkg, "extdata", "cwb", "indexed_corpora", tolower(corpus))

registry_dir_tmp <- fs::path(tempdir(), "cwb", "registry")
registry_file_tmp <- fs::path(registry_dir_tmp, tolower(corpus))
data_dir_tmp <- fs::path(tempdir(), "cwb", "indexed_corpora", tolower(corpus))

if (file.exists(registry_file_tmp)) file.remove(registry_file_tmp)
if (!dir.exists(data_dir_tmp)){
   dir.create(data_dir_tmp, recursive = TRUE)
} else {
  if (length(list.files(data_dir_tmp)) > 0L)
    file.remove(list.files(data_dir_tmp, full.names = TRUE))
}

corpus_copy(
  corpus = corpus,
  registry_dir = registry_dir_src,
  data_dir = data_dir_src,
  registry_dir_new = registry_dir_tmp,
  data_dir_new = data_dir_tmp
)

RcppCWB::cl_charset_name(corpus = corpus, registry = registry_dir_tmp)
#> [1] "latin1"

corpus_recode(
  corpus = corpus,
  registry_dir = registry_dir_tmp,
  data_dir = data_dir_tmp,
  to = "UTF-8"
)
#> Recoding s-attribute: id
#> Recoding s-attribute: topics_cat
#> Recoding s-attribute: places
#> Recoding s-attribute: language
#> Recoding p-attribute: word

RcppCWB::cl_delete_corpus(corpus = corpus, registry = registry_dir_tmp)
#> [1] TRUE
RcppCWB::cqp_initialize(registry_dir_tmp)
#> Warning: CQP has already been initialized. Re-initialization is not possible. Only resetting registry.
#> [1] TRUE
RcppCWB::cl_charset_name(corpus = corpus, registry = registry_dir_tmp)
#> [1] "utf8"

n_strucs <- RcppCWB::cl_attribute_size(
  corpus = corpus, attribute = s_attr, attribute_type = "s", registry = registry_dir_tmp
)
strucs <- 0L:(n_strucs - 1L)
struc_values <- RcppCWB::cl_struc2str(
  corpus = corpus, s_attribute = s_attr, struc = strucs, registry = registry_dir_tmp
)
speakers <- unique(struc_values)

Sys.setenv("CORPUS_REGISTRY" = registry_dir_tmp)
if (RcppCWB::cqp_is_initialized()) RcppCWB::cqp_reset_registry() else RcppCWB::cqp_initialize()
#> [1] TRUE
RcppCWB::cqp_query(corpus = corpus, query = Q)
#> <pointer: 0x55b538d70b60>
cpos <- RcppCWB::cqp_dump_subcorpus(corpus = corpus)
ids <- RcppCWB::cl_cpos2id(
  corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, cpos = cpos
)
str <- RcppCWB::cl_id2str(
  corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, id = ids
)
unique(str)
#> [1] "oil"

unlink(fs::path(tempdir(), "cwb"), recursive = TRUE)