Utility functions to assist the installation and management of indexed CWB corpora.
corpus_install(
pkg = NULL,
repo = "https://PolMine.github.io/drat/",
tarball = NULL,
doi = NULL,
checksum = NULL,
lib = .libPaths()[1],
registry_dir,
corpus_dir,
ask = interactive(),
load = TRUE,
verbose = TRUE,
user = NULL,
password = NULL,
...
)
corpus_packages()
corpus_rename(
old,
new,
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
verbose = TRUE
)
corpus_remove(corpus, registry_dir, ask = interactive(), verbose = TRUE)
corpus_as_tarball(
corpus,
registry_dir,
data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
tarfile,
verbose = TRUE
)
corpus_copy(
corpus,
registry_dir,
data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
registry_dir_new = fs::path(tempdir(), "cwb", "registry"),
data_dir_new = fs::path(tempdir(), "cwb", "indexed_corpora", tolower(corpus)),
remove = FALSE,
verbose = interactive(),
progress = TRUE
)
corpus_recode(
corpus,
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
skip = character(),
to = c("latin1", "UTF-8"),
verbose = TRUE
)
corpus_testload(
corpus,
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
verbose = TRUE
)
corpus_get_version(corpus, registry_dir = Sys.getenv("CORPUS_REGISTRY"))
corpus_reload(corpus, registry_dir, verbose = TRUE)
Name of a package (length-one character
vector).
URL of the repository.
URL, S3-URI or local filename of a tarball with a CWB indexed
corpus. If NULL
(default) and argument doi
is stated, the
whereabouts of a corpus tarball will be derived from DOI.
The DOI (Digital Object Identifier) of a corpus deposited at Zenodo (e.g. "10.5281/zenodo.3748858".)
A length-one character
vector with a MD5 checksum to
check for the integrity of a downloaded tarball. If the tarball is
downloaded from Zenodo by stating a DOI (argument doi
), the checksum
included in the metadata for the record is used for the check.
Directory for R packages, defaults to .libPaths()[1]
.
The corpus registry directory. If missing, the result of
cwb_registry_dir()
.
The directory that contains the data directories of indexed
corpora. If missing, the value of cwb_corpus_dir()
will be used.
A logical
value, whether to ask user for confirmation before
removing a corpus.
A logical
value, whether to load corpus after installation.
Logical, whether to be verbose.
A user name that can be specified to download a corpus from a password protected site.
A password that can be specified to download a corpus from a password protected site.
Further parameters that will be passed into download.file()
, if
tarball
is specified.
Name of the (old) corpus.
Name of the (new) corpus.
The ID of a CWB indexed corpus (in upper case).
The data directory where the files of the CWB corpus live.
Filename of tarball.
Target directory with for (new) registry files.
Target directory for corpus files.
A logical
value, whether to remove orginal files after having
created the copy.
Logical, whether to show a progress bar.
A character vector with s_attributes to skip.
Character string describing the target encoding of the corpus.
Logical value TRUE
if installation has been successful, or FALSE
if not.
A CWB corpus consists a set of binary files with corpus data kept together in a data directory, and a registry file, which is a plain test file that details the corpus id, corpus properties, structural and positional attributes. The registry file also specifies the path to the corpus data directory. Typically, the registry directory and a corpus directory with the data directories for individual corpora are within one parent folder (which might be called "cwb" by default). See the following stylized directory structure.
.
|- registry/
| |- corpus1
| +- corpus2
|
+ indexed_corpora/
|- corpus1/
| |- file1
| |- file2
| +- file3
|
+- corpus2/
|- file1
|- file2
+- file3
The corpus_install()
function will assist the installation of a
corpus. The following scenarios are offered:
If argument tarball
is a local tarball, the tarball will
be extracted and files will be moved.
If tarball
is a URL, the tarball will be downloaded from the online
location. It is possible to state user credentials using the arguments
user
and password
. Then the aforementioned installation (scenario 1) is
executed. If argument pkg
is the name of an installed package, corpus
files will be moved into this package.
If argument doi
is Document Object Identifier (DOI), the URL from
which a corpus tarball can be downloaded is derived from the information
available at that location. The tarball is downloaded and the corpus
installed. If argument pkg
is defined, files will be moved into a R
package, the syste registry and corpus directories are used otherwise. Note
that at this stage, it is assumed that the DOI has been awarded by
Zenodo
If argument pkg
is provided and tarball
is NULL
, corpora
included in the package will be installed as system corpora, using the
storage location specified by registry_dir
.
If the corpus to be installed is already available, a dialogue will ask the
user whether an existing corpus shall be deleted and installed anew, if
argument ask
is TRUE
.
corpus_packages()
will detect the packages that include CWB
corpora. Note that the directory structure of all installed packages is
evaluated which may be slow on network-mounted file systems.
corpus_rename()
will rename a corpus, affecting the name of the
registry file, the corpus id, and the name of the directory where data
files reside.
corpus_remove()
can be used to delete a corpus.
corpus_as_tarball()
will create a tarball (.tar.gz-file) with
two subdirectories. The 'registry' subdirectory will host the registry file
for the tarred corpus. The data files will be put in a subdirectory with
the corpus name in the 'indexed_corpora' subdirectory.
corpus_copy()
will create a copy of a corpus (useful for
experimental modifications, for instance).
corpus_get_version
parses the registry file and derives the
corpus version number from the corpus properties. The return value is a
numeric_version
class object. The corpus version is expected to follow
semantic versioning (three digits, e.g. '0.8.1'). If the corpus version
has another format or if it is not available, the return value is NA
.
corpus_reload()
will unload a corpus if necessary and reload it.
Useful to make new features of a corpus available after modification.
Returns logical value TRUE
if succesful, FALSE
if not.
For managing registry files, see registry_file_parse
for switching to a packaged corpus.
registry_file_new <- fs::path(tempdir(), "cwb", "registry", "reuters")
if (file.exists(registry_file_new)) file.remove(registry_file_new)
corpus_copy(
corpus = "REUTERS",
registry_dir = system.file(package = "RcppCWB", "extdata", "cwb", "registry"),
data_dir = system.file(
package = "RcppCWB",
"extdata", "cwb", "indexed_corpora", "reuters"
)
)
unlink(fs::path(tempdir(), "cwb"), recursive = TRUE)
corpus <- "REUTERS"
pkg <- "RcppCWB"
s_attr <- "places"
Q <- '"oil"'
registry_dir_src <- system.file(package = pkg, "extdata", "cwb", "registry")
data_dir_src <- system.file(package = pkg, "extdata", "cwb", "indexed_corpora", tolower(corpus))
registry_dir_tmp <- fs::path(tempdir(), "cwb", "registry")
registry_file_tmp <- fs::path(registry_dir_tmp, tolower(corpus))
data_dir_tmp <- fs::path(tempdir(), "cwb", "indexed_corpora", tolower(corpus))
if (file.exists(registry_file_tmp)) file.remove(registry_file_tmp)
if (!dir.exists(data_dir_tmp)){
dir.create(data_dir_tmp, recursive = TRUE)
} else {
if (length(list.files(data_dir_tmp)) > 0L)
file.remove(list.files(data_dir_tmp, full.names = TRUE))
}
corpus_copy(
corpus = corpus,
registry_dir = registry_dir_src,
data_dir = data_dir_src,
registry_dir_new = registry_dir_tmp,
data_dir_new = data_dir_tmp
)
RcppCWB::cl_charset_name(corpus = corpus, registry = registry_dir_tmp)
#> [1] "latin1"
corpus_recode(
corpus = corpus,
registry_dir = registry_dir_tmp,
data_dir = data_dir_tmp,
to = "UTF-8"
)
#> Recoding s-attribute: id
#> Recoding s-attribute: topics_cat
#> Recoding s-attribute: places
#> Recoding s-attribute: language
#> Recoding p-attribute: word
RcppCWB::cl_delete_corpus(corpus = corpus, registry = registry_dir_tmp)
#> [1] TRUE
RcppCWB::cqp_initialize(registry_dir_tmp)
#> Warning: CQP has already been initialized. Re-initialization is not possible. Only resetting registry.
#> [1] TRUE
RcppCWB::cl_charset_name(corpus = corpus, registry = registry_dir_tmp)
#> [1] "utf8"
n_strucs <- RcppCWB::cl_attribute_size(
corpus = corpus, attribute = s_attr, attribute_type = "s", registry = registry_dir_tmp
)
strucs <- 0L:(n_strucs - 1L)
struc_values <- RcppCWB::cl_struc2str(
corpus = corpus, s_attribute = s_attr, struc = strucs, registry = registry_dir_tmp
)
speakers <- unique(struc_values)
Sys.setenv("CORPUS_REGISTRY" = registry_dir_tmp)
if (RcppCWB::cqp_is_initialized()) RcppCWB::cqp_reset_registry() else RcppCWB::cqp_initialize()
#> [1] TRUE
RcppCWB::cqp_query(corpus = corpus, query = Q)
#> <pointer: 0x55b538d70b60>
cpos <- RcppCWB::cqp_dump_subcorpus(corpus = corpus)
ids <- RcppCWB::cl_cpos2id(
corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, cpos = cpos
)
str <- RcppCWB::cl_id2str(
corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, id = ids
)
unique(str)
#> [1] "oil"
unlink(fs::path(tempdir(), "cwb"), recursive = TRUE)