The Corpus Workbench (CWB) stores the binary files for
structural and positional attributes in an individual 'data directory'
(referred to by argument data_dir
) for each corpus. The data
directories will typically be subdirectories of a parent directory called
'corpus directory' (argument corpus_dir
). Irrespective of the
location of the data directories, all corpora available on a machine are
described by so-called (plain text) registry files stored in a so-called
'registry directory' (referred to by argument registry_dir
). The
functionality to manage theses directories is used as auxiliary
functionality by higher-level functionality to download and install
corpora.
cwb_corpus_dir(registry_dir, verbose = TRUE)
cwb_registry_dir(verbose = TRUE)
cwb_directories(registry_dir = NULL, corpus_dir = NULL, verbose = TRUE)
create_cwb_directories(prefix = "~/cwb", ask = interactive(), verbose = TRUE)
use_corpus_registry_envvar(registry_dir)
Path to the directory with registry files.
A logical
value, whether to output status messages.
Path to the directory with data directories for corpora.
The base path that will be prefixed
A logical
value, whether to prompt user before creating
directories.
cwb_corpus_dir
will make a plausible suggestion for a corpus
directory where data directories for corpora reside. The procedure requires
that the registry directory (argument registry_dir
) is known. If
the argument registry_dir
is missing, the registry directory will be
guessed by calling cwb_registry_dir
. The heuristic to detect the
corpus directory is as follows: First, directories in the parent directory
of the registry directory that contain "corpus" or "corpora" are suggested.
If this does not yield a result, the data directories stated in the
registry files are evaluated. If there is one unique parent directory of
data directories (after removing temporary directories and directories
within packages), this unique directory is suggested. cwb_corpus_dir
will return a length-one character
vector with the path of the
suggested corpus directory, or NULL
if the heuristic does not yield
a result.
cwb_registry_dir()
will return return the system registry
directory. By default, the environment variable CORPUS_REGISTRY defines the
system registry directory. If the polmineR-package is loaded, a temporary
registry directory is used, replacing the system registry directory. In
this case, cwb_registry_dir()
will retrieve the directory from the option
'polmineR.corpus_registry'. The return value is a length-one character
vector or NULL
, if no registry directory can be detected.
cwb_directories
will return a named character vector with the
registry directory and the corpus directory.
create_cwb_directories
will create a 'registry' and an
'indexed_corpora' directory as subdirectories of the directory indicated by
argument prefix
. Argument ask
indicates whether to create
directories, and whether user feedback is asked for before creating the
directories. The function returns a named character vector with the
registry and the corpus directory.
use_corpus_registry_envvar()
is a convenience function that will
assist users to define the environment variable CORPUS_REGSITRY in the
.Renviron-file. making it available across sessions. The function is
intended to be used in an interactive R session. An error is thrown if this
is not the case. The user will be prompted whether the cwbtools package
shall take care of creating / modifying the .Renviron-file. If not, the
file will be opened for manual modification with some instructions shown in
the terminal.