Functions to extract information from a registry file describing a corpus. Several operations could be accomplished with the 'cwb-regedit' tool, the functions defined here ensure that manipulating the registry is possible without a full installation of the CWB.

registry_get_name(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))

registry_get_id(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))

registry_get_home(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))

registry_get_info(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))

registry_get_encoding(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))

registry_get_p_attributes(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))

registry_get_s_attributes(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))

registry_get_properties(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))

Arguments

corpus

name of the CWB corpus

registry

directory of the registry (defaults to CORPUS_Registry environment variable)

Details

An appendix to the 'Corpus Encoding Tutorial' (http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf) includes an explanation of the registry file format.

registry_get_encoding will parse the registry file for a corpus and return the encoding that is defined (corpus property "charset"). If parsing the registry does not yield a result (corpus property "charset" not defined), the CWB standard encoding ("latin1") is assigned to prevent errors. Note that RcppCWB::cl_charset_name is equivalent but is faster as it uses the internal C representation of a corpus rather than parsing the registry file.

Examples

registry_get_encoding("REUTERS")
#> [1] "latin1"