Functions to extract information from a registry file describing a corpus. Several operations could be accomplished with the 'cwb-regedit' tool, the functions defined here ensure that manipulating the registry is possible without a full installation of the CWB.
registry_get_name(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_id(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_home(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_info(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_encoding(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_p_attributes(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_s_attributes(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_properties(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus | name of the CWB corpus |
---|---|
registry | directory of the registry (defaults to CORPUS_Registry environment variable) |
An appendix to the 'Corpus Encoding Tutorial' (http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf) includes an explanation of the registry file format.
registry_get_encoding
will parse the registry file for a
corpus and return the encoding that is defined (corpus property "charset").
If parsing the registry does not yield a result (corpus property "charset"
not defined), the CWB standard encoding ("latin1") is assigned to prevent
errors. Note that RcppCWB::cl_charset_name
is equivalent but is
faster as it uses the internal C representation of a corpus rather than
parsing the registry file.
registry_get_encoding("REUTERS")#> [1] "latin1"