Functions to extract information from a registry file describing a corpus. Several operations could be accomplished with the 'cwb-regedit' tool, the functions defined here ensure that manipulating the registry is possible without a full installation of the CWB.
registry_get_name(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_id(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_home(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_info(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_encoding(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_p_attributes(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_s_attributes(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) registry_get_properties(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
| corpus | name of the CWB corpus | 
|---|---|
| registry | directory of the registry (defaults to CORPUS_Registry environment variable) | 
An appendix to the 'Corpus Encoding Tutorial' (http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf) includes an explanation of the registry file format.
registry_get_encoding will parse the registry file for a
  corpus and return the encoding that is defined (corpus property "charset").
  If parsing the registry does not yield a result (corpus property "charset"
  not defined), the CWB standard encoding ("latin1") is assigned to prevent
  errors. Note that RcppCWB::cl_charset_name is equivalent but is
  faster as it uses the internal C representation of a corpus rather than
  parsing the registry file.
registry_get_encoding("REUTERS")#> [1] "latin1"