Get data.frame with left and right corpus positions (cpos) for structural attributes and values.

s_attribute_decode(
  corpus,
  data_dir,
  s_attribute,
  encoding = NULL,
  registry = Sys.getenv("CORPUS_REGISTRY"),
  method = c("R", "Rcpp")
)

Arguments

corpus

A CWB corpus (ID in upper case).

data_dir

The data directory where the binary files of the corpus are stored.

s_attribute

A structural attribute (length 1 character vector).

encoding

Encoding of the values ("latin-1" or "utf-8")

registry

The CWB registry directory.

method

A length-one character vector, whether to use "R" or "Rcpp" implementation for decoding structural attribute.

Value

A data.frame with three columns, if the s-attribute has values, or two columns, if not. Column cpos_left are the start corpus positions of a structural annotation, cpos_right the end corpus positions. Column value is the value of the annotation.

Details

Two approaches are implemented: A pure R solution will decode the files directly in the directory specified by data_dir. An implementation using Rcpp will use the registry file for corpus to find the data directory.

Examples

# pure R implementation (Rcpp implementation fails on Windows in vanilla mode)
b <- s_attribute_decode(
  corpus = "REUTERS",
  data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"),
  registry = get_tmp_registry(),
  s_attribute = "places", method = "R"
)

# Using Rcpp wrappers for CWB C code
b <- s_attribute_decode(
  corpus = "REUTERS",
  data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"),
  s_attribute = "places",
  method = "Rcpp",
  registry = get_tmp_registry()
)