Encode Positional Attribute(s). — p_attribute

Generate positional attribute from a character vector of tokens (the token stream).

p_attribute_encode(
  token_stream,
  p_attribute = "word",
  registry_dir,
  corpus,
  data_dir,
  method = c("R", "CWB"),
  verbose = TRUE,
  quietly = FALSE,
  encoding = get_encoding(token_stream),
  compress = FALSE,
  reload = TRUE
)

p_attribute_recode(
  data_dir,
  p_attribute,
  from = c("UTF-8", "latin1"),
  to = c("UTF-8", "latin1")
)

p_attribute_rename(
  corpus,
  old,
  new,
  registry_dir,
  verbose = TRUE,
  dryrun = FALSE
)

Arguments

token_stream: A character vector with the tokens of the corpus. The maximum length is 2 147 483 647 (2^31 - 1); a warning is issued if this threshold is exceeded. See the CWB Encoding Tutorial for size limitations of corpora. May also be a file.
p_attribute: The positional attribute to create - a character vector containing only lowercase ASCII characters (a-z), digits (0-9), -, and _: No non-ASCII or uppercase letters allowed. If method is "R", only one positional attribute can be encoded at a time. If method is "CWB", more than one p-attribute allowed.
registry_dir: Registry directory.
corpus: ID of the CWB corpus to create.
data_dir: The data directory for the binary files of the corpus.
method: Either 'CWB' or 'R', defaults to 'R'. See section 'Details'.
verbose: A logical value, whether to output progress messages.
quietly: A logical value passed into RcppCWB::cwb_makeall(), RcppCWB::cwb_huffcode() and RcppCWB::cwb_compress_rdx to control verbosity of these functions.
encoding: Encoding as defined in the charset corpus property of the registry file for the corpus ('latin1' to 'latin9', and 'utf8').
compress: A logical value, whether to run RcppCWB::cwb_huffcode() and RcppCWB::cwb_compress_rdx() (method 'R'), or command line tools cwb-huffcode and cwb-compress-rdx (method 'CWB'). Defaults to FALSE as compression is not stable on Windows.
reload: A logical value that defaults to TRUE to ensure that all features are available.
from: Character string describing the current encoding of the attribute.
to: Character string describing the target encoding of the attribute.
old: A character vector with p-attributes to be renamed.
new: A character vector with new names of p-attributes. The vector needs to have the same length as vector old.
dryrun: A logical value, whether to suppress actual renaming operation for inspecting output messages

Value

TRUE is returned invisibly, if encoding has been successful. FALSE indicates an error has occurred.

Details

Four steps generate the binary CWB corpus data format for positional attributes: (1) Encode the token stream of the corpus, (2) create index files, (3) compress token stream and (4) compress index files. Whereas steps 1 and 2 are required to make a corpus work, steps 3 and 4 are optional yet useful to reduce disk usage and improve performance. See the CQP Corpus Encoding Tutorial (sections 2-4) for an explanation of the procedure.

p_attribute_encode() offers an R and a CWB implementation controlled by argument method. When choosing method 'R', the token stream is encoded in 'pure R', then the C implementation of CWB functionality as exposed to R via the RcppCWB package is used (functions RcppCWB::cwb_makeall() for indexing, RcppCWB::cwb_huffcode() and RcppCWB::cwb_compress_rdx() for compression). When choosing method 'CWB', the token stream is written to disk, then CWB command line utilities 'cwb-encode', cwb-makeall', 'cwb-huffcode' and 'cwb-compress-rdx' are called using system2(). The CWB-method requires an installation of the 'CWB'. The cwb_install() function will download and # install the CWB command line tools within the package. The 'CWB'-method is still supported as it is used in the test suite of the packaage. The 'R'-method is robust and is recommended.

p_attribute_recode() will recode the values in the avs-file and change the attribute value index in the avx file. The rng-file remains unchanged. The registry file remains unchanged, and it is highly recommended to consider s_attribute_recode() as a helper for corpus_recode() that will recode all s-attributes, all p-attributes, and will reset the encoding in the registry file.

Function p_attribute_rename() can be used to rename a positional attribute. Note that the corpus is not refreshed (unloaded, re-loaded), so it may be necessary to restart R for changes to become effective.

Author

Christoph Leonhardt, Andreas Blaette

Examples

# In this example, we follow a "pure R" approach. 
library(dplyr)

reu <- system.file(package = "RcppCWB", "extdata", "examples", "reuters.txt")
tokens <- readLines(reu)

# Create new (and empty) directory structure

registry_tmp <- fs::path(tempdir(), "registry")
data_dir_tmp <- fs::path(tempdir(), "data_dir", "reuters")

if (dir.exists(registry_tmp)) unlink(registry_tmp, recursive = TRUE)
if (dir.exists(data_dir_tmp)) unlink(data_dir_tmp, recursive = TRUE)

dir.create(registry_tmp)
dir.create(data_dir_tmp, recursive = TRUE)

# Encode token stream (without compression)

p_attribute_encode(
  corpus = "reuters",
  token_stream = tokens,
  p_attribute = "word",
  data_dir = data_dir_tmp,
  registry_dir = registry_tmp,
  method = "R",
  compress = FALSE,
  quietly = TRUE,
  encoding = "utf8"
)
#> ℹ creating indices (in memory)
#> ✔ creating indices (in memory) [194ms]
#> 
#> ℹ writing file: word.corpus
#> ✔ writing file: word.corpus [564ms]
#> 
#> ℹ writing file: word.lexicon
#> ✔ writing file: word.lexicon [19ms]
#> 
#> ℹ writing file: word.lexicon.idx
#> ✔ writing file: word.lexicon.idx [17ms]
#> 
#> ℹ creating new registry file: /tmp/RtmpeQEeXz/registry/reuters
#> ℹ run `Rcpp::cwb_makeall()`
#> ✔ run `Rcpp::cwb_makeall()` [7ms]
#> 
#> ✔ corpus reloaded: CL success / CQP success

# Augment registry file 

registry_file_parse(corpus = "REUTERS", registry_dir = registry_tmp) %>%
  registry_set_name("Reuters Sample Corpus") %>%
  registry_set_property("charset", "utf8") %>%
  registry_set_property("language", "en") %>%
  registry_set_property("build_date", as.character(Sys.Date())) %>%
  registry_file_write()

# Run query as a test

library(RcppCWB)

cqp_query(corpus = "REUTERS", query = '[]{3} "oil" []{3};')
#> <pointer: 0x55b53b76f390>
regions <- cqp_dump_subcorpus(corpus = "REUTERS")

kwic <- apply(
  regions, 1,
  function(region){
    ids <- cl_cpos2id(
      "REUTERS",
      p_attribute = "word",
      registry = registry_tmp,
      cpos = region[1]:region[2]
    )
    words <- cl_id2str(
      corpus = "REUTERS",
      p_attribute = "word",
      registry = registry_tmp,
      id = ids
    )
    paste0(words, collapse = " ")
  }
)
kwic[1:10]
#>  [1] "prices for crude oil by 1.50 dlrs"          
#>  [2] "light of falling oil product prices and"    
#>  [3] "a weak crude oil market a company"          
#>  [4] "line of U.S oil companies that have"        
#>  [5] "days citing weak oil markets Reuter OPEC"   
#>  [6] "current slide in oil prices oil industry"   
#>  [7] "in oil prices oil industry analysts said"   
#>  [8] "movement to higher oil prices was never"    
#>  [9] "CERA Analysts and oil industry sources said"
#> [10] "faces is excess oil supply in world"