Pure R implementation to generate positional attribute from a character vector of tokens (the token stream).

p_attribute_encode(
  token_stream,
  p_attribute = "word",
  registry_dir,
  corpus,
  data_dir,
  method = c("R", "CWB"),
  verbose = TRUE,
  encoding = get_encoding(token_stream),
  compress = FALSE
)

p_attribute_recode(
  data_dir,
  p_attribute,
  from = c("UTF-8", "latin1"),
  to = c("UTF-8", "latin1")
)

p_attribute_rename(
  corpus,
  old,
  new,
  registry_dir,
  verbose = TRUE,
  dryrun = FALSE
)

Arguments

token_stream

A character vector with the tokens of the corpus. The maximum length is 2 147 483 647 (2^31 - 1); a warning is issued if this threshold is exceeded. See the CWB Encoding Tutorial for size limitations of corpora.

p_attribute

The positional attribute.

registry_dir

Registry directory (needed by p_attribute_huffcode() and p_attribute_compress_rdx()).

corpus

The CWB corpus (needed by p_attribute_huffcode() and p_attribute_compress_rdx()).

data_dir

The data directory for the corpus with the binary files.

method

Either 'CWB' or 'R'.

verbose

A logical value.

encoding

Encoding as defined in the charset corpus property of the registry file for the corpus ('latin1' to 'latin9', and 'utf8').

compress

A logical value.

from

Character string describing the current encoding of the attribute.

to

Character string describing the target encoding of the attribute.

old

A character vector with p-attributes to be renamed.

new

A character vector with new names of p-attributes. The vector needs to have the same length as vector old.

dryrun

A logical value, whether to suppress actual renaming operation for inspecting output messages

Details

Four steps generate the binary CWB corpus data format for positional attributes: First, encode a character vector (the token stream) using p_attribute_encode. Second, create reverse index using p_attribute_makeall. Third, compress token stream using p_attribute_huffcode. Fourth, compress index files using p_attribute_compress_rdx.

The implementation for the first two steps (p_attribute_encode() and p_attribute_makeall()) is a pure R implementation (so far). These two steps are enough to use the CQP functionality. To run p_attribute_huffcode() and p_attribute_compress_rdx(), an installation of the CWB may be necessary.

See the CQP Corpus Encoding Tutorial (https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf) for an explanation of the procedure (section 3, ``Indexing and compression without CWB/Perl'').

p_attribute_recode will recode the values in the avs-file and change the attribute value index in the avx file. The rng-file remains unchanged. The registry file remains unchanged, and it is highly recommended to consider s_attribute_recode as a helper for corpus_recode that will recode all s-attributes, all p-attributes, and will reset the encoding in the registry file.

Function p_attribute_rename can be used to rename a positional attribute. Note that the corpus is not refreshed (unloaded, re-loaded), so it may be necessary to restart R for changes to become effective.

Author

Christoph Leonhardt, Andreas Blaette

Examples

library(RcppCWB)

# In this example, we pursue a "pure R" approach. To rely on the "CWB"
# method, you can use the cwb_install() function, which will download and
# install the CWB command line # tools within the package.

tokens <- readLines(system.file(package = "RcppCWB", "extdata", "examples", "reuters.txt"))

# Create new (and empty) directory structure

tmpdir <- normalizePath(tempdir(), winslash = "/")
registry_tmp <- fs::path(tmpdir, "registry")
data_dir_tmp <- fs::path(tmpdir, "data_dir", "reuters")
if (file.exists(fs::path(data_dir_tmp, "word.corpus"))){
  file.remove(fs::path(data_dir_tmp, "word.corpus"))
}
if (dir.exists(registry_tmp)) unlink(registry_tmp, recursive = TRUE)
if (dir.exists(data_dir_tmp)) unlink(data_dir_tmp, recursive = TRUE)
dir.create(registry_tmp)
dir.create(data_dir_tmp, recursive = TRUE)

# Now encode token stream

p_attribute_encode(
  corpus = "reuters",
  token_stream = tokens, p_attribute = "word",
  data_dir = data_dir_tmp, method = "R",
  registry_dir = registry_tmp,
  compress = FALSE,
  encoding = "utf8"
  )
#> ... writing tokenstream to disk (directly from R, equivalent to cwb-encode)
#> ... creating indices (in memory)
#> ... writing file: word.corpus
#> ... writing file: word.lexicon
#> ... writing file: word.lexicon.idx
#> ... creating data for new registry file
#> ... writing registry file
#> === Makeall: processing corpus reuters ===
#> Registry directory: /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpVB8DzH/registry
#> ATTRIBUTE word
#>  + creating LEXSRT ... OK
#>  - lexicon      OK
#>  + creating FREQS ... OK
#>  - frequencies  OK
#>  - token stream OK
#>  + creating REVCIDX ... OK
#>  + creating REVCORP ... OK
#>  ? validating REVCORP ... OK
#>  - index        OK
#> ========================================

# Create minimal registry file

regdata <- registry_data(
  id = "REUTERS", name = "Reuters Sample Corpus", home = data_dir_tmp,
  properties = c(encoding = "utf-8", language = "en"), p_attributes = "word"
)

regfile <- registry_file_write(
  data = regdata, corpus = "REUTERS",
  registry_dir = registry_tmp, data_dir = data_dir_tmp,
)

# Reload corpus and run query as a test

if (cqp_is_initialized()) cqp_reset_registry(registry_tmp) else cqp_initialize(registry_tmp)
#> [1] TRUE

cqp_query(corpus = "REUTERS", query = '[]{3} "oil" []{3};')
#> <pointer: 0x7fd606a743a0>
regions <- cqp_dump_subcorpus(corpus = "REUTERS")
kwic <- apply(
  regions, 1,
  function(region){
    ids <- cl_cpos2id("REUTERS", "word", registry_tmp, cpos = region[1]:region[2])
    words <- cl_id2str(corpus = "REUTERS", p_attribute = "word", registry = registry_tmp, id = ids)
    paste0(words, collapse = " ")
  }
)
kwic[1:10]
#>  [1] "prices for crude oil by 1.50 dlrs"          
#>  [2] "light of falling oil product prices and"    
#>  [3] "a weak crude oil market a company"          
#>  [4] "line of U.S oil companies that have"        
#>  [5] "days citing weak oil markets Reuter OPEC"   
#>  [6] "current slide in oil prices oil industry"   
#>  [7] "in oil prices oil industry analysts said"   
#>  [8] "movement to higher oil prices was never"    
#>  [9] "CERA Analysts and oil industry sources said"
#> [10] "faces is excess oil supply in world"