Pure R implementation to generate positional attribute from a character vector of tokens (the token stream).
p_attribute_encode(
token_stream,
p_attribute = "word",
registry_dir,
corpus,
data_dir,
method = c("R", "CWB"),
verbose = TRUE,
encoding = get_encoding(token_stream),
compress = FALSE
)
p_attribute_recode(
data_dir,
p_attribute,
from = c("UTF-8", "latin1"),
to = c("UTF-8", "latin1")
)
p_attribute_rename(
corpus,
old,
new,
registry_dir,
verbose = TRUE,
dryrun = FALSE
)
A character
vector with the tokens of the corpus. The
maximum length is 2 147 483 647 (2^31 - 1); a warning is issued if this
threshold is exceeded. See the CWB Encoding Tutorial for
size limitations of corpora.
The positional attribute.
Registry directory (needed by p_attribute_huffcode()
and p_attribute_compress_rdx()
).
The CWB corpus (needed by p_attribute_huffcode()
and
p_attribute_compress_rdx()
).
The data directory for the corpus with the binary files.
Either 'CWB' or 'R'.
A logical
value.
Encoding as defined in the charset corpus property of the registry file for the corpus ('latin1' to 'latin9', and 'utf8').
A logical
value.
Character string describing the current encoding of the attribute.
Character string describing the target encoding of the attribute.
A character
vector with p-attributes to be renamed.
A character
vector with new names of p-attributes. The vector
needs to have the same length as vector old
.
A logical
value, whether to suppress actual renaming operation
for inspecting output messages
Four steps generate the binary CWB corpus data format for positional
attributes: First, encode a character vector (the token stream) using
p_attribute_encode
. Second, create reverse index using
p_attribute_makeall
. Third, compress token stream using
p_attribute_huffcode
. Fourth, compress index files using
p_attribute_compress_rdx
.
The implementation for the first two steps (p_attribute_encode()
and
p_attribute_makeall()
) is a pure R implementation (so far). These two
steps are enough to use the CQP functionality. To run
p_attribute_huffcode()
and p_attribute_compress_rdx()
, an
installation of the CWB may be necessary.
See the CQP Corpus Encoding Tutorial (https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf) for an explanation of the procedure (section 3, ``Indexing and compression without CWB/Perl'').
p_attribute_recode
will recode the values in the avs-file and change
the attribute value index in the avx file. The rng-file remains unchanged. The registry
file remains unchanged, and it is highly recommended to consider s_attribute_recode
as a helper for corpus_recode
that will recode all s-attributes, all p-attributes,
and will reset the encoding in the registry file.
Function p_attribute_rename
can be used to rename a
positional attribute. Note that the corpus is not refreshed (unloaded,
re-loaded), so it may be necessary to restart R for changes to become
effective.
library(RcppCWB)
# In this example, we pursue a "pure R" approach. To rely on the "CWB"
# method, you can use the cwb_install() function, which will download and
# install the CWB command line # tools within the package.
tokens <- readLines(system.file(package = "RcppCWB", "extdata", "examples", "reuters.txt"))
# Create new (and empty) directory structure
tmpdir <- normalizePath(tempdir(), winslash = "/")
registry_tmp <- fs::path(tmpdir, "registry")
data_dir_tmp <- fs::path(tmpdir, "data_dir", "reuters")
if (file.exists(fs::path(data_dir_tmp, "word.corpus"))){
file.remove(fs::path(data_dir_tmp, "word.corpus"))
}
if (dir.exists(registry_tmp)) unlink(registry_tmp, recursive = TRUE)
if (dir.exists(data_dir_tmp)) unlink(data_dir_tmp, recursive = TRUE)
dir.create(registry_tmp)
dir.create(data_dir_tmp, recursive = TRUE)
# Now encode token stream
p_attribute_encode(
corpus = "reuters",
token_stream = tokens, p_attribute = "word",
data_dir = data_dir_tmp, method = "R",
registry_dir = registry_tmp,
compress = FALSE,
encoding = "utf8"
)
#> ... writing tokenstream to disk (directly from R, equivalent to cwb-encode)
#> ... creating indices (in memory)
#> ... writing file: word.corpus
#> ... writing file: word.lexicon
#> ... writing file: word.lexicon.idx
#> ... creating data for new registry file
#> ... writing registry file
#> === Makeall: processing corpus reuters ===
#> Registry directory: /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpVB8DzH/registry
#> ATTRIBUTE word
#> + creating LEXSRT ... OK
#> - lexicon OK
#> + creating FREQS ... OK
#> - frequencies OK
#> - token stream OK
#> + creating REVCIDX ... OK
#> + creating REVCORP ... OK
#> ? validating REVCORP ... OK
#> - index OK
#> ========================================
# Create minimal registry file
regdata <- registry_data(
id = "REUTERS", name = "Reuters Sample Corpus", home = data_dir_tmp,
properties = c(encoding = "utf-8", language = "en"), p_attributes = "word"
)
regfile <- registry_file_write(
data = regdata, corpus = "REUTERS",
registry_dir = registry_tmp, data_dir = data_dir_tmp,
)
# Reload corpus and run query as a test
if (cqp_is_initialized()) cqp_reset_registry(registry_tmp) else cqp_initialize(registry_tmp)
#> [1] TRUE
cqp_query(corpus = "REUTERS", query = '[]{3} "oil" []{3};')
#> <pointer: 0x7fd606a743a0>
regions <- cqp_dump_subcorpus(corpus = "REUTERS")
kwic <- apply(
regions, 1,
function(region){
ids <- cl_cpos2id("REUTERS", "word", registry_tmp, cpos = region[1]:region[2])
words <- cl_id2str(corpus = "REUTERS", p_attribute = "word", registry = registry_tmp, id = ids)
paste0(words, collapse = " ")
}
)
kwic[1:10]
#> [1] "prices for crude oil by 1.50 dlrs"
#> [2] "light of falling oil product prices and"
#> [3] "a weak crude oil market a company"
#> [4] "line of U.S oil companies that have"
#> [5] "days citing weak oil markets Reuter OPEC"
#> [6] "current slide in oil prices oil industry"
#> [7] "in oil prices oil industry analysts said"
#> [8] "movement to higher oil prices was never"
#> [9] "CERA Analysts and oil industry sources said"
#> [10] "faces is excess oil supply in world"