Generate positional attribute from a character vector of tokens (the token stream).
p_attribute_encode(
token_stream,
p_attribute = "word",
registry_dir,
corpus,
data_dir,
method = c("R", "CWB"),
verbose = TRUE,
quietly = FALSE,
encoding = get_encoding(token_stream),
compress = FALSE,
reload = TRUE
)
p_attribute_recode(
data_dir,
p_attribute,
from = c("UTF-8", "latin1"),
to = c("UTF-8", "latin1")
)
p_attribute_rename(
corpus,
old,
new,
registry_dir,
verbose = TRUE,
dryrun = FALSE
)
A character
vector with the tokens of the corpus. The
maximum length is 2 147 483 647 (2^31 - 1); a warning is issued if this
threshold is exceeded. See the CWB Encoding Tutorial for
size limitations of corpora. May also be a file.
The positional attribute to create - a character
vector
containing only lowercase ASCII characters (a-z), digits (0-9), -, and _:
No non-ASCII or uppercase letters allowed. If method is "R", only one
positional attribute can be encoded at a time. If method
is "CWB", more
than one p-attribute allowed.
Registry directory.
ID of the CWB corpus to create.
The data directory for the binary files of the corpus.
Either 'CWB' or 'R', defaults to 'R'. See section 'Details'.
A logical
value, whether to output progress messages.
A logical
value passed into RcppCWB::cwb_makeall()
,
RcppCWB::cwb_huffcode()
and RcppCWB::cwb_compress_rdx
to control
verbosity of these functions.
Encoding as defined in the charset corpus property of the registry file for the corpus ('latin1' to 'latin9', and 'utf8').
A logical
value, whether to run RcppCWB::cwb_huffcode()
and RcppCWB::cwb_compress_rdx()
(method 'R'), or command line tools
cwb-huffcode
and cwb-compress-rdx
(method 'CWB'). Defaults to FALSE
as compression is not stable on Windows.
A logical
value that defaults to TRUE
to ensure that all
features are available.
Character string describing the current encoding of the attribute.
Character string describing the target encoding of the attribute.
A character
vector with p-attributes to be renamed.
A character
vector with new names of p-attributes. The vector
needs to have the same length as vector old
.
A logical
value, whether to suppress actual renaming
operation for inspecting output messages
TRUE
is returned invisibly, if encoding has been successful.
FALSE
indicates an error has occurred.
Four steps generate the binary CWB corpus data format for positional attributes: (1) Encode the token stream of the corpus, (2) create index files, (3) compress token stream and (4) compress index files. Whereas steps 1 and 2 are required to make a corpus work, steps 3 and 4 are optional yet useful to reduce disk usage and improve performance. See the CQP Corpus Encoding Tutorial (sections 2-4) for an explanation of the procedure.
p_attribute_encode()
offers an R and a CWB implementation controlled by
argument method
. When choosing method 'R', the token stream is encoded in
'pure R', then the C implementation of CWB functionality as exposed to R via
the RcppCWB package is used (functions RcppCWB::cwb_makeall()
for indexing,
RcppCWB::cwb_huffcode()
and RcppCWB::cwb_compress_rdx()
for compression).
When choosing method 'CWB', the token stream is written to disk, then CWB
command line utilities 'cwb-encode', cwb-makeall', 'cwb-huffcode' and
'cwb-compress-rdx' are called using system2()
. The CWB-method requires an
installation of the 'CWB'. The cwb_install()
function will download and #
install the CWB command line tools within the package. The 'CWB'-method is
still supported as it is used in the test suite of the packaage. The
'R'-method is robust and is recommended.
p_attribute_recode()
will recode the values in the avs-file and
change the attribute value index in the avx file. The rng-file remains
unchanged. The registry file remains unchanged, and it is highly
recommended to consider s_attribute_recode()
as a helper for
corpus_recode()
that will recode all s-attributes, all p-attributes, and
will reset the encoding in the registry file.
Function p_attribute_rename()
can be used to rename a
positional attribute. Note that the corpus is not refreshed (unloaded,
re-loaded), so it may be necessary to restart R for changes to become
effective.
# In this example, we follow a "pure R" approach.
library(dplyr)
reu <- system.file(package = "RcppCWB", "extdata", "examples", "reuters.txt")
tokens <- readLines(reu)
# Create new (and empty) directory structure
registry_tmp <- fs::path(tempdir(), "registry")
data_dir_tmp <- fs::path(tempdir(), "data_dir", "reuters")
if (dir.exists(registry_tmp)) unlink(registry_tmp, recursive = TRUE)
if (dir.exists(data_dir_tmp)) unlink(data_dir_tmp, recursive = TRUE)
dir.create(registry_tmp)
dir.create(data_dir_tmp, recursive = TRUE)
# Encode token stream (without compression)
p_attribute_encode(
corpus = "reuters",
token_stream = tokens,
p_attribute = "word",
data_dir = data_dir_tmp,
registry_dir = registry_tmp,
method = "R",
compress = FALSE,
quietly = TRUE,
encoding = "utf8"
)
#> ℹ creating indices (in memory)
#> ✔ creating indices (in memory) [194ms]
#>
#> ℹ writing file: word.corpus
#> ✔ writing file: word.corpus [564ms]
#>
#> ℹ writing file: word.lexicon
#> ✔ writing file: word.lexicon [19ms]
#>
#> ℹ writing file: word.lexicon.idx
#> ✔ writing file: word.lexicon.idx [17ms]
#>
#> ℹ creating new registry file: /tmp/RtmpeQEeXz/registry/reuters
#> ℹ run `Rcpp::cwb_makeall()`
#> ✔ run `Rcpp::cwb_makeall()` [7ms]
#>
#> ✔ corpus reloaded: CL success / CQP success
# Augment registry file
registry_file_parse(corpus = "REUTERS", registry_dir = registry_tmp) %>%
registry_set_name("Reuters Sample Corpus") %>%
registry_set_property("charset", "utf8") %>%
registry_set_property("language", "en") %>%
registry_set_property("build_date", as.character(Sys.Date())) %>%
registry_file_write()
# Run query as a test
library(RcppCWB)
cqp_query(corpus = "REUTERS", query = '[]{3} "oil" []{3};')
#> <pointer: 0x55b53b76f390>
regions <- cqp_dump_subcorpus(corpus = "REUTERS")
kwic <- apply(
regions, 1,
function(region){
ids <- cl_cpos2id(
"REUTERS",
p_attribute = "word",
registry = registry_tmp,
cpos = region[1]:region[2]
)
words <- cl_id2str(
corpus = "REUTERS",
p_attribute = "word",
registry = registry_tmp,
id = ids
)
paste0(words, collapse = " ")
}
)
kwic[1:10]
#> [1] "prices for crude oil by 1.50 dlrs"
#> [2] "light of falling oil product prices and"
#> [3] "a weak crude oil market a company"
#> [4] "line of U.S oil companies that have"
#> [5] "days citing weak oil markets Reuter OPEC"
#> [6] "current slide in oil prices oil industry"
#> [7] "in oil prices oil industry analysts said"
#> [8] "movement to higher oil prices was never"
#> [9] "CERA Analysts and oil industry sources said"
#> [10] "faces is excess oil supply in world"