Read, process and write data on structural attributes.
s_attribute_encode(
values,
data_dir,
s_attribute,
corpus,
region_matrix,
method = c("R", "CWB"),
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
encoding,
delete = FALSE,
verbose = TRUE
)
s_attribute_recode(
data_dir,
s_attribute,
from = c("UTF-8", "latin1"),
to = c("UTF-8", "latin1")
)
s_attribute_files(s_attribute, data_dir)
s_attribute_get_values(s_attribute, data_dir)
s_attribute_get_regions(s_attribute, data_dir)
s_attribute_merge(x, y)
s_attribute_delete(corpus, s_attribute)
s_attribute_rename(corpus, old, new, registry_dir, verbose = TRUE)
A character vector with the values of the structural attribute.
The data directory where to write the files.
Atomic character vector, the name of the structural attribute.
A CWB corpus.
A two-column matrix
with corpus positions.
EWither 'R' or 'CWB'.
Path name of the registry directory.
Encoding of the data.
Logical, whether a call to RcppCWB::cl_delete_corpus
is performed.
Logical.
Character string describing the current encoding of the attribute.
Character string describing the target encoding of the attribute.
Data defining a first s-attribute, a data.table
(or an object
coercible to a data.table
) with three columns ("cpos_left",
"cpos_right", "value").
Data defining a second s-attribute, a data.table
(or an
object coercible to a data.table
)with three columns ("cpos_left",
"cpos_right", "value").
A character
vector with s-attributes to be renamed.
A character
vector with new names of s-attributes. The vector
needs to have the same length as vector old
. The 1st, 2nd, 3rd ... nth
attribute stated in vector old
will get the new names at the 1st, 2nd,
3rd, ... nth position of vector new
.
In addition to using CWB functionality, the s_attribute_encode
function includes a pure R implementation to add or modify structural attributes
of an existing CWB corpus.
If the corpus has been loaded/used before,
a new s-attribute may not be available unless RcppCWB::cl_delete_corpus
has been called. Use the argument delete
for calling this function.
s_attribute_recode
will recode the values in the avs-file and change
the attribute value index in the avx file. The rng-file remains unchanged. The registry
file remains unchanged, and it is highly recommended to consider s_attribute_recode
as a helper for corpus_recode
that will recode all s-attributes, all p-attributes,
and will reset the encoding in the registry file.
s_attribute_files
will return a named character vector with
the data files (extensions: "avs", "avx", "rng") in the directory indicated
by data_dir
for the structural attribute s_attribute
.
s_attribute_get_values
is equivalent to performing the CL
function cl_struc2id for all strucs of a structural attribute. It is a
"pure R" operation that is faster than using CL, as it processes entire
files for the s-attribute directly. The return value is a character
vector with all string values for the s-attribute.
s_attribute_get_regions
will return a two-column integer
matrix with regions for the strucs of a given s-attribute. Left corpus
positions are in the first column, right corpus positions in the second
column. The result is equivalent to calling RcppCWB::get_region_matrix for
all strucs of a s-attribute, but may be somewhat faster. It is a "pure R"
function which is fast as it processes files entirely and directly.
s_attribute_merge
combines two tables with regions for
s-attributes checking for intersections that may cause problems. The
heuristic is to keep all non-intersecting annotations and those annotations
that define the same region in object x
and object y
.
Annotations of x
and y
which overlap uncleanly, i.e. without
an identity of the left and the right corpus position ("cpos_left" /
"cpos_right") are dropped. The scenario for using the function is to decode
a s-attribute (using s_attribute_decode
), mix in an additional
annotation, and to re-encode the enhanced s-attribute (using
s_attribute_encode
).
Function s_attribute_delete
is not yet implemented.
Function s_attribute_rename
can be used to rename a
structural attribute.
To decode a structural attribute, see s_attribute_decode
.
require("RcppCWB")
registry_tmp <- fs::path(tempdir(), "cwb", "registry")
data_dir_tmp <- fs::path(tempdir(), "cwb", "indexed_corpora", "reuters")
corpus_copy(
corpus = "REUTERS",
registry_dir = system.file(package = "RcppCWB", "extdata", "cwb", "registry"),
data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"),
registry_dir_new = registry_tmp,
data_dir_new = data_dir_tmp
)
no_strucs <- cl_attribute_size(
corpus = "REUTERS",
attribute = "id", attribute_type = "s",
registry = registry_tmp
)
cpos_list <- lapply(
0L:(no_strucs - 1L),
function(i)
cl_struc2cpos(corpus = "REUTERS", struc = i, s_attribute = "id", registry = registry_tmp)
)
cpos_matrix <- do.call(rbind, cpos_list)
s_attribute_encode(
values = as.character(1L:nrow(cpos_matrix)),
data_dir = data_dir_tmp,
s_attribute = "foo",
corpus = "REUTERS",
region_matrix = cpos_matrix,
method = "R",
registry_dir = registry_tmp,
encoding = "latin1",
verbose = TRUE,
delete = TRUE
)
#> ... adding s-attribute 'foo' to registry
cl_struc2str(
"REUTERS", struc = 0L:(nrow(cpos_matrix) - 1L), s_attribute = "foo", registry = registry_tmp
)
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
#> [16] "16" "17" "18" "19" "20"
unlink(registry_tmp, recursive = TRUE)
unlink(data_dir_tmp, recursive = TRUE)
avs <- s_attribute_get_values(
s_attribute = "id",
data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters")
)
rng <- s_attribute_get_regions(
s_attribute = "id",
data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters")
)
x <- data.frame(
cpos_left = c(1L, 5L, 10L, 20L, 25L),
cpos_right = c(2L, 5L, 12L, 21L, 27L),
value = c("ORG", "LOC", "ORG", "PERS", "ORG"),
stringsAsFactors = FALSE
)
y <- data.frame(
cpos_left = c(5, 11, 20, 25L, 30L),
cpos_right = c(5, 12, 22, 27L, 33L),
value = c("LOC", "ORG", "ORG", "ORG", "ORG"),
stringsAsFactors = FALSE
)
s_attribute_merge(x,y)
#> cpos_left cpos_right value
#> 1 1 2 ORG
#> 2 5 5 LOC
#> 3 25 27 ORG
#> 4 30 33 ORG