Read, process and write data on structural attributes.
s_attribute_encode(
values,
data_dir,
s_attribute,
corpus,
region_matrix,
method = c("R", "CWB"),
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
encoding,
delete = FALSE,
verbose = TRUE
)
s_attribute_recode(
data_dir,
s_attribute,
from = c("UTF-8", "latin1"),
to = c("UTF-8", "latin1")
)
s_attribute_files(s_attribute, data_dir)
s_attribute_get_values(s_attribute, data_dir)
s_attribute_get_regions(s_attribute, data_dir)
s_attribute_merge(x, y)
s_attribute_delete(corpus, s_attribute)
s_attribute_rename(corpus, old, new, registry_dir, verbose = TRUE)
A character
vector with the values of the structural
attribute.
The data directory where to write the files.
Name of the structural attribute, an atomic character
vector containing only lowercase ASCII characters (a-z), digits (0-9), -,
and _: No non-ASCII or uppercase letters allowed.
A CWB corpus.
A two-column matrix
with corpus positions.
Either 'R' or 'CWB'.
Path name of the registry directory.
Encoding of the data.
Logical, whether to call RcppCWB::cl_delete_corpus()
.
Logical.
Character string describing the current encoding of the attribute.
Character string describing the target encoding of the attribute.
Data defining a first s-attribute, a data.table
(or an object
coercible to a data.table
) with three columns ("cpos_left",
"cpos_right", "value").
Data defining a second s-attribute, a data.table
(or an
object coercible to a data.table
) with three columns ("cpos_left",
"cpos_right", "value").
A character
vector with s-attributes to be renamed.
A character
vector with new names of s-attributes. The vector
needs to have the same length as vector old
. The 1st, 2nd, 3rd ... nth
attribute stated in vector old
will get the new names at the 1st, 2nd,
3rd, ... nth position of vector new
.
s_attribute_encode()
implements a 'pure R' implementation to add
or modify structural attributes of an existing CWB corpus.
If the corpus has been loaded/used before, a new s-attribute may not be
available unless RcppCWB::cl_delete_corpus()
has been called. Use the
argument delete
for calling this function.
s_attribute_recode
will recode the values in the avs-file and change
the attribute value index in the avx file. The rng-file remains unchanged. The registry
file remains unchanged, and it is highly recommended to consider s_attribute_recode
as a helper for corpus_recode
that will recode all s-attributes, all p-attributes,
and will reset the encoding in the registry file.
s_attribute_files()
will return a named character vector with
the data files (extensions: "avs", "avx", "rng") in the directory indicated
by data_dir
for the structural attribute s_attribute
.
s_attribute_get_values()
is equivalent to performing the CL
function cl_struc2id for all strucs of a structural attribute. It is a
"pure R" operation that is faster than using CL, as it processes entire
files for the s-attribute directly. The return value is a character
vector with all string values for the s-attribute.
s_attribute_get_regions
will return a two-column integer
matrix with regions for the strucs of a given s-attribute. Left corpus
positions are in the first column, right corpus positions in the second
column. The result is equivalent to calling RcppCWB::get_region_matrix for
all strucs of a s-attribute, but may be somewhat faster. It is a "pure R"
function which is fast as it processes files entirely and directly.
s_attribute_merge()
combines two tables with regions for
s-attributes checking for intersections that may cause problems. The
heuristic is to keep all non-intersecting annotations and those annotations
that define the same region in object x
and object y
.
Annotations of x
and y
which overlap uncleanly, i.e. without
an identity of the left and the right corpus position ("cpos_left" /
"cpos_right") are dropped. The scenario for using the function is to decode
a s-attribute (using s_attribute_decode()
), mix in an additional
annotation, and to re-encode the enhanced s-attribute (using
s_attribute_encode()
).
Function s_attribute_delete()
is not yet implemented.
Function s_attribute_rename()
can be used to rename a structural
attribute.
To decode a structural attribute, see
s_attribute_decode
.
require("RcppCWB")
registry_tmp <- fs::path(tempdir(), "cwb", "registry")
data_dir_tmp <- fs::path(tempdir(), "cwb", "indexed_corpora", "reuters")
cwb_dir_rcppcwb <- system.file(package = "RcppCWB", "extdata", "cwb")
registry_dir_rcppcwb <- fs::path(cwb_dir_rcppcwb, "registry")
data_dir_rcppcwb <- fs::path(cwb_dir_rcppcwb,"indexed_corpora", "reuters")
corpus_copy(
corpus = "REUTERS",
registry_dir = registry_dir_rcppcwb,
data_dir = data_dir_rcppcwb,
registry_dir_new = registry_tmp,
data_dir_new = data_dir_tmp
)
no_strucs <- cl_attribute_size(
corpus = "REUTERS",
attribute = "id",
attribute_type = "s",
registry = registry_tmp
)
cpos_matrix <- get_region_matrix(
corpus = "REUTERS",
struc = 0L:(no_strucs - 1L),
s_attribute = "id",
registry = registry_tmp
)
s_attribute_encode(
values = 1L:nrow(cpos_matrix),
data_dir = data_dir_tmp,
s_attribute = "article_id",
corpus = "REUTERS",
region_matrix = cpos_matrix,
method = "R",
registry_dir = registry_tmp,
encoding = "latin1",
verbose = TRUE,
delete = TRUE
)
#> ! class of input `values` is "integer"
#> ℹ unique values after coercion to `character` vector: 20
#> ℹ add s-attribute "article_id" to registry
cl_struc2str(
"REUTERS",
struc = 0L:(nrow(cpos_matrix) - 1L),
s_attribute = "article_id",
registry = registry_tmp
)
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
#> [16] "16" "17" "18" "19" "20"
unlink(registry_tmp, recursive = TRUE)
unlink(data_dir_tmp, recursive = TRUE)
data_dir <- system.file(
package = "RcppCWB",
"extdata",
"cwb",
"indexed_corpora",
"reuters"
)
avs <- s_attribute_get_values(s_attribute = "id", data_dir = data_dir)
rng <- s_attribute_get_regions(
s_attribute = "id",
data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters")
)
x <- data.frame(
cpos_left = c(1L, 5L, 10L, 20L, 25L),
cpos_right = c(2L, 5L, 12L, 21L, 27L),
value = c("ORG", "LOC", "ORG", "PERS", "ORG"),
stringsAsFactors = FALSE
)
y <- data.frame(
cpos_left = c(5, 11, 20, 25L, 30L),
cpos_right = c(5, 12, 22, 27L, 33L),
value = c("LOC", "ORG", "ORG", "ORG", "ORG"),
stringsAsFactors = FALSE
)
s_attribute_merge(x,y)
#> cpos_left cpos_right value
#> 1 1 2 ORG
#> 2 5 5 LOC
#> 3 25 27 ORG
#> 4 30 33 ORG