Decode corpus or subcorpus and return class specified by argument to.

decode(.Object, ...)

# S4 method for corpus
decode(
  .Object,
  to = c("data.table", "Annotation"),
  p_attributes = NULL,
  s_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)

# S4 method for character
decode(
  .Object,
  to = c("data.table", "Annotation"),
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE,
  ...
)

# S4 method for slice
decode(
  .Object,
  to = "data.table",
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)

# S4 method for partition
decode(
  .Object,
  to = "data.table",
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)

# S4 method for subcorpus
decode(
  .Object,
  to = "data.table",
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)

# S4 method for integer
decode(.Object, corpus, p_attributes, boost = NULL)

# S4 method for data.table
decode(.Object, corpus, p_attributes)

Arguments

.Object

The corpus or subcorpus to decode.

...

Further arguments.

to

The class of the returned object, stated as a length-one character vector.

p_attributes

The positional attributes to decode. If NULL (default), all positional attributes will be decoded.

s_attributes

The structural attributes to decode. If NULL (default), all structural attributes will be decoded.

decode

A logical value, whether to decode token ids and struc ids to character strings. If FALSE, the values of columns for p- and s-attributes will be integer vectors. If TRUE (default), the respective columns are character vectors.

verbose

A logical value, whether to output progess messages.

corpus

A CWB indexed corpus, either a length-one character vector, or a corpus object.

boost

A length-one logical value, whether to speed up decoding a long vector of token ids by directly by reading in the lexion file from the data directory of a corpus. If NULL (default), the internal decision rule is that boost will be TRUE if the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded.

Value

The return value will correspond to the class specified by argument to.

Details

The primary purpose of the method is type conversion. By obtaining the corpus or subcorpus in the format specified by the argument to, the data can be processed with tools that do not rely on the Corpus Workbench (CWB). Supported output formats are data.table (which can be converted to a data.frame or tibble easily) or an Annotation object as defined in the package NLP. Another purpose of decoding the corpus can be to rework it, and to re-import it into the CWB (e.g. using the cwbtools-package).

An earlier version of the method included an option to decode a single s-attribute, which is not supported any more. See the s_attribute_decode function of the package RcppCWB.

If .Object is an integer vector, it is assumed to be a vector of integer ids of p-attributes. The decode-method will translate token ids to string values as efficiently as possible. The approach taken will depend on the corpus size and the share of the corpus that is to be decoded. To decode a large number of integer ids, it is more efficient to read the lexicon file from the data directory directly and to index the lexicon with the ids rather than relying on RcppCWB::cl_id2str. The internal decision rule is to use the lexicon file when the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded. The encoding of the character vector that is returned will be the coding of the locale (usually ISO-8859-1 on Windows, and UTF-8 on macOS and Linux machines).

The decode-method for data.table objects will decode token ids (column '[p-attribute]_id'), adding the corresponding string as a new column. If a column "cpos" with corpus positions is present, ids are derived for the corpus positions given first. If the data.table neither has a column "cpos" nor columns with token ids (i.e. colummn name ending with "_id"), the input data.table is returned unchanged. Note that columns are added to the data.table in an in-place operation to handle memory parsimoniously.

See also

To decode a structural attribute, you can use the s_attributes-method, setting argument unique as FALSE and s_attribute_decode. See as.VCorpus to decode a partition_bundle object, returning a VCorpus object.

Examples

use("polmineR")
#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS
# Decode corpus as data.table dt <- decode("GERMAPARLMINI", to = "data.table")
#> decoding p-attribute:word
#> decoding p-attribute:pos
#> decoding s-attribute:interjection
#> decoding s-attribute:date
#> decoding s-attribute:party
#> decoding s-attribute:speaker
#> assembling data.table
# Decode corpus selectively dt <- decode("GERMAPARLMINI", to = "data.table", p_attributes = "word", s_attributes = "party")
#> decoding p-attribute:word
#> decoding s-attribute:party
#> assembling data.table
# Decode a subcorpus sc <- subset(corpus("GERMAPARLMINI"), speaker == "Angela Dorothea Merkel") dt <- decode(sc, to = "data.table")
#> ... decoding p_attribute word
#> ... decoding p_attribute pos
#> ... decoding s_attribute interjection
#> ... decoding s_attribute date
#> ... decoding s_attribute party
#> ... decoding s_attribute speaker
# Decode subcorpus selectively dt <- decode(sc, to = "data.table", p_attributes = "word", s_attributes = "party")
#> ... decoding p_attribute word
#> ... decoding s_attribute party
# Decode partition P <- partition("REUTERS", places = "kuwait", regex = TRUE)
#> ... get encoding: latin1
#> ... get cpos and strucs
dt <- decode(P)
#> ... decoding p_attribute word
#> ... decoding s_attribute id
#> ... decoding s_attribute topics_cat
#> ... decoding s_attribute places
#> ... decoding s_attribute language
# Previous versions of polmineR offered an option to decode a single # s-attribute. This is how you could proceed to get a table with metadata. dt <- decode(P, s_attribute = "id", decode = FALSE)
#> ... decoding p_attribute word
dt[, "word" := NULL]
#> cpos id struc #> 1: 753 5 5 #> 2: 754 5 5 #> 3: 755 5 5 #> 4: 756 5 5 #> 5: 757 5 5 #> --- #> 656: 3169 13 13 #> 657: 3170 13 13 #> 658: 3171 13 13 #> 659: 3172 13 13 #> 660: 3173 13 13
dt[,{list(cpos_left = min(.SD[["cpos"]]), cpos_right = max(.SD[["cpos"]]))}, by = "id"]
#> id cpos_left cpos_right #> 1: 5 753 1217 #> 2: 11 2874 2965 #> 3: 13 3071 3173
# Decode subcorpus as Annotation object if (FALSE) { if (requireNamespace("NLP")){ library(NLP) p <- subset(corpus("GERMAPARLMINI"), date == "2009-11-10" & speaker == "Angela Dorothea Merkel") s <- as(p, "String") a <- as(p, "Annotation") # The beauty of having this NLP Annotation object is that you can now use # the different annotators of the openNLP package. Here, just a short scenario # how you can have a look at the tokenized words and the sentences. words <- s[a[a$type == "word"]] sentences <- s[a[a$type == "sentence"]] # does not yet work perfectly for plenary protocols } } # decode vector of token ids y <- decode(0:20, corpus = "GERMAPARLMINI", p_attributes = "word") hits_dt <- hits("GERMAPARLMINI", query = "Liebe", progress = FALSE) %>% as.data.table() dt <- data.table::data.table(cpos = hits_dt[["cpos_left"]]) decode(dt, corpus = "GERMAPARLMINI", p_attributes = c("word", "pos"))
#> cpos word_id pos_id word pos #> 1: 10 10 0 Liebe ADJA #> 2: 568 10 0 Liebe ADJA #> 3: 3492 10 1 Liebe NN #> 4: 3499 10 1 Liebe NN #> 5: 4822 10 0 Liebe ADJA #> 6: 5323 10 0 Liebe ADJA #> 7: 6077 10 0 Liebe ADJA #> 8: 6719 10 0 Liebe ADJA #> 9: 7470 10 0 Liebe ADJA #> 10: 8502 10 0 Liebe ADJA #> 11: 8662 10 0 Liebe ADJA #> 12: 9119 10 0 Liebe ADJA #> 13: 9931 10 1 Liebe NN #> 14: 10043 10 0 Liebe ADJA #> 15: 10557 10 0 Liebe ADJA #> 16: 10691 10 0 Liebe ADJA #> 17: 21451 10 0 Liebe ADJA #> 18: 25966 10 0 Liebe ADJA #> 19: 45046 10 0 Liebe ADJA #> 20: 46008 10 0 Liebe ADJA #> 21: 47297 10 0 Liebe ADJA #> 22: 49739 10 0 Liebe ADJA #> 23: 56810 10 0 Liebe ADJA #> 24: 59790 10 0 Liebe ADJA #> 25: 59793 10 0 Liebe ADJA #> 26: 66245 10 0 Liebe ADJA #> 27: 69098 10 0 Liebe ADJA #> 28: 69102 10 0 Liebe ADJA #> 29: 70360 10 0 Liebe ADJA #> 30: 70771 10 1 Liebe NN #> 31: 74678 10 0 Liebe ADJA #> 32: 75805 10 0 Liebe ADJA #> 33: 76002 10 0 Liebe ADJA #> 34: 85838 10 0 Liebe ADJA #> 35: 85841 10 0 Liebe ADJA #> 36: 86522 10 0 Liebe ADJA #> 37: 89354 10 0 Liebe ADJA #> 38: 91483 10 0 Liebe ADJA #> 39: 91704 10 0 Liebe ADJA #> 40: 93459 10 0 Liebe ADJA #> 41: 98038 10 0 Liebe ADJA #> 42: 103072 10 0 Liebe ADJA #> 43: 103934 10 0 Liebe ADJA #> 44: 108337 10 0 Liebe ADJA #> 45: 111273 10 0 Liebe ADJA #> 46: 112594 10 0 Liebe ADJA #> 47: 113368 10 0 Liebe ADJA #> 48: 113831 10 0 Liebe ADJA #> 49: 114772 10 0 Liebe ADJA #> 50: 116114 10 0 Liebe ADJA #> 51: 118275 10 0 Liebe ADJA #> 52: 122273 10 0 Liebe ADJA #> 53: 123463 10 0 Liebe ADJA #> 54: 124268 10 0 Liebe ADJA #> 55: 125486 10 0 Liebe ADJA #> 56: 126850 10 0 Liebe ADJA #> 57: 130690 10 0 Liebe ADJA #> 58: 135910 10 0 Liebe ADJA #> 59: 138099 10 0 Liebe ADJA #> 60: 139004 10 0 Liebe ADJA #> 61: 150552 10 0 Liebe ADJA #> 62: 151702 10 0 Liebe ADJA #> 63: 155655 10 0 Liebe ADJA #> 64: 157311 10 0 Liebe ADJA #> 65: 157918 10 0 Liebe ADJA #> 66: 159346 10 0 Liebe ADJA #> 67: 169744 10 0 Liebe ADJA #> 68: 171657 10 0 Liebe ADJA #> 69: 176041 10 0 Liebe ADJA #> 70: 176447 10 0 Liebe ADJA #> 71: 176824 10 0 Liebe ADJA #> 72: 177589 10 0 Liebe ADJA #> 73: 178112 10 0 Liebe ADJA #> 74: 180526 10 0 Liebe ADJA #> 75: 181253 10 0 Liebe ADJA #> 76: 184704 10 0 Liebe ADJA #> 77: 186167 10 0 Liebe ADJA #> 78: 187917 10 0 Liebe ADJA #> 79: 188731 10 0 Liebe ADJA #> 80: 190545 10 1 Liebe NN #> 81: 192984 10 0 Liebe ADJA #> 82: 194547 10 0 Liebe ADJA #> 83: 196402 10 0 Liebe ADJA #> 84: 196407 10 0 Liebe ADJA #> 85: 199796 10 0 Liebe ADJA #> 86: 202781 10 0 Liebe ADJA #> 87: 202786 10 0 Liebe ADJA #> 88: 203416 10 0 Liebe ADJA #> 89: 204318 10 0 Liebe ADJA #> 90: 207278 10 0 Liebe ADJA #> 91: 212013 10 0 Liebe ADJA #> 92: 216043 10 1 Liebe NN #> 93: 221108 10 0 Liebe ADJA #> cpos word_id pos_id word pos
y <- dt[, .N, by = c("word", "pos")]