Decode corpus or subcorpus.

Decode corpus or subcorpus and return class specified by argument to.

decode(.Object, ...)

# S4 method for corpus
decode(
  .Object,
  to = c("data.table", "Annotation"),
  p_attributes = NULL,
  s_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)

# S4 method for character
decode(
  .Object,
  to = c("data.table", "Annotation"),
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE,
  ...
)

# S4 method for slice
decode(
  .Object,
  to = "data.table",
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)

# S4 method for partition
decode(
  .Object,
  to = "data.table",
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)

# S4 method for subcorpus
decode(
  .Object,
  to = "data.table",
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)

# S4 method for integer
decode(.Object, corpus, p_attributes, boost = NULL)

# S4 method for data.table
decode(.Object, corpus, p_attributes)

Arguments

.Object	The `corpus` or `subcorpus` to decode.
...	Further arguments.
to	The class of the returned object, stated as a length-one `character` vector.
p_attributes	The positional attributes to decode. If `NULL` (default), all positional attributes will be decoded.
s_attributes	The structural attributes to decode. If `NULL` (default), all structural attributes will be decoded.
decode	A `logical` value, whether to decode token ids and struc ids to character strings. If `FALSE`, the values of columns for p- and s-attributes will be `integer` vectors. If `TRUE` (default), the respective columns are `character` vectors.
verbose	A `logical` value, whether to output progess messages.
corpus	A CWB indexed corpus, either a length-one `character` vector, or a `corpus` object.
boost	A length-one `logical` value, whether to speed up decoding a long vector of token ids by directly by reading in the lexion file from the data directory of a corpus. If `NULL` (default), the internal decision rule is that `boost` will be `TRUE` if the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded.

Value

The return value will correspond to the class specified by argument to.

Details

The primary purpose of the method is type conversion. By obtaining the corpus or subcorpus in the format specified by the argument to, the data can be processed with tools that do not rely on the Corpus Workbench (CWB). Supported output formats are data.table (which can be converted to a data.frame or tibble easily) or an Annotation object as defined in the package NLP. Another purpose of decoding the corpus can be to rework it, and to re-import it into the CWB (e.g. using the cwbtools-package).

An earlier version of the method included an option to decode a single s-attribute, which is not supported any more. See the s_attribute_decode function of the package RcppCWB.

If .Object is an integer vector, it is assumed to be a vector of integer ids of p-attributes. The decode-method will translate token ids to string values as efficiently as possible. The approach taken will depend on the corpus size and the share of the corpus that is to be decoded. To decode a large number of integer ids, it is more efficient to read the lexicon file from the data directory directly and to index the lexicon with the ids rather than relying on RcppCWB::cl_id2str. The internal decision rule is to use the lexicon file when the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded. The encoding of the character vector that is returned will be the coding of the locale (usually ISO-8859-1 on Windows, and UTF-8 on macOS and Linux machines).

The decode-method for data.table objects will decode token ids (column '[p-attribute]_id'), adding the corresponding string as a new column. If a column "cpos" with corpus positions is present, ids are derived for the corpus positions given first. If the data.table neither has a column "cpos" nor columns with token ids (i.e. colummn name ending with "_id"), the input data.table is returned unchanged. Note that columns are added to the data.table in an in-place operation to handle memory parsimoniously.

Examples

use("polmineR")
#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS

# Decode corpus as data.table
dt <- decode("GERMAPARLMINI", to = "data.table")
#> decoding p-attribute:word
#> decoding p-attribute:pos
#> decoding s-attribute:interjection
#> decoding s-attribute:date
#> decoding s-attribute:party
#> decoding s-attribute:speaker
#> assembling data.table

# Decode corpus selectively
dt <- decode("GERMAPARLMINI", to = "data.table", p_attributes = "word", s_attributes = "party")
#> decoding p-attribute:word
#> decoding s-attribute:party
#> assembling data.table

# Decode a subcorpus
sc <- subset(corpus("GERMAPARLMINI"), speaker == "Angela Dorothea Merkel")
dt <- decode(sc, to = "data.table")
#> ... decoding p_attribute word
#> ... decoding p_attribute pos
#> ... decoding s_attribute interjection
#> ... decoding s_attribute date
#> ... decoding s_attribute party
#> ... decoding s_attribute speaker

# Decode subcorpus selectively
dt <- decode(sc, to = "data.table", p_attributes = "word", s_attributes = "party")
#> ... decoding p_attribute word
#> ... decoding s_attribute party

# Decode partition
P <- partition("REUTERS", places = "kuwait", regex = TRUE)
#> ... get encoding: latin1
#> ... get cpos and strucs
dt <- decode(P)
#> ... decoding p_attribute word
#> ... decoding s_attribute id
#> ... decoding s_attribute topics_cat
#> ... decoding s_attribute places
#> ... decoding s_attribute language

# Previous versions of polmineR offered an option to decode a single
# s-attribute. This is how you could proceed to get a table with metadata.
dt <- decode(P, s_attribute = "id", decode = FALSE)
#> ... decoding p_attribute word
dt[, "word" := NULL]
#>      cpos id struc
#>   1:  753  5     5
#>   2:  754  5     5
#>   3:  755  5     5
#>   4:  756  5     5
#>   5:  757  5     5
#>  ---              
#> 656: 3169 13    13
#> 657: 3170 13    13
#> 658: 3171 13    13
#> 659: 3172 13    13
#> 660: 3173 13    13
dt[,{list(cpos_left = min(.SD[["cpos"]]), cpos_right = max(.SD[["cpos"]]))}, by = "id"]
#>    id cpos_left cpos_right
#> 1:  5       753       1217
#> 2: 11      2874       2965
#> 3: 13      3071       3173

# Decode subcorpus as Annotation object
if (FALSE) {
if (requireNamespace("NLP")){
  library(NLP)
  p <- subset(corpus("GERMAPARLMINI"), date == "2009-11-10" & speaker == "Angela Dorothea Merkel")
  s <- as(p, "String")
  a <- as(p, "Annotation")
  
  # The beauty of having this NLP Annotation object is that you can now use 
  # the different annotators of the openNLP package. Here, just a short scenario
  # how you can have a look at the tokenized words and the sentences.

  words <- s[a[a$type == "word"]]
  sentences <- s[a[a$type == "sentence"]] # does not yet work perfectly for plenary protocols 
}
}
 
# decode vector of token ids
y <- decode(0:20, corpus = "GERMAPARLMINI", p_attributes = "word")
hits_dt <- hits("GERMAPARLMINI", query = "Liebe", progress = FALSE) %>%
  as.data.table()
dt <- data.table::data.table(cpos = hits_dt[["cpos_left"]])
decode(dt, corpus = "GERMAPARLMINI", p_attributes = c("word", "pos"))
#>       cpos word_id pos_id  word  pos
#>  1:     10      10      0 Liebe ADJA
#>  2:    568      10      0 Liebe ADJA
#>  3:   3492      10      1 Liebe   NN
#>  4:   3499      10      1 Liebe   NN
#>  5:   4822      10      0 Liebe ADJA
#>  6:   5323      10      0 Liebe ADJA
#>  7:   6077      10      0 Liebe ADJA
#>  8:   6719      10      0 Liebe ADJA
#>  9:   7470      10      0 Liebe ADJA
#> 10:   8502      10      0 Liebe ADJA
#> 11:   8662      10      0 Liebe ADJA
#> 12:   9119      10      0 Liebe ADJA
#> 13:   9931      10      1 Liebe   NN
#> 14:  10043      10      0 Liebe ADJA
#> 15:  10557      10      0 Liebe ADJA
#> 16:  10691      10      0 Liebe ADJA
#> 17:  21451      10      0 Liebe ADJA
#> 18:  25966      10      0 Liebe ADJA
#> 19:  45046      10      0 Liebe ADJA
#> 20:  46008      10      0 Liebe ADJA
#> 21:  47297      10      0 Liebe ADJA
#> 22:  49739      10      0 Liebe ADJA
#> 23:  56810      10      0 Liebe ADJA
#> 24:  59790      10      0 Liebe ADJA
#> 25:  59793      10      0 Liebe ADJA
#> 26:  66245      10      0 Liebe ADJA
#> 27:  69098      10      0 Liebe ADJA
#> 28:  69102      10      0 Liebe ADJA
#> 29:  70360      10      0 Liebe ADJA
#> 30:  70771      10      1 Liebe   NN
#> 31:  74678      10      0 Liebe ADJA
#> 32:  75805      10      0 Liebe ADJA
#> 33:  76002      10      0 Liebe ADJA
#> 34:  85838      10      0 Liebe ADJA
#> 35:  85841      10      0 Liebe ADJA
#> 36:  86522      10      0 Liebe ADJA
#> 37:  89354      10      0 Liebe ADJA
#> 38:  91483      10      0 Liebe ADJA
#> 39:  91704      10      0 Liebe ADJA
#> 40:  93459      10      0 Liebe ADJA
#> 41:  98038      10      0 Liebe ADJA
#> 42: 103072      10      0 Liebe ADJA
#> 43: 103934      10      0 Liebe ADJA
#> 44: 108337      10      0 Liebe ADJA
#> 45: 111273      10      0 Liebe ADJA
#> 46: 112594      10      0 Liebe ADJA
#> 47: 113368      10      0 Liebe ADJA
#> 48: 113831      10      0 Liebe ADJA
#> 49: 114772      10      0 Liebe ADJA
#> 50: 116114      10      0 Liebe ADJA
#> 51: 118275      10      0 Liebe ADJA
#> 52: 122273      10      0 Liebe ADJA
#> 53: 123463      10      0 Liebe ADJA
#> 54: 124268      10      0 Liebe ADJA
#> 55: 125486      10      0 Liebe ADJA
#> 56: 126850      10      0 Liebe ADJA
#> 57: 130690      10      0 Liebe ADJA
#> 58: 135910      10      0 Liebe ADJA
#> 59: 138099      10      0 Liebe ADJA
#> 60: 139004      10      0 Liebe ADJA
#> 61: 150552      10      0 Liebe ADJA
#> 62: 151702      10      0 Liebe ADJA
#> 63: 155655      10      0 Liebe ADJA
#> 64: 157311      10      0 Liebe ADJA
#> 65: 157918      10      0 Liebe ADJA
#> 66: 159346      10      0 Liebe ADJA
#> 67: 169744      10      0 Liebe ADJA
#> 68: 171657      10      0 Liebe ADJA
#> 69: 176041      10      0 Liebe ADJA
#> 70: 176447      10      0 Liebe ADJA
#> 71: 176824      10      0 Liebe ADJA
#> 72: 177589      10      0 Liebe ADJA
#> 73: 178112      10      0 Liebe ADJA
#> 74: 180526      10      0 Liebe ADJA
#> 75: 181253      10      0 Liebe ADJA
#> 76: 184704      10      0 Liebe ADJA
#> 77: 186167      10      0 Liebe ADJA
#> 78: 187917      10      0 Liebe ADJA
#> 79: 188731      10      0 Liebe ADJA
#> 80: 190545      10      1 Liebe   NN
#> 81: 192984      10      0 Liebe ADJA
#> 82: 194547      10      0 Liebe ADJA
#> 83: 196402      10      0 Liebe ADJA
#> 84: 196407      10      0 Liebe ADJA
#> 85: 199796      10      0 Liebe ADJA
#> 86: 202781      10      0 Liebe ADJA
#> 87: 202786      10      0 Liebe ADJA
#> 88: 203416      10      0 Liebe ADJA
#> 89: 204318      10      0 Liebe ADJA
#> 90: 207278      10      0 Liebe ADJA
#> 91: 212013      10      0 Liebe ADJA
#> 92: 216043      10      1 Liebe   NN
#> 93: 221108      10      0 Liebe ADJA
#>       cpos word_id pos_id  word  pos
y <- dt[, .N, by = c("word", "pos")]

Arguments

Value

Details

See also

Examples