Decode corpus
or subcorpus
and return class specified by
argument to
.
decode(.Object, ...) # S4 method for corpus decode( .Object, to = c("data.table", "Annotation"), p_attributes = NULL, s_attributes = NULL, decode = TRUE, verbose = TRUE ) # S4 method for character decode( .Object, to = c("data.table", "Annotation"), s_attributes = NULL, p_attributes = NULL, decode = TRUE, verbose = TRUE, ... ) # S4 method for slice decode( .Object, to = "data.table", s_attributes = NULL, p_attributes = NULL, decode = TRUE, verbose = TRUE ) # S4 method for partition decode( .Object, to = "data.table", s_attributes = NULL, p_attributes = NULL, decode = TRUE, verbose = TRUE ) # S4 method for subcorpus decode( .Object, to = "data.table", s_attributes = NULL, p_attributes = NULL, decode = TRUE, verbose = TRUE ) # S4 method for integer decode(.Object, corpus, p_attributes, boost = NULL) # S4 method for data.table decode(.Object, corpus, p_attributes)
.Object | The |
---|---|
... | Further arguments. |
to | The class of the returned object, stated as a length-one
|
p_attributes | The positional attributes to decode. If |
s_attributes | The structural attributes to decode. If |
decode | A |
verbose | A |
corpus | A CWB indexed corpus, either a length-one |
boost | A length-one |
The return value will correspond to the class specified by argument
to
.
The primary purpose of the method is type conversion. By obtaining the corpus
or subcorpus in the format specified by the argument to
, the data can
be processed with tools that do not rely on the Corpus Workbench (CWB).
Supported output formats are data.table
(which can be converted to a
data.frame
or tibble
easily) or an Annotation
object as
defined in the package NLP
. Another purpose of decoding the corpus can
be to rework it, and to re-import it into the CWB (e.g. using the
cwbtools
-package).
An earlier version of the method included an option to decode a single
s-attribute, which is not supported any more. See the
s_attribute_decode
function of the package RcppCWB.
If .Object
is an integer
vector, it is assumed to be a
vector of integer ids of p-attributes. The decode
-method will
translate token ids to string values as efficiently as possible. The
approach taken will depend on the corpus size and the share of the corpus
that is to be decoded. To decode a large number of integer ids, it is more
efficient to read the lexicon file from the data directory directly and to
index the lexicon with the ids rather than relying on
RcppCWB::cl_id2str
. The internal decision rule is to use the lexicon
file when the corpus is larger than 10 000 000 million tokens and more than
5 percent of the corpus are to be decoded. The encoding of the
character
vector that is returned will be the coding of the locale
(usually ISO-8859-1 on Windows, and UTF-8 on macOS and Linux machines).
The decode
-method for data.table
objects will decode
token ids (column '[p-attribute]_id'), adding the corresponding string as a
new column. If a column "cpos" with corpus positions is present, ids are
derived for the corpus positions given first. If the data.table
neither has a column "cpos" nor columns with token ids (i.e. colummn name
ending with "_id"), the input data.table
is returned unchanged. Note
that columns are added to the data.table
in an in-place operation to
handle memory parsimoniously.
To decode a structural attribute, you can use the
s_attributes
-method, setting argument unique
as
FALSE
and s_attribute_decode
. See
as.VCorpus
to decode a partition_bundle
object,
returning a VCorpus
object.
#>#># Decode corpus as data.table dt <- decode("GERMAPARLMINI", to = "data.table")#>#>#>#>#>#>#># Decode corpus selectively dt <- decode("GERMAPARLMINI", to = "data.table", p_attributes = "word", s_attributes = "party")#>#>#># Decode a subcorpus sc <- subset(corpus("GERMAPARLMINI"), speaker == "Angela Dorothea Merkel") dt <- decode(sc, to = "data.table")#>#>#>#>#>#># Decode subcorpus selectively dt <- decode(sc, to = "data.table", p_attributes = "word", s_attributes = "party")#>#>#>#>dt <- decode(P)#>#>#>#>#># Previous versions of polmineR offered an option to decode a single # s-attribute. This is how you could proceed to get a table with metadata. dt <- decode(P, s_attribute = "id", decode = FALSE)#>dt[, "word" := NULL]#> cpos id struc #> 1: 753 5 5 #> 2: 754 5 5 #> 3: 755 5 5 #> 4: 756 5 5 #> 5: 757 5 5 #> --- #> 656: 3169 13 13 #> 657: 3170 13 13 #> 658: 3171 13 13 #> 659: 3172 13 13 #> 660: 3173 13 13#> id cpos_left cpos_right #> 1: 5 753 1217 #> 2: 11 2874 2965 #> 3: 13 3071 3173# Decode subcorpus as Annotation object if (FALSE) { if (requireNamespace("NLP")){ library(NLP) p <- subset(corpus("GERMAPARLMINI"), date == "2009-11-10" & speaker == "Angela Dorothea Merkel") s <- as(p, "String") a <- as(p, "Annotation") # The beauty of having this NLP Annotation object is that you can now use # the different annotators of the openNLP package. Here, just a short scenario # how you can have a look at the tokenized words and the sentences. words <- s[a[a$type == "word"]] sentences <- s[a[a$type == "sentence"]] # does not yet work perfectly for plenary protocols } } # decode vector of token ids y <- decode(0:20, corpus = "GERMAPARLMINI", p_attributes = "word") hits_dt <- hits("GERMAPARLMINI", query = "Liebe", progress = FALSE) %>% as.data.table() dt <- data.table::data.table(cpos = hits_dt[["cpos_left"]]) decode(dt, corpus = "GERMAPARLMINI", p_attributes = c("word", "pos"))#> cpos word_id pos_id word pos #> 1: 10 10 0 Liebe ADJA #> 2: 568 10 0 Liebe ADJA #> 3: 3492 10 1 Liebe NN #> 4: 3499 10 1 Liebe NN #> 5: 4822 10 0 Liebe ADJA #> 6: 5323 10 0 Liebe ADJA #> 7: 6077 10 0 Liebe ADJA #> 8: 6719 10 0 Liebe ADJA #> 9: 7470 10 0 Liebe ADJA #> 10: 8502 10 0 Liebe ADJA #> 11: 8662 10 0 Liebe ADJA #> 12: 9119 10 0 Liebe ADJA #> 13: 9931 10 1 Liebe NN #> 14: 10043 10 0 Liebe ADJA #> 15: 10557 10 0 Liebe ADJA #> 16: 10691 10 0 Liebe ADJA #> 17: 21451 10 0 Liebe ADJA #> 18: 25966 10 0 Liebe ADJA #> 19: 45046 10 0 Liebe ADJA #> 20: 46008 10 0 Liebe ADJA #> 21: 47297 10 0 Liebe ADJA #> 22: 49739 10 0 Liebe ADJA #> 23: 56810 10 0 Liebe ADJA #> 24: 59790 10 0 Liebe ADJA #> 25: 59793 10 0 Liebe ADJA #> 26: 66245 10 0 Liebe ADJA #> 27: 69098 10 0 Liebe ADJA #> 28: 69102 10 0 Liebe ADJA #> 29: 70360 10 0 Liebe ADJA #> 30: 70771 10 1 Liebe NN #> 31: 74678 10 0 Liebe ADJA #> 32: 75805 10 0 Liebe ADJA #> 33: 76002 10 0 Liebe ADJA #> 34: 85838 10 0 Liebe ADJA #> 35: 85841 10 0 Liebe ADJA #> 36: 86522 10 0 Liebe ADJA #> 37: 89354 10 0 Liebe ADJA #> 38: 91483 10 0 Liebe ADJA #> 39: 91704 10 0 Liebe ADJA #> 40: 93459 10 0 Liebe ADJA #> 41: 98038 10 0 Liebe ADJA #> 42: 103072 10 0 Liebe ADJA #> 43: 103934 10 0 Liebe ADJA #> 44: 108337 10 0 Liebe ADJA #> 45: 111273 10 0 Liebe ADJA #> 46: 112594 10 0 Liebe ADJA #> 47: 113368 10 0 Liebe ADJA #> 48: 113831 10 0 Liebe ADJA #> 49: 114772 10 0 Liebe ADJA #> 50: 116114 10 0 Liebe ADJA #> 51: 118275 10 0 Liebe ADJA #> 52: 122273 10 0 Liebe ADJA #> 53: 123463 10 0 Liebe ADJA #> 54: 124268 10 0 Liebe ADJA #> 55: 125486 10 0 Liebe ADJA #> 56: 126850 10 0 Liebe ADJA #> 57: 130690 10 0 Liebe ADJA #> 58: 135910 10 0 Liebe ADJA #> 59: 138099 10 0 Liebe ADJA #> 60: 139004 10 0 Liebe ADJA #> 61: 150552 10 0 Liebe ADJA #> 62: 151702 10 0 Liebe ADJA #> 63: 155655 10 0 Liebe ADJA #> 64: 157311 10 0 Liebe ADJA #> 65: 157918 10 0 Liebe ADJA #> 66: 159346 10 0 Liebe ADJA #> 67: 169744 10 0 Liebe ADJA #> 68: 171657 10 0 Liebe ADJA #> 69: 176041 10 0 Liebe ADJA #> 70: 176447 10 0 Liebe ADJA #> 71: 176824 10 0 Liebe ADJA #> 72: 177589 10 0 Liebe ADJA #> 73: 178112 10 0 Liebe ADJA #> 74: 180526 10 0 Liebe ADJA #> 75: 181253 10 0 Liebe ADJA #> 76: 184704 10 0 Liebe ADJA #> 77: 186167 10 0 Liebe ADJA #> 78: 187917 10 0 Liebe ADJA #> 79: 188731 10 0 Liebe ADJA #> 80: 190545 10 1 Liebe NN #> 81: 192984 10 0 Liebe ADJA #> 82: 194547 10 0 Liebe ADJA #> 83: 196402 10 0 Liebe ADJA #> 84: 196407 10 0 Liebe ADJA #> 85: 199796 10 0 Liebe ADJA #> 86: 202781 10 0 Liebe ADJA #> 87: 202786 10 0 Liebe ADJA #> 88: 203416 10 0 Liebe ADJA #> 89: 204318 10 0 Liebe ADJA #> 90: 207278 10 0 Liebe ADJA #> 91: 212013 10 0 Liebe ADJA #> 92: 216043 10 1 Liebe NN #> 93: 221108 10 0 Liebe ADJA #> cpos word_id pos_id word pos