Auxiliary method to get the fulltext of a corpus, subcorpora etc. Can be used to export corpus data to other tools.
get_token_stream(.Object, ...) # S4 method for numeric get_token_stream( .Object, corpus, p_attribute, subset = NULL, boost = NULL, encoding = NULL, collapse = NULL, beautify = TRUE, cpos = FALSE, cutoff = NULL, decode = TRUE, ... ) # S4 method for matrix get_token_stream(.Object, ...) # S4 method for corpus get_token_stream(.Object, left = NULL, right = NULL, ...) # S4 method for character get_token_stream(.Object, left = NULL, right = NULL, ...) # S4 method for slice get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...) # S4 method for partition get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...) # S4 method for subcorpus get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...) # S4 method for regions get_token_stream( .Object, p_attribute = "word", collapse = NULL, cpos = FALSE, ... ) # S4 method for partition_bundle get_token_stream( .Object, p_attribute = "word", phrases = NULL, subset = NULL, collapse = NULL, cpos = FALSE, decode = TRUE, verbose = TRUE, progress = FALSE, mc = FALSE, ... )
.Object | Input object. |
---|---|
... | Arguments that will be be passed into the
|
corpus | A CWB indexed corpus. |
p_attribute | A length-one |
subset | An expression applied on p-attributes, using non-standard evaluation. Note that symbols used in the expression may not be used internally (e.g. 'stopwords'). |
boost | A length-one |
encoding | If not |
collapse | If not |
beautify | A (length-one) |
cpos | A |
cutoff | Maximum number of tokens to be reconstructed. |
decode | A (length-one) |
left | Left corpus position. |
right | Right corpus position. |
phrases | A |
verbose | A length-one |
progress | A length-one |
mc | Number of cores to use. If |
CWB indexed corpora have a fixed order of tokens which is called the
token stream. Every token is assigned to a unique corpus
position, Subsets of the (entire) token stream defined by a left and a
right corpus position are called regions. The
get_token_stream
-method will extract the tokens (for regions) from a
corpus.
The primary usage of this method is to return the token stream of a
(sub-)corpus as defined by a corpus
, subcorpus
or
partition
object. The methods defined for a numeric
vector or
a (two-column) matrix
defining regions (i.e. left and right corpus
positions in the first and second column) are the actual workers for this
operation.
The get_token_stream
has been introduced so serve as a worker
by higher level methods such as read
, html
, and
as.markdown
. It may however be useful for decoding a corpus so that
it can be exported to other tools.
# Decode first words of GERMAPARLMINI corpus (first sentence) get_token_stream(0:9, corpus = "GERMAPARLMINI", p_attribute = "word")#> [1] "Guten" "Morgen" "," "meine" "sehr" "verehrten" #> [7] "Damen" "und" "Herren" "!"# Decode first sentence and collapse tokens into single string get_token_stream(0:9, corpus = "GERMAPARLMINI", p_attribute = "word", collapse = " ")#> [1] "Guten Morgen, meine sehr verehrten Damen und Herren!"# Decode regions defined by two-column matrix region_matrix <- matrix(c(0,9,10,25), ncol = 2, byrow = TRUE) get_token_stream(region_matrix, corpus = "GERMAPARLMINI", p_attribute = "word", encoding = "latin1")#> [1] "Guten" "Morgen" "," "meine" #> [5] "sehr" "verehrten" "Damen" "und" #> [9] "Herren" "!" "Liebe" "Kolleginnen" #> [13] "und" "Kollegen" "," "ich" #> [17] "begrüße" "Sie" "zur" "konstituierenden" #> [21] "Sitzung" "des" "17." "Deutschen" #> [25] "Bundestags" "."# Use argument 'beautify' to remove surplus whitespace get_token_stream( region_matrix, corpus = "GERMAPARLMINI", p_attribute = "word", encoding = "latin1", collapse = " ", beautify = TRUE )#> [1] "Guten Morgen, meine sehr verehrten Damen und Herren! Liebe Kolleginnen und Kollegen, ich begrüße Sie zur konstituierenden Sitzung des 17. Deutschen Bundestags."# Decode entire corpus (corpus object / specified by corpus ID) fulltext <- get_token_stream("GERMAPARLMINI", p_attribute = "word") corpus("GERMAPARLMINI") %>% get_token_stream(p_attribute = "word") %>% head()#> [1] "Guten" "Morgen" "," "meine" "sehr" "verehrten"# Decode subcorpus corpus("REUTERS") %>% subset(id == "127") %>% get_token_stream(p_attribute = "word") %>% head()#> [1] "Diamond" "Shamrock" "Corp" "said" "that" "effective"# Decode partition_bundle pb_tokstr <- corpus("REUTERS") %>% split(s_attribute = "id") %>% get_token_stream(p_attribute = "word")#>#>#>#># Get token stream for partition_bundle pb <- partition_bundle("REUTERS", s_attribute = "id") ts_list <- get_token_stream(pb)#>#>#>#># Workflow to filter decoded subcorpus_bundle if (FALSE) { sp <- corpus("GERMAPARLMINI") %>% as.speeches(s_attribute_name = "speaker", progress = FALSE) queries <- c('"freiheitliche" "Grundordnung"', '"Bundesrepublik" "Deutschland"' ) phr <- corpus("GERMAPARLMINI") %>% cpos(query = queries) %>% as.phrases(corpus = "GERMAPARLMINI") kill <- tm::stopwords("de") ts_phr <- get_token_stream( sp, p_attribute = c("word", "pos"), subset = {!word %in% kill & !grepl("(\\$.$|ART)", pos)}, phrases = phr, progress = FALSE, verbose = FALSE ) }