Auxiliary method to get the fulltext of a corpus, subcorpora etc. Can be used to export corpus data to other tools.

get_token_stream(.Object, ...)

# S4 method for numeric
get_token_stream(
  .Object,
  corpus,
  p_attribute,
  subset = NULL,
  boost = NULL,
  encoding = NULL,
  collapse = NULL,
  beautify = TRUE,
  cpos = FALSE,
  cutoff = NULL,
  decode = TRUE,
  ...
)

# S4 method for matrix
get_token_stream(.Object, ...)

# S4 method for corpus
get_token_stream(.Object, left = NULL, right = NULL, ...)

# S4 method for character
get_token_stream(.Object, left = NULL, right = NULL, ...)

# S4 method for slice
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)

# S4 method for partition
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)

# S4 method for subcorpus
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)

# S4 method for regions
get_token_stream(
  .Object,
  p_attribute = "word",
  collapse = NULL,
  cpos = FALSE,
  ...
)

# S4 method for partition_bundle
get_token_stream(
  .Object,
  p_attribute = "word",
  phrases = NULL,
  subset = NULL,
  collapse = NULL,
  cpos = FALSE,
  decode = TRUE,
  verbose = TRUE,
  progress = FALSE,
  mc = FALSE,
  ...
)

Arguments

.Object

Input object.

...

Arguments that will be be passed into the get_token_stream-method for a numeric vector, the real worker.

corpus

A CWB indexed corpus.

p_attribute

A length-one character vector, the p-attribute to decode.

subset

An expression applied on p-attributes, using non-standard evaluation. Note that symbols used in the expression may not be used internally (e.g. 'stopwords').

boost

A length-one logical value, whether to speed up decoding a long vector of token ids by directly by reading in the lexion file from the data directory of a corpus. If NULL (default), the internal decision rule is that boost will be TRUE if the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded.

encoding

If not NULL (default) a length-one character vector stating an encoding that will be assigned to the (decoded) token stream.

collapse

If not NULL (default), a length-one character string passed into paste to collapse character vector into a single string.

beautify

A (length-one) logical value, whether to adjust whitespace before and after interpunctation.

cpos

A logical value, whether to return corpus positions as names of the tokens.

cutoff

Maximum number of tokens to be reconstructed.

decode

A (length-one) logical value, whether to decode token ids to character strings. Defaults to TRUE, if FALSE, an integer vector with token ids is returned.

left

Left corpus position.

right

Right corpus position.

phrases

A phrases object. Defined phrases will be concatenated.

verbose

A length-one logical value, whether to show messages.

progress

A length-one logical value, whether to show progress bar.

mc

Number of cores to use. If FALSE (default), only one thread will be used.

Details

CWB indexed corpora have a fixed order of tokens which is called the token stream. Every token is assigned to a unique corpus position, Subsets of the (entire) token stream defined by a left and a right corpus position are called regions. The get_token_stream-method will extract the tokens (for regions) from a corpus.

The primary usage of this method is to return the token stream of a (sub-)corpus as defined by a corpus, subcorpus or partition object. The methods defined for a numeric vector or a (two-column) matrix defining regions (i.e. left and right corpus positions in the first and second column) are the actual workers for this operation.

The get_token_stream has been introduced so serve as a worker by higher level methods such as read, html, and as.markdown. It may however be useful for decoding a corpus so that it can be exported to other tools.

Examples

# Decode first words of GERMAPARLMINI corpus (first sentence) get_token_stream(0:9, corpus = "GERMAPARLMINI", p_attribute = "word")
#> [1] "Guten" "Morgen" "," "meine" "sehr" "verehrten" #> [7] "Damen" "und" "Herren" "!"
# Decode first sentence and collapse tokens into single string get_token_stream(0:9, corpus = "GERMAPARLMINI", p_attribute = "word", collapse = " ")
#> [1] "Guten Morgen, meine sehr verehrten Damen und Herren!"
# Decode regions defined by two-column matrix region_matrix <- matrix(c(0,9,10,25), ncol = 2, byrow = TRUE) get_token_stream(region_matrix, corpus = "GERMAPARLMINI", p_attribute = "word", encoding = "latin1")
#> [1] "Guten" "Morgen" "," "meine" #> [5] "sehr" "verehrten" "Damen" "und" #> [9] "Herren" "!" "Liebe" "Kolleginnen" #> [13] "und" "Kollegen" "," "ich" #> [17] "begrüße" "Sie" "zur" "konstituierenden" #> [21] "Sitzung" "des" "17." "Deutschen" #> [25] "Bundestags" "."
# Use argument 'beautify' to remove surplus whitespace get_token_stream( region_matrix, corpus = "GERMAPARLMINI", p_attribute = "word", encoding = "latin1", collapse = " ", beautify = TRUE )
#> [1] "Guten Morgen, meine sehr verehrten Damen und Herren! Liebe Kolleginnen und Kollegen, ich begrüße Sie zur konstituierenden Sitzung des 17. Deutschen Bundestags."
# Decode entire corpus (corpus object / specified by corpus ID) fulltext <- get_token_stream("GERMAPARLMINI", p_attribute = "word") corpus("GERMAPARLMINI") %>% get_token_stream(p_attribute = "word") %>% head()
#> [1] "Guten" "Morgen" "," "meine" "sehr" "verehrten"
# Decode subcorpus corpus("REUTERS") %>% subset(id == "127") %>% get_token_stream(p_attribute = "word") %>% head()
#> [1] "Diamond" "Shamrock" "Corp" "said" "that" "effective"
# Decode partition_bundle pb_tokstr <- corpus("REUTERS") %>% split(s_attribute = "id") %>% get_token_stream(p_attribute = "word")
#> ... creating vector of document ids
#> ... creating vector of corpus positions
#> ... decoding character vectors
#> ... generating list of character vectors
# Get token stream for partition_bundle pb <- partition_bundle("REUTERS", s_attribute = "id") ts_list <- get_token_stream(pb)
#> ... creating vector of document ids
#> ... creating vector of corpus positions
#> ... decoding character vectors
#> ... generating list of character vectors
# Workflow to filter decoded subcorpus_bundle if (FALSE) { sp <- corpus("GERMAPARLMINI") %>% as.speeches(s_attribute_name = "speaker", progress = FALSE) queries <- c('"freiheitliche" "Grundordnung"', '"Bundesrepublik" "Deutschland"' ) phr <- corpus("GERMAPARLMINI") %>% cpos(query = queries) %>% as.phrases(corpus = "GERMAPARLMINI") kill <- tm::stopwords("de") ts_phr <- get_token_stream( sp, p_attribute = c("word", "pos"), subset = {!word %in% kill & !grepl("(\\$.$|ART)", pos)}, phrases = phr, progress = FALSE, verbose = FALSE ) }