Get Token Stream. — get_token

Auxiliary method to get the fulltext of a corpus, subcorpora etc. Can be used to export corpus data to other tools.

get_token_stream(.Object, ...)

# S4 method for numeric
get_token_stream(
  .Object,
  corpus,
  p_attribute,
  subset = NULL,
  boost = NULL,
  encoding = NULL,
  collapse = NULL,
  beautify = TRUE,
  cpos = FALSE,
  cutoff = NULL,
  decode = TRUE,
  ...
)

# S4 method for matrix
get_token_stream(.Object, ...)

# S4 method for corpus
get_token_stream(.Object, left = NULL, right = NULL, ...)

# S4 method for character
get_token_stream(.Object, left = NULL, right = NULL, ...)

# S4 method for slice
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)

# S4 method for partition
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)

# S4 method for subcorpus
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)

# S4 method for regions
get_token_stream(
  .Object,
  p_attribute = "word",
  collapse = NULL,
  cpos = FALSE,
  ...
)

# S4 method for partition_bundle
get_token_stream(
  .Object,
  p_attribute = "word",
  phrases = NULL,
  subset = NULL,
  collapse = NULL,
  cpos = FALSE,
  decode = TRUE,
  verbose = TRUE,
  progress = FALSE,
  mc = FALSE,
  ...
)

Arguments

.Object	Input object.
...	Arguments that will be be passed into the `get_token_stream`-method for a `numeric` vector, the real worker.
corpus	A CWB indexed corpus.
p_attribute	A length-one `character` vector, the p-attribute to decode.
subset	An expression applied on p-attributes, using non-standard evaluation. Note that symbols used in the expression may not be used internally (e.g. 'stopwords').
boost	A length-one `logical` value, whether to speed up decoding a long vector of token ids by directly by reading in the lexion file from the data directory of a corpus. If `NULL` (default), the internal decision rule is that `boost` will be `TRUE` if the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded.
encoding	If not `NULL` (default) a length-one `character` vector stating an encoding that will be assigned to the (decoded) token stream.
collapse	If not `NULL` (default), a length-one `character` string passed into `paste` to collapse character vector into a single string.
beautify	A (length-one) `logical` value, whether to adjust whitespace before and after interpunctation.
cpos	A `logical` value, whether to return corpus positions as names of the tokens.
cutoff	Maximum number of tokens to be reconstructed.
decode	A (length-one) `logical` value, whether to decode token ids to character strings. Defaults to `TRUE`, if `FALSE`, an integer vector with token ids is returned.
left	Left corpus position.
right	Right corpus position.
phrases	A `phrases` object. Defined phrases will be concatenated.
verbose	A length-one `logical` value, whether to show messages.
progress	A length-one `logical` value, whether to show progress bar.
mc	Number of cores to use. If `FALSE` (default), only one thread will be used.

Details

CWB indexed corpora have a fixed order of tokens which is called the token stream. Every token is assigned to a unique corpus position, Subsets of the (entire) token stream defined by a left and a right corpus position are called regions. The get_token_stream-method will extract the tokens (for regions) from a corpus.

The primary usage of this method is to return the token stream of a (sub-)corpus as defined by a corpus, subcorpus or partition object. The methods defined for a numeric vector or a (two-column) matrix defining regions (i.e. left and right corpus positions in the first and second column) are the actual workers for this operation.

The get_token_stream has been introduced so serve as a worker by higher level methods such as read, html, and as.markdown. It may however be useful for decoding a corpus so that it can be exported to other tools.

Examples

# Decode first words of GERMAPARLMINI corpus (first sentence)
get_token_stream(0:9, corpus = "GERMAPARLMINI", p_attribute = "word")
#>  [1] "Guten"     "Morgen"    ","         "meine"     "sehr"      "verehrten"
#>  [7] "Damen"     "und"       "Herren"    "!"        

# Decode first sentence and collapse tokens into single string
get_token_stream(0:9, corpus = "GERMAPARLMINI", p_attribute = "word", collapse = " ")
#> [1] "Guten Morgen, meine sehr verehrten Damen und Herren!"

# Decode regions defined by two-column matrix
region_matrix <- matrix(c(0,9,10,25), ncol = 2, byrow = TRUE)
get_token_stream(region_matrix, corpus = "GERMAPARLMINI", p_attribute = "word", encoding = "latin1")
#>  [1] "Guten"            "Morgen"           ","                "meine"           
#>  [5] "sehr"             "verehrten"        "Damen"            "und"             
#>  [9] "Herren"           "!"                "Liebe"            "Kolleginnen"     
#> [13] "und"              "Kollegen"         ","                "ich"             
#> [17] "begrüße"          "Sie"              "zur"              "konstituierenden"
#> [21] "Sitzung"          "des"              "17."              "Deutschen"       
#> [25] "Bundestags"       "."               

# Use argument 'beautify' to remove surplus whitespace
get_token_stream(
  region_matrix,
  corpus = "GERMAPARLMINI",
  p_attribute = "word",
  encoding = "latin1",
  collapse = " ", beautify = TRUE
)
#> [1] "Guten Morgen, meine sehr verehrten Damen und Herren! Liebe Kolleginnen und Kollegen, ich begrüße Sie zur konstituierenden Sitzung des 17. Deutschen Bundestags."

# Decode entire corpus (corpus object / specified by corpus ID)
fulltext <- get_token_stream("GERMAPARLMINI", p_attribute = "word")
corpus("GERMAPARLMINI") %>%
  get_token_stream(p_attribute = "word") %>%
  head()
#> [1] "Guten"     "Morgen"    ","         "meine"     "sehr"      "verehrten"

# Decode subcorpus
corpus("REUTERS") %>%
  subset(id == "127") %>%
  get_token_stream(p_attribute = "word") %>%
  head()
#> [1] "Diamond"   "Shamrock"  "Corp"      "said"      "that"      "effective"

# Decode partition_bundle
pb_tokstr <- corpus("REUTERS") %>%
  split(s_attribute = "id") %>%
  get_token_stream(p_attribute = "word")
#> ... creating vector of document ids
#> ... creating vector of corpus positions
#> ... decoding character vectors
#> ... generating list of character vectors

# Get token stream for partition_bundle
pb <- partition_bundle("REUTERS", s_attribute = "id")
ts_list <- get_token_stream(pb)
#> ... creating vector of document ids
#> ... creating vector of corpus positions
#> ... decoding character vectors
#> ... generating list of character vectors

# Workflow to filter decoded subcorpus_bundle
if (FALSE) {
sp <- corpus("GERMAPARLMINI") %>% as.speeches(s_attribute_name = "speaker", progress = FALSE)
queries <- c('"freiheitliche" "Grundordnung"', '"Bundesrepublik" "Deutschland"' )
phr <- corpus("GERMAPARLMINI") %>% cpos(query = queries) %>% as.phrases(corpus = "GERMAPARLMINI")

kill <- tm::stopwords("de")

ts_phr <- get_token_stream(
  sp,
  p_attribute = c("word", "pos"),
  subset = {!word %in% kill  & !grepl("(\\$.$|ART)", pos)},
  phrases = phr,
  progress = FALSE,
  verbose = FALSE
)
}