The return value is an integer vector. The length of the vector is the number of unique tokens in the corpus / the number of unique ids. The order of the counts corresponds to the number of ids.

get_count_vector(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))

Arguments

corpus

a CWB corpus

p_attribute

a positional attribute

registry

registry directory

Value

an integer vector

Examples

y <- get_count_vector(
  corpus = "REUTERS", p_attribute = "word",
  registry = get_tmp_registry()
  )
df <- data.frame(token_id = 0:(length(y) - 1), count = y)
df[["token"]] <- cl_id2str(
  "REUTERS", p_attribute = "word",
  id = df[["token_id"]], registry = get_tmp_registry()
  )
df <- df[,c("token", "token_id", "count")] # reorder columns
df <- df[order(df[["count"]], decreasing = TRUE),]
head(df)
#>    token token_id count
#> 32   the       31   206
#> 30    to       29   134
#> 38    of       37    97
#> 36    in       35    84
#> 16   oil       15    78
#> 41   and       40    77