Get counts.

Count all tokens, or number of occurrences of a query (CQP syntax may be used), or matches for the query.

count(.Object, ...)

# S4 method for partition
count(
  .Object,
  query = NULL,
  cqp = is.cqp,
  check = TRUE,
  breakdown = FALSE,
  decode = TRUE,
  p_attribute = getOption("polmineR.p_attribute"),
  mc = getOption("polmineR.cores"),
  verbose = TRUE,
  progress = FALSE,
  phrases = NULL,
  ...
)

# S4 method for subcorpus
count(
  .Object,
  query = NULL,
  cqp = is.cqp,
  check = TRUE,
  breakdown = FALSE,
  decode = TRUE,
  p_attribute = getOption("polmineR.p_attribute"),
  mc = getOption("polmineR.cores"),
  verbose = TRUE,
  progress = FALSE,
  phrases = NULL,
  ...
)

# S4 method for partition_bundle
count(
  .Object,
  query = NULL,
  cqp = FALSE,
  p_attribute = NULL,
  phrases = NULL,
  freq = FALSE,
  total = TRUE,
  mc = FALSE,
  progress = FALSE,
  verbose = FALSE,
  ...
)

# S4 method for subcorpus_bundle
count(
  .Object,
  query = NULL,
  cqp = FALSE,
  p_attribute = NULL,
  phrases = NULL,
  freq = FALSE,
  total = TRUE,
  mc = FALSE,
  progress = TRUE,
  verbose = FALSE,
  ...
)

# S4 method for corpus
count(
  .Object,
  query = NULL,
  cqp = is.cqp,
  check = TRUE,
  p_attribute = getOption("polmineR.p_attribute"),
  breakdown = FALSE,
  sort = FALSE,
  decode = TRUE,
  verbose = TRUE,
  ...
)

# S4 method for character
count(
  .Object,
  query = NULL,
  cqp = is.cqp,
  check = TRUE,
  p_attribute = getOption("polmineR.p_attribute"),
  breakdown = FALSE,
  sort = FALSE,
  decode = TRUE,
  verbose = TRUE,
  ...
)

# S4 method for vector
count(.Object, corpus, p_attribute, ...)

# S4 method for remote_corpus
count(.Object, ...)

# S4 method for remote_subcorpus
count(.Object, ...)

Arguments

.Object	A `partition` or `partition_bundle`, or a length-one character vector providing the name of a corpus.
...	Further arguments. If `.Object` is a `remote_corpus` object, the three dots (`...`) are used to pass arguments. Hence, it is necessary to state the names of all arguments to be passed explicity.
query	A character vector (one or multiple terms), CQP syntax can be used.
cqp	Either logical (`TRUE` if query is a CQP query), or a function to check whether query is a CQP query or not (defaults to is.query auxiliary function).
check	A `logical` value, whether to check validity of CQP query using `check_cqp_query`.
breakdown	Logical, whether to report number of occurrences for different matches for a query.
decode	Logical, whether to turn token ids into decoded strings (only if query is NULL).
p_attribute	The p-attribute(s) to use.
mc	Logical, whether to use multicore (defaults to `FALSE`).
verbose	Logical, whether to be verbose.
progress	Logical, whether to show progress bar.
phrases	A `phrases` object. If provided, the denoted regions will be concatenated as phrases.
freq	Logical, if `FALSE`, counts will be reported, if TRUE, (relative) frequencies are added to table.
total	Defaults to `FALSE`, if `TRUE`, the total value of counts (column named 'TOTAL') will be amended to the `data.table` that is returned.
sort	Logical, whether to sort table with counts (in stat slot).
corpus	The name of a CWB corpus.

Value

A data.table if argument query is used, a count-object, if query is NULL and .Object is a character vector (referring to a corpus) or a partition, a count_bundle-object, if .Object is a partition_bundle.

Details

If .Object is a partiton_bundle, the data.table returned will have the queries in the columns, and as many rows as there are in the partition_bundle.

If .Object is a length-one character vector and query is NULL, the count is performed for the whole partition.

If breakdown is TRUE and one query is supplied, the function returns a frequency breakdown of the results of the query. If several queries are supplied, frequencies for the individual queries are retrieved.

Multiple queries can be used for argument query. Some care may be necessary when summing up the counts for the individual queries. When the CQP syntax is used, different queries may yield the same match result, so that the sum of all individual query matches may overestimate the true number of unique matches. In the case of overlapping matches, a warning message is issued. Collapsing multiple CQP queries into a single query (separating the individual queries by "|" and wrapping everything in round brackets) solves this problem.

References

Baker, Paul (2006): Using Corpora in Discourse Analysis. London: continuum, p. 47-69 (ch. 3).

Examples

use("polmineR")
#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS
debates <- partition("GERMAPARLMINI", date = ".*", regex=TRUE)
#> ... get encoding: latin1
#> ... get cpos and strucs
count(debates, query = "Arbeit") # get frequencies for one token
#>     query  match count         freq
#> 1: Arbeit Arbeit   159 0.0007155683
count(debates, c("Arbeit", "Freizeit", "Zukunft")) # get frequencies for multiple tokens
#>       query    match count         freq
#> 1:   Arbeit   Arbeit   159 7.155683e-04
#> 2: Freizeit Freizeit     1 4.500430e-06
#> 3:  Zukunft  Zukunft   142 6.390610e-04
  
count("GERMAPARLMINI", query = c("Migration", "Integration"), p_attribute = "word")
#>          query count         freq
#> 1:   Migration     3 1.350129e-05
#> 2: Integration    23 1.035099e-04

debates <- partition_bundle(
  "GERMAPARLMINI", s_attribute = "date", values = NULL,
  mc = FALSE, verbose = FALSE
)
y <- count(debates, query = "Arbeit", p_attribute = "word")
y <- count(debates, query = c("Arbeit", "Migration", "Zukunft"), p_attribute = "word")
  
count("GERMAPARLMINI", '"Integration.*"', breakdown = TRUE)
#>               query                       match count share
#>  1: "Integration.*"                 Integration    23 53.49
#>  2: "Integration.*"         Integrationspolitik     9 20.93
#>  3: "Integration.*"           Integrationskurse     2  4.65
#>  4: "Integration.*"        Integrationsangebote     1  2.33
#>  5: "Integration.*"    Integrationsbereitschaft     1  2.33
#>  6: "Integration.*"         Integrationserfolge     1  2.33
#>  7: "Integration.*"     Integrationsfähigkeiten     1  2.33
#>  8: "Integration.*"       Integrationskarrieren     1  2.33
#>  9: "Integration.*" Integrationspartnerschaften     1  2.33
#> 10: "Integration.*"            Integrationsplan     1  2.33
#> 11: "Integration.*"      Integrationspolitik.Es     1  2.33
#> 12: "Integration.*"       Integrationsverträgen     1  2.33

P <- partition("GERMAPARLMINI", date = "2009-11-11")
#> ... get encoding: latin1
#> ... get cpos and strucs
count(P, '"Integration.*"', breakdown = TRUE)
#>              query                    match count share
#> 1: "Integration.*"              Integration    15 48.39
#> 2: "Integration.*"      Integrationspolitik     8 25.81
#> 3: "Integration.*"        Integrationskurse     2  6.45
#> 4: "Integration.*"     Integrationsangebote     1  3.23
#> 5: "Integration.*" Integrationsbereitschaft     1  3.23
#> 6: "Integration.*"      Integrationserfolge     1  3.23
#> 7: "Integration.*"  Integrationsfähigkeiten     1  3.23
#> 8: "Integration.*"    Integrationskarrieren     1  3.23
#> 9: "Integration.*"   Integrationspolitik.Es     1  3.23

sc <- corpus("GERMAPARLMINI") %>% subset(party == "SPD")
phr <- cpos(sc, query = '"Deutsche.*" "Bundestag.*"', cqp = TRUE) %>%
  as.phrases(corpus = "GERMAPARLMINI", enc = "latin1")
cnt <- count(sc, phrases = phr, p_attribute = "word")

# Multiple queries and overlapping query matches. The first count 
# operation will issue a warning that matches overlap, see the second 
# example for a solution.
corpus("REUTERS") %>%
  count(query = c('".*oil"', '"turmoil"'), cqp = TRUE)
#> Warning: The CQP queries processed result in at least one overlapping query. Summing up the counts for the individual query matches may result in an overestimation of the total number of hits. To avoid this, consider collapsing multiple CQP queries into one single query.
#>        query count         freq
#> 1:   ".*oil"    79 0.0195061728
#> 2: "turmoil"     1 0.0002469136
corpus("REUTERS") %>% 
  count(query = '"(.*oil|turmoil)"', cqp =TRUE)
#>                query count       freq
#> 1: "(.*oil|turmoil)"    79 0.01950617

Arguments

Value

Details

References

See also

Examples