Count all tokens, or number of occurrences of a query (CQP syntax may be used), or matches for the query.
count(.Object, ...) # S4 method for partition count( .Object, query = NULL, cqp = is.cqp, check = TRUE, breakdown = FALSE, decode = TRUE, p_attribute = getOption("polmineR.p_attribute"), mc = getOption("polmineR.cores"), verbose = TRUE, progress = FALSE, phrases = NULL, ... ) # S4 method for subcorpus count( .Object, query = NULL, cqp = is.cqp, check = TRUE, breakdown = FALSE, decode = TRUE, p_attribute = getOption("polmineR.p_attribute"), mc = getOption("polmineR.cores"), verbose = TRUE, progress = FALSE, phrases = NULL, ... ) # S4 method for partition_bundle count( .Object, query = NULL, cqp = FALSE, p_attribute = NULL, phrases = NULL, freq = FALSE, total = TRUE, mc = FALSE, progress = FALSE, verbose = FALSE, ... ) # S4 method for subcorpus_bundle count( .Object, query = NULL, cqp = FALSE, p_attribute = NULL, phrases = NULL, freq = FALSE, total = TRUE, mc = FALSE, progress = TRUE, verbose = FALSE, ... ) # S4 method for corpus count( .Object, query = NULL, cqp = is.cqp, check = TRUE, p_attribute = getOption("polmineR.p_attribute"), breakdown = FALSE, sort = FALSE, decode = TRUE, verbose = TRUE, ... ) # S4 method for character count( .Object, query = NULL, cqp = is.cqp, check = TRUE, p_attribute = getOption("polmineR.p_attribute"), breakdown = FALSE, sort = FALSE, decode = TRUE, verbose = TRUE, ... ) # S4 method for vector count(.Object, corpus, p_attribute, ...) # S4 method for remote_corpus count(.Object, ...) # S4 method for remote_subcorpus count(.Object, ...)
.Object | A |
---|---|
... | Further arguments. If |
query | A character vector (one or multiple terms), CQP syntax can be used. |
cqp | Either logical ( |
check | A |
breakdown | Logical, whether to report number of occurrences for different matches for a query. |
decode | Logical, whether to turn token ids into decoded strings (only if query is NULL). |
p_attribute | The p-attribute(s) to use. |
mc | Logical, whether to use multicore (defaults to |
verbose | Logical, whether to be verbose. |
progress | Logical, whether to show progress bar. |
phrases | A |
freq | Logical, if |
total | Defaults to |
sort | Logical, whether to sort table with counts (in stat slot). |
corpus | The name of a CWB corpus. |
A data.table
if argument query is used, a count
-object,
if query is NULL
and .Object
is a character vector (referring
to a corpus) or a partition
, a count_bundle
-object, if .Object
is a partition_bundle
.
If .Object
is a partiton_bundle
, the data.table
returned will
have the queries in the columns, and as many rows as there are in the
partition_bundle
.
If .Object
is a length-one character
vector and query
is
NULL
, the count is performed for the whole partition.
If breakdown
is TRUE
and one query is supplied, the function
returns a frequency breakdown of the results of the query. If several queries
are supplied, frequencies for the individual queries are retrieved.
Multiple queries can be used for argument query
. Some care may be
necessary when summing up the counts for the individual queries. When the
CQP syntax is used, different queries may yield the same match result, so that
the sum of all individual query matches may overestimate the true number of
unique matches. In the case of overlapping matches, a warning message is
issued. Collapsing multiple CQP queries into a single query (separating the
individual queries by "|" and wrapping everything in round brackets) solves
this problem.
Baker, Paul (2006): Using Corpora in Discourse Analysis. London: continuum, p. 47-69 (ch. 3).
For a metadata-based breakdown of counts (i.e. tabulation by
s-attributes), see dispersion
. The hits
is the
worker behind the dispersion
method and offers a similar, yet more
low-level functionality as compared to the count
method. Using the
hits
method may be useful to obtain the data required for
flexible cross-tabulations.
#>#>#>#>count(debates, query = "Arbeit") # get frequencies for one token#> query match count freq #> 1: Arbeit Arbeit 159 0.0007155683#> query match count freq #> 1: Arbeit Arbeit 159 7.155683e-04 #> 2: Freizeit Freizeit 1 4.500430e-06 #> 3: Zukunft Zukunft 142 6.390610e-04#> query count freq #> 1: Migration 3 1.350129e-05 #> 2: Integration 23 1.035099e-04debates <- partition_bundle( "GERMAPARLMINI", s_attribute = "date", values = NULL, mc = FALSE, verbose = FALSE ) y <- count(debates, query = "Arbeit", p_attribute = "word") y <- count(debates, query = c("Arbeit", "Migration", "Zukunft"), p_attribute = "word") count("GERMAPARLMINI", '"Integration.*"', breakdown = TRUE)#> query match count share #> 1: "Integration.*" Integration 23 53.49 #> 2: "Integration.*" Integrationspolitik 9 20.93 #> 3: "Integration.*" Integrationskurse 2 4.65 #> 4: "Integration.*" Integrationsangebote 1 2.33 #> 5: "Integration.*" Integrationsbereitschaft 1 2.33 #> 6: "Integration.*" Integrationserfolge 1 2.33 #> 7: "Integration.*" Integrationsfähigkeiten 1 2.33 #> 8: "Integration.*" Integrationskarrieren 1 2.33 #> 9: "Integration.*" Integrationspartnerschaften 1 2.33 #> 10: "Integration.*" Integrationsplan 1 2.33 #> 11: "Integration.*" Integrationspolitik.Es 1 2.33 #> 12: "Integration.*" Integrationsverträgen 1 2.33#>#>count(P, '"Integration.*"', breakdown = TRUE)#> query match count share #> 1: "Integration.*" Integration 15 48.39 #> 2: "Integration.*" Integrationspolitik 8 25.81 #> 3: "Integration.*" Integrationskurse 2 6.45 #> 4: "Integration.*" Integrationsangebote 1 3.23 #> 5: "Integration.*" Integrationsbereitschaft 1 3.23 #> 6: "Integration.*" Integrationserfolge 1 3.23 #> 7: "Integration.*" Integrationsfähigkeiten 1 3.23 #> 8: "Integration.*" Integrationskarrieren 1 3.23 #> 9: "Integration.*" Integrationspolitik.Es 1 3.23sc <- corpus("GERMAPARLMINI") %>% subset(party == "SPD") phr <- cpos(sc, query = '"Deutsche.*" "Bundestag.*"', cqp = TRUE) %>% as.phrases(corpus = "GERMAPARLMINI", enc = "latin1") cnt <- count(sc, phrases = phr, p_attribute = "word") # Multiple queries and overlapping query matches. The first count # operation will issue a warning that matches overlap, see the second # example for a solution. corpus("REUTERS") %>% count(query = c('".*oil"', '"turmoil"'), cqp = TRUE)#> Warning: The CQP queries processed result in at least one overlapping query. Summing up the counts for the individual query matches may result in an overestimation of the total number of hits. To avoid this, consider collapsing multiple CQP queries into one single query.#> query count freq #> 1: ".*oil" 79 0.0195061728 #> 2: "turmoil" 1 0.0002469136#> query count freq #> 1: "(.*oil|turmoil)" 79 0.01950617