September 30, 2019

The art of counting

  • Working with corpora is much more than counting. But counting words and lexical units is the basic operation for any more complex analysis, and may yield substantial results.

  • To count is to measure! Is my measurement valid? Are there sufficient safeguards that I do measure what I intend to measure?

  • Statements about salience means to take the difference between absolute and relative frequencies serious.. Frequencies are the the normalisation of counts by dividing counts by corpus/subcorpus size.

  • There is a wide variety of scenarios for counting: We will focus on time series analysis and dictionary-based analyses.

  • Basic methods for counting in the polmineR package are count, dispersion and as.TermDocumentMatrix. These methods are applicable for corpus and subcorpus objects. For the following example, we use the corpus of the verbatim records of the UN General Assembly.

library(polmineR)
use("UNGA")

Counting basics: The count()-method

  • The most basic usage of the count()-method is to look up the number of occurrences of a search term (query) in a corpus.
count("UNGA", query = "refugees")
##       query count         freq
## 1: refugees  6562 0.0001523474
  • The column count reports the absolute number of observations, the column freq the relative frequency. The frequency results from a simple division of the absolute count by the corpus size.
count("UNGA", query = "refugees")[["count"]] / size("UNGA")
## [1] 0.0001523474
  • We can use a character-vector with several search terms.
count("UNGA", query = c("refugees", "asylum"))
##       query count         freq
## 1: refugees  6562 1.523474e-04
## 2:   asylum   415 9.634893e-06

Using regular expressions and CQP

  • The count()-method will accept for the argument query the syntax of the Corpus Query Processor (CQP). One implication is that we can use regular expressions. The query needs to be put in single quotation marks, and the argument cqp is set to TRUE.
count("UNGA", query = "'refugee.*'", cqp = TRUE) # mit CQP-Syntax
##          query count         freq
## 1: 'refugee.*'  7966 0.0001849435
  • We can get a breakdown of the matches we have generated by setting the argument breakdown to TRUE.
dt <- count("UNGA", query = "'refugee.*'", cqp = TRUE, breakdown = TRUE)
  • The CQP syntax can also be used to match multi-word lexical units.
dp <- count("UNGA", query = '"displaced" "persons"', cqp = TRUE)

Matches for our regular expression

Regular Expressions: Character classes

Sign Description
. wildcard / matches any character
\d “digit” (0 to 9)
\w word character
\s whitespace

Regular Expressions: Quantifiers

Sign Description
? Zero or one occurrences of the preceding element.
+ One or more occurrences of the preceding element.
* Zero or more occurrences of the preceding element.
{n} The preceding item is matched exactly n times.
{min,} The preceding item is matched min or more times.
{min,max} The preceding item is matched at least min times, but not more than max times.

Regular Expressions: Examples I

  • This is sufficient to formulate queries that cover a broad set of cases.
count("UNGA", query = '"refuge.*"', cqp = TRUE, breakdown = TRUE) %>% head(n = 3)
##         query    match count share
## 1: "refuge.*" refugees  6562 78.95
## 2: "refuge.*"  refugee  1398 16.82
## 3: "refuge.*"   refuge   339  4.08
  • Alternative characters can be put in brackets …
count("UNGA", query = '"[rR]efuge.*"', cqp = TRUE, breakdown = TRUE) %>% head(n = 3)
##            query    match count share
## 1: "[rR]efuge.*" refugees  6562 64.50
## 2: "[rR]efuge.*" Refugees  1815 17.84
## 3: "[rR]efuge.*"  refugee  1398 13.74

Regular Expressions: Examples II

  • Alternative formulations are wrapped in brackets and separated by the “|” sign.
count("UNGA", query = '"(imm|em)igration.*"', breakdown = TRUE) %>% head()
##                   query        match count share
## 1: "(imm|em)igration.*"  immigration   525 78.71
## 2: "(imm|em)igration.*"   emigration   141 21.14
## 3: "(imm|em)igration.*" immigrations     1  0.15

Querying the token stream

  • Square brackets serve as a place holder for any token …
count("UNGA", query = '"United" "Nations" "Commission" "for" []', cqp = T, breakdown = T)
##                                       query
## 1: "United" "Nations" "Commission" "for" []
## 2: "United" "Nations" "Commission" "for" []
## 3: "United" "Nations" "Commission" "for" []
##                                   match count share
## 1: United Nations Commission for Social     3 50.00
## 2:  United Nations Commission for India     2 33.33
## 3: United Nations Commission for Ruanda     1 16.67
  • Curly brackets can be used as a quantifier …
count("UNGA", query = '"[Rr]efugee.*" []{0,5} "burden"', cqp = TRUE, breakdown = TRUE) %>%
  head(n = 3) %>% subset(select = c("match", "count", "share"))
##                                 match count share
## 1:                     refugee burden     4 22.22
## 2:    refugees places an added burden     2 11.11
## 3: refugee population is a big burden     1  5.56

Dispersion analysis

  • Diachronic variation and synchronic change have a fundamental role for corpus analysis.

  • The dispersion()-method serves as a tool to efficiently count over one or two dimensions.

dt <- dispersion("UNGA", query = '"refugees"', s_attribute = "year")
head(dt) # wir betrachten nur den Anfang der Tabelle
##         query year count
## 1: "refugees" 1993     0
## 2: "refugees" 1994   384
## 3: "refugees" 1995   342
## 4: "refugees" 1996   342
## 5: "refugees" 1997   268
## 6: "refugees" 1998   344
  • The dispersion()-method can process the CQP syntax just as the count()-method.

Simple visualisation

  • Let us get frequencies from the outset.
dt <- dispersion("UNGA", query = '"refugee.*"', s_attribute = "year", freq = TRUE)
  • We can visualise the result of our numerical analysis using a bar plot.
barplot(
  height = dt[["freq"]] * 100000,
  names.arg = dt[["year"]],
  las = 2, ylab = "matches pro 100.000 Worte"
  )

Refugees in the UN General Assembly

From numbers to words

  • To obtain valid judgements about the meaning of counts (in a time-series analysis), it will be necessary to inspect the context for the matches obtained.

  • The kwic()-method (for keyword-in-context) serves this purpose. The basic usage is as follows.

kwic("UNGA", query = '"[Rr]efugee.*"', cqp = TRUE)
  • A more realistic scenario will be this one …
k <- corpus("UNGA") %>%
  subset(year == "2015") %>%
  kwic(
    query = '"[Rr]efugee.*"', cqp = TRUE,
    left = 10, right = 10,
    s_attributes = c("state_organization", "year")
    )

Concordances

Questions / concerns

  • Which metadata (i.e. display of s-attributes) do you need for your statements?

  • How much word context (left and right tokens) do you need to arrive at findings?

  • How do you think you could achieve intersubjectivity for you findings?

  • Which workflow would support your work?

Annotating KWIC results

  • First, we get the kwic lines …
refkwic <- corpus("UNGA") %>%
  subset(year == "1999") %>%
  kwic(query = '"refugee.*"', cqp = TRUE, left = 15, right = 15) %>%
  enrich(s_attribute = c("state_organization", "speaker"))
  • Then we add an annotation layer …
annotations(refkwic) <- list(name = "description", what = "")
  • We then edit this result!
edit(refkwic)
refkwic

Conclusions