Apply the log-likelihood statistic to detect cooccurrences or keywords.
ll(.Object, ...) # S4 method for features ll(.Object) # S4 method for context ll(.Object) # S4 method for cooccurrences ll(.Object) # S4 method for Cooccurrences ll(.Object, verbose = TRUE)
.Object | An object of class |
---|---|
... | Further arguments (such as |
verbose | Logical, whether to output messages. |
The log-likelihood test to detect cooccurrences is a standard approach to find collocations (Dunning 1993, Evert 2005, 2009).
(a) The basis for computing for the log-likelihood statistic is a contingency table of observationes, which is prepared for every single token in the corpus. It reports counts for a token to inspect and all other tokens in a corpus of interest (coi) and a reference corpus (ref):
coi | ref | TOTAL | |
count token | \(o_{11}\) | \(o_{12}\) | \(r_{1}\) |
other tokens | \(o_{21}\) | \(o_{22}\) | \(r_{2}\) |
TOTAL | \(c_{1}\) | \(c_{2}\) | N |
(b) Based on the contingency table(s) with observed counts, expected values are calculated for each cell, as the product of the column and margin sums, divided by the overall number of tokens (see example).
(c) The standard formula for calculating the log-likelihood test is as follows. $$G^{2} = 2 \sum{O_{ij} log(\frac{O_{ij}}{E_{ij}})}$$ Note: Before polmineR v0.7.11, a simplification of the formula was used (Rayson/Garside 2000), which omits the third and fourth term of the previous formula: $$ll = 2(o_{11} log (\frac{o_{11}}{E_{11}}) + o_{12} log(\frac{o_{12}}{E_{12}}))$$ There is a (small) gain of computational efficiency using this simplified formula and the result is almost identical with the standard formula; see however the critical discussion of Ulrike Tabbert (2015: 84ff).
The implementation in the ll
-method uses a vectorized approach of the
computation, which is substantially faster than iterating the rows of a
table, generating individual contingency tables etc. As using the standard
formula is not significantly slower than relying on the simplified formula,
polmineR has moved to the standard computation.
An inherent difficulty of the log likelihood statistic is that it is not
possible to compute the statistical test value if the number of observed
counts in the reference corpus is 0, i.e. if a term only occurrs exclusively
in the neighborhood of a node word. When filtering out rare words from the
result table, respective NA
values will usually disappear.
Dunning, Ted (1993): Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, Vol. 19, No. 1, pp. 61-74.
Rayson, Paul; Garside, Roger (2000): Comparing Corpora using Frequency Profiling. The Workshop on Comparing Corpora. https://www.aclweb.org/anthology/W00-0901/.
Evert, Stefan (2005): The Statistics of Word Cooccurrences. Word Pairs and Collocations. URN urn:nbn:de:bsz:93-opus-23714. https://elib.uni-stuttgart.de/bitstream/11682/2573/1/Evert2005phd.pdf
Evert, Stefan (2009). Corpora and Collocations. In: A. Ludeling and M. Kyto (eds.), Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin, pp. 1212-1248 (ch. 58).
Tabbert, Ulrike (2015): Crime and Corpus. The Linguistic Representation of Crime in the Press. Amsterdam: Benjamins.
# use ll-method explicitly oil <- cooccurrences("REUTERS", query = "oil", method = NULL) oil <- ll(oil) oil_min <- subset(oil, count_coi >= 3) if (interactive()) View(format(oil_min)) summary(oil)#> p critical_value N_ll #> 1: 0.001 10.83 4 #> 2: 0.005 7.88 4 #> 3: 0.010 6.63 4 #> 4: 0.050 3.84 17# use ll-method on 'Cooccurrences'-object if (FALSE) { R <- Cooccurrences("REUTERS", left = 5L, right = 5L, p_attribute = "word") ll(R) decode(R) summary(R) } # use log likelihood test for feature extraction x <- partition( "GERMAPARLMINI", speaker = "Merkel", interjection = "speech", regex = TRUE, p_attribute = "word" )#>#>#>f <- features(x, y = "GERMAPARLMINI", included = TRUE, method = "ll") f <- features(x, y = "GERMAPARLMINI", included = TRUE, method = NULL) f <- ll(f) summary(f)#> p critical_value N_ll #> 1: 0.001 10.83 63 #> 2: 0.005 7.88 114 #> 3: 0.010 6.63 143 #> 4: 0.050 3.84 397if (FALSE) { # A sample do-it-yourself calculation for log-likelihood: # Compute ll-value for query "oil", and "prices" oil <- context("REUTERS", query = "oil", left = 5, right = 5) # (a) prepare matrix with observed values o <- matrix(data = rep(NA, 4), ncol = 2) o[1,1] <- as(oil, "data.table")[word == "prices"][["count_coi"]] o[1,2] <- count("REUTERS", query = "prices")[["count"]] - o[1,1] o[2,1] <- size(oil)[["coi"]] - o[1,1] o[2,2] <- size(oil)[["ref"]] - o[1,2] # (b) prepare matrix with expected values, calculate margin sums first r <- rowSums(o) c <- colSums(o) N <- sum(o) e <- matrix(data = rep(NA, 4), ncol = 2) # matrix with expected values e[1,1] <- r[1] * (c[1] / N) e[1,2] <- r[1] * (c[2] / N) e[2,1] <- r[2] * (c[1] / N) e[2,2] <- r[2] * (c[2] / N) # (c) compute log-likelihood value ll_value <- 2 * ( o[1,1] * log(o[1,1] / e[1,1]) + o[1,2] * log(o[1,2] / e[1,2]) + o[2,1] * log(o[2,1] / e[2,1]) + o[2,2] * log(o[2,2] / e[2,2]) ) df <- as.data.frame(cooccurrences("REUTERS", query = "oil")) subset(df, word == "prices")[["ll"]] }