Perform Chisquare-Test based on a table with counts

chisquare(.Object)

# S4 method for features
chisquare(.Object)

# S4 method for context
chisquare(.Object)

# S4 method for cooccurrences
chisquare(.Object)

Arguments

.Object

A features object, or an object inheriting from it (context, cooccurrences).

Value

Same class as input object, with enriched table in the stat-slot.

Details

The basis for computing for the chi square test is a contingency table of observationes, which is prepared for every single token in the corpus. It reports counts for a token to inspect and all other tokens in a corpus of interest (coi) and a reference corpus (ref):

coirefTOTAL
count token\(o_{11}\)\(o_{12}\)\(r_{1}\)
other tokens\(o_{21}\)\(o_{22}\)\(r_{2}\)
TOTAL\(c_{1}\)\(c_{2}\)N

Based on the contingency table, expected values are calculated for each cell, as the product of the column and margin sums, divided by the overall number of tokens (see example). The standard formula for calculating the chi-square test is computed as follows. $$X^{2} = \sum{\frac{(O_{ij} - E_{ij})^2}{O_{ij}}}$$ Results from the chisquare test are only robust for at least 5 observed counts in the corpus of interest. Usually, results need to be filtered accordingly (see examples).

References

Manning, Christopher D.; Schuetze, Hinrich (1999): Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 169-172.

Kilgarriff, A. and Rose, T. (1998): Measures for corpus similarity and homogeneity. Proc. 3rd Conf. on Empirical Methods in Natural Language Processing. Granada, Spain, pp 46-52.

See also

Other statistical methods: ll(), pmi(), t_test()

Author

Andreas Blaette

Examples

use("polmineR")
#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS
library(data.table) m <- partition( "GERMAPARLMINI", speaker = "Merkel", interjection = "speech", regex = TRUE, p_attribute = "word" )
#> ... get encoding: latin1
#> ... get cpos and strucs
#> ... getting counts for p-attribute(s): word
f <- features(m, "GERMAPARLMINI", included = TRUE) f_min <- subset(f, count_coi >= 5) summary(f_min)
#> p critical_value N_chisquare #> 1: 0.001 10.83 60 #> 2: 0.005 7.88 77 #> 3: 0.010 6.63 80 #> 4: 0.050 3.84 113
if (FALSE) { # A sample do-it-yourself calculation for chisquare: # (a) prepare matrix with observed values o <- matrix(data = rep(NA, 4), ncol = 2) o[1,1] <- as.data.table(m)[word == "Weg"][["count"]] o[1,2] <- count("GERMAPARLMINI", query = "Weg")[["count"]] - o[1,1] o[2,1] <- size(f)[["coi"]] - o[1,1] o[2,2] <- size(f)[["ref"]] - o[1,2] # prepare matrix with expected values, calculate margin sums first r <- rowSums(o) c <- colSums(o) N <- sum(o) e <- matrix(data = rep(NA, 4), ncol = 2) e[1,1] <- r[1] * (c[1] / N) e[1,2] <- r[1] * (c[2] / N) e[2,1] <- r[2] * (c[1] / N) e[2,2] <- r[2] * (c[2] / N) # compute chisquare statistic y <- matrix(rep(NA, 4), ncol = 2) for (i in 1:2) for (j in 1:2) y[i,j] <- (o[i,j] - e[i,j])^2 / e[i,j] chisquare_value <- sum(y) as(f, "data.table")[word == "Weg"][["chisquare"]] }