Calculate Pointwise Mutual Information as an information-theoretic approach to find collocations.

pmi(.Object, ...)

# S4 method for context
pmi(.Object)

# S4 method for Cooccurrences
pmi(.Object)

# S4 method for ngrams
pmi(.Object, observed, p_attribute = p_attributes(.Object)[1])

Arguments

.Object

An object.

...

Arguments methods may require.

observed

A count-object with the numbers of the observed occurrences of the tokens in the input ngrams object.

p_attribute

The positional attribute which shall be considered. Relevant only if ngrams have been calculated for more than one p-attribute.

Details

Pointwise mutual information (PMI) is calculated as follows (see Manning/Schuetze 1999): $$I(x,y) = log\frac{p(x,y)}{p(x)p(y)}$$

The formula is based on maximum likelihood estimates: When we know the number of observations for token x, \(o_{x}\), the number of observations for token y, \(o_{y}\) and the size of the corpus N, the propabilities for the tokens x and y, and for the co-occcurence of x and y are as follows: $$p(x) = \frac{o_{x}}{N}$$ $$p(y) = \frac{o_{y}}{N}$$

The term p(x,y) is the number of observed co-occurrences of x and y.

Note that the computation uses log base 2, not the natural logarithm you find in examples (e.g. https://en.wikipedia.org/wiki/Pointwise_mutual_information).

References

Manning, Christopher D.; Schuetze, Hinrich (1999): Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 178-183.

See also

Other statistical methods: chisquare(), ll(), t_test()

Examples

y <- cooccurrences("REUTERS", query = "oil", method = "pmi") N <- size(y)[["partition"]] I <- log2((y[["count_coi"]]/N) / ((count(y) / N) * (y[["count_partition"]] / N))) use("polmineR")
#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS
dt <- decode( "REUTERS", p_attribute = "word", s_attribute = character(), to = "data.table", verbose = FALSE )
#> assembling data.table
n <- ngrams(dt, n = 2L, p_attribute = "word") obs <- count("REUTERS", p_attribute = "word") phrases <- pmi(n, observed = obs)