Calculate Pointwise Mutual Information as an information-theoretic approach to find collocations.
pmi(.Object, ...) # S4 method for context pmi(.Object) # S4 method for Cooccurrences pmi(.Object) # S4 method for ngrams pmi(.Object, observed, p_attribute = p_attributes(.Object)[1])
.Object | An object. |
---|---|
... | Arguments methods may require. |
observed | A |
p_attribute | The positional attribute which shall be considered. Relevant only if ngrams have been calculated for more than one p-attribute. |
Pointwise mutual information (PMI) is calculated as follows (see Manning/Schuetze 1999): $$I(x,y) = log\frac{p(x,y)}{p(x)p(y)}$$
The formula is based on maximum likelihood estimates: When we know the number of observations for token x, \(o_{x}\), the number of observations for token y, \(o_{y}\) and the size of the corpus N, the propabilities for the tokens x and y, and for the co-occcurence of x and y are as follows: $$p(x) = \frac{o_{x}}{N}$$ $$p(y) = \frac{o_{y}}{N}$$
The term p(x,y) is the number of observed co-occurrences of x and y.
Note that the computation uses log base 2, not the natural logarithm you find in examples (e.g. https://en.wikipedia.org/wiki/Pointwise_mutual_information).
Manning, Christopher D.; Schuetze, Hinrich (1999): Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 178-183.
y <- cooccurrences("REUTERS", query = "oil", method = "pmi") N <- size(y)[["partition"]] I <- log2((y[["count_coi"]]/N) / ((count(y) / N) * (y[["count_partition"]] / N))) use("polmineR")#>#>dt <- decode( "REUTERS", p_attribute = "word", s_attribute = character(), to = "data.table", verbose = FALSE )#>