Validating Sentiment Analysis

26 September 2019

Fundamentals of sentiment analysis

Sentiment analyses are very popular. Text Mining blogs are showing the many possibilities to capture the variation of text evaluations with a numerical indicator and how to analyse and display changes over time.

Which movies are rated particularly good or particularly bad? This can be examined using film reviews. What is the response of customers to a new product? Comments in social media can be examined for this purpose. There is certainly a range of useful application scenarios for sentiment analyses, especially beyond science.

What are the benefits of sentiment analyses in scientific work? Here, validity and reliability of the method do provide challenge. What do we measure when we measure ‘sentiments’? It depends on the answer to this question when and how sentiment analyses can be used as a fruitful and profound research instrument. There is one major distinction within the approach:

Dictionary-based methods measure using lists with positive / negative vocabulary.
Machine Learning based methods are developed from training data with known evaluations and make derivations for texts to be reassessed using an algorithm.

In this manual we work with a, much simpler, dictionary-based method.

Required installations / packages

The following explanations use the polmineR package and the UNGA corpus. The installation is explained in a separate set of slides. In addition, we use the following packages:

zoo: A package for working with time series data;
magrittr: Tools for chaining R commands in a “pipe” one after the other (see below);
devtools: developer tools, we use a command to download a single function;

The following code checks whether these packages are installed and installs them if necessary.

required_packages <- c("zoo", "magrittr", "devtools")
for (pkg in required_packages){
  if (pkg %in% rownames(installed.packages()) == FALSE) install.packages(pkg)
}

Please note that the functionality for the following workflow is only available with polmineR version 0.7.9.9035. If required, install the current development version of polmineR.

if (packageVersion("polmineR") < as.package_version("0.7.9.9035"))
  devtools::install_github("PolMine/polmineR", ref = "dev")

Let’s go

The required packages are now loaded.

library(zoo, quietly = TRUE, warn.conflicts = FALSE)
library(devtools)
library(magrittr)
library(data.table)

We also load polmineR and activate the UNGA corpus, which is available through the UNGA package.

library(polmineR)
use("UNGA")

Which dictionary to use?

While in German, we use the classic and more or less unrivalled “SentiWS” dictionary, in English there are several dictionaries which could be used for sentiment analysis. They all need some kind of preparation which are performed in another script.

The dictionaries should use a numeric weight, not only a binary estimation and they should be provided with some kind of Open Source License.

Preferably, the dictionary would have some linguistic information such as Part-of-Speech tags which can improve the analysis. In addition, the more words the dictionary contains, the more informative the analysis should become.

Three dictionaries might be used here …

AFINN

The AFINN sentiment lexicon by Finn Årup Nielsen, which offers a feature set comparable to SentiWS. AFINN does provide a word list in which terms are scored on a scale from -5 to +5 to indicate sentiment (rescaled to -1 to +1 for our purposes). However, it is rather limited in length (2477 terms) and does not provide linguistic annotation. It is available under the Open Database License (ODbL) v1.0.

source("./script/download_dictionaries.R")
afinn_df <- get_formatted_afinn()
tail(afinn_df, 5)

##       word weight
## 1:   yucky   -0.4
## 2:   yummy    0.6
## 3:  zealot   -0.4
## 4: zealots   -0.4
## 5: zealous    0.4

SentiWordNet

The SentiWordNet sentiment lexicon (see Baccianella et al. 2010, http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf) provides a list of over 56.000 terms which are provided with three numeric values for positivity, negativity and objectivity (i.e. neutral). The values range from 0 to 1, with their sum being 1 for each term (see ibid. 2200). SentiWordNet 3.0 is available under the CC BY-SA 4.0 License. It is comprehensive in length and does provide Part-of-Speech-Tags. However, it is build with a strong orientation on compound words and ngrams which makes the weight of some words rather obscure.

sentiwordnet_Df <- get_formatted_SentWordNet()

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

## Warning in mean.default(weight): argument is not numeric or logical:
## returning NA

tail(sentiwordnet_Df, 5)

## Empty data.table (0 rows and 3 cols): pos,word,weight

Syuzhet

The Syuzhet dictionary was created in the Nebraska Literary Lab by Matthew Jockers et al. It provides about 11.000 terms with a scale of -1 to + 1. The package itself is licensed under GPL-3.

syuzhet_df <- syuzhet::get_sentiment_dictionary()
names(syuzhet_df) <- c("word", "weight")
syuzhet_df <- as.data.table(syuzhet_df)
tail(syuzhet_df, 5)

##       word weight
## 1:    zest   0.50
## 2:  zombie  -0.25
## 3: zombies  -0.25
## 4:   false  -0.60
## 5:    true   0.50

The Choice of Syuzhet

In general, SentiWordNet does contain the most comprehensive features. However, the weights sometimes prove to be confusing which might be because of the inclusion of compound words in the labelling process. For example, the word “international” is labelled as negative, while compound words like “international affairs” (which we omitted) can convey positive meaning as well. For its comprehensive length and resolution (i.e. the different levels the weight can have), we will work with the Syuzhet dictionary for now.

Syuzhet: First look at the data

The syuzhet_df object is a data.table object. We use this instead of a classic data.frame, because it will later facilitate and accelerate the matching of the data (words in the corpus and words in the dictionary). To understand what we are working with, we take a quick look at the data.

head(syuzhet_df, 5)

##           word weight
## 1:     abandon  -0.75
## 2:   abandoned  -0.50
## 3:   abandoner  -0.25
## 4: abandonment  -0.25
## 5:    abandons  -1.00

In the last column (“weight”) you can see that words are assigned a weighting in the dictionary. This can be positive (for “positive” vocabulary) or negative (for “negative” vocabulary). We can check how many positive or negative words are in the table.

vocab <- c(positive = nrow(syuzhet_df[weight > 0]), negative = nrow(syuzhet_df[weight < 0]))
vocab

## positive negative 
##     3587     7161

Positive / negative vocabulary

We now examine the word environment of an interesting term. Because it is relevant in the context of debates in the United Nations we ask how the positive/negative connotations of ‘sanctions’ have developed over time.

A preliminary question is how large the left and right word contexts to be examined should be. In linguistic studies, a context of five words left and right is common. More may be needed for political assignments of meaning. We start from 10 words and set this for our R-session as follows.

options("polmineR.left" = 10L)
options("polmineR.right" = 10L)

Via a “pipe” we now generate a data.frame (“df”) with the counts of the Syuzhet vocabulary in the word environment of “sanctions”. The pipe makes it possible to carry out the steps one after the other without saving intermediate results.

df <- context("UNGA", query = "sanctions", p_attribute = "word", verbose = FALSE) %>%
  partition_bundle(node = FALSE) %>%
  set_names(s_attributes(., s_attribute = "date")) %>%
  weigh(with = syuzhet_df) %>%
  summary()

The tabular data of the sentiment analysis

The df-data.frame lists the statistics of the word surroundings of each occurrence of “sanctions” in the corpus. To keep things simple, we do not initially work with the weightings, but only with the positive or negative words. We simplify the table accordingly and look at it.

df <- df[, c("name", "size", "positive_n", "negative_n")] 
head(df, n = 10)

##          name size positive_n negative_n
## 1  1999-12-16   20          2          3
## 2  1999-12-16   20          1          0
## 3  1999-12-16   20          2          0
## 4  1999-12-16   20          3          0
## 5  1999-12-16   20          1          0
## 6  1999-12-16   20          2          0
## 7  1999-12-16   20          2          1
## 8  1999-12-20   20          1          3
## 9  1999-12-20   20          1          1
## 10 1999-12-20   20          0          1

Aggregation

As name of a word context we used the date of the occurrence of our search term above. This makes it possible to aggregate upwards on the basis of the date for the year.

df[["year"]] <- as.Date(df[["name"]]) %>% format("%Y-01-01")
df_year <- aggregate(df[,c("size", "positive_n", "negative_n")], list(df[["year"]]), sum)
colnames(df_year)[1] <- "year"

However, it does not make sense to work with the absolute frequencies. Therefore, we insert columns that indicate the proportion of negative or positive vocabulary.

df_year$negative_share <- df_year$negative_n / df_year$size
df_year$positive_share <- df_year$positive_n / df_year$size

We convert this into a time series object in the actual sense.

Z <- zoo(
  x = df_year[, c("positive_share", "negative_share")],
  order.by = as.Date(df_year[,"year"])
)

Visualisation I

plot(
  Z, ylab = "polarity", xlab = "year", main = "Word context of 'sanctions': Share of positive/negative vocabulary",
  cex = 0.8, cex.main = 0.8
)

Visualisation II

How good are the results?

But what is actually behind the numerical values of the determined sentiment scores? To investigate this, we use the possibility of polmineR to reduce a KWIC output according to a positive-list (vector with required words), to colour-code words and to show further information via tool tips (here: word weights). So: Move your mouse over the highlighted words!

words_positive <- syuzhet_df[weight > 0][["word"]]
words_negative <- syuzhet_df[weight < 0][["word"]]

Y <- kwic("UNGA", query = "sanctions", positivelist = c(words_positive, words_negative)) %>%
  highlight(lightgreen = words_positive, orange = words_negative) %>%
  tooltips(setNames(syuzhet_df[["word"]], syuzhet_df[["weight"]])) %>%
  as("htmlwidget")

The result stored in object Y (a ‘htmlwidget’) is shown on a separate slide.

Results

Discussion

How do you interpret the results of the time series analysis?
How valid are the results?
Is the transition to working with word weightings useful or necessary?

From dictionary approach to manual annotation

A glance at the KWIC viewer provides valuable insights in both the data itself and the way the sentiment analysis labelled it. But how would one perform a systematic review of the occurrences of our search query? Assuming that a human coder would decide differently about which use of sanction is positive and which is negative, how would go about it? Let’s call the Keyword-in-Context-Analysis again, without using the highlighting functions.

Y2 <- kwic("UNGA", query = "sanctions", positivelist = c(words_positive, words_negative))

Having a large number of occurrences (nearly 4700) it does not seem feasible (or at least desirable) to manually evaluate all matches. But one could certainly label a fraction of these to get an impression of how the word ‘sanctions’ is used.

already_labelled <- file.exists("~/lab/gitlab/ValidationWorkflows/Sentiment_Analysis/data/sanctions_kwic_sample.rds")
# this doesn't work reliably

Have a look!

sample_idx <- sort(sample(1:length(Y2), 300))

Y2_sample <- Y2[sample_idx]

Annotate it

We want to annotate this object systematically, labelling an occurrence of ‘sanctions’ as either positive, negative or neutral. To achieve this, we want to add a drop down menu to our Keyword-in-Context object. After this, using the edit method, we can actually annotate the data.

annotations(Y2_sample) <- list(name = "sentiment", what = factor("a", levels = c("positive", "negative", "neutral")))
edit(Y2_sample)

Using the build in functionality of polmineR to annotate a vast array of objects, among others the KWIC object above, we quickly annotate 300 entries from our list. The result is stored within the object.

Y2_sample <- readRDS("./data/sanctions_kwic_sample.rds")
sample_idx <- readRDS("./data/sanctions_labelled_indices.rds")

Y2_sample

Store it!

In order to work with it in another session of R, we can save the annotation as an RDS file. We also want to store the vector with the indices of the actually labelled occurrences.

saveRDS(Y2_sample, "./data/sanctions_kwic_sample.rds")
saveRDS(sample_idx, "./data/sanctions_labelled_indices.rds")

We now could save the object. Furthermore, we could already do some analyses with that by adding the metadata for year and creating a time series of our annotation which we then visualize. Unlike the dictionary based approach earlier, here we only calculate the mean of a binary coded negative/positive connotation of ‘sanctions’.

Visualise it!

From manual annotation to machine learning

Our manually coded data is very limited in size, covering about 6% of all occurrences of ‘sanctions’. Here, a machine learning approach could be useful. Instead of labelling the remaining 4400 occurrences by hand, we use the labels we already have to guess the class of the unseen data.

To use our training data in a Machine Learning context, we transform the Keyword-in-Context output of all occurrences of “sanctions” to a Document-Term-Matrix (DTM) in which for all documents - in this instance: each occurrence of ‘sanctions’ - the occurrence of words is represented in a bag-of-words-approach. There are about 6000 unique terms in the matrix.

dtm <- as.DocumentTermMatrix(Y2, p_attribute = "word")
dim(dtm)

## [1] 4713 5991

LiblineaR for ML

We use an implementation of the LiblineaR package for text classification. We are essentially following Niekler and Wiedemann for the technicalities (https://tm4ss.github.io/docs/Tutorial_7_Klassifikation.html).

See the script liblinear_classification.Rmd for the optimization process. In this separate R Markdown file, we removed stop words as well as rare words and calculated an optimal cost parameter C. We load the preprocessed document-term-matrix into our work space and fit the final model. We also only looked for two classes: positive or negative and omitted the ‘neutral’ class in order to get two distinct groups of occurrences.

dtm <- readRDS("./data/dtm_without_stopwords.rds")
annotatedLabels <- readRDS("./data/annotatedLabels.rds")
reduced_sample_idx <- readRDS("./data/new_sample_idx_without_neutral.rds")

source("./script/liblinear_utils.R")
annotatedDTM <- dtm[reduced_sample_idx,] %>%
  t() %>%
  polmineR::as.sparseMatrix() %>%
  Matrix::t() %>%
  convertMatrixToSparseM()

Model fitting

best_C <- 0.03
final_model <- LiblineaR(annotatedDTM, annotatedLabels, cost = best_C)

dtm_sparse <- dtm %>%
  t() %>%
  polmineR::as.sparseMatrix() %>%
  Matrix::t() %>%
  convertMatrixToSparseM()

final_labels <- predict(final_model, dtm_sparse)$predictions
table(final_labels) / sum(table(final_labels))

## final_labels
##  negative  positive 
## 0.8724804 0.1275196

We want to get these labels back into our initial KWIC object. For this, we add the annotation column which we then append with the final labels.

annotations(Y2) <- list(name = "sentiment", what = factor("a", levels = c("positive", "negative", "neutral")))
Y2@stat$sentiment <- final_labels

Back to kwic

As before, we could enrich the KWIC object with more metadata.

Y2 <- enrich(Y2, s_attribute = "year")

And use this metadata to visualize the distribution of the sentiment connotations of the word ‘sanctions’ over time. It seems reasonable to assume that it is more negative because we omitted the ‘neutral’ class in the model.

ts_df_all <- Y2@stat[,c("year", "sentiment")]

# recode the sentiment
ts_df_all[,"sentiment"] <- ifelse(ts_df_all$sentiment == "positive", 1, ifelse(ts_df_all$sentiment == "negative", -1, 0))

# group by year
df_manually_annotated_all <- ts_df_all %>%
  group_by(year) %>%
  summarize(net_sent = mean(sentiment))