September 29, 2019


9:45 - 10:15 Getting Started

Who are we | background and expectations | outline of the workshop

10:15- 10:30 Implementing validation as a technological frontier

The perils of quantification, existing software and the polmineR workflows

10:30 - 11:00 Coffee Break

11:00 - 13:00 Validating counts and co-occurrences

Scenarios and workflows: Validating counts, validating co-occurrences

13:00 - 14:00 Lunch Break

14:00 - 15:30 Validation Reloaded

Scenarios and workflows: Validating sentiment analysis, validating topic models

15:30 - 16:00 Where to go from here

Computers digesting text

  • Natural Language Processing (NLP)
  • Big Data
  • Data Mining | Text Mining
  • Machine Learning (NL)
  • Artificial Intelligence (KI)
  • eHumanities | digital humanities
  • Corpus linguistics and and computational linguistics, information science, statistics, …

disciplinary and methodological variety!

The PolMine Project

  • Research
    On migration & integration policy: MigTex, MIDEM, PopParl
  • Data
    Corpora of plenary protocols, Newspaper articles, …
  • Code
    open source R packages for text analysis, at CRAN & GitHub
  • Tutorials
    Using Corpora in Social Science Research / UCSSR
  • Centre
    CLARIN Centre category C, prospectively part of NFDI

Learn more:

The datafication of text

“The emergence of the computer has made it feasible for social and behavioural scientists to make a fresh start on content analysis. The vast potentialities of content analysis, though foreseen form some years, have been poorly realized, owing chiefly to the onerous task of scanning texts and processing data. The latter operation – data processing – has been successfully improved, and there is a promise of automatic scanners that, when appropriately joined with panels of human judges, will accelerate the turning of raw records into data.” (Stone et al. 1966)

From text to numbers

  • from computer-assisted content analysis to “text as data”
    scaling party positions as a driver (wordscore and wordfish)
  • joyful blasphemy agains reading …
    “[…]because it treats words simply as data rather than requiring any knowledge of their meaning as used in the text, our word scoring method works irrespective of the language in which the texts are written. In other words, while our method is designed to analyse the content of a text, it is not necessary for an analyst using the technique to understand, or even read, the texts to which the technique is applied. The primary advantage of this feature is that the technique can be applied to texts in any language.” (Laver, Benoit & Garry 2003)
  • common methods and applications
    • sentiment analyses
    • topic modelling (unsupervised learning)
    • classification (cp. Comparative Agendas Project / CAP)
  • “Validate, validate, validate” (Grimmer et al. 2013)
    An (almost) unheard plea

The idea of “distant reading”

„[…] the trouble with close reading […] is that it necessarily depends on an extremely small canon. […] you invest in individual texts so much only if you think that very few of them really matter. […] if you want to look beyond the canon […], close reading will not do it. […] At the bottom it‘s a theological exercise – very solemn treatment of very few texts taken very seriously – whereas what we really need is a little pact with the devil: we know how to read texts, so now let‘s learn how not to read them. Distant reading, where distance, let me repeat is, is a condition of knowledge. It allows you to focus on units that are much smaller or much larger than the text: devices, themes, types – or genres and systems. And if, between the very small and the very large, the text itself disappears, well, this is one of the cases where one can justifiably say, Less is more. If we want to understand the system in its entirety, we must accept loosing something. We always pay a price for theoretical knowledge; concepts are abstract, are poor. But it‘s precisely this poverty that makes it possible to handle them, and therefore to know. This is why less is actually more.“ (Moretti [2000] 2013: 49)

Why and how text matters

  • The social sciences and the “linguistic turn”
    • An evolving theoretical movement
    • analysing discourse
    • analysing frames
    • analysing narratives
  • Methodological development
    • persistence of paper & pencil-analyses
    • computer-assisted qualitative analysis (QDA, see MAXQDA, Atlas.ti)
    • digital humanities / eHumanities
    • Visual analytics
  • Varieties of “distant reading” (Moretti 2000)
    • “blended reading” (Stulpe, Lemke 2015)
    • “scalable reading” (Weitin 2017)

Where we stand

A people’s corpus miner for quanlification

intersubjectivity! Requirement to integrate qualitative and quantitative inquiry (“quanlification”)

equality! Technical restrictions to achieve “quanlification”, privilege of well-funded projects to achieve it

fraternité! Open source library that offers basic vocabulary for quanlification to overcome technological restrictions

Code: A Suite of R packages

  • Corpus analysis using R and the Corpus Workbench (CWB)
    • polmineR: Basic vocabulary for corpus analysis (portability, performance, open source, documented, usability, theory-based)
    • RcppCWB: Wrappers for the C code of the Corpus Workbench
    • cwbtools: Tools for creating and managing CWB indexed corpora
  • Data preparation and dissemination
    • frappp: Framework for Parsing Plenary Protocols
    • GermaParl: R package for disseminating the GermaParl corpus
    • UNGA: United Nations General Assembly
  • Tools to combine quantity and quality
    • annolite: Lightweight annotation tool
    • gradget: Annotation of (three-dimensional) cooccurrence graphs
    • topicanalysis: Integration of quantitative/qualitative approaches with topic models

polmineR - a basic vocabulary

  • Corpora and subcorpora
    • corpus objects: corpus()
    • subsetting corpora: partition() / subset()
  • Quantification
    • counting: hits(), count(), dispersion() (and size())
    • cooccurrences: cooccuurrences(), Cooccurrences()
    • feature extraction: features()
    • term-document-matrices: as.sparseMatrix(), as.TermDocumentMatrix()
  • Qualitative analysis
    • Keywords-in-context/concordances: kwic()
    • full text (of a subcorpus): get_token_stream(), as.markdown(), as.html(), read()


  • Prerequisites: Windows/MacOS/Linux, installation of R and RStudio.

  • Install polmineR (and RcppCWB) from CRAN, get cwbtools from GitHub.

  • Easy installation of corpora that are disseminated via R packages.
  • Five lines of code, and you are ready to work with fairly large corpora.

RStudio Server and Shiny

To skip local installation, we use a RStudio Server, hosting polmineR and the UNGA corpus

… and to avoid having to write / look at code, there is a Shiny server running hosting the shiny GUI fairly well hidden in the polmineR package.