+ - 0:00:00
Notes for current slide
Notes for next slide

Cooking with GermaParl

2023-12-14

1 / 16

Purpose and Motivation

  • GermaParl2 comprises rich structural annotation on the level of protocols (such as the date or the legislative period) and the level of speakers (such as a speakers name or parliamentary group)

  • as seen in previous cookbooks, these can be used to create meaningful subcorpora for substantive analysis

  • but even beyond that, the corpus contains annotation below the level of speeches in forms of paragraph and sentence annotation

  • sentences can provide natural units of analysis with semantic meaning (for a comprehensive discussion see Däubler et al. 2012)

  • sentence annotations can be used for a variety of use cases

2 / 16

Encoding of Sentences

  • sentences are annotated using Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)

  • sentences are encoded as the structural attribute s in GermaParl2

  • in contrast to other annotations in GermaParl, the sentence annotation does not have values; they describe regions in terms of start and end positions of sentences

  • polmineR indicates the missing values when called with s_attributes():

s_attributes("GERMAPARL2", "s")
## ! s-attribute `s` does not have values, returning NA
## [1] NA
3 / 16

Sentences and the tree structure

corpus("GERMAPARL2") %>% polmineR::tree_structure()
## protocol [lp│no│date│year│url│filetype]
## |
## └─ speaker [who│name│party│parlgroup│role]
## |
## └─ p [type]
## |
## └─ s
## |
## └─ ne [type]
4 / 16

Splitting Objects into Sentences

polmineR makes it easy to split a (sub)corpus into sentences

sentences <- corpus("GERMAPARL2") |>
subset(protocol_date == "1949-12-14") |>
split(s_attribute = "s", values = FALSE)
  • the values argument of split() makes missing values explicit, but this is not strictly necessary

  • the output is a bundle of subcorpora, each containing a single sentence

  • splitting by sentences can also be done for corpora with sentence annotation (caution: GermaParl2 is quite large)

5 / 16

Splitting Objects into Sentences

  • subcorpus bundles can be used as usual

  • one example would be to decode the sentences as strings in their word order for further analysis

  • this could be useful for word embeddings or classification tasks which rely on word order

sentences_ts <- get_token_stream(sentences)
  • the sentence annotation is not always perfect though:
sentences_ts[[693]]
## [1] "—" "Ich" "schließe" "die" "23" "."
sentences_ts[[694]]
## [1] "Sitzung" "des" "Deutschen" "Bundestags" "."
6 / 16

Sentence-Term-Matrices

  • the sentence bundle can also be used as input to create a Document-Term-Matrix (in this case a sentence-term-matrix)
  • potentially useful for machine learning approaches which rely on a Bag-of-Words representation of sentences
  • examples: Sentence Similarity, Weighting of Terms
dtm <- polmineR::as.DocumentTermMatrix(sentences, p_attribute = "word")
7 / 16

Sentence-Term-Matrices

tm::inspect(dtm)
## <<DocumentTermMatrix (documents: 695, terms: 3255)>>
## Non-/sparse entries: 13796/2248429
## Sparsity : 99%
## Maximal term length: 32
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs , . daß den der des die in und zu
## 36563 38 1 0 0 0 0 2 0 0 0
## 36589 8 1 1 1 4 0 2 3 2 1
## 36619 9 1 2 3 12 0 2 2 2 2
## 36643 7 1 0 0 1 1 2 1 0 1
## 36674 8 1 3 1 3 0 4 3 4 1
## 36695 10 1 4 0 3 1 4 1 3 0
## 36717 5 1 3 0 2 1 0 2 1 1
## 36733 5 1 0 1 2 2 2 2 3 0
## 36777 5 1 1 0 4 2 7 1 4 0
## 36787 6 1 1 1 4 0 0 3 2 1
8 / 16

Using Sentences as Context Windows

  • the boundaries of sentences can be used to define context windows of query terms

  • this can be useful to limit the analysis to relevant context words or to identify meaningful multi-word query terms

  • polmineR provides two ways to make use of the sentence annotation in these scenarios:

1) Sentence Annotation as a boundary:

  • the maximum number of tokens in the context window is determined by the values of left and right but the context does not extend over the boundary of a sentence
corpus("GERMAPARL2") |>
kwic(query = "Demokratie",
boundary = "s",
left = 20,
right = 20)
9 / 16

Using Sentences as Context Windows

2) Sentence Annotation as Context

  • the context window is determined by the structural attribute - here s - defined by region and a number of sentences in left and right
corpus("GERMAPARL2") |>
kwic(query = "Demokratie",
region = "s",
left = 0,
right = 0)
  • the annotation of left and right determines additional context in terms of sentences s

  • i.e. if s = 0, then the context window comprises of the same sentence as the query term

10 / 16

Using Sentences as Context Windows

  • changing the values of left and right to 1 adds one additional sentence as context
corpus("GERMAPARL2") |>
kwic(query = "Demokratie",
region = "s",
left = 1,
right = 1)
  • this is equivalent to the following syntax:
corpus("GERMAPARL2") |>
kwic(query = "Demokratie",
left = c("s" = 1),
right = c("s" = 1))
11 / 16

Using Sentences as Context Windows

  • this also applies to values passed to other parameters such as positivelist and stoplist:
corpus("GERMAPARL2") |>
kwic(query = "Demokratie",
region = "s",
left = 0,
right = 0,
positivelist = "Krise"
)
## ... filtering by positivelist
## ... number of hits dropped due to positivelist: 37002
## ... update count statistics for slot cpos

Note: Sentences which contain a query term more than once show up in the output of kwic more than once

12 / 16

Using Sentences in CQP Queries

  • as noted in the CQP manual, "most linguistic queries should include the restriction within s to avoid crossing sentence boundaries" (https://cwb.sourceforge.io/files/CQP_Manual/4_2.html)

  • this can be achieved with the syntax used in the following query (results on the next slide)

sc <- corpus("GERMAPARL2") |> subset(protocol_lp == 15)
count(sc,
query = '"Bundesministerium.*" []{1,5} [xpos = "NN"] within s',
cqp = TRUE,
breakdown = TRUE)
  • it has to be noted that this can be computationally expensive and depending on the use case, the differences are subtle
13 / 16

Using Sentences in CQP Queries

First five Query Matches in GermaParl2, 15th Legislative Period
query match count share
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s Bundesministeriums für Wirtschaft 84 6.78
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s Bundesministeriums für Verkehr 71 5.73
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s Bundesministeriums des Innern 67 5.41
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s Bundesministeriums der Finanzen 66 5.33
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s Bundesministeriums für Gesundheit 66 5.33
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s Bundesministerium des Innern 61 4.92
14 / 16

Sampling at the sentence level

packageVersion("polmineR")
## [1] '0.8.9.9001'
demsent_ids <- corpus("GERMAPARL2") %>%
hits(query = "Demokratie", s_attribute = "s", decode = FALSE) %>%
as.data.frame() %>%
pull(s)
demsents <- corpus("GERMAPARL2") %>%
subset(s %in% !!demsent_ids) %>%
split(s_attribute = "s") %>%
get_token_stream(p_attribute = "word", collapse = " ")
  • write it on disk and use it as input for ... whatsoever!
15 / 16

References

Däubler, T., Benoit, K., Mikhaylov, S., & Laver, M. (2012). Natural Sentences as Valid Units for Coded Political Texts. British Journal of Political Science, 42(4), 937–951. http://www.jstor.org/stable/23274173.

16 / 16

Purpose and Motivation

  • GermaParl2 comprises rich structural annotation on the level of protocols (such as the date or the legislative period) and the level of speakers (such as a speakers name or parliamentary group)

  • as seen in previous cookbooks, these can be used to create meaningful subcorpora for substantive analysis

  • but even beyond that, the corpus contains annotation below the level of speeches in forms of paragraph and sentence annotation

  • sentences can provide natural units of analysis with semantic meaning (for a comprehensive discussion see Däubler et al. 2012)

  • sentence annotations can be used for a variety of use cases

2 / 16
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow