GermaParl2 comprises rich structural annotation on the level of protocols (such as the date or the legislative period) and the level of speakers (such as a speakers name or parliamentary group)
as seen in previous cookbooks, these can be used to create meaningful subcorpora for substantive analysis
but even beyond that, the corpus contains annotation below the level of speeches in forms of paragraph and sentence annotation
sentences can provide natural units of analysis with semantic meaning (for a comprehensive discussion see Däubler et al. 2012)
sentence annotations can be used for a variety of use cases
sentences are annotated using Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)
sentences are encoded as the structural attribute s
in GermaParl2
in contrast to other annotations in GermaParl, the sentence annotation does not have values; they describe regions in terms of start and end positions of sentences
polmineR
indicates the missing values when called with s_attributes()
:
s_attributes("GERMAPARL2", "s")
## ! s-attribute `s` does not have values, returning NA
## [1] NA
corpus("GERMAPARL2") %>% polmineR::tree_structure()
## protocol [lp│no│date│year│url│filetype]## | ## └─ speaker [who│name│party│parlgroup│role]## | ## └─ p [type]## | ## └─ s## | ## └─ ne [type]
polmineR
makes it easy to split a (sub)corpus into sentences
sentences <- corpus("GERMAPARL2") |> subset(protocol_date == "1949-12-14") |> split(s_attribute = "s", values = FALSE)
the values
argument of split()
makes missing values explicit, but this is not strictly necessary
the output is a bundle of subcorpora, each containing a single sentence
splitting by sentences can also be done for corpora with sentence annotation (caution: GermaParl2 is quite large)
subcorpus bundles can be used as usual
one example would be to decode the sentences as strings in their word order for further analysis
this could be useful for word embeddings or classification tasks which rely on word order
sentences_ts <- get_token_stream(sentences)
sentences_ts[[693]]
## [1] "—" "Ich" "schließe" "die" "23" "."
sentences_ts[[694]]
## [1] "Sitzung" "des" "Deutschen" "Bundestags" "."
dtm <- polmineR::as.DocumentTermMatrix(sentences, p_attribute = "word")
tm::inspect(dtm)
## <<DocumentTermMatrix (documents: 695, terms: 3255)>>## Non-/sparse entries: 13796/2248429## Sparsity : 99%## Maximal term length: 32## Weighting : term frequency (tf)## Sample :## Terms## Docs , . daß den der des die in und zu## 36563 38 1 0 0 0 0 2 0 0 0## 36589 8 1 1 1 4 0 2 3 2 1## 36619 9 1 2 3 12 0 2 2 2 2## 36643 7 1 0 0 1 1 2 1 0 1## 36674 8 1 3 1 3 0 4 3 4 1## 36695 10 1 4 0 3 1 4 1 3 0## 36717 5 1 3 0 2 1 0 2 1 1## 36733 5 1 0 1 2 2 2 2 3 0## 36777 5 1 1 0 4 2 7 1 4 0## 36787 6 1 1 1 4 0 0 3 2 1
the boundaries of sentences can be used to define context windows of query terms
this can be useful to limit the analysis to relevant context words or to identify meaningful multi-word query terms
polmineR
provides two ways to make use of the sentence annotation in these scenarios:
boundary
:left
and right
but the context does not extend over the boundary of a sentencecorpus("GERMAPARL2") |> kwic(query = "Demokratie", boundary = "s", left = 20, right = 20)
s
- defined by region
and a number of sentences in left
and right
corpus("GERMAPARL2") |> kwic(query = "Demokratie", region = "s", left = 0, right = 0)
the annotation of left
and right
determines additional context in terms of sentences s
i.e. if s
= 0, then the context window comprises of the same sentence as the query term
left
and right
to 1 adds one additional sentence as contextcorpus("GERMAPARL2") |> kwic(query = "Demokratie", region = "s", left = 1, right = 1)
corpus("GERMAPARL2") |> kwic(query = "Demokratie", left = c("s" = 1), right = c("s" = 1))
positivelist
and stoplist
: corpus("GERMAPARL2") |> kwic(query = "Demokratie", region = "s", left = 0, right = 0, positivelist = "Krise" )
## ... filtering by positivelist
## ... number of hits dropped due to positivelist: 37002
## ... update count statistics for slot cpos
Note: Sentences which contain a query term more than once show up in the output of kwic
more than once
as noted in the CQP manual, "most linguistic queries should include the restriction within s to avoid crossing sentence boundaries" (https://cwb.sourceforge.io/files/CQP_Manual/4_2.html)
this can be achieved with the syntax used in the following query (results on the next slide)
sc <- corpus("GERMAPARL2") |> subset(protocol_lp == 15)count(sc, query = '"Bundesministerium.*" []{1,5} [xpos = "NN"] within s', cqp = TRUE, breakdown = TRUE)
query | match | count | share |
---|---|---|---|
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s | Bundesministeriums für Wirtschaft | 84 | 6.78 |
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s | Bundesministeriums für Verkehr | 71 | 5.73 |
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s | Bundesministeriums des Innern | 67 | 5.41 |
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s | Bundesministeriums der Finanzen | 66 | 5.33 |
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s | Bundesministeriums für Gesundheit | 66 | 5.33 |
"Bundesministerium.*" []{1,5} [xpos = "NN"] within s | Bundesministerium des Innern | 61 | 4.92 |
packageVersion("polmineR")
## [1] '0.8.9.9001'
demsent_ids <- corpus("GERMAPARL2") %>% hits(query = "Demokratie", s_attribute = "s", decode = FALSE) %>% as.data.frame() %>% pull(s)demsents <- corpus("GERMAPARL2") %>% subset(s %in% !!demsent_ids) %>% split(s_attribute = "s") %>% get_token_stream(p_attribute = "word", collapse = " ")
Däubler, T., Benoit, K., Mikhaylov, S., & Laver, M. (2012). Natural Sentences as Valid Units for Coded Political Texts. British Journal of Political Science, 42(4), 937–951. http://www.jstor.org/stable/23274173.
GermaParl2 comprises rich structural annotation on the level of protocols (such as the date or the legislative period) and the level of speakers (such as a speakers name or parliamentary group)
as seen in previous cookbooks, these can be used to create meaningful subcorpora for substantive analysis
but even beyond that, the corpus contains annotation below the level of speeches in forms of paragraph and sentence annotation
sentences can provide natural units of analysis with semantic meaning (for a comprehensive discussion see Däubler et al. 2012)
sentence annotations can be used for a variety of use cases
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |