class: center, middle, inverse, title-slide .title[ # Cooking with GermaParl ] .date[ ### 2023-12-14 ] --- # Purpose and Motivation * GermaParl2 comprises **rich structural annotation** on the level of protocols (such as the date or the legislative period) and the level of speakers (such as a speakers name or parliamentary group) * as seen in previous cookbooks, these can be used to **create meaningful subcorpora** for substantive analysis * but even beyond that, the corpus contains **annotation below the level of speeches** in forms of paragraph and sentence annotation * sentences can provide natural **units of analysis** with semantic meaning (for a comprehensive discussion see Däubler et al. 2012) * sentence annotations can be used for a variety of **use cases** --- # Encoding of Sentences * sentences are annotated using **Stanford CoreNLP** (https://stanfordnlp.github.io/CoreNLP/) * sentences are encoded as the **structural attribute** `s` in GermaParl2 * in contrast to other annotations in GermaParl, the sentence annotation **does not have values**; they describe regions in terms of start and end positions of sentences * `polmineR` indicates the missing values when called with `s_attributes()`: ```r s_attributes("GERMAPARL2", "s") ``` ``` ## ! s-attribute `s` does not have values, returning NA ``` ``` ## [1] NA ``` --- # Sentences and the tree structure ```r corpus("GERMAPARL2") %>% polmineR::tree_structure() ``` ``` ## protocol [lp│no│date│year│url│filetype] ## | ## └─ speaker [who│name│party│parlgroup│role] ## | ## └─ p [type] ## | ## └─ s ## | ## └─ ne [type] ``` --- # Splitting Objects into Sentences `polmineR` makes it easy to split a (sub)corpus into sentences ```r sentences <- corpus("GERMAPARL2") |> subset(protocol_date == "1949-12-14") |> split(s_attribute = "s", values = FALSE) ``` * the `values` argument of `split()` makes missing **values** explicit, but this is not strictly necessary * the output is a bundle of subcorpora, each containing a single sentence * splitting by sentences can also be done for corpora with sentence annotation (caution: GermaParl2 is quite large) --- # Splitting Objects into Sentences * subcorpus bundles can be used as usual * one example would be to decode the sentences as strings in their word order for further analysis * this could be useful for word embeddings or classification tasks which rely on word order ```r sentences_ts <- get_token_stream(sentences) ``` * the sentence annotation is not always perfect though: ```r sentences_ts[[693]] ``` ``` ## [1] "—" "Ich" "schließe" "die" "23" "." ``` ```r sentences_ts[[694]] ``` ``` ## [1] "Sitzung" "des" "Deutschen" "Bundestags" "." ``` --- # Sentence-Term-Matrices * the sentence bundle can also be used as input to create a **Document-Term-Matrix** (in this case a sentence-term-matrix) * potentially useful for **machine learning approaches** which rely on a **Bag-of-Words** representation of sentences * examples: Sentence Similarity, Weighting of Terms ```r dtm <- polmineR::as.DocumentTermMatrix(sentences, p_attribute = "word") ``` --- # Sentence-Term-Matrices ```r tm::inspect(dtm) ``` ``` ## <<DocumentTermMatrix (documents: 695, terms: 3255)>> ## Non-/sparse entries: 13796/2248429 ## Sparsity : 99% ## Maximal term length: 32 ## Weighting : term frequency (tf) ## Sample : ## Terms ## Docs , . daß den der des die in und zu ## 36563 38 1 0 0 0 0 2 0 0 0 ## 36589 8 1 1 1 4 0 2 3 2 1 ## 36619 9 1 2 3 12 0 2 2 2 2 ## 36643 7 1 0 0 1 1 2 1 0 1 ## 36674 8 1 3 1 3 0 4 3 4 1 ## 36695 10 1 4 0 3 1 4 1 3 0 ## 36717 5 1 3 0 2 1 0 2 1 1 ## 36733 5 1 0 1 2 2 2 2 3 0 ## 36777 5 1 1 0 4 2 7 1 4 0 ## 36787 6 1 1 1 4 0 0 3 2 1 ``` --- # Using Sentences as Context Windows * the boundaries of sentences can be used to define **context windows** of query terms * this can be useful to limit the analysis to relevant context words or to identify meaningful multi-word query terms * `polmineR` provides two ways to make use of the sentence annotation in these scenarios: #### 1) Sentence Annotation as a `boundary`: * the maximum number of tokens in the context window is determined by the values of `left` and `right` but the context does not extend over the boundary of a sentence ```r corpus("GERMAPARL2") |> kwic(query = "Demokratie", boundary = "s", left = 20, right = 20) ``` --- # Using Sentences as Context Windows #### 2) Sentence Annotation as Context * the context window is determined by the **structural attribute** - here `s` - defined by `region` and a number of sentences in `left` and `right` ```r corpus("GERMAPARL2") |> kwic(query = "Demokratie", region = "s", left = 0, right = 0) ``` * the annotation of `left` and `right` determines **additional context** in terms of sentences `s` * i.e. if `s` = 0, then the context window comprises of the same sentence as the query term --- # Using Sentences as Context Windows * changing the values of `left` and `right` to 1 adds one additional sentence as context ```r corpus("GERMAPARL2") |> kwic(query = "Demokratie", region = "s", left = 1, right = 1) ``` * this is equivalent to the following syntax: ```r corpus("GERMAPARL2") |> kwic(query = "Demokratie", left = c("s" = 1), right = c("s" = 1)) ``` --- # Using Sentences as Context Windows * this also applies to values passed to other parameters such as `positivelist` and `stoplist`: ```r corpus("GERMAPARL2") |> kwic(query = "Demokratie", region = "s", left = 0, right = 0, positivelist = "Krise" ) ``` ``` ## ... filtering by positivelist ``` ``` ## ... number of hits dropped due to positivelist: 37002 ``` ``` ## ... update count statistics for slot cpos ``` **Note:** Sentences which contain a query term more than once show up in the output of `kwic` more than once --- # Using Sentences in CQP Queries * as noted in the CQP manual, "most linguistic queries should include the restriction within s to avoid crossing sentence boundaries" (https://cwb.sourceforge.io/files/CQP_Manual/4_2.html) * this can be achieved with the syntax used in the following query (results on the next slide) ```r sc <- corpus("GERMAPARL2") |> subset(protocol_lp == 15) count(sc, query = '"Bundesministerium.*" []{1,5} [xpos = "NN"] within s', cqp = TRUE, breakdown = TRUE) ``` * it has to be noted that this can be computationally expensive and depending on the use case, the differences are subtle --- # Using Sentences in CQP Queries <table> <caption>First five Query Matches in GermaParl2, 15th Legislative Period</caption> <thead> <tr> <th style="text-align:left;"> query </th> <th style="text-align:left;"> match </th> <th style="text-align:right;"> count </th> <th style="text-align:right;"> share </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> "Bundesministerium.*" []{1,5} [xpos = "NN"] within s </td> <td style="text-align:left;"> Bundesministeriums für Wirtschaft </td> <td style="text-align:right;"> 84 </td> <td style="text-align:right;"> 6.78 </td> </tr> <tr> <td style="text-align:left;"> "Bundesministerium.*" []{1,5} [xpos = "NN"] within s </td> <td style="text-align:left;"> Bundesministeriums für Verkehr </td> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 5.73 </td> </tr> <tr> <td style="text-align:left;"> "Bundesministerium.*" []{1,5} [xpos = "NN"] within s </td> <td style="text-align:left;"> Bundesministeriums des Innern </td> <td style="text-align:right;"> 67 </td> <td style="text-align:right;"> 5.41 </td> </tr> <tr> <td style="text-align:left;"> "Bundesministerium.*" []{1,5} [xpos = "NN"] within s </td> <td style="text-align:left;"> Bundesministeriums der Finanzen </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 5.33 </td> </tr> <tr> <td style="text-align:left;"> "Bundesministerium.*" []{1,5} [xpos = "NN"] within s </td> <td style="text-align:left;"> Bundesministeriums für Gesundheit </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 5.33 </td> </tr> <tr> <td style="text-align:left;"> "Bundesministerium.*" []{1,5} [xpos = "NN"] within s </td> <td style="text-align:left;"> Bundesministerium des Innern </td> <td style="text-align:right;"> 61 </td> <td style="text-align:right;"> 4.92 </td> </tr> </tbody> </table> --- # Sampling at the sentence level ```r packageVersion("polmineR") ``` ``` ## [1] '0.8.9.9001' ``` ```r demsent_ids <- corpus("GERMAPARL2") %>% hits(query = "Demokratie", s_attribute = "s", decode = FALSE) %>% as.data.frame() %>% pull(s) demsents <- corpus("GERMAPARL2") %>% subset(s %in% !!demsent_ids) %>% split(s_attribute = "s") %>% get_token_stream(p_attribute = "word", collapse = " ") ``` * write it on disk and use it as input for ... whatsoever! --- # References Däubler, T., Benoit, K., Mikhaylov, S., & Laver, M. (2012). Natural Sentences as Valid Units for Coded Political Texts. British Journal of Political Science, 42(4), 937–951. http://www.jstor.org/stable/23274173.