Cooking with GermaParl

class: center, middle, inverse, title-slide

.title[
# Cooking with GermaParl
]
.date[
### 2023-12-14
]

---

# Purpose and Motivation

* GermaParl2 comprises **rich structural annotation** on the level of protocols (such as the date or the legislative period) and the level of speakers (such as a speakers name or parliamentary group)

* as seen in previous cookbooks, these can be used to **create meaningful subcorpora** for substantive analysis

* but even beyond that, the corpus contains **annotation below the level of speeches** in forms of paragraph and sentence annotation

* sentences can provide natural **units of analysis** with semantic meaning (for a comprehensive discussion see Däubler et al. 2012)

* sentence annotations can be used for a variety of **use cases**

---

# Encoding of Sentences

* sentences are annotated using **Stanford CoreNLP** (https://stanfordnlp.github.io/CoreNLP/)

* sentences are encoded as the **structural attribute** `s` in GermaParl2

* in contrast to other annotations in GermaParl, the sentence annotation **does not have values**; they describe regions in terms of start and end positions of sentences

* `polmineR` indicates the missing values when called with `s_attributes()`:

```r
s_attributes("GERMAPARL2", "s")
```

```
## ! s-attribute `s` does not have values, returning NA
```

```
## [1] NA
```

---

# Sentences and the tree structure

```r
corpus("GERMAPARL2") %>% polmineR::tree_structure()
```

```
## protocol [lp│no│date│year│url│filetype]
##    | 
##    └─ speaker [who│name│party│parlgroup│role]
##       | 
##       └─ p [type]
##          | 
##          └─ s
##             | 
##             └─ ne [type]
```

---

# Splitting Objects into Sentences

`polmineR` makes it easy to split a (sub)corpus into sentences

```r
sentences <- corpus("GERMAPARL2") |>
  subset(protocol_date == "1949-12-14") |>
  split(s_attribute = "s", values = FALSE)
```

* the `values` argument of `split()` makes missing **values** explicit, but this is not strictly necessary

* the output is a bundle of subcorpora, each containing a single sentence

* splitting by sentences can also be done for corpora with sentence annotation (caution: GermaParl2 is quite large)

---

# Splitting Objects into Sentences
 
* subcorpus bundles can be used as usual

* one example would be to decode the sentences as strings in their word order for further analysis

* this could be useful for word embeddings or classification tasks which rely on word order

```r
sentences_ts <- get_token_stream(sentences)
```

* the sentence annotation is not always perfect though:

```r
sentences_ts[[693]]
```

```
## [1] "—"        "Ich"      "schließe" "die"      "23"       "."
```

```r
sentences_ts[[694]]
```

```
## [1] "Sitzung"    "des"        "Deutschen"  "Bundestags" "."
```

---

# Sentence-Term-Matrices

* the sentence bundle can also be used as input to create a **Document-Term-Matrix** (in this case a sentence-term-matrix)
* potentially useful for **machine learning approaches** which rely on a **Bag-of-Words** representation of sentences
* examples: Sentence Similarity, Weighting of Terms

```r
dtm <- polmineR::as.DocumentTermMatrix(sentences, p_attribute = "word")
```

---

# Sentence-Term-Matrices

```r
tm::inspect(dtm)
```

```
## <<DocumentTermMatrix (documents: 695, terms: 3255)>>
## Non-/sparse entries: 13796/2248429
## Sparsity           : 99%
## Maximal term length: 32
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs     , . daß den der des die in und zu
##   36563 38 1   0   0   0   0   2  0   0  0
##   36589  8 1   1   1   4   0   2  3   2  1
##   36619  9 1   2   3  12   0   2  2   2  2
##   36643  7 1   0   0   1   1   2  1   0  1
##   36674  8 1   3   1   3   0   4  3   4  1
##   36695 10 1   4   0   3   1   4  1   3  0
##   36717  5 1   3   0   2   1   0  2   1  1
##   36733  5 1   0   1   2   2   2  2   3  0
##   36777  5 1   1   0   4   2   7  1   4  0
##   36787  6 1   1   1   4   0   0  3   2  1
```

---

# Using Sentences as Context Windows

* the boundaries of sentences can be used to define **context windows** of query terms

* this can be useful to limit the analysis to relevant context words or to identify meaningful multi-word query terms

* `polmineR` provides two ways to make use of the sentence annotation in these scenarios:

#### 1) Sentence Annotation as a `boundary`:

* the maximum number of tokens in the context window is determined by the values of `left` and `right` but the context does not extend over the boundary of a sentence

```r
corpus("GERMAPARL2") |>
  kwic(query = "Demokratie",
       boundary = "s",
       left = 20,
       right = 20)
```

---

# Using Sentences as Context Windows

#### 2) Sentence Annotation as Context

* the context window is determined by the **structural attribute** - here `s` - defined by `region` and a number of sentences in `left` and `right`

```r
corpus("GERMAPARL2") |>
  kwic(query = "Demokratie",
       region = "s",
       left = 0,
       right = 0)
```

* the annotation of `left` and `right` determines **additional context** in terms of sentences `s`

* i.e. if `s` = 0, then the context window comprises of the same sentence as the query term

---

# Using Sentences as Context Windows

* changing the values of `left` and `right` to 1 adds one additional sentence as context

```r
corpus("GERMAPARL2") |>
  kwic(query = "Demokratie",
       region = "s",
       left = 1,
       right = 1)
```

* this is equivalent to the following syntax:

```r
corpus("GERMAPARL2") |>
  kwic(query = "Demokratie",
       left = c("s" = 1),
       right = c("s" = 1))
```

---

# Using Sentences as Context Windows

* this also applies to values passed to other parameters such as `positivelist` and `stoplist`:

```r
corpus("GERMAPARL2") |>
  kwic(query = "Demokratie",
       region = "s",
       left = 0,
       right = 0,
       positivelist = "Krise"
  )
```

```
## ... filtering by positivelist
```

```
## ... number of hits dropped due to positivelist: 37002
```

```
## ... update count statistics for slot cpos
```

**Note:** Sentences which contain a query term more than once show up in the output of `kwic` more than once

---

# Using Sentences in CQP Queries

* as noted in the CQP manual, "most linguistic queries should include the restriction within s to avoid crossing sentence boundaries" (https://cwb.sourceforge.io/files/CQP_Manual/4_2.html)

* this can be achieved with the syntax used in the following query (results on the next slide)

```r
sc <- corpus("GERMAPARL2") |> subset(protocol_lp == 15)

count(sc,
      query = '"Bundesministerium.*" []{1,5} [xpos = "NN"] within s',
      cqp = TRUE,
      breakdown = TRUE)
```

* it has to be noted that this can be computationally expensive and depending on the use case, the differences are subtle

---

# Using Sentences in CQP Queries

<table>
<caption>First five Query Matches in GermaParl2, 15th Legislative Period</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> query </th>
   <th style="text-align:left;"> match </th>
   <th style="text-align:right;"> count </th>
   <th style="text-align:right;"> share </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> &quot;Bundesministerium.*&quot; []{1,5} [xpos = &quot;NN&quot;] within s </td>
   <td style="text-align:left;"> Bundesministeriums für Wirtschaft </td>
   <td style="text-align:right;"> 84 </td>
   <td style="text-align:right;"> 6.78 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &quot;Bundesministerium.*&quot; []{1,5} [xpos = &quot;NN&quot;] within s </td>
   <td style="text-align:left;"> Bundesministeriums für Verkehr </td>
   <td style="text-align:right;"> 71 </td>
   <td style="text-align:right;"> 5.73 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &quot;Bundesministerium.*&quot; []{1,5} [xpos = &quot;NN&quot;] within s </td>
   <td style="text-align:left;"> Bundesministeriums des Innern </td>
   <td style="text-align:right;"> 67 </td>
   <td style="text-align:right;"> 5.41 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &quot;Bundesministerium.*&quot; []{1,5} [xpos = &quot;NN&quot;] within s </td>
   <td style="text-align:left;"> Bundesministeriums der Finanzen </td>
   <td style="text-align:right;"> 66 </td>
   <td style="text-align:right;"> 5.33 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &quot;Bundesministerium.*&quot; []{1,5} [xpos = &quot;NN&quot;] within s </td>
   <td style="text-align:left;"> Bundesministeriums für Gesundheit </td>
   <td style="text-align:right;"> 66 </td>
   <td style="text-align:right;"> 5.33 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &quot;Bundesministerium.*&quot; []{1,5} [xpos = &quot;NN&quot;] within s </td>
   <td style="text-align:left;"> Bundesministerium des Innern </td>
   <td style="text-align:right;"> 61 </td>
   <td style="text-align:right;"> 4.92 </td>
  </tr>
</tbody>
</table>

---

# Sampling at the sentence level

```r
packageVersion("polmineR")
```

```
## [1] '0.8.9.9001'
```

```r
demsent_ids <- corpus("GERMAPARL2") %>%
  hits(query = "Demokratie", s_attribute = "s", decode = FALSE) %>%
  as.data.frame() %>%
  pull(s)

demsents <- corpus("GERMAPARL2") %>%
  subset(s %in% !!demsent_ids) %>%
  split(s_attribute = "s") %>%
  get_token_stream(p_attribute = "word", collapse = " ")
```

* write it on disk and use it as input for ... whatsoever!

---

# References

Däubler, T., Benoit, K., Mikhaylov, S., & Laver, M. (2012). Natural Sentences as Valid Units for Coded Political Texts. British Journal of Political Science, 42(4), 937–951. http://www.jstor.org/stable/23274173.