The following chunks of code will seperate an individual corpus into speeches, add the five most relevant topics to each speech as an attribute, subset the speeches of a corpus which are relevant in terms of migration and integration (i.e. whose topic attribute contains one of the identified relevant topic numbers) and decode this sub-corpus in order to re-encode the subset as a MigParl corpus.
library(polmineR)
library(cwbtools)
library(data.table)
First, the s_attribute speech must be encoded into the existing corpus. In a second step, for each speech, the five most relevant topics are added as an own attribute. We source the functionality from a seperate R Script.
source("migparl_additional_annotation_tools.R")
topics_per_state <- c(BB = "3|80|85|161|236",
BE = "4|111|141|166|225",
BW = "107|164|215",
BY = "59|129|137|138|159|201",
HB = "54|66|77|89|219",
HE = "22|29|38|122",
HH = "63|77|146|217|250",
MV = "24|56|190",
NI = "82|98|124|138",
NW = "1|34|138",
RP = "10|37|209",
SH = "75|108|115|216",
SL = "108|111|207|213",
SN = "82|156|248",
ST = "48",
TH = "10|28|157|198|202")
# the lazy version of getting 16 regional state abbreviations
corpora <- unique(FederalActorsGermany::speakerData$regional_state)
package_to_use <- "PopParl"
suppressMessages(use(package_to_use, verbose = FALSE))
i <- 1
lapply(
corpora,
function(corpus) {
message(paste("######", i, "-", corpus, "######"))
starttime <- Sys.time()
message(paste("Start time: ", starttime))
# read corpus' topic model to get topic list per speech ---------------------
model <- paste0("~/lab/gitlab/migparl_lda_models/models/", "lda_", corpus, "_250.Rdata")
lda <- readRDS(model)
# annotate speeches ---------------------------------------------------------
message("... annotate speeches")
migparl_add_s_attribute_speech(corpus, package = package_to_use)
suppressMessages(use(package_to_use, verbose = FALSE))
# annotate topics -----------------------------------------------------------
message("... annotate topics")
migparl_encode_lda_topics(corpus = corpus, model = lda, package = package_to_use)
suppressMessages(use(package_to_use, verbose = FALSE))
# get correct topic no for corpus -------------------------------------------
# partition by topic
print(as.character(topics_per_state[corpus]))
message("... subset by topic")
# creating topic regex
topic_regex <- topic_regexR(topics = as.character(topics_per_state[corpus]))
TopicPart <- partition(corpus, topics = topic_regex, regex = TRUE)
# metadata
message("... decoding s_attributes")
decode_stream <- polmineR::decode(TopicPart)
decode_stream[, cpos := as.character(as.integer(factor(cpos)) - 1)][, struc := as.character(as.integer(factor(struc)) - 1)][, id := paste0(corpus, decode_stream$id)]
tokenstream_dt <- data.table::copy(decode_stream)
metadata_dt <- decode_stream
metadata_dt[, c("word", "lemma", "pos") := NULL]
metadata_dt <- metadata_dt[,{list(cpos_left = min(as.integer(.SD[["cpos"]])), cpos_right = max(as.integer(.SD[["cpos"]])),
id = unique(.SD[["id"]]),
speaker = unique(.SD[["speaker"]]),
party = unique(.SD[["party"]]),
role = unique(.SD[["role"]]),
lp = unique(.SD[["lp"]]),
session = unique(.SD[["session"]]),
date = unique(.SD[["date"]]),
interjection = unique(.SD[["interjection"]]),
year = unique(.SD[["year"]]),
agenda_item = unique(.SD[["agenda_item"]]),
agenda_item_type = unique(.SD[["agenda_item_type"]]),
speech = unique(.SD[["speech"]]),
topics = unique(.SD[["topics"]]),
regional_state = unique(corpus)
)}, by = "struc"]
filename <- paste0(corpus, "_metadata.csv")
fwrite(metadata_dt, file = paste0("~/lab/tmp/encode/", filename))
rm(metadata_dt, decode_stream)
gc()
# tokenstream
message("... decoding p_attributes")
tokenstream_dt <- tokenstream_dt[, c("word", "pos", "lemma", "id", "cpos")]
filename <- paste0(corpus, "_tokenstream.csv")
fwrite(tokenstream_dt, file = paste0("~/lab/tmp/encode/", filename))
i <- i + 1
rm(tokenstream_dt)
gc()
timetime <- Sys.time() - starttime
message("... time: ", timetime)
}
)
## ###### 1 - BW ######
## Start time: 2018-11-14 19:15:32
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): BW
## Corpus name: BW
## Number of loads before reset: 222514
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): BW
## Corpus name: BW
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "107|164|215"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 3.85478859742482
## ###### 1 - BY ######
## Start time: 2018-11-14 19:19:23
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): BY
## Corpus name: BY
## Number of loads before reset: 220852
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): BY
## Corpus name: BY
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "59|129|137|138|159|201"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 4.23554702997208
## ###### 1 - BE ######
## Start time: 2018-11-14 19:23:38
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): BE
## Corpus name: BE
## Number of loads before reset: 170812
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): BE
## Corpus name: BE
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "4|111|141|166|225"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 3.11925282080968
## ###### 1 - BB ######
## Start time: 2018-11-14 19:26:45
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): BB
## Corpus name: BB
## Number of loads before reset: 157034
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): BB
## Corpus name: BB
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "3|80|85|161|236"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 3.02531883716583
## ###### 1 - HB ######
## Start time: 2018-11-14 19:29:46
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): HB
## Corpus name: HB
## Number of loads before reset: 146182
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): HB
## Corpus name: HB
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "54|66|77|89|219"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 2.98795476357142
## ###### 1 - HH ######
## Start time: 2018-11-14 19:32:46
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): HH
## Corpus name: HH
## Number of loads before reset: 197076
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): HH
## Corpus name: HH
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "63|77|146|217|250"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 4.23985215028127
## ###### 1 - HE ######
## Start time: 2018-11-14 19:37:00
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): HE
## Corpus name: HE
## Number of loads before reset: 185906
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): HE
## Corpus name: HE
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "22|29|38|122"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 3.94227606455485
## ###### 1 - NW ######
## Start time: 2018-11-14 19:40:57
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): NW
## Corpus name: NW
## Number of loads before reset: 216384
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): NW
## Corpus name: NW
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "1|34|138"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 4.13380882342656
## ###### 1 - MV ######
## Start time: 2018-11-14 19:45:05
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): MV
## Corpus name: MV
## Number of loads before reset: 185616
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): MV
## Corpus name: MV
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "24|56|190"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 3.56816098690033
## ###### 1 - NI ######
## Start time: 2018-11-14 19:48:39
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): NI
## Corpus name: NI
## Number of loads before reset: 281920
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): NI
## Corpus name: NI
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "82|98|124|138"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 5.75901971260707
## ###### 1 - RP ######
## Start time: 2018-11-14 19:54:24
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): RP
## Corpus name: RP
## Number of loads before reset: 147002
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): RP
## Corpus name: RP
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "10|37|209"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 2.64773242870967
## ###### 1 - SL ######
## Start time: 2018-11-14 19:57:03
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): SL
## Corpus name: SL
## Number of loads before reset: 57690
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): SL
## Corpus name: SL
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "108|111|207|213"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 54.4889371395111
## ###### 1 - ST ######
## Start time: 2018-11-14 19:57:58
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): ST
## Corpus name: ST
## Number of loads before reset: 162754
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): ST
## Corpus name: ST
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "48"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 2.5270025173823
## ###### 1 - SN ######
## Start time: 2018-11-14 20:00:29
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): SN
## Corpus name: SN
## Number of loads before reset: 236798
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): SN
## Corpus name: SN
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "82|156|248"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 3.96782789627711
## ###### 1 - SH ######
## Start time: 2018-11-14 20:04:27
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): SH
## Corpus name: SH
## Number of loads before reset: 217404
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): SH
## Corpus name: SH
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "75|108|115|216"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 3.84502136707306
## ###### 1 - TH ######
## Start time: 2018-11-14 20:08:18
## ... annotate speeches
## ... generating partitions by date
## ... getting matrix with regions for s-attribute: date
## ... generating the partitions
## ... generating speeches
## ... generating names
## ... reordering partitions
## ... running 'cwb-s-encode' to add structural annotation for attribute 'speech'
## Corpus to delete (ID): TH
## Corpus name: TH
## Number of loads before reset: 245770
## Number of loads resetted: 1
## ... annotate topics
## ... getting topic matrix
## ... decoding s-attribute speech
## ... decoding s-attribute: speech
## ... running some sanity checks
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... encoding s-attribute 'topics'
## Corpus to delete (ID): TH
## Corpus name: TH
## Number of loads before reset: 16
## Number of loads resetted: 1
## [1] "10|28|157|198|202"
## ... subset by topic
## ... get encoding: UTF-8
## ... get cpos and strucs
## ... decoding s_attributes
## ... decoding p_attribute word
## ... decoding p_attribute pos
## ... decoding p_attribute lemma
## ... decoding p_attributes
## ... time: 4.91365651686986
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
## NULL
##
## [[15]]
## NULL
##
## [[16]]
## NULL
Now we have individual tokenstreams and metadata for each regional state. Unfortunately, memory gets scarce pretty fast, which is why we don’t try to keep the tokenstreams in the RAM.