A set of LDA topicmodels is part of the Zenodo release of GermaParl (k
between 100 and 450). These topic models can be downloaded using
germaparl_download_lda
and loaded using germaparl_load_lda
.
germaparl_download_lda(
k = c(100L, 150L, 175L, 200L, 225L, 250L, 275L, 300L, 350L, 400L, 450L),
doi = "10.5281/zenodo.3742113",
data_dir,
sample = FALSE,
verbose = TRUE
)
germaparl_load_lda(
k,
registry_dir = cwbtools::cwb_registry_dir(),
verbose = TRUE,
sample = FALSE
)
A numeric
or integer
vector, the number of topics of
the topicmodel. Multiple values can be provided to download several topic
models at once.
The DOI of GermaParl at Zenodo.
The data directory with the binary files of the GERMAPARL
corpus. If missing, the directory will be guessed using the function
cwb::cwb_corpus_dir
A logical
value, if TRUE
, use GERMAPARLSAMPLE
corpus rather than GERMAPARL.
logical
The registry directory where the registry file for GERMAPARL is located.
The function germaparl_download_lda
will (invisibly) return
TRUE
if the operation has been succesful and FALSE
if not.
The function germaparl_download_lda
will download an
rds
-file that will be stored in the data directory of the GermaParl
corpus.
germaparl_load_lda
will load a topicmodel into memory.
The function will return a LDA_Gibbs
topicmodel, if the topicmodel
for k
is present; NULL
if the topicmodel has not yet been
downloaded.
# This example assumes that the directories used by the CWB do not yet exist, so
# temporary directories are created.
cwb_dirs <- cwbtools::create_cwb_directories(prefix = tempdir(), ask = FALSE)
#> ── Create CWB directories ──────────────────────────────────────────────────────
#> ℹ Using existing directory
#> /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD as parent directory
#> for registry directory and the corpus directory.
#> ℹ registry directory /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/registry already exists
#> ℹ corpus directory /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/indexed_corpora already exists
#> ✔ environment variable `CORPUS_REGISTRY` set as: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/registry
samplemode <- TRUE
corpus_id <- "GERMAPARLSAMPLE" # for full corpus: corpus_id <- "GERMAPARL"
dir.create(file.path(cwb_dirs[["corpus_dir"]], tolower(corpus_id)))
#> Warning: '/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/indexed_corpora/germaparlsample' already exists
# Download topic model
germaparl_download_lda(
k = 30, # k = 250 recommended for full GERMAPARL corpus
data_dir = file.path(cwb_dirs[["corpus_dir"]], tolower(corpus_id)),
sample = samplemode
)
#> ℹ get Zenodo record for doi 10.5281/zenodo.3823245
#> ✔ get Zenodo record for doi 10.5281/zenodo.3823245 ... done
#>
#> ℹ starting to download LDA model
#> ℹ check md5 checksum for downloaded file germaparlsample_lda_30.rds
#> ✔ check md5 checksum for downloaded file germaparlsample_lda_30.rds ... done
#>
lda <- germaparl_load_lda(
k = 30L, registry_dir = cwb_dirs[["registry_dir"]],
sample = samplemode
)
#> ... loading topicmodel for k = 30
#> Loading required package: topicmodels
lda_terms <- topicmodels::terms(lda, 10)