Use topicmodels prepared for GermaParl. — germaparl_download

A set of LDA topicmodels is part of the Zenodo release of GermaParl (k between 100 and 450). These topic models can be downloaded using germaparl_download_lda and loaded using germaparl_load_lda.

germaparl_download_lda(
  k = c(100L, 150L, 175L, 200L, 225L, 250L, 275L, 300L, 350L, 400L, 450L),
  doi = "10.5281/zenodo.3742113",
  data_dir,
  sample = FALSE,
  verbose = TRUE
)

germaparl_load_lda(
  k,
  registry_dir = cwbtools::cwb_registry_dir(),
  verbose = TRUE,
  sample = FALSE
)

Arguments

k: A numeric or integer vector, the number of topics of the topicmodel. Multiple values can be provided to download several topic models at once.
doi: The DOI of GermaParl at Zenodo.
data_dir: The data directory with the binary files of the GERMAPARL corpus. If missing, the directory will be guessed using the function cwb::cwb_corpus_dir
sample: A logical value, if TRUE, use GERMAPARLSAMPLE corpus rather than GERMAPARL.
verbose: logical
registry_dir: The registry directory where the registry file for GERMAPARL is located.

Value

The function germaparl_download_lda will (invisibly) return TRUE if the operation has been succesful and FALSE if not.

Details

The function germaparl_download_lda will download an rds-file that will be stored in the data directory of the GermaParl corpus.

germaparl_load_lda will load a topicmodel into memory. The function will return a LDA_Gibbs topicmodel, if the topicmodel for k is present; NULL if the topicmodel has not yet been downloaded.

Examples

# This example assumes that the directories used by the CWB do not yet exist, so
# temporary directories are created.
cwb_dirs <- cwbtools::create_cwb_directories(prefix = tempdir(), ask = FALSE)
#> ── Create CWB directories ──────────────────────────────────────────────────────
#> ℹ Using existing directory
#> /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD as parent directory
#> for registry directory and the corpus directory.
#> ℹ registry directory /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/registry already exists
#> ℹ corpus directory /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/indexed_corpora already exists
#> ✔ environment variable `CORPUS_REGISTRY` set as: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/registry

samplemode <- TRUE
corpus_id <- "GERMAPARLSAMPLE" # for full corpus: corpus_id <- "GERMAPARL"

dir.create(file.path(cwb_dirs[["corpus_dir"]], tolower(corpus_id)))
#> Warning: '/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/indexed_corpora/germaparlsample' already exists

# Download topic model
germaparl_download_lda(
  k = 30, # k = 250 recommended for full GERMAPARL corpus
  data_dir = file.path(cwb_dirs[["corpus_dir"]], tolower(corpus_id)),
  sample = samplemode
)
#> ℹ get Zenodo record for doi 10.5281/zenodo.3823245
#> ✔ get Zenodo record for doi 10.5281/zenodo.3823245 ... done
#> 
#> ℹ starting to download LDA model
#> ℹ check md5 checksum for downloaded file germaparlsample_lda_30.rds
#> ✔ check md5 checksum for downloaded file germaparlsample_lda_30.rds ... done
#> 
lda <- germaparl_load_lda(
  k = 30L, registry_dir = cwb_dirs[["registry_dir"]],
  sample = samplemode
)
#> ... loading topicmodel for k = 30
#> Loading required package: topicmodels
lda_terms <- topicmodels::terms(lda, 10)