A set of LDA topicmodels is part of the Zenodo release of GermaParl (k between 100 and 450). These topic models can be downloaded using germaparl_download_lda and loaded using germaparl_load_lda.

germaparl_download_lda(
  k = c(100L, 150L, 175L, 200L, 225L, 250L, 275L, 300L, 350L, 400L, 450L),
  doi = "10.5281/zenodo.3742113",
  data_dir,
  sample = FALSE,
  verbose = TRUE
)

germaparl_load_lda(
  k,
  registry_dir = cwbtools::cwb_registry_dir(),
  verbose = TRUE,
  sample = FALSE
)

Arguments

k

A numeric or integer vector, the number of topics of the topicmodel. Multiple values can be provided to download several topic models at once.

doi

The DOI of GermaParl at Zenodo.

data_dir

The data directory with the binary files of the GERMAPARL corpus. If missing, the directory will be guessed using the function cwb::cwb_corpus_dir

sample

A logical value, if TRUE, use GERMAPARLSAMPLE corpus rather than GERMAPARL.

verbose

logical

registry_dir

The registry directory where the registry file for GERMAPARL is located.

Value

The function germaparl_download_lda will (invisibly) return TRUE if the operation has been succesful and FALSE if not.

Details

The function germaparl_download_lda will download an rds-file that will be stored in the data directory of the GermaParl corpus.

germaparl_load_lda will load a topicmodel into memory. The function will return a LDA_Gibbs topicmodel, if the topicmodel for k is present; NULL if the topicmodel has not yet been downloaded.

Examples

# This example assumes that the directories used by the CWB do not yet exist, so
# temporary directories are created.
cwb_dirs <- cwbtools::create_cwb_directories(prefix = tempdir(), ask = FALSE)
#> ── Create CWB directories ──────────────────────────────────────────────────────
#>  Using existing directory
#> /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD as parent directory
#> for registry directory and the corpus directory.
#>  registry directory /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/registry already exists
#>  corpus directory /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/indexed_corpora already exists
#>  environment variable `CORPUS_REGISTRY` set as: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/registry

samplemode <- TRUE
corpus_id <- "GERMAPARLSAMPLE" # for full corpus: corpus_id <- "GERMAPARL"

dir.create(file.path(cwb_dirs[["corpus_dir"]], tolower(corpus_id)))
#> Warning: '/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpcFtCDD/indexed_corpora/germaparlsample' already exists

# Download topic model
germaparl_download_lda(
  k = 30, # k = 250 recommended for full GERMAPARL corpus
  data_dir = file.path(cwb_dirs[["corpus_dir"]], tolower(corpus_id)),
  sample = samplemode
)
#>  get Zenodo record for doi 10.5281/zenodo.3823245
#>  get Zenodo record for doi 10.5281/zenodo.3823245 ... done
#> 
#>  starting to download LDA model
#>  check md5 checksum for downloaded file germaparlsample_lda_30.rds
#>  check md5 checksum for downloaded file germaparlsample_lda_30.rds ... done
#> 
lda <- germaparl_load_lda(
  k = 30L, registry_dir = cwb_dirs[["registry_dir"]],
  sample = samplemode
)
#> ... loading topicmodel for k = 30
#> Loading required package: topicmodels
lda_terms <- topicmodels::terms(lda, 10)