A dataset with information on the corpus on a year-by-year basis is included in the package to be included in the data report of the package vignette.
germaparl_by_yearA data.frame with 22 rows and 6 variables with summary
statistics on the GermaParl corpus on a year-by-year basis.
year reported on in the row (integer value)
total number of protocols included in the corpus for the
respective year (integer value)
number of protocols prepared based on plain text versions of the
protocols (integer value)
number of protocols prepared based on pdf versions of the
protocols (integer value)
number of tokens in subcorpus for the respective year
(integer value)
share of words that cannot be lemmatized, resulting in
#unknown# tag (numeric value)
A data.frame.
The table is based on v1.0.6 of the corpus. The prepare the table, the script available at data-raw/stats_for_vignette.R has been used.