Download corpus tarball from Zenodo. Downloading both freely available data and data with restricted access is supported.
zenodo_get_tarball(
url,
destfile = tempfile(fileext = ".tar.gz"),
checksum = TRUE,
verbose = TRUE,
progress = TRUE
)
gparlsample_url_restricted
An object of class character
of length 1.
Landing page at Zenodo for resource. Can also be the URL for restricted access (?token= appended with a long key), or a DOI referencing objects deposited with Zenodo.
A character
vector with the file path where the downloaded
file is to be saved. Tilde-expansion is performed. Defaults to a temporary
file.
A logical
value, whether to check md5 sum.
A logical
value, whether to output progess messages.
A logical
value, whether to report progress during
download.
The filename of the downloaded corpus tarball, designed to serve as
input for corpus_install
(as argument tarball
). If the
resource is not available, NULL
is returned.
The path of the downloaded resource, or NULL
if the operation has
not been successful.
A sample subset of the GermaParl corpus is deposited at Zenodo for testing purposes. There are identical open access and restricted versions of GermaParlSample to test different flavours of downloading a resource from Zenodo. The URL for restricted access includes an access token which is very lengthy. This URL is included as a dataset in the package to avoid excessive line in sample code. Note that URLs that give access to restricted data are usually not to be shared.
# \donttest{
# Temporary directory structure as a preparatory step
Sys.setenv(CORPUS_REGISTRY = "")
cwb_dirs <- create_cwb_directories(
prefix = tempdir(),
ask = FALSE,
verbose = FALSE
)
Sys.setenv(CORPUS_REGISTRY = cwb_dirs[["registry_dir"]])
# Download and install open access resource
gparl_url_pub <- "https://doi.org/10.5281/zenodo.3823245"
tarball_tmp <- zenodo_get_tarball(url = gparl_url_pub)
#> ── Download resource from Zenodo ───────────────────────────────────────────────
#> ℹ get tarball URL from Zenodo record
#> ✔ get tarball URL from Zenodo record ... done
#>
#> ℹ tarball to download: germaparlsample_v0.1.0.tar.gz
#> ℹ check that md5 checksum is (cd542b95b3e6c80600f896c2a288c5d3)
#> ✔ check that md5 checksum is (cd542b95b3e6c80600f896c2a288c5d3) ... done
#>
if (!is.null(tarball_tmp)) corpus_install(tarball = tarball_tmp)
#> ── Get CWB directories ─────────────────────────────────────────────────────────
#> ℹ registry directory: /tmp/RtmpeQEeXz/registry
#> ℹ data directory: /tmp/RtmpeQEeXz/indexed_corpora
#> ── Install corpus ──────────────────────────────────────────────────────────────
#> ℹ extract tarball
#> ✔ extract tarball ... done
#>
#> ✔ copy corpus data to target directory ... done
#> ℹ parse registry file
#> ✔ parse registry file ... done
#>
#> ! no info file
#> ℹ update registry data and save registry file
#> ✔ update registry data and save registry file ... done
#>
#> ℹ load corpus
#> ✔ load corpus ... done
#>
# Download and install resource with restricted access
tarball_tmp <- zenodo_get_tarball(url = gparlsample_url_restricted)
#> ── Download restricted resource from Zenodo ────────────────────────────────────
#> ℹ get handle for restricted resource
#> ✔ get handle for restricted resource ... done
#>
#> ℹ tarball to download: <https://zenodo.org//records/6546810/files/germaparlsample_v0.1.1.tar.gz?download=1>
#> ℹ check that md5 checksum is (81d6131ac55b36f40442b7e320e6f51e)
#> ✔ check that md5 checksum is (81d6131ac55b36f40442b7e320e6f51e) ... done
#>
if (!is.null(tarball_tmp)) corpus_install(tarball = tarball_tmp)
#> ── Get CWB directories ─────────────────────────────────────────────────────────
#> ℹ registry directory: /tmp/RtmpeQEeXz/registry
#> ℹ data directory: /tmp/RtmpeQEeXz/indexed_corpora
#> ── Install corpus ──────────────────────────────────────────────────────────────
#> ℹ extract tarball
#> ✔ extract tarball ... done
#>
#> ℹ version extracted from registry file: v0.1.1
#> ── remove corpus GERMAPARLSAMPLE ───────────────────────────────────────────────
#> ℹ registry directory: /tmp/RtmpeQEeXz/registry
#> ℹ data directory: /tmp/RtmpeQEeXz/indexed_corpora/germaparlsample
#> ℹ remove files in data directory
#> ✔ remove files in data directory [6ms]
#>
#> ℹ remove data directory
#> ✔ remove data directory [11ms]
#>
#> ℹ remove registry file
#> ✔ remove registry file [11ms]
#>
#> ✔ corpus `GERMAPARLSAMPLE` has been removed
#> ✔ copy corpus data to target directory ... done
#> ℹ parse registry file
#> ✔ parse registry file ... done
#>
#> ! no info file
#> ℹ update registry data and save registry file
#> ✔ update registry data and save registry file ... done
#>
#> ℹ load corpus
#> ✔ load corpus ... done
#>
# }