Download corpus tarball from Zenodo. Downloading both freely available data and data with restricted access is supported.
zenodo_get_tarball(
url,
destfile = tempfile(fileext = ".tar.gz"),
checksum = TRUE,
verbose = TRUE,
progress = TRUE
)
zenodo_get_tarballurl(url)
gparlsample_url_restricted
An object of class character
of length 1.
Landing page at Zenodo for resource. Can also be the URL for restricted access (?token= appended with a long key), or a DOI referencing objects deposited with Zenodo.
A character
vector with the file path where the downloaded
file is to be saved. Tilde-expansion is performed. Defaults to a temporary
file.
A logical
value, whether to check md5 sum.
A logical
value, whether to output progess messages.
A logical
value, whether to report progress during
download.
The filename of the downloaded corpus tarball, designed to serve as
input for corpus_install
(as argument tarball
). If the
resource is not available, NULL
is returned.
The path of the downloaded resource, or NULL
if the operation has
not been successful.
zenodo_get_tarballurl
is a (temporary) helper function to
accomplish a temporarily bugged functionality of the zen4R package.
A sample subset of the GermaParl corpus is deposited at Zenodo for testing purposes. There are identical open access and restricted versions of GermaParlSample to test different flavours of downloading a resource from Zenodo. The URL for restricted access includes an access token which is very lengthy. This URL is included as a dataset in the package to avoid excessive line in sample code. Note that URLs that give access to restricted data are usually not to be shared.
# \donttest{
# Temporary directory structure as a preparatory step
Sys.setenv(CORPUS_REGISTRY = "")
cwb_dirs <- create_cwb_directories(
prefix = tempdir(),
ask = FALSE,
verbose = FALSE
)
Sys.setenv(CORPUS_REGISTRY = cwb_dirs[["registry_dir"]])
# Download and install open access resource
gparl_url_pub <- "https://doi.org/10.5281/zenodo.3823245"
tarball_tmp <- zenodo_get_tarball(url = gparl_url_pub)
#> ── Download resource from Zenodo ───────────────────────────────────────────────
#> ℹ extract tarball URL from Zenodo website
#> Error in open.connection(x, "rb") : HTTP error 504.
#> Warning: Zenodo not available. Try again later.
#> ✔ extract tarball URL from Zenodo website ... done
#>
if (!is.null(tarball_tmp)) corpus_install(tarball = tarball_tmp)
# Download and install resource with restricted access
tarball_tmp <- zenodo_get_tarball(url = gparlsample_url_restricted)
#> ── Download restricted resource from Zenodo ────────────────────────────────────
#> ℹ get handle for restricted resource
#> ✔ get handle for restricted resource ... done
#>
#> ℹ extract tarball URL from Zenodo website
#> ✔ extract tarball URL from Zenodo website ... done
#>
#> ℹ tarball to download: germaparlsample_v0.1.1.tar.gz
#> ℹ checking whether md5 checksum meets expectation (81d6131ac55b36f40442b7e320e6…
#> ✔ checking whether md5 checksum meets expectation (81d6131ac55b36f40442b7e320e6…
#>
if (!is.null(tarball_tmp)) corpus_install(tarball = tarball_tmp)
#> ── Get CWB directories ─────────────────────────────────────────────────────────
#> ℹ registry directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpVB8DzH/registry
#> ℹ data directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpVB8DzH/indexed_corpora
#> ── Install corpus ──────────────────────────────────────────────────────────────
#> ℹ extract tarball
#> ✔ extract tarball ... done
#>
#> ✔ copy corpus data to target directory ... done
#> ℹ parse registry file
#> ✔ parse registry file ... done
#>
#> ! no info file
#> ℹ update registry data and save registry file
#> ✔ update registry data and save registry file ... done
#>
#> ℹ load corpus
#> ✔ load corpus ... done
#>
# }