Download corpus tarball from Zenodo — zenodo_get

Download corpus tarball from Zenodo. Downloading both freely available data and data with restricted access is supported.

zenodo_get_tarball(
  url,
  destfile = tempfile(fileext = ".tar.gz"),
  checksum = TRUE,
  verbose = TRUE,
  progress = TRUE
)

gparlsample_url_restricted

Format

An object of class character of length 1.

Arguments

url: Landing page at Zenodo for resource. Can also be the URL for restricted access (?token= appended with a long key), or a DOI referencing objects deposited with Zenodo.
destfile: A character vector with the file path where the downloaded file is to be saved. Tilde-expansion is performed. Defaults to a temporary file.
checksum: A logical value, whether to check md5 sum.
verbose: A logical value, whether to output progess messages.
progress: A logical value, whether to report progress during download.

Value

The filename of the downloaded corpus tarball, designed to serve as input for corpus_install (as argument tarball). If the resource is not available, NULL is returned.

The path of the downloaded resource, or NULL if the operation has not been successful.

Details

A sample subset of the GermaParl corpus is deposited at Zenodo for testing purposes. There are identical open access and restricted versions of GermaParlSample to test different flavours of downloading a resource from Zenodo. The URL for restricted access includes an access token which is very lengthy. This URL is included as a dataset in the package to avoid excessive line in sample code. Note that URLs that give access to restricted data are usually not to be shared.

Examples

# \donttest{
# Temporary directory structure as a preparatory step
Sys.setenv(CORPUS_REGISTRY = "")
cwb_dirs <- create_cwb_directories(
  prefix = tempdir(),
  ask = FALSE,
  verbose = FALSE
)
Sys.setenv(CORPUS_REGISTRY = cwb_dirs[["registry_dir"]])

# Download and install open access resource
gparl_url_pub <- "https://doi.org/10.5281/zenodo.3823245"
tarball_tmp <- zenodo_get_tarball(url = gparl_url_pub)
#> ── Download resource from Zenodo ───────────────────────────────────────────────
#> ℹ get tarball URL from Zenodo record
#> ✔ get tarball URL from Zenodo record ... done
#> 
#> ℹ tarball to download: germaparlsample_v0.1.0.tar.gz
#> ℹ check that md5 checksum is (cd542b95b3e6c80600f896c2a288c5d3)
#> ✔ check that md5 checksum is (cd542b95b3e6c80600f896c2a288c5d3) ... done
#> 
if (!is.null(tarball_tmp)) corpus_install(tarball = tarball_tmp)
#> ── Get CWB directories ─────────────────────────────────────────────────────────
#> ℹ registry directory: /tmp/RtmpeQEeXz/registry
#> ℹ data directory: /tmp/RtmpeQEeXz/indexed_corpora
#> ── Install corpus ──────────────────────────────────────────────────────────────
#> ℹ extract tarball
#> ✔ extract tarball ... done
#> 
#> ✔ copy corpus data to target directory ... done
#> ℹ parse registry file
#> ✔ parse registry file ... done
#> 
#> ! no info file
#> ℹ update registry data and save registry file
#> ✔ update registry data and save registry file ... done
#> 
#> ℹ load corpus
#> ✔ load corpus ... done
#> 

# Download and install resource with restricted access
tarball_tmp <- zenodo_get_tarball(url = gparlsample_url_restricted)
#> ── Download restricted resource from Zenodo ────────────────────────────────────
#> ℹ get handle for restricted resource
#> ✔ get handle for restricted resource ... done
#> 
#> ℹ tarball to download: <https://zenodo.org//records/6546810/files/germaparlsample_v0.1.1.tar.gz?download=1>
#> ℹ check that md5 checksum is (81d6131ac55b36f40442b7e320e6f51e)
#> ✔ check that md5 checksum is (81d6131ac55b36f40442b7e320e6f51e) ... done
#> 
if (!is.null(tarball_tmp)) corpus_install(tarball = tarball_tmp)
#> ── Get CWB directories ─────────────────────────────────────────────────────────
#> ℹ registry directory: /tmp/RtmpeQEeXz/registry
#> ℹ data directory: /tmp/RtmpeQEeXz/indexed_corpora
#> ── Install corpus ──────────────────────────────────────────────────────────────
#> ℹ extract tarball
#> ✔ extract tarball ... done
#> 
#> ℹ version extracted from registry file: v0.1.1
#> ── remove corpus GERMAPARLSAMPLE ───────────────────────────────────────────────
#> ℹ registry directory: /tmp/RtmpeQEeXz/registry
#> ℹ data directory: /tmp/RtmpeQEeXz/indexed_corpora/germaparlsample
#> ℹ remove files in data directory
#> ✔ remove files in data directory [6ms]
#> 
#> ℹ remove data directory
#> ✔ remove data directory [11ms]
#> 
#> ℹ remove registry file
#> ✔ remove registry file [11ms]
#> 
#> ✔ corpus `GERMAPARLSAMPLE` has been removed
#> ✔ copy corpus data to target directory ... done
#> ℹ parse registry file
#> ✔ parse registry file ... done
#> 
#> ! no info file
#> ℹ update registry data and save registry file
#> ✔ update registry data and save registry file ... done
#> 
#> ℹ load corpus
#> ✔ load corpus ... done
#> 
# }