Download corpus tarball from Zenodo. Downloading both freely available data and data with restricted access is supported.

zenodo_get_tarball(
  url,
  destfile = tempfile(fileext = ".tar.gz"),
  checksum = TRUE,
  verbose = TRUE,
  progress = TRUE
)

gparlsample_url_restricted

Format

An object of class character of length 1.

Arguments

url

Landing page at Zenodo for resource. Can also be the URL for restricted access (?token= appended with a long key), or a DOI referencing objects deposited with Zenodo.

destfile

A character vector with the file path where the downloaded file is to be saved. Tilde-expansion is performed. Defaults to a temporary file.

checksum

A logical value, whether to check md5 sum.

verbose

A logical value, whether to output progess messages.

progress

A logical value, whether to report progress during download.

Value

The filename of the downloaded corpus tarball, designed to serve as input for corpus_install (as argument tarball). If the resource is not available, NULL is returned.

The path of the downloaded resource, or NULL if the operation has not been successful.

Details

A sample subset of the GermaParl corpus is deposited at Zenodo for testing purposes. There are identical open access and restricted versions of GermaParlSample to test different flavours of downloading a resource from Zenodo. The URL for restricted access includes an access token which is very lengthy. This URL is included as a dataset in the package to avoid excessive line in sample code. Note that URLs that give access to restricted data are usually not to be shared.

Examples

# \donttest{
# Temporary directory structure as a preparatory step
Sys.setenv(CORPUS_REGISTRY = "")
cwb_dirs <- create_cwb_directories(
  prefix = tempdir(),
  ask = FALSE,
  verbose = FALSE
)
Sys.setenv(CORPUS_REGISTRY = cwb_dirs[["registry_dir"]])

# Download and install open access resource
gparl_url_pub <- "https://doi.org/10.5281/zenodo.3823245"
tarball_tmp <- zenodo_get_tarball(url = gparl_url_pub)
#> ── Download resource from Zenodo ───────────────────────────────────────────────
#>  get tarball URL from Zenodo record
#>  get tarball URL from Zenodo record ... done
#> 
#>  tarball to download: germaparlsample_v0.1.0.tar.gz
#>  check that md5 checksum is (cd542b95b3e6c80600f896c2a288c5d3)
#>  check that md5 checksum is (cd542b95b3e6c80600f896c2a288c5d3) ... done
#> 
if (!is.null(tarball_tmp)) corpus_install(tarball = tarball_tmp)
#> ── Get CWB directories ─────────────────────────────────────────────────────────
#>  registry directory: /tmp/RtmpeQEeXz/registry
#>  data directory: /tmp/RtmpeQEeXz/indexed_corpora
#> ── Install corpus ──────────────────────────────────────────────────────────────
#>  extract tarball
#>  extract tarball ... done
#> 
#>  copy corpus data to target directory ... done
#>  parse registry file
#>  parse registry file ... done
#> 
#> ! no info file
#>  update registry data and save registry file
#>  update registry data and save registry file ... done
#> 
#>  load corpus
#>  load corpus ... done
#> 

# Download and install resource with restricted access
tarball_tmp <- zenodo_get_tarball(url = gparlsample_url_restricted)
#> ── Download restricted resource from Zenodo ────────────────────────────────────
#>  get handle for restricted resource
#>  get handle for restricted resource ... done
#> 
#>  tarball to download: <https://zenodo.org//records/6546810/files/germaparlsample_v0.1.1.tar.gz?download=1>
#>  check that md5 checksum is (81d6131ac55b36f40442b7e320e6f51e)
#>  check that md5 checksum is (81d6131ac55b36f40442b7e320e6f51e) ... done
#> 
if (!is.null(tarball_tmp)) corpus_install(tarball = tarball_tmp)
#> ── Get CWB directories ─────────────────────────────────────────────────────────
#>  registry directory: /tmp/RtmpeQEeXz/registry
#>  data directory: /tmp/RtmpeQEeXz/indexed_corpora
#> ── Install corpus ──────────────────────────────────────────────────────────────
#>  extract tarball
#>  extract tarball ... done
#> 
#>  version extracted from registry file: v0.1.1
#> ── remove corpus GERMAPARLSAMPLE ───────────────────────────────────────────────
#>  registry directory: /tmp/RtmpeQEeXz/registry
#>  data directory: /tmp/RtmpeQEeXz/indexed_corpora/germaparlsample
#>  remove files in data directory
#>  remove files in data directory [6ms]
#> 
#>  remove data directory
#>  remove data directory [11ms]
#> 
#>  remove registry file
#>  remove registry file [11ms]
#> 
#>  corpus `GERMAPARLSAMPLE` has been removed
#>  copy corpus data to target directory ... done
#>  parse registry file
#>  parse registry file ... done
#> 
#> ! no info file
#>  update registry data and save registry file
#>  update registry data and save registry file ... done
#> 
#>  load corpus
#>  load corpus ... done
#> 
# }