Download corpus tarball from Zenodo. Downloading both freely available data and data with restricted access is supported.

zenodo_get_tarball(
  url,
  destfile = tempfile(fileext = ".tar.gz"),
  checksum = TRUE,
  verbose = TRUE,
  progress = TRUE
)

zenodo_get_tarballurl(url)

gparlsample_url_restricted

Format

An object of class character of length 1.

Arguments

url

Landing page at Zenodo for resource. Can also be the URL for restricted access (?token= appended with a long key), or a DOI referencing objects deposited with Zenodo.

destfile

A character vector with the file path where the downloaded file is to be saved. Tilde-expansion is performed. Defaults to a temporary file.

checksum

A logical value, whether to check md5 sum.

verbose

A logical value, whether to output progess messages.

progress

A logical value, whether to report progress during download.

Value

The filename of the downloaded corpus tarball, designed to serve as input for corpus_install (as argument tarball). If the resource is not available, NULL is returned.

The path of the downloaded resource, or NULL if the operation has not been successful.

Details

zenodo_get_tarballurl is a (temporary) helper function to accomplish a temporarily bugged functionality of the zen4R package.

A sample subset of the GermaParl corpus is deposited at Zenodo for testing purposes. There are identical open access and restricted versions of GermaParlSample to test different flavours of downloading a resource from Zenodo. The URL for restricted access includes an access token which is very lengthy. This URL is included as a dataset in the package to avoid excessive line in sample code. Note that URLs that give access to restricted data are usually not to be shared.

Examples

# \donttest{
# Temporary directory structure as a preparatory step
Sys.setenv(CORPUS_REGISTRY = "")
cwb_dirs <- create_cwb_directories(
  prefix = tempdir(),
  ask = FALSE,
  verbose = FALSE
)
Sys.setenv(CORPUS_REGISTRY = cwb_dirs[["registry_dir"]])

# Download and install open access resource
gparl_url_pub <- "https://doi.org/10.5281/zenodo.3823245"
tarball_tmp <- zenodo_get_tarball(url = gparl_url_pub)
#> ── Download resource from Zenodo ───────────────────────────────────────────────
#>  extract tarball URL from Zenodo website
#> Error in open.connection(x, "rb") : HTTP error 504.
#> Warning: Zenodo not available. Try again later.
#>  extract tarball URL from Zenodo website ... done
#> 
if (!is.null(tarball_tmp)) corpus_install(tarball = tarball_tmp)

# Download and install resource with restricted access
tarball_tmp <- zenodo_get_tarball(url = gparlsample_url_restricted)
#> ── Download restricted resource from Zenodo ────────────────────────────────────
#>  get handle for restricted resource
#>  get handle for restricted resource ... done
#> 
#>  extract tarball URL from Zenodo website
#>  extract tarball URL from Zenodo website ... done
#> 
#>  tarball to download: germaparlsample_v0.1.1.tar.gz
#>  checking whether md5 checksum meets expectation (81d6131ac55b36f40442b7e320e6
#>  checking whether md5 checksum meets expectation (81d6131ac55b36f40442b7e320e6
#> 
if (!is.null(tarball_tmp)) corpus_install(tarball = tarball_tmp)
#> ── Get CWB directories ─────────────────────────────────────────────────────────
#>  registry directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpVB8DzH/registry
#>  data directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpVB8DzH/indexed_corpora
#> ── Install corpus ──────────────────────────────────────────────────────────────
#>  extract tarball
#>  extract tarball ... done
#> 
#>  copy corpus data to target directory ... done
#>  parse registry file
#>  parse registry file ... done
#> 
#> ! no info file
#>  update registry data and save registry file
#>  update registry data and save registry file ... done
#> 
#>  load corpus
#>  load corpus ... done
#> 
# }