Putting CWB indexed corpora into R data packages is a convenient way to ship and share corpora, and to keep documentation and supplementary functionality with the data.

[Deprecated]

pkg_create_cwb_dirs(pkg = ".", verbose = TRUE)

pkg_add_corpus(
  pkg = ".",
  corpus,
  registry = Sys.getenv("CORPUS_REGISTRY"),
  verbose = TRUE
)

pkg_add_configure_scripts(pkg = ".")

pkg_add_description(
  pkg = ".",
  package = NULL,
  version = "0.0.1",
  date = Sys.Date(),
  author,
  maintainer = NULL,
  description = "",
  license = "",
  verbose = TRUE
)

pkg_add_creativecommons_license(
  pkg = ".",
  license = "CC-BY-NC-SA",
  file = system.file(package = "cwbtools", "txt", "licenses", "CC_BY-NC-SA_3.0.txt")
)

pkg_add_gitattributes_file(pkg = ".")

Arguments

pkg

Path to directory of data package or package name.

verbose

A logical value, whether to be verbose.

corpus

Name of the CWB corpus to insert into the package.

registry

Registry directory.

package

The package name (character), may not include special chars, and no underscores ('_').

version

The version number of the corpus (defaults to "0.0.1")

date

The date of creation, defaults to Sys.Date().

author

The author of the package, either character vector or object of class person.

maintainer

Maintainer, R package style, either character vector or person.

description

description of the data package.

license

The license.

file

Path to file with fulltext of Creative Commons license.

Details

pkg_creage_cwb_dirs will create the standard directory structure for storing registry files and indexed corpora within a package (./inst/extdata/cwb/registry and ./inst/extdata/cwb/indexed_corpora, respectively).

pkg_add_corpus will add the corpus described in registry directory to the package defined by pkg.

add_configure_script will add standardized and tested configure scripts configure for Linux and macOS, and configure.win for Windows to the top level directory of the data package, and file setpaths.R to tools subdirectory. The configuration mechanism ensures that the data directory is specified correctly in the registry files during the installation of the data package.

pkg_add_description will add a description file to the package.

pkg_add_creativecommons_license will license information to the DESCRIPTION file, and move file LICENSE to top level directory of the package.

pkg_add_gitattributes_file will add a file '.gitattributes' to the package. The file defines types of files that will be tracked by Git LFS, i.e. they will not be under conventional version control. This is suitable for large binary files, which is the scenario applicable for indexed corpus data.

References

Blätte, Andreas (2018). "Using Data Packages to Ship Annotated Corpora of Parliamentary Protocols: The GermaParl R Package", ParlaCLARIN 2018 Workshop Proceedings, available online here.

Examples

pkgdir <- fs::path_temp()
pkg_create_cwb_dirs(pkg = pkgdir)
#> ... creating directory: /tmp/RtmpYerDdZ/R
#> ... creating directory: /tmp/RtmpYerDdZ/man
#> ... creating directory: /tmp/RtmpYerDdZ/inst
#> ... creating directory: /tmp/RtmpYerDdZ/inst/extdata
#> ... creating directory: /tmp/RtmpYerDdZ/inst/extdata/cwb
#> ... creating directory: /tmp/RtmpYerDdZ/inst/extdata/cwb/registry
#> ... creating directory: /tmp/RtmpYerDdZ/inst/extdata/cwb/indexed_corpora
pkg_add_description(
  pkg = pkgdir,
  package = "reuters",
  author = "cwbtools",
  description = "Reuters data package"
 )
#> Warning: `pkg_add_description()` was deprecated in cwbtools 0.3.4.
#>  Downloading corpora from a repository (HTTP-Server, Zenodo, S3) using
#>   corpus_install() is recommended. Only small corpora should be put into
#>   packages as sample data.
#> ... creating DESCRIPTION file
pkg_add_corpus(
  pkg = pkgdir, corpus = "REUTERS",
  registry = system.file(package = "RcppCWB", "extdata", "cwb", "registry")
)
#> Warning: `pkg_add_corpus()` was deprecated in cwbtools 0.3.4.
#>  Downloading corpora from a repository (HTTP-Server, Zenodo, S3) using
#>   corpus_install() is recommended. Only small corpora should be put into
#>   packages as sample data.
#> ... directory for indexed corpus does not yet exist, creating: /tmp/RtmpYerDdZ/inst/extdata/cwb/indexed_corpora/reuters
#> ... copying registry file
#> ... copying data files
#> ... adjusting paths in registry file
pkg_add_gitattributes_file(pkg = pkgdir)
pkg_add_configure_scripts(pkg = pkgdir)
pkg_add_creativecommons_license(pkg = pkgdir)