Putting CWB indexed corpora into R data packages is a convenient way to ship and share corpora, and to keep documentation and supplementary functionality with the data.
pkg_create_cwb_dirs(pkg = ".", verbose = TRUE)
pkg_add_corpus(
pkg = ".",
corpus,
registry = Sys.getenv("CORPUS_REGISTRY"),
verbose = TRUE
)
pkg_add_configure_scripts(pkg = ".")
pkg_add_description(
pkg = ".",
package = NULL,
version = "0.0.1",
date = Sys.Date(),
author,
maintainer = NULL,
description = "",
license = "",
verbose = TRUE
)
pkg_add_creativecommons_license(
pkg = ".",
license = "CC-BY-NC-SA",
file = system.file(package = "cwbtools", "txt", "licenses", "CC_BY-NC-SA_3.0.txt")
)
pkg_add_gitattributes_file(pkg = ".")
Path to directory of data package or package name.
A logical
value, whether to be verbose.
Name of the CWB corpus to insert into the package.
Registry directory.
The package name (character
), may not include special
chars, and no underscores ('_').
The version number of the corpus (defaults to "0.0.1")
The date of creation, defaults to Sys.Date()
.
The author of the package, either character vector or object of class person
.
Maintainer, R package style, either character
vector or person
.
description of the data package.
The license.
Path to file with fulltext of Creative Commons license.
pkg_creage_cwb_dirs
will create the standard directory
structure for storing registry files and indexed corpora within a package
(./inst/extdata/cwb/registry
and
./inst/extdata/cwb/indexed_corpora
, respectively).
pkg_add_corpus
will add the corpus described in registry directory to
the package defined by pkg
.
add_configure_script
will add standardized and tested
configure scripts configure
for Linux and macOS, and
configure.win
for Windows to the top level directory of the data
package, and file setpaths.R
to tools
subdirectory. The
configuration mechanism ensures that the data directory is specified
correctly in the registry files during the installation of the data
package.
pkg_add_description
will add a description file to the package.
pkg_add_creativecommons_license
will license information to
the DESCRIPTION file, and move file LICENSE to top level directory of the
package.
pkg_add_gitattributes_file
will add a file '.gitattributes'
to the package. The file defines types of files that will be tracked by Git
LFS, i.e. they will not be under conventional version control. This is
suitable for large binary files, which is the scenario applicable for
indexed corpus data.
Blätte, Andreas (2018). "Using Data Packages to Ship Annotated Corpora of Parliamentary Protocols: The GermaParl R Package", ParlaCLARIN 2018 Workshop Proceedings, available online here.
pkgdir <- fs::path_temp()
pkg_create_cwb_dirs(pkg = pkgdir)
#> ... creating directory: /tmp/RtmpeQEeXz/R
#> ... creating directory: /tmp/RtmpeQEeXz/man
#> ... creating directory: /tmp/RtmpeQEeXz/inst
#> ... creating directory: /tmp/RtmpeQEeXz/inst/extdata
#> ... creating directory: /tmp/RtmpeQEeXz/inst/extdata/cwb
#> ... creating directory: /tmp/RtmpeQEeXz/inst/extdata/cwb/registry
#> ... creating directory: /tmp/RtmpeQEeXz/inst/extdata/cwb/indexed_corpora
pkg_add_description(
pkg = pkgdir,
package = "reuters",
author = "cwbtools",
description = "Reuters data package"
)
#> Warning: `pkg_add_description()` was deprecated in cwbtools 0.3.4.
#> ℹ Downloading corpora from a repository (HTTP-Server, Zenodo, S3) using
#> corpus_install() is recommended. Only small corpora should be put into
#> packages as sample data.
#> ... creating DESCRIPTION file
pkg_add_corpus(
pkg = pkgdir, corpus = "REUTERS",
registry = system.file(package = "RcppCWB", "extdata", "cwb", "registry")
)
#> Warning: `pkg_add_corpus()` was deprecated in cwbtools 0.3.4.
#> ℹ Downloading corpora from a repository (HTTP-Server, Zenodo, S3) using
#> corpus_install() is recommended. Only small corpora should be put into
#> packages as sample data.
#> ... directory for indexed corpus does not yet exist, creating: /tmp/RtmpeQEeXz/inst/extdata/cwb/indexed_corpora/reuters
#> ... copying registry file
#> ... copying data files
#> ... adjusting paths in registry file
pkg_add_gitattributes_file(pkg = pkgdir)
pkg_add_configure_scripts(pkg = pkgdir)
pkg_add_creativecommons_license(pkg = pkgdir)