install.packages()has been removed from the package. Using argument
corpus_install()will install corpora found in a package as system corpora defined in the default registry directory #46.
https://github.com/PolMine/cookbook. Packages ‘NLP’ and ‘openNLP’ are no longer suggested and the
install.packages()call (though not evaluated) is omitted. Part of the fix for #46.
fs::path()function replaces base R
file.path()throughout to solidify the generation of paths and to improve the readability of the code throughout.
p_attribute_encode()checks that the
token_streamdoes not exceed the CWB corpus size limit (2^31 - 1) #40.
gparlsample_url_restrictedhas been updated to replace a URL that has become defunct.
zenodo_get_tarballurl()steps in for functionality of the zen4R package temporarily not working #42. It is used internally by the
zenodo_get_tarball()fails gracefully if Zenodo is temporarily not available.
p_attribute_rename(), corresponding to
p_attribute_encode()will remove the [p_attr].corpus file as suggested my cwb-makeall (if
fspackage for a consistent handling of paths (such as
fs::path()) is used more widely (#36).
zenodo_get_tarball()for downloading corpus tarballs from Zenodo. Restricted access can be handled too (personalized URL with token).
corpus_install()has new argument
loadto control whether corpus is loaded after installation.
pkg_add_description()is declared deprecated. To alert users, functionality of the lifecycle package is used (#1).
as.vrt()will generate valid *.vrt files from
cwb_corpus_dir(), the function would falsely yield
NAresults if the CWB directory would contain more than two directories.
verbosecan be used to suppress this output.
corpus_install()function will abort with a FALSE return value if the requested tarball is not available (#34).
s_attribute_rename()can be used to rename s-attributes.
corpus_get_version()will derive the corpus version number from the registry file and return a
numeric versionobject (#16).
writeBin()to write long integer vectors has been overcome with R v4.0.0. A warning and a preliminary workaround to address this limitation when using
p_attribute_encode()for corpora with more than 536870911 tokens can therefore be dropped. For large corpora, the function will check the R version and issue the recommendation to install $ v4.0.0 or higher, if the size limitation (536870911) is relevant (#28).
cwb_get_url()will return the MD5 checksum of the compressed file as attribute ‘md5’.
cwb_install()function will fail gracefully if downloading the CWB fails (returning
NULL). A new argument md5 will trigger checking the MD5 sum of the downloaded file (if provided). The default value of
cwb_diris now a temporary directory.
cwb_install()is skipped on Solaris to ensure that Solaris CRAN tests will not fail: A CWB binary is not available for Solaris.
corpus_install()function introduces functionality to check the integrity of a downloaded corpus tarball. If the tarball is downloaded from Zenodo (by stating a DOI using argument
doi), the md5 checksum included in the record’s metadata is extracted internally and used for checking.
corpus_copy()will accept a new argument
TRUE(the default value is
FALSE), files that have been copied will be removed. Removing files is reasonable to handle disk space parsimonously if the source corpus is at a temporary location where nobody will miss it.
corpus_install()function will abort with a warning and return value
FALSErather than an error if the DOI is not offered by Zenodo.
corpus_install()is used to install a corpus from a tarball present locally, a somewhat confusing message suggested that the tarball was downloaded. This message is not shown any more.
cwb_install()now replaces an internally hardcoded argument
cwb_dirwith an argument
cwb_dir; the function returns the directory where the CWB is installed rather than
cwb_get_bindir()now introduces an argument
p_attribute_encode(now has default value
p_attribute_encode()have been adapted so that GitHub Action unit test passes on Windows.
RCurl::url.exists(), this function has been replaced by
corpus_install()function still showed some progress messages even when
verbosewas set as
FALSE(argument not passed to
get_encoding()method would return
localeToCharset()fails to infer charset from locale. In this case, UTF-8 is assumed.
corpus_install()function tried to ask for user feedback when not in an interactive session. The function now checks whether it is possible to ask for user feedback.
cwbtools::create_cwb_directories()function did show if
corpus_install()gives much better and nicer reports on steps performed during corpus downloads. User dialogues have been reworked thoroughly to provide better user guidance.
use_corpus_registry_envvar()function is called by
corpus_install()and will amend the .Renviron file as appropriate if the user so desires.
corpus_testload()has been implemented to check whether a (newly installed) corpus is accessible.
jsonlite::fromJSON(). The auxiliary function to get and process information from Zenodo now ensures that newline characters are escaped such that they can be processed.
corpus_copy()function did not set the path to the info file to the new data directory - corrected.
corpus_install()function failed when the
NULLvalue from the default call to
cwbtools::cwb_registry_dir(). But if the directories are created, the registry directory is there. Fixed.
registry_file_compose()when the path includes any whitespace characters.
cwbtoolsthat may arise when
devtools::install_github()is used is addressed in an extended explanation in the README.md file how to install the development version of
install_corpus()function has been reworked thoroughly. Using system directories for the registry and the corpus directory is now supported. This is a prerequisite that corpora can be installed outside of R packages Installing corpora within corpora is not allowed by CRAN.
cwb_corpus_dir()) will get the whereabouts of the registry directory and the corpus directory. In particular, they consider that the polmineR package may have generated a temporary corpus registry, resetting the CORPUS_REGISTRY environment variable.
install_corpus()function accepts an argument
doito provide a Document Object Identifier (DOI). At this stage, the DOI is assumed to be awarded by Zenodo. Information available at the Zenodo site will be resolved to get the URL of a corpus tarball that can be downloaded. Upon installing a corpus from Zenodo, the DOI and the version number will be written as corpus properties into the registry file.
corpus_install()function will ask the user for feedback if a corpus would be installed that is already present and that would be deleted or overwritten.
use_corpus_registry_envvar()will assist users to create the required directory structure for CWB indexed corpora.
pkg_add_corpus()function will now create the cwb directories (registry and data directory) if necessary. Previously, these directories were required to exist before moving a corpus into a package, making it necessary to put dummy files into packages to keep R CMD build from issuing warnings and git from dropping these directories. Creating the directories on demand is a precondition for a CRAN release of data packages (#11).
matrixclass will inherit from class
array. The new package version now takes into account that
length(class(matrix(1:4,2,2)))will return the value 2.
pkgdown::build_site()will generate a proper changelog page.
curl::curl_download()for Windows because curl apparently is not able to process target filenames that include special characters.
decode()-method will turn a
Annotationobject from the NLP package.
conll_get_regions()-function will turn an CoNLL-style annotated token stream into a table with regions that can be encoded using
s_attribute_merge()will merge two
data.tableobjects defining s-attributes, checking for overlaps.
s_attribute_recode(), and supplementary
tempdir()is now wrapped as
normalizePath(tempdir(), winslash = "/")to avoid Problems on Windows, when different file separators may be used.
file.path(), the argument
fsepis “/” to prevent confusion of file seperators.
corpus_copy()is available to create a copy a corpus.
cl_delete_corpus()from RcppCWB is added to
s_attribute_encode(), so that newly added s-attributes can be used without restarting the R session.
corpus_copy()was defined (and documented) twice in a confusing manner. This is cleaned up.
installed.packages()were replaced to meet an advice of the CRAN team in the submission process.
CorpusData$add_corpus_positions()(helper function .fn)
install_corpus(), if argument tarball is specified. This is a precondition for passing arguments to download password-protected corpora.