NEWS.md
zenodo_get_tarball()
failse gracefully if Zenodo is not available #72.encode()
to prospectively supersed CorpusData
class. Includes argument properties
#13.corpus_reload()
for convenient unloading/reloading corpora #68.registry_set_name()
#13.cwb_get_url()
will get CWB v3.5 installation files #63.corpus_remove()
returns FALSE
(rather than failing with ERROR) when corpus does not exist. More telling messages.p_attribute_encode()
has new argument quietly
passed into RcppCWB functions cwb_compress()
cwb_huffcode()
and cwb_compress_rdx()
to control verbosity.$encode()
of CorpusData
class has new argument quietly
passed into p_attribute_encode()
.$encode()
has new argument reload
to trigger unloading and reloading corpus, to make s-attributes available #57.CorpusData$encode()
method uses messages from the cli package #59.p_attribute_encode()
rewritten, including explanation of argument compress
and simplification of sample code #61.s_attribute_encode()
coerces input values
to character
(rather than failing) #62.s_attribute_encode()
, p_attribute_encode()
and CorpusData$encode()
using a new (internal) function, a telling message is issued if non-ASCII or uppercase characters are used. The documentation has been augmented accordingly #48.p_attribute_encode()
checks whether files for encoded p-attribute exist and fails gracefully with telling error message if yes #4.compress
defaults to FALSE
as corpus compression is not stable on Windows #3.corpus_as_tarball()
and corpus_copy()
now have registry_file_parse(corpus, registry_dir)[["home"]]
as default value, so that values are more consistent across corpus_*
functions #18.cwb_get_bindir()
tries to find cwb-config
system utility, if it is on the path.s_attribute_encode()
issues warning on Windows when using s-attribute ‘id’ #69.normalizePath()
by fs::path()
in p_attribute_encode()
#65.p_attribute_encode()
accepts multiple p-attributes if method is “CWB”.registry_set_property()
for setting corpus properties in a pipe.read_registry_file()
will keep ‘registry_dir’ and ‘corpus’.registry_set_info
as new auxiliary function to set path to info file in registry_data
object.corpus_install()
reverts to package zen4R to links of files at Zenodo #42.curl::curl_download()
replaces download.file()
in corpus_install()
if argument user
is NULL
(to avoid corrupted download from Zenodo) #53.install.packages()
has been removed from the package. Using argument pkg
of corpus_install()
will install corpora found in a package as system corpora defined in the default registry directory #46.https://github.com/PolMine/cookbook
. Packages ‘NLP’ and ‘openNLP’ are no longer suggested and the install.packages()
call (though not evaluated) is omitted. Part of the fix for #46.fs::path()
function replaces base R file.path()
throughout to solidify the generation of paths and to improve the readability of the code throughout.p_attribute_encode()
checks that the character
vector token_stream
does not exceed the CWB corpus size limit (2^31 - 1) #40.zenodo_get_tarballurl()
is removed from package again (temporarily used when zen4R package did not work).gparlsample_url_restricted
has been updated to replace a URL that has become defunct.zenodo_get_tarballurl()
steps in for functionality of the zen4R package temporarily not working #42. It is used internally by the corpus_install()
function.zenodo_get_tarball()
fails gracefully if Zenodo is temporarily not available.p_attribute_rename()
, corresponding to s_attribute_rename()
.p_attribute_encode()
will remove the [p_attr].corpus file as suggested my cwb-makeall (if compress
is TRUE
).fs
package for a consistent handling of paths (such as fs::path()
) is used more widely (#36).zenodo_get_tarball()
for downloading corpus tarballs from Zenodo. Restricted access can be handled too (personalized URL with token).corpus_install()
has new argument load
to control whether corpus is loaded after installation.pkg_add_description()
is declared deprecated. To alert users, functionality of the lifecycle package is used (#1).as.vrt()
will generate valid *.vrt files from xml_document
input.cwb_corpus_dir()
, the function would falsely yield NA
results if the CWB directory would contain more than two directories.cwb_corpus_dir()
and cwb_registry_dir()
. Argument verbose
can be used to suppress this output.corpus_install()
function will abort with a FALSE return value if the requested tarball is not available (#34).s_attribute_rename()
can be used to rename s-attributes.corpus_get_version()
will derive the corpus version number from the registry file and return a numeric version
object (#16).writeBin()
to write long integer vectors has been overcome with R v4.0.0. A warning and a preliminary workaround to address this limitation when using p_attribute_encode()
for corpora with more than 536870911 tokens can therefore be dropped. For large corpora, the function will check the R version and issue the recommendation to install $ v4.0.0 or higher, if the size limitation (536870911) is relevant (#28).cwb_get_url()
will return the MD5 checksum of the compressed file as attribute ‘md5’.cwb_install()
function will fail gracefully if downloading the CWB fails (returning NULL
). A new argument md5 will trigger checking the MD5 sum of the downloaded file (if provided). The default value of cwb_dir
is now a temporary directory.cwb_install()
is skipped on Solaris to ensure that Solaris CRAN tests will not fail: A CWB binary is not available for Solaris.tarball
of corpus_install()
.checksum
for the corpus_install()
function introduces functionality to check the integrity of a downloaded corpus tarball. If the tarball is downloaded from Zenodo (by stating a DOI using argument doi
), the md5 checksum included in the record’s metadata is extracted internally and used for checking.corpus_copy()
will accept a new argument remove
. If TRUE
(the default value is FALSE
), files that have been copied will be removed. Removing files is reasonable to handle disk space parsimonously if the source corpus is at a temporary location where nobody will miss it.corpus_install()
function will abort with a warning and return value FALSE
rather than an error if the DOI is not offered by Zenodo.corpus_install()
is used to install a corpus from a tarball present locally, a somewhat confusing message suggested that the tarball was downloaded. This message is not shown any more.cwb_install()
now replaces an internally hardcoded argument cwb_dir
with an argument cwb_dir
; the function returns the directory where the CWB is installed rather than NULL
value.cwb_get_bindir()
now introduces an argument bindir
.compress
of p_attribute_encode(
now has default value FALSE
(#29).p_attribute_encode()
have been adapted so that GitHub Action unit test passes on Windows.FALSE
(#25).RCurl::url.exists()
, this function has been replaced by httr::http_error()
(#31).corpus_install()
function still showed some progress messages even when verbose
was set as FALSE
(argument not passed to corpus_copy()
. Fixed.get_encoding()
method would return NA
if localeToCharset()
fails to infer charset from locale. In this case, UTF-8 is assumed.corpus_install()
function tried to ask for user feedback when not in an interactive session. The function now checks whether it is possible to ask for user feedback.cwbtools::create_cwb_directories()
function did show if verbose
was FALSE
. Fixed.corpus_install()
gives much better and nicer reports on steps performed during corpus downloads. User dialogues have been reworked thoroughly to provide better user guidance.use_corpus_registry_envvar()
function is called by corpus_install()
and will amend the .Renviron file as appropriate if the user so desires.corpus_testload()
has been implemented to check whether a (newly installed) corpus is accessible.jsonlite::fromJSON()
. The auxiliary function to get and process information from Zenodo now ensures that newline characters are escaped such that they can be processed.corpus_copy()
function did not set the path to the info file to the new data directory - corrected.corpus_install()
function failed when the registry_dir
got a NULL
value from the default call to cwbtools::cwb_registry_dir()
. But if the directories are created, the registry directory is there. Fixed.registry_file_compose()
when the path includes any whitespace characters.curl
dependency of cwbtools
that may arise when devtools::install_github()
is used is addressed in an extended explanation in the README.md file how to install the development version of cwbtools
using remotes::install_github()
(#21).install_corpus()
function has been reworked thoroughly. Using system directories for the registry and the corpus directory is now supported. This is a prerequisite that corpora can be installed outside of R packages Installing corpora within corpora is not allowed by CRAN.cwb_directories()
, cwb_registry_dir()
, cwb_corpus_dir()
) will get the whereabouts of the registry directory and the corpus directory. In particular, they consider that the polmineR package may have generated a temporary corpus registry, resetting the CORPUS_REGISTRY environment variable.install_corpus()
function accepts an argument doi
to provide a Document Object Identifier (DOI). At this stage, the DOI is assumed to be awarded by Zenodo. Information available at the Zenodo site will be resolved to get the URL of a corpus tarball that can be downloaded. Upon installing a corpus from Zenodo, the DOI and the version number will be written as corpus properties into the registry file.corpus_install()
function will ask the user for feedback if a corpus would be installed that is already present and that would be deleted or overwritten.create_cwb_directories
and use_corpus_registry_envvar()
will assist users to create the required directory structure for CWB indexed corpora.pkg_add_corpus()
function will now create the cwb directories (registry and data directory) if necessary. Previously, these directories were required to exist before moving a corpus into a package, making it necessary to put dummy files into packages to keep R CMD build from issuing warnings and git from dropping these directories. Creating the directories on demand is a precondition for a CRAN release of data packages (#11).matrix
class will inherit from class array
. The new package version now takes into account that length(class(matrix(1:4,2,2)))
will return the value 2.pkgdown::build_site()
will generate a proper changelog page.s_atttribute_get_regions()
and s_attribute_get_values()
.corpus_install()
, using download.file()
replaces curl::curl_download()
for Windows because curl apparently is not able to process target filenames that include special characters.shortPathName()
is used.decode()
-method will turn a partition
into an Annotation
object from the NLP package.conll_get_regions()
-function will turn an CoNLL-style annotated token stream into a table with regions that can be encoded using s_attribute_encode()
.s_attribute_merge()
will merge two data.table
objects defining s-attributes, checking for overlaps.p_attribute_recode()
, s_attribute_recode()
, and supplementary s_attributed_files()
, and corpus_recode()
.tempdir()
is now wrapped as normalizePath(tempdir(), winslash = "/")
to avoid Problems on Windows, when different file separators may be used.file.path()
, the argument fsep
is “/” to prevent confusion of file seperators.corpus_copy()
is available to create a copy a corpus.s_attribute_encode
().cl_delete_corpus()
from RcppCWB is added to s_attribute_encode()
, so that newly added s-attributes can be used without restarting the R session.corpus_copy()
was defined (and documented) twice in a confusing manner. This is cleaned up.installed.packages()
were replaced to meet an advice of the CRAN team in the submission process.CorpusData$import_xml()
-methodCorpusData$add_corpus_positions()
(helper function .fn)download.file()
by install_corpus()
, if argument tarball is specified. This is a precondition for passing arguments to download password-protected corpora.