NEWS.md
Rcpp::sourceCpp()
or Rcpp::cppFunction()
.devtools::install_github("PolMine/RcppCWB")
. The missing ref = "dev"
has been inserted.cwb_encode()
crashed if arguments data_dir
and vrt_dir
include a tilde. Tilde expansion is now applied to these arguments to avoid this #73.sprintf()
with snprintf()
to address security issue.sprintf()
#70.corpus_properties()
and corpus_property()
do not crash any more, if corpus is not loaded or not present #69.p_attr_default()
to programmatically extract default p-attribute #63.region_matrix_corpus()
C++ code that would not show any context at all if s_attribute expansion transgressed start or end of corpus.region_matrix_corpus()
C++ code that would result from not considering that query matches may go cover more than one strucs of a structural attribute.corpus_info_file()
does not crash if INFO is not defined in the registry file (#62).sAttribute
and pAttribute
as s_attribute
or p_attribute
respectively is now accompanied by a warning that arguments are deprectated.check_corpus()
function distinguishes between whether a corpus is loaded in the CL and/or CQP context.cwb_huffcode()
and cwb_compress_rdx()
have argument delete
to trigger deleting redundant files after compression (#60).cqp_load_corpus
will internally upper corpus ID as required in the CQP context (#64).corpus_data_dir()
dir not work as intended without explicitly setting the registry
argument. Fixed.corpus_info_file()
, corpus_full_name()
, corpus_p_attributes()
, corpus_s_attributes()
, corpus_properties()
and corpus_property()
to retrieve registry file data.corpus_registry_dir()
.cwb_charsets()
reports the charsets supported by CWB.cl_load_corpus()
and cqp_load_corpus()
do what the functions suggests.cl_list_corpora()
complements existing function cqp_list_corpora()
for the CL context.skip_blank_lines
, strip_whitespace
and xml
of cwb_encode()
open configuration options of cwb_encode()
, overcoming the previously hard-coded equivalent to the command-line option “-xsB”.(#38).cpos_to_id()
, .cl_find_corpus()
and .cl_new_attribute()
are an entry to passing around pointers, rather than re-creating objects whenever switching from R to C..s_attr()
and .p_attr()
return pointers for a s- or p-attribute.cl_*
are now available with pointer as input (e.g. cpos_to_id()
).cqp_drop_subcorpus()
function that has been disabled temporarily is usable again (#34).cqp_query()
is now able to process subcorpora.RcppCWB:::.cqp_subcropus()
will construct a subcorpus from a region matrix.check_corpus()
does not re-set the registry directory and more, but tries to load the checked corpus if it has not yet been loaded.s_attr_relationship()
will detect whether two s-attributes are siblings, or in a descendent or ancestor relationship.cwb_encode()
, cwb_huffcode()
, cwb_makeall()
and cwb_compress_rdx()
now have an argument quietly
to control display of output messages. cwb_encode()
has an argument verbose
to control whether counter on the number of tokens processed is dislpayed.cwb_encode()
to digest variations of path statements between macOS and Windows are addressed using a reliable normalization of paths with fs::path()
(#48).encoding
is checked for the validity of the encoding passed in (#34).check_cpos()
issues a warning if argument cpos
is NULL
(#21).cl_cpos2id()
, cl_cpos2lbound()
, cl_cpos2rbound()
, cl_cpos2str()
and cl_cpo2struc()
will return an empty, zero-length integer vector if argument cpos
is NULL
(#21).check_corpus()
(used internally by many functions) resulted from slightly differing representations of otherwise identical paths. Using fs::path()
for path for normalization internally will omit misleading warning messages.cqp_get_registry()
will now return a fs::path
object, as a safeguard for a consistent normalization of paths.cl_delete_corpus()
will now (visibly) return a logial
value.cqp_load_corpus()
will return FALSE
if corpus has not been loaded successfully.wrappers.cpp
into cl.cpp
, cqp.cpp
and utils.cpp
, so that the code is organized more coherently corresponding to the different logics.check_cqp_query()
renamed to check_query()
to avoid a conflict with a function defined in the polmineR package.cqp_list_subcorpora()
returns a character
vector. Previously, we just had obscure printed messages.s_attribute_decode()
will not break if s-attribute has no values (#54).cl_struc2str()
and cl_struc2cpos()
may now include negative values, the vectors returned will have NA
values at respective positions. The check against negative values in check_strucs
is dropped accordingly.cwb_encode()
function did not declare structural attributes in the registry and mistakenly channeled output for the file to the terminal (#49). Fixed.cwb_encode()
did not reset global variables, which resulted in a set of errors. Solved. (#51)cwb-huffcode.c
, cwb-compress-rdx.c
and cwb-makeall.c
was not in line with the CWB version of the rest of the code (v3.4.14 / SVN revision 1069) but rather v2.2.b99 or v3.0.0. All code changes up to v3.4.14 were reconstructed and implemented (#35). Note that cwb-encode.c
was at CWB v3.4.14, as the encoding functionality was exposed at a later stage.cwb_version()
will report the version of the CWB source code.cwb_encode()
function now has a previously missing argument encoding
to state the encoding of the corpus to be indexed.cwb_encode()
now assumes implicitly that input files are XML files and remove blank lines and leading and trailing whitespace. This is equivalent to the option “-xsB” of the command line utility cwb-encode
.cwb_encode()
is now a patch of the main()
function of cwb-encode.c
, so that code in the *.cpp file can be limited to a slim wrapper, limiting the risk that the code in RcppCWB looses touch with CWB upstream development._eval.h
, _globalvars.h
and _cl.h
in the ./src
directory are autogenerated files now, not to be edited by hand.cqp_drop_subcorpus()
function is temporarily disabled to ensure that the package can be built (#34).check_corpus()
that would trigger resetting the registry unintendendly and potentially falsely.use_tmp_dir()
, normalizePath()
is applied on the tempdir()
result to avoid confusion with symbolic links on macOS.cwb_encode()
(not yet run on Windows).cqp_get_registry()
that would sometimes result in a wrong return value (i.e. registry path) has been fixed (#14).cwb_makeall()
, an internal check is performed whether the corpus has been loaded already and whether the home directory of the loaded corpus and defined in the registry file are identical (#31).cl_delete_corpus()
function crashed when trying to delete a corpus that has not been loaded (#33). The function now aborts gracefully returning 0 when trying to delete a corpus that has not been loaded.corpus_is_loaded()
can be used to check whether a corpus is loaded.cwb_encode()
that exposes functionality of cwb-encode CWB utility.cl_cpos2lbound()
and cl_cpos2rbound()
will now accept an integer vector with length > 1 as argument cpos
and return a vector with the same length. Useful to speed up iterated queries for left and right boundaries of regions (#19).cl_struc_values()
exposes the corresponding C function of the Corpus Library (CL). The previous implicit assumption that all structural attributes have values can thus be tested. Intended to work with annotations of sentences and paragraphs, i.e. common structural attributes that do usually not have values.corpus_data_dir()
will derive the data directory from the internal C representation of a corpus.s_attr_regions()
will derive regions defined by a structural attribute from the *.rng file. Fastest option for large corpora.s_attr_is_sibling()
and s_attr_is_descendent()
test the sibling/descendent relationship of structural attributes.check_corpus()
now includes checks whether the registry provided (argument registry
) is identical with the registry defined internally by CQP. The registry is reset if directories are not identical.s_attribute_decode()
method was incomplete for method “Rcpp”. This alternative to the “pure R” approach is now implemented (#2).method
previously setting “wininet” in ./tools/winlibs.R is omitted to avoid the warning “the ‘wininet’ method is deprecated for http:// and https:// URLs” on Windows.pcre-config
to locate header files of PCRE.cqp_initialize()
)get_tmp_registry()
will return the whereabouts of this directory.check_corpus()
-function. Problems with the previous implementation that relied on files in the registry directory to ensure the presence of a corpus hopefully do not occur.cl_charset_name()
is exposed, it will return the charset of a corpus. Faster than parsing the registry file again and again.cl_delete_corpus()
-function can remove loaded corpora from memory.