Files resulting from tagging/annotation may violate the requirements of the Corpus Workbench (CWB). Consolidate the known issues the vrt files may cause.

as.vrt(x, replacements = list())

Arguments

x

a character vector providing a directory with vrt files

replacements

a list of character vectors (length 2 each) with regular expressions / replacements

Details

Known issues resulting from annotating files (with the treetagger in particular) are whitespace characters invalid for XML, XML elements at the end of a line rather than in a seperate line, characters invalid for XML (such as ampersands), inter alia.

Before doing respective corrections, the method tests whether there is any text at all in the files. Empty files (files that contain nothing but XML tags) are dropped.