vignettes/vignette.Rmd
vignette.Rmd
The cwbtools package offers a toolset to create, modify and manage corpora to be used with the Corpus Workbench (CWB) from within R. It supports the transition from data formats established by well-known R packages such as tm, quanteda or tidytext to a CWB corpus, so that the CWB data format and CWB-related tools can be used.
Working with CWB indexed corpora is worth considering when working
with large corpora. The cwbtools package is designed to be able
to work with large data. The core tool of the cwbtools package is the
CorpusData
-Class. It offers a standard workflow to import
and process data, and to generate an indexed corpus. It is implemented
as a R6 class. The
advantage of using the reference semantics of the R6 class system is
that there is minimal copying of the data. Handling memory
parsimoniously is necessary when working with large corpora.
Further functions of the package support the creation, modification and management of structural and positional attributes of a corpus, and of registry files. See the manual to learn more about these functions. To learn more about the data structure of the CWB and the CWB jargon, the IMS Open Corpus Workbench (CWB) Corpus Encoding Tutorial is recommended as a point of entry.
The focus of this vignette is the creation of CWB indexed corpora. We
present three scenarios: Creating a corpus (a) from a
tibble
, (b) from XML and (c) from the VCorpus
used by the tm package.
We start with a few preliminary preparations.
The CWB stores each indexed corpora in an individual data directory. Registry files in a ‘registry directory’ (or ‘registry’ in short) describe the indexed corpora and include the path to the data directory.
For the following examples, we create a temporary directory structure for registry files, and indexed corpora, making sure that these directories exits and are empty.
registry_tmp <- fs::path(tempdir(), "registry")
data_dir_tmp <- fs::path(tempdir(), "data_dir")
if (!file.exists(registry_tmp)){
dir.create (registry_tmp)
} else {
file.remove(list.files(registry_tmp, full.names = TRUE))
}
if (!file.exists(data_dir_tmp)) dir.create(data_dir_tmp)
In addition to cwbtools, we rely extensively on the data.table package. It is a good companion to cwbtools because of its efficiency to handle large data. Apart from the processing speed, it supports in-place modifications of data. When corpora grow large, it is advisable to omit copying the data in memory if not absolutely necessary.
Core operations on structural and positional attributes are
implemented in “pure R”, but the CWB can be downloaded and stored within
the package using the cwb_install()
function. In the
following examples, we will rely on the “pure R” approach.
As a basic scenario, we create a corpus of Jane Austen’s books (included in the package janeaustenr). We first create an (empty) data directory …
austen_data_dir_tmp <- fs::path(data_dir_tmp, "austen")
if (!file.exists(austen_data_dir_tmp)) dir.create(austen_data_dir_tmp)
file.remove(list.files(austen_data_dir_tmp, full.names = TRUE))
… and instantiating a CorpusData
object.
Austen <- CorpusData$new()
The tidytext
package offers an efficient workflow to
create the token stream we can assign to the CorpusData
object.
tbl <- tidytext::unnest_tokens(
janeaustenr::austen_books(),
word, text, to_lower = FALSE
)
Austen$tokenstream <- as.data.table(tbl)
To demonstrate that we can use an additional positional attribute, we stem the tokens using the SnowballC package.
Austen$tokenstream[, stem := SnowballC::wordStem(Austen$tokenstream[["word"]], language = "english")]
In this case, (zero-based!) corpus positions need to be assigned explicitly.
Austen$tokenstream[, cpos := 0L:(nrow(tbl) - 1L)]
We can now create (and inspect) the table with the structural attributes.
cpos_max_min <- function(x)
list(cpos_left = min(x[["cpos"]]), cpos_right = max(x[["cpos"]]))
Austen$metadata <- Austen$tokenstream[, cpos_max_min(.SD), by = book]
Austen$metadata[, book := as.character(book)]
setcolorder(Austen$metadata, c("cpos_left", "cpos_right", "book"))
head(Austen$tokenstream)
## book word stem cpos
## <fctr> <char> <char> <int>
## 1: Sense & Sensibility SENSE SENSE 0
## 2: Sense & Sensibility AND AND 1
## 3: Sense & Sensibility SENSIBILITY SENSIBILITi 2
## 4: Sense & Sensibility by by 3
## 5: Sense & Sensibility Jane Jane 4
## 6: Sense & Sensibility Austen Austen 5
A few finishing touches and we have the token stream ready to be
encoded: We remove the metadata from the table in the
tokenstream
field and order the columns nicely.
Austen$tokenstream[, book := NULL]
setcolorder(Austen$tokenstream, c("cpos", "word", "stem"))
Austen$tokenstream
## cpos word stem
## <int> <char> <char>
## 1: 0 SENSE SENSE
## 2: 1 AND AND
## 3: 2 SENSIBILITY SENSIBILITi
## 4: 3 by by
## 5: 4 Jane Jane
## ---
## 725060: 725059 in in
## 725061: 725060 its it
## 725062: 725061 national nation
## 725063: 725062 importance import
## 725064: 725063 Finis Fini
Ready to encode the corpus!
Austen$encode(
corpus = "AUSTEN",
encoding = "utf8",
p_attributes = c("word", "stem"),
s_attributes = "book",
registry_dir = registry_tmp,
data_dir = austen_data_dir_tmp,
method = "R",
compress = FALSE,
reload = TRUE
)
This is a rudimentary check (using low-level RcppCWB functions) whether to corpus can be used. How often does the token “pride” occur?
ids <- RcppCWB::cl_str2id(
corpus = "AUSTEN",
p_attribute = "word",
str = "pride",
registry = registry_tmp
)
cpos <- RcppCWB::cl_id2cpos(
corpus = "AUSTEN",
p_attribute = "word",
id = ids,
registry = registry_tmp
)
length(cpos)
## [1] 113
The second example is to turn a small sample corpus of debates in the UN General Assembly into an indexed corpus. The package includes some XML files that follow a standard suggested by the Text Encoding Initiative (TEI).
teidir <- system.file(package = "cwbtools", "xml", "UNGA")
teifiles <- list.files(teidir, full.names = TRUE)
list.files(teidir)
## [1] "N9986497.xml" "N9986515.xml" "N9986521.xml" "N9986551.xml" "N9986557.xml"
## [6] "N9986569.xml" "N9986587.xml"
For our example, we need an (empty) data directory for the UNGA corpus.
unga_data_dir_tmp <- fs::path(data_dir_tmp, "unga")
if (!file.exists(unga_data_dir_tmp)) dir.create(unga_data_dir_tmp)
file.remove(list.files(unga_data_dir_tmp, full.names = TRUE))
The point of departure then is to create a CorpusData
object that will serve as a processor for these files and storage
facility for the corpus data. The central fields of the class are named
chunktable
, tokenstream
and
metadata
. When we inspect the new object at the outset, we
will see that these fields are not filled initially.
UNGA <- CorpusData$new()
UNGA
## chunktable: NULL
## tokenstream: NULL
## metadata: NULL
To turn the XML files into the
CorpusData object, we use the method
$import_xml()`. To be
able to add metadata from the header to the metadata table, the method
requires a named vector of XPath expressions used to find the metadata
within the XML document.
metadata <- c(
doc_lp = "//legislativePeriod",
doc_session = "//titleStmt/sessionNo",
doc_date = "//publicationStmt/date",
doc_url = "//sourceDesc/url",
doc_src = "//sourceDesc/filetype"
)
UNGA$import_xml(filenames = teifiles, meta = metadata)
UNGA
## chunktable: 2 columns / 1148 rows
## tokenstream: NULL
## metadata: 20 columns / 1148 rows
The input XML files are TEI files. Speaker information is included in the attributes of a tag named ‘sp’. To maintain the original content, there is a element ‘speaker’ in the document that includes the information on the speaker call without having parsed it. It is not text spoken in the debate, so we consider it as noise to be removed.
to_keep <- which(is.na(UNGA$metadata[["speaker"]]))
UNGA$chunktable <- UNGA$chunktable[to_keep]
UNGA$metadata <- UNGA$metadata[to_keep][, speaker := NULL]
The CWB requires a tokenstream as input to generate positional attributes. NLP tools will offer lemmatization, part-of-speech-recognition and more. To keep things simple, we perform a very simple tokenization that relies on the tokenizers package.
UNGA$tokenize(lowercase = FALSE, strip_punct = FALSE)
UNGA
Let us see how it looks like now …
UNGA$tokenstream
## id word cpos
## <int> <char> <int>
## 1: 2 I 0
## 2: 2 should 1
## 3: 2 like 2
## 4: 2 to 3
## 5: 2 inform 4
## ---
## 127096: 1148 1 127095
## 127097: 1148 p 127096
## 127098: 1148 . 127097
## 127099: 1148 m 127098
## 127100: 1148 . 127099
The CorpusData
object now includes a table with the
token stream, and we are ready to import the corpus into the CWB. We use
the $encode()
method for this purpose. Note that the
workers are the p_attribute_encode
function (to encode the
token stream), and the s_attribute_encode
function (to
encode the structural attributes / the metadata).
UNGA$encode(
registry_dir = registry_tmp,
data_dir = unga_data_dir_tmp,
corpus = "UNGA",
encoding = "utf8",
method = "R",
p_attributes = "word",
s_attributes = c(
"doc_lp", "doc_session", "doc_date",
"sp_who", "sp_state", "sp_role"
),
compress = FALSE
)
In this example, the logical parameter compress
is
FALSE
. If TRUE
, a so-called “huffcode””
compression if performed., which will reduce corpus size and speed up
queries. For big corpora, compression makes sense, but it can be
time-consuming. If you want to create a corpus experimentally and do not
yet need optimization, performing the compression can be postponed or
omitted.
The indexed corpus has now been prepared and can be used with CQP, CQPweb, or – if you like R – with a package such as polmineR. To see quickly whether everything has worked out as intended (and to keep installation requirements at a minimum), we use the low-level functionality of the RcppCWB package.
id_peace <- RcppCWB::cl_str2id(
corpus = "UNGA",
p_attribute = "word",
str = "peace",
registry = registry_tmp
)
cpos_peace <- RcppCWB::cl_id2cpos(
corpus = "UNGA",
p_attribute = "word",
id = id_peace,
registry = registry_tmp
)
tab <- data.frame(
i = unlist(lapply(1:length(cpos_peace), function(x) rep(x, times = 11))),
cpos = unlist(lapply(cpos_peace, function(x) (x - 5):(x + 5)))
)
tab[["id"]] <- RcppCWB::cl_cpos2id(
corpus = "UNGA", p_attribute = "word",
cpos = tab[["cpos"]], registry = registry_tmp
)
tab[["str"]] <- RcppCWB::cl_id2str(
corpus = "UNGA", p_attribute = "word",
id = tab[["id"]], registry = registry_tmp
)
peace_context <- split(tab[["str"]], as.factor(tab[["i"]]))
peace_context <- unname(sapply(peace_context, function(x) paste(x, collapse = " ")))
head(peace_context)
## [1] "the breast of a tranquil peace for all of the world"
## [2] "earth in the rapture of peace , justice and safety ."
## [3] "community are committed to achieving peace , security and prosperity for"
## [4] "components of post - conflict peace - building , given the"
## [5] "removal and the consolidation of peace and mutual trust between neighbouring"
## [6] "and the promotion of durable peace and sustainable development in Africa"
In the third scenario, we will make the transition from a
VCorpus
(tm package) to a CWB indexed corpus. We use the
Reuters corpus that is included as sample data in the tm package.
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
reuters.tm <- VCorpus(
DirSource(reut21578),
list(reader = readReut21578XMLasPlain)
)
A quick way to attain the data format required for the
CorpusData
class is to coerce the VCorpus
to a
tibble
, using respective functionality of the tidytext
package.
## # A tibble: 20 × 17
## author datetimestamp description heading id language origin topics
## <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NA 1987-02-26 17:00:56 "" DIAMON… 127 en Reute… YES
## 2 BY TED … 1987-02-26 17:34:11 "" OPEC M… 144 en Reute… YES
## 3 NA 1987-02-26 18:18:00 "" TEXACO… 191 en Reute… YES
## 4 NA 1987-02-26 18:21:01 "" MARATH… 194 en Reute… YES
## 5 NA 1987-02-26 19:00:57 "" HOUSTO… 211 en Reute… YES
## 6 NA 1987-03-01 03:25:46 "" KUWAIT… 236 en Reute… YES
## 7 By Jere… 1987-03-01 03:39:14 "" INDONE… 237 en Reute… YES
## 8 NA 1987-03-01 05:27:27 "" SAUDI … 242 en Reute… YES
## 9 NA 1987-03-01 08:22:30 "" QATAR … 246 en Reute… YES
## 10 NA 1987-03-01 18:31:44 "" SAUDI … 248 en Reute… YES
## 11 NA 1987-03-02 01:05:49 "" SAUDI … 273 en Reute… YES
## 12 NA 1987-03-02 07:39:23 "" GULF A… 349 en Reute… YES
## 13 NA 1987-03-02 07:43:22 "" SAUDI … 352 en Reute… YES
## 14 NA 1987-03-02 07:43:41 "" KUWAIT… 353 en Reute… YES
## 15 NA 1987-03-02 08:25:42 "" PHILAD… 368 en Reute… YES
## 16 NA 1987-03-02 11:20:05 "" STUDY … 489 en Reute… YES
## 17 NA 1987-03-02 11:28:26 "" STUDY … 502 en Reute… YES
## 18 NA 1987-03-02 12:13:46 "" UNOCAL… 543 en Reute… YES
## 19 By BERN… 1987-03-02 14:38:34 "" NYMEX … 704 en Reute… YES
## 20 NA 1987-03-02 14:49:06 "" ARGENT… 708 en Reute… YES
## # ℹ 9 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>,
## # topics_cat <named list>, places <named list>, people <chr>, orgs <chr>,
## # exchanges <chr>, text <chr>
Columns with the metadata we may want to encode are still character vectors (length > 1). We turn the columns with the topic categorizations and the places into length-one character vectors.
topics <- sapply(reuters.tbl[["topics_cat"]], paste, collapse = "|")
places <- sapply(reuters.tbl[["places"]], paste, collapse = "|")
reuters.tbl[["topics"]] <- topics
reuters.tbl[["places"]] <- places
This is the input we need. We instantiate a CorpusData
object end make sure that a new data directory for the REUTERS corpus
exists (and is empty) …
Reuters <- CorpusData$new()
reuters_data_dir_tmp <- fs::path(data_dir_tmp, "reuters")
if (!file.exists(reuters_data_dir_tmp)) dir.create(reuters_data_dir_tmp)
file.remove(list.files(reuters_data_dir_tmp, full.names = TRUE))
We assign the chunktable and the metadata, coerced to a
data.table
, to the object …
Reuters$chunktable <- data.table(reuters.tbl[, c("id", "text")])
Reuters$metadata <- data.table(reuters.tbl[,c("id", "topics", "places")])
Reuters
## chunktable: 2 columns / 20 rows
## tokenstream: NULL
## metadata: 3 columns / 20 rows
… we tokenize the text to gain the token stream table …
Reuters$tokenize()
… we have a look at the token stream …
Reuters$tokenstream
## id word cpos
## <char> <char> <int>
## 1: 127 diamond 0
## 2: 127 shamrock 1
## 3: 127 corp 2
## 4: 127 said 3
## 5: 127 that 4
## ---
## 4057: 708 yacimientos 4056
## 4058: 708 petroliferos 4057
## 4059: 708 fiscales 4058
## 4060: 708 added 4059
## 4061: 708 reuter 4060
It seems that we can encode the corpus.
Reuters$encode(
corpus = "REUTERS",
encoding = "utf8",
p_attributes = "word",
s_attributes = c("topics", "places"),
registry_dir = registry_tmp,
data_dir = reuters_data_dir_tmp,
method = "R",
compress = FALSE
)
The REUTERS corpus is about oil production in the Middle East. To check quickly, whether it works, we count the number of occurrences of the word “oil”.
ids <- RcppCWB::cl_str2id(
corpus = "REUTERS",
p_attribute = "word",
str = "oil",
registry = registry_tmp
)
cpos <- RcppCWB::cl_id2cpos(
corpus = "REUTERS",
p_attribute = "word",
id = ids,
registry = registry_tmp
)
length(cpos)
## [1] 86
Finally, we show how to add a further p-attribute to an existing corpus, using the REUTERS corpus we just created.
The initial step here is to decode the entire token stream. It could
be done easily using polmineR::get_token_stream()
, but to
avoid creating a further dependency, we use basic RcppCWB
functionality.
reuters_size <- RcppCWB::attribute_size(
corpus = "REUTERS",
registry = registry_tmp,
attribute = "word",
attribute_type = "p"
)
ids <- RcppCWB::cl_cpos2id(
corpus = "REUTERS",
registry = registry_tmp,
p_attribute = "word",
cpos = 0L:(reuters_size - 1L)
)
token_stream <- RcppCWB::cl_id2str(
corpus = "REUTERS",
registry = registry_tmp,
p_attribute = "word",
id = ids
)
We then generate the stemmed token stream …
stemmed <- SnowballC::wordStem(token_stream, language = "en")
… and add this to the corpus.
p_attribute_encode(
token_stream = stemmed,
p_attribute = "stem",
encoding = "utf8",
corpus = "REUTERS",
registry_dir = registry_tmp,
data_dir = reuters_data_dir_tmp,
method = "R",
verbose = TRUE,
quietly = TRUE,
compress = FALSE
)
## ℹ creating indices (in memory)
## ✔ creating indices (in memory) [178ms]
##
## ℹ writing file: stem.corpus
## ✔ writing file: stem.corpus [527ms]
##
## ℹ writing file: stem.lexicon
## ✔ writing file: stem.lexicon [18ms]
##
## ℹ writing file: stem.lexicon.idx
## ✔ writing file: stem.lexicon.idx [17ms]
##
## ℹ updating registry file: /tmp/RtmpI6aoTF/registry/reuters
## ℹ run `Rcpp::cwb_makeall()`
## ✔ run `Rcpp::cwb_makeall()` [7ms]
##
## ✔ corpus reloaded: CL success / CQP success
This is a very basic demo: Obviously, you could use more advanced NLP tools to generate a part-of-speech annotation, lemmatization etc, and add it to an existing corpus.