Corpora indexed using the Corpus Workbench (CWB) offer an efficient data
structure for large, linguistically annotated corpora. The
corpus
-class keeps basic information on a CWB corpus. Corresponding to
the name of the class, the corpus
-method is the initializer for
objects of the corpus
class. A CWB corpus can also be hosted remotely
on an OpenCPU server. The remote_corpus
class (which inherits from the corpus
class) will handle respective
information. A (limited) set of polmineR functions and methods can be
executed on the corpus on the remote machine from the local R session by
calling them on the remote_corpus
object. Calling the
corpus
-method without an argument will return a data.frame
with
basic information on the corpora that are available.
# S4 method for character corpus(.Object, server = NULL, restricted) # S4 method for missing corpus()
.Object | The upper-case ID of a CWB corpus stated by a
length-one |
---|---|
server | If |
restricted | A |
Calling corpus()
will return a data.frame
listing the corpora
available locally and described in the active registry directory, and some
basic information on the corpora.
A corpus
object is instantiated by passing a corpus ID as
argument .Object
. Following the conventions of the Corpus Workbench
(CWB), Corpus IDs are written in upper case. If .Object
includes
lower case letters, the corpus
object is instantiated nevertheless,
but a warning is issued to prevent bad practice. If .Object
is not a
known corpus, the error message will include a suggestion if there is a
potential candidate that can be identified by agrep
.
A limited set of methods of the polmineR
package is exposed
to be executed on a remote OpenCPU server. As a matter of convenience, the
whereabouts of an OpenCPU server hosting a CWB corpus can be stated in an
environment variable "OPENCPU_SERVER". Environment variables for R sessions
can be set easily in the .Renviron
file. A convenient way to do this
is to call usethis::edit_r_environ()
.
corpus
A length-one character
vector, the upper-case ID of a CWB
corpus.
data_dir
The directory where the files for the indexed corpus are.
type
The type of the corpus (e.g. "plpr" for a corpus of plenary protocols).
name
An additional name for the object that may be more telling than the corpus ID.
encoding
The encoding of the corpus, given as a length-one
character
vector.
size
Number of tokens (size) of the corpus, a length-one integer
vector.
server
The URL (can be IP address) of the OpenCPU server. The slot is
available only with the remote_corpus
class inheriting from the
corpus
class.
user
If the corpus on the server requires authentication, the username.
password
If the corpus on the server requires authentication, the password.
Methods to extract basic information from a corpus
object are
covered by the corpus-methods documentation object. Use the
s_attributes
method to get information on structural
attributes. Analytical methods available for corpus
objects are
size
, count
, dispersion
,
kwic
, cooccurrences
,
as.TermDocumentMatrix
.
Other classes to manage corpora:
phrases
,
regions
,
subcorpus
#>#># get corpora present locally y <- corpus() # initialize corpus object r <- corpus("REUTERS") r <- corpus ("reuters") # will work, but will result in a warning#> Warning: Using corpus 'REUTERS', not 'reuters' - note that corpus ids are expected to be in upper case throughout.# apply core polmineR methods a <- size(r) b <- s_attributes(r) c <- count(r, query = "oil") d <- dispersion(r, query = "oil", s_attribute = "id") e <- kwic(r, query = "oil") f <- cooccurrences(r, query = "oil") # used corpus initialization in a pipe y <- corpus("REUTERS") %>% s_attributes() y <- corpus("REUTERS") %>% count(query = "oil") # working with a remote corpus if (FALSE) { REUTERS <- corpus("REUTERS", server = Sys.getenv("OPENCPU_SERVER")) count(REUTERS, query = "oil") size(REUTERS) kwic(REUTERS, query = "oil") GERMAPARL <- corpus("GERMAPARL", server = Sys.getenv("OPENCPU_SERVER")) s_attributes(GERMAPARL) size(x = GERMAPARL) count(GERMAPARL, query = "Integration") kwic(GERMAPARL, query = "Islam") p <- partition(GERMAPARL, year = 2000) s_attributes(p, s_attribute = "year") size(p) kwic(p, query = "Islam", meta = "date") GERMAPARL <- corpus("GERMAPARLMINI", server = Sys.getenv("OPENCPU_SERVER")) s_attrs <- s_attributes(GERMAPARL, s_attribute = "date") sc <- subset(GERMAPARL, date == "2009-11-10") }