Create a subcorpus and keep it in an object of the partition
class. If
defined, counts are performed for the p-attribute defined by the parameter
p_attribute
.
partition(.Object, ...) # S4 method for character partition( .Object, def = NULL, name = "", encoding = NULL, p_attribute = NULL, regex = FALSE, xml = "flat", decode = TRUE, type = get_type(.Object), mc = FALSE, verbose = TRUE, ... ) # S4 method for environment partition(.Object, slots = c("name", "corpus", "size", "p_attribute")) # S4 method for partition partition( .Object, def = NULL, name = "", regex = FALSE, p_attribute = NULL, decode = TRUE, xml = NULL, verbose = TRUE, mc = FALSE, ... ) # S4 method for context partition(.Object, node = TRUE) # S4 method for remote_corpus partition(.Object, ...) # S4 method for remote_partition partition(.Object, ...)
.Object | A length-one character-vector, the CWB corpus to be used. |
---|---|
... | Arguments to define partition (see examples). If |
def | A named list of character vectors of s-attribute values, the names are the s-attributes (see details and examples) |
name | A name for the new |
encoding | The encoding of the corpus (typically "LATIN1 or "(UTF-8)), if NULL, the encoding provided in the registry file of the corpus (charset="...") will be used. |
p_attribute | The p-attribute(s) for which a count is performed. |
regex | A logical value (defaults to FALSE). |
xml | Either 'flat' (default) or 'nested'. |
decode | Logical, whether to turn token ids to strings (set FALSE to minimize object size / memory consumption) in data.table with counts. |
type | A length-one character vector specifying the type of corpus / partition (e.g. "plpr") |
mc | Whether to use multicore (for counting terms). |
verbose | Logical, whether to be verbose. |
slots | Object slots that will be reported columns of |
node | A logical value, whether to include the node (i.e. query matches) in the region matrix
generated when creating a |
An object of the S4 class partition
.
The function sets up a partition
object based on s-attribute values.
The s-attributes defining the partition can be passed in as a list, e.g.
list(interjection="speech", year = "2013")
, or directly (see
examples).
The s-attribute values defining the partition may use regular expressions. To
use regular expressions, set the parameter regex to TRUE
. Regular
expressions are passed into grep
, i.e. the regex syntax used in R
needs to be used (double backlashes etc.). If regex is FALSE
, the
length of the character vectors can be > 1, matching s-attributes are
identifies with the operator '
The XML imported into the CWB may be "flat" or "nested". This needs to be
indicated with the parameter xml
(default is "flat"). If you generate
a partition
based on a flat XML structure, some performance gain may be
achieved when ordering the s-attributes with decreasingly restrictive
conditions. If you have a nested XML, it is mandatory that the order of the
s-attributes provided reflects the hierarchy of the XML: The top-level
elements need to be positioned at the beginning of the list with the
s-attributes, the the most restrictive elements at the end.
If p_attribute
is not NULL, a count of tokens in the corpus will be
performed and kept in the stat
-slot of the partition-object. The
length of the p_attribute
character vector may be 1 or more. If two or
more p-attributes are provided, The occurrence of combinations will be
counted. A typical scenario is to combine the p-attributes "word" or "lemma"
and "pos".
If .Object
is a length-one character vector, a
subcorpus/partition for the corpus defined be .Object
is generated.
If .Object
is an environment (typically .GlobalEnv
),
the partition
objects present in the environment are listed.
If .Object
is a partition
object, a subcorpus of the
subcorpus is generated.
To learn about the methods available for objects of the class
partition
, see partition_class
,
Andreas Blaette
#>#>spd <- partition("GERMAPARLMINI", party = "SPD", interjection = "speech")#>#>kauder <- partition("GERMAPARLMINI", speaker = "Volker Kauder", p_attribute = "word")#>#>#>merkel <- partition("GERMAPARLMINI", speaker = ".*Merkel", p_attribute = "word", regex = TRUE)#>#>#>#> [1] "2009-10-28" "2009-11-10"#> [1] "Angela Dorothea Merkel"merkel <- partition( "GERMAPARLMINI", speaker = "Angela Dorothea Merkel", date = "2009-11-10", interjection = "speech", p_attribute = "word" )#>#>#>merkel <- subset(merkel, !word %in% punctuation) merkel <- subset(merkel, !word %in% tm::stopwords("de")) # a certain defined time segment days <- seq( from = as.Date("2009-10-28"), to = as.Date("2009-11-11"), by = "1 day" ) period <- partition("GERMAPARLMINI", date = days)#>#>