Create a subcorpus and keep it in an object of the partition class. If defined, counts are performed for the p-attribute defined by the parameter p_attribute.

partition(.Object, ...)

# S4 method for character
partition(
  .Object,
  def = NULL,
  name = "",
  encoding = NULL,
  p_attribute = NULL,
  regex = FALSE,
  xml = "flat",
  decode = TRUE,
  type = get_type(.Object),
  mc = FALSE,
  verbose = TRUE,
  ...
)

# S4 method for environment
partition(.Object, slots = c("name", "corpus", "size", "p_attribute"))

# S4 method for partition
partition(
  .Object,
  def = NULL,
  name = "",
  regex = FALSE,
  p_attribute = NULL,
  decode = TRUE,
  xml = NULL,
  verbose = TRUE,
  mc = FALSE,
  ...
)

# S4 method for context
partition(.Object, node = TRUE)

# S4 method for remote_corpus
partition(.Object, ...)

# S4 method for remote_partition
partition(.Object, ...)

Arguments

.Object

A length-one character-vector, the CWB corpus to be used.

...

Arguments to define partition (see examples). If .Object is a remote_corpus or remote_partition object, the three dots (...) are used to pass arguments. Hence, it is necessary to state the names of all arguments to be passed explicity.

def

A named list of character vectors of s-attribute values, the names are the s-attributes (see details and examples)

name

A name for the new partition object, defaults to "".

encoding

The encoding of the corpus (typically "LATIN1 or "(UTF-8)), if NULL, the encoding provided in the registry file of the corpus (charset="...") will be used.

p_attribute

The p-attribute(s) for which a count is performed.

regex

A logical value (defaults to FALSE).

xml

Either 'flat' (default) or 'nested'.

decode

Logical, whether to turn token ids to strings (set FALSE to minimize object size / memory consumption) in data.table with counts.

type

A length-one character vector specifying the type of corpus / partition (e.g. "plpr")

mc

Whether to use multicore (for counting terms).

verbose

Logical, whether to be verbose.

slots

Object slots that will be reported columns of data.frame summarizing partition objects in environment.

node

A logical value, whether to include the node (i.e. query matches) in the region matrix generated when creating a partition from a context-object.

Value

An object of the S4 class partition.

Details

The function sets up a partition object based on s-attribute values. The s-attributes defining the partition can be passed in as a list, e.g. list(interjection="speech", year = "2013"), or directly (see examples).

The s-attribute values defining the partition may use regular expressions. To use regular expressions, set the parameter regex to TRUE. Regular expressions are passed into grep, i.e. the regex syntax used in R needs to be used (double backlashes etc.). If regex is FALSE, the length of the character vectors can be > 1, matching s-attributes are identifies with the operator '

The XML imported into the CWB may be "flat" or "nested". This needs to be indicated with the parameter xml (default is "flat"). If you generate a partition based on a flat XML structure, some performance gain may be achieved when ordering the s-attributes with decreasingly restrictive conditions. If you have a nested XML, it is mandatory that the order of the s-attributes provided reflects the hierarchy of the XML: The top-level elements need to be positioned at the beginning of the list with the s-attributes, the the most restrictive elements at the end.

If p_attribute is not NULL, a count of tokens in the corpus will be performed and kept in the stat-slot of the partition-object. The length of the p_attribute character vector may be 1 or more. If two or more p-attributes are provided, The occurrence of combinations will be counted. A typical scenario is to combine the p-attributes "word" or "lemma" and "pos".

If .Object is a length-one character vector, a subcorpus/partition for the corpus defined be .Object is generated.

If .Object is an environment (typically .GlobalEnv), the partition objects present in the environment are listed.

If .Object is a partition object, a subcorpus of the subcorpus is generated.

See also

To learn about the methods available for objects of the class partition, see partition_class,

Author

Andreas Blaette

Examples

use("polmineR")
#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS
spd <- partition("GERMAPARLMINI", party = "SPD", interjection = "speech")
#> ... get encoding: latin1
#> ... get cpos and strucs
kauder <- partition("GERMAPARLMINI", speaker = "Volker Kauder", p_attribute = "word")
#> ... get encoding: latin1
#> ... get cpos and strucs
#> ... getting counts for p-attribute(s): word
merkel <- partition("GERMAPARLMINI", speaker = ".*Merkel", p_attribute = "word", regex = TRUE)
#> ... get encoding: latin1
#> ... get cpos and strucs
#> ... getting counts for p-attribute(s): word
s_attributes(merkel, "date")
#> [1] "2009-10-28" "2009-11-10"
s_attributes(merkel, "speaker")
#> [1] "Angela Dorothea Merkel"
merkel <- partition( "GERMAPARLMINI", speaker = "Angela Dorothea Merkel", date = "2009-11-10", interjection = "speech", p_attribute = "word" )
#> ... get encoding: latin1
#> ... get cpos and strucs
#> ... getting counts for p-attribute(s): word
merkel <- subset(merkel, !word %in% punctuation) merkel <- subset(merkel, !word %in% tm::stopwords("de")) # a certain defined time segment days <- seq( from = as.Date("2009-10-28"), to = as.Date("2009-11-11"), by = "1 day" ) period <- partition("GERMAPARLMINI", date = days)
#> ... get encoding: latin1
#> ... get cpos and strucs