Split corpus or partition into speeches. — as.speeches • polmineR

Split entire corpus or a partition into speeches. The heuristic is to split the corpus/partition into partitions on day-to-day basis first, using the s-attribute provided by s_attribute_date. These subcorpora are then splitted into speeches by speaker name, using s-attribute s_attribute_name. If there is a gap larger than the number of tokens supplied by argument gap, contributions of a speaker are assumed to be two seperate speeches.

as.speeches(.Object, ...)

# S4 method for partition
as.speeches(
  .Object,
  s_attribute_date = grep("date", s_attributes(.Object), value = TRUE),
  s_attribute_name = grep("name", s_attributes(.Object), value = TRUE),
  gap = 500,
  mc = FALSE,
  verbose = TRUE,
  progress = TRUE
)

# S4 method for subcorpus
as.speeches(
  .Object,
  s_attribute_date = grep("date", s_attributes(.Object), value = TRUE),
  s_attribute_name = grep("name", s_attributes(.Object), value = TRUE),
  gap = 500,
  mc = FALSE,
  verbose = TRUE,
  progress = TRUE
)

# S4 method for corpus
as.speeches(
  .Object,
  s_attribute_date = grep("date", s_attributes(.Object), value = TRUE),
  s_attribute_name = grep("name", s_attributes(.Object), value = TRUE),
  gap = 500,
  mc = FALSE,
  verbose = TRUE,
  progress = TRUE
)

# S4 method for character
as.speeches(
  .Object,
  s_attribute_date = grep("date", s_attributes(.Object), value = TRUE),
  s_attribute_name = grep("name", s_attributes(.Object), value = TRUE),
  gap = 500,
  mc = FALSE,
  verbose = TRUE,
  progress = TRUE
)

Arguments

.Object	A `partition`, or length-one `character` vector indicating a CWB corpus.
...	Further arguments.
s_attribute_date	A length-one `character` vector, the s-attribute that provides the dates of sessions.
s_attribute_name	A length-one `character` vector, the s-attribute that provides the names of speakers.
gap	An `integer` value, the number of tokens between strucs assumed to make the difference whether a speech has been interrupted (by an interjection or question), or whether to assume seperate speeches.
mc	Whether to use multicore, defaults to `FALSE`. If `progress` is `TRUE`, argument `mc` is passed into `pblapply` as argument `cl`. If `progress` is `FALSE`, `mc` is passed into `mclapply` as argument `mc.cores`.
verbose	A `logical` value, defaults to `TRUE`.
progress	A `logical` value, whether to show progress bar.

Value

A partition_bundle, the names of the objects in the bundle are the speaker name, the date of the speech and an index for the number of the speech on a given day, concatenated by underscores.

Examples

use("polmineR")
#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS
speeches <- as.speeches(
  "GERMAPARLMINI",
  s_attribute_date = "date", s_attribute_name = "speaker"
)
speeches_count <- count(speeches, p_attribute = "word")
tdm <- as.TermDocumentMatrix(speeches_count, col = "count")
#> ... using the p_attribute-slot of the first object in the bundle as p_attribute: word
#> ... generating (temporary) key column
#> ... generating cumulated data.table
#> ... getting unique keys
#> ... generating integer keys
#> ... cleaning up temporary key columns

bt <- partition("GERMAPARLMINI", date = "2009-10-27")
#> ... get encoding: latin1
#> ... get cpos and strucs
speeches <- as.speeches(bt, s_attribute_name = "speaker")
#> ... generating partitions by date
#> ... generating speeches
#> ... generating names
#> ... reordering partitions
#> ... coercing partitions to plpr_partitions
summary(speeches)
#>                                 name size
#> 1     Heinz Riesenhuber_2009-10-27_1 4766
#> 2         Volker Kauder_2009-10-27_1   38
#> 3       Norbert Lammert_2009-10-27_1 4441
#> 4     Gerda Hasselfeldt_2009-10-27_1   23
#> 5      Wolfgang Thierse_2009-10-27_1   14
#> 6    Hermann Otto Solms_2009-10-27_1   17
#> 7             Petra Pau_2009-10-27_1   25
#> 8 Katrin Göring-Eckardt_2009-10-27_1   17
sp <- as.speeches(.Object = corpus("GERMAPARLMINI"), s_attribute_name = "speaker")