Split entire corpus or a partition into speeches. The heuristic is to split the corpus/partition into partitions on day-to-day basis first, using the s-attribute provided by s_attribute_date. These subcorpora are then splitted into speeches by speaker name, using s-attribute s_attribute_name. If there is a gap larger than the number of tokens supplied by argument gap, contributions of a speaker are assumed to be two seperate speeches.

as.speeches(.Object, ...)

# S4 method for partition
  s_attribute_date = grep("date", s_attributes(.Object), value = TRUE),
  s_attribute_name = grep("name", s_attributes(.Object), value = TRUE),
  gap = 500,
  mc = FALSE,
  verbose = TRUE,
  progress = TRUE

# S4 method for subcorpus
  s_attribute_date = grep("date", s_attributes(.Object), value = TRUE),
  s_attribute_name = grep("name", s_attributes(.Object), value = TRUE),
  gap = 500,
  mc = FALSE,
  verbose = TRUE,
  progress = TRUE

# S4 method for corpus
  s_attribute_date = grep("date", s_attributes(.Object), value = TRUE),
  s_attribute_name = grep("name", s_attributes(.Object), value = TRUE),
  gap = 500,
  mc = FALSE,
  verbose = TRUE,
  progress = TRUE

# S4 method for character
  s_attribute_date = grep("date", s_attributes(.Object), value = TRUE),
  s_attribute_name = grep("name", s_attributes(.Object), value = TRUE),
  gap = 500,
  mc = FALSE,
  verbose = TRUE,
  progress = TRUE



A partition, or length-one character vector indicating a CWB corpus.


Further arguments.


A length-one character vector, the s-attribute that provides the dates of sessions.


A length-one character vector, the s-attribute that provides the names of speakers.


An integer value, the number of tokens between strucs assumed to make the difference whether a speech has been interrupted (by an interjection or question), or whether to assume seperate speeches.


Whether to use multicore, defaults to FALSE. If progress is TRUE, argument mc is passed into pblapply as argument cl. If progress is FALSE, mc is passed into mclapply as argument mc.cores.


A logical value, defaults to TRUE.


A logical value, whether to show progress bar.


A partition_bundle, the names of the objects in the bundle are the speaker name, the date of the speech and an index for the number of the speech on a given day, concatenated by underscores.


#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS
speeches <- as.speeches( "GERMAPARLMINI", s_attribute_date = "date", s_attribute_name = "speaker" ) speeches_count <- count(speeches, p_attribute = "word") tdm <- as.TermDocumentMatrix(speeches_count, col = "count")
#> ... using the p_attribute-slot of the first object in the bundle as p_attribute: word
#> ... generating (temporary) key column
#> ... generating cumulated data.table
#> ... getting unique keys
#> ... generating integer keys
#> ... cleaning up temporary key columns
bt <- partition("GERMAPARLMINI", date = "2009-10-27")
#> ... get encoding: latin1
#> ... get cpos and strucs
speeches <- as.speeches(bt, s_attribute_name = "speaker")
#> ... generating partitions by date
#> ... generating speeches
#> ... generating names
#> ... reordering partitions
#> ... coercing partitions to plpr_partitions
#> name size #> 1 Heinz Riesenhuber_2009-10-27_1 4766 #> 2 Volker Kauder_2009-10-27_1 38 #> 3 Norbert Lammert_2009-10-27_1 4441 #> 4 Gerda Hasselfeldt_2009-10-27_1 23 #> 5 Wolfgang Thierse_2009-10-27_1 14 #> 6 Hermann Otto Solms_2009-10-27_1 17 #> 7 Petra Pau_2009-10-27_1 25 #> 8 Katrin Göring-Eckardt_2009-10-27_1 17
sp <- as.speeches(.Object = corpus("GERMAPARLMINI"), s_attribute_name = "speaker")