Generate TermDocumentMatrix / DocumentTermMatrix. — as.TermDocumentMatrix • polmineR

Methods to generate the classes TermDocumentMatrix or DocumentTermMatrix as defined in the tm package. There are many text mining applications for document-term matrices. A DocumentTermMatrix is required as input by the topicmodels package, for instance.

as.TermDocumentMatrix(x, ...)

as.DocumentTermMatrix(x, ...)

# S4 method for character
as.TermDocumentMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...)

# S4 method for character
as.DocumentTermMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...)

# S4 method for bundle
as.TermDocumentMatrix(x, col, p_attribute = NULL, verbose = TRUE, ...)

# S4 method for bundle
as.DocumentTermMatrix(x, col = NULL, p_attribute = NULL, verbose = TRUE, ...)

# S4 method for partition_bundle
as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)

# S4 method for partition_bundle
as.TermDocumentMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)

# S4 method for subcorpus_bundle
as.TermDocumentMatrix(x, p_attribute = NULL, verbose = TRUE, ...)

# S4 method for subcorpus_bundle
as.DocumentTermMatrix(x, p_attribute = NULL, verbose = TRUE, ...)

# S4 method for partition_bundle
as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)

# S4 method for context
as.DocumentTermMatrix(x, p_attribute, verbose = TRUE, ...)

# S4 method for context
as.TermDocumentMatrix(x, p_attribute, verbose = TRUE, ...)

Arguments

x	A `character` vector indicating a corpus, or an object of class `bundle`, or inheriting from class `bundle` (e.g. `partition_bundle`).
...	Definitions of s-attribute used for subsetting the corpus, compare partition-method.
p_attribute	A p-attribute counting is be based on.
s_attribute	An s-attribute that defines content of columns, or rows.
verbose	A `logial` value, whether to output progress messages.
col	The column of `data.table` in slot `stat` (if `x` is a `bundle`) to use of assembling the matrix.

Value

A TermDocumentMatrix, or a DocumentTermMatrix object. These classes are defined in the tm package, and inherit from the simple_triplet_matrix-class defined in the slam-package.

Details

If x refers to a corpus (i.e. is a length 1 character vector), a TermDocumentMatrix, or DocumentTermMatrix will be generated for subsets of the corpus based on the s_attribute provided. Counts are performed for the p_attribute. Further parameters provided (passed in as ... are interpreted as s-attributes that define a subset of the corpus for splitting it according to s_attribute. If struc values for s_attribute are not unique, the necessary aggregation is performed, slowing things somewhat down.

If x is a bundle or a class inheriting from it, the counts or whatever measure is present in the stat slots (in the column indicated by col) will be turned into the values of the sparse matrix that is generated. A special case is the generation of the sparse matrix based on a partition_bundle that does not yet include counts. In this case, a p_attribute needs to be provided. Then counting will be performed, too.

If x is a partition_bundle, and argument col is not NULL, as TermDocumentMatrix is generated based on the column indicated by col of the data.table with counts in the stat slots of the objects in the bundle. If col is NULL, the p-attribute indicated by p_attribute is decoded, and a count is performed to obtain the values of the resulting TermDocumentMatrix. The same procedure applies to get a DocumentTermMatrix.

If x is a subcorpus_bundle, the p-attribute provided by argument p_attribute is decoded, and a count is performed to obtain the resulting TermDocumentMatrix or DocumentTermMatrix.

Author

Andreas Blaette

Examples

use("polmineR")
#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS
 
# enriching partition_bundle explicitly 
tdm <- partition("GERMAPARLMINI", date = ".*", regex = TRUE) %>% 
  partition_bundle(s_attribute = "date") %>% 
  enrich(p_attribute = "word") %>%
  as.TermDocumentMatrix(col = "count")
#> ... get encoding: latin1
#> ... get cpos and strucs
#> ... using the p_attribute-slot of the first object in the bundle as p_attribute: word
#> ... generating (temporary) key column
#> ... generating cumulated data.table
#> ... getting unique keys
#> ... generating integer keys
#> ... cleaning up temporary key columns
   
# leave the counting to the as.TermDocumentMatrix-method
tdm <- partition_bundle("GERMAPARLMINI", s_attribute = "date") %>% 
  as.TermDocumentMatrix(p_attribute = "word", verbose = FALSE)
   
# obtain TermDocumentMatrix directly (fastest option)
tdm <- as.TermDocumentMatrix("GERMAPARLMINI", p_attribute = "word", s_attribute = "date")
#> ... generate data.table with token and struc ids
#> ... generate unique document ids
#> ... counting token per doc
#> ... generate simple_triplet_matrix
#> ... add row and column labels

dtm <- corpus("REUTERS") %>%
  split(s_attribute = "id") %>%
  as.TermDocumentMatrix(p_attribute = "word")
#> ... generating corpus positions
#> ... getting ids
#> ... performing count
#> ... generating keys
#> ... generating simple triplet matrix