R/TermDocumentMatrix.R
as.DocumentTermMatrix.Rd
Methods to generate the classes TermDocumentMatrix
or
DocumentTermMatrix
as defined in the tm
package. There are
many text mining applications for document-term matrices. A
DocumentTermMatrix
is required as input by the topicmodels
package, for instance.
as.TermDocumentMatrix(x, ...) as.DocumentTermMatrix(x, ...) # S4 method for character as.TermDocumentMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...) # S4 method for character as.DocumentTermMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...) # S4 method for bundle as.TermDocumentMatrix(x, col, p_attribute = NULL, verbose = TRUE, ...) # S4 method for bundle as.DocumentTermMatrix(x, col = NULL, p_attribute = NULL, verbose = TRUE, ...) # S4 method for partition_bundle as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...) # S4 method for partition_bundle as.TermDocumentMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...) # S4 method for subcorpus_bundle as.TermDocumentMatrix(x, p_attribute = NULL, verbose = TRUE, ...) # S4 method for subcorpus_bundle as.DocumentTermMatrix(x, p_attribute = NULL, verbose = TRUE, ...) # S4 method for partition_bundle as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...) # S4 method for context as.DocumentTermMatrix(x, p_attribute, verbose = TRUE, ...) # S4 method for context as.TermDocumentMatrix(x, p_attribute, verbose = TRUE, ...)
x | A |
---|---|
... | Definitions of s-attribute used for subsetting the corpus, compare partition-method. |
p_attribute | A p-attribute counting is be based on. |
s_attribute | An s-attribute that defines content of columns, or rows. |
verbose | A |
col | The column of |
A TermDocumentMatrix
, or a DocumentTermMatrix
object.
These classes are defined in the tm
package, and inherit from the
simple_triplet_matrix
-class defined in the slam
-package.
If x
refers to a corpus (i.e. is a length 1 character vector), a
TermDocumentMatrix
, or DocumentTermMatrix
will be generated for
subsets of the corpus based on the s_attribute
provided. Counts are
performed for the p_attribute
. Further parameters provided (passed in
as ...
are interpreted as s-attributes that define a subset of the
corpus for splitting it according to s_attribute
. If struc values for
s_attribute
are not unique, the necessary aggregation is performed, slowing
things somewhat down.
If x
is a bundle
or a class inheriting from it, the counts or
whatever measure is present in the stat
slots (in the column
indicated by col
) will be turned into the values of the sparse
matrix that is generated. A special case is the generation of the sparse
matrix based on a partition_bundle
that does not yet include counts.
In this case, a p_attribute
needs to be provided. Then counting will
be performed, too.
If x
is a partition_bundle
, and argument col
is
not NULL
, as TermDocumentMatrix
is generated based on the
column indicated by col
of the data.table
with counts in the
stat
slots of the objects in the bundle. If col
is
NULL
, the p-attribute indicated by p_attribute
is decoded,
and a count is performed to obtain the values of the resulting
TermDocumentMatrix
. The same procedure applies to get a
DocumentTermMatrix
.
If x
is a subcorpus_bundle
, the p-attribute provided
by argument p_attribute
is decoded, and a count is performed to
obtain the resulting TermDocumentMatrix
or
DocumentTermMatrix
.
Andreas Blaette
#>#># enriching partition_bundle explicitly tdm <- partition("GERMAPARLMINI", date = ".*", regex = TRUE) %>% partition_bundle(s_attribute = "date") %>% enrich(p_attribute = "word") %>% as.TermDocumentMatrix(col = "count")#>#>#>#>#>#>#>#># leave the counting to the as.TermDocumentMatrix-method tdm <- partition_bundle("GERMAPARLMINI", s_attribute = "date") %>% as.TermDocumentMatrix(p_attribute = "word", verbose = FALSE) # obtain TermDocumentMatrix directly (fastest option) tdm <- as.TermDocumentMatrix("GERMAPARLMINI", p_attribute = "word", s_attribute = "date")#>#>#>#>#>dtm <- corpus("REUTERS") %>% split(s_attribute = "id") %>% as.TermDocumentMatrix(p_attribute = "word")#>#>#>#>#>