Class, methods and functionality for processing phrases (lexical
units, lexical items, multi-word expressions) beyond the token level. The
envisaged workflow at this stage is to detect phrases using the
ngrams
-method and to generate a phrases
class object from the
ngrams
object using the as.phrases
method. This object can be
passed into a call of count
, see examples. Further methods and
functions documented here are used internally, but may be useful.
# S4 method for ngrams as.phrases(.Object, ...) # S4 method for matrix as.phrases(.Object, corpus, enc = encoding(corpus)) # S4 method for phrases as.character(x, p_attribute) concatenate_phrases(dt, phrases, col)
.Object | Input object, either a |
---|---|
... | Arguments passed into internal call of |
corpus | A length-one |
enc | Encoding of the corpus. |
x | A |
p_attribute | The positional attribute (p-attribute) to decode. |
dt | A |
phrases | A |
col | If |
The phrases
considers a phrase as sequence as tokens that can
be defined by region, i.e. a left and a right corpus position. This
information is kept in a region matrix in the slot "cpos" of the
phrases
class. The phrases
class inherits from the
regions
class (which inherits from the and the
corpus
class), without adding further slots.
If .Object
is an object of class ngrams
, the
as.phrases
-method will interpret the ngrams as CQP queries,
look up the matching corpus positions and return an phrases
object.
If .Object
is a matrix
, the as.phrases
-method
will initialize a phrases
object. The corpus and the encoding of the
corpus will be assigned to the object.
Applying the as.character
-method on a phrases
object
will return the decoded regions, concatenated using an underscore as
seperator.
The concatenate_phrases
function takes a data.table
(argument dt
) as input and concatenates phrases in successive rows
into a phrase.
Other classes to manage corpora:
corpus-class
,
regions
,
subcorpus
# Workflow to create document-term-matrix with phrases obs <- corpus("GERMAPARLMINI") %>% count(p_attribute = "word") phrases <- corpus("GERMAPARLMINI") %>% ngrams(n = 2L, p_attribute = "word") %>% pmi(observed = obs) %>% subset(ngram_count > 5L) %>% subset(1:100) %>% as.phrases() dtm <- corpus("GERMAPARLMINI") %>% as.speeches(s_attribute_name = "speaker", progress = TRUE) %>% count(phrases = phrases, p_attribute = "word", progress = TRUE, verbose = TRUE) %>% as.DocumentTermMatrix(col = "count", verbose = FALSE)#>#>#>#>#>#> [1] 98#> [1] 12260# Derive phrases object from an ngrams object reuters_phrases <- ngrams("REUTERS", p_attribute = "word", n = 2L) %>% pmi(observed = count("REUTERS", p_attribute = "word")) %>% subset(ngram_count >= 5L) %>% subset(1:25) %>% as.phrases() phr <- as.character(reuters_phrases, p_attribute = "word") # Derive phrases from explicitly stated CQP queries cqp_phrase_queries <- c( '"oil" "revenue";', '"Sheikh" "Aziz";', '"Abdul" "Aziz";', '"Saudi" "Arabia";', '"oil" "markets";' ) reuters_phrases <- cpos("REUTERS", cqp_phrase_queries, p_attribute = "word") %>% as.phrases(corpus = "REUTERS", enc = "latin1") # Use the concatenate_phrases() function on a data.table lexical_units_cqp <- c( '"Deutsche.*" "Bundestag.*";', '"sozial.*" "Gerechtigkeit";', '"Ausschuss" "f.r" "Arbeit" "und" "Soziales";', '"soziale.*" "Marktwirtschaft";', '"freiheitliche.*" "Grundordnung";' ) phr <- cpos("GERMAPARLMINI", query = lexical_units_cqp, cqp = TRUE) %>% as.phrases(corpus = "GERMAPARLMINI", enc = "word") dt <- corpus("GERMAPARLMINI") %>% decode(p_attribute = "word", s_attribute = character(), to = "data.table") %>% concatenate_phrases(phrases = phr, col = "word")#>#>dt[word == "Deutschen_Bundestag"]#> cpos word #> 1: 308 Deutschen_Bundestag #> 2: 508 Deutschen_Bundestag #> 3: 3034 Deutschen_Bundestag #> 4: 9408 Deutschen_Bundestag #> 5: 10449 Deutschen_Bundestag #> 6: 10580 Deutschen_Bundestag #> 7: 11434 Deutschen_Bundestag #> 8: 11963 Deutschen_Bundestag #> 9: 21347 Deutschen_Bundestag #> 10: 32024 Deutschen_Bundestag #> 11: 53952 Deutschen_Bundestag #> 12: 70369 Deutschen_Bundestag #> 13: 70846 Deutschen_Bundestag #> 14: 76952 Deutschen_Bundestag #> 15: 89287 Deutschen_Bundestag #> 16: 89318 Deutschen_Bundestag #> 17: 100758 Deutschen_Bundestag #> 18: 114793 Deutschen_Bundestag #> 19: 118872 Deutschen_Bundestag #> 20: 129975 Deutschen_Bundestag #> 21: 131688 Deutschen_Bundestag #> 22: 137340 Deutschen_Bundestag #> 23: 160334 Deutschen_Bundestag #> 24: 167455 Deutschen_Bundestag #> 25: 171908 Deutschen_Bundestag #> 26: 172517 Deutschen_Bundestag #> 27: 177983 Deutschen_Bundestag #> 28: 188692 Deutschen_Bundestag #> cpos worddt[word == "soziale_Marktwirtschaft"]#> cpos word #> 1: 21178 soziale_Marktwirtschaft #> 2: 42934 soziale_Marktwirtschaft #> 3: 42944 soziale_Marktwirtschaft #> 4: 42960 soziale_Marktwirtschaft #> 5: 42981 soziale_Marktwirtschaft #> 6: 64328 soziale_Marktwirtschaft