A library for corpus analysis using the Corpus Workbench (CWB) as an efficient back end for indexing and querying large corpora.

polmineR()

Details

The package offers functionality to flexibly create partitions and to carry out basic statistical operations (count, co-occurrences etc.). The original full text of documents can be reconstructed and inspected at any time. Beyond that, the package is intended to serve as an interface to packages implementing advanced statistical procedures. Respective data structures (document term matrices, term co- occurrence matrices etc.) can be created based on the indexed corpora.

A session registry directory (see registry()) combines the registry files for corpora that may reside in anywhere on the system. Upon loading polmineR, the files in the registry directory defined by the environment variable CORPUS_REGISTRY are copied to the session registry directory. To see whether the environment variable CORPUS_REGISTRY is set, use the `Sys.getenv()`-function. Corpora wrapped in R data packages can be activated using the function use().

The package includes a draft shiny app that can be called using polmineR().

Package options

  • polmineR.p_attribute: The default attribute

  • polmineR.left: Default value for left context.

  • polmineR.lineview: A logical value, whether ...

  • polmineR.pagelength: 10L

  • polmineR.meta:

  • polmineR.mc:

  • polmineR.cores:

  • polmineR.browse:

  • polmineR.buttons:

  • polmineR.specialChars:

  • polmineR.cutoff:

  • polmineR.corpus_registry: The system corpus registry directory defined by the environment variable CORPUS_REGISTRY before the polmineR package has been loaded. The polmineR package uses a temporary registry directory to be able to use corpora stored at multiple locations in one session. The path to the system corpus registry directory captures this setting to keep it available if necessary.

  • polmineR.shiny: A logical value, whether polmineR is used in the context of a shiny app. Used to control the apprearance of progress bars depending on whether shiny app is running, or not.

  • polmineR.warn.size: When generating HTML table widgets (e.g. when preparing kwic output to be displayed in RStudio's Viewe pane), the function DT::datatable() that is used internally will issue a warning by default if the object size of the table is greater than 1500000. The warning adresses a client-server scenario that is not applicable in the context of a local RStudio session, so you may want to turn it of. Internally, the warning can be suppressed by setting the option DT.warn.size to FALSE. The polmineR option polmineR.warn.size is processed by functions calling DT::datatable() to set and reset the value of DT.warn.size. Please note: The formulation of the warning does not match the scenario of a local RStudio session, but it may still be useful to get a warning when tables are large and slow to process. Therefore, the default value of the setting is FALSE.

References

Jockers, Matthew L. (2014): Text Analysis with R for Students of Literature. Cham et al: Springer.

Baker, Paul (2006): Using Corpora in Discourse Analysis. London: continuum.

Author

Andreas Blaette (andreas.blaette@uni-due.de)

Examples

use("polmineR") # activate demo corpora included in the package
#> ... activating corpus: GERMAPARLMINI (version: 0.0.1 | build date: 2019-02-23)
#> ... activating corpus: REUTERS
# The package includes two sample corpora corpus("REUTERS") %>% show_info() corpus("GERMAPARLMINI") %>% show_info() # Core methods applied to corpus C <- count("REUTERS", query = "oil") C <- count("REUTERS", query = c("oil", "barrel")) C <- count("REUTERS", query = '"Saudi" "Arab.*"', breakdown = TRUE, cqp = TRUE) D <- dispersion("REUTERS", query = "oil", s_attribute = "id") K <- kwic("REUTERS", query = "oil") CO <- cooccurrences("REUTERS", query = "oil") # Core methods applied to partition kuwait <- partition("REUTERS", places = "kuwait", regex = TRUE)
#> ... get encoding: latin1
#> ... get cpos and strucs
C <- count(kuwait, query = "oil") D <- dispersion(kuwait, query = "oil", s_attribute = "id") K <- kwic(kuwait, query = "oil", meta = "id")
#> ... getting corpus positions
#> ... number of hits: 14
#> ... checking that all p-attributes are available
#> ... getting token id for p-attribute: word
#> ... generating contexts
CO <- cooccurrences(kuwait, query = "oil") # Go back to full text p <- partition("REUTERS", id = 127)
#> ... get encoding: latin1
#> ... get cpos and strucs
if (interactive()) read(p) h <- html(p) h_highlighted <- highlight(h, highlight = list(yellow = "oil")) if (interactive()) h_highlighted # Generate term document matrix pb <- partition_bundle("REUTERS", s_attribute = "id") cnt <- count(pb, p_attribute = "word") tdm <- as.TermDocumentMatrix(cnt, col = "count")
#> ... using the p_attribute-slot of the first object in the bundle as p_attribute: word
#> ... generating (temporary) key column
#> ... generating cumulated data.table
#> ... getting unique keys
#> ... generating integer keys
#> ... cleaning up temporary key columns