by Andreas Blätte
New polmineR version (v0.7.3) at CRAN!
The new release of the package has seen quite a bit of refactoring. It is the first version that is compatible with Windows. Documentation for many functions/methods has been improved, the package vignette provides an extended introduction to the package. With improved portability and documentation, I hope that more users will find the package useful.
Getting polmineR v0.7.3
The new version of polmineR is available at CRAN and can be installed using the install.packages mechanism.
Previous versions of polmineR were Linux/MacOS-only packages. It took some effort to learn the techniques necessary to make the package work on Windows (cross-compilation, compiling against an external C library), but I am really happy that things worked at last. The package vignette includes detailed instructions how to install polmineR on Windows systems.
Of course, v0.7.3 is not the end of history. The development version of the package that will include bug fixes and new features will be available at GitHub. The most recent development version can be installed from GitHub using the devtools package.
Feedback I receive how the installation instructions can be improved will find their way into the vignette. The most recent version of the vignette will be available at the development branch of polmineR at GitHub.
If nothing went wrong, the package can be loaded.
Installing packaged corpora
The last version of polmineR introduced an installation mechanism for corpora indexed using the Corpus Workbench (CWB) in R data packages. See ?install.corpus to learn more. In the following examples, I will use a corpus of parliamentary debates that can be installed from a repository hosted at the project server of the PolMine project.
The new polmineR version offers a more robust install.corpus mechanism, and a function to learn about the packaged corpora installed at your system.
To access a packaged corpus, the CORPUS_REGISTRY environment variable needs to be reset. That’s achieved calling use.
Cooccurrences and cooccurrences
The new polmineR version has seen considerable refactoring to make the naming of methods more consistent, and to make the code more maintainable. The biggest intervention is to clearly distinguish between getting the co-occurrences for one query, and getting all cooccurrences in a corpus, or partition.
Cooccurrences for one query are now obtained using the cooccurrences method.
Note that the output in RStudio now makes use of labels for the columns that somewhat explain the statistical results.
The ‘old’ cooccurrences method to get all cooccurrences in a corpus has been removed from the polmineR package. Don’t worry - it is not gone. It is included in an extension package ‘polmineR.graph’ that is available a GitHub. The big change that has happened for calculating all cooccurrences is that I now use a reference classes to handle the data for all cooccurrences, which can be quite bulky in memory. The huge advantage of reference classes is that objects are not copied, reducing memory consumption. As working with all coccurrences is more advanced anyway, and I did not want to mix the S4 class system and reference classes in a confusing fashion, the Coccurrences class had to emigrate.
Please note that the old context method is not gone, but it is now a worker for the cooccurrences and the kwic method. The reworked context class maintains data on word contexts in data.tables. Very good for speedy operations!
New uses for the size method
The uses of the size-method have been extended. A new sAttribute parameter has been introduced that will give you information how a corpus is spread over the s-attribute specified.
An immediate scenario may be to gain a quick understanding of the data you work with.
The parameter sAttribute will also swallow a character vector of length 2. The return value of the size method is a data.table (melted format). For our scenario, the dcast will turn it into a useful format.
An enhanced sAttributes method
I would not guess that anybody apart from myself might have discovered the old meta-method to learn about the metadata of regions of text. It has always been a worker. Now it is gone, and the functionality has been integrated into the sAttributes method. The sAttribute parameter is not limited to length 1 character vectors any more. If more than one string is provided, you get a data.table with the combinations that are available in the corpus.
Further changes, features and improvements
-
The way you can index objects inheriting from the textstat class is more consistent than previously, and mimicks that style you can index data.tables. See
?"textstat-class"
to learn more. -
Managing encodings has been reworked to achieve Windows compatibility. Two new functions (as.corpusEnc and as.nativeEnc) have been introduced for encoding conversion.
-
The count-method for a whole corpus (as provided by a character vector) will now work when more than pAttribute is provided.
-
There are some speed improvements for generating the html that is prepared by the read-method. A previously hidden highlight-method is now exported. It is faster and definitely more robust than before, as the xml2 package is used to parse and enhance html.
-
My old implementation of progress bars for multicore operations has never been stable in a satisfactory fashion. polmineR now relies on the pbapply package that is specialized on that task.
-
To make the package as reliable as possible, I have started to use the testthat package to include unit tests. We want to discover unintended errors.
An easter egg: An integrated shiny app!
Before I forget that… There is an experimental shiny app included in the package! It offers GUI access to core methods of the polmineR package (count, cooccurrences, features, dispersion). In fact, the app is meant to serve as a tutorial to the functionality of polmineR. It is experimental, but to check it out type:
Many of the changes that package has seen serve the purpose of providing a consistent backend for the shiny app. There will be more development in that respect.
Further perspectives
There has been a lot of refactoring behind the scences to ensure that the package is portable and to improve performance. Documentation should be much better than before. Feedback on advice is highly welcome, and we certainly plan to offer more tutorials, examples and advice.
A consequence of the refactoring of the context/cooccurrences method has been that some functionality was not immediatly re-implemented. That’s planned for v0.7.4.
I am happy that there is a growing community of users. I am looking forward to your advice, and to making this an increasingly useful package.
Subscribe via RSS