The PolMine Project developed out of the conviction that developing the capacities of text mining in the social sciences is as much about data as it is about exploring new analytical approaches. Hacky solutions inhibit reproducible findings and collaboration. The following four ideas are cornerstones of the project:
Public data: The focus of PolMine is on the texts published by public institutions. Corpora of parliamentary protocols have been, and remain at the heart of the project.
Sustainability: To facilitate the growth of sustainable corpora, the project’s concern is to develop a sustainable codebase that will ensure the reproducible preparation of corpora, and make updating the data as efficient as possible.
Standardization: We adhere to standardizations suggested by the Text Encoding Initiative (TEI). An XML document conforming to TEI can then be turned into almost any other data format that is required.
Ensuring quality: Released corpora are available for registered academic users through our GitLab server. The choice for git/GitLab supports versioning the data, and facilitates an issue tracking system. We want to involve users in a process to improve the quality of the data.