Changelog
v2.0.1
Major:
- improved the detection of the end of debates to remove additional remaining appendices (issue #1)
- added missing speakers caused by missing line breaks (issue #2)
- improved agenda item recognition in the TEI/XML, in particular in LPs 15 to 17 (issue #5)
- improved stage annotation mechanism which sometimes resulted in very large paragraphs and missing speakers (issue #6)
- included four additional sessions in LP 15 (see issue #7)
- improved recognition of speakers of the federal council with additional and adjusted regular expressions
Minor:
- changes to the PDF processing pipeline (improved margins for text extraction, modified regular expressions for end of debates)
- new regular expressions for stage annotation (e.g. “Anlage”, interrupted sessions)
- more meaningful “position” attribute for guest speakers (in TEI/XML)
- minor additions to protocol-specific preprocessing functions
- removed literal “NA” at the end of lines which occurred due to false concatenation in LPs 13-18
- improved stage annotation in LP 19
- speakers of regional states now consistently have role “federal_council” instead of “misc”
- in TEI/XML: speakers in LP 19 have attribute “position” now to match earlier legislative periods. Its value is always “NA”
- improved concatenation of words split by line breaks
- fixed false assignment of some governmental or presidential speakers to a parliamentary group. Their parliamentary group is set to “NA” now
- adjustments in speaker metadata (corrected party assignments, addressed a speaker mismatch)
Protocol specific changes:
- 01/019: added accidentally removed second part of the interrupted session (issue #3)
- 02/101: removed misplaced attachment
- 13/001: specific expression for “Rita Süssmuth” who is neither president nor MP in this instance
- 13/096: removed speeches which were added twice (Issue #4)
- 14/069: removed speeches which were added twice (Issue #4)
- 17/148: now prepared based on PDF instead of plain text because of the quality of the source data
- 18/191: addressed encoding issue