Changelog

v2.0.1

Major:

improved the detection of the end of debates to remove additional remaining appendices (issue #1)
added missing speakers caused by missing line breaks (issue #2)
improved agenda item recognition in the TEI/XML, in particular in LPs 15 to 17 (issue #5)
improved stage annotation mechanism which sometimes resulted in very large paragraphs and missing speakers (issue #6)
included four additional sessions in LP 15 (see issue #7)
improved recognition of speakers of the federal council with additional and adjusted regular expressions

Minor:

changes to the PDF processing pipeline (improved margins for text extraction, modified regular expressions for end of debates)
new regular expressions for stage annotation (e.g. “Anlage”, interrupted sessions)
more meaningful “position” attribute for guest speakers (in TEI/XML)
minor additions to protocol-specific preprocessing functions
removed literal “NA” at the end of lines which occurred due to false concatenation in LPs 13-18
improved stage annotation in LP 19
speakers of regional states now consistently have role “federal_council” instead of “misc”
in TEI/XML: speakers in LP 19 have attribute “position” now to match earlier legislative periods. Its value is always “NA”
improved concatenation of words split by line breaks
fixed false assignment of some governmental or presidential speakers to a parliamentary group. Their parliamentary group is set to “NA” now
adjustments in speaker metadata (corrected party assignments, addressed a speaker mismatch)

Protocol specific changes:

01/019: added accidentally removed second part of the interrupted session (issue #3)
02/101: removed misplaced attachment
13/001: specific expression for “Rita Süssmuth” who is neither president nor MP in this instance
13/096: removed speeches which were added twice (Issue #4)
14/069: removed speeches which were added twice (Issue #4)
17/148: now prepared based on PDF instead of plain text because of the quality of the source data
18/191: addressed encoding issue