Changelog

v2.0.1

Major:

  • improved the detection of the end of debates to remove additional remaining appendices (issue #1)
  • added missing speakers caused by missing line breaks (issue #2)
  • improved agenda item recognition in the TEI/XML, in particular in LPs 15 to 17 (issue #5)
  • improved stage annotation mechanism which sometimes resulted in very large paragraphs and missing speakers (issue #6)
  • included four additional sessions in LP 15 (see issue #7)
  • improved recognition of speakers of the federal council with additional and adjusted regular expressions

Minor:

  • changes to the PDF processing pipeline (improved margins for text extraction, modified regular expressions for end of debates)
  • new regular expressions for stage annotation (e.g. “Anlage”, interrupted sessions)
  • more meaningful “position” attribute for guest speakers (in TEI/XML)
  • minor additions to protocol-specific preprocessing functions
  • removed literal “NA” at the end of lines which occurred due to false concatenation in LPs 13-18
  • improved stage annotation in LP 19
  • speakers of regional states now consistently have role “federal_council” instead of “misc”
  • in TEI/XML: speakers in LP 19 have attribute “position” now to match earlier legislative periods. Its value is always “NA”
  • improved concatenation of words split by line breaks
  • fixed false assignment of some governmental or presidential speakers to a parliamentary group. Their parliamentary group is set to “NA” now
  • adjustments in speaker metadata (corrected party assignments, addressed a speaker mismatch)

Protocol specific changes:

  • 01/019: added accidentally removed second part of the interrupted session (issue #3)
  • 02/101: removed misplaced attachment
  • 13/001: specific expression for “Rita Süssmuth” who is neither president nor MP in this instance
  • 13/096: removed speeches which were added twice (Issue #4)
  • 14/069: removed speeches which were added twice (Issue #4)
  • 17/148: now prepared based on PDF instead of plain text because of the quality of the source data
  • 18/191: addressed encoding issue