German Plenary Proceedings as an NLP Testbed

With the 19th electoral term of the German federal parliament ("Bundestag" in German) drawing to a close, it could be interesting to do some data analysis on what has been said there. Fortunately, for the first time in this electoral term, the Bundestag makes all speeches available in an XML format, complete with speaker metadata.

In this article, we will download this data and preprocess it into a dataframe for easy analysis. Please note the rights of use of this dataset as published at https://www.bundestag.de/services/impressum.

Setup your Notebook Environment

Google Colaboratory (Colab) on the Google Cloud Platform offers a JupyterLab-like computational notebook environment with all Python libraries needed for this project preinstalled, offering maximal convenience. On the other hand, you should be aware of the privacy- and Quality-of-Service implications of using a free Google service.

If you'll be using your own infrastructure, you'll need to install Python and all required Python libraries. I'd recommend to also install JupyterLab or an alternative computational notebook environment. Many people prefer computational notebooks over IDEs for explorative data science work.

Import Python Dependencies

To start, we will import some Python dependencies. The glob module is used to find all files matching a given name pattern in a given directory, while os helps with manipulating filenames. google.colab.drive will allow us to mount our google drive for permanently storing our retrieved dataset. lxml is an XML parser, urllib3 is used to download files over HTTP. numpy and pandas are used to build our dataset.

from datetime import datetime
import glob
from google.colab import drive
import lxml
import lxml.html
import numpy as np
import os
import pandas as pd
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import urllib.request

Mount Google Drive

Next, we will "mount" our google drive, so that we can save files to it programmatically from this notebook. We will save temporary files that we download, as well as our completed dataset. After executing the following code cell, click on the link displayed as output, allow access and copy the access token, then paste the access token into the textbox displayed as output.

drive.mount('/content/gdrive')

Download Plenary Proceedings as XML Files

Before downloading, we will first prepare directories on our Google Drive to save our data to, then we will download the XML schema (document type definition - DTD) of the plenary proceedings:

%%shell

PARQUET_DATA_DIR="/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet"
XML_DATA_DIR="/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/xml"

mkdir -p "$PARQUET_DATA_DIR"
mkdir -p "$XML_DATA_DIR"
cd "$XML_DATA_DIR"

# download XML DTD for plenary protocolls of the 19th term if it do not exist on google drive...
if [ ! -f "dbtplenarprotokoll-data.dtd" ]; then
    wget -N https://www.bundestag.de/resource/blob/575720/22c416420e8a51c380d2ddffb19ff5b7/dbtplenarprotokoll-data.dtd
fi

We are now ready to download the raw plenary proceeding XML files:

def download_plenary_protocols(to_path):
  http = urllib3.PoolManager() 
  offset = 0
  count = 0
  while True:
    response = http.request("GET", f"https://www.bundestag.de/ajax/filterlist/de/services/opendata/543410-543410?noFilterSet=true&offset={offset}")
    parsed = lxml.html.fromstring(response.data)
    empty = True
    for link in parsed.getiterator(tag="a"):
      empty = False
      link_href = link.attrib["href"]
      count += 1
      filename = to_path + "/" + os.path.basename(link_href)
      file_url = "https://www.bundestag.de" + link_href
      print(f"downloading URL '{file_url}'")
      urllib.request.urlretrieve(file_url, filename)
    if empty: break
    offset += 10
  print(f"downloaded {count} XML files")

download_plenary_protocols("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/xml")

Data Preprocessing and Cleanup

Next, we will define a function to extract speeches from a given plenary proceedings XML file. This function will also extract/calculate metadata for each speech:

def extract_speeches_from_xml(xml, only_J_paragraphs=True):
  speech_paragraphs_dict = {
      "id": [],
      "date": [],
      "speaker_id": [],
      "speaker_title": [],
      "speaker_first_name": [],
      "speaker_last_name": [],
      "speaker_fraction": [],
      "speaker_full_name": [],
      "speech_id": [],
      "paragraph_num": [],
      "paragraph": [],
      "paragraph_len_chars": [],
      "paragraph_len_words": []
  }
  def first_or_empty_string(a): return a[0] if a else ""
  session_date_string = xml.xpath("/dbtplenarprotokoll/@sitzung-datum")[0]
  session_date = datetime.strptime(session_date_string, "%d.%m.%Y\)
  speeches_with_one_speaker = xml.xpath("//sitzungsverlauf//rede[count(p/redner)=1]")
  for speech in speeches_with_one_speaker:
    speaker_id = speech.xpath("p/redner/@id")[0]
    speech_id = speech.xpath("@id")[0]
    speaker_title = first_or_empty_string(speech.xpath("p/redner/name/titel/text()"))
    speaker_first_name = first_or_empty_string(speech.xpath("p/redner/name/vorname/text()"))
    speaker_last_name = first_or_empty_string(speech.xpath("p/redner/name/nachname/text()"))
    speaker_fraction = first_or_empty_string(speech.xpath("p/redner/name/fraktion/text()"))
    speaker_full_name = speaker_title + (" " if speaker_title != "" else "") + speaker_first_name + " " + speaker_last_name + " (" + speaker_fraction + ")"
    #print(f"{speaker_id} {speaker_first_name} {speaker_last_name}:")
    paragraphs = speech.xpath("p[@klasse='J']/text()") if only_J_paragraphs else speech.xpath("p[@klasse!='redner']/text()")
    for paragraph_num, paragraph in enumerate(speech.xpath("p[@klasse='J']/text()")):
      id = speech_id + "_" + str(paragraph_num)
      speech_paragraphs_dict["id"].append(id)

speech_paragraphs_dict["date"].append(session_date)
      speech_paragraphs_dict["speaker_id"].append(speaker_id)
      speech_paragraphs_dict["speaker_title"].append(speaker_title)
      speech_paragraphs_dict["speaker_first_name"].append(speaker_first_name)
      speech_paragraphs_dict["speaker_last_name"].append(speaker_last_name)
      speech_paragraphs_dict["speaker_fraction"].append(speaker_fraction)
      speech_paragraphs_dict["speaker_full_name"].append(speaker_full_name)
      speech_paragraphs_dict["speech_id"].append(speech_id)
      speech_paragraphs_dict["paragraph_num"].append(paragraph_num)
      speech_paragraphs_dict["paragraph"].append(paragraph)
      speech_paragraphs_dict["paragraph_len_chars"].append(len(paragraph))
      speech_paragraphs_dict["paragraph_len_words"].append(len(paragraph.split()))
      #print(f"{id}: {paragraph}")
  return speech_paragraphs_dict

We are now ready to process every XML file into the preliminary dataframe of speeches speech_paragraph_df:

from tqdm.notebook import tqdm # progress bar

xml_files = glob.glob("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/xml/*.xml")

speech_paragraphs_df = None
for xml_file in tqdm(xml_files):
  xml = lxml.etree.parse(xml_file)
  df = pd.DataFrame.from_dict(extract_speeches_from_xml(xml))
  speech_paragraphs_df = df if speech_paragraphs_df is None else speech_paragraphs_df.append(df)

This dataframe represents a table of every paragraph of every speech delivered by a speaker of a known fraction, including metadata such as speaker name, speaker fraction, and paragraph length in words.

We will now cleanup the speaker_fraction column. We will then remove speakers without fraction, as there are some data quality problems with these. We will need to examine these problems at some later point in time.

# data cleanup
speech_paragraphs_df.loc[speech_paragraphs_df.speaker_fraction == "Fraktionslos", "speaker_fraction"] = "fraktionslos"
speech_paragraphs_df.loc[speech_paragraphs_df.speaker_fraction.str.startswith("BÜNDNIS"), "speaker_fraction"] = "BÜNDNIS 90/DIE GRÜNEN"

# only keep speech paragraphs of speaker's with known fraction
speech_paragraphs_df = speech_paragraphs_df[speech_paragraphs_df.speaker_fraction.isin(["CDU/CSU", "SPD", "AfD", "FDP", "DIE LINKE", "BÜNDNIS 90/DIE GRÜNEN"])]

We can now take a look at our finished dataset:

speech_paragraphs_df

Finally, we save our finished dataset to our Google Drive for future analysis:

speech_paragraphs_df.to_parquet("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet/bundestag_plenary_protocols_term_19.parquet")

To load this dataset back into memory at a later date, we would use the following code:

speech_paragraphs_df = pd.read_parquet("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet/bundestag_plenary_protocols_term_19.parquet")

While we have our finished dataset available in memory, lets perform a simple example analysis, tabulating the number of speech paragraphs by speaker fraction:

speech_paragraphs_df.speaker_fraction.value_counts()
CDU/CSU                  34349
SPD                      22774
AfD                      17829
FDP                      14677
DIE LINKE                13674
BÜNDNIS 90/DIE GRÜNEN    12794
Name: speaker_fraction, dtype: int64

Finally, we can transform our dataset of speech paragraphs to a dataset of speeches via the following Pandas incantation:

speeches_df = speech_paragraphs_df\
  .groupby("speech_id")\
  .agg({
    "speaker_id": "first",
    "speaker_title": "first",
    "speaker_first_name": "first",
    "speaker_last_name": "first",
    "speaker_fraction": "first",
    "speaker_full_name": "first",
    "speech_id": "first",
    "paragraph": " ".join,
    "paragraph_len_chars": "sum",
    "paragraph_len_words": "sum"})

That's about it. In my next post, we'll categorize all speech paragraphs into a set of predefined topics.