German Plenary Proceedings as an NLP Testbed
With the 19th electoral term of the German federal parliament ("Bundestag" in German) drawing to a close, it could be interesting to do some data analysis on what has been said there. Fortunately, for the first time in this electoral term, the Bundestag makes all speeches available in an XML format, complete with speaker metadata.
In this article, we will download this data and preprocess it into a dataframe for easy analysis. Please note the rights of use of this dataset as published at https://www.bundestag.de/services/impressum.
Setup your Notebook Environment
Google Colaboratory (Colab) on the Google Cloud Platform offers a JupyterLab-like computational notebook environment with all Python libraries needed for this project preinstalled, offering maximal convenience. On the other hand, you should be aware of the privacy- and Quality-of-Service implications of using a free Google service.
If you'll be using your own infrastructure, you'll need to install Python and all required Python libraries. I'd recommend to also install JupyterLab or an alternative computational notebook environment. Many people prefer computational notebooks over IDEs for explorative data science work.
Import Python Dependencies
To start, we will import some Python dependencies. The glob
module is used to find all files matching a given name pattern in a given directory, while os
helps with manipulating filenames. google.colab.drive
will allow us to mount our google drive for permanently storing our retrieved dataset. lxml
is an XML parser, urllib3
is used to download files over HTTP. numpy
and pandas
are used to build our dataset.
from datetime import datetime
import glob
from google.colab import drive
import lxml
import lxml.html
import numpy as np
import os
import pandas as pd
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import urllib.request
Mount Google Drive
Next, we will "mount" our google drive, so that we can save files to it programmatically from this notebook. We will save temporary files that we download, as well as our completed dataset. After executing the following code cell, click on the link displayed as output, allow access and copy the access token, then paste the access token into the textbox displayed as output.
drive.mount('/content/gdrive')
Download Plenary Proceedings as XML Files
Before downloading, we will first prepare directories on our Google Drive to save our data to, then we will download the XML schema (document type definition - DTD) of the plenary proceedings:
%%shell
PARQUET_DATA_DIR="/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet"
XML_DATA_DIR="/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/xml"
mkdir -p "$PARQUET_DATA_DIR"
mkdir -p "$XML_DATA_DIR"
cd "$XML_DATA_DIR"
# download XML DTD for plenary protocolls of the 19th term if it do not exist on google drive...
if [ ! -f "dbtplenarprotokoll-data.dtd" ]; then
wget -N https://www.bundestag.de/resource/blob/575720/22c416420e8a51c380d2ddffb19ff5b7/dbtplenarprotokoll-data.dtd
fi
We are now ready to download the raw plenary proceeding XML files:
def download_plenary_protocols(to_path):
http = urllib3.PoolManager()
offset = 0
count = 0
while True:
response = http.request("GET", f"https://www.bundestag.de/ajax/filterlist/de/services/opendata/543410-543410?noFilterSet=true&offset={offset}")
parsed = lxml.html.fromstring(response.data)
empty = True
for link in parsed.getiterator(tag="a"):
empty = False
link_href = link.attrib["href"]
count += 1
filename = to_path + "/" + os.path.basename(link_href)
file_url = "https://www.bundestag.de" + link_href
print(f"downloading URL '{file_url}'")
urllib.request.urlretrieve(file_url, filename)
if empty: break
offset += 10
print(f"downloaded {count} XML files")
download_plenary_protocols("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/xml")
Data Preprocessing and Cleanup
Next, we will define a function to extract speeches from a given plenary proceedings XML file. This function will also extract/calculate metadata for each speech:
def extract_speeches_from_xml(xml, only_J_paragraphs=True):
speech_paragraphs_dict = {
"id": [],
"date": [],
"speaker_id": [],
"speaker_title": [],
"speaker_first_name": [],
"speaker_last_name": [],
"speaker_fraction": [],
"speaker_full_name": [],
"speech_id": [],
"paragraph_num": [],
"paragraph": [],
"paragraph_len_chars": [],
"paragraph_len_words": []
}
def first_or_empty_string(a): return a[0] if a else ""
session_date_string = xml.xpath("/dbtplenarprotokoll/@sitzung-datum")[0]
session_date = datetime.strptime(session_date_string, "%d.%m.%Y\)
speeches_with_one_speaker = xml.xpath("//sitzungsverlauf//rede[count(p/redner)=1]")
for speech in speeches_with_one_speaker:
speaker_id = speech.xpath("p/redner/@id")[0]
speech_id = speech.xpath("@id")[0]
speaker_title = first_or_empty_string(speech.xpath("p/redner/name/titel/text()"))
speaker_first_name = first_or_empty_string(speech.xpath("p/redner/name/vorname/text()"))
speaker_last_name = first_or_empty_string(speech.xpath("p/redner/name/nachname/text()"))
speaker_fraction = first_or_empty_string(speech.xpath("p/redner/name/fraktion/text()"))
speaker_full_name = speaker_title + (" " if speaker_title != "" else "") + speaker_first_name + " " + speaker_last_name + " (" + speaker_fraction + ")"
#print(f"{speaker_id} {speaker_first_name} {speaker_last_name}:")
paragraphs = speech.xpath("p[@klasse='J']/text()") if only_J_paragraphs else speech.xpath("p[@klasse!='redner']/text()")
for paragraph_num, paragraph in enumerate(speech.xpath("p[@klasse='J']/text()")):
id = speech_id + "_" + str(paragraph_num)
speech_paragraphs_dict["id"].append(id)
speech_paragraphs_dict["date"].append(session_date)
speech_paragraphs_dict["speaker_id"].append(speaker_id)
speech_paragraphs_dict["speaker_title"].append(speaker_title)
speech_paragraphs_dict["speaker_first_name"].append(speaker_first_name)
speech_paragraphs_dict["speaker_last_name"].append(speaker_last_name)
speech_paragraphs_dict["speaker_fraction"].append(speaker_fraction)
speech_paragraphs_dict["speaker_full_name"].append(speaker_full_name)
speech_paragraphs_dict["speech_id"].append(speech_id)
speech_paragraphs_dict["paragraph_num"].append(paragraph_num)
speech_paragraphs_dict["paragraph"].append(paragraph)
speech_paragraphs_dict["paragraph_len_chars"].append(len(paragraph))
speech_paragraphs_dict["paragraph_len_words"].append(len(paragraph.split()))
#print(f"{id}: {paragraph}")
return speech_paragraphs_dict
We are now ready to process every XML file into the preliminary dataframe of speeches speech_paragraph_df
:
from tqdm.notebook import tqdm # progress bar
xml_files = glob.glob("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/xml/*.xml")
speech_paragraphs_df = None
for xml_file in tqdm(xml_files):
xml = lxml.etree.parse(xml_file)
df = pd.DataFrame.from_dict(extract_speeches_from_xml(xml))
speech_paragraphs_df = df if speech_paragraphs_df is None else speech_paragraphs_df.append(df)
This dataframe represents a table of every paragraph of every speech delivered by a speaker of a known fraction, including metadata such as speaker name, speaker fraction, and paragraph length in words.
We will now cleanup the speaker_fraction
column. We will then remove speakers without fraction, as there are some data quality problems with these. We will need to examine these problems at some later point in time.
# data cleanup
speech_paragraphs_df.loc[speech_paragraphs_df.speaker_fraction == "Fraktionslos", "speaker_fraction"] = "fraktionslos"
speech_paragraphs_df.loc[speech_paragraphs_df.speaker_fraction.str.startswith("BÜNDNIS"), "speaker_fraction"] = "BÜNDNIS 90/DIE GRÜNEN"
# only keep speech paragraphs of speaker's with known fraction
speech_paragraphs_df = speech_paragraphs_df[speech_paragraphs_df.speaker_fraction.isin(["CDU/CSU", "SPD", "AfD", "FDP", "DIE LINKE", "BÜNDNIS 90/DIE GRÜNEN"])]
We can now take a look at our finished dataset:
speech_paragraphs_df
Finally, we save our finished dataset to our Google Drive for future analysis:
speech_paragraphs_df.to_parquet("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet/bundestag_plenary_protocols_term_19.parquet")
To load this dataset back into memory at a later date, we would use the following code:
speech_paragraphs_df = pd.read_parquet("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet/bundestag_plenary_protocols_term_19.parquet")
While we have our finished dataset available in memory, lets perform a simple example analysis, tabulating the number of speech paragraphs by speaker fraction:
speech_paragraphs_df.speaker_fraction.value_counts()
CDU/CSU 34349
SPD 22774
AfD 17829
FDP 14677
DIE LINKE 13674
BÜNDNIS 90/DIE GRÜNEN 12794
Name: speaker_fraction, dtype: int64
Finally, we can transform our dataset of speech paragraphs to a dataset of speeches via the following Pandas incantation:
speeches_df = speech_paragraphs_df\
.groupby("speech_id")\
.agg({
"speaker_id": "first",
"speaker_title": "first",
"speaker_first_name": "first",
"speaker_last_name": "first",
"speaker_fraction": "first",
"speaker_full_name": "first",
"speech_id": "first",
"paragraph": " ".join,
"paragraph_len_chars": "sum",
"paragraph_len_words": "sum"})
That's about it. In my next post, we'll categorize all speech paragraphs into a set of predefined topics.