Topics in the Bundestag: Classifying Speeches of the 19th Bundestag (1/2)

Today, we will apply a Deep Learning-based language model to classify (nearly) all speeches held in the 19th German Bundestag (the German federal parliament) into the following predefined topics, as provided by the German broadcasting network ARD:

Außenpolitik (Foreign Policy)
Bildung (Education)
Digitalisierung (Digitalization)
Familien (Family Policy)
Gesundheitswesen (Healthcare)
Jobs
Klima (Climate Change)
Pflege (Caregiving)
Rente (Pension Schemes)
Sicherheit (Security Policy)
Steuern (Taxes)
Wohnraum (Housing)
Zuwanderung (Immigraion)

DISCLAIMER: The classification model we're going to use hasn't been validated for this task and can produce biased or nonsensical results. If you're planning to use this approach in a production setting, at least make sure to validate result quality by conducting a double-blind study!

We had statistical topic modeling methods (such as Latent Dirichlet Allocation, LDA) to help with the task of topic discovery for nearly 20 years. To help with our task of classifying speeches into topics, LDA results can be used as features for a downstream classification algorithm. But at least from my limited experience, it can be quite difficult to apply such a complicated construct in practice. Furthermore, with this approach, you would need to prepare a large enough dataset of topic-labeled speeches to train the classifier. Back to hardworking undergraduates it is. By the way, you should take a look at the excellent work of Open Discourse, who, among other things, did an LDA-based analysis of all Bundestag-speeches since 1949.

In this post, we will try a different approach, one that will allow us to classify all speech paragraphs present in the dataset we compiled in my last post into a given set of topics. This dataset contains, depending on when you follow this post, nearly all or all speeches given during the 19th electoral term of the German federal parliament.

The approach we'll be trying is called zero-shot learning for text classification as Natural Language Inference (NLI). Let's take some time to unpack this. Zero-shot learning means that we'll use a model trained on an upstream task, i.e. NLI, to solve our downstream task, i.e. classifying speeches, without retraining or even finetunig the model. Zero-shot learning works without any labels for the downstream task, freeing us from compiling training data. NLI is a language classification task of the following form: Given a premise and a hypothesis, predict if the hypothesis follows from the premise (entailment), contradicts the premise (contradiction), or is unrelated to the premise (neutral). The premise "The European Emissions Trading System was the first large greenhouse gas emmissions trading scheme in the world." entails the hypothesis "This text is about climate change.", contradicts the hypothesis "This text is about rock music." and is neutral with regard to the hypothesis "Gernany is a country in Central Europe.". When you solve NLI, you solve zero-shot text classification, as you can guess from the previous example. See Joe Davisons excellent article on Zero-Shot Learning in Modern NLP for details, including alternative ideas.

Putting all this theory into practice is easy, thanks to Hugging Face's transformers package and Sahaj Tomar's German Zeroshot model, which was finetuned on the German subset of the Cross-Lingual NLI Corpus and is based on Deep Set's German BERT language model. The latter model was trained by self-supervision on four text corpora: "OSCAR" (~145 Gigagytes), a corpus scraped from websites and filtered for explicit material, "OPUS" (~10 Gigabytes), a corpus compiled from various domains such as movie subtitles, parliament speeches and books, "Wikipedia" (~ 6 Gigabytes), a postprocessed dump of the German Wikipedia, and "Open Legal Data" (~ 2.4 Gigabytes), a dataset of German court decisions.

Exerting a few dozen lines of Python code, we'll be able to classify all of the 100,000+ speech paragraphs in our dataset in less than eight hours of GPU compute time. You can also follow along in Google Colab, which currently gives away GPU compute time for free. If you're only interested in the results of our study, feel free to skip to the second part of this series.

Setup Your Notebook Environment

Please see my last post on how to setup your notebook environment. You can use your own JupyterLab installation or a cloud service such as Amazon Sagemaker or Google Colab. When using Google Colab, make sure to change the runtime type to "GPU" before proceeding.

Install and Import Dependencies

If you're using Google Colab, you'll first have to install Hugging Face's transformers library by issuing the following command in a notebook cell:

!pip install transformers

This library implements a wide range of modern Deep Learning-based NLP techniques. To train and infer deep neural networks, it depends on either TensorFlow or PyTorch. Next, you should check if your GPU is ready to use by issuing:

!nvidia-smi

You will need a GPU with at least eight Gigabytes of memory. Depending on the GPU used, classifying our dataset will require different ammounts of time. On a NVIDIA Tesla P100, the process will finish in less than eight hours.

We will need to import the following Python libraries used in this project: numpy is Python's de facto standard library for numerical computing. The os package is needed to manipulate (temporary) files, pandas provides a dataframe abstraction for (kind of) conveniently working with largish tabular datasets, torch is PyTorch, the machine learning framework powering our deep language models, tqdm for showing progress bars, and finally transformers, a library implementing several modern NLP methods.

import numpy as np
import os
import pandas as pd
import torch
from tqdm.notebook import tqdm
from transformers import pipeline

Load and Prepare the Plenary Proceedings Dataset

We will now load our dataset of speeches as prepared in my previous post. In Google Colab, we'll have to mount our Google Drive, where we stored the dataset, first:

from google.colab import drive
drive.mount("/content/gdrive")

Next, we'll load the dataset into the Pandas dataframe speech_paragraphs_df and remove all speech paragraphs with less than 11 words, as these are too short for meaningful classification:

speech_paragraphs_df = pd.read_parquet("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet/bundestag_plenary_protocols_term_19.parquet")
speech_paragraphs_df = speech_paragraphs_df[speech_paragraphs_df.paragraph_len_words >= 10]

print(f"Number of paragraphs:\t\t{len(speech_paragraphs_df)}")
print(f"Median words per paragraph:\t{speech_paragraphs_df.paragraph_len_words.median()}")
print(f"Mean words per paragraph:\t{speech_paragraphs_df.paragraph_len_words.mean()}")

At the time of writing, there are 115,612 speech paragraphs in this dataset, with a median (mean) length of 54 (58.27) words.

Define the Classifier Pipeline

We can now prepare our zero-shot text classification pipeline, which entails downloading the models from Hugging Face's servers. Note that at least the base model (DeepSet's "GBERT large") is available under the permissive MIT license:

classifier = pipeline("zero-shot-classification",
                      model="Sahajtomar/German_Zeroshot",
                      device=0) # use GPU

With the classifier pipeline in place, we will define our "hypothesis" template and our labels. Our pipeline will merge the hypothesis template and the labels to yield NLI-hypotheses like "In diesem Text geht es um Außenpolitik.", as described above:

hypothesis_template = "In diesem Text geht es um {}."
labels = ["Außenpolitik",
          "Bildung",
          "Digitalisierung",
          "Jobs",
          "Familien",
          "Gesundheitswesen",
          "Klima",
          "Pflege",
          "Rente",
          "Sicherheit",
          "Steuern",
          "Wohnraum",
          "Zuwanderung"]

Next, we'll write some code to classify every speech paragraph in our dataset. This code is complicated by that fact that we will work in batches to speed up processing, and that we will save a temporary file of intermediate results as a backup:

def classify(speech,
             labels=labels,
             hypothesis_template=hypothesis_template,
             multi_label=True):
  results = classifier(speech, labels, 
                       hypothesis_template=hypothesis_template,
                       multi_label=multi_label)
  return results

def classify_to_df(paragraphs,
                   labels=labels,
                   hypothesis_template=hypothesis_template,
                   multi_label=True):
  classify_results = classify(paragraphs,
                              labels=labels,
                              hypothesis_template=hypothesis_template,
                              multi_label=multi_label)
  classify_results = [classify_results] if not isinstance(classify_results, list) else classify_results # ensure result is a list
  scores_dict = [dict(sorted(list(zip(r["labels"], r["scores"])))) for r in classify_results]
  scores_df = pd.DataFrame(scores_dict)
  return scores_df

def classify_df_batch(df_batch,
                      labels=labels,
                      hypothesis_template=hypothesis_template,
                      multi_label=True):
  paragraphs = df_batch.paragraph.to_list()
  scores_df = classify_to_df(paragraphs, labels=labels, hypothesis_template=hypothesis_template)
  scores_df.index = df_batch.index # align indices
  scored_df_batch = pd.concat([df_batch, scores_df], axis=1)
  return scored_df_batch

def classify_df(df,
                labels=labels,
                hypothesis_template=hypothesis_template,
                multi_label=True,
                batch_size=20,
                tmp_file="/content/gdrive/MyDrive/Colab Notebooks/tmp_file.csv"):
  work_df = df
  result_df = pd.DataFrame()
  if os.path.isfile(tmp_file):
    incomplete_df = pd.read_csv(tmp_file)
    incomplete_df.date = pd.to_datetime(incomplete_df.date, format="%Y-%m-%d", errors="coerce")
    work_df = work_df[~work_df.id.isin(incomplete_df.id)].dropna()
  for g, df_batch in tqdm(work_df.groupby(np.arange(len(work_df)) // batch_size)):
    result_df_batch = classify_df_batch(df_batch,
                                        labels=labels,
                                        hypothesis_template=hypothesis_template,
                                        multi_label=multi_label)
    result_df_batch.to_csv(tmp_file, index=False, header=(g == 0), mode="a")
    result_df = result_df.append(result_df_batch)
  os.remove(tmp_file)
  return result_df

Run the Classifier Pipeline and Save Results

We are now ready to classify our dataset. Depending on your computer or cloud service, this will take several hours to a few days. If the process is interrupted for some reason, just re-run it, as our code should pickup the temporary result file and continue where it left off. After the classification is finished, we save our result as an Apache Parquet file for later analysis:

%%time
classified_speech_paragraphs_df = classify_df(speech_paragraphs_df,
                                              labels=labels,
                                              multi_label=True)
classified_speech_paragraphs_df.to_parquet("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols_with_multiclass_topics_term_19.parquet")

Note that by setting the multi_label parameter to True, the classifier pipeline considers the labels as independent, i.e. a speech paragraph can be about multiple topics. Predicted probabilities are normalized for each candidate label by doing a softmax of the NLI entailment score versus the NLI contradiction score. This gives us, for each of our 13 labels, an estimate of the probability that a given speech paragraph is about the topic described by the label.

Finally, we print a sample of our model predictions:

pd.set_option("display.max_colwidth", 250)
classified_speech_paragraphs_df.sample(8).iloc[:, [1, 7, 10] + list(range(13, 26))]

	date	speaker_full_name	paragraph	Außenpolitik	Bildung	Digitalisierung	Familien	Gesundheitswesen	Jobs	Klima	Pflege	Rente	Sicherheit	Steuern	Wohnraum	Zuwanderung
491	2021-05-06	Christine Aschenberg-Dugnus (FDP)	Damit sie sich nicht im Versorgungswirrwarr verirren, brauchen sie geeignete Ansprechpartner. Für Patientinnen und Patienten ist es oft sehr schwierig, den bestmöglichen Versorgungspfad zu finden; denn die Symptome von Long Covid sind unspezifisc...	0.002154	0.001999	0.459230	0.131266	0.896095	0.006705	0.016970	0.568585	0.038141	0.303992	0.005355	0.020504	0.014352
5	2019-12-12	Jürgen Braun (AfD)	Einseitigkeit zugunsten des Islam, Blindheit gegenüber dem islamischen Fundamentalismus, kein Gespür für die persönliche Freiheit der Bürger: Der Bericht der Bundesregierung zur Menschenrechtspolitik ist, um es auf einen Punkt zu bringen, ein Dok...	0.039055	0.000737	0.000818	0.007973	0.000418	0.000512	0.000447	0.001041	0.000855	0.003408	0.000457	0.002898	0.013067
147	2020-10-08	Linda Teuteberg (FDP)	Aller guten Dinge sind drei: Für ein modernes Einwanderungsrecht braucht es auch moderne Behörden. Die Visastellen vergeben längst nicht genügend Termine, damit qualifizierte Menschen zügig ihre Anträge stellen können. Wir brauchen außerdem Lotse...	0.029913	0.638526	0.635045	0.047879	0.001336	0.628733	0.011468	0.006697	0.014245	0.045025	0.001524	0.027864	0.962281
90	2021-03-24	Frank Schwabe (SPD)	Der aktuelle Fall ist genannt worden: Ömer Faruk Gergerlioglu, der als Menschenrechtler und als Abgeordneter anerkannt ist, soll wegen eines Tweets für zweieinhalb Jahre ins Gefängnis.	0.053857	0.000553	0.322427	0.177035	0.000486	0.001232	0.050058	0.001185	0.002257	0.425522	0.001775	0.166230	0.349176
673	2021-05-06	Dr. Konstantin von Notz (BÜNDNIS 90/DIE GRÜNEN)	Bedanken möchte ich mich bei den Kolleginnen und Kollegen von FDP und Linken – ich finde es gut, dass wir das so gut zusammen hinbekommen haben –, insbesondere beim Kollegen Stefan Ruppert, beim Kollegen Benjamin Strasser und bei der Kollegin Buc...	0.589526	0.499660	0.660624	0.587036	0.373693	0.517687	0.625037	0.653395	0.544435	0.286163	0.589484	0.683710	0.688290
150	2018-06-14	Siemtje Möller (SPD)	Ich kann hier also festhalten: Unsere Soldatinnen und Soldaten, unsere Marine macht vor Ort einen hervorragenden Job in einem klar abgegrenzten, rundherum europäischen Mandat. Liebe Kolleginnen und Kollegen, lassen Sie uns diesen sinnvollen Einsa...	0.227165	0.002314	0.301667	0.018452	0.003619	0.969009	0.317800	0.428302	0.244927	0.955501	0.016070	0.030430	0.033591
68	2020-11-05	René Röspel (SPD)	Zum Schluss darf ich mich ganz herzlich bei unseren Sachverständigen Lena-Sophie Müller, Jan Kuhlen, Sami Haddadin und Lothar Schröder bedanken. Wir haben richtig viel gelernt, toll diskutiert. Ich hoffe, dass sie uns auf dem Weg der Umsetzung vo...	0.057079	0.168211	0.900108	0.059311	0.003292	0.162420	0.044387	0.030843	0.018963	0.069954	0.013431	0.466886	0.112802
83	2018-03-15	Agnieszka Brugger (BÜNDNIS 90/DIE GRÜNEN)	Was aber aus meiner Sicht gar nicht geht – Herr Grosse-Brömer hat uns ja gerade Exzellenz versprochen –, ist, dass die Bundesregierung bezüglich der Frage von Abschiebungen und der Bewertung der Sicherheitslage in Afghanistan darauf verweist, das...	0.876693	0.093432	0.048144	0.285236	0.007484	0.064532	0.029879	0.035582	0.045826	0.934408	0.009593	0.063168	0.442256

That's it for today. In the next installment of this series, we will proceed with an exploratory data analysis of our dataset of "classified" speech paragraphs.