Oliver Flasch's Blog

What LLMs Can and Cannot Do

Oliver Flasch — Sat, 14 Mar 2026 13:59:13 GMT

Will the power of Large Language Models (LLMs) continue to increase, or will they reach a plateau soon? Might the capabilities of current frontier models and AI systems be already good enough to cause massive disruption as implementations ripple through the economy? Or will the limitations of large deep learning models trained on massive amounts of tokens, mainly text, limit their reach?

Emergent Consensus

At first sight, even experts seem to struggle to reach a consensus on these questions. I think that, on further investigation, this is no longer true. The predictions of imminent danger from superintelligent AI did not come to pass, shifting the "AI doomer" narrative to a less dramatic, yet still very serious scenario of massive job loss. On the "AI boomer" side, voices promising Artificial General Intelligence are noticeably subdued, while the focus shifts from models to complex agentic systems that seem to be able to automate (most) knowledge work (very soon). In other words, after a period of hype, a consensus seems to form, albeit slowly. Let's speculate on what that consensus might look like, maybe even answering some of our initial questions on the way.

A World of Text

First, note that LLMs live in a world of symbols ("tokens"), nearly all of them sourced from massive amounts of text graciously copied from the Internet. Today, training an LLM consists of three stages: Data collection and preprocessing, pre-training, and post-training. Empirical results, i.e. "scaling laws" show that the capabilities of LLMs seem to scale with three factors: Model size, dataset size, and training compute. As more or less all easily obtainable data is already used in pre-training, scaling the remaining factors lead to the current investments in massive data centers. Algorithmic advances can be understood as a less controllable fourth factor, which recently resulted in "reasoning models". These models are post-trained by Reinforcement Learning with Verifiable Rewards (RLVR), a sort of "self play", where language models increase their performance by trying to find solutions to (often synthetically created) problems that are hard to solve, but whose solutions are easy to verify, such as certain mathematical proofs. Even multi-modal models that can process images, videos, and sound in addition to text, are, for reasons of efficiency and data availability, mainly pre-trained on text.

Embodiment

Second, contrast how animals (including us) learn to how LLMs are trained. While animals continuously learn while interacting with an evolving physical and social world, current LLMs are pre-trained on a massive, but static corpus of mostly text tokens, then post-trained on a static set of tasks and human (i.e. often commercial) preferences. This naturally leads to book-smart, somewhat biased LLM agents whose "world models" fail at common sense, suggesting you take a walk to the car wash to save on gas. These limits of current LLMs clearly show the importance of "true" understanding by "grasping" the real world in a literal sense, as an embodied intelligent being.

In principle, the deep learning paradigm, given continous training, should be sufficient to create these embodied artifical intelligences. In practice, building efficient neural architectures and large datasets to learn from true interactions with the real world, augmented with simulations where possible, is a considerably more complex task than collecting an Internet worth of text and training on that. Current "world models" trained on video data or video games already demonstrate what is possible beyond LLMs, but are also limited by the lack of true interaction with the physical world.

AI Agents and the Future of Work

Applying these ideas to current and future LLMs, I'd conclude that the "street-smarts" of LLM-based AI agents will remain limited in often surprising ways. Their flexibility will stay constrained by what is possible through "in-context learning" for several years to come, meaning that LLM-based AI agents will not be able to truly gain experience on their jobs. As these agents continue to "live" in a world of text, they should not be able to distinguish between fact and fiction in a dependable way for the foreseeable future, excluding certain high-stakes use cases and necessitating complex "guard rails". Their "creativity" should be limited to the combinatorial kind.

Taking these constraints into account, I think it's safe to say that LLM-based AI agents should not be able to directly replace humans in most roles. Productivity gains should arrive slowly, as organizations will need to change processes to create roles compatible with the limitations of LLM-based AI agents. When these roles are created though, AI agents should be able to automatically explore and synthesize solutions from existing ideas present in their massive training data sets, leading to interesting, even transformative results.

Photo by Ben Blumentritt

gerenbench - A Small Proprietary Benchmark for German RAG

Oliver Flasch — Sat, 07 Mar 2026 12:48:00 GMT

Retrieval Augmented Generation (RAG) is quickly maturing into a useful (and boring) technology to connect proprietary, often local data to Large Language Models (LLMs) to augment their training data with knowledge and skills through in-context learning. RAG can enable important use cases such as tailored AI assistants for answering questions, generating complex reports ("Deep Research") or automating tasks ("Agentic AI").

RAG employs an information retrieval system, like a database or search engine, to augment the context of an LLM with relevant information. There are many different ways of varying complexity to achieve this, from simple approaches that enrich a user’s prompt with search results every time to agentic workflows that enable an LLM to freely use multiple search engines or databases as tools. When connecting an LLM to proprietary data, it is extremely important to ensure that it cannot access public data sources at the same time, to prevent data exhilaration.

This article presents results of "gerenbench", a small proprietary benchmark that uses German encyclopedia articles to estimate the accuracy of LLMs on RAG tasks. It provides a glimpse at the native and RAG performance of open weight LLMs runnable on consumer hardware as of March 2026.

Benchmark Setup

The benchmarks contains 100 multiple choice questions based on 25 German encyclopedia articles from to mid-nineties. Each question contains the relevant article and four possible answers. It encompasses the following two scenarios:

LLM-Only: The LLM is presented with the question and four possible answers. The relevant article is withheld.
LLM+RAG: The LLM is presented with the relevant article, the question and four possible answers.

In both scenarios, the LLM is instructed (via the prompts given in the appendix) to answer the question by generating a single letter (A, B, C, or D) designating the correct answer. Therefore, scenario 1 tests world knowledge, while scenario 2 tests world knowledge and RAG capability. Result analysis is fully deterministic and automated. A redacted example task is given in the appendix of this article. Note that scenario 2 assumes the best case scenario where the relevant article is included in the LLMs context exclusively every time.

All experiments were run on an Apple M3 with 24 GB of unified RAM using Ollama Version 0.17.4.

Results

The following 10 open weight LLMs where tested on both scenarios (Table 1):

Table 1: LLMs Tested

LLM Name	Provider	Model Card	Size	Quantization
Gemma 3 1b	Google DeepMind	Hugging Face Link	1b	`q4_K_M`
Gemma 3 12b	Google DeepMind	Hugging Face Link	12b	`q4_K_M`
gpt-oss-20b	OpenAI	Hugging Face Link	20b	`q4_K_M`
LFM2 24b-a2b	Liquid AI	Hugging Face Link	24b-a2b	`q4_K_M`
Ministral 3 3b	Mistral AI	Hugging Face Link	3b	`q4_K_M`
Ministral 3 8b	Mistral AI	Hugging Face Link	8b	`q4_K_M`
Ministral 3 14b	Mistral AI	Hugging Face Link	14b	`q4_K_M`
Mistral Small 3.2 24b	Mistral AI	Hugging Face Link	24 b	`q4_K_M`
Phi-4-mini 3.8b	Microsoft	Hugging Face Link	3.8b	`q4_K_M`
Phi-4 14b	Microsoft	Hugging Face Link	14b	`q4_K_M`

Table 2 shows mean accuracies and processing times for scenario 1, i.e. LLM-Only performance, without RAG. "Prompt Duration" denotes the mean time in seconds taken to process the prompt, "Generate Duration" denotes the mean time in seconds taken to generate the answer. Best results are shown in bold font.

Table 2: Accuracy and Processing Time without RAG (Ordered by Mean Accuracy)

LLM Name	Accuracy (Mean)	Prompt Duration (Mean, s)	Generate Duration (Mean, s)
Ministral 3 14b (it-2512-q4_K_M)	0.76962	2.21029	0.15751
Mistral Small 3.2 24b (it-2506-q4_K_M)	0.75916	3.31204	0.177936
Ministral 3 8b (it-2512-q4_K_M)	0.74703	1.25622	0.291435
gpt-oss-20b	0.71656	0.865917	34.4813
Phi-4 14b (q4_K_M)	0.69075	2.39739	0.971595
Gemma 3 12b (it-q4_K_M)	0.65124	1.09994	0.104123
Ministral 3 3b (it-2512-q4_K_M)	0.58928	0.499747	0.0415913
LFM2 24b-a2b (q4_K_M)	0.57037	1.27881	0.0333128
Phi-4-mini 3.8b (q4_K_M)	0.43891	0.601904	0.260388
Gemma 3 1b (it-q4_K_M)	0.2892	0.080226	0.0246014

Table 3 shows mean accuracy and processing times for scenario 2, i.e. RAG. Best results are shown in bold font.

Table 3: RAG Accuracy and Processing Time (Ordered by Mean Accuracy)

LLM Name	RAG Accuracy (Mean)	RAG Prompt Duration (Mean, s)	RAG Generate Duration (Mean, s)
Mistral Small 3.2 24b (it-2506-q4_K_M)	0.97025	11.9376	0.172491
gpt-oss-20b	0.96963	2.0207	8.44707
Phi-4 14b (q4_K_M)	0.96947	10.1061	0.690504
Ministral 3 8b (it-2512-q4_K_M)	0.96173	5.20979	0.130678
Ministral 3 14b (it-2512-q4_K_M)	0.96017	8.46598	0.146143
Gemma 3 12b (it-q4_K_M)	0.96008	4.45786	0.137686
Ministral 3 3b (it-2512-q4_K_M)	0.91026	2.15204	0.0413062
LFM2 24b-A2b (q4_K_M)	0.87004	2.62395	0.0338279
Phi-4-mini 3.8b (q4_K_M)	0.81902	2.58125	0.615699
Gemma 3 1b (it-q4_K_M)	0.49717	0.261436	0.0246014

Figure 1 shows the accuracies of all tested LLMs for scenario 1 (LLM only) and scenario 2 (LLM+RAG). 95% confidence intervals, generated via bootstrap sampling, are included.

Figure 1: Accuracy for LLM-Only (Scenario 1) vs. LLM+RAG (Scenario 2)

Finally, figure 2 shows a Pareto plot of LLM accuracy vs. total generation time.

Figure 2: Pareto Plot of LLM Accuracy vs. Generation Time for LLM-Only (Scenario 1, Blue) and LLM+RAG (Scenario 2, Red)

Conclusions and Next Steps

The results show that mid-size, open-weight LLMs today are capable of solving the benchmark task with over 95% accuracy (an error rate of less than 1 in 20 queries). In contrast, when relying solely on internal world knowledge, even the best models tested are only about 75% accurate, producing an incorrect result for 1 out of 4 queries, on average. If an accuracy of 95% is good enough should be evaluated against the risk profile of the task at hand.

Result accuracy scales with model size, while there are still significant differences between models of comparable size, demonstrating the value of creating custom (proprietary) benchmarks. Model runtime is highly dependent on the LLM runtime (hard- and software used for inference), but also important in practice, which is why it is reported in the benchmark results.

Currently, the RAG portion of this benchmark (scenario 2) assumes perfect information retrieval, i.e. that relevant information is present every time, which is rarely the case in practice. Future iterations could be improved by including a quota of test cases where relevant information is missing. Furthermore, to move beyond the current multiple-choice format, a third scenario could be added to test free-form generation accuracy. This would involve removing multiple-choice options from the prompt template and evaluating the generated answers using pattern matching or an "LLM-as-a-judge" approach.

Appendix

All LLMs under test where instructed with the following prompt templates:

system_prompt = """
Your task is to answer multiple-choice questions with high precision and accuracy.
You must answer with the letter of the correct choice, i.e. either A,B,C, or D.
You must not generate anything else. 
"""

prompt_template = """**Question:**
{question}

**Multiple Choice Options:**
A) {answer_a}
B) {answer_b}
C) {answer_c}
D) {answer_d}

Answer the given question using one of the multiple choice options.
You must return either A, B, C, or D.
You must return exactly one letter.
---
"""

rag_prompt_template = """**Context:**
{context}

**Question:**
{question}

**Multiple Choice Options:**
A) {answer_a}
B) {answer_b}
C) {answer_c}
D) {answer_d}

Based on the given context, answer the given question using one of the multiple choice options.
You must return either A, B, C, or D.
You must return exactly one letter.
---
"""

The system_prompt gives general instructions to the model, while the prompt_template is used for the LLM-Only part of the benchmark (scenario 1) and rag_prompt_template is used for the LLM+RAG part of the benchmark (scenario 2).

The following shows a redacted example task of the benchmark:

Article
Thyssen AG, Dachgesellschaft eines in den Bereichen Investitionsgüter, Handel und Dienstleistungen und Stahlerzeugung und -verarbeitung tätigen Konzerns; Sitz: Düsseldorf. Die T. AG entwickelte sich, v.a. aus den beiden Hauptunternehmen Thyssen & Co. KG (gegr. 1871) und August Thyssen-Hütte AG (gegr. 1890), bis 1914 (...redacted...)

Question
Durch wen sollte die (...redacted...)?

Possible Answers
a) (...redacted...)
b) (...redacted...)
c) (...redacted...)
d) (...redacted...)

LLM Update

Oliver Flasch — Sat, 07 Mar 2026 12:17:58 GMT

I don’t use Large Language Models (LLMs) to write articles on my blog, because I believe it’s important to let my own voice be heard unaltered. To improve and stay authentic, there is no substitute to writing and thinking myself.

I use different commercial and open weight LLMs to test ideas and to critique drafts, though. I also use them in „deep research“ mode, as a semantic search engine to gather and structure source material. I feel that, in these use cases, they improve my productivity considerably.

I use LLMs extensively for coding, but not for „vibe coding“. As of 2026, I still review every line of code generated. This is mainly to prevent technical and cognitive debt stemming from code I do not fully understand. There are also security risks I‘m not willing to take. I find LLMs useful, although not perfect, to generate complex, algorithm-heavy code that would take me much longer to write on my own. I also use LLMs to quickly generate architectural overviews of large code bases, which I find useful, even though I need to carefully check for factual errors. The same applies to using LLMs to review code for defects and security problems, which at this stage can augment, but not replace, human work. As generating code is now basically free, I use coding-LLMs heavily for exploration, e.g. for one-off scripts and interactive visualizations, to learn a new concept, algorithm or to proof an idea.

Risk Assessment in AI Projects

Oliver Flasch — Fri, 12 Aug 2022 18:31:11 GMT

How do you assess risks in AI projects? While AI- and data science-projects share risks with classical software projects, there are specific risks you should be aware of. AI projects typically fall into one of the following four risk classes, ranging from comparatively low to very high risk:

Risk Class 1 (Low Risk)

Using a Pre-Trained AI Model in its Native Domain
‌‌In this scenario, no model training and no training data is required. Nonetheless, a labeled data set of sufficient quality is needed in order to validate model quality in a principled manner. Check published accuracy metrics (substracting a safety margin) of the pre-trained model and compare these to your requirements. In contrast to implementing off-the-shelf software, risk is slightly elevated as production input data must match model training data not only structurally, but also statistically. Examples of data science projects of risk class 1 include machine translation, face recognition / face detection, or detection of person names, organization names and geographical locations in natural language (named-entity recognition).

Risk Class 2 (Medium Risk)

Transferring a Pre-Trained AI Model to a Related Domain
If no pre-trained model is available for your task, from a risk-perspective, the next best option often is to fine-tune an existing foundation model. AI-models, specifically artificial neural networks associated with deep learning, can be pre-trained on large cross-domain datasets, leading to foundation models that can then be fine-tuned to your specific domain using a smaller, domain-specific dataset. This represents an elevated risk compared to using a pre-trained model directly, as you need to supply your own training data, which needs to be of sufficient quality for the task at hand. While this training dataset can be considerably smaller than what would be needed to train a model from scratch, fine-tuning deep learning models for language or vision tasks still need thousands to tens of thousands of training examples to reach high accuracy. Typical examples of projects in this risk class include solving vision-, e.g. image classification and detection, or language-tasks, e.g. machine translation of domain-specific texts.

Risk Class 3 (High Risk)

Training an Existing AI Model-Architecture on Proprietary Data
In case no foundation model applicable to your task at hand exists, you'll need to ressort to training a model from scratch. To reduce risk, you should keep to established model architectures with a proven track record on related tasks. Judging task / model architecture-fit without extensive experimentation is still an art-form predicated on considerable experience. Moreover, training a model from scratch requires a sufficient volume of training data, and significant effort for hyperparameter-tuning, model-specific data preprocessing (e.g. outlier detection, data imputation, data augmentation, and data transformation), validation, and quality assurance. Even more severely, it is often not known from the outset whether your training data quality is sufficient to solve your task with good enough accuracy at all. For all these reasons, AI projects of this risk class should be considered research projects and be managed as such. Calculate with a duration of 3 years. There will be considerable risks of delays or project failure, especially when encountering training data quality problems. Risks will increase further if your task at hand involves less common learning paradigms (e.g. PU-learning) and reduce if you are working with well-established statistical or machine learning methods (generalized linear models, support vector machines, graphical models, Bayesian networks, decision tree ensembles, etc.) on high-quality structured data. Risk mitigation measures should include data quality analyses and feasibility analyses (proof of value).

Risk Class 4 (Very High Risk)

Custom Development of a Application-Specific AI Model-Architecture
If your task is cannot be solved by an established model architecture, you will need to establish a research and development project of often considerable complexity, scale, and risk. If successful, the result can be intellectual property of very high value, especially if based on proprietary training data. Application-specific model architectures are somewhat common in deep learning, as models are build from composable submodules. Development of custom model architecture using the framework of inferential statistics or graphical models (including Bayesian networks) is common and useful if your team has the required specialized expertise. Though, from a risk perspective, developing custom AI model-architectures should by a measure of last ressort. You should calculate with a research project duration of 3 to 5 years and with considerable risk. Highly specialized knowledge in machine learning or statistics will be required. Examples for AI projects in this risk class include protein structure prediction or custom robotics.

Conclusions

Commercial AI vendors will often downplay project risks, be it from ignorance or malice. This phenomenon seems to be especially prevalent with vendors of AI-based off-the-shelf software for overly broad yet concrete-sounding application areas like "fraud detection", "predictive maintenance" or "anomaly detection". Be wary of the fact that a presumed class 1 project may very well turn into a failed class 3 project if your use case does not exactly match the vendor's reference!

Topics in the Bundestag (Part 2)

Oliver Flasch — Sun, 05 Sep 2021 15:04:52 GMT

If you followed the last part of this series, you ended up with a dataset of (nearly) all the paragraphs of the speeches held in the 19th Bundestag, augmented with speech ID, paragraph number, speaker and fraction information, length statistics, and probabilities of the paragraph belonging to each of the following 13 predefined topics. These topics where proposed by the German broadcasting network ARD:

Außenpolitik (Foreign Policy)
Bildung (Education)
Digitalisierung (Digitalization)
Familien (Family Policy)
Gesundheitswesen (Healthcare)
Jobs
Klima (Climate Change)
Pflege (Caregiving)
Rente (Pension Schemes)
Sicherheit (Security Policy)
Steuern (Taxes)
Wohnraum (Housing)
Zuwanderung (Immigraion)

In this part, we will analyze this dataset from multiple perspectives, including fractions, speaker, speeches and time. This will give us, among other things, fraction topic profiles, topics trending over time, topic profiles of individual speeches, and perhaps most interestingly, a tool for instantly retrieving speeches about a given combination of topics.

DISCLAIMER: The model we used for topic classification hasn't been validated for this task and could have produced biased or nonsensical results. If you're planning to use this approach in a production setting, make sure to validate result quality by comparing model predictions against human classifications, ideally in a double-blind study. In addition, the class labels taken from ARD are probably suboptimal, as the concepts vary widely in scope. You should also validate result stability against class label substitutions with synonyms.

Fraction (Party) Topic Profiles

What topics are the different fractions in the Bundestag talking about? One would expect to find interesting differences. Before we begin, let's take a short look at our dataset, to remind ourselves of what we're working with:

Click to show code.

Topics in the Bundestag (Part 1)

Oliver Flasch — Sun, 29 Aug 2021 13:34:18 GMT

Today, we will apply a Deep Learning-based language model to classify (nearly) all speeches held in the 19th German Bundestag (the German federal parliament) into the following predefined topics, as provided by the German broadcasting network ARD:

Außenpolitik (Foreign Policy)
Bildung (Education)
Digitalisierung (Digitalization)
Familien (Family Policy)
Gesundheitswesen (Healthcare)
Jobs
Klima (Climate Change)
Pflege (Caregiving)
Rente (Pension Schemes)
Sicherheit (Security Policy)
Steuern (Taxes)
Wohnraum (Housing)
Zuwanderung (Immigraion)

DISCLAIMER: The classification model we're going to use hasn't been validated for this task and can produce biased or nonsensical results. If you're planning to use this approach in a production setting, at least make sure to validate result quality by conducting a double-blind study!

We had statistical topic modeling methods (such as Latent Dirichlet Allocation, LDA) to help with the task of topic discovery for nearly 20 years. To help with our task of classifying speeches into topics, LDA results can be used as features for a downstream classification algorithm. But at least from my limited experience, it can be quite difficult to apply such a complicated construct in practice. Furthermore, with this approach, you would need to prepare a large enough dataset of topic-labeled speeches to train the classifier. Back to hardworking undergraduates it is. By the way, you should take a look at the excellent work of Open Discourse, who, among other things, did an LDA-based analysis of all Bundestag-speeches since 1949.

In this post, we will try a different approach, one that will allow us to classify all speech paragraphs present in the dataset we compiled in my last post into a given set of topics. This dataset contains, depending on when you follow this post, nearly all or all speeches given during the 19th electoral term of the German federal parliament.

The approach we'll be trying is called zero-shot learning for text classification as Natural Language Inference (NLI). Let's take some time to unpack this. Zero-shot learning means that we'll use a model trained on an upstream task, i.e. NLI, to solve our downstream task, i.e. classifying speeches, without retraining or even finetunig the model. Zero-shot learning works without any labels for the downstream task, freeing us from compiling training data. NLI is a language classification task of the following form: Given a premise and a hypothesis, predict if the hypothesis follows from the premise (entailment), contradicts the premise (contradiction), or is unrelated to the premise (neutral). The premise "The European Emissions Trading System was the first large greenhouse gas emmissions trading scheme in the world." entails the hypothesis "This text is about climate change.", contradicts the hypothesis "This text is about rock music." and is neutral with regard to the hypothesis "Gernany is a country in Central Europe.". When you solve NLI, you solve zero-shot text classification, as you can guess from the previous example. See Joe Davisons excellent article on Zero-Shot Learning in Modern NLP for details, including alternative ideas.

Putting all this theory into practice is easy, thanks to Hugging Face's transformers package and Sahaj Tomar's German Zeroshot model, which was finetuned on the German subset of the Cross-Lingual NLI Corpus and is based on Deep Set's German BERT language model. The latter model was trained by self-supervision on four text corpora: "OSCAR" (~145 Gigagytes), a corpus scraped from websites and filtered for explicit material, "OPUS" (~10 Gigabytes), a corpus compiled from various domains such as movie subtitles, parliament speeches and books, "Wikipedia" (~ 6 Gigabytes), a postprocessed dump of the German Wikipedia, and "Open Legal Data" (~ 2.4 Gigabytes), a dataset of German court decisions.

Exerting a few dozen lines of Python code, we'll be able to classify all of the 100,000+ speech paragraphs in our dataset in less than eight hours of GPU compute time. You can also follow along in Google Colab, which currently gives away GPU compute time for free. If you're only interested in the results of our study, feel free to skip to the second part of this series.

Setup Your Notebook Environment

Please see my last post on how to setup your notebook environment. You can use your own JupyterLab installation or a cloud service such as Amazon Sagemaker or Google Colab. When using Google Colab, make sure to change the runtime type to "GPU" before proceeding.

Install and Import Dependencies

If you're using Google Colab, you'll first have to install Hugging Face's transformers library by issuing the following command in a notebook cell:

!pip install transformers

This library implements a wide range of modern Deep Learning-based NLP techniques. To train and infer deep neural networks, it depends on either TensorFlow or PyTorch. Next, you should check if your GPU is ready to use by issuing:

!nvidia-smi

You will need a GPU with at least eight Gigabytes of memory. Depending on the GPU used, classifying our dataset will require different ammounts of time. On a NVIDIA Tesla P100, the process will finish in less than eight hours.

We will need to import the following Python libraries used in this project: numpy is Python's de facto standard library for numerical computing. The os package is needed to manipulate (temporary) files, pandas provides a dataframe abstraction for (kind of) conveniently working with largish tabular datasets, torch is PyTorch, the machine learning framework powering our deep language models, tqdm for showing progress bars, and finally transformers, a library implementing several modern NLP methods.

import numpy as np
import os
import pandas as pd
import torch
from tqdm.notebook import tqdm
from transformers import pipeline

Load and Prepare the Plenary Proceedings Dataset

We will now load our dataset of speeches as prepared in my previous post. In Google Colab, we'll have to mount our Google Drive, where we stored the dataset, first:

from google.colab import drive
drive.mount("/content/gdrive")

Next, we'll load the dataset into the Pandas dataframe speech_paragraphs_df and remove all speech paragraphs with less than 11 words, as these are too short for meaningful classification:

speech_paragraphs_df = pd.read_parquet("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet/bundestag_plenary_protocols_term_19.parquet")
speech_paragraphs_df = speech_paragraphs_df[speech_paragraphs_df.paragraph_len_words >= 10]

print(f"Number of paragraphs:\t\t{len(speech_paragraphs_df)}")
print(f"Median words per paragraph:\t{speech_paragraphs_df.paragraph_len_words.median()}")
print(f"Mean words per paragraph:\t{speech_paragraphs_df.paragraph_len_words.mean()}")

At the time of writing, there are 115,612 speech paragraphs in this dataset, with a median (mean) length of 54 (58.27) words.

Define the Classifier Pipeline

We can now prepare our zero-shot text classification pipeline, which entails downloading the models from Hugging Face's servers. Note that at least the base model (DeepSet's "GBERT large") is available under the permissive MIT license:

classifier = pipeline("zero-shot-classification",
                      model="Sahajtomar/German_Zeroshot",
                      device=0) # use GPU

With the classifier pipeline in place, we will define our "hypothesis" template and our labels. Our pipeline will merge the hypothesis template and the labels to yield NLI-hypotheses like "In diesem Text geht es um Außenpolitik.", as described above:

hypothesis_template = "In diesem Text geht es um {}."
labels = ["Außenpolitik",
          "Bildung",
          "Digitalisierung",
          "Jobs",
          "Familien",
          "Gesundheitswesen",
          "Klima",
          "Pflege",
          "Rente",
          "Sicherheit",
          "Steuern",
          "Wohnraum",
          "Zuwanderung"]

Next, we'll write some code to classify every speech paragraph in our dataset. This code is complicated by that fact that we will work in batches to speed up processing, and that we will save a temporary file of intermediate results as a backup:

def classify(speech,
             labels=labels,
             hypothesis_template=hypothesis_template,
             multi_label=True):
  results = classifier(speech, labels, 
                       hypothesis_template=hypothesis_template,
                       multi_label=multi_label)
  return results

def classify_to_df(paragraphs,
                   labels=labels,
                   hypothesis_template=hypothesis_template,
                   multi_label=True):
  classify_results = classify(paragraphs,
                              labels=labels,
                              hypothesis_template=hypothesis_template,
                              multi_label=multi_label)
  classify_results = [classify_results] if not isinstance(classify_results, list) else classify_results # ensure result is a list
  scores_dict = [dict(sorted(list(zip(r["labels"], r["scores"])))) for r in classify_results]
  scores_df = pd.DataFrame(scores_dict)
  return scores_df

def classify_df_batch(df_batch,
                      labels=labels,
                      hypothesis_template=hypothesis_template,
                      multi_label=True):
  paragraphs = df_batch.paragraph.to_list()
  scores_df = classify_to_df(paragraphs, labels=labels, hypothesis_template=hypothesis_template)
  scores_df.index = df_batch.index # align indices
  scored_df_batch = pd.concat([df_batch, scores_df], axis=1)
  return scored_df_batch

def classify_df(df,
                labels=labels,
                hypothesis_template=hypothesis_template,
                multi_label=True,
                batch_size=20,
                tmp_file="/content/gdrive/MyDrive/Colab Notebooks/tmp_file.csv"):
  work_df = df
  result_df = pd.DataFrame()
  if os.path.isfile(tmp_file):
    incomplete_df = pd.read_csv(tmp_file)
    incomplete_df.date = pd.to_datetime(incomplete_df.date, format="%Y-%m-%d", errors="coerce")
    work_df = work_df[~work_df.id.isin(incomplete_df.id)].dropna()
  for g, df_batch in tqdm(work_df.groupby(np.arange(len(work_df)) // batch_size)):
    result_df_batch = classify_df_batch(df_batch,
                                        labels=labels,
                                        hypothesis_template=hypothesis_template,
                                        multi_label=multi_label)
    result_df_batch.to_csv(tmp_file, index=False, header=(g == 0), mode="a")
    result_df = result_df.append(result_df_batch)
  os.remove(tmp_file)
  return result_df

Run the Classifier Pipeline and Save Results

We are now ready to classify our dataset. Depending on your computer or cloud service, this will take several hours to a few days. If the process is interrupted for some reason, just re-run it, as our code should pickup the temporary result file and continue where it left off. After the classification is finished, we save our result as an Apache Parquet file for later analysis:

%%time
classified_speech_paragraphs_df = classify_df(speech_paragraphs_df,
                                              labels=labels,
                                              multi_label=True)
classified_speech_paragraphs_df.to_parquet("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols_with_multiclass_topics_term_19.parquet")

Note that by setting the multi_label parameter to True, the classifier pipeline considers the labels as independent, i.e. a speech paragraph can be about multiple topics. Predicted probabilities are normalized for each candidate label by doing a softmax of the NLI entailment score versus the NLI contradiction score. This gives us, for each of our 13 labels, an estimate of the probability that a given speech paragraph is about the topic described by the label.

Finally, we print a sample of our model predictions:

pd.set_option("display.max_colwidth", 250)
classified_speech_paragraphs_df.sample(8).iloc[:, [1, 7, 10] + list(range(13, 26))]

	date	speaker_full_name	paragraph	Außenpolitik	Bildung	Digitalisierung	Familien	Gesundheitswesen	Jobs	Klima	Pflege	Rente	Sicherheit	Steuern	Wohnraum	Zuwanderung
491	2021-05-06	Christine Aschenberg-Dugnus (FDP)	Damit sie sich nicht im Versorgungswirrwarr verirren, brauchen sie geeignete Ansprechpartner. Für Patientinnen und Patienten ist es oft sehr schwierig, den bestmöglichen Versorgungspfad zu finden; denn die Symptome von Long Covid sind unspezifisc...	0.002154	0.001999	0.459230	0.131266	0.896095	0.006705	0.016970	0.568585	0.038141	0.303992	0.005355	0.020504	0.014352
5	2019-12-12	Jürgen Braun (AfD)	Einseitigkeit zugunsten des Islam, Blindheit gegenüber dem islamischen Fundamentalismus, kein Gespür für die persönliche Freiheit der Bürger: Der Bericht der Bundesregierung zur Menschenrechtspolitik ist, um es auf einen Punkt zu bringen, ein Dok...	0.039055	0.000737	0.000818	0.007973	0.000418	0.000512	0.000447	0.001041	0.000855	0.003408	0.000457	0.002898	0.013067
147	2020-10-08	Linda Teuteberg (FDP)	Aller guten Dinge sind drei: Für ein modernes Einwanderungsrecht braucht es auch moderne Behörden. Die Visastellen vergeben längst nicht genügend Termine, damit qualifizierte Menschen zügig ihre Anträge stellen können. Wir brauchen außerdem Lotse...	0.029913	0.638526	0.635045	0.047879	0.001336	0.628733	0.011468	0.006697	0.014245	0.045025	0.001524	0.027864	0.962281
90	2021-03-24	Frank Schwabe (SPD)	Der aktuelle Fall ist genannt worden: Ömer Faruk Gergerlioglu, der als Menschenrechtler und als Abgeordneter anerkannt ist, soll wegen eines Tweets für zweieinhalb Jahre ins Gefängnis.	0.053857	0.000553	0.322427	0.177035	0.000486	0.001232	0.050058	0.001185	0.002257	0.425522	0.001775	0.166230	0.349176
673	2021-05-06	Dr. Konstantin von Notz (BÜNDNIS 90/DIE GRÜNEN)	Bedanken möchte ich mich bei den Kolleginnen und Kollegen von FDP und Linken – ich finde es gut, dass wir das so gut zusammen hinbekommen haben –, insbesondere beim Kollegen Stefan Ruppert, beim Kollegen Benjamin Strasser und bei der Kollegin Buc...	0.589526	0.499660	0.660624	0.587036	0.373693	0.517687	0.625037	0.653395	0.544435	0.286163	0.589484	0.683710	0.688290
150	2018-06-14	Siemtje Möller (SPD)	Ich kann hier also festhalten: Unsere Soldatinnen und Soldaten, unsere Marine macht vor Ort einen hervorragenden Job in einem klar abgegrenzten, rundherum europäischen Mandat. Liebe Kolleginnen und Kollegen, lassen Sie uns diesen sinnvollen Einsa...	0.227165	0.002314	0.301667	0.018452	0.003619	0.969009	0.317800	0.428302	0.244927	0.955501	0.016070	0.030430	0.033591
68	2020-11-05	René Röspel (SPD)	Zum Schluss darf ich mich ganz herzlich bei unseren Sachverständigen Lena-Sophie Müller, Jan Kuhlen, Sami Haddadin und Lothar Schröder bedanken. Wir haben richtig viel gelernt, toll diskutiert. Ich hoffe, dass sie uns auf dem Weg der Umsetzung vo...	0.057079	0.168211	0.900108	0.059311	0.003292	0.162420	0.044387	0.030843	0.018963	0.069954	0.013431	0.466886	0.112802
83	2018-03-15	Agnieszka Brugger (BÜNDNIS 90/DIE GRÜNEN)	Was aber aus meiner Sicht gar nicht geht – Herr Grosse-Brömer hat uns ja gerade Exzellenz versprochen –, ist, dass die Bundesregierung bezüglich der Frage von Abschiebungen und der Bewertung der Sicherheitslage in Afghanistan darauf verweist, das...	0.876693	0.093432	0.048144	0.285236	0.007484	0.064532	0.029879	0.035582	0.045826	0.934408	0.009593	0.063168	0.442256

That's it for today. In the next installment of this series, we will proceed with an exploratory data analysis of our dataset of "classified" speech paragraphs.

German Plenary Proceedings as an NLP Testbed

Oliver Flasch — Tue, 16 Mar 2021 10:46:48 GMT

With the 19th electoral term of the German federal parliament ("Bundestag" in German) drawing to a close, it could be interesting to do some data analysis on what has been said there. Fortunately, for the first time in this electoral term, the Bundestag makes all speeches available in an XML format, complete with speaker metadata.

In this article, we will download this data and preprocess it into a dataframe for easy analysis. Please note the rights of use of this dataset as published at https://www.bundestag.de/services/impressum.

Setup your Notebook Environment

Google Colaboratory (Colab) on the Google Cloud Platform offers a JupyterLab-like computational notebook environment with all Python libraries needed for this project preinstalled, offering maximal convenience. On the other hand, you should be aware of the privacy- and Quality-of-Service implications of using a free Google service.

If you'll be using your own infrastructure, you'll need to install Python and all required Python libraries. I'd recommend to also install JupyterLab or an alternative computational notebook environment. Many people prefer computational notebooks over IDEs for explorative data science work.

Import Python Dependencies

To start, we will import some Python dependencies. The glob module is used to find all files matching a given name pattern in a given directory, while os helps with manipulating filenames. google.colab.drive will allow us to mount our google drive for permanently storing our retrieved dataset. lxml is an XML parser, urllib3 is used to download files over HTTP. numpy and pandas are used to build our dataset.

from datetime import datetime
import glob
from google.colab import drive
import lxml
import lxml.html
import numpy as np
import os
import pandas as pd
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import urllib.request

Mount Google Drive

Next, we will "mount" our google drive, so that we can save files to it programmatically from this notebook. We will save temporary files that we download, as well as our completed dataset. After executing the following code cell, click on the link displayed as output, allow access and copy the access token, then paste the access token into the textbox displayed as output.

drive.mount('/content/gdrive')

Download Plenary Proceedings as XML Files

Before downloading, we will first prepare directories on our Google Drive to save our data to, then we will download the XML schema (document type definition - DTD) of the plenary proceedings:

%%shell

PARQUET_DATA_DIR="/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet"
XML_DATA_DIR="/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/xml"

mkdir -p "$PARQUET_DATA_DIR"
mkdir -p "$XML_DATA_DIR"
cd "$XML_DATA_DIR"

# download XML DTD for plenary protocolls of the 19th term if it do not exist on google drive...
if [ ! -f "dbtplenarprotokoll-data.dtd" ]; then
    wget -N https://www.bundestag.de/resource/blob/575720/22c416420e8a51c380d2ddffb19ff5b7/dbtplenarprotokoll-data.dtd
fi

We are now ready to download the raw plenary proceeding XML files:

def download_plenary_protocols(to_path):
  http = urllib3.PoolManager() 
  offset = 0
  count = 0
  while True:
    response = http.request("GET", f"https://www.bundestag.de/ajax/filterlist/de/services/opendata/543410-543410?noFilterSet=true&offset={offset}")
    parsed = lxml.html.fromstring(response.data)
    empty = True
    for link in parsed.getiterator(tag="a"):
      empty = False
      link_href = link.attrib["href"]
      count += 1
      filename = to_path + "/" + os.path.basename(link_href)
      file_url = "https://www.bundestag.de" + link_href
      print(f"downloading URL '{file_url}'")
      urllib.request.urlretrieve(file_url, filename)
    if empty: break
    offset += 10
  print(f"downloaded {count} XML files")

download_plenary_protocols("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/xml")

Data Preprocessing and Cleanup

Next, we will define a function to extract speeches from a given plenary proceedings XML file. This function will also extract/calculate metadata for each speech:

def extract_speeches_from_xml(xml, only_J_paragraphs=True):
  speech_paragraphs_dict = {
      "id": [],
      "date": [],
      "speaker_id": [],
      "speaker_title": [],
      "speaker_first_name": [],
      "speaker_last_name": [],
      "speaker_fraction": [],
      "speaker_full_name": [],
      "speech_id": [],
      "paragraph_num": [],
      "paragraph": [],
      "paragraph_len_chars": [],
      "paragraph_len_words": []
  }
  def first_or_empty_string(a): return a[0] if a else ""
  session_date_string = xml.xpath("/dbtplenarprotokoll/@sitzung-datum")[0]
  session_date = datetime.strptime(session_date_string, "%d.%m.%Y\)
  speeches_with_one_speaker = xml.xpath("//sitzungsverlauf//rede[count(p/redner)=1]")
  for speech in speeches_with_one_speaker:
    speaker_id = speech.xpath("p/redner/@id")[0]
    speech_id = speech.xpath("@id")[0]
    speaker_title = first_or_empty_string(speech.xpath("p/redner/name/titel/text()"))
    speaker_first_name = first_or_empty_string(speech.xpath("p/redner/name/vorname/text()"))
    speaker_last_name = first_or_empty_string(speech.xpath("p/redner/name/nachname/text()"))
    speaker_fraction = first_or_empty_string(speech.xpath("p/redner/name/fraktion/text()"))
    speaker_full_name = speaker_title + (" " if speaker_title != "" else "") + speaker_first_name + " " + speaker_last_name + " (" + speaker_fraction + ")"
    #print(f"{speaker_id} {speaker_first_name} {speaker_last_name}:")
    paragraphs = speech.xpath("p[@klasse='J']/text()") if only_J_paragraphs else speech.xpath("p[@klasse!='redner']/text()")
    for paragraph_num, paragraph in enumerate(speech.xpath("p[@klasse='J']/text()")):
      id = speech_id + "_" + str(paragraph_num)
      speech_paragraphs_dict["id"].append(id)

speech_paragraphs_dict["date"].append(session_date)
      speech_paragraphs_dict["speaker_id"].append(speaker_id)
      speech_paragraphs_dict["speaker_title"].append(speaker_title)
      speech_paragraphs_dict["speaker_first_name"].append(speaker_first_name)
      speech_paragraphs_dict["speaker_last_name"].append(speaker_last_name)
      speech_paragraphs_dict["speaker_fraction"].append(speaker_fraction)
      speech_paragraphs_dict["speaker_full_name"].append(speaker_full_name)
      speech_paragraphs_dict["speech_id"].append(speech_id)
      speech_paragraphs_dict["paragraph_num"].append(paragraph_num)
      speech_paragraphs_dict["paragraph"].append(paragraph)
      speech_paragraphs_dict["paragraph_len_chars"].append(len(paragraph))
      speech_paragraphs_dict["paragraph_len_words"].append(len(paragraph.split()))
      #print(f"{id}: {paragraph}")
  return speech_paragraphs_dict

We are now ready to process every XML file into the preliminary dataframe of speeches speech_paragraph_df:

from tqdm.notebook import tqdm # progress bar

xml_files = glob.glob("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/xml/*.xml")

speech_paragraphs_df = None
for xml_file in tqdm(xml_files):
  xml = lxml.etree.parse(xml_file)
  df = pd.DataFrame.from_dict(extract_speeches_from_xml(xml))
  speech_paragraphs_df = df if speech_paragraphs_df is None else speech_paragraphs_df.append(df)

This dataframe represents a table of every paragraph of every speech delivered by a speaker of a known fraction, including metadata such as speaker name, speaker fraction, and paragraph length in words.

We will now cleanup the speaker_fraction column. We will then remove speakers without fraction, as there are some data quality problems with these. We will need to examine these problems at some later point in time.

# data cleanup
speech_paragraphs_df.loc[speech_paragraphs_df.speaker_fraction == "Fraktionslos", "speaker_fraction"] = "fraktionslos"
speech_paragraphs_df.loc[speech_paragraphs_df.speaker_fraction.str.startswith("BÜNDNIS"), "speaker_fraction"] = "BÜNDNIS 90/DIE GRÜNEN"

# only keep speech paragraphs of speaker's with known fraction
speech_paragraphs_df = speech_paragraphs_df[speech_paragraphs_df.speaker_fraction.isin(["CDU/CSU", "SPD", "AfD", "FDP", "DIE LINKE", "BÜNDNIS 90/DIE GRÜNEN"])]

We can now take a look at our finished dataset:

speech_paragraphs_df

Finally, we save our finished dataset to our Google Drive for future analysis:

speech_paragraphs_df.to_parquet("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet/bundestag_plenary_protocols_term_19.parquet")

To load this dataset back into memory at a later date, we would use the following code:

speech_paragraphs_df = pd.read_parquet("/content/gdrive/My Drive/Colab Notebooks/bundestag_plenary_protocols/parquet/bundestag_plenary_protocols_term_19.parquet")

While we have our finished dataset available in memory, lets perform a simple example analysis, tabulating the number of speech paragraphs by speaker fraction:

speech_paragraphs_df.speaker_fraction.value_counts()

CDU/CSU                  34349
SPD                      22774
AfD                      17829
FDP                      14677
DIE LINKE                13674
BÜNDNIS 90/DIE GRÜNEN    12794
Name: speaker_fraction, dtype: int64

Finally, we can transform our dataset of speech paragraphs to a dataset of speeches via the following Pandas incantation:

speeches_df = speech_paragraphs_df\
  .groupby("speech_id")\
  .agg({
    "speaker_id": "first",
    "speaker_title": "first",
    "speaker_first_name": "first",
    "speaker_last_name": "first",
    "speaker_fraction": "first",
    "speaker_full_name": "first",
    "speech_id": "first",
    "paragraph": " ".join,
    "paragraph_len_chars": "sum",
    "paragraph_len_words": "sum"})

That's about it. In my next post, we'll categorize all speech paragraphs into a set of predefined topics.

Deep Learning German

Oliver Flasch — Wed, 23 Sep 2020 15:30:09 GMT

Remember the frenzy OpenAI caused last year with it's announcement not to publish the largest variant of it's latest language model GPT-2? Because of concerns about it being used to generate deceptive, biased, or abusive language at scale? Of course, other parties quickly trained their own versions, proving these concerns to be largely unfounded. But GPT-3, OpenAI's newest iteration of basically the same model architecture, but scaled up a hundredfold, shows that we're quickly entering dangerous (and fascinating) territory.

In this blog, we're going to implement a small web app that uses a German GPT-2 model to auto-complete your sentences. We'll be using free GPU instances available in Google Colaboratory, so that you can follow along quickly, without expensive hardware or complicated setup. Of course, you can also use your local PyTorch installation with JupyterLab, which should give you exactly the same results.

But First, Some Theory

The "GPT" in GPT-2 stands for "Generative Pretrained Transformer", meaning, given a fixed window of text preprocessed into a fixed-size vocabulary of tokens (normalized pieces of words/sentences), the model will predict the probabilities of each token being the next one. To express this in simpler terms without being totally inaccurate: Given some words as a "prompt", GPT-2 will predict probable next words. By applying GPT-2 to it's own output, you can generate arbitrary long texts.
Therefore, GPT-2 can be trained on large unlabeled text corpora in an unsupervised fashion, making training relatively simple and cheap. This is, if you can provide the necessary compute resources.
OpenAI trained the orignal variant of this model on 8 million documents scraped from good quality Reddit submissions, for a total of 40 GB of text.
As a neural network, GTP-2 can be modified or plugged into a larger network to solve many different tasks, even across media boundaries. Neural networks are a kind of non-leaky mathematical abstraction.
See OpenAI's original GPT-2 Paper for the details, or The Illustrated GPT-2, Jay Alammar's excellent exposition.‌
Thilina Rajapakse's Simple Transformers library makes it very simple to apply GPT-2 and other transformer models to a wide variety of NLP tasks.

Set Up Your Notebook Environment

We will use PyTorch as a modern machine learning framework. PyTorch is a good fit for our task, as it offers a modern readable (eager, imperative) API, combined with very good performance and a rapidly growing community.

A simple way to get startet with Deep Learning-based NLP is by using Google Colaboratory (Colab) on the Google Cloud Platform. Colab offers a JupyterLab-like computational notebook environment with PyTorch already preinstalled. As of August 2020, Colab offers you free GPU instances, which is great for compute-intensive tasks like GPT-2 training and -inference. On the other hand, you should be aware of the privacy- and QoS implications of using a free Google service.

If you'll be using your own infrastructure, you'll need to install PyTorch and configure it to use your GPU (if applicable). I'd recommend to also install JupyterLab or an alternative computational notebook environment. Many people prefer computational notebooks over IDEs for explorative data science work.

Prepare PyTorch

Open your a new notebook in your notebook environment of choice, import both PyTorch and Numpy, and check if PyTorch supports your GPU(s). When running in Colab, you will need to enable GPU support by selecting "Change runtime type" in the "Runtime" menu. Then select "GPU" in the "Hardware accelerator" dropdown menu. Colab might assign you a different GPU model each time your open your notebook and start the runtime.

import torch
import numpy as np

print(f"Detected {torch.cuda.device_count()} PyTorch-compatbile GPU devices.")
print(f"Name of first GPU device: {torch.cuda.get_device_name(0)}")

This should print the number of GPUs detected by PyTorch and the model name of the first GPU detected.

If you're running in Colab, you should mount your Google Drive (which provides you with free permantent storage space) to store the required software and GPT-2 model parameters:

from google.colab import drive

drive.mount('/content/gdrive')

When running these lines, Colab will provide you with on-screen instructions on how to enable access to your Google Drive contents from your Colab notebook.

If you're running this for the first time, clone the transformer-lm Python package from GitHub:

%%shell
cd /content/gdrive/My\ Drive/Colab\ Notebooks
git clone https://github.com/lopuhin/transformer-lm.git

Next, install the requirements of the transformer-lm package to your local notebook environment:

%%shell
cd /content/gdrive/My\ Drive/Colab\ Notebooks/transformer-lm/
pip install -r requirements.txt

Then add transformer-lm to your Python package path:

import sys
sys.path.append('/content/gdrive/My Drive/Colab Notebooks/transformer-lm/')

Download Zamia AI's German GPT-2 Model

If you're running this for the first time, download and extract the pretrained GPT-2 model from Zamia AI. Otherwise, please skip this step to preserve Zamia AI's bandwidth!

%%shell
# only do this once to preserve bandwith!
cd /content/gdrive/My\ Drive/Colab\ Notebooks
mkdir pytorch_models
cd pytorch_models
wget https://goofy.zamia.org/zamia-speech/brain/gpt2-german-345M-r20191119.tar.xz
wget https://goofy.zamia.org/zamia-speech/brain/sp.vocab tar xvf gpt2-german-345M-r20191119.tar.xz rm gpt2-german-345M-r20191119.tar.xz
ls -lha

Now you should be able to load the German GPT-2 model into memory. We will be using Konstantin Lopuhin's transformer-lm language model wrapper:

import lm.inference as lmi
from pathlib import Path

mw = lmi.ModelWrapper.load(Path("/content/gdrive/My Drive/Colab Notebooks/pytorch_models/de345-root/"))

Make Predictions

Next, we'll define functions to infer new tokens from our model:

def gpt2_gen(mw, prefix, n_tokens_to_generate=10, top_k=8):
  prefix_tokens = mw.tokenize(prefix)
  generated_tokens = mw.generate_tokens(prefix_tokens, n_tokens_to_generate, top_k)
  generated_text = mw.sp_model.DecodePieces(generated_tokens)
  return generated_text

def gpt2_gen_until(mw, prefix, stop_token='.', top_k=8):
  prefix_tokens = mw.tokenize(prefix)
  generated_tokens = list(prefix_tokens) # generate until the stop_token is seen
  next_token = ''
  while next_token != stop_token:
    # generate TOP_K potential next tokens
    ntk = mw.get_next_top_k(generated_tokens, top_k)
    # convert log probs to real probs
    logprobs = np.array(list(map(lambda a: a[0], ntk)))
    probs = np.exp(logprobs) / np.exp(logprobs).sum()
    # pick next token randomly according to probs distribution
    next_token_n = np.random.choice(top_k, p=probs)
    next_token = ntk[next_token_n][1]
    # append next token
    generated_tokens.append(next_token)
    print(mw.sp_model.DecodePieces(generated_tokens)) # DEBUG output
  # decode tokens and return generated text
  generated_text = mw.sp_model.DecodePieces(generated_tokens)
  return generated_text

The function gpt2_gen takes a prefix string or "prompt" from which to generate a fixed number of tokens by iteratively applying the model to its own output. At each step, the model will generate a probability of each possible token as its output. If we would always choose the token with highest predicted probability, the model would generate very static and boring text, often getting stuck in loops. Instead, to choose the token to predict, we use top k sampling: First, we sort the tokens by probability, then zero-out the probabilities for tokens below the top_kth token and sample from the remaining distribution. This appears to improve percieved quality of the generated text by removing the distribution's tail, making it less likely for the generated text to go off topic. Sampling from language models is an active area of research. See this article for a good introduction.

We also define the tool function gpt_gen_until that generates tokens until a defined stop_token (e.g. a full stop '.') is generated by the model.

Finally, we are ready to make some predictions / generate some text by using gpt2_gen_until to complete a german sentence:

gpt2_gen_until(mw, "Wenn der Regen niederbraust, wenn der Sturm das Feld durchsaust,", stop_token=".", top_k=8)

'Wenn der Regen niederbraust, wenn der Sturm das Feld durchsaust, so kann sich auch die Landwirtschaft nicht sicher sein.'

Okay, maybe not exactly what Heine thought of, but nonetheless, it works! Time for another test, this time with timing via %%time to get an idea of the compute time required for inference:

%%time
gpt2_gen(mw, "Das blaue Pferd", n_tokens_to_generate=16, top_k=8)

CPU times: user 5.64 s, sys: 13.8 ms, total: 5.65 s
Wall time: 5.65 s

'Das blaue Pferd war in der deutschen Reitkunst ein sehr beliebter Ausdruck für seine Fähigkeit, auch im'

Build a Web App to Auto-Complete Your Sentences

Having a working model, we can now create a very simple web app to auto-complete german text:

from google.colab.output import eval_js
from http.server import BaseHTTPRequestHandler, HTTPServer
import urllib.parse
import json

port = 8000
colab_url = eval_js(f'google.colab.kernel.proxyPort({port})')

html_code = '''









GPT-2 Text-Autopilot






🦄 GPT-2 Schreib-Autopilot
Schreibblockade? Müde? Nutze den Schreib-Autopiloten, und Deine Texte schreiben sich von selbst! 🤩
Er dachte, automatisches Schreiben wäre eine gute Idee weil




© Copyright 2020 Oliver Flasch



'''

class WebServiceHandler(BaseHTTPRequestHandler):
  def do_GET(self):
    parsed_url = urllib.parse.urlparse(self.path)
    if parsed_url.path == '/':
      self.send_response(200)
      self.send_header('Content-type', 'text/html')
      self.end_headers()
      self.wfile.write(html_code.encode('utf-8'))
    else:
      self.send_response(404)
      self.end_headers()
  def do_POST(self):
    parsed_url = urllib.parse.urlparse(self.path)
    if parsed_url.path == '/autocomplete':
      self.send_response(200)
      self.send_header('Content-type', 'application/json')
      self.end_headers()
      content_length = int(self.headers['Content-Length'])
      prefix_text = json.loads(self.rfile.read(content_length))
      generated_text = gpt2_gen(mw, prefix_text, n_tokens_to_generate=5, top_k=8)
      new_text = generated_text[len(prefix_text):]
      self.wfile.write(json.dumps(new_text).encode('utf-8'))
    else:
      self.send_response(404)
      self.end_headers()

You can start this web and access it through the Google Colaboratory proxy port:

print(f'Open the following link: {colab_url}')
httpd = HTTPServer(('', port), WebServiceHandler)
try:
  httpd.serve_forever()
except KeyboardInterrupt:
  pass
httpd.server_close()

This makes the app accessible to your own browser only.

Serve Your Web App Through ngrok

Install and configure pyngrok to run the Web Application publicly (that is, as long as the runtime of your notebook is active):

%%shell
pip install pyngrok
ngrok authtoken

Now you can start your web app through ngrok, making it available to the public:

from pyngrok import ngrok

public_url = ngrok.connect(port = str(port))
print(f'Open the following link: {public_url}')
httpd = HTTPServer(('', port), WebServiceHandler)
try:
  httpd.serve_forever()
except KeyboardInterrupt:
  pass
httpd.server_close()

Wrapping Up

This covers the basics of using a pretrained GPT-2 model to generate German text.

Next, you could explore GPT-3 fascinating few-shot-learning abilities, fine tune GPT-2 for specific text genres or exploit the modularity of deep neural networks to modify the model for other tasks, such as text classification, sentiment analysis, or question answering. I hope to cover some of these topics in future posts.

Acknowledgements

Many thanks to Guenter Bartsch (Zamia AI) for training the first publically available German GPT-2 model, and to Konstantin Lopuhin for his excellent work on transformer-lm, his PyTorch implementation of GPT-2.

Read Me

Oliver Flasch — Mon, 24 Aug 2020 10:00:07 GMT

This blog is about data strategy and artificial intelligence. I'm writing to capture ideas, walk through experiments and document what works in practice (and what doesn't).

Here are some of the topics I hope to cover:

Deep-dives into real-world applications of current AI technology
German natural language processing (NLP) with transformer models
Automating machine learning and deep learning (AutoML)
Management of successful AI projects

It's nice to have you here. Enjoy your reading!