Remember the frenzy OpenAI caused last year with it's announcement not to publish the largest variant of it's latest language model GPT-2? Because of concerns about it being used to generate deceptive, biased, or abusive language at scale? Of course, other parties quickly trained their own versions, proving these concerns to be largely unfounded. But GPT-3, OpenAI's newest iteration of basically the same model architecture, but scaled up a hundredfold, shows that we're quickly entering dangerous (and fascinating) territory.

In this blog, we're going to implement a small web app that uses a German GPT-2 model to auto-complete your sentences. We'll be using free GPU instances available in Google Colaboratory, so that you can follow along quickly, without expensive hardware or complicated setup. Of course, you can also use your local PyTorch installation with JupyterLab, which should give you exactly the same results.

But First, Some Theory

  • The "GPT" in GPT-2 stands for "Generative Pretrained Transformer", meaning, given a fixed window of text preprocessed into a fixed-size vocabulary of tokens (normalized pieces of words/sentences), the model will predict the probabilities of each token being the next one. To express this in simpler terms without being totally inaccurate: Given some words as a "prompt", GPT-2 will predict probable next words. By applying GPT-2 to it's own output, you can generate arbitrary long texts.
  • Therefore, GPT-2 can be trained on large unlabeled text corpora in an unsupervised fashion, making training relatively simple and cheap. This is, if you can provide the necessary compute resources.
  • OpenAI trained the orignal variant of this model on 8 million documents scraped from good quality Reddit submissions, for a total of 40 GB of text.
  • As a neural network, GTP-2 can be modified or plugged into a larger network to solve many different tasks, even across media boundaries. Neural networks are a kind of non-leaky mathematical abstraction.
  • See OpenAI's original GPT-2 Paper for the details, or The Illustrated GPT-2, Jay Alammar's excellent exposition.‌
  • Thilina Rajapakse's Simple Transformers library makes it very simple to apply GPT-2 and other transformer models to a wide variety of NLP tasks.

Set Up Your Notebook Environment

We will use PyTorch as a modern machine learning framework. PyTorch is a good fit for our task, as it offers a modern readable (eager, imperative) API, combined with very good performance and a rapidly growing community.

A simple way to get startet with Deep Learning-based NLP is by using Google Colaboratory (Colab) on the Google Cloud Platform. Colab offers a JupyterLab-like computational notebook environment with PyTorch already preinstalled. As of August 2020, Colab offers you free GPU instances, which is great for compute-intensive tasks like GPT-2 training and -inference. On the other hand, you should be aware of the privacy- and QoS implications of using a free Google service.

If you'll be using your own infrastructure, you'll need to install PyTorch and configure it to use your GPU (if applicable). I'd recommend to also install JupyterLab or an alternative computational notebook environment. Many people prefer computational notebooks over IDEs for explorative data science work.

Prepare PyTorch

Open your a new notebook in your notebook environment of choice, import both PyTorch  and Numpy, and check if PyTorch supports your GPU(s). When running in Colab, you will need to enable GPU support by selecting "Change runtime type" in the  "Runtime" menu. Then select "GPU"  in the "Hardware accelerator" dropdown menu. Colab might assign you a different GPU model each time your open your notebook and start the runtime.

import torch
import numpy as np

print(f"Detected {torch.cuda.device_count()} PyTorch-compatbile GPU devices.")
print(f"Name of first GPU device: {torch.cuda.get_device_name(0)}")

This should print the number of  GPUs detected by PyTorch and the model name of the first GPU detected.

If you're running in Colab, you should mount your Google Drive (which provides you with free permantent storage space) to store the required software and GPT-2 model parameters:

from google.colab import drive


When running these lines, Colab will provide you with on-screen instructions on how to enable access to your Google Drive contents from your Colab notebook.

If you're running this for the first time, clone the transformer-lm Python package from GitHub:

cd /content/gdrive/My\ Drive/Colab\ Notebooks
git clone

Next, install the requirements of the transformer-lm package to your local notebook environment:

cd /content/gdrive/My\ Drive/Colab\ Notebooks/transformer-lm/
pip install -r requirements.txt

Then add transformer-lm to your Python package path:

import sys
sys.path.append('/content/gdrive/My Drive/Colab Notebooks/transformer-lm/')

Download Zamia AI's German GPT-2 Model

If you're running this for the first time, download and extract the pretrained GPT-2 model from Zamia AI. Otherwise, please skip this step to preserve Zamia AI's bandwidth!

# only do this once to preserve bandwith!
cd /content/gdrive/My\ Drive/Colab\ Notebooks
mkdir pytorch_models
cd pytorch_models
wget tar xvf gpt2-german-345M-r20191119.tar.xz rm gpt2-german-345M-r20191119.tar.xz
ls -lha

Now you should be able to load the German GPT-2 model into memory. We will be using Konstantin Lopuhin's transformer-lm language model wrapper:

import lm.inference as lmi
from pathlib import Path

mw = lmi.ModelWrapper.load(Path("/content/gdrive/My Drive/Colab Notebooks/pytorch_models/de345-root/"))

Make Predictions

Next, we'll define functions to infer new tokens from our model:

def gpt2_gen(mw, prefix, n_tokens_to_generate=10, top_k=8):
  prefix_tokens = mw.tokenize(prefix)
  generated_tokens = mw.generate_tokens(prefix_tokens, n_tokens_to_generate, top_k)
  generated_text = mw.sp_model.DecodePieces(generated_tokens)
  return generated_text

def gpt2_gen_until(mw, prefix, stop_token='.', top_k=8):
  prefix_tokens = mw.tokenize(prefix)
  generated_tokens = list(prefix_tokens) # generate until the stop_token is seen
  next_token = ''
  while next_token != stop_token:
    # generate TOP_K potential next tokens
    ntk = mw.get_next_top_k(generated_tokens, top_k)
    # convert log probs to real probs
    logprobs = np.array(list(map(lambda a: a[0], ntk)))
    probs = np.exp(logprobs) / np.exp(logprobs).sum()
    # pick next token randomly according to probs distribution
    next_token_n = np.random.choice(top_k, p=probs)
    next_token = ntk[next_token_n][1]
    # append next token
    print(mw.sp_model.DecodePieces(generated_tokens)) # DEBUG output
  # decode tokens and return generated text
  generated_text = mw.sp_model.DecodePieces(generated_tokens)
  return generated_text

The function gpt2_gen takes a prefix string or "prompt" from which to generate a fixed number of tokens by iteratively applying the model to its own output. At each step, the model will generate a probability of each possible token as its output. If we would always choose the token with highest predicted probability, the model would generate very static and boring text, often getting stuck in loops. Instead, to choose the token to predict, we use top k sampling: First, we sort the tokens by probability, then zero-out the probabilities for tokens below the top_kth token and sample from the remaining distribution. This appears to improve percieved quality of the generated text by removing the distribution's tail, making it less likely for the generated text to go off topic. Sampling from language models is an active area of research. See this article for a good introduction.

We also define the tool function gpt_gen_until that generates tokens until a defined stop_token (e.g. a full stop '.') is generated by the model.

Finally, we are ready to make some predictions / generate some text by using gpt2_gen_until to complete a german sentence:

gpt2_gen_until(mw, "Wenn der Regen niederbraust, wenn der Sturm das Feld durchsaust,", stop_token=".", top_k=8)
'Wenn der Regen niederbraust, wenn der Sturm das Feld durchsaust, so kann sich auch die Landwirtschaft nicht sicher sein.'

Okay, maybe not exactly what Heine thought of, but nonetheless, it works! Time for another test, this time with timing via  %%time to get an idea of the compute time required for inference:

gpt2_gen(mw, "Das blaue Pferd", n_tokens_to_generate=16, top_k=8)
CPU times: user 5.64 s, sys: 13.8 ms, total: 5.65 s
Wall time: 5.65 s

'Das blaue Pferd war in der deutschen Reitkunst ein sehr beliebter Ausdruck für seine Fähigkeit, auch im'

Build a Web App to Auto-Complete Your Sentences

Having a working model, we can now create a very simple web app to auto-complete german text:

from google.colab.output import eval_js
from http.server import BaseHTTPRequestHandler, HTTPServer
import urllib.parse
import json

port = 8000
colab_url = eval_js(f'google.colab.kernel.proxyPort({port})')

html_code = '''
<!doctype html>
<html lang="de">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link rel="stylesheet" href="">
<link rel="stylesheet" href="" integrity="sha384-9aIt2nRpC12Uk9gS9baDl411NQApFmC26EwAOH8WgZl5MYYxFfc+NcPb1dKGj7Sk" crossorigin="anonymous">
html, body { font-family: 'Lato', sans-serif; }
.main-container { height: 100%; display: flex; flex-direction: column; }
.cover-container { height: 100vh; }
textarea { box-sizing: border-box; height: 100%; margin-top: 6px; margin-bottom: 6px; }
button { box-sizing: border-box; }
footer { margin-top: 12px; }
function init() {
  const t = document.getElementById("textarea");
  t.setSelectionRange(t.value.length, t.value.length); // move cursor to end
  t.addEventListener('keydown', function(e) {
    if(e.keyCode === 9) { // tab was pressed
      e.preventDefault(); // prevent tab navigation
function autoComplete() {
  const t = document.getElementById("textarea");
  const [start, end] = [t.selectionStart, t.selectionEnd];
  const prefixText = t.value.substring(0, t.selectionStart);
  fetch('/autocomplete', { method: 'POST', body: JSON.stringify(prefixText) }).
    then(response => response.json()).
    then(function(newText) {
    textarea.setRangeText(newText, start, end, 'select');
<title>GPT-2 Text-Autopilot</title>
<body onload="init()">
<div class="container-fluid cover-container d-flex flex-column">
<div class="row flex-fill">
<div class="col-12">
<main class="main-container">
<h1>&#x1F984; GPT-2 Schreib-Autopilot</h1>
<div>Schreibblockade? Müde? Nutze den Schreib-Autopiloten, und Deine Texte schreiben sich von selbst! &#x1F929;</div>
<textarea id="textarea" autofocus="on" autocomplete="off" lang="de">Er dachte, automatisches Schreiben wäre eine gute Idee weil</textarea>
<button type="button" class="btn btn-primary" onclick="autoComplete()">automatisch vervollst&auml;ndigen (Tab)</button>
<footer>&copy; Copyright 2020 Oliver Flasch</footer>

class WebServiceHandler(BaseHTTPRequestHandler):
  def do_GET(self):
    parsed_url = urllib.parse.urlparse(self.path)
    if parsed_url.path == '/':
      self.send_header('Content-type', 'text/html')
  def do_POST(self):
    parsed_url = urllib.parse.urlparse(self.path)
    if parsed_url.path == '/autocomplete':
      self.send_header('Content-type', 'application/json')
      content_length = int(self.headers['Content-Length'])
      prefix_text = json.loads(
      generated_text = gpt2_gen(mw, prefix_text, n_tokens_to_generate=5, top_k=8)
      new_text = generated_text[len(prefix_text):]

You can start this web and access it through the Google Colaboratory proxy port:

print(f'Open the following link: {colab_url}')
httpd = HTTPServer(('', port), WebServiceHandler)
except KeyboardInterrupt:

This makes the app accessible to your own browser only.

Serve Your Web App Through ngrok

Install and configure pyngrok to run the Web Application publicly (that is, as long as the runtime of your notebook is active):

pip install pyngrok
ngrok authtoken <YOUR_AUTHTOKEN_HERE>               

Now you can start your web app through ngrok, making it available to the public:

from pyngrok import ngrok

public_url = ngrok.connect(port = str(port))
print(f'Open the following link: {public_url}')
httpd = HTTPServer(('', port), WebServiceHandler)
except KeyboardInterrupt:

Wrapping Up

This covers the basics of using a pretrained GPT-2 model to generate German text.

Next, you could explore GPT-3 fascinating few-shot-learning abilities, fine tune GPT-2 for specific text genres or exploit the modularity of deep neural networks to modify the model for other tasks, such as text classification, sentiment analysis, or question answering. I hope to cover some of these topics in future posts.


Many thanks to Guenter Bartsch (Zamia AI) for training the first publically available German GPT-2 model, and to Konstantin Lopuhin for his excellent work on transformer-lm, his PyTorch implementation of GPT-2.