gerenbench - A Small Proprietary Benchmark for German RAG

Retrieval Augmented Generation (RAG) is quickly maturing into a useful (and boring) technology to connect proprietary, often local data to Large Language Models (LLMs) to augment their training data with knowledge and skills through in-context learning. RAG can enable important use cases such as tailored AI assistants for answering questions, generating complex reports ("Deep Research") or automating tasks ("Agentic AI").

RAG employs an information retrieval system, like a database or search engine, to augment the context of an LLM with relevant information. There are many different ways of varying complexity to achieve this, from simple approaches that enrich a user’s prompt with search results every time to agentic workflows that enable an LLM to freely use multiple search engines or databases as tools. When connecting an LLM to proprietary data, it is extremely important to ensure that it cannot access public data sources at the same time, to prevent data exhilaration.

This article presents results of "gerenbench", a small proprietary benchmark that uses German encyclopedia articles to estimate the accuracy of LLMs on RAG tasks. It provides a glimpse at the native and RAG performance of open weight LLMs runnable on consumer hardware as of March 2026.

Benchmark Setup

The benchmarks contains 100 multiple choice questions based on 25 German encyclopedia articles from to mid-nineties. Each question contains the relevant article and four possible answers. It encompasses the following two scenarios:

LLM-Only: The LLM is presented with the question and four possible answers. The relevant article is withheld.
LLM+RAG: The LLM is presented with the relevant article, the question and four possible answers.

In both scenarios, the LLM is instructed (via the prompts given in the appendix) to answer the question by generating a single letter (A, B, C, or D) designating the correct answer. Therefore, scenario 1 tests world knowledge, while scenario 2 tests world knowledge and RAG capability. Result analysis is fully deterministic and automated. A redacted example task is given in the appendix of this article. Note that scenario 2 assumes the best case scenario where the relevant article is included in the LLMs context exclusively every time.

All experiments were run on an Apple M3 with 24 GB of unified RAM using Ollama Version 0.17.4.

Results

The following 10 open weight LLMs where tested on both scenarios (Table 1):

Table 1: LLMs Tested

LLM Name	Provider	Model Card	Size	Quantization
Gemma 3 1b	Google DeepMind	Hugging Face Link	1b	`q4_K_M`
Gemma 3 12b	Google DeepMind	Hugging Face Link	12b	`q4_K_M`
gpt-oss-20b	OpenAI	Hugging Face Link	20b	`q4_K_M`
LFM2 24b-a2b	Liquid AI	Hugging Face Link	24b-a2b	`q4_K_M`
Ministral 3 3b	Mistral AI	Hugging Face Link	3b	`q4_K_M`
Ministral 3 8b	Mistral AI	Hugging Face Link	8b	`q4_K_M`
Ministral 3 14b	Mistral AI	Hugging Face Link	14b	`q4_K_M`
Mistral Small 3.2 24b	Mistral AI	Hugging Face Link	24 b	`q4_K_M`
Phi-4-mini 3.8b	Microsoft	Hugging Face Link	3.8b	`q4_K_M`
Phi-4 14b	Microsoft	Hugging Face Link	14b	`q4_K_M`

Table 2 shows mean accuracies and processing times for scenario 1, i.e. LLM-Only performance, without RAG. "Prompt Duration" denotes the mean time in seconds taken to process the prompt, "Generate Duration" denotes the mean time in seconds taken to generate the answer. Best results are shown in bold font.

Table 2: Accuracy and Processing Time without RAG (Ordered by Mean Accuracy)

LLM Name	Accuracy (Mean)	Prompt Duration (Mean, s)	Generate Duration (Mean, s)
Ministral 3 14b (it-2512-q4_K_M)	0.76962	2.21029	0.15751
Mistral Small 3.2 24b (it-2506-q4_K_M)	0.75916	3.31204	0.177936
Ministral 3 8b (it-2512-q4_K_M)	0.74703	1.25622	0.291435
gpt-oss-20b	0.71656	0.865917	34.4813
Phi-4 14b (q4_K_M)	0.69075	2.39739	0.971595
Gemma 3 12b (it-q4_K_M)	0.65124	1.09994	0.104123
Ministral 3 3b (it-2512-q4_K_M)	0.58928	0.499747	0.0415913
LFM2 24b-a2b (q4_K_M)	0.57037	1.27881	0.0333128
Phi-4-mini 3.8b (q4_K_M)	0.43891	0.601904	0.260388
Gemma 3 1b (it-q4_K_M)	0.2892	0.080226	0.0246014

Table 3 shows mean accuracy and processing times for scenario 2, i.e. RAG. Best results are shown in bold font.

Table 3: RAG Accuracy and Processing Time (Ordered by Mean Accuracy)

LLM Name	RAG Accuracy (Mean)	RAG Prompt Duration (Mean, s)	RAG Generate Duration (Mean, s)
Mistral Small 3.2 24b (it-2506-q4_K_M)	0.97025	11.9376	0.172491
gpt-oss-20b	0.96963	2.0207	8.44707
Phi-4 14b (q4_K_M)	0.96947	10.1061	0.690504
Ministral 3 8b (it-2512-q4_K_M)	0.96173	5.20979	0.130678
Ministral 3 14b (it-2512-q4_K_M)	0.96017	8.46598	0.146143
Gemma 3 12b (it-q4_K_M)	0.96008	4.45786	0.137686
Ministral 3 3b (it-2512-q4_K_M)	0.91026	2.15204	0.0413062
LFM2 24b-A2b (q4_K_M)	0.87004	2.62395	0.0338279
Phi-4-mini 3.8b (q4_K_M)	0.81902	2.58125	0.615699
Gemma 3 1b (it-q4_K_M)	0.49717	0.261436	0.0246014

Figure 1 shows the accuracies of all tested LLMs for scenario 1 (LLM only) and scenario 2 (LLM+RAG). 95% confidence intervals, generated via bootstrap sampling, are included.

Figure 1: Accuracy for LLM-Only (Scenario 1) vs. LLM+RAG (Scenario 2)

Finally, figure 2 shows a Pareto plot of LLM accuracy vs. total generation time.

Figure 2: Pareto Plot of LLM Accuracy vs. Generation Time for LLM-Only (Scenario 1, Blue) and LLM+RAG (Scenario 2, Red)

Conclusions and Next Steps

The results show that mid-size, open-weight LLMs today are capable of solving the benchmark task with over 95% accuracy (an error rate of less than 1 in 20 queries). In contrast, when relying solely on internal world knowledge, even the best models tested are only about 75% accurate, producing an incorrect result for 1 out of 4 queries, on average. If an accuracy of 95% is good enough should be evaluated against the risk profile of the task at hand.

Result accuracy scales with model size, while there are still significant differences between models of comparable size, demonstrating the value of creating custom (proprietary) benchmarks. Model runtime is highly dependent on the LLM runtime (hard- and software used for inference), but also important in practice, which is why it is reported in the benchmark results.

Currently, the RAG portion of this benchmark (scenario 2) assumes perfect information retrieval, i.e. that relevant information is present every time, which is rarely the case in practice. Future iterations could be improved by including a quota of test cases where relevant information is missing. Furthermore, to move beyond the current multiple-choice format, a third scenario could be added to test free-form generation accuracy. This would involve removing multiple-choice options from the prompt template and evaluating the generated answers using pattern matching or an "LLM-as-a-judge" approach.

Appendix

All LLMs under test where instructed with the following prompt templates:

system_prompt = """
Your task is to answer multiple-choice questions with high precision and accuracy.
You must answer with the letter of the correct choice, i.e. either A,B,C, or D.
You must not generate anything else. 
"""

prompt_template = """**Question:**
{question}

**Multiple Choice Options:**
A) {answer_a}
B) {answer_b}
C) {answer_c}
D) {answer_d}

Answer the given question using one of the multiple choice options.
You must return either A, B, C, or D.
You must return exactly one letter.
---
"""

rag_prompt_template = """**Context:**
{context}

**Question:**
{question}

**Multiple Choice Options:**
A) {answer_a}
B) {answer_b}
C) {answer_c}
D) {answer_d}

Based on the given context, answer the given question using one of the multiple choice options.
You must return either A, B, C, or D.
You must return exactly one letter.
---
"""

The system_prompt gives general instructions to the model, while the prompt_template is used for the LLM-Only part of the benchmark (scenario 1) and rag_prompt_template is used for the LLM+RAG part of the benchmark (scenario 2).

The following shows a redacted example task of the benchmark:

Article
Thyssen AG, Dachgesellschaft eines in den Bereichen Investitionsgüter, Handel und Dienstleistungen und Stahlerzeugung und -verarbeitung tätigen Konzerns; Sitz: Düsseldorf. Die T. AG entwickelte sich, v.a. aus den beiden Hauptunternehmen Thyssen & Co. KG (gegr. 1871) und August Thyssen-Hütte AG (gegr. 1890), bis 1914 (...redacted...)

Question
Durch wen sollte die (...redacted...)?

Possible Answers
a) (...redacted...)
b) (...redacted...)
c) (...redacted...)
d) (...redacted...)

gerenbench - A Small Proprietary Benchmark for German RAG

Oliver Flasch

Oliver Flasch

Benchmark Setup

Results

Conclusions and Next Steps

Appendix

LLM Update

Topics in the Bundestag (Part 2)

Topics in the Bundestag (Part 1)

What LLMs Can and Cannot Do

LLM Update