Retrieval Augmented Generation (RAG) is quickly maturing into a useful (and boring) technology to connect proprietary, often local data to Large Language Models (LLMs) to augment their training data with knowledge and skills through in-context learning. RAG can enable important use cases such as tailored AI assistants for answering questions, generating complex reports ("Deep Research") or automating tasks ("Agentic AI").
RAG employs an information retrieval system, like a database or search engine, to augment the context of an LLM with relevant information. There are many different ways of varying complexity to achieve this, from simple approaches that enrich a user’s prompt with search results every time to agentic workflows that enable an LLM to freely use multiple search engines or databases as tools. When connecting an LLM to proprietary data, it is extremely important to ensure that it cannot access public data sources at the same time, to prevent data exhilaration.
This article presents results of "gerenbench", a small proprietary benchmark that uses German encyclopedia articles to estimate the accuracy of LLMs on RAG tasks. It provides a glimpse at the native and RAG performance of open weight LLMs runnable on consumer hardware as of March 2026.
Benchmark Setup
The benchmarks contains 100 multiple choice questions based on 25 German encyclopedia articles from to mid-nineties. Each question contains the relevant article and four possible answers. It encompasses the following two scenarios:
- LLM-Only: The LLM is presented with the question and four possible answers. The relevant article is withheld.
- LLM+RAG: The LLM is presented with the relevant article, the question and four possible answers.
In both scenarios, the LLM is instructed (via the prompts given in the appendix) to answer the question by generating a single letter (A, B, C, or D) designating the correct answer. Therefore, scenario 1 tests world knowledge, while scenario 2 tests world knowledge and RAG capability. Result analysis is fully deterministic and automated. A redacted example task is given in the appendix of this article. Note that scenario 2 assumes the best case scenario where the relevant article is included in the LLMs context exclusively every time.
All experiments were run on an Apple M3 with 24 GB of unified RAM using Ollama Version 0.17.4.
Results
The following 10 open weight LLMs where tested on both scenarios (Table 1):
| LLM Name | Provider | Model Card | Size | Quantization |
|---|---|---|---|---|
| Gemma 3 1b | Google DeepMind | Hugging Face Link | 1b | q4_K_M |
| Gemma 3 12b | Google DeepMind | Hugging Face Link | 12b | q4_K_M |
| gpt-oss-20b | OpenAI | Hugging Face Link | 20b | q4_K_M |
| LFM2 24b-a2b | Liquid AI | Hugging Face Link | 24b-a2b | q4_K_M |
| Ministral 3 3b | Mistral AI | Hugging Face Link | 3b | q4_K_M |
| Ministral 3 8b | Mistral AI | Hugging Face Link | 8b | q4_K_M |
| Ministral 3 14b | Mistral AI | Hugging Face Link | 14b | q4_K_M |
| Mistral Small 3.2 24b | Mistral AI | Hugging Face Link | 24 b | q4_K_M |
| Phi-4-mini 3.8b | Microsoft | Hugging Face Link | 3.8b | q4_K_M |
| Phi-4 14b | Microsoft | Hugging Face Link | 14b | q4_K_M |
Table 2 shows mean accuracies and processing times for scenario 1, i.e. LLM-Only performance, without RAG. "Prompt Duration" denotes the mean time in seconds taken to process the prompt, "Generate Duration" denotes the mean time in seconds taken to generate the answer. Best results are shown in bold font.
| LLM Name | Accuracy (Mean) | Prompt Duration (Mean, s) | Generate Duration (Mean, s) |
|---|---|---|---|
| Ministral 3 14b (it-2512-q4_K_M) | 0.77127 | 2.21029 | 0.15751 |
| Mistral Small 3.2 24b (it-2506-q4_K_M) | 0.75971 | 3.31204 | 0.177936 |
| Ministral 3 8b (it-2512-q4_K_M) | 0.7504 | 1.25622 | 0.291435 |
| gpt-oss-20b | 0.71902 | 0.865917 | 34.4813 |
| Phi-4 14b (q4_K_M) | 0.69161 | 2.39739 | 0.971595 |
| Gemma 3 12b (it-q4_K_M) | 0.6511 | 1.09994 | 0.104123 |
| Ministral 3 3b (it-2512-q4_K_M) | 0.58817 | 0.499747 | 0.0415913 |
| LFM2 24b-a2b (q4_K_M) | 0.56871 | 1.27881 | 0.0333128 |
| Gemma 3 1b (it-q4_K_M) | 0.50059 | 0.261436 | 0.0246014 |
| Phi-4-mini 3.8b (q4_K_M) | 0.4406 | 0.601904 | 0.260388 |
Table 3 shows mean accuracy and processing times for scenario 2, i.e. RAG. Best results are shown in bold font.
| LLM Name | RAG Accuracy (Mean) | RAG Prompt Duration (Mean, s) | RAG Generate Duration (Mean, s) |
|---|---|---|---|
| Mistral Small 3.2 24b (it-2506-q4_K_M) | 0.9697 | 11.9376 | 0.172491 |
| gpt-oss-20b | 0.96957 | 2.0207 | 8.44707 |
| Phi-4 14b (q4_K_M) | 0.96907 | 10.1061 | 0.690504 |
| Ministral 3 14b (it-2512-q4_K_M) | 0.96079 | 8.46598 | 0.146143 |
| Ministral 3 8b (it-2512-q4_K_M) | 0.95992 | 5.20979 | 0.130678 |
| Gemma 3 12b (it-q4_K_M) | 0.95978 | 4.45786 | 0.137686 |
| Ministral 3 3b (it-2512-q4_K_M) | 0.90893 | 2.15204 | 0.0413062 |
| LFM2 24b-A2b (q4_K_M) | 0.86951 | 2.62395 | 0.0338279 |
| Phi-4-mini 3.8b (q4_K_M) | 0.82001 | 2.58125 | 0.615699 |
| Gemma 3 1b (it-q4_K_M) | 0.50059 | 0.261436 | 0.0246014 |
Figure 1 shows the accuracies of all tested LLMs for scenario 1 (LLM only) and scenario 2 (LLM+RAG). 95% confidence intervals, generated via bootstrap sampling, are included.
Finally, figure 2 shows a Pareto plot of LLM accuracy vs. generation time.
Conclusions and Next Steps
The results show that mid-size, open-weight LLMs today are capable of solving the benchmark task with over 95% accuracy (an error rate of less than 1 in 20 queries). In contrast, when relying solely on internal world knowledge, even the best models tested are only about 75% accurate, producing an incorrect result for 1 out of 4 queries, on average. If an accuracy of 95% is good enough should be evaluated against the risk profile of the task at hand.
Result accuracy scales with model size, while there are still significant differences between models of comparable size, demonstrating the value of creating custom (proprietary) benchmarks. Model runtime is highly dependent on the LLM runtime (hard- and software used for inference), but also important in practice, which is why it is reported in the benchmark results.
Currently, the RAG portion of this benchmark (scenario 2) assumes perfect information retrieval, i.e. that relevant information is present every time, which is rarely the case in practice. Future iterations could be improved by including a quota of test cases where relevant information is missing. Furthermore, to move beyond the current multiple-choice format, a third scenario could be added to test free-form generation accuracy. This would involve removing multiple-choice options from the prompt template and evaluating the generated answers using pattern matching or an "LLM-as-a-judge" approach.
Appendix
All LLMs under test where instructed with the following prompt templates:
system_prompt = """
Your task is to answer multiple-choice questions with high precision and accuracy.
You must answer with the letter of the correct choice, i.e. either A,B,C, or D.
You must not generate anything else.
"""
prompt_template = """**Question:**
{question}
**Multiple Choice Options:**
A) {answer_a}
B) {answer_b}
C) {answer_c}
D) {answer_d}
Answer the given question using one of the multiple choice options.
You must return either A, B, C, or D.
You must return exactly one letter.
---
"""
rag_prompt_template = """**Context:**
{context}
**Question:**
{question}
**Multiple Choice Options:**
A) {answer_a}
B) {answer_b}
C) {answer_c}
D) {answer_d}
Based on the given context, answer the given question using one of the multiple choice options.
You must return either A, B, C, or D.
You must return exactly one letter.
---
"""The system_prompt gives general instructions to the model, while the prompt_template is used for the LLM-Only part of the benchmark (scenario 1) and rag_prompt_template is used for the LLM+RAG part of the benchmark (scenario 2).
The following shows a redacted example task of the benchmark:
Article
Thyssen AG, Dachgesellschaft eines in den Bereichen Investitionsgüter, Handel und Dienstleistungen und Stahlerzeugung und -verarbeitung tätigen Konzerns; Sitz: Düsseldorf. Die T. AG entwickelte sich, v.a. aus den beiden Hauptunternehmen Thyssen & Co. KG (gegr. 1871) und August Thyssen-Hütte AG (gegr. 1890), bis 1914 (...redacted...)
Question
Durch wen sollte die (...redacted...)?
Possible Answers
a) (...redacted...)
b) (...redacted...)
c) (...redacted...)
d) (...redacted...)