Odunayo Ogundepo
← Back to Blog

A Simple Utility for Context Caching with Hugging Face

5 min read
Transformers Inference Caching

When working with Hugging Face Transformers, caching is usually discussed in the context of generation. Most of the available cache abstractions, like past_key_values or DynamicCache, are designed to speed up decoding during autoregressive generation.

However, these are fundamentally generation caches, not context caches.

This distinction becomes important when you want to reuse a fixed prefix, such as a system prompt, across multiple independent requests without carrying over previous user interactions. This is a feature that is automatically supported by most existing inference engines like VLLM.

The Problem

With, Huggingface transformers If you directly reuse out.past_key_values from model.generate, you are not just caching the system prompt. You are caching:

  • system prompt
  • user input
  • assistant response
  • generated tokens

This means every subsequent request implicitly appends to the entire previous interaction, which is not what we want.

What we really want is a way to cache only a reusable context prefix, like a system prompt, and combine it with fresh inputs each time.

The Constraint with Chat Templates

Using tokenizer.apply_chat_template introduces another limitation:

  • a system-only prompt is not valid under most chat templates
  • templates expect a structured conversation: system -> user -> assistant

So we cannot directly construct a system-only cache using the template.

A Simple Utility Approach

The workaround is simple and practical: build a cache using system + empty user + assistant prefix.

This gives us a valid prefix that can be reused safely.

Then for every new request:

  • clone the cache
  • append only the new user input
  • generate a response

This ensures no history leakage, proper formatting, and efficient reuse of the system context.

Example Utility Script

Python
import copy
import inspect
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Qwen/Qwen2-0.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)
model.eval()
model.generation_config.temperature = None
model.generation_config.top_p = None
model.generation_config.top_k = None
prepare_inputs_params = set(inspect.signature(model.prepare_inputs_for_generation).parameters)

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id


def build_user_prefix_cache(user_prefix: str):
    prefix_text = tokenizer.apply_chat_template(
        [{"role": "user", "content": user_prefix}],
        tokenize=False,
        add_generation_prompt=False,
        continue_final_message=True,
    )

    inputs = tokenizer(
        prefix_text,
        add_special_tokens=False,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        outputs = model(**inputs, use_cache=True)

    return outputs.past_key_values, inputs["input_ids"].shape[1], prefix_text


def generate_with_prefix_cache(user_text, prefix_cache, prefix_len, prefix_text, user_prefix):
    kv_cache = copy.deepcopy(prefix_cache)

    full_prompt_text = tokenizer.apply_chat_template(
        [{"role": "user", "content": user_prefix + user_text}],
        tokenize=False,
        add_generation_prompt=True,
    )
    suffix_text = full_prompt_text[len(prefix_text):]

    inputs = tokenizer(
        suffix_text,
        add_special_tokens=False,
        return_tensors="pt",
    ).to(model.device)
    suffix_len = inputs["input_ids"].shape[1]

    full_attention_mask = torch.ones(
        (1, prefix_len + suffix_len),
        device=model.device,
        dtype=torch.long,
    )

    generate_kwargs = {
        "input_ids": inputs["input_ids"],
        "attention_mask": full_attention_mask,
        "past_key_values": kv_cache,
        "max_new_tokens": 128,
        "do_sample": False,
        "use_cache": True,
        "return_dict_in_generate": True,
    }
    if "cache_position" in prepare_inputs_params:
        generate_kwargs["cache_position"] = torch.arange(
            prefix_len,
            prefix_len + suffix_len,
            device=model.device,
        )

    with torch.no_grad():
        out = model.generate(**generate_kwargs)

    generated = out.sequences[0, suffix_len:]
    return tokenizer.decode(generated, skip_special_tokens=True)

def timed_call(fn, *args):
    start = time.perf_counter()
    output = fn(*args)
    elapsed = time.perf_counter() - start
    return output, elapsed

# Common prefix.
prefix = (
    "You are an expert school principal, skilled in effectively managing "
    "faculty and staff. Draft 10-15 questions for a potential first grade "
    "Head Teacher for my K-12, all-girls', independent school that emphasizes "
    "community, joyful discovery, and life-long learning. The candidate is "
    "coming in for a first-round panel interview for a 8th grade Math "
    "teaching role. They have 5 years of previous teaching experience "
    "as an assistant teacher at a co-ed, public school with experience "
    "in middle school math teaching. Based on these information, fulfill "
    "the following paragraph: "
)

# Sample prompts.
prompts = [
    "Who do i teach?",
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]*100


for prompt in prompts:
    cached_output, cached_time = timed_call(
        generate_with_prefix_cache, prompt, cache[0], cache[1], cache[2], prefix
    )
    cached_generation_total += cached_time


print(f"Cache build time: {cache_build_time:.4f}s")
print(f"Total with cache: {cached_generation_total:.4f}s")
                

Key Insights from Hugging Face KV Cache Design

A few important details from the Hugging Face KV cache design help clarify why this utility works and where its limitations come from.

KV Cache = Prefix + New Tokens

The cache is always treated as a prefix of the sequence, not a separate structure. When you pass a cache, you are effectively saying: everything in this cache happened before these new tokens.

This is exactly why reusing past_key_values from a full generation leads to unintended context carryover.

Attention Depends on Full Sequence Length

When using a cache, attention is computed over:

Sequence Length
past_kv_length + new_tokens_length

So the attention mask must reflect the combined length of cached tokens and new inputs.

This is typically handled automatically inside generate(), but it becomes important when manually reusing caches or building custom loops.

Caches Grow Over Time

The default cache, DynamicCache, grows as new tokens are generated by appending key/value pairs at each step.

This reinforces an important point: a KV cache is inherently tied to a growing sequence, not a fixed context.

Cache Implementations Are Generation-Focused

Transformers provides multiple cache implementations, including DynamicCache, StaticCache, and QuantizedCache. All of these are optimized for generation efficiency, not for reusing a fixed prompt across independent requests.

Why Context Caching Is Not Native

Because of how KV caching works, the cache always represents a single continuous sequence. There is no built-in concept of "this part is reusable" or "this part should be reset."

This is why building a reusable context cache requires manually controlling what goes into the cache, avoiding reuse of generated tokens, and carefully managing how new inputs are appended.

A Note on Inference Systems

It is worth noting that this approach is fundamentally a workaround.

Modern inference systems like TGI and vLLM already implement prefix caching internally. They handle prefix sharing across requests, efficient KV reuse, batching, and memory management for you.

This utility is most useful when you are:

  • working directly with Hugging Face Transformers
  • building custom pipelines
  • prototyping without a dedicated inference server

References


Have thoughts or questions about this post? Feel free to reach out via email or connect on LinkedIn.

Built from scratch by me and Claude :)