Blog

Thoughts on NLP, multilingual systems, evaluation, and building AI systems that work in the real world.

An Easier Way to Set Up flash-attn

March 27, 2026 PyTorch CUDA Tooling

A practical fix for the common flash-attn setup failure mode: install the exact wheel that matches your Python, PyTorch, CUDA, and architecture instead of hoping a generic pip install lines up.

A Simple Utility for Context Caching with Hugging Face

March 25, 2026 Transformers Inference Caching

Generation caches in Transformers are not the same thing as reusable context caches. This post shows a practical utility for reusing a fixed system prompt across independent requests without leaking prior conversation history.