from cachy import enable_cachy
enable_cachy()
Cachy: How we made our notebooks 60x faster.
Intro.
At AnswerAI we build software that makes working with A.I. that little bit easier. For example, in the past year we built a series of open source python packages (Claudette, Cosette) that make it much simpler to work with LLM providers like Anthropic and OpenAI.
These packages make many LLM calls which pose a bunch of challenges that can really slow down development.
- running the test suite is slow as each LLM call take 100’s of ms to run
- llm responses are non deterministic which makes assertions difficult
- ci/cd pipelines (like Github Actions) need access to API keys to run tests
As we build most of our software in notebooks non-deterministic responses create an additional problem. They add significant bloat to notebook diffs which makes code review more difficult 😢.
Why cachy
?
Although LLMs are relatively new these challenges are not, and an established solution already exists. You simply mock each LLM call so that it returns a specific response instead of calling the LLM provider. Indeed this approach works pretty well but it is a little cumbersome. In our case, we would need to call the LLM manually, capture the response, save it to our project, and write a mock that uses it. We would need to repeat this process for hundreds of LLM calls across our projects 😢.
We asked ourselves if we could do better and create something that just worked automatically in the background with zero manual intervention. That something better turned out to be very simple. We looked at the source code of the most popular LLM SDKs and found that they all use the httpx
library to call their respective APIs. All we needed to do was modify httpx
’s send
method to save the response of every call to a local file (a.k.a a cache) and re-use it on future requests. Here’s some pseudo-code that implements just that.
@patch
def send(self:httpx._client.Client, r, **kwargs):
= req2id(r) # convert request to a unique identifier
id_ if id_ in cache: return httpx.Response(content=cache[id_])
= self._orig_send(r, **kwargs)
res
update_cache(id_, res)return res
We added this simple patch to one of our projects and the payoff was immediate.
- we could now run our tests in ~2 seconds instead of 2 minutes 🔥
- we could finally add a test suite to our ci/cd pipeline
- our notebook diffs were clean and focused
The best part is that we got all of these benefits without having to write a single line of code and bloating our project with mocks and fixtures.
Since then we’ve added support for async, streaming, and turned it into into a separate package called cachy which we’re open sourcing today 🎉.
Usage
Setting up cachy is pretty straightforward.
- install it with pip
pip install pycachy
- import cachy in your notebook or script
from cachy import enable_cachy
- enable cachy by adding
enable_cachy()
to the top of your notebook or script
Now when you use Anthropic or OpenAI’s python SDK the response will be cached and re-used whenever you make the same LLM call again. You don’t need to write any additional code. cachy
just works automatically in the background.
Here’s an example.
Now, let’s request a completion from OpenAI.
from openai import OpenAI
= OpenAI()
cli = cli.responses.create(model="gpt-4.1", input="Hey!")
r r
Hey! How can I help you today? 😊
- id: resp_05b1a0c3eca9e1450068dbb5ff4a74819e8bc3099532846ea1
- created_at: 1759229439.0
- error: None
- incomplete_details: None
- instructions: None
- metadata: {}
- model: gpt-4.1-2025-04-14
- object: response
- output: [ResponseOutputMessage(id=‘msg_05b1a0c3eca9e1450068dbb600147c819e8684cbe7fe3adc40’, content=[ResponseOutputText(annotations=[], text=‘Hey! How can I help you today? 😊’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
- parallel_tool_calls: True
- temperature: 1.0
- tool_choice: auto
- tools: []
- top_p: 1.0
- background: False
- conversation: None
- max_output_tokens: None
- max_tool_calls: None
- previous_response_id: None
- prompt: None
- prompt_cache_key: None
- reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
- safety_identifier: None
- service_tier: default
- status: completed
- text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
- top_logprobs: 0
- truncation: disabled
- usage: ResponseUsage(input_tokens=9, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=11, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=20)
- user: None
- billing: {‘payer’: ‘developer’}
- store: True
If we run the same request again, the response is now read from the cache.
= cli.responses.create(model="gpt-4.1", input="Hey!")
r r
Hey! How can I help you today? 😊
- id: resp_05b1a0c3eca9e1450068dbb5ff4a74819e8bc3099532846ea1
- created_at: 1759229439.0
- error: None
- incomplete_details: None
- instructions: None
- metadata: {}
- model: gpt-4.1-2025-04-14
- object: response
- output: [ResponseOutputMessage(id=‘msg_05b1a0c3eca9e1450068dbb600147c819e8684cbe7fe3adc40’, content=[ResponseOutputText(annotations=[], text=‘Hey! How can I help you today? 😊’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
- parallel_tool_calls: True
- temperature: 1.0
- tool_choice: auto
- tools: []
- top_p: 1.0
- background: False
- conversation: None
- max_output_tokens: None
- max_tool_calls: None
- previous_response_id: None
- prompt: None
- prompt_cache_key: None
- reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
- safety_identifier: None
- service_tier: default
- status: completed
- text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
- top_logprobs: 0
- truncation: disabled
- usage: ResponseUsage(input_tokens=9, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=11, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=20)
- user: None
- billing: {‘payer’: ‘developer’}
- store: True
General Purpose Caching
Although this post focuses on caching LLM responses, cachy
can be used to cache any calls made with httpx
. All you need to do is tell cachy
what urls you want to cache.
=["api.example.com", "api.demo.com"]) enable_cachy(doms
Conclusion
cachy is one of those little quality of life improvements that keeps us in a flow state for longer and help us move that little bit faster. We hope you’ll find it useful.