Cachy: How we made our notebooks 60x faster.

open-source
Cache your API calls with a single line of code. No mocks, no fixtures. Just faster, cleaner code.
Author

Tommy

Published

October 1, 2025

Intro.

At AnswerAI we build software that makes working with A.I. that little bit easier. For example, in the past year we built a series of open source python packages (Claudette, Cosette) that make it much simpler to work with LLM providers like Anthropic and OpenAI.

These packages make many LLM calls which pose a bunch of challenges that can really slow down development.

  • running the test suite is slow as each LLM call take 100’s of ms to run
  • llm responses are non deterministic which makes assertions difficult
  • ci/cd pipelines (like Github Actions) need access to API keys to run tests

As we build most of our software in notebooks non-deterministic responses create an additional problem. They add significant bloat to notebook diffs which makes code review more difficult 😢.

Why cachy?

Although LLMs are relatively new these challenges are not, and an established solution already exists. You simply mock each LLM call so that it returns a specific response instead of calling the LLM provider. Indeed this approach works pretty well but it is a little cumbersome. In our case, we would need to call the LLM manually, capture the response, save it to our project, and write a mock that uses it. We would need to repeat this process for hundreds of LLM calls across our projects 😢.

We asked ourselves if we could do better and create something that just worked automatically in the background with zero manual intervention. That something better turned out to be very simple. We looked at the source code of the most popular LLM SDKs and found that they all use the httpx library to call their respective APIs. All we needed to do was modify httpx’s send method to save the response of every call to a local file (a.k.a a cache) and re-use it on future requests. Here’s some pseudo-code that implements just that.

@patch
def send(self:httpx._client.Client, r, **kwargs):
    id_ = req2id(r) # convert request to a unique identifier
    if id_ in cache: return httpx.Response(content=cache[id_])
    res = self._orig_send(r, **kwargs)
    update_cache(id_, res)
    return res

We added this simple patch to one of our projects and the payoff was immediate.

  • we could now run our tests in ~2 seconds instead of 2 minutes 🔥
  • we could finally add a test suite to our ci/cd pipeline
  • our notebook diffs were clean and focused

The best part is that we got all of these benefits without having to write a single line of code and bloating our project with mocks and fixtures.

Since then we’ve added support for async, streaming, and turned it into into a separate package called cachy which we’re open sourcing today 🎉.

Usage

Setting up cachy is pretty straightforward.

  • install it with pip pip install pycachy
  • import cachy in your notebook or script from cachy import enable_cachy
  • enable cachy by adding enable_cachy() to the top of your notebook or script

Now when you use Anthropic or OpenAI’s python SDK the response will be cached and re-used whenever you make the same LLM call again. You don’t need to write any additional code. cachy just works automatically in the background.

Here’s an example.

from cachy import enable_cachy
enable_cachy()

Now, let’s request a completion from OpenAI.

from openai import OpenAI

cli = OpenAI()
r = cli.responses.create(model="gpt-4.1", input="Hey!")
r

Hey! How can I help you today? 😊

  • id: resp_05b1a0c3eca9e1450068dbb5ff4a74819e8bc3099532846ea1
  • created_at: 1759229439.0
  • error: None
  • incomplete_details: None
  • instructions: None
  • metadata: {}
  • model: gpt-4.1-2025-04-14
  • object: response
  • output: [ResponseOutputMessage(id=‘msg_05b1a0c3eca9e1450068dbb600147c819e8684cbe7fe3adc40’, content=[ResponseOutputText(annotations=[], text=‘Hey! How can I help you today? 😊’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
  • parallel_tool_calls: True
  • temperature: 1.0
  • tool_choice: auto
  • tools: []
  • top_p: 1.0
  • background: False
  • conversation: None
  • max_output_tokens: None
  • max_tool_calls: None
  • previous_response_id: None
  • prompt: None
  • prompt_cache_key: None
  • reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
  • safety_identifier: None
  • service_tier: default
  • status: completed
  • text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
  • top_logprobs: 0
  • truncation: disabled
  • usage: ResponseUsage(input_tokens=9, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=11, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=20)
  • user: None
  • billing: {‘payer’: ‘developer’}
  • store: True

If we run the same request again, the response is now read from the cache.

r = cli.responses.create(model="gpt-4.1", input="Hey!")
r

Hey! How can I help you today? 😊

  • id: resp_05b1a0c3eca9e1450068dbb5ff4a74819e8bc3099532846ea1
  • created_at: 1759229439.0
  • error: None
  • incomplete_details: None
  • instructions: None
  • metadata: {}
  • model: gpt-4.1-2025-04-14
  • object: response
  • output: [ResponseOutputMessage(id=‘msg_05b1a0c3eca9e1450068dbb600147c819e8684cbe7fe3adc40’, content=[ResponseOutputText(annotations=[], text=‘Hey! How can I help you today? 😊’, type=‘output_text’, logprobs=[])], role=‘assistant’, status=‘completed’, type=‘message’)]
  • parallel_tool_calls: True
  • temperature: 1.0
  • tool_choice: auto
  • tools: []
  • top_p: 1.0
  • background: False
  • conversation: None
  • max_output_tokens: None
  • max_tool_calls: None
  • previous_response_id: None
  • prompt: None
  • prompt_cache_key: None
  • reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
  • safety_identifier: None
  • service_tier: default
  • status: completed
  • text: ResponseTextConfig(format=ResponseFormatText(type=‘text’), verbosity=‘medium’)
  • top_logprobs: 0
  • truncation: disabled
  • usage: ResponseUsage(input_tokens=9, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=11, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=20)
  • user: None
  • billing: {‘payer’: ‘developer’}
  • store: True

General Purpose Caching

Although this post focuses on caching LLM responses, cachy can be used to cache any calls made with httpx. All you need to do is tell cachy what urls you want to cache.

enable_cachy(doms=["api.example.com", "api.demo.com"])

Conclusion

cachy is one of those little quality of life improvements that keeps us in a flow state for longer and help us move that little bit faster. We hope you’ll find it useful.