# Cachy: How we made our notebooks 60x faster.
Tommy
2025-10-01

<img src="./2025-10-01-cachy/1.png" data-fig-align="center" />

### Intro.

At AnswerAI we build software that makes working with A.I. that little
bit easier. For example, in the past year we built a series of open
source python packages ([Claudette](https://claudette.answer.ai/),
[Cosette](https://answerdotai.github.io/cosette/)) that make it much
simpler to work with LLM providers like Anthropic and OpenAI.

These packages make many LLM calls which pose a bunch of challenges that
can really slow down development.

- running the test suite is slow as each LLM call take 100’s of ms to
  run
- llm responses are non deterministic which makes assertions difficult
- ci/cd pipelines (like Github Actions) need access to API keys to run
  tests

As we build most of our software in notebooks non-deterministic
responses create an additional problem. They add significant bloat to
notebook diffs which makes code review more difficult 😢.

### Why `cachy`?

Although LLMs are relatively new these challenges are not, and an
established solution already exists. You simply mock each LLM call so
that it returns a specific response instead of calling the LLM provider.
Indeed this approach works pretty well but it is a little cumbersome. In
our case, we would need to call the LLM manually, capture the response,
save it to our project, and write a mock that uses it. We would need to
repeat this process for hundreds of LLM calls across our projects 😢.

We asked ourselves if we could do better and create something that just
worked automatically in the background with zero manual intervention.
That something better turned out to be very simple. We looked at the
source code of the most popular LLM SDKs and found that they all use the
`httpx` library to call their respective APIs. All we needed to do was
modify `httpx`’s `send` method to save the response of every call to a
local file (a.k.a a cache) and re-use it on future requests. Here’s some
pseudo-code that implements just that.

``` python
@patch
def send(self:httpx._client.Client, r, **kwargs):
    id_ = req2id(r) # convert request to a unique identifier
    if id_ in cache: return httpx.Response(content=cache[id_])
    res = self._orig_send(r, **kwargs)
    update_cache(id_, res)
    return res
```

We added this simple patch to one of our projects and the payoff was
immediate.

- we could now run our tests in ~2 seconds instead of 2 minutes 🔥
- we could finally add a test suite to our ci/cd pipeline
- our notebook diffs were clean and focused

The best part is that we got all of these benefits without having to
write a single line of code and bloating our project with mocks and
fixtures.

Since then we’ve added support for async, streaming, and turned it into
into a separate [package](https://pypi.org/project/pycachy/) called
[cachy](https://github.com/AnswerDotAI/cachy) which we’re open sourcing
today 🎉.

### Usage

Setting up cachy is pretty straightforward.

- install it with pip `pip install pycachy`
- import cachy in your notebook or script
  `from cachy import enable_cachy`
- enable cachy by adding `enable_cachy()` to the top of your notebook or
  script

Now when you use Anthropic or OpenAI’s python SDK the response will be
cached and re-used whenever you make the same LLM call again. You don’t
need to write any additional code. `cachy` just works automatically in
the background.

Here’s an example.

``` python
from cachy import enable_cachy
enable_cachy()
```

Now, let’s request a completion from OpenAI.

``` python
from openai import OpenAI

cli = OpenAI()
r = cli.responses.create(model="gpt-4.1", input="Hey!")
r
```

Hey! How can I help you today? 😊

<details>

- id: resp_05b1a0c3eca9e1450068dbb5ff4a74819e8bc3099532846ea1
- created_at: 1759229439.0
- error: None
- incomplete_details: None
- instructions: None
- metadata: {}
- model: gpt-4.1-2025-04-14
- object: response
- output:
  \[ResponseOutputMessage(id=‘msg_05b1a0c3eca9e1450068dbb600147c819e8684cbe7fe3adc40’,
  content=\[ResponseOutputText(annotations=\[\], text=‘Hey! How can I
  help you today? 😊’, type=‘output_text’, logprobs=\[\])\],
  role=‘assistant’, status=‘completed’, type=‘message’)\]
- parallel_tool_calls: True
- temperature: 1.0
- tool_choice: auto
- tools: \[\]
- top_p: 1.0
- background: False
- conversation: None
- max_output_tokens: None
- max_tool_calls: None
- previous_response_id: None
- prompt: None
- prompt_cache_key: None
- reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
- safety_identifier: None
- service_tier: default
- status: completed
- text: ResponseTextConfig(format=ResponseFormatText(type=‘text’),
  verbosity=‘medium’)
- top_logprobs: 0
- truncation: disabled
- usage: ResponseUsage(input_tokens=9,
  input_tokens_details=InputTokensDetails(cached_tokens=0),
  output_tokens=11,
  output_tokens_details=OutputTokensDetails(reasoning_tokens=0),
  total_tokens=20)
- user: None
- billing: {‘payer’: ‘developer’}
- store: True

</details>

If we run the same request again, the response is now read from the
cache.

``` python
r = cli.responses.create(model="gpt-4.1", input="Hey!")
r
```

Hey! How can I help you today? 😊

<details>

- id: resp_05b1a0c3eca9e1450068dbb5ff4a74819e8bc3099532846ea1
- created_at: 1759229439.0
- error: None
- incomplete_details: None
- instructions: None
- metadata: {}
- model: gpt-4.1-2025-04-14
- object: response
- output:
  \[ResponseOutputMessage(id=‘msg_05b1a0c3eca9e1450068dbb600147c819e8684cbe7fe3adc40’,
  content=\[ResponseOutputText(annotations=\[\], text=‘Hey! How can I
  help you today? 😊’, type=‘output_text’, logprobs=\[\])\],
  role=‘assistant’, status=‘completed’, type=‘message’)\]
- parallel_tool_calls: True
- temperature: 1.0
- tool_choice: auto
- tools: \[\]
- top_p: 1.0
- background: False
- conversation: None
- max_output_tokens: None
- max_tool_calls: None
- previous_response_id: None
- prompt: None
- prompt_cache_key: None
- reasoning: Reasoning(effort=None, generate_summary=None, summary=None)
- safety_identifier: None
- service_tier: default
- status: completed
- text: ResponseTextConfig(format=ResponseFormatText(type=‘text’),
  verbosity=‘medium’)
- top_logprobs: 0
- truncation: disabled
- usage: ResponseUsage(input_tokens=9,
  input_tokens_details=InputTokensDetails(cached_tokens=0),
  output_tokens=11,
  output_tokens_details=OutputTokensDetails(reasoning_tokens=0),
  total_tokens=20)
- user: None
- billing: {‘payer’: ‘developer’}
- store: True

</details>

### General Purpose Caching

Although this post focuses on caching LLM responses, `cachy` can be used
to cache any calls made with `httpx`. All you need to do is tell `cachy`
what urls you want to cache.

``` python
enable_cachy(doms=["api.example.com", "api.demo.com"])
```

### Conclusion

[cachy](https://answerdotai.github.io/cachy/) is one of those little
quality of life improvements that keeps us in a flow state for longer
and help us move that little bit faster. We hope you’ll find it useful.
