# You can now train a 70b language model at home

2024-03-06

## Summary

Today, we’re releasing Answer.AI’s first project: a fully open source
system that, for the first time, can efficiently train a 70b large
language model on a regular desktop computer with two or more standard
gaming GPUs (RTX 3090 or 4090). This system, which combines FSDP and
QLoRA, is the result of a collaboration between Answer.AI, Tim Dettmers
(U Washington), and Hugging Face’s Titus von Koeller and Sourab
Mangrulkar.

This system will help the open source community release better models.
Teknium, the creator of the extremely popular OpenHermes models and
datasets, with over half a million downloads, said:

> “*With this capability we can take huge models to new heights locally,
> and gigantic, hundreds of billions of parameter models are now
> accessible by small labs.*”

At Answer.AI we made this our first project because it’s a key
foundation of our north star: helping make useful AI available to
everyone. Just being able to use *other* people’s models is not enough.
We want everyone to be able to create their *own* personalized models,
so that they are in control of their own AI systems.

## Background

### The big idea

There are two very different levels of hardware used to train deep
learning models. There is the data center class hardware, such as H100s
and A100s, costing [hundreds of thousands of
dollars](https://shop.lambdalabs.com/deep-learning/servers/blade/customize).
Then there are desktop computers containing gaming GPUs, such as dual
4090s, costing [under
$10,000](https://shop.lambdalabs.com/gpu-workstations/vector/customize)
(and which can be assembled from 2nd hand parts for less than half the
price of a pre-built system).

But here’s the key point: the gaming GPUs have similar performance to
the data center GPUs that cost over 10x more! It would be great if we
could use these 10x cheaper (but nearly as fast) cards to train large
language models, but we can’t, because they have much less memory. The
best currently available data center cards have 80GB RAM, whilst gaming
cards max out at 24GB RAM. Since only the largest models produce the
best results, creating the best models has been largely inaccessible to
most people.

We realized that there’s actually no intrinsic reason for this. The
super fast hardware is there, waiting to be used – we just need a way to
feed it with the model and the data in a way that meets its memory
constraints. The obvious question is: why hasn’t this been done then?
All the big industry labs have the 10x more expensive hardware already,
so they don’t really have the incentive to figure this out.

The big idea here is simple: figure out how to use these cheaper,
lower-memory gaming GPUs to train the best available open source models.
So the goal is this: train[1] a 70 billion parameter (70b) model using
only gaming GPUs, which means our per-GPU memory will be at most 24GB.
It’ll be a challenge, because each parameter normally takes 16 bits (2
bytes), so that’s 70\*2=140GB to even store the weights – and that’s
without including all the other data such as activations, gradients, and
optimization state!

### Why this project?

Answer.AI is a very unusual type of organization – a for-profit R&D lab
closer in spirit to [19th century electricity
labs](https://www.answer.ai/posts/2024-01-26-freaktakes-lessons.html)
than to today’s AI research groups. Figuring out how to make large model
training inexpensive and accessible is just the kind of thing Eric Ries
and Jeremy Howard hoped we’d be able to do when the organization was
[launched at
NeurIPS](https://www.answer.ai/posts/2023-12-12-launch.html) last year.

Solving this problem is hard. It requires understanding many separate
libraries (e.g bitsandbytes, PEFT, Transformers, Accelerate, and
PyTorch), and computer science and math concepts (e.g discretization,
distributed computing, GPU programming, linear algebra, SGD concepts
such as gradient checkpointing), and how they all interact.

Academia is full of brilliant people that solve hard problems. But
academia hasn’t solved this particular problem. That’s because it’s
difficult for university researchers to justify spending time on this
kind of work. Combining existing tools and techniques together isn’t
generally considered “novel” enough to result in publication in a high
impact journal, but that’s the currency that academics need.
Furthermore, academics are generally expected to become highly
specialized within their field, making it challenging to bring together
so many pieces into a single solution.

And, of course, big tech companies are also full of brilliant people
that solve hard problems. But this particular problem, training models
with consumer GPUs, isn’t a problem they need to solve – they’ve already
bought the big expensive GPUs! Many startups are also full of brilliant
people that solve hard problems! But, as [Eric Ries
explains](https://ltse.com/about/mission), “today’s financial market
forces businesses to prioritize short-term gains over everything else”.
It’s extremely hard for a startup to justify to investors why they’re
spending their funds on open source software and public research.

Whilst academia, big tech, and startups had good reasons for not solving
this problem, these are [the exact
reasons](https://www.answer.ai/posts/2023-12-12-launch.html) that this
problem was a great fit for Answer.AI. Everyone who works at the company
has built the kinds of systems that we had to work with on this problem,
so we were able to understand how all the pieces fit together. People
who love to both deeply understand the foundations of software and AI,
and also love to hack at fun and interesting end-to-end systems are the
kinds of people who are drawn to Answer.AI, and vice versa.

The problems we choose to solve together are selected by the same people
that will do the solving. So we tend to pick up projects that involve
bringing together multiple ideas together to create practically useful
solutions. And because we’re a public benefit company with a charter to
produce *long term* benefit from AI, open source software and public
research are directly in line with our mission.

### QLoRA: Train bigger models on a single GPU

Two projects have been released recently that took the first critical
steps towards making this a reality: QLoRA (by [Tim Dettmers et
al](https://arxiv.org/abs/2305.14314)), and FSDP (by Meta’s [PyTorch
team](https://engineering.fb.com/2021/07/15/open-source/fsdp/)).

QLoRA is a simple but brilliant combination of two critically important
advances in modern neural networks: *quantization*, and *LoRA*.
Quantization is a technique where, instead of using 16 or even 32 bits
to store the weights of a neural network, 4 (or even fewer) bits are
used. There are only 16 possible values of a 4 bit number, but [Dettmers
and Zettlemoyer showed](https://arxiv.org/abs/2212.09720) that this can
be enough in the large language models that are popular today. Tim
Dettmers made these 4-bit “quantized” models easy to create, thanks to
his bitsandbytes library, and recently Hugging Face has stepped in to
help [maintain and
document](https://huggingface.co/docs/bitsandbytes/main/en/index) this
library, particularly thanks to the initiative of Titus von Koeller.

Unfortunately, once a model is quantized, it can not be trained any
further with regular approaches – with just 16 possible values, the
gradient descent method used for model training will observe zero
gradients nearly everywhere, so it can’t make any updates to the
quantized weights. This is a major problem, because it means that
quantization can only be used for inference, not for continued
pre-training or fine-tuning. Whilst inference is useful and important,
it’s really just *consuming* models. But we want everybody to be able to
*contribute* to *creating* models!

The trick to avoiding this limitation is to use
[LoRA](https://arxiv.org/abs/2106.09685) – “Low-Rank Adaptation of Large
Language Models”. LoRA doesn’t train the whole large language model at
all, but instead adds “adaptors”, which are very small matrices
(generally smaller than 1% of the full model) that are trained, whilst
keeping the rest of the model constant. If you’ve played with models
like Stable Diffusion, you will have probably seen these adapters many
times; it’s how those models are generally shared, and why they are so
small and fast to download.

Tim realized that LoRA can be combined with quantization: use a
quantized base model, which is not changed at all by the training, and
add trainable LoRA adaptors that are not quantized. This combination is
*QLoRA*. Tim’s team was able to use this to, for the first time, train a
model that (unquantized) is larger than the GPU: they trained a 65b
model (which is 130GB unquantized) on a 48GB card.

Hugging Face stepped in once again here, creating the
[PEFT](https://huggingface.co/blog/peft) library, which made LoRA
training far simpler, and also integrating it directly with bitsandbytes
to allow anyone to use QLoRA with just a few lines of code. The Hugging
Face team has been working tirelessly behind the scenes to ensure that
the open source community can use these technologies to train their
models. If you’ve ever used Transformers to load a 4-bit model using a
single function argument, then you’ve got them to thank (and even if you
haven’t, you’ve almost certainly used the work of folks that have built
their model with this ecosystem).

QLoRA didn’t quite slay the problem we set out to solve, to train a 70b
model on 24GB cards, but it got closer than anything before. When
quantized to 4 bits (which is 0.5 bytes), the 70b model takes 70/2 = 35
GB, which is larger than the 24GB gaming GPUs we want to use.

There are other limitations to QLoRA. A 48GB card is very expensive, and
training a 65b model only just fits on such a card. That can be a
problem, because we need to store lots of other things too, including
the activations[2], gradients, and optimization state of the model
during training. If there’s not much memory left over after loading the
model weights, there’s not enough working memory to support training.

For instance, one of the benefits of language models is that we can use
them to “chat” with, or understand, or analyze long documents or
conversations. To make models that can handle long sequences like that,
we need to show them examples of long sequences during training. The
longest sequence used in training is called the “sequence length”.
Trying to use anything but a short sequence length will cause an error
when training a 65b QLoRA model on a 48GB card, because there isn’t
enough memory to store all the information about the sequence; nearly
all the memory is used just to store the model itself.

Furthermore, if the model can only look at a single sequence at a time,
it’s going to take a really long time to get through all the data in our
training set. So instead we want to be able to “batch” a few sequences
together at a time. The number of sequences included is the “batch
size”. When there’s very little space left on the GPU after loading the
model weights, we can only use very small batch sizes, resulting in
extremely slow training.

### FSDP: Scale training to multiple GPUs

One obvious solution to the problem of the RAM limitations of a single
consumer GPU, is to use more than one GPU! A very common approach in the
open source community is to simply place a few layers of the model on
each card. So then to train, you run the first few layers on the first
GPU, then the next few on the second GPU, and so forth. For instance, a
70b (140GB) model could be spread over 8 24GB GPUs, using 17.5GB on
each. There’s even a convenient setting in Hugging Face Transformers,
`device_map=’auto’`, which you may well have used; that’s what this is
actually doing behind the scenes. This does the job, but there’s a giant
downside: only one GPU is ever active at a time, as all the others wait
for their “turn”. That means that ⅞ of the compute is wasted.

*Distributed Data Parallel* (DDP) was previously the gold standard
approach to training models across multiple GPUs efficiently. This
requires keeping the full model on each GPU – if you have a small model
(e.g. a 2b model, which takes 4GB RAM) you can simply load the whole
thing onto each GPU separately, and have each GPU then churn through
training examples in parallel. So for instance, if you had 4 GPUs,
that’s a 4x training speedup. But DDP doesn’t work if the model doesn’t
fit onto a GPU, with enough room to spare for the data needed for the
training process.

So we need something that can split a model across GPUs (like
`device_map=’auto’`) and also use them in parallel (like DPP). This is
where Meta’s [Fully Sharded Data
Parallel](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html)
(FSDP) library comes in. It “shards” a large model, by splitting its
parameters across multiple GPUs, allowing all the GPUs to be used
simultaneously. When a layer of the neural network is calculated on a
particular GPU during training, all the required shards are copied
there. Then, the calculation is made, and finally the copied parts are
deleted from that GPU. Whilst this sounds terribly inefficient, actually
by being smart about copying the data of the next layer at the same time
the current layer is busy calculating, it’s possible for this approach
to result in no slowdown compared to DDP.

FSDP’s ability to bring the performance of DDP to models that are larger
than any one GPU has been a revelation. For instance, a 70b (70 billion
parameter) unquantized model takes 140GB of RAM (because each parameter
is stored as 16 bits, which is 2 bytes), but even NVIDIA’s H100 card
(which costs around $40,000 for a single card!) falls short of what’s
needed, with its 80GB RAM. But with FSDP, four H100 GPUs can be combined
for a total of 320GB RAM.

(Mind you, such a machine is going to set you back around $150,000…)

## Bringing FSDP and QLoRA together

At Answer.AI our north star is making useful AI more accessible.
$150,000 to create your own high-quality personalized model definitely
doesn’t count as accessible! So the first project we embarked on was to
make it possible to use a desktop with consumer gaming GPUs to
efficiently train a 70b model. We figured that if we could use QLoRA to
reduce the size of a model by around 400% (so a 70b model would fit into
35GB RAM), and then we used FSDP to shard that across two or more 24GB
consumer cards, that would leave enough RAM left over to train a model.

### First steps

Jeremy and Tim in late 2023 discussed the idea of bringing FSDP and
QLoRA together. Tim connected Jeremy with Titus von Koeller, and Jeremy
and Titus worked together to try, explore, understand, and document the
issues that occurred when the two libraries were combined.

Answer.AI’s Johno Whitaker put together an important first step: a
simple standalone test script which allowed us to more deeply understand
the problem, and test solutions. A key breakthrough came in early 2024
when Answer.AI’s Benjamin Warner and Titus independently came up with a
key idea: store the quantized parameters in a selectable data type,
where that storage data type is the same data type as the “computation
type” of the model[3].

Benjamin had this prototyped within 24 hours of developing the idea, but
then we discovered another problem: FSDP was not copying the
quantization information needed for each shard to use the model! That’s
because FSDP is quite opinionated on the subset of data it will sync
between GPUs[4]. We realized that if we quantized the model on each GPU
the missing metadata would remain untouched on all GPUs. Furthermore, we
had to move the “quantization state” (the information necessary to
(de)quantize the parameters) from the parameters into the layers, in
order to ensure they were not removed when FSDP moved shards.

Once we had those issues resolved, we were able to successfully train
our first batch of data with a quantized model using FSDP! Benjamin and
Answer.AI’s Kerem Turgutlu were able to package this up with all the
tests and refactoring needed into a [pull
request](https://github.com/TimDettmers/bitsandbytes/pull/970) for
bitsandbytes. We’re extremely grateful to the maintainers of the
bitsandbytes project, who were very responsive in shepherding our PR
through their processes.

### Mission accomplished, nearly

At this point, we once again figured that we’d have things tied up
pretty quickly, and once again we under-estimated the complexity of the
task! The first thing we realized was that it still wasn’t actually
possible to load a quantized model that’s larger than a single GPU,
since the loading and quantization process itself required the whole
model to be put on one GPU.

Jeremy had spent a few weeks carefully studying Meta’s fantastic
[Llama-Recipes](https://github.com/facebookresearch/llama-recipes)
project, which was the best complete FSDP fine tuning implementation he
had found, and by closely tracking how it worked together with
bitsandbytes, along with Hugging Face’s PEFT, Transformers, and
Accelerate projects, he managed to construct a minimal standalone script
which manually complete all the steps needed to fine tune a model.

Benjamin realized that with some tweaks it would be possible to do the
loading and discretization one layer at a time, thus avoiding the need
to ever have the whole model on a single GPU. He also figured out how to
prevent the PEFT library from moving the quantization state to CPU.
Kerem wrote a custom implementation of the LoRA algorithm so that it
could work with Benjamin’s changes.

And with that, we were able to fine tune a 70b model on dual 3090 gaming
GPUs for the first time!

To make this work, we were benefiting not just from FSDP and QLoRA, but
also from a vast array of clever techniques developed over the last
couple of years by the academic and open source communities. We used:

- [Gradient checkpointing](https://arxiv.org/abs/1604.06174) (also known
  as activation checkpointing) to avoid storing full gradients, instead
  saving activations at a number of ‘checkpoints’ throughout the model
  and then re-computing gradients by re-running the forward computation
  step as needed
- [CPU
  offloading](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload)
  to store weights in CPU RAM rather than on the GPU when they are not
  in use, drastically reducing the GPU memory required. This technique
  isn’t very useful to the “GPU rich” using H100 GPUs, which have highly
  optimized ways to pass weights to each other. But for our use case,
  it’s absolutely necessary, since gaming GPUs and motherboards don’t
  have these systems
- [Flash Attention 2](https://arxiv.org/abs/2307.08691) to efficiently
  calculate attention using a memory-optimized Cuda kernel.

Making these work together with FSDP and QLoRA wasn’t always
straightforward. For instance, when using CPU offloading, Benjamin
discovered and fixed a problem in bitsandbytes. Each time an “offloaded”
weight was copied back to the GPU, it was being automatically
re-quantized, which effectively turning the pretrained model into random
weights! We made a pull request to bitsandbytes that kept track of which
parameters had already been quantized, so we could avoid the redundant
computation.

After all this work, we were very pleased to see that we could train
large models with consumer GPUs. Jeremy had run detailed benchmarking of
the original llama-recipes across a range of GPU hardware, and Kerem
developed a comprehensive benchmarking system for the new project. In
comparing the two we realized that we were still not able to use the
sequence lengths or batch sizes we’d hoped – for some reason more memory
was being used than we expected.

When we looked closely, it turned out that it wasn’t due to our
FSDP/QLoRA integration at all – but actually as we increased seqlen in
bitsandbytes even without FSDP, the memory usage went up super-linearly,
eventually resulting in even higher memory usage than without
quantization! It turns out that we’re [not the
first](https://github.com/RahulSChand/gpu_poor/issues/1) people to
discover this problem. We don’t have a bitsandbytes solution yet (but
it’s being investigated), but it did lead us to an exciting discovery…

### Discovering HQQ

We love collaborating with like minded people, so when we saw the
amazing work being done by Daniel Han on [Unsloth](https://unsloth.ai/)
we wanted to learn more, and see whether we might be able to help each
other out. We asked Daniel if there were other interesting projects in
this space that we should be watching, and he pointed us to
[HQQ](https://mobiusml.github.io/hqq_blog/).

To explain HQQ, we’ll need to give a bit of background first… The 4-bit
quantization done by bitsandbytes uses a simple, fast, and clever
approach where each group of parameters is normalized to a consistent
range, and then each parameter is placed in a bucket, where the bucket
breakpoints are based on an assumption that the parameters are normally
distributed. This has the benefit that quantization is nearly instant,
but because real model parameters will not exactly fit the assumed
distribution, accuracy can suffer.

Other approaches such as GPTQ and the more recent AWQ go in a different
direction, where the quantization parameters are optimized based on the
actual behavior of the model when representative data is passed to
it.[5] These methods tend to produce more accurate models, potentially
with even fewer than 4 bits per parameter; but they have the downside
that the optimization process can take hours or even days for each
model.

HQQ combines the best of both worlds. It is 50x faster to process a 70b
model compared to GPTQ, yet is more accurate than it. Kerem decided to
investigate whether HQQ would work well with FSDP.

He discovered that making HQQ and FSDP work well together took nearly
the exact same steps as was required for bitsandbytes, and as result he
had a complete working example completed within days. The [mobius.ml
folks](https://www.mobiuslabs.com/) couldn’t have been more responsive
and helpful in ensuring that our PR was successfully merged – so we are
now delighted to announce that FSDP works with HQQ too!

## How to use FSDP/QLoRA

To use FSDP, you will of course need more than one GPU. If you don’t
have access to such a system, you can rent a dual 3090 box from the
[Runpod Community Cloud](https://www.runpod.io/) for around $0.60/hour.
There are many other providers to choose from;
[cloud-gpus](https://cloud-gpus.com/) is a great place to see what’s on
offer.

You’ll need to install the latest version of Transformers, PEFT, and
bitsandbytes (and HQQ if you’re using that). Then, clone [our
repo](https://github.com/AnswerDotAI/fsdp_qlora/tree/main) and follow
the README there. Running `python train.py --help` will show the
available options. To train llama2-7b on the included alpaca dataset on
two 24GB cards you might run:

> `python train.py --train_type qlora --dataset alpaca --batch_size 8 --gradient_accumulation_steps 2 --output_dir qlora_output --log_to wandb`

We’ve crammed everything required into this single file to make it
easier to see what is going on and modify things if necessary.

You should treat this script as an alpha/preview release. Whilst we’ve
used it to successfully train a variety of practically useful models on
a range of hardware, it’s still early days. If you’re not comfortable
with testing and debugging models, we’d suggest holding off for a few
months whilst the community more fully tests the approach.

**Update:** We’re excited to see that support is [already underway in
the Hugging Face
ecosystem](https://huggingface.co/docs/peft/main/en/accelerate/fsdp#use-peft-qlora-and-fsdp-for-finetuning-large-models-on-multiple-gpus)
via changes to
[Accelerate](https://github.com/huggingface/accelerate/pull/2544),
[Transformers](https://github.com/huggingface/transformers/pull/29587),
[TRL](https://github.com/huggingface/trl/pull/1416) and
[PEFT](https://github.com/huggingface/peft/pull/1550). Our code has also
been incorporated into the Axolotl finetuning library and [used to train
Mixtral](https://twitter.com/winglian/status/1766192708102562222) and
other models.

## A first step

Originally we’d planned to show some benchmarks in this article, along
with guidance based on the benchmark results about how to best take
advantage of FSDP/QLoRA. However, we had to keep delaying posting this,
because we kept making major improvements every few days! So we’ll
follow up with a benchmarking and recommendations article in the coming
weeks.

What we’ve shown here is just the first step. We have lots of ideas for
improvements, and we’re sure that the open source community will have
many other ideas we haven’t even thought of yet!

We hope that seeing this proof of concept, that it’s possible to scale
resource-efficient QLoRA training across inexpensive gaming GPUs, will
help bring more attention to the problem of bringing down the cost of
model training. It’s in everyone’s interest to make AI more accessible –
and to enable more people to not only consume, but also build, valuable
models.

[1] Throughout this article “training” can refer to either pre-training,
or fine-tuning.

[2] Strictly speaking, we don’t generally store the actual activations,
but rather just the intermediate pieces needed to recalculate them on
demand by using gradient checkpointing.

[3] FSDP only supports floating point types but most quantization
libraries store quantized weights in integer types. A selectable storage
datatype resolved this discrepancy.

[4] FSDP only sypports syncing PyTorch parameters and buffers, while
most quantization libraries store “quantization state” metadata in
dictionaries.

[5] This is very important as calibration data bias is another major
issue one can face using these data dependent methods.