We've cooked up a bunch of improvements designed to reduce friction and make the.


When deciding on a GPU spec to train or fine-tune a model, you're likely going to need to hold onto the pod for hours or even days for your training run. Even a difference of a few cents per hour easily adds up, especially if you have a limited budget. On the other hand, you'll need to be sure that you have enough VRAM in your pod to even get the job off the ground in the first place. Here's the info you'll need to make an informed purchase of your GPU time.
When deciding which GPU spec to select at two competing price points (say, the A5000 vs A6000) you can get a rough estimate by evaluating how many cores they have:
Be advised, of course, that you won't get that full 31% due to additional overhead and other factors, but it's safe to say you should get at least three-quarters of it. However, at present the A5000 is only half the price of the A6000 in Secure Cloud, so the lower-spec card is actually the better choice of the two for you in this situation, economically, assuming you don't need the extra VRAM in the A6000. Hopefully, this does demonstrate that it's generally not advisable to use a higher-end card to try to push your training along faster from a purely economic standpoint.
So it's not that GPU speed doesn't matter - it's just that the speed of your hardware doesn't matter nearly as much as the techniques you apply. VRAM should be the focus of your decision, which leads us to..
During inference, you primarily need memory for the model parameters and a small amount for activations. However, fine-tuning introduces several additional memory-hungry components:
All told, these components add up to the ~16GB per 1B parameter rule of thumb. This seems like a lot – and it is – but you can reduce that footprint through implementations like flash attention (which might save you 10-20%.) You can also use something like gradient checkpointing which even the most pessimistic estimate could save you half of your VRAM usage while increasing your training run by 20-30%. However, this training run time will easily be compensated for by the fact that you can train in a lower GPU spec, or in a pod with fewer GPUs renting, along with some other clawed back gains such as lower overhead due to needing fewer physical cards for the job.
So, the 16GB per 1B rule of thumb is good for determining the maximum requirement for the job, which is a good starting point. If you're training a small, lightweight model like a 1.5B model, then using something like gradient checkpointing is actually going to work against you, because you aren't going to save enough to drop from an A5000 to an A4000 to justify the increased training costs. However, for larger models, it's almost always going to be worth it, so long as you are not under any pressing time constraints.
If you're on a budget, and are willing to manage the technical overhead of additional files, then you can use LoRA and QLoRA to train at a deep discount. These methods freeze the original model weights and only train small rank decomposition matrices. Instead of updating all parameters, we're only updating these small matrices. Let's look at what this means in practice:
For a 7B parameter model with LoRA:
QLoRA takes this efficiency even further by combining LoRA with 4-bit quantization:
This dramatic reduction in memory requirements has democratized LLM fine-tuning, making it possible to work with these models on even consumer-grade GPUs. For example, QLoRA enables fine-tuning of a 13B model on a single RTX 4090, which would be impossible with traditional methods.
When all is said and done, here's how those values might shake out for a number of different model weights. You can see how stark the differences can get - for a 70B model, you could need either a 5xH200 pod or a humble A40, depending on the techniques utilized.
So to sum it up - when training a model, the spec's speed doesn't matter nearly as much as the VRAM requirements and whatever memory-saving techniques you use. It's almost always going to be worth it to take the speed and technical tradeoffs for lower VRAM requirements - and then it becomes more a game of lowering that as much as feasible, which means your run will cost you less in the end.
The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.