Bunk Costs Guest Post by Ian Bruene

In the AI posts recent the off repeated refrain / questions about whether any of this is economically viable came up. Usually when I’ve heard people make that objection I pay little attention: the facts they cite tend to be questionable and cherry-picked at best, and all too often outright fraudulent. Nothing new here, same old same old for the topic. More fundamentally pointing out that a huge amount of money and resources have been poured into a new technology which is getting better at an accelerating pace and it hasn’t paid off yet is………. not a particularly interesting observation to make.
This time I decided to do a bit of figuring up, and it turns out that you can just do math, and no one can stop you. I’m going to talk about three different types of model which are the most relevant, and which have people raising the most questions about their viability.
But first an important distinction must be made for those who are unfamiliar with these: running a completed model and training the model require vastly different amounts of compute. It might take hundreds or thousands of GPUs crunching data for a month to train a new model, but when that is completed a single GPU can keep up with constant usage from multiple users.
Also I am going to limit my discussion of valuable usage to cases where there is a fairly solid and definable value proposition. Because once I’ve laid out the math there, everything else is just gravy. And I am mostly not going to talk about the detail of how the money flows: I’m just going to cover whether X amount of value is generated vs the training cost.
Large State of the Art LLMs
These are what everyone knows of for services like chatgpt or grok. They are the big boys which have massive datacenters built to train and run them. Information on what the more recent models have cost to train has not been published, but we can still make some educated guesses. Estimates put GPT-4 around $60-80 million, but Altman has stated that it was “over 100 million”. There is even less information for -4o or o3, but a figure of $100-200 million for -4o is likely.
Can this recoup costs? Is there anything valuable enough which these can do to pay for that?
(Also I’d like to point out that while those sound like big numbers, as far as industrial investments go they are pretty tiny.)
Well let’s look at something where we can have objective standards: it is a fact that there are programmers who individually can create $10 million in value. They can go much higher than that, but there are fewer the higher you go.
Also we know for a fact that a -4o class model is useful to an expert programmer. How? Because ESR has been working on a new project using AI for a few weeks now.
From his reports, we know that AI assistance for an expert programmer can multiply development speed by a factor of 2 to 3. It might go higher, but let’s go with a very conservative 2x multiplier. And we won’t include any of the ancillary benefits: just the time it takes the project form start to finish.
(And to head off what I know some of you are furiously pounding your keyboards about: those figures were while maintaining a high standard of quality)
So let’s put all of those points together:
If you have a developer who can create $10m in value and you give him an AI he can create $20m in value in the same period, for a gain of +$10m. While they are rare by general population standards, $10m value developers are fairly common for competent people. If you give 20 of them a -4o class AI, the AI will have generated enough value to offset its training cost.
Any additional $10m value developers who use the AI are over-unity, and the thousands of $1m value developers add to the pile on. We haven’t even touched any business case beyond making the very best programmers more productive and we’ve already demonstrated that the concerns – or perhaps concern trolls – about recouping the cost are full of nothing but wind.
But wait. It gets worse for that objection. It gets so very much worse.
Small LLMs
There is a wide variety of model sizes, all the way from 671 billion parameter behemoths like undistilled DeepSeek-r1, down to tiny models you can run on the cheapest raspberry pi. But a notable size range is around 7 billion parameters; there are a lot of small models which are about this size, because you can do useful things with that, and it can easily run on even the low end consumer GPUs.
The specific model which ESR uses the most at the moment is 4.1-mini. We don’t know exactly how large it is because “Open”AI are a bunch of secretive little twerps. But they have stated that it is in this general size range, and most estimates put it around 7-8b. This means we know that a model in this size range is useful to an expert programmer.
Several different estimates for how much it cost to train 4.1-mini put it somewhere around $1 million. Which in large corporation terms is extra money they found while cleaning out the sofa. Now consider all those numbers I went through before to see if -4o could be profitable at 200 times the startup cost, and compare them to a $1m investment.
Even if you try to rescue the financial-doomer position by saying they had to train 4o before they could get to 4.1-mini (which is probably true), that just leaves you with the 4o training cost which we already know can generate over-unity value.
ImageGen
Image generator models are much smaller than LLMs. StableDiffusion 1.5 is just under a billion parameters, as opposed to a 7b LLM being considered very small. Here we actually have some useful data; SD1.5 was trained for about $600k on an AWS cluster of 256 A100 GPUs and 150,000 GPU-hours of compute time. We also know that SDXL is 3.5b parameters, so all other factors being equal a naive scaling would put its training cost around $2.1 million.
Already we are talking about something much cheaper. But we can cut these prices down considerably. SD1.5 was trained on A100 cards. That’s the previous generation, most stuff nowadays uses the H100 (and the bleeding edge B200 is starting to appear) which is more expensive per hour but 3-4 times faster. Going by AWS pricing if you trained the exact same SD1.5 model on an AWS H100 cluster it would only cost about $200k.
But wait; there’s even more we can cut. AWS is the boutique GPU rental service. If you want something more in line with the market price for compute you can go to runpod.io. Using their figures, training on A100 cards would only be $300k, or using H100s it would be about $115k.
If we take these figures and apply the naive 3.5x scaling factor for the much more capable SDXL, that $115k works out to around $400k. Let’s be generous and round it all the way up to $1 million. Again; this is petty cash level expenditure for a larger company.
Future directions…
While I was coming up with figures for this post I asked o3 to work out an estimate of what it would cost to train a brand new 7b model, using runpod prices, and the current well known state of the art in training techniques but nothing exotic. The figures it came up with were on the order of $15-30k worth of compute assuming no disastrous failed runs.
At which point we are talking about something which the medium to large end of small businesses can do without wincing.
Or a well off hobbyist.
I currently have a janky AI “server” which I’m going to be rebuilding into a proper server with a 4x V100 nvlink board. The V100 is a couple generations behind even the A100 which is why I’m able to get them cheaply.
Just counting those with no additional GPUs, limiting training time to 1 month, and using current training techniques, I will be able to train a brand new 2 billion parameter model at home. If I did it in summer the power cost would be about $70. If I was smart and did it in winter the power cost would be only $51. That doesn’t count additional AC or reduced furnace needs.
Even if nothing else could pay off the cost of training models, once a given size of model is within the capabilities of a geek who doesn’t have a ton of money to spend on the problem your economic objections fly out the window.
Now tell me: what happens to old computer hardware when it gets old and stops being useful in datacenters?
[The image for this post was generated on my existing V100, 19 seconds @ 150W, or about $0.0001 in power]









