We engineer tomorrow to build a better future.
Solutions to your liquid cooling challenges.
 
 
DANFOSS
数据中心液冷产品
  数据中心液冷产品
  FD83接头
  UQD快速接头
  UQDB盲插接头
  BMQC盲插接头
  NVQD02
  NVBQD02
  EHW194液冷软管
  EHW094液冷软管
  5400制冷剂接头
  Manifold 分水器
  液冷系统生产及集成
Danfoss流体管阀件
 
 
 
 
 
非标定制液冷产品
液冷系统生产及集成
阀门
传感器
选型资料下载
  新闻通告
  成功案例
  资料下载

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


   

 

全球五大巨头GPU总量曝光!2025年等效H100或超1240万块
新智元 新智元 2024年12月02日 13:07 北京
图片

AI巨头的芯片之争,谷歌微软目前分列一二。而xAI作为新入局者,正迅速崛起。这场竞争中,谁会成为最后赢家?

今年,马斯克用全球最大AI超算Colossus轰动了整个世界。 这台超算配备了10万张英伟达H100/H200显卡,并预计未来即将扩展到20万张。 自此,AI巨头们倍感压力,数据中心大战火上浇油。巨头们纷纷酝酿着各自的建造计划。 最近,LessWrong网站上发表了一篇博客,根据公开数据对英伟达芯片的产量、各个AI巨头的GPU/TPU数量进行了估计,并展望了芯片的未来。

博客地址:https://www.lesswrong.com/posts/bdQhzQsHjNrQp7cNS/estimates-of-gpu-or-equivalent-resources-of-large-ai-players#Nvidia_chip_production


截止目前,世界五大科技公司的2024年拥有的算力,以及2025年的预测:

微软有75万-90万块等效H100,明年预计达到250万-310万
谷歌有100万-150万块等效H100,明年预计达到350万-420万
Meta有55万-65万块等效H100,明年预计达到190万-250万
亚马逊有25万-40万块等效H100,明年预计达到130万-160万
xAI有10万块等效H100,明年预计达到55万-100万

?2024 YE (H100 equivalent)
2025 (GB200)
2025YE (H100 equivalent)
MSFT
750k-900k
800k-1m
2.5m-3.1m
GOOG
1m-1.5m
400k
3.5m-4.2m
META
550k -650k
650k-800k
1.9m-2.5m
AMZN
250k-400k
360k
1.3m-1.6m
XAI
~100k
200k-400k
550k-1m

 

芯片数量估算总结


可见,他们都在紧锣密鼓地布局自己的算力版图,开展下一代更先进模型的训练。 谷歌Gemini 2.0预计在本月正式上线。此前,马斯克也曾透露,Grok 3也会在年底亮相,具体时间仍旧未知。 他表示,在法律问题数据集上完成训练后,下一代Grok 3将是一个强大的私人律师,能全天候提供服务。 为了追赶劲敌,OpenAI o2模型据称也在训练中了。


这一切训练的开展,都离不开GPU/TPU。 英伟达稳坐GPU霸主,25年或暴销700万块

毋庸置疑,英伟达早已跃升为数据中心GPU的最大生产商。 11月21日,英伟达发布的2025财年第三季度财报预计,2024自然年的数据中心收入将达1100亿美元,比2023年的420亿美元增长了一倍多,2025年有望突破1730亿美元。


收入主力,那便是GPU了。 据估计,2025年英伟达销量为650万至700万块GPU,几乎全是最新的Hopper和Blackwell系列。 根据生产比例和产量预期,其中约包括200万块Hopper,500万块Blackwell。


今年产量:500万块H100

那么,2024年英伟达实际产量是多少?目前,关于这一数据来源较少,有些甚至还对不上。 不过,有估算称2024年第四季度将生产约150万块Hopper GPU。不过这包括一些性能较低的H20芯片,因此是一个上限值。 根据季度间数据中心收入比例推测,全年生产总量可能上限为500万块——这是基于每块H100等效芯片收入约2万美元的假设,而这个单价似乎偏低;如果使用更合理的2.5万美元计算,实际产量应该在400万块左右。 这一数据与年初估计的150万至200万块H100生产量存在差异。目前尚不清楚这一差异是否可以归因于H100与H200的区别、产能扩大或其他因素。 但由于这一估算与收入数据不一致,选择使用更高的数字作为参考。


此前的产量

为了评估目前以及未来谁拥有最多的计算资源,2023年之前的数据对整体格局的影响有限。 这主要是因为GPU性能本身的提升,以及从英伟达的销售数据来看,产量已经实现了大幅增长。 根据估算,微软和Meta在2023年各自获得了约15万块H100 GPU。结合英伟达的数据中心收入,2023年H100及同等级产品的总产量很可能在100万块左右。


五大科技巨头,等效H100预测

截止2024年底,微软、Meta、谷歌、亚马逊、xAI将拥有多少块等效H100?2025年他们又将扩展到多少块GPU/TPU?
从季度报告(10-Q)和年度报告(10-K)中可以看出,英伟达的客户分为「直接客户」和「间接客户」。
其中,46%的收入都是来自直接客户,包括像SMC、HPE、戴尔这样的系统集成商。
他们通过采购GPU,然后组装成服务器,提供给间接客户使用。
间接客户覆盖的范围就非常广泛,比如公有云服务提供商、互联网消费类公司、企业用户、公共部门机构和创业公司都属于这一范畴。
更直白讲,微软、Meta、谷歌、亚马逊、xAI都是「间接客户」(关于他们的拥有GPU相关信息披露相对宽松,但可信度可能较低)。
2024年财年报告中,英伟达披露了,约19%的总收入来自通过系统集成商和分销商采购产品的间接客户。


根据交易规定,他们必须披露收入占比超过10%的客户信息。那么,英伟达的这个数据透露了什么?
要么是,第二大客户规模只有第一大客户的一半,要么是这些数据存在测量误差。
这其中,最大的客户可能是谁?
从现有信息来看,最有可能的候选者是微软。


微软、Meta

微软很可能就是英伟达这两年的最大客户,这一判断基于以下几个因素:
首先,微软拥有全球最大的公有云服务平台之一;其次,它是OpenAI的主要算力供应商;再者,与谷歌、亚马逊不同,微软没有大规模部署自己的定制芯片;最后,微软似乎与英伟达建立了特殊的合作关系——他们是首个获得Blackwell GPU的公司。
今年10月,微软Azure已经开始测试32个GB200服务器的机架。


2024年微软的收入占比数据没有2023年那么精确,英伟达第二季度财报(10-Q)中提到上半年为13%,第三季度仅「超过10%」。
这表明,微软在英伟达销售中的份额较2023年有所降低。
另有彭博统计,微软占英伟达收入15%,其次是Meta占13%,亚马逊占6%,谷歌约占6%(不过资料中并未明确指出这些数据具体对应哪些年份)。
去年来自Omdia研究统计,2023年底Meta、微软各有15万块H100,亚马逊、谷歌和甲骨文各5万块,这一数据与彭博数据更为吻合。


不过,Meta曾发文宣称,到2024年底将拥有相当于60万块H100算力。据称这包括35万块 H100,剩余部分很可能是H200,以及少量将在最后一个季度交付的Blackwell芯片。


如果假设这60万的数字准确无误,并结合收入占比进行推算,便可以更准确地估计微软的可用算力。 微软预计将比Meta高出25%到50%,也就是相当于75万—90万块等效H100算力。


谷歌、亚马逊

仅从英伟达收入的贡献来看,亚马逊、谷歌无疑是落后于微软Meta。然而,这两家公司的情况有着显著差异。 谷歌已经拥有大量自研的定制TPU,这是内部工作负载的主要计算芯片。 去年12月,谷歌推出了下一代迄今为止最强大的AI加速器TPU v5p。


Semianalysis在2023年底一篇报道中指出,谷歌是唯一一家拥有出色自研芯片的公司。


谷歌在低成本、高性能且可靠的大规模AI部署方面的能力几乎无人能及,是全球算力最丰富的企业。


而且,谷歌在基础设施上的投入,只会越来越多。2024年第三季度财报估计,AI支出为130亿美元,「大部分」用在搭建技术基础设施,其中其中60%是服务器(GPU/TPU)。


大部分或许意味着70-110亿美元,其中在TPU/GPU服务器上预估耗资45-70亿美元。
按照TPU对GPU支出2:1的估算,并保守假设TPU的每美元性能与微软的GPU支出相当,预计到2024年底谷歌将拥有相当于100万到150万块等效H100算力。
相比之下,亚马逊内部AI工作负载规模很可能小得多。


他们持有相当数量的英伟达芯片,主要是为了满足通过其云平台提供的外部GPU需求,尤其是为Anthropic提供算力需求。
毕竟,亚马逊和微软一样,都是金主爸爸,负责为OpenAI劲敌提供充足算力。


另一方面,亚马逊虽也有自研的Trainium和Inferentia芯片,但他们在这方面的起步比谷歌的TPU晚得多。
这些芯片似乎远落后于业界最先进水平,他们甚至提供高达1.1亿美元的免费额度来吸引用户尝试,这表明目前的市场接受度并不理想。


不过,今年年中,亚马逊定制芯片似乎出现了的转机。
在2024年第三季度财报电话会议上,CEO Andy Jassy在谈到Trainium2时表示,这些芯片获得了巨大的市场兴趣,我们已多次与制造合作伙伴协商,大幅提高原定的生产计划。
Semianalysis报道指出,「根据我们已知数据,微软和谷歌于2024年在AI基础设施上的投资计划,大幅领先亚马逊部署的算力」。
这些芯片换算成等效H100并不明确,关于Trainium/Trainium2芯片的具体数量也难以获得,仅知道在上述免费额度计划中提供了4万块。
xAI

今年,xAI在基础设施搭建中,最为标志性事件便是——122天建成了10万块H100组成的世界最大超算。
而且,这一规模还在不断扩展中。马斯克预告了未来将扩展到20万块由H100/H200组成的超算。


据称,xAI超算目前似乎在站点供电方面遇到了一些问题。
2025年Blackwell芯片预测

最新2024 AI现状报告对Blackwell采购量进行了估算: 大型云计算公司正在大规模采购GB200系统:微软介于70万到140万块之间,谷歌40万块,AWS 36万块。据传OpenAI独自拥有至少40万块GB200。


如果将微软GB200预估值设为100万块,那么谷歌、AWS这些数字与它们在英伟达采购中,相对于微软的比例是相符的。 这也使得微软占英伟达总收入的12%,与2024年其在英伟达收入份额的小幅下降趋势一致。 该报告虽然没有给出Meta的具体估计数字,但Meta预计明年人工智能相关基础设施支出将显著加速,这表明其在英伟达支出中将继续保持高份额。 lesswrong预计在2025年,Meta的支出规模将维持在微软支出的约80%水平。 虽然没有提及xAI,但马斯克宣称,将在2025年夏天部署一个有30万块Blackwell芯片的运算集群。 虑到马斯克一贯的夸张风格,更为合理的一个估计是,到2025年底他们可能实际拥有20万—40万块芯片。 那么,一块B200相当于多少块H100?这个问题对于评估算力增长至关重要。


就训练而言,性能预计飙升(截至2024年11月)2.2倍。英伟达发布当天,给出的数据称,两个B200组成的GB200,其性能是H100的7倍,训练速度是H100的4倍。
对于谷歌,假设英伟达芯片继续占其总边际计算能力的三分之一。对于亚马逊,这一比例假定为75%。 值得注意的是,仍有大量H100和GB200芯片未被计入上述统计中。
有些是未达到英伟达收入报告阈值的机构,还有些是像甲骨文这样的云服务提供商和其他中小型云服务提供商可能持有相当数量的芯片。

此外,也包括一些英伟达重要的非美国客户。 在全面了解各家手握多少GPU/TPU算力之后,下一个问题是,这些算力将用在哪? 巨头们训练模型用了多少算力?

以上都讨论的是关于各个AI巨头总计算能力的推测,但许多人可能更关心最新前沿模型的训练使用了多少计算资源。 以下将讨论OpenAI、谷歌、Anthropic、Meta和xAI的情况。 但由于这些公司要么是非上市企业,要么规模巨大无需披露具体成本明细(比如谷歌,AI训练成本目前只是其庞大业务的一小部分),因此以下分析带有一定的推测性。
OpenAI和Anthropic

2024年OpenAI的训练成本预计达30亿美元,推理成本为40亿美元。


据称,微软向OpenAI提供了40万块GB200 GPU,用于支持其训练。这超越了AWS整体的GB200容量,使OpenAI的训练能力远超Anthropic。 另一方面,Anthropic 2024年预计亏损约20亿美元,而收入仅为几亿美元。 考虑到Anthropic的收入主要来自API服务且应该带来正毛利,且推理成本应该相对较低,这意味着20亿美元中,大部分都用于模型训练。 保守估计其训练成本为15亿美元,这大约是OpenAI的一半,但并不妨碍其在前沿模型上的竞争力。 这种差异也是可以理解的。Anthropic的主要云提供商是资源相对有限的AWS,AWS的资源通常少于为OpenAI提供算力支持的微软。这可能限制了Anthropic的能力。


谷歌和Meta

谷歌的Gemini Ultra 1.0模型使用了约为GPT-4的2.5倍的计算资源,发布时间却晚了9个月。其所用的计算资源比Meta的最新Llama模型高25%。 尽管谷歌可能拥有比其他公司更多的计算能力,但作为云服务巨头,它面临着更多样的算力需求。与专注于模型训练的Anthropic或OpenAI不同,谷歌和Meta都需要支持大量其他内部工作负载,如社交媒体产品的推荐算法等。 Llama 3所用计算资源比Gemini少,且发布时间晚8个月,这表明Meta分配给前沿模型的资源相较OpenAI和谷歌更少。
xAI

据报道,xAI使用了2万块H100训练Grok 2,并计划用10万块H100训练Grok 3。 作为参考,GPT-4据称使用2.5万块A100进行了90-100天的训练。 考虑到H100的性能约为A100的2.25倍,Grok 2的训练计算量约为GPT-4的两倍,而Grok 3则预计达到其5倍,处于计算资源利用的前沿水平。


此外,xAI并非完全依赖于自有芯片资源,部分资源来源于租赁——据估算,他们从Oracle云平台租用了1.6万块H100。 如果xAI分配给训练的计算资源比例接近OpenAI或Anthropic,推测其训练规模可能与Anthropic相当,但低于OpenAI和谷歌的水平。


参考资料:
https://www.lesswrong.com/posts/bdQhzQsHjNrQp7cNS/estimates-of-gpu-or-equivalent-resources-of-large-ai-players

?

微信扫一扫
关注该公众号

 

 

Estimates of GPU or equivalent resources of large AI players for 2024/5
by CharlesD 29th Nov 2024


AI infrastructure numbers are hard to find with any precision. There are many reported numbers of “[company] spending Xbn on infrastructure this quarter” and “[company] has bought 100k H100s or “has a cluster of 100k H100s” but when I went looking for an estimate of how much compute a given company had access to, I could not find consistent numbers available. Here I’ve tried to pull together information from a variety of sources to get ballpark estimates of (i) as of EOY 2024, who do we expect to have how much compute? and (ii) how do we expect that to change in 2025? I then spend a little time talking about what that might mean for training compute availability at the main frontier labs. Before going into this, I want to lay out a few caveats:

These numbers are all estimates I’ve made from publicly available data, in limited time, and are likely to contain errors and miss some important information somewhere.
There are very likely much better estimates available from paywalled vendors, who can spend more time going into detail of how many fabs there are, what each fab is likely producing, where the data centers are and how many chips are in each one, and other detailed minutiae and come to much more accurate numbers. This is not meant to be a good substitute for that, and if you need very accurate estimates I suggest you go pay one of several vendors for that data.
With that said, let’s get started.

Nvidia chip production
The first place to start is by looking at the producers of the most important data center GPUs, Nvidia. As of November 21st, after Nvidia reported 2025 Q3 earnings[1] calendar year Data Center revenues for Nvidia look to be around $110bn. This is up from $42bn in 2023, and is projected to be $173bn in 2025 (based on this estimate of $177bn for fiscal 2026).[2]

Data Center revenues are overwhelmingly based on chip sales. 2025 chip sales are estimated to be 6.5-7m GPUs, which will almost entirely be Hopper and Blackwell models. I have estimated 2m Hopper models and 5m Blackwell models based on the proportion of each expected from the CoWoS-S and CoWoS-L manufacturing processes and the expected pace of Blackwell ramp up.

2024 production
Sources for 2024 production numbers were thin and often conflicting, but estimates of 1.5m Hopper GPUs for Q4 2024 (though this will include some H20 chips, a significantly inferior chip, and so is an upper bound) and data center revenue ratios quarter by quarter suggest an upper bound of 5m were produced (this would assume approx $20k of revenue per H100-equivalent which seems low - using a more plausible $25k we get 4m). This is in conflict with estimates of 1.5-2m h100s produced from earlier in the year - whether this difference could plausibly be attributed to h100 vs h200, expanded capacity, or another factor, is unclear, but since this is incongruent with their revenue numbers I have chosen to use the higher figure.

Previous production
For the purpose of knowing who has the most compute now and especially going forward, pre 2023 numbers are not going to significantly move the needle, due to improvements in GPUs themselves and big increases in the production numbers, based on Nvidia sales.

Based on estimates that Microsoft and Meta each got 150k H100s in 2023, and looking at Nvidia Data Center revenues, something in the 1m range for H100 equivalent production in 2023 seems likely.

GPU/TPU counts by organisation
Here I try to get estimates for how many chips (expressed as H100 equivalents) each of Microsoft, Meta, Google, Amazon and XAI will have access to at Year End 2024, and project numbers for 2025.

Numerous sources report things to the effect that “46% of Nvidia’s revenue came from 4 customers”. However, this is potentially misleading. If we look at Nvidia 10-Qs and 10-Ks, we can see that they distinguish between direct and indirect customers, and the 46% number here refers to direct customers. However, direct customers are not what we care about here. Direct customers are mostly middlemen like SMC, HPE and Dell, who purchase the GPUs and assemble the servers used by indirect customers, such as public cloud providers, consumer internet companies, enterprises, public sector and startups.

The companies we care about fall under “indirect customers”, and the disclosures around these are slightly looser, and possibly less reliable. For fiscal year 2024 (approx 2023 as discussed) Nvidia’s annual report disclosed that “One indirect customer which primarily purchases our products through system integrators and distributors [..] is estimated to have represented approximately 19% of total revenue”. They are required to disclose customers with >10% revenue share[3], so either their second customer is at most half as big as the first, or there are measurement errors here[4]. Who is this largest customer? The main candidate seems to be Microsoft. There are sporadic disclosures on a quarterly basis of a second customer exceeding 10% briefly[5], but not consistently and not for either the full year 2023 or the first 3 quarters of 2024[6].

Estimating H100 equivalent chip counts at year end 2024
Microsoft, Meta
Given Microsoft has one of the largest public clouds, is the major provider of compute to OpenAI, does not (unlike Google and possibly Amazon) have a significant installed base of its own custom chips, and appears to have a privileged relationship with Nvidia relative to peers (they were apparently the first to get Blackwell chips, for example) it seems very likely that this largest customer is Microsoft in both years. The revenue share for 2024 is not specified as precisely as for 2023, with 13% of H1 revenue mentioned in the Nvidia Q2 10-Q and just “over 10%” for Q3, but 13% seems a reasonable estimate, suggesting their share of Nvidia sales decreased from 2023.

There are other estimates of customer sizes - Bloomberg data estimates that Microsoft makes up 15% of Nvidia's revenue, followed by Meta Platforms at 13% of revenue, Amazon at 6% of revenue, and Google at about 6% of revenue - it is not clear from the source which years this refers to. Reports of the numbers of H100 chips possessed by these cloud providers as of year end 2023 (150k for Meta and Microsoft, and 50k each for Amazon, Google and Oracle) align better with the Bloomberg numbers.

An anchoring data point here is Meta’s claim that Meta would have 600k H100 equivalents of compute by year end 2024. This was said to include 350k H100s, and it seems likely most of the balance would be H200s and a smaller number of Blackwell chips arriving in the last quarter[7].

If we take this 600k as accurate and use the proportion of revenue numbers, we can get better estimates for Microsoft’s available compute as being somewhere between 25% and 50% higher than this, which would be 750k-900k H100 equivalents.

Google, Amazon
Amazon and Google are consistently suggested to be behind here in terms of their contribution to Nvidia revenues. However, these are two quite different cases.

Google already has substantial amounts of its own custom TPUs, which are the main chips used for their own internal workloads[8]. It seems very likely that Amazon’s internal AI workloads are much smaller than this, and that their comparable amounts of Nvidia chips reflect mostly what they expect to need to service external demand for GPUs via their cloud platforms (most significantly, demand from Anthropic).

Let’s take Google first. As mentioned, TPUs are the main chip used for their internal workloads. A leading subscription service providing data on this sector, Semianalysis, claimed in late 2023 that “[Google] are the only firm with great in-house chips” and “Google has a near-unmatched ability to deploy AI at scale reliably with low cost and high performance”, and that they were “The Most Compute Rich Firm In The World”. Their infrastructure spend has remained high[9] since these stories were published.

Taking a 2-1 estimate for TPU vs GPU spend[9] and assuming (possibly conservatively) that TPU performance per dollar is equivalent to Microsoft’s GPU spend I get to numbers in the range of 1m-1.5m H100 equivalents as of year end 2024.

Amazon, on the other hand, also has their own custom chips, Trainium and Inferentia, but they got started on these far later than Google did with its TPUs, and it seems like they are quite a bit behind the cutting edge with these chips, even offering $110m in free credits to get people to try them out, suggesting they’ve not seen great adaptation to date. Semianalysis suggest “Our data shows that both Microsoft and Google’s 2024 spending plans on AI Infrastructure would have them deploying far more compute than Amazon” and “Furthermore, their upcoming in-house chips, Athena and Trainium2 still lag behind significantly.”

What this means in terms of H100 equivalents is not clear, and numbers on the count of Trainium or Trainium2 chips are hard to come by, with the exception of 40,000 being available for use in the free credits programme mentioned above.

However, as of mid 2024 this may have changed - on their Q3 2024 earnings call CEO Andy Jassy said regarding Trainium2 “We're seeing significant interest in these chips, and we've gone back to our manufacturing partners multiple times to produce much more than we'd originally planned.” At that point however, they were “starting to ramp up in the next few weeks” so it seems unlikely they will have huge supply on board in 2024.

XAI
The last significant player I will cover here is XAI. They have grown rapidly, and have some of the largest clusters and biggest plans in the space. They revealed an operational 100k H100 cluster in late 2024, but there seem to be issues with them getting enough power to the site at the moment.

2025 - Blackwell
The 2024 State of AI report has estimates of Blackwell purchases by major providers - “Large cloud companies are buying huge amounts of these GB200 systems: Microsoft between 700k - 1.4M, Google 400k and AWS 360k. OpenAI is rumored to have at least 400k GB200 to itself. “ These numbers are for the chips in total and so we are at risk of double counting 2024 Blackwell purchases, so I have discounted them by 15%.

The Google and AWS numbers here are consistent with their typical ratio to Microsoft in Nvidia purchases, if we take 1m as the Microsoft estimate. This would also leave Microsoft at 12% of Nvidia total revenues[10], consistent with a small decline in its share of Nvidia revenue as was seen in 2024.

No Meta estimate was given in this report, however Meta anticipates a “"significant acceleration" in artificial intelligence-related infrastructure expenses next year” suggesting its share of Nvidia spending will remain high. I have assumed they will remain at approximately 80% of Microsoft spend in 2025.

For XAI, they are not mentioned much in the context of these chips, but Elon Musk claimed they would have a 300k Blackwell cluster operational in summer 2025. Assuming some typical hyperbole on Musk's part it seems plausible they could have 200k-400k of these chips by year end 2025.

How many H100s is a B200 worth? For the purpose of measuring capacity growth, this is an important question. Different numbers are cited for training and for inference, but for training 2.2x is the current best estimate (Nov 2024).

For Google, I have assumed the Nvidia chips continue to be ? of their total marginal compute. For Amazon, I have assumed they are 75%. These numbers are quite uncertain and the estimates are sensitive to them.

It is worth noting that there are still many, many H100s and GB200s unaccounted for here, and that there could be significant aggregations of them elsewhere, especially under Nvidia’s 10% reporting threshold. Cloud providers like Oracle and other smaller cloud providers likely hold many, and there are likely some non-US customers of significance too, as Nvidia in Q3 2025 said that 55% of revenue came from outside the US in the year to date (down from 62% the previous year). As this is direct revenue, it may not all correspond to non-US final customers.

Summary of estimated chip counts [11]
2024 YE (H100 equivalent) 2025 (GB200) 2025YE (H100 equivalent)
MSFT 750k-900k 800k-1m 2.5m-3.1m
GOOG 1m-1.5m 400k 3.5m-4.2m
META 550k -650k 650k-800k 1.9m-2.5m
AMZN 250k-400k 360k 1.3m-1.6m
XAI ~100k 200k-400k 550k-1m
Model training notes
The above numbers are estimates for total available compute, however many people are likely to care more about how much compute might be used to train the latest frontier models. I will focus on OpenAI, Google, Anthropic, Meta and XAI here. This is all quite speculative as all these companies are either private or so large they do not have to disclose the breakdowns of costs for this, which in Google’s case is a tiny fraction of their business as it stands.

OpenAI 2024 training costs were expected to reach $3bn, with inference costs at $4bn. Anthropic, per one source, “are expected to lose about ~$2B this year, on revenue in the high hundreds of millions”. This suggests total compute costs more on the order of $2bn than OpenAI’s $7bn. Their inference costs will be substantially lower, given their revenue mostly comes from the API and should have positive gross margins, this suggests that most of that $2bn was for training. Let’s say $1.5bn. A factor of two disadvantage for training costs vs OpenAI does not seem like it would prohibit them being competitive. It also seems likely, as their primary cloud provider is AWS, which as we’ve seen has typically had fewer resources than Microsoft, which provides OpenAI’s compute. The state of AI report mentioned earlier suggested 400k GB200 chips were rumoured to be available to OpenAI from Microsoft, which would exceed AWS 'entire rumoured GB200 capacity and therefore likely keep them well above Anthropic’s training capacity.

Google is less clear. The Gemini Ultra 1.0 model was trained on approximately 2.5x the compute of GPT-4, but published 9 months later,, and 25% more than the latest Llama model. Google, as we have seen, probably has more compute available than peers, however as a major cloud provider and a large business it has more demands[12] on its compute than Anthropic or OpenAI or even Meta, which also has substantial internal workflows separate from frontier model training such as recommendation algorithms for its social media products. Llama 3 being smaller in compute terms than Gemini despite being published 8 months later suggests Meta has so far been allocating slightly less resources to these models than OpenAI or Google.

XAI allegedly used 20k H100s to train its Grok 2, and projected up to 100k H100s would be used for Grok 3. Given GPT-4 was allegedly trained on 25,000 Nvidia A100 GPUs over 90-100 days, and a H100 is about 2.25x an A100, this would put Grok 2 at around double the compute of GPT-4 and project another 5x for Grok 3, putting it towards the leading edge.

Note that not all of this has historically come from their own chips - they are estimated to rent 16,000 H100s from Oracle cloud. If XAI is able to devote a similar fraction of its compute to training as OpenAI or Anthropic, I would guess its training is likely to be similar in scale to Anthropic and somewhat below OpenAI and Google.

Thanks to Josh You for feedback on a draft of this post. All errors are my own. Note that Epoch have an estimate of numbers for 2024 here which mostly lines up with the figures I estimated, which I only found after writing this post, though I expect we used much of the same evidence so the estimates are not independent.

^
yes, 2025 - Nvidia’s fiscal year annoyingly runs from Feb-Jan and so their earnings in calendar year 2024 are mostly contained in fiscal year 2025

^
Note that for ease of comparison with other numbers, I have attempted to adjust nvidia numbers back by a month, allowing calendar years to line up

^
Note that this is >10% of total revenue, not Data Center revenue, but Nvidia confirms it is attributable to their Data Center segment for all these customers.

^
From the Q2 2025 report - “Indirect customer revenue is an estimation based upon multiple factors including customer purchase order information, product specifications, internal sales data and other sources. Actual indirect customer revenue may differ from our estimates”.

^
Q2 2025 - “For the second quarter of fiscal year 2025, two indirect customers which primarily purchase our products through system integrators and distributors, including through Customer B and Customer E, are estimated to each represent 10% or more of total revenue attributable to the Compute & Networking segment. For the first half of fiscal year 2025, an indirect customer which primarily purchases our products from system integrators and distributors, including from Customer E, is estimated to represent 10% or more of total revenue, attributable to the Compute & Networking segment. “ this implies one customer exceeded the threshold only for Q2 and not for H1

^
Q3 2025 - For the third quarter and first nine months of fiscal year 2025, an indirect customer which primarily purchases our products through system integrators and distributors, including through Customer C, is estimated to represent 10% or more of total revenue, attributable to the Compute & Networking segment.

^
This source suggests 500k H100s, but I think this possibly stems from a misreading of the original Meta announcement which referred to 350k H100s total, and this source also omits H200s entirely.

^
From Google: “TPUs have long been the basis for training and serving AI-powered products like YouTube, Gmail, Google Maps, Google Play, and Android. In fact, Gemini was trained on, and is served, using TPUs.”

^
Google's Q3 2024 earnings report contained an estimate of $13bn for AI CapEx in Q3 2024,"the majority" on technical infra, 60% of which was servers (GPUs,TPUs). Taking “the majority” to mean $7-11bn, 60% of this being on servers suggests they spent $4.5-7bn that quarter on TPUs/GPUs. If we estimate them as being 6% of Nvidia total revenue as Bloomberg suggests, then they spent about $1.8bn on Nvidia GPUs, so that leaves $2.7bn-$5.2bn to spend on other servers. Given internal workloads run on TPUs, it seems likely the TPU spend is quite a bit higher than GPU spend, so taking the middle of this range we get just under $4bn on TPUs.

^
Taking the 7m 2025 GPU production numbers from above, assuming 850k of the 5m Blackwell chips go to Microsoft in 2025 (as they will begin receiving them in 2024 and that is in their 2024 estimate already) and assuming nvidia revenue is 90% Data Center and Blackwell costs 60-70% more than Hopper per Nvidia Q3 2025 earnings.

^
Note that the ranges in these estimates are not confidence intervals, but rather ranges in which I think a plausible best guess based on the evidence I looked at might land. I have not attempted to construct confidence intervals here.

^
“Today, more than 60 percent of funded gen AI start-ups and nearly 90 percent of gen AI unicorns are Google Cloud customers. “ said Google CEO Sundar Pichai on their Q1 2024 earnings call

 

关于我们

北京汉深流体技术有限公司 是丹佛斯中国数据中心签约代理商。产品包括FD83全流量双联锁液冷快换;液冷通用快速接头UQD & UQDB;OCP ORV3盲插快换接头BMQC;EHW194 EPDM液冷软管、电磁阀、压力和温度传感器。在国家数字经济、东数西算、双碳、新基建战略的交汇点,公司聚焦组建高素质、经验丰富的液冷工程师团队,为客户提供卓越的工程设计和强大的客户服务。

公司产品涵盖:丹佛斯液冷流体连接器、EPDM软管、电磁阀、压力和温度传感器及Manifold。
未来公司发展规划:数据中心液冷基础设施解决方案厂家,具备冷量分配单元(CDU)、二次侧管路(SFN)和Manifold的专业研发设计制造能力。

- 针对机架式服务器中Manifold/节点、CDU/主回路等应用场景,提供不同口径及锁紧方式的手动和全自动快速连接器。
- 针对高可用和高密度要求的刀片式机架,可提供带浮动、自动校正不对中误差的盲插连接器。以实现狭小空间的精准对接。
- 基于OCP标准全新打造的液冷通用快速接头UQD & UQDB ;OCP ORV3盲插快换接头BMQC , 支持全球范围内的大批量交付。

 

北京汉深流体技术有限公司 Hansen Fluid
丹佛斯签约中国经销商 Danfoss Authorized Distributor

地址:北京市朝阳区望京街10号望京SOHO塔1C座2115室
邮编:100102
电话:010-8428 2935 , 8428 3983 , 13910962635
手机:15801532751,17310484595 ,13910122694
13011089770,15313809303
Http://www.hansenfluid.com
E-mail:sales@cnmec.biz

传真:010-8428 8762

京ICP备2023024665号
京公网安备 11010502019740

Since 2007 Strong Distribution & Powerful Partnerships