-
Notifications
You must be signed in to change notification settings - Fork 0
/
situationalawarenessv1.txt
2123 lines (1194 loc) · 312 KB
/
situationalawarenessv1.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Leopold Aschenbrenner
# SITUATIONAL AWARENESS <br> The Decade Ahead
JUNE 2024
## Dedicated to Ilya Sutskever.
While I used to work at OpenAI, all of this is based on publiclyavailable information, my own ideas, general field-knowledge, or SF-gossip.
Thank you to Collin Burns, Avital Balwit, Carl Shulman, Jan Leike, Ilya Sutskever, Holden Karnofsky, Sholto Douglas, James Bradbury, Dwarkesh Patel, and many others for formative discussions. Thank you to many friends for feedback on earlier drafts. Thank you to Joe Ronan for help with graphics, and Nick Whitaker for publishing help.
SITUATIONAL-AWARENESS.AI
Updated June 6, 2024
San Francisco, California
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $\$ 10$ billion compute clusters to $\$ 100$ billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there's a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be unleashed, and before long, The Project will be on. If we're lucky, we'll be in an all-out race with the CCP; if we're unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the willful blindness of "it's just predicting the next word". They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy-but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people-the smartest people I have ever met-and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
## Contents
Introduction
History is live in San Francisco.
I. From GPT-4 to AGI: Counting the OOMs
AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took us from preschooler to $\sim$ smart high-schooler abilities in 4 years. Tracing trendlines in compute ( $\sim 0.5$ orders of magnitude or OOMs/year), algorithmic efficiencies ( $\sim 0.5$ OOMs/year), and "unhobbling" gains (from chatbot to agent), we should expect another preschooler-to-high-schoolersized qualitative jump by 2027.
## II. From AGI to Superintelligence: the Intelligence Explosion 46
AI progress won't stop at human-level. Hundreds of millions of AGIs could automate AI research, compressing a decade of algorithmic progress $(5+$ OOMs) into 1 year. We would rapidly go from human-level to vastly superhuman AI systems. The power-and the peril-of superintelligence would be dramatic.
III. The Challenges
The most extraordinary techno-capital acceleration has been set in motion. As AI revenue grows rapidly, many trillions of dollars will go into GPU, datacenter, and power buildout before the end of the decade. The industrial mobilization, including growing US electricity production by 1 os of percent, will be intense.
IIIb. Lock Down the Labs: Security for AGI
The nation's leading AI labs treat security as an afterthought. Currently, they're basically handing the key secrets for AGI to the CCP on a silver platter. Securing the AGI secrets and weights against the state-actor threat will be an immense effort, and we're not on track.
Reliably controlling AI systems much smarter than we are is an unsolved technical problem. And while it is a solvable problem, things could very easily go off the rails during a rapid intelligence explosion. Managing this will be extremely tense; failure could easily be catastrophic.
## IIId. The Free World Must Prevail
Superintelligence will give a decisive economic and military advantage. China isn't at all out of the game yet. In the race to AGI, the free world's very survival will be at stake. Can we maintain our preeminence over the authoritarian powers? And will we manage to avoid self-destruction along the way?
IV. The Project
As the race to AGI intensifies, the national security state will get involved. The USG will wake from its slumber, and by 27/28 we'll get some form of government AGI project. No startup can handle superintelligence. Somewhere in a SCIF, the endgame will be on.
## V. Parting Thoughts
What if we're right?
Appendix
# I. From GPT-4 to AGI: Counting the OOMs
AGI by 2027 is strikingly plausible. GPT-2 to GPT-4 took<br>us from $\sim$ preschooler to $\sim$ smart high-schooler abilities in<br>4 years. Tracing trendlines in compute ( $\sim 0.5$ orders of magni-<br>tude or OOMs/year), algorithmic efficiencies ( 0.5 OOMs/year),<br>and "unhobbling" gains (from chatbot to agent), we should<br>expect another preschooler-to-high-schooler-sized qualitative<br>jump by 2027 .
Look. The models, they just want to learn. You have to understand this. The models, they just want to learn.
ILYA SUTSKEVER
(circa 2015, via Dario Amodei)
GPT-4's capabilities came as a shock to many: an AI system that could write code and essays, could reason through difficult math problems, and ace college exams. A few years ago, most thought these were impenetrable walls.
But GPT-4 was merely the continuation of a decade of breakneck progress in deep learning. A decade earlier, models could barely identify simple images of cats and dogs; four years earlier, GPT-2 could barely string together semi-plausible sentences. Now we are rapidly saturating all the benchmarks we can come up with. And yet this dramatic progress has merely been the result of consistent trends in scaling up deep learning.
There have been people who have seen this for far longer. They were scoffed at, but all they did was trust the trendlines. The
trendlines are intense, and they were right. The models, they just want to learn; you scale them up, and they learn more.
I make the following claim: it is strikingly plausible that by 2027 , models will be able to do the work of an AI researcher/engineer. That doesn't require believing in sci-fi; it just requires believing in straight lines on a graph.
## Base Scaleup of Effective Compute
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-008.jpg?height=1046&width=1521&top_left_y=838&top_left_x=340)
In this piece, I will simply "count the OOMs" (OOM = order of magnitude, $10 \times=1$ order of magnitude): look at the trends in 1) compute, 2) algorithmic efficiencies (algorithmic progress that we can think of as growing "effective compute"), and 3) "unhobbling" gains (fixing obvious ways in which models are hobbled by default, unlocking latent capabilities and giving them tools, leading to step-changes in usefulness). We trace
Figure 1: Rough estimates of past and future scaleup of effective compute (both physical compute and algorithmic efficiencies), based on the public estimates discussed in this piece. As we scale models, they consistently get smarter, and by "counting the OOMs" we get a rough sense of what model intelligence we should expect in the (near) future. (This graph shows only the scaleup in base models; "unhobblings" are not pictured.)
the growth in each over four years before GPT-4, and what we should expect in the four years after, through the end of 2027 . Given deep learning's consistent improvements for every OOM of effective compute, we can use this to project future progress.
Publicly, things have been quiet for a year since the GPT-4 release, as the next generation of models has been in the ovenleading some to proclaim stagnation and that deep learning is hitting a wall. ${ }^{1}$ But by counting the OOMs, we get a peek at what we should actually expect.
The upshot is pretty simple. GPT-2 to GPT-4-from models that were impressive for sometimes managing to string together a few coherent sentences, to models that ace high-school exams-was not a one-time gain. We are racing through the OOMs extremely rapidly, and the numbers indicate we should expect another $\sim 100,000 x$ effective compute scaleup-resulting in another GPT-2-to-GPT-4-sized qualitative jump-over four years. Moreover, and critically, that doesn't just mean a better chatbot; picking the many obvious low-hanging fruit on "unhobbling" gains should take us from chatbots to agents, from a tool to something that looks more like drop-in remote worker replacements.
While the inference is simple, the implication is striking. Another jump like that very well could take us to AGI, to models as smart as PhDs or experts that can work beside us as coworkers. Perhaps most importantly, if these AI systems could automate AI research itself, that would set in motion intense feedback loops-the topic of the the next piece in the series.
Even now, barely anyone is pricing all this in. But situational awareness on AI isn't actually that hard, once you step back and look at the trends. If you keep being surprised by AI capabilities, just start counting the OOMs.[^0]
## The last four years
We have machines now that we can basically talk to like humans. It's a remarkable testament to the human capacity to adjust that this seems normal, that we've become inured to the pace of progress. But it's worth stepping back and looking at the progress of just the last few years.
## GPT-2 to GPT-4
Let me remind you of how far we came in just the $\sim 4$ (!) years leading up to GPT-4.
GPT-2 (2019) preschooler: "Wow, it can string together a few plausible sentences." A very-cherry-picked example of a semicoherent story about unicorns in the Andes it generated was incredibly impressive at the time. And yet GPT-2 could barely count to 5 without getting tripped up; 2 when summarizing an article, it just barely outperformed selecting 3 random sentences from the article. ${ }^{3}$[^1]
${ }^{3}$ From the GPT-2 paper, Section 3.6.
GPT-2 examples people found very impressive at the time
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-010.jpg?height=572&width=1613&top_left_y=1584&top_left_x=278)
Comparing AI capabilities with human intelligence is difficult and flawed, but I think it's informative to consider the analogy
Figure 2: Some examples of what people found impressive about GPT2 at the time. Left: GPT-2 does an ok job on extremely basic reading comprehension questions. Right: In a cherry-picked sample (best of 10 tries), GPT-2 can write a semi-coherent paragraph that says some semi-relevant things about the Civil War.
here, even if it's highly imperfect. GPT-2 was shocking for its command of language, and its ability to occasionally generate a semi-cohesive paragraph, or occasionally answer simple factual questions correctly. It's what would have been impressive for a preschooler.
GPT-3 (2020) ${ }^{4}$ elementary schooler: "Wow, with just some fewshot examples it can do some simple useful tasks." It started
${ }^{4}$ I mean clunky old GPT-3 here, not the dramatically-improved GPT-3.5 you might know from ChatGPT. being cohesive over even multiple paragraphs much more consistently, and could correct grammar and do some very basic arithmetic. For the first time, it was also commercially useful in a few narrow ways: for example, GPT-3 could generate simple copy for SEO and marketing.
GPT-3 examples people found very impressive at the time
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-011.jpg?height=791&width=1507&top_left_y=1133&top_left_x=320)
SITUATIONAL AWARENESS | Leopold Aschenbrenner
Again, the comparison is imperfect, but what impressed people about GPT-3 is perhaps what would have been impressive for an elementary schooler: it wrote some basic poetry, could tell richer and coherent stories, could start to do rudimentary
Figure 3: Some examples of what people found impressive about GPT3 at the time. Top: After a simple instruction, GPT-3 can use a made-up word in a new sentence. Bottom-left: GPT-3 can engage in rich storytelling back-and-forth. Bottom-right: GPT-3 can generate some very simple code.
coding, could fairly reliably learn from simple instructions and demonstrations, and so on.
GPT-4 (2023) smart high schooler: "Wow, it can write pretty sophisticated code and iteratively debug, it can write intelligently and sophisticatedly about complicated subjects, it can reason through difficult high-school competition math, it's beating the vast majority of high schoolers on whatever tests we can give it, etc." From code to math to Fermi estimates, it can think and reason. GPT-4 is now useful in my daily tasks, from helping write code to revising drafts.
GPT-4 examples people found very impressive at the time
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-013.jpg?height=1041&width=735&top_left_y=363&top_left_x=714)
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-013.jpg?height=580&width=637&top_left_y=1453&top_left_x=430)
Figure 4.3: GPT-4 vs ChatGPT on AP problems. GPT-4 uses a correct approach, but produces
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-013.jpg?height=669&width=594&top_left_y=1449&top_left_x=1123)
igure 3.1: Solving a LeetCode problem using dynamic programming. GPT-4's solution also has better readability thanks to comprehensive commenting
Figure 4: Some of what people found impressive about GPT-4 when it was released, from the "Sparks of AGI" paper. Top: It's writing very complicated code (producing the plots shown in the middle) and can reason through nontrivial math problems. Bottom-left: Solving an AP math problem. Bottom-right: Solving a fairly complex coding problem. More interesting excerpts from that exploration of GPT-4's capabilities here.
On everything from AP exams to the SAT, GPT-4 scores better than the vast majority of high schoolers.
Of course, even GPT-4 is still somewhat uneven; for some tasks it's much better than smart high-schoolers, while there are other tasks it can't yet do. That said, I tend to think most of these limitations come down to obvious ways models are still hobbled, as I'll discuss in-depth later. The raw intelligence is (mostly) there, even if the models are still artificially constrained; it'll take extra work to unlock models being able to fully apply that raw intelligence across applications.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-014.jpg?height=217&width=1463&top_left_y=1030&top_left_x=342)
SITUATIONAL AWARENESS | Leopold Aschenbrenner
The trends in deep learning
The pace of deep learning progress in the last decade has simply been extraordinary. A mere decade ago it was revolutionary for a deep learning system to identify simple images. Today, we keep trying to come up with novel, ever harder tests, and yet each new benchmark is quickly cracked. It used to take decades to crack widely-used benchmarks; now it feels like mere months.
We're literally running out of benchmarks. As an anecdote, my friends Dan and Collin made a benchmark called MMLU a few years ago, in 2020. They hoped to finally make a benchmark that would stand the test of time, equivalent to all the hardest exams we give high school and college students. Just three years later, it's basically solved: models like GPT-4 and Gemini
Test scores of AI systems on various capabilities relative to human performance
Within each domain, the initial performance of the Al is set to -100 . Human performance is used as a baseline, set to zero When the Al's performance crosses the zero line, it scored more points than humans.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-015.jpg?height=548&width=1069&top_left_y=393&top_left_x=254)
Data source: Kiela et al. (2023)
OurWorldlnData.org/artificial-intelligence | CC BY Note: For each capability, the first year always shows a baseline of -100 , even if better performance was recorded later that year
get $\sim 90 \%$.
More broadly, GPT-4 mostly cracks all the standard high school and college aptitude tests (Figure 7). ${ }^{5}$
Or consider the MATH benchmark, a set of difficult mathematics problems from high-school math competitions. ${ }^{6}$ When the benchmark was released in 2021, GPT-3 only got $\sim 5 \%$ of problems right. And the original paper noted: "Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue [...]. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community"-we would need fundamental new breakthroughs to solve MATH, or so they thought. A survey of ML researchers predicted minimal progress over the coming years (Figure 8);7 and yet within just a year (by mid-2022), the best models went from $\sim 5 \%$ to $50 \%$ accuracy; now, MATH is basically solved, with recent performance over $90 \%$.
Figure 6: Deep learning systems are rapidly reaching or exceeding humanlevel in many domains. Graphic: Our World in Data[^2]
## Performance on common exams (percentile compared to human test-takers)
| GPT-4 | GPT-3.5 |
| :--- | :--- |
| (2023) | $(2022)$ |
| Uniform Bar Exam | 90 th | 10 th |
| :--- | :--- | :--- |
| LSAT | 88 th | 40 th |
| SAT | 97 th | 87 th |
| GRE (Verbal) | 99 th | 63 rd |
| GRE (Quantitative) | 80 th | 25 th |
| US Biology Olympiad | 99 th | 32 nd |
| AP Calculus BC | 51 st | 3 rd |
| AP Chemistry | 80 th | 34 th |
| AP Macroeconomics | 92 nd | 40 th |
| AP Statistics | 92 nd | 51 st |
## SITUATIONAL AWARENESS | Leopold Aschenbrenner
On June 30,2022 , what will be the state-of-the-art accuracy of a machinelearning model on the MATH Dataset?
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-016.jpg?height=138&width=808&top_left_y=2056&top_left_x=363)
Figure 7: GPT-4 scores on standardized tests. Note also the large jump from GPT-3.5 to GPT-4 in human percentile on these tests, often from well below the median human to the very top of the human range. (And this is GPT-3.5, a fairly recent model released less than a year before GPT-4, not the clunky old elementary-school-level GPT-3 we were talking about earlier!)
Over and over again, year after year, skeptics have claimed "deep learning won't be able to do X" and have been quickly proven wrong. ${ }^{8}$ If there's one lesson we've learned from the past decade of AI, it's that you should never bet against deep learning.
Now the hardest unsolved benchmarks are tests like GPQA, a set of PhD-level biology, chemistry, and physics questions. Many of the questions read like gibberish to me, and even PhDs in other scientific fields spending $30+$ minutes with Google barely score above random chance. Claude 3 Opus currently gets $\sim 60 \%, 9$ compared to in-domain PhDs who get $\sim 80 \%$-and I expect this benchmark to fall as well, in the next generation or two.[^3]
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-018.jpg?height=1791&width=1235&top_left_y=384&top_left_x=445)
Table 1: Six example questions from the dataset, two each from subdomains of chemistry, biology, and physics (respectively).
Figure 9: Example GPQA questions. Models are already better at this than I am, and we'll probably crack expertPhD-level soon...
## Counting the OOMs
How did this happen? The magic of deep learning is that it just works-and the trendlines have been astonishingly consistent, despite naysayers at every turn.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-019.jpg?height=540&width=529&top_left_y=565&top_left_x=256)
Base compute
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-019.jpg?height=545&width=547&top_left_y=562&top_left_x=800)
$4 x$ compute
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-019.jpg?height=540&width=531&top_left_y=565&top_left_x=1361)
$32 x$ compute
With each OOM of effective compute, models predictably, reliably get better. ${ }^{10}$ If we can count the OOMs, we can (roughly, qualitatively) extrapolate capability improvements. ${ }^{11}$ That's how a few prescient individuals saw GPT-4 coming.
We can decompose the progress in the four years from GPT-2 to GPT-4 into three categories of scaleups:
1. compute: We're using much bigger computers to train these models.
2. ALGorithmic efficiencies: There's a continuous trend of algorithmic progress. Many of these act as "compute multipliers," and we can put them on a unified scale of growing effective compute.
3. "unhobbling" Gains: By default, models learn a lot of amazing raw capabilities, but they are hobbled in all sorts of dumb ways, limiting their practical value. With simple algorithmic improvements like reinforcement learning from human feedback (RLHF), chain-of-thought (CoT), tools, and scaffolding, we can unlock significant latent capabilities.
Figure 10: The effects of scaling compute, in the example of OpenAI Sora.
${ }^{10}$ And it's worth noting just how consistent these trendlines are. Combining the original scaling laws paper with some of the estimates on compute and compute efficiency scaling since then implies a consistent scaling trend for over 15 orders of magnitude (over 1,000,000,000,000,000x in effective compute)!
${ }^{11}$ A common misconception is that scaling only holds for perplexity loss, but we see very clear and consistent scaling behavior on downstream performance on benchmarks as well. It's usually just a matter of finding the right log$\log$ graph. For example, in the GPT-4 blog post, they show consistent scaling behavior for performance on coding problems over 6 OOMs ( $1,000,000 x$ ) of compute, using MLPR (mean log pass rate).
The "Are Emergent Abilities a Mirage?" paper makes a similar point; with the right choice of metric, there is almost always a consistent trend for performance on downstream tasks.
More generally, the "scaling hypothesis" qualitative observation-very clear trends on model capability with scale-predates loss-scaling-curves; the "scaling laws" work was just a formal measurement of this.
We can "count the OOMs" of improvement along these axes: that is, trace the scaleup for each in units of effective compute. 3 x is 0.5 OOMs; 10 x is 1 OOM; 30 x is 1.5 OOMs; 100 x is 2 OOMs; and so on. We can also look at what we should expect on top of GPT-4, from 2023 to 2027.
I'll go through each one-by-one, but the upshot is clear: we are rapidly racing through the OOMs. There are potential headwinds in the data wall, which I'll address-but overall, it seems likely that we should expect another GPT-2-to-GPT-4-sized jump, on top of GPT-4, by 2027.
## Compute
I'll start with the most commonly-discussed driver of recent progress: throwing (a lot) more compute at models.
Many people assume that this is simply due to Moore's Law. But even in the old days when Moore's Law was in its heyday, it was comparatively glacial-perhaps 1-1.5 OOMs per decade. We are seeing much more rapid scaleups in compute-close to $5 x$ the speed of Moore's law-instead because of mammoth investment. (Spending even a million dollars on a single model used to be an outrageous thought nobody would entertain, and now that's pocket change!)
| Model | Estimated Compute | Growth |
| :--- | :--- | :--- |
| GPT-2 (2019) | $\sim 4 \mathrm{e} 21$ FLOP | |
| GPT-3 (2020) | $\sim 3 \mathrm{e} 23$ FLOP | $+\sim 2$ OOMs |
| GPT-4 (2023) | 8 e 24 to 4 e 25 FLOP | $+\sim 1.5-2$ OOMs |
We can use public estimates from Epoch AI (a source widely respected for its excellent analysis of AI trends) to trace the compute scaleup from 2019 to 2023 . GPT-2 to GPT-3 was a quick scaleup; there was a large overhang of compute, scaling from a smaller experiment to using an entire datacenter to train a large language model. With the scaleup from GPT-3 to GPT4, we transitioned to the modern regime: having to build an entirely new (much bigger) cluster for the next model. And yet
the dramatic growth continued. Overall, Epoch AI estimates suggest that GPT-4 training used $\sim 3,000 x-10,000 x$ more raw compute than GPT-2.
In broad strokes, this is just the continuation of a longer-running trend. For the last decade and a half, primarily because of broad scaleups in investment (and specializing chips for AI workloads in the form of GPUs and TPUs), the training compute used for frontier AI systems has grown at roughly $\sim 0.5$ OOMs/year.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-021.jpg?height=651&width=1032&top_left_y=805&top_left_x=270)
The compute scaleup from GPT-2 to GPT-3 in a year was an unusual overhang, but all the indications are that the longerrun trend will continue. The SF-rumor-mill is abreast with dramatic tales of huge GPU orders. The investments involved will be extraordinary-but they are in motion. I go into this more later in the series, in IIIa. Racing to the Trillion-Dollar Cluster; based on that analysis, an additional 2 OOMs of compute (a cluster in the $\$ 10$ of billions) seems very likely to happen by the end of 2027; even a cluster closer to +3 OOMs of compute ( $\$ 100$ billion+) seems plausible (and is rumored to be in the works at Microsoft/OpenAI).
Figure 11: Training compute of notable deep learning models over time. Source: Epoch AI.
## Algorithmic efficiencies
While massive investments in compute get all the attention, algorithmic progress is probably a similarly important driver of progress (and has been dramatically underrated).
To see just how big of a deal algorithmic progress can be, consider the following illustration (Figure 12) of the drop in price to attain $\sim 50 \%$ accuracy on the MATH benchmark (high school competition math) over just two years. (For comparison, a computer science PhD student who didn't particularly like math scored $40 \%$, so this is already quite good.) Inference efficiency improved by nearly 3 OOMs-1,0oox-in less than two years.
Relative (inference) cost of $\sim 50 \%$ performance on the MATH benchmark
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-022.jpg?height=729&width=1065&top_left_y=1072&top_left_x=256)
Though these are numbers just for inference efficiency (which may or may not correspond to training efficiency improvements, where numbers are harder to infer from public data), they make clear there is an enormous amount of algorithmic progress possible and happening.
Figure 12: Rough estimate on relative inference cost of attaining $\sim 50 \%$ MATH performance.
Calculations below.
Gemini 1.5 Flash scores $54.9 \%$ on MATH, and costs $\$ 0.35 / \$ 1.05$ (input/output) per million tokens. GPT-4 scored $42.5 \%$ on MATH prelease and $52.9 \%$ on MATH in early 2023 , and cost $\$ 30 / \$ 60$ (input/output) per million tokens; that's $85 x / 57 x$ (input/output) more expensive per token than Gemini 1.5 Flash. To be conservative, I use an estimate of $30 x$ cost decrease above (accounting for Gemini 1.5 Flash possibly using more tokens to reason through problems)
Minerva54oB scores $50.3 \%$ on MATH, using majority voting among 64 samples. A knowledgeable friend estimates the base model here is probably 2 $3 x$ more expensive to inference than GPT-4. However, Minerva seems to use somewhat fewer tokens per answer on a quick spot check. More importantly, Minerva needed 64 samples to achieve that performance, naively implying a 64x multiple on cost if you e.g. naively ran this via an inference API. In practice, prompt tokens can be cached when running an eval; given a few-shot prompt, prompt tokens are likely a majority of the cost, even accounting for output tokens. Supposing output tokens are a third of the cost for getting a single sample, that would imply only a $\sim 20 x$ increase in cost from the maj@64 with caching. To be conservative, I use the rough number of a $20 x$ cost decrease in the above (even if the naive decrease in inference cost from running this via an API would be larger).
In this piece, I'll separate out two kinds of algorithmic progress. Here, I'll start by covering "within-paradigm" algorithmic improvements-those that simply result in better base models, and that straightforwardly act as compute efficiencies or compute multipliers. For example, a better algorithm might allow us to achieve the same performance but with 1ox less training compute. In turn, that would act as a 1ox (1 OOM) increase in effective compute. (Later, I'll cover "unhobbling," which you can think of as "paradigm-expanding/application-expanding" algorithmic progress that unlocks capabilities of base models.)
If we step back and look at the long-term trends, we seem to find new algorithmic improvements at a fairly consistent rate. Individual discoveries seem random, and at every turn, there seem insurmountable obstacles-but the long-run trendline is predictable, a straight line on a graph. Trust the trendline.
We have the best data for ImageNet (where algorithmic research has been mostly public and we have data stretching back a decade), for which we have consistently improved compute efficiency by roughly $\sim 0.5$ OOMs/year across the 9 -year period between 2012 and 2021.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-023.jpg?height=382&width=539&top_left_y=1454&top_left_x=259)
(a) Pareto frontiers in data and compute for AlexNet performance
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-023.jpg?height=380&width=523&top_left_y=1458&top_left_x=817)
(b) Pareto frontiers in data and compute for ResNeXt-101 performance
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-023.jpg?height=376&width=526&top_left_y=1460&top_left_x=1358)
(c) Pareto frontiers in data and compute for ViT-e performance
Figure 1. Pareto frontiers for training models to achieve performance of well-known models over time.
That's a huge deal: that means 4 years later, we can achieve the same level of performance for $\sim 100 x$ less compute (and concomitantly, much higher performance for the same compute!).
Unfortunately, since labs don't publish internal data on this, it's harder to measure algorithmic progress for frontier LLMs
Figure 13: We can measure algorithmic progress: how much less compute is needed in 2021 compared to 2012 to train a model with the same per formance? We see a trend of $\sim 0.5$ OOMs/year of algorithmic efficiency Source: Erdil and Besiroglu 2022.
over the last four years. EpochAI has new work replicating their results on ImageNet for language modeling, and estimate a similar $\sim 0.5 \mathrm{OOMs} /$ year of algorithmic efficiency trend in LLMs from 2012 to 2023. (This has wider error bars though, and doesn't capture some more recent gains, since the leading labs have stopped publishing their algorithmic efficiencies.)
## Efficiency doubles roughly every 8 months
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-024.jpg?height=697&width=1103&top_left_y=649&top_left_x=240)
More directly looking at the last 4 years, GPT-2 to GPT-3 was basically a simple scaleup (according to the paper), but there have been many publicly-known and publicly-inferable gains since GPT-3:
## - We can infer gains from API costs: ${ }^{12}$
- GPT-4, on release, cost the same as GPT-3 when it was released, despite the absolutely enormous performance increase. ${ }^{13}$ (If we do a naive and oversimplified back-of-theenvelope estimate based on scaling laws, this suggests that perhaps roughly half the effective compute increase from GPT-3 to GPT-4 came from algorithmic improvements. ${ }^{14}$ )
- Since the GPT-4 release a year ago, OpenAI prices for GPT-4-level models have fallen another $6 x / 4 \times$ (input/output) with the release of GPT-40.
Figure 14: Estimates by Epoch AI of algorithmic efficiencies in language modeling. Their estimates suggest we've made $\sim 4$ OOMs of efficiency gains in 8 years.
${ }^{12}$ Though these are inference efficiencies (rather than necessarily training efficiencies), and to some extent will reflect inference-specific optimizations, a) they suggest enormous amounts of algorithmic progress is possible and happening in general, and b) it's often the case that an algorithmic improvements is both a training efficiency gain and an inference efficiency, for example by reducing the number of parameters necessary.
${ }^{13}$ GPT-3: \$60/1M tokens, GPT-4: $\$ 30 / 1 \mathrm{M}$ input tokens and $\$ 60 / 1 \mathrm{M}$ output tokens.
${ }^{14}$ Chinchilla scaling laws say that one should scale parameter count and data equally. That is, parameter count grows "half the OOMs" of the OOMs that effective training compute grows. At the same time, parameter count is intuitively roughly proportional to inference costs. All else equal, constant inference costs thus implies that half of the OOMs of effective compute growth were "canceled out" by algorithmic win.
That said, to be clear, this is a very naive calculation (just meant for a rough illustration) that is wrong in various ways. There may be inferencespecific optimizations (that don't translate into training efficiency); there may be training efficiencies that don't reduce parameter count (and thus don't translate into inference efficiency); and so on.
- Gemini 1.5 Flash, recently released, offers between "GPT3.75-level" and GPT-4-level performance, ${ }^{15}$ while costing 85x/57x (input/output) less than the original GPT-4 (extraordinary gains!).
- Chinchilla scaling laws give a $3 x+$ (o.5 OOMs+) efficiency gain. ${ }^{16}$
- Gemini 1.5 Pro claimed major compute efficiency gains (outperforming Gemini 1.o Ultra, while using "significantly less" compute), with Mixture of Experts (MoE) as a highlighted architecture change. Other papers also claim a substantial multiple on compute from MoE.
- There have been many tweaks and gains on architecture, data, training stack, etc. all the time. ${ }^{17}$
Put together, public information suggests that the GPT-2 to GPT-4 jump included 1-2 OOMs of algorithmic efficiency gains. ${ }^{18}$
## Decomposing drivers of progress
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-025.jpg?height=727&width=941&top_left_y=1428&top_left_x=359)
SITUATIONAL AWARENESS | Leopold Aschenbrenner ${ }^{15}$ Gemini 1.5 Flash ranks similarly to GPT-4 (higher than original GPT-4, lower than updated versions of GPT-4) on LMSys, a chatbot leaderboard, and has similar performance on MATH and GPQA (evals that measure reasoning) as the original GPT-4, while landing roughly in the middle between GPT3.5 and GPT-4 on MMLU (an eval that more heavily weights towards measuring knowledge).
${ }^{16}$ At GPT-3 scale, more than $3 x$ at larger scales.
${ }^{17}$ For example, this paper contains a comparison of a GPT-3-style vanilla Transformer to various simple changes to architecture and training recipe published over the years (RMSnorms instead of layernorm, different positional embeddings, SwiGlu activation, AdamW optimizer instead of Adam, etc.), what they call "Transformer++", implying a 6x gain at least at small scale.
${ }^{18}$ If we take the trend of 0.5
OOMs/year, and 4 years between GPT-2 and GPT-4 release, that would be 2 OOMs. However, GPT-2 to GPT-3 was a simple scaleup (after big gains from e.g. Transformers), and OpenAI claims GPT-4 pretraining finished in 2022, which could mean we're looking at closer to 2 years worth of algorithmic progress that we should be counting here. 1 OOM of algorithmic efficiency seems like a conservative lower bound.
Figure 15: Decomposing progress: compute and algorithmic efficiencies (Rough illustration.)
Over the 4 years following GPT-4, we should expect the trend to continue: ${ }^{19}$ on average $\sim 0.5$ OOMs/year of compute efficiency, i.e. $\sim 2$ OOMs of gains compared to GPT-4 by 2027. While compute efficiencies will become harder to find as we pick the low-hanging fruit, AI lab investments in money and talent to find new algorithmic improvements are growing rapidly. ${ }^{20}$ (The publicly-inferable inference cost efficiencies, at least, don't seem to have slowed down at all.) On the high end, we could even see more fundamental, Transformer-like ${ }^{21}$ breakthroughs with even bigger gains.
Put together, this suggests we should expect something like 1-3 OOMs of algorithmic efficiency gains (compared to GPT-4) by the end of 2027, maybe with a best guess of $\sim 2$ OOMs.
## The data wall
There is a potentially important source of variance for all of this: we're running out of internet data. That could mean that, very soon, the naive approach to pretraining larger language models on more scraped data could start hitting serious bottlenecks.
Frontier models are already trained on much of the internet. Llama 3, for example, was trained on over 15 T tokens. Common Crawl, a dump of much of the internet used for LLM training, is $>100 \mathrm{~T}$ tokens raw, though much of that is spam and duplication (e.g., a relatively simple deduplication leads to 30 T tokens, implying Llama 3 would already be using basically all the data). Moreover, for more specific domains like code, there are many fewer tokens still, e.g. public github repos are estimated to be in low trillions of tokens.
You can go somewhat further by repeating data, but academic work on this suggests that repetition only gets you so far, finding that after 16 epochs (a 16-fold repetition), returns diminish extremely fast to nil. At some point, even with more (effective) compute, making your models better can become much tougher because of the data con-
${ }^{19}$ At the very least, given over a decade of consistent algorithmic improvements, the burden of proof would be on those who would suggest it will all suddenly come to a halt!
${ }^{20}$ The economic returns to a $3 x$ compute efficiency will be measured in the $\$ 10$ s of billions or more, given cluster costs. ${ }^{21}$ Very roughly something like a $\sim 10 \mathrm{x}$ gain. straint. This isn't to be understated: we've been riding the
scaling curves, riding the wave of the language-modelingpretraining-paradigm, and without something new here, this paradigm will (at least naively) run out. Despite the massive investments, we'd plateau.
All of the labs are rumored to be making massive research bets on new algorithmic improvements or approaches to get around this. Researchers are purportedly trying many strategies, from synthetic data to self-play and RL approaches. Industry insiders seem to be very bullish: Dario Amodei (CEO of Anthropic) recently said on a podcast: "if you look at it very naively we're not that far from running out of data [...] My guess is that this will not be a blocker [...] There's just many different ways to do it." Of course, any research results on this are proprietary and not being published these days.
In addition to insider bullishness, I think there's a strong intuitive case for why it should be possible to find ways to train models with much better sample efficiency (algorithmic improvements that let them learn more from limited data). Consider how you or I would learn from a really dense math textbook:
- What a modern LLM does during training is, essentially, very very quickly skim the textbook, the words just $f l y$ ing by, not spending much brain power on it.
- Rather, when you or I read that math textbook, we read a couple pages slowly; then have an internal monologue about the material in our heads and talk about it with a few study-buddies; read another page or two; then try some practice problems, fail, try them again in a different way, get some feedback on those problems, try again until we get a problem right; and so on, until eventually the material "clicks."
- You or I also wouldn't learn much at all from a pass through a dense math textbook if all we could do was breeze through it like LLMs. ${ }^{22}$[^4]
- But perhaps, then, there are ways to incorporate aspects of how humans would digest a dense math textbook to let the models learn much more from limited data. In a simplified sense, this sort of thing-having an internal monologue about material, having a discussion with a study-buddy, trying and failing at problems until it clicks-is what many synthetic data/self-play/RL approaches are trying to do. ${ }^{23}$
The old state of the art of training models was simple and naive, but it worked, so nobody really tried hard to crack these approaches to sample efficiency. Now that it may become more of a constraint, we should expect all the labs to invest billions of dollars and their smartest minds into cracking it. A common pattern in deep learning is that it takes a lot of effort (and many failed projects) to get the details right, but eventually some version of the obvious and simple thing just works. Given how deep learning has managed to crash through every supposed wall over the last decade, my base case is that it will be similar here.
Moreover, it actually seems possible that cracking one of these algorithmic bets like synthetic data could dramatically improve models. Here's an intuition pump. Current frontier models like Llama 3 are trained on the internetand the internet is mostly crap, like e-commerce or SEO or whatever. Many LLMs spend the vast majority of their training compute on this crap, rather than on really highquality data (e.g. reasoning chains of people working through difficult science problems). Imagine if you could spend GPT-4-level compute on entirely extremely highquality data-it could be a much, much more capable model.
A look back at AlphaGo-the first AI system that beat the world champions at Go, decades before it was thought possible-is useful here as well. ${ }^{24}$
- In step 1, AlphaGo was trained by imitation learning on expert human Go games. This gave it a foundation. ${ }^{23}$ One other way of thinking about it I find interesting: there is a "missingmiddle" between pretraining and in-context learning.
In-context learning is incredible (and competitive with human sample efficiency). For example, the Gemini 1.5 Pro paper discusses giving the model instructional materials (a textbook, a dictionary) on Kalamang, a language spoken by fewer than 200 people and basically not present on the internet, in context-and the model learns to translate from English to Kalamang at human-level! In context, the model is able to learn from the textbook as well as a human could (and much better than it would learn from just chucking that one textbook into pretraining)
When a human learns from a textbook, they're able to distill their shortterm memory/learnings into long-term memory/long-term skills with practice; however, we don't have an equivalent way to distill in-context learning "back to the weights." Synthetic data/selfplay/RL/etc are trying to fix that: let the model learn by itself, then think about it and practice what it learned, distilling that learning back into the weights.[^5]
- In step 2, AlphaGo played millions of games against itself. This let it become superhuman at Go: remember the famous move 37 in the game against Lee Sedol, an extremely unusual but brilliant move a human would never have played.
Developing the equivalent of step 2 for LLMs is a key research problem for overcoming the data wall (and, moreover, will ultimately be the key to surpassing human-level intelligence).
All of this is to say that data constraints seem to inject large error bars either way into forecasting the coming years of AI progress. There's a very real chance things stall out (LLMs might still be as big of a deal as the internet, but we wouldn't get to truly crazy AGI). But I think it's reasonable to guess that the labs will crack it, and that doing so will not just keep the scaling curves going, but possibly enable huge gains in model capability.
As an aside, this also means that we should expect more variance between the different labs in coming years compared to today. Up until recently, the state of the art techniques were published, so everyone was basically doing the same thing. (And new upstarts or open source projects could easily compete with the frontier, since the recipe was published.) Now, key algorithmic ideas are becoming increasingly proprietary. I'd expect labs' approaches to diverge much more, and some to make faster progress than others-even a lab that seems on the frontier now could get stuck on the data wall while others make a breakthrough that lets them race ahead. And open source will have a much harder time competing. It will certainly make things interesting. (And if and when a lab figures it out, their breakthrough will be the key to AGI, key to superintelligence-one of the United States' most prized secrets.)
## Unhobbling
Finally, the hardest to quantify-but no less important-category of improvements: what I'll call "unhobbling."
Imagine if when asked to solve a hard math problem, you had to instantly answer with the very first thing that came to mind. It seems obvious that you would have a hard time, except for the simplest problems. But until recently, that's how we had LLMs solve math problems. Instead, most of us work through the problem step-by-step on a scratchpad, and are able to solve much more difficult problems that way. "Chain-of-thought" prompting unlocked that for LLMs. Despite excellent raw capabilities, they were much worse at math than they could be because they were hobbled in an obvious way, and it took a small algorithmic tweak to unlock much greater capabilities.
We've made huge strides in "unhobbling" models over the past few years. These are algorithmic improvements beyond just training better base models-and often only use a fraction of pretraining compute-that unleash model capabilities:
- Reinforcement learning from human feedback (RLHF). Base models have incredible latent capabilities, ${ }^{25}$ but they're raw and incredibly hard to work with. While the popular conception of RLHF is that it merely censors swear words, RLHF has been key to making models actually useful and commercially valuable (rather than making models predict random internet text, get them to actually apply their capabilities to try to answer your question!). This was the magic of ChatGPT—well-done RLHF made models usable and useful to real people for the first time. The original InstructGPT paper has a great quantification of this: an RLHF'd small model was equivalent to a non-RLHF'd $>$ 1oox larger model in terms of human rater preference.
- Chain of Thought (CoT). As discussed. CoT started being widely used just 2 years ago and can provide the equivalent of a $>$ 1ox effective compute increase on math/reasoning problems.[^6]
- Scaffolding. Think of CoT++: rather than just asking a model
to solve a problem, have one model make a plan of attack, have another propose a bunch of possible solutions, have another critique it, and so on. For example, on HumanEval (coding problems), simple scaffolding enables GPT-3.5 to outperform un-scaffolded GPT-4. On SWE-Bench (a benchmark of solving real-world software engineering tasks), GPT4 can only solve $\sim 2 \%$ correctly, while with Devin's agent scaffolding it jumps to $14-23 \%$. (Unlocking agency is only in its infancy though, as I'll discuss more later.)
- Tools: Imagine if humans weren't allowed to use calculators or computers. We're only at the beginning here, but ChatGPT can now use a web browser, run some code, and so on.
- Context length. Models have gone from 2 k token context (GPT-3) to 32 k context (GPT-4 release) to $1 \mathrm{M}+$ context (Gemini 1.5 Pro). This is a huge deal. A much smaller base model with, say, 1ook tokens of relevant context can outperform a model that is much larger but only has, say, 4 k relevant tokens of context-more context is effectively a large compute efficiency gain. ${ }^{26}$ More generally, context is key to unlocking many applications of these models: for example, many coding applications require understanding large parts of a codebase in order to usefully contribute new code; or, if you're using a model to help you write a document at work, it really needs the context from lots of related internal docs and conversations. Gemini 1.5 Pro, with its $1 \mathrm{M}+$ token context, was even able to learn a new language (a low-resource language not on the internet) from scratch, just by putting a dictionary and grammar reference materials in context!
- Posttraining improvements. The current GPT-4 has substantially improved compared to the original GPT-4 when released, according to John Schulman due to posttraining improvements that unlocked latent model capability: on reasoning evals it's made substantial gains (e.g., $\sim 50 \%$-> $72 \%$ on MATH, $\sim 40 \%$ to $\sim 50 \%$ on GPQA) and on the LMSys leaderboard, it's made nearly 1oo-point elo jump (comparable to the difference in elo between Claude 3 Haiku and the much larger Claude 3 Opus, models that have a $\sim 50 x$ price difference). ${ }^{26}$ See Figure 7 from the updated Gemini 1.5 whitepaper, comparing perplex ity vs. context for Gemini 1.5 Pro and Gemini 1.5 Flash (a much cheaper and presumably smaller model)
A survey by Epoch AI of some of these techniques, like scaffolding, tool use, and so on, finds that techniques like this can typically result in effective compute gains of 5-30x on many benchmarks. METR (an organization that evaluates models) similarly found very large performance improvements on their set of agentic tasks, via unhobbling from the same GPT-4 base model: from $5 \%$ with just the base model, to $20 \%$ with the GPT-4 as posttrained on release, to nearly $40 \%$ today from better posttraining, tools, and agent scaffolding (Figure 16).
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-032.jpg?height=626&width=1073&top_left_y=755&top_left_x=252)
While it's hard to put these on a unified effective compute scale with compute and algorithmic efficiencies, it's clear these are huge gains, at least on a roughly similar magnitude as the compute scaleup and algorithmic efficiencies. (It also highlights the central role of algorithmic progress: the $\sim 0.5$ OOMs/year of compute efficiencies, already significant, are only part of the story, and put together with unhobbling algorithmic progress overall is maybe even a majority of the gains on the current trend.)
"Unhobbling" is a huge part of what actually enabled these models to become useful-and I'd argue that much of what is holding back many commercial applications today is the need for further "unhobbling" of this sort. Indeed, models today are still incredibly hobbled! For example:
- They don't have long-term memory.
Figure 16: Performance on METR's agentic tasks, over time via better unhobbling. Source: Model Evaluation and Threat Research
## Decomposing drivers of progress
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-033.jpg?height=699&width=940&top_left_y=393&top_left_x=343)
SITUATIONAL AWARENESS | Leopold Aschenbrenner
Figure 17: Decomposing progress: compute, algorithmic efficiencies, and unhobbling. (Rough illustration.)
- They can't use a computer (they still only have very limited tools).
- They still mostly don't think before they speak. When you ask ChatGPT to write an essay, that's like expecting a human to write an essay via their initial stream-of-consciousness. ${ }^{27}$[^7]
- They can (mostly) only engage in short back-and-forth dialogues, rather than going away for a day or a week, thinking about a problem, researching different approaches, consulting other humans, and then writing you a longer report or pull request.
- They're mostly not personalized to you or your application (just a generic chatbot with a short prompt, rather than having all the relevant background on your company and your work).
The possibilities here are enormous, and we're rapidly picking low-hanging fruit here. This is critical: it's completely wrong to just imagine "GPT-6 ChatGPT." With continued unhobbling
progress, the improvements will be step-changes compared to GPT-6 + RLHF. By 2027, rather than a chatbot, you're going to have something that looks more like an agent, like a coworker.
## From chatbot to agent-coworker
What could ambitious unhobbling over the coming years look like? The way I think about it, there are three key ingredients:
## 1. SOLVING THE “ONBOARDING PROBLEM"
GPT-4 has the raw smarts to do a decent chunk of many people's jobs, but it's sort of like a smart new hire that just showed up 5 minutes ago: it doesn't have any relevant context, hasn't read the company docs or Slack history or had conversations with members of the team, or spent any time understanding the company-internal codebase. A smart new hire isn't that useful 5 minutes after arriving-but they are quite useful a month in! It seems like it should be possible, for example via very-long-context, to "onboard" models like we would a new human coworker. This alone would be a huge unlock.
## 2. THE TEST-TIME COMPUTE OVERHANG (REASONING/ERROR CORRECTION / SYSTEM II FOR LONGER-HORIZON PROBLEMS)
Right now, models can basically only do short tasks: you ask them a question, and they give you an answer. But that's extremely limiting. Most useful cognitive work humans do is longer horizon-it doesn't just take 5 minutes, but hours, days, weeks, or months.
A scientist that could only think about a difficult problem for 5 minutes couldn't make any scientific breakthroughs. A software engineer that could only write skeleton code for a single function when asked wouldn't be very usefulsoftware engineers are given a larger task, and they then go make a plan, understand relevant parts of the codebase or
technical tools, write different modules and test them incrementally, debug errors, search over the space of possible solutions, and eventually submit a large pull request that's the culmination of weeks of work. And so on.
In essence, there is a large test-time compute overhang. Think of each GPT-4 token as a word of internal monologue when you think about a problem. Each GPT-4 token is quite smart, but it can currently only really effectively use on the order of $\sim$ hundreds of tokens for chains of thought coherently (effectively as though you could only spend a few minutes of internal monologue/thinking on a problem or project).
What if it could use millions of tokens to think about and work on really hard problems or bigger projects?
| Number of tokens | Equivalent to me work- <br> ing on something for... | |
| :--- | :--- | :--- |
| 100 s | A few minutes | ChatGPT (we are here) |
| $1,000 \mathrm{10,000s}$ | Half an hour | +1 OOMs test-time compute |
| $100,000 s$ | Half a workday | +2 OOMs |
| Millions | A workweek | +3 OOMs |
Table 2: Assuming a human thinking at $\sim 100$ tokens/minute and working 40 hours/week, translating "how long a model thinks" in tokens to human-time on a given problem/project.
Even if the "per-token" intelligence were the same, it'd be the difference between a smart person spending a few minutes vs. a few months on a problem. I don't know about you, but there's much, much, much more I am capable of in a few months vs. a few minutes. If we could unlock "being able to think and work on something for monthsequivalent, rather than a few-minutes-equivalent" for models, it would unlock an insane jump in capability. There's a huge overhang here, many OOMs worth.
Right now, models can't do this yet. Even with recent advances in long-context, this longer context mostly only works for the consumption of tokens, not the production of tokens-after a while, the model goes off the rails or gets stuck. It's not yet able to go away for a while to work on a
problem or project on its own. ${ }^{28}$
But unlocking test-time compute might merely be a matter of relatively small "unhobbling" algorithmic wins. Perhaps a small amount of RL helps a model learn to error correct ("hm, that doesn't look right, let me double check that"), make plans, search over possible solutions, and so on. In a sense, the model already has most of the raw capabilities, it just needs to learn a few extra skills on top to put it all together.
In essence, we just need to teach the model a sort of System II outer loop ${ }^{29}$ that lets it reason through difficult, longhorizon projects.
If we succeed at teaching this outer loop, instead of a short chatbot answer of a couple paragraphs, imagine a stream of millions of words (coming in more quickly than you can read them) as the model thinks through problems, uses tools, tries different approaches, does research, revises its work, coordinates with others, and completes big projects on its own.
Trading off test-time and train-time compute in other ML domains. In other domains, like AI systems for board games, it's been demonstrated that you can use more test-time compute (also called inference-time compute) to substitute for training compute.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-036.jpg?height=488&width=765&top_left_y=1751&top_left_x=390)
Fig. 9. The trade-off between train-time compute and test-time compute. Each dotted line gives the minimum train-test compute required for a certain Elo on a $9 \times 9$ board ${ }^{28}$ Which makes sense-why would it have learned the skills for longerhorizon reasoning and error correction? There's very little data on the internet in the form of " my complete internal monologue, reasoning, all the relevant steps over the course of a month as I work on a project." Unlocking this capability will require a new kind of training, for it to learn these extra skills.
Or as Gwern put it (private correspondence): "'Brain the size of a galaxy, and what do they ask me to do? Predict the misspelled answers on benchmarks!' Marvin the depressed neural network moaned."
${ }^{29}$ System I vs. System II is a useful way of thinking about current capabilities of LLMs-including their limitations and dumb mistakes-and what might be possible with RL and unhobbling. Think of this way: when you are driving, most of the time you are on autopilot (system I, what models mostly do right now). But when you encounter a complex construction zone or novel intersection, you might ask your passenger-seat-companion to pause your conversation for a moment while you figure out-actually think about-what's going on and what to do. If you were forced to go about life with only system I (closer to models today), you'd have a lot of trouble. Creating the ability for system II reasoning loops is a central unlock.
Figure 18: Jones (2021): A smaller model can do as well as a much larger model at the game of Hex if you give it more test-time compute ("more time to think"). In this domain, they find that one can spend $\sim 1.2$ OOMs more compute at test-time to get performance equivalent to a model with $\sim 1$ OOMs more training compute.
If a similar relationship held in our case, if we could unlock +4 OOMs of test-time compute, that might be equivalent to +3 OOMs of pretraining compute, i.e. very roughly something like the jump between GPT-3 and GPT-4. (Solving this "unhobbling" would be equivalent to a huge OOM scaleup.)
## 3. USING A COMPUTER
This is perhaps the most straightforward of the three. ChatGPT right now is basically like a human that sits in an isolated box that you can text. While early unhobbling improvements teach models to use individual isolated tools, I expect that with multimodal models we will soon be able to do this in one fell swoop: we will simply enable models to use a computer like a human would.
That means joining your Zoom calls, researching things online, messaging and emailing people, reading shared docs, using your apps and dev tooling, and so on. (Of course, for models to make the most use of this in longer-horizon loops, this will go hand-in-hand with unlocking test-time compute.)
BY THE END OF THIS, I expect us to get something that looks a lot like a drop-in remote worker. An agent that joins your company, is onboarded like a new human hire, messages you and colleagues on Slack and uses your softwares, makes pull requests, and that, given big projects, can do the model-equivalent of a human going away for weeks to independently complete the project. You'll probably need somewhat better base models than GPT-4 to unlock this, but possibly not even that much better-a lot of juice is in fixing the clear and basic ways models are still hobbled.
(A very early peek at what this might look like is Devin, an early prototype of unlocking the "agency-overhang" /"testtime compute overhang" on models on the path to creating a fully automated software engineer. I don't know how well Devin works in practice, and this demo is still very
limited compared to what proper chatbot $\rightarrow$ agent unhobbling would yield, but it's a useful teaser of the sort of thing coming soon.)
By the way, the centrality of unhobbling might lead to a somewhat interesting "sonic boom" effect in terms of commercial applications. Intermediate models between now and the dropin remote worker will require tons of schlep to change workflows and build infrastructure to integrate and derive economic value from. The drop-in remote worker will be dramatically easier to integrate-just, well, drop them in to automate all the jobs that could be done remotely. It seems plausible that the schlep will take longer than the unhobbling, that is, by the time the drop-in remote worker is able to automate a large number of jobs, intermediate models won't yet have been fully harnessed and integrated-so the jump in economic value generated could be somewhat discontinuous.
## The next four years
Putting the numbers together, we should (roughly) expect another GPT-2-to-GPT-4-sized jump in the 4 years following GPT-4, by the end of 2027 .
- GPT-2 to GPT-4 was roughly a 4.5-6 OOM base effective compute scaleup (physical compute and algorithmic efficiencies), plus major "unhobbling" gains (from base model to chatbot).
- In the subsequent 4 years, we should expect 3-6 OOMs of base effective compute scaleup (physical compute algorithmic efficiencies)-with perhaps a best guess of $\sim 5$ OOMs-plus step-changes in utility and applications unlocked by "unhobbling" (from chatbot to agent/drop-in remote worker).
To put this in perspective, suppose GPT-4 training took 3 months. In 2027, a leading AI lab will be able to train a GPT-4level model in a minute. ${ }^{30}$ The OOM effective compute scaleup[^8]
## GPT-2 (2019) to GPT-4 (2024)
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-039.jpg?height=572&width=1442&top_left_y=478&top_left_x=382)
## 2023-2027 (Projection)
## Compute
Algorithmic Efficiency
$$
2-3 \text { оомs } \quad 1-3 \text { оомs }
$$[^9]
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-039.jpg?height=146&width=439&top_left_y=1713&top_left_x=1380)
Chatbot to Agent
will be dramatic.
Where will that take us?
## Counting the OOMs
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-040.jpg?height=716&width=1524&top_left_y=561&top_left_x=325)
Figure 20: Summary of counting the
GPT-2 to GPT-4 took us from $\sim$ preschooler to $\sim$ smart highOOMs. schooler; from barely being able to output a few cohesive sentences to acing high-school exams and being a useful coding assistant. That was an insane jump. If this is the intelligence gap we'll cover once more, where will that take us? ${ }^{31}$ We should not be surprised if that takes us very, very far. Likely, it will take us to models that can outperform PhDs and the best
${ }^{31}$ Of course, any benchmark we have today will be saturated. But that's not saying much; it's mostly a reflection on the difficulty of making hard-enough benchmarks
(One neat way to think about this is that the current trend of AI progress is proceeding at roughly $3 x$ the pace of child development. Your 3x-speed-child just graduated high school; it'll be taking your job before you know it!)
Again, critically, don't just imagine an incredibly smart ChatGPT: unhobbling gains should mean that this looks more like a drop-in remote worker, an incredibly smart agent that can reason and plan and error-correct and knows everything about you and your company and can work on a problem indepen-
dently for weeks.
We are on course for AGI by 2027. These AI systems will basically be able to automate basically all cognitive jobs (think: all jobs that could be done remotely).
To be clear-the error bars are large. Progress could stall as we run out of data, if the algorithmic breakthroughs necessary to crash through the data wall prove harder than expected. Maybe unhobbling doesn't go as far, and we are stuck with merely expert chatbots, rather than expert coworkers. Perhaps the decade-long trendlines break, or scaling deep learning hits a wall for real this time. (Or an algorithmic breakthrough, even simple unhobbling that unleashes the test-time compute overhang, could be a paradigm-shift, accelerating things further and leading to AGI even earlier.)
In any case, we are racing through the OOMs, and it requires no esoteric beliefs, merely trend extrapolation of straight lines, to take the possibility of AGI—true AGI-by 2027 extremely seriously.
It seems like many are in the game of downward-defining AGI these days, as just as really good chatbot or whatever. What I mean is an AI system that could fully automate my or my friends' job, that could fully do the work of an AI researcher or engineer. Perhaps some areas, like robotics, might take longer to figure out by default. And the societal rollout, e.g. in medical or legal professions, could easily be slowed by societal choices or regulation. But once models can automate AI research itself, that's enough-enough to kick off intense feedback loops-and we could very quickly make further progress, the automated AI engineers themselves solving all the remaining bottlenecks to fully automating everything. In particular, millions of automated researchers could very plausibly compress a decade of further algorithmic progress into a year or less. AGI will merely be a small taste of the superintelligence soon to follow. (More on that in the next chapter.)
In any case, do not expect the vertiginous pace of progress to abate. The trendlines look innocent, but their implications are
intense. As with every generation before them, every new generation of models will dumbfound most onlookers; they'll be incredulous when, very soon, models solve incredibly difficult science problems that would take PhDs days, when they're whizzing around your computer doing your job, when they're writing codebases with millions of lines of code from scratch, when every year or two the economic value generated by these models 1oxs. Forget scifi, count the OOMs: it's what we should expect. AGI is no longer a distant fantasy. Scaling up simple deep learning techniques has just worked, the models just want to learn, and we're about to do another 100,0o0x+ by the end of 2027. It won't be long before they're smarter than us.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-042.jpg?height=287&width=178&top_left_y=984&top_left_x=239)
2014
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-042.jpg?height=444&width=926&top_left_y=980&top_left_x=412)
Figure 21: GPT-4 is just the beginningwhere will we be four years later? Do not make the mistake of underestimating the rapid pace of deep learning progress (as illustrated by progress in GANs).
Addendum. Racing through the OOMs: It's this decade or bust
I used to be more skeptical of short timelines to AGI. One reason is that it seemed unreasonable to privilege this decade, concentrating so much AGI-probability-mass on it (it seemed like a classic fallacy to think "oh we're so special"). I thought we should be uncertain about what it takes to get AGI, which should lead to a much more "smeared-out" probability distribution over when we might get AGI.
However, I've changed my mind: critically, our uncertainty over what it takes to get AGI should be over OOMs (of effective compute), rather than over years.
We're racing through the OOMs this decade. Even at its bygone heyday, Moore's law was only 1-1.5 OOMs/decade. I estimate that we will do $\sim 5$ OOMs in 4 years, and over $\sim 10$ this decade overall.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-043.jpg?height=817&width=1012&top_left_y=1253&top_left_x=299)
In essence, we're in the middle of a huge scaleup reaping one-time gains this decade, and progress through the OOMs will be multiples slower thereafter. If this scaleup doesn't get us to AGI in the next 5-10 years, it might be a
Figure 22: Rough projections on effective compute scaleup. We've been racing through the OOMs this decade; after the early 2030 , we will face a slow slog.
long way out.
- Spending scaleup: Spending a million dollars on a model used to be outrageous; by the end of the decade, we will likely have $\$ 1$ ooB or $\$ 1 \mathrm{~T}$ clusters. Going much higher than that will be hard; that's already basically the feasible limit (both in terms of what big business can afford, and even just as a fraction of GDP). Thereafter all we have is glacial $2 \% /$ year trend real GDP growth to increase this.
- Hardware gains: AI hardware has been improving much more quickly than Moore's law. That's because we've been specializing chips for AI workloads. For example, we've gone from CPUs to GPUs; adapted chips for Transformers; and we've gone down to much lower precision number formats, from fp64/fp32 for traditional supercomputing to fp 8 on Hioos. These are large gains, but by the end of the decade we'll likely have totallyspecialized AI-specific chips, without much further beyond-Moore's law gains possible.
- Algorithmic progress: In the coming decade, AI labs will invest tens of billions in algorithmic R\&D, and all the smartest people in the world will be working on this; from tiny efficiencies to new paradigms, we'll be picking lots of the low-hanging fruit. We probably won't reach any sort of hard limit (though "unhobblings" are likely finite), but at the very least the pace of improvements should slow down, as the rapid growth (in $\$$ and human capital investments) necessarily slows down (e.g., most of the smart STEM talent will already be working on AI). (That said, this is the most uncertain to predict, and the source of most of the uncertainty on the OOMs in the 2030s on the plot above.)
Put together, this means we are racing through many more OOMs in the next decade than we might in multiple decades thereafter. Maybe it's enough-and we get AGI soon-or we might be in for a long, slow slog. You and I can reasonably disagree on the median time to AGI,
depending on how hard we think achieving AGI will bebut given how we're racing through the OOMs right now, certainly your modal AGI year should sometime later this decade or so.
Matthew Barnett
@MatthewJBar
My own basic calculations suggest that, given the potential for increased investment and hardware progress, we could very soon move through a large fraction of the remaining compute gap between the current frontier models and the literal amount of computation used by evolution.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-045.jpg?height=628&width=1051&top_left_y=865&top_left_x=255)
7:54 PM $\cdot$ Mar 26, 2024 $\cdot$ 3,968 Views[^10]
# II. From AGI to Superintelligence: the Intelligence Explosion
AI progress won't stop at human-level. Hundreds of millions<br>of AGIs could automate AI research, compressing a decade<br>of algorithmic progress ( $5+$ OOMs) into 1 year. We would<br>rapidly go from human-level to vastly superhuman AI sys-<br>tems. The power-and the peril—of superintelligence would<br>be dramatic.
#### Abstract
Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an 'intelligence explosion,' and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make.
I. J. GOOD (1965)
## The Bomb and The Super
In the common imagination, the Cold War's terrors principally trace back to Los Alamos, with the invention of the atomic bomb. But The Bomb, alone, is perhaps overrated. Going from The Bomb to The Super-hydrogen bombs-was arguably just as important.
In the Tokyo air raids, hundreds of bombers dropped thousands of tons of conventional bombs on the city. Later that year, Little Boy, dropped on Hiroshima, unleashed similar destructive power in a single device. But just 7 years later, Teller's hydrogen bomb multiplied yields a thousand-fold once again-a single bomb with more explosive power than all of the bombs dropped in the entirety of WWII combined.
#### Abstract
The Bomb was a more efficient bombing campaign. The Super was a country-annihilating device. 32
So it will be with AGI and Superintelligence.
AI progress won't stor at human-leVel. After initially learning from the best human games, AlphaGo started playing against itself-and it quickly became superhuman, playing extremely creative and complex moves that a human would never have come up with.
We discussed the path to AGI in the previous piece. Once we get AGI, we'll turn the crank one more time-or two or three more times-and AI systems will become superhuman-vastly superhuman. They will become qualitatively smarter than you or I, much smarter, perhaps similar to how you or I are qualitatively smarter than an elementary schooler.
The jump to superintelligence would be wild enough at the current rapid but continuous rate of AI progress (if we could make the jump to AGI in 4 years from GPT-4, what might another 4 or 8 years after that bring?). But it could be much faster than that, if AGI automates AI research itself.
Once we get AGI, we won't just have one AGI. I'll walk through the numbers later, but: given inference GPU fleets by then, we'll likely be able to run many millions of them (perhaps 100 million human-equivalents, and soon after at 1ox+ human speed). Even if they can't yet walk around the office or make coffee, they will be able to do ML research on a computer. Rather than a few hundred researchers and engineers at a leading AI lab, we'd have more than 100,ooox that-furiously working on algorithmic breakthroughs, day and night. Yes, recursive self-improvement, but no sci-fi required; they would need only to accelerate the existing trendlines of algorithmic progress (currently at $\sim 0.5$ OOMs/year).
Automated AI research could probably compress a humandecade of algorithmic progress into less than a year (and that seems conservative). That'd be $5+$ OOMs, another GPT-2-to- ${ }^{32}$ And much of the Cold War's perversities (cf Daniel Ellsberg's book) stemmed from merely replacing Abombs with H-bombs, without adjusting nuclear policy and war plans to the massive capability increase.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-048.jpg?height=1147&width=1488&top_left_y=351&top_left_x=321)
GPT-4-sized jump, on top of AGI-a qualitative jump like that from a preschooler to a smart high schooler, on top of AI systems already as smart as expert AI researchers/engineers.
There are several plausible bottlenecks-including limited compute for experiments, complementarities with humans, and algorithmic progress becoming harder-which I'll address, but none seem sufficient to definitively slow things down.
Before we know it, we would have superintelligence on our hands-AI systems vastly smarter than humans, capable of novel, creative, complicated behavior we couldn't even begin to understand-perhaps even a small civilization of billions of them. Their power would be vast, too. Applying superin-
Figure 24: Automated AI research could accelerate algorithmic progress, leading to $5+$ OOMs of effective compute gains in a year. The AI systems we'd have by the end of an intelligence explosion would be vastly smarter than humans.
telligence to $\mathrm{R} \& \mathrm{D}$ in other fields, explosive progress would broaden from just ML research; soon they'd solve robotics, make dramatic leaps across other fields of science and technology within years, and an industrial explosion would follow. Superintelligence would likely provide a decisive military advantage, and unfold untold powers of destruction. We will be faced with one of the most intense and volatile moments of human history.
## Automating AI research
We don't need to automate everything-just AI research. A common objection to transformative impacts of AGI is that it will be hard for AI to do everything. Look at robotics, for instance, doubters say; that will be a gnarly problem, even if AI is cognitively at the levels of PhDs . Or take automating biology R\&D, which might require lots of physical lab-work and human experiments.
But we don't need robotics-we don't need many things-for AI to automate AI research. The jobs of AI researchers and engineers at leading labs can be done fully virtually and don't run into real-world bottlenecks in the same way (though it will still be limited by compute, which I'll address later). And the job of an AI researcher is fairly straightforward, in the grand scheme of things: read ML literature and come up with new questions or ideas, implement experiments to test those ideas, interpret the results, and repeat. This all seems squarely in the domain where simple extrapolations of current AI capabilities could easily take us to or beyond the levels of the best humans by the end of 2027.33
It's worth emphasizing just how straightforward and hacky some of the biggest machine learning breakthroughs of the last decade have been: "oh, just add some normalization" (LayerNorm/BatchNorm) or "do $f(x)+x$ instead of $f(x)$ " (residual connections)" or "fix an implementation bug" (Kaplan $\rightarrow$ Chinchilla scaling laws). AI research can be automated. And automating AI research is all it takes to kick off extraordinary feedback loops. 34[^11]
We'd be able to run millions of copies (and soon at 1ox+ human speed) of the automated AI researchers. Even by 2027, we should expect GPU fleets in the 1os of millions. Training clusters alone should be approaching 3 OOMs larger, already putting us at 10 million+ A1oo-equivalents. Inference fleets should be much larger still. (More on all this in IIIa. Racing to the Trillion Dollar Cluster.)
That would let us run many millions of copies of our automated AI researchers, perhaps 100 million human-researcherequivalents, running day and night. There's some assumptions that flow into the exact numbers, including that humans "think" at 100 tokens/minute (just a rough order of magnitude estimate, e.g. consider your internal monologue) and extrapolating historical trends and Chinchilla scaling laws on pertoken inference costs for frontier models remaining in the same ballpark. ${ }^{35}$ We'd also want to reserve some of the GPUs for running experiments and training new models. Full calculation in a footnote. ${ }^{36}$
Another way of thinking about it is that given inference fleets in 2027, we should be able to generate an entire internet's worth of tokens, every single day. ${ }^{37}$ In any case, the exact numbers don't matter that much, beyond a simple plausibility demonstration.
35 As noted earlier, the GPT-4 API costs less today than GPT-3 when it was released-this suggests that the trend of inference efficiency wins is fast enough to keep inference costs roughly constant even as models get much more powerful. Similarly, there have been huge inference cost wins in just the year since GPT-4 was released; for example, the current version of Gemini 1.5 Pro outperforms the original GPT-4, while being roughly 1ox cheaper.
We can also ground this somewhat more by considering Chinchilla scaling laws. On Chinchilla scaling laws, model size-and thus inference costs-grow with the square root of training cost, i.e. half the OOMs of the OOM scaleup of effective compute. However, in the previous piece, I suggested that algorithmic efficiency was advancing at roughly the same pace as compute scaleup, i.e. it made up roughly half of the OOMs of effective compute scaleup. If these algorithmic wins also translate into inference efficiency, that means that the algorithmic efficiencies would compensate for the naive increase in inference cost.
In practice, training compute efficiencies often, but not always, translate into inference efficiency wins. However, there are also separately many inference efficiency wins that are not training efficiency wins. So, at least in terms of the rough ballpark, assuming the $\$ /$ token of frontier models stays roughly similar doesn't seem crazy.
(Of course, they'll use more tokens, i.e. more test-time compute. But that's already part of the calculation here, by pricing human-equivalents as 100 tokens/minute.)
${ }^{36} \mathrm{GPT}_{4} \mathrm{~T}$ is about $\$ 0.03 / 1 \mathrm{~K}$ tokens. We supposed we would have 1 os of millions of A1oo equivalents, which cost $\sim \$ 1$ hour per GPU if A1oo-equivalents. If we used the API costs to translate GPUs into tokens generated, that implies 10 of millions GPUs * \$1 /GPU-hour * 33 K tokens $/ \$=\sim$ one trillion tokens/ hour. Suppose a human does 100 tokens/min of thinking, that means a human-equivalent is 6,000 tokens/hour. One trillion tokens/hour divided by 6,000 tokens/human-hour $=\sim 200$ million human-equivalentsi.e. as if running 200 million human researchers, day and night. (And even if we reserve half the GPUs for experiment compute, we get 100 million human-researcher-equivalents.)
${ }^{37}$ The previous footnote estimated $\sim 1 \mathrm{~T}$ tokens/hour, i.e. 24 T tokens a day. In the previous piece, I noted that a public deduplicated CommonCraw1 had around 30 T tokens.
Moreover, our automated AI researchers may soon be able to run at much faster than human-speed:
- By taking some inference penalties, we can trade off running fewer copies in exchange for running them at faster serial speed. (For example, we could go from $\sim 5 x$ human speed to $\sim 100 \mathrm{x}$ human speed by "only" running 1 million copies of the automated researchers. ${ }^{38}$ )
- More importantly, the first algorithmic innovation the automated AI researchers work on is getting a 1ox or $100 x$ speedup. Gemini 1.5 Flash is $\sim 10 x$ faster than the originallyreleased GPT-4, 39 merely a year later, while providing similar performance to the originally-released GPT-4 on reasoning benchmarks. If that's the algorithmic speedup a few hundred human researchers can find in a year, the automated AI researchers will be able to find similar wins very quickly.
That is: expect 100 million automated researchers each working at 1oox human speed not long after we begin to be able to automate AI research. They'll each be able to do a year's worth of work in a few days. The increase in research effort-compared to a few hundred puny human researchers at a leading AI lab today, working at a puny $1 \times$ human speed-will be extraordinary.
## This could easily dramatically accelerate existing trends of algorithmic progress, compressing a decade of advances into
a year. We need not postulate anything totally novel for automated AI research to intensely speed up AI progress. Walking through the numbers in the previous piece, we saw that algorithmic progress has been a central driver of deep learning progress in the last decade; we noted a trendline of $\sim 0.5$ OOMs/year on algorithmic efficiencies alone, with additional large algorithmic gains from unhobbling on top. (I think the import of algorithmic progress has been underrated by many, and properly appreciating it is important for appreciating the possibility of an intelligence explosion.)Could our millions of automated AI researchers (soon working at 1ox or 10ox human speed) compress the algorithmic progress human researchers would have found in a decade into a year[^12]instead? That would be $5+$ OOMs in a year.
Don't JUST IMAgINE 100 million junior software engineer interns here (we'll get those earlier, in the next couple years!). Real automated AI researchers be very smart-and in addition to their raw quantitative advantage, automated AI researchers will have other enormous advantages over human researchers:
- They'll be able to read every single ML paper ever written, have been able to deeply think about every single previous experiment ever run at the lab, learn in parallel from each of their copies, and rapidly accumulate the equivalent of millennia of experience. They'll be able to develop far deeper intuitions about ML than any human.
- They'll be easily able to write millions of lines of complex code, keep the entire codebase in context, and spend humandecades (or more) checking and rechecking every line of code for bugs and optimizations. They'll be superbly competent at all parts of the job.
- You won't have to individually train up each automated AI researcher (indeed, training and onboarding 100 million new human hires would be difficult). Instead, you can just teach and onboard one of them-and then make replicas. (And you won't have to worry about politicking, cultural acclimation, and so on, and they'll work with peak energy and focus day and night.)
- Vast numbers of automated AI researchers will be able to share context (perhaps even accessing each others' latent space and so on), enabling much more efficient collaboration and coordination compared to human researchers.
- And of course, however smart our initial automated AI researchers would be, we'd soon be able to make further OOM-jumps, producing even smarter models, even more capable at automated AI research.
Imagine an automated Alec Radford—imagine 100 million au-
tomated Alec Radfords. ${ }^{40}$ I think just about every researcher at OpenAI would agree that if they had io Alec Radfords, let alone 100 or 1,000 or 1 million running at $10 \times$ or $100 x$ human speed, they could very quickly solve very many of their problems. Even with various other bottlenecks (more in a moment), compressing a decade of algorithmic progress into a year as a result seems very plausible. (A 1ox acceleration from a million times more research effort, which seems conservative if anything.)
That would be $5+$ OOMs right there. 5 OOMs of algorithmic wins would be a similar scaleup to what produced the GPT2-to-GPT-4 jump, a capability jump from $\sim$ a preschooler to $\sim$ a smart high schooler. Imagine such a qualitative jump on top of AGI, on top of Alec Radford.
It's strikingly plausible we'd go from AGI to superintelligence very quickly, perhaps in 1 year.
## Possible bottlenecks
While this basic story is surprisingly strong-and is supported by thorough economic modeling work-there are some real and plausible bottlenecks that will probably slow down an automated-AI-research intelligence explosion.
I'll give a summary here, and then discuss these in more detail in the optional sections below for those interested:
- Limited compute: AI research doesn't just take good ideas, thinking, or math-but running experiments to get empirical signal on your ideas. A million times more research effort via automated research labor won't mean a million times faster progress, because compute will still be limited-and limited compute for experiments will be the bottleneck. Still, even if this won't be a $1,000,000 x$ speedup, I find it hard to imagine that the automated AI researchers couldn't use the compute at least 1ox more effectively: they'll be able to get incredible ML intuition (having internalized the whole ML literature and every previous experiment every run!) and ${ }^{40}$ Alec Radford is an incredibly gifted and prolific researcher/engineer at OpenAI, behind many of the most important advances, though he flies under the radar some.
centuries-equivalent of thinking-time to figure out exactly the right experiment to run, configure it optimally, and get maximum value of information; they'll be able to spend centuries-equivalent of engineer-time before running even tiny experiments to avoid bugs and get them right on the first try; they can make tradeoffs to economize on compute by focusing on the biggest wins; and they'll be able to try tons of smaller-scale experiments (and given effective compute scaleups by then, "smaller-scale" means being able to train 100,00o GPT-4-level models in a year to try architecture breakthroughs). Some human researchers and engineers are able to produce 1ox the progress as others, even with the same amount of compute-and this should apply even moreso to automated AI researchers. I do think this is the most important bottleneck, and I address it in more depth below.
- Complementarities/long tail: A classic lesson from economics (cf Baumol's growth disease) is that if you can automate, say, $70 \%$ of something, you get some gains but quickly the remaining $30 \%$ become your bottleneck. For anything that falls short of full automation-say, really good copilots-human AI researchers would remain a major bottleneck, making the overall increase in the rate of algorithmic progress relatively small. Moreover, there's likely some long tail of capabilities required for automating AI research-the last $10 \%$ of the job of an AI researcher might be particularly hard to automate. This could soften takeoff some, though my best guess is that this only delays things by a couple years. Perhaps 2026/27-models speed are the proto-automated-researcher, it takes another year or two for some final unhobbling, a somewhat better model, inference speedups, and working out kinks to get to full automation, and finally by 2028 we get the $10 x$ acceleration (and superintelligence by the end of the decade).
- Inherent limits to algorithmic progress: Maybe another 5 OOMs of algorithmic efficiency will be fundamentally impossible? I doubt it. While there will definitely be upper limits, ${ }^{41}$ if we got 5 OOMs in the last decade, we should probably expect at least another decade's-worth of progress[^13]to be possible. More directly, current architectures and training algorithms are still very rudimentary, and it seems that much more efficient schemes should be possible. Biological reference classes also support dramatically more efficient algorithms being plausible.
- Ideas get harder to find, so the automated AI researchers will merely sustain, rather than accelerate, the current rate of progress: One objection is that although automated research would increase effective research effort a lot, ideas also get harder to find. That is, while it takes only a few hundred top researchers at a lab to sustain 0.5 OOMs/year today, as we exhaust the low-hanging fruit, it will take more and more effort to sustain that progress-and so the 100 million automated researchers will be merely what's necessary to sustain progress. I think this basic model is correct, but the empirics don't add up: the magnitude of the increase in research effort-a million-fold-is way, way larger than the historical trends of the growth in research effort that's been necessary to sustain progress. In econ modeling terms, it's a bizarre "knife-edge assumption" to assume that the increase in research effort from automation will be just enough to keep progress constant.
- Ideas get harder to find and there are diminishing returns, so the intelligence explosion will quickly fizzle: Related to the above objection, even if the automated AI researchers lead to an initial burst of progress, whether rapid progress can be sustained depends on the shape of the diminishing returns curve to algorithmic progress. Again, my best read of the empirical evidence is that the exponents shake out in favor of explosive/accelerating progress. In any case, the sheer size of the one-time boost-from 1oos to 1oos of millions of AI researchers-probably overcomes diminishing returns here for at least a good number of OOMs of algorithmic progress, even though it of course can't be indefinitely selfsustaining.
Overall, these factors may slow things down somewhat: the most extreme versions of intelligence explosion (say, overnight) seem implausible. And they may result in a somewhat longer
runup (perhaps we need to wait an extra year or two from more sluggish, proto-automated researchers to the true automated Alec Radfords, before things kick off in full force). But they certainly don't rule out a very rapid intelligence explosion. A year-or at most just a few years, but perhaps even just a few months-in which we go from fully-automated AI researchers to vastly superhuman AI systems should be our mainline expectation.
If you'd rather skip the in-depth discussions on the various bottlenecks below, click here to skip to the next section.
## Limited compute for experiments (optional, in more depth)
The production function for algorithmic progress includes two complementary factors of production: research effort and experiment compute. The millions of automated AI researchers won't have any more compute to run their experiments on than human AI researchers; perhaps they'll just be sitting around waiting for their jobs to finish.
This is probably the most important bottleneck to the intelligence explosion. Ultimately this is a quantitative question-just how much of a bottleneck is it? On balance, I find it hard to believe that the 100 million Alec Radfords couldn't increase the marginal product of experiment compute by at least 1ox (and thus, would still accelerate the pace of progress by 10x):
- There's a lot you can do with smaller amounts of compute. The way most AI research works is that you test things out at small scale-and then extrapolate via scaling laws. (Many key historical breakthroughs required only a very small amount of compute, e.g. the original Transformer was trained on just 8 GPUs for a few days.) And note that with $\sim 5$ OOMs of baseline scaleup in the next four years, "small scale" will mean GPT-4 scale-the automated AI researchers will be able to run 100,00o GPT-4level experiments on their training cluster in a year, and tens of millions of GPT-3-level experiments. (That's a lot
of potential-breakthrough new architectures they'll be able to test!)
- A lot of the compute goes into larger-scale validation of the final pretraining run-making sure you are getting a high-enough degree of confidence on marginal efficiency wins for your annual headline product-but if you're racing through the OOMs in the intelligence explosion, you could economize and just focus on the really big wins.
- As discussed in the previous piece, there are often enormous gains to be had from relatively low-compute "unhobbling" of models. These don't require big pretraining runs. It's highly plausible that that intelligence explosion starts off automated AI research e.g. discovering a way to do RL on top that gives us a couple OOMs via unhobbling wins (and then we're off to the races).
- As the automated AI researchers find efficiencies, that'll let them run more experiments. Recall the near-10oox cheaper inference in two years for equivalent-MATH performance, and the 1ox general inference gains in the last year, discussed in the previous piece, from mere-human algorithmic progress. The first thing the automated AI researchers will do is quickly find similar gains, and in turn, that'll let them run 1oox more experiments on e.g. new RL approaches. Or they'll be able to quickly make smaller models with similar performance in relevant domains (cf previous discussion of Gemini Flash, near-10ox cheaper than GPT-4), which in turn will let them run many more experiments with these smaller models (again, imagine using these to try different RL schemes). There are probably other overhangs too, e.g. the automated AI researchers might be able to quickly develop much better distributed training schemes to utilize all the inference GPUs (probably at least 1ox more compute right there). More generally, every OOM of training efficiency gains they find will give them an OOM
more of effective compute to run experiments on.
- The automated AI researchers could be way more efficient. It's hard to understate how many fewer experiments you would have to run if you just got it right on the first try—no gnarly bugs, being more selective about exactly what you are running, and so on. Imagine 1000 automated AI researchers spending a month-equivalent checking your code and getting the exact experiment right before you press go. I've asked some AI lab colleagues about this and they agreed: you should pretty easily be able to save $3 x$-10x of compute on most projects merely if you could avoid frivolous bugs, get things right on the first try, and only run high value-of-information experiments.
- The automated AI researchers could have way better intuitions.
- Recently, I was speaking to an intern at a frontier lab; they said that their dominant experience over the past few months was suggesting many experiments they wanted to run, and their supervisor (a senior researcher) saying they could already predict the result beforehand so there was no need. The senior researcher's years of random experiments messing around with models had honed their intuitions about what ideas would work-or not. Similarly, it seems like our AI systems could easily get superhuman intuitions about ML experiments-they will have read the entire machine learning literature, be able to learn from every other experiment result and deeply think about it, they could easily be trained to predict the outcome of millions of ML experiments, and so on. And maybe one of the first things they do is build up a strong basic science of "predicting if this large scale experiment will be successful just after seeing the first $1 \%$ of training, or just after seeing the smaller scale version of this experiment", and so on.
- Moreover, beyond really good intuitions about re-
search directions, as Jason Wei has noted, there are incredible returns to having great intuitions on the dozens of hyperparameters and details of an experiment,. Jason calls this ability to get things right on the first try based on intuition "yolo runs". (Jason says, "what I do know is that the people who can do this are surely 1o-100x AI researchers.")
Compute bottlenecks will mean a million times more researchers won't translate into a million times faster research-thus not an overnight intelligence explosion. But the automated AI researchers will have extraordinary advantages over human researchers, and so it seems hard to imagine that they couldn't also find a way to use the compute at least 1ox more efficiently/effectively-and so 1ox the pace of algorithmic progress seems eminently plausible.
I'LL TAKE A MOMENT HERE to acknowledge perhaps the most compelling formulation of the counterargument I've heard, by my friend James Bradbury: if more ML research effort would so dramatically accelerate progress, why doesn't the current academic ML research community, numbering at least in the tens of thousands, contribute more to frontier lab progress? (Currently, it seems like lab-internal teams, of perhaps a thousand in total across labs, shoulder most of the load for frontier algorithmic progress.) His argument is that the reason is that algorithmic progress is compute-bottlenecked: the academics just don't have enough compute.
Some responses:
- Quality-adjusted, I think academics are probably more in the thousands not tens of thousands (e.g., looking only at the top universities). This probably isn't substantially more than the labs combined. (And it's way less than the hundreds of millions of researchers we'd get from automated AI research.)
- Academics work on the wrong things. Up until very
recently (and perhaps still today?), the vast majority of the academic ML community wasn't even working on large language models. In terms of strong academics in academia working on large language models, it might be meaningfully fewer than researchers at labs combined?
- Even when the academics do work on things like LLM pretraining, they simply don't have access to the stateof-the-art-the large accumulated body of knowledge of tons of details on frontier model training inside labs. They don't know what problems are actually relevant, or can only contribute one-off results that nobody can really do anything with because their baselines were badly tuned (so nobody knows if their thing is actually an improvement).
- Academics are way worse than automated AI researchers: they can't work at 1ox or 10ox human speed, they can't read and internalize every ML paper ever written, they can't spend a decade checking every line of code, replicate themselves to avoid onboarding-bottlenecks, etc.
Another countervailing example to the academics argument: GDM is rumored to have way more experiment compute than OpenAI, and yet it doesn't seem like GDM is massively outpacing OpenAI in terms of algorithmic progress.
In general, I expect automated researchers will have a different style of research that plays to their strengths and aims to mitigate the compute bottleneck. I think it's reasonable to be uncertain how this plays out, but it's unreasonable to be confident it won't be doable for the models to get around the compute bottleneck just because it'd be hard for humans to do so.
- For example, they could just spend a lot of effort early on building up a basic science of "how to predict large scale results from smaller scale experiments". And I expect there's a lot that they could do that humans can't do, e.g. maybe things more like "predicting if this large scale experiment will be successful just after seeing the
first $1 \%$ of training". This seems pretty doable if you're a super strong automated researcher with very superhuman intuitions and this can save you a ton of compute.
- When I imagine AI systems automating AI research, I see them as compute-bottlenecked but making up for it in large part by thinking e.g. $1000 x$ more (and faster) than humans would, and thinking at a higher level of quality than humans (e.g. because of the superhuman ML intuitions from being trained to predict the result of millions of experiments). Unless they're just much worse at thinking than engineering, I think this can make up for a lot, and this would be qualitatively different from academics.
(In addition to experiment compute, there's the additional bottleneck of eventually needing to run a big training run, something which currently takes months. But you can probably economize on those, doing only a handful during the year of intelligence explosion, taking bigger OOM leaps for each than labs currently do. ${ }^{42}$ Or you could "spend" 1 out of the 5 OOMs of compute efficiency wins to do a training run in days rather than months.)
Complementarities and long tails to $100 \%$ automation (optional, in more depth)
The classic economist objection to AI automation speeding up economic growth is that different tasks are complementary: so, for example, automating $80 \%$ of what labor humans did in 1800 didn't lead to a growth explosion or mass unemployment, but the remaining $20 \%$ became what all humans did and remained the bottleneck. (See e.g. a model of this here.).
I think the economists' model here is correct. But a key point is that I'm only talking about one currently-small part of the economy, rather than the economy as a whole. People may well still be getting haircuts normally during this time-robotics might not yet be worked out, AIs for every domain might not yet be worked out, the societal ${ }^{42}$ Note that while I think this is likely, it's kind of scary: it means that rather than a fairly continuous series of big models, each somewhat better than the previous generation, downstream model intelligence might be more discrete/discontinuous. We might only do one or a couple of big runs during the intelligence explosion, banking multiple OOMs of algorithmic breakthroughs found at smaller scale for each.
rollout might not yet be worked out, etc.-but they will be able to do AI research. As discussed in the previous piece, I think the current course of AI progress is taking us to essentially drop-in remote workers as intelligent as the smartest humans; as discussed in this piece, the job of an AI researcher seems totally within the scope of what could be fully automated.
Still, in practice, I do expect somewhat of a long tail to get to truly $100 \%$ automation even for the job of an AI researcher/engineer; for example, we might first get systems that function almost as an engineer replacement, but still need some amount of human supervision.
In particular, I expect the level of AI capabilities to be somewhat uneven and peaky across domains: it might be a better coder than the best engineers while still having blindspots in some subset of tasks or skills; by the time it's human-level at whatever its worst at, it'll already be substantially superhuman at easier domains to train, like coding. (This is part of why I think they'll be able to use the compute more effectively than human researchers. By the time of $100 \%$ automation/the intelligence explosion starting, they'll already have huge advantages over humans in some domains. This will also have important implications for superalignment down the line, since it means that we'll have to align systems that are meaningfully superhuman in many domains in order to align even the first automated AI researchers.)
But I wouldn't expect that phase to last more than a few years; given the pace of AI progress, I think it would likely just be a matter of some additional "unhobbling" (removing some obvious limitation of the models that prevented it from doing the last mile) or another generation of models to get all the way.
Overall, this might soften takeoff some. Rather that 2027 AGI $\rightarrow 2028$ Superintelligence, it might look more like:
- 2026/27: Proto-automated-engineer, but blind spots in
other areas. Speeds up work by 1.5x-2x already; progress begins gradually accelerating.
- 2027/28: Proto-automated-researchers, can automate $>90 \%$. Some remaining human bottlenecks, and hiccups in coordinating a giant organization of automated researchers to be worked out, but this already speeds up progress by $3 \mathrm{x}+$. This quickly does the remaining necessary "unhobbling" takes us the remainder of the way to $100 \%$ automation.
- 2028/29: 10x+ pace of progress $\rightarrow$ superintelligence.
That's still very fast...
## Fundamental limits to algorithmic progress (optional, in more depth)
There's probably a real cap on how much algorithmic progress is physically possible. (For example, 25 OOMs of algorithm progress seems impossible, since that would imply being able to train a GPT-4 level system in less than $\sim 10$ FLOPs. ${ }^{43}$ ) But something like 5 OOMs seems very much in the realm of possibilities; again, that would just require another decade of trend algorithmic efficiencies (not even counting algorithmic gains from unhobbling).
Intuitively, it very much doesn't seem like we have exhausted all the low-hanging fruit yet, given how simple the biggest breakthroughs are-and how rudimentary and obviously hobbled current architectures and training techniques still seem to be. For example, I think it's pretty plausible that we'll bootstrap our way to AGI via AI systems that "think out loud" via chain-of-thought. But surely this isn't the most efficient way to do it, surely something that does this reasoning via internal states/recurrence/etc would be way more efficient. Or consider adaptive compute: Llama 3 still spends as much compute on predicting the "and" token as it does the answer to some complicated question, which seems clearly suboptimal. We're getting huge OOM algorithmic gains from even just small tweaks, while there are dozens of areas where much more effi- ${ }^{43}$ Though you could get results that would take 25 more OOMs of hardware with current architecture!
cient architectures and training procedures could likely be found.
Biological references also suggest huge headroom. The human range of intelligence is very wide, for example, with only tiny tweaks to architecture. Humans have similar numbers of neurons as other animals, even though humans are much smarter than those animals. And current AI models are still many OOMs from the efficiency of a human brain; they can learn with a tiny fraction of the data (and thus tiny fraction of "compute") than AI models can, suggesting huge headroom for our algorithms and architecture.
Ideas get harder to find and diminishing returns (optional, in more depth)
As you pick the low-hanging fruit, ideas get harder to find. This is true in any domain of technological progress. Essentially, we see a straight line on a log-log curve: $\log$ (progress) is a function of $\log$ (cumulative research effort). Every OOM of further progress requires putting in more research effort than the last OOM.
This leads to two objections to the intelligence explosion:
1. Automated AI research will merely be what's necessary to sustain progress (rather than dramatically accelerating it).
2. A purely-algorithmic intelligence explosion would not be sustained / would quickly fizzle out as algorithmic progress gets harder to find / you hit diminishing marginal returns.
I spent a lot of time thinking about these sorts of models in a past life, when I was doing research in economics. (In particular, semi-endogenous growth theory is the standard model of technological progress, capturing these two competing dynamics of growing research effort and ideas getting harder to find.)
In short, I think the underlying model behind these objections is sound, but how it shakes out is an empirical question-and I think they get the empirics wrong.
The key question is essentially: for every 1ox of progress, does further progress become more or less than 1ox harder? Napkin math (along the lines of how this is done in the economic literature) helps us bound this.
- Suppose we take the $\sim 0.5 \mathrm{OOMs} /$ year trend rate of algorithmic progress seriously; that implies a 1oox of progress in 4 years.
- However, quality-adjusted headcount / research effort at a given leading AI lab has definitely grown by $<$ Ioox in 4 years. Maybe it's increased $10 x$ (from 1os to 1oos of people working on relevant stuff at a given lab), but even that is unclear quality-adjusted.
- And yet, algorithmic progress seems to be sustained.
Thus, in response to objection 1 , we can note that the $\sim$ million-fold increase in research effort will simply be a much larger increase than what would merely be necessary to sustain progress. Maybe we'd need on the order of thousands of researchers working on relevant research at a lab in 4 years to sustain progress; the 100 million Alec Radfords would still be an enormous increase, and surely lead to massive acceleration. It's just a bizarre "knife-edge" assumption to think that automated research would be just enough to sustain the existing pace of progress. (And that's not even counting thinking at 1ox human speed and all the other advantages the AI systems will have over human researchers.)
In response to objection 2, we can note two things:
- First, the mathematical condition noted above. Given that, based on our napkin math, quality-adjusted research effort needed to grow «10ox while we did 1oox of algorithmic progress, it pretty strongly seems that the
shape of the returns curve shakes out in favor of selfsustaining progress. ${ }^{44}$
- Secondly, the returns curve doesn't even need to shake out in favor of a fully sustained chain reaction for us to get a bounded-but-many-OOM surge of algorithmic progress. Essentially, it doesn't need to be a "growth effect"; a large enough "level effect" would be enough. That is, a million-fold increase in research effort (combined with the many other advantages automated researchers would have over human researchers) would be such a large one-time boost that even if the chain reaction isn't fully self-sustaining, it could lead to a very sizeable (many OOMs) one-time gain. ${ }^{45}$
On net, while obviously it won't be unbounded and I have a lot of uncertainty over just how far it'll go, I think something like a 5 OOM intelligence explosion purely from algorithmic gains / automated AI research seems highly plausible.
Tom Davidson and Carl Shulman have also looked at the empirics of this in a growth-modeling framework and come to similar conclusions. Epoch AI has also done some recent work on the empirics, also coming to the conclusion that empirical returns to algorithmic R\&D favors explosive growth, with a helpful writeup of the implications.
## The power of superintelligence
WhETHER OR Not you agree with the strongest form of these arguments-whether we get a $<1$ year intelligence explosion, or it takes a few years-it is clear: we must confront the possibility of superintelligence.
The AI systems we'll likely have by the end of this decade will be unimaginably powerful.
- Of course, they'll be quantitatively superhuman. On our fleets ${ }^{44}$ 10ox progress $\rightarrow$ 10ox more automated research effort, but you, say, only needed 1ox more research effort to keep it going and do the next 10ox, so the returns are good enough for explosive progress to be sustained.[^14]of 1oos of millions of GPUs by the end of the decade, we'll be able to run a civilization of billions of them, and they will be able to "think" orders of magnitude faster than humans. They'll be able to quickly master any domain, write trillions of lines of code, read every research paper in every scientific field ever written (they'll be perfectly interdisciplinary!) and write new ones before you've gotten past the abstract of one, learn from the parallel experience of every one of its of copies, gain billions of human-equivalent years of experience with some new innovation in a matter of weeks, work $100 \%$ of the time with peak energy and focus and won't be slowed down by that one teammate who is lagging, and so on.
- More importantly-but harder to imagine-they'll be qualitatively superhuman. As a narrow example of this, large-scale RL runs have been able to produce completely novel and creative behaviors beyond human understanding, such as the famous move 37 in AlphaGo vs. Lee Sedol. Superintelligence will be like this across many domains. It'll find exploits in human code too subtle for any human to notice, and it'll generate code too complicated for any human to understand even if the model spent decades trying to explain it. Extremely difficult scientific and technological problems that a human would be stuck on for decades will seem just so obvious to them. We'll be like high-schoolers stuck on Newtonian physics while it's off exploring quantum mechanics.
As an example of how wild this could be, look at some Youtube videos of video game speedruns, such as this one of beating Minecraft in 20 seconds. (If you have no idea what's going on in this video, you're in good company; even most normal players of Minecraft have almost no clue what's going on.)
Now imagine this applied to all domains of science, technology, and the economy. The error bars here, of course, are extremely large. Still, this is happening, and it's important to consider just how consequential this would be.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-068.jpg?height=766&width=1027&top_left_y=262&top_left_x=278)
In the intelligence explosion, explosive progress was initially only in the narrow domain of automated AI research. As we get superintelligence, and apply our billions of (now superintelligent) agents to R\&D across many fields, I expect explosive progress to broaden:
- An AI capabilities explosion. Perhaps our initial AGIs had limitations that prevented them fully automating work in some other domains (rather than just in the AI research domain); automated AI research will quickly solve these, enabling automation of any and all cognitive work.
- Solve robotics. Superintelligence won't stay purely cognitive for long. Getting robotics to work well is primarily an ML algorithms problem (rather than a hardware problem), and our automated AI researchers will likely be able to solve it (more below!). Factories would go from human-run, to AIdirected using human physical labor, to soon being fully run by swarms of robots.
- Dramatically accelerate scientific and technological progress. Yes, Einstein alone couldn't develop neuroscience and build a semiconductor industry, but a billion superintelligent automated scientists, engineers, technologists, and robot
technicians (with the robots moving at 1ox or more human speed!) ${ }^{46}$ would make extraordinary advances in many fields in the space of years. (Here's a nice short story visualizing what AI-driven R\&D might look like.) The billion superintelligences would be able to compress the R\&D effort humans researchers would have done in the next century into years. Imagine if the technological progress of the 2oth century were compressed into less than a decade. We would have gone from flying being thought a mirage, to airplanes, to a man on the moon and ICBMs in a matter of years. This is what I expect the 2030s to look like across science and technology.
- An industrial and economic explosion. Extremely accelerated technological progress, combined with the ability to automate all human labor, could dramatically accelerate economic growth (think: self-replicating robot factories quickly covering all of the Nevada desert ${ }^{47}$ ). The increase in growth probably wouldn't just be from $2 \% /$ year to $2.5 \% /$ year; rather, this would be a fundamental shift in the growth regime, more comparable to the historical step-change from very slow growth to a couple percent a year with the industrial revolution. We could see economic growth rates of $30 \% /$ year and beyond, quite possibly multiple doublings a year. This follows fairly straightforwardly from economists' models of economic growth. To be sure, this may well be delayed by societal frictions; arcane regulation might ensure lawyers and doctors still need to be human, even if AI systems were much better at those jobs; surely sand will be thrown into the gears of rapidly expanding robo-factories as society resists the pace of change; and perhaps we'll want to retain human nannies; all of which would slow the growth of the overall GDP statistics. Still, in whatever domains we remove human-created barriers (e.g., competition might force us to do so for military production), we'd see an industrial explosion. ${ }^{46}$ The 10x speed robots doing physical R\&D in the real world is the "slow version"; in reality the superintelligences will try to do as much R\&D as possible in simulation, like AlphaFold or manufacturing "digital twins".[^15]
| Growth mode | Date began to <br> dominate | Doubling time of <br> global economy <br> (years) |
| :--- | :--- | :--- |
| Hunting | 2,000,000 B.C. | 230,000 |
| Farming | 4700 B.C. | 860 |
| Science/commerce | 1730 A.D. | 58 |
| Industry | 1903 A.D. | 15 |
| Superintelligence? | 2030 A.D.? | $? ? ?$ |
## - Provide a decisive and overwhelming military advantage.
Even early cognitive superintelligence might be enough here; perhaps some superhuman hacking scheme can deactivate adversary militaries. In any case, military power and technological progress has been tightly linked historically, and with extraordinarily rapid technological progress will come concomitant military revolutions. The drone swarms and roboarmies will be a big deal, but they are just the beginning; we should expect completely new kinds of weapons, from novel WMDs to invulnerable laser-based missile defense to things we can't yet fathom. Compared to pre-superintelligence arsenals, it'll be like 21st century militaries fighting a 19th century brigade of horses and bayonets. (I discuss how superintelligence could lead to a decisive military advantage in a later piece.)
- Be able to overthrow the US government. Whoever controls superintelligence will quite possibly have enough power to seize control from pre-superintelligence forces. Even without robots, the small civilization of superintelligences would be able to hack any undefended military, election, television, etc. system, cunningly persuade generals and electorates, economically outcompete nation-states, design new synthetic bioweapons and then pay a human in bitcoin to synthesize it, and so on. In the early 1500s, Cortes and about 500 Spaniards conquered the Aztec empire of several million; Pizarro and $\sim 300$ Spaniards conquered the Inca empire of several million; Alfonso and $\sim 1000$ Portuguese conquered the Indian Ocean. They didn't have god-like power, but the Old World's technological edge and an advantage in strate-
Table 3: A shift in the growth regime is not unprecedented: as civilization went from hunting, to farming, to the blossoming of science and commerce, to industry, the pace of global economic growth accelerated. Superintelligence could kick off another shift in growth mode. Based on Robin Hanson's "Long-run growth as a sequence of exponential modes".
gic and diplomatic cunning led to an utterly decisive advantage. Superintelligence might look similar.
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-071.jpg?height=914&width=1485&top_left_y=470&top_left_x=317)
Robots. A common objection to claims like those here is that, even if AI can do cognitive tasks, robotics is lagging way behind and so will be a brake on any real-world impacts.
I used to be sympathetic to this, but I've become convinced robots will not be a barrier. For years people claimed robots were a hardware problem-but robot hardware is well on its way to being solved.
Increasingly, it's clear that robots are an ML algorithms problem. LLMs had a much easier way to bootstrap: you had an entire internet to pretrain on. There's no similarly large dataset for robot actions, and so it requires more nifty approaches (e.g. using multimodal models as a base, then using synthetic data/simulation/clever RL) to train them.
Figure 26: Explosive growth starts in the narrower domain of AI R\&D; as we apply superintelligence to $R \& D$ in other fields, explosive growth will broaden.
There's a ton of energy directed at solving this now. But even if we don't solve it before AGI, our hundreds of millions of AGIs/superintelligences will make amazing AI researchers (as is the central argument of this piece!), and it seems very likely that they'll figure out the ML to make amazing robots work.
As such, while it's plausible that robots might cause a few years of delay (solving the ML problems, testing in the physical world in a way that is fundamentally slower than testing in simulation, ramping up initial robot production before the robots can build factories themselves, etc.)-I don't think it'll be more than that.
How all of THIS plays out over the 2030s is hard to predict (and a story for another time). But one thing, at least, is clear: we will be rapidly plunged into the most extreme situation humanity has ever faced.
Human-level AI systems, AGI, would be highly consequential in their own right-but in some sense, they would simply be a more efficient version of what we already know. But, very plausibly, within just a year, we would transition to much more alien systems, systems whose understanding and abilitieswhose raw power-would exceed those even of humanity combined. There is a real possibility that we will lose control, as we are forced to hand off trust to AI systems during this rapid transition.
More generally, everything will just start happening incredibly fast. And the world will start going insane. Suppose we had gone through the geopolitical fever-pitches and man-made perils of the 20th century in mere years; that is the sort of situation we should expect post-superintelligence. By the end of it, superintelligent AI systems will be running our military and economy. During all of this insanity, we'd have extremely scarce time to make the right decisions. The challenges will be immense. It will take everything we've got to make it through in one piece.
The intelligence explosion and the immediate post-superintelligence period will be one of the most volatile, tense, dangerous, and wildest periods ever in human history.
And by the end of the decade, we'll likely be in the midst of it.
Confronting the possibility of an intelligence explosionthe emergence of superintelligence-often echoes the early debates around the possibility of a nuclear chain reactionand the atomic bomb it would enable. HG Wells predicted the atomic bomb in a 1914 novel. When Szilard first conceived of the idea of a chain reaction in 1933, he couldn't convince anyone of it; it was pure theory. Once fission was empirically discovered in 1938, Szilard freaked out again and argued strongly for secrecy, and a few people started to wake up to the possibility of a bomb. Einstein hadn't considered the possibility of a chain reaction, but when Szilard confronted him, he was quick to see the implications and willing to do anything that was needed to be done; he was willing to sound the alarm, and wasn't afraid of sounding foolish. But Fermi, Bohr, and most scientists thought the "conservative" thing was to play it down, rather than take seriously the extraordinary implications of the possibility of a bomb. Secrecy (to avoid sharing their advances with the Germans) and other all-out efforts seemed absurd to them. A chain reaction sounded too crazy. (Even when, as it turned out, a bomb was but half a decade from becoming reality.)
We must once again confront the possibility of a chain reaction. Perhaps it sounds speculative to you. But among senior scientists at AI labs, many see a rapid intelligence explosion as strikingly plausible. They can see it. Superintelligence is possible.
## III. The Challenges
## IIIa. Racing to the Trillion-Dollar Cluster
The most extraordinary techno-capital acceleration has been set in motion. As AI revenue grows rapidly, many trillions of dollars will go into GPU, datacenter, and power buildout before the end of the decade. The industrial mobilization, including growing US electricity production by 10 of of percent, will be intense.
You see, I told you it couldn't be done without turning the whole country into a factory. You have done just that.
NIELS BOHR (to Edward Teller, upon learning of the scale of the
Manhattan Project in 1944)
The race to AGI won't just play out in code and behind laptops-it'll be a race to mobilize America's industrial might. Unlike anything else we've recently seen come out of Silicon Valley, AI is a massive industrial process: each new model requires a giant new cluster, soon giant new power plants, and eventually giant new chip fabs. The investments involved are staggering. But behind the scenes, they are already in motion.
In this chapter, I'll walk you through numbers to give you a sense of what this will mean:
- As revenue from AI products grows rapidly-plausibly hitting a $\$ 100 B$ annual run rate for companies like Google or Microsoft by $\sim 2026$, with powerful but pre-AGI systems-
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-076.jpg?height=973&width=1675&top_left_y=240&top_left_x=239)
Figure 27: The trillion-dollar cluster. Credit: DALLE
that will motivate ever-greater capital mobilization, and total
- We're on the path to individual training clusters costing $\$ 10 o s$ of billions by 2028 -clusters requiring power equivalent to a small/medium US state and more expensive than the International Space Station.
- By the end of the decade, we are headed to \$1T+ individual training clusters, requiring power equivalent to $>\mathbf{2 0} \%$ of US electricity production. Trillions of dollars of capex will churn out 1oos of millions of GPUs per year overall.
Nvidia shocked the world as its datacenter sales exploded from about $\$ 14 \mathrm{~B}$ annualized to about $\$ 90 \mathrm{~B}$ annualized in the last year. But that's still just the very beginning.
## Training compute
Earlier, we found a roughly $\sim 0.5 \mathrm{OOMs}^{4} /$ year trend growth of AI training compute. If this trend were to continue for the rest
${ }^{48}$ As mentioned earlier, $\mathrm{OOM}=$ orof the decade, what would that mean for the largest training der of magnitude, $10 x=1$ order of magnitude clusters?
| Year | OOMs | Hioos- <br> equivalent | Cost | Power | Power reference class |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 2022 | $\sim$ GPT-4 <br> cluster | $\sim 1 \mathrm{Kk}$ | $\sim \$ 500 \mathrm{M}$ | $\sim 10 \mathrm{MW}$ | $\sim 10,000$ average homes |
| $\sim 2024$ | +1 OOM | $\sim 100 \mathrm{~K}$ | \$billions | $\sim 100 \mathrm{MW}$ | $\sim 100,000$ homes |
| $\sim 2026$ | +2 OOMs | $\sim 1 \mathrm{M}$ | \$1os of bil- <br> lions | $\sim 1 \mathrm{GW}$ | The Hoover Dam, or a <br> large nuclear reactor |
| $\sim 2028$ | +3 OOMs | $\sim 10 \mathrm{M}$ | \$100s of <br> billions | $\sim 10 \mathrm{GW}$ | A small/medium US <br> state |
| $\sim 2030$ | +4 OOMs | $\sim 100 \mathrm{M}$ | $\$ 1 \mathrm{~T}+$ | $\sim 100 G W$ | $>20 \%$ of US electricity <br> production |
This may seem hard to believe-but it appears to be happening. Zuck bought 350k H1oos. Amazon bought a 1GW datacenter campus next to a nuclear power plant. Rumors suggest a $1 G W, 1.4 \mathrm{M}$ H1oo-equivalent cluster ( 2026-cluster) is being built in Kuwait. Media report that Microsoft and OpenAI are rumored to be working on a \$10oB cluster, slated for 2028 (a cost comparable to the International Space Station!). And as each generation of models shocks the world, further acceleration may yet be in store.
Perhaps the wildest part is that willingness-to-spend doesn't even seem to be the binding constraint at the moment, at least for training clusters. It's finding the infrastructure itself: "Where do I find 10 GW?" (power for the $\$ 100 B+$, trend 2028 cluster) is a favorite topic of conversation in SF. What any compute guy is thinking about is securing power, land, permitting, and datacenter construction. ${ }^{49}$ While it may take you a year of waiting to get the GPUs, the lead times for these are much longer still.
Table 4: Scaling the largest training clusters, rough back-of-the-envelope calculations. For details on the calculations, see Appendix.[^16]
The trillion-dollar cluster- +4 OOMs from the GPT-4 cluster, the $\sim 2030$ training cluster on the current trend-will be a truly extraordinary effort. The 100 GW of power it'll require is equivalent to $>\mathbf{2 0} \%$ of US electricity production; imagine not just a simple warehouse with GPUs, but hundreds of power plants. Perhaps it will take a national consortium.
(Note that I think it's pretty likely we'll only need a $\sim \$ 100 B$ cluster, or less, for AGI. The \$1T cluster might be what we'll train and run superintelligence on, or what we'll use for AGI if AGI is harder than expected. In any case, in a post-AGI world, having the most compute will probably still really matter.)
## Overall compute
The above are just rough numbers for the largest training clusters. Overall investment is likely to be much larger still: a large fraction of GPUs will probably be be used for inference ${ }^{50}$ (GPUs to actually run the AI systems for products), and there could be multiple players with giant clusters in the race.
My rough estimate is that 2024 will already feature $\$ 100 B-$
$\$ 200 B$ of AI investment:
- Nvidia datacenter revenue will hit a $\sim \$ 25 \mathrm{~B} /$ quarter run rate soon, i.e. $\sim \$ 100 B$ of capex flowing via Nvidia alone. But of course, Nvidia isn't the only player (Google's TPUs are great too!), and close to half of datacenter capex is on things other than the chips (site, building, cooling, power, etc.). ${ }^{51}$
- Big tech has been dramatically ramping their capex numbers: Microsoft and Google will likely do $\$ 50 \mathrm{~B}+5^{2}$, AWS and Meta $\$ 40 B+$, in capex this year. Not all of this is AI, but combined their capex will have grown $\$ 50 B-\$ 100 B$ year-over-year because of the AI boom, and even then they are still cutting back on other capex to shift even more spending to AI. Moreover, other cloud providers, companies (e.g., Tesla is spending $\$ 10 B$ on AI this year), and nation-states are investing in AI as well.[^17]\footnotetext{
${ }^{51}$ For example, this total-cost-ofownership analysis estimates that around $40 \%$ of a large cluster cost is the H1oo GPUs itself, and another $13 \%$ goes to Nvidia for Infiniband networking. That said, excluding cost of capital in that calculation would mean the GPUs are about $50 \%$ of the cost, and with networking Nvidia gets a bit over $60 \%$ of the cost of the cluster.
${ }^{52}$ And apparently, despite Microsoft growing capex by $79 \%$ compared to a year ago in a recent quarter, their AI cloud demand still exceeds supply!
}
![](https://cdn.mathpix.com/cropped/2024_07_13_724a888048ed9b569582g-079.jpg?height=1610&width=1674&top_left_y=248&top_left_x=236)
Figure 28: Quarterly Nvidia datacenter revenue. Plot by Thomas Woodside
Figure 29: Big tech capex is growing extremely rapidly since ChatGPT unleashed the AI boom. Source.
Let's play this forward. My best guess is overall compute investments will grow more slowly than the $3 x /$ year largest training clusters, let's say $2 x$ /year. 53
And these aren't just my idiosyncratic numbers. AMD forecasted a $\$ 400 \mathrm{~B}$ AI accelerator market by 2027 , implying $\$ 700 B+$ of total AI spending, pretty close to my numbers (and they are surely much less "AGI-pilled" than I am). Sam Altman is re-[^18]
| Year | Annual in- <br> vestment | AI accelerator <br> shipments (in <br> H1oos-equivalent) | Power as $\%$ of US <br> electricity produc- <br> tion | Chips as $\%$ of current <br> leading-edge TSMC <br> wafer production |
| :--- | :--- | :--- | :--- | :--- |
| 2024 | $\sim \$ 150 \mathrm{~B}$ | $\sim 5-10 \mathrm{M}$ | $1-2 \%$ | $5-10 \%$ |
| $\sim 2026$ | $\sim \$ 500 \mathrm{~B}$ | $\sim 10$ of millions | $5 \%$ | $\sim 25 \%$ |
| $\sim 2028$ | $\sim \$ 2 \mathrm{~T}$ | $\sim 100 \mathrm{M}$ | $20 \%$ | $\sim 100 \%$ |
| $\sim 2030$ | $\sim \$ 8 \mathrm{~T}$ | $\sim 1005$ of millions | $100 \%$ | $4 \times$ current capacity |
ported to be in talks to raise funds for a project of "up to $\$ 7 \mathrm{~T}$ " in capex to build out AI compute capacity (the number was widely mocked, but it seems less crazy if you run the numbers here...). One way or another, this massive scaleup is happening.
Will it be done? Can it be done?
The scale of investment postulated here may seem fantastical. But both the demand-side and the supply-side seem like they could support the above trajectory. The economic returns justify the investment, the scale of expenditures is not unprecedented for a new general-purpose technology, and the industrial mobilization for power and chips is doable.
## AI revenue
Companies will make large AI investments if they expect the economic returns to justify it.
Reports suggest OpenAI was at a $\$ 1 \mathrm{~B}$ revenue run rate in August 2023, and a \$2B revenue run rate in February 2024. That's roughly a doubling every 6 months. If that trend holds, we should see a $\sim \$ 10 B$ annual run rate by late 2024/early 2025, even without pricing in a massive surge from any nextgeneration model. One estimate puts Microsoft at $\sim \$ 5 \mathrm{~B}$ of incremental AI revenue already.
Table 5: Playing forward trends on total world AI investment. Rough back-ofthe-envelope calculation. For some further details on the calculations, see Appendix.
So far, every 1ox scaleup in AI investment seems to yield the necessary returns. GPT-3.5 unleashed the ChatGPT mania. The estimated $\$ 500 \mathrm{M}$ cost for the GPT-4 cluster would have been paid off by the reported billions of annual revenue for Microsoft and OpenAI (see above calculations), and a "2024class" training cluster in the billions will easily pay off if Microsoft/OpenAI AI revenue continues on track to a $\$ 10 B+$ revenue run rate. The boom is investment-led: it takes time from a huge order of GPUs to build the clusters, build the models, and roll them out, and the clusters being planned today are many years out. But if the returns on the last GPU order keep materializing, investment will continue to skyrocket (and outpace revenue), plowing in even more capital in a bet that the next 1ox will keep paying off.