Tensor Labbet A blog of deep learnings

It's Hard to Feel the AGI

In this post: A reality check from leading researchers
(6 min read)

In the heat of the ongoing AI summer, a chilling effect is starting to spread from the growing cracks between marketing claims and the underlying technology. To a backdrop of comparisons with the dot com bubble, some of the most accomplished minds of the field are beginning to revise their projections for what can realistically be expected for the near future.

Ilya Sutskever shared his view on a recent podcast that the current approach around transformer-based LLMs is likely to stall out in the coming years as the scaling paradigm hits a ceiling. He notes a remarkable discrepancy in their excellent performance in evaluations despite inadequate generalization and low economic impact in practice. He argues that fundamentally new research insights are needed to break through this plateau.

Moreover, he expresses doubts about the future profitability of the current business models around LLMs, despite massive potential revenues, due to lacking differentiation between competitors. Ultimately, he revises his estimate for the emergence of systems with human-like learning abilities back by 5-20 years. His startup, Safe Superintelligence Inc., is currently exploring research ideas that may identify viable new approaches towards this goal.

As former chief scientist of OpenAI, his doubts about the future direction and profitability of their business model should raise concerns. OpenAI plays a central role in what has been described as circular investment dealings related to enormous investments into hardware and data centers. The latter have been claimed to account for over 90% of growth in US GDP over the first half of 2025.

They are now seeking unprecedented funding for future spending commitments, with controversy around their CFO publicly commenting on their financial innovations and seemingly floating the idea that the US government could act as a financial backstop.

Andrej Karpathy previously featured on the same podcast and voiced a noteworthy critique of the current industry hype around LLM-based AI Agents. He argues that, while impressive, the technology still needs a decade of work and improvements. Only then could they hope to reach the promised level of performing like an automated employee or coworker, whereas currently “they’re cognitively lacking and it’s just not working”.

He expects these systems to contribute to gradual economic growth within the ongoing, long-term compounding pattern seen ever since the onset of the industrial revolution rather than a sudden jump in GDP.

He compares this comparatively slow development to earlier ambitions around automating radiology and self-driving cars. Despite witnessing impressive demos for the latter already in 2014 and contributing as director of AI at Tesla, he argues that neither field reached this goal yet. He points out that current self-driving technology still requires human supervision and frequent manual intervention by remote operators.

Development efforts over this decade instead faced diminishing returns, with ‘the march of nines’ in reliability requiring a constant amount of effort for the same relative reduction in errors. Ultimately, he revised ‘the year of agents’ to be ‘the decade of agents’ instead.

In the software industry, widespread reporting posits AI and automation as the main cause for large-scale layoffs, but the degree of autonomy vs supervision that tools for agentic code generation require remains controversial. As an example, a recent study deployed frontier AI agent frameworks on projects sourced from online freelancing platforms with a success rate of just 2.5%.

Rich Sutton appeared on the same podcast in a rather contentious earlier episode to share his view that LLMs are a dead end in AI research. He argues that, while surprisingly effective, LLMs have no internal ‘world model’ based on which they could explore potential actions and predict their outcomes, but instead merely mimic human use of language through imitation learning. He points out their lack of any actual goal towards which to take action as opposed to just mechanistically processing tokens.

He notes that, whereas LLMs gain the ability for next-token prediction in a supervised learning phase, no such thing occurs in nature. Instead, he emphasizes their critical lack of continual learning abilities. According to his ‘Big World Hypothesis’ the world is too complex for any agent to successfully navigate without this ability to adapt and learn from experience.

He mentions Moravec’s Paradox, which contrasts the ease with which machines can imitate more highly evolved, specific cognitive tasks as opposed to their inability to perform lower-order, largely unconscious biological functions such as sensory, motorical and social skills.

He raises conceptual limitations of deep learning and gradient descent for generalization. While he considers machine superintelligence as ultimately inevitable, he also highlights that even the intelligence exhibited by a squirrel remains fundamentally beyond our current understanding.

Yann LeCun has been a long-standing critic of the idea that LLMs could scale to human-level intelligence. Over recent years he shared many insights and concerns in this regard that overlap with those listed above.

He argues that language is not intelligence. He considers it a low-bandwidth modality, with a discrete and restricted vocabulary that language models can predict probability distributions over to determine which word or symbols are likely to follow each other. The physical world is instead experienced by human vision with high bandwidth in high-dimensional and continuous representations. These cannot be enumerated in the same way to form a probability distribution over. Similar to Rich Sutton, he therefore argues that LLMs fundamentally lack an adequate mental model of the physical world in which they could plan a sequence of actions to arrive at an intended goal. This limits their cognitive abilities to a level below the intelligence of young children or even cats or dogs with no language abilities.

He expects that a truly intelligent system could acquire common sense and an understanding of the physical world from multimodal inputs like video, operate with persistent memory and perform reasoning and planning. He considers LLMs incapable of inventing solutions to new problems rather than merely performing knowledge retrieval from vast training data.

To him, the current exuberance around LLMs is not a new phenomenon, with parallels to the hype around expert systems of the 80s that set high expectations followed by failures and disillusionment. Although he does not consider it likely that any isolated group could suddenly discover the secret to AGI, he expects the required capabilities to be gradually developed through innovations from industry and academia. Along with his own work on a Joint Embedding Predictive Architecture (JEPA), these may become more viable within the coming 3-5 years.

Conclusion

These findings may not be entirely unexpected for many who have been following the field for a while. What is more surprising is the consensus that is taking hold even among those who previously presented much more optimistic timelines, sometimes in a context of financial incentives.

In sober review, there are undoubtedly many tangible achievements that have been reached by LLMs and other generative models. For tasks like generating text, images, video and audio, brainstorming, planning, summarization and agentic tasks with varying degrees for human supervision for software engineering and more, the technology can already provide genuine value for years to come.

Pinpointing the limits of their autonomy will remain a challenging task with a moving target. This nuance will hopefully not be lost if the industry climate drops to a new ‘AI Winter’ among disillusioned investors who may have massively bought into pre-orders for science fiction that conflated machine learning with human-like robots.

But is it conceivable that the collective wisdom of investors overprovisioned billions of dollars in tech funding in vain? If so, we could take consolation in a thought few have had the courage to consider, round up these hundreds of thousands of GPUs, take a daring bet, and use them instead to finally fire up the Metaverse.

When Machines that Simulate Intelligence Seemed Like a Summer Project

In this post: A look back at pioneering thoughts on AI research
(7 min read)

I recently stumbled upon a research proposal that must have raised a lot of eyebrows and received widespread attention in machine learning circles:

‘We propose that a 2 month, 10 man study of artificial intelligence be carried out […]. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.’

The summer in question was 1956, now almost seventy years ago. The resulting meeting at Dartmouth College in Hanover, New Hampshire, has often been called the time and place that established research on ‘artificial intelligence’ as a scientific field of its own.

Even today, the research proposal itself still makes for a fascinating read. Many of its core themes remain remarkably relevant, whereas others played out in unexpected directions. This blog post is dedicated to some observations on how these pioneering thoughts from over two generations ago are reflected in the technological reality of today.

The Conjecture

Just like in the hype of today, it would only take two years for the media to ascribe human-like intent, abilities and emotions to early implementations of these research concepts.

The proposal itself, however, is remarkably sober. It largely avoids the notion of human-like machines that are intelligent and instead explicitly aims for machines able to simulate intelligence.

As one of the originators of the proposal, Marvin Minsky proposes how the actions of a machine “[…] would seem rather clever, and the behaviour would have to be regarded as rather ‘imaginative’.” if it was designed to build an internal, abstract model of its environment in which to first explore solutions to a given task before taking action.

Nowadays, there is an ongoing debate about whether multi-modal LLMs form an internal ‘world model’ that mirrors this idea. And it is not the only concept that rings surprisingly familiar from the discourse of today. The proposal specifically lists seven major themes to be examined towards its goal.

Aspects of the Artificial Intelligence Problem

1. Automatic Computers are proposed as able to simulate the behaviour of any machine, and ultimately even higher functions of the human brain, limited mainly by lack of efficient algorithms rather than computing resources.

Looking back today, the jury is still out on the first half of this point and it seems unclear as to whether it will ever be attained. But especially the second part of this assertion turned out exactly opposite to what was proposed here, with the ‘Bitter Lesson’ by Rich Sutton concluding that, in general, increasing computing power has proven far more impactful than clever design of algorithms.

2. How can a computer be programmed to use a language and perform reasoning based on words, forming sentences that imply one another and resemble human thought?

Here, the field of natural language processing made substantial progress. Examples include techniques that allow for words to be encoded as embeddings, like word2vec and language models like BERT that can furthermore relate embedded words to one another. Today, LLMs can iterate on intermediate processing steps in natural language with techniques like Chain-of-Thought, marketed outright as ‘reasoning’ with ‘thinking’ models.

3. Neuron nets were proposed as a promising approach, to be arranged so as to form concepts.

This research direction, among all competing paradigms of the time, proved to be spot on and foreshadowed what would later become known as deep learning. The McCulloch-Pitts neuron model Had already been invented over a decade earlier. Arranging variants of this neuron model into network architectures with ever more layers would ultimately give rise to deep neural networks that enabled many of the most decisive capabilities within the field of AI research today.

4. Theory of the size of a calculation was to be developed, to quantify the efficiency of calculations and the complexity of functions,

Later work on algorithmic time complexity and space complexity would explore these questions more deeply, from the sixties onwards. Concepts of information theory such as the Minimum Description Length principle and Kolmogorov complexity would furthermore be developed providing some theoretical backing to the power of machine learning with neural networks.

5. Self-improvement was expected to be a defining theme of intelligence.

Here, the reality of today is somewhat mixed. Machine learning emerged as a sub-discipline of AI research, with algorithms designed to adapt to given data. Many methods feature some aspects of self-improvement.

Generative adversarial networks (GANs) feature two competing networks, with one facilitating the training process of the other. In reinforcement learning, improvements occur dynamically from interaction with an (often simulated) environment. Online learning and related methods also see ongoing improvement, often used in recommender systems. Self-criticism in LLMs enables them to refine their outputs during inference time.

Nonetheless, as of today, it is still common for many methods to feature a distinct training phase, after which trainable parameters are locked or frozen before deployment. Capabilities for an exponentially compounding self-improvement across varied skill sets still appear out of reach for now.

6. Abstractions formed by machines from sensory and other data were to be explored.

This point has seen substantial progress, like the ability to encode language and image data into relatable embeddings with models like CLIP, which can also be used to control image generation with natural language via CLIP-guided denoising diffusion for image generation as with Stable Diffusion. Likewise, manipulation of the latent space learned by models powers other abstract capabilities, such as image inpainting with generative fill.

7. Randomness and creativity were conjectured to be related, with injections of controlled randomness enabling orderly thinking to reach imaginative solutions.

This direction took a surprising turn, as the success of generative models caught even many experts off guard. Randomness indeed turned out to be an important factor, from the randomized noise that serves as input to image generators like GANs and Stable Diffusion to the temperature setting in LLMs that controls how often less probable outputs are produced.

Creativity was long considered a defining trait that separated humans from machines. Until recently, a prevailing line of thinking was therefore that machines could eventually automate all menial and administrative tasks and give humans the leisure to socialize and pursue poetry, music, arts and crafts.

Chances are that the originators of the Dartmouth proposal would have been stunned by haikus generated with LLMs, images from Stable Diffusion, videos from Leo or music from Suno. When looking at the results of all that put together, they may have in fact wanted to row things back.

Conclusion

For many of these ideas, it might be still too early to make a final call, as the ultimate goals of the proposal have arguably still not been fully attained. With time, some approaches that are hyped today may well seem like dead ends in hindsight, whereas others will seem obvious.

There is much more to the document yet than can be unpacked here, and this does not even include the individual proposals by its originators, John McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon. Many more big names were ultimately involved in this summer project, but all that may well warrant another blog post in the future.

Reviewing “AI Engineering” by Chip Huyen

In this post: A book review
(5 min read)

In January this year, Chip Huyen published her newest book ‘AI Engineering’, which quickly made waves online.

Having read her previous book ‘Designing Machine Learning Systems’ (2022), which I warmly recommend, I wondered what could possible remain to be covered in 500+ additional pages. It really turned out to be something entirely different, and this blog post will attempt a short review for anyone still curious about the book.

Machine Learning vs AI Engineering

Her previous book, ‘Designing Machine Learning Systems’, was a gentle but comprehensive overview of machine learning terms, techniques and applications that went easy on mathematical notation. While it briefly mentioned Large Language Models (LLMs) and remains highly relevant, it was published about half a year before the release of ChatGPT in November 2022, pre-dating its impact on the field.

The new book ‘AI Engineering’ is now entirely dedicated to working with LLMs. Subtitled ‘Building Applications with Foundation Models’, it addresses a much wider audience than just ML Engineers or Researchers with a technical or scientific interest. Indeed, it clearly sets apart ML from the scope of the book and expects little prior knowledge of the field. The contents have therefore hardly any overlap with the previous book and are mostly complementary.

Format and Style

The field is now moving faster than anyone could hope to read about, with sensational new announcements almost every week. Technical specifics are quickly outdated and the book accordingly sticks to more timeless high-level concepts, insights and approaches. These are nonetheless often supported by hard evidence from statistics or relevant papers and first-hand accounts from industry experts. However, this also tends to make it a quite verbose, lighter read of often more ‘qualitative’ nature, with plenty of examples and a minimum of mathematical notation and formulas.

Content

The author has an hopeful but grounded take on the capabilities of LLMs, with the credentials to back it up. The book provides many use cases and guides, but also known failure modes backed by the scientific literature, along with techniques to mitigate them.

The contents include concepts such as pre- and post-training, retrieval-augmented generation (RAG), agents, tool-use, evaluation techniques, emergent properties and quirks of the models. It offers strategies for application development, an in-depth chapter on fine-tuning with techniques such as Low-Rank Adaptation (LoRA) as well as optimization strategies for inference and more.

The facts laid out in the book are well-researched and appear solid overall. However, it does repeat the common claim that the 2012 AlexNet paper was the first to utilize GPUs for training of neural networks. As mentioned in my previous summary on AlexNet the truth seems somewhat more nuanced. Chances are that the author already has a rebuttal waiting in their inbox by Jürgen Schmidhuber.

Conclusion

The book provides a comprehensive overview and also offers many insights that were new to me, from the mapped-out political leanings of LLMs over the availability of training material in different languages to prompt injection attacks that could leak sensitive training data when tasked to merely repeat the word ‘poem’ an infinite number of times.

The distinction between ML and AI Engineering itself is thought-provoking. It illustrates how ML-based capabilities which used to be academic research topics are rapidly becoming applicable, abstracted and commoditized via APIs.

As with any other API, these capabilities thereby become accessible even without any deeper understanding of the field. The latter nonetheless helps, as simple LLM wrappers still encounter many limitations which are laid out in the book (hallucinations, costs, prompt injection attacks etc) and are still rarely competitive for hard engineering challenges (as seen e.g. in Kaggle challenges).

The book provides a solid foundation in this regard. Even with a more technical background, its balance between breadth and depth covered many gaps in my knowledge and makes it likely to remain relevant for years to come, so that it was well worth reading for me.

Disclaimer: I have no affiliation with the author and this review is not monetized, sponsored or funded in any other way. Originally, this post was meant to also review the book ‘Alice’s Adventures in a Differentiable Wonderland’ by Simone Scardapane that I read earlier, but that will remain for a future post.

What Deep Learning can do for Image Segmentation in Radiology

In this post: From Fully Convolutional Networks to TotalSegmentator
(10 min read)

Amidst the ongoing hype around the growing capabilities of large language models, it can be curious to note how earlier predictions about machine learning have stood the test of time.

Autonomous driving and radiology in particular were considered obvious candidates for automation, starting with the deep learning boom for image recognition around 2012. Geoffrey Hinton famously suggested in 2016 that training of radiologists should be discontinued altogether, arguing that they would be obsolete within five years.

And yet, things turned out quite different... (click to expand) Radiologist Jobs in 2025 (Source). And once at work, a US radiologist in 2025 may earn $265-495k per year or pick from a number of job openings that actually appears to be increasing.


This is a sobering reminder that many real-world problems turned out to be much harder to crack than expected at first. Although the AI hype has since mostly turned elsewhere, a closer look at what happened in these fields can be rather interesting.

This blog post examines one of their most active, and arguably most successful research areas and reviews one decade worth of progress around deep learning methods for semantic segmentation of medical images in radiology.

Semantic Segmentation for Radiology

From magnetic resonance imaging (MRI) to computed tomography (CT), medical imaging offers various modalities for visualising the human body. Just like the anatomy itself, these images are often three-dimensional and thus composed of volumetric pixels, or voxels, similar to the block worlds of Minecraft.

Early on, measurements and findings were often reported from two-dimensional X-ray images. With increasingly affordable imaging technology, the data has since grown both in quantity and resolution, leaving mere seconds on average for a radiologist to inspect images that can each consist of millions of voxels.

Semantic segmentation is one type of image analysis that is commonly performed in research and industry on these images. It aims to assign a class label to every pixel or voxel, typically to mark all parts of the image that contain a certain tissue or structure. Once complete, a segmentation mask can be used to measure volumes, render surface models or plan radiation treatments.

Video Example:
- Manual CT Image Segmentation in 3D Slicer (YouTube) [7:43 minutes]

When done by hand, a given volumetric image is typically segmented by drawing on it as a stack of dozens or hundreds of two-dimensional slices. This can take from minutes to hours or even days. Results may vary not only between different operators but also when the same image is analysed repeatedly by the same person. This becomes a challenge in studies where hundreds or even thousands of scans are to be analysed in this way.

Despite its many issues, manual segmentation remains the method of choice for many real-world projects that operate in risk-averse settings under restrictive regulatory constraints. On the flip side, these same regulations and concerns have helped to reduce the number of scenarios where malfunctioning software would burn patients with radiation or confuse surgeons with misleading 3d navigation views.

Risk-averse skeptics were furthermore proven right on several occasions before when they doubted claims about technology being superior to medical experts. In the late 90s, computer-aided detection (CAD) systems were funded with millions of dollars per year but later reported to provide no benefit or perhaps even cause harm. The deep learning boom later caused an entire flood of such claims, like in the 2017 CheXNet paper co-authored by Andrew Ng, the issues of which are discussed with many insights in the blog of Lauren Oakden-Rayner.

Despite this history of overpromising, the work done by a multitude of researchers and engineers over the years has nonetheless achieved some impressive progress. Especially deep learning systems for semantic segmentation of medical images have a lot to offer and are worth a closer look.

Deep Learning for Semantic Image Segmentation

In the ImageNet challenge of 2012, AlexNet famously set a new benchmark result for image recognition by assigning one of 1,000 possible class labels to a given input image with unprecedented accuracy.

Sliding window segmentation used such image classifiers for semantic segmentation, applying them to each position of an image to receive as input a patch around the current position and predict a class label for the central pixel. This approach suffered from inefficiencies, however, with redundant processing wherever a given image area appeared in multiple, adjacent patches.

[Fully Convolutional Networks, 2015] were proposed as a neural network architecture for dense prediction of pixel-wise labels. This approach removes any fully-connected layers from established architectures for image classification. The remaining convolutional and pooling layers act as an encoder, producing feature maps that retain spatial image information at different resolutions. From these, a 1x1 convolution produces one feature map for each class, to be upsampled by transposed convolution layers to restore the original image dimensions.

Transposed Convolution (Animations)
- Convolution arithmetic (GitHub) by Vincent Dumoulin, Francesco Visin

Note: Transposed convolutions differ from dilated (or à trous) convolutions that featured in a previous blog post.

With skip connections, these upsampled feature maps are obtained not only from the final, most low-resolution output, but also from earlier steps and fused together by summation to incorporate more high-resolution features.

[U-Net, 2015] built on this approach by proposing a symmetric encoder-decoder architecture with even more skip connections. The encoder part forms the left half of its U-shape, with successive network layers producing feature maps of decreasing resolution. The decoder path then gradually restores the original resolution with transposed convolution layers. Long skip connections provide shortcuts that concatenate feature maps of the encoder to their counterparts in the decoder. This enables U-Nets to consider both coarse, low-resolution features as well as detailed, high-resolution features.

[3D U-Net, 2016] later extended these concepts to volumetric, voxel-based input data by using 3D variants of both pooling and convolution layers.

U-Net architectures enjoyed enormous success and remain competitive options for medical image segmentation to this day. They tend to be robust and reliable, with 2D variants training within minutes on modern GPUs and being lightweight enough for inference even on laptop CPUs. They can also perform well even with just a few dozen training images, as each pixel or voxel effectively forms one training sample. Hundreds of U-Net variants were subsequently proposed in the literature, including SegResNet with short skip connections and 2.5D approaches that stack adjacent slices to form RGB colour images suitable for encoders pre-trained on ImageNet.

[nnU-Net, 2018] (‘no-new-Net’) was ultimately proposed as a self-adapting framework for training effective U-Net architectures. It enables the training of both 2D and 3D U-Nets, as well as cascades with subsequent segmentation steps. Instead of focusing on modifications of the model architecture, it adjusts the preprocessing, training, inference and post-processing with many domain-specific heuristics and also enables cross-validations for evaluation.

For example, as imaging devices of different vendors can vary in contrast, the preprocessing first standardizes or at least scales the image intensities. Some segmentation tasks suffer from extreme class imbalances (for example when small cancerous lesions are to be segmented) and the sampling strategy therefore tries to balance the foreground and background samples for each minibatch in training. Variations in patient anatomy and position are simulated by augmentation with rotation, scaling and mirroring during training. Test-time augmentation furthermore presents each sample with its mirrored copy and averages the predictions for increased robustness. Contemporary GPUs were often limited to 11GB, and so the inference processes larger images by blending patch-wise predictions.

Its authors at the German Cancer Research Center (DKFZ) published an open-source implementation. Its research code (which purportedly left Godzilla dead, whereas I was merely scarred) enabled many to reproduce these techniques. So successful was this framework both in benchmark challenges and various research papers that even in summer 2024 it still lays claim to dominance in this domain.

[TotalSegmentator, 2023] was later released as a freely available, already trained nnU-Net model for segmentation of 104 different structures in CT images, such as organs, bones, muscle and blood vessels. Up to this point, it was common that segmentation models trained on one dataset would not perform as well on data from other sources due to distribution shifts from different imaging devices, protocols or patient demographics. By training on a varied dataset of over one thousand real-world CT images with different age groups, sites, and protocols, TotalSegmentator made a substantial leap in generalization across arbitrary CT data. Notably, the model was made highly accessible with a Python package, integration into 3D Slicer and even a free web interface. In 2024, an extended version was released for segmentation of 59 structures in even more variable images from MRI.

Concluding Thoughts

From early Fully Convolutional Networks of 2014 to TotalSegmentator in 2024 for MRI, these papers trace the evolution of deep learning for semantic segmentation of radiology images over an entire decade. As an open research question it motivated thousands of papers with varied methodologies. Their insights were gradually distilled into the later publications and methods. Today, no programming or deep learning knowledge is required for anyone to simply drag and drop an image into TotalSegmentator with impressive results.
That is progress!

Medical image segmentation still remains an active field of research, with various benchmark challenges (beyond the scope of TotalSegmentator) in conferences like MICCAI and Kaggle challenges with monetary prizes.

The convolutional neural network architectures reviewed so far in this post also remain a competitive option especially for limited training data, as is common for medical images. Methods that utilise Transformers have emerged too, such as SwinUNETR (also see the open-source MONAI framework) and MedSAM as a medical version of the Segment Anything Model. More recently, even multimodal large language models are being considered for radiology tasks which may warrant an entire blog post of their own.

So with all these innovations, then, why has radiology not been automated yet? While there is a growing number of success stories of FDA-approved and CE marked supporting tools entering the market, these systems have to overcome numerous obstacles. Next to regulatory hurdles, workflow integration and acceptance issues, the technical challenge is just one of them.

However, even the technical challenges in this space have not been truly automated yet. I received a taste of this myself when working on kidney segmentations in MRI of UK Biobank. Before my results for 40,000 participants could be uploaded to the official data catalogue, I skimmed through thousands of these images to identify and understand outlier cases where my U-Net had failed to accurately segment both kidneys.

This taught me about horseshoe kidneys, or renal fusion, in which both kidneys are connected from birth. In this large dataset over a dozen such cases occurred, and without any training examples for this rare condition, the model had not learned how to handle them well at first. Mel Gibson is often given as a famous case of renal fusion, and if he had participated in UK Biobank, my system would have likely failed him. But who knows, perhaps today TotalSegmentator would have succeeded even for him?

The Lost Reading Items

In this post: An attempt to reconstruct Ilya Sutskever's 2020 AI reading list
(8 min read)

I recently shared a summary of a viral AI reading list attributed to Ilya Sutskever, which laid claim to covering ‘90% of what matters’ back in 2020. It boils down the reading items to barely one percent of the original word count to form the TL;DR I would have wished for before reading.

The viral version of the list as shared online is known to be incomplete, however, and includes only 27 of about 40 original reading items. The rest allegedly fell victim to the E-Mail deletion policy at Meta¹. These missing reading items have inspired some good discussions in the past, with many different ideas as to which papers would have been important enough to include.

This post is an attempt to identify these lost reading items. It builds on clues gathered from the viral list, contemporary presentations given by Ilya Sutskever, resources shared by OpenAI and more.

¹Correction: An earlier version mistakenly referred to OpenAI here instead of Meta

Filling the Gaps

The main piece of evidence is a claim shared along with the list according to which an entire selection of meta-learning papers was lost.

Meta-learning is often said to pursue ‘learning to learn’, with neural networks being trained for a general ability to adapt more easily to new tasks for which only few training samples are available. A network should thus be able benefit from its existing weights without requiring an entirely new training from scratch on the new data. One-shot learning provides just a single training sample to a model from which it is expected to learn a new downstream task, whereas zero-shot settings provide no annotated training samples at all.

For some of the candidate papers listed below, the case can be strengthened further by evidence in the form of an endorsement straight from OpenAI itself. Ilya Sutskever was chief scientist at a time when OpenAI published the educational resource ‘Spinning Up in Deep RL’ which includes several of these candidates in an entirely separate reading list of 105 ‘Key Papers in Deep RL’. Below, the papers which also appear in that list are marked with a symbol (⚛).

Clues from the Preserved Reading Items

Some meta-learning concepts can be found even in the known parts of the list. The preserved reading items can be arranged into a narrative arc around a related branch of research on Memory-Augmented Neural Networks (MANNs). Following the ‘Neural Turing Machine’ (NTM) paper, ‘Set2Set’ and ‘Relational RNNs’ experimented with external memory banks that an RNN could read and write information on. They directly cite or closely relate to several papers which may well have been part of the original list:

Potential Reading Items (Part 1):

Clues from Contemporary Presentations

Certain papers about meta-learning and competitive self-play also feature repeatedly in a series of presentations held by Ilya Sutskever around this time and may well have eventually been included in the reading list too.

Recorded Presentations:
- Meta Learning and Self Play - Ilya Sutskever, OpenAI (YouTube), 2017
- OpenAI - Meta Learning & Self Play - Ilya Sutskever (YouTube), 2018
- Ilya Sutskever: OpenAI Meta-Learning and Self-Play (YouTube), 2018

These presentations largely overlap and repeatedly reference known contents of the reading list. They open with a fundamental motivation of why deep learning works, framing backpropagation with neural networks as a search for small circuits that relate to the Minimum Description Length principle, according to which the shortest program that can explain given data will reach the best generalization possible.

Next, all three presentations reference the following meta-learning papers:

Potential Reading Items (Part 2):

Reinforcement Learning (RL) also features heavily in all three presentations, with close links to meta-learning. One key concept is competitive self-play in which agents interact in a simulated environment to reach specific, typically adversarial objectives. As a way to ‘turn compute into data’, this approach enabled simulated agents to outperform human champions and invent new moves in rule-based games. Ilya Sutskever presents an evolutionary biology perspective that relates competitive self-play to the impact of social interaction on brain size (pay-walled link). He goes on to suggest that rapid competence gain in a simulated ‘agent society’ may ultimately, according to his judgement, provide a plausible path towards a form of AGI.

Given the significance he ascribes to these concepts, it seems plausible that some of the cited papers on self-play may have later also been included in the reading list. They may form a sizeable chunk of the missing items, especially as RL is otherwise mentioned by only one of the preserved reading items.

Potential Reading Items (Part 3):

Even today, these presentations from around 2018 are still worth watching. Next to fascinating bits of knowledge, they also include gems such as the statement:

‘Just like in the human world: The reason humans find life difficult is because of other humans’

-Ilya Sutskever

While some concepts in computer science accordingly appear timeless, other points may seem surprising today, like the casual remark of an audience member in the Q&A session:

‘It seems like an important sub-problem on the path to AGI will be understanding language, and the state of generative language modelling right now is pretty abysmal.’

-Audience member

To which Ilya Sutskever responds:

‘Even without any particular innovations beyond models that exist today, simply scaling up models that exist today on larger datasets is going to go surprisingly far.’

-Ilya Sutskever (in 2018)

This response was later confirmed by experimental results in the reading item ‘Scaling Laws for Neural Language Models’ (which echoes the ‘Bitter Lesson’ by Rich Sutton). It was ultimately proven true, as he would oversee Transformer architectures scaled up to an estimated 1.8 trillion parameters and costing over $60 million to train on 128 GPUs forming Large Language Models (LLMs) which are today capable of generating text that is increasingly difficult to distinguish from human writing.

Honorable Mentions

Many other works and authors may have featured on the original list, but the evidence wears increasingly thin from here on.

Overall, the preserved reading items manage to strike an impressive balance between covering different model classes, applications and theory while also including many famous authors of the field. Perhaps the exceptions to this rule are worth noting, even if they may have slipped among the ‘10% of what matters’ that didn’t make the original list.

As such, it would have seemed plausible to include:

Conclusion

This post will remain largely speculative until more becomes known. After all, even the viral list itself was never officially confirmed to be authentic. Nonetheless, the potential candidates for the lost reading items listed above seemed worth sharing. Taken together, they may well fill a gap in the viral version of the list that would, in the words of the author, corresponded roughly to a missing ‘30% of what matters’ at its time.