Reviewing “AI Engineering” by Chip Huyen

21 Jun 2025, Taro Langner

In this post: A book review
(5 min read)

In January this year, Chip Huyen published her newest book ‘AI Engineering’, which quickly made waves online.

Having read her previous book ‘Designing Machine Learning Systems’ (2022), which I warmly recommend, I wondered what could possible remain to be covered in 500+ additional pages. It really turned out to be something entirely different, and this blog post will attempt a short review for anyone still curious about the book.

Machine Learning vs AI Engineering

Her previous book, ‘Designing Machine Learning Systems’, was a gentle but comprehensive overview of machine learning terms, techniques and applications that went easy on mathematical notation. While it briefly mentioned Large Language Models (LLMs) and remains highly relevant, it was published about half a year before the release of ChatGPT in November 2022, pre-dating its impact on the field.

The new book ‘AI Engineering’ is now entirely dedicated to working with LLMs. Subtitled ‘Building Applications with Foundation Models’, it addresses a much wider audience than just ML Engineers or Researchers with a technical or scientific interest. Indeed, it clearly sets apart ML from the scope of the book and expects little prior knowledge of the field. The contents have therefore hardly any overlap with the previous book and are mostly complementary.

Format and Style

The field is now moving faster than anyone could hope to read about, with sensational new announcements almost every week. Technical specifics are quickly outdated and the book accordingly sticks to more timeless high-level concepts, insights and approaches. These are nonetheless often supported by hard evidence from statistics or relevant papers and first-hand accounts from industry experts. However, this also tends to make it a quite verbose, lighter read of often more ‘qualitative’ nature, with plenty of examples and a minimum of mathematical notation and formulas.

Content

The author has an hopeful but grounded take on the capabilities of LLMs, with the credentials to back it up. The book provides many use cases and guides, but also known failure modes backed by the scientific literature, along with techniques to mitigate them.

The contents include concepts such as pre- and post-training, retrieval-augmented generation (RAG), agents, tool-use, evaluation techniques, emergent properties and quirks of the models. It offers strategies for application development, an in-depth chapter on fine-tuning with techniques such as Low-Rank Adaptation (LoRA) as well as optimization strategies for inference and more.

The facts laid out in the book are well-researched and appear solid overall. However, it does repeat the common claim that the 2012 AlexNet paper was the first to utilize GPUs for training of neural networks. As mentioned in my previous summary on AlexNet the truth seems somewhat more nuanced. Chances are that the author already has a rebuttal waiting in their inbox by Jürgen Schmidhuber.

Conclusion

The book provides a comprehensive overview and also offers many insights that were new to me, from the mapped-out political leanings of LLMs over the availability of training material in different languages to prompt injection attacks that could leak sensitive training data when tasked to merely repeat the word ‘poem’ an infinite number of times.

The distinction between ML and AI Engineering itself is thought-provoking. It illustrates how ML-based capabilities which used to be academic research topics are rapidly becoming applicable, abstracted and commoditized via APIs.

As with any other API, these capabilities thereby become accessible even without any deeper understanding of the field. The latter nonetheless helps, as simple LLM wrappers still encounter many limitations which are laid out in the book (hallucinations, costs, prompt injection attacks etc) and are still rarely competitive for hard engineering challenges (as seen e.g. in Kaggle challenges).

The book provides a solid foundation in this regard. Even with a more technical background, its balance between breadth and depth covered many gaps in my knowledge and makes it likely to remain relevant for years to come, so that it was well worth reading for me.

Disclaimer: I have no affiliation with the author and this review is not monetized, sponsored or funded in any other way. Originally, this post was meant to also review the book ‘Alice’s Adventures in a Differentiable Wonderland’ by Simone Scardapane that I read earlier, but that will remain for a future post.

What Deep Learning can do for Image Segmentation in Radiology

29 Jan 2025, Taro Langner

In this post: From Fully Convolutional Networks to TotalSegmentator
(10 min read)

Amidst the ongoing hype around the growing capabilities of large language models, it can be curious to note how earlier predictions about machine learning have stood the test of time.

Autonomous driving and radiology in particular were considered obvious candidates for automation, starting with the deep learning boom for image recognition around 2012. Geoffrey Hinton famously suggested in 2016 that training of radiologists should be discontinued altogether, arguing that they would be obsolete within five years.

And yet, things turned out quite different... (click to expand)

(Source). And once at work, a US radiologist in 2025 may earn $265-495k per year or pick from a number of job openings that actually appears to be increasing.

This is a sobering reminder that many real-world problems turned out to be much harder to crack than expected at first. Although the AI hype has since mostly turned elsewhere, a closer look at what happened in these fields can be rather interesting.

This blog post examines one of their most active, and arguably most successful research areas and reviews one decade worth of progress around deep learning methods for semantic segmentation of medical images in radiology.

Semantic Segmentation for Radiology

From magnetic resonance imaging (MRI) to computed tomography (CT), medical imaging offers various modalities for visualising the human body. Just like the anatomy itself, these images are often three-dimensional and thus composed of volumetric pixels, or voxels, similar to the block worlds of Minecraft.

Early on, measurements and findings were often reported from two-dimensional X-ray images. With increasingly affordable imaging technology, the data has since grown both in quantity and resolution, leaving mere seconds on average for a radiologist to inspect images that can each consist of millions of voxels.

Semantic segmentation is one type of image analysis that is commonly performed in research and industry on these images. It aims to assign a class label to every pixel or voxel, typically to mark all parts of the image that contain a certain tissue or structure. Once complete, a segmentation mask can be used to measure volumes, render surface models or plan radiation treatments.

Video Example:
- Manual CT Image Segmentation in 3D Slicer (YouTube) [7:43 minutes]

When done by hand, a given volumetric image is typically segmented by drawing on it as a stack of dozens or hundreds of two-dimensional slices. This can take from minutes to hours or even days. Results may vary not only between different operators but also when the same image is analysed repeatedly by the same person. This becomes a challenge in studies where hundreds or even thousands of scans are to be analysed in this way.

Despite its many issues, manual segmentation remains the method of choice for many real-world projects that operate in risk-averse settings under restrictive regulatory constraints. On the flip side, these same regulations and concerns have helped to reduce the number of scenarios where malfunctioning software would burn patients with radiation or confuse surgeons with misleading 3d navigation views.

Risk-averse skeptics were furthermore proven right on several occasions before when they doubted claims about technology being superior to medical experts. In the late 90s, computer-aided detection (CAD) systems were funded with millions of dollars per year but later reported to provide no benefit or perhaps even cause harm. The deep learning boom later caused an entire flood of such claims, like in the 2017 CheXNet paper co-authored by Andrew Ng, the issues of which are discussed with many insights in the blog of Lauren Oakden-Rayner.

Despite this history of overpromising, the work done by a multitude of researchers and engineers over the years has nonetheless achieved some impressive progress. Especially deep learning systems for semantic segmentation of medical images have a lot to offer and are worth a closer look.

Deep Learning for Semantic Image Segmentation

In the ImageNet challenge of 2012, AlexNet famously set a new benchmark result for image recognition by assigning one of 1,000 possible class labels to a given input image with unprecedented accuracy.

Sliding window segmentation used such image classifiers for semantic segmentation, applying them to each position of an image to receive as input a patch around the current position and predict a class label for the central pixel. This approach suffered from inefficiencies, however, with redundant processing wherever a given image area appeared in multiple, adjacent patches.

[Fully Convolutional Networks, 2015] were proposed as a neural network architecture for dense prediction of pixel-wise labels. This approach removes any fully-connected layers from established architectures for image classification. The remaining convolutional and pooling layers act as an encoder, producing feature maps that retain spatial image information at different resolutions. From these, a 1x1 convolution produces one feature map for each class, to be upsampled by transposed convolution layers to restore the original image dimensions.

Transposed Convolution (Animations)
- Convolution arithmetic (GitHub) by Vincent Dumoulin, Francesco Visin

Note: Transposed convolutions differ from dilated (or à trous) convolutions that featured in a previous blog post.

With skip connections, these upsampled feature maps are obtained not only from the final, most low-resolution output, but also from earlier steps and fused together by summation to incorporate more high-resolution features.

[U-Net, 2015] built on this approach by proposing a symmetric encoder-decoder architecture with even more skip connections. The encoder part forms the left half of its U-shape, with successive network layers producing feature maps of decreasing resolution. The decoder path then gradually restores the original resolution with transposed convolution layers. Long skip connections provide shortcuts that concatenate feature maps of the encoder to their counterparts in the decoder. This enables U-Nets to consider both coarse, low-resolution features as well as detailed, high-resolution features.

[3D U-Net, 2016] later extended these concepts to volumetric, voxel-based input data by using 3D variants of both pooling and convolution layers.

U-Net architectures enjoyed enormous success and remain competitive options for medical image segmentation to this day. They tend to be robust and reliable, with 2D variants training within minutes on modern GPUs and being lightweight enough for inference even on laptop CPUs. They can also perform well even with just a few dozen training images, as each pixel or voxel effectively forms one training sample. Hundreds of U-Net variants were subsequently proposed in the literature, including SegResNet with short skip connections and 2.5D approaches that stack adjacent slices to form RGB colour images suitable for encoders pre-trained on ImageNet.

[nnU-Net, 2018] (‘no-new-Net’) was ultimately proposed as a self-adapting framework for training effective U-Net architectures. It enables the training of both 2D and 3D U-Nets, as well as cascades with subsequent segmentation steps. Instead of focusing on modifications of the model architecture, it adjusts the preprocessing, training, inference and post-processing with many domain-specific heuristics and also enables cross-validations for evaluation.

For example, as imaging devices of different vendors can vary in contrast, the preprocessing first standardizes or at least scales the image intensities. Some segmentation tasks suffer from extreme class imbalances (for example when small cancerous lesions are to be segmented) and the sampling strategy therefore tries to balance the foreground and background samples for each minibatch in training. Variations in patient anatomy and position are simulated by augmentation with rotation, scaling and mirroring during training. Test-time augmentation furthermore presents each sample with its mirrored copy and averages the predictions for increased robustness. Contemporary GPUs were often limited to 11GB, and so the inference processes larger images by blending patch-wise predictions.

Its authors at the German Cancer Research Center (DKFZ) published an open-source implementation. Its research code (which purportedly left Godzilla dead, whereas I was merely scarred) enabled many to reproduce these techniques. So successful was this framework both in benchmark challenges and various research papers that even in summer 2024 it still lays claim to dominance in this domain.

[TotalSegmentator, 2023] was later released as a freely available, already trained nnU-Net model for segmentation of 104 different structures in CT images, such as organs, bones, muscle and blood vessels. Up to this point, it was common that segmentation models trained on one dataset would not perform as well on data from other sources due to distribution shifts from different imaging devices, protocols or patient demographics. By training on a varied dataset of over one thousand real-world CT images with different age groups, sites, and protocols, TotalSegmentator made a substantial leap in generalization across arbitrary CT data. Notably, the model was made highly accessible with a Python package, integration into 3D Slicer and even a free web interface. In 2024, an extended version was released for segmentation of 59 structures in even more variable images from MRI.

Concluding Thoughts

From early Fully Convolutional Networks of 2014 to TotalSegmentator in 2024 for MRI, these papers trace the evolution of deep learning for semantic segmentation of radiology images over an entire decade. As an open research question it motivated thousands of papers with varied methodologies. Their insights were gradually distilled into the later publications and methods. Today, no programming or deep learning knowledge is required for anyone to simply drag and drop an image into TotalSegmentator with impressive results.
That is progress!

Medical image segmentation still remains an active field of research, with various benchmark challenges (beyond the scope of TotalSegmentator) in conferences like MICCAI and Kaggle challenges with monetary prizes.

The convolutional neural network architectures reviewed so far in this post also remain a competitive option especially for limited training data, as is common for medical images. Methods that utilise Transformers have emerged too, such as SwinUNETR (also see the open-source MONAI framework) and MedSAM as a medical version of the Segment Anything Model. More recently, even multimodal large language models are being considered for radiology tasks which may warrant an entire blog post of their own.

So with all these innovations, then, why has radiology not been automated yet? While there is a growing number of success stories of FDA-approved and CE marked supporting tools entering the market, these systems have to overcome numerous obstacles. Next to regulatory hurdles, workflow integration and acceptance issues, the technical challenge is just one of them.

However, even the technical challenges in this space have not been truly automated yet. I received a taste of this myself when working on kidney segmentations in MRI of UK Biobank. Before my results for 40,000 participants could be uploaded to the official data catalogue, I skimmed through thousands of these images to identify and understand outlier cases where my U-Net had failed to accurately segment both kidneys.

This taught me about horseshoe kidneys, or renal fusion, in which both kidneys are connected from birth. In this large dataset over a dozen such cases occurred, and without any training examples for this rare condition, the model had not learned how to handle them well at first. Mel Gibson is often given as a famous case of renal fusion, and if he had participated in UK Biobank, my system would have likely failed him. But who knows, perhaps today TotalSegmentator would have succeeded even for him?

The Lost Reading Items

11 Nov 2024, Taro Langner

In this post: An attempt to reconstruct Ilya Sutskever's 2020 AI reading list
(8 min read)

I recently shared a summary of a viral AI reading list attributed to Ilya Sutskever, which laid claim to covering ‘90% of what matters’ back in 2020. It boils down the reading items to barely one percent of the original word count to form the TL;DR I would have wished for before reading.

The viral version of the list as shared online is known to be incomplete, however, and includes only 27 of about 40 original reading items. The rest allegedly fell victim to the E-Mail deletion policy at Meta¹. These missing reading items have inspired some good discussions in the past, with many different ideas as to which papers would have been important enough to include.

This post is an attempt to identify these lost reading items. It builds on clues gathered from the viral list, contemporary presentations given by Ilya Sutskever, resources shared by OpenAI and more.

¹Correction: An earlier version mistakenly referred to OpenAI here instead of Meta

Filling the Gaps

The main piece of evidence is a claim shared along with the list according to which an entire selection of meta-learning papers was lost.

Meta-learning is often said to pursue ‘learning to learn’, with neural networks being trained for a general ability to adapt more easily to new tasks for which only few training samples are available. A network should thus be able benefit from its existing weights without requiring an entirely new training from scratch on the new data. One-shot learning provides just a single training sample to a model from which it is expected to learn a new downstream task, whereas zero-shot settings provide no annotated training samples at all.

For some of the candidate papers listed below, the case can be strengthened further by evidence in the form of an endorsement straight from OpenAI itself. Ilya Sutskever was chief scientist at a time when OpenAI published the educational resource ‘Spinning Up in Deep RL’ which includes several of these candidates in an entirely separate reading list of 105 ‘Key Papers in Deep RL’. Below, the papers which also appear in that list are marked with a symbol (⚛).

Clues from the Preserved Reading Items

Some meta-learning concepts can be found even in the known parts of the list. The preserved reading items can be arranged into a narrative arc around a related branch of research on Memory-Augmented Neural Networks (MANNs). Following the ‘Neural Turing Machine’ (NTM) paper, ‘Set2Set’ and ‘Relational RNNs’ experimented with external memory banks that an RNN could read and write information on. They directly cite or closely relate to several papers which may well have been part of the original list:

Potential Reading Items (Part 1):

‘Meta-learning with memory-augmented neural networks’
from 2016
‘Prototypical networks for few-shot learning’
from 2017
‘Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks’⚛
from 2017

Clues from Contemporary Presentations

Certain papers about meta-learning and competitive self-play also feature repeatedly in a series of presentations held by Ilya Sutskever around this time and may well have eventually been included in the reading list too.

Recorded Presentations:
- Meta Learning and Self Play - Ilya Sutskever, OpenAI (YouTube), 2017
- OpenAI - Meta Learning & Self Play - Ilya Sutskever (YouTube), 2018
- Ilya Sutskever: OpenAI Meta-Learning and Self-Play (YouTube), 2018

These presentations largely overlap and repeatedly reference known contents of the reading list. They open with a fundamental motivation of why deep learning works, framing backpropagation with neural networks as a search for small circuits that relate to the Minimum Description Length principle, according to which the shortest program that can explain given data will reach the best generalization possible.

Next, all three presentations reference the following meta-learning papers:

Potential Reading Items (Part 2):

‘Human-level concept learning through probabilistic program induction’
as Lake et al., 2016
‘Neural Architecture Search with Reinforcement Learning’
as Zoph and Le, 2017
‘A Simple Neural Attentive Meta-Learner’⚛
as Mishra et al., 2017

Reinforcement Learning (RL) also features heavily in all three presentations, with close links to meta-learning. One key concept is competitive self-play in which agents interact in a simulated environment to reach specific, typically adversarial objectives. As a way to ‘turn compute into data’, this approach enabled simulated agents to outperform human champions and invent new moves in rule-based games. Ilya Sutskever presents an evolutionary biology perspective that relates competitive self-play to the impact of social interaction on brain size (pay-walled link). He goes on to suggest that rapid competence gain in a simulated ‘agent society’ may ultimately, according to his judgement, provide a plausible path towards a form of AGI.

Given the significance he ascribes to these concepts, it seems plausible that some of the cited papers on self-play may have later also been included in the reading list. They may form a sizeable chunk of the missing items, especially as RL is otherwise mentioned by only one of the preserved reading items.

Potential Reading Items (Part 3):

‘Hindsight Experience Replay’⚛
as Andrychowicz et al., 2017
‘Continuous control with deep reinforcement learning’⚛
as DDPG: Deep Deterministic Policy Gradients, 2015
‘Sim-to-Real Transfer of Robotic Control with Dynamics Randomization’
as Peng et al., 2017
‘Meta Learning Shared Hierarchies’
as Frans et al., 2017
‘Temporal Difference Learning and TD-Gammon [1995]’
as Tesauro et al., 1992
‘Karl Sims - Evolved Virtual Creatures, Evolution Simulation, 1994’
as Carl Sims, 1994 (YouTube video [4:09])
‘Emergent Complexity via Multi-Agent Competition’
as Bansal et al., 2017
‘Deep reinforcement learning from human preferences’⚛
as Christiano et al., 2017 (Note: Introduces RLHF)

Even today, these presentations from around 2018 are still worth watching. Next to fascinating bits of knowledge, they also include gems such as the statement:

‘Just like in the human world: The reason humans find life difficult is because of other humans’

-Ilya Sutskever

While some concepts in computer science accordingly appear timeless, other points may seem surprising today, like the casual remark of an audience member in the Q&A session:

‘It seems like an important sub-problem on the path to AGI will be understanding language, and the state of generative language modelling right now is pretty abysmal.’

-Audience member

To which Ilya Sutskever responds:

‘Even without any particular innovations beyond models that exist today, simply scaling up models that exist today on larger datasets is going to go surprisingly far.’

-Ilya Sutskever (in 2018)

This response was later confirmed by experimental results in the reading item ‘Scaling Laws for Neural Language Models’ (which echoes the ‘Bitter Lesson’ by Rich Sutton). It was ultimately proven true, as he would oversee Transformer architectures scaled up to an estimated 1.8 trillion parameters and costing over $60 million to train on 128 GPUs forming Large Language Models (LLMs) which are today capable of generating text that is increasingly difficult to distinguish from human writing.

Honorable Mentions

Many other works and authors may have featured on the original list, but the evidence wears increasingly thin from here on.

Overall, the preserved reading items manage to strike an impressive balance between covering different model classes, applications and theory while also including many famous authors of the field. Perhaps the exceptions to this rule are worth noting, even if they may have slipped among the ‘10% of what matters’ that didn’t make the original list.

As such, it would have seemed plausible to include:

Yann LeCun with pioneering work on CNNs for real-world use
Ian Goodfellow with Generative Adversarial Networks (GANs) that dominated image generation at the time and
Demis Hassabis for RL research towards AlphaFold that earned a Nobel prize

Conclusion

This post will remain largely speculative until more becomes known. After all, even the viral list itself was never officially confirmed to be authentic. Nonetheless, the potential candidates for the lost reading items listed above seemed worth sharing. Taken together, they may well fill a gap in the viral version of the list that would, in the words of the author, corresponded roughly to a missing ‘30% of what matters’ at its time.

Summary of Ilya Sutskever's AI Reading List

24 Sep 2024, Taro Langner

In this post: Ilya Sutskever's AI Reading list in ~120 words per item
(15 min read)

Earlier this year, a reading list with about 30 papers was shared on Twitter.
It reportedly forms part of a longer version originally compiled by Ilya Sutskever, co-founder and chief scientist of OpenAI at the time, for John Carmack in 2020 with the remark:

‘If you really learn all of these, you’ll know 90% of what matters’.

While the list is fragmentary and much has happened in the field since, this endorsement and the claim that it was part of onboarding at OpenAI quickly made it go somewhat viral.

At about 300,000 words total, the combined content nonetheless corresponds to around one thousand book pages of dense, technical text and requires a decent investment in time and energy for self-study. After doing just that, I therefore dedicate this blog post to all those of us who provisionally bookmarked it (“for later”) and are still curious. What follows is my own condensed and structured summary with about 120 words per item, free of mathematical notation, to capture the essential key points, context and some perspective gained from reading it with the surrounding literature.

In a Nutshell

The list contains 27 reading items, with papers, blog posts, courses, one dissertation and two book chapters, all originally dating from 1993 to 2020.

The contents can be roughly broken down as follows:

Methodology	Items	Share*	Topics
Convolutional Neural Networks (CNNs)	5	25%	image recognition, semantic segmentation
Recurrent Neural Networks (RNNs)	10	19%	language modeling, speech-to-text, machine translation, combinatorial optimization, visual question answering, content-based attention
Transformers	3	6%	multi-head and dot-product attention, language model scaling
Information Theory	5	42%	Kolmogorov complexity, compression, Minimum Description Length
Miscellaneous	4	8%	variational inference, representation learning, graph neural networks, distributed training

Using these categories, the next sections summarize the gist of each item, roughly sorted by how they build on each other.

Convolutional Neural Networks

CS231, 2017
Stanford University Course
Length: ~50,000 words, forming 11 blocks of 2 modules
Instructors: Fei-Fei Li, Andrej Karpathy and Justin Johnson
🔗

[CS231, 2017] is a classic course on deep learning fundamentals from Stanford University. It builds up from linear classifiers and their ability to learn a given task based on mathematical optimization, or training, which adjusts their internal parameter weights such that applying them to input data will produce more desirable outputs. This basic concept is developed into backpropagation for training of neural networks, in which trainable parameters are typically arranged into multiple layers together with other modules such as activation functions and pooling layers. Convolutional Neural Networks (CNNs) are introduced as a specialized architecture for image recognition, as used in modern computer vision systems to this day. Extended video lectures are available on youtube.

Note: If you are starting from zero, this course and newer resources by e.g. DeepLearning.AI on Coursera or FastAI will help you to get more out of the remaining list.

AlexNet, 2012
Paper
Length: ~6,000 words
Authors: Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton
🔗

[AlexNet, 2012] established CNNs as state of the art for image recognition and arguably initiated the widespread hype around deep learning. It outperformed its competitors in the 2012 ImageNet benchmark challenge, predicting whether a given input image contained e.g. a cat, dog, ship or any other of 1,000 possible classes, so conclusively that the real-world dominance of deep learning became commonly accepted. An important factor was its early* CUDA implementation that enabled unusually fast training on GPUs.

*Note: Earlier GPU implementations are documented in section 12.1.2 of the book Deep Learning.

ResNet, 2015
Paper
Length: ~6,000 words
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun
🔗

[ResNet, 2015] succeeded AlexNet as a more modern CNN architecture, reaching first place on the ImageNet challenge in 2015. It remains a popular CNN architecture to this day and is subject of ongoing research. It introduced residual connections into CNN architectures that had become ever deeper, stacking more convolutional layers to achieve higher representational power. By allowing residual connections to skip or bypass entire blocks of layers, ResNet architectures suffered less from gradient degradation effects in training and could thus be robustly trained at previously unseen depth.

ResNet identity mappings, 2016
Paper
Length: ~6,000 words
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun
🔗

[ResNet identity mappings, 2016] were later proposed by the ResNet authors as a ‘clean’ information path and best design for the skip connections, so that their contents are merely added to the results of a bypassed block without any further modification. Whereas earlier designs placed an activation layer on the skip path after the addition, the proposed pre-activation design moves this layer to the start of the bypassed block instead. The skip connections can thus form a shortcut through the entire neural network that is only interrupted by additions, allowing improved propagation of gradient signals that make it possible for even deeper neural networks to be trained.

Dilated convolutions, 2015
Paper
Length: ~6,000 words
Authors: Fisher Yu and Vladlen Koltun
🔗

[Dilated convolutions, 2015] (or à trous convolutions) were proposed as a new type of module for dense prediction with CNNs in tasks like semantic image segmentation, where class labels are assigned to any given pixel of an input image. Architectures such as AlexNet and ResNet condense input images to lower-dimensional representations via strided convolutions or pooling layers to predict one class label for an entire image. Related architectures for dense prediction therefore typically restore the original input image resolution from these downsampled, intermediate representations via upsampling operations. Whereas e.g. transpose convolutions achieve this with competitive results, dilated convolutions avoid downsampling entirely. Instead, they space out the filter kernel of a convolutional layer to skip one or more neighboring input pixels, thereby providing a larger receptive field without any reduction in resolution.

Recurrent Neural Networks

Today, Recurrent Neural Networks (RNNs) have been largely superseded by Transformers and date from what Ilya Sutskever himself would later call the “[pre-2017] stone age” of machine learning. They nonetheless remain subject of active research and see continued use in certain applications. Forming a substantial part of the reading list, they showcase the evolution of early insights and architectural developments that lead up to the systems of today. Most of the RNNs listed below are Long Short-Term Memory (LSTM) architectures. Some designs furthermore include Feedforward Networks with no recurrent connections, usually trained end-to-end as part of the model.

Understanding LSTM Networks, 2015
Blog Post
Length: ~2,000 words
Author: Christopher Olah
🔗

[Understanding LSTM Networks, 2015] provides a brief introduction to RNNs and LSTMs in particular. RNNs can process a sequence of inputs, one step at a time, while evolving a hidden state vector that is (re-)ingested, updated and returned again at each step along the input sequence. The hidden state vector thereby allows for information to persist and be passed to subsequent processing steps. Nonetheless, simpler RNNs typically struggle with long-term dependencies. LSTMs alleviate this by introducing a cell state as additional recurrent in- and output, acting as a memory pathway for addition, update or removal of information along each processing step via trainable gating mechanisms.

The Unreasonable Effectiveness of RNNs, 2015
Blog Post
Length: ~6,000 words
Author: Andrej Karpathy
🔗

[The Unreasonable Effectiveness of RNNs, 2015] shows use cases and results of RNNs in action. They are distinguished by their ability to both process and predict variable-sized sequences while also maintaining an internal state, prompting the author Andrej Karpathy to state “If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs.”. He showcases results for image captioning and character-level language modeling that enables RNNs to automatically generate prose and articles. Early code generation capabilities are noted, with convincing syntax but failure to compile and a tendency to suffer from hallucinations where the model provides outputs as most probable that are evidently incorrect. The blog post also includes a minimal RNN code example.

RNN Regularization, 2014
Paper
Length: ~3,500 words
Authors: Wojciech Zaremba, Ilya Sutskever and Oriol Vinyals
🔗

[RNN regularization, 2014] addresses the challenge of training large RNNs without overfitting, where a model would excessively adapt to, or even memorize, its training samples and fail to generalize to new data. A technique for regularization, which aims to reduce this effect, was proposed that applies dropout, which omits randomly selected outputs of a given neural network layer. Dropout had been known for several years and was used e.g. in AlexNet. Here, the key insight was to utilize dropout only within a given RNN cell, but to avoid it on the recurrent connections that carried the hidden state vector. In this way, larger RNNs could avoid overfitting while preserving long-term dependencies.

Neural Turing Machines, 2014
Paper
Length: ~7,500 words
Authors: Alex Graves, Greg Wayne and Ivo Danihelka
🔗

[Neural Turing Machines, 2014] were proposed as a form of memory-augmented neural network, with an external memory bank on which an RNN controller could write or erase information with a ‘blurry’, differentiable, attention-based focus. Equipped with this working memory, the Neural Turing Machine outperformed a baseline RNN in experiments involving associative recall, copying and sorting sequences and generalized more robustly to sequence lengths that exceeded those encountered in training.

Deep Speech 2, 2016
Paper
Length: ~7,000 words
Authors: Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, Zhenyao Zhu
🔗

[Deep Speech 2, 2016] proposed an automatic speech recognition system to convert audio recordings into text by processing log-spectrograms representing the audio with RNNs to predict sequences of characters of either English or Mandarin. The authors utilized batch normalization instead of dropout for regularization on the non-recurrent layers and Gated Recurrent Units (GRUs) as a somewhat simplified alternative to LSTMs used in most of the other papers examined so far, together with a plethora of other engineering tweaks, including batched processing for low-latency streaming output.

RNNsearch, 2015
Paper
Length: ~8,000 words
Authors: Dzmitry Bahdanau, KyungHyun Cho and Yoshua Bengio
🔗

[RNNsearch] is credited with introducing the first attention mechanism into Natural Language Processing (NLP), proposing additive, content-based attention for neural machine translation. Its encoder-decoder architecture encodes an input sequence of English words with an RNN encoder into a context vector used by an RNN decoder to predict an output sequence of French words. In prior work this context vector was simply the final hidden state of the encoder, which therefore had to contain all relevant information about the input sequence. RNNsearch addresses this bottleneck by making the context vector a weighted sum over all encoder hidden states, or annotations. When predicting a target word, the decoder can thereby rely on context from arbitrary parts of the encoded input sequence by (re-)calculating the context vector as a weighted sum of annotations. The weighting is determined by an alignment model, a feedforward network that receives the current decoder hidden state together with an annotation and assigns a score to the latter.

Pointer Networks, 2015
Paper
Length: ~4,500 words
Authors: Oriol Vinyals, Meire Fortunato and Navdeep Jaitly
🔗

[Pointer Networks] repurpose the concept of content-based attention to solve combinatorial optimization problems. Here, content-based attention is used to ‘point’ at elements of the input sequence in a specific order. The output sequence is therefore an indexing of the input elements. Given a set of two-dimensional points as input, Pointer Net was trained to solve for their convex hull, Delaunay triangulation or Traveling Salesman Problem by predicting in which order these points should be visited. With no limitation to the length of the output sequence or dictionary, this approach was found to generalize beyond the longest sequence length encountered in training.

Set2Set, 2016
Paper
Length: ~6,500 words
Authors: Oriol Vinyals, Samy Bengio and Manjunath Kudlur
🔗

[Set2Set] extends sequence-to-sequences methods as examined above to enable order-invariant processing of sets. These methods are shown to strongly depend on the specific order of both an input and output sequence (e.g. the exact order of random points provided to Pointer Networks for convex hull prediction). The authors propose Set2Set as a solution, with the encoder forming a memory bank (that resembles annotations of RNNSearch) to create a context vector. This memory bank, however, is sampled more than just once. Instead, a process block introduces a new LSTM, which evolves a query vector for repeated, content-based attention readouts of the memory. Finally, the write block (a Pointer Network) can add even more attention steps in the form of glimpses.

Relation Networks, 2017
Paper
Length: 5,000 words
Authors: Adam Santoro, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, Timothy Lillicrap
🔗

[Relation Network, 2017] modules were proposed as a method for relational inference tasks such as visual and text-based question answering. The Relation Network module ingests a pair of feature vectors, for example an LSTM hidden state for a word or sentence in text or the values at a specific pixel position in feature maps produced by a CNN for image data. A given pairing is processed with one or more neural network layers before forming an element-wise sum and creating an output with a second stack of layers. By doing this for all pairs of inputs, this approach outperformed the human baseline in answering textual questions regarding the size, position and color of 3D generated shapes relative to each other.

Relational Recurrent Neural Networks, 2018
Paper
Length: 6,000 words
Authors: Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, Timothy Lillicrap
🔗

[Relational Recurrent Neural Networks, 2018] proposed a Relational Memory Core module in which an attention mechanism allows memories to interact with each other and be recurrently refined as a fixed-size matrix. This approach was adapted for several tasks requiring relational reasoning and outperformed multiple baseline methods. Given a random set of vectors of which an arbitrary one was marked, it predicted which other vector had the highest Euclidean distance to it. It also learned to execute short code snippets involving variable manipulation, performed language modeling and scored well in a toy reinforcement learning task. The self-attention mechanism that enabled its memory interactions is described in the following section.

Transformers

The previous section tracks the rise of attention mechanisms as an increasingly potent tool for providing context in sequence-to-sequence prediction tasks. Eventually, these developments yielded the Transformer as a neural network architecture that predominantly relies on attention and discards both recurrent and convolutional layers entirely. The excellent scalability of this approach, together with growing compute resources and extensive training data, established Transformers as dominant method for language modeling, forming the backbone of systems like ChatGPT and performing well even on image and multimodal data.

New attention mechanisms enabled substantial speed and efficiency advantages:
The additive attention mechanism of the previous section compared an encoder and decoder hidden state to each other by applying an alignment model to each such pair for scoring. Internally, the alignment model formed linear projections of both vectors and added them together to calculate a score, which was normalized over all encoder hidden states to form a context vector as their weighted average.
With multiplicative attention, Transformers compare multiple pairings of hidden states at once by forming a dot product of their linear projections, which can be implemented with faster, highly optimized matrix multiplications.

Attention Is All You Need, 2017
Paper
Length: ~4,500 words
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin
🔗

[Attention Is All You Need] proposed the Transformer architecture. In an encoder-decoder structure for machine translation, embedding layers convert each input token into a feature vector, to which positional encodings are added. The proposed Scaled Dot-Product Attention computes a weighted average over multiple value vectors, each weighted by comparing its associated key vector to a given query vector using the dot product. The result is scaled (for numerical stability) and then normalized over all keys with a softmax function. Multi-head attention conducts this process in parallel with different, learned projections of each input.
Three variants of this mechanism are used. In self-attention used by the encoder, the query, key and value are distinct linear projections of the same output vector from the previous layer. In masked self-attention the decoder furthermore masks out the weights for future tokens. Finally, in encoder-decoder attention, each decoder block obtains only the query from the preceding decoder layer, whereas key and value originate from the final encoder layer. The experiments exceeded state-of-the-art results, with two orders of magnitude lower compute resources than previous approaches.

The Annotated Transformer, 2020
Blog Post (2022 version)
Length: ~6,000 words
Authors: Austin Huang, Suraj Subramanian, Jonathan Sum, Khalid Almubarak, and Stella Biderman (2020 original by Sasha Rush)
🔗

[The Annotated Transformer] implements the Transformer as described in ‘Attention is All You Need’ line by line as a fully functional Jupyter Notebook using PyTorch, with all code available on GitHub. Text segments of the original paper feature alongside the code, together with comments and visualizations that clarify various aspects of the architecture beyond the contents of the paper. The notebook also implements examples for data formatting, training and inference that show the Transformer applied in practice.

Note: The Illustrated Transformer by Jay Alammar is yet another in-depth guide.

Scaling Laws for Neural Language Models, 2020
Paper
Length: ~9,000 words
Authors: Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu and Dario Amodei
🔗

[Scaling Laws for Neural Language Models] explores the predictive performance of Transformers for language modeling as a function of model size, data quantity and available compute resources. Extensive empirical results enable the authors to establish formulas that relate these factors to each other over seven orders of magnitude and enable several recommendations as to their optimal configuration. While each of these can form a bottleneck, model size (i.e. the number of trainable parameters) forms the single most impactful factor. Larger models reach higher sample-efficiency and better generalization earlier on in training. The specific model architecture has little effect. An eight-fold increase in model size requires a five-fold increase in training data. Given a fixed compute budget, the authors accordingly recommend to prioritize first model size, then batch size and only then the number of training steps, with early stopping before convergence typically providing the best trade-off in their experiments.

Information Theory

A substantial portion of the reading list is dedicated to more abstract material on theoretical informatics. Rather than proposing specific architectures or engineering solutions for concrete applications, these works are concerned with more fundamental study of the limits of computability, probability and intelligence. Recurring themes are principles for inductive inference such as Occam’s razor, which states a preference for simplicity when choosing between competing explanations (be they theories, hypotheses or models) for some given evidence or data. Another core concept is Kolmogorov complexity* for quantifying the amount of information, or potential for compression, of a given input.

*Note: Kolmogorov complexity of a sequence can be defined as the length of the shortest program that prints it and then halts. While uncomputable in practice, it can be approximated with compression software such as gzip.

A Tutorial Introduction to the Minimum Description Length Principle, 2004
Book Chapter
Length: ~30,000 words
Author: Peter Grünwald
🔗

[A Tutorial Introduction to the Minimum Description Length Principle, 2004] describes an approach for model selection that mathematically formalizes Occam’s razor, defining a preference for the most simple model among all those that explain the available data. The principle relates learning to data compression, as the ability to exploit regularity for achieving a shortest possible description. This description is defined by codes, and codelength functions are noted as corresponding to probability mass functions. The two-part code version of the Minimum Description Length (MDL) principle measures the simplicity of a model instance as the length of its description (in bits) added to the length of the data description as encoded with it. The refined, one-part code version examines entire families of models based on their goodness-of-fit and complexity.

Kolmogorov Complexity and Algorithmic Randomness
(Chapter 14), 2017
Book Chapter
Length: ~35,000 words
Authors: Alexander Shen, Vladimir A. Uspensky and Nikolay Vereshchagin
🔗

[Kolmogorov Complexity and Algorithmic Randomness] features a final chapter on algorithmic statistics. In this framework, a given sequence of observations is encoded as one binary string. Kolmogorov complexity provides formal means of quantifying its randomness and regularity, as well as the expected and desired properties of a theory or model that can explain it. Such a model should preferably be simple, as indicated by low Kolmogorov complexity. It should also explain as much regularity in the data as possible, making the data “typical” for the model. This property is formally quantified by low randomness deficiency of the data relative to the model. Together, these two properties are also related to the two-part code of the Minimum Description Length principle. The chapter closes by drawing parallels between good models and good compressors, together with the potential of lossy compression to perform effective denoising.

The First Law of Complexodynamics, 2011
Blog Post
Length: ~2,000 words
Author: Scott Aaronson
🔗

[The First Law of Complexodynamics] explores the relationship between entropy and complexity. Whereas the second law of thermodynamics dictates that entropy of closed systems increases over time, their complexity of ‘interestingness’ is noted to first rise and then fall again. Giving the example of coffee and milk mixing in a glass, the highest such ‘complextropy’ is noted to occur midway, when tendrils of milk result from both liquids no longer being cleanly separated but also not yet forming a homogenous blend. Kolmogorov complexity is explored as a way to express both entropy and this ‘complextropy’, with the conjecture that a resource-bounded definition could provide a suitable theoretical framework.

Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton, 2014
Paper
Length: ~8,500 words
Authors: Scott Aaronson, Sean M. Carroll and Lauren Ouellette
🔗

[Quantifying the Rise and Fall of Complexity] explores these ideas in further depth. Covering various theoretical notions of complexity, it eventually settles on ‘apparent complexity’ as a way of modeling the separate phenomena of entropy and the ‘interestingness’ of a closed system. Practical experiments inspired by the blending of coffee and milk fill a 2D array with a clean split of binary values and perturb these over multiple time steps to represent random mixing. This array forms an image which is compressed by gzip to approximate Kolmogorov complexity via file size. At each time step, this is done with the image itself to approximate entropy, but also with a coarse-grained, blurred version to estimate its apparent complexity. As envisioned, the increasingly noisy image values yield rising entropy whereas their blurred representation first raises and then decreases the apparent complexity measure as the mix gets more homogeneous.

Machine Super Intelligence, 2008
Dissertation
Length: ~50,000 words
Author: Shane Legg, supervised by Marcus Hutter
🔗

[Machine Super Intelligence, 2008] explores universal artificial intelligence under aspects of algorithmic complexity, probability and information theory. It covers inductive inference from Epicurus principle of multiple explanations, Occam’s razor, Bayes rule and priors to complexity measures and agent-environment models as examined in reinforcement learning. Discussing various definitions and established tests for intelligence, it proposes a formal definition and measure for universal intelligence* as the ability of an agent to achieve specific goals in a wide range of environments. While the proposed measure itself is uncomputable in practice, it enables theoretical conclusions, such as the requirement that powerful agents be proportionally complex, and motivates several practical experiments in which a downscaled version of a hypothetically optimal agent is deployed for reinforcement learning.

*Note: This measure would accordingly score the universal intelligence of specialized, ‘narrow’ machine learning systems that form the bulk of the papers examined in this blog post as comparatively low.

Miscellaneous

Keeping Neural Networks Simple by Minimizing the Description Length of the Weights, 1993
Paper
Length: ~6,000 words
Authors: Geoffrey E. Hinton and Drew van Camp
🔗

[Keeping Neural Networks Simple by Minimizing the Description Length of the Weights] introduced the concept of Variational Inference with neural networks. This approach enables neural network training to approximate the otherwise computationally prohibitive concept of Bayesian inference. The authors propose a regularization technique that represents each weight of a neural network as a Gaussian probability distribution described by a mean and a variance value. Inspired by the Minimum Description Length principle, the cost function used during training penalizes the description length of the weights and the data misfits. The authors argue that this representation allows for a substantial reduction in the description length of the weights. Their Bits-Back Coding argument states that the distribution of each weight can be sampled with random bits at no additional cost, as the random bits can be reconstructed given a fixed learning algorithm, architecture and initial probability distribution for each weight.

Variational Lossy Autencoder, 2017
Paper
Length: ~6,000 words
Authors: Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever and Pieter Abbeel
🔗

[Variational Lossy Autoencoders] provide a way for data compression with control over which aspects of the data should be retained or discarded. In experimental results, this enables 2D image compression that discards local texture while retaining global structure. Autencoders use an inference model to compresses input data to a compact latent code, from which a generative model decodes the original input. This latent code should accordingly represent all information relevant for describing the input. When using sufficiently powerful autoregressive models like RNNs however, decoders had been previously found capable of predicting the output while ignoring the latent code entirely. Here, a theoretical explanation for this phenomenon is provided based on Bits Back Coding. The proposed approach weakens the decoder (e.g. limiting it to reconstruct small receptive fields) such that it depends on the missing information (e.g. global structure) being fully provided by the latent code to which the input is compressed.

GPipe, 2018
Paper
Length: ~5,000 words
Authors: Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu and Zhifeng Chen
🔗

[GPipe] is* a library for distributed training of neural networks on more than one accelerator (e.g. GPUs). It subdivides the neural network architecture into cells formed by one or more consecutive layers and assigns each cell to a separate accelerator. It furthermore employs pipeline parallelism by also splitting each mini-batch of training samples into several micro-batches that are pipelined through, so that multiple accelerators can work on different micro batches concurrently. The gradients for all micro-batches are aggregated for one synchonous update per mini-batch. Training thus remains consistent regardless of cell count or micro-batch size.

*Note: As for the library itself, GitHub shows that its most recent commit for the final version v0.0.7 occurred in September 2020.

Neural Message Passing for Quantum Chemistry, 2017
Paper
Length: ~6,000 words
Authors: Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals and George E. Dahl
🔗

[Neural Message Passing for Quantum Chemistry] explores the application of graph neural networks to predict quantum mechanical properties of organic molecules. The commonalities between several related works that utilize neural networks for graph data are first discussed and abstracted into a new concept of Message Passing Neural Networks. This framework considers undirected graphs composed of edges and nodes, both of which can have features. Each forward pass performs one or more steps of a message passing phase in which the hidden state of each given node is updated based on messages that depend on the hidden states and connecting edges with all adjacent nodes. Next, a readout phase calculates one hidden state for the entire graph using a readout function that is invariant to the order of graph nodes. Using the Gated Graph Neural Network architecture, the authors present experimental results on graph data of molecular structures that achieved state-of-the-art results at the time.

Concluding Thoughts

Whereas this summary was written by hand, the described technology is approaching a point where its language modeling capabilities are hard to distinguish from human writing already now in 2024. Eventually, Large Language Models may indeed become self-explanatory in a literal sense. Prompting e.g. ChatGPT to summarize this material nonetheless still yields explanations that seem very convincing but can also be largely hallucinated and often misleading. Perhaps that will already improve once this article is ingested into the training data?

Although low-quality, generated content was a recurring theme I encountered while researching this list, there are also several independent summaries of the reading list worth sharing here:

Aman Chadha’s Distilled AI (~12,000 words, includes other papers)
DataMListic’s youtube video playlist (about 25 min total)

With this blog post, the known contents of the reading list are compressed to barely more than one percent of the original word count. This leaves a lot more to be discussed, but hopefully it still has something to offer for the interested reader. My own, subjective review will be saved for another post in the future.

Older Newer

Tensor Labbet A blog of deep learnings

Reviewing “AI Engineering” by Chip Huyen

Machine Learning vs AI Engineering

Format and Style

Content

Conclusion

What Deep Learning can do for Image Segmentation in Radiology

Semantic Segmentation for Radiology

Deep Learning for Semantic Image Segmentation

Concluding Thoughts

The Lost Reading Items

Filling the Gaps

Clues from the Preserved Reading Items

Clues from Contemporary Presentations

Honorable Mentions

Conclusion

Summary of Ilya Sutskever's AI Reading List

In a Nutshell

Convolutional Neural Networks

Recurrent Neural Networks

Transformers

Information Theory

Miscellaneous

Concluding Thoughts