A Decade of Residuals: History & Effects on modern ML

A Decade of Residuals: History & Effects on modern ML

Introduction: Gradient Highways
A decade ago training deep neural nets was quite the bottleneck in ML. Increasing depth, counter-intuitively, reduced performance. although we knew that depth meant better representation in theory, but in practice training was brittle and numerically unstable as the chain of jacobians gets longer and longer, many work-arounds existed such as layer-wise pre-training, semi-supervised stacking, better initializations …
until a different approach succesfully mitigated a major design flaw in pure sequential neural nets: the vanishing gradient problem, espeically pronounced when sigmoid and tanh a common activation functions choice at that time.
Residual connections were introduced to solve that problem but over time this solution influenced a wide range of design choices, training strategies and representation learning dynamics.
note: figures in this blog reflect a focus on the residual stream(s) by deviating from usual illustrations, and ommittes normalization layers in most figures for simplicity
Residual Connections
the high level intuition of residual connections is that, when training very deep networks (e.g MLPs), the change of one layer’s representation compared to the next would be less and less different and thus instead of fully learning the representation at layer L+1\mathrm{L+1}L+1﻿ we can simply learn the difference / change relative to layer L’s\text{L's}L’s﻿ representation.
⋮Al=Fl(Al−1)Al+1=Fl(Al)Al+2=Fl(Al+1)⋮\vdots \\A_l = \mathrm{F}_l(A_{l-1}) \\
A_{l+1} = \mathrm{F}_l(A_{l})
\\A_{l+2} = \mathrm{F}_l(A_{l+1})
\\ \vdots⋮Al​=Fl​(Al−1​)Al+1​=Fl​(Al​)Al+2​=Fl​(Al+1​)⋮
Al+1≃Al+εFlres(Al)=Al+1−Al→Al+1=Flres(Al)+Al\\
\mathrm{A}_{l+1} \simeq \mathrm{A}_{l} + \boldsymbol{\varepsilon}
\\
\mathrm{F}^{res}_l(\mathrm{A}_{l}) = \mathrm{A}_{l+1} - \mathrm{A}_{l} \\
\rightarrow \mathrm{A}_{l+1} = \mathrm{F}^{res}_l(\mathrm{A}_{l}) + \mathrm{A}_{l}Al+1​≃Al​+εFlres​(Al​)=Al+1​−Al​→Al+1​=Flres​(Al​)+Al​
Although the computed function class remains unchanged, during training the latter formulation offers a great optimization advantage.
if we take a closer look at a single param tensor in a certain layer in a deep MLP and derive it’s gradient computation we’ll end-up with something like: 
∂J∂Wl=∂J∂XL×(∏k=lL−1∂Xk+1∂Xk)×∂Xl∂Wl\frac{\partial{\mathcal{J}}}{\partial{W_l}} = \frac{\partial{\mathrm{J}}}{\partial{X_L}} \times \Large{(}\large{\prod_{k=l}^{L-1}} \frac{\partial{X_{k+1}}}{\partial{X_{k}}}\Large{)}\small \times  \frac{\partial{X_l}}{\partial{W_{l}}}∂Wl​∂J​=∂XL​∂J​×(k=l∏L−1​∂Xk​∂Xk+1​​)×∂Wl​∂Xl​​
The premise of gradient-based methods is that useful gradietn signal vanishes only at optima, however in practice, a single jacobian factor with a small norm is sufficient to collapse the entire product effectively blocking learning in earlier layer and preventing gradient flow
intuitively if a parameter can affect loss only through it’s effect on the next layer, then shrinking gradient downstream removes its only pathway to contribute to optimization.
A common cause of this phenomenon is the presence of saturation regions in activation functions (e.g extreme magnitude inputs to sigmoid & tanh, …)
residual blocks introduce an explicit identity term into the layer to layer jacobian ensuring a theoretically depth-independent gradient component
∂Xl+1∂Xl=I+∂Fl(Xl)∂Xl\frac{\partial X_{l+1}}{\partial X_l} = \mathrm{I} + \frac{\partial \mathrm{F}_l(X_{l})}{\partial X_l}∂Xl​∂Xl+1​​=I+∂Xl​∂Fl​(Xl​)​
the first major breakthrough using this idea was ResNet, which achieved SOTA performance on ImageNet at the time and demonstrated, for the first time, that networks hundreds of layers deep could be trained reliably.
beyond training stability ResNet proved the mental model of information flow and representation evolution though model depth. it shifted the view from semantically disassociated blocks into a continuous flow of updates with each layer restircted by design to learn small deviations representation-wise and utilize depth to iteratively move representation toward the target underlying manifold.
Transformers later adopted the same principle modelizing the entire model as a sequence of residual updates, where attention modules and MLPs contribute to the shared residual stream. this opened up an entirely new design space beyond vanilla feed-forward stacks: how to combine residuals with normalization layers, how strongly to consider each update and how to keep the evolving representation well-conditioned across however-many blocks we used.
Gates and Filters
by introducing another stream of representation flow, normalization, selection, and filtering became necessary to keep representation updates stable and meaningful.
the simplest form of such control are gates:
Al+1=Al+g(Al).F(Al)\mathrm{A_{l+1}} = A_l + g(A_l).\mathrm{F}(A_l)Al+1​=Al​+g(Al​).F(Al​)
where ggg﻿ regulate the residual connection update. this idea appeared in architectures such GRU LSTM in sequential modeling, similar approaches appeared in CNN and UNet variants (e.g nnUNet, …) where skip connection aren’t simply added to the main stream but rather gated or attention weighted allowing fine-grained control on information flow.

a more structured extension is mixture of experts (MoE) paradigm. instead of a single residual update, multiple specialized blocks (experts) propose updates and a learned router selects or weights them. in residual terms, MoEs decide which transformations are allowed to write into the shared stream for a given input.
another approach focuses on modulating residual updates rather than selecting them, feature-wise linear modulation (FiLM) applies a shifted and scaled input-conditioned transformation:
Xres=γ(X)F(X)+β(X)X_{res} = \gamma(X)\mathrm{F}(X) + \beta(X)Xres​=γ(X)F(X)+β(X)
allowing more dynamic updates based on input features, structure or other condition vectors (very commonly used in diffusion and generative models)
even modern activation functions like SwiGLU and GEGLU could be interpreted as following the input-conditioned gate pattern, all cited mechanisms and components manifest how simple residuals turned into controlled editing process over a shared stream
Neural ODEs and dynamical systems
a natural interpretation of a continuous residual update flow is that of a dynamical system. once models are written as
Xl+1=Xl+F(Xl)X_{l+1}=X_l+\mathrm{F}(X_l)Xl+1​=Xl​+F(Xl​)
depth can be interpreted as time, a formulation studied for decades across physics, control and robotics
the Neural ODE paper (Chen et al.) made this connection explicit by observing that a residual network correspond to explicit Euler discretization of an ordinary differential equation:
Xt+Δt=Xt+Δt.F(Xt)dXtdt=F(Xt)\mathrm{X}_{t + \Delta t} = \mathrm{X}_t + \Delta t .\mathrm{F}(\mathrm{X}_t) \\ \frac{d\mathrm{X}_t}{dt} = \mathrm{F}(\mathrm{X}_ t)Xt+Δt​=Xt​+Δt.F(Xt​)dtdXt​​=F(Xt​)
a standard ResNet corresponds to that with Δt=1\Delta t = 1Δt=1﻿.
in such interpretation each residual block computes a small update based on the current state and moves the representation forward along a trajectory in feature space. instead of stacking transformations, the network is learns a discretized vector field.
this perspective has since evolved into families of models such as Neural ODEs, continuous-time RNNs and liquid neural networks, which learn an internal state flows over time. companies like liquid.ai build directly on this idea to achieve highly efficient models in several settings.
the Neural ODEs paper authors proceed to work adaptive time discretization beyond ResNet-like models using specialized solvers to integrate learned dynamics
RAPTOR: Block Recurrence Effect
once models are built around a residual stream a new regularization technique emerges: Drop Path
ensuring no subset of blocks dominate the updates contribution; during training we randomly skip layers / blocks following a probability pdrop_pathp_{drop\_path}pdrop_path​﻿, further reinforcing the representation similarity block-to-block and forces later blocks to approximate earlier dynamics under stochastic depth
the Recurrent Block Dynamics (RAPTOR) paper (jacob et al) study this behavior and found large representation similart across blocks, clustering depth into kkk﻿ sets of consecutive blocks.
the authors set and proved their block recurrence hypothesis by training a vision transformer (distilled from Dinov2) using only 2 blocks, each representing a set of functionally similar blocks in the teacher model, applying each recurrently and preserved 96% of Dinov2’s linear-probe accuracy on ImageNet-1k.
dynamic-systems-like interpretability was also studied in the same work, finding tokens converging to angular attractors and local self-corrective behaviors tested by robustness to angular perturbations
Hyper Connections & mHC
while earlier augmentations focused on expanding residual depth, most notably the Deep Layer Aggregation paper (Yu et al) enfluencing many subsequent work, recent efforts shifted toward expanding residual width. in early 2025 a bytedance team introduced hyper connections (Zhu et al) as an expansion of vanilla residual, offering an learnable interpolation between pre-norm and post-norm residual connections as well as boarder mechanisms for width expansion methods (ResiDual, Parallel Transformer Blocks, …).
since the original transformers paper, many architectural tweaks has been adopted, most prominently post-norm variants. yet the latter didn’t offer a free-lunch; tho it mitigated vanishing gradients it resulted in more common representation collapse, whereas pre-norm exhibits the opposite trade-off.
the core idea behind hyper connections is the use multiple residual streams, a merging matrix Hpre\mathrm{H}^{pre}Hpre﻿ to aggregate stream at the input of each layer, a redistribution matrix Hpost\mathrm{H}^{post}Hpost﻿ redistribute the update at the layer output and a residual mixing matrix Hres\mathrm{H}^{res}Hres﻿ mixes pre- and post- layer residual streams
residual connections: Xl+1=Xl+Fθ(Xl)Hyper connections :Xl+1=HlresXl+HlpostFθ(HlpreXl)\text{residual connections:                 			}\quad  X_{l+1} = X_{l} + \mathcal{F}_\theta(X_l)\quad\quad\quad\quad\quad\quad\quad
\\ \text{Hyper connections :}\quad X_{l+1} = \mathcal{H}^{res}_l X_l + \mathcal{H}^{post}_l\mathcal{F}_{\theta}(\mathcal{H}^{pre}_l X_l) \quadresidual connections: Xl+1​=Xl​+Fθ​(Xl​)Hyper connections :Xl+1​=Hlres​Xl​+Hlpost​Fθ​(Hlpre​Xl​)
the hyper connection framework therefore introduces both depth connections: pre- and post- layer aggregation and redistribution, and width connections: residual streams mixing.
in the original formulation of the Hpre\mathrm{H}^{pre}Hpre﻿, Hpost\mathrm{H}^{post}Hpost﻿ and Hres\mathrm{H}^{res}Hres﻿ are defined as the sum of two matrices, a learnable static “bias” matrix and the an input-conditioned “dynamic” matrix (a scaled non-linear transformation of input):
Hpre=sα.tanh(XWpre)+bpre∈Rn×1Hpost=sβ.tanh(XWpost)+bpost∈R1×nHres=sα.tanh(XWres)+bres∈Rn×n\mathcal{H}^{pre} = s_\alpha . \mathcal{tanh}(X\mathbf{W}_{pre}) + \mathbf{b}_{pre} \quad  \quad\in \mathbb{R}^{n\times 1}\\
\mathcal{H}^{post} = s_\beta . \mathcal{tanh}(X\mathbf{W}_{post}) + \mathbf{b}_{post} \quad \in \mathbb{R}^{1\times n} \\\mathcal{H}^{res} = s_\alpha .  \mathcal{tanh}(X\mathbf{W}_{res}) + \mathbf{b}_{res} \quad  \quad\in \mathbb{R}^{n\times n}\\
Hpre=sα​.tanh(XWpre​)+bpre​∈Rn×1Hpost=sβ​.tanh(XWpost​)+bpost​∈R1×nHres=sα​.tanh(XWres​)+bres​∈Rn×n
tho carefull initialization of each matrix leads to stable training as reported in the original paper, at larger scale behavior got more subtle.
a recent paper from deepseek, mHC: manifold-constrained Hyper Connections (Xie et al) idetify a critical design flaw in the hyper connections framework: breaking the identity mapping property that underlies the stability of residual connections
looking at the unrolled equation of hyper connections of non-consecutive layers reveals an exponential effect of the repeated multiplication by Hres\mathcal{H}^{res}Hres﻿ disturbing identity mapping, for instance positive entries <1\lt 1<1﻿ lead to exponential decay, positive entries >1\gt 1>1﻿ result in exponential growth and negative entries result in alternating oscillations, none of which are norm contolled.
residual connections: Xl+n=Xl+∑i=ll+nF(i)(Xi)HC :Xl+n=(∏k=ll+nHkres)Xl+∑i=ll+n(∏k=liHkres)HipostF(i)(HipreXi)\text{residual connections:                 			} \quad X_{l+n} = X_l + \sum_{i=l}^{l+n} \mathcal{F}^{(i)}(X_i)
\\ \text{HC :} X_{l+n} = (\prod_{k=l}^{l+n}\mathcal{H}^{res}_k) X_l + \large{\sum_{i=l}^{l+n}}\Large{(} \small{\prod_{k=l}^{i}}\mathcal{H}^{res}_k\large{)}\mathcal{H}^{post}_i\mathcal{F}^{(i)}(\mathcal{H}^{pre}_i X_i)residual connections: Xl+n​=Xl​+i=l∑l+n​F(i)(Xi​)HC :Xl+n​=(k=l∏l+n​Hkres​)Xl​+i=l∑l+n​(k=l∏i​Hkres​)Hipost​F(i)(Hipre​Xi​)
to address that, the mHC paper proposes constraining the mixing matrix Hres\mathcal{H}^{res}Hres﻿ to be positive, with each row and column summing to 1 to preserve the originally intended linear mixing behavior while maintaining stable identity transport and well conditioned gradients.
the proposed constriant places Hres\mathcal{H}^{res}Hres﻿ in the Birkhoff polytope i.e set of matrices with positive elements, each row and column summing to 1, formally known as “doubly stochastic matrices”:
PMres(Hres):={Hres∈Rn×n∣Hres⩾0, Hres1n=1n, 1nTHres=1nT}\mathcal{P}_{\mathcal{M}^{res}}(\mathcal{H}^{res}) := \{\mathcal{H}^{res} \in \mathbb{R}^{n\times n}|\mathcal{H}^{res}\geqslant 0,\ \mathcal{H}^{res}\mathbf{1}_n = \mathbf{1}_n, \ \mathbf{1}^\mathsf{T}_n\mathcal{H}^{res} = \mathbf{1}^\mathsf{T}_n\}PMres​(Hres):={Hres∈Rn×n∣Hres⩾0, Hres1n​=1n​, 1nT​Hres=1nT​}
projection into the Birkhoff polytope is performed using the sinkhorn-knopp algorithm (iterative dim normalization) applied to the element-wise exponential of Hres\mathcal{H}^{res}Hres﻿.
moreover this constrain admits a compositional closure i.e the product ∏i=ll+nHlres\large\prod_{i=l}^{l+n}\mathcal{H}^{res}_l∏i=ll+n​Hlres​﻿ remains doubly stochastic, enabling norm-preserving residual transport across depth
following the same norm-preserving principle,the mHC paper additionally enforces positivity on Hpre\mathcal{H}^{pre}Hpre﻿ and Hpost\mathcal{H}^{post}Hpost﻿ by using sigmoid non-linearity
together resulting in the constrained formulation:
Hpre=σ(spre.(XWpre)+bpre)∈Rn×1Hpost=2σ(sβ.(XWpost)+bpost)∈R1×nHres=Sinkhorn-Knopp(sres.(XWres)+bres)∈Rn×n\quad\mathcal{H}^{pre} = \sigma\large(s_{pre} . (X\mathbf{W}_{pre}) + \mathbf{b}_{pre}\large) \quad\quad\quad\quad\quad\quad\quad\quad \in \mathbb{R}^{n\times 1}\\

\quad\mathcal{H}^{post} = 2\sigma\large( s_\beta . (X\mathbf{W}_{post}) + \mathbf{b}_{post}\large) \quad\quad\quad\quad\quad\quad\quad \in \mathbb{R}^{1\times n}\\

\quad\mathcal{H}^{res} = \text{Sinkhorn-Knopp}\large(s_{res} . (X\mathbf{W}_{res}) + \mathbf{b}_{res}\large) \quad \in \mathbb{R}^{n\times n}\quad\quad\\

Hpre=σ(spre​.(XWpre​)+bpre​)∈Rn×1Hpost=2σ(sβ​.(XWpost​)+bpost​)∈R1×nHres=Sinkhorn-Knopp(sres​.(XWres​)+bres​)∈Rn×n
implementation note: the deepseek team reported only ≈6.7%\approx 6.7\%≈6.7%﻿ compute overhead enabled by selective activation checkpointing, specialized fused kernels and communication-compute overlap in their DualPipe setup and directly addressed the memory overhead introduced by the multiple residual streams unlike the original hyper connections paper.
Conclusion
Throughout the years, mental models in deep learning have changed in many diverse and interesting ways, yet in most of them residual connections persisted. As more top venue work highlights the similarities between evolving architectural models in machine learning and dynamical systems, we are gradually breaking previously established scaling laws through more sophisticated geometric structure and constrained representation dynamics
References
the following papers constitute the main references for the ideas discussed in this article
He et al: Deep Residual Learning for Image Recognision, CVPR 2016
Chen et al: Neural Ordinary Differential Equations, NeurIPS 2018
Jacob et al: Recurrent Blcok Dynamics in ViTs, under review
Zhu et al: Hyper-Connections,ICLR 2025
Xie et al: mHC: Manifold-Constrained Hyper-Connections