Finite-time Lyapunov exponents of deep neural networks

We compute how small input perturbations affect the output of deep neural networks, exploring an analogy between deep networks and dynamical systems, where the growth or decay of local perturbations is characterised by finite-time Lyapunov exponents. We show that the maximal exponent forms geometrical structures in input space, akin to coherent structures in dynamical systems. Ridges of large positive exponents divide input space into different regions that the network associates with different classes. These ridges visualise the geometry that deep networks construct in input space, shedding light on the fundamental mechanisms underlying their learning capabilities.

Deep neural networks can be trained to model complex functional relationships [1].The expressivity of such neural networks -their ability to unfold intricate data structures -increases exponentially as the number of layers increases [2].However, deeper networks are harder to train, due to the multiplicative growth or decay of signals as they propagate through the network.This multiplicative amplification, also known as the unstable-gradient problem [3], causes signals to either explode or vanish in magnitude if the number of layers is too large.A second important problem is that we lack insight into the learning mechanisms.Although there is some intuition for shallow networks [4], there is still no general understanding of the principles that cause some architectures to fail, while others work better.
For a common type of deep networks, the so-called multi-layer perceptrons [Fig.1(a)], we show that these two problems are closely related.
We exploit the fact that such networks are discrete dynamical systems; inputs x (0) are mapped iteratively through x Here, g(•) is a non-linear activation function [3], the layer index = 0, . . ., L + 1 plays the role of time, L is the number of hidden layers, N is the number of neurons in layer , and the weights w ( ) ij and thresholds θ ( ) i are parameters.Sensitivity of x ( ) to small changes in the inputs x (0) = x corresponds to exponentially growing perturbations in a chaotic system with positive maximal Lyapunov exponent [5,6] lim →∞ λ ( ) 1 (x), with growth rate λ ( ) 1 (x) = −1 log |δx ( ) |/|δx|.The multiplicative ergodic theorem [5] guarantees that λ (L) 1 (x) converges as L → ∞ to a limit that is independent of x.
The standard way of initialising network parameters is to choose zero thresholds and random weight matrices with independent Gaussian-distributed elements with zero mean, and variance σ 2 w .In this case, the Lyapunov exponents are determined by a product of random matrices [7], and in the mean-field limit of N = N → ∞, one finds λ (L) 1 ∼ log(GN σ 2 w ) independent of x and L. Here, the constant G depends on the choice of activation function [8].This relation explains why the initial weight variance should be chosen so that GN σ 2 w = 1, because then signals neither contract nor expand [8][9][10], stabilising the learning.The maximal Lyapunov exponent also determines the success or failure in predicting chaotic time series with recurrent networks [11][12][13] that use large reservoirs of neurons with random weights.In that case, the mean-field limit N → ∞ works very well [13].
For finite N and L, the maximal finite-time Lyapunov exponent (FTLE) λ (L) 1 (x) depends on the input x.Averaging over input patterns yields an estimate for the Lyapunov exponent [14], but even if the average λ (L) 1 (x) over inputs vanishes, some patterns may exhibit large positive exponents, causing the training to fail.
Moreover, the weights of a trained network are not random but should reflect what the network has learned about the inputs.This raises the question: does the maximal exponent form geometric structures in input space, just as in dynamical systems where the ridges of high FTLE define Lagrangian coherent structures [15][16][17]?How does the variations of the maximal FTLE in input-space depend on the number L of layers of the network, and on its width N ?
To answer these questions, we computed the maximal FTLEs for fully connected deep neural networks with different widths and numbers of layers.For a simple classification problem with two-dimensional inputs x divided into two classes with targets t(x) = ±1 [Fig.1(b)], we show how the x-dependence of the maximal FTLE changes when changing N and L. For narrow networks (small N ), we find that the maximal FTLE forms ridges of large exponents in the input plane, much like Lagrangian coherent structures in high-dimensional dynamical systems [15][16][17].These ridges provide insight into the learning process, illustrating how the network learns to change its output by order unity in response to a small shift of the input pattern across the decision boundary.However, as the network width grows, we see that the ridges disappear, suggesting a different learning mechanism.Similar conclusions hold for a more complex clas-< l a t e x i t s h a 1 _ b a s e 6 4 = " o D T D X z s 0 Z 0 j j W m 1 l 9 w Z 5 e 2 t z a F 6 Z 7 k 1 N v Z j t w 2 7 8 g X V U u b 5 / 4 h L U A v M c + b n j c 1 2 b Z w F 5 y 9 G C a v h i 8 / e H d H a B N 7 6 A l 6 i o 5 Q g l 6 j E X q H T t A Y 0 e A 4 e B 9 M g o / h l / B r + C 3 8 v p G G w W X N I 9 S J 8 M c / X x b 9 O A = = < / l a t e x i t > x (0) 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 p < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 z 4 6 p l z c t x X g S 8 t t m i k i o g h q 0 c w E R R 1 2 t E 5 l h h Y t z j d q r 4 3 E r P d P c m I z d K 3 7 J P 3 p E v a k / r p + 4 T L U A t M S v b n h t q u j 3 C X X D y b J C + G D z / l M T D I d r Y H n q I H q M n K E U v 0 R C 9 R 8 d o j E h w F H w I R s E 4 P A + / h t / C 7 x t p G F z E 3 E c d C 3 / + A z F e / O s = < / l a t e x i t > `= L < l a t e x i t s h a 1 _ b a s e 6 4 = " p X B 5 z y 9 2 X a P D 1 U 3 8 j 5 7 X / w f / k / / l 3 + + l v r e Z c w j 1 D H / 9 z 9 C a P 6 i < / l a t e x i t > x (1)   < l a t e x i t s h a 1 _ b a s e 6 4 = " K d 8 U 3 8 j 5 7 X / w f / k / / l 3 + + l v r e Z c w j 1 D H / 9 z 9 E P v 6 j < / l a t e x i t > x (2)   < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 e P f R 1 q 9 d g Q k m c d r P y T J 7 a J J M s b 8 9 7 Y b y 3 G c X x J V F u z z I b x q 3 Y l T r 9 a v c / P m u m / T A a R C s L t k F 8 A c K 3 5 9 / / v j 9 7 a E f T X e 9 x k k t S c x C G M K z 1 J I 4 q k 1 q s D C U M m l 5 S a 6 g w m e M S J g < l a t e x i t s h a 1 _ b a s e 6 4 = " X G G I q z H B + 5 sification problem using the MNIST data set of handwritten digits, where FTLE structures in input space explain variations in classification accuracy and predictive uncertainty.
Finite-time Lyapunov exponents.Figure 1(a) shows a multi-layer perceptron [3], a fully-connected feed-forward neural network with L hidden layers, N 0 input components, N neurons per hidden layer with non-linear activation functions, and N L+1 output neurons.The network maps every input x (0) = x to an output x (L+1) .Weights and thresholds are varied to minimise the output error [x (L+1) − t(x)] 2 , so that the network predicts the correct target t(x) for each input x.The sensitivity of x ( ) to small changes δx is determined by linearisation, Here, W ( ) are the weight matrices, and D ( ) are diagonal matrices with elements D and g (b . The Jacobian J (x) characterises the growth or decay of small perturbations to x [5,6].Its maximal singular value Λ ( ) 1 (x) increases or decreases exponentially as a function of , with rate λ 2 (x) > . . .are the square roots of the non-negative eigenvalues of the right Cauchy-Green tensor J (x)J (x).The maximal eigenvector of J (x)J (x) determines the direction of maximal stretching, i.e. in which input direction the output changes the most, starting from a given input x.
FTLEs and Cauchy-Green tensors are used in solid mechanics to identify elastic deformation patterns [18], and to find regions of instability in plastic deformation [19] and crack initiation [20].More generally, FTLEs help to characterise the sensitivity of complex dynamics to initial conditions [21][22][23].In fluid mechanics, they explain the alignment of particle transported by the fluid [24,25], providing valuable insight into the stretching and rotation of fluid elements over time and space [26].FTLEs allow to identify Lagrangian coherent structures [15][16][17]; strongly repelling fluid-velocity structures that help to organise and understand flow patterns [27].These geometrical structures appear as surfaces of large maximal FTLEs, orthogonal to the maximal stretching direction.
In applying these methods to neural networks, one should recognise several facts.First, in deep neural networks, the weights change from layer to layer.Therefore the corresponding dynamical system is not autonomous.Second, the number N of neurons per layer may change as a function of , corresponding to a changing phasespace dimension.Third, the neural-network weights are trained.This limits the exponential growth of the maximal singular value, as we show below.Fourth, one can use different activation functions, such as the piecewise linear ReLU function [3], or the smooth tanh function [8].Here we use g(b) = tanh(b), so that the network map is continuously differentiable just like the dynamical systems for which Lagrangian coherent structures were found and analysed.
Two-dimensional data set.To illustrate the geometric structures formed by the maximal FTLE, we first consider a toy problem.The data set [Fig.1(b)] comprises 4 × 10 4 input patterns, with 90% used for training, the rest for testing.We trained fully connected feed-forward networks on this data set by stochastic gradient descent, minimising the output error [x (L+1) − t(x)] 2 .In this way we obtained classification accuracies of at least 98%.We considered different network layouts, changing the numbers of layers and hidden neurons per layer.The weights were initialised as independent Gaussian random numbers with zero mean and variance σ 2 w ∼ N −1 , while the thresholds were initially set to zero.After training, we computed the maximal FTLE in layer L and the associated stretching direction from Eq. (1) as described in Refs.[28,29].
The results are summarised in Figure 2, which shows maximal-FTLE fields for trained networks with different layouts.First, we see that the ridges of large positive λ (L) 1 (x) align with the decision boundary between the two classes [Fig.1(b)].The ridges are most prominent for small N and large L. In this case, the network learns by grouping the inputs into two different basins of attraction for t = ±1, separated by a ridge of positive λ (L) 1 (x).A small shift of the input across the decision boundary leads to a substantial change in the output.
Second, the contrast increases as L becomes larger, quantifying the exponential expressivity of deep neural networks.For larger L, the network can resolve smaller input distances δx because the singular values increase/decrease exponentially from layer to layer.Comparing networks of different depths, we find that Lλ (L) 1 (x) saturates for large L, on the ridge.This is a consequence of the training: the network learns to produce output differences on the order of δx (L+1) ∼ 1, and  to resolve input differences δx on the scale of the mean distance between neighbouring patterns over the decision boundary.Therefore, the saturation value is larger when the number densities of input pattern is higher (not shown).Even though Λ (L) 1 (x) saturates, Λ (L) 2 (x) < 1 decreases exponentially as L grows (not shown), thereby causing Λ (L) 1 (x)/Λ (L) 2 (x) to increase exponentially, as in dynamical systems [30].
Third, the ridges gradually disappear as the number N of hidden neurons per layer increases, because the maximal singular value of J L (x) approaches a definite x-independent limit as N → ∞ at fixed L. In the infinite-width limit, training is equivalent to kernel regression with a kernel that is independent of the inputs in the training data set [31,32].But how can the network distinguish inputs with different targets in this case, without ridges indicating decision boundaries?One possibility is that the large number of hidden neurons allows the network to embed the inputs into a high-dimensional space where they can be separated thanks to the universal approximation theorem [33].In this case, training only the output weights (and threshold) should suffice.Figure 3(a) confirms this, as the classification error decreases with increasing embedding dimension, for random hidden weights.We remark that the classification error of the fully trained network is smaller than the error with random hidden weights.This is not surprising, since different random embeddings have different classification errors when the number of patterns exceeds twice the embedding dimension [3,34].
Fourth, Figure 2 also shows the maximal stretching directions.For large L they become orthogonal to the ridges of large λ (L) 1 (x).This demonstrates that there is a stringent analogy between the FTLE ridges of deep neural networks and Lagrangian coherent structures.The stretching patterns appear to exhibit singular points where the maximal stretching tends to zero [35], reflecting topological constraints imposed on the direction field.
Fifth, one may wonder how the FTLE structures depend on weight initialisation.When weights are initialised with a small variance, σ w 1, most FTLEs are negative initially [blue in Fig. 3(b)].This implies a slowing down of the initial training (vanishing-gradient problem).To see this, consider the fundamental forwardbackward dichotomy of deep neural networks [3]: weight updates in the stochastic-gradient algorithm are given by δw . It follows from Eq. (2) that negative FTLEs cause small weight increments δw ( ) mn .Conversely, when the maximal FTLE is positive and too large, the weights grow rapidly, leading to training instabilities.Remarkably, Figure 3(b) demonstrates a self-organising effect due to training: the distributions of the maximal FTLE converge to centre around zero.This is explained by the fact that the network learns by creating maximal-FTLE ridges in input space: to accommodate positive and neg- ative λ (L) 1 (x), the distribution centres around zero, alleviating the unstable-gradient problem.
MNIST data set.This data set consists of 60,000 images of handwritten digits 0 to 9. Each grayscale image has 28 × 28 pixels and was pre-processed to facilitate machine learning [36].Deep neural networks can achieve high precision in classifying this data, with accuracies of up to 99.77% on a test set of 10,000 digits [37].
We determined the maximal-FTLE field for this data set for a network with L = 16 hidden layers, each containing N = 20 neurons, and a standard softmax layer with ten outputs [3].To visualise the geometrical structures in the 28 2 -dimensional input space, we projected it to two dimensions as follows.We added a bottleneck layer with two neurons to the fully trained network, just before the softmax-output layer.We retrained only the weights and thresholds of this additional layer and the output layer, keeping all other hidden neurons unchanged.The local fields b 1 and b 2 of the two bottleneck neurons are the coordinates of the two-dimensional representation shown in Figure 4(a).We see that the input data separate into ten distinct clusters corresponding to the ten digits.The maximal FTLEs at the centre of these clusters are very small or even negative, indicating that the output is not sensitive to small input changes.These regions are delineated by areas with significantly larger positive FTLEs [see 3×zoom in panel (a)].Figure 2 leads us to expect that patterns with large λ (L) 1 (x) are located near the decision boundaries in high-dimensional input space.This is verified by strong correlations between λ (L) 1 (x) and both the classification error and the predictive uncertainty.Figure 4(b) shows that the classification error on the test set is larger for inputs x with larger λ , where • denotes an average over an ensemble of networks (here with ten members) with the same layout but different weight initialisations [39,40].These observations confirm that ridges of maximal FTLEs localise the decision boundaries.
Figure 4(a) also shows λ (L) 1 (x) along a path generated by an adversarial attack.The attack begins from a sample within the cluster corresponding to the digit 9 and aims to transform it into a digit 4 by making small perturbations to the input data [41] toward class 4. We see that the maximal FTLE is small at first, then increases as the path approaches the decision boundary, and eventually decreases again.This indicates that our conclusions regarding the correlations between large maximal FTLEs and decision boundaries extend to neighbourhoods of the MNIST training set that contain patterns the network has not encountered during training. Conclusions.
We explored geometrical structures formed by the maximal FTLE in input space, for deep neural networks trained on different classification problems.We found that ridges of positive exponents define the decision boundaries, for a two-dimensional toy classification problem, and for a high-dimensional data set of hand-written digits.In the latter case, we projected the high-dimensional input space to two dimensions, and found that the network maps digits into distinct clusters surrounded by FTLE ridges at the decision boundaries.This conclusion is supported by the fact that the locations of the FTLE ridges correlate with low classification accuracy and high predictive uncertainty.
The network layout determines how prominent the FTLE structures are.As the number of layers increases, the ridges sharpen, emphasising their role in learning and classification.However, as the number of hidden neurons per layer tends to infinity, the FTLE structures disappear.In this limit, the network separates the inputs by embedding them into a high-dimensional space, rendering training of the hidden neurons unnecessary.
It is important to underscore that the two different ways to learn, by FTLE ridges or embedding, result in qualitative differences regarding classification errors and predictive uncertainties, and may also affect how susceptible a network is to adversarial attacks.The geometrical method presented here extends to other network architectures (such as convolutional networks), and will help to visualise and understand the mechanisms that allow such neural networks to learn.
LS was supported by grants from the Knut and Alice Wallenberg (KAW) Foundation (no.2019.0079) and Vetenskapsrådet (VR), no.2021-4452.JB received support from UCA-JEDI Future Investments (grant no.ANR-15-IDEX-01).HL was supported by a grant from the KAW Foundation.KG was supported by VR grant 2018-03974, and BM by VR grant 2021-4452.Part of the the numerical computations for this project were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC).
r e I P v 8 B N + w e 9 l 6 Q b M e 0 5 B I e D P L 3 k Q r / I = < / l a t e x i t > (b) 6 1 A r w n P m 4 E 3 N d 6 1 c B + c v B j H r 8 Y v 3 3 t 3 J 2 g b B + g B e o y e o h i 9 R h P 0 D k 3 R D N H g W T A N P g Q f w y / h t / B 7 + G M r D Y P z m n u o F + H P f 0 6 N / y 4 = < / l a t e x i t > input < l a t e x i t s h a 1 _ b a s e 6 4 = " q T L Q 7 l o A O d 2 1 N h X I C / + 9 e s w 0 9

0 FIG. 1 . 2 ,
FIG. 1. Classification with a fully connected feed-forward network.(a) Layout with two input components x (0) 1 and x (0) 2 , L hidden layers with N = 5 neurons, and one output x (L+1) for classification.(b) Two-dimensional input plane (schematic) for a classification problem with a circular decision boundary that separates input patterns with targets t = +1 ( ) from those with t = −1 ( , green).

FIG. 2 .
FIG. 2. Geometrical FTLE structures in input space for different widths N and depths L of fully-connected feed-forward neural networks trained on the data set shown schematically in Fig. 1(b).Shown is the colour-coded magnitude of Lλ ), and the maximal stretching directions (black lines).

FIG. 3 .
FIG. 3. (a) Classification error for a fully connected feedforward network with L = 2 hidden layers with random weights (not trained), and trained output weights, as a function of the number N of hidden neurons per layer (solid black line).Also shown is the classification error for the fully trained network (dashed line).Both curves were obtained for the data set shown schematically in Fig. 1(b).(b) Evolution of maximal-FTLE distribution as a function of training time measured in epochs [3], for a network with L = 8 hidden layers with N = 50 neurons per layer.The weights were initialised with different variances, log GN σ 2 w = −0.2(blue), 0 (green), and 0.2 (red).

FIG. 4 .
FIG.4.Maximal-FTLE field for the MNIST data[36].A fully connected feed-forward network with N = 20 neurons per hidden layer, L = 16 hidden layers, and a softmax layer with ten outputs was trained to a classification accuracy of 98.88%.The maximal FTLE was calculated for each of the 28 2 -dimensional inputs and projected to two dimensions (see text).(a) Training data in the non-linear projection.For each input, the maximal FTLE λ (L) 1 is shown colour-coded (legend).The box contains 93% of the recognised digits 0. A threefold blow up of this box is also shown.The line represents a sequence of adversarial attacks from 9 to 4 (see text), with λ (L) 1 (x) colour-coded.(b) Classification error on the test set as a function of λ (L) 1 (x).(c) Predictive uncertainty H (see text) as a function of λ (L) 1 (x).
).Figure 4(c) shows that large values of λ (L) 1 (x) correlate with high predictive uncertainty, measured by the entropy H of the posterior predictive distribution [38].For softmax outputs, where x (L+1) i can be interpreted as probabilities, H = − i x