Learning by non-interfering feedback chemical signaling in physical networks

Both non-neural and neural biological systems can learn. So rather than focusing on purely brain-like learning, efforts are underway to study learning in physical systems. Such efforts include equilibrium propagation (EP) and coupled learning (CL), which require storage of two different states-the free state and the perturbed state-during the learning process to retain information about gradients. Inspired by slime mold, we propose a new learning algorithm rooted in chemical signaling that does not require storage of two different states. Rather, the output error information is encoded in a chemical signal that diffuses into the network in a similar way as the activation/feedforward signal. The steady state feedback chemical concentration, along with the activation signal, stores the required gradient information locally. We apply our algorithm using a physical, linear flow network and test it using the Iris data set with 93% accuracy. We also prove that our algorithm performs gradient descent. Finally, in addition to comparing our algorithm directly with EP and CL, we address the biological plausibility of the algorithm.


I. INTRODUCTION
What basic ingredients constitute a biological learning system, such as slime mold or higher-order organisms?Biological learning systems adapt to the external environment by tailoring specific responses for given external conditions.As the system continues to experience external conditions of a similar kind, it develops functionality to respond to the stimulus in such a way to increase its chances of survival.Intriguingly, this functionality is an emergent phenomenon as a result of interactions between the various components [1].For example, when birds come together in a flock, they increase their chances of survival [2].This happens not because of a 'supervisor' that commands each bird to fly in a particular way, but because birds, such as starlings, interact with a fixed number of neighbors independent of their density to give rise to emergent functionality [3].Similarly, in the presence of rising waters, fire ants cooperate to form floating rafts consisting of a structural base and freelymoving ants on top of the base with treadmilling between the two roles [4,5].Local, ant interaction rules, including an effective repulsive force between the freely-moving ants and the water replicate the types of observed shapes of rafts [6].These special interactions between components, responsible for emergent functionality in nature, have themselves emerged out of the long process of evolution.
Given the intricacies of biological learning systems, neural networks are in silico brain-like learning systems, resembling the visual cortex, in particular [7][8][9][10], that can recognize patterns and solve problems [11,12].More specifically, neural networks achieve functionality by modifying weights and biases to minimise a particular cost function.Of the many ways to do so, the algorithm of choice in neural networks with multiple layers (deep learning) is the backpropagation algorithm [13].Backpropagation updates the network such that its weights (and biases) perform gradient descent in the cost function landscape.The complex nature of the tasks that neural networks are capable of hints at the possibility that biological learning systems also achieve functionality by optimizing cost functions by gradient descent [14].In other words, the long process of evolution may have optimised the "learning algorithm" in such biological systems to update its components via gradient descent.The success of backpropagation has, therefore, encouraged a search for biologically plausible learning rules analogous to it [14][15][16][17][18][19].For completeness, here are properties one should ensure while constructing such a biologically plausible learning system : 1. local learning algorithms [20], 2. the implementation of such algorithms is constrained by the laws of physics, and 3. the algorithms minimize a cost function via gradient descent or stochastic gradient descent.
Indeed, there have been attempts to construct learning algorithms within purely physical systems [21][22][23][24][25][26].Here, we will focus on "equilibrium propagation" [24,27] and "coupled learning" [25].In these approaches, the error information corresponding to each component is encoded in terms of differences of local physical quantities measured between two phases.At each step of training, the outputs of the network are nudged towards the target output by applying additional boundary conditions at the output nodes.Next, the system is allowed to settle to a new steady state called the "nudged state" (or the "clamped state"), which is closer to the desired target than the initial "free state".In the limit where the nudge amplitude goes to zero, the difference of local, physical quantities between these two phases encodes the gradient of the cost function [27].Unlike backpropagation, these algorithms achieve gradient descent without an explicit layer-by-layer transfer of error information.Nevertheless, one caveat of these approaches is the requirement to store the free state.In other words, the learning rule requires information about the free state, which is no longer physically available at the end of the second phase when the parameters are updated.One way around this requirement is to build two copies of the same physical network [28], though biology does not necessarily have such a luxury.
In this work, we present a new algorithm and a new learning rule that overcomes the above requirement in equilibrium propagation and coupled learning.Our learning rule computes the gradients using local information for each weight without the need to store the free state.We demonstrate that the functionality of the nudged state can be realised in physical and biophysical, learning systems using chemical signalling.We show that steady state chemical concentrations can be used in the second phase to encode the required gradient information.Chemical signaling is ubiquitious in biology.For example, consider the structurally simple, yet functionally complex organism, named Physarum polycephalum, otherwise known as slime mold.Slime mold is a uni-cellular, multi-nucleated organism that is neither a plant nor an animal nor a fungus.This uni-cellular organism can span up to the meter scale and consists of a network of tubes whose underlying structure is driven by cytoskeletal reorganization [29].Despite its simplicity, in the sense that it is non-neuronal, this organism is capable of myriad complex tasksprecisely coordinating flows in its body [30], navigating mazes [31], and connecting food sources with optimal paths [32,33].Work by K. Alim and others showed that much of this complex phenomenon can be explained by a mechanism of signal propagation [34].Specifically, slime mold uses a chemical signal to send information regarding the location of food sources across its body, which triggers a change in its tubular structure due to a softening agent to optimize the connection between food sources [35].In light of an example, we construct a physical learning network of tubes/pipes that uses chemical signals to send error information across the system.Our system is a flow network, with activation pressures at nodes v and pipe conductance described by weights w.The information from the external environment is input into the system by fixing the boundary conditions I at input nodes.Node activation pressures v are the non-trainable variables (or "state variables") of the system that are determined by Laplace's equation and input boundary conditions.Our physical system is, therefore, a linear one.The functionality we seek is to obtain desired pressures ("target pressures") at output nodes for a given input I. To achieve this, a feedback chemical is released into the flow network by fixing the chemical currents at output nodes.The value of these chemical currents is proportional to the difference between target pressures and output pressures.The chemical concentrations u at internal nodes are determined by the same Laplace equation, but with feedback boundary conditions .We show that this chemical concentration u along with the node pressures v locally encode the weight gradients of the cost function that we want to optimize.We propose a new learning rule that updates the trainable weights w such that it does gradient descent with respect to the cost function.

II. THEORY
We consider a flow network of nodes interconnected by weighted edges.We denote w xy the weight (i.e.conductance) of the edge between node x and node y.A subset of the nodes are boundary node pairs (or "input" node pairs), denoted For each pair (b + j , b − j ), an input current I j flows into the network through the node b + j and flows out of the network through the node b − j .The state of the system is defined by the node pressures, denoted v(x) and governed by Laplace's equation at steady state.Another subset of the nodes are output node pairs The output of the network is defined as the set of pressure drops across output nodes {v is a function of input currents {I j } and all the weights of the network {w xy }.Training the network consists in modifying the weights {w xy } such that, given the input currents {I j }, we get desired pressure drops {v d (o + i , o − i )} across the output node pairs.We define the cost function Now we present a physical procedure and a learning rule for the weights that performs gradient descent with respect to the cost function.To achieve this, we release a feedback chemical into the network through the pairs of output nodes {(o + i , o − i )}, by fixing chemical currents across these nodes.Specifically, for each pair of output nodes flows into the network through node o + i and flows out of the network through node o − i , where η is a constant ("nudging").A steady state chemical concentration develops at every node, governed again by Laplace's equation.We denote u(x) the steady state concentration at node x, and u(x, y) the drop in chemical concentration between nodes x and y.Finally, we update each weight w xy according to where α is a constant.We show below that this learning rule performs gradient descent on the cost function with step size ("learning rate") αη, i.e.
for every weight w xy .The above learning rule for the weights is local.The final error term depends upon two quantities; the pressure drop due to flow and the concentration drop of the feedback chemical.If they have the same sign, then the weight gets a positive update and vice versa.Here, we assume that the relaxation time scale of the system is much faster than the time scale of weight updates so that the system is in steady state as the weights are updated.Note that the two quantities in the weight update are independent of each other, therefore, we assume that diffusion is fast enough that it is independent of the flow.
In an experimental setting, one can realize this by using very small flow rate, leading to very small pressure drops across weights and using a signalling chemical with very high diffusion rate via, perhaps, some catalytic process.We understand that such a construction is not necessarily realized in nature, therefore, we also propose a purely flow version of the model where the chemical signals are carried by the fluid flow and not by diffusion (Appendix A).While this chemical flow algorithm is presumably more physically plausible, it is not yet clear that the algorithm performs gradient descent.In any event, what we present here is an idealization.Obviously, nature may be using a complex combination of flow and diffusion for signalling.
Considering v as a pressure and u as a chemical concentration is just a certain packaging of the theory.The central idea of this work is to use two independent physical quantities which leads to non-interference of the input signal and the feedback signal, in other words one can use any two non-interfering modalities to conduct learning [36].For example, one can use two chemicals v and u diffusing in a static fluid, with distinct chemical signatures to encode input and error signal.In this case, there is no need for an additional assumption on the relationship between flow rate and diffusion rate.
Our result holds for any cost function C, not just the squared error (Eq.1).In general, in the second phase, the chemical current flowing in through o + i and flowing out through Our result also holds if we reverse the sign of η (2) and the sign of α in the learning rule (3).In other words, the algorithm performs gradient descent so long as αη > 0. Now we prove our claim that the learning rule (3) performs gradient descent (4).Let us number the nodes of the network 1, 2, . . ., n.Let I x be the input current at node x (with I x = 0 by convention if node x is not an input node) and v x the pressure at node x.For each node x, the steady state condition at node x in the first phase yields: We get a system of n linear equations.This system rewrites with matrix-vector notations as where v is the vector of node pressures, I is the vector of input currents and L is the matrix Note that the matrix L is symmetric because w xy = w yx for every pair of nodes (x, y).Next, we denote E x = −η ∂C ∂vx the chemical current at node x in the second phase (with e x = 0 by convention if node x is not an output node), and u x the concentration of the chemical at node x.Assuming that the diffusion constant of the chemical is equal to the flow conductivity (up to a constant of proportionality), the chemical concentration at steady state satisfies the same system of linear equations, with different boundary conditions, i.e.
where u is the vector of node concentrations, and E is the vector of chemical currents.Now we compute the gradient of the cost function: for every weight w xy we have Multiplying by −η on both sides, we get, by definition of E, Using the steady state condition of the second phase (8) and the fact that L is symmetric, we get Next, we differentiate the steady state condition of the first phase (6) with respect to w xy .We get Rearranging the terms and injecting this in (11), we get Looking at the form of the matrix L (7), the matrix ∂L ∂wxy has exactly four non-zero coefficients: those at positions (x, x), (x, y), (y, x) and (y, y), equal to +1, −1, −1 and +1, respectively.Therefore Combining this with (13), we conclude that Hence, the learning rule corresponds to one step of gradient descent with step size αη.This concludes the proof.Summing up the training mechanism: The chemical concentration reaches steady state u. 5.This procedure, which corresponds to one step of gradient descent, is repeated iteratively until convergence of the weights is achieved.

III. THE IRIS DATA SET
We train the flow network on a standard machine learning task: classifying Iris flowers.The Iris data set [37] contains 150 examples of Iris flowers belonging to three species (setosa, virginica and versicolor), and, therefore, 50 examples for each category.Each example is of the form X = (X 1 , X 2 , X 3 , X 4 ), composed of four features of the flower (petal width, petal length, sepal width and sepal length, all measured in cm), and comes with its assigned iris category, denoted Y .So an example would look something like X = (5.1,3.5, 1.4, 0.2) and Y = 'setosa'.Given the four features of the flower as input, the trained network should be able to tell which species it belongs to.
We now detail how we do this.A flow network is constructed as follows: 1. Generate a square lattice, 2. Perturb the positions of the lattice with a step of length δ in any random direction, 3. every node is connected to its d nearest neighbours [38].We choose d = 4 for all our simulations unless specified.
4. Every edge of the network is assigned a conductance from a truncated normal distribution.
5. Four pairs of input nodes are chosen from this flow network, where the input data (the normalised features of the iris) is given as external currents across these four pairs of boundary nodes.Three pairs of nodes are chosen as the output nodes such that every pair is composed of two neighbouring nodes.For pairs of output nodes that are not neighboring nodes, the training error decreased less smoothly.
Once the network is trained, for a given input, the set of potential drops across these node pairs should tell the category of iris the input data corresponds to.
The network architecture remains fixed throughout the training-testing process.Only the conductances of these weights are modified.
As for how the flow network interfaces with the Iris data set, 1.The data set is divided into two subsets: one training set (used for training) and one test set (used for testing).Each of these sets have 75 examples of irises, 25 from each category.
2. The data set is normalised.That is, for each example X in the data set, and for each feature X i of X; we set , where X max i and X min i are the minimum and maximum values for that feature X i in the training set.We choose λ = 5 for all simulations.
While choosing the desired outputs we must keep in mind the fact that this linear system may not be able to find a set of weights that give out the desired output.In other words, we must choose desired outputs that are physically attainable.We, therefore, implement the same technique as described in Ref. [28] to choose the desired output voltages for each of the three Iris categories.For each category, the desired voltage is the average, normalized input data.That is, each iris category has 25 examples of four input features, each input feature is averaged out over 25 examples.Finally, we obtain a four tuple of averaged input features for each iris category.When this is given as input to the initial network, we aim to arrive at an output voltage that corresponds to the average behavior of the input, which is the desired voltage.
To conduct the training process, first the input data is given to the network and the output is observed.If the output is not equal to the desired output for that iris category, the feedback chemical is released at the output node pairs by fixing the currents.The weights of the network are modified using the learning rule mentioned above.This process is repeated consecutively for all examples.Next, once the entire data set is exhausted, we say 'one epoch' has passed.We train the network for multiple epochs.At the beginning of each epoch, because the network has changed significantly, new desired voltages are calculated.Therefore, each epoch has its own set of desired voltages.
To conduct the testing process, after the network is trained for multiple epochs, we record how well it classifies unseen data from the test set.To be specific, the test set has 25 examples per iris category.The iris features from these test examples are given as input and the output voltage is compared with the desired voltage.The desired voltage is calculated using the testing data set.An example is then classified into that iris category, for which the output voltage is closest to the desired output.
This training-testing procedure is implemented on a biophysical network.Results are shown in Fig 2.

IV. LINK WITH EQUILIBRIUM PROPAGATION AND COUPLED LEARNING
To readily see the link with our algorithm and EP, suppose that, in the second phase, instead of injecting the chemical, we inject more of the main substance at output nodes (through the currents ).We can adapt the analysis of Sec.II to show that, instead of the chemical concentration, u now represents the pressure difference between the second phase and the first phase (after and before injecting the currents i at output nodes).Indeed, denoting v free and v nudged the node pressures at equilibrium in the first phase and second phase, respectively, we have where we recall that E is the vector of currents at output nodes.Eq. ( 18) is the same Laplace equation as (6), therefore v free = v.Moreover, subtracting ( 19) from ( 18), we see that v nudged − v free satisfies the same equation ( 8) as the chemical concentration u, that is We conclude that u = v nudged − v free .We thus recover the setting of equilibrium propagation [23] with "free state" v free = v and "nudged state" v nudged = v + u.Coupled learning [25] is also very closely related, with the difference that nudged states are realized by imposing boundary conditions on the pressures of output nodes, rather than on the currents.Without loss of generality, let us assume for convenience that α = −1.Our learning rule rewritten in terms of the free and nudged states is where η is the nudge amplitude, i.e. the factor that scales the amplitude of the currents injected in the second phase.Here we have used that u is proportional to the nudge amplitude, i.e. u(x, y) = O(η).We have thus recovered the learning rule of equilibrium propagation [23] and coupled learning [25] which, at order 1 in η, is In equilibrium propagation and coupled learning, the multiplicative factor in front of the learning rule is k = learning rate 2×nudge amplitude ; in our setting here, since α = −1, the nudge amplitude is the same as the learning rate, therefore k = 1 2 .
To understand the discrepancy of O(η 2 ) between our learning rule and the learning rule of EP and CL, let us rewrite v nudged as v η to explicitly show the dependence of the nudged state on the nudge amplitude η.In particular, v 0 = v free = v.In our linear flow network, u is a linear response of η (specifically u = −ηL −1 ∂C ∂v , see Eq. ( 8)).Combined with the relationship u = v η −v 0 shown above, this implies that u = η ∂v η ∂η η=0 . Therefore: = −η v(x, y) ∂v η (x, y) ∂η η=0 = − η 2 We recover the learning rule ∆w EP/CL xy of Eq. ( 23) as a finite difference approximation of Eq. ( 26) -hence the term O(η 2 ).
The advantage of our method over EP and CL is that, by using chemical signaling, the system components can now differentiate between "activation" and "feedback" signals based on their distinct chemical signatures.This happens because the chemical u encodes the same information as ∂v η ∂η .Not only this eliminates the need to store information about two separate phases, but also gives a way for exact gradient computation.
On the other hand, one of the strengths of equilibrium propagation and coupled learning is that they can be used for arbitrary physical systems driven by physical equilibration: their learning rule can be written in terms of derivatives of energy with respect to the trainable parameters [24,25,27], as follows: where E free is the energy of the system in the free phase and E nudged is the energy in the nudged phase.For example, Eq 23 can be obtained from Eq 27 by taking E as power dissipation function [23,25].While the present work focuses on linear physical systems, we will leave the study of our algorithm in nonlinear systems for future work.

V. DISCUSSION
We present a simple model of a physical learning system that learns via chemical signaling.In our system, the error between the desired behavior and observed behavior at the output nodes is encoded in the form of a feedback chemical signal.These signals travel across the network via diffusion with the weights of the network updating in response to the concentration of the feedback chemical and there is no need for two states, as there is in equilibrium propagation and coupled learning.We also show that this learning rule minimises a cost function via gradient descent even beyond the infinitesimal nudge amplitude or learning rate limit.
Given the prevalence of backpropagation, we also compare our algorithm to backpropagation.While there are many explicit differences between backpropagation and our algorithm -mostly because artificial neural networks are very different from physical flow networks -we can compare the basic ideas between these algorithms.In our model the weight update rule is proportional to two quantities: the "error term" u(x, y) which tells how much the pressure drop across the weight w xy must change, and the "activation term" v(x, y) which tells the existing pressure drop due to input.This is similar to backpropagation where the weight update is proportional to the presynaptic input and error in the postsynaptic output [11].However, the difference between these algorithms is seen in the way error information is communicated.In backpropagation, the error at the output layer and the weights projecting onto this output layer determines the error at the penultimate layer and so on [11].Therefore, the relationship between the error values of two layers only depends on the local weight values connecting them.In our model, once the error at output is known, the error at the neighbouring nodes is determined by the steady state of the system, which means that the relationship between their error depends on all the weights of the network.
While our work here is mostly concerned with linear flow networks in the absence of advection (see Appendix A for an exception to this), one can explore how our algorithm can be extended to include it.Moreover, our physical procedure (Sec.II) may also be applied to nonlinear systems, i.e., systems whose components have nonlinear characteristics.We will leave the mathematical analysis and experimentation of these settings for future work.We also must explore going beyond the quasistatic limit as time scales also constrain biological learning systems.Efforts towards this goal have recently been made in silico and in experiments using coupled learning [39].Similar extensions can be implemented using our algorithm.
Nature may indeed be using similar signaling mechanisms that we have elaborated on here.Cells use biochemical signals to structure themselves in response to external conditions to optimize their functionality and, thus, identity [40].Slime mold, a tubular network-like single cell organism uses chemical signals to modify its tube radii in response to food as the external stimuli [34].While the particular chemical still remains unknown, a recently proposed candidate is ATP [35].ATP is crucial for the activity of myosin and, hence, the contractility of the actin cytoskeleton.An increase in ATP would pre-sumably enhance contractility.One might conclude that enhanced contractility leads to stiffer cells as is found in cells adherent to a substrate.However, for cells in suspension, inhibition of myosin leads to stiffer cells due to accelerated depolymerization of actin filaments [41].Perhaps suspended cell behavior is more relevant for the tubular structures in slime model than adherent cells behavior.On other hand, given such understanding, we expect similar artificial versions of this signaling mechanism could be used for experimental realizations of our model with its aforementioned extensions.For instance, an experimental system that makes use of different mechanisms, such as different catalytic reactions on a flexible sheet to drive shape changes in the presence of flows (see Appendix A), is a distinct possibility [42].Alternatively, perhaps one may implement similar catalytic reactions in stiff sheets that fold.Incidentally, supervised learning in stiff sheets, i.e. origami, has been explored [43].
As for multi-cellular organisms, the brain presumably fine tunes the synaptic strengths of billions of neurons to generate an optimal behavior.For this to happen, feedback signals should not only carry precise credit information to individual neurons but while doing so, they must not interfere with the activation/feedforward signals [14].How the brain does this is still unknown and many models have been developed to explain this phenomenon.For instance, some models invoke the use of error neurons [44], while others assume a temporal seg-regation of activation and feedback phases [25].Others propose a compartmentalization of individual neurons to spatially separate information as opposed to temporal segregation [45].Our learning mechanism is similar to this last one.It avoids multiple phases by modifying a node to store two kinds of information -an activation signal and a feedback signal.The reason they do not interfere is because the system components can identify them by their chemical signatures -a ubiquitous phenomenon in nature.We have shown how the same network structure used to send an activation signal, can be used to send precise gradient information to individual weights.Therefore, our model optimises a cost function via gradient descent, something that deep neural networks already do to achieve human like functionality [46][47][48][49].In light of the reasons stated above, we think our model may ultimately help neuroscientists understand credit assignment mechanisms in the brain.

FIG. 1 .
FIG. 1.The flow network.Schematic of the flow network with blue points representing the input node pair and red points representing the output node pair.The weights of the flow network are denoted by wxy between neighboring nodes x and y.These weights are varied during the training-testing process.

4 .
The concentration drop u(x, y) and pressure drop v(x, y) are measured across each weight w xy , and the weights are updated according to ∆w xy = −αu(x, y)v(x, y).

FIG. 2 .
FIG. 2. Training and testing using the Iris data set.The Iris data set is trained on a network with 12 2 nodes and learning rate η = 10 −4 .The weights are sampled from a truncated normal distribution with mean of 0.1, standard deviation of 0.01 with a lower cutoff of 0.01 and upper cutoff of 0.19.We use the standard mean squared error (L2 norm) as the cost function.[Top Left] At each training step, the network is shown an example and the mean squared error is calculated between the network's output and desired output, shown as a blue dot.[Top Right] Accuracy is defined as the fraction of correct predictions out of total testing examples.[Bottom Left] The initial state of the network, with thickness of blue edges representing the conductance.Red crosses denote input nodes; green crosses denote output nodes.[Bottom Right] The change in weights between the initial state and the final trained state of the network.Red denotes negative change, while blue denotes positive change with thickness showing the magnitude of change.
1.A flow network is generated where chemicals can spread via a diffusion process.2. Input to the network is given by fixing currents {I j } across input node pairs {(b + j , b − j )}, leading to steady state pressures v. Outputs are measured as pressure drops {v(o +