Stochastic p-bits for Invertible Logic

Conventional logic and memory devices are built out of deterministic units such as transistors, or magnets with energy barriers in excess of 40-60 kT. We show that stochastic units, p-bits, can be interconnected to create robust correlations that implement Boolean functions with impressive accuracy, comparable to standard circuits. Also they are invertible, a unique property that is absent in digital circuits. When operated in the direct mode, the input is clamped, and the network provides the correct output. In the inverted mode, the output is clamped, and the network fluctuates among possible inputs consistent with that output. We present an implementation of an invertible gate to bring out the key role of a three-terminal building block to enable the construction of correlated p-bit networks. The results for this implementation agree well with those from a universal model, showing that p-bits need not be magnet-based: any three-terminal tunable random bit generator should be suitable. We present an algorithm for designing a Boltzmann machine (BM) with symmetric connections that implements a given truth table. We then show how BM Full Adders can be interconnected in a partially directed manner to implement large operations such as 32-bit addition. Hundreds of p-bits get precisely correlated such that the correct answer out of 2^33 possibilities can be extracted by looking at the mode of a number of time samples. With perfect directivity a small number of samples is enough, while for less directed connections more samples are needed, but even in the former case invertibility is largely preserved. This combination of accuracy and invertibility is enabled by the hybrid design that uses bidirectional units to construct circuits with partially directed connections. We establish this result with examples including a 4-bit multiplier which in inverted mode functions as a factorizer.


I. INTRODUCTION
Conventional semiconductor-based logic and nanomagnet-based memory devices are built out of stable, deterministic units such as standard MOS (metal oxide semiconductor) transistors, or nanomagnets with energy barriers in excess of ≈ 40-60 kT. The objective of this paper is to introduce the concept of what we call "p-bits" representing unstable, stochastic units which can be interconnected to create robust correlations that implement precise Boolean functions with impressive accuracy comparable to standard digital circuits. At the same time this "probabilistic spin logic" (PSL) is invertible, a unique property that is absent in standard digital circuits. When operated in the direct mode, the input is clamped, and the network provides the correct output. In the inverted mode, the output is clamped, and the network fluctuates among all possible inputs that are consistent with that output.
Any random signal generator whose randomness can be tuned with a third terminal should be a suitable building block for PSL. The icon in Fig. 1b represents our generic * kcamsari@purdue.edu † datta@purdue.edu building block whose input I i controls the output m i according to the equation (Fig. 1a), m i (t) = sgn{rand(−1, 1) + tanh(I i (t))} (1) where rand(−1,+1) represents a random number uniformly distributed between −1 and +1. It is assumed to change every τ seconds which represents the retention time of individual p-bits. We normalize the time axis to τ so that t is dimensionless and progresses in steps (0, 1, 2, . . .). At each time step, if the input is zero, the output takes on a value of −1 or +1 with equal probability, as shown in the middle panel of Fig. 1d. A negative input I i makes negative values more likely (left panel) while a positive input makes positive values more likely (right panel). Fig. 1c shows m i (t) as the input is ramped from negative to positive values. Also shown is the timeaveraged value of m i which equals tanh(I i ).
A possible physical implementation of p-bits could use stochastic nanomagnets with low energy barriers ∆ whose retention time [1]: is very small, on the order of τ 0 which is a material dependent quantity called the attempt time and is experimentally found to be ≈ 10 ps − 1 ns [1] among different magnetic materials. Such stochastic nanomagnets can be pinned to a given direction with spin currents that are at least an order of magnitude less than those needed to switch 40 kT magnets. The sigmoidal tuning curve in Fig. 1c describing the time average of a fluctuating signal represents the essence of a p-bit. Purely CMOS implementations of a p-bit are possible [2,3], but the sigmoid seems like a natural feature of nanomagnets driven by spin currents. Indeed, the use of stochastic nanomagnets in the context of random number generators, stochastic oscillators and autonomous learning [4][5][6] has been discussed in the literature. But performing "invertible" Boolean logic utilizing large scale correlations has not been discussed before to our knowledge.
Note that we are using the term invertibility in the broader sense of relation inverses and not in the narrower sense of function inverses. , 0}} even though the corresponding functional inverse is not defined. What our scheme provides, probabilistically, is the relation inverse [7,8].
Ensemble-average versus time-average: A sigmoidal response was presented in [9] for the ensemble-averaged magnetization of large barrier magnets biased along a neutral state. This was proposed as a building block for both Ising computers as well as directed belief networks and a recent paper [10] describes a similar approach applied to a graph coloring problem. By contrast low barrier nanomagnets provide a sigmoidal response for the time-averaged magnetization and a suitably engineered network of such nanomagnets could cycle through the 2 N collective states at GHz rates, with an emphasis on the "low energy states" which can encode the solution to the combinatorial optimization problems, like the trav-  eling salesman problem (TSP) as shown in [11]. Once the time-varying magnetization has been converted into a time-varying voltage through a READ circuit, a simple RC circuit can be used to extract the answer through a moving time average. For example, in Fig. 1c the red trace was obtained from the rapidly varying blue trace using an RC circuit in a SPICE simulation.
The central feature underlying both implementations is the p-bit that acts like a tunable random number generator, providing an intrinsic sigmoidal response for the ensemble-averaged or the time-averaged magnetization as a function of the spin current. It is this response that allows us to correlate the fluctuations of different p-bits in a useful manner by interconnecting them according to where h i provides a local bias to magnet i and J ij defines the effect of bit j to bit i, and I 0 sets a global scale for the strength of the interactions like an inverse "pseudotemperature" giving a dimensionless current I i to each pbit. The computation of I i (t) in terms of m j (t) in Eq. (2) is assumed instantaneous, in hardware implementations there can be interconnect delays that relate m j (t) to currents at a later time, I i (t ). Equation (1) arises naturally from the physics of low barrier nanomagnets as we have discussed above. Equation (2) represents the "weight logic" for which there are many candidates such as memristors [12], floating-gate based devices [13], domain-wall based devices [14], standard CMOS [15]. The suitability of these options will depend on the range of J values and the sparsity of the J-matrix.
Equations (1)(2) are essentially the same as the defining equations for Boltzmann machines introduced by Hinton and his collaborators [16] which have had enormous impact in the field of machine learning, but they are usually implemented in software that is run on standard CMOS hardware. The primary contributions of this paper are threefold: • Hardware implementation: It may seem "obvious" that an unstable magnet could provide a natural hardware for representing a p-bit, but we would like to stress a less obvious point. To the best of our knowledge, simple two-terminal devices are not suitable for constructing large scale correlated networks of the type envisioned here. Instead, we need three-terminal building blocks with transistor-like gain and input-output isolation as shown in Fig. 1b [9]. To stress this point, we describe a concrete implementation of a Boolean function using detailed nanomagnet and transport simulations that are in good agreement with those obtained by the generic model based on Eq. (1). All other results in this paper are based on Eq. (1) in order to emphasize the generality of the concept of p-bits which need not necessarily be nanomagnet-based [17,18].
• Boltzmann machines (BM) for invertible Boolean logic (Fig.2a): Much of the current emphasis on BMs is on "learning" giving rise to the concept of restricted Boltzmann machines [19]. By contrast this paper is about Boolean logic, extending an established method for Hopfield networks [20] to provide a mathematical prescription to turn any Boolean truth table into a symmetric J-matrix (Eq. (2), with J ij = J ji ), in one shot with no "learning" being involved. This design principle seems quite robust, functioning satisfactorily even when the J-matrix elements are rounded off, so that the required interconnections are relatively sparse and quantized which simplifies the hardware implementation. The numerical probabilities agree well with those predicted from the energy functional.
using the Boltzmann law: Most importantly we show that the resulting Boolean gates are invertible: not only do they provide the correct output for a given input, for a given output they provide the correct input(s). If the given output is consistent with multiple inputs, the system fluctuates among all possible answers. This remarkable property of invertibility is absent in standard digital circuits and could help provide solutions to the Boolean satisfiability problem ( Fig. 8) [21].
• Directed networks of BM ( Fig.2b): Finally we show that individual BM's can be connected to perform precise arithmetic operations which are the norm in standard digital logic, but quite surprising for BM which are more like a collection of interacting particles than like a digital circuit. We show that a 32-bit adder converges to the one correct sum out of 2 33 ≈ 8 billion possibilities when the interaction parameter is suddenly turned up from say I 0 = 0.25 to I 0 = 5. This can be likened to quenching a molten liquid and getting a perfect crystal. What we expect is plenty of defects, distributed differently everytime we do the experiment. That is exactly what we get if the individual BM Full adders comprising the 32-bit adder are connected bidirectionally (J ij = J ji ). But by making the connection between Adders directed (J ij = J ji ), we obtain the striking accuracy of digital circuits while largely retaining the invertibility of BM. This is a key result that we establish with extensive examples including a 4-multiplier which in inverted mode functions as a factorizer.
Each of these three contributions is described in detail in the three sections that follow.

II. AN EXAMPLE HARDWARE IMPLEMENTATION OF PSL
To ensure that individual p-bits can be interconnected to produce robust correlations, it is important to have separate terminals for writing (more correctly biasing) and reading, marked W and R respectively in Fig. 3a. With IMA nanomagnets (e.g circular nanomagnets) this could be accomplished following existing experiments [24,25] using the giant spin Hall effect (GSHE). Recent experiments using a built-in exchange bias [26][27][28][29] could make this approach applicable to PMA as well. Note however, that these experiments have all been performed with stable free layers, and would have to be carried out with low barrier magnets in order to establish their suitability for the implementation of p-bits. As the field progresses, one can expect the bias terminal to involve voltage control [30,31] instead of current control, just as the output could involve quantities other than magnetization. We will now show a concrete implementation of a Boolean function using minimal CMOS circuitry in conjunction with stochastic nanomagnets through detailed nanomagnet and transport simulations that are in good agreement with those obtained from the generic model based on Eq. (1). Fig. 3a shows a possible, CMOS-assisted p-bit that has a separate READ and WRITE path. The device consists of a heavy metal exhibiting Giant Spin Hall Effect (GSHE) that drives a circular magnet which replaces the usual elliptical magnets in order to provide the stochasticity needed for the magnetization. A small read current, which is assumed to not disturb the magnetization of the free layer in our design, that flows through the fixed layer is used to sense the instantaneous magnetization, which is amplified and isolated by two inverters that act as a buffer. This structure is very similar to the experimentally demonstrated GSHE switching of elliptical magnets that were similarly read-out by an MTJ [24], with the only exception that the elliptical magnets are replaced by circular magnets with an aspect ratio of one. This device could be viewed as replacing the free layers of the GSHE-driven MTJs demonstrated in [24] with those in the telegraphic regime [25,[32][33][34] .
In the presence of thermal noise the magnetization of such a circular magnet rotates in the plane of the circle without a preferred easy-axis that that would have arisen due to the shape anisotropy, effectively making its thermal stability ∆ ≈ 0 kT [35]. This magnetization can be pinned by a spin current that is generated by flowing a charge current through the GSHE layer. The magnetic field driven sigmoidal responses of magnetization for such circular magnets have experimentally been demonstrated [36,37], while the spin current driven pinning has not been demonstrated to our knowledge. Using validated modules for transport and magnetization dynamics [38] ( Fig. 3b), we solve the stochastic Landau-Lifshitz-Gilbert (sLLG) equation in the presence of thermal noise and a GSHE current. The following subsection shows detailed simulation parameters.
Sigmoidal response: A long-time average (t = 500 ns) of the magnetization m z as a function of a GSHEgenerated spin current is plotted in Fig. 3e that displays the desired sigmoidal characteristic for p-bits dictated by Eq. (1). The x-axis of Fig. 3e is normalized to the geometric gain factor that relates the charge current to the spin current exerted [39,40]: where θ SH is the Hall angle, t is the thickness and λ is the spin-relaxation length of the heavy metal. The quantity β can be made to be much greater than 1 providing an intrinsic gain [41], however for the parameters used in the present examples, β is ≈ 1.5. Another quantity that is used to normalize the x-axis of Fig. 3e is the "thermal spin current" that corresponds to the strength of the thermal noise that needs to be overcome for a circular magnet to be pinned in a given direction: where q is electron charge, α is the damping coefficient of the magnet. I th s , I s and I c all have units of charge current, therefore we can define the dimensionless interaction parameter, I 0 of Eq. 2 as I 0 ≡ βI c /I th s = I s /I th s . It can be seen from Fig. 3e that when the applied spin current βI c /I th s = I s /I th s ≈ 10, the magnetization of the circular magnet is pinned in the ±z directions for these particular parameters. For PMA magnets with low barriers (∆ kT ), the pinning current is independent of the volume as long as increasing the volume does not invalidate the ∆ kT assumption. This can be analytically shown from a 1D Fokker-Planck equation [42], and we have reproduced this behavior directly from sLLG simulations. For the in-plane (circular) magnets considered here, the pinning current in general has a M s and Vol. dependence and the dimensionless pinning current can be larger.
Nevertheless, it is possible to estimate the thermal spin current for typical damping coefficients of α = 0.01 − 0.1, Pinning currents for superparamagnets are at least an order of magnitude smaller than the critical switching currents of stable magnets [43]. I th s , defined by Eq. (6) also sets the scale for I 0 defined in Eq. (2) suggesting that a stochastic nanomagnet based implementation of PSL could be more energy efficient than the standard spin-torque switching of stable magnets that suffer from high current densities.
Need for three-terminal devices with READ-WRITE separation: Note that a crucial function of the READ circuit and the CMOS transistors in this design is the ability to turn the magnetization into an output voltage that is proportional to m z , providing gain for fan-out and isolation to avoid any read disturb. Indeed, a critical requirement for any other alternative implementations of p-bits is the need for three terminal devices with separate READ and WRITE paths to provide gain and isolation. In this particular design these features come in by directly integrating CMOS transistors, but CMOS-free, all-magnetic designs with these characteristics have been proposed [41,44]. Our purpose is to simply show how a p-bit can be realized by using experimentally demonstrated technology. Alternative designs are beyond the scope of this paper.
READ Circuit: For the output to provide symmetric voltage swings on the GSHE layer, the minus supply V − needs to be set to V DD /2 since V OUT ranges between 0 and V DD . V + is set to V DD /2 + V R where V R is a small READ voltage that is amplified by the inverters. We assume a simple, bias-independent MTJ model [45]: where P is the interface polarization and G 0 is the average MTJ conductance. Setting the reference resistance ( Fig. (3c)) R 0 equal to G −1 0 , the input voltage to the inverters, V M in FIG. (2d) becomes: In the absence of a bias m z becomes 0 and the middle voltage fluctuates around the mean V M = V DD /2 + V R /2. This requires the inverter characteristic to be shifted to this value to produce a telegraphic output that fluctuates between 0 and V DD with equal probability (Fig. 3f). This shift is easily engineered by sizing the pFET and nFET transistors differently, a wider pFET shifts the inverter characteristic towards V DD , as we will show in the next subsection.
Interconnection matrix: A passive resistor network can be used as a possible interconnection scheme to correlate the p-bits as shown in Fig. 4. A proper design of the interconnection matrix J that has only a few discrete values ensures a minimal number of different conductances (G ij ). In this demonstrated example the AND gate requires only 2 unique, discrete conductance values.
The spin currents that need to be delivered to each p-bit are on the order of a few µA and can be generated with charge currents that are even smaller, due to the GSHE gain. This means the interconnection resistances R ij could be on the order of 100 kΩ's since the voltage drops across these resistances are around V OUT − V − ≈ ±0.5 V. Since the GSHE ground V − = V DD /2 simply shifts all the voltages to get symmetric ± swings, we define the voltages (V OUT ) i = (V OUT ) i −V − . Then input currents to each p-bit can be expressed (Fig. 4a): assuming j G ij G GSHE since the heavy metal resistances are typically much less than hundreds of kΩ. We have verified the validity of Eq. (9) by SPICE simulations, for the parameters chosen for these examples.
As a result, we observe that Eq. (9) constitutes a hardware mapping for the interconnections of Eq. (2). In this scheme G ij conductances are initially adjusted to obtain a global interaction strength I 0 for a given problem. Alternatively, the interaction strength can be adjusted electrically by varying the supply voltages.
Invertible AND Gate: Fig. 4b shows an explicit implementation of an invertible AND gate (A ∩ B = C) corresponding to [J] and {h} matrices [46] that have 3 unique, integer entries: In Fig. 4d, we show the inverse operation of the AND gate where we clamp the output bit C to a 0 or 1 by the bias voltage attached to its input terminal. The interconnection resistance is chosen to be R 0 = 125 kΩ that roughly provides ≈ ±6 µA of charge current to each p-bit, corresponding to an I 0 ≈ 3.5 for the chosen parameters.
Generating the histogram: At the end of the simulation (t=200 ns), we threshold the voltage output of A,B and C by legislating all voltages above V DD /2 = 0.4 V to be 1, and below V DD /2 to be 0. Then a histogram output for the thresholded word [ABC] is obtained and normalized to unit probability. Clamping the output to 0 and letting A and B float, make A and B fluctuate in a correlated manner and they visit the three possible states (00, 01, 10) with approximately equal probability. Resolving the output 0 to the three possible input combinations is, in a way "factorizing" the output. Conversely, clamping the output to 1 produces a strong (11) peak in the histogram of [ABC], which is the only consistent input combination for C=1 (Fig. 4c-d).
Assumptions of the model: We have made several simplifying assumptions while modeling the hardware implementation of a p-bit. (1) The READ voltage that is amplified by the inverters produces a small current that passes through the circular magnet and might potentially disturb its current state. We assumed that this current (labeled as I S2 in Fig. 3b) is negligible and do not affect the magnetization of the stochastic magnet. (2) We assumed that the spin current generated by the heavy metal is deposited to the free layer with perfect efficiency (I S1 = I S1 in Fig. 3b), however, depending on the interface properties this conversion factor can be less than 100%. (3) We have also assumed that the fixed layer does not produce a notable stray field on the circular magnet. Note that the presence of such a constant field would simply shift the sigmoidal behavior presented in Fig. 3d-e to the right (or left) and could have been offset by a constant bias current. (4) Finally, we have neglected the resistance of the GSHE portion in the READ circuit ( Fig. 3c), assuming the MTJ resistance would be dominant in this path.

Detailed Simulation Parameters
This section shows the details of simulation parameters for the hardware implementation of p-bits that are used for Fig. 3 sLLG for stochastic circular magnets: The magnetization of a circular nanomagnet described asm i is obtained from the stochastic Landau-Lifshitz-Gilbert (sLLG) equation: where α is the damping coefficient, q is the electron charge, γ is the electron gyromagnetic ratio, I s is the spin current that is assumed to be uniformly distributed over the total number of spins in the macrospin, N i = M s Vol./µ B , µ B being the Bohr magneton. It is assumed that the spin current generated from the GSHE layer is polarized in the z-direction, such that I Si = I Sẑ . H i is the effective field of the circular magnet, where the uniaxial anisotropy is assumed to be negligible, but there is still a strong demagnetizing field. The thermal fluctuations also enter through the effective magnetic field: The CMOS inverter characterestics in conjunction with a spherical representation-based sLLG are obtained using the modular framework developed in [38] using HSPICE.
14 nm FinFET Inverter Characteristics: Fig. 5 shows the input/output characteristics of the single and double inverters that are used to amplify the stochastic signal that is generated by the MTJ (Fig. 3). At zero-bias from the GSHE, the amplified signal V M (Eq. 8) is in the middle of V + and V − which is V DD /2 + V R /2. The buffer response can be shifted to this value by increasing the size of pFETs, as shown in Fig. 5.

III. INVERTIBLE BOOLEAN LOGIC WITH BOLTZMANN MACHINES
We now present a mathematical prescription that shows how any given truth table can be implemented in terms of Boltzmann Machines, in "one shot" with no learning being involved, unlike much of the past work in this area (See for example, [50,51]). In Section II, we chose a simple [J] and {h} matrix to implement an AND gate based on [46]. In this section, we outline a general approach to show how any truth table can be implemented in terms of such matrices. Our approach, pictorially described in Fig. 6, begins by transforming a given truth table from binary (0, 1) to bipolar (−1, +1) variables. The lines of the truth table are then required to be eigenvectors each with eigenvalue +1, all other eigenvectors are assumed to have eigenvalues equal to 0. This leads to the following prescription for J as shown in Fig. 6: where u i are the eigenvectors corresponding to lines in the truth table of a Boolean operation and S is a projection matrix that accounts for the non-orthogonality of the vectors defined by different lines of the truth table. Note that the resultant J-matrix is always symmetric (J ij = J ji ) with diagonal terms that are subtracted in our models such that J ii = 0. The number of p-bits in the system is made greater than the number of lines in a truth table through the addition of hidden units (Fig. 6) to ensure that the number of conditions we impose is less than the dimension of the space defined by the number of p-bits. Another important aspect in the construction of [J] is that an eigenvector u i implies that its complement −u i is  also a valid eigenvector. However only one of these might belong to a truth table. We introduce a "handle" bit to each u i that is biased (h i ) to distinguish complementary eigenvectors. These handle bits provide the added benefit of reconfigurability. For example, AND and OR gates have complementary truth tables, and a given gate can be electrically reconfigured as an AND or an OR gate using the handle bit.
J-Matrices for AND/FA: We now provide the details of the J-matrix for the AND gate, obtained using the prescription shown in Fig. 6 based on Eq. (12a). The eigenvectors of the truth table for the AND in Fig. 6 are placed into a matrix U, such that U = [u 1 u 2 u 3 u 4 ], where u 1 is the first row of the matrix shown in Fig. 6, u 1 = [−1 + 1 + 1 + 1 + 1 − 1 − 1 − 1] T and so on. In matrix notation, the S-matrix can be written as: Then the J-matrix becomes: Removing the diagonal entries by making J ii = 0 and multiplying the matrix entries by 2, to obtain simple in- with the notation, [1-5: auxiliary bit and handle bit, 6:"A", 7:"B", 8:"C"]. Following a similar procedure, we use the following 14 × 14 Full Adder matrix, J FA : with the notation, [1−9: auxiliary bits and handle bit, 10: "C in ", 11: "B", 12: "A", 13: "S" 14: "C out "]. These are the J-matrices (AND and FA) that are used for all examples in the paper, except for the AND gate described in Section II. Fig. 10 shows the "truth table" operation of the Full Adder where all input/output terminals are "floating" using the J-matrix of Eq. (16), showing excellent quantitative agreement with the Boltzmann distribution of Eq. (4) at steady state even for the undesired peaks of the truth table.
Note that this prescription for [J] is similar to the principles developed originally for Hopfield networks ( [52], and Eq. (4.20) in [20]). However, other approaches are possible along the lines described in the context of Ising Hamiltonians for quantum computers [46]. We have tried some of these other designs for [J] and many of them lead to results similar to those presented here. For practical implementations, it will be important to evaluate different approaches in terms of their demands on the dynamic range and accuracy of the weight logic. Description of universal model: Once a J-matrix and the h-vector are obtained for a given problem, the system is initialized by randomizing all m i at time, t = t 0 . First, the current (voltage) that a given p-bit (m i ) feels due to the other coupled m j is obtained from Eq. (2), and the m i value is updated according to Eq. (1). Next the procedure is repeated for the remaining p-bits by finding the current they receive due to all other m i using the updated values of m i . For this reason, the order of updating was chosen randomly in our models and we found that the order of updating has no effect in our results. However, updating the p-bits in parallel leads to incorrect results. These two observations are well-known in the context of Hopfield networks and Boltzmann Machines [53][54][55]. This type of serial updating corresponds to the "asyn-FIG. 8. Implementing a Boolean function and its inverse: The input or output terminals of an appropriately interconnected network of p-bits can be "clamped" to perform a specific logic operation or its inverse. In this example, the input bits (A,B) of an OR Gate are clamped to be +1, forcing the output bit C to be 1, during the first phase of operation (t < t0).
In the second phase of operation (t > t0), the output of the OR gate C is clamped to the value +1, which is consistent with three different combinations of (A,B). As shown in the time response and the long-time histogram plots, all three possibilities emerge with equal probability, demonstrating the "inverse" OR operation. In each case, the expected probabilities from the Boltzmann Law (Eq. (4)) closely match those produced by the generic model, Eq. (1-2) after running the system for one million steps, only a fraction is shown in the upper panel for clarity.
FIG. 9. Noise Tolerance of AND: The probability of a wrong output for an (AND) gate (Eq. 15) operated with clamped inputs is investigated in the presence of a random noise field which enters Eq. (2) as indicated in the figure. The noise is assumed to be uniformly distributed over all p-bits in a given network, and centered around zero with magnitude ±hn, where (I0 = 2, hi = ±1). Each gate is simulated 50000 times for T =100 time steps to produce an error probability for a given noise value, and the maximum peak produced by the system is assumed to be an output that can be read with certainty. The system shows robust behavior even in the presence of large levels of noise. chronous dynamics" [20,56]. We note that the hardware implementation discussed in this paper naturally leads to an asynchronous updating of p-bits in the absence of a global clock signal. We have set up an online simulator based on this model in Ref. [23] so that interested readers can simulate some of the examples discussed in this paper. Fig. 7 shows the time evolution of an AND based on Eq. (15). Initially for t < t 0 the interaction strength is zero (I 0 = 0), making the pseudo-temperature of the system infinite and the network produces uncorrelated noise visiting each state with equal probability. In the second phase (t > t 0 ), the interaction strength is suddenly increased to I 0 = 2, effectively "quenching" the network by reducing the temperature. This correlates the system such that only the states corresponding to the truth table of the AND gate are visited, each with equal probability when a long time average is taken. The average probabilities in each phase quantitatively match the Boltzmann Law defined by Eq. (4).
In Fig. 8, we show how a correlated network producing a given truth table can be used to do directed computation analogous to standard CMOS logic. An OR gate is constructed by using the same [J] matrix for an AND gate, but with a negated handle bit. By "clamping" the input bits of an OR gate (t < t 0 ) through their bias terminals, h i , to (A,B)=(+1,+1), the system is forced to only one of the peaks of the truth table, effectively making C=1.
The PSL gates however exhibit a remarkable difference with standard logic gates, in that inputs and outputs are on an equal footing. Not only do clamped inputs give the corresponding output, a clamped output gives the corresponding input(s). In the second phase (t > t 0 ) the output of the OR gate is clamped to +1, that produces three possible peaks for the input terminals, corresponding to various possible input combinations that are consistent with the clamped output (A,B)=(0,1),(1,0) and (1,1). The probabilistic nature of PSL allows it to obtain multiple solutions (Fig. 8c). It also seems to make the results more resilient to unwanted noise due to stray fields that are inevitable in physical implementations as shown in Fig. 9. Here, we simulate an AND gate in the presence of a normally distributed random noise that enters the bias fields of each p-bit and define the computation to be faulty, if the mode (most frequent value) of the output bit is not consistent with the programmed input combinations after T = 100 time steps. We observe that even large levels of uncontrolled noise produces correct results with high probabilities. Fig. 10 shows the design of a Full Adder (FA) with the 8-line truth table shown. There are three inputs in all, two from the numbers to be added, and one carry bit from previous FA. It produces two outputs, one the sum bit and the other a carry bit to be passed on to the next FA. The probabilities of different states are calculated using J F A from Eq. (16), with I 0 = 0.5 in the truth table mode, where all inputs and outputs are floating and the states are numbered using the decimal number corresponding to the binary word [C i A B S C o ]. The decimal numbers corresponding to the truth table are shown in the inset, and these match the location of the taller peaks in the histogram. Note that the Boltzmann distribution (Eq. (4)) quantitatively matches the model even for the suppressed peaks. A higher I 0 reduces these suppressed peaks further. The statistics are collected for T = 10 6 steps, and each terminal output is then placed in the histogram.

IV. DIRECTED NETWORKS OF BOLTZMANN MACHINES
When constructing larger circuits composed of individual Boltzmann machines, the reciprocal nature of the Boltzmann machine often interferes with the directed nature of computation that is desired. It seems advisable to use a hybrid approach. For example in constructing a 32-bit adder we use Full-Adders (FA) that are individually BMs with symmetric connections, J ij = J ji . But when connecting the carry bit from one FA to the next, the coupling element J ij is non-zero in only one direction from the least significant to the most significant bit. This directed coupling between the components distinguishes PSL from purely reciprocal Boltzmann machines. Indeed, even the Full Adder could be implemented not as a Boltzmann machine but as a directed network of more basic gates. But then it would lose its invertibility. On the other hand, the directed connection of BM Full Adders largely preserves the invertibility of the overall system as we will show. Fig. 11 shows the operation of a 32-bit adder that sums two 32-bit numbers A and B to calculate the 33-bit sum S. In the initial phase (t < t 0 ) we have I 0 = 0 corresponding to infinite temperature so that the sum bits (S) fluctuate among 2 33 ≈ 8 billion possibilities. With I 0 = 1, Fig. 11 shows that the correct answer has a probability of ≈ 12% which is much lower than the ≈ 100% that can be achieved with larger I 0 values (as in Fig.13 a-c with I 0 =5). Nevertheless the peak is unmistakable as evident from the expanded scale histogram and the correct answer is extracted from the majority vote of T =100 samples as shown in Fig. 13. This ability to extract the correct answer despite large fluctuations is a general property of probabilistic algorithms.

32-bit Adder/Subtractor
Interestingly, although the overall system includes several unidirectional connections, it seems to be able to perform the inverse function as well. With A and B clamped it calculates S=A+B as noted above. Conversely with S clamped, the input bits A and B fluctuate in a correlated manner so as to make their sum sharply peaked around S. Fig. 11 shows the time evolution of the input bits that have broad distributions spanning a wide range. Initially, when I 0 is small, the sum of A and B also shows a broad distribution, but once I 0 is turned up to 1, the distributions of A and B get strongly correlated making the distribution of A+B sharply peaked around the fixed value of S. It must be noted that the 32-bit adder shown in Fig. 11 is not like standard digital circuits which are not invertible. The demonstration of such an invertible 32-bit adder could be practically significant, since binary addition is noted to be the most fundamental and frequently used operation in digital computing [57].
Delay of Ripple Carry Adder : Just as in CMOS-based Ripple Carry Adders, the delay of the p-bit based RCA is a function of the inputs A and B. In Fig. 12 we have systematically studied the worst-case delay of the p-bit based Ripple Carry Adder (RCA) as a function of increasing bit size. We selected a "worst-case" combination The overall J-matrix for a 32-bit adder J-matrix is shown, and it is quite sparse and quantized. (b) For t < t0, I0 = 0 and the sum fluctuates randomly. At t = t0, I0 is suddenly increased, and the adder converges on the correct result for two random inputs A and B. The distribution of 1000 data points (t > t0) show a single peak with 24% probability of time spent in the correct state (not including the uncorrelated time points for t < t0). (c) Even though the connections between the Full Adder units are directed, the system performs the inverse function as well. When the output (S) is clamped to a fixed number, the inputs (A) and (B) fluctuate in a correlated manner to make A+B=S when I0 = 1. Note the broad distributions of A and B (collected for t > t0) as compared to the extremely sharp distribution of A+B. that results in a carry that needs to be propagated from bit 1 to bit N which results in a linear increase in the delay, exhibiting O(n) complexity with input size similar to CMOS implementations [58]. When the inputs are random, the delay seems to increase sub-linearly. The system is quenched at t=0 for different interaction parameters I 0 and the delay is defined to be the time it takes for the system to settle to the mode of the array for T =200. An error check has been carried out separately to ensure the calculated sum (mode) is always exactly equal to the expected sum. For random inputs the 32-bit adder is close to 20 time steps, in accordance with the example shown in Fig. 11.
Digital accuracy AND logical invertibility: The striking combination of accuracy and invertibility is made possible by our hybrid design, whereby the individual Full Adders are Boltzmann Machines, even though their connection is directed. Our 32-bit adder is more like a collection of interacting particles than like a digital circuit as evident from Fig. 13a which shows a colormap of the binary state of each of the 448 p-bits as a function of time with the interaction parameter I 0 suddenly increased from 0.25 to 5 at t 0 = 50, thereby quenching a "molten liquid" into a "solid". Nevertheless it shows the striking accuracy of a digital circuit, with S−A−B exactly equal to zero in each of the 1000 trials as shown in Fig. 13b. We do not expect a "molten liquid" to be quenched into a "perfect crystal" every time. Instead, we would expect a "solid full of defects" with different non-zero values for S−A−B in each trial. That is exactly what we get if the carry bits are bidirectional as in a fully BM implementation (Fig. 13d).
Note however, that this digital accuracy is achieved while maintaining the property of invertibility that is absent in digital circuits. Fig. 13 is not for direct mode operation, but for the adder operating in reverse mode as a subtractor. It might be expected that the directed connection of carry bits from the less significant to the more significant bit could lead to a loss of invertibility. To investigate this point, we show the error S−A−B as a function of trial number (Fig. 14) for four different modes of operation with (i) A and B clamped (Addition), (ii) S and A clamped (Subtraction), (iii) A, B and S for the 16 most significant bits (msb) clamped, and (iv) A, B and S for the 16 least significant bits (lsb) clamped. The fully bidirectional implementation shows very large errors for all modes of operation. The directed implementation, on the other hand, works perfectly for both the adder and the subtractor modes. It also works if we clamp the least significant bits, but not if we clamp the most significant bits. This seems reasonable since we expect to be able to control a flow by making changes upstream (lsb), but not downstream (msb).
Partial directivity: So far in our examples we have only considered fully directed (J ij = 2 J 0 , J ji = 0) or fully bidirectional (J ij = J 0 , J ji = J 0 ) carry bits when connecting the individual Full Adders. In Fig. 15 we systematically analyze the effects of partial directivity in the operation of a 32-bit adder. We observe that the 32-bit adder operates correctly even when there is large degree of bidirectionality (J ji = J ij × 0.75) provided that the system is allowed to run for a long time, T = 50000, in stark contrast with the fully directed case that could resolve the right answer within T = 100, shown in Fig. 14b. Decreasing the time steps systematically increases the error. Increasing the correlation parameter while keeping T constant also seems to adversely affect the bidirectional designs, that might be getting the system stuck in local minima.
The components corresponding to λ=0 decay instantaneously while the eigenvector corresponding to λ=1 is the stationary result representing the correct solution. But for the system to reach this state, we have to wait for the fourth eigenvector corresponding to λ 4 to decay sufficiently. A fully directed network has J 21 =0, so that λ 4 = 0 and the system quickly reaches the correct solution. But in a bidirectional network with J 12 = J 21 , the fourth eigenvalue can be quite close to one, especially for large I 0 and take an exponentially long time to decay, as λ T = exp(T ln λ) ≈ exp(−T (1 − λ)) when λ is close to 1.
This 2 p-bit model provides some insight into our general observation that directivity can be used to obtain accurate answers quickly. However, depending on the problem at hand it may be desirable to retain some degree of bidirectionality, since full directivity does lead to some loss of invertibility as seen for one set of inputs in Fig. 14. An example of a partially directed p-bit network is discussed in the next section.
FIG. 14. Invertibility of 32-bit adder, directed vs bidirectional: An adder that provides the sum S of two 32-bit numbers A and B: S = A + B. The left panel shows the adder implemented with bidirectional carry bits, while the right panel shows one with carry bits directed from the least significant to the most significant bit. Four different modes are shown with (i) A and B clamped (Addition), (ii) S and A clamped (Subtraction), (iii) A, B and S for the 16 most significant bits (msb) clamped, and (iv) A, B and S for the 16 least significant bits (lsb) clamped. Note that that bidirectional implementation shows very large errors for all modes of operation. The directed implementation works perfectly for both the adder and the subtractor modes. It also works if we clamp the least significant bits, but not if we clamp the most significant bits. Correlation parameter I0 = 1, T = 100 steps for all trials. S,A,B are taken to be the mode (most frequent value) of the 100×1 array obtained at the end of each trial. Clamped inputs are random 32-bit words for each trial, for a total of 1000 trials.

FIG. 15. Error versus bidirectionality:
The degree of bidirectionality Jji/Jij of the carry-out (j) to carry-in (i) link between the Full Adders is systematically varied while keeping the sum Jij + Jji constant. In each case the sum is obtained from the statistical mode (or majority vote) of T time samples over 50 trials. The y-axis shows the fraction of trials that yield the wrong result. Note that for large I0 and small T , errorfree operation is obtained only if bidirectionality is close to zero similar to standard digital circuits. But with I0 = 1.5 and T =50,000, error-free operation (at least for 50 trials) is obtained even with ≈ 75% bidirectionality. Fig. 16 shows how the invertibility of PSL logic blocks can be used to perform integer factorization using a multiplier in reverse. Normally, the factorization problem requires specific algorithms [59] to be performed in CMOSlike hardware, here we simply use a digital 4-bit multiplier working in reverse to achieve this operation.

4-Bit Multiplier / Factorizer
Specifically with the output of the multiplier clamped to a given integer from 0 to 15, the input bits float to the correct factors. The interconnection strength I 0 is increased suddenly from 0 to 2 at t = t 0 (Fig. 16) and the input bits get locked to one of the possible solutions. For example, when the output is set to 9, both inputs float to 3. With the output set to 6, both inputs fluctuate between two values, 2 and 3. Note that factors like 9 = 9 × 1 do not show up, since encoding 9 in binary requires 4-bits (1001) and the input terminals only have 2-bits. We have checked other cases where factorizing 3 shows both 3×1 and 1×3, and factorizing zero shows all possible peaks since there are many solutions such that 0 = 0 × 1, 2, 3 and so on.
We also kept the same directed connections between the Full Adders for the carry bits, making them a directed network of Boltzmann Machines, similar to the 32bit Adder. Moreover, we kept a directed connection from the Full Adders to the AND gates as shown in Fig. 16a since the information needs to flow from the output to The reversibility of PSL allows the operation of integer factorization using a binary multiplication circuit implemented using the principles of digital logic using AND gates and Full Adders (FA) as shown in (a). The output nodes of a 4-bit multiplier are clamped to a given integer, and the system produces the only consistent factors of the product at the input terminals, probabilistically. The interaction parameter I0 is suddenly increased to a saturation value of 2, and held constant as shown. (b) The output terminal is clamped to 9 and is factored into 3 × 3, note that 9 × 1 is not an achievable solution in this setup since encoding 9 requires 4-bit inputs in binary, whereas inputs are limited to 2-bits. (c) The output terminal is clamped to 6 and after being correlated, the factors cross-oscillate between 2 and 3. In both cases the histogram is obtained by counting outputs after t > t total /2 = 1.25 × 10 4 time steps to collect statistics after the system is thermalized.
the input in the case of factorization. The input bits that go to multiple AND gates are "tied" to each other with a positive exchange (J > 0) value much like 2-spins interacting ferromagnetically, however in PSL we envision these interactions to be controlled purely electrically. In this example, we have observed that the system is sensitive to the relative strengths of couplings within the AND gates and between the AND gates and the Full Adders which can also depend on a chosen annealing profile.
The design of factorizers of practical relevance is beyond the scope of this paper. Our main purpose has been to establish how the key feature of invertibility of p-bits can be creatively used for different circuits with unique functionalities. The demonstration of 4-bit factorization through reverse multiplication is similar to memcomputing [60] based on deterministic memristors. Note, however, that the building blocks and operating principles of stochastic p-bits and memcomputing [61] are very different and the only similarity noted here is the fact that both approaches treat the input and output terminals on an equal footing.

V. SUMMARY
It is generally believed that (1) probabilistic algorithms can tackle specific problems much more efficiently than classical algorithms [62], and that (2) probabilistic algorithms can run far more efficiently on a probabilistic computer than on a deterministic computer [62,63]. As such, it seems reasonable to expect that probabilistic computers based on robust room temperature p-bits could provide a practically useful solution to many challenging problems by rapidly sampling the phase space in hardware.
In this paper we have presented a framework for using probabilistic units or "p-bits" as a building block for a probabilistic spin logic (PSL) which is used to implement precise Boolean logic with an accuracy comparable to standard digital circuits, while exhibiting the unique property of invertibility that is unknown in deterministic circuits. Specifically we have: • presented an implementation based on stochastic nanomagnets to illustrate the importance of threeterminal building blocks in the construction of large scale correlated networks of p-bits. We emphasize that this is just one possible implementation that is by no means the only one (Section II).
• presented an algorithm for implementing Boolean gates as BM with relatively sparse and quantized J-matrix elements, benchmarked their operation against the Boltzmann law, and established their capability to perform not just direct functions but also their inverse (Section III), and • presented a 32-bit adder implemented as a hybrid BM that achieves digital accuracy over a broad combination of the interaction parameter I 0 , directionality and the number of samples T . This striking accuracy is reminiscent of digital circuits, but it is achieved while preserving a certain degree of invertibility which is absent in digital circuits. The accuracy is particularly surprising with high degrees of bidirectionality (J 12 = 0.75 × J 21 ) where the system is picking out the one correct answer out of nearly 2 33 ≈ 8 billion possibilities. This may require a larger number of time samples, but these could be collected rapidly at GHz rates. (Section IV).
We hope these findings will help emphasize a new direction for the field of spintronic and nanomagnetic logic by shifting the focus from stable high barrier magnets to stochastic, low barrier magnets, while inspiring a search for other possible physical implementations of p-bits.