Large-Scale Optical Neural Networks based on Photoelectric Multiplication

Recent success in deep neural networks has generated strong interest in hardware accelerators to improve speed and energy consumption. This paper presents a new type of photonic accelerator based on coherent detection that is scalable to large ($N \gtrsim 10^6$) networks and can be operated at high (GHz) speeds and very low (sub-aJ) energies per multiply-and-accumulate (MAC), using the massive spatial multiplexing enabled by standard free-space optical components. In contrast to previous approaches, both weights and inputs are optically encoded so that the network can be reprogrammed and trained on the fly. Simulations of the network using models for digit- and image-classification reveal a"standard quantum limit"for optical neural networks, set by photodetector shot noise. This bound, which can be as low as 50 zJ/MAC, suggests performance below the thermodynamic (Landauer) limit for digital irreversible computation is theoretically possible in this device. The proposed accelerator can implement both fully-connected and convolutional networks. We also present a scheme for back-propagation and training that can be performed in the same hardware. This architecture will enable a new class of ultra-low-energy processors for deep learning.

here we encode both the inputs and weights in optical signals, allowing the weights to be changed on the fly at high speed. Synaptic connections (matrix-vector products) are realized by the quantum photoelectric multiplication process in the homodyne detectors. Our system is naturally adapted to free-space optics and can therefore take advantage of the massive spatial multiplexing possible in free-space systems [25,26] and the high pixel density of modern focal-plane arrays [27] to scale to far more neurons than can be supported in nanophotonics or electronic cross-bar arrays. The optical energy consumption is subject to a fundamental standard quantum limit (SQL) arising from the effects of shot noise in photodetectors, which lead to classification errors. Simulations based on MNIST neural networks empirically show the SQL can be as low as 50-100 zJ/MAC. Using realistic laser, modulator, and detector energies, performance at the sub-fJ/MAC level should be possible with present technology. The optical system can be used for both fully-connected and convolutional layers. Finally, backpropagation is straightforward to implement in our system, allowing both inference and training to be performed in the same optical device. Fig. 1 illustrates the device. A deep neural network is a sequence of K layers ( Fig. 1(a)), each consisting of a matrix multiplication x → A x (synaptic connections) and an element-wise nonlinearity x i → f (x i ) A (1) x (0) x (1) x (2) x (K) x (k)

Coherent Matrix Multiplier
Copies of x (k)  represented as a sequence of K layers, each consisting of a matrix-vector multiplication (grey) and an element-wise nonlinearity (red). (b) Implementation of a single layer. Matrix multiplication is performed by combining input and weight signals and performing homodyne detection (inset) between each signal-weight pair (grey box). For details on experimental implementation see Supp. Sec. S1. The resulting electronic signals are sent through a nonlinear function (red box), serialized, and send to the input of the next layer.
(activation function); thus the input into the (k + 1) th layer is related to the k th layer input by: For a given layer, let N and N be the number of input and output neurons, respectively. Input (output) data are encoded temporally as N (N ) pulses on a single channel as shown in Fig. 1(b). This encoding, reminiscent of the Coherent Ising Machine [28,29,30], contrasts with other approaches used for neural networks, which encode inputs in separate spatial channels [23,24,22]. As there are N N weights for an N × N fully-connected matrix, the weights enter on N separate channels, each carrying a single matrix row encoded in time. Input data is optically fanned out to all N channels, and each detector functions as a quantum photoelectric multiplier, calculating the homodyne product between the two signals (inset). As long as both signals are driven from the same coherent source and the path-length difference is less than the coherence length, the charge Q i accumulated by homodyne receiver i is: Here E (in) (t) and E (wt) i (t) are the input and weight fields for receiver i, which are taken to be sequences of pulses with amplitudes proportional to x j and A ij , respectively (x j , A ij ∈ R). Thus each receiver performs a vector-vector product between x and a row A i of the weight matrix; taken together, the N electronic outputs give the matrix-vector product A x. Fields are normalized so that power is given by P (t) = |E(t)| 2 , and η is the detector efficiency. A serializer reads out these values one by one, applies the nonlinear function f (·) in the electrical domain, and outputs the result to a modulator to produce the next layer's inputs.
The homodyne detector in Fig. 1(b) (inset) combines the advantages of optics and electronics: it can process data encoded at extremely high speeds, limited only by the bandwidth of the beamsplitter ( THz) and the (optical) bandwidth of the photodetectors (typically 100 nm, or 10 THz). The electrical bandwidth can be much slower, since only the integrated charge is measured. Finally, the present scheme avoids the need for nonlinear optics that is a major stumbling block in all-optical logic [31]: since the output is electrical, the dot product A ij x j can be computed at extremely low power (sub-fJ/MAC) using standard non-resonant components (photodiodes) that are CMOS-compatible and scalable to arrays of millions.
Coherent detection greatly simplifies the setup compared to alternative approaches. With a given set of weight inputs, the network in Fig. 1(b) requires N input pulses and N detectors to perform a matrix-vector operation with N N MACs, performing an operation that should scale quadratically with size using only linear resources. This is in contrast to electrical approaches that require quadratic resources (N N floatingpoint operations total). The (optical) energy consumption of nanophotonic systems [23] also scales linearly for the same operation; however, the circuit is much more complex, requiring O(N N ) tunable phase shifters [32,33], which becomes very challenging to scale beyond several hundred channels and may be sensitive to propagation of fabrication errors. The main caveat to our system is the need to generate the weights in the first place, which imposes an energy cost that does scale quadratically. However, in many cases (particularly in data centers) neural networks are run simultaneously over large batches of data, so with appropriate optical fan-out, the cost of the weights can be amortized over many clients. Put another way, running the neural network on data with batch size B, we are performing a matrix-matrix product

Deep Learning at the Standard Quantum Limit
As energy consumption is a primary concern in neuromorphic and computing hardware generally [14], an optical approach must outperform electronics by a large factor to justify the investment in a new technology. In addition, optical systems must show great potential for improvement, ideally by many orders of magnitude, to allow continued scaling beyond the physical limits of Moore's Law. Thus two questions are relevant: (1) the fundamental, physical limits to the energy consumption, and (2) the energy consumption of a practical, near-term device using existing technology.
The fundamental limit stems from quantum-limited noise. In an electrical signal, energy is quantized at a level E el = h/τ el , where τ el ∼ 10 −10 s is the signal duration. Optical energy is quantized at a level E opt = h/τ opt , where τ opt ≡ c/λ ∼ (2-5) × 10 −15 s, which is 10 4 -10 5 times higher. As a result, E opt kT E el and electrical signals can be treated in a classical limit governed by thermal noise, while optical signals operate in a zero-temperature quantum limit where vacuum fluctuations dominate. These fluctuations are read out on the photodetectors, where the photoelectric effect [34] produces a Poisson-distributed photocurrent [35,36]. While the photocurrents are subtracted in homodyne detection, the fluctuations add in quadrature, and Eq. (1) is replaced by (See Supp. Sec. S2 for derivation and assumptions): Here the w (k) i ∼ N (0, 1) are Gaussian random variables, · is the L 2 norm, and n mac is the number of photons per MAC, related to the total energy consumption of the layer by n tot = N N n mac .
The noise term in Eq. (3) scales as n −1/2 mac , and therefore the signal-to-noise ratio (SNR) of each layer will scale as SNR ∝ n mac . Since noise adversely affects the network's performance, one expects that the energy minimum should correspond to the value of n mac at which the noise becomes significant. To quantify this statement, we perform benchmark simulations using a collection of neural networks trained on the MNIST (digit recognition) dataset. While MNIST digit classification is a relatively easy task [13], the intuition developed here should generalize to more challenging problems. Data for two simple networks are shown in Fig. 2, both having a 3-layer, fully-connected topology ( Fig. 2(a)). In the absence of noise, the networks classify images with high accuracy, as the example illustrates ( Fig. 2(b)).
As Fig. 2(c) shows, the error rate is a monotonically decreasing function of n mac . The two asymptotic limits correspond to the noiseless case (n mac → ∞, which returns the network's canonical accuracy), and

Inputs
Inner layers Outputs the noise-dominated case (n mac → 0, where the network is making a random guess). Of interest to us is the cutoff point, loosely defined as the lowest possible energy at which the network returns close to its canonical accuracy. This is around 0.5-1 aJ (5-10 photons) for the small network (inner layer size N = 100), and 50-100 zJ (0.5-1 photon) for the large network (inner layer size N = 1000). This bound stems from the standard quantum limit (SQL): the intrinsic uncertainty of quadrature measurements on coherent states [37], which is temperature-and device-independent. This should be viewed as an absolute lower bound for the energy consumption of neural networks of this type; although the use of squeezed light allows one to reach sensitivity below the SQL [38,39], this requires squeezing all inputs (including vacuum inputs) which will likely lead to a net increase in overall energy consumption.
The SQL is network-dependent, and not all layers contribute equally. For each MAC, we have SNR ∝ n mac ; however, the signal adds linearly while the errors add in quadrature. As a result, the larger network is more resilient to individual errors because each output is averaging over more neurons. Moreover, the solid curves in Fig. 2(c) are restricted to the case when n mac is the same for all layers. The dashed lines show the error rate in a fictitious device where quantum-limited noise is only present in a particular layer. For the large network, a smaller n mac can be tolerated in the second layer, suggesting that slightly better performance could be achieved by tuning the energy for each layer.
Quantum limits to computational energy efficiency in photonics are not unique to neural networks. In digital photonic circuits based on optical bistability [40], vacuum fluctuations lead to spontaneous switching events that limit memory lifetime and gate accuracy [41,42]. However, these effects require bistability at the attojoule scale [41,43], which is well out of the reach of integrated photonics (although recent developments are promising [44,45,46]). By contrast, neural networks are analog systems so the quantum fluctuations set a meaningful limit on efficiency even though no attojoule-scale optical nonlinearities are employed.

Energy Budget
Viewing the neural network as an analog system with quantum-limited performance shifts the paradigm for comparing neural networks. Fig. 3(a) shows the standard approach: a scatterplot comparing error rate with number of MACs, a rough proxy for time or energy consumption [12,13]. There is a tradeoff between size and accuracy, with larger networks requiring more operations but also giving better accuracy. In the SQL picture, each point becomes a curve because now we are free to vary the number of photons per MAC, and the energy bound is set by the total number of photons, not the number of MACs. Fig. 3(b) plots the error rate as a function of photon number for the networks above. While the general tradeoff between energy and accuracy is preserved, there are a number of counterintuitive results. For example, according to Fig. 3(a), networks 1 and 2 have similar performance but the first requires 8× more MACs, so under a conventional analysis, network 2 would always be preferred. However, Fig. 3(b) indicates that network 1 has better performance at all energy levels. This is because network 1 is less sensitive to shot noise due to averaging over many neurons, and therefore can be operated at lower energies, compensating for the increased neuron count. The same apparent paradox is seen with networks 3 and 4. This suggests that, in a quantum-limited scenario, reducing total energy may not be as simple as reducing the number of operations.
The total energy budget will depend on many other factors besides the SQL. Fig. 3(c) plots energy per MAC as a function of the average number of input neurons per layer N , a rough "size" of the neural network. The SQL data are plotted for the eight networks in Fig. 3(a-b), and the corresponding dashed line is an empirical fit. Note that the SQL is an absolute lower bound, assumes perfect detectors, and only counts input optical energy. In a realistic device, this curve is shifted up by a factor (η d η c η s β mod ) −1 , where η d , η c , and η s are the detector, coupling, and source (laser) efficiencies and β mod is the modulator launch efficiency [47]; these are all close enough to unity in integrated systems [26,48,49,50] that the factor is 10.
Another key factor is the detector electronics. The homodyne signal from each neuron needs to be sent through a nonlinear function y i → f (y i ) and converted to the optical domain using a modulator ( Fig. 1(b)). The most obvious way to do this is to digitize the signal, perform the function f (·) in digital logic, serialize the outputs, convert back to analog, and send the analog signal into the modulator. ADCs in the few-pJ/sample regime are available [51] and simple arithmetic can be performed at the pJ scale [15,16,17]. A range of modulator designs support few-fJ/bit operation [52,53,54,55]. Thus a reasonable near-term estimate would be few-pJ/neuron; this figure is divided by the number of inputs per neuron to give the energy per MAC (solid green curve in Fig. 3(c)). A much more aggressive figure of 1 fJ/neuron is also plotted-approaching this bound may be possible with state-of-the-art modulators [52], but it is below the energy figures for ADCs and would therefore require well-designed analog electronics. At these energies, the SQL becomes relevant, but due to optical inefficiencies the SQL will likely be relevant at higher energies as well.
For context, the ∼1 pJ/MAC figure [15,16,17] for state-of-the-art ASICs is shown in Fig. 3(c). Energy consumption in non-reversible logic gates is bounded by the Landauer (thermodynamic) limit E op = kT log(2) ≈ 3 zJ [56]. While multiply-and-accumulate is technically a reversible operation, all realistic computers implement it using non-reversible binary gates, so Landauer's principle applies. A 32-bit multiplication [57,58] requires approximately 10 3 binary gates (see Supp. Sec. S3) and each bit operation consumes at least kT log(2), giving a limit E mac ≥ 3 aJ (dotted line in Fig. 3(c)). This is already higher than the SQL for the larger networks with N ≥ 100. The optical neural network can achieve sub-Landauer performance because the matrix product is performed through optical interference, which is reversible and not subject to the bound.
A final consideration is the energy required to generate the weights. There is one weight pulse per MAC, so at the minimum this will be 1 fJ/MAC for the modulator, and may rise above 1 pJ/MAC once the driver electronics and memory access are included. However, once the optical signal is generated, it can be fanned out to many neural networks in parallel, reducing this cost by a factor of B, the batch size. Large batch sizes should enable this contribution to E mac to reach the few-femtojoule regime, and potentially much lower.

Training and Convolutions with Optical GEMM
As discussed previously, the optical unit in Fig. 1(b) performs a matrix-vector product, and running multiple units in parallel with the same set of weights performs a general matrix-matrix product (GEMM), a key function in the Basic Linear Algebra Subprograms (BLAS) [59]. Fig. 4(a) shows a schematic for an optical GEMM unit based on homodyne detection inspired by the neural-network concept. The inputs are two matrices (M 1 ) m×k and (M 2 ) n×k , encoded into optical signals on the 1D red (blue) transmitter arrays, and mapped with cylindrical lenses to rows (columns) of the 2D detector array. From the accumulated charge at each pixel, one can extract the matrix elements of the product (M 1 M T 2 ) m×n . This operation requires m · n · k MACs, and the total energy consumption (and energy per MAC) are: where E in , E out are the transmitter and receiver energy requirements, per symbol, which include all optical energy plus electronic driving, serialization, DAC/ADC, etc. If all matrix dimensions (m, n, k) are large, significant energy savings per MAC are possible if E in , E out can be kept reasonably small.
We saw above that the optical system could be used for neural-network inference. When running a batch of B instances X = [x 1 . . . x B ], the output Y = [y 1 . . . y B ] can be computed through the matrix-matrix product Y = AX. In fully-connected layers, training and back-propagation also rely heavily on GEMM. The goal of training is to find the set of weights A (k) that minimize the loss function L, which characterizes the inaccuracy of the model. Training typically proceeds by gradient-based methods. Since the loss depends on the network output, we start at the final layer and work backward, a process called back-propagation [7,8].
At each layer, we compute the gradient (∇ A L) ij = ∂L/∂A ij from the quantity (∇ Y L) ij = ∂L/∂Y ij , and propagate the derivative back to the input (∇ X L) ij = ∂L/∂X ij (Fig. 4(c)). These derivatives are computed from the chain rule and can be written as matrix-matrix multiplications: Once the derivative has been propagated to ∇ X (k) L (for layer k) we use the chain rule to compute ∇ Y (k−1) L =   f (∇ X (k) L) and proceed to the previous layer. In this way, we sequentially compute the derivatives ∇ A (k) L at each layer in the neural network.
In addition to fully-connected layers, it is also possible to run convolutional layers on the optical GEMM unit by employing a "patching" technique [60]. In a convolutional layer, the input x ij;k is a W × H image with C channels. This is convolved to produce an output y ij;k of dimension W × H with C channels [13]: Here K i j ,kl is the convolution kernel, a 4-dimensional tensor of size K x × K y × C × C, and (s x , s y ) are the strides of the convolution. Naïvely vectorizing Eq. (6) and running it as a fully-connected matrix-vector multiply is very inefficient because the resulting matrix is sparse and contains many redundant entries. Patching expresses the image as a matrix X of size K x K y C × W H , where each column corresponds to a vectorized K x × K y patch of the image (Fig. 4(d)). The elements of the kernel are rearranged to form a (dense) matrix K of size C × K x K y C. Eq. (6) can then be computed by taking the matrix-matrix product Y = KX, which has size C × W H . On virtually any microprocessor, GEMM is a highly optimized function with very regular patterns of memory access; the benefits of rewriting the convolution as a GEMM greatly outweigh the redundancy of data storage arising from overlapping patches [60]. The time required to rearrange the image as a patch matrix is typically very small compared to the time to compute the GEMM; therefore, by accelerating the GEMM, the optical matrix multiplier will significantly improve the speed and energy efficiency of convolutional layers. Note also that, since we are performing the convolution as a matrixmatrix (rather than matrix-vector) operation, it is possible to obtain energy savings even without running the neural network on large batches of data. Computing the convolution requires W H K x K y C C MACs. Following Eq. (4), the energy per MAC (not including memory rearrangement for patching) is: The coefficients c in = (1/C + 1/W H ) −1 and c out = K x K y C govern the energy efficiency when we are limited by input / output energies (transmitter / receiver and associated electronics). Since reading a 32-bit register takes ∼pJ of energy [13], a reasonable lower bound for near-term systems is E in , E out pJ. Thus it is essential that c in , c out 1 for the energy performance of the optical system to beat an ASIC (∼pJ/MAC).
As a benchmark problem, we consider AlexNet [1], the first convolutional neural network to perform competitively at the ImageNet Large-Scale Visual Recognition Challenge [9]. AlexNet consists of 5 convolutional Total FC layers 59M -- Table 1: Layers in AlexNet [1]. Values of c in , c out are calculated from Eq. (7). Max-pooling layers after CONV1, CONV2, and CONV5 are used to reduce the image size, but the relative computational cost for these layers is negligible. (CONV) layers and 3 fully-connected (FC) layers, and consistent with deep neural networks generally, the majority of the energy consumption comes from the CONV layers [13]. Table 1 gives the layer dimensions and the values of c in , c out for the CONV layers in AlexNet [1]. The MAC-weighted averages for all layers are c in > 100 and c out > 1000. Thus, even under extremely conservative assumptions of E in , E out 100 pJ (comparable to DRAM read energies [13,14]), it is still possible to achieve sub-pJ/MAC performance.
More advanced technology, such as few-fJ optical interconnects [26], may significantly reduce E in and E out , and therefore the energy per MAC. However, the performance is still fundamentally limited by detector shot noise (e.g. Eq. (3) for FC layers). Supp. Sec. S2 extends the shot-noise analysis to the case of matrix-matrix products needed for the convolutional case. Using a pre-trained AlexNet model (see Methods for details), Fig. 5(b) shows the top-10 accuracy on the ImageNet validation set as a function of the number of photons per MAC n mac . Consistent with Fig. 2(c), there are two limits: n mac 1 corresponds to the random guess regime with 99% error rate (for top-10 accuracy with 1,000 classes), while n mac 1 recovers the accuracy of the noiseless model. Fig. 5(b) show the fictitious case where noise is present in only a single layer, while the solid green line corresponds to the case where all layers have noise and n mac is the same for each layer. Not all layers contribute equally to the noise: CONV1 is the most sensitive, requiring n mac 20, while the deeper layers (particularly the fully-connected layers) can tolerate much lower energies n mac 1. Since the SNR is related to the total power received, which scales as c out n mac for the convolutional layers (c out pulses per detector), it is not surprising that the deeper layers, which have a larger c out , are less sensitive to quantum noise. The SQL obtained for AlexNet (n mac 20 or E mac 3 aJ) is slightly larger than that from the MNIST networks in Fig. 2(c), but of the same order of magnitude, suggesting that the SQL is somewhat problem-dependent.

Discussion
This paper has presented a new architecture for optically accelerated deep learning that is scalable to large problems and can operate at high speeds with low energy consumption. Our approach takes advantage of the photoelectric effect, via the relation I ∝ |E| 2 , to compute the required matrix products opto-electronically, obviating the need for all-optical nonlinearity that has hobbled past approaches to optical computing [31]. Since the device can be constructed with free-space optical components, it can scale to much larger sizes than nanophotonic implementations [23], being ultimately limited by the size of the detector array (N 10 6 ).
A key advantage to this scheme is that the multiplication itself is performed passively by optical interference, so the main speed and energy costs are associated with routing data into and out of the device. For a matrix multiplication C m×n = A m×k B k×n , the input/output (I/O) energy scales as O(mk) + O(nk) + O(mn), while the number of MACs scales as O(mnk). For moderately large problems found in convolutional neural-network layers (m, n, k ≥ 100) with moderate I/O energies (∼pJ), performance in the ∼10 fJ/MAC range should be feasible, which is 2-3 orders of magnitude smaller than for state-of-the-art CMOS circuits [15,16,17]. Advances in optical interconnects [52,48,49] may reduce the I/O energies by large factors [26], translating to further improvements in energy per MAC.
The fundamental limits to a technology are important to its long-term scaling. For the optical neural network presented here, detector shot noise presents a standard quantum limit to neural network energy efficiency [37]. Because this limit is physics-based, it cannot be engineered away unless non-classical states of light are employed [38,39]. To study the SQL in neural networks, we performed Monte Carlo simulations on pre-trained models for MNIST digit recognition (fully-connected) and ImageNet image classification (convolutional). In both cases, network performance is a function of the number of photons used, which sets a lower bound on the energy per MAC. This bound is problem-and network-dependent, and for the problems tested in this paper, lies in the range 50 zJ-5 aJ/MAC. By contrast, the Landauer (thermodynamic) limit is 3 aJ/MAC (assuming 1,000 bit operations per MAC [57,58]); sub-Laudauer performance is possible because the multiplication is performed through optical interference, which is reversible and not bounded by Landauer's principle.
Historically, the exponential growth in computing power has driven advances in machine learning by enabling the development of larger, deeper, and more complex models [11,13,16,17]. As Moore's Law runs out of steam, photonics may become necessary for continued growth in processing power-not only for interconnects [26], but also for logic. The architecture sketched in this paper promises significant short-term performance gains over state-of-the-art electronics, with a long-term potential, bounded by the standard quantum limit, of many orders of magnitude of improvement.

Methods
Neural-network performance was computed using Monte Carlo simulations. For fully-connected layers, Eq. (3) was used, while for convolutional layers, the convolution was performed by first forming the patch matrix ( Fig. 4(d)) and performing the matrix-matrix multiplication (noise model discussed in Supp. Sec. S2). The weights for the fully-connected MNIST neural networks were trained on a GPU using TensorFlow. A pretrained TensorFlow version of AlexNet (available online at Ref. [61]) was modified to implement the quantum noise model and used for ImageNet classification. Simulations were performed on an NVIDIA Tesla K40 GPU.
[Supplementary] Large-Scale Optical Neural Networks based on Photoelectric Multiplication S1 Homodyne Product Implementation Details Using optical homodyne detection (Fig. S1(a)), it is possible to obtain a signal proportional to the product of electric field amplitudes originating from two coherent, in-phase optical sources at different spatial locations. The input beams of electric field amplitudes x 1 and A 11 travel through a 50:50 beamsplitter, interfere, and outputs I + and I − are detected, where I + = 1 2 |x 1 + A 11 | 2 and I − = 1 2 |x 1 − A 11 | 2 . The difference of the photocurrents I + − I − is proportional to the real part of the product of the incident electric field amplitudes 2Re[A * 11 x 1 ]. If all field amplitudes are real, this returns the product 2A 11 x 1 . Alternatively, to reduce system complexity and minimize data transfer requirements, we can instead opt to use a single detector to perform homodyne detection (Fig. S1(b)). In this case, two copies of the signal x 1 and weight A 11 are sent into the detector, and a phase modulator applies a π-phase shift to the second copy of x 1 , flipping its amplitude. The photocurrents I + and I − now appear separated in time rather than space, and are read out separately and may be subsequently subtracted. This technique can be applied over many channels and time steps, such that the full matrix-vector product y = A x is computed (Fig. S1(c)). In this technique, half of the light is discarded and twice as many pulses are required, doubling the energy consumption and latency. However, since the same detector and beamsplitter are used for I + and I − , one avoids the technical issues associated with beamsplitter imbalance and detector inhomogeneities.
In practice, optical fan-out of x and encoding of the weight-matrix A by intensity and phase modulation can be performed on-chip. Optical phased arrays containing as many as 1024 grating antennas for outcoupling have been shown in a sub-cm 2 CMOS chip, including 1192 phase and 168 amplitude modulators [62]. The signals can be interfered and imaged on the detector array using bulk optics. An example experimental setup is illustrated in Fig. S2.   Figure S2: Experimental implementation of homodyne detection for optical neural network. x is fanned out and the weight-matrix A is generated on-chip. Outcoupling is achieved with grating antennas or nanoantennas, beams are focused onto separate detector pixels. Two example beams are drawn.

S2.1 Matrix-Vector Multiply
In the time-encoded neural network, at each layer, a stream of datax i is broadcast to the neurons of the subsequent layer. Each neuron is a homodyne detector that interferes this broadcast signal against the weight signalĀ ij . Assume perfect spatial and temporal mode-matching between input and weight signals, and normalize the quantities so that |x i | 2 , |Ā ij | 2 correspond to the number of photons per pulse. If a pulse with amplitudeū enters the detector, the output current is Poisson distributed: This has a mean of |u| 2 and a standard deviation of |u|. The homodyne signal at neuron i is obtained by interfering the signalsx j andĀ ij on a 50/50 beamsplitter and taking the difference between the photocurrents is the sum of many Poisson random variables, which is itself a Poisson process. In the useful limit of many photons per neuron (though not necessarily per MAC), this is approximately Gaussian: where w (±) i ∼ N (0, 1) is a (normal) random variable. This term is responsible for the detector shot noise. The homodyne detector measures the difference between the photocurrents. The means subtract while the noises add in quadrature: In the final line of (S3), the sum over j is implicit, and we have used the L 2 vector norms forx and the row-vectors ofĀ i to simplify the notation as follows: u 2 = j |u j | 2 . As before, w (±) i and w i are normally distributed as N (0, 1). This output is sent through a nonlinear function (here a ReLU) to obtain the inputs to the next layer: Herex,x andĀ are the physical inputs/outputs and weights, normalized to the single-photon level (so that |x i | 2 is the photon number in pulse x i , etc.). These are related to the logical variables by scaling constants Because the ReLU is scale-free (ReLU(cx) = c ReLU(x) for c > 0), the relation between feature vectors at subsequent layers is: Set α = 1/(2ξ A ) to make the classical term match the desired relation x i = ReLU(A ij x j ). The quantum noise enters inversely in ξ x , ξ A , which is expected because these quantities are related to the photon number, through the scaling of (x,x ,Ā).
The total number of photons involved in this layer is sum of the input photons and weight photons n tot = n It is often appropriate to assume that the row vectors A i have approximately the same norm: In this case, we can replace A i → A / √ N . Making this approximation and substituting ξ x , ξ A from Eq. (S8), we find: For a fixed energy per MAC n mac = n  Fig. 4(a) of the main text introduced a modified scheme that multiplies two matrices rather than a matrix and a vector. This is advantageous when running the network on batches of data. It is also well-adapted to computing convolutions through the patching method (see main text). To compute the vector-vector product C m×n = A m×k B k×n , pixel (i, j) receives the homodyne product of two pulse trains, encoding A i,:

S2.2 Matrix-Matrix Multiply
(the i th row of A) and B :,j (the j th column of B). The result is C ij = l A il B lj , assuming all matrices are real.
As in Sec. S2.1, we normalize all quantitiesĀ ij ,B ij ,C ij to photon number. Following Eq. (S3), the measured value ofC ij is photocurrent at detector (i, j): The physical quantities are related to their logical values by a scaling factor:Ā = ξ A A,B = ξ B B,C = ξ C C. Scaling to logical quantities and setting ξ C = 2ξ A ξ B (to scale C to satisfy C = AB in the classical limit) we obtain: The optical energy per MAC is n mac = n mac in Eq. (S12). Also, assuming that the rows of A (and columns of B) have roughly the same norm, we can replace A i,: → A / √ m and B :,j → B / √ n. Eq. (S12) then simplifies to:

S3 Landauer Limit Dependence on Architecture and Bit Precision
The Landauer Limit [56] sets a thermodynamic lower bound of kT log(2) for the energy of an irreversible logic gate with a single-bit output. For a more complex operation such as multiplication, the limit will be (kT log (2)) × G, where G is the gate count. The estimate of G = 10 3 for a MAC, used in the main text, is roughly accurate for 32-bit precision. However, the trend in deep learning has been toward lower-precision arithmetic, which is faster and more energy-efficient [17]. Table S1 lists the gate counts and transistor counts of commonly used integer multpliers [57,58,63,64,65] at all common bit precisions.
Gate count is an algorithmic construct independent of the physical implementation. Any digital computer based on irreversible gates will satisfy the bound E op ≥ (kT log(2)) × G, regardless of how (or whether) the gates are implemented with CMOS transistors. Fig. S3 shows the Landauer limit for the multipliers listed in Table S1. Since multipliers require more gates than adders, the multiplier accounts for the majority of the gates for a MAC, and Fig. S3 is a rough lower bound for the Landauer limit to E mac .
Note that for all multiplier types the gate and transistor count scale quadratically with number of bits. By moving from 32-bit arithmetic to 8-bit arithmetic, the Landauer limit decreases by a factor of 16. For the most efficient multiplier under this metric (Wallace/Booth), it reaches just under 100 zJ/MAC (10 −19 J/MAC). This is significantly lower than the 3 aJ value quoted in the main text, but still slightly above the SQL obtained for the larger networks in Fig. 3(c) (main text), suggesting that even when comparing against low bit-precision digital architectures, the optical network can beat the Landauer limit.
In practice, digital designers use different multipliers depending on if they are optimizing for speed, area on chip, or energy consumption [65].    Table S1. very energy efficient in modern designs and are common. Braun Multipliers are sometimes used for unsigned multiplication [64].