Universal compiling and (No-)Free-Lunch theorems for continuous variable quantum learning

Quantum compiling, where a parameterized quantum circuit is trained to learn a target unitary, is an important primitive for quantum computing that can be used as a subroutine to obtain optimal circuits or as a tomographic tool to study the dynamics of an experimental system. While much attention has been paid to quantum compiling on discrete variable hardware, less has been paid to compiling in the continuous variable paradigm. Here we motivate several, closely related, short depth continuous variable algorithms for quantum compilation. We analyse the trainability of our proposed cost functions and numerically demonstrate our algorithms by learning arbitrary Gaussian operations and Kerr non-linearities. We further make connections between this framework and quantum learning theory in the continuous variable setting by deriving No-Free-Lunch theorems. These generalization bounds demonstrate a linear resource reduction for learning Gaussian unitaries using entangled coherent-Fock states and an exponential resource reduction for learning arbitrary unitaries using Two-Mode-Squeezed states.


I. INTRODUCTION
Progress in experimental implementations of quantum optical neural networks [1][2][3] and extensions of quantum machine learning frameworks to the continuous-variable (CV) setting [4][5][6] indicate that quantum photonics is a viable platform for near-term quantum algorithms. Variational quantum algorithms, where a problem-specific cost function is evaluated on a quantum computer [7,8], while a classical optimiser trains a parameterized quantum circuit to minimise this cost, have been implemented in photonic systems. For instance, the variational quantum eigensolver [9] and variational quantum unsampling [10], i.e., partial characterization of a unitary operator, have both been implemented on integrated photonic processors. Beyond the fundamental physical advantages of photonic systems, such as a well-characterized set of loss channels and the possibility of room temperature operation, there are computational advantages to CV implementations of variational quantum algorithms such as the existence of efficient quantum error mitigation schemes [11][12][13]].
An important computational task that CV quantum processors are well-suited to is the variational compilation [14][15][16] of CV unitaries. The task is to optimize a parameterized quantum circuit to learn a given target unitary. The target unitary could take the form of a known gate sequence that one seeks to compile into a shorter depth, or more noise resistant, circuit. Hence quantum compiling could be used as a subroutine to reduce the resources required to implement large scale quantum algorithms. Alternatively, the target unitary could be the unknown dynamics of a quantum system. In this case, quantum compilation plays a role analogous to, but potentially less resource intensive than, a quantum sensing protocol [17] or unitary process tomography [18,19]. Specifically, our CV compiling algorithms make use of Gaussian measurements and CV resources such as intensity and quadrature squeezing, and so do not require preparation of exotic optimal probe states as in an optimal quantum sensing protocol, nor a large number of measured observables as in process tomography. In this sense, CV compiling provides a new tool for experimental physics.
In this paper, we establish frameworks for CV variational quantum compiling that are valid for arbitrary CV target unitaries. In contrast to the variational compiling method explored in Ref. [15], we include entanglementenhanced methods that can be used to learn an entire unitary rather than just its action on a low lying subspace. We illustrate the wide applicability of our cost functions for CV quantum compiling by numerically demonstrating efficient learning of arbitrary single-mode Gaussian unitaries, the generalized beamsplitter operation, and Kerr non-linearities.
We further make connections between this framework and CV quantum learning theory by deriving "(No)-Free-Lunch" theorems. These analytic theorems specify the minimal training data required to learn CV unitary operators in increasingly general settings, providing fundamental bounds on the limits of quantum learning. In particular, the bounds highlight how utilizing entangled training states can reduce the amount of training data required to learn an unknown unitary and thus entanglement could be seen to provide a 'free lunch'. We further use these results as an alternative motivation for the cost functions we propose for quantum compiling.
This manuscript is structured as follows. Section II provides a background to quantum compiling, including a discussion of its possible uses and a summary of previously proposed methods for discrete variable quantum compilation. Section III presents our main results, including the cost functions we propose for CV quantum compiling and an analysis of their trainability. Section IV contains numerical implementations of our proposed CV learning algorithm. Section V presents our No-Free-Lunch theorems for CV learning. Section VI summarises and discusses our results.

A. Applications of CV quantum compiling
The goal of CV quantum compiling is to take a (possibly unknown) unitary U and return a gate sequence V , executable on a CV quantum computer, that has approximately the same action as U on any given input state (up to possibly a global phase factor). Here we describe three possible applications of this subroutine.
Optimal circuit design. Quantum compilation could be used to variationally compile CV gate sequences to form optimal subcircuits. By optimal, we primarily mean short depth. However, compilation might also be used to find circuits that naturally compensate for systematic gate errors or that are more generally resistant to noise. The construction of such optimal circuits may prove critical for the successful implementation of larger scale algorithms, including proposals for generating optimal bosonic states in protocols for quantum metrology [20] and entanglement extraction [21].
Experimental quantum physics. More generally, variational quantum compilation could be used to learn the unknown unitary dynamics of a physical system. In the context of an optical system, one might be interested in studying the optical properties of a new material as sketched in Fig. 1. For example, as discussed further in Section IV, one might use quantum compiling to estimate the Kerr effect in nonlinear optical media that cannot itself be directly implemented in a CV quantum circuit. In this manner, variational quantum compilation provides a new tool for experimental physics.
Structured learning. In discrete variable systems, variational quantum compilation has proven useful for learning the spectral decomposition of a unitary operation. This in turn opens up the possibility of simulating beyond the coherence time of a quantum processor [22][23][24]. Similarly one could use discrete variable quantum compiling to learn block decompositions of a given unitary which is useful to study the entanglement properties of a system. It would be interesting to explore whether CV quantum compiling could similarly be used to study the spectral or entanglement properties of a given CV Here we sketch an experimental circuit to learn the unitary U implemented by a novel optical material (shown in orange) by training a parameterised quantum circuit V † (θ) implemented on an optical quantum computer (shown in grey). For specific details on the proposed circuit, and the cost it calculates, see unitary, or for simulating the dynamics of CV quantum systems [25][26][27].

B. Discrete Variable Quantum Compilation
Before presenting our algorithm for continuous variable quantum compilation, let us first review the discrete variable Quantum Assisted Quantum Compilation (QAQC) algorithm of Ref. [14]. In QAQC a compilation is found by variationally searching for a gate sequence that minimizes the Hilbert-Schmidt Test cost. This cost, which quantifies how close the compilation is to exact, can be written as the normalized Hilbert-Schmidt inner product between the target unitary U and possible compilation V , This cost is faithful, vanishing if and only if U and V differ by a global phase factor, i.e., V = e iϕ U for some ϕ ∈ R. Therefore, by minimizing C HST , we learn a unitary U that implements a target V up to a global phase. The Hilbert-Schmidt Test cost may be computed by the two closely related circuits shown in Fig. 2(a) and Fig. 2(b). To see how, we first note that where |Φ AB is the Bell entangled state of two qubit registers A and B of n = log 2 d qubits, i.e. |Φ AB := n j=1 |Φ + Aj Bj with |Φ + := 1 √ 2 (|00 + |11 ). It thus follows that we can write and C HST can be computed using the circuit shown in Fig. 2(a). Due to the ricochet property of the state |Φ , viz., X ⊗ I |Φ + = I ⊗ X T |Φ + for linear operator X, the Hilbert-Schmidt test cost can alternately be written as Thus C HST (V, U ) can also be computed with the target and ansatz unitaries applied in parallel, instead of in series, reducing the total circuit depth as shown in Fig. 2(b).
Finally, we note that the Hilbert Schmidt Test cost can be related to the average gate fidelity between U and V . Specifically, it can be shown [28,29] that where is the average fidelity of states acted upon by V versus those acted upon by U , with the average being over all pure states according to the Haar measure. In theory, Eq. (5) provides a third way of measuring C HST . One could perform a Loschmidt echo test, as shown in Fig. 2 where {|n } ∞ n=0 is the Fock basis and r is a squeezing parameter. To highlight the connection between TMSS and Bell states it is helpful to consider its truncated variant which tends to the standard TMSS in the limit that r tends to infinity, i.e. lim r→∞ |ψ r TMSS (r) = |ψ TMSS (r) . For finite r the truncated TMSS tends to a Bell state as r tends to infinity, that is In this sense, the TMSS may be viewed as a CV generalization of the Bell state. More generally, TMSSs are highly entangled states which, by reducing the number of measurements necessary to attain a given signal-to-noise ratio, have proven to be an important resource in quantum metrology [30][31][32]. Moreover, TMSSs were numerically shown to be nearly optimal for measuring the fidelity of noisy CV quantum teleportation channels [33]. These examples suggest that TMSSs may also be valuable for the unitary channel discrimination task we consider here. This is confirmed in Section V, where we use the entanglement-enhanced No-Free-Lunch theorem of Ref. [34] to argue that training on a single TMSS minimizes the generalization error.
This motivates our first proposed cost function to train an m-mode hypothesis unitary V to match an m-mode target unitary U as the following, This is the CV analogue of Eq. (3) obtained by using an m-mode TMSS instead of a Bell state. We call this cost, which is evidently faithful by construction, the Loschmidt Echo Two-Mode Squeezed State (LE-TMSS) cost since it measures the inner product between V ⊗ 1 1 |ψ m TMSS (r) and U ⊗ 1 1 |ψ m TMSS (r) using the Loschmidt Echo circuit sketched in Fig. 2(d).
To understand the structure of the circuit that we propose to measure C LE-TMSSr , it is helpful to recall that In c) W ψ j is the unitary that prepares the state |ψj , i.e. W ψ j |0 = |ψj , therefore the probability to measure the all-zero state on n qubits is equal to | ψj| U V † |ψj | 2 . CHST could theoretically be estimated by running this circuit over a Haar random ensemble of training states but for large problems this is exponentially inefficient. In d) and e) S(r) is the unitary single mode squeeze operator and • is a 50:50 beamsplitter that entangles the squeezed registers. The probability to measure the all-zero state on all 2m modes in d) and e) is equal to 1 − CLE−TMSS and 1 − CR−TMSS respectively. In f) D(α) is the unitary single mode displacement operator. The mean probability to obtain all-zero state on m modes is equal to the average of the k values | αj| U V † |αj | 2 , which is used to estimate CACS.
the TMSS can be written as where a j and b j are the annihilation operators on the A j and B j modes respectively, and S(r) is the single mode squeezing operator [35]. It follows that an m-mode TMSS can be prepared across two m-mode registers A and B by first negatively squeezing the modes on register A and positively squeezing the modes on register B and then pairwise entangling the modes A j and B j (for j = 1 to j = m) using a network of 50:50 beamsplitters.
As shown in Fig. 2(d), preparing a TMSS in this manner is the first step of the circuit to measure C LE-TMSSr . The second step is to apply the target unitary U and the inverse of the ansatz V † to register A. The final step is to implement the inverse of the m-mode TMSS state preparation in order to measure the overlap with the m-mode TMSS. This is done by first inverting the beamsplitter network and then reversing the initial local squeezing. The inverse squeezing can be carried out either by a two-step process consisting of an active optical unitary followed by an on-off photodetection measurement, or in a one-step process by an ideal general-dyne measurement [36]. The probability to obtain the measurement outcome in which all 2m modes are in the |0 state, i.e. the vacuum state, is equal to Hence this circuit can be used to measure C LE-TMSSr as claimed.
Ricocheted Two-Mode-Squeezed State cost Unlike the Bell states utilized in discrete variable quantum compiling algorithms, the TMSS only satisfies an approximate ricochet property for finite r. That is, with |ψ m TMSS (r) defined as in (7) with the exact property only holding in the limit that r → ∞ or for specially chosen U and V . Consequently, the Ricocheted version of the Two-Mode Squeezed State cost function, i.e.
is equal to (10) in the limit r → ∞. The circuit for computing (13), which is shown in Fig. 2(e), is identical to the circuit used to measure C LE-TMSS but with the target and ansatz unitaries prepared in parallel rather than series. The difference between cost functions (10) and (13) depends on V . In Appendix A we show how the cost functions differ in expectation over finite rank V . The calculation shows that even if the size of V increases multiplicatively, it is sufficient to increase the squeezing parameter r additively in order to make the cost functions (10) and (13) approximately equal.
Although the C R-TMSSr cost can be computed by a simple circuit, it has a drawback that its minimum need not be zero when optimizing V for a given U . Hence, when used for variational compiling, it will be hard to determine when to terminate the optimization loop. It is therefore helpful to define a normalized version of (13) where the normalization terms, for X = U and X = V , (15) can be calculated using the same circuit to measure C R-TMSSr . As shown in Appendix B, this normalized costC R-TMSSr is faithful, vanishing if and only if U and V agree up to a global phase.
Given the need to evaluate the normalization terms, as well as the original cost term, this cost is slightly more resource intensive than the LS-TMSS cost. However, the reduction in circuit depth achieved by using the approximate ricochet property may compensate for this in experimental contexts where coherence lifetimes are short.
Averaged coherent states cost It is not possible to define a cost which is directly analogous to Eq. (5) in a CV context as there is no direct equivalent to the Haar measure for CV states because of the infinite dimensionality of the Hilbert space for CV systems. Instead, one can consider averaging over a family of states up to a specific energy bound. In Ref [15] a cost is defined in this manner as an average over Fock states. However, large Fock states are hard to produce experimentally, and so we argue that a more natural choice, given the ease with which they can typically be produced in the laboratory, is coherent states. With this in mind, one could consider using the cost function where dµ(α) is a normalized measure on the set of mmode coherent states |α with energy 1 less than E. Each of the coherent state overlaps in Eq. (16) can be computed using local heterodyne measurements on the m modes. That Eq. (16) is faithful can be seen from the fact that if it takes the value 0, the modulus of the Qsymbol of the unitary operator V † U is equal to 1 almost everywhere on the domain, from which it follows that V † U = e iφ due to the overcompleteness of coherent states [37]. In practice, this cost, which we call the Averaged Coherent State (ACS) cost, can only be estimated by sampling k coherent states with energy less that E, i.e. using where α j 2 < E for all j. In Section V we will use an NFL theorem for Gaussian operations to argue that k = 2m training states will suffice to learn any Gaussian unitary U . In Section IV we provide numerics which suggest that to learn weakly non-Gaussian operations, in particular a small Kerr non-linearity, k = 2m modes is also sufficient. However, in general to learn an arbitrary operation we expect that k will scale with E.
Thus, in general, estimating C ACS E will be more resource intensive than the TMSS costs, C LE-TMSSr and C R-TMSSr , in the sense that it requires a larger number of cost evaluations. However, C ACS E does not require generating large highly-entangled Two-Mode-Squeezed states, and therefore may in some contexts be less experimentally demanding. In particular, we expect this cost to be most useful for learning (approximately) Gaussian operations where the number of training states required is reduced.

B. Trainability
For a variational quantum algorithm to run successfully, i.e. for it to be possible to minimize cost and thereby find the optimum solution, the cost landscape must have sufficiently large gradients to allow for training. Recently, it has been shown that discrete variable VQAs can exhibit so called 'barren plateaus', where under certain conditions the gradient of the cost function vanishes exponentially with the size of the system [38][39][40][41][42][43][44][45][46][47][48][49][50]. Preliminary results further indicate continuous variable systems [51] may exhibit an analogous barren plateau phenomenon where the cost gradients vanish exponentially with the number of system modes. On such barren plateau landscapes (potentially untenably) precise measurements are required to determine the direction of steepest descent and navigate to the minimum. Thus for any learning algorithm to be scalable to large problem sizes it is essential to use a cost that does not exhibit a barren plateau.
Even from basic examples, one can see that the cost functions for CV quantum compiling exhibit a barren plateau. To demonstrate this we will focus on the Loschmidt Echo TMSS cost, but analogous arguments follow for the ricocheted TMSS cost and the averaged coherent state cost. Consider using the Loschmidt Echo TMSS cost to compile the m-mode identity operation using the ansatz composed of a product of phase gates, i.e.
Then the cost takes the form It follows (examining φ 1 without loss of generality) that which vanishes exponentially with the number of modes m. It therefore follows from Chebyshev's inequality that the probability that the cost gradient deviates from zero vanishes exponentially. Thus the landscape exhibits a barren plateau [51]. If r is allowed to vary with m, then taking sublinear scaling of the total squeezing, e.g., local squeezing r(m) = O( ln m α m ), α > 0, causes (20) to vanish only polynomially.
For fixed r, the barren plateau phenomenon can be circumvented by using a local variant of our proposed costs. Analogously to the local version of the Hilbert-Schmidt test introduced in Ref. [14], where pairs of qubits, rather than all 2n qubits, are measured to compute the local cost; our proposed local TMSS costs can be calculated from measurements on pairs of CV modes instead of 2m CV modes. Specifically, as shown in Fig. 3(a), the local version of the Loschmidt Echo TMSS cost is defined as where Pr(00) Aj Bj is the probability of observing outcome 00 on the pair of modes A j B j from registers A and B in Fig. 2(d). This cost function can be shown to be faithful using the same probability theoretic argument used to prove the faithfulness of the local Hilbert Schmidt test cost in Ref. [14].
The local cost C (L) LE-TMSSr can be expressed as a sum of entanglement fidelities. To see how, first note that the local marginal states of can be written as . (22) Here ρ β(r) is a thermal state at the inverse temperature β(r) := −2 ln tanh r, and LE-TMSS can be written as where F j is the entanglement fidelity of the channel with respect to the TMSS. That is, Thus, not only is the local cost faithful, it also has a natural conceptual interpretation. Crucially, the local TMSS cost (24) appears not to exhibit a barren plateau. For example, for the problem of compiling the identity with multimode phase shifters considered at the beginning of this section, one obtains which, for fixed r, is constant as m → ∞.
We note that, for a fixed number of modes m, the costs C LE-TMSSr and C (L) LE-TMSSr concentrate to 1 when the squeezing parameter r is large. It follows that the gradients of C LE-TMSSr and C (L) LE-TMSSr , as seen from Eq. (19) and Eq. (27), vanish exponentially with r. Consequently, training becomes exponentially more resource intensive for larger r. A similar exponential vanishing with respect to r was observed for approximations of CV energyconstrained channel fidelities that compare the actions of CV channels on two-mode squeezed states [33]. This vanishing gradient problem is conceptually different to the barren plateau phenomenon which may be resolved using a local cost. In Section IV, we propose a practical resolution for this vanishing gradient problem.
Finally, we note that for the example of compiling the identity operation considered earlier, the averaged coherent state cost function C ACS E in (16), and its approximation in (17), do not exhibit barren plateaus if the energy bound E is taken to depend on the mode number m in such a way that the maximal energy per mode E(m)/m grows sublinearly as a function of m (see Section 2 of Ref. [51]). However, we expect that more general compiling problems, such as Gaussian compiling or compiling of Kerr non-linearities, will exhibit barren plateaus. Such trainability issues could again be mitigated by defining a local version of C ACS E . A natural choice in local cost would be (analogously to (21)) to compute a spatial average of the probability of measuring the vacuum state on each of the modes at end of the circuit in Fig. 2(f). More concretely, one could use A is a coherent state in the 2m-dimensional phase space, and we have used the discrete version of C ACS E (V, U ) in (17). Computation of one term in the double sum defining C (L) ACS E is shown in Fig. 3(c).

IV. NUMERICAL IMPLEMENTATIONS
Here we present results for implementing CV quantum compilation to learn commonly encountered CV operations. In particular, we focused on learning arbitrary single mode Gaussian operations, Kerr non-linearities and a general beamsplitter operation. In each case, we performed continuous parameter optimization in order to minimize the TMSS cost function Eq. (13). We focus on the Loschmidt-Echo variant of the cost but similar results are obtained for the Ricocheted variant. We note that it is unnecessary to use the local version of the cost here since for these proof-of-principle implementations we consider learning single and two mode unitaries for which the cost gradients are expected to be manageable even with a global cost.
Given the close connections between the TMSS cost and the HST cost for large r, and the operational meaning of C HST as a measure of the average fidelity between U and V , ideally we would use a large r value, i.e. large squeezing, to learn U . However, as discussed in Section III B, and as demonstrated numerically in Fig. 4, the landscape of the TMSS cost becomes overwhelmingly flat for large r, making it difficult to train. We therefore found it more effective to train initially using a small r value. Then once reasonably accurate pre-trained parameters have been obtained using a small r, we trained on a larger r to refine the quality of the solution.

Gaussian
Operations. An arbitrary single-mode Gaussian operation is generated by the quadratic Hamiltonian where α and β are arbitrary complex numbers and φ is an arbitrary real number. We generated a random target Gaussian operation U targ := U Gaus (α targ , β targ , φ targ ) Here is a noise parameter that determines the deviation of the ansatz parameters, θ, from the optimum parameters, θ opt . Specifically, we set θ k = θ opt k + R, where R is a random number between -1 and 1.
We then used C LE−TMSS to learn U targ using an ansatz of the same form. That is, using an ansatz of the form V anz = U Gaus (α, β, φ) where α, β and φ are parameters to be variationally learnt. Since U Gaus is readily factorizable into the product of displacement, squeezing and phase operations, this ansatz can be straightforwardly implemented using standard gates on a CV-quantum computer.
The results of learning 2 an arbitrary Gaussian operation are shown in the top row of Fig. 5. To quantify the quality of the optimization, we take the optimal parameters obtained at each iteration of the optimisation algorithm and plot both the Hilbert-Schmidt cost, C HST , and the errors in the individual optimised parameters. As shown in Fig. 5(a), we start with r = 0.1 and successfully optimize the TMSS cost (using the COBYLA algorithm) down to 10 −6 . This corresponds to errors in the HST cost and individual parameters in the region of 10 −1 to 10 −4 . We then took the optimal parameters from minimizing the TMSS cost with r = 0.1 and optimized using the HST cost with r = 0.5. The cost value and parameter errors initially go sharply up (because the old parameters that optimised the cost with r = 0.1 are no longer optimal) before decreasing again as the new cost is optimised. After optimising with r = 2.5 we get both the TMSS and HST costs down to 10 −9 with errors in the individual parameters in the region of 10 −5 . Thus Fig. 5 both demonstrates the effectiveness of our perturbative strategy and highlights how the difference between C TMSS and C HST decreases with increasing r.
Kerr Non-Linearity. The second optimization task we consider is learning a Kerr non-linearity of the form Since there is no simple ansatz which can capture an arbitrary non-Gaussian operation, in this case we use the general layered ansatz advocated in Refs [15,52]. This ansatz is composed of multiple layers that each consist of a displacement, squeeze, phase shift and non-linear Kerr shift. That is, a single layer is of the form and the total ansatz is composed of a product of L such layers, V ans (θ) := L l=1 V layer (α l , β l , φ l , χ l ) where θ := {α l , β l , φ l , χ l } L l=1 . Since the gates in every layer constitute a universal set [52], this ansatz can be used to implement any single-mode quantum operation. We focus on the task of learning a large Kerr nonlinearity (of, perhaps, some new, yet to be classified, material). To make this task both non-trivial and physically pertinent we suppose that the Kerr non-linear components used as part of the ansatz are limited to implementing some maximum non-linearity which is less than that of the target non-linearity. Specifically, we suppose that the components of θ are bounded between 0 and 1, and we consider trying to learn χ targ = 3 using a 4layered ansatz. To perform the optimization we employ the gradient-based Limited-memory BFGS algorithm.
To assess the performance of the optimization in Fig. 5 we again plot the HST cost as well as a measure of the error in the individual parameters. Given the non-commutativity of the displacement, squeezing, phase shift and Kerr operations, there are multiple possible choices in the parameters θ such that V (θ) = U Kerr (χ targ ). Despite this, in practice, we found that the optimization algorithm found the 'obvious' solution where the displacement parameters α l , squeezing parameters β l , and phase shift parameters φ l each sum to zero and the Kerr non-linearity parameters χ l summed to χ targ . We therefore took the difference between these values (i.e. | l=1 α l − 0|, | l=1 β l − 0|, | l=1 φ l − 0| and | l=1 χ l − χ targ |) as the measure of our displacement, squeezing, phase and Kerr errors respectively.
Similarly to the Gaussian case we find that starting with a small r allows for successful training. Then increasing the value of r improves the quality of the training in the sense that the HST error and parameter errors can be further decreased. We achieve a final TMSS and HST cost of 10 −8 and parameters errors of ∼ 10 −5 . To mitigate the problem of vanishing cost gradients for large r we take a perturbative approach starting with r = 0.1 (left column), and after convergence increasing r to 0.5 (middle column) and then 2.5 (right column). To quantify the quality of the optimization, we take the optimal parameters obtained at each iteration of the optimisation algorithm and plot both the Hilbert-Schmidt cost CHST (red dashed) and the errors in the individual optimised parameters (dotted). The parameter errors are given in natural units with = 1 and the mass and frequency of the modes equal to 1.
Beamsplitter Operation. Finally, we attempted to learn a beamsplitter operation of the form for a two mode system with annihilation operators a and b respectively where θ and φ are randomly chosen phases in the range [0, 2π]. To learn this operation we used a single layer ansatz of the form where V (1) layer and V (2) layer indicate the single-mode gate sequence defined in Eq (32) on the first-and second-mode respectively. The optimization was successful, with the TMSS and HST costs reduced to ∼ 10 −7 .
We note that while it may superficially appear from Fig. 5 that fewer iterations are required to learn the Kerr non-linearity and beamsplitter operation than an arbitrary Gaussian operation, this is a feature of our choice in optimisation algorithm. Namely, BFGS uses a gradient based approach which involves evaluating the cost n param times, where n param is the number of parameters that need to be learnt at every iteration step. Once this is accounted it requires more cost evaluations to learn the general beamsplitter or a Kerr non-linearity than to learn a Gaussian operation. This is precisely as one would expect since these are more complex optimization problems.

V. NO-FREE-LUNCH THEOREMS FOR CV QUANTUM LEARNING
In classical machine learning, the No-Free-Lunch (NFL) theorems consider the task of learning a target function f , where f maps a discrete input set X to a discrete output set Y (both of size d). The learning is performed using a training set S consisting of |S| inputoutput training pairs, In the limit of perfect learning, one assumes it is possible to train a hypothesis function h S to match the target function f on all training pairs in S. The No-Free-Lunch theorems then quantify the 'generalization error', i.e. how well the hypothesis function matches the target on unseen data. In general terms, the theorems demonstrate that the generalization error of a given learning algorithm is not less than that of a random learning algorithm in expectation over target functions f [53][54][55][56][57].
That is, the average performance of a learning algorithm is determined not by the choice in learning algorithm but rather by the amount of training data |S|. Specifically, the generalization error can be quantified by the following risk function where 1[S] is the indicator function taking value 1 (0) if condition S is satisfied (not satisfied). This is the probability that the hypothesis function h S (x) and target function f (x) differ across X, the domain of f , when x is sampled from the uniform probability distribution π(x). The average risk, averaged over training sets S and functions f , can for any optimization method be lower bounded as [57] Hence the average risk is determined by the number of training pairs |S|, vanishing if and only if S spans the full domain of f , i.e. if |S| = d.
Similar NFL theorems exist for finite-dimensional quantum circuit learning, in which a target function f corresponds to a unitary quantum channel U and the training set is generalized to a set of quantum state pairs S = {|ψ j , U |ψ j } |S| j=1 . By defining the generalization error using a suitable distance on quantum state space, it is shown that in general an exponential number of training states, |S| ∼ 2 n , are required to learn an n qubit unitary [58].
Further, by allowing the training set to consist of pairs of states that are entangled with a reference system, i.e. S = {|ψ j , U ⊗ I R |ψ j } |S| j=1 , where |ψ j ∈ H ⊗ H R are entangled pure states of Schmidt rank r, an entanglementenhanced quantum No-Free-Lunch theorem can be derived. In this case, the lower bound of the expected error of a quantum learning algorithm, over all target unitaries U is reduced linearly in r [34]. This has the important practical implication that, by using entanglement as a resource, the number of unique input-output state pairs needed to learn a target unitary, U , may be exponentially reduced in the limit of perfect learning.
In Section V A we derive NFL theorems in a restricted setting where the task is learning a linear optical unitary operation. Specifically, Theorem 1 and Theorem 2 quantify learning with classical training data (coherent states) and quantum training data (entangled coherent-Fock states) respectively. Section V B shows how the entanglement-assisted NFL theorem of [34] can be applied in an unrestricted CV learning setting. We further discuss how CV quantum NFL theorems can be used to motivate cost functions for CV quantum compiling. These results are summarized in Table I.

A. Learning linear optical unitaries from Gaussian training data
Linear optical unitaries capture the dynamics of multimode beamsplitters, phase shifters and displacement operators. Such unitaries, on m CV modes, can be written in the form U = e iH where H = H † and [H, m j=1 a † j a j ] = 0. Here we consider the task of training a hypothesis unitary V S to emulate a target linear optical unitary U using a set of training data S composed of m-mode coherent states. We analyze the expected performance of a generic learning algorithm, over all target linear optical unitaries U and all training sets containing |S| training states.
To fix the notation, an m-mode coherent state with mean vector w is written |w . Here w is a row vector in R 2m given by w = R with R = (q 1 , p 1 , . . . , q m , p m ) the vector of canonical quadrature operators. The action of the target linear optical unitary U on |w is given by U |w = |wO where O is a 2m × 2m orthogonal matrix. The set of 2m by 2m orthogonal matrices will be denoted Orth(2m). Equipped with this notation, the training set to learn a linear optical unitary U using |S| pairs of mmode coherent states can be written as Similarly, the action of the hypothesis linear optical unitary V S on |w can be written as V S |w = |wT S where T S ∈ Orth(2m). We focus on the limit of perfect learning and assume that the learning algorithm outputs an orthogonal matrix T S that agrees perfectly with O on all coherent states w j in the training set. That is, we assume that w j T S = w j O for the training data mean vectors (w j , w j O) ∈ S.
To quantify how well the hypothesis unitary V S matches the target unitary U on all possible coherent states, i.e. not just the training states, we define a risk function. To do so we utilize a simple loss function of the form L(y, z) = y − z 2 where y = xO and z = xT S are the output vectors of the target and hypothesis orthogonal matrices respectively. Throughout this section, · refers to the 2-norm on the Euclidean space R 2m . The total risk is then defined as the average loss over a multivariate Gaussian distribution of input vectors x, i.e. over the distribution π(x) = 1 (2πσ) m e − x 2 2σ . The total risk thus takes the form The following theorem quantifies the expected risk for learning a linear optical unitary using the training set S, (38), in the limit of perfect learning. Theorem 1. Let O be distributed according to the normalized Haar measure on Orth(2m) and let the training data S of cardinality |S| be chosen uniformly from a compact connected set of m-mode coherent states, as defined in Eq. (38). Then Proof. For fixed S, simplify (39) to Under the assumption that the learning algorithm outputs T S that agrees with O on the |S|-dimensional subspace of R 2m spanned by the training data (i.e., w j T = w j O for the training data mean vectors w j ), we can write where Y ∈ Orth(2m − |S|). Taking the expectation over O gives Because S is taken from a subset of R 2m with no isolated points, one always obtains a set of |S| linearlyindependent coherent states when S is sampled. Therefore, taking the expectation over S does not change the right-hand side of (43).
Theorem 1 shows that the generalization error for learning generic linear optic unitaries reduces linearly with the number of pairs of coherent states trained on, vanishing completely for |S| = 2m. (We stress that |S| is the number of unique training pairs required to learn the unitary, not the total number, which, due to shot noise and the iterative optimization procedure, will be substantially larger.) This implies that the Averaged Coherent State cost in Eq. (17) can be approximated using only 2m training states when learning m-mode linear optical unitaries. More broadly, Theorem 1 can be viewed as a "classical" NFL theorem for CV systems.
It is possible formulate a quasiclassical CV NFL theorem in which the training data consists of squeezed, rather than coherent, states. In this case we use a risk function that compares the action of O and T S on the phase space fluctuations of a compact set of centered, pure CV Gaussian states. We find that squeezing in general inhibits the learning process. However, intriguingly, for this definition of the risk, the risk may be reduced by the training set size as a function of |S| 2 instead of |S|. This CV NFL theorem is discussed and proved in Appendix D.
We now show, similarly to the entanglement-assisted discrete variable NFL theorem [34], that utilizing entangled training states can lower the expected risk. This improvement is achieved by modifying the training data set in Theorem 1, while keeping the risk (39) the same. Specifically, we now consider a training set Here {|w (j) k X } r k=1 is a set of linearly independent coherent states acting on a system X and |k R denotes the k th Fock state of an ancilla register R. The positive integer r acts as an analogue of Schmidt rank in this context, although we note that the linearly independent mean vectors w (j) k need not be approximately orthogonal, so r is not strictly related to the entanglement entropy. To use a precise term, r is equal to the exponential of the entropy of coherence [59] with respect to the orthonormal set {|w is sufficiently large, |ψ j has entanglement entropy approximately equal to log 2 r with respect to the partition consisting of m CV modes X and the CV register R of the training set.
Analogously to the NFL for coherent state training above, we derive the following theorem on the expected risk.
Theorem 2. Let O be distributed according to the normalized Haar measure on Orth(2m) and let the training data S of cardinality |S| consist of pairs of entangled CLE−TMSS (10) coherent-Fock states as defined in Eq. (44) and Eq. (45). Then Proof. As in the setting of Theorem 1, the objective is to learn the orthogonal matrix O corresponding to an mmode linear optical unitary U . The assumption of perfect agreement of T S and O on the training data set now corresponds to the condition V S ⊗ I R |ψ r j = U ⊗ I R |ψ r j for all j. Proceeding up to (41) in the same way as in the proof of Theorem 1, we now note that the assumption of perfect agreement on training data implies that w (j) k T S for all k, j. Taking into account linear independence of the mean vectors w (j) k in R 2m , this means that T S and O are identical on an r|S| dimensional subspace of the phase space R 2m . So T S O T = I r|S| ⊕ Y with Y ∈ Orth(2m − r|S|) and, instead of (43) above, one gets Again the expectation over training sets S of fixed cardinality |S| is trivial when the mean vectors w (j) k are chosen uniformly from some compact connected subset of R 2m . Theorem 2 shows that for a fixed training data set size, increasing the parameter r in the training data (for large, w (j) k this approximately corresponds to increasing the entanglement entropy of the training data) can reduce the generalization error. In this sense, entanglement could be seen to provide a 'free-lunch'. However, as with all apparently free lunches, there are caveats. Namely, there may be a hidden cost in obtaining the entangled training data in the first place since entanglement is generally experimentally challenging to create and preserve. Thus how 'free' this lunch is will depend on the relative scarcity of training states and entanglement.
It is also important to note that the enhancement provided by entanglement here is less necessary than the enhancement found in the discrete variable case. In the discrete variable case an exponential number of training pairs are required in the absence of entangled training data, whereas to learn linear optical unitaries, the number of unentangled pairs scales linearly in the number of modes.
Theorem 2 could be viewed as motivating a cost function of the form where the |ψ r j are the entangled coherent-Fock states defined in Eq. (45). We note that this is a generalisation of C ACS in the sense that it reduces to C ACS in the limit that r = 1. To learn a linear optical unitary Theorem 2 implies it suffices to use k = 2m/r training pairs. One could also potentially use this cost to learn more general unitaries; however, Theorem 2 does not apply in that case and therefore one may need to use a significantly larger rk to minimise the generalization error.
In Appendix C we prove that Theorems 1 and 2 generalize to learning arbitrary Gaussian operations. We thus expect it to be possible to learn a single mode Gaussian operation using a single entangled training pair (r = 2, |S| = 1), or two unentangled training pairs (r = 1, |S| = 2) but not a single unentangled training pair (r = 1, |S| = 1) since for the former the expected risk vanishes whereas the latter corresponds to a finite risk. This is indeed supported by our numerical results shown in Fig. 6 where we optimize C ACS (corresponding to training on two unentangled training state pairs) and C (2,1) ECFS (corresponding to training on a single entangled training state pair) using the same variational framework set out in Section IV. We find that while it is possible to minimise C (1) ACS , this does not correspond to the Gaussian operation being successfully learnt. This is shown by the large learning errors, as measured by the truncated Hilbert-Schmidt Test cost, which quantifies the average error over all possible input states, and individual parameter errors, in the left-hand panel of Fig. 6. Conversely, as shown in the middle-and right-hand panels of Fig. 6, when using entangled training data or multiple training states the learning errors are iteratively minimized as the cost is minimized.
In Fig 7 we present analogous results for the learning of a weak single-mode Kerr non-linearity. Specifically, as shown in Fig. 7 we find that a single mode (m = 1) Kerr non-linearity of χ = 0.1 and χ = 0.5 can be learnt using either a single entangled training pair (r = 2, |S| = 1), or two unentangled training pairs (r = 1, |S| = 2) but not a single unentangled training pair (r = 1, |S| = 1).

B. Learning arbitrary unitaries and motivation of compiling cost functions
Theorems 1 and 2 concern learning linear optical unitaries. The question remains whether similar entanglement assisted NFL theorems can be derived for learning arbitrary CV unitaries.
To answer this question, it is useful to recall that the discrete variable (i.e., finite dimensional) entanglementassisted quantum NFL theorem in Ref. [34]. Specifically, the analog of Theorem 2 takes the form where U is the target unitary, V S is the output of the learning algorithm on entangled training states in S, r is the Schmidt rank of the training data states, and the risk is In (50), the integral is over all pure states according to the Haar measure. A continuous variable NFL theorem cannot be derived that is strictly analogous to (49) in the discrete variable setting because there is no Haar measure over the unitary group in B(H) for infinite dimensional H. On the other hand, one is often only interested in the action of the target unitary U on Fock states only up to a finite cutoff. For example, recent proposals for efficient updates and derivatives of Gaussian gates in parameterized CV circuits utilize cutoff recursion relations for the Fock matrix elements of the gates [5].
Eq. (49) implies that a single full rank state, i.e. a state with rank d, can be used to fully learn a unitary of rank d. Thus, the truncated TMSS states defined in Eq. (8) can be used to learn arbitrary d dimensional unitaries without incurring a generalization error. Taking the limit that d tends to infinity, this implies that Loschmidt-Echo TMSS cost can be used to learn arbitrary CV unitaries, thus further motivating its use.

VI. DISCUSSION
In this work we have established a framework for quantum compiling in continuous variable systems. We started by motivating the TMSS cost (both the Loschmidt Echo and Ricocheted variants) and the averaged coherent state cost as natural CV analogues of the Hilbert-Schmidt state cost. Our numerical implementations demonstrated the successful learning of single mode Gaussian operations, a generalized Beamsplitter operation and Kerr non-linearities using these costs.
We subsequently showed how these costs may be alternatively motivated via a series of increasingly general '(No-)Free Lunch' theorems. Firstly, the NFL theorem for Gaussian operations using coherent state mean vector training data establishes that it is possible to perfectly learn an m-mode Gaussian operation by training on only 2m coherent states. This implies that it is possible to learn arbitrary Gaussians by training on an approximation of the averaged coherent state cost In the left and middle columns we use CACS with a single training pair (k = 1) and two training pairs (k = 2) respectively. In the right hand column we use CECFS with a single entangled training pair (k = 1 and r = 2). In all cases the basic coherent state training states have a random energy up to a maximum of 1 and a random phase in the range 0 to 2π. To quantify the quality of the optimization, we take the optimal parameters obtained at each iteration of the optimization algorithm and plot both the Hilbert-Schmidt cost CHST (red) and the errors in the individual optimized parameters in arbitrary units (dotted). The dashed red line and dot-dashed red lines show the truncated Hilbert-Schmidt cost CHST on the first 50 Fock states and on the first 5 Fock states respectively. using only 2m coherent states. Next, the NFL theorem for Gaussian operations using entangled coherent-Fock states both showed how entanglement may be used to reduce the amount of training data required to learn Gaussian operations and motivated an alternative entanglement-enhanced cost function for compiling that makes use of entangled coherent-Fock states. Finally, we argued that taking the continuum limit of the discrete variable entanglement-enhanced NFL theorem implies that to learn an arbitrary unitary on a single training state requires a full rank state. This motivates training using the TMSS cost.
It is worth highlighting that these (No-)Free-Lunch theorems quantify the number of different training pairs required to learn a unitary in the ideal case of perfect training. That is, they do not give the total number of copies of training pairs that are required to learn the unitary. Indeed, given shot noise, a large number of copies of each pair will in fact be required to evaluate the cost. More generally, training may be imperfect not only due to shot noise but also hardware noise or the presence of barren plateaus or local minima in the training cost function landscape. A valuable extension would be to generalize the theorems to account for imperfect learning.
It would also be interesting to derive further NFL theorems for alternative classes in training data. For example, one might be concerned with learning a unitary U from homodyne or heterodyne detection data, in which case a risk function could be defined in terms of the difference in the expected quadrature vector of the output state for the hypothesis and target unitaries. General unitary learning protocols based on other CV measurement-motivated risk functions, such as those associated with CV distinguishability norms [60,61], are expected to have associated NFL theorems and quantum compiling protocols that are adapted to the measurement class under consideration.
We further note that Two-Mode-Squeezed states are not the only choice of state to saturate the entanglement enhanced NFL bound for arbitrary unitaries. One could alternatively use any full rank state, such as cluster states. A finite energy CV cluster state is defined by |CL r = e iq⊗q (S(−r) |0 ) ⊗2 , where q is the single-mode position quadrature and S(r) is the unitary squeezing operator. The state |CL r limits to the well-known CV cluster state for r → ∞ [62]. One could use |CL r to define a faithful cost function analogous to the Loschmidt Echo and Ricocheted TMSS costs.
As quantum hardware develops, the CV quantum compiling algorithms we have presented here are expected to find use optimizing short depth CV quantum circuits, thereby aiding the implementation of larger scale quantum algorithms. Further, we envision that tuning CV quantum resources such as intensity or squeezing could allow one to implement our CV quantum compiling algorithms in a noise resistant way. For example, results of Ref. [51] indicate that sublinear scaling (with mode number m) of coherent state intensity and number of quantum-limited attenuator layers does not induce barren plateaus in cost functions such as C ACS E when restricted to linear optical unitaries. More generally, we are excited by the idea that these quantum compilation algorithms may be used to study the optical properties of new materials. It would be interesting to explore whether these algorithms could be combined with meta-learning strategies to actively design new materials with desirable properties such as controllable squeezing amplitudes or non-linearities.
with squeezing parameter r satisfying −2 ln tanh r = β. It follows that Here, we show that Theorem 1 and Theorem 2 can be generalized to learning arbitrary Gaussian operations. In Corollary 1 below, the target unitary U is associated with L ∈ Sp(2m, R) by U † RU = RL and the learning algorithm outputs T S ∈ Sp(2m, R) when given training set S in (38). A 2m × 2m real matrix L is symplectic iff where z j ∈ R + . Let S be the training data (38) with |S| ≡ 0 mod 2, and let L ∈ Sp(2m, R). If R L (T S ) is the risk (39) and G (2m) is equipped with the probability measure dO 1 dO 2 µ(dz), with dO 1 and dO 2 Haar measure and µ(dz) a probability measure on R m , then E S (E L (R L (T S ))) is given by (40).
Proof. Every symplectic matrix L is in G (2m) due to the Bloch-Messiah decomposition [65]. From (41), it follows that R L (T S ) = 1 2 − TrT S L T 4m . Since T S and L are assumed to agree on the subspace of R 2m spanned by the training data, one can write T S L T = I |S| A B Y . But T S L T ∈ Sp(2m, R) implies that where ∆ = ∆ |S| ⊕ ∆ 2m−|S| . The first equation in (C1) implies B = 0 from non-degeneracy of the symplectic form. Let V be the Gaussian unitary that acts on the canonical operators as V † RV = RT S L T . Then the action of V on the 2m-vector of operators (0, . . . , 0,R) withR := (q |S|+1 , p |S|+1 , . . . , q m , p m ) is (RB,RY ) = (0, . . . , 0,RY ), so the unitary invariance of the canonical commutation relation implies that Y ∈ Sp(2m − |S|). It then follows from the second equation of (C1) that A = 0. One concludes that where E Y is taken with respect to the measure on Y induced by the measure on G (2m) . Since Y is a symplectic matrix, it can be written W 1 F W 2 with W 1 , W 2 ∈ Orth(2m − |S|). The fact that W 1 and W 2 are independent and distributed according to Haar measure follows from the restricting the Haar measure on O 1 and O 2 . Therefore, the expectation over Y in (C2) is zero.
An entirely equivalent argument can be used to generalize Theorem 2 to learning Gaussian operations.
distance gives a sum of two integrals: e −2(rj +r k ) (e T 2j−1 W T T T S OW e 2k−1 ) 2 +e −2(rj −r k ) (e T 2j−1 W T T T S OW e 2k ) 2 +e 2(rj −r k ) (e T 2j W T T T S OW e 2k−1 ) 2 +e 2(rj +r k ) (e T 2j W T T T S OW e 2k ) 2 (D6) The integral in the first line evaluates to D 2 −D −2 4 log D . In the second integral, it is useful to break up the sum over j, k to the j = k and j = k parts. Then we use the following lemma L r,r L r ,r (2m + 1) 4m(m + 1)(2m − 1) The proof of the lemma involves integration over the orthogonal group with respect to the Haar measure [66]. We will apply the lemma with L = T T S O, using the first integral from the lemma exactly to evaluate the j = k part of the second integral in (D6) and using the second integral from the lemma in its asymptotic form to evaluate the j = k part of the second integral in (D6). The result for integration over W is The assumption that T S and O agree on the training dataset S is now taken into account. Recall that the training covariance matrices Σ (j) are associated with distinct directions in R 2m . Therefore, we can write T T S O = I |S| ⊕ Y with Y ∈ Orth((2m − |S|)). Note that dY TrY = 0, dY TrY 2 = 1, and dY (TrY ) 2 = 1. From this it follows that E O (TrT T S O) 2 = E Y ((|S| + TrY ) 2 ) = |S| 2 + 1 and E O (Tr (T T S O) 2 ) = E Y (|S| + Y 2 ) = |S| + 1. Applying these to (D8) gives where the O(m −2 ) comes from the asymptotic in (D7). Absorbing the remaining O(m −2 ) terms and carrying out the trivial average over the finite set S results in (D2).