Absence of Barren Plateaus in Quantum Convolutional Neural Networks

Quantum neural networks (QNNs) have generated excitement around the possibility of efficiently analyzing quantum data. But this excitement has been tempered by the existence of exponentially vanishing gradients, known as barren plateau landscapes, for many QNN architectures. Recently, Quantum Convolutional Neural Networks (QCNNs) have been proposed, involving a sequence of convolutional and pooling layers that reduce the number of qubits while preserving information about relevant data features. In this work we rigorously analyze the gradient scaling for the parameters in the QCNN architecture. We find that the variance of the gradient vanishes no faster than polynomially, implying that QCNNs do not exhibit barren plateaus. This provides an analytical guarantee for the trainability of randomly initialized QCNNs, which highlights QCNNs as being trainable under random initialization unlike many other QNN architectures. To derive our results we introduce a novel graph-based method to analyze expectation values over Haar-distributed unitaries, which will likely be useful in other contexts. Finally, we perform numerical simulations to verify our analytical results.


I. Introduction
The field of classical machine learning has been revolutionized by the advent of Neural Networks (NNs). One of the most prominent forms of classical neural network is the Convolutional Neural Network (CNN) [1,2]. CNNs differ from traditional fully-connected neural networks in that they use kernels applied across network layers which reduce the dimension of the data and the number of parameters that need to be trained. The architecture of CNNs was inspired by seminal experiments investigating the structure of the visual cortex [3,4]. Indeed, CNNs have been employed for image-based tasks as their structure allows for local pattern recognition; reviews of stateof-the-art techniques can be found in Refs. [5][6][7].
Despite the tremendous success of NNs, several difficulties were encountered during their development that hindered the trainability of the NN's parameters. While multi-layer perceptrons are more powerful than singlelayer ones [8,9], their trainability is more challenging, and the backpropagation method was developed to address this issue [10]. In addition, training difficulties can also arise in NNs that employ backpropagation and gradient-based methods, as the loss function gradients can be vanishing [11]. For CNNs this phenomena can be mitigated by choosing rectifier neuron activation functions [12] or by implementing layer-by-layer training.
Recently, tremendous effort has been put forward to analyze the trainability of the cost functions of VQAs and QNNs, as it has been shown that the optimization landscape can exhibit the so-called barren plateau phenomenon where the cost function gradients of a randomly initialized ansatz vanish exponentially with the problem size [35][36][37][38][39][40][41][42][43][44]. If the cost exhibits a barren plateau, an exponentially large precision is needed to navigate through the landscape, rendering the architecture unscalable.
Specifically, it has been shown that barren plateaus can arise for different architectures such as VQAs with deep random parametrized quantum circuits [35] or with global cost functions [36]. Additionally, recent studies have shown that more general perceptron-based dissipative QNNs [32] can also show barren plateau landscapes [37,43], proving that this is a general phenomenon in many quantum machine learning applications. While strategies have been proposed to mitigate the effects of barren plateaus and improve trainability [45][46][47][48][49][50][51][52][53][54], much work still remains to be done to guarantee the efficient trainability of VQAs and QNNs.
In this work we analyze the trainability and the existence of barren plateaus in the Quantum Convolutional Neural Network (QCNN) architecture introduced by Cong et al. in [34]. A very similar architecture was proposed by Grant et al. in [55] under the name Hierarchical Quantum Classifier, which our work applies to as well. Motivated by the structure of CNNs, in a QCNN a series of convolutional layers are interleaved with pooling layers which effectively reduce the number of degrees of freedom, while preserving the relevant features of the input state. We remark that due to the specific architecture of the QCNN, its trainability does not follow from previous works. QCNNs have been success-fully implemented for error correction, quantum phase detection [34,56], and image recognition [57], and their shallow depth makes them a promising architecture for the near-term.
Here we provide a rigorous analysis of the scaling of the QCNN cost function gradient, under the following assumptions: (1) all the unitaries in the QCNN form independent (and uncorrelated) 2-designs; (2) the cost function is linear with respect to the input density matrix. Under those assumptions, we show that the variance of the cost function partial derivatives is at most polynomially vanishing with the system size. This implies that the cost function landscape does not exhibit a barren plateau, and hence that the QCNN architecture is trainable under random initialization of parameters. We obtain our results by introducing a novel method to compute expectation values of operators defined in terms of Haar-distributed unitaries, which we call the Graph Recursion Integration Method (GRIM). Intuitively, our results follow from the intrinsically short depth of the QCNN (logarithmic depth), and the local nature of the cost function. We then also provide trainability guarantees for the special case where the QCNN is composed only of pooling layers. Finally, we perform numerical simulations to verify our analytical results.
Our paper is organized as follows. In Section II we start by reviewing the necessary theoretical background on QCNNs, barren plateaus and Haar-distributed unitaries. The GRIM is introduced in Sec. III A. Section III B contains our main results in the form of Theorem 1 and concomitant Corollary 1 which show the absence of barren plateaus for randomly initialized QCNNs. Then, in Sec. III C we present Theorem 2 which analyzes the pooling-based QCNN, while Sec. III D contains our numerical results. Finally, our discussions and conclusion are presented in Sec. IV. The proof of our main results are presented in the Appendix.

II. Theoretical Framework
A. The QCNN architecture As depicted in Fig. 1(a), the Quantum Convolutional Neural Network (QCNN) architecture takes as input an n-qubit input state ρ in , in a Hilbert space H in , which is sent through a circuit composed of a sequence of convolutional and pooling layers. The convolutional layer is composed of two rows of parametrized two-qubit gates acting on alternating pairs of neighboring qubits. In each pooling layer half of the qubits are measured, with the measurement outcomes controlling the unitary applied to neighbouring qubits. After L convolutional and pooling layers, the QCNN contains a fully connected layer that applies a unitary to the remaining qubits. Finally, at the end of the QCNN one measures the expectation value of some Hermitian operator O.
From the previous description we have that the input FIG. 1. Schematic representation of a quantum convolutional neural network (QCNN). The input of the QCNN is an nqubit quantum state ρin. The state ρin is then sent through a sequence of L convolutional (C) and pooling (P) layers. The convolutional layers are composed of two rows of two-qubit unitaries (W) acting on alternating pairs of qubits. The pooling layer is composed of pooling operators (indicated in the dashed box). In each pooling module a qubit is measured, with the measurement outcome controlling a unitary applied to a neighboring qubit (I). After the final pooling layer, one applies a fully connected unitary (F) to the remaining qubits and obtains an output state ρout whose dimension is much smaller than that of ρin. Finally, one measures the expectation value of some operator O over the state ρout.
state to the QCNN is mapped to a reduced state in a Hilbert space H out whose dimension is much smaller than that of H in . The output state can be expressed as Here, V (θ) is the unitary that contains the gates in the convolutional and pooling layers plus the fully connected layer, θ is the vector of the trainable parameters, and Tr out denotes the partial trace over all qubits except those in H out . Note that the non-linearities in a QCNN arise from the pooling operators (measurement and conditioned unitary) in the pooling layers, which effectively reduce the degrees of freedom in each layer.

B. Cost function
The goal of the QCNN is to employ a training set S (of size M = |S|) containing input states {ρ α in } M α=1 to optimize the parameters in the QCNN and minimize a cost function which we assume can be expressed as where c α are real coefficients, and where ρ α out (θ) are obtained from (1) for each input state ρ in .
For example, in a binary classification problem the training set is usually of the form S = {ρ α in , y α } M α=1 with y α = 0, or y α = 1. Here we can divide S into two subsets S 0 and S 1 , of sizes M 0 = |S 0 | and M 1 = |S 1 | (such that M = M 0 + M 1 ), respectively composed of states associated with outputs 0, and 1. In this case, Eq. (2) can be explicitly expressed as where the operator O is such that for any state τ ∈ H out we have Here we remark that it is convenient to rewrite the cost function as where O = (O ⊗ 1 1 out ), and 1 1 out is the identity operator over the Hilbert space H out which is defined such that H in = H out ⊗ H out . In addition, here we define the Hermitian operator which is a quantum state for the special case when c α 0 for all α, and when c α = 1. As shown below, our results are valid independently of σ being a quantum state.

C. Ansatz
In what follows we consider for simplicity the case where n = 2 k and L = log(n) = k, so that dim(H out ) = 2. Moreover, we assume that the unitaries in the convolutional and pooling layers are independent. That is, the convolutional and the fully connected layer in V (θ) are composed of two qubit parameterized unitary blocks acting on neighboring qubits, which we denote as W ij (θ ij ). We henceforth use the terminology of sub-layers, such that the unitaries in the first sub-layer act before those in the second-sub-layer. On the other hand, the pooling layers are composed of pooling operators I ij . Here, i = 0, . . . , L is the layer index with i = 0 corresponding to the fully connected layer, while j is the index that determines block placement within the layer. We remark that this generalization contains as a special case the usual QCNN structure where the blocks in the same convolutional or pooling layer are identical. Moreover, as discussed in our results section, correlating the unitaries in the convolutional layers tends to increase the magnitude of the cost function gradient.
Given a two-qubit unitary W ij (θ ij ) in a convolutional or fully-connected layer it is common to expand it as a product of two-qubit gates of the form where H η is a Hermitian operator, e −iθηHη is a parameterized gate (such as a rotation), W η is an unparameterized gate (such as a CNOT), and where θ ij = (θ 1 , . . . , θ η , . . .). For simplicity of notation here we omit the parameter dependence in W ij . Note that Eq. (6) can be readily used to determine the circuit description of W ij . Moreover, if the operator H η has eigenvalues ±1 (or more generally, if it only has two distinct eigenvalues) one can evaluate the cost partial derivative via the parameter-shift rule [58,59].
On the other hand, the modules in the pooling layers can be thought of as a map from a two-qubit Hilbert space H AB = H A ⊗ H B to a single-qubit Hilbert space H B . Given a two-qubit state τ ∈ H AB , the action of the measurement and the controlled unitary is given by where Tr A denotes the partial trace over H A , and where Here Π k = |i i| for k = 1, 0 and U ij k are parameterized single qubit unitaries. It is straightforward to verify that the operators I ij are unitaries:

D. Trainability and variance of the cost
Consider a given trainable parameter θ µ in a unitary W ij appearing in the -th layer of V (θ), which for simplicity we denote as W , and let the cost function partial derivative with respect to θ µ be ∂C(θ)/∂θ µ = ∂ µ C. Then, let us define the two-qubit Hilbert space where W acts on as H w , such that H in = H w ⊗ H w .
As shown in [35][36][37][38], the trainability of the parameters in a randomly initialized QCNN can be analyzed by studying the scaling of the variance where the expectation value · · · is taken over the parameters in V (θ). We recall that under standard assumptions (see below) we have ∂ µ C = 0 [35,36]. Hence, from Chebyshev's inequality, Eq. (8) bounds the probability that the cost function partial derivative deviates from zero by a value larger than a given c 0 as Hence, if the variance is exponentially small (i.e., if the cost exhibits a barren plateau), the cost function gradient will be on average exponentially small and an exponential precision will be needed to navigate the flat landscape.
In contrast, large variances show that no barren plateaus are present and that the trainability of the parameters under random initialization can be guaranteed. Let us now present the explicit form of ∂ µ C. From Eqs. (4) and (6) we find that where we write V = V R W V L . That is, V R and V L contain all gates in the QCNN except for W . We henceforth omit the parameter dependency for simplicity of notation. Moreover, from the expansion of W in (6) we also define (11) so that W = W B W A . We remark that it is straightforward to verify that if H µ is an idempotent operator, then ∂ µ C = 0.

E. Haar-distributed unitaries
When training the QCNN, a common strategy is to randomly initialize the parameters in V (θ) [34]. Hence, it is convenient to define W ij as the set of the unitaries obtained from W ij for each random initialization. To analyze the trainability of the QCNN we assume that the sets W ij form independent local 2-designs.
Let us recall the definition of a t-design. Let W = {W y } y∈Y be a set of unitaries in a d-dimensional Hilbert space. Then, W is a t-design if for every polynomial P t,t (W ) of degree t in the matrix elements of W and of degree t in those of W † , the expectation value of P t,t (W ) over the set W is equal to the expectation value over the Haar distribution. That is, where the integral is taken with respect to the Haar measure over the unitary group of degree d.
The previous assumption has the following key consequence. Let us recall that the pooling layer is formed by pooling operators defined according to Eq. (7). Then, as shown in Fig. 1 every pooling operator I is preceded by a unitary W in the second sub-layer of the convolutional layer such that their combined action can be obtained from IW . As W forms a 2-design it is left-and rightinvariant under the action of the unitary group. That is, for any function F (W ), and for any unitary matrix A Equation (13) shows that the action of the operators I ij in the pooling layer can be absorbed into the action of the unitaries in the convolutional layers as the variance of the cost function partial derivative is independent of the controlled unitaries in I ij . Hence, as schematically shown in Fig. 2, we can consider that the unitary V (θ) of the QCNN is simply composed of two-qubit unitaries W ij which form independent 2-designs.
Finally, let us remark that as shown in [36], if W A and W B form independent 2-designs, then ∂ µ C = 0 and

Eq. (8) can be expressed as
Here the summation runs over all bitstrings p, q, p , q of length 2 n−2 . In addition, we define where Ω qp and Ψ qp are operators on H w defined as We use Tr w to denote the trace over all qubits not in H w .

III. Main results
In this section we present our main results. First, we introduce a novel method for analyzing the scaling of the variance of Eq. (14) which we call the Graph Recursion Integration Method (GRIM). Specifically, our method can be used when employing the Weingarten calculus to integrate unitaries over the unitary group. As discussed below, the GRIM is based on finding recursions when integrating groups of unitaries to form a graph which can be readily used to compute the scaling of the cost function partial derivative variance in (14). Our second main result is that we employ the GRIM to obtain a lower bound on Var[∂ µ C] for the QCNN architecture. Moreover, this lower bound can be used, under certain standard assumptions, to guarantee the trainability of the QCNN. We then present our results to guarantee the trainability of a pooling-based QCNN. Finally, we present heuristical results obtained from numerically computing Var[∂ µ C].

A. Graph Recursion Integration Method (GRIM)
Let us consider the task of computing the expectation values ∆Ω q p qp V R and ∆Ψ p q pq V L in Eq. (14). In [36] it was shown that one can solve this problem by sequentially integrating each Haar-distributed unitary via the elementwise formula of the Weingarten calculus [60,61] (see Appendix).
While the previous approach has been successfully used to analyze the trainability of parametrized quantum circuits and of QNNs [36,37] it has the difficulty that the number of terms that one has to keep track of increases exponentially with the number of unitaries integrated. The GRIM allows us to circumvent this difficulty by simultaneously integrating specific groups of unitaries and recursively grouping the resulting terms to form a graph G w . As such, the GRIM has to be implemented to form a graph for the integration of the gates in V L and likewise for the integration of the gates in V R . In what follows we consider the case of obtaining the graph to compute the expectation value of ∆Ω q p qp V R . Moreover, we henceforth assume that V R is composed of all the unitaries in the forward light-cone L F of W .
The first step of the GRIM is to group the unitaries in V R into modules that cover all of L F . We remark that different modules will lead to different graphs, and that the fewer distinct modules one employs, the smaller the graph will be in terms of the number of distinct nodes. In Fig. 3(a) we show the three distinct basic modules needed to analyze the QCNN architecture, such that for any W in the -th layer, L F can be covered with such modules. Specifically, Fig. 3(b) depicts an example where L F can be covered by three middle modules M M , an edge module M E , and a center module M C . The next step in the GRIM is to integrate the gates in each module M to form the graph. Each node N i ∈ G w will correspond to different contractions of operators containing the modules, while the edges of the graph will be associated with real coefficients λ i,j ∈ (0, 1). In this notation, the coefficient λ i,j corresponds to the oriented edge N i → N j . Moreover, the initial node can always be defined from (15) as where T and T are two qubit operators. Following (17), here T contains primed operators obtained from | p q |. Then, as shown in the Appendix, the result of integrating a given module M can be expressed as where the integral is taken over the unitaries W ij that appear in a module M , and where the terms e ij lead to the edge coefficients λ i,j . Moreover, here the operator T is obtained from T such that it contains the same modules as the ones in T except for the one that was integrated according to (20).
Equation (20) is at the basis of the GRIM as it shows that when a module is integrated one connects the nodes N i → N j for all j in the summation. By sequentially integrating all unitaries, one can form an oriented graph which always starts with N 1 according to (15). As discussed in the Appendix, the number of nodes in the graph, denoted as |G w |, is always in O( ) for the QCNN architecture. Additionally, the structure of the graphs G w is recursive, which allows us to obtain results valid for arbitrary and for all placements of the unitary W . In Fig. 4(a) we show an example where W is in the edge of the -th layer, and where the forward light-cone L F is composed of ( − 1) edge modules and a single center module. The graph that arises from integrating the unitaries in the M E modules is shown in Fig. 4(b). We refer the reader to the Appendix for additional details on the derivation of this graph and the definition of the N i nodes. For this specific case the graph contains |G w | = ( + 1) nodes.
Once the graph G w has been formed, one can use it to explicitly compute the contribution of expectation values in the variance of (14). Specifically, the following Proposition, proved in the Appendix, holds  Here, W is indicated with a larger colored circle. (b) Graph obtained from the GRIM by integrating all modules ME in arLF for the case in (a). Graph Gw obtained via GRIM. Here the edge coefficients λi,j are real and positive numbers such that λi,j ∈ (0, 1). Moreover, for this case the number of nodes is ( + 1) and the structure of the graph is such that that one can easily obtain the graph for any value of .
ance can be computed via the GRIM as where σ w = Tr w [V L σV † L ] is the reduced operator σ in the subsystem of the qubits that W acts on, and where we defined for an operator O the Hilbert-Schmidt distance between A and B. Here G w denotes the oriented graph obtained by integrating all the modules in L F , P (G w ) is the set of all paths of length over G w , and g = (g 1 , . . . , g ) is a vector which indicates the nodes in the path such that g i ∈ [1, . . . , |G w |]. Finally, Λ g are real positive numbers obtained as where λ g ξ ,g ξ+1 ∈ (0, 1) are the edge coefficients associated with the oriented edge N ξ → N ξ+1 . Note that ε O = N 1 (O) from Eq. (19) by taking T = T , meaning that here we are using the surprising fact that the final node is always N 1 (O). Proposition 1 shows that given a unitary W , the contribution from the expectation value of ∆Ω p q pq can be ultimately obtained by adding up all the paths in the oriented graph G w weighed by the coefficients associated with the edges taken over the path. Moreover, since ε O and ε σw are always positive, any single path over the graph leads to a lower bound for Var[∂ µ C]. Finally, we remark that one can also employ the GRIM to integrate the gates in V L to compute the expectation value ε σw V L .
In what follows we briefly give an example of how to use the GRIM to compute the expectation value ∆Ω q p qp V R for the case when W is the center unitary of the -th layer of the QCNN as indicated in Fig. 5(a). Here we can see that L F can be covered with center modules M C . Employing the notation O = Ω qp it is straightforward to see that ∆Ω q p Then, as shown in Fig. 5(b) in the tensor network representation of quantum circuits, we can always write the recursion relation Here, | p q | −1 denotes the projectors on the qubits that the -th module M C acts on. With this notation it then follows that O 0 = O. By employing the elementwise formula of the Weingarten calculus to integrate unitaries over the Haar distribution (see Appendix) we find for all where the integration is over all unitaries W ij in theth module M C . As depicted in Fig. 5(c) the coefficients Then, as shown in the Appendix, the operators p 2 and p 3 lead to a factor 1/2, while the operator p 1 leads to a factor of 1 so that 4 25 (e 1 + 2(e2+e3) 5 ) = 28/125. Equation (25) shows how the structure of the graph G w emerges, as integrating a module connects the node N 1 with itself with a coefficient λ 1,1 = 28/125. The ensuing graph is presented in Fig. 5(d), and we can see that it is composed of a single node.

B. Trainability of the QCNN
Here we present the main result of our article, which is derived using the GRIM. The detailed proofs can be found in the Appendix. For convenience of notation we here consider the case when W is in the first sub-layer. The case where it belongs to the second sub-layer is considered in the Appendix and we remark that the results are substantially the same for both cases, albeit with some notational differences. Theorem 1. Consider the QCNN cost function of Eq. (4) and let W be a unitary in the first sub-layer of the -th layer of V (θ). Moreover, let us assume that the unitaries W ij , W A and W B in the convolutional and fully connected layers form independent 2-designs. Then, the lower bound of the variance of the cost partial derivative with respect to a parameter θ µ in unitary W can be computed, using the GRIM, from the graph G w as with F n (L, ) = 1 9 Tr Here σ w = Tr w [σ] is the reduced operator σ in the subsystem of the qubits that W acts on. Moreover, ε O , P (G w ), g, and Λ g are defined in Proposition 1.
From Theorem 1 we can obtain the following corollary.
Corollary 1. Consider the function F n (L, ) defined in Eq. (27). Then, since L is at most O(log(n)), the variance Var[∂ µ C] is at most polynomially vanishing with n as poly(n) . Let us now consider the main implication of this result. Corollary 1 provides conditions under which no barren plateaus arise and the trainability of the parameters in the unitaries of the QCNN architecture can be guaranteed. This is due to the fact that the determination of a cost minimizing direction in the parameter hyperspace requires only a polynomially large precision. This is in contrast to landscapes which exhibit a barren plateau, where an exponentially large precision is needed to navigate through the flat landscape.
We remark that at the core of the result in Corollary 1 lies the fact that QCNNs are shallow, as they have at most a number of layers L in O(log(n)). Such shallowness naturally arises from the intrinsic structure of the QCNN architecture, where the number of qubits is reduced at each layer. Hence, terms such as 1/50 L− +1 or Λ g in Theorem 1 can vanish no faster than Ω(1/ poly(n)) if L is in O(log(n)).
As indicated in Corollary 1, the bound in (28) . The latter means that O and the reduced operator σ w need to respectively have a Hilbert-Schmidt distance to the (scaled) identity operator which is not exponentially vanishing. Note that this condition is expected, as extracting information by measuring operators close to the identity is naturally a hard task. Similarly, attempting to train a quantum circuit on a state close to the identity is also a difficult task. Here we remark that in the Appendix we relax the condition on σ so that we can guarantee that the lower bound for Var[∂ µ C] is at most polynomially vanishing with n if σ z is not exponentially close to Tr[σ]1 1/4 (as measured by the contraction in N 1 ) on any pair of qubits in the backwards-propagated light-cone of W .

C. Trainability from pooling
As previously discussed, when the unitaries W ij in the convolutional layers form 2-designs, the action of the operators I ij in the pooling modules can be absorbed by the action of their neighboring W ij . Here, we analyze the trainability of the QCNN for the special case when the unitaries W ij act trivially, i.e. when W ij = 1 1 ∀i, j. This can be illustrated by the pooling module in Fig. (6), in which an input state ρ (0) ⊗n is processed by pooling layers, locally described by the channel FIG. 6. Schematic circuits of a pooling-based QCNN. The left module belongs to the layer that acts prior to one which contains the right module. As shown, after each pooling layer, half the qubits are measured. In this special case, the unitaries in the convolutional and fully connected layers of the QCNN act trivially. where is the controlled unitary corresponding to the pooling. Moreover, here we assume for simplicity that ρ (0) = | 0 0| ⊗n , where we recall that n = 2 L . As explicitly proved in the Appendix, the following theorem holds.
Theorem 2. Consider a CQNN where the convolutional unitaries are W ij = 1 1 for all i, j and where the action of pooling layers can be described by Eqs. (29)- (30). Then, the expectation value of the magnitude of the cost function partial derivative ∂C/∂θ (k) = ∂ k C with respect to any parameter θ (k) in a pooling module is Theorem 2 shows that the pooling-based QCNN does not exhibit a barren plateau as |∂ k C| is polynomially vanishing with the system size.

D. Numerical Verification
In this section we present numerical results to analyze the scaling of the cost function partial derivative variance. Specifically, we consider here two cases: when all the unitaries W ij in each convolutional module are independent, and when all the unitaries in the same layer are identical. The latter corresponds to the QCNN architecture as originally introduced in [34]. As shown below, our numerics indicate that correlating ("corr") the unitaries W ij (to make them identical) leads to a variance of the cost function partial derivatives which is larger to the one obtained in the uncorrelated ("uncorr") case, i.e., This suggests that the lower bound obtained in Theorem 1 can also be used as a lower bound for the correlated case. We remark that in [47] it was shown that correlating the parameters in an ansatz can lead to larger variances. The QCNN in our heuristics is constructed such that each W ij is given by the parametrized circuit of Fig. 7(a). As shown in [62], this decomposition allows us to prepare any two-qubit unitary (up to a global phase). Since we simulate QCNNs with an even number of qubits which was not always a power of 2, in the pooling layers we trace out qubits so that each convolutional layer always acts on an even number of qubits. Moreover, for simplicity, we consider the case where the training set contains a single state initialized the the all-zero state ρ = | 0 0| ⊗n , and at the end of the QCNN we measure the operator O = Z ⊗ Z, where Z denotes the Pauli-z operator. Finally, here the trainable parameter θ µ is in the center unitary acting on the first layer. We remark that the results obtained for this case are representative of other possible choices for the location of θ µ .
For each number of qubits n ∈ {4, 6, ..., 26}, we simulated 200 instances of randomly initialized QCNNs and computed the derivative ∂ µ C. All the simulations were performed using TensorFlow Quantum [63]. In Fig. 7(b) we show the results for Var[∂ µ C] as a function of n for the cases when the W ij are correlated and uncorrelated. First, let us remark that the variance for the correlated case is always larger than that of the uncorrelated case as in (32). Moreover, noting that the y-axis is a log scale, we can see that the curves are sub-linear, meaning that the scaling of Var[∂ µ C] is sub-exponential. This result then numerically verifies that QCNNs do not exhibit barren plateaus in their training landscape.

IV. Discussion
Investigating the barren plateau phenomenon in parametrized quantum circuits and quantum neural networks is fundamental to understanding their trainability. In the absence of a barren plateau, the determination of a minimizing direction in the cost function landscape does not require an exponentially large precision, meaning that one can always navigate through the landscape by measuring expectation values with a precision that grows at most polynomially with the system size. We remark that polynomial overhead is the standard goal for quantum algorithms, which aim to achieve a speedup over classical algorithms that often scale exponentially. Hence, the absence or existence of a barren plateau can determine whether or not a quantum speedup is achievable.
In this work we analyzed the gradient scaling for the Quantum Convolutional Neural Network (QCNN) architecture. We first introduced a novel method, called the Graph Recursion Integration Method (GRIM), which greatly simplifies the computation of expectation values of quantities that are expressed in terms of Haardistributed unitaries. The goal of the GRIM is to create an oriented graph that can be used to evaluate this expectation value.
We then employed the GRIM to derive our main result in Theorem 1, which provides a general lower bound on the variance of the cost function gradient. In Corollary 1 we derived the asymptotic scaling of this lower bound under standard assumptions and show that it vanishes no faster than polynomially with the system size. This result confirms the absence of barren plateaus in QC-NNs and allows us to guarantee the trainability of this architecture when randomly initializing its parameters. Moreover, our work highlights QCNNs as potentially being generically trainable under random initialization, and hence sets them apart from other QNN architectures that are trainable only under certain conditions.
Since the results in Theorem 1 were derived assuming that the unitaries in the convolutional layers form independent 2-designs, we additionally considered the case when they are trivial and equal to the identity. In Theorem 2 we show that this pooling-based QCNN, i.e., a QCNN which is composed only of pooling layers, is also trainable when initialized randomly. In addition, we present numerical evidence to support our theoretical results, and we show that correlating the parameters in the convolutional layers (as in the original QCNN architecture), leads to larger gradient variances, meaning that our results can also be useful for that particular case. Similarly, when the two-qubit gates do not form 2-designs, then one can expect the variance of the gradient to be larger [64]. Moreover, let us remark that while we have here considered cost functions which are linear with respect to the input density matrices, as shown in Section II B, this type of costs is quite general and encompasses a wide range of applications (such as classification).
While our work proves the trainability for the QCNN architecture, there are many future research directions that can be pursued to generalize our results. First, one can analyze more general cost functions than the ones considered here. For instance, one could consider meansquare cost functions, as the ones employed in regression problems, where C is a quadratic function on the expectation value Tr[V (θ)σC † (θ)O]. In addition, due to close connection between the QCNN architecture and the Multi-scale Entanglement Renormalization Ansatz (MERA) [34], it would be interesting to analyze how our results and methods can be employed to derive trainability results for MERA frameworks. Finally, we believe the GRIM will be useful in analyzing the scaling of the cost function partial derivative variance for other VQA and QNN architectures.
Note added. After completion of this work, [65] was posted where the authors analyze the existence of barren plateaus in QNN with a Tree Tensor structure, concluding that this architecture also does not exhibit barren plateaus.

Appendix 1: Preliminaries
In this section, we present preliminary notation and results needed to derive our main Theorems.

A. Haar distributed unitaries and integration over the unitary group
Let us here recall the definition of a t-design. Let W = {W y } y∈Y be a finite set of size |Y | of d-dimensional unitaries W y , and let P t,t (W ) be a polynomial of degree at most t in the matrix elements of W , and at most of degree t in those of W † . We say that W is a t-design if for every polynomial P t,t (W ) we have where the integral is taken with respect to the Haar measure over the unitary group. Then, to compute expectation values as in (33) it is useful to recall formulas which allow for symbolical integration with respect to the Haar measure on the unitary group U(d) of degree d. Specifically, following the elementwise formula of the Weingarten calculus [60,61] we have that the following formulas are valid for the first two moments: where w ij are the matrix elements of the unitary W ∈ U(d), and Here we remark that the integrations in our calculations were performed by employing the Random Tensor Network Integrator (RTNI) package for symbolic integration of Haar distributed tensors [66].
For the proof of Lemma 1, see [36].
Lemma 3. Consider the operator ∆Ψ p q pq defined in the main text, which we recall here for convenience: where p, q, p , q are vectors of length 2 n − d. Let H be a Hilbert space of n qubits and let H w be a d-dimensional subsystem, such that H w ⊗ H w = H. Then, let S and S be two disjoint sets of qubits not in H w such that H S ⊗ H S = H w . The following equality holds where σ S,w = Tr S [σ L ] and σ L = V L σV † L . Here Tr S [·] indicates the partial trace over H S . Proof. Let us consider the first term in (45). We have: where the last line comes from Lemma 1. Similarly, for the second term in (45) we have Therefore, combining the results from (54) and (55) we find where in the last equation we invoked Lemma 2.
Lemma 4 (Monotonicity of the trace distance). Let H be a tensor product Hilbert space H = H A ⊗ H B , and let R AB and S AB be two operators on H. Then, the following inequality holds where R B = Tr A (R AB ) and S B = Tr A (S AB ). Here D T (R B , S B ) = 1 2 ||R B − S B || 1 is the trace distance, with || · || 1 the Schatten 1-norm.
Remark. The monotonicity of the trace distance is usually proven when R AB and S AB are density matrices. The proof we present here is valid for any two matrices.
Proof. By duality of Schatten norms [67], we have: Therefore, there exists a matrix Λ * B with ||Λ B || ∞ = 1 such that the maximum in (60) is reached, i.e., The inequality follows from (60) and the fact that ||1 1 A ⊗ Λ * B || ∞ = 1, while the last equality is simply another application of the Schatten norm duality.
We remark that an alternative proof, for the case where R AB and S AB are Hermitian, proceeds as follows. Let us begin by noting that one can write R AB − S AB = P AB − Q AB where P AB 0 and Q AB 0, and P AB is orthogonal to Q AB . The orthogonality condition gives |P AB − Q AB | = P AB + Q AB . One then has where P B = Tr A (P AB ) and Q B = Tr A (Q AB ). Note that the inequality in this proof follows from the triangle inequality for matrix norms.
Proof. The equivalence of matrix norms gives us the following inequality between the Frobenius norm ||M || 2 and the trace norm ||M || 1 of a matrix M with dimension d [68]: The first inequality comes from the monotonicity of the Schatten norms [67,69] and the second one from the Cauchy-Schwartz inequality. Since D HS (R, S) = ||R − S|| 2 2 and D T (R, S) = 1 2 ||R − S|| 1 , it follows from Eq. (69) that: Therefore, in our case: where the first and last inequalities come from the equivalence of matrix norms given by Eq. (70), while the second inequality follows from the monotonicity of the Trace distance as stated in Lemma 4.

Appendix 2: Proof of Theorem 1
In this section we present a proof of Theorem 1, which establishes a lower bound for the variance of the cost function partial derivative in terms of the coefficients of a graph obtained through the GRIM. Moreover, in this section we also present a proof for Proposition 1. Let us restate the theorem here for convenience: Theorem 1. Consider the QCNN cost function of Eq. (4) and let W be a unitary in the first sub-layer of the -th layer of V (θ). Moreover, let us assume that the unitaries W ij , W A and W B in the convolutional and fully-connected layers form independent 2-designs. Then, a lower bound of the variance of the cost function partial derivative, with respect to a parameter θ µ in unitary W , can be computed using the GRIM from the graph G w as with F n (L, ) = 1 9 Here σ w = Tr w [σ] is the reduced operator σ in the subsystem of the qubits that W acts on. Moreover, ε O , P (G w ), g and Λ g are defined in Proposition 1.

A. Variance of the cost function partial derivative
Let us assume here that the block W is in the -th layer of the QCNN. As discussed in the main text, we can decompose V as where V R contains all the gates in the forward light-cone L F of W , and where V L contains all remaining gates. Moreover, W can also be decomposed as As shown in [36], the partial derivative of C(θ) with respect to θ µ can be written as and if either W A or W B form a 1-design, we have Here the summation runs over all bitstrings p, q, p , q of length 2 n−2 . In addition, we defined where Ω qp and Ψ qp are operators on H w defined as Here Tr w indicates the trace over all qubits not in H w , with H w being the Hilbert space associated with the qubits that that W acts on.
In what follows, we employ the GRIM to compute the contributions of the expectation values ∆Ω q p qp V R and ∆Ψ p q pq V L to the variance of the cost function partial derivative.
B. Integration over the unitaries in VR via the GRIM

Grouping the unitaries in modules
Let us now focus on the term ∆Ω q p qp V R . As discussed in the main text we here employ the GRIM, whose first step is to group the unitaries in L F into modules M . Moreover, as show in Fig. 8, for the QCNN architecture one only needs three basic modules: a center module M C , a middle module M M and an edge module M E . Note that different choices of W lead to different modules being employed. Surprisingly, for any W we will have one of the following three cases: • Case 1: The light-cone L F can be covered by modules M C .
• Case 2: The light-cone L F can be covered by ( − 1) modules M E and a module M C .
• Case 3: The light-cone L F can be covered by modules M M , modules M E and a module M C such that + + 1 = .
Let us make the following important remarks. First, note that the final module is always M C . Second, let us recall that W can be in the first or in the second sub-layer. As indicated in Fig. 8(a), if W is in the first sub-layer, then W does not belong to any module. However, if W is in the second sub-layer, then it is always one of the unitaries in a given module (see Fig. 8(b)). This can be alternatively thought as introducing additional basic initial modules corresponding to taking M C , M M and M E and removing a unitary from the second sub-layer. More formally, the first step of the GRIM implies that V R can always be decomposed into a set of modules: where V (i) R it is convenient to rewrite Ω qp as where we defined Finally, let S w be the set of qubits in H w , that is, the set of qubits that W acts trivially on. We decompose S w as the disjoint union where L k represents all the qubits in S w that V (k) R acts non-trivially on. Moreover, we define L = S w \ (L 1 ∪ ... ∪ L ) as the set of all the qubits that are not in the forward light-cone of W . Using this decomposition we can rewrite the partial trace of Eq. (78) as In Fig. 9 we show the tensor network representation of Ω qp and of ∆Ω q p qp for the case when V (1) In the next section we show how to generate the graph G w via the GRIM.

Basic notation
Here we introduce the basic notation employed to form the GRIM graph. For simplicity, let us consider the tensor network of ∆Ω q p qp as shown in Fig. 9, and let us consider the task of integrating the blocks in V contractions of the w and L 1 wires. Noting that ∆Ω q p Eqs. 83 and 84), we write where we denote V (1) R . The coefficients a 1,β,s are real numbers to be found, the non-zero of which characterize the module that we are integrating. Moreover, we remark that P(L 1 ) and P(L 1 ∪ w) respectively denote the power sets of the qubits in L 1 and L 1 ∪ w. Hence, the summations in (82) and (83) are taken over all the subsets s and s of L 1 and L 1 ∪ w, respectively. In addition, here we have defined the nodes where c β,s are real coefficients that characterize the node, and we define the contraction of the operator O Here | p q | and |p q | are operators over the qubits in S w \ L 1 . The contribution of the projectors | p q | and | p q | over the qubits in L 1 is given by the operators p 1,β,s , which are defined as where s = L 1 \ s. We refer to reader to Fig. 10(a) for a graphical representation of the operator p 1,β,s . Here we remark that a 1,β,s and c β,s are not uniquely defined for a given integral, since for every s , we can replace c β,s by γ β c β,s and a 1,β,s by a 1,β,s /γ β , where γ β can be any non-zero number: To fix this freedom, we henceforth impose that c β,∅ = 1 or −1 and that a 1,β,s 0. The choice of a 1,β,s 0 guarantees that the edge coefficients of the GRIM graph are always positive.

Construction of the graph
Employing the notation introduced in the previous section we can now construct the structure of the graph G w , of which we provide a canonical version in Fig. 10(b). The recursive procedure is defined as follows: • We define an initial node N 1 defined as for any operator O. Note that N 1 is simply a particular case of Eq. (83). As described in the previous section, we integrate the unitaries in V and obtain a result of the form (82). In the graph, we represent this by drawing an oriented edge (an arrow) from the initial node N 1 to the nodes N β that appear on the right-hand side of (82).
• Let N α be a node in the graph obtained by integrating the (i − 1)-th modules. To construct the next nodes in the graph, we integrate N α over the module V As shown in Fig. 10(b), for any non-zero a α,β,s p α,β,s we associate a new edge (arrow) from N α to the nodes in the right-hand-side of (89). Here we define for convenience the operator e α,β,s = a α,β,s p α,β,s .
Applying the aforementioned process times, we can obtain the basic structure of the graph G w , which is composed of a set of nodes {N α } and associated oriented edges. Let us make two important remarks. First, each edge of the graph is now associated with an operator e α,β,s . Below we show how to obtain the real edge coefficients λ α,β from the e α,β,s . Moreover, we also define an equivalence relation between nodes such that N α1 and N α2 correspond to the same node if they apply the same wire contractions, independent of the qubit that the wire represents. Next, let us define a walk g through the graph as a sequence of connected edges g = (e α1,α2,s 1 , e α2,α3,s 2 , ...) , such that: • The operator e αi,αi+1,si is associated to one of the edges connecting N αi to N αi+1 .
• As discussed below, in all QCNN graphs the walk ends at a node N 1 evaluated at the operator O.
We denote the set of all such paths of length as P . From the ensuing graph and the previous results we can now compute the expectation value ∆Ω q p qp V R . Consider a walk g ∈ P . Then, let us define the operator that is, the product of the weights along the walk. We can then write This equation can simplified by noting that, by definition, The next step in the GRIM is to simplify the graph by lower bounding Eq. (95) and associating to each vertex a real positive coefficient λ αβ in place of each operator e α,β,s that appears in Q( g). We remark that below we analyze the three possible cases for the placement of W .

Simplification of the graph
Let us first recall that all the p α,β,s obtained from integrating the k-th module can be written in terms of the operators (δ pq ) s k (δ qp ) s k (δ pq ) s k (δ p q ) s k where s k ∪ s k = L k and s k ∩ s k = ∅. Hence, from (95) we can write where S P is the disjoint union of all the s k and L, and where S P = w \ S P . Combining Eqs. (75), (95), and (96) we find where we denote |s| as the number of qubits in the set s. The first equality arises from the fact that a α,β,s are independent of p, q, p , q . The second equality follows from Lemma 3. Then, the inequality is obtained by invoking Lemma 5 (equivalence of the norms and the monotonicity of the partial trace). In (100) we have used the definition of |S P |. Finally, in the last line we have simply grouped terms into a single product. We see from Eq. (101) that every coefficient of the form (δ pq ) s (δ pq ) s induces a factor 1 2 |s| in the edge coefficients. That leads us to a new, simplified, version of our graph, where each edge operator e α,β,s is replaced by an edge coefficient λ αβ : e α,β,s = a s α,β,s p α,β,s → λ αβ = s a α,β,s 2 |s| . leads to a summation of terms according to (107). Here, one of these terms corresponds to ε σ ( +2) w . As schematically shown, in the computation of ε σ ( +2) w there are many gates that simplify to identity. These gates are indicated here as shallow circles.
Moreover, two edges connecting two given nodes N α and N β , with associated coefficients e α,β,s and e α,β,s , can be factored into a single edge in Eq. (101). The associated coefficient is the sum of the λ αβ corresponding to the two edges.
We denote G w as the final graph consisting of the simplified λ αβ coefficients and we denote P (G w ) as the set of paths in this graph. Hence, the final lower bound can be written: where we define the sequence of connected edges g = (λ α1,α2 , λ α2,α3 , ...) . and Note that this constitutes a proof for Proposition 1.
C. Integration over the unitaries in VL via the GRIM

Method
In this section we introduce the general method employed to compute the expectation value ε σw V L . In contrast to the forward light-cone, as shown in Fig. 11(a), the width of the backward light-cone L B is not bounded and instead grows with the number of layers. The goal here is to employ the GRIM to reduce the number of gates that need to be integrated to compute ε σw V L .
First, let us remark that since we are interested in the operator σ w = Tr w [V L σV † L ], all the unitaries in V L which are not in L B will compile to identity (see Fig. 11(a) and (b)). Defining V L B as the unitary containing the blocks in L B , then we have that ε σw V L = ε σw V L B .
As depicted in Fig. 11(a), the next step is to draw an effective backwards light-cone L B and to define the unitary V L B which contains all the gates in L B . This effective light-cone consists of the repetition of M C modules, starting after the integration of an initial unitary if we are in the second sub-layer, as shown in Fig. 12 and 13. We will describe below a method to only have to effectively keep track of the reduced light-cone L B by sequentially removing gates outside of L B that compile to identity. We here follow again the first step of GRIM to decompose V L B into a series of (L − ) modules: Equation (105) also allows us to define the operators and we write σ , which again will lead to a recursion of the form where here the nodes N α , contain no |p q| operators due to the fact that ε σw is independent of p, q, p , and q . Moreover, as explicitly shown below for all three possible scenarios for placement of W , there will always be specific tensor contractions N α where the unitaries not in V L B compile to identify. Such a procedure is schematically shown in Fig. 11(c) for the middle module, where computing ε σ ( +2) w leads to many more blocks in V L B being simplified and compiling to the identity. Hence, by sequentially repeating this procedure we can obtain a lower bound for ε σw V L without needing to integrate all gates in L B , but just those in L B .
In the next section we explicitly show this procedure for all possible choices of W gate placement.
FIG. 14. By integrating a single unitary in each sub-layer one can obtain a reduced state σ in the sub-space of any odd pair of qubits in LB.

A. Case 1
Let us recall that the graph for the Case 1 (when W is in the first sub-layer) was presented in the main text. As shown in Fig. 15(a, top), there is a single path through G w and we have g∈P (Gw) Then, recalling that L, it follows that where we see that the lower bound is in Ω(1/ poly(n)) if L ∈ O(log(n)).
The case when W is in the second sub-layer leads to an additional module being added to the graph (see Fig. 15(a, bottom)). This result simply modifies the lower bound as and we recover again that the lower bound is in Ω(1/ poly(n)) if L ∈ O(log(n)).

B. Case 2
Let us first consider the case when W is in the first sub-layer. The graph obtained after integrating − 1 modules is shown in Fig. 15(b). Let us here describe the nodes and their associated coefficients. The node N2 of the middle graph is also connected to a node N1 that connects to the rest of the edge graph.
The nodes in Fig. 15(b) are given by the contractions In Fig. 15(b) we have shown how the wires are labeled. Similarly, the edge coefficients λ α,β are presented in Table I.  Finally, for Case 2, the final module that one needs to integrate is a center module M C . Surprisingly, we find that that for all the nodes N α previously presented we have: that is, all integrals generate positive coefficients. From here we can see that one can always find a pathĝ such that Λĝ is a product of numbers λ i in (0, 1), meaning that where λ min is the smallest λ i in the pathĝ. Hence, the lower bound is in Ω(1/ poly(n)) if L ∈ O(log(n)). When L is in the second sub-layer we simply find that the graph in Fig. 15(b) is slightly modified, but the scaling of the lower bound for g∈P (Gw) Λ g remains unchanged.

C. Case 3
Let us first consider the integration of the middle modules M M . The graph obtained for for the case when W is in the first sub-layer is presented in Fig. 15(c).
The nodes in Fig. 15(c) are given by the contractions In Fig. 15(c) we show how the wires are labeled. The edge coefficients λ α,β are presented in Table II, where the coefficients a k , b k , p k and q k follow a recursive formula of the form f n = 2f n−1 + f n−1 , with a 0 = 1, a 1 = 1, a 2 = 3 , p 0 = 1, p 1 = 7, p 2 = 15 , q 0 = 1, q 1 = 4, q 2 = 11 . Similarly to case 2 we can see that one can always find a pathĝ such that Λĝ is a product of numbers λ i in (0, 1), meaning that where λ min is the smallest λ i in the pathĝ. Hence, the lower bound is in Ω(1/ poly(n)) if L ∈ O(log(n)). Moreover, a similar result can be obtained for W in the second sub-layer. Hence, we have shown that Corollary 1 is valid for all possible choices of W .

Appendix 4: Trainability from pooling
By consolidating parameters into smaller registers, pooling layers in a QCNN can enhance trainability in comparison to a QNN in which a full n-qubit register is maintained throughout a variational circuit. This can be illustrated by the pooling module in Fig. 6, in which an input state ρ (0) ⊗n is processed by pooling layers, locally described by the channel where is the controlled unitary corresponding to the pooling. Starting from ρ (0) = |0 ⊗n with n = 2 L for simplicity, a cost function at the j-th layer (j = 1, . . . , L) is given by The increase in trainability as j → L can be seen clearly by considering the cross-section θ (j) ± = ±θ for all j. For these angles, the cost function C (j) (θ + , θ − ) is evaluated in closed form by solving the recursion relation For = 1, i.e., a single pooling layer, the expression (159) is almost everywhere equal to 1 on [−π, π] as n → ∞ and, therefore, C (1) (θ) exhibits a barren plateau. In contrast, for j = log n = L, one evaluates for all n, which implies the absence of barren plateaus in the landscape because the magnitude of the derivative is bounded away from zero in expectation. Uncorrelating the pooling layers by taking the angles defining the conditional unitary I j to be θ (j) ± := ±θ (j) , j = 1, . . . , L does not result in barren plateaus for any parameters, as long as the pooling module is sufficiently deep. In particular, the j layer cost function for maximum depth j = L satisfies E dC (L) dθ (k) = 1 2n log 2 (π)−1 for all k, which does not exhibit the exponentially fast decrease to 0 required by the definition of barren plateaus.