Optimization of CNOT circuits on limited connectivity architecture

A CNOT circuit is the key gadget for entangling qubits in quantum computing systems. However, the qubit connectivity of noisy intermediate-scale quantum (NISQ) devices is constrained by their {limited connectivity architecture}. To improve the performance of CNOT circuits on NISQ devices, we investigate the optimization of the size/depth of CNOT circuits under the limited connectivity architecture. We present a method that can optimize the size of any $n$-qubit CNOT circuit $O\left(\frac{n^2}{\log \delta}\right)$ on any connected graph with minimum degree $\delta$, and prove this bound is optimal for the regular graph. For the near-term sparsely connected structure, we additionally present a method that can optimize the size of any $n$-qubit CNOT circuit to below $2n^2$. The numerical experiment shows that our method performs better than state-of-the-art results. Specifically, we present an example to illustrate the applicability of our algorithm. For the grid structure, which is commonly used in current quantum devices, we demonstrate that the depth of any $n$-qubit CNOT circuit can be optimized to be linear in $n$ with certain ancillary qubits (ancillas). Experimental results indicate that this method has significant improvements compared with all of the existing methods. We additionally test our algorithms on the five-qubit IBMQ devices, and the experiments show that the measurement results of the optimized circuit with our algorithm are more robust to noise compared with the IBM mapping method.

A CNOT circuit is the key gadget for entangling qubits in quantum computing systems. However, the qubit connectivity of noisy intermediate-scale quantum (NISQ) devices is constrained by their limited connectivity architecture. To improve the performance of CNOT circuits on NISQ devices, we investigate the optimization of the size/depth of CNOT circuits under the limited connectivity architecture. We present a method that can optimize the size of any n-qubit CNOT circuit O n 2 log δ on any connected graph with minimum degree δ, and prove this bound is optimal for the regular graph. For the near-term sparsely connected structure, we additionally present a method that can optimize the size of any n-qubit CNOT circuit to below 2n 2 . The numerical experiment shows that our method performs better than state-of-the-art results. Specifically, we present an example to illustrate the applicability of our algorithm. For the grid structure, which is commonly used in current quantum devices, we demonstrate that the depth of any n-qubit CNOT circuit can be optimized to be linear in n with certain ancillary qubits (ancillas). Experimental results indicate that this method has significant improvements compared with all of the existing methods. We additionally test our algorithms on the five-qubit IBMQ devices, and the experiments show that the measurement results of the optimized circuit with our algorithm are more robust to noise compared with the IBM mapping method.

I. INTRODUCTION
Quantum circuit synthesis is a process to construct a quantum circuit that implements a desired unitary operator and optimizes its size/depth in terms of a given gate set, which is an important task in the field of quantum computation and quantum information [1][2][3]. There are two key limitations of the current intermediate-scale quantum devices. First, the performance and reliability of quantum devices heavily depend on the length of time that the underlying quantum states can remain coherent. Hence it is natural to design the algorithm with less coherent time and environmental noise [4,5], in other words, with smaller size and/or lower depth. The second limitation is that the state-of-art quantum devices do not support placing 2-qubit gates (usually the CNOT gates) in arbitrary pairs of qubits. The 2-qubit gates are only allowed to be placed between the adjacent qubits [5][6][7]. We denote a qubit as a vertex, and use an edge between two vertices to represent a 2-qubit gate that can be performed on these two qubits. Then the limitations of the qubit connection can be represented as a limited connectivity architecture. The limited connectivity architecture for the near-term devices is usually selected as grid-style graphs [4,6,7]. Meanwhile, the d dimensional grid is also a good candidate for limited connectivity architecture for quantum supremacy by Harrow et al. [8].
CNOT circuits, in which there are only CNOT gates, are indispensable for quantum circuit synthesis to con- * sunxiaoming@ict.ac.cn struct general circuits [2,[9][10][11], since people often use CNOT gates with some single-qubit gates to build universal quantum computing [12][13][14]. The optimization of the size/depth of CNOT circuits shed some light on the more general problem of arbitrary circuit mapping. CNOT circuits also dominate stabilizer circuits, which play an important role in quantum error correction [3] and quantum fault-tolerant computations [15]. Aaronson et al. [9] proved that any stabilizer circuit has a canonical form, i.e., H-C-P-C-P-C-H-P-C-P-C, where H and P are one layer of Hadamard gates and Phase gates respectively, and C is a block of CNOT gates only. It follows that the depth of stabilizer circuits equals 5d CNOT + 5, where d CNOT is the depth of CNOT circuits. Hence, the optimization of CNOT circuits can be directly applied to the optimization of stabilizer circuits.
Many researchers are aiming at reducing the size/depth of CNOT circuits without limited connectivity architecture [10,16,17]. For instance, Patel et al. [10] proposed an algorithm to optimize any CNOT circuits to O(n 2 / log n) size on the full connectivity architecture. Moore and Nilsson [16] proposed an algorithm to parallelize any CNOT circuits to O(log n) depth with O(n 2 ) ancillas on the full connectivity architecture, in which the depth matches the lower bound Ω(log n).
However, these work can not be directly applied to near-term quantum devices with limited connectivity architecture.
There are several size optimization algorithms for CNOT circuits under the limited connectivity architecture. Kissinger et al. [18] proposed an algorithm that gives a 2n 2 -size equivalent circuit for any CNOT cir-cuits if the architecture contains a Hamiltonian path. Unfortunately, their optimized size is O(n 3 ) for the architecture without a Hamiltonian path. Nash et al. [19] proposed a similar algorithm simultaneously which gives a 4n 2 -size equivalent CNOT circuit for any CNOT circuits under any connected graph. There arises the following question: Given any CNOT circuit, how can we implement it on the state-of-art quantum devices with the smallest size of quantum gates and/or lowest circuit depth?
In this paper, we first consider how to optimize the size of the CNOT circuit without ancillae. Our algorithm achieves a worst-case bound O n 2 log δ -size on any connected graph with minimum degree δ. Our algorithm is the generalization of Patel et al. [10]. Furthermore, we prove this bound is tight for the regular graph. For the sparse graph with maximum degree O(1), we slightly improve the results of Kissinger et al. [18] and Nash et al. [19]. Specifically, we propose an algorithm that can optimize the size of any given CNOT circuit to 2n 2 on any connected graph. We simulate this size optimization algorithm on some particular graphs in nearterm quantum devices, and the simulation experimental results show our optimized size is smaller than the existing results [18,19].
Secondly, based on the rapid decoherence of the quantum system and the grid constriction of the recent quantum devices [4,5], we propose an algorithm that can optimize the depth of any given CNOT circuit to O n 2 min{m1,m2} with 3n ≤ m 1 m 2 ≤ n 2 qubits on m 1 × m 2 grid. The optimized depth is asymptotically tight when m 1 m 2 = n 2 . We generalize the result to any constant d dimensional grid. We also give the experimental result of the depth optimization algorithm on an n × n grid. As the number of qubits grows, the optimized depth has significant advantages over the existing size optimization algorithms. We give the comparison of our algorithms and existing algorithms in Table I. In the rest of the paper, we cover the basic notation of this paper, and the basic preliminaries of the CNOT circuit and its properties in Sec. I. In Sec. III we introduce a size optimization algorithm, and the lower bounds on the general graph. We also give a size optimization algorithm in Sec. IV, together with the numerical comparison of our algorithms and existing algorithms. Additionally, we give an example to show the application of the algorithm. Next, we introduce our depth optimization and the experimental results in Sec. V. In Sec. VI, we implement the optimized CNOT circuit on the IBMQ device. The experimental results show fewer outcome errors compared to the original circuit on the IBMQ noisy device. Finally, we give a discussion in Sec. VII.

II. PRELIMINARY
Notations. We use [n] to denote {1, 2, . . . , n}, F q to denote field with q elements, ⊕ to denote addition under F 2 , GL(n, 2) to denote set of all n × n invertible matrix over F 2 , I to denote the identity matrix. Let e j be the j-th coordinate basis vector with all zeros but a 1 in the j-th position. In the later sections, with a little abuse of symbols, we use a decimal number to represent the ceiling of this number when it needs to be an integer.
CNOT/SWAP Gate and Circuit. A CNOT gate, with control qubit q i and target qubit q j , maps (q i , q j ) to (q i , q j ⊕ q i ). A SWAP gate, operating on two qubits q i and q j , swaps (q i , q j ) to (q j , q i ). A CNOT circuit is a circuit that only contains CNOT gates. We refer to a CNOT circuit with size s, which means the number of CNOT gates equals s in this CNOT circuit.
Circuit with limited connectivity architecture. We use graph G(V, E) to represent the limited connectivity architecture of two-qubit gates (CNOT gate) in the circuit. A vertex in G represents a qubit, and the two-qubit gate (CNOT gate) can only operate on two qubits that are connected in G. We say a circuit C is under limited connectivity architecture G if all two-qubit gates in C satisfy the limited connectivity architecture. Fig. 1 gives an example of the circuit on the limited connectivity architecture. CNOT circuit in Fig. 1 (a) cannot be performed directly on a 3-qubit quantum device with the limited connectivity architecture in 1 (c). In contrast, the CNOT circuit in Fig. 1 (b) can be performed directly on it.
Circuit with ancillas. We say a CNOT circuit C n,m−n with n-qubit inputs and (m − n)-qubit ancillas implements an n-qubit circuit C, if for any input |x with ancillas |0 ⊗(m−n) , the results of C n,m−n is C |x ⊗ |0 ⊗(m−n) . We say two circuits (with or without ancillas) are equivalent if they implement the same circuit C.
Matrix representation of CNOT circuit [16]. We use CNOT i,j to denote CNOT gate with control qubit q i and target qubit q j . The CNOT gate is an invertible linear map 1 0 1 1 over F 2 . It is easy to check, that a CNOT gate CNOT i,j is equivalent to the row operation which adds the i-th row to j-th row over F 2 in the corresponding invertible matrix. By the linearity property  of CNOT circuits, we can represent an n-qubit CNOT circuit C as an invertible matrix M ∈ GL(n, 2) [10,16], and the j-th column of M equals Ce j . We use R(i, j) to denote such row adding operation in the matrix, and call the (i, j) its index set. We take a 3-qubit CNOT circuit as an example for the matrix representation as shown in Fig. 2. A sequence of row adding operations that transform M to I corresponds to a CNOT circuit represented by M −1 . The size optimization of CNOT circuit C is equivalent to optimizing a parameter t such that there exist index pairs (i 1 , j 1 ), . . . , (i t , j t ) satisfy M = t k=1 R(i k , j k ). Fig. 3 illustrates the equivalence of the CNOT gate operation on a CNOT circuit and row addition on its boolean matrix representation. The summation for the inputs is under module 2 in later sections.
Grid Graph. In a d dimensional grid graph G(V, E) with the size of each dimension be m i . A vertex in this m 1 × · · · × m d grid can be represented as a d dimen- An n-qubit CNOT circuit is under m 1 × · · · × m d grid, which means the number of ancillas is m 1 · · · m d − n, where m 1 · · · m d ≥ n, and the initial input x ∈ {0, 1} n is located on any n vertices of the grid.
Parallelized row-addition Matrices [17]. We say a matrix R is a parallelized row-addition matrix if there exists t ∈ [n], i ∈ [n] t , j ∈ [n] t such that i k , j k 's are 2t different indices and R = t k=1 R(i k , j k ). A parallelized row-addition matrix corresponds to a CNOT circuit with depth 1. Hence optimizing the depth of a CNOT circuit C is equivalent to optimizing a parameter t such that there exists t parallelized row-addition matrices R 1 , . . . , R t and M = t k=1 R k . Several concepts in graph theory. The degree of a vertex is the number of edges that are incident to the vertex. In graph G, ∆ and δ denote the maximum and minimum degree of its vertices respectively. A graph is regular if ∆ = δ. The Steiner tree with terminals set S ⊆ V , is a connected sub-graph of G with a minimum number of edges that contain all vertices in S. A cut vertex is a vertex whose removal will make the connected graph disconnected.

III. SIZE OPTIMIZATION ON GENERAL GRAPH
In this section, we propose an algorithm aiming at optimizing the size of CNOT circuits for quantum systems on the general limited connectivity architecture, as shown in Theorem 1. We additionally prove that our algorithm is asymptotically tight for the regular graph.

A. Size optimization algorithm
Theorem 1 is the generalization of Patel et al. [10], which optimizes the size of CNOT circuits on the full connectivity architecture and gives the optimized size of O n 2 log n . The most significant difference between the techniques of Theorem 1 and the algorithms in Refs. [18,19] is that here we eliminate several columns simultaneously instead of eliminating a single column in each iteration. Proof. Let k = n/δ in Theorem 2, where δ is the minimum degree of graph G(V, E). It is easy to check the summation of the degree of any k vertices are greater than n, and thus we have an O n 2 log δ -size CNOT circuit.
It follows that the optimized size in Theorem 1 is asymptotically tight for a nearly regular graph in which all vertices have the same order of degree by Theorem 3.
Algorithm 1: (SBE) Eliminate the first s columns of the given invertible matrix.
input : Integers n, k such that k j=1 di j ≥ n for any k vertices i1, . . . , i k , Let S be the set containing all the rows in M (2) with Gray code equals Gray(i); 10 Transform the binary string of row l to Gray(i) by adding one of the rows in M (1) to row l under the minimum path between the vertices of these two rows with Lemma 2; Let set S be random k rows of S, and let set R be the vertices of these rows in S ; 13 Eliminate rows in S to 0 by row l under the 2-approximate minimum Steiner tree of set R in G (Lemma 2); We give an explicit algorithm to show the upper bound in Theorem 1. Recall that the construction of the CNOT circuit is equivalent to constructing an invertible matrix with row addition operations by Sec. I. Algorithm SBE (size block elimination) gives the row additions for the first s columns of the invertible matrix under graph G(V, E). In the algorithm, we use vertex V i to represent row i, as well as qubit q i . The i-th row can be directly added to the j-th row means vertex V i is connected to V j . We also depict the process in Fig.  4.
Let k be a number such that the summation of the degree of any k vertices in G(V, E) are greater than the total number of qubits n. Algorithm SBE gives an optimized size as stated in Theorem 2. Theorem 2 is the generalization of Theorem 1, here we consider the first k minimum degrees of the graph instead of only one minimum degree.
Theorem 2. Algorithm 1 can optimize the size of any n-qubit CNOT circuits to O n 2 log(n/k) under a connected graph G(V, E) as the limited connectivity architecture, where the summation of the degree of any k vertices in G are greater than n.
Lemma 1 ensures the efficiency of our optimization algorithm, by which we can eliminate one row in one step on average. Lemma 1. Given connected graph G(V, E), for any integer k such that the summation of the degree of any k vertices is greater than n, the minimum Steiner tree for any k vertices in G is less than 5k.
This Lemma can be obtained directly by applying the technique in Theorem 2.4 of Ali et al. [20]. The detailed proof of this lemma is in Supplementary material. The core idea of the proof is that two vertices share no common neighbors if the distance between them equals three. This lemma needs exponential cost to give a minimum Steiner tree, hence we replace it with the 2approximate minimum Steiner tree in Algorithm SBE, as stated in the following corollary.
Corollary. Given connected graph G(V, E), for any integer k such that the summation of the degree of any k vertices are greater than n, the 2-approximate minimum Steiner tree for any k vertices in G is less than 10k.
The following lemma serves Lemma 3. This lemma can be obtained directly from the optimization process of Nash et al. [19], by which we can add the value of a qubit to the target qubits and keep the values of the other qubits the same.
The transformation |x 1 · · · |x n → |y 1 · · · |y n on any connected graph can be implemented by CNOT gates with size O (n).
By this lemma, we can eliminate one column of the matrix with O(n) row additions. Proof. Let s = log(n/k)/2. Given n-qubit CNOT circuit M ∈ GL(n, 2), the following algorithm uses O n 2 /s row additions to transform M to I and thus gives an equivalent O n 2 /s size CNOT circuit. The algorithm starts with dividing M into n/s blocks . Similarly let I = I 1 · · · I n/s . Next transform M j to I j for all of j ∈ [n/s], as shown in Fig. 4. Algorithm SBE states how to transform M 1 to I 1 with row additions. The process of transforming M j to I j for j > 1 are almost the same with the process of transforming M 1 to I 1 , except in step (1) of Algorithm SBE, we need to change M (1) into the (j − 1)s + 1-th to js-th rows for input matrix M j , and M (2) be the rest rows.
Now we prove the number of row additions is indeed O n 2 /s . Since any k vertices have an O(k)-size 2approximate minimum Steiner tree, the number of vertices in 2-approximate minimum Steiner trees in Steps 1 and 13 in Algorithm SBE are both O(k). Hence the number of row additions is equal to Steps (5-6) for k ≤ n. Since there are in total n/s such blocks, we need n/s × O(n) = O n 2 /s row additions.
The optimized size in Theorem 2 is asymptotically tight when k = n ε for 0 ≤ ε < 1, i.e., the summation of the degree of any n vertices are greater than n, since Ω n 2 / log n is the lower bound for any limited connectivity architecture by Theorem 3.

B. Lower bound for size optimization
The best lower bound of full connectivity architecture CNOT circuit synthesis is Ω(n 2 / log n) size by Patel et al. [10]. This lower bound is obtained by counting the number of distinct CNOT circuits with the given number of CNOT gates, which also implies the same lower bound Ω(n 2 / log n) for CNOT circuit synthesis on a limited connectivity architecture. We say the quantum cir-cuitÛ ∈ C 2 n ×2 n is an ε-approximation for the quantum Here we prove a tighter lower bound as stated in Theorem 3. The technique of the proof is inspired by the counting method from Christofides [21] and Jiang et al. [17].
Theorem 3. For any connected graph G(V, E) as the limited connectivity architecture of the quantum system, there exist some n-qubit CNOT circuits for which any ε < 1/ √ 2-approximation 2-qubit circuits need Ω n 2 / log ∆ size of CNOT gates on graph G, where ∆ is the maximum degree of the limited connectivity architecture.
Before giving the proof of Theorem 3, we first define the canonical CNOT circuit.
Definition (Canonical CNOT circuit). For any nqubit CNOT circuits with k CNOT gates, which can be represented as a sequence of elementary row operations, R 1 , R 2 , . . . , R k . We say it is canonical if and only if the sequence can be partitioned into several non-empty blocks G 1 , G 2 , . . . , G s and these blocks satisfy the following properties, 2. For 2 ≤ i ≤ s, for every matrix in the block G i , there exists at least one element of its index set belonging to the index set of some matrix in the previous block G i−1 .
Intuitively, the canonicity means that CNOT gates in the same block can be executed simultaneously and any CNOT gate in the G i block can be put into the previous block G i−1 . It is simple to prove that any CNOT circuit can be transformed into an equivalent canonical CNOT circuit within finite steps and the readers can refer to it [21] for specific proof. Next, we will show the upper bound of the number of distinct canonical CNOT circuits as stated in Lemma 4. In the following we prove that when η = ε 2k and ε < 1 √ 2 , the η discretizations of any two different CNOT circuits with size k are different. Since for any two different CNOT circuits U f , U g with size k such that we have By the definition of U η f , we have Similarly we have U η g − U g |x 2 ≤ ε. Since our approximation needs to be ε < 1 √ 2 close to the original solution, this implies Hence by Lemma 4, we have an upper bound for the number of the 2-qubit circuit with ε < 1 √ 2 approximation to all of the possible k CNOT gates 8 k e k ∆ k 2 n log n 2 η 32k with error 2kη. Since Eq. (1) is greater than all possible CNOT circuits, which is greater than 2 n(n−1)/2 , then we have k = Ω n 2 / log ∆ .

IV. SIZE OPTIMIZATION ON NEAR-TERM QUANTUM DEVICES
Since in the near-term quantum superconducting device, the degree of vertices on the limited connectivity architecture is very low, the size optimization algorithm in Sec. III gives O n 2 optimized size, which is optimal in the order. Nevertheless, the algorithm is more complex and it has a larger constant for the optimized size compared to the algorithm of Kissinger et al. [18] and Nash et al. [19].
The algorithms of both Kissinger et al. and Nash et al. aim to eliminate the n × n boolean matrix M of the given CNOT circuit into identity. They both first eliminate the matrix into an upper triangular matrix by eliminating all of the ones into zeros under the diagonal line column by column, and then utilize the same method to eliminate the ones above the diagonal line. To eliminate all of the ones into zeros under the diagonal line in column j, they both generate the limited connectivity architecture G with the value of the i-th vertex being M ij . Then they generate a Steiner tree with the terminals which have value one, and utilize two slightly different methods to eliminate all terminals into zero.
In this section, we propose another size optimization algorithm that gives 2n 2 optimized size on any connected graph in the worst case. We also show that this algorithm has a better performance compared with the existing algorithms [18,19].

A. Size optimization algorithm
For the "elimination process" that transforms a matrix to identity by row operations under a limited connectivity architecture, we cannot add a row to another if their corresponding vertices are not adjacent. Given the limited connectivity architecture G and the matrix M ∈ GL(n, 2) corresponding to a CNOT circuit, in contrast to the algorithms of Kissinger et al. [18] and Nash et al. [19], we propose an algorithm that eliminates the i-th row and i-th column simultaneously for vertex i ∈ V which is not a cut vertex. The optimized size of the algorithm achieves 2n 2 in the worst case on any connected limited connectivity architecture, as stated in the following theorem. Algorithm ROWCOL ensures the correctness of Theorem 4. In Algorithm ROWCOL, Steps (3-6) aim to transform the i-th column into e i and Steps (7)(8)(9)(10) aim to transform the i-th row into e T i . All arithmetic operations are over the binary field F 2 . An illustrative example is shown in Example 1. The limited connectivity architecture for the CNOT circuit in Example 1 is depicted in Fig. 6. The optimized CNOT circuit for this inevitable matrix is the inverse of the joint circuit generated from steps (1) to (8). Find a tree T containing S ∪ i; 8 Preorder traverse T from i. When reaching j / ∈ S , add the j-th row to its parent; 9 Postorder traverse T from i, add every row to its parent when reached; 10 Delete i from graph G; When we perform CNOT gates in Steps (2-3, 5-6), the number of CNOT gates is less than the number of remaining nodes, hence the total size is at most 4(n − 1) + 4(n − 2) + · · · + 4 × 1 ≤ 2n 2 .
Since a connected graph must have a vertex that is not a cut vertex and the graph remains connected after that vertex is deleted, this algorithm can be applied to any connected graph. We can BFS (Breadth-First Search) starting from any vertex in the graph and apply the above algorithm in the BFS tree, and then we delete a leaf node each time.

B. Numerical results
In this subsection, we give the comparison of the experimental simulation of Algorithm ROWCOL and algorithms in Refs. [18,19]. The CNOT circuits are performed on the IBM-Q20 graph and T-like graph (T20). The limited connectivity architecture of IBM-Q20 and T20 are depicted in Fig. 5. The original circuit size ranges from 20 to 800 in the experiment, where the number of qubits is chosen to be 20. We randomly sample 200 CNOT circuits for each input size, and compare the average optimized size for Algorithms ROWCOL and Ref. [18,19], as depicted in Fig. 7.
Our algorithm performs better in most generated random circuits than the algorithm of Kissinger et al. for the limited connectivity architecture with a Hamiltonian path, as shown in Fig 7 (a). In particular, we have a better-optimized size for 82.3% random circuits on IBMQ20, and 99.9% random circuits on the T20 graph. When the graph does not have a Hamiltonian path, our algorithm has more obvious advantages than Kissinger et al. [18] (as in Fig. 7 (b)).
The above-optimized result for the randomly generated CNOT circuits shows the advantages of Algorithm ROWCOL for general CNOT circuits. In the following, we perform Algorithm ROWCOL on a specific CNOT circuit to show the applicability of our algorithm. The comparison results are coincident with the average case.
We also give an example to show the advantage of our algorithm, which is a staircase CNOT circuit followed by a randomly generated SWAP circuit (as shown in Fig. 6 (a)) under the limited connectivity architecture in Fig. 6 (b). Staircase CNOT and SWAP circuits both are crucial in the quantum circuit implementation of error detection and correction [22][23][24], simulation of quantum chemistry [25,26], Hamiltonian simulation [27] and near-term variational algorithms [28]. Fig. 8 gives a comparison of the optimized circuit between Algorithm ROWCOL and Algorithm in Ref. [19]. The non-trivial optimized CNOT size of Algorithm ROW-COL and Ref. [19] are 10 and 20 respectively.

V. DEPTH OPTIMIZATION ON THE NEAR-TERM QUANTUM DEVICES
Due to the decoherence with the execution time of the near-term quantum devices, it is meaningful to decrease the execution time for a given quantum circuit. Although Algorithm ROWCOL can also be used to optimize the depth of any given CNOT circuit, the depth of the optimized circuit is almost the same as the size. To our knowledge, the ancillas can greatly reduce the depth of a quantum circuit. Much work has aimed to reduce the depth of CNOT circuits via designing optimized circuits with some ancillas [16,17,29]. In this section, we first propose a depth optimization algorithm on the NISQ devices -grid with limited ancillas, and then show the great improvements of the optimized depth compared to other existing algorithms by numerical experiment.

A. Depth optimization algorithm
Here we optimize the depth of CNOT circuits by bringing in limited ancillas on a 2 dimensional grid. We then generalize the result to any constant d dimensional grid. This theorem gives a trade-off of depth and ancillas for CNOT circuits under grid limited connectivity architecture. The depth can be optimized to O(n) when m 1 = m 2 = n. It is easy to check there exists a CNOT circuit for which the optimized depth is Ω (n) on an n×n grid. Hence our optimized depth is asymptotically tight in this case. The main idea for Theorem 5 is to divide the output of the CNOT circuit into several intermediate results conserving the ancillas. Then combine the intermediate results to generate the output and restore the ancillas.
Before giving the algorithm to show the correctness of Theorem 5, we would like to cast two lemmas that show how to copy and add inputs under the d dimensional grid, and one can easily check that the lower bound for this operation on an (m 1 × m 2 × · · · × m d ) grid is also   We prove this lemma by constructing a tree rooted at vertex y with depth O( d i=1 m i ) in the d dimensional grid, then we divide the tree into some disjoint path to parallelize the CNOT gates. See Supplementary material for the proofs of Lemmas 5,6.
In the following, we give the algorithm for Theorem 5. Let y := y 1 · · · y n ∈ {0, 1} n be the output, then y i = j M ij x j . We first divide the summation into several parts. Let z ij := jm1 k=(j−1)m1+1 M ik x k where i ∈ [n] and j ∈ [n/m 1 ] (Here we suppose n/m 1 is an integer. It is easy to generalize it to the general case.). It is easy to check that the i-th output qubit y i = j z ij . Let s := m 2 −2n/m 1 . Let the coordinate (i, j) represent the i-th row and j-th column of the m 1 × m 2 grid.
Algorithm DepAncGrid implements the transformation Hence, the transformation can be implemented by first performing and then performing = 0 m−n , y .
Finally, move y j to the first n/m 1 columns for all j by SWAP gates. Hence we have an equivalent paralleled CNOT circuit for any given CNOT circuit. We depict this process in Fig. 9.  The experimental results of Algorithm ROWCOL, algorithms in Refs. [19] and [18] under limited connectivity architecture graphs: (a) IBM Q20 and (b) T20. As a contrast, we also draw the curve y = x in the figure.
FIG. 8: Optimized circuits for the staircase CNOT circuit in Fig. 6 (a) under the limited connectivity architecture in Fig.  6 (c). The non-trivial optimized size of Algorithm ROWCOL, Ref. [19] are 10 and 20 respectively, as shown in (a) and (b), where in (b) the two CNOT gates in the dotted box can be trivially canceled hence we did not count them in the comparison.

B. Numerical results
In this subsection, we give the numerical tests for Algorithm DepAncGrid. We compare the optimized depth of Algorithm DepAncGrid with all of the existing size optimization algorithms on the grid graph mentioned previously. To show the performance of these algorithms, different sizes of n × n dimensional grids ranging from 4 to 11 are selected in this experiment. For one grid, we first randomly sample 200 different CNOT circuits, and then run all these algorithms under this condition. The method to sample a random CNOT circuit here is: (1). Randomly sample an n × n 0-1 matrix by randomly selecting a "0" or "1" in each position; (2). Determine whether the matrix sampled in (1) is invertible, if not, return to (1), otherwise accept the matrix as a random CNOT circuit.
The comparison results are depicted in Fig. 10, including the optimized depth of Algorithms ROWCOL, DepAncGrid, and algorithms in Refs. [18,19] on grid graph. In particular, here Algorithm DepAnc-Grid needs n 2 qubits and other algorithms only need n qubits. The y axis shows the average depth of the optimized circuit. In consideration of reducing the impact of the different Hamiltonian paths chosen in Ref. [18], we choose two different Hamiltonian paths to synthesize the same CNOT circuit. The comparison results show Algorithm DepAncGrid has a significant improvement for the optimized depth as the number of qubits increases. Theoretically, the depth of the CNOT circuit generated by Algorithm DepAncGrid equals O(n), while O(n 2 ) for other algorithms.
This experimental result shows that the depth of the CNOT circuit can be greatly reduced for the ancillasfree quantum system.

VI. EXPERIMENTAL RESULT ON IBMQ
In this section, we test the performance of our optimized CNOT circuits on IBM devices. We implement a staircase CNOT circuit and an Add CNOT circuit, which have wide applications in error correction [3], variational algorithms [3,30] and quantum chemistry [31,32].
We leverage IBMQ devices (ibmq athens and ibmq 5 yorktown) as the limited connectivity architecture [33], as shown in Fig. S1 of the Supplementary material. In Fig. 11 (a), we give the staircase circuit without considering the limited connectivity architec- // x1, . . . , xm 1 in the first column, and so on, as in Fig. 9 2for l ← 1to n/m1 do 3 Copy all of xi in the l-th column to the columns j for n/m1 + 1 ≤ j ≤ n/m1 + s in the same row as xi; Add all values of each column k to the last row to generate z k,1 in the coordinate (m1, n/m1 + k); 10 Add z (j−1)s+1,1 , . . . , zjs,1 to the mirror symmetric coordinate; 11 Restore the ancillas in columns j for n/m1 + 1 ≤ j ≤ n/m1 + s; ture, with input |φ = |0 +|1 √ 2 |0 |0 +|1 √ 2 |0 |0 . We perform the circuit on IBMQ with the mapping the CNOT circuitprovided by the ROWCOL algorithm in Fig. 11 (b). There is a layer of H gates before the CNOT circuit in Fig. 11 (b) to generate the input state |φ from initial state |0 ⊗5 of the IBMQ device.
We compared the measurement results of ROW-COL algorithm and IBM optimization algorithm on ibmq athens quantum device, and plot the classical simulation result (ideal quantum circuit without any error) as a comparison, as shown in Fig. 12. We performed 8,000 measurements on each circuit independently. The horizontal axis shows the results measured under the computational basis, where j represents computational basis j 0 j 1 . . . j 4 . The vertical axis shows the frequency of each outcome after 8,000 measurements. The ideal output state after performing the CNOT circuit C in Fig. 11 (a) is |ψ = C |φ = |0 +|4 +|16 +|20 2 . Hence the expected frequency of the result after 8,000 repeated measurements is {|0 : 4000; |4 : 4000; |16 : 4000; |20 : 4000}. Fig. 11 shows that the simulation results are consistent with the expected frequency. It can also be seen from Fig.  12 that the circuit optimized by ROWCOL algorithm has a strong robustness to errors. We can extract the correct measurement result by setting a small threshold y = 1000 and selecting the outcomes whose frequency is greater than that threshold. As a comparison, there are tremendous errors in the circuit measurement results obtained directly by IBM's mapping method.
The Add circuit, as shown in Fig. 13 (a) and performed on the ibmq 5 yorktown device, has a similar performance, as shown in Fig. 12 (b).

VII. DISCUSSION
Optimization of the size/depth of the quantum circuit with limited connectivity architecture is one of the main challenges in near-term quantum computing [34][35][36]. In this paper, we propose two size/depth optimization algorithms on the limited connectivity architecture. The experimental results show our algorithms have better performances compared to the existing optimization al- q0 gorithms.
Specifically, for the connected graph which has a minimum degree δ, any n-qubit CNOT circuits can be optimized to O n 2 log δ size on a such graph. We also prove the order is tight for a regular graph. Algorithm SBE further indicates the size of any n-qubit CNOT circuits can be optimized to O n 2 log(n/k) = O n 2 log δ on the limited connectivity architecture where the average degree of any k vertex set is greater than n/k for the limited connectivity architecture.
For the limited connectivity architecture which has a constant degree, we can optimize any CNOT circuits to 2n 2 size on any connected graph, the order is tight when the limited connectivity architecture is a simple path. We also give an algorithm that takes more features of the graph into account and gives a better upper bound for the specific class of graphs.
With these two size optimization algorithms, we see that the optimized size of a CNOT circuit is dominated by the degrees of the limited connectivity structure.
Here we give the intuition by "implementation of a single CNOT gate". For the structure where the summation of any k degrees from k different vertices is larger than n, the number of required CNOT gates to construct a CNOT gate CNOT ij scales as 10k with Lemma 1. It implies that for the limited connectivity structure with minimum degree δ, the number of required CNOT gates to construct a CNOT gate CNOT ij scales as 10n δ in the worst case. This phenomenon explains why the smaller optimized size is corresponding to stronger connectivity.
Since the current quantum superconducting devices [4,5,33] usually have a low degree for the limited connectivity architecture, and the coherence time is very limited, we also consider a specifically limited connectivity architecture -the m 1 × m 2 grid. We show that any n-qubit CNOT circuits can be optimized to O n 2 (m1+m2) m1m2 depth on this grid, where 3n ≤ m 1 , m 2 ≤ n 2 . This optimized result can be easily generalized to any constant d dimensional grid. The dimensions for a corresponding grid of current quantum superconducting devices [4,5,33] To conclude the suitable scenarios of the proposed algorithms, we list the options for the algorithms with the degrees of the associated limited connectivity structures as follows: (a) The SBE algorithm has a better performance for the quantum computer with a comparatively large number of the minimum degree of the limited connectivity structure. This algorithm has a better performance compared to other existing algorithms for the cases where the minimum degree  Fig. 13 (a) under ibmq 5 yorktown device. δ of the structure is larger than the constant, i.e., δ > O(1) and the number of input qubits are comparatively large (larger than hundreds of qubits).
(b) The ROWCOL algorithm has a better performance for the near-term quantum device with dozens of qubits or the constant degree of the limited connectivity structure, such as 1D or 2D grids.
(c) The DepAncGrid algorithm is suitable for 1D and 2D limited connectivity structures.
Note that the 65-qubit quantum superconducting device proposed by IBMQ [33] is not exactly a grid, nevertheless, it is a sub-graph of a grid. We can still perform Algorithm DepAncGrid on the expanded grid by leverage of SWAP gates to implement the CNOT gate for two vertices that are not connected in the sub-graph. Another avenue is to construct a new virtual 2D grid, and convert the single CNOT gate in the virtual grid to a series of CNOT gates in the real devices. We can ensure the additional cost of CNOT gates is bounded to a constant factor of the original one due to its sub-grid structure.
We list two open problems for the optimization of CNOT circuits on the limited connectivity architecture.
(1) Is there any improved size optimization algorithm for some more specific structures under the limited connectivity architecture? (2) If there are no ancillas, can we give some better results for the depth optimization of CNOT circuits on the m 1 × · · · × m d grid for constant d?
Similar to the CNOT gate, the Toffoli gate is also a classical reversible gate. Nevertheless, the optimization of Toffoli gates is much more complex than CNOT circuits. An explicit reason is that it is not a linear map, so we can not construct the connection for the gate operations and the linear operations on a smaller size matrix, i.g., {0, 1} n×n boolean matrix. The intrinsic reason is that Toffoli circuits are computationally universal, and it generates all possible reversible transformations f : {0, 1} n → {0, 1} n if ancillas are allowed to be used [37]. Since Toffoli is also a classical reversible gate, we can optimize specific Toffoli circuits instructed by the specific boolean function it implemented -similar to the process with CNOT circuits. We give an example in the supplementary material for the implementation of the staircase of Toffoli gates under the full connectivity structure. We also leave an interesting open question of whether it can be applied to more general Toffoli circuits.
Proof.[Proof of Lemma 1] Denote T k (G) = max |S|=k,S⊆V T G (S), where T G (S) is the size (the number of vertices) of the minimum Steiner tree for vertex set S in G.
Let S = {u 1 , · · · , u k } ⊆ V and A be an empty set. In a connected graph G(V, E), let the distance between two vertices be the number of edges in the shortest path that connects them. Let d(i, j) denote the distance between vertex i and j in Graph G, d(A, v) denote the minimum distance between vertex v and set A.
Firstly put u 1 into set A, and then put all of v ∈ V such that d(A, v) = 3 into A. That is, Let A = {a 1 , · · · , a k } be a set such that the element a i ∈ A is a vertex in set A and closest to u i in S. By the construction of A, we have T G ({a 1 , · · · , a k }) ≤ 3(|A| − 1) + 1 ≤ 3|A|.
By the definition of G, for any k vertex in G, the summation of their degrees is greater than n. On the other hand, since there are no common neighbors for any two vertex in set A, v∈A d(v) + |A| ≤ n, thus we have |A| < k.
Therefore, T G (S) ≤ T G ({a 1 , · · · , a k }) + 2d. For k from m 1 to 2, (k−1, 1, . . . , 1) ← (k, 1, . . . , 1), we get the sum of all vertices, except the vertices not in S added twice. The arithmetic operations are over F 2 , so y = i∈S x i . We can recover x 1 , x 2 , . . . , x n−1 by reversing above operations which are not related to (1, 1, . . . , 1). The operations in the i-th item and (2d + 1 − i)-th item can be parallelized to m i − 1 depth, where 1 ≤ i ≤ d, hence the total depth is O( In this section, we give the optimization method for staircase of Toffoli gates. Proof. Let unitary C denote the circuit of the staircase of Toffoli gates (Figure 15), and let output y := Cx, where x ∈ {0, 1} n is the input string, then we have y k = 0≤j≤k/2 x 2j j≤i≤k/2−1 x 2i+1 , if k mod 2 = 0, x k , otherwise.
for any 0 ≤ k ≤ n−1. It's easy to find that the frequency of q i in all of y k is #x 2k−1 = k(n − k), #x 2k = n − k.
where 0 ≤ k ≤ n−1 2 . We can construct all of the required outputs by (2) Computing all of the multiplications of the variables in every clause in parallel with all of the copied variables in the ancillas; (3) Computing all of the addition of all the clauses in parallel; (4) Restore all of the qubits except the generated output y.
(5) Let U f be the circuit of Steps (1)(2)(3)(4). Restore all of ancillas to zeros by performing U f −1 to qubits (y, x, 0), where y = f (x) is the associated boolean functions of the staircase of Toffoli gates, and U f is the circuit constructed by the sub-circuits from Steps (1-4).
If we are allowed to use CNOT gates, then the copy operation can be implemented with CNOT circuits, otherwise, we can use a Toffoli gate with one control qubit initialized as one to implement the CNOT gate. Hence the copy and multiplication operations x 0 x 1 · · · x n or x 0 ⊕ x 1 ⊕ · · · ⊕ x n both can be implemented in O(log n) and O(n) ancillary qubits. For each y k we require O(n 2 ) ancillas, and Steps (1-4) require O(n 3 ) ancillas in total.
Here the f −1 can be implemented as the inverse of the circuit in Fig. 15. U f −1 can be implemented with O(n) size of Toffoli gates since the expression of f −1 (x) is much easier.