Weighted Quantum Channel Compiling through Proximal Policy Optimization

We propose a general and systematic strategy to compile arbitrary quantum channels without using ancillary qubits, based on proximal policy optimization -- a powerful deep reinforcement learning algorithm. We rigorously prove that, in sharp contrast to the case of compiling unitary gates, it is impossible to compile an arbitrary channel to arbitrary precision with any given finite elementary channel set, regardless of the length of the decomposition sequence. However, for a fixed accuracy $\epsilon$ one can construct a universal set with constant number of $\epsilon$-dependent elementary channels, such that an arbitrary quantum channel can be decomposed into a sequence of these elementary channels followed by a unitary gate, with the sequence length bounded by $O(\frac{1}{\epsilon}\log\frac{1}{\epsilon})$. Through a concrete example concerning topological compiling of Majorana fermions, we show that our proposed algorithm can conveniently and effectively reduce the use of expensive elementary gates through adding the weighted cost into the reward function of the proximal policy optimization.

We propose a general and systematic strategy to compile arbitrary quantum channels without using ancillary qubits, based on proximal policy optimization-a powerful deep reinforcement learning algorithm. We rigorously prove that, in sharp contrast to the case of compiling unitary gates, it is impossible to compile an arbitrary channel to arbitrary precision with any given finite elementary channel set, regardless of the length of the decomposition sequence. However, for a fixed accuracy one can construct a universal set with constant number of -dependent elementary channels, such that an arbitrary quantum channel can be decomposed into a sequence of these elementary channels followed by a unitary gate, with the sequence length bounded by O( 1 log 1 ). Through a concrete example concerning topological compiling of Majorana fermions, we show that our proposed algorithm can conveniently and effectively reduce the use of expensive elementary gates through adding the weighted cost into the reward function of the proximal policy optimization.
Quantum compilers, which decompose quantum operations into hardware compatible elementary operations, play an important role in quantum computation [1] and digital simulation [2]. This technique is especially crucial for the applications of noisy intermediate-scale quantum devices [3], where the performance of deep quantum circuits might be limited by noises and quantum decoherences. A number of notable approaches have been proposed to compile unitary gates and the dynamics of isolated quantum systems [4][5][6][7][8][9][10][11][12][13][14][15][16]. However, in reality quantum systems cannot be perfectly isolated and would inevitably interact with the external environment, making the more general quantum channel compiling indispensable for a wide range of applications [17,18].Yet, quantum channel compilation has been barely explored [19], with major previous attention paid to exploiting the Stinespring dilation theorem [20] and compiling arbitrary quantum channels through elementary gates acting on an expanded Hilbert space with ancillary qubits playing an prerequisite role [21][22][23][24][25][26][27][28][29][30]. Hitherto, a general and systematic strategy to compile arbitrary quantum channels without using ancillary qubits has not been established. Here, we prove two generic theorems regarding channel compiling and introduce such a strategy based on deep reinforcement learning (see Fig. 1 for an illustration).
Machine learning, or more broadly artificial intelligence, has recently cracked a number of notoriously challenging problems, such as playing the game of Go [32,33], predicting protein spatial structures [34], and weather forecasting [35]. Its tools and techniques have been broadly exploited in various quantum physics tasks, including representing quantum many-body states [36,37], quantum state tomography [38,39], learning topological phases of matter [40][41][42][43][44][45][46][47][48][49][50], and nonlocality detection [51]. For quantum compiling on unitary gates, machine learning approaches have also been introduced to provide a near-optimal sequence [52,53]. In this paper, we first rigorously prove that it is impossible to compile any quantum channel to arbitrary accuracy using unitary gates and a finite set of elementary channels, which is in sharp contrast to the case of compiling a unitary gate. As illustrated in Fig.  1, we propose a quantum channel compiler which given an accuracy demand , decomposes any quantum channel into a It first decomposes the target channel into a sequence of elementary channels followed by a unitary gate. Then a DRL environment is set up to produce approximation sequence Un for the target gate U required in the previous step using elementary gates. At each step n, the agent gets an observation On and feeds On as the input vector to the deep neural network (DNN). DNN outputs a probability distribution π(an|On), according to which the agent chooses the next gate to apply in the decomposition sequence. The environment returns a reward rn to the agent afterward. See the Supplementary Materials for details [31].
sequence of finite types of elementary quantum channels followed by a unitary gate. We provide a constructive method to obtain the elementary channel set and show that the size of the set scales as O(d 2 ) with the dimension d of Hilbert space and is independent of . We additionally prove that the length of the elementary channel sequence in the decomposition is bounded above by O( 1 log( 1 )). For the unitary gate at the end of the decomposition, we train a deep reinforcement learning (DRL) agent to decompose it into hardware compatible elementary gates. To reduce the resource requirement of the compiler, we exploit the proximal policy optimization (PPO) algorithm [54] to train our agent with weighted cost in the reward function to reduce the use of experimentallyexpensive elementary gates. We further prove a Ω(log(1/ )) arXiv:2111.02426v1 [quant-ph] 3 Nov 2021 lower bound for any indispensable expensive elementary gate count to compile an arbitrary unitary gate within error . As a benchmark, we apply our algorithm to the topological quantum compiling of Majorana fermions [55,56], whose braidings together with a non-topological T gate form a universal set. We numerically show that our algorithm could reduce the use of T gate by a factor of two compared to the traditional Solovay-Kitaev algorithm.
Notations.-To begin with, we first introduce some basic notations and concepts [1]. A quantum state can be represented by a positive semi-definite operator ρ ∈ O(H S ) with Tr(ρ) = 1, where H S is the Hilbert space and O(H S ) the set of operators on H S . In general, a quantum channel E can be characterized by a completely positive, trace-preserving (CPTP) map which maps a quantum state ρ into another state E(ρ) ∈ O(H S ). Any single-qubit state can be represented as ρ = 1 2 (I + a · σ), where a is a three-dimensional vector within the Bloch sphere and σ = (σ x , σ y , σ z ) are Pauli matrices. Any linear CPTP map for a single-qubit system could be represented by a four-by-four matrix T [21,[57][58][59]: where i, j ∈ {0, 1, 2, 3}, σ 0 , σ 1 , σ 2 , σ 3 represents the Pauli matrix I, σ x , σ y , σ z and T ∈ R 3×3 , t ∈ R 3 . Under this representation, a channel is an affine map [60] E : Geometrically, E maps the states within the Bloch sphere into states enveloped by an ellipsoid, with t the center shift from the original center and T the distortion matrix for the ellipsoid. When det(T ) = 1, the CPTP map reduces to a unitary gate. In this sense, unitary gates can be regarded as special channels. Throughout this paper, we differentiate unitary gates from channels for clarity. A general theorem for channel compilation.-To formulate the problem, we consider a set S with metric d(·). A set Γ ⊂ S is called a δ-net if for any x ∈ S, there exists y ∈ Γ such that d(x, y) ≤ δ [5]. The subset Γ ⊂ S is called a dense subset under metric d(·) if it is a δ-net of S for arbitrary δ.
Suppose we have a set of elementary channels and want to approximate the target channel with a sequence of unitary gates and elementary channels chosen from the set. For technical convenience and simplicity, we consider Schatten onenorm [61] ||E target − E approx || 1→1 = max ρ∈O(H S ) ||E target (ρ) − E approx (ρ)|| 1 as the distance measure. Now, we are ready to present our general theorem. Theorem 1. Consider compiling single-qubit channels using unitary gates and elementary channels with Schatten onenorm as the distance measure. Then: (1) Given a finite set of elementary channels together with an arbitrary unitary gate, it is impossible to compile an arbitrary single-qubit channel to arbitrary accuracy.
(2) Given an accuracy bound , one can construct a finite set of elementary quantum channels using as a parameter such that any single-qubit channel can be compiled by the elementary channels from this set and a unitary gate within error . The length of sequence is bounded above by O( 1 log( 1 )).
proof. We provide the main idea here. The full proof is technically involved and thus left to the Supplementary Materials [31]. Suppose that we are provided with a finite set of elementary channels C = {E 1 , ..., E n } with corresponding distortion matrices {T 1 , .., T n } and center shifts {t 1 , ..., t n }. Without loss of generality, we assume that det (T n ) ≤ ... ≤ det (T 1 ) < 1. Noting that the composition of channels could not increase the determinant of the distortion matrix, thus a target channel with det (T 1 ) < det (T ) cannot be compiled by the channels chosen from C to arbitrary accuracy, independent of how long the decomposition sequence is. For part (2), we decompose the compiling process into several steps and provide a constructive proof. For a target channel E with distortion matrix T with eigenvalues λ 1 , λ 2 , λ 3 and center shift t, we first implement intermediate channel E 1 with parameters T 1 = diag(|λ 1 |, |λ 2 |, |λ 3 |), t 1 = t using elementary channels parametrized by . We then use a unitary gate to realize the negative, complex eigenvalues and basis transformations.
The above theorem could be extended to multi-qubit channels. For a d-dimensional quantum state ρ ∈ O(H S ), there exists a canonical and orthonormal basis {O α , α = 1, ..., d 2 − 1} [59,62]. A density operator ρ under such basis could be representing pure states and ||p|| 2 < 1 representing mixed states. As a quantum state ρ can be represented by a vector within a ball, a quantum channel E : O(H S ) → O(H S ) can be written as an affine map represented by distortion matrix T ∈ R (d 2 −1)×(d 2 −1) and center shift t ∈ R d 2 −1 similar to Eq. (1). With this representation, we can extend the Theorem 1 to the multi-qubit case [31]. Yet, it is worthwhile to mention that in this case, the size of set of constructed quantum channels should scale as O(d 2 ) in part (2) of the theorem [31].
The above results imply that a finite number of elementary channels could not approximate an arbitrary target channel to arbitrary accuracy, regardless of the specific structure of each elementary channel and the length of the compiling sequence. This is in sharp contrast with the case of unitary gate compiling, where we can use a small number of elementary gates to compile an arbitrary unitary gate within any accuracy demand. We remark that any quantum channel can be implemented by a sequence of elementary unitary gates acting on a dilated Hilbert space and this seems to contradict with the claim of part (1) in the Theorem 1. However, this spurious contradiction dissolves after noting the fact that tracing out the ancillary qubits at different sequence locations would effectively result in different channels even for the same elementary unitary gates. In other words, although a small number of different unitary gates suffice to implement any quantum channel with ancillary qubits, when restricted to the targeted system no finite set of elementary channels is universal.
In the proof for part (2) of the Theorem 1, we have provided two explicit constructions to decompose an arbitrary quantum channel into a sequence of elementary channels fol-lowed by a unitary gate [31]. The first construction has an elementary channel set of O(d 2 ) size with a sequence length O(d 2 / log(1/ )), and the other uses a much larger elementary channel set [of size O(2 d 2 )] but much shorter decomposition sequence [of length O(1/ log(1/ ))]. In other words, we can decompose any target quantum channel into a fixed sequence of elementary channels followed by a n-qubit unitary gate. The channel compilation task has thus been reduced to unitary compiling with elementary unitary gates.
Weighted unitary gate compilation and a DRL algorithm.-We now consider quantum compilation for unitary gates U ∈ SU (d) with the elementary gate set S E = {g 1 , ..., g n |g i ∈ SU (d)}. A gate set is universal if it can compile arbitrary unitary gates to any given accuracy demand under the distance measure. In other words, a gate set is universal if and only if it generates a dense subgroup in SU (d) [5]. We present the following theorem concerning the lower bound of any indispensable gate in compiling an arbitrary unitary.
Theorem 2. For a non-dense subgroup G ⊂ SU (d) generated by S E , suppose we can find g * ∈ SU (d) such that G generated by {g 1 , ..., g n , g * } is dense in SU (d). When employing G as elementary gate set for quantum compilation task on SU (d), the number of gate g * to compile an arbitrary gate within distance is bounded below by: The proof of Theorem 2 relies on the volume method, which considers covering the whole SU (d) space with -balls centered at each possible gate sequence. For brevity, we leave the technical details to the Supplementary Materials [31]. This theorem gives a lower bound for the count of any indispensable gate g * in compiling an arbitrary unitary, which scales linearly in log( 1 ) but quadratic in the Hilbert space dimension d that is exponentially large as the system size increases. In practical applications, g * may represent some experimentally expensive or flawed gate and thus reducing its count in compiling could be crucial. For the case of quantum compiling with Clifford+T gate set, a number of striking algorithms [12][13][14][15][16], which either exploit its specific structures or utilize ancillary qubits, have been proposed to reduce the T count. Here, we introduce a more general approach (in the sense that it does not rely on the special properties of the elementary gate set and thus bears universal applicability) without using ancillary qubits. We exploit a reinforcement learning technique, the proximal policy optimization [54] in particular, to reduce the count of experimentally expensive gates.
Unlike commonly used Q-learning algorithms [53,63] such as deep-Q networks [64], PPO directly represents a policy explicitly as π θ (a|s) by a neural network, which receives the current state s as an input and outputs the probability π θ (a|s) the agent may choose for each action a. The updating rules used by PPO explore the biggest possible improvement step without causing a performance collapse [54]. This property makes PPO particularly suitable for quantum compilation tasks since applying an inappropriate gate in the sequence will dramatically destruct the approximation gate. Moreover, we modify the reward function used in PPO, which efficiently reduces the count of a specific gate that is experimentally costly: where r n is the reward that the agent receives for the n-th step, r(U n , U t ) is the reward determined by comparing the approximation gate U n and the target gate U t , and C g * is the additional weighted punishment for the employment of g * gates.
By increasing C g * , the agent would tend to avoid using g * gates and thus the g * count would be reduced in the decomposition [31]. By exploiting the PPO algorithm, our algorithm searches the approximation sequence in a depth-first search scheme. Comparing with the breadth-first search DRL algorithm proposed in Ref. [53], the PPO algorithm runs with significantly less time and memory in both the training and searching stages. This advantage makes our algorithm feasible for compiling tasks for larger systems with more complicated actions and rewards. Topological compiling of Majorana fermions. To benchmark the performance of our PPO algorithm in practice, we consider topological compiling with the four-quasiparticle encoding scheme for Majorana fermions [56]. Two elementary gates corresponding to the braidings of four Majorana fermions are shown in Fig. 2(a) and unitary gates can be approximated through braiding sequences and T gates. A simple example for decomposing an x-rotation is shown in Fig. 2 It is well known that braidings of Majorana fermions only lead to an elementary gate set S E = {CNOT, H, S}, which is not sufficient for universal quantum computation [56,65]. To achieve universality, a non-topological T gate with a relatively high experimental cost is necessary. Therefore, reducing T count is of practical importance [66][67][68]. Here, we apply the PPO algorithm to attain this goal. To this end, we note that the topological gate set generates a Clifford group Cl n ⊂ SU (d) with finite size |Cl n | = O(2 2n 2 +3n ) [69][70][71]. According to the Theorem 2, the scaling of T count for compiling an arbi- trary gate in the worst case is Ω( d 2 −1 n 2 log( 1 )). We mention that topological quantum compiling has been broadly explored with various algorithms proposed [72][73][74][75][76][77][78]. Most of the algorithms run in O(poly log(1/ )) and output sequences of braidings with length O(poly log(1/ )) to obtain an approximation within distance from given target evolution. Here, we exploit the average gate fidelity F (U, V ) = d|ψ | ψ|U V † |ψ | 2 provided in the open-source QuTiP package [79] to measure the distance d(U, V ) = 1 − F (U, V ) between an approximated unitary gate V and a target unitary gate U . To exploit or DRL algorithm as a single qubit quantum compiler, the action space is a set A = {B 12 , B −1 12 , B 23 , B −1 23 , T, T −1 } containing six elementary gates. To train the agent, we employ a deep neural network (DNN) with five layers of fully connected neurons and train it with a PPO algorithm encapsulated in OPENAI gym and baseline package [80]. To construct the training data, we generate a sequence of length L consisting of the elementary gates to be the target gate. To test our agent, we construct a test dataset of 1500 data samples, each as a random sequence of gates from A of length between 10 and 80. We compare the performance of our algorithm and the traditional Solovay-Kitaev algorithm on such a dataset [31].
In Fig. 3, we plot the error distribution over the test dataset and the proportion of T gate for our algorithm and Solovay-Kitaev algorithm with the same threshold t = 10 −3 . We observe that the error distributions for both algorithms show two peaks with one below the threshold t and the other larger than the threshold. This could be understood as the potential risk of failure for the depth-first search scheme, which is exploited in both algorithms. On average, the Solovay-Kitaev algorithm with net depth 12 provides a more accurate approxi-mation than our algorithm at the price of longer average length while the 8-depth algorithm provides an equal average length with sacrificed approximation accuracy. In the inset of Fig. 3, we find that by increasing the cost of T gate from C T = 0 to C T = 2, the average T gate rate over the test set decreases from 0.354 to 0.201. This is a significant 60% reduction compared to that for the Solovay-Kitaev algorithm which gives a T gate rate of 0.445. This shows that adding the cost of T gate as a punishment in DRL could effectively reduce the T count.
We remark that the depth-first search scheme in our PPO algorithm makes the time complexity scale linearly with the maximal search depth L max . This is distinct from the A * search (which is breadth-first and hence exhibits an exponential scaling with L max ) used in Ref. [53]. As a result, the PPO algorithm is significantly less time and memory consuming in both the training and searching stages. This advantage makes our algorithm feasible for compiling tasks for larger systems with more complicated actions and rewards. As a trade-off, the PPO algorithm would not output the nearoptimal sequence for a given target unitary and accuracy demand, and may even fail to find a decomposition if the accuracy threshold is too small. We mention that one can increase the successful rate of the compilation by increasing L max [31].
Discussion and conclusion.-Theorem 1 implies that for channel compiling, there exists no finite universal elementary channel set. However, for a given target channel, this theorem does not tell whether it can be decomposed into predetermined elementary channels to a desired accuracy. Finding a general and efficiently computable criterion for determining whether a given channel can be compiled with a fixed elementary channel set or not is of both theoretical and experimental importance, and worth future investigation. Another interesting and important future direction is to incorporate a partial breadthfirst search mechanism into the current PPO algorithm to increase the success rate and reduce the total length of the output sequences, at the cost of a slightly more time and memory consuming training process. In addition, our proposed PPO algorithm may carry over straightforwardly to other scenarios, including quantum control problems [81] and digital quantum simulations [82][83][84][85][86] for both closed and open systems.
In summary, we have rigorously proved that, in sharp contrast to the case of unitary compiling, it is impossible to compile an arbitrary channel to arbitrary accuracy with any given finite elementary channel set, regardless of the length of the decomposition sequence. We analytically constructed a general scheme to decompose an arbitrary channel into a sequence of elementary channels plus a unitary gate, hence reducing the task of channel compiling to unitary compiling. To reduce the count of certain experimentally expensive gates, we further introduced a DRL algorithm that is generally applicable to any elementary gate set and uses no ancillary qubit. We benchmarked the performance of our algorithm with an example concerning quantum compiling of Majorana fermions, and demonstrated that our approach can reduce the use of expensive gates efficiently and effectively. Our results shed new light on the general problem of quantum compiling, which provides a valuable guide for future studies in both theory and experiment.

Supplementary Materials for: Weighted Quantum Channel Compiling through Proximal Policy Optimization
In the Supplementary Materials, we provide more details about the proofs for the two theorems in the main text. We also provide supplementary notes on deep reinforcement learning, proximal policy optimization (PPO) algorithms, and a detailed description of our algorithm with more numerical results. We additionally provide a brief recapitulation for encoding and braiding Majorana fermions.
A. Proof for the Theorem 1 We recall Eq. (1) in the main text which provides a matrix-form mapping representation for a linear CPTP quantum channel E as where T is called a distortion matrix, t is called a center shift and the state vectors a and a are chosen within the Bloch sphere satisfying the relation a = T a + t. This representation indicates that the channel E maps the Bloch sphere into an ellipsoid. To guarantee the physical feasibility, the ellipsoid must be enveloped within the original Bloch sphere and |a| 2 , |a | 2 ≤ 1. Therefore, T cannot have an eigenvalue that has magnitude larger than 1. Moreover, if all eigenvalues of T has magnitude 1, then t = 0 and E is a unitary gate. For simplicity, we represent a quantum channel E with distortion matrix T and center shift t as E(T, t).
As mentioned in the main text, the distance measure between channels used in this paper is the Schatten one-norm [61], which measures the maximal 1-norm distance between the output states of different quantum channels with the same quantum state ρ chosen from H S as the input state. For singlequbit states ρ 1 = (1 + a 1 · σ)/2 and ρ 2 = (1 + a 2 · σ)/2, the trace distance between them reads ||ρ 1 − ρ 2 || 1 = |a 1 − a 2 |/2. This indicates that the distance between channels E 1 (T 1 , t 1 ) and Now we start the proof for the first part of the Theorem 1. We first introduce the following lemma.
For part (2) of this theorem, we assume that the target channel is E * (T * , t * ) and T * has eigenvalues λ * 1 , λ * 2 , and λ * 3 with Without loss of generality, we assume t * i > 0 and suppose the accuracy demand is . Now, we propose a four-step procedure to decompose the target channel into a sequence of unitary gates and channels chosen from 14 elementary channels.
We introduce a procedure to use the above elementary channels to compile T step 1 within distance 6δ using a sequence of k 3 elementary channels. We exploit the Table. S1 to record each elementary channel in this sequence. In this sequence, we introduce three {0, 1} strings s x = s x,1 ...s x,k1 , s y = s y,1 ...s y,k2 , and s z = s z,1 ...s z,k3 . For s i,j (i = x, y, z), it is 0 or 1 when the center shift of the j-th channel in the sequence on i-axis is 0 or δ. We use T ii j (i = 1, 2, 3, j = 1, ..., k 3 ) to represent the (i, i)-th entry of the distortion matrix T j for the j-th channel in the sequence.
Step 2. If eigenvalues λ * 1 , λ * 2 , λ * 3 contain complex values, i.e. λ * 2 = re iθ , λ * 3 = re −iθ , r ∈ R, we use a unitary gate that can be represented as an affine map E R (T R , t R ) with t R = 0 and Such a unitary gate can introduce the complex phase e iθ and e −iθ for λ * 2 and λ * 3 , respectively. Therefore, in the following steps, we only need to consider the case where all eigenvalues are real.
Step 3. If the T * contains real negative eigenvalues, we consider the following three unitary transformations with affine map representations E Step 4. In this step we apply a unitary transformation to transfer the current basis to the {v * 1 , v * 2 , v * 3 } basis. In the above four steps, the unitary gates in step 2 to 4 could be combined as one unitary gate, and according to Ref. [1] this unitary gate can be approximated within error δ by a sequence of elementary gates chosen from a universal gate set. Therefore, the total error cannot exceed the sum of error in all steps, which is 7δ. By fixing δ = 7 , we can decompose an arbitrary quantum channel into a sequence of unitary gates and elementary channels chosen from the 14 elementary channels E 1 , ..., E 14 constructed in Eqs. (A2)-(A7).
Following the four steps above, we can also bound the length of the sequence for the compilation. In step 2-4, we need one unitary gate while in step 1, the length of the table does not exceed log (1−δ) δ + 1. In practice, δ = 7 is usually a small number close to 0. Therefore, we can do the approximation log (1−δ) ( 1 )). This indicates that the length of the entire sequence is O ( 1 log( 1 )). This completes the proof of the part (2) of the Theorem 1 in the main text.
We can extend the Theorem 1 to the multi-qubit case. As mentioned in the main text, for a d-dimensional quantum state ρ ∈ O(H S ), a canonical and orthonormal basis [59,62] An arbitrary density operator ρ can be written as ρ = 1 d (I + The parameters in {p α } form the polarization vector p = (p 1 , ..., p d 2 −1 ) of a (d 2 − 1)dimensional ball with ||p|| 2 = 1 representing pure states and ||p|| 2 < 1 representing mixed states. Since a quantum state ρ can be represented as a vector within a ball, a quantum channel E : O(H S ) → O(H S ) can be written as an affine map represented by a distortion matrix T ∈ R (d 2 −1)×(d 2 −1) and a center shift t ∈ R d 2 −1 Notice that the above affine map is similar to the single-qubit case mentioned in the main text, we can extend part (1) of the Theorem 1 straightforwardly to the multi-qubit scenario.
For the second part, we assume the target channel to be E * (T * , t * ) and T * has eigenvalues λ * 1 , ..., λ d 2 −1 * . Without loss of generality, we can rank λ * i in decreasing order of magnitude such that |λ * 1 | ≥ ... ≥ |λ * d 2 −1 |. Inherited from the proof for single-qubit channel, k i is defined to be If we directly use the constructive approach in the proof for the Theorem 1, we will create 2 d 2 − 2 elementary channels The first 2 d 2 −1 channels E 1 (T 1 , t 1 ), ..., E 2 d 2 −1 (T 2 d 2 −1 , t 2 d 2 −1 ) have the same distortion matrix T 1 = ... = T 2 d 2 −1 = diag{1 − δ, ..., 1 − δ} and their center shifts go through 2 d 2 −1 cases in which each element of the center shift can be 0 or δ. The next 2 d 2 −2 elementary channels that follows are They share the same distortion matrix diag{1, 1 − δ, ..., 1 − δ} and their center shifts go through all cases that have the first element to be 0 and other elements to be either 0 or δ. The rest channels are constructed similarly until the last two channels , which have the distortion matrix diag{1, ..., 1, 1 − δ} and center shifts (0, ...., 0) and (0, 0, ..., 0, δ). Similar to the previous proof, we hold strings s i , i = 1, ..., d 2 − 1 with the j-th element s i,j = 0, 1 representing whether the i-th element of the center shift for the j-th channel of the sequence is 0 or δ.
Under this construction, the error in total can be bounded above by (2(d 2 − 1) + 1)δ. Therefore, we fix δ = 2(d 2 )−1 to guarantee that the distance between our approximation and target channel is no more than . The length of the sequence is still bounded by O ( 1 log( 1 )).
However, notice that the above construction requires O(2 d 2 ) elementary channels, which is double exponential to the number of qubits n. Here, we propose another construction that only requires O(d 2 ) elementary channels. We still exploit the 4-step compiling process in the previous section.
Under this construction, step 1 could be decomposed into sub-steps compiling E step i (T step i , t step i ) with T step i = diag{1, ..., 1, |λ * i |, 1, ..., 1}, t step i = (0, ..., 0, t * i , 0, .., 0) using E 2i−1 , E 2i separately, where |λ * i | and t i are the i-th diagonal element of distortion matrix and the i-th element of the center shift. In the i-th sub-step, we keep a k i length 0-1 string s i and decompose E step i into a sequence containing k i elementary channels. If the j-th element s i,j of s i is 0, we add E 2i−1 to the sequence and otherwise we add E 2i . Therefore, the i-th element of approximation for the center shift is Therefore, in each sub-step we can approximate |λ * i | and t * i within distance δ. The total distance in this step cannot exceed 2(d 2 − 1)δ, which is the same with the previous construction. We can still fix δ = 2(d 2 )−1 . It is worthwhile to mention that there exists a trade-off between the above two constructions. Though the second construction only requires O(d 2 ) elementary channels, the output sequence would have a total length O(d 2 1 log( 1 )), which is exponential in the number of qubits n.

B. Proof for the Theorem 2
In this section, we provide the detailed proof for the Theorem 2 in the main text. To derive a lower bound, we exploit the volume method [5,87] based on the constraint that the whole space of SU (d) should be covered by the -balls centered by the gates that could be implemented by an elementary gate sequence.
We start with the case when the subgroup G = g 1 , ..., g n generated by {g 1 , ..., g n } is finite. We denote the size of G as |G| = K. Consider compilation with fewer than t g * gates and unlimited number of gates chosen from G. If no g * gate is used, we can only compile the K gates in G. When we use at least one g * gates, any sequence containing 0 < k ≤ t g * gates could always be written as g = (g i11 g * g i12 )...(g i k1 g * g i k2 ) where g i11 , g i12 , ..., g i k1 , g i k2 are chosen from G. Consider the subset G * = {g s g * g t |g s , g t ∈ G} that can generate a dense subgroup of SU (d). Any sequence that contains k g * gates can be regarded as k gates in G * . Therefore, the number of g * gates can be regard as the number of gates from G * . As the size of G * is |G * | = K 2 , the possible number of gate sequences containing no more than t g * could be bounded above by |G * | + |G * | 2 + ... + |G * | t ≤ O(K 2t+2 ). Therefore, the total number of gates we can accurately compile with a gate sequence with fewer than t g * gates is bounded above by O(K 2t+2 ).
We exploit the normalized Haar measure on SU (d) space such that the volume of SU (d) is one [87]. Under this measure, the volume of -ball in SU (d) group scales as Θ( d 2 −1 ). If any gate in SU (d) can be approximated within distance , all -balls centered by possible gates sequences generated by no more that t g * gates should cover SU (d). Hence, we can deduce that k ≥ d 2 −1 log |G| log( 1 ). This completes the proof of the Theorem 2 in the main text for a finite G.
When the subgroup G is infinite, the lower bound given by the volume method would reduce to Ω(1). This lower bound, however, is not as trivial as it seems. Indeed, we could even find a simple extreme example, in which one can compile arbitrary gates with only two g * gates and unlimited number of gates chosen from an infinite G. Consider a single-qubit gate compilation with g * = H and G = R x (απ) with a generator R x (απ) of an irrational number α. In this example, G is an infinite group which could approximate all the rotations along x-axis within arbitrary accuracy demands. Notice that an arbitrary single-qubit gate could be written as and R z (θ 2 ) = HR x (θ 2 )H, any single-qubit gate could be compiled by gates chosen from G and at most two H gates.
In the main text, we focus on the quantum compiling on Majorana fermions and non-topological T gates. Based on this compilation scheme, various algorithms have been proposed to reduce the T count for both multi-qubit unitary compilation [16,66] and single-qubit unitary compilation [67,68]. However, these algorithms focus on providing an optimal number of T gates for a given unitary. Compare with these approaches, our result provide a lower bound for the worst case overall unitaries chosen from SU (d).

C. Deep reinforcement learning and PPO algorithm
In this section, we give a brief introduction to the deep reinforcement learning (DRL) and the PPO algorithm we exploit to compile unitary gates.

I. Introduction to Reinforcement Learning and Policy Gradient
To formalize the learning process, we introduce some basic concepts and notations. A reinforcement learning (RL) aims to train a decision-making agent in a Markovian decision process. This process involves a state set S and an action set A. In step n, the agent chooses an action a n ∈ A while the environment shifts from s n ∈ S to s n+1 ∈ S, providing the agent a scalar value reward r n as feedback. In a Markovian decision process, the new state s n+1 and reward r n only depend on the former state s n and action a n . Therefore, the RL process can be written as a trajectory τ : s 0 → a 0 → r 0 → s 1 → ... → a N −1 → r N −1 → s N , where N is the maximal number of steps in an iteration.
The core problem in RL is to learn a policy to choose the optimal action a n given the environment state s n . Therefore, a policy function π is introduced to map the states to the action probability distributions. The agent uses this function π to perform decision making tasks and choose the action a ∼ π(s) according to the policy for the state s. In deep reinforcement learning, a policy π θ is represented by a policy network with parameters θ. To evaluate the performance of π θ , an objective function is defined to be the expected return over all complete trajectories: where γ is the discount factor and the expectation is obtained over all trajectories sampled from π θ . To find an optimal policy, we have to find the policy that maximize the objective function. A straight forward approach is to perform gradient descent on policy to solve the optimization.
where α is a scalar factor known as the learning rate and ∇ θ J(π θ ) is known as the policy gradient. It is proved that the policy gradient could be calculated by [88]: R n (τ )∇ θ log π θ (a n |s n ) , (C3) where the action a n ∼ π θ is sampled from probability distribution π θ (a n |s n ) given by the policy at step n and R n (τ ) = N n =n γ n −n r n is the discounted sum of reward from current step n to the end of the trajectory. The algorithm for this simple policy gradient method is summarized in Algorithm 1

Algorithm 1 The Policy Gradient Algorithm
Input The environment E, the initial policy parameters θ0, the number of epochs T , the step size α. Output The optimal policy π θ * 1: for i = 0, 1, 2, . . . , T do 2: Running current policy π θ i in E and obtain a set of trajectories Di = {τi}.
However, in practical policy gradient the parameter space and the policy space do not always map congruently. This fact makes it challenging to find a step size α. If α is chosen as a small constant, more iterations is potentially required in the training process. If α is chosen bigger, the agent would be vulnerable to a performance collapse in which the agent chooses a bad action, resulting in a sudden drop in its performance. In addition, another issue of policy gradient descent is that it is sample-inefficient because it does not reuse data. To address these two issues, the proximal policy optimization algorithms [54] are proposed.

II. Proximal Policy Optimization (PPO)
Our algorithm employs proximal policy optimization (PPO) [54], which is a policy gradient algorithm developed by OPENAI. The PPO is motivated by a algorithm called Trust Region Policy Optimization (TRPO) [89], which aims to find an optimal policy iteratively to maximize the J(π) without causing a performance collapse. While TRPO applies second-order methods which are complex to compute, PPO uses first-order methods and tricks to keep the updated policy not changed too fast. PPO is much simpler to implement than TRPO, while performing well in practice.
To introduce PPO algorithm, We first define a value function V π (s) = E s0=s,τ ∼π [ N n=1 γ n r n ] and a value-action function Q π (s, a) = E s0=s,a0=a,τ ∼π [ N n=1 γ n r n ] given state s, action a, discount factor γ and policy π. These two functions are used to evaluate a state and a given stateaction pair. The objective function used in the learning procedure is defined to be the expected reward of policy π as J(π) = E τ ∼π [ N n=1 γ n r n ]. In the policy descent procedure, we will get another policy π in the next iteration given a current policy π. The objective function changes as: γ n A π (s n , a n )], where A π (s n , a n ) = Q π (s n , a n ) − V π (s n ) is defined as the advantage function. The relative policy performance J(π ) − J(π) provides a metric to measure the improvement of performance after a policy shift. Therefore, maximizing J(π ) is equivalent to maximizing J(π ) − J(π).
To approximate Eq. (C4), we use the trajectories from the old policy τ ∼ π and adjust with importance sampling weights R n (π) = π (an|sn) π(an,sn) [88]. This approximation is given as A π (s n , a n )R n (π)], (C5) where J CPI π (π ) is known as a surrogate objective. The surrogate objective function can be additionally written as an average over both τ ∼ π and n as E τ ∼π ,n [A π (s n , a n )R n (π)]. In our algorithm, we use an alternative version of PPO with clipped surrogate objective function J CLIP =E τ ∼π ,n [A π (s n , a n )× min{R n (π), clip(R n (π), 1 − , 1 + )}]. (C6) The above equation is known as the clipped surrogate objective function and is a hyperparameter which defines the clipping bound |R n (π) − 1| ≤ . This parameter would decay during the training procedure. As the term clip(R n (π), 1 − , 1+ )A π (s n , a n ) bounds the value J CPI , this objective function prevents the updates that create large and risky policy changes.
In objective J CLIP , the most computationally costly parts are the weight R n (π) and advantage A π (s n , a n ). However, these parts are required in any algorithm that optimizes the surrogate objective function. The rest calculations are essentially constant-time clippings and minimizings. Therefore, the clipped objective is relatively easy to compute and understand. The whole PPO algorithm is summarized in Algorithm 2.

Algorithm 2 The Proximal Policy Optimization Algorithm
Input The environment E, the initial policy parameters θ0, the initial value function parameters φ0, the number of epochs T , the step size α. Output The optimal policy π θ * 1: for i = 0, 1, 2, . . . , T do 2: Running current policy π θ i in E and obtain a set of trajectories Di = {τi}.

3:
Compute the advantage function Aπ i (s, a) by using the estimation of value function V φ i

D. Supplementary note on the algorithm and numerical experiments
In this section, we provide the design details of the agent and algorithm we used. We additionally provide more numerical data about applying this algorithm to compilation based on Majorana fermion systems.

I. Training the PPO agent
Our DNN provided by OPENAI baseline package [80] consists of five full connected layers each containing 256 neurons. The activate function is the leaky ReLU function [90] throughout the neural network. We exploit the Adam algorithm [91,92] as our optimizer, and batch normalization is applied.
As mentioned in the main text, the DNN is trained to evaluate the objective function J(π) with reward function: Otherwise, (D2) where r s (U n , U t ) is the state reward obtained by comparing the distance d(U n , U t ) between the approximation gate U n and the target gate U t , C g * is the additional punishment for the employment of g * gate, L Ut is the number of gates used for generating U t , L max is the maximal number of steps allowed to compile the target gate for the agent, c is a constant to balance rewards and punishments, and t is the distance tolerance.
Starting from the identity at each iteration, the agent chooses a gate from A = {B 12 , B −1 12 , B 23 , B −1 23 , T, T −1 } in each step and obtains a reward value calculated by Eq. (D1). When the distance between the approximation sequence and the target unitary gate falls within the threshold t , the agent obtains a reward and starts a new iteration with a new target gate. When the number of the steps exceeds the maximal length L max , the iteration also terminates.
Before the training process, the DNN is initialized with random parameters. From the beginning of the training process, we feed random sequences consisting gates from A = {B 12 , B −1 12 , B 23 , B −1 23 , T, T −1 } of length 10. We choose the accuracy threshold t = 10 −3 and train the agent and the DNN to search for π with higher reward and generate approximations below the maximal length L max . During the training process, we hold a reward threshold as a function of the length of the random sequence in the training data. If the reward obtained by the agent when compiling the training data reaches a threshold, we increase the length of the random sequence generated as training data until the length reaches 80. In Fig. S1(a), we plot the average reward as a function of the number of steps of training. We can observe that the average reward first increases to the reward threshold and keeps dropping when we increase the length of the sequence among training data. When the length of the randomly generated training sequence reaches the upper bound 80, the average reward would increase as we no longer increase the length. Moreover, we could obtain that the average reward would decrease if we increase the cost for T gate. We trained this model on a single NVIDIA TITAN V GPU for about one day.
II. More results on applying PPO algorithms to topological quantum compiling on Majorana fermions To further explore our DRL algorithm in compiling topological quantum compiling on Majorana fermions, we construct another test dataset consisting of 1000 random generated sequences of length 80 with gates chosen from A = {B 12 , B −1 12 , B 23 , B −1 23 , T, T −1 }. We feed this dataset into the trained agent and increase the maximal length L max of generated sequences to observe the changes of average distance , number of T gates, and approximation sequence length.
As shown in Fig. S1(b), we input the test dataset into the agent trained with C T = 0. We observe that when the maximal length L max increases, the average compilation error first decreases quickly and then converges to a stable value of about 0.005. This indicates that when L max exceeds a threshold, the major obstacle for improving the accuracy of the compiler would be the sparsity of the net with approximated policy reward of the agent. It is worthwhile to mention that compare with Ref. [53], the search complexity in our algorithm increases linearly rather than exponentially with L max . In Fig.  S1(c), we plot the average length of the approximation sequences and the number of T gates in the sequences as a function of L max . It is shown that these two functions increase ap-proximately linearly with L max while the proportion of T gate in the approximation sequence remains rarely changed. This result indicates that our DRL agent could stably reduce the usage of T gate under different L max .

E. Encoding and operation on Majorana fermions
In this section, we briefly introduce the encoding methods on Majorana fermions systems. The fusion principle for Majorana fermions is the Ising type τ ×τ ∼ I+ψ with I, τ, ψ representing a vacuum state, a Majorana fermion, and a normal fermion. In the main text, we consider the four-quasiparticle encoding scheme where each qubit is encoded by four Majorana fermions with the total topological charge as 0. As mentioned in the main text, the gates {H,S} could be realized by braidings on the four-quasiparticle scheme. We denote the four Majorana braiding operators on each quasiparticle as b i , i = 1, 2, 3, 4 in one logic qubit and these operators satisfy b † i = b i , b 2 i = I and anticommutation relation {b i , b j } = 2δ ij . As shown in Ref. [93], Pauli operators in computational basis can be expressed as Unitary operations can be realized by counterclockwise exchanges of two Majorana fermions as below: where j, j are chosen as two neighboring quasiparticles. Specifically, we give three basic braidings as We can implement H gate with the braiding sequence H = B 2 23 B −1 12 B 23 B −1 12 B 2 23 . Hence, we have shown that a singlequbit Clifford group generated by H gate and S gate can be realized by Majorana braidings. However, entanglement gate on two-qubit cannot be obtained through braiding due to the no entanglement rule [94].