Implementation of Measurement Reduction for the Variational Quantum Eigensolver

One limitation of the variational quantum eigensolver algorithm is the large number of measurement steps required to estimate different terms in the Hamiltonian of interest. Unitary partitioning reduces this overhead by transforming the problem Hamiltonian into one containing fewer terms. We explore two different circuit constructions of the transformation required - one built by a sequence of rotations and the other a linear combination of unitaries (LCU). To assess performance, we simulated chemical Hamiltonians and studied the ground states of H2 and LiH. Both implementations are successful even in the presence of noise. The sequence of rotations realization offers the greatest benefit to calculations, whereas the probabilistic nature of LCU reduces its effectiveness. To our knowledge, this work also demonstrates the first experimental implementation of LCU on quantum hardware.

One limitation of the variational quantum eigensolver algorithm is the large number of measurement steps required to estimate different terms in the Hamiltonian of interest. Unitary partitioning reduces this overhead by transforming the problem Hamiltonian into one containing fewer terms. We explore two different circuit constructions of the transformation required -one built by a sequence of rotations and the other a linear combination of unitaries (LCU). To assess performance, we simulated chemical Hamiltonians and studied the ground states of H2 and LiH. Both implementations are successful even in the presence of noise. The sequence of rotations realization offers the greatest benefit to calculations, whereas the probabilistic nature of LCU reduces its effectiveness. To our knowledge, this work also demonstrates the first experimental implementation of LCU on quantum hardware.

I. INTRODUCTION
Current quantum computing devices have significant limitations, namely short coherence times, low qubit numbers and little to no error correction. These machines are termed noisy intermediate-scale quantum (NISQ) computers [1]. The leading candidate algorithms for use on NISQ devices are variational hybrid quantum-classical algorithms such as the variational quantum eigensolver (VQE) and quantum approximate optimization algorithm (QAOA) [2,3]. VQE estimates Hamiltonian eigenvalues on near term quantum computers [3]. Many different implementations of the algorithm have been performed utilizing a wide array of different quantum platforms [4][5][6][7].
VQE has been widely applied to the electronic structure problem. The second quantized form of the molecular electronic Hamiltonian is converted to a qubit Hamiltonian by the Jordan-Wigner, Bravyi-Kitaev or related transformations [8][9][10]. The resulting qubit Hamiltonian is a linear combination of m Pauli operators on n qubits: where c i are real coefficients, P i are n-qubit Pauli operators, which are n-fold tensor products of 1-qubit Pauli operators, or the 2 × 2 identity matrix: σ i j ∈ {X, Y, Z, I} ∀i, and j indexes the qubit the operator acts on.
In general, the number of terms in equation (1) scales as O(N 4 orb ), where N orb is the number of orbitals [11]. The linearity of expectation values allows where |ψ( θ) is an ansatz state produced by a parameterized quantum circuit. In conventional VQE, the expectation value of each subterm P i is determined independently. An estimate of each term's expectation value P i is found by averaging over M i repeated measurement outcomes {s (i) j } j=1,2,..,Mi via [3,12,13]: where s (i) j ∈ {−1, +1}. The above expression is exact as the number of samples M i → ∞.
A finite number of runs is used to estimate each P i term and thus each outcome will belong to a distribution centred around the expectation value ψ( θ)| P i |ψ( θ) with standard deviation i . Each estimate ψ( θ)| P i |ψ( θ) is derived from sums of random variables with finite variance. Due to the central limit theorem, they must converge to a normal distribution [12]. Because we will be comparing Hamiltonians with different numbers of terms, we will take a slightly different approach. We combine a single sample of all terms into a single-shot energy estimate e j = i c i s (i) j . The distribution of this estimate determines how many samples are required to achieve a given error on the mean.
The optimal number of repetitions M i to achieve a certain precision is [14,15]:

arXiv:2012.02765v2 [quant-ph] 28 Jul 2021
where M is the total number of measurements. Since the number of terms in equation 1 scales as O(N 4 orb ), the total number of measurements required will scale as O(N 6 orb / 2 ), where chemical accuracy is defined as = 1 kcal/mol (1.6 mHa), the accuracy required to match typical thermochemical experiments. Wecker et al. showed that to obtain energy estimates for HeH + , BeH 2 and H 2 O requires 10 8 − 10 9 samples to achieve an error of 1 mHa [14]. This implies that the number of measurements required is an obstacle for experimental implementations of VQE to the number of qubits currently available on NISQ devices [7]. For example, take the experimental implementation of VQE by Hempel et al., which took 20 ms to perform each VQE repetition on a trapped ion quantum computer [6]. To obtain the ground state energy of H 2 , in a minimal basis to within chemical accuracy, of order 14000 repetitions were needed and 4.6 minutes of averaging required.
Various approaches have been proposed for reducing the total number of samples required by VQE [16][17][18][19][20][21][22][23][24]. In this paper, we focus on the unitary partitioning procedure independently proposed by Verteletskyi et al. [25] and Zhao et al. [26]. The main idea of this approach is to partition the qubit Hamiltonian into groups of n-fold Pauli operators whose linear combination is unitary. The overall operator represented by these sums can then be measured at once using additional coherent resources. In this work, we compare two different circuit realizations of unitary partitioning, as proposed in [26].

II. UNITARY PARTITIONING
The expectation value of any Hermitian operator can be obtained via a single set of single-qubit measurements, because it can be written in terms of its spectral decomposition. For example, consider the spectral decomposition of a general Hermitian operator A: d is the dimension of the space and A acts on orthonormal states |ψ a . Each |ψ a is an eigenstate of the operator with corresponding eigenvalue λ a . As the set of eigenvectors {|ψ a } form an orthonormal basis there always exists a unitary R that maps this basis to another: R |ψ a = |e a or |ψ a = R † |e a . The operator A can be written in this basis: The expectation value of A can be found by A = ψ| A |ψ = ψ| R † QR |ψ . Note Q is a matrix defined by the bracket in equation 6c. This idea underpins the unitary partitioning method.
Individual n-fold tensor product operators of Pauli operators P i are Hermitian and unitary; however, a sum of unitary operators is in general not unitary. To make j c j P j unitary three constraints are imposed [25,26]: The first condition is satisfied by partitioning the qubit Hamiltonian H q into m c sets denoted {S l } l=0,2,..,mc−1 . The sub-Hamiltonian corresponding to each anticommuting set H S l is defined as: S l is the set of P i terms in H S l , where {P j , P i } = 0 ∀P j = P i ∈ S l . The process of finding such sets is discussed in [22,25,26] and formulated as a minimum clique cover problem. This is an NP-hard problem [27]; however, heuristic algorithms can provide sufficiently good approximate solutions to this problem [17,25,28]. Condition (2) is satisfied by re-normalising each anticommuting set: j . The final condition is already satisfied, as all the coefficients c i in H q are real.
Using the unitary partitioning method, the qubit Hamiltonian is separated into m c sets of unitary sums: FIG. 1: General quantum circuit implementation of unitary partitioning constructed as a sequence of rotations. The subscript s denotes system qubits. V ψ0 is a unitary gate that prepares the ansatz state. To measure the qubits in the computational basis, the single qubit gates B ∈ {H, R x (−π/2), I} are required to perform a change of basis dependent on the Pauli operator P (l) j can be written as R † l Q l R l . The expectation value of the Hamiltonian can therefore be obtained via: and only m c ≤ m terms are estimated, at the expense of needing to implement R l coherently within each circuit [26]. In this work, the operator R l required by unitary partitioning is implemented by either a sequence of rotations or LCU [29]. These methods follow the constructions outlined in [26]. The relevant details for each method are summarised in the next subsections. It will be shown that the new operator Q l is simply a particular n-fold Pauli operator that we denote as P (l) w .

A. Ordered Sequence of rotations approach
In this section, R l is constructed such that it maps a completely anticommuting set H S l (equation 7) to a single Pauli operator via conjugation -formally: w . To begin a particular P j ∈ S l is selected to be reduced to. This term is denoted by the index w and written as P (l) w , where l indexes the set. Once chosen, this operator is used to define the following set of Hermitian self-inverse operators [26]: note the coefficients β w and β k are not present. As every P j operator in S l anticommutes with all other operators in the set by definition, it is clear from equation 11 that X (l) wk will commute with all {P j | ∀P j ∈ S l where j = w, k} and anticommute with P k . This property is the crux of this conjugation approach.
The adjoint rotation generated by X (l) wk can be written [26]: whose action on H S l is [26]: can be made to go to 0, by setting β k cos θ wk = β w sin θ wk . This approach removes the term with index k and increases the coefficient of P (l) w from β w → β 2 w + β 2 k . This process is repeated over all indices excluding k = w until only the P (l) w term remains. This procedure can be concisely written using the following operator: which is simply a sequence of rotations. The angle θ wk is defined iteratively at each step of the removal process, as the coefficient of P (l) w increases at each step and thus must be taken into account. Importantly the correct solution for θ wk must be chosen given the signs of β w and β k [26]. The overall action of this sequence of rotations is: where the number of terms in the qubit Hamiltonian requiring separate measurement is reduced from m → m c . Note that R S l ≡ R l , we use this notation to differentiate this assembly from the LCU construction.
B. Linear combination of unitaries method

LCU technique overview
Given a complex operator as a linear combination of d unitary operators: where U j are unitary operators (that are assumed to be easy-to-implement) and α j are real positive coefficients. Without loss of generality phase factors and signs can be absorbed into the unitaries U j to make all α j ≥ 0. Fig. 2 shows how to do this for n-fold Pauli operators. The linear combination of unitaries (LCU) method offers a way to probabilistically implement such an operator using the two unitary operators G and U LCU [29,30]: where subscript a denotes the ancilla register. U LCU is sometimes known as the "select" operator and G the "prepare" operator. The most important property of the unitary operator G is that the coefficients α j only define the first column of the matrix -resulting in G |0 a → |G a . The rest of the columns must be orthogonal, but can have any values -hence there is a freedom of choice when defining G. A practical note on this is if one finds a quantum circuit that performs |0 a → |G a , then its action on other basis states will automatically be accounted for and G is completely defined (provided the quantum circuit is composed as a product of unitaries). To summarise the LCU method, first G is used to initialize the ancilla register: G |0 a → |G a . The controlled unitary U LCU is then applied across the system and ancilla registers, resulting in [30]: If |G a is measured in the ancilla register, then the state will be projected onto A α 1 |ψ s and A was successfully applied to the system state |ψ s . If any other state in the ancilla register is measured (orthogonal complement ∈ H G ⊥ ), then the quantum state is projected into the wrong part of the Hilbert space and A α 1 |ψ s is not performed.
The LCU method gives a probabilistic implementation of the matrix A, which has a probability of success given by: As A can be a non-unitary matrix A † A doesn't necessarily result in the identity matrix. In general the success probably therefore depends on both |ψ and 1 α 1 2 . For the special case when A is a unitary matrix then the success probability is given by To increase the probability of success different techniques such as oblivious amplitude amplification [31,32] and amplitude amplification [33,34] can be used. However, extra coherent resources are required.
An alternate way to describe the LCU method is to note that the matrix A is in general not unitary. To build a unitary form, the normalised matrix A can be embedded into a larger Hilbert space by putting it into the upper-left block of a unitary matrix [35]: . .
where each [.] represents a matrix of arbitrary elements. The reason to divide by α 1 is unitary matrices must   The operator U is a probabilistic implementation of A and is commonly known as a 'block encoding' of A. Overall the LCU method encodes the desired matrix as [30]: A final note on notation. When a matrix e.g. A is block encoded using U we usually say "the unitary U gives a (α, k, )-block encoding of A". Here α is the l 1norm of the matrix to be block encoded. k is the extra ancilla qubits required to perform the block encoding. This depends on the number of operators in the linear combination of unitaries (equation 17) and scales logarithmically as k = log 2 (|A|) , where |A| is the number of operators in the linear combination.
Finally, is the error of the block encoding and is determined by [35]: The LCU technique can be used to implement nonunitary operations [29,36], such as matrix inversion. This is achieved by constructing the required operator as a linear combination of unitaries. As the n-fold tensor product of Pauli operators including the n-fold identity operation form a complete operator basis, any (2 n × 2 n ) complex operator can be built by different linear combinations of these unitary operators.
A toy example of the LCU method is given in the next section to illustrate the practical implementation of this technique. This can be skipped without loss of continuity.

Toy LCU example
Consider the Hamiltonian H = α 0 U 0 + α 1 U 1 , where α 0,1 ≥ 0 and α 1 = |α 0 | + |α 1 |. To implement this operator as a linear combination of unitaries G (equation 19) and U LCU (equation 18) must be defined. The construction of these operators is given by the definition of H. In this case, they will be: The values of x and y can be anything that ensures the columns of G are orthogonal. The following can be used: as discussed in [36]. The quantum circuit given in Fig.  3 shows the probabilistic implementation of H using the LCU method.
Stepping through this circuit we find: Measuring |0 a on the ancilla line will project the state onto α 0 U 0 + α 1 U 1 |ψ s = H |ψ s up to a normalization and heralds a successful implementation of the linear combination of unitaries method. If |1 a is measured then the state is projected into the wrong subspace and H is not performed. Overall this approach gives a ( α 1 , 1, 0)block encoding of H.
If H is a unitary operator, then the success probability (equation 21) is non-unitary then the success probability depends on . This matrix defines the projector onto the all zero state on the system register's qubit and is a non-unitary operation. Clearly the success probability of block encoding this operator will depend heavily on |ψ s (and whether it has overlap with |0 s ).

LCU approach to unitary partitioning
Analogous to the sequence of rotations method (Section II A), R l will be applied by conjugation to an anticommuting set H S l to give a single Pauli operator. However, unlike the previous method, where R l was achieved by a sequence of rotations (equation 14) here it is built via a linear combination of unitaries (LCU). An overview of the LCU method is given in the previous subsection.
To construct R l via LCU, we first need to manipulate each of the m c anticommuting sets (H S l ) that the qubit Hamiltonian was partitioned into (equation 9). As before, a particular Pauli operator P j ∈ S l in each set is selected to be reduced to. Again this will be denoted by the index w and written as P where Taking each normalised set H S l (equation 8) and rewriting them with the term we are reducing to (β w ) outside the sum: by re-normalising the remaining sum in equation 29: where and β We can substitute equation 28a into equation 30a: where β 2(l) w +Ω 2 l = 1. In this form we can use the trigonometric identity cos 2 (θ) + sin 2 (θ) = 1 to define the following operator: Comparing equations 31 and 32 it is clear that cos(φ Next using the definition of H (l) w in equation 32 it was shown in [26] that one can consider rotations of H w around an axis that is Hilbert-Schmidt orthogonal to both H S l \{P (l) w } (equation 28) and P (l) w : w , is self-inverse and has the following action [26]: This defines the rotation: where P kw will be another tensor product of Pauli operators, as products of n-fold Pauli operators will yield another operator in the Pauli group. The adjoint action of this rotation on H (l) w is: w , the coefficient of H S l \{P (l) w } will go to zero and we achieve the intended result of R l H (l) w . To build R l by the LCU method, we use its definition in equation 35c. In practice, it is easier to re-write equation 35 using the fact that all P kw and I are in the Pauli group. The terms can thus be combined into a single sum: Note all α q must be real and α q ≥ 0 ∀q. This is achieved by absorbing any signs and complex phases into P (l) kw , hence these operators are n-fold tensor Pauli operators up to a complex phase. When written in this form, it is easy to define the operators G (equation 19) and U LCU (equation 18): FIG. 4: General quantum circuit implementation of unitary partitioning constructed as a LCU. The subscripts s and a denote the system and ancilla registers respectively. V ψ0 is a unitary gate that prepares the ansatz state. To measure the qubits in the computational basis, the single qubit gatesB ∈ {H, R x (−π/2), I} are required to perform a change of basis dependent on the Pauli operator P that are required to perform R l as a LCU. Overall the operator is encoded as: Without using amplitude amplification, as R l is unitary, the probability of success is given by the square of the l 1norm of R l . Note that the l 1 -norm is defined as α q 1 = |H S l |−1 q=0 |α q |.

III. NUMERICAL STUDY
The ability of the unitary partitioning measurement reduction strategy is dependent on the problem Hamiltonian. To assess the performance of each implementation, we investigate application to Hamiltonians of interest in quantum chemistry.

A. Method
We consider Hamiltonians for H 2 and LiH molecules employing the STO-3G and STO-6G basis sets respectively. These were calculated using Openfermion-PySCF and converted into the qubit Hamiltonian using the Bravyi-Kitaev transformation in OpenFermion [37,38].
Partitioning into anticommuting sets H S l was performed using networkX [39]. First, a graph of the qubit Hamiltonian was built, where each node is a term in the Hamiltonian. Next edges are put between nodes on the graph that anticommute. Finally, a graph colouring of the complement graph was performed. This searches for the minimum number of colours required to colour the graph, where no neighbours of a node can have the same colour as the node itself. The "largest first" colouring strategy in NetworkX was used [39,40]. Each unique colour represents an anticommuting clique. Fig. 5 shows the method applied to H 2 . This approach is the minimum clique cover problem mapped to a graph colouring problem. Further numerical details for each Hamiltonian can be found in Appendix A.
The input state |ψ( θ) for all calculations was the exact full configuration interaction (FCI) ground state, found by diagonalizing the Hamiltonian. As our aim was to investigate different implementations of unitary partitioning, this meant the ansatz optimization step in VQE was not required.
For the simulations performed on IBM's ibmqx2 quantum processing unit (QPU), a measurement error mitigation strategy available in Qiskit was utilized and is a simple inversion procedure [41]. The quantum circuits required were generated by the qiskit.ignis complete meas cal method and executed alongside each separate ibmqx2 experiment, with the maximum number of shots (8192). This sampling cost was not included in the number of calls to quantum device. The CompleteMeas-Fitter method in qiskit.ignis [42] was used to generate the calibration matrix required for measurement error mitigation [41,42].
We denote the number of terms in the

B. Results
For a given preparation of the true ground state of H 2 , we compare both implementations of measurement reduction by unitary partitioning against a standard VQE calculation on IBM's open access quantum device (ibmq 5 Yorktown -ibmqx2) and Qiskit's qasm simulator [42]. The quantum circuits required are given in Appendix A. Fig. 6 shows the distribution of single-shot energy estimates of all three techniques applied to molecular hydrogen. The average energy is given by E = 1 N N −1 j=0 e j . To compare each method the measurement budget was fixed to M = 1.2663 × 10 6 . A calibration matrix method available in Qiskit was used to mitigate measurement errors and was used to amend the raw outputs from ib-mqx2.
The qubit Hamiltonian for H 2 has five terms, which is reduced to three by unitary partitioning, not including the identity term. The number of energy estimates e j obtained was 253260 for standard VQE and 5/3 this for unitary partitioning by the sequence of rotations method. This is because the smaller number of terms allowed a correspondingly larger number of samples to be taken for a fixed M . The total number of e j samples from ib-mqx2 for these techniques was reduced to 253074 and 421951 after measurement error mitigation was applied. The LCU approach to unitary partitioning is probabilistic and requires post-selection on the all zero state of the ancilla register. After post-selection, our simulation of unitary partitioning as a LCU on ibmqx2 gave 333407 raw e j samples and 332763 e j samples after measurement error mitigation was applied to the raw output. Our emulation of this method on a noise-free quantum processing unit (QPU) gave 336390 e j samples after post-selection. The theoretical maximum possible number of samples for LCU would be the same as the sequence of rotations method if all samples obtained were successful.
The reason a normal distribution is not obtained is due to the number of terms in the qubit Hamiltonian for H 2 . At most only 32 distinct values of e j are possible for standard VQE and 8 under unitary partitioning, and so we do not expect the central limit argument to apply here.
To investigate the distribution of energies obtained from each method in more detail, we simulated the larger problem of LiH using Qiskit's statevector simulator [42]. Further details are given in Appendix A 2. Fig. 7 summarizes the results. Again, each data point is an energy estimate found from the weighted measurement outcomes of a single-shot VQE run. The standard qubit Hamiltonian for this problem is made up of 630 terms and after applying unitary partitioning 102 terms, not including the identity term. The measurement budget was fixed at M = 1.018521 × 10 9 . The total number of energy estimates e j for standard VQE, the sequence of rotations and the LCU methods after post-selection were 1616700, 9985500 and 1447349 respectively.
We performed the Kolmogorov-Smirnov [43] and Shapiro-Wilk tests [44] on the data in Fig. 7 to check for normality. The P-values obtained in all cases were smaller than 0.05, and thus we could not assume a normal distribution. This may be caused by insufficient samples allowing convergence to the central limit or the problem size still being too small. To estimate the statistics of the true distribution we thus employed bootstrap resampling [45].

C. Discussion
In our results, for a fixed measurement budget M we obtain a set of independent identically distributed random energy samples {e 1 , e 2 , ..., e N }. The standard deviation of this sample σ e converges to the true standard deviation a the number of samples increases. As the number of samples increases, the error on the sample mean decreases. The standard error of the mean (SEM) is defined as: for N energy samples. Because we are considering a fixed measurement budget M , the effect of unitary partitioning will be to increase the number of energy samples N and hence reduce the SEM.
To benchmark each method we compare σ e and SEM of the ground state energies samples. 95% confidence intervals (CI) were calculated using bootstrapping with 10, 000 resamples with replacement. The full statistical analysis is given in Table I.
Qualitatively, the noise-free LiH simulation results in Fig. 7 show that VQE with unitary partitioning applied as either a LCU or a sequence of rotations give a similar distribution of energies compared to standard VQE. This is expected, as unitary partitioning leaves a given molecular Hamiltonian H q unchanged -equation 9 -thus both the standard deviation σ = H q 2 − H q 2 and full configuration interaction ground state energy E F CI will remain unchanged.
Quantitatively, the σ e of ground state energy estimates of LiH for each method were very similar, with the largest difference being 4.8 mHa. Note σ of the full population is independent of the number of samples N taken. Thus even though the distributions in Fig. 7 look very similar, the number of data points in each curve is significantly different and therefore the SEM is significantly different for each implementation.
Whereas, for the noise-free simulation of H 2 (Fig. 6c), the sample standard deviation of e j from VQE with unitary partitioning applied were an order of magnitude lower than standard VQE. We expect this is due to the small number of distinct e j outcomes for this specific problem under unitary partitioning.
As unitary partitioning is designed to require fewer terms to be measured, for a fixed measured budget M , the total number of energy estimates will be larger. The sequence of rotations construction of unitary partitioning is deterministic and will always give more e j samples than conventional VQE. Hence, the SEM of both H 2 and LiH noiseless simulations using the sequence of rotations method were an order of magnitude lower than standard VQE. On the other hand, the LCU realization of unitary partitioning is probabilistic. Even though fewer terms need measurement, post-selection requires some samples to be discarded. We see this in the simulation for LiH, where the LCU approach actually has the fewest e j samples at 1447349 compared to 1616700 for standard VQE. As the σ e of all three approaches are similar, the LCU implementation has the highest SEM in this case. The advantage over standard VQE is thus dependent on the success probability, which for each circuit is given by the inverse l 1 norm squared of the operator to be implemented as a LCU. Importantly post-selection is only performed on the ancilla register. The success probability is inversely proportional to the dimension of the ancilla Hilbert space. The number of ancilla qubits scales logarithmically with the number of terms in each anticommuting set (n ancilla = log 2 (|H S l | − 1) ), and the dimension of the ancilla Hilbert space, and hence success probability, is inversely proportional to the size of the anticommuting sets.
The experimental results for H 2 on ibmqx2 show that applying the unitary partitioning technique does not appreciably change the performance of VQE, when combined with error mitigation techniques. We suspect this is due to the extra coherent resource required to perform R l causing an increase in errors, which offsets the improvement of the SEM given by the technique. We expect this to be mitigated as gate fidelities increase.
The experimental execution of R l by LCU on ibmqx2 performed comparably to the sequence of rotations realization. Ignoring post-selection issues, the LCU algo-  The mean, standard deviation and standard error on the mean for each method calculating the ground state energies of H 2 and LiH using single-shot VQE. The simulator backend represents a noise-free QPU emulator and ibmqx2 a real quantum device. Ibmqx2-raw are the raw experimental results from the QPU and ibmqx2-mit with measurement error mitigation applied. 95% confidence intervals (CI) were calculated using bootstrap resampling [45].
rithm is more complex and requires more qubits to implement. We believe this motivates further examination of the use of more advanced quantum algorithms on NISQ devices. A particular feature of our results from ibmqx2 ( Fig.  6) is that the mean ground state energy obtained is overestimated by a seemingly constant amount. We suspect this could be due to two effects. Firstly, our ansatz circuit prepares the ground state. Any coherent errors in this circuit will increase the energy of the state prepared by virtue of the variational principle [46]. Secondly, inspecting the qubit Hamiltonian for H 2 most coefficients are positive. As our results overestimate the energy, it implies that measurement outcomes of each n-fold Pauli operator are more frequently +1 causing each estimate of e j to be larger. This could be an indication of a higher  The single-qubit gate error rates of IBM QPU's have error rates in the range of 0.1%-0.3% and two-qubit gate errors in the range of 2%-5% [48]. The most error-prone operation is measurement and ibmqx2 on average has a measurement error rate of 4%, but this can be much higher (13%) [48]. This large measurement error is apparent when comparing the raw and measurement error mitigated results from the QPU simulation of H 2 . In future experiments, it would be interesting to improve measurement fidelity, for example by using invert-andmeasure designs [48] as well as flipping the qubit encoding (|0 → |1 and |1 → |0 ) as in [49], or by other mitigation schemes [50].
In the original work in which these techniques were proposed, it was shown that the variance of the different methods should be similar [26]. This is observed on both the QPU emulator and quantum device.
Crucially, when partitioning the qubit Hamiltonian into anticommuting cliques (equation 9), the greatest measurement reduction is obtained if the minimum clique cover is found. This cover has the fewest H S l sets possible. However, non-optimal clique covers still give a measurement reduction. As the size of the quantum circuit for R l is proportional to the number of terms in H S l , we propose that for practical applications a non-optimal clique cover is beneficial. By splitting the problem Hamiltonian into pairs of anticommuting operators (|H S l | = 2 ∀{l} l=0,1,...,mc−1 ), the extra coherent resources required to perform R l are experimentally realistic for current and near term devices. This offers a constant factor improvement to the number of measurements required. A detailed circuit analysis of different implementations of unitary partitioning is given in the next Section.
FIG. 8: Quantum circuit to perform a unitary operator given as an exponentiated n-fold tensor product of Pauli operators [47].

IV. CIRCUIT ANALYSIS
In order to investigate the circuit depth of each technique, we consider circuits made up of arbitrary single qubit and CNOT gates. However, often when analysing fault-tolerant protocols it is common to consider the universal gate set of Clifford and T gates. The LCU method only requires arbitrary rotations in the operator G, whereas the sequence of rotations method requires a rotation for every operator in an anticommuting set, apart for P (l) w . The relative depth of LCU vs sequence of rotations circuits would be interesting to explore in this setting.

A. Sequence of rotations circuit analysis
In Section II A, it was shown that R S l could be constructed by a sequence of R We need to consider the cost to perform each R (l) wk rotation. Whitfield et al. in [47] show how to build the required quantum circuits for these types of operators and an example is illustrated in Fig. 8.
Every R (l) wk circuit will require O 2(N s − 1) CNOT gates, 1 R z (θ) gate and O(2N s ) change of basis gates {H, R x (θ)}. Here N s is the number of system qubits. A single R (l) wk is needed for each term in the set S l , apart from for P (l) w . The total number of rotations that make up the full sequence of rotation operator R S l is therefore |H S l | − 1. |H S l | is the size of the anticommuting set. The overall gate count scales as O (2N s + 1)(|H S l | − 1) single qubit and gates and O 2(N s − 1)(|H S l | − 1) CNOT gates.
We note that there is a choice in the ordering of R (l) wk when constructing R S l . By choosing an ordering that maximises the common substring between Pauli strings defining R (l) wk (lexicographical order), it is possible to cancel the common change of basis gates between subsequent R (l) wk rotations. We refer the reader to [51], which gives further possible gate cancellations -including CNOT cancellations. This can significantly reduce the circuit depth when constructing R S l .

B. LCU implementation
In the linear combination of unitaries approach to unitary partitioning, R l is written as a linear combination of n-fold Pauli operators (equation 37) -up to a complex sign. Fig. 2 shows how to implement such operators.
Such an operator can be implemented using the LCU technique. To achieve this, the gates required to realise G (l) (equation 38) and U (l) LCU (equation 39) are required. We will not explicitly consider the construction of G, as it heavily depends on the ancilla state required and many different approaches are possible. In the worst case, without introducing any additional qubits, O(N c log 2 (N c ) 2 ) standard 1 and 2-bit gate operations are required [52]. Here the number of control qubits N c is given by the number of operators U (l) LCU is constructed from. In this case N c = log 2 (|H S l | − 1) .
On the other hand, the quantum circuit to construct U LCU is well defined. Overall, (|H S l | − 1) N c -bit controlled P q gates are required. To efficiently construct each control P q gate, N w = (N c − 1) work qubits are employed. The control states of the ancilla qubits are stored on these work qubits using Toffoli gates [53]. An example is shown in Fig. 9. For every control P i a cascade of O(2(N c − 1)) Toffoli gates are required.
With no circuit simplifications on the ancilla and work qubit registers, the number of Toffoli gates required scales as O 2 Nc (2N c − 2) . However, significant simplifications can be made. Here we assume all 2 Nc control states are required.
Importantly if we arrange the sequence of control gates optimally, we can reuse some of the work qubits when the control states overlap. Fig. 10 and Fig. 11 show different possible circuit templates that can be utilized to simplify the quantum circuits.
To maximise the circuit simplifications on the ancilla and work qubit registers we show how a Gray encoding of U LCU should be used. Fig. 26 in Appendix B shows the control states for N c = 5 in a Gray and binary encoding. Importantly in a Gray code, adjacent bitstrings only differ by one bit [54]. In other words, the Hamming distance between adjacent control states is always 1.
Consider x leading bits in common between adjacent control bitstrings, where 2 ≤ x ≤ (N c − 1). In the circuit picture, these are the cases when the top x controls between two adjacent control unitaries are in common. For these cases, we get 2(x−1) trivial Toffoli reductions (Fig.  10a) on the Toffoli gates between the control unitaries. The number of times each x occurs is given by 2 x and therefore the total trivial Toffoli reduction is given by: (43) Next, consider the case of x = N c − 1. After the trivial Toffoli gate cancellations, there must be two Toffli gates that must differ by one bit in a Gray code. These will cancel to a single CNOT -illustrated in Fig. 10b. This occurs 2 x times generating an additional 2 x = 2 Nc−1 CNOT gates and removing a further 2 x (2) = 2 x+1 = 2 Nc Toffoli gates. No further reductions are possible. A full example of this is given in Fig. 12.
For 0 ≤ x < (N c − 1), after the trivial Toffoli simplifications, it will always be possible to convert the two Toffoli gates into a CNOT gate. Again they must differ by one bit in a Gray code and the template in Fig.  10b can be applied. The remaining circuit will have a Toffoli-CNOT-Toffoli. These can be further reduced by applying a template shown in Fig. 11. In a Gray encoding, the template in Fig. 11 can always be applied, giving  Fig. 13a shows an example case with a Gray encoding. Fig. 13b shows how a less optimal reduction, which occurs when a binary encoding is used. Fig. 14 shows another example, where a binary encoding again results in a less optimal reduction compared to a Gray code. In a Gray encoding, 3 Toffoli gates will be cancelled at each step and 1 CNOT gate generated. The number of Toffoli gates removed in this process is given by: The increase in CNOT count is given in equation 45.
The total number of CNOT gates is given by equation 45. In a Gray encoding, the optimized number of Toffoli gates is given by equation 46. Here T is short for Toffoli.  When U LCU is encoded using a Gray code and all 2 Nc control states are used there will be 3(2 Nc−1 ) − 5 Toffoli gates and 2 Nc−1 − 1 additional CNOT gates. Each Toffoli gate can be decomposed into 9 single qubit gates and 6 CNOT gates [57], therefore the reduced gate count requires 27(2 Nc−1 ) − 45 single qubit gates and 19(2 Nc−1 ) − 31 CNOT gates.
So far these counts do not include any gate that acts on the system register, as different approaches are possible. In the next two Sections we analyse two different possibilities -a cascade and direct approach. We consider each case with a Gray encoding of the control unitaries.

LCU cascade
In the cascade approach, each control-P i operator is performed using different changes of basis, a cascade of CNOT gates and a CNOT controlled by the work qubits. This approach requires a change of basis for every X and Y of each P i gate of U LCU (equation 39). In general O(2N s ) gates. The resulting operator can then be implemented using a cascade of O(2(N s − 1)) CNOT gates and a control Z gate. Two Hardamard gates can convert the control Z gate into a CNOT gate. An example is shown in Fig. 15.
The additional gate count on the system register will scale as O 2 Nc (2N s + 2) single qubit gates and O 2 Nc (2N s − 1) CNOT gates.
The overall gate count over the system and ancilla reg- H H FIG. 16: Example quantum circuit to perform a multi-control P i gate via a direct approach. By limiting the size of each anticommuting clique to |H S l | ≤ 5 ∀{l} l=0,1,...,mc−1 no work qubits will be required to implement R l .
The quantum circuits required to implement unitary partitioning under these conditions are realistic for implementation on current and near term devices. This offers a constant factor improvement on the number of measurements required.
If anticommuting cliques |H S l | > 5 are present, they can be partitioned into separate subsets each of size less than 5. The produced sets will still be valid anticommut- ing cliques. A re-normalization on all subsets must also be performed.

Further LCU simplifications
An additional circuit simplification is possible cancellations in the gates making up U LCU . We did not explicitly consider this in our work here, but note its clear application. The ordering of the control unitaries in U LCU is arbitrary and common qubit-wise Pauli strings can be cancelled. The optimal reduction is obtained if Pauli operators on common qubits are maximised -this is known as a lexicographical ordering. An example reduction is given in Fig. 17.
We did not employ this process for our H 2 simulation, as it offered no improvement. The LiH problem would have benefited from this reduction however, as we only simulated this problem on a QPU emulator multicontrol gates could be performed directly. We therefore didn't decompose these operations into their single and two qubit gates and simulated the control P i gates directly.

C. LCU Unary Implementation
Another approach to implementing the LCU is having each control unitary in equation 39 controlled by its own qubit. Hence the number of control qubits required is N c = |H S l | − 1. This is known as a unary or one-hot encoding [58]. An example encoding for 8 states is given in Table II. Under a unary encoding, the quantum circuit for G (equation 38) is made up of a two R y rotation on each qubit, where the amplitude αq αq 1 is encoded on either the zero or one state. This is due to each α q being real and positive. The number of single qubit gates on the ancilla register scales as O 2(|H S l | − 1) . No work qubits are required. The cascade or direct approach can then be utilized to implement the gates which act on the system register. Fig. 18 illustrates this approach with each amplitude encoded on the |0 a state. This implementation uses an exponentially small subspace of the ancilla qubits' Hilbert space [58]. This isn't an efficient use of quantum memory, but reduces the circuit depth of LCU significantly.
O(N c log 2 (N c ) 2 ) standard 1 and 2-bit gate operations [52] single:  Table III summaries the different resources required to implement R l via different approaches.
In summary, the sequence of rotations implementation of unitary partitioning gives the lowest gate count. For the different LCU implementations, the direct unary approach provides the lowest depth quantum circuits at the cost of requiring an ancilla qubit for each unitary in U (l) LCU . Recently, the largest implementation of VQE to date was only able to make use of 12 qubits out of 53 available [7] and these unused qubits could be utilized for unary encoding. When this overhead becomes prohibitively large for sizeable H S l sets, the Gray encoding schemes can be used.

V. CONCLUSION
Our work shows that the unitary partitioning technique for measurement reduction can significantly improve the precision of variational calculations. For a fixed measurement budget M , fewer terms need to be estimated and thus the total number of separate energy estimates is increased. As the sample standard deviation of energies is similar for the different approaches, the standard error of the mean will be lower when unitary partitioning is applied. Our results indicate the deterministic sequence of rotations implementation offers the best improvement, which we find in our noiseless simulation of H 2 and LiH. In contrast, the LCU approach is probabilistic and some measurements must be discarded. The advantage over standard VQE is thus dependent on the success probability. This naive implementation of LCU can be improved by using oblivious and standard amplitude amplification [31][32][33][34], which can boost the probability of success. However, further coherent resources are required.
The experimental results obtained using IBM's NISQ device (ibmqx2) show VQE with unitary partitioning applied performs no worse than conventional VQE for a fixed M , when combined with error mitigation techniques. Even though unitary partitioning requires fewer terms to be combined to give an energy estimate leading to less statistical noise and more energy samples, we suspect the additional coherent resources required causes an increased error accumulation, which offsets the advantages given by the technique. As quantum devices continue to improve, this effect should be reduced and we expect unitary partitioning will benefit many variational quantum algorithms.
Our work shows how precision can be improved for a fixed number of calls to a QPU; however, an alternate outlook is how this technique may allow larger problems to be studied. For a given precision, applying unitary partitioning requires fewer samples and thus may allow larger scale simulations to be performed on reasonable timescales.
Future work should investigate how the variance of energies obtained changes if different terms in H S l are reduced to, as there is flexibility in the unitary partitioning technique. We also note that this work has an interesting application to the recently proposed frugal shot Rosalin optimizer [59], which uses a weighted random sampling of Hamiltonian terms. Unitary partitioning transforms the Hamiltonian of interest into one with fewer terms of different coefficients; the effect this has on the optimizer's performance is an interesting avenue to explore.
Acknowledgments.-A. R. acknowledges support from the Engineering and Physical Sciences Research Council (EP/L015242/1). P. J. L. and A. T. acknowledge support by the NSF STAQ project (PHY-1818914). P. V. C. is grateful for funding from the European Commission for VECMA (800925). A. R. would also like to thank J. Dborin for useful discussions.  converted into the qubit Hamiltonian using the Bravyi-Kitaev transformation in OpenFermion [37]. The following Sections give numerical details.

Molecular Hydrogen
In the minimal STO-3G basis the qubit Hamiltonian for H 2 in the BK representation is: This Hamiltonian only acts off diagonally on qubits 0 and 2 [60], therefore it can be reduced to: Overall 2 qubits are required, and any Pauli operator indexed with a 2 is re-labelled with an index of 1. The input state was found by diagonalising the problem Hamiltonian (Equation A2) -|ψ ground H2 = −0.1125 |01 + 0.9936 |10 . For our calculation, the bond length was set to R(H-H)=0.74Å. Note that we index the state from left to right.
To perform unitary partitioning, the qubit Hamiltonian for H 2 needs to be split into anticommuting sets H S l . As discussed in the main text, the NetworkX package was utilized to do this [39]. First, a graph of the qubit Hamiltonian was built, where each node is a term in the Hamiltonian. Next edges are put between nodes on the graph that anticommute. Finally, a graph colouring of the complement graph was performed. This searches for the minimum number of colours required to colour the graph, where no neighbours of a node can have the same color as the node itself. The "largest first" colouring strategy in NetworkX was used. Each unique colour represents an anticommuting clique. Fig. 5 shows the method applied to H 2 and Table IV gives the resulting anticommuting sets H S l obtained. Note the set index l represents a unique colour obtained in the graph colouring. This approach is the minimum clique cover problem mapped to a graph colouring problem.
The following subsections give the quantum circuits used to estimate the ground state of H 2 by standard VQE, and VQE with unitary partitioning applied.       Table IV, this table gives