A general framework for randomized benchmarking

Randomized benchmarking (RB) refers to a collection of protocols that in the past decade have become central methods for characterizing quantum gates. These protocols aim at efficiently estimating the quality of a set of quantum gates in a way that is resistant to state preparation and measurement errors. Over the years many versions have been developed, however, a comprehensive theoretical treatment of RB has been missing. In this work, we develop a rigorous framework of RB general enough to encompass virtually all known protocols as well as novel, more flexible extensions. Overcoming previous limitations on error models and gate sets, this framework allows us, for the first time, to formulate realistic conditions under which we can rigorously guarantee that the output of any RB experiment is well-described by a linear combination of matrix exponential decays. We complement this with a detailed analysis of the fitting problem associated with RB data. We introduce modern signal processing techniques to RB, prove analytical sample complexity bounds, and numerically evaluate performance and limitations. In order to reduce the resource demands of this fitting problem, we introduce novel, scalable post-processing techniques to isolate exponential decays, significantly improving the practical feasibility of a large set of RB protocols. These post-processing techniques overcome shortcomings in efficiency of several previously proposed methods such as character benchmarking and linear-cross entropy benchmarking. Finally, we discuss, in full generality, how and when RB decay rates can be used to infer quality measures like the average fidelity. On the technical side, our work substantially extends the recently developed Fourier-theoretic perspective on RB by making use of the perturbation theory of invariant subspaces, as well as ideas from signal processing.

Randomized benchmarking refers to a collection of protocols that in the past decade have become central methods for characterizing quantum gates. These protocols aim at efficiently estimating the quality of a set of quantum gates in a way that is resistant to state preparation and measurement errors. Over the years many versions have been developed, however, a comprehensive theoretical treatment of randomized benchmarking has been missing. In this work, we develop a rigorous framework of randomized benchmarking general enough to encompass virtually all known protocols as well as novel, more flexible extensions. Overcoming previous limitations on error models and gate sets, this framework allows us, for the first time, to formulate realistic conditions under which we can rigorously guarantee that the output of any randomized benchmarking experiment is well-described by a linear combination of matrix exponential decays. We complement this with a detailed analysis of the fitting problem associated with randomized benchmarking data. We introduce modern signal processing techniques to randomized benchmarking, prove analytical sample complexity bounds, and numerically evaluate performance and limitations. In order to reduce the resource demands of this fitting problem, we introduce novel, scalable post-processing techniques to isolate exponential decays, significantly improving the practical feasibility of a large set of randomized benchmarking protocols. These post-processing techniques overcome shortcomings in efficiency of several previously proposed methods such as character benchmarking and linear-cross entropy benchmarking. Finally, we discuss, in full generality, how and when randomized benchmarking decay rates can be used to infer quality measures like the average fidelity. On the technical side, our work substantially extends the recently developed Fourier-theoretic perspective on randomized benchmarking by making use of the perturbation theory of invariant subspaces, as well as ideas from signal processing.

I. INTRODUCTION
In the last few years significant steps have been taken towards the development of large-scale quantum computers. A key part of the development of these quantum computers are tools that provide diagnostics, certification, and benchmarking. Particularly for quantum operations, stringent conditions have to be met to achieve fault tolerance. Motivated by this observation, in recent years a significant body of work has been dedicated to the development of tools for the certification and benchmarking of quantum gates. A prominent role in this collection of tools is taken by methods that can be collectively referred to as randomized benchmarking (RB). These methods have risen to prominence because they conform well to the demands of realistic experimental settings. They estimate the magnitude of an average error of a set of quantum gates in a fashion that is robust to errors in state preparation and measurement (SPAM) and moreover is, in many settings, efficient, in the sense that the resources required scale polynomially with the number of qubits in the device. The various versions of RB apply sequences of randomly chosen quantum gates of varying length. Small errors are thus amplified with the sequence length, and gate quality measures can be extracted from the dependence of the output data on sequence length.
In RB protocols, group structures feature strongly, in that the gate set considered is in almost all cases a subset of a finite group. Such group structures not only make it possible to efficiently make predictions for error-free sequences and compute inverses, but they also provide the means to analyze the error contribution after averaging. Originally proposed for random unitary gates [1][2][3], RB is now most prominently executed with gates from the so-called Clifford group [4][5][6], a set of efficiently classically simulatable quantum gates that take a key role specifically in fault tolerant quantum computing [7]. It has also been considered for other (subsets of) finite groups [8][9][10][11][12][13][14][15]. Moreover RB has been generalized to capture other figures of merit of gate sets, such as relative average gate fidelities to specific anticipated target gates [4], fidelities within a symmetry sector [9,10], or the unitarity [16]. Specifically recently, with challenges of realizing fault-tolerant quantum computers in mind, emphasis has been put on capturing losses, leakage, and cross-talk in a scheme [17][18][19]. Also, data from RB -or rather suitably combining data from multiple such experiments -can be sufficient to acquire full tomographic information about a quantum gate [20][21][22]. This adds up to a wealth of RB protocols [23] proposed over the previous years. Fig. 2 summarizes a (to our knowledge) up to date list of theoretical proposals for RB procedures presently known.
A significant body of work moreover deals with the limitations and precise preconditions of RB. The originally rather stringent assumptions on noise being necessarily identical across different quantum gates have over time been relaxed for particular protocols in later work [24][25][26], and the connection between the output of RB and operationally relevant quantities (such as average fidelity) has been studied in some detail [26,27].
And yet, it seems fair to say that a comprehensive picture of RB schemes for the quantum technologies [28] has been lacking so far. In particular a theoretical framework that is broad enough to formalize the required pre-conditions ensuring the proper functioning of RB protocols beyond case-by-case arguments for specific protocols. This is unsatisfactory, as the development of higher quality quantum gates and currently relies heavily on a plethora of tailor-made variants of RB. This motivates our current effort at providing a clear rigorous underpinning for RB and exploring its underlying mathematical structure, putting all variants of RB on a common footing.
With this effort we aim to not only better understand these protocols, but also to increase trust in them, making it possible to reliably use them without a detailed understanding of their inner workings. This is a timely effort, as procedures that fit within the RB framework, such as linear-cross-entropy benchmarking [29] and the behaviour of noisy random circuits more generally, have been the topic of significant attention recently [30][31][32], including for the purpose of benchmarking [33]. Given how we identify linear-cross-entropy benchmarking as a randomized benchmarking procedure, we relate our general framework to this timely discussion.
At the same time our framework allows us to go significantly beyond current protocols and establish a series of novel theoretical results and benchmarking schemes, addressing several shortcomings of the current state of the art. Among others, these novel results include a rigorous error bound for generator-style randomized benchmarking, a formal equivalence between linear cross entropy benchmarking and randomized benchmarking and a novel, scalable method for isolating signals in RB experiments, an absolute requirement if one wants to apply RB to non-standard gate-sets. This latter method, which we call filtered RB, is a significant conceptual improvement over standard RB schemes, promising greater flexibility and applicability. Notably, it also obviates the need for physically implemented inversion gates in randomized benchmarking experiments and the preparation of specific input states, making its implementation significantly more straight-forward. As such, this framework therefore also constitutes a solid basis for developing new schemes of randomized benchmarking. Altogether these results substantially advance the understanding of the possibilities and requirements of randomized benchmarking as a practical tool for estimation and certification.

II. OVERVIEW OF RESULTS
In this work, we aim at developing a mathematically comprehensive framework of randomized benchmarking protocols, synthesizing, generalizing, and substantially strengthening previous work. This paper covers a variety of different aspects of randomized benchmarking (RB), from general theorems on the validity of RB data, to a detailed study of the classical post-processing of data generated by RB and an in-depth discussion of the connection between the outputs of RB and average fidelity. As our work is often quite technical, we have formulated a series of 'take-home messages at the end of this section, summarizing the key takeaways of our work for experimental practice.

A. A general framework for randomized benchmarking
We begin by providing a general framework that generalizes and covers (to the best of our knowledge) all RB procedures currently present in the literature. This can also be thought of as an attempt at a formal definition of RB protocols, and is largely an effort to organize and formalize knowledge already present in the RB literature. RB protocols can be divided into two separate phases: a data collection phase, and a data processing phase.
• The data collection phase corresponds to the part of the protocol involving the actual quantum computer and can be described as (1) the preparation of a quantum state, (2) the application of a sequence of random quantum operations, Figure 1. The basic structure of RB. The RB data collection phase iterates the following steps: After (1) the preparation of an initial state ρ0, (2) a sequence of m random gates gi is applied, (3) followed by the gate inverting the sequence ginv up to an end gate g end and (4) a final POVM measurement. In the data processing phase the measurement data, for many random sequences and different sequence lengths is post-processed in a classical computer to extract decay parameters quantifying the imperfections in the implemented gates. capped by (3) an inversion operator mapping the state (ideally) to a specified final state (usually the initial state), upon which (4) a measurement is then performed.
• This process yields estimates of a success probability p(m) for different sequence lengths m, which constitutes the input to the data processing phase. In this phase -which is completely classical-the data p(m) is fitted to a functional model, generally a linear combination of exponential decays. One can consider the decay rates of these exponential decays as direct measures of quality of the implementation, or further relate it to operational quantities like the average fidelity.
Starting with the data collection phase, we write down a general RB protocol (alg. 1). This protocol depends on a number of input parameters, and by making particular choices for these parameters we can obtain all RB protocols currently in the literature. The key parameters are as follows: the sampling distribution is very far from being uniform (for instance taking only non-zero values on a small set of generators).
These classes of RB procedures are motivated by the qualitatively different behaviour of the associated output data p(m), which we will discuss in more detail later. They also partially but not completely align with notions already present in the literature. In particular we will see that the behaviour of this data is dictated by the group G and the reference representation ω. We can always decompose this representation ω into a direct sum of irreducible sub-representations, i.e., ω = λ∈Λ σ ⊕n λ λ where the σ λ are irreducible (and occur with multiplicity n λ ).
A key tenet of RB is that this decomposition decides the functional form of the output data p(m) as a function of sequence length m. More precisely, we expect behaviour of the form where A λ , M λ are n λ × n λ matrices encoding state preparation and measurement errors (SPAM), and the quality of gate implementation respectively. This formalizes in a precise way the general idea that RB data is well described by a linear combination of exponential decays and allows for the classical processing of RB output data, thus providing the connection between the data collection and the data processing phases. Note, however, that if irreducible sub-representations appear with non-trivial multiplicities the functional form of eq. (1) includes matrix exponential decays. These can have qualitatively different features than scalar exponential decays, requiring a more sophisticated data processing approach. It is for instance possible for these matrices to have complex eigenvalue pairs, which will lead to damped-oscillation behaviour in randomized benchmarking data.

B. The functional form of randomized benchmarking data
At the core of the RB literature is the promise that RB output data has a very specific form, namely that of a linear combination of (matrix) exponential decays (as expressed in eq. (1)), decaying with the length of the sequences of random gates. Moreover, this linear combination is of a specific structure, determined by the implemented gate-set. However, this functional form of the RB output data is not guaranteed by the protocol itself, but is instead derived from assumptions on the noisy implementation of the random quantum operations. In early work this assumption took the form of the gate-independent noise assumption. Later, it was realized that this assumption is not satisfactory [26] and it was subsequently generalized for standard Clifford RB to the more general assumption that the noisy implementations of gates are Markovian and time-independent, and moreover either that the gate-dependent variation of the noise is upper bounded in the diamond norm (in the work of ref. [24]), or lower bounded in average fidelity (in the work of ref. [25]). Here, we provide a series of theorems generalizing these works to (almost) all existing RB protocols, justifying eq. (1) in broad circumstances. The theorems we prove make claims of different strength for different classes of RB protocols, as per the typology outlined in fig. 2.
• We prove that the output data of uniform RB protocols (as per the typology in fig. 2) can be described as a linear combination of exponentials, up to an exponentially small error, provided that the gate implementations are Markovian, time-independent and are on average close in diamond norm to an ideal implementation that is a representation. This closeness is independent of the particular RB protocol and independent of the underlying Hilbert space dimension. The complete statement is given as theorem 8 that can be summarized as follows: Theorem 1 (Informal version of theorem 8). Consider an RB experiment with sequence length m, with gates uniformly drawn from a group G and implemented through a reference representation ω(g) = λ∈Λ σ ⊕n λ λ (g). Denote the corresponding noisy implementation on the quantum computer as φ(g) (note that this assumes time independent and Markovian noise). If we have then the output data p(m) of the RB experiment obeys the relation • Standard Clifford RB [5,34] • Real RB [35] • Simultaneous RB [36] • dihedral RB* [37] • CNOT-dihedral RB [38] • Character RB* [39] • Restricted gate set RB [40] • Monomial RB [14] • Complete RB [41] • Leakage RB (1)** [19] • Leakage RB (2)** [42] • Unitarity RB** [16] • Loss RB** [18] • Measurement based RB [43] • Logical RB [44] • Pauli channel tomography* [45,46] • Linear XEB* [29] • Standard interleaved RB [4] • T-gate interleaved RB [47] • Iterative RB [48] • Individual gate RB [9] • Hybrid RB* [49] • Cycle RB* [13] • RB tomography [50] • Approximate RB [14] • NIST RB [5,51] • Generator RB [14,52] • Direct RB [15] • Coset (2-for-1) RB* [39] Figure 2. An overview of RB schemes, indicating how they fit within our typology (see section V D) of RB schemes and what theorem covers the behaviour of their output data (see section VI). A * indicates that the protocol has a non-trivial post-processing scheme, while a * * indicates that the protocol in its original specification has no inversion gate. We discuss how this is equal to uniform RB (with inversion) together with a post-processing step in section VIII.
with the error exponentially suppressed in m. Here A λ and M λ are n λ × n λ matrices, with M λ depending only on the actual implementation φ.
The proof of this theorem relies on a combination of techniques from earlier works: Taking the matrix Fourier transform perspective introduced to RB in ref. [25] and combining it with the realization in ref. [24] that the diamond distance (averaged over random gates) is the correct distance measure for the formulation of assumptions on noisy gate implementations. We also make heavy use of the perturbation theory of invariant subspaces of non-normal matrices [53,54]. We note that the specific parameter 1/9 is an artifact of the proof techniques and probably sub-optimal.
• Building on theorem 8, we prove multiple theorems for non-uniform RB protocols. The first subtype, approximate RB, is covered by theorem 9, a direct generalization of theorem 8, and also features an exponentially suppressed error. For the second subtype, subset RB, on the other hand, we can only give a weaker statement, guaranteeing that the RB output data is described by a linear combination of exponentials up to constant error (in sequence length) as long as the sequence length m is taken to be larger than a mixing length m mix . This mixing length indicates the moment where the m-fold convolution ν * m of the probability distribution ν, which governs the sampling of random gates, becomes close to the uniform distribution and is a function of both the initial distribution ν and the underlying group G. We can summarize our result on subset RB as follows: Theorem 2 (Informal version of theorem 10). Consider a RB experiment with sequence length m, with gates drawn from a group G according to a probability distribution ν and implemented through a reference representation ω(g) = λ∈Λ σ ⊕n λ λ (g). Denote the corresponding (noisy) actual implementation on the quantum computer as φ(g). If we have, for some sequence length m mix that and δ + δ ≤ 1/9, then the output data p(m) of the RB experiment obeys the relation with the error bound independent of m. Here A λ and M λ are n λ × n λ matrices, with M λ depending only on the actual implementation φ.
This theorem cannot guarantee an exponential error bound, but still improves on the state of the art [14,15], both in the generality of the assumptions made and the size of the possible error. Note also the appearance of the m mix −1 term in the average diamond norm deviation. This can be read as the requirement that the generating gates are of sufficiently high quality that any (composite) uniformly randomly chosen gate will be close in diamond norm to its ideal version. In this sense this requirement is of the same stringency as eq. (2).
• We discuss the behaviour of interleaved RB protocols, illustrating how standard interleaved RB, as well as all but one non-standard interleaved RB protocol, are covered by theorem 8. We consider two non-standard interleaved RB protocols, namely cycle benchmarking [13], which is covered by our theorems in a non-trivial way and robust RB tomography [50], which is not covered by our theorems. We argue that this is not a weakness of our argument but rather that the RB output data of this protocol behaves in a non-standard manner, requiring tailor-made analysis.
• In section X, we provide a discussion of the central assumption |G| g∈G ω(g) − φ(g) ≤ δ, made on the behaviour of noisy gates in the above theorems. We argue that this assumption is a natural one to make (theorem 18) and moreover that it can not be replaced by a similar assumption involving the average fidelity without requiring the gate to be exponentially close to perfect in the number of qubits. This also answers an open question posed in ref. [25] in the negative.
The unifying conceptual theme of all of our theorems is the fact that RB can be seen as a 'power iteration in frequency space'. The behaviour of the output data is dictated by the dominant eigenvalues of a fixed matrix that is obtained from the Fourier transform [25] (in a specific sense defined later) of the noisy implementation map φ. Taking powers of this matrix results in the exponential suppression of all but the largest eigenvalues. Together, these results provide a rigorous justification for the folkloric knowledge that RB protocols function under broad experimental circumstances.

C. A framework for randomized benchmarking data processing
The second phase of the RB protocol, the data processing phase, takes in RB output data, which is well-described by a linear combination of exponentials and outputs the decay rates associated with those exponentials. If the data is well described by a single exponential decay this can be done by off-the-shelf curve fitting procedures, but if the RB output data is of a more complex form (such as a linear combination of several exponentials) a more flexible approach is required. Here we provide a self-contained discussion of modern signal processing methods for extracting decay parameters from data with a functional form given by eq. (1). We review signal processing algorithms, in particular the MUSIC and ESPRIT algorithms, that are at least in principle applicable to the most general form of RB output data, even including matrix exponentials. Beyond that, we discuss theoretical guarantees that were derived for these algorithms and discuss their implications for RB data processing. Building upon these guarantees, we derive a sampling complexity statement that ensures the recovery of decay rates with these algorithms under measurements with finite statistics. We complement our analytical discussion with numerical evaluations and simulations that demonstrate the practical performance of these algorithms. Importantly, our discussions detail the fundamental limitations of post-processing RB output data featuring many exponential decays. D. A general post-processing scheme for isolating exponential decays Even with modern methods, fitting multiple exponential decays is a difficult affair, and in many scenarios one is only interested in a subset of the decay parameters that describe the output data of a particular RB experiment. Because of this, several methods have been developed to isolate particular exponential decays. Examples of this include the class of uniform RB protocols without inversion gates (indicated with a double asterisk ' * * ' in fig. 2) and a variety of other protocols that take linear combinations of RB output data with different ending gates g end to isolate particular exponential decays (indicated with a single asterisk ' * '). In section VIII, we give a novel class of protocols called filtered RB that subsumes all these earlier approaches. For simplicity, we only consider uniform RB, but our results generalize to other types of RB.
This class of protocols is based on the realization that RB output data (indexed by an ending gate g end ) can be seen as a vector in the group algebra of the group being benchmarked. This allows for the design of filter functions α λ : G → C, based on the matrix elements of irreducible representations, that isolate exponential decays associated with subrepresentations of the ideal implementation of the gates in the group G. Using these filter functions we can write down a general post-processing scheme for the isolation of exponential decays and prove that it works when the assumptions of theorem 8 are satisfied. We prove a theorem of the following form.
Theorem 3 (Theorem 16 (informal)). Let α λ : G → C be the filter function associated with the irreducible representation σ λ and let p(m, g end ) be the output data associated with a uniform RB experiment with ending gate g end , satisfying the condition eq. (2) with parameter δ. We have that with M λ associated with the irreducible sub-representation σ λ (as per eq. (1)).
Beyond this theoretical result we note that this novel class of protocols allows one (by a simple re-parametrization) to eliminate the need for an explicitly implemented inversion gate in RB, making the protocol significantly simpler to implement in practice.
We also give a statistical analysis of this post-processing scheme. In particular, we prove that if the measurement POVM performed in the RB experiment is (proportional to) a state 3-design, the sample complexity of the complete benchmarking procedure (data collection plus post-processing) is asymptotically independent of the dimension of the underlying Hilbert space for arbitrary benchmarking groups. This is a strong improvement on previous attempts at such a general postprocessing procedure. Note that the 3-design condition appearing here plays a similar role in controlling the variance in scalable estimation procedures such as shadow estimation [55,56].
We stress, however, that the 3-design condition is a sufficient condition and there are examples in the literature covered by this post-processing scheme where this condition is not met but the overall procedure is still scalable. In particular we discuss the recently proposed linear cross entropy benchmarking procedure (XEB) [29] in section VIII C. We argue that the variant of XEB that performs multiple random gate sequences is an example of uniform RB (as per the typology) combined with an instance of our general post-processing scheme. Furthermore, we argue that the sample complexity of linear XEB is asymptotically independent of the underlying Hilbert space dimension even though the POVM being measured is not itself a 3-design.
E. Randomized benchmarking and average fidelity RB has originally been designed to estimate the average gate fidelity of a group of gates. Under the assumption of gate-independent noise, it can be proven (as has already been done in ref. [1]) that the decay rates estimated in an RB experiment correspond exactly to the average fidelity of the noise associated to the gates. However, if this condition is relaxed, the connection between these decay rates and the average fidelity is less clear. Even more strongly, it has been argued in ref. [26] that due to a so called gauge freedom in the representation of the gate set, the entire premise of a connection between RB decay rates and average fidelity may be suspect. This is because the choice of the gauge does not influence the RB decay rates, but it does affect the average gate fidelity. Indeed, it has been shown that under some transformations the two quantities may differ by orders of magnitude, even in the gate dependent noise case (where the previously proven connection can be seen as a 'natural' gauge choice).
Subsequently proposals have been made to reconnect the average gate fidelity and RB decay rates in the context of standard Clifford RB: A natural gauge called the depolarizing gauge [25] and the noise-in-between-gates framework. Both of these proposals provide an exact connection between the decay rates of RB and the average fidelity. However, several crucial questions of interpretation have still been left open, and in this work we aim to address some of them, and sharpen others.
In section IX B 2, we substantially generalize both proposed connections between decay rates and average fidelity to RB with arbitrary finite groups. What is more, we argue that these two proposals are in fact equivalent. Moreover, we present an explicit example of a completely positive implementation map which is not completely positive in the depolarizing gauge (or equivalently has non-completely positive noise-in-between-gates). This implies that both these interpretations of RB decay rates are not fully satisfactory, because they can not be guaranteed to correspond to the average fidelity of a physical process. That said, this does not mean that RB decay rates are not useful figures of merit, as they can always be interpreted as meaningful benchmarks in their own right.
Complementing this, following the approximate approach of ref. [27], we show that the problem of connecting RB decay rates with the average gate fidelity can be (approximately) reduced to the deviation between the dominant (ideal) unperturbed eigenvectors and their (implemented) perturbed version in Fourier space. We show that, as long as this overlap is sufficiently close to 1, any gauge choice that corresponds to a CPT channel will connect RB parameters to the average gate fidelity. Hence we obtain, under precise conditions, an approximate version of the connection between average fidelity and RB decay rates.
More formally, we leverage the Fourier transform framework introduced in ref. [25] to derive the following expression for the entanglement fidelity, which is linearly related to the average fidelity, averaged over all elements of the group as where f max (σ λ ) is the RB decay parameter associated with the irreducible subrepresentation σ λ . In the Fourier framework f λ,max corresponds to the largest eigenvalue of the Fourier transform of the implementation map φ evaluated at σ λ . Furthermore, the parameter α Overlap encodes the overlap between the (left and right) eigenvectors associated with this largest eigenvalue, and the eigenvector of the Fourier transform of the reference representation ω evaluated at σ λ . Finally, the term α Res , the residuum, encodes information about the sub-dominant eigenspaces of the Fourier transform. The factors α Overlap , α Res are gauge dependent. We give bounds on the overlap and residuum in terms of the deviation of φ from the reference ω and discuss relevant scenarios where these terms contribute only negligibly to the entanglement fidelity (and thus when RB decay data corresponds approximately to an average fidelity).

F. Non-technical discussion
In this work, we develop a comprehensive theory of randomized benchmarking (RB). Our main motivation has been our desire to give a mathematical framework for RB and to classify known schemes. It should be clear, however, that our work goes significantly beyond a mere classification of what is present in the literature. Since our work is in parts rather technical, we will in the following formulate a series of 'take home messages': Actionable advice for experimentalists interested in using RB in the laboratory and developing new protocols to suit their needs.
1. RB gives exponential decays under broad (Markovian) circumstances. Confirming experimental intuition, and extending earlier results for specific groups, our main result (theorem 8) proves that RB protocols behave (up to an exponentially small correction factor) as expected whenever the noise afflicting the gate-set is Markovian and time independent. Because the correction factor is so small, any deviation from the prescribed functional form can in fact be taken as evidence of non-Markovian or time-dependent noise processes (as suggested earlier by ref. [24]). We do wish to emphasize that the error term in theorem 8 can be quite significant for small sequence lengths. Hence we recommend as a rule of thumb that RB experiments should not include very short (m ≤ 5) sequence lengths, especially when strong gate-dependent (but Markovian) noise is suspected, as this might bias the estimator for the decay rate.
however, this process need not be physical (i.e., it does not always correspond to a CPTP map). Alternatively, we show that RB decay rates can always be connected approximately to a the average fidelity of a physical process, but the degree of approximation is dependent upon external beliefs about the underlying noise process. Hence, we believe the interpretation of RB decay rates as an average fidelity to be broadly valid, but subject to technical caveats.
These three messages can be considered folk knowledge in the RB community, for which we provide a rigorous underpinning. However, our work also contains new conceptual developments, notably the following.
1. Filtering scalably extends RB to a large class of groups. As formalized in section VII, a major practical hurdle to applying RB with arbitrary finite groups, is that this generically requires the fitting of output data to multiexponential decays. This is a difficult problem both in theory and in practice and it has so far contributed to the limited experimental use of RB beyond a few groups (such as the Clifford group). Our new procedure, which we call filtering (or filtered RB), takes a major step towards solving this problem by giving a generic procedure for isolating exponential decays in a fully scalable manner. This approach is discussed in great detail in sections VII and VIII, with filter functions being introduced in section VIII.A.
2. Inversion gates are not required for RB. Another key practical difficulty in performing randomized benchmarking has been the necessity to compute and implement a global inversion gate. However, filtered RB has the bonus property that it does not require the application of inverses. Instead a random noisy gate sequence can be directly compared to a perfect classical simulated version to extract the same RB decay rates., making the quantum part of the protocol significantly easier to implement. However, this simplicity is gained at a (constant) extra sampling overhead, as the inversion gate in standard RB also suppresses the sampling complexity [57].
With these new contributions, our framework serves as a convenient basis to design new schemes that come with rigorous performance bounds built in. We expect this to facilitate and accelerate the development of more sophisticated and tailormade benchmarking schemes as required by experimental practitioners. Steps in this direction have already been made [58][59][60]. In particular, ref. [59] explores the framework put forth here for continuous groups of quantum gates.

G. Structure of this work
In section III, we discuss mathematical preliminaries: We set the notation for the rest of the work and recall standard notions from representation theory. This section can be skipped by experienced readers.
In section IV, we discuss implementation maps: linear maps from finite groups to super-operators, a central concept in our treatment of RB. We also give an introduction into matrix valued Fourier theory and explicitly state several results from the perturbation theory of non-normal matrices which we use throughout the rest of the work.
In section V, we give a general framework for RB, with its two phases: the data collection and data processing phases, and give a general protocol for the data collection phase. This protocol, which depends on a range of input parameters, covers (the data collection phase of) all known versions of RB. We also discuss a typology of RB schemes, dividing up the known protocols into a few generic classes.
In section VI, we present a series of general theorems that govern the behaviour of the output data of a RB protocol. We confirm the folklore knowledge that RB data is well described by a linear combination of (matrix) exponentials, under some general assumptions.
In section VII, we discuss general procedures for extracting decay parameters from RB output data. We discuss implementation and general limitations and prove a sampling complexity statement for RB.
In section VIII, we propose a general post-processing method for isolating exponential decays associated with particular sub-representations. We argue that this post processing method covers many previously proposed procedures. We also prove a sufficient condition under which this post processing scheme is scalable for any RB protocol and analyze linear cross-entropy benchmarking as an example.
In section IX, we discuss the relation between the decay rates generated by RB and the average fidelity, focusing in particular on the gauge freedom in the presentation of the underlying noise channels.
Finally, in section X, we finally argue that the assumptions made in section VI are natural and in some sense necessary for the correct behaviour of RB.

III. PRELIMINARIES: QUANTUM CHANNELS AND GROUP REPRESENTATIONS
In this section, we will go over some of the basic mathematical machinery needed to talk about randomized benchmarking (RB) and prove our central theorems. We will discuss quantum channels and their matrix representations (section III A), and groups and group representations (section III B). This is fairly standard material, and beyond the setting of notation it can be skipped by an experienced reader.
We begin by setting the stage and introducing some basic notation used throughout our work. We will denote complex vector spaces by V or more explicitly by C d . We denote by M d the vector space of complex linear transformation of C d and by S d the space of linear transformations of M d , often called super-operators. Here d is an integer that in many cases can be thought of as being a power of two (d = 2 q ), however, all theorems are valid for general d unless explicitly stated. We will denote by Tr V the partial trace over a tensor factor V (of an implied tensor product space V ⊗ W for some W ). Finally we will denote the complex conjugate by a bar (i.e., A is the entry-wise complex conjugate of A) A. Quantum channels and the operator-matrix representation Unitary operations as they are generated by quantum gates -in the focus of attention in this work -are quantum channels. Formally, quantum channels are super-operators, that is elements of S d , that are trace preserving and completely positive. In order to represent quantum channels (and elements of S d more generally), we make use of the operator matrix representation. Given a quantum channel E ∈ S d , we can represent it as an element of M d 2 by choosing an orthonormal basis (with respect to the trace or Hilbert-Schmidt inner product) Analogously, (density) matrices ρ ∈ M d can be represented as vectors, Note that the action E(ρ) now corresponds to a matrix-vector multiplication E|ρ and the concatenation of two channels E and E into a matrix multiplication EE . We can analogously write a (POVM element) matrix Π ∈ M d as a co-vector With this, the probability to obtain an outcome described by the POVM element Π when measuring ρ is p(Π|ρ) = Π, ρ = Tr[Πρ].

B. Representations of groups
At the heart of our discussion will be notions of representations of groups. In this section, we will hence recall some basic facts about the representations of finite (and compact) groups over complex vector spaces, with a focus on their use in quantum computation. For a more in depth treatment of this topic we refer to refs. [61,62]. We in this work restrict our attention to finite groups keeping the notation more concise. Most results can be analogously stated for continuous, compact groups and derived following the same strategy. Ref. [59] carefully discusses the required modifications and gives explicite reformulations for continuous compact groups.

Representations
Let G be a finite group and consider the space M d of linear transformations of C d . A representation ω is a map ω : G → M d that preserves the group multiplication, i.e., We will require the operators ω(g) to be unitary as well (for finite groups this can always be done).

Reducible and irreducible representations
If there is a non-trivial subspace W of C d such that for all vectors w ∈ W we have then the representation ω is called reducible. The restriction of ω to the subspace W is also a representation, which we call a sub-representation of ω. If there are no non-trivial subspaces W such that eq. (14) holds the representation ω is called irreducible. We will generally reserve the letter σ to denote irreducible representations. Two representations ω, ω of a group G are called equivalent if there exists an invertible linear map T such that We will denote this by ω ω . For finite groups G the set of irreducible representations (up to the above equivalence) is finite. We will denote it by Irr(G).

Sums, products, and Maschke's Lemma
We will make use of sums and products of representations. Given representations ω, ω , the maps are again representations. They are, however, generally not irreducible (even if ω and ω are). However, Maschke's Lemma ensures that every representation ω of a group can be uniquely written as a direct sum of irreducible representations, that is where the index set Λ is a subset of the set Irr(G) and n λ is an integer denoting the number of copies (or multiplicity) of σ λ present in ω.

Characters
Characters are a central object in representation theory, given by the trace of a representation.
Definition 4 (Character of a representation). The character χ ω of a representation ω of a group G is defined as One of the most important properties for characters of irreducible representations is the following orthogonality relation.

Projections onto irreducible representation
Given a representation ω = λ∈Λ σ ⊕n λ λ on a vector space V ω = V ⊕n λ λ we can choose a basis v λ j | j ∈ {1, . . . , d λ } for each V λ . Each vector v in V ω can thus be written as a linear combination v = λ∈Λ d λ j=1 c λ j v λ j . We can conversely identify the basis vector components of any vector v by application of an appropriate projection P λ j , such that P λ Note that, in order to construct these projections, the knowledge of the diagonal elements of the corresponding irreducible representation σ λ is required. However, it is also possible to project any vector onto distinct irreducible subspaces (up to multiplicity) by using only knowledge of the character of a representation: This last formula follows simply from the definition of the character as χ λ (g) = Tr(σ λ (g)).

IV. FOURIER TRANSFORMS AND PERTURBATION THEORY OF IMPLEMENTATION MAPS
In this section, we review the concept of group implementation maps and their Fourier theory (section IV B). Mathematically this corresponds to non-commutative harmonic analysis of matrix-valued functions. We also discuss perturbation theory for non-normal matrices. This material is somewhat less well known, so we spend more time discussing these concepts.

A. Implementation maps
Given a group G, we can assign quantum circuits (elements of U (d)) to each group element, which gives rise to a representation of the group. However, in practice quantum circuits will not be executed perfectly, but rather include noise. This noise can be modelled by a quantum channel, and we can thus envision assigning to each group element a quantum channel modelling the real implementation of that circuit. These quantum channels can be composed, but this composition will not necessarily maintain group structure and will thus in general not form a representation. However, we can define the more general concept of an 'implementation map' φ, which is a function from a finite group G to the space of super-operators S d , where we will usually assume that φ(g) is a trace non-increasing quantum channel for all g. If we want to draw explicit attention to this fact we will call φ completely positive if and only if φ(g) is completely positive for all g ∈ G. Finally, note that if φ(g)φ(h) = φ(gh) for all g, h ∈ G then φ would be a representation. We can think of the implementation map as being an abstract presentation of the noisy implementation of the group elements, which depends on the noise processes in the quantum computer but also on other choices such as the compilation of circuits into elementary gates.

B. Fourier transforms of implementation maps
When considering an implementation map one can ask precisely when it is a representation, and failing that, if it is close to a representation (in some reasonable way). To answer this question we need to introduce some mathematical machinery. This machinery was first introduced into the theory of randomized benchmarking (RB) by ref. [25], based on work by Gowers & Hatami [63], which is itself a partial review of older mathematical work. In this section, we will consider general maps φ from a group G to a space of d × d matrices M d . Thinking of S d as a matrix space, our notion of implementation map can be seen to be a special case of these maps. Given a map φ we define its Fourier transform F(φ) as for all λ ∈ Irr(G). So the Fourier transform F(φ) is a function from the set Irr(G) of irreducible representations of G to a set of matrices. This definition has all the properties of a Fourier transform. Firstly, it has an inverse transform, which maps F(φ) back to φ, given by for all g ∈ G and where d λ is the dimension of V λ , the space on which the representation σ λ acts.
Secondly, it has the correct behaviour with respect to convolutions of implementation maps: the Fourier transform of a convolution corresponds to a product of Fourier transforms. Recalling the definition of a convolution of two implementation maps φ, φ we can easily see the following for all λ ∈ Irr(G). Another useful property is the Parseval identity Finally, we note that the Fourier transform (evaluated at an irreducible representation) of a representation is an orthogonal projector with its rank given by the multiplicity of that irreducible representation. To see this, consider a representation ω = λ∈Λ σ ⊕n λ λ . We have that for all λ ∈ Irr(G). Moreover for λ ∈ Λ we have by the character orthogonality formula.

Fourier operators
We also give another, useful way to think about the matrix Fourier transform, namely in terms of what we call Fourier operators.
Note that the set of maps φ → S d can be seen as a vector space under point-wise addition (of the super-operators). We can further lift this vector space to an algebra by considering the convolution operator * (as defined in eq. (26)) on the functions in the vector space. We can construct a faithful (i.e., injective) matrix representation of this algebra as with ω G = λ∈Irr(G) σ λ . This is just the Fourier transform of φ gathered in a direct sum (note that Irr(G), and hence the sum, is finite for any finite group). By the Peter-Weyl theorem for finite groups one can equally well think of ω G (g) as an element of the group algebra C[G] associated with G, we will not be using this point of view explicitly. We will call F (φ) the Fourier operator of φ. From the properties of the Fourier transform we immediately see that It will be useful to equip the algebra of Fourier operators with several norms, based on the diamond norm · for S d (In principle this construction will work for any norm on S d ). We define where D ω = λ∈Irr(G) d λ 1 λ collects the relevant dimensional factors and where the second equality follows from the properties of the Fourier transform. These norms are bona fide matrix norms on the algebra of Fourier operators, notably they are sub-multiplicative viz., and similarly for · m . We also have an identity involving both norms that will be helpful later.

C. Perturbation theory
In this section, we gather some technical tools from matrix perturbation theory that are essential to the many of the proofs in this paper. Our sources for this section are the standard books of Stewart and Sun [54] and Kato [53]. For the rest of this section, we will assume that · denotes a sub-multiplicative matrix norm on Let A ∈ M d be a complex Hermitian matrix. Assume that there exists a unitary matrix X = [X 1 , X 2 ] such that the columns of X 1 and X 2 span invariant subspaces of A, that is We call this a spectral resolution of A. We can think of A 1 , A 2 as the matrix A restricted to subspaces of C d spanned by the columns of X 1 , X 2 , respectively, and furthermore we assume that the eigenvalues of A 1 are all distinct from the ones of A 2 : the subspaces are then said to be simple. These subspaces are invariant under the action of A in the sense that AX 1 = X 1 A 1 and are hence called invariant subspaces. It turns out that spectral resolutions, and invariant subspaces more generally, are stable against (small) perturbations. That is, given a perturbation matrix E (not necessarily Hermitian) we can find matrices R = [R 1 , R 2 ] and L = [L 1 , L 2 ] such that L † = R −1 and for some A 1 , A 2 and the matrices R, L are close to X in a well specified sense. This is what one would expect from a perturbation theorem. It, however, only holds if the perturbation E is small with respect to the difference between A 1 and A 2 . This difference is made quantitative by the so-called separation function: This separation function has some rather nice properties. Firstly it is symmetric in its arguments: Secondly it is stable against perturbations, i.e., given a perturbation A + E of A we have With this function we can state the following theorem, which can be derived from theorem 2.8 in ref. [54, p. 238].
. Also, let · be a matrix norm. Now let E be a complex matrix. If E has the properties then there exist matrices P 1 , P 2 such that and and Proof. From the first property in (42), and theorem 2.8 in ref. [54] we conclude the existence of a matrix P 1 s.t. and and the property that We note that in eq. (42) the first property implies the second if X 1 While eigenvalues and invariant subspaces are stable under small perturbations (as discussed above), that is, they are holomorphic functions with respect to analytic perturbations, the same is not true for eigenvectors. This is due to the fact that a vector basis spanning a multi-dimensional eigenspace is not uniquely determined, and thus the eigenvectors of the perturbed eigenspace may be completely different from the unperturbed basis. However, if an unperturbed eigenvalue a 1 is simple, the related eigenvector x 1 is unique (up to a scalar factor), and it is thus stable. We can make this more explicit by specializing theorem 6 to simple invariant subspaces of dimension one. Let us again consider an hermitian matrix A and adopt a unitary basis transformation X = [x 1 , X 2 ] so that where A 2 ∈ M d−1 . In this specific setting, the separation function becomes [64] sep(a 1 , From Theorem 6, we then have the following. Corollary 7 (Perturbation of a 1-dim simple subspace). The left and right perturbed eigenvectors originated from x 1 are where we neglect terms O( E 2 ).
Finally, to analyze perturbations of eigenvalues, we will make use of the Bauer-Fike Theorem [54, theorem 1.6]: let A be diagonalizable such that S −1 AS = diag((a j ) j ) and let E be an arbitrary operator of same dimension. Then, for any eigenvalueã of A + E, the bound is satisfied for some eigenvalue a j in any vector-induced norm. This implies that, if A is Hermitian, then

V. THE RANDOMIZED BENCHMARKING PROTOCOL
The name randomized benchmarking (RB) is conventionally given to a class of methods that assess the quality of a set of quantum gates. These methods are probabilistic, and can be seen as constructing an estimator for a quantity that captures some notion of gate quality. In this section, we will make an attempt at defining randomized benchmarking. By this we mean that we will attempt to organize and make explicit various ideas that have been present in the literature. We begin (in section V A) by dividing RB into two parts: a data collection phase and a data processing phase. These correspond roughly to the parts of RB performed on a quantum computer and on a classical computer, respectively. Within this division we focus first on the data collection phase. In sections V B and V C, we give a general protocol for the data collection phase of RB. This general protocol depends on a number of input parameters, and we can obtain every known RB protocol from a choice of these input parameters. We complement this protocol with a classification of RB protocols into a few types in section V D. This classification, which pertains only to the data collection phase of RB is largely a formalization of knowledge implicit in the literature but we will see that it is a useful organizing tool when proving theorems about the data generated by RB. This data we discuss in section V E.
We note that the output of RB data is assumed to be of a very particular form, namely that of a linear combination of (matrix) exponential decays. However, this form is incumbent upon assumptions on the quantum computer on which (the data collection phase of ) RB is implemented. We discuss what assumptions have been made before in the literature and propose our own set of assumptions, which we justify later in the text.
A. The data collection and data processing phases RB is composed of two major parts, a data collection phase and a data processing phase. The data collection phase consists of what one typically thinks of as RB: one randomly selects a sequence of quantum gates and applies them to a quantum state together with a global inverse, and measures the resulting state. Averaging over many random choices of these gates one obtains RB output data that depends on the length of the random sequence in a controlled way. This vague description can be made more precise in many different ways and we will provide a general framework for this procedure in the next few subsections.
The data processing phase on the other hand consists of what one then does with the data given by a RB experiment. This can be as simple as fitting the data to an exponential decay, but in many cases also involves more sophisticated processing techniques. The key feature of the RB protocol that allows for a structured approach to data processing is the fact that the RB output data has a very controlled form. We will discuss this form in section V E after more formally discussing the data collection phase of RB.

B. Input parameters
The data collection phase of a RB procedure is characterized by a set of input parameters. These input parameters fully define a protocol (which we write down in section V C) that can be executed on a quantum computer, yielding probabilistic data that can then be interpreted. Below is a list of all input parameters to RB, together with an explanation and examples of choices for these parameters that correspond to versions of RB present in the literature.

1.
A gate set/group: A finite set of unitaries (quantum gates) on C d . In (almost) all RB protocols this gate set is also a finite subgroup G ⊂ U (d) of the unitary group. In a large section of the RB literature the group considered is the q-qubit Clifford group C q , but a range of other choices (such as the Pauli group P q [13], the real Clifford group [35] or the CNOT-dihedral group [37,38]) are possible. Choosing a group fixes what gates RB assesses the quality of and partially determines the structure of the output data. In generator-style RB [14,15] this group is defined implicitly by the set of generators.

2.
A reference implementation/representation: A map φ r from the gate set/group G to the d-dimensional superoperators that specifies how the gates in G should be implemented in the quantum computer. This map takes into account aspects of the specific RB protocol but also how gates are composed of elementary gates and other implementation details. In uniform RB the map φ r is a representation of the group G on S d . The prototypical example is the action on the space of Hermitian matrices ρ by conjugation, i.e., φ r (g)(ρ) = ω(g)(ρ) = U g ρU g † . In general, however, the reference implementation φ r is not a representation, though we will see that for any known RB procedure the reference implementation can be written as φ r (g) = Aω(g)B, where A, B are (unitary) quantum channels. We will refer to ω as the reference representation.
3. An ending gate: A group element g end that dictates the global action of an RB sequence. For most proposals this gate is simply the identity, but in other proposals non-trivial choices for g end (such as choosing it uniformly at random [13,37,39,45]) play an essential role in data-processing schemes. This ending gate also allows us to include RB schemes that do not involve an inversion gate [16,19,42,65]. We emphasize that it is not necessary to implement this gate physically, but rather it arises from compilation.

4.
A set of sequence lengths: A set of integers M denoting the length of the random sequences of gates implemented in a RB experiment. We will denote elements of this set by m and the largest element of this set by M .
5. An input state: A state ρ 0 that is prepared at the beginning of an RB experiment. This state will typically be a pure state (such as the |0, . . . , 0 state vector), but is chosen mixed in some versions of RB [57].
6. An output POVM: A POVM that is measured at the end of an RB experiment. We will denote this POVM as {Π i } i∈I with some index set I. In many cases this is a two-component POVM , but some RB procedures explicitly call for more complex measurements (such as a computational basis measurement [29]).

7.
A set of sampling distributions: A set of probability distributions ν i for i ∈ {1, . . . , M } over the group G that govern the random sampling of group elements in RB. We will often consider the scenario where all these probability distributions are the same, in which case we will drop the subscript i and just write ν for the probability distribution . Moreover, in almost all instances in the literature this distribution is uniform, i.e., ν(g) = 1/|G|, and unless stated explicitly we will always assume this to be the case.

C. The data collection protocol
Given the input parameters discussed above we can write down a formal procedure for the data collection phase of RB. It has as output an estimatorp(i, m) of a probability p(i, m) for each POVM element Π i for i ∈ I and each sequence length m ∈ M. Apply φ r (g end g inv ) to the state 9 Measure the state in the POVM {Π i } i∈I

10
Repeat the above many times to obtain estimatorsp(i; g 1 , . . . , g m ) for the probabilities Repeat for many random g 1 , . . . , g m and average to obtain estimatorsp(i, m) for the probabilities . . , g m ) 12 end 13 Output the estimatorsp(i, m) for all i ∈ I, m ∈ M Note that the probabilities p(i, m) depend in a non-trivial manner on the initial state ρ 0 , the POVM {Π i } i∈I and the ending gate g end . We will, however, suppress this dependence unless it is explicitly necessary to refer to it.

D. A typology of randomized benchmarking protocols
Given protocol alg. 1, different choices of the parameters discussed in section V B give rise to different RB procedures. More strongly, (the data collection phases of) all variants of RB currently in the literature can be expressed by choosing these input parameters correctly. Surveying the literature we can distinguish 3 major types that are differentiated by their reference implementations and sampling distributions. The output data associated with these classes of protocols has varying behaviour and we will treat each class separately in section VI. All protocols included in these classes can be found in fig. 2 (here we will only give illustrative examples).
1. Uniform randomized benchmarking: This is the basic type of RB. It is characterized by the fact that the probability distributions ν i are the uniform distribution for all i ∈ {1, . . . , m max }, and that the reference implementation map φ r is exactly a representation ω, usually the standard action by conjugation given by ω(g) = φ r (g)(ρ) = U g ρU g † for unitaries U g (other choices have been made in [42,66]). Randomized benchmarking proposals of this type are mainly distinguished by what group G they consider as a gate-set (at least when it comes to the data collection phase, different proposals in this class might have radically different data processing procedures.) Protocols of this type include the original RB proposals [1,67] and many others.
2. Non-uniform randomized benchmarking: The defining feature of this class is that the sampling distributions ν i are not the uniform distribution. It comes in two flavours, which we will discuss separately: (a) Subset RB: Here, the distributions ν i are far from uniform (and typically only have support on a small subset of the group G). Examples from the literature are refs. [14,15,39,52]. (b) Approximate RB: Here the ν i are close to uniform. This latter class will turn out to be essentially the same as uniform RB. This class has been discussed in ref. [14] and also arises in the original 'NIST' RB proposal [5] (as per the analysis of ref. [51]).
In all works of this type so far the reference implementations are representations (akin to uniform RB).
3. Interleaved randomized benchmarking: This class of RB protocols is characterized by the addition of an extra 'interleaving gate' in the RB procedure. This is a class that is somewhat idiosyncratic, having one standard subtype and a collection of 'non-standard' protocols: (a) Standard interleaved randomized benchmarking: In this class the interleaving gate is an element of the benchmarked group G. In this case we find that it is most useful to interpret interleaved RB as uniform RB, with the reference implementation a representation ω, but with the probability distributions ν i uniform for even i and peaked on a single group element (the interleaving gate) for odd i. We will consider this in more detail in section VI C. The paradigmatic example is ref. [4], but nearly all uniform RB protocols have an interleaved version. (b) Non-standard interleaved randomized benchmarking: These protocols are characterized by the addition of interleaving gates that are not part of the group G as well as non-uniform sampling distributions. We will discuss these protocols on a more case by case basis in section VI C.

Protocols without inversion gates
A number of RB protocols have been developed that do not feature an inversion gate g inv . These protocols are indicated with a * * in fig. 2. While not immediately obvious, these protocols are actually covered by the general procedure written down in alg. 1. We can think of these protocols as choosing the ending gate g end at random for each experimental run and averaging over the results. Because of the invariance of uniform group averages this is equivalent to not including an inversion gate and ending the protocol on a random group element. In section VIII we will see that protocols without inversion gate can be seen as a special case of a general post-processing scheme for RB data.

E. Output data
There is a folkloric notion that the output data of RB has an exponential dependence on the sequence length, with the rate of decay dependent only on the implementation φ of the gates in G. This was first established to be true for uniform RB (in our typology) with the unitary and Clifford groups where, under certain assumptions (see section V F) on the quantum computer implementing operations, one can prove that p(i, m) = Af m + B where f only depends on the implementation map φ and A, B are constants depending on SPAM. However, if the group G was not the Clifford group it was found that the RB output data did not follow a single exponential decay but rather was of the form p(i, m) = λ A λ f m λ with the decay constants f λ depending only on the implementation of the quantum operations and associated with the irreducible sub-representations of the reference representation ω.
However, this functional form is only valid if the reference representation ω has no multiplicities (no irreducible subrepresentation occurs more than once), and hence does not describe all possible RB experiments. In this paper we will argue that for a general reference representation of the form ω = λ∈Λ σ ⊕n λ λ for Λ ⊂ Irr(G) RB data takes the form where M λ is an n λ × n λ real matrix that only depends on the implementation φ and A λ is an n λ × n λ matrix encoding SPAM behaviour. Note that the matrices M λ are not required to be normal, or even diagonalizable. This means that p(i, m) can appear to be strikingly non-exponential (at least if m is fairly small) unless ω is known to be multiplicity-free. We will discuss this in greater detail in section VII when we discuss general fitting procedures.

F. Assumptions
The functional form of RB output data given in eq. (63) does not immediately follow from the specification of the protocol in alg. 1. Rather it must be derived based on assumptions on the behaviour of the operations being performed inside the quantum computer. Here we give a run-down of assumptions that are made throughout the literature, and which we will make in order to derive eq. (63). The assumptions we will make are not the most general possible that still lead to eq. (63), but we attempted to strike a balance between generality and operational motivation. In the list we will point out where assumptions can be generalized and refer to work where this is done (for some versions of RB).
• State preparation and measurement consistency: We assume that the initial state ρ 0 and the measurement POVM {Π i } i∈I are always prepared in the same manner, independently of the gates being implemented. Slightly stronger, we will assume the existence of quantum channels E SP and E M such that the implemented initial state is given by E SP (ρ 0 ) and the elements of the implemented measurement POVM are given by E M (Π i ). This assumption is made throughout the RB literature.
• Markovianity and time-independence: We assume that the implementation of a gate g ∈ G is always the same, independently of when it is performed in the RB protocol and independently of its context (the gates being performed before and after). This assumption leads to the concept of an implementation map φ : G → S d which assigns to each group element g a completely positive super-operator φ(g) modelling the actual implementation of the gate.
-This assumption is not always justified, as the implementation of a gate can in principle depend on, e.g., the gates being implemented before it or the amount of time elapsed in the protocol. It can also depend on external uncontrolled variables (either deterministic or random). In ref. [68], a model of time dependence has been considered and in refs. [69][70][71] the effect of gate-correlations and certain uncontrolled variables such as quasi-static noise were investigated. In all of these scenarios, however, the exponential behaviour of eq. (63) breaks down. It might be possible to derive assumptions beyond the setting of Markovian time-independence that lead to output data of the correct form, but we will not pursue this here.
• Closeness to reference implementation: In order to derive eq. (63) we must make additional assumptions on the implementation map φ. We will assume that for sufficiently small δ > 0. The appearance of the diamond distance might strike one as overly pessimistic, however, we will show that it is in fact required in section IX. It is also not the most general possible assumption that still guarantees eq. (63) (see below), but it has the advantage of making reference only to physical quantities and being operationally interpretable.
-In early works on RB the standard assumption was that of gate-independent noise. This means the implementation map φ is of the form φ(g) = Aφ r (g) for all g with some fixed quantum channel A. This is not a very realistic assumption and several attempts were made to replace it with a weaker assumption. In ref. [67] it has been proposed to consider a perturbation φ(g) = Aφ r (g) + A g φ r (g). In ref. [26], however, this analysis was shown to not be strong enough to actually justify behaviour of the form eq. (63). Here, an analysis of uniform Clifford randomized benchmarking as a power iteration of a matrix was proposed (see also early work in this direction by [41]), justifying the exponential decay model (but with non-optimal correction). Subsequently, in ref. [24] eq. (63) was derived (with an exponentially small correction) for uniform RB with the multi-qubit Clifford group under the assumption that there exist super-operators R, L such that for small enough δ. This assumption is quite general, but has as its main drawback that the operators R, L are not guaranteed to be completely positive, complicating the interpretation of this assumption as being a belief on physical quantities. Finally, ref. [25] derives (introducing the Fourier analysis also used here) eq. (63) (up to an exponentially small correction) for uniform RB with the multi-qubit Clifford group under an assumption on the fidelity of the implementation map φ w.r.t. its reference implementation, This assumption has the advantage of making reference to physical objects only, but suffers from the drawback that δ must grow inversely proportional to the underlying Hilbert space dimension for the argument in ref. [25] to hold. We discuss this further in section X.

VI. THE RANDOMIZED BENCHMARKING FITTING MODEL
In this section, we will prove a general theorem about the behaviour of RB output data, i.e., the probabilities p(i, m) associated with an RB experiment with its input parameters specified as in section V B and described in protocol alg. 1. We will argue that for a broad variety of choices for reference implementations and probability distributions this data is well described by a linear combination of exponential (matrix) decays (as in eq. (63)), as long as the physical implementation φ is close to its ideal version: the reference implementation φ r . By close we mean that the diamond distance between reference and ideal implementations, averaged over the group, has to be bounded as One can think of the above equation as a relatively weak initial belief one must hold about one's quantum computer (instantiated in φ) before one can trust the outcome of RB.
For the rest of the work we will adopt the transfer matrix framework (discussed in section III) for describing the action of super-operators. We also explicitly write implementation noise on the initial state ρ 0 and output POVM {Π i } i∈I through quantum channels E SP (state preparation) and E M (measurement). This is notationally somewhat clumsy, but it makes explicit one of the assumptions underlying RB, namely that SPAM noise is independent of sequence length.
The theorems we present in this section are generalizations of the theorems given in ref. [24], encompassing almost all known RB procedures, but the techniques used are based on the cleaner conceptual framework of matrix valued Fourier transforms provided by ref. [25], which we reviewed in section IV B. The central observation of ref. [25] is that the data collection phase of uniform RB can be seen as evaluating an m-fold convolutions of the implementation map φ. This observation generalizes beyond uniform RB to arbitrary implementation maps, and, in particular, we see that can be rewritten, using the invariance of the uniform sum over G under changes of variables, as where we have used the definition of convolution of implementation maps given in eq. (26) and where (ν i φ)(g) = ν i (g)φ(g). We will see that often the convolution product map φ * (ν m φ) * · · · * (ν 1 φ) can be written exactly as an m-fold convolution φ * m (for some φ that is not necessarily the same as φ).
We will begin in section VI A with discussing the case of uniform RB (as per the RB typology in section V D). This is the easiest case, but the results derived there will go a long way in analyzing the other two types (non-uniform and interleaved RB).

A. Uniform randomized benchmarking
Here we discuss the behaviour of RB output data given by a uniform RB scheme (as defined in section V D). We will prove that this data behaves as expected (i.e., a controlled linear combination of exponential decays), as long as the implementation map φ is close enough to its reference implementation φ r . As we saw in section V B, for uniform RB protocols this reference implementation is exactly a representation, which we denote by ω. We can always decompose ω into a direct sum of irreducible representations. We write this as ω = λ∈Λ σ ⊕n λ λ with Λ some index set and σ λ irreducible sub-representations appearing with multiplicity n λ . As discussed in section V, we expect the RB output data to be approximately well described by a linear combination of the form where M λ is an n λ ×n λ matrix depending only on the actual implementation φ. In particular M λ is given by the projection of the Fourier mode F(φ)[σ λ ] onto the subspace associated with its n λ largest (in absolute value) eigenvalues. This is the content of theorem 8. The essential idea in theorem 8 is the fact that convolutions correspond to matrix multiplication in Fourier space, together with a careful use of the subspace perturbation techniques discussed in section III.
Theorem 8 (Output data of uniform randomized benchmarking). Let p(i, m) be the outcome probability associated with a uniform RB experiment with group G, initial state ρ 0 , reference representation ω = λ∈Λ σ ⊕n λ λ , and ending gate g end , for a specific sequence length m ∈ M and POVM element Π i in the POVM {Π i } i (as described in protocol alg. 1). Let φ be the implementation map describing the actually implemented operations. Moreover, assume that there exists a δ > 0 such that The RB output probability p(i, m) is well approximated as where M λ , A λ are n λ × n λ real matrices and M λ only depends on the implementation φ.
Proof. Note from eq. (69) with ν i the uniform probability distribution for all i ∈ {1, . . . , m} that Inserting the Fourier transform of φ, we get where ω G (g) = ⊕ λ∈Irr(G) σ λ (g) is the direct sum of all irreducible representations of G and D G = ⊕ λ∈Irr(G) d λ 1 λ accounts for the dimensional factor in the inverse Fourier transform. Now we can consider the Fourier operator F (φ) (as defined in eq. (31)) associated with φ as a perturbation of its ideal version F (ω). From our discussion of Fourier transforms and Fourier operators we know that F (ω) = 1 |G| g∈G λ∈Λ σ λ (g) ⊗ ω(g) is an orthogonal projection, with rank given by the number of irreducible sub-representations of ω (Rk(F (ω)) = λ∈Λ n λ ). Recall also that there is a natural matrix norm · m on the space of Fourier operators and that The plan is now to use the perturbation theorem (theorem 6) to split the above into dominant and sub-dominant invariant subspaces. To do this note that F (ω) is a projector so we trivially get a spectral resolution with X 1 = F (ω), X 2 = 1 − F (ω) with F (ω) acting as the identity on the column and row space of X 1 and as the zero operator on the column and row space of X 2 . Thinking of F (φ − ω) as a perturbation to F (ω) we need to ensure the conditions in eq. (42) are satisfied with respect to the norm · m . Using the sub-multiplicativity of this norm and the fact that X 1 m = 1 by construction together with the triangle inequality, we get the following sufficient condition for the applicability of theorem 6: where we also used that sep(1, 0) = 1, which is easy to see from the definition of sep (see section IV C). Working out, we see that the above is satisfied if eq. (72) is true, which it is by assumption. Hence we can use theorem 6 to conclude the existence of operators Using the fact that L † = R −1 (and thus that L 2 † R 1 = L 1 † R 2 = 0) we can now write p(m, g end , Π) as a sum of two terms corresponding to the above spectral resolution: We will consider both of these terms separately. We will deal first with the second term. Note that, using the definitions of R, L from theorem 6, we have which is just a statement about the max-norm of a Fourier-operator. Note that X 2 † F (ω)X 2 = 0 by construction so the above only depends on F (φ − ω). Now using the max-mean norm inequality in eq. (35) several times and the fact that X 2 = 1 − X 1 , we can upper bound this as Now we use from theorem 6, the upper bounds on and where we have exploited the assumption δ ≤ 1/9 in the last line. Inserting these bounds into the main expression we get where we have used that F (φ − ω) max ≤ 2 and δ ≤ 1/9. Next we consider the first term in eq. (80). For this term we desire an exact expression. We begin by noting that both F (ω) and F (φ) are block diagonal with respect to the decomposition of ω G into irreducible representations. This implies that the matrices R, L are block diagonal w.r.t. this decomposition as well, and that, moreover, we can take the matrices P 1 , P 2 to be block diagonal with the blocks labeled by the irreducible sub-representations present in ω = λ∈Λ σ ⊕n λ λ . Writing P = ⊕ λ∈Λ P λ , and similarly for other operators we can write the first term of eq. (80) as where we have also used that F (ω)X 2 = 0 by construction. To continue further we need to pick a convenient basis to express X λ 1 , R λ 1 . For this note that we can specify rank 1 Fourier operators in L(V ω G ) ⊗ S d by specifying pairs of super-operators A, B and looking at Fourier operators of the form F (AωB) (It is useful to think of the Fourier operator F (ω) as a vectorization operation on A, B). We can express X 1 = F (ω) in this way by considering the operators F (P j λ ωP j λ ) where P j λ is the (super-operator) projector onto the jth copy of the representation σ λ in ω = λ∈Λ σ ⊕n λ λ ). Note that these operators are rank one orthogonal projectors and moreover that holds true. Now noting that (X λ 1 + X λ 2 P λ 1 ) = R λ 1 is a rank n λ matrix with X λ 1 R λ 1 = X λ 1 , we can similarly find n λ super-operators R j λ λ (j λ ∈ 1, . . . , n λ ) such that where F (P j λ λ ωR j λ λ ) is again of rank one (but no longer orthogonal). Note that X λ 1 R λ 1 = X λ 1 gives rise to the orthogonality property Using these resolutions of X λ 1 , R λ 1 and the orthogonality property we can express the first term in eq. (80) further as with by the fact that F (P j λ λ ωP j λ λ ), F (R j λ λ ωP j λ λ ) are of rank one. Now writing we can combine the two terms in eq. (80) to get B. Randomized benchmarking with non-uniform sampling Several works [5,14,15,39,51] discuss adaptations of RB where the elements of the group are no longer sampled exactly at random, but are instead sampled according to (1) a distribution close to uniform [5,14,51] (which we call 'approximate RB' in section V B, following ref. [14]), or (2) a distribution that only has support on a small subset of the group; group generators in the case of ref. [14] (see also early work on the Clifford group by [52]), subgroup cosets in the case of ref. [39], and constant depth circuits (layers) in the case of ref. [15]. In section V B, we called these approaches 'subset RB'.
We begin by treating the case of approximate RB. This corresponds to performing RB as described in protocol alg. 1 but instead of sampling group elements from the group G uniformly at random one samples group elements according to some prescribed probability distributions ν i : G → [0, 1] (with i indicating the time at which the gate is applied). In ref. [14] it has been argued that as long as the distributions ν i are all close to the uniform distribution in the l 1 -norm, then the output data of approximate RB is close to the output data of exact RB.
As a corollary of theorem 8 we obtain a similar result. Our result is somewhat less general than the one given in theorem 17 of ref. [14]. In particular, we will assume that all distributions ν i are equal to a fixed distribution ν. In return for this restriction we will be able to make a much stronger statement on the behaviour of the RB output data. Moreover, our approach does not require the gate-independent noise assumption (replacing it with the more general diamond norm assumption of eq. (72)). We have the following statement.
Theorem 9 (Randomized benchmarking data with non-uniform sampling). Let ν be a probability distribution on G and p ν (i, m) be the outcome probability associated with a non-uniform RB experiment with implementation map φ and reference representation ω(g) = λ∈Λ σ ⊕n λ λ . Moreover, assume that there exists δ, δ > 0 such that with δ + δ ≤ 1/9. Now p ν (i, m) is well approximated as where M ν λ , A λ are n λ × n λ real matrices, M ν λ depends on the implementation φ and the measure ν.
Proof. Consider the map φ ν : G → S d : g → |G|ν(g)φ(g). Note that we can think of non-uniform RB as being uniform RB with this (not trace preserving but still completely positive) implementation map. In particular we have which is just (74) but with the 'effective implementation' φ ν . From the assumptions of the theorem we have Hence, the proof of theorem 8 immediately applies to p ν (i, g end , m), yielding (105).
We note that in the case of NIST RB [51] the probability distribution over (a subgroup of) the single qubit Clifford group is not strictly speaking close enough to uniform to apply the above theorem. This can be easily solved by blocking a few gate applications together, defining a new effective implementation map φ = (νφ) * (νφ) · · · * (νφ) which is close enough to uniformly distributed to apply theorem 9.
The above approach fails utterly when applied to subset RB. In this scenario the distribution ν only has support on a small subset A of G and consequently g∈G |ν(g) − 1 |G| | ≈ 1 in many cases. This is not necessarily a weakness of theorem 8 but rather a statement of the fact that strong deviations from exponential behaviour can be observed if one does not give the distribution ν time to converge to the uniform distribution through repeated convolution. This was already noted more or less explicitly in previous papers on subset RB. There are two approaches to solving this problem. The first, followed in refs. [14,15,39,52] is to restrict the set of sequence lengths M at which RB data is gathered to m ≥ m mix where m mix is related to the mixing time of the distribution ν. Note that in the direct RB proposal [15], this convergence time is instead enforced directly by applying a uniformly random gate before applying non-uniformly sampled gates. The second approach is to take this deviation from uniform RB behaviour at face value [13] and draw conclusions from the RB output directly. We believe this latter approach is more accurately classified as an interleaved benchmarking scheme and we will discuss it there.
With regards to the first approach we can make a statement akin to theorem 8 about subset RB procedures by making the (natural) assumption that upon equilibration of the distribution ν the quality of the total gates has not degraded too much. Intuitively, this means that the gates that have high weight in the initial distribution are of high enough quality to generate (by composition) good quality implementations of all gates in the group. Concretely, we have the following theorem.
Theorem 10 (Subset randomized benchmarking). Let ν be a probability distribution on G and p ν (i, m) be the outcome probability associated with a non-uniform RB experiment with implementation map φ and reference representation ω(g) = λ∈Λ σ ⊕n λ λ . Moreover, assume that there exists an integer m mix and real numbers δ, δ > 0 such that with δ + δ ≤ 1/9. Now p ν (i, m) is well approximated as with M λ the projection onto the n λ dimensional dominant invariant subspace of F(νφ)[σ λ ] and where with δ = δ + δ .
Note that this theorem is qualitatively less strong than theorem 8. In particular, we can not guarantee that the distance between the output data of subset RB and the exponential decays associated with the irreducible sub-representations of the reference representation closes exponentially fast with increasing sequence length. However, our bound on this distance is stronger than previous rigorous statements (theorem 20 in ref. [14]) and works under weaker assumptions. The distance bound given in ref. [39] (theorem 3) does close exponentially but the proof relies critically on the fact that ν is uniformly non-zero on a (large) subgroup coset in G, and thus only applies to a far more restricted situation. Note also that it does not directly apply to the approach taken in [15]. However, we believe that with very minor alterations the reasoning below can be made to fit.
Proof. Consider again the map φ ν : G → S d : g → |G|ν(g)φ(g). We have We now establish a bound on the quality of φ * mmix ν , namely we show that This can be seen as follows with ω ν (g) = |G|ν(g)ω(g). Writing out the convolution in the first term and changing variables, we get for the first term and where we have used the telescoping series identity A m − B m = m j=1 A m−j (A − B)B j−1 which holds for any elements A, B of an associative algebra (such as the implementation maps with convolution), the sub-multiplicativity of the diamond norm, and the fact that φ(g) = ω(g) = 1 for all g ∈ G. Together with the theorem assumptions, this yields (113). Now as in theorem 8, we can write the RB output data as where m = m − m mix . We can again consider F (φ * mmix ν ) as a perturbation of F (ω). Since F (ω) is a projector, the operator F (φ * mmix ν ) will resolve into a dominant an sub-dominant invariant subspace (as in theorem 8). We have Now note that F (φ * mmix ν ) and F (φ ν ) commute, and hence share invariant subspaces. This means we can write the first term in eq. (121) as Finally, we can bound the second term in eq. (121) as (124) using the max-mean inequality of the norms on Fourier operators. Note now that where we have used that ν is a probability distribution and that φ ≤ 1. Moreover, we have that F (φ) max ≤ 1. Using this and the reasoning from theorem 8 we can thus bound the second term as with δ = δ + δ . Inserting the assumption that F (φ * mmix − ω) m ≤ δ we obtain the statement of the theorem.

C. Interleaved randomized benchmarking
As discussed in V D, a common variant of RB is interleaved randomized benchmarking (IRB). IRB is performed like uniform RB, as formulated in alg. 1, but the reference implementation is not a representation. Instead a fixed operation C is being interleaved between the application of randomly selected group elements. The outcome of this experiment is then compared to the same RB experiment without the interleaving gate to infer the quality of the interleaved gate C. The literature splits into two sections, standard interleaved RB [4,48] and non-standard interleaved RB [9,47]. We emphasize here that we discuss the so-called 'interleaved step of the interleaved RB protocol, and do not interpret the resulting decay rate (for a thorough discussion of the relationship of interleaved RB decay rates and their interpretation see [72]).

Standard interleaved randomized benchmarking
In the standard protocol the interleaved operation C is applied after every randomly selected gate and is also a part of the group G. Hence at the end of a random sequence, the inversion step can be performed inside the group. An IRB output data is thus of the form for a POVM element Π i , an ending gate g end , a sequence length m, an implementation map φ and an initial state ρ 0 . It is interesting to interpret this procedure in the light of the protocol given in section V C. Namely we can think of defining a probability distribution ν C over G, that takes the value 1 for g = C and 0 for all other group elements. With this probability distribution, we can reconsider the above as an RB experiment according to the protocol written in alg. 1, we have where µ is the uniform distribution on G. Hence, we can think of standard IRB as being a RB experiment with a particular choice of sampling distributions. In this picture, it becomes trivial to extend theorem 8 to standard interleaved RB by considering the map φ C = (ν C φ) * φ. By the standard change of variables we can see and hence interleaved RB is just uniform RB with the implementation map φ C . If φ(C) is close enough to its reference representation element ω(C) the assumption eq. (72) is reasonable for φ C as well. Hence, theorem 8 holds equally well for interleaved RB.
Non-standard interleaved RB protocols [9,13,47,50] depart from the above framework by including interleaved gates that are not part of the group G, (the Pauli group in the case of ref. [13] and the Clifford group in the case of ref. [47]) and sampling from the group in a non-uniform manner. These are somewhat idiosyncratic so we will treat them separately.
We will see that the protocols of refs. [9,47] are covered by theorem 8, while the protocols of ref. [13] and ref. [50] are not covered. We expect that it is possible to make guarantees on the output data of these protocols with suitable adaptations to theorem 8 but we do not pursue this here.

Interleaved T-gate randomized benchmarking
In ref. [47] the quality of a T -gate (with ideal implementation T ), with an associated noisy implementation T is assessed by estimating the following quantity with C q the q-qubit Clifford group, P q ⊂ C q the Pauli group and φ : C → S d an implementation of the Clifford group (and the Pauli group) and t(p) : P q → C q is an injective map mapping Pauli elements p to T pT † . Because T is in the third level of the Clifford hierarchy we have T pT † ∈ C q for all p ∈ P q making the above well defined. By defining the map φ T (g) : a probability distribution on C q taking non-zero value only on the image of the map t (strictly speaking t −1 (g) is not defined for g ∈ Im(t), but ν T is zero there anyway). With these definitions we can rewrite the output probability as Hence, theorem 8 generalizes to p T (m) as long as (72) is satisfied for the convoluted map φ * φ T . In the ideal case of φ = ω (the reference representation) and T = T we see that φ * φ T (g) = ω(g). Hence this is a reasonable assumption to make, and theorem 8 thus covers the protocol presented in ref. [47].

Individual gate benchmarking
Individual RB, as proposed in ref. [9], is an interleaved RB protocol characterized by uniform probability distributions and, interestingly, a reference implementation φ r that is not a representation. Rather, the reference implementation is of the form φ r (g) = Uω(g) where ω(g) is the standard action by conjugation, i.e. ω(g)(ρ) = U g ρU g † , and U(ρ) = U ρU † is a fixed unitary gate (that is not a part of the group G). Moreover, U is assumed to commute with the representation ω(g). The output RB data p(i, m) associated with this procedure is of the form (74), however, the central assumption (eq. 72) of theorem 8 is generally far from satisfied (unless U is the identity). However, we can make the alternative assumption that where U is the noisy implementation of the unitary U and φ is the implementation of the reference representation ω(g). This is a reasonable assumption to make since so as long as the implementation of the interleaving unitary U is of sufficient quality eq. (133) is reasonable. Furthermore we note that due to the commutation assumption [ω(g), U] = 0 the Fourier operator F (Uω) has the same dominant invariant subspace as F (ω) (since F (Uω) = 1 G ⊗ UF (ω) = F (ω)1 G ⊗ U). Hence the proof of theorem 8 goes through for individual gate benchmarking as well, replacing the assumption eq. (72) with eq. (133).

Cycle benchmarking
Cycle benchmarking [13] is a recently developed RB protocol that can also be subsumed under the framework of Theorem 8, albeit after some non-trivial considerations we will discuss in this section.
The data collection phase of cycle benchmarking can be seen as interleaved RB over the Pauli group with the interleaving gate C being a (non-Pauli) Clifford gate. In particular, cycle benchmarking implements sequences C, g m , . . . , C, g 1 where g is drawn uniformly at random from the Pauli group P q and C is a Clifford gate.
A key aspect of cycle benchmarking is the cycle length, i.e. an integer c s.t. C c = e (note that for any Clifford gate such a cycle length exists). In cycle benchmarking the number of random Pauli elements implemented is always a multiple of the cycle length. Writing φ(g) for the noisy implementation of the standard conjugation representation of the Pauli group, and C for the noisy implementation of the Clifford gate C we can define the cycle implementation map (on the Pauli group): Cφ(g c ) . . . Cφ(g 1 ). (136) Note that because the Clifford group contains the Pauli group the equation Cg c · · · Cg 1 = g makes sense. Now because of the cycle property since C −1 gC is always a Pauli element. Hence the equation has exactly |P (c−1) 1 | solutions. Furthermore we have that Cφ(g c ) . . . Cφ(g 1 ) = 1 |P| c g 1 ,...g c ∈Pq Cφ(g c ) · · · Cφ(g 1 ) and thus that 1 |P q | mc g1,1,...gm,c∈Pq Cφ(g mc ) · · · Cφ(g 1 ) = φ * m c (e) which means cycle benchmarking can be framed as RB with the implementation map φ c . Moreover, since in the limit of perfect gates we have, if Cg c · · · Cg 1 = g, that Cω(g c ) · · · Cω(g 1 ) = ω(g) we can reasonably make the assumption that φ c is close to its reference implementation (i.e. 72). Hence the behaviour of cycle benchmarking data is covered by Theorem 8. What is less clear is how to interpret the resulting exponential decays (especially in terms of the implementations φ and C). This requires a more sophisticated analysis, which is done in [13].

Robust benchmarking tomography
In robust benchmarking tomography [50] one uses a RB protocol as a subroutine to extract tomographic information from a super-operator (not necessarily a unitary) E. This is done by estimating the probability where g is a fixed element of the group G and φ is the implementation of a reference representation ω (the goal is to estimate correlations between ω(g ) and E). We can consider this as an interleaved RB scheme with reference implementation φ tom (g) = ω(g g) (thinking of E as a noisy implementation of the identity gate). However, this reference implementation is not close to a representation (unless g = e), which means that theorem 8 does not apply. This is not an artifact of the proof technique but rather a reflection of the fact that robust benchmarking tomography features extremely rapid exponential decays. In the gate-independent noise case the decay rate is set by the average fidelity F (ω(g ), E) which can be very small. In the language of matrix Fourier theory this means that the dominant eigenvalues of the Fourier operator F (φ tom ) will be small even in the ideal case. Hence, we do not expect an assumption of the form (72) to be strong enough to guarantee exponential behaviour of the RB output data in this scenario.

VII. DATA PROCESSING AND SAMPLE COMPLEXITY
As discussed before the randomized benchmarking (RB) protocol can be divided into data collection and post-processing phases. The data collection protocol is summarized in algorithm 1. The outputs of the data collection phase are mean estimatorsp(i, m, g end ) that estimate the average over all sequences of length m according to the measures ν i and the quantum measurement statistics, simultaneously. The main theorems of the data collection phase (theorems 8 -10) state that the expectation value, again both over the measurement statistics and the random sequences, is well-approximated by a linear combination of (matrix) exponentials in m.
The figures of merit that RB experiments report are the decay parameters associated with the linear combination of (matrix) exponentials. Extracting these decay parameters is the objective of the data-processing phase that is the focus of the current section. For gate-independent noise and reference representations without multiplicities the decay parameters can be directly connected to the average gate fidelity of the noise. In the more general case, the interpretation of the decay parameters in terms of other operational measures of quality can be more complicated. We will consider the connection between the decay parameters and the average gate-set fidelity in section IX.
Here we want to take a more pragmatic approach for the post-processing phase. The deviation of the decay parameters from unity can directly be regarded as a measure of quality that captures the deviation of the actually implemented gates from an ideal implementation. In principle, the set of decay parameters itself provides a refined image of the quality of the implementation, as compared to the average gate fidelity. This motivates us to limit the post-processing phase to the extraction of the decay parameters. The estimation of other measures of quality from the decay parameters is then left to an optional subsequent processing phase.
In the simplest RB setting (e.g., uniform RB with the Clifford group), featuring a single noise-affected representation, the data processing phase only involves fitting a single exponential decay curve. The analysis of RB data arising in more general settings, however, requires a considerably more flexible approach for the data processing.
Extracting multiple decay coefficients, or poles, from a discrete series of data points is a well-studied problem in signal processing that arises in many different disciplines. For this reason, this section includes a review of modern approaches to this fitting problem that not only have been generalized to the fitting of matrix exponentials but also come with theoretical performance guarantees and bounds. The pole-finding algorithms we review (MUSIC and ESPRIT) come with multiple merits: (1) they are easily and efficiently implementable, (2) they are flexible enough to in principle analyze any RB signal of the general form (63), (3) they come with in-built de-noising and super-resolution capabilities, (4) they feature theoretical bounds that can (4.a) inform the design of experimental parameters, and (4.b) -very importantly-can be used to identify parameter regimes where distinguishing the different decay parameters becomes infeasible in practice.
Following this review we combine analytical guarantees and numerical simulations to evaluate the performance of these algorithmic approaches for the processing of RB data. In particular, we discuss the effect of the configuration of the decay parameters, such as their number and spacings, on the overall number of required measurements and the maximal sequence length in the experiment. We thereby provide theoretical guiding principles for designing RB experiments and explicitly work out limitations where the experimental precision required in order to separate multiple decays become impractical.
These fundamental limitations in analyzing RB data have previously motivated a variety of more resource-intensive datagathering protocols that take further data from which one can isolate different decay curves in the classical post-processing phase. We will turn our attention to devising a novel general method for isolating matrix exponentials in section VIII. We begin by a detailed description of the data processing problem.
A. The randomized benchmarking data processing phase The theorems on the data collection phase, morally summarized by eq. (63), state that in expectation RB output data is well-approximated by a linear combination of (matrix) exponentials in m. Every matrix M λ ∈ C n λ ×n λ in the expansion is associated with an irreducible representation λ of the reference representation ω and n λ is the multiplicity of σ λ in the decomposition of ω. From the collected data, a RB protocol subsequently extracts decay parameters that describe the exponential decay. The decay parameters associated with a matrix M λ are its eigenvalues spec(M λ ) = {z with coefficients a . Therefore, the matrix exponential takes the form with real coefficients a (λ,j) i . Note that m j are falling polynomials in m. Thus, the function space of Tr(A λ M m λ ) is in general spanned by exponential function parametrized by the eigenvalues modulated by falling polynomials. With the pole-finding techniques, which we discuss in the next section, one can extract the set of all poles from RB output data. Thus, the general post-processing task of RB is the following: Given a data-seriesp(m) that is approximately described by linear combinations of polynomial modulated decays, extract the set Q of all poles.
Loosely speaking, estimating Q is typically possible, provided that the coefficients of all representations are sufficiently large and the poles are sufficiently spaced. In the remainder of this section, we assess this statement quantitatively using analytical and numerical methods.
In practice, one might operate under additional assumptions and does not need to extract all poles individually. For example, if one expects multiple poles in the data series that are all more or less aligned, the data processing problem becomes equivalent to extracting a single pole. The general form of the data-processing task however, stays the same, namely extracting the poles in the data series.
Without additional assumptions or post-processing, the resulting poles are unlabeled, in the sense that one does not know which pole is associated with which irreducible presentation. This issue will be addressed when we turn our attention to techniques that filter the RB data for specific representations in section VIII.
B. Data processing algorithms and guarantees

Fitting single decays
Many proposals for RB derive a data model that is well-approximated by a single decay curve. This is for example the case when the group is a unitary 2-design, the reference representation ω is the adjoint representation and the actual implementation is close to being trace-preserving [24]. The adjoint representation of a unitary 2-group acts irreducible on the space of traceless matrices and yields a single dominant decay curve.
A single dominant decay parameter can be extracted using non-linear least-square fitting algorithms such as Levenberg-Marquardt, see, e.g., ref. [73,Chapter 3.2]. In ref. [57] it has been shown that in RB for the Clifford group the variance of the data points is expected to strongly vary with the sequence length m. This observed heteroskedasticity motivates to use iteratively re-weighted variants of least square fitting algorithms.
Ref. [74] analyses a simplified fitting procedure that estimates the decay parameter from the ratio of the data for two sufficiently separated sequence lengths. In the regime of high fidelity, it establishes a multiplicative error in the deviation of the decay parameter from one from an efficient number of samples. Relatedly, ref. [45] gives an estimation scheme for a RB procedure that estimates, in parallel, multiple single exponential decays with multiplicative accuracy. This scheme makes use of post-processing techniques to guarantee the 'single-exponential' shape of the data. We will discuss this more in section VIII.

Fitting multiple decay with pole-finding algorithms: MUSIC and ESPRIT
Algorithms for simultaneously identifying multiple poles (frequencies and decay parameters) from a discrete series of data points date back to at least the work of Prony [75]. A zoo of modern algorithmic approaches has been developed in the context of direction-of-angle estimation in array signaling. In principle, these techniques can extract poles that are closer together than the grid spacing defined by the finite sampling rate, a phenomenon dubbed super resolution. The theoretical framework to derive guarantees for these algorithms that go beyond a perturbative analysis of special noise models or very simple configurations, was only developed recently [76,77], first focusing on convex optimization.
Here, we will analyze the performance of the MUSIC algorithm [78] and the ESPRIT [79] algorithm on RB data. Performance guarantees for these two subspace algorithms were derived in refs. [80][81][82][83] for the multiplicity-free case. Furthermore, the ESPRIT algorithm was extended to polynomially modulated exponentials of the type we encounter in RB data with multiplicities in refs. [84,85]. We will summarize the required modification in section VII B 5. For the sake of clarity, we now start reviewing the algorithms for identifying multiple poles without polynomial modulation. This corresponds to the case of RB with a multiplicity-free reference representation. For the rest of this section we will denote the output data as y m instead ofp(m), in keeping with the signal processing literature. We will also assume equidistant spacing of the available sequence lengths m. As we point out in section VII B 5, this requirement can be relaxed by running a low-rank completion algorithm on incomplete data and thereby infer equidistantly spaced data y m . When clear from the context, we will write the data series simply as a vector y, dropping the explicit dependence on m.
The strategy of both algorithms, MUSIC and ESPRIT, is to identify the range of the subspaces associated with the dominant singular values of the Hankel matrix of the data series {y m } m . The crucial observation is that from this subspace the poles can be extracted. Let y ∈ R M be the RB data with M the maximal sequence length. The Hankel matrix for 1 ≤ L < M is given by If n = 1, and thus z ∈ C we will refer to W M (z) as the Vandermonde vector of length M and pole z.
With this notation, the data vector y, without noise, is in the range of W M (z) T . Furthermore, cyclically shifting the entries of y amounts to multiplication of the summands with the respective poles. In effect, the Hankel matrix has a Vandermonde decomposition where we have denoted by α the deviation of y from an ideal linear combination of exponentials due to the perturbative error (m) and finite statistics and where a is the vector of pre-factors given in eq. (143).
To identify the signal subspace and distinguish it from the noise subspace, the MUSIC and ESPRIT algorithms employ an SVD decomposition of the Hankel matrix, Hankel L (y) = U ΣV T . In the absence of noise and perturbation, i.e., α = 0, Hankel L (y) has n non-vanishing singular values and the corresponding singular vectors form an orthonormal basis of the signal space span W T M (z). Let U signal be the matrix consisting of the singular vectors of the non-trivial singular values as columns and let U noise be the matrix consisting of an orthonormal basis of the complement. In the presence of noise, analogously choosing the singular vectors of the n largest singular values yields an estimate of the signal space.
From the noise space projector P noise = U noise U † noise , the MUSIC algorithm defines the inverse noise-space correlation function R −1 : C → R, The poles z can then be identified as the peaks of R −1 (z). These can be found by a continuous scan of the values of R −1 (z), which can be done numerically.
A slightly different approach that avoids the continuous search for poles is taken by the ESPRIT algorithm. The ESPRIT algorithm exploits a so-called 'rotational invariance' property. To this end, let W ↓ L (z) and W ↑ L (z) be the sub-matrices of the Vandermonde matrix W T L (z) that omit the last and first column, respectively. These sub-matrices are related via This rotational invariance property is inherited by U signal . In consequence, let H ↓ and H ↑ be the sub-matrix of the Hankel matrix H that omits the last and first rows, respectively. Then, in the noiseless case, a solution matrix Ψ of the equation has non-zero eigenvalues z, which are the poles contained in the data. It is given explicitly by the pseudo-inverse of H ↑ applied to H ↓ . Again noisy signals can be considerably de-noised by projecting H ↑ to the signal space before inversion. Altogether we find the algorithmic strategy of ESPRIT to be (i) calculate the SVD of the Hankel matrix of y and determine P signal = U signal U † signal , (ii) calculate Ψ = (P signal H ↑ ) + H ↓ and (iii) determine z as the eigenvalues of Ψ.

Performance guarantees
Non-perturbative analysis of the performance of MUSIC has been conducted in refs. [80,82]. Therein, the following bound for the deviation of the noise-correlation function R(z) from the ideal noiseless counter-part R signal (z) has been derived for poles z of unit absolute value (sinusoids). The argument, however, holds verbatim for all z ∈ C n .
Theorem 11 (Noise-correlation function bound [82], Proposition 4.2). Let E = Hankel L (α) denote the Hankel matrix of the perturbation/noise of the signal vector y. Let ε min be the smallest singular value of the Hankel matrix of the noise-free signal. Suppose L ≥ n, M − L + 1 ≥ n and 2 E ∞ < ε min . Then for all z ∈ C.
We observe that the bound on R(z) is proportional to the spectral-norm of the noise in the signal but in addition is decorated by a noise-enhancing factor inversely proportional to the smallest singular value ε min of the Hankel matrix.
The bound on R(z) can thus not be directly translated into a bound on the precision in recovering the poles z without further assumptions, see ref. [80, theorem 4] in this context. Nonetheless, the peaks of R −1 (z) are typically very sharp, and the bound on R(z) indicates a regime where one can typically expect MUSIC to accurately work. For the ESPRIT algorithm, similar bounds can be found in refs. [81,83]. The bounds for ESPRIT additionally involve the minimum singular value of the truncation (as defined above) of the Hankel matrix.

Conditioning of Vandermonde matrices
The performance guarantees for MUSIC (and ESPRIT) show a noise-enhancement inversely proportional to the minimum singular value ε min of the Hankel matrix of the ideal signal. The minimum singular value ε min in turn can be regarded as a measure for the conditioning of the Vandermonde matrices into which the Hankel matrix decomposes. This conditioning depends on the system parameters and on the configuration of poles. Given expected values for the poles and the maximal sequence length, it is straight-forward to calculate the minimum singular value numerically. This can provide valuable information in the design of RB experiments.
More systematically, it is informative to understand the scaling behaviour of the conditioning of the Vandermonde matrices with the help of theoretical bounds. One such bound that allows us to study its asymptotic behaviour is briefly reviewed in this section. A lot of work has been devoted to study the often surprisingly favorable conditioning of Vandermonde matrices for poles on the unit circle, which describe sinusoidal oscillations, see, e.g., ref. [83] and references therein for a discussion of the phenomenon of super resolution.
In the context of RB, we are conversely interested in poles that are on the real-line. A more general characterization of the conditioning of Vandermonde matrices with poles inside the unit circle (allowing for decays beyond oscillations) has been studied in ref. [86]. The conditioning obviously depends on the set of poles z and the size M of the Vandermonde matrix. To state the result given in ref. [86] we define several quantities. To the set of poles z = (z 1 , . . . , z n ), we associatě z := max j |z j |,ẑ := min j |z j | andz := min j =k |z j − z k |. Furthermore, let us define Note that W M (z)W † M (z) is the frame operator of the frame defined by the rows of the Vandermonde matrix and Q M (z) is the orthogonalizing matrix arising in symmetric orthogonalization. With the help of Q M (z), we define the matrix that will play a prominent role for analyzing the Vandermonde conditioning. In particular, its departure from normality as measured by D 2 (F M (z)) = F M (z) 2 F − z 2 2 will appear. In ref. [86] a bound is derived for the 2-norm condition number κ 2 (W M ) = W M ∞ W + M ∞ through the bounding of the Frobenius norm condition number κ F (W M ) = W M F W + M F . Here X + denotes the (Moore-Penrose) pseudo inverse of a matrix X. The condition number of a linear map A gives a worst-case bound on the relative reconstruction error in 2 -norm induced by an additive error in 2 -norm for a linear inverse problem. But here we are more concerned with how it enters into the accuracy of identifying poles in the MUSIC and ESPRIT algorithms. For the analysis of the MUSIC and ESPRIT algorithm, we want to upper bound the minimum singular value ε −1 min . By means of the Vandermonde decomposition (147) and the sub-multiplicativity of the spectral norm, we have ε −1 For the condition number the following bound holds.
Theorem 12 (Conditioning of Vandermonde matrices [86], theorem 6). For M > n ≥ 2, for a Vandermonde matrix W M (z), it holds that with Most interesting in our context is the asymptotic scaling in the limit of large maximal sequence length M , for poles inside the unit disc |z i | < 1 for all i. In this limit, the above bounds become tight and the following holds true.
Lemma 13 (Asymptotics of condition number [86], lemma 8). Let z = (z 1 , . . . , z n ) ∈ C n with |z i | < 1 for all i ∈ [n]. Define C(z) ∈ C n×n as the matrix with entries Then, Later in this section we will use this bound to perform numerical investigations of the resolving power of the MUSIC and ESPRIT algorithms and to give a sampling complexity bound for general RB.

Extensions of the algorithms
Incomplete data or logarithmic grids. So far the presented algorithms and analysis relied on having an equidistant grid of sequence-length. It is well-known that a low-rank matrix can under fairly general assumptions be completed from the knowledge of just a subset of their entries [87]. Thus, given only data y m for values m on an irregular subset regular grid, one can attempt at completing the Hankel matrix for the regular grid using a low-rank matrix completion algorithm. This pre-processing step can be combined with MUSIC or ESPRIT to arrive at pole-finding algorithms that do not rely on complete data from an equidistant grid [80]. In particular, we suspect that for exponential decays a logarithmic grid can potentially yield improved recovery similar to the multiplicative error bounds for the fitting of single exponentials derived in ref. [74], but we leave formally verifying this to future work.
Generalization of ESPRIT to matrix exponentials. Refs. [84,85] have generalized the ESPRIT algorithm to signal spaces spanned by products of falling polynomials and exponentials. This is exactly the signal model (143) that we encountered for RB output data, when the reference representation has multiplicities. The key insight in this generalization is that the Hankel matrix of such signals admits a decomposition analogous to the Vandermonde decomposition (147) in terms of Pascal-Vandermonde matrices. These Pascal-Vandermonde matrices feature the same rotational invariance property underlying the ESPRIT algorithm. Thus, one can show that when applying the standard ESPRIT algorithm to data of this form, the vector of eigenvalues of the matrix Ψ is still the vector of poles z with the eigenvalues appearing in multiplicities according to the maximal degree of the associated falling polynomial. Hence, ESPRIT can be directly applied to estimate matrix-exponential data series. Noise in the signal will generically break the degeneracy of the eigenvalue spectrum, corresponding to the fact that a generic matrix has non-degenerate eigenvalues. Searching for regular polygons of poles allows for matching groups of perturbed poles corresponding to the same unperturbed pole. We refer to refs. [84,85] for further details.

C. Randomized benchmarking sampling complexity -estimation of the Hankel matrix
The performance bounds on the pole-finding algorithms, such as theorem 11, depend on the deviation of the Hankel matrix from ideal data in spectral norm. In RB protocols this error has two contributions: 1. The finite sampling statistics of the measurements, which yields a statistical error of the mean estimatorp(m).
2. The perturbative error that comes from neglecting sub-dominant eigenvalues, which is controlled by our theorems 8, 9, and 10.
For the finite sampling error, we provide the following bound. To this end, we model the individual measurement performed during the RB protocol by a random variableŶ m . To simplify the notation in the proof, we assume that the number of different sequence lengths is even and use a square Hankel matrix.
Combining lemma 14 with the performance bound for MUSIC, theorem 11, and (154) we can state the following result for the overall sampling complexity of random benchmarking experiments.
Corollary 15 (Sampling complexity). Let M be even and L = M/2. And z = (z i ) n i=1 be a set of poles. For m ∈ [M ] let p(m) be the mean estimator of i.i.d. copies of random variables with variance bounded by ε 2 . Choose˜ , δ > 0, provided that the total number of random trials is for the noise space correlation function (148) defined by the MUSIC algorithm with input datap it holds that |R(z ) − R signal (z )| ≤˜ with probability δ.
We state this bound in terms of the condition number of the Vandermonde matrix, which allows us to make analytic claims about the behaviour of the sampling complexity in various regimes. However, one can state an equivalent bound in terms of the smallest singular value, which will often be significantly smaller. It is, however, difficult to work with analytically.
For the application of corollary 15 to RB data processing, one has to additionally control the perturbative error appearing in theorems 8, 10, 9. The perturbative error per RB data point, see e.g., (73), yields an additive error in the noise correlation function of order of Mẑ −1 κ 2 2 (W M/2 (z)) . The scaling with M originate from the spectral norm of the Hankel matrix and the factor of z −1 κ 2 2 (W M/2 (z)) captures the noise-enhancement. Lemma 14 follows from the Matrix Bernstein bound [88,89] that requires us to control the spectral norm and matrix variance statistics in order to provide a tail bound for sums of matrices. We follow the same strategy as presented in ref. [88] for Toeplitz matrices.
Proof of lemma 14. With the help of the L × L exchange matrix and the L × L (non-cyclic) shift matrix X that has ones its first upper off-diagonal and zeros everywhere else we can write where we identify the elements of p cyclically. We define Since andŶ k takes values in [0, 1], we have that for all i, k. For the matrix variance we calculate that with P k a diagonal projector having k ones on the diagonal and zeros everywhere else. One finds the same structure for analogously. By the assumption of the lemma Var[Ŷ k ] ≤ ε 2 . Therefore, matrix variance statistics is dominated as The matrix Bernstein inequality [88] yields Requiring the right hand side to be dominated by δ and solving for N yields the lemma's assertion.

D. Vandermonde conditioning for randomized benchmarking decays
The noise-enhancements factor in the performance guarantee for the tone-finding algorithms MUSIC and ESPRIT is given by the inverse of the minimum singular value ε −1 min of the Hankel matrix of the ideal, noise-free signal. This minimum singular value (154) is in turn controlled by the minimal absolute value of the poles and the conditioning of the Vandermonde matrix W L (z) associated with the poles and the signal length. Here we numerically investigate this conditioning in various scenarios relevant to RB. We express all data in terms of the dimension of the Hankel matrix L, which one can generally take as being about half of the maximal sequence length M .
When the RB data model is described by many poles that are close in value the noise-enhancement due to bad conditioning can be limiting factor rendering the extraction of poles infeasible.  Increasing the sequence length improves the conditioning of W L (z), see figure 4. But theorem 12 shows that the condition number of W L (z) is even in the asymptotic limit W ∞ (z) for large L bounded away from zero. Thus, increasing the length of observed RB series only improves the conditioning up to a certain point.
The explicit expressions of the upper and lower bounds on the condition number in theorem 12 have a rather complicated dependency on the geometrical constellation of the poles. One can argue that for RB data with poles on the real line there are roughly speaking two effects coming into play: (1) The spacing of the poles and (2) the number of poles.
To illustrate the dependency on the spacing of the poles, we have numerically evaluated the κ 2 (W ∞ (z)) for different pairs of poles as they might appear in RB data. The result is shown in fig. 3. The first pole is chosen to deviate from 1 by a value r ∈ {10 −2 , 10 −3 , 10 −4 }, the second pole is chosen at different values around the first one. We observe that as both poles move together the condition number diverges. Importantly, the size of the interval in which the condition number grows over a certain threshold scales with r. Correspondingly, we expect that poles closer to 1 can be still resolved with a smaller spacing compared to poles that deviate considerably from 1.
Secondly, even if the poles are spaced such that the ratio of the departure from normality and the minimum spacing are fixed the upper bound in theorem 12 exhibits an exponential dependency on the number of poles. We numerically evaluate this dependency for different families of poles that each defines a set of poles for every cardinality, see table I. These    z)) on the number of poles n for different families. We find that due to a typically exponential dependency, the conditioning indicates that the reconstruction of multiple poles becomes demanding for already small numbers n.
Note that the conditioning is significantly improved if the poles are not exclusively on the real line but also have nonvanishing imaginary parts. Such pole sets for example arise in the RB variant of ref. [9] focusing on individual gates.

E. Performance evaluation
After collecting evidence that the reconstruction of multiple poles quickly becomes a demanding task. We here show that for moderate configurations (i.e., not too many poles, not too close together) the ESPRIT algorithm is suitable for the postprocessing of RB data. To this end, we implemented the ESPRIT algorithm in Python. For a fixed set of poles the ideal data series (constructed from the poles and a fixed identical pre-factor) is made noisy by randomly sampling binomial distributions. This simulates the random noise due to finite statistics for a certain number of samples per sequence length. Subsequently, the set of poles is reconstructed from the noisy data using the ESPRIT algorithms. We compare the reconstructed set of poles with the ideal set of poles using the symmetric Hausdorff distance. Let z ∈ C n and z ∈ C n d H (z, z ) = max{d dH (z ; z), d dH (z; z )} , d dH (z, z ) = max samples. This phenomenon is observed for different families of poles and the location of the threshold depends strongly on the number of poles in the signal. It is interesting to note in fig. 7 that the minimal number of samples needed for reconstruction is dependent on the maximal sequence length. Since increasing the maximal sequence length has an implicit sampling cost, this points to a non-trivial optimization problem in allocating resources. We leave further investigation of the optimal point for a family of poles for further research. The conclusion from these numerical investigations is that the RB decay rate recovery problem is feasible using modern methods when the number of poles is small but rapidly becomes impractical as the number of poles grows.

VIII. ISOLATING MATRIX EXPONENTIALS ASSOCIATED WITH A REPRESENTATION
We have seen in section VI that for uniform randomized benchmarking (RB) the output data is well described by a linear combination of (matrix) exponential decays associated with irreducible sub-representations of a reference representation. The decay rates can in principle be extracted by the methods described in section VII. However, two issues crop up here: (1) the sample complexity of extraction is strongly dependent on the number of decays present in the RB output data, limiting RB to groups with reference representations containing at most a few irreducible sub-representations, and (2) upon successful extraction of decay constants, it is not clear a priori how they are related to the different irreducible sub-representations present, making it hard to relate the decay constants to the average fidelity.
A data processing technique that addresses this problem was proposed in various papers (marked with a * in fig. 2) such as the dihedral benchmarking scheme [37] for the single qubit dihedral group, the character benchmarking scheme [39] which works for general groups (with some technical constraints on the reference representation) and the Pauli channel tomography scheme [45] and cycle benchmarking [13] for the Pauli group (in ref. [45] multiple decays are actually estimated in parallel). The unifying theme in all of these procedures is that one estimates RB output data p(i, m, g end ) for different ending gates g end ∈ G, and then correlates the resulting vector of signals [p(i, m, g end )] g end with a scalar function f λ (g end ) (which can be thought of as a dual vector) that depends on an irreducible sub-representation σ λ of the reference representation ω.
In this section we will take this idea and generalize it as far as possible. In particular we will propose a post-processing method that, for any group G and reference representation ω, takes in RB output data p(i, m, g end ) (for all g end ∈ G) and an irreducible sub-representation σ λ of the reference representation ω, and outputs post-processed data k λ (m) that only depends on the (matrix) exponential decay associated with σ λ . We will state theorems for uniform RB, but the discussion below generalizes to the other types of RB.
We note that all examples of RB schemes without inversion gates (marked with a * * in fig. 2) can be seen as special cases of the procedure given below, where the output data [p(i, m, g end )] g end is simply averaged over g end . We would also like to note that the procedure defined here obviates the need for explicitly implementing the inversion gate (as it can be simply absorbed by redefining g end ). This makes the protocol more experimentally practical.
A. The post-processing procedure We begin by defining filter functions α λ (associated with a representation σ λ ) where P λ : S d → S d is the projection onto the sub-representation σ ⊕n λ λ of the reference representation ω. This is (up to normalization) the matrix element of the sub-representation σ ⊕n λ λ corresponding to the vectors |ρ 0 and Π i |. From the RB data and the above matrix element function we can now compute the following quantity we call the λ-filtered RB output data: where the normalization constant is given by One can think of this quantity as measuring the presence of the sub-representation σ λ in the data p(i, g end , m). We will make this more precise in the following theorem: Theorem 16 (Measuring sub-representations in the data). Let G be a finite group and ω : G → S d a reference representation of G with decomposition ω = λ ∈Λ σ ⊕n λ λ . Moreover, let φ be an implementation of ω for which Theorem 8 holds. For a fixed λ ∈ Λ consider the λ-filtered data k λ (m) as defined in (173). As a function of m we now have that where B λ is an n λ × n λ matrix encoding SPAM terms, M λ is given by the projection onto the subspace associated with the n λ largest eigenvalues of F(φ)[σ λ ] (as given in theorem 8), and K is some constant independent of m.
Proof. We know from theorem 8 that with A λ given in eq. (101). From the definition of k λ (m), we can thus compute Considering only the first term, and inserting the definition of α λ (i, g end ) we are interested in the SPAM operator quantity for λ ∈ Λ. From the proof of theorem 8 (eq. (101)), we can recover an expression for the n λ × n λ matrix A λ : where P j λ is the projector onto the j'th copy of σ λ in the reference representation ω and R λ , L λ 1 encode the deviation of φ from ω (their precise shape is not relevant for our argument). By linearity, we can now consider where we have used that 1 |G| which is the Fourier transform analog of the orthogonality of characters of irreducible representations. Hence B λ,λ = δ λ,λ B λ,λ := B λ .
Plugging this back into the expression for k λ we get We can thus upper bound the difference k λ (m) − Tr(B λ M λ ) by considering the magnitude of the difference term. Note that we know from theorem (8) . It follows that there exists a K such that Hence, the λ-filtered output data has essentially the same behaviour as regular RB data, except that only the Fourier mode associated with σ λ is included in the signal. One can think of the λ filter function α λ as placing a delta-peak filter function centered on the 'frequency' σ λ . Note that by linearity we get essentially the same result if one defines a filter function associated with non-irreducible representations (via a direct sum of irreducible representations). This can be thought of as placing a frequency comb on the RB data. Finally, it is interesting to explicitly write down the form of the SPAM matrix B λ in the limit of no SPAM and perfect gates. In the case of a multiplicity-free reference representation ω we have which emphasizes the importance of the normalization constant (on which more later), but also the importance of choosing ρ and {Π i } i∈I such that B λ is non-zero.

B. Statistical estimation
When computing the filtered output data k λ (m) in the previous section we assumed we had access to the RB output data p(i, g end , m) for all i ∈ I and g end ∈ G. This is not realistic since both the size of the POVM {Π i } i∈I and the size of the group |G| can be exponential in the number of qubits. In practice we will need to construct a statistical estimatork λ for k λ , and argue thatk λ is a good approximation for a reasonable number of samples. This we will do in this section.
Note that the normalization factor N λ is essential in lower bounding the magnitude of the filtered function k λ (i.e., making sure that the number k λ is not too small). However, this normalization factor can be proportional to the Hilbert space dimension d, making it tricky to set up an estimator for k λ that has a sampling complexity that does not grow with d (which would make sampling practically impossible for more than a few qubits). This is the task we will turn to now. We can construct an estimator for k λ (m) essentially directly from its definition. Compute λ-filter function values for the non-zero frequencies: {α λ (i, g endl ) i ∈ I, f i (g endl )} 5 end 6 Compute the empirical weighted averagê It is easy to see that the mean of this estimator is equal to the λ filtered output data k λ (m). However, this does not mean that the associated estimation procedure is efficient. A priori the variance of the estimator could scale with Hilbert space dimension d, since the magnitude of the filter function N 1 λ α λ does so in general. We can not prove that this estimator is efficient for all groups G and POVMs {Π i } i∈I . We can, however, make some partial statements. In particular, we can prove that the estimator is efficient as long as the POVM {Π i } i∈I is generated by a 3-design. This is a restrictive condition, but not impossible to fulfill. We will discuss how to implement such a POVM after stating and proving the following theorem, which essentially states that under the 3-design condition, the variance of the estimatork λ (m) does not scale with the Hilbert space dimension d. This means that the sampling resources required by the protocol do not depend on the number of qubits in the system, making the post-processing step scalable (at least with respect to sampling). We note that this theorem gives an extremely crude bound on the variance, and the actual variance is liable to be substantially smaller. For simplicity, we assume that there is no SPAM or gate noise, but the conclusions made here easily generalize.
Theorem 17 (Efficient estimators). Consider a uniform RB experiment of sequence length m, with group G, reference representation ω, measurement POVM {Π i } i∈I and initial state ρ 0 , and further assume that the POVM {Π i } i∈I is an (exact) 3-design, that is Π i = d |I| |χ i χ i | with states |χ i and 1 I i∈I |χ i χ i | ⊗3 = dψ |ψ ψ | ⊗3 . Then for all λ ∈ Λ the variance of the estimatork λ (m) is asymptotically independent of the Hilbert space dimension d.
Proof. First we calculate the effect of the 3-design condition on the normalization factor of the correlation function α(i, ·), by direct calculation we have where we have used the fact that the Haar measure is invariant under unitary action to absorb the ω(g) dependence, as well as a standard formula for the second moment of a Haar average over the unitary group, see e.g. [56,Proposition 37] or [55] (and that Tr(ρ 0 ) = 1). We can now calculate the variance. We denote byk λ (m, g end ) the estimator of i∈I N −1 λ α(i, g end )p(i, m, g end ) for a fixed g end ∈ G. By the law of total variation we can write: by dropping the negative terms in the variances. We will begin with calculating the second term. For this note that for all g end ∈ G (again using the invariance of the Haar measure): where we have used the expression for p(i, m, g end ) from eq. (69). Note that this expression is asymptotically independent of the Hilbert space dimension (depending only on how well the initial state overlaps with the projector P λ ). Next we discuss the first term, given by Here appears a third moment of a Haar average, which can evaluated using Weingarten calculus (see for instance equations S35 and S36 in ref. [55], ref. [56] or ref. [90] more generally). In this particular instance, we get where A| t = A − Tr(A)1 for matrices A. By isolating a common d −3 factor and plugging back in, we get which is again asymptotically independent of the Hilbert space dimension.
Measurement POVMs that are proportional to 3-designs are not very common. However, when considering a system of q qubits it is possible to construct one by considering computational basis measurements conjugated by a random element of the q-qubit Clifford group C q . That is, we consider the POVM and it is also proportional to a 3-design, because the multi-qubit Clifford group is a unitary 3-design [91,92], and hence every orbit {C |x x | C † } C∈Cq is a state 3-design (and thus so is the union over x).
We emphasize that the 3-design condition is only a sufficient condition for a controlled variance of the estimator for the filtered output data, which works for any group G and sub-representation σ λ . For particular choices of G and σ λ the estimatork λ (m) might be efficient for other choices of the POVM {Π i } i∈I . It is for instance easy to see that the variance will also be controlled if the degree d λ of the irrep σ λ is small. This follows from the fact that the normalization factor N λ can be written as so assuming the POVM {Π i } i∈I and the initial state ρ 0 can be chosen to have sufficient (larger than 1/d) overlap with the sub-representation σ λ the magnitude of the inverse normalization factor N −1 λ , and hence the size of the support of the probability distribution {N −1 λ α(i, g end )} p(i,m,g end ) is controlled by 1/d λ . Hence, if d λ is small, the estimatork λ (m) is efficient. This follows because it is constructed by sampling from a (O(1) in d) bounded random variable. Examples of this behaviour have been noted in the literature [37,39,45].
Alternatively, there are situations where the dimension of the representation σ λ scales with the total Hilbert space dimension d but the estimatork λ (m) is still efficient because the group G under consideration is sufficiently randomizing (roughly, it spans its own 3-design due to the randomization over the ending gate g end ). An example of this is the recently introduced linear cross entropy benchmarking procedure which we will discuss in the next section.
Finally ,we would like to add that if one re-uses the same experimental data p(m, g end ) to estimate k λ (m) for different λ, the resulting estimates for k λ (m) (and consequently the associated decay rates) will be correlated. This must be taken into account when performing joint statistical inferences on estimates for several M λ . This can of course be remedied by gathering new data for each representation label λ.

C. Example: Linear cross-entropy benchmarking
Recently, ref. [29] has introduced a RB-like protocol referred to as linear cross-entropy benchmarking, in short XEB. We will see in this section that this protocol falls into the framework of the benchmarking schemes introduced here. In fact, it can be seen as uniform RB with G the full unitary group, together with a post-processing scheme that is a special case of the above filtering scheme. Let φ : U (2 q ) → S d be an implementation map of the unitary group, also let {Π x } x∈{0,1} n be the computational basis POVM, and ρ 0 = |0 0 |. The linear cross-entropy fidelity is now given by with E M , E SP being the usual SPAM error channels. Setting α(x, U ) = | x |U |0 | 2 = Π x |ω(U )|ρ 0 we see that F XEB can be interpreted as a RB experiment of sequence length '0' with g end = U together with post-processing by correlation with the adjoint representation ω(U ) = U · U † . Note that the dimensional factor almost precisely serves as the correct normalization factor for α(x, U ), since We can extend this interpretation by considering the linear cross entropy of a sequence of m random unitaries (this is done implicitly in ref. [29]). This gives Using the invariance of the Haar measure and the linearity of the trace and the tensor product we can rewrite this as with p(x, U m , m) the output probability of a regular RB experiment. Now noting that ω(U ) decomposes into the trivial representation (on the space {a|1 | a ∈ C}) and the adjoint representation (on the space { |A | Tr(A) = 0}) we apply theorem 8 to the above to get up to a correction exponentially small in m, where s tr (f adj ) is the largest eigenvalue of the Fourier transform of φ evaluated at the trivial (adjoint) representation. Recall that s tr = 1 if φ(U ) is trace preserving for all U , and that we can moreover interpret f adj as affinely related to the average fidelity (certainly in the gate independent noise setting). Hence, through theorem 8 and our general post-processing scheme the linear cross entropy benchmarking procedure inherits both the stability and interpretation of uniform RB.
It is notable that the estimatork λ (m), which in this case estimates the linear cross entropy fidelity F XEB,m is actually efficient, in the sense of theorem 17. We can sketch an argument for this by directly estimating the variance of the estimator. For this argument we will assume gate-independent noise (i.e., φ(U ) = Aω(U ) for some completely positive A). Following theorem 17, we have Using the gate independent noise assumption and the fact hat ω(U )(ρ) = U ρU † , the RHS is a Haar integral of a degree-3 homogeneous polynomial in the entries of U, U , and the second term is a Haar integral of a degree-4 homogeneous polynomial. The asymptotic behaviour of such integrals (in the limit of large d) is well known [90] and evaluates to O(d −3 ) and O(d −4 ), respectively. Hence, the overall variance is O(1) in d. One could fill in the exact constants by evaluating the Haar integrals (like we did in theorem 17), but we do not pursue this here.

IX. RANDOMIZED BENCHMARKING AND AVERAGE FIDELITY
Up until now, we have treated the information extracted from RB procedures, and in particular the decay rates, as figures of merit in their own right, without establishing a direct connection to other well-know quantities such as the average gate fidelity. Indeed, this latter object is often portrayed as the conclusive result of an RB protocol.
In this section, we will provide a series of arguments to validate the interpretation of the RB parameters as standalone information, by showing that connecting RB decays to the average gate fidelity presents complications that are hard to overcome. The underlying reason for this incompatibility is due to the gauge-dependent nature of the average gate fidelity (as argued in [26]) that cannot be established nor controlled under RB. More precisely, in subsection IX A we provide an explicit example showing that adopting a gauge to match the average gate fidelity gives rise to a channel that is not physical. In subsection IX B, we substantiate our argument with an analysis of the expression of the entanglement fidelity -a quantity closely related to the average fidelity -in terms of RB decay parameters and the adopted gauge.
Observing this expression we conclude that RB parameters and fidelity can be linked only if there is a close overlap between the dominant eigenvector of the ideal operator and the dominant, gauge-dependent left and right eigenvectors of its implemented version; the critical point is that ascertaining whether this requirement is met is not possible with an RB procedure. We want to highlight that this intricacy in connecting RB to other well-established quantities does not mean RB protocols are inherently flawed, but only that the information they provide have to be regarded independently, with decay rates as the defining quantities to characterize the accuracy of experimentally implemented sets of gates.
A. The depolarizing gauge and in-between noise average fidelity In an attempt to resolve the apparent disconnect between fidelity and RB decay parameters in the gate-dependent noise setting, in refs. [24] and [25] proposals have been made for the precise connection between RB decay rates and average fidelity. In ref. [24], it has been noted that the output data of Clifford RB could be exactly fitted to a single exponential whose decay rates are exactly interpreted as the average fidelity of the 'noise in between gates', a manifestly gauge invariant quantity. Similarly in ref. [25], it has been argued that the decay of Clifford RB can be regarded as the average fidelity of the implementation w.r.t. a particular gauge choice, namely the one in which the average implementation inverted with the reference representation is precisely a depolarizing channel. We will show here that (1) both of these statements can be generalized to RB with arbitrary groups, (2) both statements in fact say the exact same thing, and (3) both interpretations suffer from the same problem, namely that the channel of which the average fidelity is measured by RB is not necessarily a CP map (i.e., physical), even if the implementation map φ is.
In ref. [24], the RB decay rate is interpreted as measuring the fidelity of 'the noise in between gates'. (A general version of) this construction goes as follows. For an implementation φ of a group G, close to some reference representation ω = λ∈Λ σ λ we can pick the dominant eigenvectors vec(R λ ) of the Fourier transform F(φ) evaluated at the irreducible sub-representation σ λ ⊂ ω (for now assuming no multiplicities, this easily generalizes). We can de-vectorize these eigenvectors and sum them up to create a super-operator R with the property where Dep is the generalized depolarizing channel Dep = λ∈Λ f λ P λ with f λ the eigenvalue corresponding to R λ . W.l.o.g. we can assume that R is invertible (as a matrix). Note also that for any φ we can write φ(g) = Rω(g)L(g) where L(g) is some implementation map (not necessarily completely positive).
With this parametrization the noise between two gates g, g (which in this parametrization only depends on g) is given by L(g)R. The entanglement fidelity w.r.t. the identity averaged over all g ∈ G of this map is where we have used the linearity and unitary invariance of the average fidelity. Note that F avg (Dep, 1) = 1 is precisely the average fidelity one would obtaining by plugging the RB decay rates f λ into eq. (242) .
On the other hand, ref. [25] connects the RB decay rates to the average fidelity of the implementation map φ in a particular gauge, that is a particular choice of invertible super-operators such that This map φ dep = S −1 φS is called the depolarizing gauge. According to ref. [25] the correct interpretation of the RB decay rates is that they measure the fidelity of the implementation map φ in the depolarizing gauge with respect to the reference implementation ω. It turns out that the correct choice for S is precisely the operator R mentioned above, which can be easily seen by explicit computation We can connect the above two interpretations by inserting the parametrization φ(g) = Rω(g)L(g) into the expression for φ dep as φ dep (g) = R −1 Rω(g)L(g)R = ω(g)L(g)R. (227) Hence, the depolarizing gauge is precisely the gauge in which each super-operator φ dep (g) is viewed as the ideal superoperator ω(g) preceded by the noise in between gates L(g)R (in the sense of ref. [24]). Hence, these two interpretations of the RB decay rates as corresponding to an average fidelity of 'something' neatly map to each other.
A central open question in both the above constructions is whether the noise in between gates, or equivalently the noise in the implementation in the depolarizing gauge, can always be chosen to be a completely positive implementation map. This is essential if we want to consider these interpretations as actual descriptions of reality. Here we answer this question in the negative by giving an example (An adaptation of a construction given in ref. [26]) of a point-wise CP implementation map φ where the noise in between gates (the implementation in the depolarizing gauge) is not completely positive. Let G be the single qubit Clifford group, and consider, in the Pauli basis, the following super-operators From these we can construct the implementation φ(g) = T (γ)M 1 (α)ω(g)M 2 (α), with ω(g)(ρ) = U g ρU g † the standard reference representation. It is easy to see that the transformation to the depolarizing gauge is given by . Equivalently, the noise in between gates is given by M 2 (α)T (γ)M 1 (α). The claim is now that there exists pairs α, γ such that φ(g) is completely positive for all g ∈ G but M 2 (α)T (γ)M 1 (α) is not. An easy pathological example can be obtained by setting γ = 0. In this case we have Hence, for all α < 1 the maps φ(g) are CP while the map M 2 (α)T (0)M 1 (α) is not (this can be verified by using the complete positivity conditions for qubit channels from ref. [93]). For γ < 1 one can always construct interval conditions on α such that the same holds. Hence, the interpretations [24,25] both suffer from a problem, namely that in order to imagine RB as 'measuring the average fidelity' of some object, this object has to be chosen in a way that is not necessarily physical. This possibility was already indicated by both papers, but no explicit example was given. It is unclear how to resolve this problem: one could for instance try to find natural conditions on φ such that the noise in between gates, or equivalently the implementation in the depolarizing gauge, is always completely positive. Alternatively one could adopt the framework of ref. [27] where one relaxes the problem by asking for a positive gauged implementation map that has a fidelity approximately given by the RB decay rates (with approximate meaning small relative to 1 − f λ ). This can be done for Clifford RB on a single qubit [27] but generalizing to higher dimensions seems difficult (although some work in this direction has been done [94]).

B. Connecting average fidelity and randomized benchmarking decay rates
In the previous subsection we showed that the depolarizing gauge does not always give rise to a CP implementation map, and hence, cannot be connected in all cases to the average fidelity of a physical process. Here we want to investigate the link between fidelity and the RB decay parameters under a general gauge choice S. We will do this using the tools of perturbation theory we have used earlier to establish theorem 8.

The randomized benchmarking measurement outcome
Let us consider a special case of theorem 8 corresponding to reference representations ω that are multiplicity-free (for simplicity), and making the gauge freedom S explicit. In this situation, we can write the Fourier operator F (ω) as a direct sum of rank-1 orthogonal projections, since from eq. (29) and (30) it follows that for each unitary irreducible representation σ λ of G |z(σ λ ) z(σ λ ) | rank-1 orthogonal projection if π and σ λ are equivalent irreducible representations, Furthermore, we also assume that the Fourier transform F(σ λ ) is a diagonalizable operator. Since the set of diagonalizable matrices is dense [95], it is always possible to find such a diagonalizable matrix at arbitrary proximity of any given operator. We can thus write the Fourier transform of the implementation map on the irreducible representation appearing in the decomposition of ω as the perturbation E(σ λ ) := F(SφS −1 − ω)[σ λ ] of the rank-1 operator F(ω)[σ λ ], where f max (σ λ ) is the largest eigenvalue of F(SφS −1 )[σ λ ] and {f j λ } d λ −1 j λ =1 are the other eigenvalues. The sets of left-and right eigenvectors form a bi-orthogonal system, that is, max (σ λ )|r max (σ λ ) = j λ (σ λ )|r j λ (σ λ ) = 1 and max (σ λ )|r j λ (σ λ ) = j λ (σ λ )|r max (σ λ ) = j λ (σ λ )|r k λ (σ λ ) = 0, for j λ = k λ . The important remark that we should make here is that this basis of eigenvectors reflects the gauge transformation SφS −1 .
In this scenario, we can thus write eq. (75) in the proof of theorem 8 for g end = 1 By eq. (62), it follows that f max (σ λ ) for each σ λ in the irreducible decomposition of ω is lower bounded by 1− E(σ λ ) ∞ , while the sub-dominant eigenvalues, correspond to perturbations of the kernel of F(ω)[σ λ ], are upper bounded by E(σ λ ) ∞ . Moreover, by Theorem 18 presented in Section X, the eigenvalues in those subspaces not related to irreducible representations appearing in decomposition are again dominated by E(σ λ ) ∞ . Hence, we can choose m large enough such that f m max (λ) f m j λ (σ λ ) for all f j λ (σ λ ) and for each irreducible representations σ λ occurring in the decomposition of ω, and such that the leakage of the perturbation in non-occurring irreducible subspaces is suppressed.

Average gate fidelity and entanglement fidelity
The first RB protocols based on the Clifford group [5,34] linked a single decay parameter f to the average fidelity of a quantum channel R, under the assumption of gate-independent noise, i.e., φ(g) = Rω(g). The relation is given by This formula generalizes to uniform RB with an arbitrary group G with reference representation ω = λ∈Λ σ ⊕n λ λ , again under the assumption of gate-independent noise. However, it is more convenient to express it in terms of the entanglement fidelity, defined as where the trace is taken over the super-operators, and related to the average gate fidelity by F avg (R) = dF e (R) + 1 d + 1 .
In particular we have (first formally written down in ref. [14]) with M λ again an n λ × n λ matrix.
The connection between the RB decay rates and the fidelity has been challenged in ref. [26], where it has been argued that the average fidelity and the output of RB are not related in a unique way. In doing so they introduced the concept of gauge freedom into the RB literature.
In the context of RB, gauge freedom is the observation that two implementation maps φ and φ give rise to the same RB output data p(m) if they are related by a similarity transformation S, i.e., φ = SφS −1 . However, the average fidelity of these implementation maps (relative to some reference implementation) will generally differ. Note that this an issue even with the assumption of gate-independent noise, however, in this case there is a 'canonical' choice of gauge for which the RB decay rates and the fidelity are related. In the gate-dependent noise scenario there is no such obvious gauge choice. The rest of this section will be concerned with this question.
At this point we can again use of the property in eq. (230) for F(ω)[σ λ ] and the re-formulation in eq. (233) for F(SφS −1 )[σ λ ] and write E g F e (SφS −1 (g), ω(g)) = where we have defined the residuum term This establishes a connection between the decay parameters f max (σ λ ) retrieved from eq. (238) and the entanglement fidelity as expressed in eq. (248).
The last regime we consider is when |1 − f max (σ λ )| is close to the deviation of the vector overlap from 1, that is, which is troublesome not only for the fact that we cannot retrieve the overlap but also because in this case ∆ max may be of the same magnitude or smaller than |α Res |. Indeed, in this regime the residuum can then play a significant role in the characterization of the average gate fidelity.
The conclusion we draw from this analysis is that the overlap z(σ λ )|r max (σ λ ) max (σ λ )|z(σ λ ) is the key factor to consider when relating RB decays to the fidelity. This overlap must be sufficiently close to 1 under the adopted gauge relative to the difference |1 − f max (σ λ )|.
Finally, we wish to relate { E(σ λ ) ∞ } λ to a promise on a physical quantity related to the perturbation of the ideal gate implementation ω. We recall that E(σ λ ) = F(SφS −1 − ω)[σ λ ] and consider that · ∞ ≤ · F such that where we have applied Parseval's identity. Note, however, that the LHS of this expression runs over all irreducible representations of G and not the only ones decomposing ω.

X. RANDOMIZED BENCHMARKING UNDER DIAMOND NORM AND FIDELITY CONSTRAINTS
In theorem 8, we have argued that randomized benchmarking (RB) output data associated with an implementation of a group G could be approximated as a sum of (matrix) exponentials provided the implementation map φ was close to a reference representation ω w.r.t. to the diamond norm (averaged over all group elements). Here we will argue that this is a natural condition to demand in the context of RB. In particular we will show that this condition is stable, in the sense that it is impossible to be close (in the sense of eq. (72)) to two inequivalent representations at once, and, moreover, we show that this requirement cannot be replaced with a weaker one involving the average fidelity, resolving an open question in ref. [25].

A. Stability of representations under diamond norm
First, we prove that 'closeness to a representation' is a stable concept, that is, it is impossible to be close to two representations at once (in a suitable sense).
Theorem 18 (Stability of representations). Let φ be an implementation map of a group G taking values in S d such that and let ω, ω be representations of G on V n , V n with embedding maps L : V n → V n , L : V n → V n and R : V n → V n , R : V n → V n such that 1 |G| g∈G φ(g) − Rω(g)L ≤ , 1 |G| g∈G φ(g) − R ω (g)L ≤ .
Theorem 19 (Stability of the closeness to a representation). Let φ, φ be implementations of a group G on the superoperators S d such that and let ω be a representation of G on V n with associated maps L : S d → V n and R : V n → S d such that In this subsection, we argue that the condition eq. (72) is in some sense necessary for the correct behaviour of RB, in the sense that it can not be replaced with a natural weaker condition. Given the worst-case nature of the diamond norm eq. (72) is rather restrictive, and one might wonder if it is possible to replace this diamond norm constraint with a more congenial constraint based on the average fidelity. That is, one can imagine replacing eq. (72) with a constraint of the form for some δ > 0. Indeed, this is the assumption made in ref. [25] to prove a version of theorem 8 for the Clifford group.
Here, it has been noted that in order to guarantee correct behaviour the constant δ must be chosen inversely proportional to the Hilbert space dimension (δ ∼ 1/d). It has been speculated that this dimensional scaling could perhaps be an artifact of the proof techniques used.
We will argue that this scaling is in fact real, by providing an explicit family (inspired by example 8.1 in ref. [96]) of examples of implementations φ L (where L is an integer independent of d) of a group G with 1 |G| g∈G F avg (φ(g), ω(g)) ≥ 1 − 2L d relative to a reference implementation ω but with associated RB output data that is not even qualitatively of the form eq. (63). In fact, by choosing L large (but constant in d) we can obtain almost arbitrary non-exponential behaviour in the RB output data associated with φ L .
Example 1. Real scaling. Choose G to be the q-qubit Clifford group with standard reference implementation ω(g) = U g · U g † . Now let Λ µ L be a super-operator indexed by an integer L and a real number 0 ≤ µ ≤ 1, defined by its action on the basis matrices |i j | as with S µ a d × d stochastic matrix of the form For convenience we write Λ for Λ µ L in the following. It is easy to see that Λ is a quantum channel and moreover that if i, j ≤ L then Λ( |i j |) ∈ Span{ |i j | i , j ≤ L}.
This data shows curious behaviour. For small sequence lengths we have p( |1 1 |+ |L L | , m) ≈ µ m , but with increasing sequence length we observe wildly non-exponential behaviour.

XI. CONCLUSIONS
In this work, we have introduced a comprehensive theory of RB. As such, it goes beyond a mere classification of known protocols (a task that we also hope to achieve). But at the same time, it provides a deeper understanding, a more precise formulation and interpretation of what the data acquired in RB means, actionable advice to experimentalists and theoretical practitioners and a conceptual platform from which new schemes can be derived. Specifically, we show how RB gives rise to exponential decays under broad classes of Markovian circumstances, show -importantly in practical contextsin what sense RB is robust to deviations from uniform sampling and provides further evidence to the interpretation in terms of average gate fidelities. Maybe most important for our work to serve as a basis for substantial further development of methods and protocols are new conceptual insights into how inversion gates are -in contrast to common belief -not required for RB and into how large classes of groups in RB can become available by means of new filtering techniques. This contributes to overcoming the problem of isolating exponential decays in a fully scalable manner. First steps into exploiting the insights established here when devising new schemes have already been made [58][59][60]. We hope that this work provides a starting point of a further rich class of new protocols of quantum certification and benchmarking, providing stringent and rigorous quality criteria, while respecting experimental needs and desiderata. discussions and Susane Calegari for contributions to the illustration. The authors would also like to acknowledge an anonymous referee for pointing out the correct way to include cycle benchmarking into the framework of theorem 8. The Berlin team has been supported by the BMBF project DAQC, for which it introduces new methods for randomized benchmarking of near-term superconducting quantum platforms, and BMBF project MUNIQC-ATOMS, for which it introduces a starting point to develop schemes of analog randomized benchmarking. It has also been funded by the DFG (EI 519/9-1, for which this work develops ideas of signal processing, and DFG CRC 183, for which this is an internode work Berlin-Copenhagen, as well as DFG EI 519/14-1), and the Munich Quantum Valley (K-8