Practical Trainable Temporal Postprocessor for Multistate Quantum Measurement

,


I. INTRODUCTION
High fidelity quantum measurement is essential for any quantum information processing scheme, from quantum computation to quantum machine learning.However, while measurement optimization has focused on quantum hardware advancements [1][2][3], several modern experiments operate in regimes where optimal hardware conditions are difficult to sustain, or -for machine learning with general quantum systems [4][5][6][7][8] -may not always be known.For example, in the push towards higher qubit readout fidelities with complex multi-qubit processors in circuit QED (cQED), optimization of individual readout resonators becomes increasingly difficult.More importantly, finite qubit coherence means that simply extending the measurement duration is not a viable option to enhance fidelity: faster and hence higher power measurements are needed.However, these readout powers are associated with enhanced qubit transitions, leading to the T 1 versus n problem [9][10][11][12][13][14][15] and excitation to higher states [14,16,17] outside the computational subspace.Machine learning with quantum devices operating in unconventional regimes allows for an even broader range of complex dynamics.Quantum measurement data obtained under these conditions cannot be expected to be optimally analyzed using schemes built for more standard readout paradigms [18].Therefore, a practical approach to extract the maximum information possible from such data is timely.
In this paper, we demonstrate a machine learning scheme to optimally process quantum measurement data for completely general quantum state classification tasks.For the most common such task of single-shot qubit state readout, standard post-processing of measurement records has remained relatively unchanged (with some exceptions [19,20]): data is filtered using a "matched filter" (MF) constructed from the sample mean of measurement records for two states to be distinguished (for example, states |e⟩ or |g⟩ of a qubit).Crucially, the MF thus defined applies only to binary classification, and much more restrictively, is optimal only for idealized conditions under which the readout signal is subject to Gaussian white (i.e.uncorrelated) noise processes [21].In many deployments where complex conditions prevail (such as multi-qubit readout) an even simpler and less optimal boxcar filter is employed, due to the ease of its construction.Our approach harnesses machine learning to provide a model-free trainable temporal post-processor (TPP) of quantum measurement data under the most general noise conditions, and for an arbitrary number of states of a generic measured quantum system ( [22] for source code).We test our approach by applying it to the experimental readout of distinct qubits across a range of measurement powers.Our results demonstrate that the TPP reliably outperforms the standard MF whenever measured data exhibits nontrivial temporal correlations, including those of a quantum origin.We find an important such regime that has garnered significant recent attention to be that of high-power readout [16,[23][24][25][26]; here, we experimentally show that the TPP can provide a reduction in errors by up to 30% in certain cases.Furthermore, the TPP achieves this improvement while requiring only linear weights applied to quantum measurement data (see Fig. 1): this makes it compatible with FPGA implementations for real-time hardware processing, and exacts a lower training cost [27,28] than neural network-based machine learning schemes [20,29,30].
Machine learning has already been established as a arXiv:2310.18519v3[quant-ph] 27 Jul 2024 powerful approach to classical temporal data processing, providing state-of-the-art fidelity in tasks such as time series prediction [31], and chaotic systems' forecasting [32][33][34] and control [35].Adapting this approach to quantum state classification as we do here requires its application to time-evolving quantum signals.Signals extracted from the readout of quantum systems are often dominated by noise, making their processing distinct from that required of typical data from classical systems.More importantly, the noise in such signals can arise from truly quantummechanical sources, such as stochastic transitions between states of a multi-level atom (quantum 'jumps'), or vacuum fluctuations in quantum modes.A key finding of our work is that the TPP is able to learn from precisely these quantum noise correlations in data extracted from quantum systems to improve classification fidelity.To uncover this essential principle of TPP learning, we first develop an interpretation of the TPP as the application of optimal filters to quantum measurement data.This provides a framework to quantify and visualize what is 'learned' by the TPP from a given dataset.Secondly, the TPP is tested on simulated quantum measurement datasets using stochastic master equations, where quantum noise sources and hence their correlation signatures in measured data can be precisely controlled.
Using simulated datasets where all noise sources contribute additive Gaussian white noise -a reasonable assumption for measurement chains under asymptotically ideal conditions -we show that the TPP provides filters that reduce exactly to the matched filter for binary classification.More importantly, as the TPP is valid for the classification of any number of states, it provides the generalization of matched filters for arbitrary state classification.We then provide a systematic analysis of TPP applied to quantum measurement with more complex quantum noise sources, such as quantum amplifiers adding correlated quantum noise, or noise due to state transitions.In such scenarios the TPP provides filters adapted to the noise characteristics: we also provide an efficient semi-analytic form for these general TPP filters, which can deviate substantially from filters learned under the white noise assumption, and crucially outperform the latter in qubit classification.By learning from quantum noise correlations, the TPP therefore utilizes a characteristic of quantum measurement data inaccessible to post-processing schemes relying on noise-agnostic matched filtering methods.
The established learning principles provide a structure and interpretability to the general applicability of the TPP which enhances its practical utility.First, the exact mapping to matched filters under appropriate noise conditions places the TPP on firm footing, guaranteed to perform at least as well as these baseline methods.Secondly, and much more importantly, the TPP's ability to learn from noise (crucially, quantum noise) renders it able to then beat the MF when noise conditions change.This theoretical adaptability becomes practical due to the TPP's straightforward training procedure, which is  The objective is to process temporal data corresponding to an unknown state (indexed σ) of an arbitrary physical systemhere the state of a qubit in a quantum measurement chainto estimate the true label σ with maximum accuracy.The TPP approach uses a set of weights W and biases b to map the vector ⃗ x of measured data, comprising an instance of NO observables each a time series of length NT, to the corners of a hypercube in C-dimensional space.Optimal values of W and b are learned by training to realize this mapping with minimal error, in a least-squares sense.Scatter plots shown in C = 3 dimensional space are data from real qubit p ∈ {e, g, f } readout after the TPP.also ideal for autonomous repeated calibrations, necessary on even industrial-grade quantum processors [36][37][38].Ultimately, the trainable TPP could provide an ideal component to optimally process quantum measurement data from general quantum devices used for machine learning, which could exhibit exotic quantum noise characteristics.
The rest of this paper is organized as follows.In Sec.II we introduce the TPP framework to multi-state classification: a model-free supervised machine learning approach that can be applied to the classification of arbitrary time series.We also introduce the task used to demonstrate the TPP -dispersive qubit readout in the cQED architecture -and standard approaches currently used for this task.In Sec.III we draw connections between the TPP approach and these standard filteringbased approaches to qubit state measurement, and provide the TPP's generalization of matched filtering to arbitrary states.In Sec.IV, we apply the developed TPP framework to experimental data for qubit readout, showing that it can outperform standard matched-filtering at strong measurement powers relevant for high-fidelity readout.Sec.V explores the learning principles that enable the TPP to be more effective than standard matched filters using controlled simulations.We conclude with a discussion on the general applicability of TPP for quantum state classification and temporal processing of quantum measurement data.

II. TRAINABLE TEMPORAL POST-PROCESSOR FOR MULTI-STATE CLASSIFICATION
To overview its key features we first introduce the mathematical framework underpinning our trainable temporal post-processor (TPP), which is defined as follows.We consider N O continuously measured observables, each measurement yielding a time series of length N T .All measured data corresponding to an unknown state with index σ can be compiled into the vector ⃗ x (σ)  which thus exists in the space ⃗ x (σ) ∈ R NO•NT , where ⃗ (•) specifies vectors containing vectorized temporal data; examples will be provided shortly (see also Fig. 1).
Formally, operation of the TPP is then described as an input-output transformation, mapping a vector ⃗ x (σ)  from the space of measured data, R NO•NT , to a vector y ∈ R C in the space of class labels; the scalar predicted class label σ est is given by an operation F[•] on this vector y, so that the complete transformation is: Crucially, the TPP transformation -defined by a trainable matrix of weights W ∈ R C×NO•NT and a trainable vector of biases b ∈ R C -is linear.Machine learning using only linear trainable weights has shown remarkable success in time-dependent supervised machine learning tasks to map time series to a dynamically-evolving target function, although with a focus on classical data with weak noise [27,28].Here, we adapt this framework to processing of temporal measurement data from a quantum system and with a time-independent target, as is relevant for initial state classification [21].
More precisely, W and b are both learned from sampled data ⃗ x (p) with known labels p (C in total) in a supervised learning framework.The target y ∈ R C for any instance of ⃗ x (p) is taken to be a vector with only one nonzero element -a single 1 at index p, defining a corner of a C-dimensional hypercube (referred to as one-hot encoding, see Fig. 1).Then, the optimal W opt , b opt minimize a least-squares cost function to achieve this target with minimal error: Here X is the matrix containing the complete training dataset, comprising N train instances of ⃗ x (p) for each class p, while Y is the corresponding set of targets (see Appendix C for full training details).
A distinguishing feature of the TPP framework amongst other ML paradigms is that its optimization is convex and hence guaranteed to converge.We note that the function F[•] used to map the TPP output to an estimated class label is untrained, and hence does not effect the training complexity; it is often taken to be the argmax{•} function that extracts the position of the largest element in y.However, it can also be a more general classifier, such as a Gaussian discriminator (clarified shortly).The dimensions of the various components making up the TPP framework are summarized in Table I.
A. Learning from noise correlations While Eq. ( 1) presents a formal mathematical formulation of the TPP framework in the machine learning context, we can develop further understanding of how the TPP learns from data to enable classification.To this end, we first note that this stochastic measurement data can be written in the very general form: Here ⃗ ζ (σ) describes the stochasticity of the measured data: most importantly, we are interested in data where ⃗ ζ (σ) will be dominated by contributions from quantum noise sources.We take the noise process to have zero mean, E[ ⃗ ζ (σ) j ] = 0, where E[•] describes ensemble averages over distinct noise realizations (obtained for distinct measurements).Then, ⃗ s (σ) = E[⃗ x (σ) ] are simply the sample mean of the measured data traces for state σ.Crucially, the noise is characterized by nontrivial second-order temporal correlations, which we define as Higher-order correlations of the noise can also be generally non-zero, but are not explicitly analyzed here due to the TPP's use of a quadratic loss function.
The use of a least-squares cost function in Eq. ( 2) is now crucial: it means that a closed form of the optimal weights W opt and biases b opt learned by the TPP can be obtained (see Appendix D).Furthermore, the form of Eq. (3) allows us to write these learned weights and biases as Here M is a matrix that depends only on the mean traces (full form in Appendix D).In contrast, D is the matrix of second-order moments: which depends on the the "Gram" matrix of mean traces, G = c ⃗ s (c) (⃗ s (c) ) T , but also on the temporal correlations via the matrix V ≡ c Σ (c) .Both these quantities emerge naturally in the analysis of the resolvable expressive capacity of physical systems that are subject to noise [39].Here, Eq. ( 4) implies that weights learned by the TPP are not determined only by data means via G, but are also sensitive to temporal correlations through V.This simple feature will distinguish the TPP from standard classification approaches, a result we demonstrate in the rest of our analysis.

B. Quantum noise in dispersive qubit readout
We will demonstrate the utility of the TPP framework for contemporary cQED applications by focusing on readout of dispersive qubit-cavity systems.However, we emphasize that TPP is model-free: it can process data ⃗ x generated by an arbitrary physical system, without any knowledge of its underlying physical model.Nevertheless, we introduce a simplified theoretical model of dispersive qubit-cavity systems below, to erect a foundation for the interpretability of TPP [40].First, this enables us to identify the sources of quantum noise at play in dispersive qubit readout.More importantly, we use this model to generate benchmarking datasets with controlled, practically-relevant quantum noise characteristics: the TPP's application to these datasets with known temporal correlations in Secs.III, V allows us to interpret its learning principles.The ultimate test for the TPP is still in its application to real qubit readout data, in Sec.IV.
The standard quantum measurement chain for heterodyne readout of a multi-level artificial atom (here, a transmon) dispersively coupled to a readout cavity is depicted schematically in Fig. 1, and can be modeled via the stochastic master equation (SME): Here the Liouvillian superoperator L sys defines the quantum system whose states are to be read out.For dispersive qubit readout, L sys ρ = −i[ Ĥdisp , ρ], where the dispersive Hamiltonian Ĥdisp for a multi-level transmon takes the form (for cavity operators in the interaction frame with respect to an incident readout tone at frequency ω d , and setting ℏ = 1) Here ∆ da = ω d − ω a is the detuning between the cavity and the readout tone, while χ p is the dispersive shift per photon when the artificial atom is in state |p⟩ [41,42].
Unfortunately, the artificial atom can undergo transitions from its initial state to unmonitored loss channels, which can reduce readout fidelity; all losses through such channels are described by the general Liouvillian L envt .
The final superoperator L meas defines measurement chain components that are actively monitored to read out the state of the quantum system of interest.Here, we consider continuous heterodyne monitoring of a single quantum mode of the measurement chain, generally labelled d.In the simplest case, L meas defines readout of the cavity itself (then, d → â); however, it can also describe the dynamics (coherent or otherwise) of any other monitored quantum devices in the measurement chain.The most pertinent example is readout of the signal mode of an (ideally linear) quantum-limited amplifier that follows the dispersive qubit-cavity system via an intermediate circulator, as shown schematically in Fig. 1.Most generally, L meas can describe the monitoring of several modes of a general quantum nonlinear processor that is embedded in the measurement chain [5].Crucially, L meas must include a stochastic component (indicated by the Wiener increment dW ), describing measurementconditioned dynamics of the dispersive qubit-cavity system under such continuous monitoring (see Appendix B).
For a qubit in the (a priori unknown) initial state |σ⟩ before measurement, continuous monitoring of the measurement chain then yields a single 'shot' of heterodyne records {I (σ) (t), Q (σ) (t)} contingent on this state σ.The complexity of this readout task can be appreciated given the form of raw heterodyne records even under a simplified theoretical model: We consider discretized temporal indices t i , for i ∈ [N T ] and N T = T meas /∆t, where T meas is the total measurement time and ∆t is the sampling time set by the digitizer.Heterodyne measurement is intended to probe the expectation values ⟨ X(σ of the monitored mode d; however, any individual measurement record is obscured by noise ξ from various sources. Vacuum noise ξ I/Q (t i ) is associated with heterodyne measurement of even an empty cavity, and is modelled as zero-mean Gaussian white noise, More importantly, ξ qm I (t i ), ξ qm Q (t i ) describes quantum noise contributions to measurement records, whose origin is intrinsically tied to the nature of quantum measurement.The measurement of a quantum system imposes an evolution of its state, so that a given measurement affects the outcome of subsequent measurements.This effect is described via a measurement-conditioned stochastic quantum state ρc (referred to as a quantum trajectory), which is distinct from the unconditional quantum state ρ formally obtained when ensemble-averaging over repeated measurements.Consequently, for any given measurement instance, observables such as the conditional quadrature expectation ⟨ X(σ) (t i )⟩ c = Tr{ X ρ(σ) c (t i )} under heterodyne monitoring can deviate from the unconditional ensemble average ⟨ X(σ) (t i )⟩; this difference, given by ξ qm , manifests as quantum noise.These terms include amplified quantum fluctuations when measuring the output field from a quantum amplifier (see Sec. V A), or the influence of quantum jumps in the measured cavity field due to transitions of the dispersively coupled qubit (see Sec. V B).Finally, ξ cl I/Q (t i ) describe classical noise contributions to measurement records, for example noise added by classical HEMT amplifiers.While the statistics of this noise may take different forms, it is formally distinct from heterodyne measurement noise, as it has no associated stochastic measurement superoperator in Eq. (6).
The objective of the qubit readout task is then to use noisy single-shot [43] temporal measurement data to obtain an estimated class label σ est that is ideally equal to the true class label σ.Within the TPP framework, then, The noise ⃗ ζ in Eq. ( 3) then contains the terms ξ, ξ qm , ξ cl .However, before describing TPP results, we first briefly review standard approaches to qubit state classification.

C. Standard post-processing for binary qubit state readout: matched filters
The standard classification paradigm in cQED to obtain σ est from raw heterodyne records would formally be described as a filtered Gaussian discriminant analysis (FGDA) in contemporary learning theory [44].This comprises two stages: (i) temporal filtering of each measured quadrature, and (ii) assigning a class label to filtered quadratures that maximizes the likelihood of their observation amongst all C classes as determined by a Gaussian probability density function.Formally, this procedure can be written as: The function G[•] then assigns class labels according to the aforementioned Gaussian discriminator.A fact seldom mentioned explicitly is that both the temporal filters and the Gaussian discriminator must be constructed using a calibration dataset, analogous to the training phase of the TPP: a set of N train heterodyne records obtained when the initial qubit states are known under controlled initialization protocols.For example, for the most commonly considered case of binary qubit state classification to distinguish states |e⟩ and |g⟩, and under the assumption that the noise in heterodyne records is additive Gaussian white noise, an optimal filter is known: the matched filter [21,45,46].The empirical matched filter is constructed from the calibration dataset, where (n) indexes distinct records, via with ⃗ h Q defined analogously for I → Q.The function G[•] requires fitting Gaussian profiles to measured probability distributions of known classes, and hence uses means and variances estimated from calibration data.While a Gaussian discriminant analysis can be applied to classification of an arbitrary number of states C and beyond white noise constraints, the choice of an optimal temporal filter in these more general situations is not straightforward [47].Due to its ease of construction, often a matched filter akin to Eq. (11), or an even more rudimentary boxcar filter (a uniform filter that is nonzero only when the measurement signal is on) is deployed, regardless of the complexity of the noise conditions (for example, when qubit decay is significant and more optimal filters can be found [21]).We will show how the TPP approach provides a natural generalization of matched filtering to multi-state classification, and furnishes a trainable classifier that can generalize to more complex noise environments.

III. TPP LEARNING AS OPTIMAL FILTERING: GENERALIZED MATCHED FILTERS
To understand how the TPP generalizes standard matched filtering approaches, we first show an important connection between the two schemes.Note that the learned matrix of weights W opt ∈ R C×NO•NT can be equivalently expressed as: where With this parameterization, Eq. (1) for the kth component of the vector y can be rewritten as: When Eq. ( 13) is compared against Eq. ( 10), the interpretation of ⃗ f k becomes clear: this set of weights can be viewed as a temporal filter applied to the data ⃗ x.TPP based classification can therefore be interpreted as the application of C filters (one for each k) to obtain the estimated label σ est .The optimal W opt therefore defines the optimal filters that enable this estimation with minimal error.The use of C optimal filters for a C-state classification task indicates the linear scaling of the TPP approach with the complexity of the task.
Remarkably, the optimal W opt given by Eq. ( 4), and hence the C optimal filters, can be expressed in the simple semi-analytic form: where the mean traces ⃗ s (p) and correlation matrix V = p Σ (p) can both be empirically estimated from data under the known initial state p: while the coefficients C kp can also be shown to depend only on ⃗ s (p) and V (see Appendix D for full details).Furthermore, we note that the C filters are not all independent; they can be shown to satisfy the constraint (see Appendix D) where ⃗ 0 ∈ R NO•NT is the null vector.This powerful constraint, which holds regardless of the statistics of the noise ⃗ ζ, implies that only C − 1 of the C filters need to be learned from training data.
A. TPP performance under Gaussian white noise in comparison to standard FGDA We can now analyze the case most often assumed in cQED: that the dominant noise source in heterodyne records I, Q is stationary Gaussian white noise (independent of the undetermined state), an assumption under which matched filters are optimal for binary classification.Engineering of cQED measurement chains is geared towards approaching this limit, by (i) developing large bandwidth, high dynamic range amplifiers that operate with fast response times and minimal nonlinear effects even at high gain and large input signal powers [48,49], [50][51][52][53], (ii) improving qubit T 1 and tolerance to strong cavity drives to reduce transitions during T meas [3], and (iii) controlling technical noise sources such as electronic white noise from classical cryo-HEMT amplifiers and room temperature electronics.
In this relevant limit, the correlation matrix V of Eq. ( 5) becomes proportional to the identity matrix, and the resulting TPP-learned filters depend chiefly only on the mean traces ⃗ s (p) .For any C = 2 state classification task, for example p ∈ {e, g} qubit readout, we can show that C ke = −C kg , which reduces ⃗ f k exactly to a standard binary matched filter.Remarkably, the TPP-learned optimal filters in the Gaussian white noise approximation then provide a semi-analytically calculable generalization of matched filters to C states.
We can now analyze the multi-state classification performance enabled by these TPP-learned optimal filters in comparison to the standard FGDA approach.To guarantee dispersive qubit readout data that is subject only to white noise, we use a theoretical simulation of Eq. ( 6) to generate measured heterodyne records for C qubit states, under the following assumptions: (i) all qubit state transitions are neglected, (ii) any additional classical noise sources in the measurement chain are ignored, and (iii) therefore direct readout of the cavity can be considered instead of the use of a quantum amplifier and the potential quantum noise added by it.We take the cavity measurement tone to be applied for a subset of the total T meas , namely for [T on , T off ], and to be coincident with the cavity center frequency so that ∆ da = 0, usual for transmon readout (for full details, see Appendix B 1).Other system parameters can be found in the caption of Fig. 2.
The TPP can be used to generate optimal filters, and hence perform classification, for arbitrary C; for examples of calculated filters, see Fig. 10, Appendix D. For concreteness, here we analyze the classification performance enabled by the TPP to distinguish C = 3 states p ∈ {e, g, f }.Our choice of resonantly driving the readout cavity means the sign of cavity dispersive shifts for transmon states e and f is the same, and is opposite to that for g, making them harder to distinguish (see also Fig. 2 inset).We note that the specific details of the readout scheme do not change the TPP learning procedure.
For this three-state classification task, a unique filter choice for the FGDA is not known.While certain approaches at constructing filters have been attempted [54], boxcar filtering is still commonly employed.Another approach might be to use a matched filter that optimizes distinction of just one pair of states.There are 3 such filters in total: for discrimination of e-g states as defined in Eq. ( 11), as well as analogously-defined filters for e-f and g-f states.
In Fig. 2, we show classification infidelities 1−F, calculated for datasets with increasing measurement tone amplitude (more opaque markers), using both the optimal TPP filter and the FGDA with the four aforementioned filter choices.We emphasize again that these datasets are generated via simplified theoretical simulations guaranteeing white noise conditions, in particular ignoring any non-idealities associated with strong readout drives; under these conditions, classification performance improves steadily with increasing measurement tone amplitude, as shown.Even in this regime, we clearly observe that the FGDA infidelities for most filter choices are worse than the TPP.Interestingly, the poorest performer is not the boxcar filter; instead, it is the e-g filter, which would be optimal if we were only distinguishing {e, g} states, that We consider dispersive qubit readout to distinguish states p ∈ {e, g, f } as a function of measurement power.For a transmon χp/κ ∈ {−χ, χ, −3χ}, χ/κ = 0.195, and κ/2π = 1.54 MHz.More opaque markers indicate higher measurement tone amplitudes.Inset shows induced dispersive shifts for each state (not to scale).Standard FGDA is performed using one of three MFs corresponding to each distinct state pair, as well as a boxcar filter.TPP filters are also followed by a Gaussian discriminator for an equivalent comparison.Only one of the binary MFs allows the FGDA to approach the TPP, while all other chosen filters yield a worse performance.
yields the worst performance.This is because the e-g filter is completely unaware of the f state: it attempts to best discriminate e and g, but in doing so substantially confuses e and f states that are already the hardest to distinguish.The e-f filter corrects this major problem and hence performs better, but does not discriminate e and g as well as the e-g filter would.Due to the specific driving conditions and phases, the g-f filter unwittingly does a good job at addressing both these problems, yielding the best performance.Nevertheless, it can only match the TPP.This trial-and-error approach relies on knowledge of optimal matched filtering from binary classification, but clearly cannot be optimal for C > 2: none of the filter choices are informed by the statistical properties of measured data for all C classes to be distinguished.Alternative approaches, such as using multiple classifiers with upto C − 1 independent filters (for an equivalent resource cost to the TPP) can account for all classes, but as we show in Appendix D 7 do not outperform the TPP, and also exhibit a dependence on readout conditions.In either case, the brute-force determination of pairwise matched filters scales at least with the number of distinct state pairs, which grows quadratically with C; this is before one even accounts for fine-tuning of filter coefficients (analogous to learning C kp in the TPP approach).In contrast, the TPP approach provides a simple automated scheme to learn optimal filters, takes data for readout of all classes into account, is model-free and thus applicable to arbitrary readout conditions, and scales only linearly with the task dimension set by C.
However, the true strength of TPP learning arises when noise in measured heterodyne records no longer satisfies the additive Gaussian white noise assumption, which may arise if any of the conditions (i)-(iii) for qubit measurement chains listed earlier are not met.Departures from this ideal scenario are widely prevalent in cQED, and will be apparent in experimental results in the following section.Through the rest of this paper, we show how the trainability of the TPP approach enables it to learn filters tailored to these more general noise conditions, and consequently outperform the standard FGDA based on binary matched filters.

IV. TPP-LEARNING FOR REAL QUBITS A. Experimental Results
To demonstrate how the general learning capabilities of the TPP approach can aid qubit state classification in a practical setting, we now apply it to the readout of finite-lifetime qubits in an experimental cQED measurement chain.The essential components of the measurement chain are as depicted schematically in Fig. 1 and described by Eq. ( 6).The actual circuit diagram is shown in Fig. 8 in Appendix A, and important parameters characterizing the measurement chain components are summarized in Fig. 3(a).
We consider two distinct cavity systems, for the dispersive readout of distinct single qubits A and B to discriminate states p ∈ {e, g}.For lossless qubits that are read out dispersively for a fixed measurement time T meas , the ratio χ/κ determines the theoretical maximum readout fidelity; in particular, an optimal value for this ratio is known under these ideal conditions [42].However, experimental considerations mean that operating parameters must be designed with several other factors in mind.At high χ/κ ratios with modest or higher κ, for large κ with modest χ/κ ratios, and especially when both are true, the experiment is sensitive to dephasing from the thermal occupation of the readout resonator at a rate proportional to nκ [55].This can be quite limiting to the T 2 dephasing time of the qubit if the readout resonator is strongly coupled to the environment and/or the environment has appreciable average thermal photon occupation n.In the opposite low χ/κ limit, the qubit is shielded from thermal dephasing, but readout becomes very difficult as the rate at which one learns about the qubit state from a steady state coherent drive is proportional to χ/κ [42].In this experiment, the lower-than-usual χ/κ ≈ 0.2 in qubit B represents a compromise between these two limits, while also enabling the high fidelity discrimination of multiple excited states of the transmon (See Fig. 7, Appendix A).
Each readout cavity is driven in reflection, and its out- put signal is amplified also in reflection using a Josephson Parametric Amplifier (JPA).We employ the latest iteration of strongly-pumped and weakly-nonlinear JPAs [53], boasting a superior dynamic range.Such JPAs operate well below saturation even at signal powers that correspond to over 100 photons, enabling us to probe qubit readout at high measurement powers.By choosing a signal frequency at exactly half the pump frequency, we can operate the JPA in phase-sensitive mode.We can also operate the amplifier in phase-preserving mode if we detune the signal from half the pump frequency by greater than the spectral width of the pulse.Several filters are used to reject the strong JPA pump tone required to enable this operation.Circulators are used to route the output signals away from the input signals and to isolate the qubit from amplified noise.
In ideal circumstances, the use of stronger measurement tones should increase the classification fidelity for qubit readout, as shown via simplified theoretical simulations in Fig. 2. In practice, however, higher measurement powers are known to be associated with a variety of complex dynamical effects that can limit fidelity.Perhaps the most common observation is enhanced qubit e → g decay under strong driving (referred to as the T 1 versus n problem).The relative accessibility of higher excited states in transmon qubits means that at strong enough driving, general multi-level transitions to these higher levels can also be observed.There have also been predictions of chaotic dynamics and ionization [14,56] at certain readout resonator occupation levels, as well as complex dynamics due to qubit-induced resonator nonlinearities [57].The theoretical understanding of these effects, and their modeling via an SME analogous to Eq. ( 6) is an ongoing challenge.
In our experiments, we perform readout across this domain using two different qubits.For Qubit A, we simultaneously vary both pulse amplitude and pulse duration (T off −T on ), the latter from 300 ns to 1150 ns, to together obtain roughly 9±3 to 18±5 photons in the cavity in the steady state.For Qubit B's phase-preserving dataset, measurement pulse durations vary independently from 500 ns to 900 ns, and measurement amplitudes are adjusted to drive roughly 44±5 to 363±40 photons in the cavity in the steady state; the significantly larger photon number is tolerated due to the low Qubit B χ/κ.At the lowest pulse duration and amplitude, this corresponds to just enough discriminating power to separate the measured distributions for the two states by approximately their width in a boxcar-filtered IQ plane (namely, without the use of an empirical MF).An example of the individual readout histograms for qubits initialized in states p ∈ {e, g} at this lowest measurement tone power is shown in Fig. 3(b).Qubit B's phase-sensitive dataset was recorded with a pulse time of 800ns with a shaped pulse to shorten the effect of the cavity ring-up time similar to [58].
At the highest measurement powers, we are able to populate the readout cavity with up to 100 photons, calibrated by observing the frequency shift of the qubit drive frequency versus the occupation of the readout resonator.At these powers, extreme higher-state transitions become visible during the readout pulse [9]; an example is shown in Fig. 3(b) (see also Fig. 7 in Appendix A).There is also a notable elliptical distortion in the high-amplitude data, particularly for qubit A. We suspect that this is due to the short duration of the pulses and the inclusion of the cavity ring-up and ring-down in the integration, since the simple boxcar filter used to integrate the histograms in Fig. 3(b) does not rotate with the signal mean.
For such complex regimes where no simple model of the dynamics exists, the construction of an optimal filter is not known; this hence serves as an ideal testing ground for the TPP approach to qubit state classification.We compute the infidelities of binary classification using both the TPP scheme and an FGDA using the standard MF [Eq.(11)] under a variety of readout conditions, plotting the results against each other in Fig. 3(c).
The highest fidelity using both schemes is obtained for qubit B under conditions where its T 1 time is longest.This dataset was collected at a fixed, moderate measurement power; the different points correspond to a rolling of the relative JPA pump and measurement tone phase that determines the amplified quadrature under phasesensitive operation.The dashed line marks equal classification infidelities, so that any datasets above this line yield a higher classification infidelity with the FGDA than with the TPP.Here we see that both schemes exhibit very similar performance levels.
The other two datasets are obtained for readout under varying measurement powers.The depth of shading of the markers indicates the strength of measurement drives: the more opaque the marker, the higher the measurement power.We first note that the classification fidelity does not uniformly increase with signal amplitude in experiment; this is in contrast to the simplified theoretical simulations of Sec.III A, and is expected due to the aforementioned dynamical effects exhibited in real qubit readout at higher readout powers (neglected in Fig. 2).
For weaker measurement powers, we see that the TPP and the FGDA are once again comparable.However, a very clear trend emerges: for higher measurement powers -where measurement dynamics become much more complex as demonstrated in Fig. 3(b) -the TPP generally outperforms the FGDA.To more precisely quantify the difference in performance between the TPP and FGDA, we introduce the metric E: which essentially asks: "what percentage fewer errors does the TPP make when compared to the FGDA?"We plot E in the inset of Fig. 3(c) for the two qubit readout experiments where the input power is varied.We see clearly that with increasing power, the TPP can significantly outperform the FGDA scheme, committing as many as 30% fewer errors in the experiments considered.In fact, in certain cases where the FGDA predicts a reduction in classification fidelity with increasing readout power, the TPP's learning advantage can even enable a qualitatively different trend, instead boosting classification performance with increasing readout power (for details, see Appendix E 1).In both cases, at weaker amplitudes the general TPP filter closely matches the TPP filter assuming white noise.However, for higher measurement amplitudes, a marked difference between the white noise TPP filter and the general TPP filter is observed.
Our results demonstrate that the TPP approach can be successfully applied to real qubit readout across a broad spectrum of measurement conditions.Furthermore, the TPP can even outperform the standard FGDA in certain relevant regimes, such as for high-power readout.While the TPP can thus be applied as a model-free learning tool, we are also interested in understanding the principles that enable the TPP to outperform standard approaches using an MF.Uncovering these principles can help identify the types of classification tasks where TPP learning is essential.Our interpretation of TPP learning as optimal filtering proves a useful tool in this vein.

B. Adaptation of TPP-learned filters under strong measurement tones
For visualization, we only analyze filters ⃗ f k ∈ R NT for I quadrature data; the complete vector ⃗ f k includes filters for all N O observables.Recall that for a C state classification task, the TPP learns C filters; however, the sum of filters is constrained by Eq. ( 16), so that C − 1 filters are sufficient to describe the TPP's learning capabilities.In Fig. 4(a) we first consider filters learned by the TPP for a C = 2 classification task, for select experimental datasets from Fig. 3 obtained under a low and a high measurement power.It therefore suffices to analyze just ⃗ f 1 , the first filter for the I quadrature, as a function of measurement power.The black curves are filters learned under the assumption of Gaussian white noise, given by Eq. ( 14); recall that for this binary case, these filters are exactly the standard MF.The gray curves, in contrast, are filters learned by the TPP for arbitrary noise conditions, obtained by solving Eq. ( 2).At a low measurement tone amplitude (less opaque marker), the general TPP filter appears very similar to the TPP filter under white noise.As the measurement tone amplitude is increased, however, the TPP-learned filter under arbitrary noise can deviate substantially from the TPP filter under white noise.This is accompanied by a marked difference in performance, observed in Fig. 3(c).
Crucially, the generalization of matched filters provided by TPP-learning as discussed in Sec.III A enables a similar comparison for classification tasks of an arbitrary number of states.We show learned filters for C = 3 state classification of p ∈ {e, g, f } in Fig. 4(b), again for a low and high measurement power.It is now sufficient to consider any two of three distinct I-quadrature filters; here we choose ⃗ f 1 and ⃗ f 3 .Once more, the general TPP filters begin to deviate significantly from TPP filters under the white noise assumption at high powers.Most importantly, these filters provide an improvement in 3state classification fidelity relative to the FGDA scheme (for brevity, full results are provided in Appendix E 2).
Clearly, the precise form of filters learned by the TPP to outperform white noise filters must be influenced by some physical phenomena that arise at strong measurement powers.However, the TPP is not provided with any physical description for such phenomena, which is in fact part of its model-free appeal.What then, is the mechanism through which the TPP can learn about such phenomena to compute optimal filters?The answer lies explicitly in Eq. ( 14): TPP-learned filters are sensitive to noise correlations in data via V. Using simulations of measurement chains where the noise structure of quantum measurement data can be precisely controlled, we show that the noise structure can strongly deviate from white noise conditions under practical settings.Crucially, the TPP can adapt to these changes whereas the MF cannot.

V. TPP LEARNING: SIMULATION RESULTS
As discussed in Sec.II, the TPP weights and hence optimal filters depend on mean traces, but are also cognizant of -and can learn from -the noise structure of measured data via the temporal correlation matrix V.This is in stark contrast to the use of a matched filter.
Crucially, data obtained from quantum systems can exhibit temporal correlations that have a quantummechanical origin.In what follows, we demonstrate the ability of the TPP to learn these quantum correlations, using simulations of two experimental setups where such quantum noise sources arise naturally: (i) readout using phase-preserving quantum amplifiers with a finite bandwidth, so that the amplifier added noise (demanded by quantum mechanics) has a nonzero correlation time, and (ii) readout of finite lifetime qubits with multi-level transitions (quantum jumps).
A. Correlated quantum noise added by finite-bandwidth phase-preserving quantum amplifiers Quantum-limited amplifiers are a mainstay of measurement chains in cQED, needed to overcome the added classical noise of following HEMTs.Phase-preserving quantum amplifiers are necessitated by quantum mechanics to add a minimum amount of noise to the incoming cavity signal being processed.The correlation time of this added quantum noise is determined by the dynamics of the amplifier itself, namely its active linewidth reduced by anti-damping necessary for gain.For finite bandwidth amplifiers operating at large enough gains, this can lead to the addition of quantum noise with non-zero correlation time in measured heterodyne data.
To simulate qubit readout in these circumstances, we consider a quantum measurement chain described by Eq. ( 6) now consisting of a qubit-cavity-amplifier setup.L meas then describes the readout of a non-degenerate (i.e.two-mode) parametric amplifier and its non-reciprocal coupling to the cavity used to monitor the qubit.We ignore qubit state transitions, so that L envt only describes losses via unmonitored ports of the cavity and amplifier.Full details of the simulated SME are included in Appendix B 2.
We must consider added classical noise in the measurement chain, as this is what demands the use of a quantum amplifier in the first place.We take the added classical noise to be purely white, ξ cl (t i ) = √ ncl dW dt (t i ), with a noise power ncl = 30, parameterized as usual in "photon number" units; these assumptions on the noise structure and power are taken from standard cQED experiments, including our own.Now, the obtained heterodyne measurement records, Eqs.(8a), (8b) contain two dominant noise sources: (i) excess classical white noise, and (ii) quantum noise added by the amplifier, contained once again in quantum trajectories ⟨ X(σ) (t)⟩ c and ⟨ P (σ) (t)⟩ c .
We restrict ourselves for the moment to binary classification of states |e⟩ and |g⟩; here, the matched filtering (MF) scheme is unambiguously defined, and serves as a concrete benchmark for comparison to the TPP approach.In Fig. 5, we compare calculated infidelities using the FGDA and TPP approaches for three different values of amplifier transmission gain G tr , and as a function of the coherent input tone power: darker markers correspond to readout with stronger input tones.
To understand how correlations in the measured data depend on the varying amplifier gain, we introduce the noise power spectral density (PSD) of the data (here, the I-quadrature) for state |p⟩, where τ jk = ∆t(j − k).The PSD is simply the Fourier transform of the noise autocorrelation function (by the Wiener-Khinchin theorem).Through V, the TPP learns from these correlations when optimizing filters.The noise PSD is plotted in the inset of Fig. 5; for the current readout task, this is independent of p.With increasing gain, the PSD deviates from the flat spectrum representative of white noise to a spectrum peaked at low frequencies, indicative of an extended correlation time.The observations also emphasize that added noise by the quantum amplifier dominates over heterodyne measurement noise ξ, as well as excess classical noise ξ cl .For the lowest considered amplifier gain, we see that the FGDA and TPP classification performance is quite close.However, with increasing gain, the FGDA infidelity is substantially higher, up to an order of magnitude worse for the largest gain considered here.This TPP performance advantage is enabled by optimized filters, shown in Fig. 5(b).The measurement tone is only on between the two dashed vertical lines.The curves in black show white noise filters, exactly equal to the MF in this binary case.Note that these filters also change with gain: the amplifier response time increases at higher gains, so the mean traces and hence the MF derived from these traces exhibit much slower rise and fall times.The general TPP filter is similar to the MF at low gains, but becomes markedly distinct at higher gains.
Interestingly, one such change is that at high gains the general TPP filter becomes non-zero even prior to the measurement signal turning on (the first vertical dashed line).This appears odd at first sight, since there must not be any information that could enable state classification before a measurement tone probes the cavity used for dispersive qubit measurement.To validate this, in Fig. 5(d) we plot 1−F calculated for an increasing length of measured data, t ∈ [0, T meas ].We clearly see that for t < T on , both the TPP and FGDA cannot distinguish the states, as must be the case.The non-zero segment of the general TPP filter before T on instead accounts for noise correlations.In particular, due to the long correlation time of noise added by the quantum amplifier, noise in data beyond T on is correlated with noise from t < T on .The general TPP filter is aware of these correlations that the standard MF is completely oblivious to, and by accounting for them improves classification performance.

B. Correlated quantum noise due to multi-level transitions
A transmon is a multi-level artificial atom, as described by Eq. ( 7); as a result, it is possible to excite levels beyond the typical two-level computational subspace of e and g states.Such transitions manifest as stochastic quantum jumps in quantum measurement data, and are an important source of error in readout.
To model measurement under such conditions, we now consider the dispersive heterodyne readout of a finite lifetime transmon with possible occupied levels {e, g, f }.We further allow only a subset of all possible allowed transitions between these levels, and with static rates: |e⟩ → |g⟩ at rate γ eg , the reverse |g⟩ → |e⟩ at rate γ ge , and |e⟩ → |f ⟩ at rate γ ef (see Fig. 6 inset).The transitions are described by superoperator L envt , while L meas describes the measurement tone incident on the cavity, and the heterodyne measurement superoperator for the same; for full details see Appendix B 3.
For simplicity, we now further neglect excess classical noise added by the measurement chain, dropping terms ξ cl I/Q (t).As a result, the obtained measurement records, Eqs.(8a), (8b), contain only two noise sources: white heterodyne measurement noise, and quantum noise due to  qubit state transitions imprinted on the emanated cavity field, contained in quantum trajectories of cavity quadratures ⟨ X(σ) (t)⟩ c and ⟨ P (σ) (t)⟩ c .We then generate simulated datasets by integrating the resulting full SME, Eq. ( 6) for different values of transition rates, and consider the task of binary classification of states p ∈ {e, g}.
We compare the performance of a trained TPP against that of an FGDA with an empirical MF using the metric E in Fig. 6(a) with varying transition rates.The noise PSD is plotted in Fig. 6(b) for representative datasets.In the absence of any transitions (lightest orange), S (p) [f ] is flat at all frequencies, regardless of the initially prepared state p.This is because the measured data only has heterodyne white noise.With an increase in γ eg , we note that S (e) [f ] deviates from the white noise spectrum, attaining a peak at low frequencies.In contrast, S (g) [f ] remains unchanged as trajectories for initial states |g⟩ undergo no transitions.In the most complex case where we allow for all considered transitions, S (g) [f ] also starts to demonstrate deviation from the white noise spectrum.
From readout datasets with no transitions to readout data with increasing transition rates, we note a small but clear improvement in classification performance using the trained TPP in comparison to the FGDA.That the TPP is able to learn information in the presence of transitions that evades the MF is clear when we compare the two sets of filters in Fig. 6(c).As the transition rates increase, the MF undergoes modifications due to the changes to the means of heterodyne records.However, the TPP is sensitive to changes beyond means -in the correlations of measured data -and increasingly learns a distinct filter with sharply decaying features.We note that the utility of similar exponential linear filters for finite-lifetime qubits has been the subject of earlier analytic work [21].The TPP approach generalizes the ability to learn such filters in the presence of arbitrary transition rates and measurement tones, and for multi-state classification.
One may note that in the absence any multi-level transitions (Fig. 6(a), first datapoint) the FGDA appears to outperform the TPP (E < 0); given the results of Sec.III A, this may seem odd, as here the measurement noise is exactly Gaussian white noise, so the TPP filter reduces exactly to the MF used in the standard FGDA.The important distinction is that, unlike Sec.III A, here we are deploying the general TPP, which makes no a priori assumptions about noise characteristics.In the special case where the noise is in fact Gaussian white noise, the MF is already cognizant of the correct noise statistics, while the TPP must learn them via training, leading to a slight under-performance that is alleviated as the size of the training dataset is increased (see also Appendix C).Of course, this freedom is precisely what enables the TPP to learn more efficiently when noise characteristics are not simply Gaussian and white, for example under increasing multi-level transitions.There, the TPP shows an improvement relative to the standard FGDA in spite of having to learn the new noise statistics from training data.For these more complex noise conditions, the standard MF is now sub-optimal, and the FGDA performance suffers as a result.
Finally, we emphasize that the simplified transition model considered here is chosen to highlight the ability of the TPP to learn quantum noise associated with quantum jumps under controlled noise conditions, where no other nontrivial noise sources (classical or quantum) exist.The TPP approach to learning is model-free, and its ability to learn in more general noise settings is demonstrated by its adaptation to real qubit readout in Sec.IV.

VI. DISCUSSION AND OUTLOOK
In this paper we have demonstrated a machine learning approach to classification of an arbitrary number of states using temporal data obtained from quantum measurement chains.While we have focused on the task of dispersive readout of multi-level transmons, the TPP approach applies broadly to quantum systems, and more generally physical systems, monitored over time.Our results show that the TPP framework for processing quantum measurement data reduces to standard approaches based on matched filtering in the precise regimes of validity of the latter.However, the TPP can adapt to more general readout scenarios to significantly outperform matched filtering schemes.We show this improvement for the TPP trained on real qubit readout data to confirm the practical utility of our scheme.
Rather than treating the TPP as a black box, in our work we clarify the learning mechanism that enables the TPP to outperform matched filtering schemes.First, we develop a heuristic interpretation of the TPP mapping as one of applying temporal filters to measured data.TPP learning then amounts to learning optimal filters.Deconstructing the learning scheme, we find the TPP performance advantage is enabled by its ability to learn optimal filters by accounting for noise correlations in temporal data.When this noise is purely white, the TPP approach provides a generalization of matched filtering to an arbitrary number of states.
Crucially, we find that the TPP can efficiently learn from correlations not just due to classical signals, or in principle due to quantum noise in theory, but from practical systems where the majority of the noise is quantum in origin.In addition to real qubit readout, using theoretical simulations where the strength of quantum noise sources can be tuned precisely, such as noise due to multi-level transitions or the added noise of phasepreserving quantum amplifiers, we clearly demonstrate that the TPP can learn from quantum noise correlations to outperform standard matched filtering.Furthermore, our precise identification of quantum correlations as a harnessable resource can help guide future machine learning approaches to quantum signal processing.
The TPP approach, anchored by its connection to standard matched filtering, with demonstrated advantages for real qubit readout under complex readout conditions, and feasibility for FPGA implementations (to be demonstrated in future work), is ideal for integration with cQED measurement chains for the next step in readout optimization.Furthermore, the TPP's gener-ality and ability to efficiently learn from data could pave the way for an even broader class of applications.An important potential use is as a post-processor of quantum measurement data for quantum machine learning.With the use of general quantum machines for information processing, the optimal means to extract data from their measurements may not always be known.We believe the TPP is ideally suited to uncover the optimal linear post-processing step, through training that could be incorporated as part of the optimization of the quantum machine.This is because the existence of an exact analytic form for the optimal trained TPP weights eliminates the need for multiple training epochs, batch-wise evaluations, or gradient computations, so that training the TPP adds minimal complexity to the optimization of an already complex quantum measurement chain, in stark contrast to the substantial overhead of training a neural network used as a post-processor.Finally, optimal state estimation is essential for control applications.The trainable TPP can form part of a framework for control applications, such as Kalman filtering for quantum systems.

F Time-shuffled data
Appendix A: Experimental Setup In this appendix section, we show a few more examples of readout IQ histograms as well as a more detailed circuit diagram for the measurement chain.Shown in Fig. 7 below, we see two examples of the extremes of the measurement data for readout of Qubit B used to generate Fig. 3. Part (a) shows a lower power readout pulse performed for a short 300ns time, where the cavity barely has time to reach a steady state before the drive is turned off.Consequently, information from both the ring up and ring down must be integrated to achieve the SNR shown in this figure.Despite this measure, there is still significant infidelity from the lack of separation of the gaussian signals.In the second case, the displacement voltage is larger, and the pulse is three times as long, resulting in significantly increased separation of the gaussian signals and enabling discrimination of the |g⟩ , |e⟩ , |f ⟩ and |h⟩ states.However, the large powers required induce transitions between these states, resulting in the trails between them as the measurement integrates a mixture of different cavity states at different times.In Fig. A8, the hardware schematic of the measurements in section IV are shown.The measurement setup is fairly standard, using single sideband upconversion to send signals into the dilution refrigerator, moving through three stages of attenuation with 20dB attenuation at 4K, 20dB attenuation at the 100mK stage, and approximately 45dB of attenuation at the base stage of the refrigerator, with 10dB of the base stage attenuation coming from a particularly well-thermalized copper body attenuator.The signal interacts with the qubit and cavity system, is routed by two circulation stages to the amplifier, amplified in reflection, and then is routed once again back through the circulators to the remaining stages of amplification at 4K and room temperature accordingly.From there it is downconverted by the same local oscillator to 50MHz, filtered, amplified once more at low frequency, digitized at 1GS/s, and finally demodulated and integrated to acquire a readout histogram such as the ones shown in Fig. 7.In this appendix section, we describe the SMEs used to model various quantum measurement chains and generated datasets analyzed in the main text.For convenience we reproduce the general SME of Eq. ( 6): For all the considered models of quantum measurement chains for the fixed task of dispersive qubit readout, L sys remains the same, as identified in the main text: where Ĥdisp is the dispersive cQED Hamiltonian for a multi-level artificial atom, The superoperators L envt and L meas [dW ] will depend on the specific model considered.

Dispersive readout with no qubit transitions and using a cavity
For qubit readout in the absence of any state transitions, L envt → 0. As a result, the SME of Eq. (B1) takes the simpler form: Here L sys is given by Eq. (B2).The superoperator L meas describes quantum modes in the measurement chain that are used to measure the quantum system of interest.This superoperator can be expressed in the general form: Here L q defines the unconditional dynamics of quantum modes used for measurement; here, it takes the explicit form: which describes the measurement tone used for cavity readout, and the cavity losses due to its monitored port.Importantly, L q is independent of the qubit sector.Then, S[dW ] is the stochastic measurement superoperator that describes conditional evolution under continuous heterodyne monitoring: These explicit forms of superoperators fully define Eq. (B1) in this regime without qubit transitions.However, this assumption can be used to further simplify the form of the SME.In particular, in the absence of transitions, the quantum state of the measurement chain is given by the ansatz: where ρc (t) is the conditional density matrix defining the quantum state of all quantum modes in the measurement chain other than the qubit (namely, the cavity mode).The above implies that the qubit state is completely unchanged during the readout time.The only evolution is in the state of the modes used to readout the qubit, namely the cavity modes.
By now tracing out the qubit subspace in Eq. (B1), we can obtain an SME for ρc (t) alone, under the ansatz of Eq. (B8).The Hamiltonian contribution from the dispersive qubit Hamiltonian yields: and by conjugation, following which we arrive at: where we have defined Ĥcav as the cavity Hamiltonian alone: We can perform a similar simplification on terms due to L meas .For the ansatz in Eq. (B8), we find for L q : As L q was independent of the qubit subsector, it remains unchanged following the partial trace over this subsector.
The stochastic measurement operator S[dW ] is again independent of the qubit subspace.Hence tracing out the qubit sector yields: The final cavity-only SME in the absence of any qubit transitions takes the form: The resulting SME preserves Gaussian states and can be solved exactly using a truncated equations of motion (TEOMs) approach.
2. Dispersive readout with no qubit transitions and using a quantum-limited amplifier with added noise In the absence of any state transitions, L envt → 0, and the SME of Eq. (B1) takes the simpler form: Again L sys is given by Eq. (B2), and L meas takes the form: Now, L q for the unconditional dynamics of quantum modes used for measurement takes the explicit form: The first term again describes the measurement tone used for cavity readout, and the second describes cavity losses.However, the cavity's open port is now directed to a phase-preserving amplifier downstream.The superoperator L amp is the Liouvillian defining this quantum amplifier, which we take to be a two-mode non-degenerate parametric amplifier providing phase-preserving gain: The superoperator L c then defines the non-reciprocal coupling between the cavity mode and amplifier's signal mode d, To ensure non-reciprocal coupling so that fields from the cavity that carry qubit state information are transmitted to the amplifier for readout, but transmission in the reverse direction is forbidden, we require g = Γ [59].
Finally S[dW ] describes conditional evolution under continuous heterodyne monitoring, now of the amplifier's signal mode: We now summarize the actual parameter choices used to generate quantum amplifier simulated datasets in the main text.We define the total cavity loss rate κ = κ ′ +Γ.Then, we choose cavity parameters so that κ ′ = Γ = 0.5κ, and the dispersive shift χ/κ = 0.5.Recall that perfect non-reciprocal coupling in the desired direction requires g = Γ = 0.5κ.Lastly, amplifier parameters are chosen so that γ = γ d + Γ = 5κ, yielding the ratio of cold amplifier linewidth to cavity linewidth γ/κ = 5 used in the main text, and also implying that γ d = 4.5κ.
In the absence of qubit transitions, Eq. (B8) holds once again, as L meas is completely independent of the qubit sector.Hence this sector may be traced out exactly as in the previous subsection.We thus arrive at a cavity-amplifier-only SME in the absence of any qubit transitions: for L meas now given by Eq. (B17).The resulting SME again preserves Gaussian states and can be solved exactly using a truncated equations of motion (TEOMs) approach.

Dispersive readout including multi-level transitions using a cavity
For qubit readout allowing for state transitions, we must now include L envt in the SME: Again L sys is given by Eq. (B2).Now the nontrivial superoperator L envt takes the form: where γ jk is the rate of transition from qubit state |j⟩ to state |k⟩.
As we still consider readout using a cavity, the remaining terms in Eq. (B23) are as in Eq. (B25); in particular, L meas takes the form: where L q is given by: while S[dW ] is given by: We emphasize that now the quantum state of the measurement chain can not generally be expressed in the form of Eq. (B8).Hence Eq. (B23) is integrated in the joint qubit-cavity Hilbert space to generate simulated measurement datasets.
Eq. (C7) helps us define X ∈ R (NO•NT+1)×CNtrain as a matrix which contains all measured records as well as a row of ones to account for the contribution of biases.Then, W ∈ R C×(NO•NT+1) is the composite matrix of all learned weights.Eq. (C7) defines a regression problem that can be solved to obtain the optimal weights [60], For convenience of the analysis to follow we introduce two new matrices: the mean matrix M ∈ R C×(NO•NT+1) , and the second-order moments matrix so that Eq. (C8) can equivalently be written as: where the factors of N train cancel out.Note that the matrix C = X X T can at times be ill-conditioned, making its inverse difficult to compute numerically.In such cases, we instead compute the quantity C + , related to the pseudoinverse of X , and defined by the following limit relation defining the pseudoinverse: where I is the identity matrix on R (NO•NT+1)×(NO•NT+1) and λ is typically referred to as a regularization parameter.
If C is invertible, we have C + → C −1 .We do emphasize that for the datasets analyzed in this paper, the intrinsic dataset noise serves as an effective regularizer, such that we can typically set λ = 0.

Testing via cross-validation
For all classification infidelities calculated in the main text, we perform cross-validation.For a full dataset of N traj records per state, a training set is constructed with N train < N traj records as described above.The remaining N test = N traj − N train records are used to construct a testing set.We use 80% of the dataset for training, and the remaining 20% for testing.This is consistent with training and testing set sizes for standard machine learning applications; for example, the MNIST handwritten digits classification task [61] uses 85.7% of the total dataset for training, and the remaining 14.3% for testing.Predicted state labels are obtained using this testing set via both the FGDA scheme, Eq. ( 10) of the main text, and the TPP, Eq. ( 1).This process is repeated until a total of L = 10 iterations are completed: each time, a new set of weights W opt is obtained from a distinct randomly chosen training set of the total N traj records, and classification infidelities computed using the new random testing datasets.All classification fidelities are averaged to obtain the final values plotted in the main text.This cross-validation approach is standard in machine learning, and ensures that the observed performance is not unduly effected by variations due to the specific training or testing dataset used.

Dependence on size and fidelity of training sets
As the TPP deploys a supervised learning approach to training (not unlike a standard matched filter, as shown by Eq. (11) in the main text), an important question is how its performance depends on the size of the available training dataset, as well as any possible errors in data labelling such as may arise due to qubit initialization errors for the case of qubit state readout.
To answer these questions, we consider again the case of measurement data experiencing only Gaussian white noise, and compare the general TPP and FGDA performance for binary classification of p ∈ {e, g}.Note that only using the general TPP makes sense here, as the white noise TPP is exactly equal to the standard matched filter learned from a given training dataset in the special case of white noise.In Fig. 9(a), we thus plot the performance of the general TPP against the FGDA as in the main text, with a training set size of N train = 4000 measurement records per class.We also consider the impact of qubit ground state |g⟩ initialization error from 5% upto 35%: more opaque markers correspond to a lower qubit initialization error, and hence better classification performance.Fig. 9(b) repeats the same plot but now for a larger training set with N train = 8000 measurement records per class.
We note that for the smaller training data set, the TPP very marginally underperforms in comparison to the FGDA for some of the data points.This is because the TPP has not yet converged to the optimal filter for this size of training data set.With increasing training set size, this difference becomes smaller and smaller.We also note that qubit initialization error appears to impact both schemes similarly, so that the TPP appears to not be unduly impact by data mislabeling.
Finally, we emphasize that the task considered here is one that most heavily favours the standard FGDA in contrast to the general TPP, as the measurement data actually satisfies the noise conditions assumed a priori by the standard matched filter.Furthermore, the signal amplitudes used are weak (as indicated by the relatively low classification fidelity), so that measured data has a low signal-to-noise ratio and more data is needed to probe its statistics faithfully.If the true measured data exhibits noise statistics that deviate from this white noise case, even TPP filters learned using small training set sizes can already outperform the then sub-optimal standard MF trained on the same dataset.
Next we consider YX T , which can be expanded out explicitly, where we have used Eq.(D5) in obtaining the final expression.Hence using Eq.(D7), the matrix M takes the simple form (after the factors of N train cancel out): which contains the mean traces for all measured observables over all states, explaining the nomenclature of the mean matrix.We have further introduced the vectors ⃗ S (c) which also include the contribution from the bias.
b. Simplification of second-order moments matrix C Simplifying the second-order correlation matrix C is more involved.We begin by expanding it to the form: Note that XX T is simply the two-time correlation matrix of the measured data.We can further simplify C, which has four components.Starting with the simplest, we note that: Next, we consider the off-diagonal block term, The other off-diagonal term is simply the transpose of the above.Finally, we consider the block matrix, To proceed further, we substitute Eq. (D1) into the final expression and expand: Note that the sums indexed by n over the training data are estimators of the statistics of the noise process.We can therefore write: It now proves useful to introduce two further matrices, the Gram matrix G: and the empirical correlation matrix V: We can therefore write C in the simplified form, and hence construct the full C via Eq.(D11).
Having constructed explicit forms of M and C, we are in principle positioned to evaluate the optimal weights and biases W opt explicitly as well.To do so, it first again proves useful to interpret the learned weights in terms of optimal filters.

Constraints on TPP filters
The learned matrix of weights can be written in vector form as: Next, using Eq.(C8) together with the explicit form of the mean matrix M in Eq. (D10), we arrive at the important relation: where we have used the fact that C, and hence its inverse, is a symmetric matrix, and thereby computed the transpose of both sides.The above equation then implies: We note that the matrix C is very general as it is constructed for completely arbitrary measured signals; it is therefore generally dense and its inverse C −1 cannot be analytically determined.However, Eq. (D22) suggests that if we can find a way to work with quantities C −1 ⃗ S (c) directly, we can avoid having to evaluate this regularized inverse of C.This is our strategy to evaluate optimal filters analytically.
We demonstrate this approach by considering the action of C on the constant inhomogeneous vector, where ⃗ 0 ∈ R NO•NT is a vector of zeros.In particular, we wish to evaluate C⃗ n.Using the block representation of C, we have: Most importantly, note that the right hand side is entirely independent of the covariance matrix V, instead depending only on mean traces.Now, using Eq.(D22), multiplying through by C −1 will allow us to work directly with the (unknown) optimal filters ⃗ F (c) .We immediately find: For completeness, we also consider the case where we instead require the calculation of C + .To this end, we add and subtract the regularization parameter λ, The above defines a constraint on learned optimal filters, implying that they are not all linearly independent.Crucially, this constraint holds regardless of the correlation properties of the noise characterized by V, and is hence very general.
3. Analytically-calculable TPP filters: "matched filters" for arbitrary C Having obtained a useful constraint on TPP-learned filters, we will now take a step further and calculate semianalytic expressions for these learned filters (eventually arriving at Eq. ( 14) of the main text).
The first step is to simplify the form of the matrix C in Eq. (D19), which we reproduce and expand below: We have thus far allowed the auxiliary matrix L be completely general; we can now use it to simplify the form of C. Note that V as defined in Eq. (D18) is the positive sum of individual positive-definite correlation matrices; as a result, it must also be positive-definite and real.Among the useful properties of such positive-definite matrices is that they admit a Cholesky decomposition.We choose the auxiliary matrix L such that it precisely determines the Cholesky decomposition of V: where we have also used the fact that a positive-definite matrix is always invertible.With this choice, we immediately find that C reduces to: where Ī is the identity matrix on R NO•NT×NO•NT .
a. Obtaining the linear system for filters To obtain a system of equations for the learned filters, we now consider the action of C on the vector ⃗ S (c) .To do so, we will once again make use of the simplified block representation of C, which allows us to write: It proves useful to define the overlap of mean traces where we have used Eq.(D29).We can thus write: Finally, defining and once again introducing ⃗ n from Eq. (D23), we arrive at the form: Therefore, we find that the action of C on ⃗ S (c) can be expressed as a linear combination of the set of vectors { ⃗ S (c) }, and a vector ⃗ n that is independent of c.
We now wish to introduce the unknown filters ⃗ F c to the above system, using Eq.(D22).To do so, we add and subtract the regularization parameter λ, and multiply through by the regularized inverse of C.This yields However, Eq. (D36) is not entirely free of the (C − λI) −1 matrix, due to the inhomogeneous term.Fortunately, as the inhomogeneous term is constant, it can be removed by considering the difference of Eq. (D36) for any two distinct c values.For example, considering c This naturally introduces the difference of mean traces to the calculation of learned filters.Finally, we recall that the unknown filters ⃗ F c are not all linearly independent.We therefore use the constraint Eq. (D25) in the formal limit λ → 0 to eliminate one of the unknown vectors, here taken to be ⃗ F C : Then Eq. (D37) can be rewritten as: Each pair yields an equation of the form of Eq. (D39); it is easily seen that the full set of C − 1 equations can be compiled into the matrix system: using the properties of the Kronecker product.Here I is the identity matrix on R NO(NT+1)×NO(NT+1) as before, while both Q and T are elements of the much smaller space R (C−1)×(C−1) .In particular, their matrix elements are given by: Note further that T is a diagonal matrix.
b. Solving the linear system for filters Being a simple linear system, Eq. (D41) has the formal solution, We can now simply read off the solution for the unknown vector ⃗ F c : The first term on the right hand side completely defines the filter components in ⃗ F c , as they have a zero at the position corresponding to the bias component.The second term then entirely defines the bias.Using the form of ⃗ F c from Eq. (D20), we can immediately read off the individual filters for each measured quadrature: which simplifies to: where we have again used Eq.(D29).The bias terms are finally given by: The remaining learned filter and bias is then given by the constraint, Eq. (D25).An alternative, more practical form of the learned filters can be extracted by transitioning from the representation in terms of difference vectors ⃗ S Pp , to the individual traces ⃗ S (c) , using Eq.(D40).We find: which provides the learned filters as a linear combination of mean signals corresponding to each state to be classified.Comparing with Eq. ( 14) from the main text, we have: otherwise.
(D49) For C = 2, the matrix system in Eq. (D41) reduces to a single equation: From here we can directly read off the filter and bias term: We now provide an example of the construction of TPP-learned optimal filters for C = 3 state classification.To compute these filters using Eq.(D49), we simply require knowledge of the matrix Q, whose matrix elements are given by Eq. (D42).For C = 3, Q ∈ R 2×2 , and the distinct state pairs P p for p = 1, 2 are given by P Then, Q takes the form: and its inverse can hence be easily computed: Using Eq. (D49), we can therefore write for the non-trivial TPP-learned filters: Note that the final filter ⃗ f 3 must be defined by the constraint Eq. ( 16) of the main text (or equivalently Eq. (D25) of this appendix section).
6. TPP-learned optimal filters for multi-state classification under Gaussian white noise We now present an example of TPP-learned optimal filters for dispersive qubit readout where the dominant noise source is additive Gaussian white noise.This is ensured via a theoretical simulation of Eq. ( 6) as discussed in Sec.III A of the main text.These simulations yield single-shot measurement records for any number of transmon states.Examples of these records are then shown in Fig. 10 for four distinct transmon states p ∈ {e, g, f, h}; for ease of visualization we only consider the I quadrature.We use this simulated dataset as a training set to determine the TPP-learned filters under the white noise assumption, as defined by Eq. ( 14) in the main text with V ∝ Ī.While the individual measurement records are obscured by white noise, the empirically-calculated mean traces in the top right of Fig. 10 illustrate the physics at play.The mean traces grow once the measurement tone is turned on past T on , and settle to a steady state depending on the induced dispersive shift χ p and the measurement amplitude.The traces begin to fall beyond T off and eventually settle to background levels.These means, together with an estimate of the variances, determine the coefficients C kp that define the contribution of the mean trace ⃗ s (p) to the kth filter, and are hence sufficient to calculate optimal filters for the classification of any subset of states.
For the standard binary classification task (C = 2) of distinguishing {e, g} states, the learned filters are shown in black in the top row of Fig. 10 ).Black curves are filters learned under the white noise assumption, calculated analytically using Eq. ( 14).Bar plots show the coefficients C kp applied to respective mean traces in calculating these filters.Gray curves are general filters calculated by numerically solving Eq. ( 2).Both analytically-computed white noise filters and general filters can be extended to arbitrary C.
difference of mean traces for the two states, ⃗ f 1 ∝ ⃗ s (e) − ⃗ s (g) , making it exactly equivalent to the standard matched filter for binary classification, as discussed in the previous subsection.We note that the second filter (k = 2) is simply the negative of the first, as demanded by Eq. ( 16).
Crucially, the TPP approach now provides the generalization of such matched filters to the classification of an arbitrary number of states.For three-state (C = 3) classification of {e, g, f } states, the three TPP-learned filters are plotted in the middle row, while the last row shows the four filters for the classification of C = 4 states {e, g, f, h}.Filters for the classification of an arbitrary number of states C can be constructed similarly.The bar plots of C kp show how these filters typically have non-zero contributions from the mean traces for all states.This emphasizes that the TPP-learned filters are not simply a collection of binary matched filters, but a more non-trivial construction.Most importantly, our analytic approach enables this construction by inverting a matrix in R (C−1)×(C−1) to determine C kp .This is a substantially lower complexity relative to the pseudoinverse calculation demanded by Eq. ( 2), which requires inverting a much larger matrix C ∈ R NO•NT×NO•NT .
Of course, the latter approach of obtaining W opt and hence TPP filters using Eq. ( 2) can also be employed for learning using the same training data.Here, it yields the underlying filters in gray.The resulting filters appear to simply be noisier versions of the analytically calculated filters.The reason for this straightforward: the fact that the noise in the measurement data is additive Gaussian white noise is a key piece of information used in calculating the TPP filters via Eq.( 14), but is not a priori known to the general TPP.The latter makes no assumptions regarding the underlying noise statistics of the dataset.Instead, the training procedure itself enables the TPP to learn the statistics of the noise and adjust W opt accordingly.The fact that the general TPP filters approach the white noise  ) to implement C = 3 state classification using C −1 instances of FGDA.Each instance predicts outcomes given by headers of rows and columns respectively, while bold labels indicate final predicted labels based on joint outcomes; see text for details.Performance comparison for (b) the same readout conditions as Fig. 2 of the main text, and (c) for readout conditions where the measurement drive is resonant with the dispersively-shifted cavity when the transmon qubit is in state |g⟩.The TPP still outperforms C − 1 FGDA instances, with the latter's performance also varying depending on readout conditions.filters shows this learning in practice.This ability to extract noise statistics from data is a key feature that makes TPP learning useful under more general noise conditions, as was demonstrated in Secs.IV, V of the main text.In Sec.III A of the main text, the performance of TPP-learned optimal filters was compared against standard FGDA implementations where a single matched filter is used.However, as the TPP uses C − 1 independent filters, it is natural to ask for C > 2 state classification tasks whether employing multiple instances of FGDA with distinct filters could provide an improvement in performance.In other words, is the improvement in TPP performance observed in Fig. 2 of the main text arising simply because the TPP is using more filters, or is due to the learned optimal filters being able to extract more useful information from the noisy temporal measurement data?
To investigate this, we consider the C = 3 state classification task from Sec. III A, but now compare the TPP against C − 1 instances of the FGDA.A standard approach to do so is to consider one-versus-all classification.Here, for a single instance, an FGDA is trained to process temporal data and to output a state label as being p, or not p, (or !p for short), instead of predicting a precise state label in the !p case.The 'filter' portion of this FGDA can be labelled a one-versus-all matched filter, and can be constructed for example as: Next, a second instance of the FGDA processes the same temporal data, but using a one-versus-all matched filter constructed for a different state label q, and hence now predicts the state label as being q or !q'.FGDA instances are used to process temporal data until C − 1 instances have been used, and hence one of 2 (C−1) possible outcomes has been obtained.A concrete example of the possible outcomes for C = 3 state classification is shown in Fig. 11(a) for one-versus-all filters constructed for p = g, q = e.Depending on the possible joint outcome, a state label can finally be assigned: for example, the result g and !e is consistent with the state label g, !g and e implies e, and !g and !e implies f .Note that the final outcome g and e is ambiguous; here we use a random choice to assign a state label.Note that using C − 1 instances of the FGDA introduces more ambiguities than using just a single filter: different choices of p and q can be made, as indicated by the other tables in Fig. 11(a) where we either choose p = e, q = f or p = g, q = f .There is even greater ambiguity about the choice of the C − 1 one-versus-all matched filters, Eq. (D55), where the prefactors of each mean trace can be chosen arbitrarily.Even before exploring the performance of C − 1 FGDA instances, we note that the TPP already provides a unique set of filters, determined by coefficients C kp as given by Eq. (D49).
We now compare the performance of the TPP against the three distinct C − 1 FGDA instance implementations shown in Fig. 11 Also shown is the use of a single g-f matched filter, which was found to match the performance of the TPP in this case.We clearly see that the performance of the three distinct C − 1 FGDA instances matches the TPP performance much more closely than the single matched filters used in Fig. 2 of the main text.However, if the readout conditions are modified, for example if the cavity readout drive is now resonant with the cavity frequency when the qubit is in the ground state |g⟩, the performance can vary significantly, as shown in Fig. 11(c).Now, all C − 1 FGDA instances have a higher classification infidelity than the TPP, with certain instances faring much worse than others.
It is therefore clear that the improvement in classification fidelity provided by the TPP is not due only to its use of more than a single filter: C − 1 FGDA instances using the same number of independent filters as the TPP do not always match its performance.This emphasizes the need to optimize the individual filters used; the TPP provides an autonomous, model-free approach to achieve precisely this objective for the classification of an arbitrary number of states.

Semi-analytic TPP-learned optimal filters beyond Gaussian white noise conditions
As shown in the main text, a key feature of the TPP is that applies to post-processing of temporal data experiencing more general noise conditions than simply uniform, observable-independent Gaussian white noise.In the main text, we compared numerically-calculated general TPP filters to semi-analytic filters computed under the white noise approximation.In this subsection we also show the semi-analytic but general TPP filters, as defined by Eq. ( 14) of the main text for a general correlation matrix V.
We start with a simple case where the required V −1 can be computed analytically.Consider the case of heterodyne measurement N O = 2 but where the two measured observables (quadrature time series ⃗ I and ⃗ Q) have stationary but distinct variances σ 2 I , σ 2 Q respectively; for concreteness we assume σ 2 Q > σ 2 I .In this case, V takes the simple form: where Ĩ is the identity matrix in R NT timesNT .Of course, this form of V can be straightforwardly inverted: For convenience, we define filters and mean traces for each quadrature as respectively.
To calculate the semi-analytic general filters, we then simply use Eq. ( 14) from the main text to immediately find: We see that there is now a relative weighting of the filters in accordance with their variance: noisier observables are suppressed relative to less noisy observables.Additionally, the coefficients C kp also depend on V −1 .In Fig. 12(a) we plot the resulting filters for the readout conditions considered in Fig. 11(b) (this ensures both I and Q quadratures have non-zero mean signal values), using both the semi-analytic general TPP filters given by Eq. (D58), as well as the exact general TPP filters; the latter are shown with thicker lines deliberately to highlight differences between the two (to be expanded upon in due course).Finally, also shown are the filters assuming uniform white noise across the measured quadratures, which are clearly distinct from the general filters and do not penalize the noisier Q quadrature.Secondly, in Fig. 12(b), we consider the case of correlated quantum noise added by a finite-bandwidth phasepreserving quantum amplifier from Sec. V A of the main text, now also showing calculated semi-analytic general filters.We see that for all cases the semi-analytic general TPP filters show only very small differences when compared to the exact general TPP filter.As both schemes use empirically-calculated mean traces to construct the Gram matrix G and empirically-estimate the correlation matrix V, the residual differences can be attributed to the fact that the semi-analytic TPP filter assumes the noise terms have zero mean, while the exact general filter does not make such an assumption.
We will also emphasize that computing the exact general TPP filter requires the inversion of the matrix C ∈ R (NO•NT+1)×(NO•NT+1) , while the semi-analytic general TPP requires the inversion of V ∈ R (NO•NT)×(NO•NT) .In the

3-state classification results for real qubit readout
In this appendix section we include some supplementary results to Fig. 3 of the main text, now comparing classification performance for multi-state (C = 3) classification for real qubit readout of p ∈ {e, g, f }.The results are shown in Fig. 14 for the readout of Qubit B.
The standard FGDA is deployed here using the g-f matched filter, introduced in Sec.III; as discussed in Appendix.D 7, this provides the best performance amongst other single matched filters, while C − 1 matched filters do not provide a marked improvement in this readout configuration.We note again that the TPP outperforms the FGDA for almost all data points, and the performance difference increases at stronger measurement tone amplitudes.The underperformance at the lowest measurement tone amplitude can again be attributed to the fact that under these simpler readout conditions, the optimal filter is close to the white noise filter (see Fig. 4 of the main text); the general TPP does not know this a priori, and must learn this information from a finite training dataset, whose size limits constrains the fidelity of the learned filter.At higher signal amplitudes, the TPP outperforms the FGDA inspite of this training cost.Overall, we see that the TPP provides a better classification scheme for multi-state readout of real qubits, supplementing the improvement in performance demonstrated for binary classification of real qubits in the main text.

TPP learning of correlated classical noise
In this appendix section, we use a further example to demonstrate the ability of TPP-based learning to extract correlations from measured data, to supplement simulations in Sec.V. Like Sec.V A, we again consider simulated datasets of measured heterodyne records from a measurement chain of a qubit-cavity-amplifier setup, as in Appendix D 6. Now, however, we consider the excess classical noise added by the measurement process to also possess a component with a colored spectrum (suppressing quadrature labels for clarity): where ξ W (t i ) describes white noise as before, while ξ P (t i ) describes 1/f (or pink) noise.The power spectral density of the noise processes is given by the Fourier transform of their steady-state autocorrelation function (by the Wiener-Khinchin theorem), S N [f ] = dτ e −i2πf τ E[ξ N (0)ξ N (τ )] for N ∈ {W, P}.The noise processes are normalized so that the total noise power, df |S N [f ]| is the same for any of the considered noise processes; hence the relative magnitude (σ P /σ W ) 2 determines the relative strength of the noise processes with different correlation statistics.We restrict ourselves again to binary classification of states |e⟩ and |g⟩.In Fig. 15, we plot the calculated infidelities using the MF and TPP approaches against each other in logscale for different noise conditions parameterized by (σ P /σ W ) 2 , and as a function of the coherent input tone power: darker markers correspond to readout with stronger input tones.

Figure 1 .
Figure 1.Temporal post-processor (TPP) for multistate classification using quantum measurement data, demonstrated for dispersive qubit readout in cQED.The objective is to process temporal data corresponding to an unknown state (indexed σ) of an arbitrary physical systemhere the state of a qubit in a quantum measurement chainto estimate the true label σ with maximum accuracy.The TPP approach uses a set of weights W and biases b to map the vector ⃗ x of measured data, comprising an instance of NO observables each a time series of length NT, to the corners of a hypercube in C-dimensional space.Optimal values of W and b are learned by training to realize this mapping with minimal error, in a least-squares sense.Scatter plots shown in C = 3 dimensional space are data from real qubit p ∈ {e, g, f } readout after the TPP.

Figure 2 .
Figure2.Multi-state (C = 3) classification performance of TPP versus FGDA under Gaussian white noise conditions.We consider dispersive qubit readout to distinguish states p ∈ {e, g, f } as a function of measurement power.For a transmon χp/κ ∈ {−χ, χ, −3χ}, χ/κ = 0.195, and κ/2π = 1.54 MHz.More opaque markers indicate higher measurement tone amplitudes.Inset shows induced dispersive shifts for each state (not to scale).Standard FGDA is performed using one of three MFs corresponding to each distinct state pair, as well as a boxcar filter.TPP filters are also followed by a Gaussian discriminator for an equivalent comparison.Only one of the binary MFs allows the FGDA to approach the TPP, while all other chosen filters yield a worse performance.

Figure 3 .
Figure 3. Classification performance of TPP versus FGDA for readout of real qubits.(a) Parameters of various dispersive qubit-cavity systems used for gathering readout data.Coherence measurements are subject to 10% variation over time.(b) Representative qubit readout histograms under boxcar filtering as a function of measurement signal amplitude.(c) Readout data for three dispersive qubit-cavity systems is analyzed and the resulting classification infidelities for binary (C = 2) state classification are plotted against each other.The dashed line marks 1 − FFGDA = 1 − FTPP.For datasets with variable shading of markers (red and black), more opaque markers indicate higher measurement tone amplitudes, with corresponding resonator photon number n indicated via colorbars.Inset: Percentage fewer errors E computed for indicated datasets with increasing input signal amplitude.

Figure 4 .
Figure 4. Adaptation of TPP-learned filters with increasing measurement tone amplitude and evolving noise conditions.Black curves are normalized TPP filters under the white noise assumption; for binary state classification, these are identical to standard matched filters.Gray curves are general TPP filters with no assumptions on noise statistics.(a) Filter ⃗ f1 for binary (C = 2) classification, and (b) filters ⃗ f1,3 for C = 3 state classification.In both cases, at weaker amplitudes the general TPP filter closely matches the TPP filter assuming white noise.However, for higher measurement amplitudes, a marked difference between the white noise TPP filter and the general TPP filter is observed.

Figure 5 .
Figure 5. Classification performance of TPP versus FGDA on simulated dataset of readout via a phasepreserving quantum amplifier.(a) Classification infidelities for varying amplifier transmission gains Gtr as a function of measurement signal amplitude (more opaque markers are higher amplitudes).The ratio of the bare amplifier linewidth to the cavity mode linewidth is γ/κ = 5.Noise PSD is shown in the inset for the different operating gains (for a linear amplifier, this is independent of the measurement signal amplitude).(b) Learned filters under white noise assumption (black) and general noise conditions (gray) for representative datasets of each value of Gtr.(c) Classification infidelities as a function of total time t.The measurement tone is only on between the two vertical dashed lines.

Figure 6 .
Figure 6.Classification performance of TPP versus FGDA on simulated dataset of readout of a qubit experiencing multi-level transitions.(a) E as a function of increasing transition rate values (more opaque markers), as shown.Schematic in the inset shows transmon levels and non-zero transition rates considered.(b) Noise PSD S (p) [f ] for three representative datasets in the inset, indicating deviation from flat (white noise) as measurement data includes more transitions.(c) TPP learned filters (gray) compared to matched filters (black) for representative datasets, showing adaptation with transition rates.

Figure 7 .
Figure 7.Comparison between boxcar-integrated IQ results of (a) a lower power pulse applied for a short time, corresponding to n = 116 readout photons, and (b) a higher power pulse applied for a longer time, corresponding to a larger n = 176 readout photons.State transitions are visible as "trails" leading between the primary symbols in (b).Counts are shown in logarithmic units to emphasize low-count trails.

Figure 8 .
Figure 8. Upconversion and downconversion schematic for drive pulses sent first to the qubit readout resonator, driven in reflection, then routed to the amplifier and to the HEMT via two circulators.

Figure 9 .
Figure 9.Comparison of general TPP versus FGDA as a function of training set size and qubit initialization error.Results are shown for training set sizes of (a) Ntrain = 4000 measurement records and (b) Ntrain = 8000 measurement records per class.More opaque markers indicate lower qubit initialization error.

Figure 10 .
Figure 10.TPP-learned optimal filters for simulated multi-state classification under Gaussian white noise conditions.Top right: single-shot measurement records obtained under the indicated measurement tone, and empirical mean traces of several heterodyne records of the cavity I quadrature corresponding to multi-level atom states |p⟩ where p ∈ {e, g, f, h}.For a transmon χp/κ ∈ {−χ, χ, −3χ, −5χ}, χ/κ = 0.195, and κ/2π = 1.54 MHz.Rows: TPP-learned optimal filters for classifying states p ∈ {e, g} (C = 2), {e, g, f } (C = 3), and {e, g, f, h} (C = 4).Black curves are filters learned under the white noise assumption, calculated analytically using Eq.(14).Bar plots show the coefficients C kp applied to respective mean traces in calculating these filters.Gray curves are general filters calculated by numerically solving Eq. (2).Both analytically-computed white noise filters and general filters can be extended to arbitrary C.

Figure 11 .
Figure 11.Multi-state (C = 3) classification performance of TPP versus C −1 instances of FGDA under Gaussian white noise conditions.(a) Three distinct schemes (each corresponding to a different table) to implement C = 3 state classification using C −1 instances of FGDA.Each instance predicts outcomes given by headers of rows and columns respectively, while bold labels indicate final predicted labels based on joint outcomes; see text for details.Performance comparison for (b) the same readout conditions as Fig.2of the main text, and (c) for readout conditions where the measurement drive is resonant with the dispersively-shifted cavity when the transmon qubit is in state |g⟩.The TPP still outperforms C − 1 FGDA instances, with the latter's performance also varying depending on readout conditions.

7 .
Comparison of TPP against C − 1 instances of FGDA (a), first for the readout conditions from Fig. 2 of the main text; the results are shown in Fig. 11(b).

Figure 14 .
Figure 14.Multi-state (C = 3) classification performance of TPP versus FGDA for readout of real qubits.Classification infidelities using both schemes are plotted against each other for one of the three dispersive qubit-cavity systems analyzed in the main text.The dashed line marks 1 − FFGDA = 1 − FTPP.As before, more opaque markers indicate stronger measurement tone amplitudes.

Table I .
Summary of components of the TPP learning framework and their dimensions.