Bridging the reality gap in quantum devices with physics-aware machine learning

The discrepancies between reality and simulation impede the optimisation and scalability of solid-state quantum devices. Disorder induced by the unpredictable distribution of material defects is one of the major contributions to the reality gap. We bridge this gap using physics-aware machine learning, in particular, using an approach combining a physical model, deep learning, Gaussian random ﬁeld, and Bayesian inference. This approach has enabled us to infer the disorder potential of a nanoscale electronic device from electron transport data. This inference is validated by verifying the algorithm’s predictions about the gate voltage values required for a laterally-deﬁned quantum dot device in AlGaAs/GaAs to produce current features corresponding to a double quantum dot regime.


I. INTRODUCTION
Differences between theory and experiment pervade all of science, and are one of the driving forces of human discovery.Simulations often require fewer resources than real experiments but rarely capture the full complexity of a system, limiting their practical application.Narrowing the gap between a model and the real world is key for the control of complex systems using machine learning, especially when a machine learning model is trained on a simulation before being applied to real systems [1,2].The reality gap is widened further when there are quantities which are not directly observable.Such unobservable quantities may be estimated through their influence on other characteristics of the system; for example, indirect observation of black holes [3], observation of the signature of Higgs boson decay [4], or machine learning estimation of human poses from behind walls [5].
Solid-state quantum devices of nominally identical design will often display different characteristics.This variability hinders the scalability of otherwise promising qubit realisations, such as in the spin states of electrons confined in electrostatically-defined quantum dots [6][7][8].Different devices exhibit different electron transport features for identical gate voltage values.This variability is even observed in the same device after being exposed to thermal cycling [9].In particular, electrostatic disorder induced by randomly located donor ions can be a significant source of variability in delta-doped semiconductor quantum dot devices [10,11].Confinement potentials of individual quantum dots have been probed using in-plane magnetic fields [12], but there has been no quantitative experimental study of the disorder present in these devices beyond the observation of its effects [13].
To access the disorder characteristics that can only be observed indirectly through the transport of electrons, in this work we develop a physics-aware machine learning approach.† These authors contributed equally to this work and are listed in alphabetical order.* natalia.ares@eng.ox.ac.uk We use transport measurements of an electrostatically-defined quantum dot device in an AlGaAs/GaAs heterostructure to inform and verify our approach.
To infer the disorder potential we use a combination of transport measurements and predictions from a physical model.The physical model is an electrostatic simulation from which transport features can be estimated.Many simulations with different parameter settings are required to compare this physical model with transport measurements.To accommodate this need without extreme computation times, we develop a fast approximation of the model using deep learning.
The transport measurements and electrostatic simulations inform the inference algorithm to produce plausible disorder potentials, i.e. posterior samples.The inference mechanism used in this paper follows the philosophy of approximate Bayesian computation [14][15][16][17] by utilising the deep learning approximation of the electrostatic model.
A naive implementation of this inference still leads to unrealistically expensive and wasteful computation.This is because electrons are confined in a 2-dimensional electron gas (2DEG), and thus the disorder potential to be inferred is a dense 2D function.We develop a novel reparameterisation to greatly reduce the dimensionality of the inference problem, while selecting only the most informative regions of the disorder potential.This maps the non-parametric 2D disorder potential into a parametric model.This reparameterisation approximates the function in the spatial and spectral domains simultaneously using an inducing point approximation of a Gaussian process [18][19][20], and random Fourier features [21][22][23][24][25][26].
To assess the performance of inference results we use the disorder potentials produced by the algorithm to predict the electron transport regime of new measurements.These predictions provide good agreement with experiment, indicating that our physics-aware method is effective.The physical model can determine the number of quantum dots at a given voltage location.Using posterior disorder samples within this model allows us to predict the voltage locations of double quantum dots and verify these predictions with the experiment.Results show that our physics-aware machine learn-ing provides a clear advantage over an uninformed model of the disorder potential when predicting the location of double quantum dot features in gate voltage space.

II. THE DEVICE
A bias is applied to ohmic contacts to drive current through the device from source to drain, and applying voltages to the gates allows for the control of this current.With appropriate gate voltages, electrons may be confined to form quantum dots.Current peaks as a function of gate voltage in transport measurements are a signature of Coulomb blockade, indicating the formation of quantum dots.A random distribution of Si donor ions contributes a disordered component to the electrostatic potential experienced by electrons confined in a 2DEG.The distribution of donor ions is thought to freeze at low temperatures with rearrangement only possible significantly above device operating temperature [27].
Our device has 8 Ti/Au gate electrodes to which DC voltages can be applied to control electron transport in a 2DEG within a GaAs/AlGaAs heterostructure [9,28].The gate architecture of the device used in the experiments is depicted in Figure 1(a), where each of the gate voltages can be set to any value between 0V and -2V.In our device gate G6 is held at 0V to avoid leakage currents.The device is operated at millikelvin temperatures.

A. Electrostatics
As part of our physics-aware machine learning method, summarised in Figure 1(b), we require a model of the quantum dot device.The effects of gate electrodes and donor ions on the electron density in the 2DEG are calculated selfconsistently using the pinned surface model [29,30].Delta doping results in donor ions being randomly located in a plane at a constant height of 45nm above the 2DEG.With r = (x, y) denoting a location in the 2DEG plane, the total electrostatic potential is where the electrostatic potential contributions are φ g from the gate electrodes, φ d from the randomly located donor ions, φ s from surface states, and φ e from the presence of electrons in the 2DEG.The potential φ g results from the combined effect of the potential from each gate electrode weighted by the applied voltages.We find that this model underestimates the magnitude of φ g , so we use experimental data to fit an appropriate scale factor for each thermal cycle as discussed in Appendix D. The surface potential is determined by the Schottky barrier with the gates, as discussed by Ref. [31].Following this work, we set the surface potential to a constant value of φ s = −800mV.The potential in the 2DEG from a donor at location r k in the donor plane is φ d (r, r k ).The random potential from all donor ions is then Calculated using the Thomas-Fermi approximation in 2D, the electron density contributes to φ tot while also depending on φ tot .A self-consistent solution for φ tot is computed using an iterative under-relaxation process, with an example shown in Figure 2(c).

B. Modelling the Transport Regime
To model the transport regime of the device we consider the transport path of an electron from source to drain.If any point on the transport path has a fully depleted electron density, we say the classical channel for transport is closed (i.e.current does not flow freely).When the channel is closed, the device can be in the quantum dot regime with transport features from quantum tunneling events, or pinch-off where no current flows at all.When scanning a random combination of all gate voltages we can approximate the device as an open or closed channel.A semi-classical electron trajectory between source and drain is calculated by formulating φ tot as a graph, where each pixel is a node with nearest neighbour edges weighted by the mean of connected node values.The minimum spanning tree (MST) [32] of the graph is calculated and the unique path from source to drain is determined as shown in Figure 2(e).With the electrostatic potential energy defined as U (r) = −eφ tot (r), the path through the MST has the minimum possible maximum value of U .The location of this point will be called the minimax point, r * = (x * , y * ), with U * ≡ U (r * ).If U * is greater than or equal to the Fermi energy µ F , the model transport channel is considered closed.
The electron trajectory approximated by the MST path can also be used to determine the number of quantum dots formed by a given φ tot .The number of dots defined in the device can be determined using regions of the 1D MST path where U (r * ) < µ F which are delimited by barriers with U (r * ) ≥ µ F .An example of the electron density and path corresponding to a single dot in our model is shown in Figure 2(d-f).Since dots in our device are 2-dimensional objects in the plane of the 2DEG, the 1-dimensional MST path from source to drain is not sufficient to fully determine the number of dots.Additional paths through the MST are calculated to ensure dot labels are robust to all possible configurations of the electron density.Transport features corresponding to quantum dots can only be observed near the closed channel boundary due to tunnel barriers typically suppressing current far beyond this boundary.The dots identified using our model are not affected by this limitation.

C. Deep Learning Approximation
For disorder inference we require fast prediction of the transport regime, determined by U * in our model, given gate voltages and a disorder potential.The self-consistent electrostatic model and MST path require up to 10 seconds to calculate U * in serial computation.This computation time is impractical for the large batches of U * required for the inference algorithm.Deep learning methods, and their ease of implementation on GPU hardware, allow for a significant acceleration [33,34].
A convolutional neural network (CNN) is trained to calculate φ tot (r * ).The architecture of a CNN is particularly suited to data in 2D grids such as the potentials in our electrostatic model.Each input is a 2D potential φ in = φ g + φ d + φ s where φ g and φ d are randomly generated and φ s remains constant.The output training data consists of the self-consistent potential φ tot and φ tot (r * ) corresponding to each input, with U * = −φ tot (r * ) in units of electron volts.The complete mapping, expressed as φ in → φ tot → φ tot (r * ), is approximated by the CNN F U .The resolution of φ in and φ tot is reduced from that used in the electrostatic model to improve the performance of F U .A series of resolution preserving convolutions in a residual neural network (ResNet) architecture [35] learn the non-linear transformation φ in → φ tot and further layers learn the mapping φ tot → φ tot (r * ).Test results achieve a mean absolute error (MAE) of 1.27meV in U * estimations, with a 1.2% error when classifying the transport regime using U * .Batching inputs and using a GPU (GTX 1080 Ti) gives a computation time of 0.6ms for a single U * using F U , a speed up of order 10 4 over the electrostatic model and path finding algorithm.This evaluation of U * is also significantly faster than measurement of current.The parallel computation of CNN outputs surpass any acceleration which could be achieved by optimising the exact computation of methods discussed in III, which cannot be parallelised.
To make predictions of voltage locations with a given number of quantum dots using disorder inference results, a fast method for counting dots is required.A second CNN is trained to approximate the number of quantum dots at a given set of gate voltages.The input is φ in as used for F U , with the output being the number of dots, N dot ∈ {0, 1, 2, 3}.The network learns the mapping F D : φ in → P (N dot ), where P (N dot ) is the probability for a given N dot and classification is determined by the maximum P (N dot ).Due to the sparsity of dots in gate voltage space, the training set used for F U is such that the classifier cannot accurately determine N dot , but only the presence or absence of dots.We thus use an intermediate classifier which produces a new training set that ideally only includes gate voltages for which N dot > 0. A mixture of selected (dot-abundant) data and the original (dot-sparse) data is used to train F D .When determining the maximum number of dots in the direction of a given vector of gates voltages, F D achieves 95.8% classification accuracy.Using a GPU with batched inputs, the computation time for a single classification with F D is 0.6ms.Further details of networks F U and F D can be found in the supplemental material.

IV. INFERENCE ALGORITHM A. Disorder Potential Reparameterisation
The disorder potential used in the electrostatic model is a dense 2D grid covering the entire 2DEG plane, as displayed in Figure 2(b).A dense grid is unnecessary for inference since φ d is continuous and values can be interpolated from a sparse grid.Using a dense grid would be in unfeasible even with the reduced resolution CNN inputs.
We propose a novel reparameterisation algorithm, with the objective to find a set of n Z locations Z = {r Z k |k = 1, . . ., n Z }, where the disorder potential values on those locations sufficiently determine the transport regime.Following the literature of Gaussian Process regression [18], Z defines a set of inducing points.For the experiments in this paper, Z is parameterised as a 14×14 uniform grid defined by two corner points, with the initial grid shown in Figure 3(a).
Our reparameterisation requires the locations of the inducing points Z as well as the values of the disorder potential on these points, represented by a vector α = The full dense grid of φ d cannot be exactly recovered from the values on Z, because the inducing points are too sparse and random disorder potential variations between the points could influence the transport regime.This variability is encoded in the vector β = [ 1 , . . ., 2q ] containing amplitudes of random Fourier features [21][22][23][24][25][26], where q is the number of frequencies considered.The parameters contained in β are thus dependent on Z.
With optimal inducing points Z opt , the disorder poten- tial values contained in α sufficiently determine the transport regime, while the contribution of random Fourier features from β is marginal.A numerical optimiser is used to find Z opt , where the optimisation objective is to minimise the effect of β on transport regime predictions made by F U .During optimisation, the disorder potential contributing to the input of F U is approximately reconstructed from Z, α, and β using a deterministic function f (see Appendix C), The optimisation of Z can be performed on simulated data, and the optimised inducing points used by our inference algorithm are shown in Figure 3(b).We observe that the inducing points are located where the transport channel is more likely to be depleted, so that the disorder potential on Z opt can determine the transport regime of the device.Detailed formulation and implementation of the inducing point optimisation algorithm can be found in Appendix D.

B. Bayesian inference
To reconstruct the disorder potential, in addition to determining Z opt , we must infer suitable values of α and β.To do this, the inference algorithm requires measurements of current in gate voltage space.We generate random directions in the 7-dimensional gate voltage space, each defined by a unit vector u j normalised such that max i |u i j | = 1 and u i j ≤ 0 for u i j ∈ u j .A specific voltage location is defined as v = Ru j , where R is the voltage distance along u j .In particular, the inference algorithm requires information about the location of the boundary between open and closed channel transport.To obtain this information, stored in a dataset D, a current trace is conducted along a given u j from the origin at R = 0mV to the device voltage limit at R = 2000mV.Each current trace contributes 2 entries in D; the voltages immediately before and after current drops to half the open channel current, paired with y = 1 and y = 0 respectively.The resulting dataset can be defined by D = {(v i , y i )|i = 1, . . ., 2n u }, where n u is the number of unit vectors considered.We use n u = 200 in this paper, which is well below the typical data requirements of deep learning methods used to predict features of quantum devices [36,37].
To infer α and β using D we define a prior distribution p(α, β) and a likelihood of data p(D|α, β).The posterior distribution then follows the Bayes rule, p(α, β|D) ∝ p(D|α, β)p(α, β).In our formulation, p(α, β) follows the multivariate normal distribution having the zero mean vector and diagonal covariance matrix.The likelihood function utilises the estimated U * from the CNN F U for each data point (v i , y i ) by calculating φ g from v, and approximating φ d from α and β.
A set of n s posterior samples {(α i , β i )|i = 1, . . ., n s } can be drawn from Markov-chain Monte Carlo (MCMC) methods.Using (2), the posterior samples of α and β generate a set of 2D disorder potentials S φ = { φi |i = 1, . . ., n s }, which can be used for CNN inputs.The CNN computation is differentiable, unlike the electrostatic model and path finding algorithm, allowing us to use Hamiltonian Monte Carlo (HMC) [38] with TensorFlow Probability [39].
V. RESULTS

A. Transport Channel Prediction
From the Bayesian inference process we obtain a set of posterior samples of the disorder potential S φ.The standard deviation of posterior inducing point values used to generate S φ is shown in Figure 4 for three thermal cycles of the same device.A low posterior standard deviation on an inducing point means the inference algorithm has learned more about the disorder potential at that location.In each case the posterior standard deviation is lowest in regions surrounding gate G1 (the 'nose').This reflects the possible locations of U * existing most frequently in these locations, due to the primary role of G1 in depleting the transport path from source to drain.After performing inference of the disorder potential, the set of posterior samples is used in the electrostatic model approximated by F U .We verify the posterior prediction of the distance R C required to close the transport channel in a simulated and experimental device for a set of unit vectors, {u i |i = 1, . . ., n v }.We set R C to be the point at which the current drops below 50% of the open channel current.For a given u ∈ {u i |i = 1, . . ., n v }, we calculate the mean value of R C predicted using each posterior sample in S φ.To probe the generality of inference results, we evaluate predictions us- For a simulated device in which the true disorder can be chosen but is hidden from the algorithm, we compare the performance of random and posterior disorder potentials when predicting the value of R C over 5 independent iterations of the inference algorithm.Random disorder potentials, generated using the electrostatic model with randomly located donor ions, predict R C with a mean absolute percentage error (MAPE) of 7.0% across training and test data.In contrast, posterior samples predict the value of R C with a MAPE of 0.3% on training data, and 0.5% on test data.These results show that the inference algorithm is successful in finding disorder potentials which effectively describe features of a simulated device.
We then verify the posterior prediction of R C in a real device.Thermal cycling the device 4 times, we run a total of 5 iterations of the inference algorithm.The value of R C is predicted with a MAPE of 1.5% for training data and 2.0% for test data.The MAE of R C predictions is 24.3mV and 31.9mV for training and test data respectively.Random disorder potentials predict R C with a mean absolute percentage error (MAPE) of 7.5% across training and test data.Compared to the simulated device, the reduced performance of the inference algorithm can be attributed to differences between the model and the experiment.The inference results remain effective in predicting the gate voltages which close the transport channel.

B. Double Dot Prediction
Having demonstrated the success of the inference algorithm in determining the values of R C , we use the posterior disorder samples to predict transport features beyond the training domain of the inducing point optimisation and inference algorithm.We specifically consider features corresponding to the double quantum dot regime.We implement a method requiring minimal knowledge of the transport characteristics of a particular device.Three pieces of information are required, i) quantum dots form near the closed channel boundary, ii) gates G3 and G7 in our device couple most strongly to dot energy levels, and iii) double quantum dots form features with periodicity in two gate voltage directions in transport measurements.
The method of finding double quantum dots using posterior disorder samples is summarised in Figure 5(a).Random unit vectors u, are generated and scanned from R = 0mV to R = 2000mV in the simulated device.A randomly chosen unit vector is unlikely to lead to double dot transport features given the sparsity of double dots in voltage space.Based on predictions made by F D for each posterior disorder, we select candidate voltage vectors.If F D detects a double dot along a vector for a given posterior disorder, the vector's score is increased by one.Vectors with a score greater than a selected threshold (taken to be ns /3) are accepted to be investigated further in a test device.
Accepted vectors only indicate a direction in gate voltage space in which double quantum dot features could be observed in a test device.As these features are expected to be found near the closed channel boundary in transport measurements, we investigate multiple voltage locations near this boundary along each accepted vector.To investigate each accepted u, an automated protocol performs a current trace along u from the origin to the device voltage limit.The gate voltages are then set to the boundary between open and closed channel regimes along u, identified by a drop of 20% from open channel current.To allow for the identification of double quantum dot transport features, gates G3 and G7 are scanned in a 200mV×200mV window centred at this gate voltage location.Such 2D scans are subsequently performed at intervals of 13.3mV in R along the direction of u until the maximum current value in a 2D scan drops below 100pA.The 2D scans are labelled by 6 human experts to determine the presence of double quantum dot features along u.Multiple human experts are required as double quantum dot features are often subjective to human labellers and difficult to identify computationally [9].Further details of the vector filtering and labelling of 2D scans can be found in the supplemental material.Similar to our R C predictions, we first test the predictive power of posterior disorders in a simulated device in which the true disorder potential is known.By selecting random unit vectors we find a mean double dot occurrence rate of 0.83% using several random disorder potentials, generated using the electrostatic model with randomly located donor ions.We thus perform disorder inference followed by vector filtering.We do not scan gates G3 and G7 as in the real device since F D can determine the number of dots at a point in voltage space.After performing vector filtering, double dots are correctly identified in 28% of instances using random disorders, and in 67% of instances using posterior samples.This demonstrates that posterior disorder samples have greater predictive power than random disorder potentials.
In the real device, we produce two sets of posterior samples from independent iterations of disorder inference.Accepted vectors and the associated labels from both iterations are combined to provide larger sets of results using posterior and random disorders.Examples of 2D current scans which scored highly for double quantum dot features are shown in Fig 5(b).To assess the success of our dot prediction method, a Binomial distribution is fitted to posterior and random results, where the probability of finding a double dot along an accepted voltage vector is P (DD) = p, with P (DD) = 1 − p.The fit results in 95% confidence intervals of 0.235 < p post < 0.449 using posterior samples, and 0.004 < p rand < 0.149 using random disorders.
These values, with a separation of the 95% confidence intervals, demonstrate that using posterior disorders results in a higher rate of success than random disorders in finding ex-perimental double quantum dots.Our results show that the inference algorithm produces disorder potentials with predictive power beyond the original domain of training data, and can reduce the human expertise required to tune a double quantum dot.
In addition to the comparison of random and posterior disorders, we also perform the filtering process with featureless (i.e.constant valued) disorder potentials.Fewer vectors are accepted to be tested than when using posterior or random disorders, and vectors which produce the highest scoring 2D scans, identified in results of posterior disorder predictions, are not found.Further details can be found in the supplemental material.

VI. CONCLUSION
We demonstrate that hidden disorder in a nanoscale electronic device can be inferred with indirect measurements and physics-aware machine learning.The reparameterisation of the disorder potential proves effective in reducing the dimensionality of the problem, and the successful acceleration of an electrostatic model with a differentiable convolutional neural network allows for Bayesian inference.The entire inference process, from inducing point location optimisation to selecting posterior samples is general and applicable to any gate structure.The device specifics are contained in the electrostatic model, and can easily be adapted to other gate architectures.The prediction of double dot locations using both random and posterior disorders shows the benefits of model assisted tuning, and results indicate that the posterior disorders perform better in this task.The real device still has greater complexity than the model can capture, but the success of current predictions indicates that the use of physics-aware machine learning has narrowed the reality gap.The generality of this method and the minimal data required for inference are promising qualities for future utility in understanding nanoscale quantum devices.
Appendix A: Self-Consistent Electron Density The electron number density is calculated using the Thomas-Fermi approximation in 2D, where m * is the effective mass of an electron in GaAs, µ F is the chemical potential or Fermi level of the 2DEG which is set to zero, and Θ(•) is the Heaviside step function.The factor of two accounts for spin degeneracy, and the Heaviside step function approximates the Fermi distribution at low temperatures.The electrostatic potential associated with the electron density is A self-consistent solution is computed using an iterative under-relaxation process.The device is fabricated from a wafer (Gossard-060926C) with 2DEG density n = 2.64 × 10 15 m −2 and delta-doping density n δ ≈ 6 × 10 16 m −2 .As the 2DEG density can be accurately measured, and there is no guarantee that all Si donors become effective dopants, we fit n δ such that the electrostatic model produces the known 2DEG density.The fitted value is n δ = 1.25×10 16 m −2 giving a calculated mean electron density of n = (2.64 ± 0.04)× 10 15 m −2 , which is the mean and standard deviation uncertainty of 100 calculations.This value is in agreement with the experimental value of 2.64×10 15 m −2 .

Appendix B: Disorder Covariance
A Gaussian Process, requiring a covariance function of the random disorder potential, is used to generate random disorder potentials in the inference algorithm.The donor plane divided into cells, with a random variable I ij ∈ N 0 determining the number of donors in the cell at r ij = (x i , y j ).The potential the 2DEG from the donor ion distribution is given by φ d (r) = ij I ij φ d (r, r ij ), summing over each cell in the donor plane.
The covariance between two points in the 2DEG plane can be evaluated numerically, and appropriate kernel parameters are fitted.A rational quadratic kernel function is chosen, with fitted values of ρ = 139.8nmand σ = 20.8mV.Alternative kernels provide better fits, but the explicit form of the corresponding frequency distribution of random Fourier features is unknown or intractable.

Appendix C: Reparameterisation
Let X = {r X k |k = 1, . . ., n X } denote the set of dense grid points (34×52 or 45×69 for the experiments in the paper, depending on the CNN model) on the x-y plane, the potential of which is the input of the CNN.Without any measurement, the disorder potential values on X, denoted by φ X , is approximately a random vector following the normal distribution: where m is the pre-calculated mean potential level, 1 is a onefilled vector, and K X is the covariance matrix, of which element (i, j) is k(r X i , r X j ).The value of m = 1184mV is determined from the mean values of 1000 random disorder potentials generated using the electrostatic model (with φ s absorbed into the disorder potential, m = 384mV).For the sake of simplicity, the all derivations below are based on the meanadjusted potential: f = φ X − m1.In order to generate a random sample from f , we can draw a random sample from X ∼ N (0, I X ) and then transform it as where 0 is a zero-filled vector, I X is the n X × n X identity matrix, and L X is the lower Cholesky decomposition of K X .
Since n X is too large for a practical Bayesian inference problem, and we want to make the inference algorithm independent of n X , the inducing point approach is used.The set of inducing points, Z = {r Z k |k = 1, . . ., n Z }, usually has many fewer points than X: n Z < n X .Let u denote the vector of the mean-adjusted potential values at Z (i.e.u = α − m1 using notation from the main text).The two mean-adjusted potential vectors f and u are jointly a normal distribution, and the joint distribution can be decomposed into two terms: p(u, f ) = p(u)p(f |u).The first term is a prior distribution, p(u) = N (u; 0, K Z ), where K Z is the covariance matrix, of which element (i, j) is k(r Z i , r Z j ).The second term is the conditional distribution of f given u: The computational complexity of computing the covariance of f |u is O(n 3 X ) because of the covariance matrix in (C2).To reduce the computational complexity, any low-rank approximation can be used.In this paper, we approximate the covariance matrix with spectral features.The idea behind this approach is to let inducing points take account of spatially important locations, and the spectral features control relatively unimportant spatial information.The approximation of many types of covariance kernel functions with spectral features is extensively studied in the context of random Fourier features [21][22][23][24][25][26].
The spectral feature is where q is an arbitrary chosen integer satisfying n Z < 2q < n X , and w i is a random sample whose probability density function depends on the underlying covariance kernel function.We use q = 300 in this work.The n z < 2q inequality ensures Ψ Z Ψ Z (defined below) is invertible, and 2q < n x ensures that an advantage is gained in computational complexity when using random Fourier features.
The corresponding probability distribution of the samples ω 1 , . . ., ω q given the kernel function (B1) is where ω is an angular frequency, and • 1 is the L1 norm function.
The prior covariance matrices, K X and K Z , are approximated by the spectral features: K X ≈ Ψ X Ψ X and K Z ≈ Ψ Z Ψ Z , where Ψ X ∈ R n X ×2q with ψ(r) for r ∈ X as rows, and Ψ Z is defined in a similar fashion.The posterior covariance in (C2) is approximated as cov(f |u) ≈ ΨX Ψ X , where The approximated random field by substituting the approximated covariance matrix into (C2) is where Z and 2q are standard normal random vectors with length n Z and 2q, respectively, and L Z is the lower Cholesky decomposition of K Z .The approximated posterior random vector is straightforward, The equation defines the reconstruction of φ X through the function using the mean adjusted values, and using the notation in (2) where α = u + m1, and β = 2q .
Appendix D: Detailed Inference Algorithm

Overview
The posterior inference requires two prerequisites with no interdependence: i) fixing a gate scale factor, ii) optimising the inducing points.The gate voltages are multiplied by the gate scale factor.The gate scale factor is optimised by maximising the likelihood of the observations with assuming the the disorder is perfectly flat.Optimised scale factor values range from 3.48 to 3.94 for runs of the inference algorithm on different thermal cycles.

Inducing Points Optimisation
Before obtaining any measurements, the inducing point optimisation can be conducted with simulated data.The inducing point optimisation is expensive to compute, but the computation time is not critical, because the optimisation only has to be performed once for a given gate architecture.Algorithm 1 shows the optimisation procedure.Line number 10 in Algorithm 1 is important; it generates a posterior random sample with the information {Z, u}.The sample only retains information about φ d at Z.For the experiments, we used (C2) for the posterior sampling, the approximated distribution (C4) can be used if the computation speed matters.
For the input of Algorithm 1, equation (C1) is used for generating Φ X sim .Each element of V sim is generated by choosing a disorder randomly from Φ X sim , then choosing a pair of voltage vectors near the closed channel boundary with uniform direction sampling in [9].The function KL(p, p ) computes p i log(p i /p i ) + (1 − p i ) log((1 − p i )/(1 − p i )) element-wise for p and p , then it computes the average of them.For the experiments in the paper, n d and n v are set to 20, and ADAM optimiser is used.The current probability prediction F prob U uses the CNN model F U and a sigmoid function: where g(φ d , v) computes φ in on the dense grid for CNN (see section III C), and σ ξ is a modified sigmoid function with the steepness parameter, σ ξ (•; 10) = ξ + (1 − 2ξ)σ(•; 10).The margin ξ allows discrepancy between our approximated model and the real world measurement, and the steepness parameter, set to 10, makes a relatively sharp probability of electric current while allowing the function differentiable.The differentiability is required to use the ADAM optimiser.for all φ d ∈ Φ X mini and v ∈ Vmini do 8: u ← interpolated values of φ d at Z 10: φ d ← a random sample from the posterior GP with {Z, u} 11: l ← KL(p, p ) + KL(p, p ) loss ← loss +l 14: end for 15: Z ← opt.update(loss,Z) 16: end while ALGORITHM 1. Inducing point optimisation

MCMC Inference
The goal of the MCMC inference is to generate random samples from the posterior distribution of uncertain variables.The uncertain variables in (C4) are u and 2q .the posterior pdf of (u, 2q ) given observed current measurement D is The prior distributions p(u) and p( 2q ) are defined in Appendix C. For the experiments in the paper, Hamiltonian Monte-Carlo is used with the posterior pdf.Each time MCMC inference is performed a different number of posterior samples are generated.In our work we find typical values to be 150 < n s < 320.

Supplemental Material ELECTROSTATIC MODEL
As discussed in the main text, the total electrostatic potential in the 2DEG φ tot (r) is considered as the sum of components φ tot (r) = φ g r) + φ d (r) + φ s (r) + φ e (r), and the electrostatic potential energy of an electron is U (r) = −eφ tot (r).
The gate electrodes exist on the surface of the device and the potential from each gate at a depth d = 115nm beneath the gates in the plane of the 2DEG is determined using analytic expressions.A further 5nm is added to the depth of the AlGaAs/GaAs heterojunction (110nm) to account for the extent of the electron density beyond this junction.An representation of the gates is taken form an SEM image a device of identical design.The image of each gate is used and the potential from each pixel is calculated individually and summed to give the total potential from each gate φ gi (r).The total gate potential is where the sum is over all gates and v i is the voltage applied to the i th gate.The gate scale factor G SF is used as the pinned surface model underestimates the magnitude of the gate potential.This underestimation is observed when gate voltages stop current in the experimental device but do not deplete the transport path from source to drain in the simulated device.
Donor ions exist in a plane at a constant height h = 45nm above the 2DEG.The potential in the 2DEG from a single donor ion at location r ij = (x i , y j ) in the donor plane is The potential from each donor ion is summed to give the disorder potential φ d (r).The electron density is calculated using the Thomas-Fermi approximation in 2D and a self-consistent solution for φ tot (r) is computed using an iterative under-relaxation process.

MODELLING TRANSPORT AND DOTS
A semi-classical trajectory of electrons between source and drain is calculated by formulating the 2D potential φ tot as a graph G φ , where each pixel is a node.G φ is defined as where V is the set of nodes in G φ with v i the value of the i th node, and E is the set of edges in G φ with e ij the edge connecting v i and nearest neighbour v j .
The Dijkstra algorithm results in the shortest path from source to drain through G φ , but this path overestimates the maximum potential energy of an electron along the path.Gate voltages which close the transport channel are then underestimated by our model.This effect can be observed when comparing the transport path with an electron density profile in Figure S1(a).As discussed in the main text, calculating the minimum spanning tree (MST) of G φ resolves this issue by providing a unique path connecting source and drain with a minimum sum of edge weights.The transport path determined by the MST graph and associated electron density is shown in Figure S1(b).As discussed in the main text, this trajectory allows for the number of quantum dots in the transport channel to be counted.

DEEP LEARNING
A deep convolutional neural network (CNN), denoted F U , is trained to approximate U * given gate voltages and a disorder potential.The structure of F U is shown in Figure S2(a).The ResNet skip-connections used in F U , as shown in Figure S2(b), share similarities with the iterative method used to solve the self-consistent potential.The training data set contains 85, 000 entries with each input being a potential profile φ in = φ g + φ d + φ s where φ g and φ d are randomly generated.The output training data consists of the self-consistent potential φ tot and φ tot (r * ) with U * = −φ tot (r * ).The resolution of each input is reduced from the high resolution required to accurately compute the training data, as shown in Table S1.Training is performed for 100 epochs with a learning rate of 1 × 10 −3 (dropping to 2.5 × 10 −4 in two steps) and a MSE loss function.
The structure of the dot classifier CNN is shown in Figure S2(c).The network learns the mapping F D : φ in → P (N dot ) for N dot ∈ {0, 1, 2, 3}, where classification is taken as the maximum value of P (N dot ).An intermediate classifier is used to generate a suitable dataset, as discussed in the main text.Training is performed for 100 epochs with a learning rate of 1 × 10 −3 dropping to 5 × 10 −4 after 70 epochs.Test results for dot classification have an accuracy of 98.0% on random data (dot-sparse) and 75.9% on selected (dot-abundant) data.
Following the notation used in Table S1, F U and F D use 1 8 and 1 6 resolution of φ in respectively.The computation time for both networks F U and F D is approximately 0.2ms given a 2D potential input and using a GPU.However, processing a vector of gate voltages into a 2D potential increases this time to 0.6ms.The processing involves determining the total gate potential using (S1), and summing this with the disorder potential.S1.Performance metrics of FU for different resolutions of φin, computed using a GPU (GTX 1080 Ti).Max-pooling processes are adapted for each resolution, otherwise the networks are identical.The resolution reduction fraction is applied to both input dimensions.Full resolution (269, 411) is only required for computing training data, so not considered.Time per output is from a batch of 1000 inputs.MAE is chosen so that units remain in mV, and transport channel error is based on binary classification using the value of φtot(r * ) produced by FU .

Residual Block
Conv. to the self-consistent potential φtot, and finally to the minimax value φtot(r * ).All convolutional units have 32 channels.(b) Schematic of a residual block as used in FU and FD.The number of channels is preserved through a residual block.(c) Structure of FD beginning with the normalised input potential, with a probability distribution for the number of dots as the output.All convolutional units have 64 channels, and dots are classified using one-hot encoding for {0, 1, 2, 3} dot classes.

DISORDER COVARIANCE
As discussed in the main text, the potential at a point r in the 2DEG from the donor ion distribution is given by φ d (r) = ij I ij φ d (r, r ij ), summing over each cell in the donor plane (subscripts i and j reflect the 2D grid of cells).The covariance of φ d between two points in the 2DEG plane can be computed as cov r, r = var(I) ij φ d (r, r ij )φ d (r , r ij ), where I is the distribution from which each I ij is independently drawn.Using the correlation function eliminates the dependence on I, This expression is evaluated numerically, and appropriate parameters are fitted to the kernel function given in the main text, k(r, r ) = σ 1 + |r−r | 2 ρ 2 −1 .Normalising to the correlation kernel, a least squares fit results in ρ = 139.8nm,with a MSE of 1.65 × 10 −4 .An appropriate value of σ = 20.8mV is found using the standard deviations of 1000 disorder profiles.

INDUCING POINT VALUES
As discussed in the main text, the standard deviation of inducing point values across the posterior samples indicates how much the inference algorithm has learned about the disorder potential at those locations.A low standard deviation on an inducing point means that all posterior samples have similar values at that location and therefore the inference algorithm is confident of the disorder potential there.To further demonstrate this, we consider inference results for the simulated device using training datasets of different sizes as shown in Figure S3.The training dataset is D = {(v i , y i )|i = 1, . . ., 2n u }, as defined in the main text.A training dataset with n u = 200 is used for inference results discussed in the main text.
We observe that the number of inducing points with a low posterior standard deviation increases with the size of the training dataset.This indicates that the inference algorithm has gained information about a larger area of the disorder potential by considering more directions in voltage space.We can also observe that even for a small training dataset, the inference results are most confident about the disorder potential values at the tip of gate G1 which reflects its role in depleting the electron density along the path from source to drain.
The lowest standard deviation on a given inducing point is approximately 2mV using the simulated measurements to inform the inference algorithm, and approximately 5.5mV using experimental measurements.This performance difference can be expected as the simulated device is a controlled and self-contained environment, whereas the experiment will have more unknowns which may not be accounted for in the model.

POSTERIOR DISORDER SAMPLES
The inference algorithm produces a set of posterior values for the inducing point values, α, and random Fourier feature amplitudes, β.These values are used to produce posterior samples of the real-space disorder potential as outlined in the main text and Appendix C. The resolution of these posterior disorder potentials can be chosen depending on the desired use (e.g. as inputs to F U , F D , or the self-consistent electrostatic model).Figure S4 shows the true disorder and posterior samples for two iterations of the inference algorithm on a simulated device.The posterior samples exhibit much more detailed features inside the region spanned by the optimised inducing points where qualitative similarities with the true disorder observed.This further demonstrates the information gained at these points (in addition to Figure S3).
For posterior disorder potentials, features outside the inducing point region are governed by the amplitudes of random Fourier features contained in β which are necessary to ensure the posterior samples are continuous and suitable to be used as inputs to F U and F D .

Figure 1 .
Figure 1.(a) Device geometry including the gate electrodes (labelled G1-G8), donor ion plane, and an example disorder potential experienced by confined electrons.Typical flow of current from source to drain is indicated by the white arrow.(b) Schematic of the disorder inference process.Colours indicate the following; red for experimentally controllable variables, green for quantities relevant to the electrostatic model, blue for experimental device, and yellow for machine learning methods.Dashed arrows represent the process of generating training data for the deep learning approximation and are not part of the disorder inference process.
summing over the location of each donor.Examples of φ g and φ d are shown in Figure 2(a) and (b) respectively.

Figure 2 .
Figure 2.An example of the device model as discussed in Section III.Spatial coordinates x and y are used to indicate the scale of the device.(a) Electrostatic potential from the gate electrodes φg, (b) a disorder potential φ d , and (c) self-consistent potential φtot given the potentials in (a) and (b).(d) The electron density in the 2DEG given the potential in (c), which shows a single dot.(e) Example of MST path from source to drain in 2D (yellow line) with the location of U * is marked by a green circle.(f) The potential energy, U , corresponding to the MST path in (e) with U * marked by a green circle.The horizontal axis indicates the total distance moved in 2D space.The channel is closed since the value of U * is above the Fermi level indicated by the red dashed line.

Figure 3 .
Figure 3. Gate architecture overlaid with inducing point locations X, indicated by blue dots (a) before optimisation and (b) after optimisation.Red circles indicate the two corners defining the rectangular array of inducing points.The location of these corners are optimised to ensure that the disorder potential values at the inducing points determines whether the transport channel is open or closed.Hence the optimised inducing points are located closer to the source and drain reservoirs.

Figure 4 .
Figure 4. Disorder inference results using experimental data for three thermal cycles (A,B,C) of the same device.The inducing point locations, Zopt, are indicated by circles with the gate structure in the background.The colour of each inducing point represents the standard deviation over posterior samples of the disorder potential value at that point.

Figure 5 .
Figure 5. (a) Predicting double dot locations.A unit vector u is passed to the filter which determines whether the vector is considered for the test device (which can be a real or simulated device).The filter uses the dot classifier FD to scan along v = Ru for each of the ns disorder samples and the score is increased for each disorder sample which produces a double dot in the scan.Posterior disorder samples from the inference algorithm are shown.Vectors with a score greater than ns /3 are accepted to be tested.An example current trace (solid blue line) of a voltage vector from the origin to the limit of device operation is shown.The dashed red line indicates the 80% threshold used to begin 2D current scans over gates G3 and G7.The 2D scans are taken at intervals along the original current trace (indicated by red circles).The resulting 2D current scans are passed to multiple human experts to label the presence of double quantum dots.(b) Example current scans over G3 and G7 which scored highly for double dots when labelled by 6 human experts for 3 different unit vectors.Each scan is a 200mV×200mV window with the voltages associated with the direction of u at the centre.

Input: 5 :
Set of randomly generated disorders Φ X sim , set of randomly generated voltages Vsim, initial inducing points Zinit, minibatch size of disorders n d , minibatch size of voltages nv, Gaussian process kernel k for disorder, Optimiser parameters θopt, CNN model for current probability prediction F prob U Output: Optimised inducing points Z 1: opt ← Adam optimiser(θopt) 2: Z ← Zinit 3: while Stopping criterion not satisfied do 4: Φ X mini ← choose random n d samples from Φ X sim Vmini ← choose random nv samples from Vsim

Figure S1 .
Figure S1.Comparison of the semi-classical electron trajectory for (a) the full graph G φ and (b) the MST of G φ at identical gate voltages.The electron density with the relevant transport path (yellow line) is shown in the upper plot of each panel.The potential along each path (blue line) is shown in the lower plot of each panel, where the Fermi energy, µF , is indicated by a red dashed line.The 'Distance' axis indicates the total distance moved in 2D space.The transport path using the full graph overestimates the potential barrier height as it shows two regions where U > µF .The electron density demonstrates that there is only one region where U > µF , matching the prediction of the MST path.

Figure S2 .
Figure S2.The neural networks used to compute U * and classify dots.Text to the left and right of the arrows indicates the activation function applied and dimension reduction processes respectively.(a) Structure of FU beginning with the 2D normalised input potential φin,to the self-consistent potential φtot, and finally to the minimax value φtot(r * ).All convolutional units have 32 channels.(b) Schematic of a residual block as used in FU and FD.The number of channels is preserved through a residual block.(c) Structure of FD beginning with the normalised input potential, with a probability distribution for the number of dots as the output.All convolutional units have 64 channels, and dots are classified using one-hot encoding for {0, 1, 2, 3} dot classes.

Figure S3 .
Figure S3.Disorder inference results using simulated data for three training dataset sizes (nu = 50, 100, 200), with D = {(vi, yi)|i = 1, . . ., 2nu} as discussed in the main text).The inducing point locations are indicated by circles with the gate structure in the background.The colour of each inducing point represents the standard deviation of the posterior disorder potentials at that point.

Figure S4 .
Figure S4.The true disorder and 4 randomly selected posterior disorder samples for 2 iterations of the inference algorithm on a simulated device, (a) and (b).The true disorder is randomly generated using the electrostatic model with randomly located donor ions.The blue box on each disorder potential indicates the region spanned by the optimised inducing points.The posterior disorder samples are the same resolution as the true disorder, (134x206).In each subfigure, the posterior disorders have more detailed features within the box, and qualitative similarities with the true disorder can be observed.All plots in each subfigure share the same scale, as shown at the bottom of each subfigure.