Active Learning approach to simulations of Strongly Correlated Matter with the Ghost Gutzwiller Approximation

Quantum embedding (QE) methods such as the Ghost Gutzwiller Approximation (gGA) offer a powerful approach to simulating strongly-correlated systems, but come with the computational bottleneck of computing the ground state of an auxiliary embedding Hamiltonian (EH) iteratively. In this work, we introduce an active learning (AL) framework integrated within the gGA to address this challenge. The methodology is applied to the single-band Hubbard model and results in a significant reduction in the number of instances where the EH must be solved. Through a principal component analysis (PCA), we find that the EH parameters form a low-dimensional structure that is largely independent of the geometric specifics of the systems, especially in the strongly-correlated regime. Our AL strategy enables us to discover this low-dimensionality structure on the fly, while leveraging it for reducing the computational cost of gGA, laying the groundwork for more efficient simulations of complex strongly-correlated materials.


I. INTRODUCTION
At present, most quantitative simulations of quantum matter utilize standard approximations to density functional theory (DFT) [1,2].However, these approximations face limitations when simulating the properties of strongly-correlated systems, which are solids and molecules where electrons are localized around specific atomic sites and have intensified interactions due to spatial confinement.This issue is especially relevant in materials containing transition metals from the 3d series and to lanthanides and actinides.To address this challenge, various QE [3,4] many-body techniques have been developed.Methods such as dynamical mean-field theory (DMFT) [5][6][7][8][9], density matrix embedding theory (DMET) [10,11], rotationally invariant slave boson theory (RISB) [12][13][14], and the multi-orbital Gutzwiller approximation (GA) [15][16][17][18][19][20][21][22], are now widely used for quantitatively simulating strongly-correlated systems.Recently an extension of the GA, denoted as the gGA [23,24], has been developed.The gGA framework incorporates auxiliary Fermionic degrees of freedom to enrich the variational space.Notably, gGA has demonstrated accuracy that is comparable to DMFT [23][24][25][26][27], indicating that it might serve as an advantageous alternative, especially when aiming for a combination of accuracy and computational manageability.
However, all of the available QE many-body techniques pose a computational burden for emerging applications in materials discovery, where computational efficiency is crucial for reducing both the time and cost of material development.The main reason lies in their common QE algorithmic structure, that requires the iterative solution of an EH for each correlated fragment in the system, constituting the most computationally intensive step [28][29][30][31].Addressing this bottleneck could enable accurate simulations of strongly-correlated materials at computational costs comparable to traditional approximations to DFT.
In prior work, a machine-learning-based solution to this problem was proposed both in the context of DMFT [32,33] and in the context of the GA [34], exploiting the observation that the mathematical structure of the EH is determined solely by the electron shell structure of the impurity, thus being consistent across diverse materials and molecules.This intrinsic commonality that arises when solving the EH across different materials and molecules, which we refer to as "universality," suggests that machine learning (ML) techniques could, in principle, be trained once and for all to solve the EH problem, thereby bypassing the computational bottleneck of all subsequent QE simulations.In particular, in Ref. [34] a method combining the GA method with a ML algorithm, termed "n-KRR," demonstrated success in implementing this program for a series of actinide systems.However, this achievement was enabled by the possibility to specifically conjecture the physically-relevant range of training data for these materials -an advantage that arXiv:2312.05343v2[cond-mat.str-el]12 Dec 2023 is not generally available.Indeed, this represents the primary barrier to overcome for extending this approach to general many-body systems: it is generally impossible to preemptively determine which training data should be generated.Consequently, a different approach is required to make this ML strategy universally applicable.
To overcome the challenge of determining a priori training data, here we introduce an active learning methodology that marries probabilistic ML techniques -specifically, a recent extension of Gaussian Process Regression (GPR)-with the gGA framework.As new gGA calculations proceed, our active learning model continuously evaluates new instances of the EH problem, adaptively updating and refining its own training set based on the level of uncertainty in its predictions.This strategy eliminates the need for a predefined training set and ensures that only physically relevant data are gathered throughout the computational process.
We benchmark our method using the single-band Hubbard model across varying geometries and interaction strengths, thereby significantly reducing the required number of explicit EH calculations.Using a PCA, we show that the EH parameters explored throughout these calculations have a low-dimensional structure, largely independent of the specific lattice configurations, particularly in strongly-correlated regimes.We discuss how such inherent low-dimensional structure of the parameter space opens a path for computational techniques commonly found in computer science, underlining the potential of our active learning strategy to generalize across a wide array of strongly-correlated materials in future work.

II. MODEL AND gGA METHOD
This section aims to lay the foundation for the subsequent development of the QE algorithmic structure.We begin by introducing the single-band Hubbard model, that we employ in our benchmark calculations.We then present the formulation of the gGA.A primary focus of gGA is to iteratively solve for the ground state of an EH, an essential component of the QE approach.
A. The single-band Hubbard model For clarity, in this work we present the formalism underlying our AL framework focusing on a generic singleband Hubbard Hamiltonians represented as follows: where N is the number of system fragments, c † iσ and c iσ are Fermionic creation and annihilation operators, σ is a spin label, i and j are fragment labels, µ is the chemical potential, U is the interaction strength and ni = σ c † iσ c iσ is the number operator for the system fragment i, and t ij is a generic hopping matrix, with arbitrary entries.

B. The gGA Lagrange function
Specializing the theory presented in Refs.[23,24] to the single-orbital Hubbard equation given by Eq. ( 1), and focusing on solutions preserving both spin and translational symmetries, we find that the ground state in the gGA can be obtained by extremizing the following Lagrange function: where I is the identity matrix, the integer number B controls the size of the gGA variational space and, in turn, the precision of the gGA approach, E and E c i are scalars, and ∆ i , Λ i , and Λ c i are B × B Hermitian matrices.Additionally, D i and R i are B × 1 column matrices.The socalled "quasiparticle Hamiltonian" ( Ĥqp ) and EH ( Ĥi emb ) are defined as: Ĥi Here the vector |Φ i ⟩ is the most general embedding state for the fragment i, i.e., the most general state within the Fock space spanned by the 2(B + 1) modes c † iσ and b † iaσ , with (B + 1) Fermions in total (half-filled).The vector |Ψ 0 ⟩ is the most general single-particle state within the so-called "quasi-particle" space, spanned by the 2BN modes f † iaσ .For B = 1, Eq. ( 2) reduces to the standard GA Lagrange function.In this work we set B = 3, which proved to be sufficient for capturing the ground-state properties with accuracy comparable to DMFT for the ground-state properties [23][24][25][26][27].

C. Gauge Invariance and Physical Observables
It can be readily verified that the gGA Lagrangian is invariant with respect to the following gauge transformation: with: where θ = (θ 1 , .., θ N ), θ i are B × B Hermitian matrices and the superscript "T" denotes the transpose, while the superscript " * " denotes the complex conjugate.The name "gauge" here refers to the fact that modifications of the parameters generated by such a gauge transformation do not influence any physical observable, which can be extracted from the variational parameters that extremize the Lagrange function.
For completeness, below we write explicitly how the physical observables that we calculate in this paper are computed as a function of the variational parameters, based on the theoretical framework derived in previous work [23,24].
The total energy of the system is given by the Lagrange function value after extremization (which is gauge invariant).The expectation values for local observables are encoded in |Φ i ⟩.In particular, the local double-occupancy expectation value in the gGA ground state is given by the following gauge-invariant expression: where niσ = c † iσ c iσ .To calculate the quasi-particle weight, it is convenient to express the variational parameters in a gauge where [Λ i ] ab = [l i ] a δ ab (which always exists, since Λ i is Hermitian).In this gauge, the mathematical expression for the quasi-particle weight is the following: .
where Σ i (ω) is the self-energy.

III. FORMULATION OF THE ML PROBLEM
A pivotal insight at the base of the ML approach proposed in this work is that the problem of extremizing the Lagrange function in Eq. ( 2) can be formally tackled by first solving for |Φ i ⟩ and E c i as a function of the other parameters.This amounts to replacing the original variational problem of extemizing L in Eq. ( 2) with the problem of extremizing the following Lagrange function: where: Ĥi emb is the EH defined in Eq. ( 4), and the expectation value is taken with respect to the corresponding halffilled ground state |Φ i ⟩.Note that the subscript i is not present in the function Ēc of Eq. ( 16), highlighting the fact that the function does not depend on it.
A key property of the function Ēc (D, Λ c , U, µ) is that it is "universal," in the sense that its definition is irrespective of the details of the model system under study.For example, when applying gGA to any of the Hubbard models defined in Eq. (45), the form of Ēc (D, Λ c , U, µ) would remain consistent, independent of the specific lattice structure or the numerical values of the hopping matrix t.Therefore, if one could learn the energy function Ēc (D, Λ c , U, µ), along with its gradient, the computational cost of extremizing L would be significantly alleviated.The purpose of this work is to derive a ML model to accomplish this goal.

A. The EH Universal Function
The Lagrange equations obtained by extremizing the Lagrange function in Eq. ( 16) can be solved iteratively [23,24].The specific algorithm employed in this work is detailed in the Supplemental Material [35].
For the purposes of this paper, the essential point is that the algorithmic structure involves iteratively evaluating Ēc (D i , Λ c i , U, µ) and its gradient, as the parameters D i and Λ c i are updated at each step.Specifically, the computational bottleneck lies in the evaluation of the ground-state single-particle density matrix elements of each fragment i: where the identities above hold true because of the Hellmann-Feynman theorem.

B. Reducing the complexity of the learning problem
In the previous subsection we have introduced the function Ēc (D, Λ c , U, µ).Note that, since D and Λ c are generally complex and Λ c is Hermitian, Ēc is a function of 2B + B 2 + 2 real parameters.
In this section we show that it is possible to reduce the problem of learning the aforementioned universal EH energy function to the problem of learning the following function of only 2B real variables: This simplification is made possible by the fact that the EH function Ēc , defined in Sec.III A, satisfies the following general properties: • Invariance of half-filled ground state |Φ⟩ under simultaneous shift of impurity and bath energies: • Linear homogeinity: • Gauge invariance: where x is any real number and u is any B × B unitary matrix (i.e., a gauge transformation, see Sec.II C).From the properties above it follows that: = U Ēc ( D, Λc , 1, 0) + µ = U E( D1 , .., DB , Λc 11 , .., Λc BB ) + µ , where: and the unitary matrix u is a gauge transformation chosen in such a way that Λc is diagonal and the entries of D are real and sorted in descending order, as detailed in the Supplemental Material [35].
Another important consequence of Eqs. ( 23) and ( 24) is that, as shown in the Supplemental Material [35], also the gradient of Ēc is fully encoded in the gradient of E, which is given by the following equations: where the expectation values are taken with respect to the ground state of the EH defined in Eq. (21).

C. Summary
In summary, in this section we have reduced the problem of solving the gGA equations to an iterative procedure which consists of evaluating iteratively the functions: where we have introduced the 2B-dimensional real vector: Each evaluation of E(X) and F(X) requires to calculate the ground state of the EH defined in Eq. ( 21), whose dimension is 2 2(B+1) .This can quickly become the computational bottleneck of the gGA framework as B increases, which we aim to mitigate using ML.An important point to highlight is the simplification introduced by reducing the problem to the learning of a single scalar function E(X), which we achieved through the use of the Hellmann-Feynman theorem.This approach streamlines computation and reinforces predictive accuracy compared to the method of individually learning each entry of the ground-state single-particle density matrix of the EH (which is the technique we previously employed in Ref. [34]).First, learning a single scalar function is computationally less demanding than learning an array of functions corresponding to each matrix element.Second, this method also automatically enforces inherent prior information about these specific functions, such as the condition that: i.e., F(X) is conservative, thereby enhancing the overall predictive accuracy of the model.Interestingly, from the mathematical perspective, the problem outlined above bears a striking resemblance to a specific successful ML application, namely learning force fields for accelerating molecular dynamics simulations [36][37][38].In fact, in both instances, the challenge revolves around learning the total energy E(X) of a system and its gradient F(X) from computational data.On the other hand, there are 2 key differences: • Universality: Within our framework, if we could learn once and for all E(X), the resulting model could be used to bypass such computational bottleneck for any system involving single-orbital fragments (e.g., all models of the form represented in Eq. ( 45), for all hopping matrices t ij and for any values of U and µ).
• Domain structure: In gGA the sequence of points X α explored throughout the computation of any given model always converges towards the specific X realizing the corresponding solution.Therefore, the majority of these points, where E(x) needs to be learned, gravitate around X, instead of being spread around the whole parameters space.
These unique characteristics suggest a more tractable learning problem.Specifically, we are not compelled to learn the universal function E(X) over its entire domain, hereafter referred to as the "ambient space".Rather, we can confine our attention to a subset of parameters situated in the proximity of the ground states of physical models.These ground states encapsulate the possible physical embeddings, i.e., the feasible interactions that fragments can have with their environment in the ground state of physical systems.Such parameters presumably constitute a limited fraction of the ambient space.
Therefore, we are confronted with a dual challenge.The first is learning the function E(X) over this restricted domain, hereafter referred to as the "latent space."The second is concurrently unveiling the structure of this latent space, which is expected to have lower dimensionality than the ambient space.The successful completion of the latter task would considerably streamline the overall learning problem, specifically by potentially mitigating the effects of the so-called "curse of dimensionality," where computational cost grows exponentially with the number of dimensions.
In the next section we describe an AL framework for learning E(X) and F(X), which is specifically tailored for this purpose, capitalizing on the observations above.

IV. ACTIVE LEARNING FRAMEWORK
In this section, we outline an AL strategy based on probabilistic ML to overcome the computational challenges delineated in the previous section.
A critical aspect of our strategy is the use of a probabilistic ML model, i.e., a model capable of combining our prior knowledge and observed data to make predictions for E and its gradient F as expectation values ⟨E(X)⟩ and ⟨F(X)⟩ with respect to a suitable probability distribution, and to quantify their uncertainties through standard deviations: where F i (X) are the components of F(X), with i = 1, . . ., 2B, where 2B is the dimension of X.Furthermore, we require a model capable of learning both the energy function E and its gradient F simultaneously, ensuring exactly the consistency between these two quantities.We achieve this by a recent generalization of GPR, satisfying both of these requirements [39,40], described in Sec.IV B and in the Supplemental Material [35].

A. gGA+AL Algorithmic Structure
Given a probabilistic ML method with the requirements listed above in place, we proceed to outline the AL strategy as follows: 1.For each self-consistency cycle, the gGA algorithm produces a parameter vector X for every impurity in the system.If the uncertainty estimate Σ for ML predictions at a given point X is within a threshold, the prediction is accepted.Otherwise, the machine database is updated with energy E and gradient F data, and a new machine evaluation is performed.
2. The probabilistic ML framework predicts the expectation values ⟨E(X)⟩ and ⟨F(X)⟩.It also estimates the uncertainty quantification of these predictions, termed Σ 0 (X) and Σ i (X), respectively.
3. If the existing data can produce a sufficiently accurate prediction for ⟨E(X)⟩ and ⟨F(X)⟩ (with respect to pre-established accuracy thresholds), the ML model's outputs are directly employed, thereby bypassing the need for ground state computation of the EH.
4. If the data are insufficient for an accurate prediction corresponding to the EH at the given X, new training data are computed for suitable parameters X α .These data points are then added to the database for future use.
5. The output, obtained through either step 3 or 4, is fed back into the QE algorithm to initiate the next iteration.The process is repeated until convergence is achieved.
The salient features of this framework, as depicted in Fig. 1, are twofold: (1) Since all data are generated through actual gGA calculations, they are inherently situated in the proximity of the "latent space" of physically meaningful embeddings.This circumvents the issue of requiring prior knowledge of the QE latent space, as discussed in Sec.III C. (2) Owing to the universality of the EH, data collected during any calculation for strongly correlated systems are retained permanently.Consequently, the ML component of our framework becomes increasingly efficient with each new calculation, potentially enhancing the computational efficiency of all subsequent gGA simulations.
The details about the algorithm implementation outlined above are described in the subsections below.

B. Generalized Gaussian Process Regression
Above in this section we introduced the need for a probabilistic ML model that can learn and predict both the energy function E and its gradient F, as well as quantify their uncertainties.To meet these requirements, we employ a generalized form of Gaussian Process Regression (GPR) [39,40], as implemented in a development version of the program package MidasCpp [41].The method constructs a so-called "posterior probability distribution" for the function to learn, which is based on a prior probability distribution and available data.
• The prior encodes our expectations about the general properties of the function we aim to learn, such as its range and smoothness.
• The observed data in our case consist of a database: where X α are points with evaluated energies E α and gradients F α , σ 0α measures the uncertainty associated with the energy data E α , and σ α = (σ 1α , .., σ dα ), where σ iα is the uncertainty associated with the i-th component of the gradient data F iα and d is the dimension of X α .
In GPR the prior probability distribution is assumed to be a zero-mean Gaussian.Consequently, it is fully characterized by the so-called "kernel function," which is essentially the correlation function ⟨E(X)E(X ′ )⟩ prior , where the expectation value is taken with respect to the prior probability distribution.The kernel function we employ in this work is the "square exponential kernel," given by: This kernel is governed by two hyperparameters: l and σ f .The length scale l is essentially a correlation length, defining the expected "minimum wavelength" or smoothness of the function E(X).The parameter σ f specifies the expected amplitude or range of the function, serving as an infrared cutoff.Specifically, k(X, X) = σ 2 f corresponds to the expected variance ⟨E(X) 2 ⟩ prior of the function.
The posterior probability distribution thus integrates both our prior knowledge and the available observed data, generating predictions for E and F that align with the observed data, while utilizing the prior for making predictions elsewhere.In particular, our AL framework requires to compute the following quantities at any given point X: where the expectation values are taken with respect to the posterior distribution, that depends on the hyperparameters l, σ f , as well as the available data D. Explicit expressions for these distributions are provided in the Supplemental Material [35].
It is important to note that, as opposed to standard GPR and the KRR-based method previously used in Ref. [34], the generalized GPR framework outlined above enforces by construction the condition: where Fi (X) are the components of F(X), with i = 1, . . ., d, where d = 2B is the dimension of X. Enforcing exactly these conditions yields more accurate predictions.
On the other hand, as explained in the Supplemental Material [35], such gain comes with additional computational cost.Specifically, if the database D contains N training data points, making predictions requires to invert a matrix (the so-called "covariance matrix"), whose size is In light of this, the computational complexity and matrix size present two challenges that need to be carefully addressed: • RAM Storage: The large covariance matrix, of size N (d + 1) × N (d + 1), necessitates considerable memory storage.This can become a significant issue as the number of data loaded in the GPR framework grows.
• Scalability: The matrix inversion operation itself has a time complexity of O (N (d + 1)) 3 .As N increases, it becomes computationally burdensome.Specific measures must be incorporated into our active learning framework to ensure its scalability for large databases.
• Numerical Stability: The large size of the covariance matrix and closely spaced data points in D can make the matrix inversion prone to numerical issues.Specifically, the matrix can become ill-conditioned, having a large condition number, which is the ratio of the largest to the smallest eigenvalue.This makes the matrix sensitive to small changes, potentially leading to a loss of numerical precision.
In the following subsection, we detail how these challenges are addressed in our active learning framework, and also elaborate on our procedure for choosing the hyperparameters l and σ f .

C. Addressing computational and numerical challenges in AL framework
To cope with the computational and numerical challenges presented by the large covariance matrix, we have designed the following strategy within our AL framework.
Our approach is based on the construction of a hyper cubic lattice discretization of the d-dimensional parameter space X, yielding a set of grid points X i1,...,i d .Each i α index covers all integers, effectively tiling the whole parameter space.We use a lattice spacing a = 0.0125, which is set to be much smaller than 1, taking into account the dimensionless nature of our parameters.When predicted values for a given X are being requested during the self-consistent embedding computations we find the nearest point on the lattice to this point.In addition we also consider this lattice points 2d nearest neighbors on the lattice (corresponding to i d ± 1 for all d).Should any of these 2d + 1 points not already be in the database D, they are calculated and added.Note that these calculations are independent of each other and, therefore, can be executed in parallel.For each evaluation at X, the GPR prediction and its associated uncertainties are calculated based solely on these 2d + 1 data points.We locate these nearest points using a k-d tree-based algorithm to maintain computational efficiency.This step effectively imposes a "budget" on the number of training data we use for making GPR predictions, thereby controlling the size of the covariance matrix and the associated computational complexity.
Next, we address the setting of the hyperparameter σ f within this localized framework.For the GPR prediction of each point X, the hyperparameter σ f for the squared exponential kernel is set based on the data as follows: which is consistent with its interpretation in terms of the prior probability distribution: σ 2 f = ⟨E 2 (X)⟩ prior .To determine the appropriate value of the hyperparameter l, we employ the following iterative approach: 1. Initialize l at l init = 0.5.The range for l is predetermined to be between l init and l final = 2.0, with increments of ∆l = 0.1.
2. For the current l, calculate Σ max , which is the maximum of the uncertainties Σ i (X) associated with all components of the gradient F(X) for i = 1, .., d.
3. If Σ max < Σ = 10 −3 , accept the current GPR prediction for that l and terminate the loop.
4. Otherwise, increment l by ∆l and return to step 2. If l exceeds l final , revert to exact calculations for that specific test point and terminate the loop.
The uncertainty parameters σ iα , see Eq. ( 36), have been all set to 10 −5 in all of our calculations.
The rationale underlying this algorithm is to initiate with the smallest l value, thereby making the least assumptions about the landscape's smoothness and relying more heavily on the data for our predictions.If this l proves insufficient, we then increment l in a stepwise manner, each time reassessing the prediction quality.This iterative fine-tuning of l is in essence a method for optimizing the trade-off between bias and variance, a standard criterion in ML, which ensures that our model is neither too simplistic (high bias; large l) nor too sensitive to fluctuations in the data (high variance; small l).Note also that our choice of l range is in line with the fact that our parameters X are dimensionless, from which we expect the optimal l to be around 1. Furthermore, it is consistent with the assumption l ≫ a, acknowledging that we cannot resolve scales smaller than the lattice spacing a. Finally, note that our choice of Σ = 10 −3 is consistent with the assumption Σ ≫ σ iα , acknowledging that it is impossible to make predictions with higher accuracy than the available data.Thus, the framework above effectively manages challenges related to RAM storage, computational scalability, and numerical stability, while maintaining a balance between computational cost and prediction accuracy.

V. RESULTS
In this section we document the results obtained by applying the AL algorithm described above to gGA calculations on the single-band Hubbard model at half filling.Here, we consider different values for the Hubbard interaction strength U and hopping parameters t ij (corresponding to multiple lattice geometries).
We carry out our calculations within the gGA framework set at B = 3, which, as proven in previous work, is sufficient for achieving accuracy comparable to DMFT for ground-state properties.As clarified in Sec.III C, in this setting the dimension of the "ambient space" of parameters X, where the energy function E(X) is defined, is 2B = 6.All training-data evaluations for E(X) and F(X) are calculated using the exact diagonalization (ED) method.
In Subsections V B and V C we document the efficiency and accuracy obtained in these calculations using our AL approach.In Subsection V D, we demonstrate that the parameters explored span a low-dimensional latent space and discuss the practical implications of these results, as well as their physical interpretation in relation to Mott physics.Additional calculations and analysis for the Hubbard model away from half filling are presented in the Supplemental Material [35].

A. Goal of Benchmark Calculations
Our aim is to test the AL method within all interaction regimes of the half-filled Hubbard model at zero temper-ature, including the so-called coexistence region, which is an interval of parameters U featuring a metastable Mott state.To capture all of these regimes, all calculations are organized into series constructed as follows.Each series starts from a large value of interaction strength U max and decreases it in intervals of ∆U to a small value U min .Then, the interaction strength is increased back to U max with the same spacing.From now on, we refer to such a series of calculations as a "sweep." A critical metric for quantifying the efficiency of our gGA+AL approach is the ratio of the number of times new data must be acquired and added to the database during a given calculation, N data , to the total number of gGA iterations necessary to perform the same calculation without ML, N iterations : which we would like to be as small as possible.
It is important to note that the value of S is heavily influenced by the choice of hyper-parameters, as detailed in Sec.IV C. In these benchmarks we strived for high accuracy by requiring that both the energy E and its gradient F are estimated to a precision of at least Σ = 10 −3 .Furthermore, a minimum of 2d + 1 training points are required within a grid with tight lattice spacing a = 0.0125 for each test point.This ensures that new calculations are invoked when the exploration enters a new parameter region spaced by more than a.
In the forthcoming benchmark calculations, we aim to address three specific scientific questions for evaluating the utility and efficiency of our gGA+AL method: 1. Ability to learn: Can a sweep of gGA+AL calculations, once completed and with the data stored, be repeated without requiring any new data for the Hamiltonian parameters already explored?This is a necessary condition for realizing the computational benefits of our data-driven approach.

Transfer-Learning Efficiency:
If multiple sweeps are performed, each with different settings such as ∆U or t ij , can data acquired in one sweep be leveraged in another to reduce the need for new calculations?We are interested in whether the explored parameters can span a latent space with overlapping regions that can be exploited for computational efficiency in future calculations.
3. Accuracy Preservation: Is the accuracy in physical quantities preserved when completing a calculation using the gGA+AL method?While it is always feasible to refine the results through a few standard gGA iterations without active learning at the end of any gGA+AL calculation, achieving high accuracy directly with AL is preferable for maximizing computational gains.

B. Benchmarks for Hubbard Model on Infinite-Coordination Bethe Lattice
In this subsection we present benchmarks of our gGA+AL method as applied to the Hubbard model represented by the following Hamiltonian: where t the hopping between nearest-neighbor sites, and the the hopping parameters are finite only between nearest-neighbour sites on a infinite-coordination Bethe lattice, with semicircular density of states The energy measure is set in unit of the half-bandwidth D ∝ t.

Efficiency starting with empty database
For each ∆U , we start with an empty database and execute a sweep of calculations, as defined in Section V A. Immediately following this, a test sweep is performed, leveraging the data acquired during the initial sweep.
Once the test sweep for a specific ∆U is completed, the database is reset to empty, and the entire process is repeated for the subsequent ∆U values.The results of these benchmarks are summarized in Fig. 2, where each row corresponds to a different ∆U .The figure showcases the value of the efficiency metric S for both the original and test sweeps.The left and right columns of the figure display the S values for metallic and Mott states, respectively.
A key observation is that no additional training data are required in the test sweeps for all ∆U values, with the only exception at ∆U = 0.6, where a few new training data are added to the database.This confirms the ability of our framework to "learn monotonically," in the sense outlined in Section V A. It is also remarkable that the computational gains achieved through our gGA+AL approach are substantial even in the initial sweep, when starting from an empty database.In fact, the average of the efficiency metric S registered throughout each sweep at ∆U = 0.6, 0.3, 0.15, 0.075 is at 83%, 73%, 56%, and 40%, respectively, when considering both the Mott and metallic phases.The fact that tighter meshes lead to increased overall computational savings is explained by the fact that data acquired along the way for solving the gGA equations can be used by the AL framework for reducing computational cost of subsequent calculations with similar interaction strengths.
From the physical perspective, a very interesting feature of the results shown in Fig. 2 is that S is smaller in the Mott phase and the strongly-correlated metallic phase, pointing to higher transfer-learning efficiency in these regions, compared to the weakly correlated regime.This finding, and its physical interpretation, is discussed later in Sec.V D with a PCA.

Efficiency with Progressive Data Accumulation
In Fig. 3, we compare the efficiency of our gGA+AL method across the sweeps with spacings ∆U = 0.3, 0.15, 0.075.Each row of the figure corresponds to one of such sweeps.The top panels present the S values obtained when starting with an empty database for each new sweep, as in Fig. 2. In contrast, the bottom panels show the S values calculated starting from a database that was initially populated at the end of a ∆U = 0.6 sweep and subsequently updated without resetting as we traverse through the mentioned series of ∆U values.
We observe that, using the scheme that retains data, the metric S demonstrates a computational gain of over 50% compared to calculations performed with a reset database.Specifically, in this same scheme, the average of the efficiency metric S registers at 33%, 22%, and 10% for ∆U = 0.3, 0.15, 0.075, respectively.Consistently with the trend of the results in Fig. 2, this gain is even more significant in the regime of strongly-correlated parameters.This further supports the transfer-learning ability of our AL framework.It is important to note that, as mentioned in Sec.III B, in the gGA framework the sequence of points X i explored throughout the computation of any given model always converges towards the point X realizing the corresponding solution.Thus, the majority of these points, where E(X) needs to be learned, tend to cluster around X instead of being distributed throughout the entire parameter space.As a result, the data effectively form an approximate 1-dimensional latent structure (parametrized by U ), embedded within the 6-dimensional ambient space, as we are going to show explicitly in Sec.V D with a PCA.The "transfer-learning" ability displayed by our AL framework (i.e., its ability to exploit the data previously stored for improving efficiency of subsequent calculations) is grounded on its ability to exploit such type of low-dimensional structures, effectively bypassing the exponential computational burden associated with the unnecessary task of learning E(X) in the whole ambient space.

C. Benchmarks for Hubbard Model with different lattice structures
Here we extend our analysis to consider different geometries, specifically variations in the hopping matrices t ij .This enables us to evaluate how the gGA+AL frame-FIG.4. Efficiency metric S for 3D cubic (upper panels) and 2D square (lower panels) lattices.Each panel is divided into two sections: The upper section shows results obtained without storing any data, while the lower section presents results where all previously acquired data, including that from the Bethe lattice calculations, were retained.Mott points are depicted on the right side, and metallic points are on the left.
work performs when the data do not naturally span an approximately 1-dimensional curve solely parametrized by U .A central question we aim to address is whether a low-dimensional structure is commonly present among the embedding parameters X in these more general scenarios, and if so, whether our AL framework can leverage this structure for more efficient and accurate calculations.

Efficiency with Progressive Data Accumulation
To extend the scope of our analysis, we have also performed calculations of the Hubbard model on 3D cubic and 2D square lattices, employing a mesh with ∆U = 0.075.The resulting efficiency metrics S for these calculations are illustrated in Fig. 4.
The figure is organized as follows: the upper panels correspond to the 3D cubic lattice calculations, while the lower panels are for the 2D square lattice.Within each panel, the upper and lower parts distinguish between the two modes of data acquisition.The upper part of each panel displays the efficiency metrics obtained without any stored data, whereas the lower part showcases results when retaining all previously acquired data, including that from our Bethe lattice calculations.
The results of Fig. 4 are consistent with our previous findings.Even when starting from an empty database, there is a significant computational advantage.More notably, employing the data acquisition model that continuously accumulates data results in additional gain compared to the data-reset calculations.This suggests that despite the differences in geometry between the systems, there is a degree of overlap in the data that is explored.Intriguingly, these gains are predominantly observed in the Mott phase and strongly correlated regime, implying a greater degree of overlap in these cases compared to the weakly correlated regime.
These observations lead us to conclude that the latent space of the "physically relevant embeddings," or the set of parameters X probed during gGA ground-state calculations, possesses a "special" structure.Specifically, this structure may be such that it occupies only a small subset of the ambient parameter space.We delve deeper into understanding this "special" structure of the latent space in Sec.V D, where we employ the PCA to study the data structure in detail and provide a physical interpretation of these findings.

Accuracy of gGA+AL solution
In addition to computational efficiency, another critical aspect of our gGA+AL framework is its accuracy in calculating physical observables.To rigorously evaluate this, we consider observables such as the total energy, local double occupancy, and quasi-particle weight.These observables are computed from the variational parameters obtained after convergence, as detailed in Sec.II C.
From Fig. 5, it is evident that the application of our ML algorithm does not result in a significant loss of ac-curacy compared to a canonical gGA algorithm.Also, the endpoint of the metal-insulator transition U c1 is in perfect agreement with that obtained using the standard method.The only discrepancy is that the endpoint of the metal-insulator coexistence region U c2 shows a slight overestimation on the 2D square lattice when using our ML algorithm.
It is also worth pointing out that, as previously mentioned in Sec.V A, it is always possible to refine the results with a few iterations of the standard gGA method after the active learning steps.This ensures that the computational efficiency gained by the gGA+AL framework does not compromise accuracy, providing a risk-free framework for high-efficiency, reliable calculations.

D. PCA analysis of the training database
In this section, we turn our attention to the underlying structure of the database that has been acquired in the course of our calculations.Our primary objective is to probe the latent space within which our AL framework operates.In particular, we aim to elucidate why our AL framework shows notably higher transfer-learning efficiency in the strongly-correlated regime, and to discuss the physical implications of these findings.
Our database comprises vectors X α from calculations on the Bethe lattice in the limit of infinite coordination number, the 2D square lattice, and the 3D cubic lattice, all initiated without pre-existing data.To investigate the low-dimensional structure of such "latent space" of embedding parameters, we perform a PCA analysis.

Definition of the PCA
The PCA analysis of our database consists of the following steps.
• We construct a N × d data matrix M, where d = 6 is the dimension of the vectors X α and N is the number of database points, by placing each EH parameter vector X α of the database as a row, leading to: i.e., M αi = [X α ] i .
• The Singular Value Decomposition (SVD) of M is represented as a sum of the outer products of its singular vectors: where σ r denotes the singular values sorted from the largest to the smallest, the vectors u r and v r are the column vectors corresponding to the left and right singular vectors, respectively.From Eq. ( 47) it follows that the data points X α can be expressed as the following expansion of v T r : with coefficients x rα = σ r [u r ] α , where [u r ] α denotes the α component of u r .Since the singular values σ r are sorted from the largest to the smallest, the first terms of Eq. ( 48) retain the most significant features of the parameter space.
• The approximation of each data point X α is thus obtained by truncating the sum in Eq. ( 48) as follows: where d cut is the number of retained principal components selected to capture the desired amount of total variance from M, and v T r are the corresponding "principal axes".

Application of the PCA
The results of the PCA analysis described above are in Fig. 6, which shows the first two principal components, i.e. [u 1 ] α σ 1 and [u 2 ] α σ 2 , where σ 1 = 77.8 and σ 1 = 14.6.These two principal components account for more than 88 % of the variability of the data.
To further investigate the low-dimensional structure of the latent space, we present a scatter plot of these first two principal components in Fig. 6, providing us with a pictorial representation of the latent space.In this plot, the points are color-coded based on the lattice type and the stage of the calculation.Specifically, points obtained after convergence for the Bethe lattice, 2D square lattice, and 3D cubic lattice are colored in blue, red, and green, respectively.All other points, which are gathered during the self-consistency procedure but do not correspond to converged solutions, are colored in grey.
In line with our earlier discussion in Sec.V B 2, the data for each lattice effectively form an approximate onedimensional latent curve, parametrized by U , which bifurcates within the coexistence region.Remarkably, data subsets corresponding to each lattice structure are very similar.Furthermore, we observe that the separation between the data corresponding to different lattices is more pronounced in the weakly correlated regime (small U , lower-left part of the graph).As U increases, the data corresponding to different lattices in the metallic regime become increasingly overlapping, culminating in maximum overlap near the end of the coexistence region U c2 .
From the computational perspective, these observations shed light on the higher transfer-learning efficiency of our AL framework in the strongly correlated regime.The overlapping data imply that similar regions of the feature space are explored across different calculations.Consequently, the data from one calculation can be effectively transferred to subsequent calculations, reducing the need for additional data points.
From a physical standpoint, the observed overlapping behavior of the databases across different lattice structures is rooted in the universality of Mott physics.The parameters X α obtained after convergence, represented by the colored dots in Fig. 6, can be interpreted as physical embeddings of the correlated fragments.These embeddings capture the essence of electron localization induced by the Hubbard interaction, which transcends the specifics of the lattice structure.As we approach and enter the Mott phase in the strongly correlated regime, the fragments become less entangled with their surrounding environment due to reduced charge fluctuations.Consequently, it is understandable that the physical embeddings become increasingly less dependent on the lattice type, leading to the observed overlaps in the databases.

Possible Future Methodological Enhancements
In light of our findings concerning the low-dimensional latent space, and their general origin rooted in Mott physics, it is natural to consider additional computational techniques that could further leverage this structure in future applications to complex multi-orbital strongly correlated systems.Specifically, Deep Kernel Learning (DKL) with autoencoders could further facilitate learning within our AL framework, as it could offer enhanced flexibility and scalability for discovering optimal feature spaces, all while preserving the essential element of uncertainty quantification employed within our AL procedure.

VI. CONCLUSIONS AND OUTLOOK
In this study, we have presented an AL framework integrated within the gGA to efficiently explore the ground state of the EH in the context of the single-band Hubbard model.From a computational standpoint, this approach leads to a marked reduction in the number of EH instances that must be solved iteratively, thus significantly mitigating the computational cost inherent to gGA.Moreover, our PCA analysis reveals that the parameters of the EH reside in a latent space with a low-dimensional structure that is largely invariant to the specifics of the lattice geometry, especially in the strongly-correlated regime.
From the physical perspective, the existence of this low-dimensional latent space can be attributed to the universal features of Mott physics.The phenomenon of electron localization, caused by the Hubbard interaction, transcends the geometric specifics of the systems studied.As a result, the correlated fragments are less susceptible to environmental influences, leading to a latent space whose characteristics are conserved across various lattice structures.
Looking forward, extending this methodology to more complex systems involving multiple orbitals, such as 5orbital d-systems and 7-orbital f-systems, presents an interesting challenge.While the universality of Mott physics gives us reason to expect similar low-dimensional structures in these more complicated systems, the actual existence and dimensionality of such a latent space remains an open question.In this respect, it is important to note that, when considering real-material calculations, the parameters of the electronic Hamiltonian are not freely adjustable.Structural stability, which emerges from the interplay between electronic and lattice degrees of freedom, imposes further constraints on the physically realizable electronic structures, which do not exist in model calculations, where all parameters can be tuned in arbitrary ways.A trivial example of how structural stability limits the possible quantum embeddings of the correlated degrees of freedom is that it often leads to symmetry, which can be exploited to reduce the number of gGA parameters using group-theoretical considerations.Additionally, structural stability restricts the possible atomic environments based on fundamental principles of chemistry, such as valence compatibility between atoms.These constraints may significantly limit the dimensionality and structure of the latent space of physically realizable embeddings, facilitating the learning problem.
Hence, the implementation of our AL framework in real-material calculations, potentially within an ab-initio DFT+gGA framework, could provide further insights into the structure of this latent space and its limitations, laying the groundwork for more efficient simulations of complex strongly-correlated materials.

ACKNOWLEDGMENTS
This work was supported by a grant from the Simons Foundation (1030691, NL).We gratefully acknowledge funding from the Novo Nordisk Foundation through the Exploratory Interdisciplinary Synergy Programme project NNF19OC0057790.DGA acknowledges funding from the Rising Star Fellowship of the Department of Biology, Chemistry, Pharmacy of Freie Universität Berlin.The part of the work by YXY was supported by the U.S. Department of Energy (DOE), Office of Science, Basic Energy Sciences, Materials Science and Engineering Division, including the grant of computer time at the National Energy Research Scientific Computing Center (NERSC) in Berkeley, California.This part of research was performed at the Ames National Laboratory, which is operated for the U.S. DOE by Iowa State University under Contract No. DE-AC02-07CH11358.KB acknowl-

I. GAUSSIAN PROCESS REGRESSION IN THE CONTEXT OF ACTIVE LEARNING
Within our active learning (AL) algorithm described in the main text, we store data gathered during the ghost Gutzwiller Approximation (gGA) iterations in a training database consisting of all the quantities needed for training our ML model and making predictions.In Eq. ( 1) E(X) and F(X) refer to the energy and its d gradient components evaluated at a point X in a d dimensional space.Furthermore, the model's prior uncertainties σ 0 and σ σ σ for the energy and its gradients are stored.These uncertainties take into account noise in the training and test data.However, since in our case all the training and test data is deterministic, they only are used as regularization parameters and, hence, we use σ 0 = σ i ∀ i ∈ {1, 2, . . .d}.
During the gGA iterations we encounter new points X * for which we do not know the corresponding energies and gradients.For such a new test point we then attempt to predict its energy and gradient using Gaussian process regression (GPR).If there are closeby points (nearest neighbors) available in the database D, their energies and gradients are retrieved from the database and used as training points for GPR.All of the remaining nearest neighbors to X * which are not yet available in the database are then generated on the fly and their energies and gradients are merged into the database until we have the full set of required 2d + 1 training points required by our algorithm (for more details on the hypercubic lattice used for discretizing the space of training points it is referred to the main text).
This local subset of D is stored in a vector of length N (d + 1) containing the energy and its d gradients for all of the N = 2d + 1 training points, where the superscript "T" denotes the transpose.This training data will be used to define equations for making prediction at the test point X * .

A. Gradient Enhanced GPR
Using our prior knowledge about the data given by the training data under the assumption that all elements of y(X i ) for any given point X i are normally distributed we arrive at the following equations for the posterior mean and covariance matrix where ȳ is the posterior mean vector of length N * (d+1) containing the expectation values (predictions) for the energy and its gradients for the N * test points and K, K * and K * * are the modified prior covariance matrices of size N (d + 1) × N (d + 1), N (d + 1) × N * (d + 1) and N * (d + 1) × N * (d + 1), respectively.The modified structure of these covariance matrices ensures that the predicted energy and its gradients are consistent, i.e. ⟨F⟩ = ∇⟨E(X)⟩, by also incorporating the gradients of the kernel function as follows: Here, the following short-hand notations is introduced for the original values and partial derivatives of the kernel function, where the kernel function used for this work (square exponential kernel) is given by k This also allows for the computation of standard deviations (uncertainties) of energies and gradients for a test point By utilizing this modified structure of the covariance matrix, the GPR model effectively leverages both function and gradient observations, often yielding enhanced predictions while at the same time allowing us to learn a scalar function and being able to use the gradient predictions for the construction of the density function.For more information on the program implementation of this approach as well as on the detailed analysis of its computational scaling, the interested reader is referred to Refs.[1,2].

II. THE gGA APPROXIMATION
The goal of this section is to outline the algorithmic structure of the gGA.

A. The gGA Lagrange function
As explained in the main text, for the single-band models of the form considered in this work, the gGA framework is encoded the following Lagrange function: where N denotes the total number of unit cells, E and E c i are scalars, and ∆ i , Λ i , and Λ c i are B × B Hermitian matrices.Additionally, D i and R i are B × 1 column matrices.The so-called "quasiparticle Hamiltonian" ( Ĥqp ) and EH ( Ĥi emb ) are defined as: Ĥi where ni = σ c † iσ c iσ .The integer number B controls the size of the gGA variational space and, in turn, the precision of the gGA approach.
Below we write explicitly the saddle-point equations obtained by extremizing the Lagrange function above.

B. The gGA Lagrange Equations
In order to simplify the final saddle point equations for the gGA Lagrangian, we rewrite Eq. ( 15) as follows: where we have introduced the matrix: and the projectors over the degrees of freedom corresponding to each fragment: Additionally, we introduce the following parametrization of the Lagrange multipliers ∆ i , Λ i and Λ c i , where they are expanded in terms of a basis of orthonormal matrices [h i ] s with respect to the canonical scalar product (A, B) = Tr A † B : Given the definitions above, we write the saddle point equations for the gGA Lagrangian as follows: B c=1 [ Ĥi where E 0 is the ground-state eigenvalue of the quasi-particle Hamiltonian, E c i is the ground-state eigenvalue of the i-th EH, and f is the zero-temperature Fermi function.
As explained in the main text, because of the Helmann-Feynmann theorem: where: Ĥi emb is the EH defined in Eq. ( 16), and the expectation value is taken with respect to the corresponding half-filled ground state.

C. Iterative procedure to solve the gGA equations
The equations above can be solved with multiple approaches.In the benchmark calculations of this work we used the following iterative procedure: 1. Given an initial guess for the parameters R i , Λ i , solve Eq. ( 23) and determine ∆ i using Eq. ( 24).
2. Use Eq. ( 25) to determine D i from the parameters above.
3. Use Eq. ( 26) to determine Λ c i from the parameters above.
4. Iteratively determine chemical potential µ such that the number of physical particles in the EH is equal to the total number of particles in fragment i.This is done by solving a root problem for the particle number, which involves the following steps in each iteration: • Incorporate chemical potential into Λc i as Λc i = u † i Λ c i u i − Iµ. • Solve Eq. ( 27) using exact diagonalization.
• Compute hybridization and bath blocks ρ hyb i and ρ bath i of density matrix.
5. Use Eq. ( 29) to determine ∆ i from the parameters above.
6. Use Eq. ( 28) to determine R i from the parameters above.
8. Restart from the first step using the so-obtained parameters R i , Λ i , and iterate until self-consistency is reached (i.e., until the initial and R i , Λ i and those obtained after the steps above are equal up to a gauge transformation, within a given accuracy threshold).

III. GAUGE FIXING
In this section we are going to explain the details of the ML procedure outlined in the main text, where ML is used to circumvent the bottleneck in gGA calculations, i.e. the repeated solution of Eq. ( 27) in step 4 in the algoorithm outlined in Sec.II C.
In the main text we have established that the energy function Ēc of the EH can be expressed in a reduced domain where Λ c is diagonal as follows: where E is the energy of the EH in this reduced domain and the matrix elements of D and Λc have been arranged into a vector: X = ( D1 , .., DB , Λc 11 , .., Λc BB ) (35) and the elements of the vector X are then given by the following transformations: where: u = u perm u phase u eigen T ∈ O(B), where • u eigen transforms to the eigenbasis of Λ c .
• u phase fixes the phase of D such that Di > 0.
• u perm is a permutation matrix, which ensures that Di ≥ Di+1 .
Since the diagonalization of the EH is gauge perserving, the only outputs available are This means that we can only directly obtain ρ hyb and the diagonal elements of ρ bath .However, ρ bath is generally not diagonal in the gauge defined above.Furthermore, the function E does not provide us with direct access to the double occupancy.Below we derive analytical expressions which enable to compute these quantities from E(X), F(X) and X.

A. Calculation of off-Diagonal Density Matrix Elements of the EH
To arrive at an expression which also enables us to compute the off-diagonal elements of ρ bath , we start by rotating the embedding parameters about an angle θ r .For a fixed set of embedding parameters.This leads to the following expression: where u (θ r ) = e −iθrhr is an arbitrary unitary matrix, {h r } is a set of hollow Hermitian matrices and the index r ∈ 1, 2, . . ., B(B−1) 2 runs over all non-redundant off-diagonal elements of Λ c .As mentioned in the main text, such a unitary transformation cannot change the energy due to the gauge invariance of the embedding Hamiltonian (EH), i.e. ∂F/∂θ r = 0. Exploiting this stationarity condition with respect to any unitary rotation we end up with the following equation The terms [M r ] a and [M c r ] a correspond to the partial derivatives of the transformed D and Λc with respect to θ r which are given by: Eq. ( 41) is exact and can be used to iteratively solve for the off-diagonal elements of ρ hyb ab since these are the only unknowns in Eq. ( 41),

B. Calculation of the Double Occupancy
The double occupancy is not directly accessible via the gradient of E with respect to X. Hence, we employ a similar trick as we did for obtaining the off-diagonal elements of ρ hyb ab .But instead of using the invariance of the EH with respect to unitary transformations, we employ its scaling invariance: where α ∈ R is an arbitrary scaling factor.Taking the total derivative of the above equation yields: where: With this, we arrive at the following expression for the double occupancy where the Hellmann-Feynman theorem has been used to evaluate the partial derivatives with respect to X i (as already described in the main text): and to evaluate the double occupancy as partial derivative with respect to U : which allows us to compute the double occupancy analytically using Eq.(47).

IV. BENCHMARK CALCULATIONS OF gGA+AL AWAY FROM HALF FILLING
Here we present benchmark calculations of the Hubbard model away from half-filling.Specifically, we consider dopings of 10 %, 20 % and 30 %, corresponding to fillings of N = 1.1, 1.2, 1.3.Similarly to the procedure outlined in the main text, we organize our calculations as follows.Each series of calculations starts from a small initial value of Hubbard interaction strength U min , which is increased in steps with equal spacing ∆U up to a value U max .Subsequently, the interaction strength is decreased back to U min , using the same spacing.Below these series of calculations will be referred to as "forward sweep" and "backward sweep", respectively.
As in the main text, we will quantify the efficiency of our approach using the same: which is the ratio of the number of times new data must be acquired and added to the database during a given calculation, N data , to the total number of gGA iterations necessary to perform the same calculation without ML, N iterations .In Fig. 1 we show the behavior of the metric S for the Bethe lattice in the limit of infinite coordination number.Each row corresponds to a different number of total particles in the system and the columns show the results for the forward and backward sweep, respectively.Each row is subdivided in three part: The upper part refers to results obtained with an empty/reset initial training set and the middle part shows the results obtained by initializing the training database with the half-filling data and continuously updating the database throughout all of the doped calculations.The lower part shows the results for a second set of sweeps, to verify if the previously gathered data can be efficiently leveraged for reducing the computational cost.
We note that the efficiency metric S for these calculations is even lower than for the half-filling calculations presented in the main text, indicating higher gains.The reason is that for the doped calculations the chemical potential has to be determined iteratively (see algorithm in Sec.II C), rather than being fixed by particle-hole symmetry.Therefore, the calculation requires more ED evaluations compared to the half-filling case, which our AL algorithm can efficiently leverage on for training.We also note that the efficiency of the AL framework does not change substantially for the calculations performed when continuously updating the database.The absence of transfer learning between calculations at different values of N is not surprising, as calculations at different dopings presumably correspond to non-overlapping regions of the ambient space.
In Fig. 2 we show the predictions of the gGA+AL calculations for all fillings.The accuracy of the method is satisfactory, although it slightly deteriorates for larger values of U , particularly for the quasiparticle weights Z.In this subsection we show calculations of the Hubbard model on a 3D cubic lattice, performing the same analysis as done above for the case of the Bethe lattice in the limit of infinite coordination.Note that, in this section, the training database of the calculations performed with data accumulation includes all of the data gathered throughout the calculations on the Bethe lattice away from half-filling.
In Fig. 3 we show the behavior of the metric S for the 3D cubic lattice.As for the calculations performed for the infinite-coordination Bethe lattice, the AL method results in a substantial reduction in computational cost, even when starting each set of calculations from an empty database.Interestingly, further gains are observed within the model of progressive data accumulation, indicating that the AL framework is able to leverage on the data previously acquired during the calculations of the model on the Bethe lattice.We delve deeper into the underlying reason in Sec.IV D, where we analyze the data using a principal component analysis (PCA), in a similar fashion as in the main text for the half-filling calculations.
In Fig. 4 we show the predictions of the gGA+AL calculations for all fillings.The accuracy of the method is satisfactory, although it slightly deteriorates for larger values of U , as for the case of the Bethe lattice.
C. Benchmarks for the Hubbard Model for the 2D Square Lattice Lastly, we analyze the procedure outlined above on the 2D square lattice, comparing the results obtained starting from an empty database with those obtained with the model of progressive data accumulation, including also the data acquired while performing calculations on the Hubbard model for the 3D cubic lattice.
In Fig. 5 we show the behavior of the metric S for the 2D Square Lattice.As expected, further gains are observed within the model of progressive data accumulation, indicating that the AL framework can leverage on the data previously acquired during the calculations of both the model on a Bethe lattice and the model on a 3D cubic lattice.We also note that this effect is particularly pronounced for large values of U , which is the same trend observed in the main text for the gGA+AL calculations at half-filling.
In Fig. 6 we show the predictions of the gGA+AL calculations for all fillings.The accuracy of the method is satisfactory, showing the same trends as for the other lattices analyzed previously.

D. Principal Components Analysis
As in the main text, here we turn our attention to the underlying structure of the database, performing a principal component analysis of all training points accumulated during the sweeps with dopings N = 1.1, 1.2, 1.3.To generate this plot each sweep was started from an empty database.As for the half-filling calculations discussed in the main text, we find that the first two principal components account for about 85% of the variability of the data, which are shown in Fig. 7.Note that the curves obtained for different lattice structures tend to converge to the the same area at large U , therefore explaining the observed higher transfer-learning efficiency of our AL framework in the strongly-correlated regime.

FIG. 1 .
FIG. 1. Schematic representation of the AL structure.A machine is trained on the fly while performing gGA calculations.If the uncertainty estimate Σ for ML predictions at a given point X is within a threshold, the prediction is accepted.Otherwise, the machine database is updated with energy E and gradient F data, and a new machine evaluation is performed.

FIG. 2 .
FIG. 2. Efficiency metric S for sweeps performed at different values of ∆U .Each row corresponds to a unique ∆U .The left and right columns show S values for metallic and Mott states, respectively.The results of both the original sweep and the subsequent test sweep are included.

FIG. 3 .
FIG. 3. Comparison of the efficiency metric S for different sweeps: ∆U = 0.3, 0.15, 0.075.Each row corresponds to mesh spacing ∆U .The left and right columns represent the S values for metallic and Mott states, respectively.For each ∆U , the top panel shows S calculated using an empty database at the start of each sweep.The bottom panel displays S calculated starting from the database obtained at the end of the ∆U = 0.6 sweep and updated continuously as we proceed through the series of ∆U , from the largest to the tightest.

FIG. 5 .
FIG.5.Comparison of total energy, local double occupancy, and quasi-particle weight calculated using gGA+AL and standard gGA methods.The results referred to as "exact results" correspond to the exact gGA solutions and are represented by continuous black lines.The results obtained with gGA+AL are represented by triangles.

FIG. 6 .
FIG. 6. Scatter plot of the first two principal components of the training database.Points obtained after convergence for the Bethe lattice, 2D square lattice, and 3D cubic lattice are colored in blue, red, and green, respectively.All other points are colored in grey.

FIG. 1 .
FIG. 1.Comparison of the efficiency metric S for different dopings: N = 1.1, 1.2, 1.3 for the Bethe lattice using ∆U = 0.075.Each row corresponds to a different doping.The left and right columns represent the S values for forward and backward sweeps, respectively.For each N , the top panel shows S calculated using an empty database at the start of each sweep, the panel in the middle displays S calculated starting from the database obtained from the half-filling calculations and updated continuously as we proceed through the series of dopings, from smallest to largest.The lower panel displays a test sweep using the database containing all of the data gathered so far.

FIG. 2 .
FIG.2.Comparison of total energy, local double occupancy, and quasi-particle weight for the Bethe lattice calculated using gGA+AL and standard gGA methods.The results referred to as "exact results" correspond to the exact gGA solutions and are represented by continuous black lines.The results obtained with gGA+AL are represented by triangles.

FIG. 3 .
FIG. 3. Comparison of the efficiency metric S for different dopings: N = 1.1, 1.2, 1.3 for the 3D cubic lattice using ∆U = 0.35.Each row corresponds to a different doping.The left and right columns represent the S values for forward and backward sweeps, respectively.For each N , the top panel shows S calculated using an empty database at the start of each sweep, the panel in the middle displays S calculated starting from the database obtained from the calculations previously performed on a Bethe-lattice and continuously updated throughout the calculations at different N , from smallest to largest.The lower panel displays a test sweep using the database containing all of the data gathered so far.

FIG. 4 .
FIG.4.Comparison of total energy, local double occupancy, and quasi-particle weight for the 3D cubic lattice calculated using gGA+AL and standard gGA methods.The results referred to as "exact results" correspond to the exact gGA solutions and are represented by continuous black lines.The results obtained with gGA+AL are represented by triangles.