Hidden self-energies as origin of cuprate superconductivity revealed by machine learning

Experimental data are the source of understanding matter. However, measurable quantities are limited and theoretically important quantities are sometimes hidden. Nonetheless, recent progress of machine-learning techniques opens possibilities of exposing them only from available experimental data. In this paper, after establishing the reliability of the method in various careful benchmark tests, the Boltzmann-machine method is applied to the angle-resolved photoemission spectroscopy spectra of cuprate high temperature superconductors, Bi$_2$Sr$_2$CuO$_{6+\delta}$ (Bi2201) and Bi$_2$Sr$_2$CaCuO$_{8+\delta}$ (Bi2212). We find prominent peak structures both in normal and anomalous self-energies, but they cancel in the total self-energy making the structure apparently invisible, while the peaks make universally dominant contributions to superconducting gap, hence evidencing the signal that generates the high-$T_{\rm c}$ superconductivity. The relation between superfluid density and critical temperature supports involvement of universal carrier relaxation associated with dissipative strange metals, where enhanced superconductivity is promoted by entangled quantum-soup nature of the cuprates. The present achievement opens avenues for innovative machine-learning spectroscopy method to reveal fundamental properties hidden in direct experimental accesses.


I. INTRODUCTION
Momentum k and energy ω dependent electron singleparticle spectral function A(k, ω) can be measured with recent revolutionarily refined resolution of angle resolved photoemission spectroscopy (ARPES) [1]. From A(k, ω), the interaction effects crucial for unconventional superconductors can be identified in the self-energy [2,3]. Scanning tunneling microscope (STM) and its spectra (STS) including the quasiparticle interference method [4] also give us insights into the self-energy [5,6].
In superconductors, the self-energy consists of the normal and anomalous (superconducting) contributions, Σ nor and Σ ano , respectively. ARPES and STS provide us with only the total self-energy Σ tot in a specific combination of these two [2] (see below for details). However, to understand the superconducting mechanism, it is crucially important to extract these two separately, because they represent different part of interaction effects: Σ ano is proportional to the superconducting gap function, at the heart of superconducting properties, while normalelectron correlation effects, such as renormalized mass and life time, are encoded in Σ nor . Despite its importance, Σ ano can be straightforwardly extracted separately only when Σ nor is non-singular as in the BCS superconductivity of weakly correlated systems [5,6]. Indeed, the * YAMAJI.Youhei@nims.go.jp decomposition of the self-energies of the BCS superconductors has played the role of establishing the phonon mechanism since the anomalous part contains the information of the phonon density of states, which is crucial for the identification of the glue for the superconductivity. In case of the cuprate high-T c superconductors, because of the strong electron correlation, the subject of extracting normal and anomalous self-energy separately belongs to an open enigmatic inverse problem, which has hampered the full understanding of the superconducting mechanism for decades.
Recently machine learning and data science are developing rapidly as tools to analyze accumulated data across various research domains. The expectation is to solve complex, important problems for human being, which are hardly solved by conventional tools, as in the attempt of forecasting future and designing new functional materials by utilizing existing big data. More specifically, machine learning has potential for innovating routes of exposing quantities, which is invisible in direct measurements. Solving inverse problems to construct theories with predictive power by using existing experimental data is a typical target of the machine learning innovation.
In this paper, we develop a scheme of machine-learning technique to extract physical quantities hidden in experimental data. To demonstrate the power of our method, we apply it to the electronic structure of the cuprate high-temperature superconductors under strong correlation effects manifested by the formation of the pseudo-gap in the normal state. Specifically, the Boltzmann machine [7,8] is examined to extract Σ nor and Σ ano separately from available ARPES spectra even when the normal self-energy is subject to prominent or singular correlation effects. Discovered prominent peak structure in the energy dependence of Σ ano hidden in the ARPES is shown to generate most of the superconducting gap, namely more than 90% of the gap, and hence to be the driving force of the superconductivity in the cuprates. From the extracted self-energies, we elucidate the factors that determine the superconducting transition temperatures.
The organization of the present paper is as follows: In Sec. II, we summarize the essence of the regression scheme to extract the self-energies from the spectral functions. The details of the method is given in Sec. III. The readers who are not interested in the technical details of the method but are interested only in the results may skip Sec. III and directly jump to Sec. IV. Various benchmark tests of the regression by using the present method are presented in Sec. IV for simple metals and conventional Bardeen-Cooper-Schrieffer-type superconductors to show the reliability of the method. The main results of the present paper studying the regression for the cuprate superconductors are given in Sec. V. The self-energies of cuprate superconductors extracted separately for the normal and anomalous contributions are shown, which have prominent peak structures each in the normal and anomalous part, but cancels in the spectral function. We also show that the prominent peak in the anomalous self-energy gives rise to the major part of the high temperature superconductivity. Sec. VI is devoted to the discussion on the implications of the present results. Then, we summarize this paper and give our outlook in Sec. VII.

II. METHODS
In this section, we introduce the fundamentals and basic concept of the present method. The detailed regression scheme is given in Sec. III for readers who are interested in technical details.

A. Regression
The present machine learning scheme is classified into the category of a general regression task, which optimizes a function A to find B and/or F , when A is a nonlinear and complex functional of another function B as A = F (B), where the goal of solving the inverse problem B = F −1 (A) by the optimization of A is set from a physical purpose [9]. Another example belonging to the regression task as an application of the machine learning is found in the use for quantum many-body problems or statistical physics problems (see for instance, Ref. 10 1. Flow chart of machine-learning procedure. Regression procedure of the normal self-energy Σ nor (k, ω) and anomalous self-energy Σ ano (k, ω) using the experimental spectral function. The procedure starts from the central top (initial guess) to the training process (the inner loop) consisting of the red and black arrows to optimize all parameters in the artificial neural network (ANN). When the error converges, the outer loop to decrease the test error, which delivers the initial values for the next inner loop until the test error is minimized.
In the present case, A is the ARPES spectral function A(k, ω), B is the self-energies Σ nor and Σ ano and F is given by Eqs. (2) to (4) below. The training data is given by experimental A(k, ω) at a discrete and limited number of ω. Then the present machine learning is a typical regression task to infer both Σ ano (k, ω) and Σ nor (k, ω) separately as continuous functions of ω.
Here, we note that the present scheme is not a simple interpolation for functions of ω. The regression with artificial neural network provides the self-energies Σ ano (k, ω) and Σ nor (k, ω) that simultaneously satisfy the constraints from the several rigorous and physically sound prior knowledge together with the experimental data.
In the machine learning, flexible regression models that can minimize the training error without any limitation are employed. In contrast, when we employ phenomenological regression models, the constraints arising from phenomenological functions substantially increase the cost function far beyond the noise in the experimental data as demonstrated in the following discussion (see Appendix A). If the phenomenological form of the regression model is not a priori justified, the machine learning scheme should have an advantage over standard phenomenological regression schemes.
The basic procedure is illustrated in Fig. 1. More technical details are found in the following Sec. III and the robustness, accuracy and reliability of the present machine learning are shown in detail in Appendices F and D. An important advantage of the machine learning is that the approximation converges to the correct results without overfitting and bias if one increases the data point. We do not call any other approach that does not reach this systematic improvement as a machine learning.

B. Green function
We propose a theoretical method to extract Σ nor and Σ ano from experimentally observed spectral functions A(k, ω) of superconductors. In the superconducting phases, the single-particle retarded Green function at a given momentum k as a function of frequency ω is given by a diagonal component of 2 × 2 matrix in Nambu representation, with ζ = ω + iδ (δ is a small positive real number). The bare dispersion is given by k . A(k, ω) measurable by ARPES is related toĜ as with the normal component of the Green function G nor (k, ω) ≡Ĝ(k, ω) 11 = [ω − k − Σ tot (k, ω)] −1 . (3) Here, the total self-energy Σ tot is given by [11] Σ tot (k, ω) = [Σ nor (k, ζ) + W (k, ζ)] δ→+0 , with W given as a specific combination, W (k, ω) = Σ ano (k, ω) 2 /[ω + k + Σ nor (k, −ω) * ]. (5) The gap function ∆(k, ω) = Q(k, ω)Σ ano (k, ω), which is a measure of superconducting order, is proportional to Σ ano (k, ω) with the coefficient Q(k, ω) called the frequency dependent renormalization factor defined as The ω → 0 limit of Q is theoretically equivalent to the quasiparticle weight (renormalization factor) defined in Eq. (9) below as we calculate in Appendix B with the help of the procedure in Appendix C. The real part of ∆(k, ω), Re∆(k, ω = 0) is nothing but the superconducting gap (see the definition of ∆ in Eq. (6)). In the present report, δ is chosen to be equal to the experimental resolution as δ = 10 meV [12], instead of taking δ → 0 + .
To estimate the density of the Cooper pairs, mass renormalization, and gap amplitude from the spectral function, we define F (k), z qp (k), and ∆ 0 (k) respectively as, z −1 qp (k) = 1 − ∂ReΣ nor (k, ω)/∂ω| ω→0 , (9) ∆ 0 (k) = ∆(k, ω = ∆ 0 (k)), (10) where z qp is called the renormalization factor (see Appendix B). As is well known, the gap function is interpreted by the product of the Cooper pair density and the effective attractive interaction to form the Cooper pair as the mean field acting on the Cooper pair formation.

C. Prior knowledge
When we can measure entire ω dependence ofĜ(k, ω + iδ) at a fixed k, we can reconstruct Σ nor (k, ω) and Σ ano (k, ω) solely without any information at momenta other than k, because Eqs. (3) and (4) are all diagonal in the k space. However, we can not measure the entire ω dependence of the complete Green function ma-trixĜ(k, ω + iδ), which makes the information through the Kramers-Kronig relation incomplete. In the literature [13], it has been assumed that the momentum dependence of the normal-state spectrum at a fixed ω is a single Lorentzian curve so that Σ nor (k, ω) and Σ ano (k, ω) can be extracted separately without knowing the ω dependence at large |ω|. In the present article, to overcome the lack of information at large |ω|, we utilize physically sound constraints and extract Σ nor (k, ω) and Σ ano (k, ω) from experimentally observed A(k, ω) at a single fixed momentum k, instead of assuming specific momentum dependences of the spectra. Indeed, the present Boltzmann machine learning as detailed below successfully reproduces Σ nor (k, ω) and Σ ano (k, ω) separately from benchmark spectral functions without the momentum dependence of the spectra as shown in Appendix D.
The physical constraints employed in the present article are classified into two categories. The constraints in the first category are the structure of the Green function given in Eq. (1), the Kramers-Kronig relationship between the real and imaginary part of the selfenergies, negative definiteness of ImΣ nor , and odd nature of ImΣ ano as ImΣ ano (−ω) = −ImΣ ano (ω). The constraint in the second category is the sparse and localized nature of the ImΣ ano along the ω axis. In the present context, the sparseness is defined as the property of ImΣ ano concentrated and localized in the small frequency range around the Fermi level. It should be mentioned that the optimization procedure of the present machine learning does not explicitly impose this latter constraint because of the flexible representability of the present (restricted) Boltzmann machine [14]. The sparse nature of ImΣ ano is, however, justified a posteriori in the optimized solution, which turns out to satisfy physically reasonable sparseness although the machine learning procedure does not explicitly impose this constraint.
From physical grounds, the sparse and localized nature of ImΣ ano is a natural consequence as clarified in Ref. 15, where irrespective of the mechanism and symmetry of the realistic pairing, strong and long-range nature of Coulomb repulsion in general causes severe pair breaking at large energies and suppresses Σ ano at energies far away from the Fermi level.

D. Overview of optimization
We represent the self-energies by using artificial neural networks in the present scheme, and optimize the neural network to reproduce the experimentally observed spectral functions. The optimization of the neural network consists of inner and outer optimization loops (see Figure 1). In the inner optimization loop, starting with given initial parameters for the neural network, all the parameters are optimized to minimize the training error (see the definition given in Eq. (26) below) by following the natural gradient method (see Appendix E for details). On the other hand, in the outer optimization loop, test errors (see Eq. (28)) are minimized by the Bayesian optimization. The updated distribution delivers the initial values for the next inner loop. The next inner loop starts again with the updated neural network parameters, while, if the number of repetitions of the outer loop already reaches an upper limit (typically less than a hundred in the present paper), the optimization is completed and the current best neural network gives the optimized self-energies. Here, we combine these inner loop and outer loop optimization to avoid the overfitting, which would be inevitable if we would use only the inner loop optimization with the training data.

III. TECHNICAL DETAILS
In this section, we show the details of the present regression scheme to extract the self-energy from ARPES data. The neural network representation of the selfenergy is detailed in Sec. III A, which is supplemented by the Kramers-Kronig relation in Sec. III B. The procedures to optimize the self-energy are given in Sec. III C, III D, and III E. You may skip this section if you are not interested in the technical details of the present method.
A. Wavelet analysis and Boltzmann machine representation of imaginary part of self-energy Although high-resolution ARPES data for A(k, ω) are available in experiments, Σ nor and Σ ano are not directly given separately from A(k, ω), while if Σ nor and Σ ano and k are given, A(k, ω) can be determined easily by using The piece-wise rectangular function representation of ImΣ ano is shown. While total ImΣ ano is represented by open red rectangles, Boltzmann machine representation generates their components as is illustrated as filled red, cyan, and blue rectangles. To satisfy anti-symmetry of ImΣ ano , the copies of the Boltzmann machines shown in grey rectangles are also supplemented. (c) The wavelet-like structure of the rectangular basis set is illustrated. From the longest wave length structure governed by σ0 to the shortest wave length structure controlled by σL−1, each rectangular basis (open red rectangle) is labeled by the set of the bits σ = (σ0, σ1, . . . , σL−1) (σ = ±1). The structure of the restricted Boltzmann machine for ImΣ nor and mixed distribution consisting of Boltzmann machines for ImΣ ano are depicted in (d) and (e), respectively.
Eqs. (1) and (2). Therefore we need to solve an underdetermined non-linear inverse problem. To overcome the underdetermined nature of the problem, we employ physically sound constraints justifiable even in strongly correlated electron systems as prior knowledge as introduced above. By incorporating these physical constraints, we try to optimize Σ nor and Σ ano so as to reproduce experimental A(k, ω). For this purpose we employ a machinelearning method by applying a Boltzmann-machine algorithm [7]. The reliability, accuracy and robustness of the present machine learning procedure are shown in several robustness test against noise in Appendix F and benchmark tests in Appendix D.
In the retarded Green function representation, ImΣ nor is negative definite and ImΣ ano is an odd function of ω. The negative definiteness and odd nature are guaranteed by expanding ImΣ nor and ImΣ ano with positive definite Eqs. (16), (17), (19), and (22) Eqs. (24) and (25) Sec. IIB Sec. IIID FIG. 3. Flow chart of machine-learning procedure. Regression procedure of the normal self-energy Σ nor (k, ω) and anomalous self-energy Σ ano (k, ω) using the experimental spectral function A exp (k, ω). Here, (R)BM, and COM (green arrows) stand for (restricted) Boltzmann machine, and center of mass, respectively. The procedure starts from the central top (initial guess) to the training process (the inner loop) consisting of the red and black arrows to optimize all the BM parameters. When the error converges, the outer loop (blue and black arrows) updates the COM positions to decrease the test error, which delivers the initial values for the next inner loop until the test error is minimized. The test error is minimized by repeating the combined inner and outer loop updates. The histograms are schematic ω-dependences of ImΣ nor/ano (k, ω) at a fixed k: The stepwise representation of ImΣ ano (k, ω) (the open purple boxes) is obtained by antisymmetrizing the superposition of the three BM distributions (red, blue, and green filled boxes), D(S), to satisfy the odd-function property in ω, where the light or dark gray histograms are added. ImΣ nor is directly given from C(S) as the open red boxes. In total, the machine learning minimizes the training (inner loop) and test (outer loop) errors given by the average E over ω for Eqs. (26) and (28), respectively.
bases {θ j } as where c j and d j are real coefficients. Due to the positive definiteness of the normal part, c j is positive. One of the simplest basis set {θ j } is a set of rectangular functions, which gives a step-wise representation of the self-energies. To obtain a flexible and compact representation for the coefficients c j and d j , here we will combine a well-established wavelet-type representation [16][17][18] and Boltzmann machines as follows. The high representability of the wavelet formalism with the rectangular basis is discussed in Appendix G. The piece-wise rectan-gular representations of ImΣ nor and ImΣ ano are shown in Fig. 2(a) and (b), respectively.
In this fitting, the frequency range of our interest ω ∈ [−Λ/2, +Λ/2] is first divided into 2 L grids using an integer L and assign an L-digit binary representation as where σ i = mod(I/2 i , 2) for the decimal representation I(σ) in the range 0 I(σ) 2 L − 1 of the grid number coordinate; Then the unit rectangular function Θ L σ (x) is defined as The correspondence between the position of the rectangular functions and the decimal representation I(σ) is shown in Fig. 2(c). Then, the basis set expansions of ImΣ nor and ImΣ nor are obtained as with two sets of 2 L fitting parameters, C(σ) and D(σ).
In the actual calculation, we take Λ ∼ 0.8 eV to efficiently fit the experimentally observed A(k, ω) confined to the range ω > −0.4 eV. Note that we assume ImΣ nor (ω) is thus restricted and nonzero in the comparable range (ω > −Λ/2 eV, Λ ∼ 0.8 eV ) to the experimentally measurable one of A(k, ω) to fit them, while we do not impose any constraint on ImΣ ano (ω), because Σ ano is restricted within this energy range as we noted above. We have confirmed in Appendix F that the result is basically unchanged even when the energy cut-off Λ imposed on ImΣ nor (ω) is removed by adding possible high-energy tail or peak in the range ω < −Λ/2. This means that our main finding is not altered by the experimentally unknown shape of the spectral function at ω < −0.4 eV.
Irrespective of the cut-off Λ, the Kramers-Kronig relations given below (Eqs. (24) and (24)) are applied for all the self-energies to the whole energy range (−∞ < ω < ∞) in the machine learning procedure so that the optimized solution satisfies the strict causality. In fact, for the converged A(ω), the sum rule ∞ −∞ dωA(ω) = 1 is satisfied.
As a compressed representation for C(σ) and D(σ), in this paper, we employ two different types of Boltzmann machines to enhance their representational power and determine each Σ nor and Σ ano , separately. ImΣ nor is negative definite and a widely distributed function within energy scale set by the Coulomb repulsion. Thus, we use the flexible and nonnegative restricted Boltzmann machine, which will be introduced below. However, ImΣ ano has different properties: It is sparse, which is justified a posteriori as we addressed already. Therefore, we employ a mixture distribution of the Boltzmann machine without the hidden variables to accelerate the optimization. There may still remain ambiguities in determining Σ nor and Σ ano from an observed A(k, ω). As proposed below, the ambiguities are removed by imposing physical constraints of ImΣ ano . Now we introduce the Boltzmann machine to represent C(σ) and D(σ) in Eqs. (16) and (17), respectively. We first change the binary variable σ introduced in Eq. (13) to the Ising variable S = (S 0 , S 1 , ...S L−1 ) using the relation S = 2σ − 1 for later convenience and rewrite as C(S)(= C(σ)) and D(S)(= D(σ)). Then by adding hidden Ising variables h = (h 1 , h 2 , ...), the Boltzmann machine is generally defined as a Boltzmann weight for Ising variables ν = ±1 consisting of S and h in the notation ν = (S, h) as where (W ) m = W m represents interaction among ν, and (B) = B represents bias fields applied to ν. W lm , and B are variational parameters to minimize the difference between the resultant A(k, ω) and the measured spectral functions. The role of the hidden variables h is to enhance the representability of B to approximate C(S) and D(S).
Thanks to the non-negativity of C(S), it is efficiently represented by the restricted Boltzmann machine (RBM) [8,19,20], one of the most widely used one, as represented by where B C restricts the interactions in B only between visible and hidden variables in the form W m S h m . The advantage of the RBM is that one can analytically trace out the hidden variables h m , leading to where L h is the number of the hidden variables. Any ωdependent line shape in the energy range [−Λ/2, Λ/2] can be flexibly represented by optimized Boltzmann-machine parameters, if they are nonnegative. L h is typically set L h = 2L to achieve a convergence with reasonable computational costs. The restricted Boltzmann machine representation is schematically illustrated in Fig. 2(d).
For Σ ano , to remove the ambiguities, we impose the physically required symmetries, which can be constrained by employing the odd function Eq. (17). If ImΣ ano is sparse, namely confined in a certain range of ω, ImΣ ano can be better represented by a mixture distribution, namely by a linear combination of the full Boltzmann machine in the form where M is the number of the Boltzmann machines in the linear combination. Here, B D allows only the physical variable S without the hidden one h, but allows the interaction between S as The Boltzmann machine representation is schematically illustrated in Fig. 2(e). Note that the linear combination of the Gaussian distributions is one of the standard procedure to approximate a smooth function [21] and can be used as an initial guess of D (for the detailed procedure, see in Appendix E 4). Of course, the Boltzmann machine has representability far beyond the Gaussian distribution after the optimization.
Since it is sufficient to take the number of variables S at most L = 9 to fit the experiment data containing the resolution limitation, we can explicitly take the trace summation over S for all (with 2 L terms) at each iteration step. Therefore the drawback of the mixture distribution of BM beyond the RBM (namely, the complexity arising from containing the interaction term between two physical variables proportional to W λ m S S m ) is not a serious problem. We set the number of the Boltzmann machines in Eq. (22) up to 3 (M ≤ 3) to promote the faster optimization of the imaginary part of the anomalous self-energy.
When the experimental spectral data contain small noise, the step-wise representation for the imaginary part of the self-energies may introduce a systematic increase in the test errors. To reduce the possible increased error, we introduce a piecewise-linear representation instead of the step-wise representation ImΣ nor/ano (see Appendix G). Since the estimated noise is very small for the experimental ARPES spectra of the optimally doped Bi2212 (T c = 90K) at 12K, the piecewise linear representation is helpful to achieve the comparable size of test errors with the noise in the experimental data (see Appendix F for quantitative discussions).

B. Real part of self-energy
The real part of the retarded self-energy is obtained through the Kramers-Kronig relation as, where a broadening factor δ = 10meV is introduced to represent a principal value, See Appendix H for details.

C. Numerical procedure for optimization
We optimize the Boltzmann machines to reproduce experimentally observed spectral functions. The optimization of the Boltzmann machines consists of an inner and outer optimization loops (see Figure 3). In the inner optimization loop, starting with given initial parameters for the Boltzmann machines, all the parameters of the Boltzmann machine are optimized to minimize the training error defined in Eq. (26) by following the natural gradient method detailed in Appendix E. On the other hand, in the outer optimization loop, a test error defined below (Eq. (28)) is minimized by the Bayesian optimization that only updates the centers of mass of the distributions in D(S) by fixing other BM parameters to the values obtained in the inner loop. The updated distribution delivers the initial values for the next inner loop.
To find the optimized self-energy, these inner and outer loop optimization processes are combined as follows. First, Boltzmann machines are initialized to follow the prior knowledge explained above. Then, these Boltzmann machines are optimized to minimize the training error in the inner loop. Once the inner-loop optimization converges, the test error is evaluated by the optimized Boltzmann machines, and the self-energies given by the present Boltzmann machines are stored as the current best ones. When the outer loop is repeated, only if the test error becomes minimum in the whole optimization history, the present self-energies are stored as the current best. Then, the centers of mass of the M Boltzmannmachine distributions are updated in the outer loop by using the Bayesian optimization scheme. The next inner loop starts again with the updated Boltzmann machines parameters, while, if the number of repetitions of the outer loop already reaches an upper limit (typically less than a hundred in the present paper), the optimization is finalized and the current best Boltzmann machines give the optimized self-energies. For an efficient optimization, the initial condition at the largest L (L = 9) is prepared from the optimization at L = 8 (see Appendix E.)

D. Minimization of training error
For given initial parameters of the Boltzmann machines, the least square error of the training defined by is minimized, where N d is the number of the experimental data points, A exp (ω) is an experimentally observed A(k, ω), {ω j } (j = 1, 2, . . . , N d ) is the set of frequency where A(k, ω) is observed in the experiment, and f (ω j ) is a convolution of the Fermi-Dirac distribution and a Gaussian distribution. The experimental data A exp (ω j ) involves the Fermi-Dirac distribution broadened by the resolution of the experiments. Therefore, we introduce the convolution f (ω) of the Fermi-Dirac distribution at 12 K for Bi2212 and 11 K for Bi2201 and the Gaussian distribution with standard deviation √ σ 2 = 5 meV. Here, we normalize the experimental data A exp (ω) by assuming (1/N d ) j A exp (ω j ) = n 0 , where 0 < n 0 < 1. In this paper, we infer n 0 = 0.3 per spin, which means that 60% of an electron is assumed to be distributed in the experimentally observed range (ω −0.4 eV). We show that n 0 = 0.3 is indeed the optimized value of the least square fit in Appendix I while the result of the selfenergies does not sensitively depend on the choice of n 0 around 0.3.

E. Minimization of test error
To further explore the multi-dimensional parameter space of the Boltzmann machine and find an optimized solution, we employ the Bayesian optimization scheme in the outer loop. Before performing the Bayesian optimization, as explained above, we perform sufficiently large number of optimization steps, which is typically 4 × 10 3 , in the inner optimization loop to minimize the training error.
Then, we update the center of mass of each component of mixture distribution represented by Boltzmann machine in ImΣ ano defined in Eq. (17), in the following procedure. First, we extract the weight, center of mass, and variance of each Boltzmann-machine distribution by zeroth, first, and second moments as the function of ω.
To update the center of mass, we use the Bayesian optimization scheme [21] depending on the history of the optimization process for the center of mass, where the test error χ 2 defined below is the cost function to be minimized instead of χ 2 , to avoid overfitting. Then, we construct the initial values for D(S) that defines ImΣ ano for the next inner loop by using the updated center of mass with the weight and variance obtained above. Each Boltzmann machine in the mixture distribution D(S) is initialized as the Gaussian distribution with the weight, the updated center of mass, and variance. One may wonder why we initialize the Boltzmann machine again with the Gaussian distribution: Since the next inner loop optimizes the Boltzmann machine again, the initialization by the Gaussian does not alter the final results, where the final convergence to the optimized self-energies is reached after the inner loop. The reason to reduce the distribution temporarily to the Gaussian is that the outer loop optimization can be handled easily since only the three lowest moments are needed.
To define the test error, first, we generate synthetic experimental data from the original data. Because the overfitting originates from reproducing detailed noisy behaviors in the experimental data finer than the experimental resolution [22], to eliminate the noise, the experimental data A exp is fitted by a smooth function A fit defined as a linear combination of the Gaussian distributions [21] with standard deviation √ σ 2 = 10 meV equal to the experimental resolution. Then, we can estimate amplitude of noise in the experimental data as This error estimate is a standard procedure in the linear regression problem [23]. Inferring error or noise of the experimental data from the smoothed curve represented by an interpolation of the experimental data based on the physical assumption of smooth and continuous behavior in nature is a general established procedure in the linear regression problem. (See for instance, Ref. 21). This is a natural regularization to infer the generalization error reliably. There, it is important to assume the smoothness within the scale of the experimental resolution and with a frequency scale sufficiently longer than the interval of the experimental discrete points to exclude overfittings. This is achieved by the superposition of the Gaussian with the 10meV width to meet the experimental resolution, 10meV (see Ref. 9 as well).
By using the probability distribution p(A syn |A fit , ω) ∝ By assuming that p(A syn |A fit , ω) well reproduces real experimental data, the cost function to avoid the overfitting is defined by, where A syn r is the rth synthetic experimental data independently generated by the probability distribution p(A syn |A fit , ω) and {ω (r) s } is a set of randomly chosen frequency points for each synthetic data A syn r . See also Appendix F.
The optimization of the internal parameters of the Boltzmann machines in the inner loop and the optimization of the center of mass of each mixture distribution in the outer loop is repeated together several tens. The self-energies that give the minimum of χ 2 for the test data are called the optimized self-energies.

IV. BENCHMARK TESTS
In this section, we benchmark the performance of the present self-energy inference by utilizing the artificial neural network representation. First, we reproduce the normal self-energy of metals. Since there is no anomalous component, the regression becomes easier and less underdetermined. As typical examples, the surface state of Be and normal state of (La,Sr) 2 CuO 4 are analyzed.
Next, we perform the regression to reproduce the known normal and anomalous self-energies of superconductors. The present method indeed reproduces these self-energies for a model of strong coupling Bardeen-Cooper-Schrieffer (BCS) superconductors. The readers who are interested in the main results on the superconducting cuprates may skip this section and go to Sec. V. show the theoretical spectral function given by the machinelearning self-energy, which is shown in the right panel. The red (blue) closed squares represent the imaginary (real) part of the self-energy obtained by the present Boltzmann-machine learning with n0 = 0.5 (n0 is the ratio of the measured weight to the total weight). For the definition of n0, see Sec. III D. The magenta (cyan) curves represent the imaginary (real) part of the self-energy obtained in Hengsberger et al. [24] by using the Migdal-Eliashberg theory with a constant shift of the imaginary part due to elastic scatterings. Here we note that wavy structures in Σ shown in the right panel simply originates from the wavy structures in the experimental ARPES data superimposed on top of the sharp peak shown in the left panel (red open squares) likely to be ascribed to the experimental noise.

A. Normal metals
To benchmark the capability of the present scheme, we analyze non-superconducting metals as trivial examples.

Surface state of Be
The ARPES spectrum of Be (0001) surface state [24] is analyzed as a typical metal. In Fig. 4(a), the spectral function at the Fermi momentum k F obtained by our regression task reproduces the experimental data. The optimized normal self-energy is consistent with the selfenergy given by the combination of Migdal-Eliashberg (ME) theory and data from experiments by considering a constant shift of the imaginary part due to elastic scatterings, as shown in Fig. 4(b). Note that our machine learning procedure does not assume the ME theory and shows the ability of reliable regression later even when the ME theory is not a priori justified as in the case of strongly correlated electron systems.
2. Normal state of (La,Sr)2CuO4 We have also analyzed the ARPES data of Zhou, et al. [25], to see the normal-state self-energy of (La,Sr) 2 CuO 4 . After subtracting the background proposed in this paper, we have analyzed intrinsic spectral function by the machine learning and succeeded in reproducing the spectral function from the optimized selfenergy as shown in Fig. 5.
We also compare thus obtained self-energy with the results of the standard Kramers-Kronig (KK) transformation scheme and found that they are essentially consistent each other. Here, the standard KK transformation scheme employed in Ref. 25 consists of three steps: First, a high-energy tail is attached to the spectral function to obtain the full ω dependence of −(1/π)ImG(ω). Second, by assuming the particle-hole symmetry, the KK transformation is used to obtain the real part of G. Then, from the Dyson equation, the self-energy Σ = ω − G −1 is obtained. Without the assumption on the high-energy tail and the particle-hole symmetry, our regression scheme indeed reproduces the results of the standard KK transformation.

B. BCS superconductors
Here, we show that the present scheme reproduces the self-energies of the strong coupling BCS superconductors. In the BCS superconductors, the amplitude of the self-energies is smaller than those in strongly correlated electron systems such as cuprate superconductors. The weaker self-energy effects are suitable for perturbative treatments for forward problems such as the ME theory if the ME theory is justified. In contrast, the present self-energy regression from the spectral functions works well for the strongly correlated electron systems because the spectral functions provide more information necessary for solving the inverse problem to obtain the selfenergies, such as large superconducting gaps and broader quasiparticle peaks. Then, the regression of the BCS superconductors are difficult tasks to perform by using the present scheme when the amplitude of the anomalous self-energy and its influence on the spectral function are small. Nevertheless, in this section, we demonstrate the successful regression of the self-energies for a BCS superconductor when it is close to the strong-coupling limit, in which we have substantial amplitude of the anomalous self-energy. To perform the KK transformation, a high energy Gaussian tail is added to A exp (ω) for ω < −0.8 eV as shown in (c). Real and imaginary parts of the normal self-energy (blue and red squares, respectively) obtained from the machine learning using the EDC curve shown in (b) are compared with the real and imaginary parts obtained by the standard KK transformation (cyan and magenta solid curves, respectively). The standard KK scheme is detailed in Appendix D. As a typical model, the superconducting state described by the following Eliashberg equations is examined. By following the standard strong coupling theory for boson-mediated superconductors [5,26,27], the superconducting gap ∆(ω) and particle-hole symmetric component of the normal self-energy ω(1 − Z(ω)) at zero temperature are given by the Eliashberg equations, where ∆ 0 = Re∆(∆ 0 ), ω c is the cutoff frequency, and η is a positive broadening factor. Again, we assume that the noninteracting density of states is given by a momentum and energy independent constant N F for simplicity and the superconducting symmetry is momentum independent s-wave. In the following, we use the name "phonon" for the boson, although Eqs. (29), (30), and the following Eq. (31) can be used to describe electrons coupled to any localized optical boson mode, irrespective of the origin of the boson mode. Here, we assume that the kernel functions K ± originate from the Einstein phonon as where Ω is the Einstein phonon frequency and g el-ph is the electron-phonon coupling constant. With the assumption that the density of states is a constant around the Fermi level, the self-energies obtained by the Eliashberg equations are independent of the electron density and dimension of the system. Then, the normal component of the Green function is defined as where φ(ω) = Z(ω)∆(ω) and is the energy measured from the Fermi energy. The spectral function is defined as is the Fermi-Dirac distribution, and the superconducting density of states is given by The notation of the self-energies used in the literature on the phonon-mediated superconductors [26,27], and φ(ω), is different from Σ nor (ω) and Σ ano (ω) in the present paper. The normal and anomalous components of the self-energies are obtained as In Fig. 6, an example of the self-energy inference for the phonon-mediated superconductors is shown. Here, we set the coupling constant g 2 el-ph N F = 0.275 eV, the Einstein phonon frequency Ω = 0.075 eV, the cutoff frequency ω c = 4 eV, and the broadening factor η = 0.0075 eV 2 . The parameters in Σ (0) (ω) are chosen as a = 0.008 eV 2 , b = 0.016 eV 2 , and α = 0.005 eV 3 . For normalization of the spectral function, we choose n 0 = 0.3 without tuning. Although its effect is negligibly small in the inferred self-energies, the Fermi-Dirac distribution with T = 40 K is introduced in the spectral function used in the machine learning just by following the scheme with finite-temperature experimental data. The machine learning results capture essential features of the original normal and anomalous self-energies that generate the target spectrum A b , where the anomalous selfenergy has a dip around −(∆ 0 + Ω) and −(∆ 0 + 2Ω) and the normal self-energy shows a sharp drop. The dip in the anomalous self-energy, which arises from the electron-phonon coupling and gives the superconducting gap through the Kramers-Kronig relation, is responsible for the s-wave superconductivity. These features are characteristic of the strong coupling BCS (phononmediated) superconductors and the anomaly of N b (ω) at −(∆ 0 + Ω) shown in the inset of (a) is regarded as the evidence of the phonon mechanism. Although the slight deviation between ImΣ nor b and the inferred self-energy ImΣ nor for ω −0.3eV due to the finite high-energy cutoff in ImΣ nor , in contrast to ImΣ nor b that stays constant even for ω −0.3eV, the machine learning well reproducing the exact results of the dip in ImΣ ano and the sharp drop in ImΣ nor indicates the reliability of the present method.

V. RESULTS: CUPRATE SUPERCONDUCTORS
In this section, we will show the results of the regression for ARPES data of Bi2212 and Bi2201. The prominent peak structures are found both in the normal and anomalous self-energies, which cancel each other.

A. Experimental data
We utilize high resolution ARPES data taken for two cuprate compounds, Bi2212 for optimally doped sample with critical temperature T c ∼ 90K [12] and Bi2201 underdoped sample with T c ∼ 29K [28]. We analyze Bi2212 data at temperature T = 12 K and Bi2201 at T = 11K, which are both well below T c . The machine learning enables us to obtain Σ nor and Σ ano separately, and reveals prominent peak structures in both of them, which are apparently hidden in the original ARPES data, because of the cancellation of these two contributions. We elucidate its profound consequences for the superconducting mechanism. Although tremendous efforts have been devoted since the discovery of the cuprate superconductors with many fruitful clarifications, various puzzling issues remain open. The normal-state A(k, ω) is highly unusual including the pseudogap. Nevertheless, the superconducting phase does not look so unusual except for the dwave-type nodal gap itself and somewhat inconspicuous "peak-dip-hump" structure (see red square symbols in Fig. 7(a) [12,28]): Outside the sharp quasiparticle peak (at ∼ −40 meV in Fig. 7(a)) expected at the superconducting gap edge, A(k, ω) (energy distribution curve (EDC)) particularly at the antinodal point k = k AN is characterized by a deeper-energy weak dip followed by a broad hump [1,2]. In contrast, the underdoped sample does not show the gap-edge peak ( Fig. 7(b)), although comparable gaps ∼ 30 meV open as a first look. They are in contrast with the strong-coupling BCS superconductors, where the solution of the Eliashberg equation using the phonon density of states predicts prominent peaks (or saw-tooth-like) structures outside the gap in A(k, ω) (or density of states after angle integration), which has finger-print correspondence to the actual peak measured by the tunneling spectra, while the peak is shown to be crucial in the emergence of the superconductivity, thus constitutes the decisive testimony of the electron-phonon mechanism [5,6,11].
There are limitation on information available from the ARPES measurements. The unoccupied states invisible in the ARPES spectra may impose such a limitation. The bilayer nature of Bi2212 may also affect our regression scheme. There are also issues such as photon energy dependence and effects of matrix elements. The influence of these uncertainties and issues do not change our results in the following sections, as examined in Appendices F and J.
By using the Boltzmann-machine learning, a dramatic consequence is revealed for Σ nor and Σ ano by reconstructing them from the mild structure of A(k, ω) given by ARPES. The present reconstruction is a non-linear underdetermined problem as in many of machine learning problems. To obtain a reliable solution, we utilize physically sound constraints such as the rigorous causality encoded as the Kramers-Kronig relation. Sparse and localized nature of ImΣ ano has resulted a posteriori as the optimized solution under physical constraint as detailed in Sec. II C. To represent the self-energies and incorporate the physically sound constraints, the Boltzmann machine as universal function approximators developed in machine learning is employed. See Sec. III and also Figure 3 for the flow chart.
The obtained A(k AN , ω) (cross points in Fig. 7) perfectly reproduces the distinct behaviors of both the optimally doped and underdoped samples.
These EDC curves are constructed from ImΣ nor (k AN , ω) and ImΣ ano (k AN , ω) in Figs. 8(a) and (b). Remarkably, prominent peaks are found in ImΣ ano (k AN , ω) at ω = ω OP PEAK ∼ ±70 meV for Bi2212 and at ω = ω UD PEAK ∼ ±45 meV for Bi2201, accompanied by weaker peaks at ± 180 meV and ± 160 meV, respectively. We will show below that the discovered peaks are the main source of superconductivity. Although the peak of ImΣ ano (k AN , ω) had been searched for long time in analogy to the strong coupling BCS superconductors [2,13], its clear signature was missing in the cuprates. The machine learning now has succeeded in its identification. Surprisingly, ImΣ nor (ω) also has distinct (positive or negative) peaks at the same energy as ImΣ ano (ω) and, as we clarify below, their contributions to the spectral function cancel out each other.
The robustness of our finding against noise, uncertainty in experimental data and the experimentally uncertain high-energy part is demonstrated in Appendix F and D. In Fig. 8(a), we also plot the error bars of the peak position and height inferred from the experimental uncertainty and the machine learning error (for detailed procedure of the error-bar estimate, see Appendix F 1, Figure 17, and Sec. III ("Minimization of training error" and "Minimization of test error" as well as Appendix G). The small error bars indicate that the existence of prominent peak is reliable. For this analysis, we have utilized the inferred experimental noise estimated from the interpolated smooth A(k, ω) (namely, the inferred true smooth A(k, ω)) obtained in the standard linear regression analysis . Then the error bars of our peak estimate here is given from the optimization to hypothetical experimental data points generated with the same level of noise to the inferred true A(k, ω). See Sec. III E for details. We note that in the case of underdoped Bi2201 sample below T c , there exists subtlety in the machine learning solution. Although the superconducting solution presented here is the optimized solution that gives the smallest mean-square error in the fitting of A(k, ω), an insulating solution is also found with larger error. This may be related to severe competition of insulating and superconducting behaviors in the real sample. We show the properties of the superconducting solution because it gives the best optimized solution and the sample is indeed superconducting.
Despite the peaks in ImΣ nor , and ImΣ ano , prominent peaks are missing in ImΣ tot (ω) as shown in Figs. 8(c) and (d) (black symbols). We discuss below why the peaks in ImΣ ano at ω = 0 necessarily show up and their contribution cancels with ImΣ nor , when we impose physically justifiable constraints such as the Kramers-Kronig transformation. Instead of the peak in ImΣ nor (ω) and ImΣ ano (ω), a negative prominent peak [29] generating the superconducting gap is found centered at ω ∼ 0 in ImΣ tot , which arises from the zero of the denominator in Eq. (5), commonly to the conventional BCS superconductors.

C. Role of self-energy peaks to superconductivity
To understand the significance of the peaks at ω PEAK in ImΣ ano , we show the contribution of the peaks to ReΣ ano (k, ω = 0) estimated from the normalized partial Kramers-Kronig relation [30] (Cauchy relation) (see also Eqs. (24) and (25)) defined by ReΣ ano (k, ω = 0) is a measure of the superconducting amplitude, because the gap ∆(k, ω = 0) is proportional to Σ ano (k, ω = 0) (Eq. (6)). Since I Σ (Ω = −∞) = 1, the contribution of the peak in ImΣ ano to the superconductivity can be estimated from the increment in I Σ (Ω). Figure 9(a) shows that the inner energy peak at ω PEAK ∼ −70 meV (−45 meV) for Bi2212 (Bi2201) both contribute to more than 90% of ReΣ ano (k AN , ω = 0) (note that ImΣ ano is an odd function of ω). Namely, these peaks are indispensable in the emergence of the superconductivity.
To further demonstrate the crucial role of the inner peak in ImΣ ano (k AN , ω), we have hypothetically eliminated the peaks in ImΣ nor and ImΣ ano as shown in Fig. 9(c) for Bi2212. The resultant A(k, ω) in Fig. 9(d) shows the gap disappearance and switching to a normal metal with a sharp quasiparticle peak, confirming the crucial role of the peaks to superconductivity.
Through our Boltzmann-machine analyses, Σ nor , Σ ano and W are revealed to have prominent (positive or negative) peaks, while they cancel each other in the sum Σ tot . It is important that this conclusion is obtained directly from experimental data without assuming any theoretical models aside from mathematical (causal) requirement for the Green function. A recent self-energy analyses of ARPES data [13,31] did not identify the present prominent structure. However, our analyses on the momentum dependence suggest that the results are not necessarily inconsistent each other for the case of Ref. 13, because the Contribution of peak of ImΣ ano (kAN, ω) showing dominance to superconductivity. IΣ(Ω) calculated from ImΣ ano (kAN, ω) is shown for the Bi2212 (a) and for the Bi2201 (b) The right negative peaks of ImΣ ano (kAN, ω) both contribute to more than 90% of ReΣ ano (ω = 0). (c) ImΣ nor (kAN, ω) and ImΣ ano (kAN, ω) modified from the originals in Figs. 8(a) and 9(a) by eliminating the low-energy peaks around ω = −70 meV for Bi2212. The peak component of ImΣ nor (kAN, ω) to be subtracted is ImΣPEAK(kAN, ω) in Fig. 8(c) and the subtracted ImΣ nor (kAN, ω) is nothing but ImΣN(kAN, ω). On the other hand, ImΣ ano (kAN, ω) consists of only two peaks and the right peak around ω = −70 meV can be easily subtracted by using the sigmoid function. Peak-subtracted ImΣ nor (kAN, ω) and ImΣ ano (kAN, ω) are represented by red and purple circles, respectively. (See Appendix K for the prescription to decompose the self-energy and see Appendix C that the peak position of the gap function ∆(ω) is slightly shifted from Σ ano (ω).) (d) In comparison to the experimental A(k, ω) (blue thin curve), spectral function obtained from the peak-subtracted ImΣ nor (ω) and ImΣ ano (ω) is shown by red circles, where superconductivity disappears resulting in a good normal metal with a quasiparticle width comparable to the experimental resolution (∼ 10 meV). momentum range in Ref. 13 is limited in the nodal region far from the antinodal point, where we also see that the prominent peak is missing because of the d-wave symmetry of the superconducting gap. It is crucially important to see the antinodal region to see the prominent peak as we discuss in Appendix A and L. Further, we show in Appendix A that physically inappropriate assumptions for strongly correlated electron systems posed in Refs. 13 and 31 lead to failures of identifying the existence of the peaks.
If a large superconducting gap is open around ω = 0 as in the experimental A(k, ω), it requires the corresponding famous gap structure in ReΣ ano around ω = 0, where inside the two peaks at the gap energy ω = ±∆, ReΣ ano shows plateau as shown in Fig. 8(a). Then consistency requires a prominent peak structure around ω PEAK in ImΣ ano through the Kramers-Kronig relation. The peak of ImΣ ano in turn naively anticipates prominent structures outside the gap in A(k, ω) through Eq. (3). However, such structures are missing. This is possible only when Σ nor plays a role to cancel the prominent structure in ImW . This is corroborated by the vanishing superconductivity in Fig. 9(d). Furthermore, the superconducting order accompanied by coherent quasiparticle excitations observed in experiments can be generated from ImΣ ano (ω) only when electrons at ω become coherent, signaled by the reduction of |ImΣ tot | (or more precisely |ImΣ N |) (see Figs. 8(c) and (d), their captions, Sec. III, and Appendix K for the definition of |ImΣ N |) seen in the region |ω| < ω * ∼ 0.07 eV. This restricts ω PEAK to this range. The present machine learning indeed reproduced this natural expectation in a physically transparent and reasonable way.
We show some analyses on the temperature dependence of the self-energies including a case above T c and the momentum dependence away from the antinodal point in Appendix M and L, respectively, as supporting data of the present analyses. The results confirm the validity of the present conclusion.
Although we found that the prominent peak in ImΣ ano is crucial for the high critical temperature of the curates, full understanding and mechanism of prominent peaks in Σ nor (ω) and Σ ano (ω), which are absent in Σ tot (ω) are open to further analyses. Our result is significant because the cancellation poses a severe constraint on theories of curate superconductors and calls for further consistent studies based on this finding. In Discussion, we refer to one possible explanation.

A. Insights into intrinsic (Planckian) dissipation
In this and next subsections, we discuss further possible connection gained from the main findings of the peaks and their cancellation to other more open issues of the cuprate superconductors in regard to the damping (incoherence) of quasiparticle and the factors that determines T c to emphasize the significance of the findings with outlook. The present machine learning is also useful to separately extract other theoretically fundamental quantities such as the momentum resolved superconducting order parameter (the density of Cooper pairs or the superfluid density) F (k), mass renormalization factor z qp (k) and the single-particle relaxation time τ , which had been inferred only indirectly or only in combinations of more than one quantity in experiments in the literature, although these quantities play crucial roles independently of each other below in understanding physics. (see Sec. II B for precise definition of the above quantities).
How frequently the single-particle excitations are scattered is encoded in the imaginary part of the normal self-energy ImΣ nor . Landau's Fermi-liquid-like behavior characterized by ImΣ nor (ω) ∝ ω 2 , is satisfied only in a small region (|ω| < 0.03 eV) for Bi2212, and looks even linear (∝ |ω|) in the same region for Bi2201, implying non-Fermi liquid (marginal Fermi liquid) behavior [32] (see Figs. 8(c) and (d)), which can be fit by The ω-linear component c 1 (k)ω is disruptive to the quasiparticle picture, and manifests emergent inelastic dissipation absent in Landau's Fermi liquids. As supporting information, tiny quasi-particle renormalization factor z qp corroborating the non-Fermi liquid together with its effects on pair breaking is also shown in Appendix B (see definition of z qp in Eq. (9) ).
The single particle relaxation time τ is defined by τ (k, ω) −1 = z qp (k)ImΣ nor (k, ω)/ . When the carrier relaxation time is estimated from τ , the ω-linear term, z qp (k)c 1 (k)ω, is associated with the universally observed T -linear resistivity in the cuprates [33,34] through the ω-T correspondence τ ( ω) ↔ τ (k B T ) transformed to the self-energy of two-particle Green function for the conductivity. The temperature-insensitive z qp (k)c 1 (k) shown in Appendix N also supports the correspondence. (See Fig. 25 for each z qp (k) and c 1 (k).) A remarkable property of the inelastic relaxation rate Γ(k) = z qp (k)c 1 (k) is its high value (∼ 1) with only weak dependence on the doping, momentum (see Fig. 10(a)) and temperature.
This universal behavior of Γ ∼ 1-1.5 seems to support a local and universal mechanism of the relaxation, for instance, the Planckian dissipation mechanism of the hydrodynamic state, which claims τ −1 (T ) = Γk B T / or τ −1 (ω) = Γω with a universal constant Γ of the order unity [35,36].
Although simple version of Planckian mechanism expects only an extended broad self-energy structure due to "unparticle physics," the self-energy has a broad but prominent peak structure around ω = ω PEAK which is responsible for the superconductivity through the Kramers-Kronig transformation as we discussed. At the same time, the actual line shape is rather broad with the width around 0.05 eV (see Figs. 8(c) and (d)), which is comparable to ω PEAK itself. More importantly, the peak is smoothly connected in the tail with the ω-linear behavior near the zero energy, implying that the "Planckian dissipation and hydrodynamic behavior" associated with the strange metal [35] is caused by the source of the superconductivity. The broad prominent peaks could be due to the damped pole but it could also be ascribed to "unparticle object" generated by entangled bare electron and dark object.
The marginal fermi liquid behavior (ImΣ nor (ω) ∝ ω) needs to be understood with care. Since the present photoemission data could include extrinsic background effect, our analysis may not clarify the high-energy part of intrinsic ω-linear behavior. In fact, in relation to the T -linear resistivity, the related ω-linear behavior should show up around the gapless nodal region, while the peak of the normal self-energy vanishes at the nodal point (see Fig. 22). Therefore, the ω-linear coefficient observed as the tail of the peak is not necessarily the same as the Tlinear coefficient in the resistivity. In fact the high-energy ω-linear component in Fig. 8 (c) and (d) (black squares) has substantially smaller slope than the present ω-linear component directly associated with the prominent peak. This smaller slope at the high-energy region (ω < −0.1 eV region) is consistent with the high-temperature Tlinear resistivity [34] through the correspondence relation ω ↔ πk B T and the ω-linear self-energy in the highenergy part identified in an earlier study [13]. We need further studies on the relation between these two ω-linear components.
The T -linear scaling may remind the readers of quantum critical behaviors. In the present study, we examine only two sets of data for different compounds with different doping concentrations. Therefore, we could not exclude the possibility that these samples are by chance both close to quantum critical points. In fact, the underdoped sample shows more linear behavior than the optimum doped sample, which might imply that the underdoped sample is closer to the quantum critical point. Alternatively, the T -linear scaling behavior could emerge in a distinct phase covering a finite range of doping concentrations as discussed above as the Planckian fluid. However, it is impossible to draw a conclusion from these two samples only and it is left for future studies.

B. Factors that determine the superconducting critical temperature
Fundamental quantities revealed by the machine learning provide further insight into the superconductivity through the scaling among experimental observables: The linear relation F ∝ T c between T c and the superfluid density F measured from the muon-spin relaxation rate R (theoretically proportional to F (k)z qp (k) averaged over Fermi surface momentum) has been examined through the Uemura plot [40] in high-T c superconductors as in an example of the purple triangles in the inset of Fig. 10(b) for Bi2201 [37]. The linearity should be satisfied for attractive interaction stronger than the effective Fermi energy scale E F , which is proportional to the carrier density in two spatial dimensions. Here, E F is roughly the effective bandwidth of the dispersion z qp k . This proposal interprets the linearity as a signature of the Bose-Einstein condensation (BEC) regime. Homes et al. [39] proposed empirical but more universal fitting as plotted in an example of Bi2201 by blue upside-down triangles in the inset of Fig. 10(b) [37,38], where the dc conductivity σ dc at T c enters as R ∼ CT c σ dc (T c ) with a material independent constant C. Since σ dc is proportional to the momentum relaxation time, the Homes relation proposes qualitatively different physics involving dissipation and scattering effects beyond the naive BEC regime. However, since σ dc is believed to be proportional to both the carrier density and the relaxation time, it is not easy to single out the relaxation effect. Related scaling of the superfluid density F z qp proportional to the quasiparticle peak weight was also proposed [41,42].
Here, we heuristically propose a better scaling for T c by utilizing the present finding to show the power of the machine learning and the significance of the peak. The amplitude of the self-energy peak discovered here responsible for the superconductivity has to represent the scale of the effective attractive interaction for the Cooper pair in analogy to the Eliashberg formalism and should enter the T c scaling. Our proposal for T c is given by where Γ is the damping introduced before and F is the superfluid density. The factor g is the scale of the effec-tive attractive interaction as will be discussed later. It is reasonable that T c is scaled by the mean field acting on the formation of the Cooper pair given by the product of the attractive interaction g and the order parameter given by F (superfluid density).
Let us first discuss how the characteristic effective attractive interaction g is extracted from our self-energy analysis. We first introduce the bare attraction Ω 0 (k AN ), which is represented by the ratio of the peak intensity W PEAK (k AN ) (peak intensity of W ) in Fig. 8 to the absolute value of the peak energy ω PEAK (k AN ). Note that W PEAK (k AN )/ω PEAK (k AN ) is proportional to the gap through Kramers-Kronig relation. (Through Eq. (5), the residues of the poles of W , Σ nor and Σ ano follow the same scaling at the self-energy peak). Here we assume that the antinodal value is the representative of this estimate, because the gap is the maximum. Now W PEAK (k AN ) = dωImW PEAK (k, ω) is the integrated intensity of the peak ImΣ PEAK = −ImW PEAK plotted as the yellow and pink areas in Figs. 8(c) and (d). We note that the quasiparticle weight constrains the coherent pairing so that the peak intensity W PEAK (k AN ) should involve the averaged quasiparticle weight at the peak energy. For the quasiparticle weight, instead of z qp defined in the ω → 0 limit, we employ the renormalization factor Q at k AN (see Eq. (7)) averaged in the self-energy peak region, namely, Q = dωImW PEAK (k, ω)Q(k, ω)/W PEAK (k) with the integration over the interval ω < 0. Then gives the scale of the attractive interaction. The damping Γ in Eq. (37) looks counterintuitive because it tells that the strongly damped electrons would have higher T c . However, in this strong coupling superconductor, the strong damping is originated from the quantum entanglement as is discussed in the previous subsection, which may promote the quantum mechanical singlet pairing. In Fig. 10(b), the fit by Eq. (37) is shown (the main panel of Fig. 10(b)).
Although the Homes plot does not offer how T c is determined because σ(T c ) ∝ 1/T c cancels in the relation to F , the present result indeed shows T c linearly scaled by Γ(k N )g(k AN )F (k AN ). The linearity is crucially different from the Uemura plot as well because of the dependence on the relaxation rate Γ. Intuitively, τ = /(Γk B T ) or /(Γk B ω) is related to the characteristic length scale λ for the extension of the quantum mechanically entangled area through λ ∼ v F τ [35], where v F is the characteristic electron velocity ("Fermi velocity"). The larger attraction generates stronger self-energy peak. It necessarily generates the steeper ω-linear tail of ImΣ nor near zero energy, further enhancing more local and stronger pairing adiabatically continued to the BEC limit beyond the realistic cuprate regime, and raises T c through Eq. (37). The strange (dissipative) metal and high T c with the strong attraction represent the two sides of the same coin. It is also interesting to note that Eq. (37) looks compatible with the scaling E c ∝ γ c T 2 c ,where E c is the condensation energy and γ c is the Sommerfeld constant of the specific heat [43,44], because gF plays the role of the gap, which generates the energy gain.
Note that Γ(k) should be analyzed around k N , while Ω 0 and F contribute to T c at k AN for better fitting. The in-plane transport and the quantum entanglement are dominated by the contribution around the nodal region, while the pairing looks driven in the antinodal region, both of which contribute to raise T c .
Equation (37) is the best scaling among various attempts we have made. To convince readers, we just show two examples of plot in Figs. 10(c) and (d). The first example is z qp (k AN )F (k AN ) vs. T c plotted in Fig. 10(c), which, though not perfectly equivalent, apparently mimics the Uemura plot. The second is z qp (k N )F (k AN ) vs. 1/z qp (k N )c 1 (k N ) which mimics the Homes plot, because T c σ(T c ) ∝ 1/z qp (k N )c 1 (k N ) is expected (see Fig. 10(d)). The standard deviation is by far best for the present fit in Fig. 10(b) with Eq. (37).
The primary origin of the large difference of T c between Bi2201 and Bi2212 is identified as the difference in W PEAK (namely the coupling strength of the electron with the dark object which makes the prominent self-energy peaks), supplemented by the difference in ω PEAK . (see Table I. W PEAK (k AN ) ∼ 7.6 × 10 −3 eV 2 and ω PEAK (k AN ) ∼ 0.07 eV for the optimum Bi2201 and W PEAK (k AN ) ∼ 1.4×10 −2 eV 2 and ω PEAK (k AN ) ∼ 0.045 eV for the optimal Bi2212 at k = k AN ). (More quantitative details, see also the list of Ω 0 (k AN ) in Table I and the angle and doping dependences of W PEAK (k)/ω PEAK (k), and Q(k) in Fig. 25.) A recent intensive study [45] on clean thin films of (La,Sr) 2 CuO 4 has revealed that the Uemura plot shows a linear scaling for T c 10 K and a quadratic scaling, where T c is proportional to the square root of the relaxation rate, for T c 10 K. Consequently, the higher T c linear scaling extrapolates T c to a finite value at the zero relaxation rate limit. The present scaling modifies the higher T c linear scaling by taking into account gΓ. If g is approximately a constant for a given crystal structure and, Γ increases by decreasing the doping, as already reported in the overdoped region [46], the linear scaling in Fig. 10(b) is consistent with the sub-linear scaling of T c to F observed in Ref. 45. We note that, if Γ diverges as T c decreases in the underdoped region, intrinsic doping dependence of Γ may explain the square-root scaling at the underdoped limit qualitatively. However, the quantitative estimate of the power of the scaling is beyond our scope. On the other hand, ARPES measurements on clean samples at the low T c limit in the overdoped region are not available.
The present result is significant because the whole analyses are obtained solely from the single ARPES line shape of A(k, ω) and contains much less ambiguity than before. The present machine learning purely from experimental data sheds new light on understanding the superconducting mechanism, where the energy dissipation plays a role through the extension of quantum entanglement. For detailed doping concentration and momentum dependences of T c , F (k), c 1 (k), z qp (k), and the superconducting gap ∆ 0 (k) for Bi2201 at 11K are found in Appendix O.

C. On cancellation of peaks
On the cancellation of the two self-energy contributions which make the superconducting temperature high and the role of dissipation in determining T c , it is desired to examine the present results in other cuprate compounds by measuring A(k, ω) at high accuracy and resolution.
The present finding also call for research to find the microscopic origin of the canceling self-energy peaks. A two-component model was proposed [47,48] for the cancellation of the self-energy poles, but even if it is the case, based on the present experimental evidence, further pursuit for the microscopic description of the hidden-fermion excitation and its interaction is highly desirable. In the two-component fermion theory, electrons are fractionalized into the bare electrons and dark fermions (hidden fermions) consistently with the cluster dynamical mean-field theory (cDMFT) [47,48]. However, the prominent tail of the peak extending to ω ∼ 0 as ω-linear feature suggests that the dark fermion must have strong interaction effect. See also Ref. 49 for the prediction of the fractionalization expected in other spectroscopic data such as the resonant inelastic X-ray scattering.
If the spontaneous symmetry breaking such as the stripe order coexists with the superconductivity, the cancellation may be accounted for as an alternative interpretation [47,48]. A phenomenological resonating valence bond theory [50] also accounts for the cancellation because of the same fractionalized description as is discussed in Ref. 51. The present result poses severe constraints on possible theories. Whether there exist other origins of the peaks and their cancellation rather than the above possibilities would be equally intriguing based on the present finding.

D. Kink
A kink structure was observed in the dispersion of momentum distribution curve (MDC) peak of Bi2212 and other compounds mainly near the nodal point [52][53][54]. In addition, a kink-like structure was identified in the study on the Hubbard model [55]. It is an intriguing future issue to study whether this kink in the MDC peak dispersion has any connection to the peak and accompanied sudden change of the slope and sign in the real part of normal self-energy found here in Figs. 8(a) and (b), because the energy scale ∼ −0.07 eV of the sign change for Bi2212 is similar to the kink energy scale.

VII. SUMMARY AND OUTLOOK
We have formulated a reliable way of extracting the normal and anomalous self-energies separately from the ARPES data of superconductors by taking advantage of recently developed machine learning method. Careful benchmark tests including simple metals, conventional BCS superconductors and model systems that have established solutions indicate that the method offers an accurate and reliable regression of the self-energies only from the ARPES data even for challenging strongly correlated electron systems.
Then the method has been applied to cuprate superconductors, Bi2201 and Bi2212. We have successfully extracted the normal and anomalous components of the self-energy from the ARPES spectra.
In contrast to previous studies, the result shows that the imaginary part of the normal and anomalous selfenergies have prominent peak structures as a function of the frequency. However, the contributions of the normal and anomalous components cancel in the spectral function, which accounts for the failure to identify the prominent structure for long time. Nevertheless the peak in the anomalous self-energy has been shown to generate more than 90% of the superconducting gap and thus turned out to be the primary source of the superconductivity. Therefore, the discovered these peak structures and their cancellation pose a severe constraint with insight on the mechanism of the high transition temperature of superconductivity.
The origin of the failure in identifying the peak structures in the previous studies is elucidated in Appendix A to be primarily the assumptions in the previous studies: The previous studies assumed linear energy dispersion, and/or self-energies that are momentum independent along the direction perpendicular to the Fermi surface, for example in Ref. 13. These assumptions are not justified in the strongly correlated electron systems.
Thus newly obtained quantities hidden in the direct experimental measurements in the past allow us to show that the superconducting transition temperatures are well scaled by the product of the superfluid density F , the effective attractive interaction g and the Planckian dissipation Γ.
Present successful examples of insight obtained purely from the machine learning analysis of experimental data indicates an opening of a promising field which allows understanding physics hidden in experiments, without relying on involved and specific theoretical assumptions and constraints that are not shown to be justifiable in strongly correlated electron systems. At the same time, we have shown that very accurate experimental data are required to extract hidden quantities. For instance, in Refs. 56, 57, and 58, the signal-to-noise ratio in the ARPES measurements for Bi2212 seems to be already sufficiently small while available momenta in the Brillouin zone are limited. Although Ref. 59 for Bi2212 and Ref. 60 for Bi2201 reported the ARPES spectra in a wide range of momenta, the signal-to-noise ratio does not seem to be small enough. It is important to improve the experimental resolution and suppress errors to increase the reliability of the machine learning inference.
In addition combining with other independent measurements such as the quasiparticle interference obtained from the scanning tunneling microscope in this case is important to reach better statistics. By combining with other experimental data and indisputable theoretically basic constraints such as symmetry, much more powerful tool will be provided for understanding physics of complex phenomena.
The present study will stimulates studies on the origin of the peak structures. Indeed, there have been studies [61,62] on the origin of the peak structures, which are inspired and motivated by our results [63].

ACKNOWLEDGMENTS
We thank Takeshi Kondo and Adam Kaminski for providing us ARPES data published in Refs. 12 and 28. We also thank Takeshi Kondo for discussions on the experimental results. We are grateful to Shiro Sakai for discussions and comments on the manuscript and Chandra Varma for clarification of the procedure of the analysis in Ref. 13. This research was supportd by MEXT as "Priority Issue on Post-K computer" (Creation of New Functional Devices and High-Performance Materials to Support Next-Generation Industries (CDMSI)) and "Basic Science for Emergence and Functionality in Quantum Matter -Innovative Strongly-Correlated Electron Science by Integration of Fugaku and Frontier Experiments -" (JPMXP1020200104) as a program for promoting researches on the supercomputer Fugaku, supported by RIKEN-Center for Computational Science (R-CCS) through HPCI System Research Project (Project ID: hp170263, hp180170, hp190145, hp200132 and hp210163). Y. Y. was supported by PRESTO, JST (JPMJPR15NF). Y. Y. and M.I. were supported by JSPS KAKENHI (Grant No. 16H06345). A.F. was supported by KAKENHI (Grant No. 19K03741). The present regression is performed by using our house code. The numerical code will be available upon request.

Appendix A: Comparison with Previous Studies on Self-Energy
Here, we discuss the comparison with the analysis by Bok et al. [13], which has not clearly identified a prominent peak structure in the normal and anomalous selfenergies. We first point out that the primary origin of the discrepancy may be the momentum region they studied. They have analyzed mainly only around the nodal region and at most up to θ = 20 • measured from the nodal point, which is far away from the antinodal point. This makes the identification of the prominent peak difficult. In Appendix L, we show the momentum dependence of the EDC curve for the optimally doped Bi2201 (see Fig. 22). In this notation, 20 • from the nodal point in [13] nearly corresponds to the point between Fig. 22(h). It is natural that the peak structure is not clearly visible there. However, if we take a close look, aside from the clear difference of the featureless slope arising from the instrumental difference and presumable different background effect, the peak-like (or shoulder-like) structures at -0.06 eV in the imaginary part of the normal and anomalous self-energies in Fig. 3 of Ref. 13 shares a common feature with our result in Fig. 22(h). Unfortunately, this tiny signature and the cancellation in the spectral function is hardly identified conclusively because of the momentum far from the antinodal point.
More importantly, a crucial origin of the underestimate of the peak is the usage of the Dynes function as explained below. It leads to the underestimation of peaklike structures in the imaginary part of the self-energies. The underestimation inevitably leads to the difference in the self-energies obtained in the present paper and Ref. 13.
Before going into the explanation of the underestimation, we review the method used in Ref. 13 to make the discussion self-contained. The method assumes that the real part of the Dynes function N (ω) defined as, is equal to the ratio of the integrated spectral function at the normal and superconducting states as where the integrated spectral functions are obtained with respect to the bare dispersion (k ⊥ ) along momentum perpendicular to the Fermi surface as A S (ω) = dk ⊥ A s (k ⊥ , φ, ω) and A S (ω) = dk ⊥ A n (k ⊥ , φ, ω). Here, A s (A n ) is the spectral function of the superconducting (normal) state. The relation, however, holds only when the bare dispersion (k ⊥ ) is linear at the large bandwidth limit. Practically, the method approximately works around the nodal direction where the bare band is linear within a certain energy range. However, even around φ = 20 • , the approximation does not work, where (k ⊥ ) shows strong deviation from the linear dispersion, as shown below.
Here, we concretely demonstrate that the relationship ReN (ω) = A S (ω)/A N (ω) indeed underestimates the imaginary part of the self-energies. As the most striking example, we would like to demonstrate that, even if there are peak structures in the genuine self-energies, the selfenergies estimated by using ReN (ω) = A S (ω)/A N (ω) collapse to shoulder structures instead of the peak structures.
When there is no background b(ω) in the experimentally observed spectral function, the scheme to extract the self-energies used by Bok et al. is summarized as follows: (1): ReN (ω) is (initially) estimated by A S (ω)/A N (ω).
(2): ImN (ω) is obtained from ReN (ω) by using the KK transformation. Then, ∆(ω) is obtained from (3): By inputing ∆(ω) into the Green function in the superconducting phase, Z(ω) = 1 − Σ nor (ω)/ω and Σ ano (ω) = φ(ω) = ∆(ω)Σ nor (ω) are obtained. Below, we estimate the self-energies by following Bok et al. [13]. As a model self-energies that show peak structures, we take the self-energies at φ = 21.24 • obtained in 22(h), which is shown in the left panel of 11. We also assume a typical bare dispersion for bismuth cuprates, (k) = µ − 2t 1 (cos k x + cos k y ) + 4t 2 cos k x cos k y − 2t 3 (cos(2k x ) + cos(2k y )), where µ = 405 meV, t 1 = 360 meV, t 2 = 108 meV, and t 3 = 36 meV. Then, we integrated the spectra A S (ω) with Σ nor and Σ ano along the momentum cuts (in the first quadrant of the Brillouin zone) specified by the angle φ measured from the antinode and obtain A N (ω). To estimate A N (ω), we generate the normal state selfenergy Σ nor by eliminating peak structures that cancel  [13]. The shoulder structure in ImΣ nor (red circles) and the peak structure in ImW (green circles) around ω = −0.06 eV are canceled in ImΣ tot (black squares). The peak structures in ImΣ nor and ImW for ω > −0.02 eV, which are interpreted as impurity effects in Ref. 13, are also canceled each other in ImΣ tot . For details, see Appendix A.
the peak structure in Σ ano , which is shown in the middle panel of Fig. 11 by assuming that the effect of the superconductivity appears at the peak structure only in the normal self-energy. Then, in the right panel of Fig. 11, we obtain the estimated Σ nor (black curve) by assuming ReN (ω) = A S (ω)/A N (ω), in comparison with the original Σ nor taken from Fig. 22(h). Although we started from Σ nor and Σ ano that contain prominent peaks, in the resultant Σ nor derived by assuming the above Dynes function shows the disappearance of the peak. This shows that the usage of the assumption of the Dynes function leads to the self-contradiction with the underestimate of the peak structure. As shown in the right panel of Fig. 11, the estimated ImΣ nor shows the shoulder structure around -50 meV, instead of the peak around -70 meV in the original ImΣ nor . The shoulder structure of ImΣ nor found in Fig.3D of Ref. 13 around -50 meV is interpreted as a remnant of the original peak structure in the genuine self-energies, which is consistent with our present results. Therefore, the difference between our results and those of Ref. 13 is attributed to the artificial reduction in the amplitude of ImΣ nor due to the assumption ReN (ω) = A S (ω)/A N (ω) in Ref. 13.
The self-energies obtained in Ref. 13 also show a remnant of the cancellations of the peak structures of ImΣ nor and ImW found in the present paper. As shown in 12, the shoulder structure of ImΣ nor obtained in Ref. 13 is canceled by ImW . Then, ImΣ tot = ImΣ nor + ImW does not show any shoulder structure.
The above general underestimate (or elimination) of the self-energy peak is a universal failure of the assumption employed by the Dynes function and the assumption of the wide band limit. The physical origin is the following: In the wide band width limit, A N (ω) only contains the information of the bare band, which does not depend on whether the system is normal or superconducting. Then, the left-hand side of A S (ω)/A N (ω) = Re{ω/ √ ω 2 − ∆ 2 } does not contain not only the normalstate self-energy but also the bare band information, which are absent in the right-hand side of the relation. However, even at φ = 20 • , the assumption of the wide band width is invalid. Therefore, the normal-state selfenergy information remains in the denominator of the ratio A S (ω)/A N (ω), where Z n (ω) (Z s (ω)) is Z(ω) = 1 − Σ nor (ω)/ω in the normal (superconducting) state. To compensate the contribution of Σ nor in the left-hand side of the relation A S (ω)/A N (ω) = Re{ω/ √ ω 2 − ∆ 2 }, the normalstate self-energy Σ nor below T c , whose contribution is contained in A S (ω), may resemble the self-energy Σ nor above T c , whose contribution is contained in A N (ω). When the normal state self-energy does not show significant structures and, thus, does not introduce pseudogap and anomalies, such as peak-dip-hump structures, in the normal-state spectral functions, the estimated superconducting self-energies indeed do not show significant structures. We note that the shoulder structure consistently present in Ref. 13 and 12b in ImΣ nor without a signature of the pseudogap around 20 degrees momentum could be contributed from the background. However, the above analysis does not change even when the background is subtracted, because it appears both in A S (ω) and A N (ω) and the effect of the subtraction cancels.
In addition, Bok et al. [13] have assumed that the normal and anomalous self-energies are momentum independent along the direction perpendicular to the Fermi surface. The assumption imposes crucially restrictive condition when one infers the self-energy and superconducting gap function from the momentum distribution curve (MDC) as in Ref. 13 (see Eq. (S8) in the Supplementary information of their paper). This assumption is adequate in the BCS superconductors in conventional weakly correlated systems, while it is questionable in the present strongly correlated cases such as the cuprates, where even the normal self-energy can be singular and strongly dependent on the momentum. In addition to the restricted momentum dependence assumed in [13], they have assumed the quasiparticle representation along the momentum perpendicular to the Fermi surface (or the Lorentzian form of the MDC) and the spectral function that cannot be represented by the Lorentzian form are interpreted as the background, while these have not been assumed in the present paper, because, though it is satisfied in the Fermi-liquid normal state, it is not clear whether these constraints are satisfied in the strong coupling cuprate superconductors. Indeed, there exist a number of numerical evidences for the violation of the assumption: For instance, the singular momentum dependence of the normal self-energy with emergence of the coexisting zeros and poles of the Green function [64] and non-quasiparticle spectral function [65,66] in doped Mott insulators. The machine learning is more fit in solving the present problem, because the flexible fitting of the self-energy function is required at least away from ω = 0 particularly in the antinodal region, where the breakdown of the quasiparticle picture is apparent. Concerning the difference in the high-energy part (ω < −0.1 eV) of the original two ARPES data (namely in [13] and [12]), we have already analyzed the effect of the possible extrinsic high-energy part in F and has shown that it does not affect the structure of the peaks as clarified in 19 and it cannot be the origin of the difference in the peak structure. Next we discuss the origin of discrepancy in the result by Li, et al. [31]. They assumed momentum independent self-energy in the MDC analysis as one sees in Eq. (3) of their supplementary note 2, whose basis is unclear. More crucially, they assumed the imaginary part of self-energy in the forms Eq.
with constant fitting parameters W 3 , W 4 , E 3 and E 4 , for the anomalous part, which is unjustified. Particularly, the assumed form Eq. (A1) does not allow the formation of peak or dip and it does not allow the cancellation with the structure in the normal contribution in the spectral function as we discovered. It is crucially important to allow the flexibility of the self-energy form and the machine learning is one of the best way to incorporate it while the attempt by Li, et al. failed in implementing the flexibility. We have attempted to fit the self-energy with the constraint of Eq. (A1) for the anomalous part and found the resultant optimized χ 2 is χ 2 MLF = 6.1 × 10 −6 , which is much higher than the present result χ 2 ML = 2.1 × 10 −6 . Because the experimental resolution is χ 2 exp = 1.4 × 10 −6 as mentioned above, the intrinsic χ 2 defined by δχ 2 MLF = χ 2 MLF − χ 2 exp is 4.7 × 10 −6 . Then the intrinsic machine learning error in the unit of the experimental standard deviation is δχ MLF /χ exp = 1.8, which is close to the twice of the standard deviation. Namely, the probability that this constrained choice is true is less than 7%. The resultant self-energy does not show any appreciable peak as it should be in contrast to the present result as one sees in Fig. 13. Note that, despite the constrained anomalous part, due to the unbiased choice of the normal self-energy here, the self-energies in Fig. 13 should be much better fit than the more constrained ones (including the normal part) in Li, et al [31]. The spectral function obtained from the additional constraint in the normal part, Eq. (5) or (6) in Supplementary Information of Ref. 31 must give even higher χ 2 than χ 2 MLF . In addition, the form of the imaginary part of the anomalous self-energy is unphysical. The form assumes the attractive interaction ranging to the infinite frequency scale (namely, instantaneous attractive interaction), which can not exist in the real experiments.
In summary, constraints or assumptions unjustified a priori such as momentum independent self-energy, strictly constrained but unjustifiable self-energy form, and Eliashberg equation in the previous works do not allow prominent peak structures and at the same time give higher errors in the regression of the experimental data than our result indicating the superiority of our analysis. It supports that the cancellation of the normal and anomalous contribution in the total self-energy found in our analysis and not in the former studies should be considered seriously.

Appendix B: Pair Breaking Effect
The renormalization factor (quasiparticle residue) estimated from the expression theoretically equivalent to z qp defined in Eq. (9) is the weight of the quasiparticle, which can be substantially reduced from the noninteracting value z(k) = 1 due to the interaction effects. The renormalization factor estimated from the fitting of Eq. (B1) is z qp ∼ 0.1 for Bi2212 and z qp ∼ 0.03 for Bi2201 (see Fig. 14) supporting the non-Fermi liquid behavior especially in the underdoped case.
As shown in Figs. 14(c) and (d), the non-Fermiliquid-like ImΣ nor (ω) affects the gap function ∆(ω) (in Eq. (6) through Q(ω) (see Eq. 7). In general, negative Im∆(ω) for ω < 0 enhances Re∆(ω = 0) through the Kramers-Kronig relation and is indeed negative in most of ω in Fig. 14(c) and (d). However, Im∆(ω) is positive at |ω| < 0.04 eV (|ω| < 0.06 eV) for Bi2212 (Bi2201). Because ImΣ ano is found to be always negative for ω < 0, it is ascribed to the pair breaking effect of Q, arising from poles of Σ nor inside the superconducting gap as already pointed out [47]. The pair breaking is much more prominent for underdoped sample, Bi2201. Although a similar conclusion for the underdoped Bi2201 suggests a universal nature, the prominent non-Fermi liquid behavior and the pair breaking could be accounted for by an alternative at k = k AN , namely the pole of Σ nor shifts to the energy ω ∼ 0 and destroys Σ ano accompanied by an insulating gap. Although such a solution gives worse χ 2 in our analysis, a momentum selective insulating behavior at the antinodal point deserves to be explored further together with the full momentum and temperature dependences. The gap function ∆(ω) defined in Eq. (6) can show significant δ dependence near the small δ limit around ω ∼ 0. The δ dependence originates from the finite imaginary part of the normal self-energy ImΣ nor (k, ω = 0) inevitable in the experimental data. When we modify Q as we obtain stable behaviors of ∆(ω) for |ω| > 10 meV by keeping δ = 10 meV and restricting to δ < δ. In Fig. 14, we use δ = 2.5 meV.  6) and (7). It reveals that the peak positions in ∆(k AN , ω) are different from those in ImΣ ano , which is consistent with the hidden fermion theory [47]. In fact, the peak positions of Im∆(k AN , ω) (∼ ±80 and ±220 meV for Bi2212 and ∼ ±80 and ±210 meV for Bi2201) are nearly the same as the peak positions of ReΣ ano , while the peak positions of Re∆(k AN , ω) (∼ ±50 and ±180 meV for Bi2212 and ∼ ±50 and ±160 meV for Bi2201) are nearly the same as the peak positions of ImΣ ano . This is because the imaginary part of Q is dominant in the relevant frequency region (∼ 100 − 200 meV) as shown in Fig. 14. The shift of the peak positions indicates the strong renormalization effect in the normal quasiparticle contained in Q. In any case, in the contribution to the real order parameter of the superconductivity ∆ AN = Re∆(k AN , ω = 0) is expected to be contributed mostly from the two peaks in Im∆(k AN , ω) through the Kramers-Kronig relation. The d-wave gap amplitude ∆ AN is 30 meV for the optimally doped Bi2212 while it is around 10 meV. However, at small energy (∼ 60 meV), the gap amplitude is both around 40 meV, which is comparable to the peak energy of ImΣ ano and ImΣ nor suggesting the similar pseudogap energy for optimum and underdoped samples.

Appendix D: Accuracy, Stability and Robustness Tested by Solvable Benchmarks
In this section, we employ exactly solvable models as benchmarks. We examine whether our machine learning correctly reproduces the exact self-energies (with prominent peak structures), if the exact solution indeed shows the cancellation of the normal and anomalous self-energy contributions in the total self-energy and the spectral function A(ω) shows only a weak peak-dip-hump structure. To our knowledge, exact solution, which shows such a cancellation is not found except for the case of the two-component fermion model. Then, as a benchmark, we inferred the self-energy of a superconducting two-component fermion model defined by the following Lagrangian, , (D1) which is essentially the same form as that introduced in Ref. 47 and discussed in Ref. 48. In the following discussion, we assume that the noninteracting density of states determined from c (k) is a constant N F and focus on a specific momentum k at the Fermi momentum just for simplicity. Because of the momentum independence, this consideration at a specified momentum does not cause loss of generality. Here, we add Σ (0) (ω) at the above momentum defined by in addition to c (k) to mimic the additional normal Fermi-liquid-like component seen in the experimental result arising from interaction effect for the part not represented by the coupling to the d fermion, where a, b and α are constants. The self-energy in the exact solution of this two-fermion model is given as, For simplicity, we have dropped the momentum dependence in the solutions (D3) and (D4). In our calculation, we set a = 0.008 eV 2 , b = 0.2 eV 2 , α = 0.08 eV 3 , V 1 = 0.075 eV, D 1 = 0.0375 eV, and d = c = 0. The present choice of the parameters is enough to generate the spectral function observed at the Fermi momentum we focus, and the dependence on the doping and dimension of the system etc. are implicitly contained in N F . By using the exact solution for the spectral function A 2f (ω), we add small but finite noise, where σ 2 of the noise is set to be 6 × 10 −4 to simulate the role of noise in experiments and perform the machine learning using this noisy A 2f (ω). Although it is irrelevant to the inferred self-energies, the Fermi-Dirac distribution with T = 40 K is introduced in the spectral function used in the machine learning just by following the scheme with finitetemperature experimental data. In Figs. 15(a)-(c), the spectral function of the two-component fermion model A 2f (ω) and the self-energies are shown for the exact solutions (solid and dashed curves) and the machine learning results (symbols). In the self-energy inference, we choose n 0 = 0.45 without fine tuning and use the spectrum within −0.55 eV < ω < 0.05 eV. The peak position in ImΣ nor and ImΣ ano and the Fermi liquid-like normal contribution in the exact solution (shown with the index 2f such as A 2f (ω) illustrated by solid and dashed curves) are well reproduced by our machine learning results (symbols). The peak cancellation in Σ tot 2f in the exact solution is also well reproduced.
In Figs. 15(d)-(f), we show an artificial case, where the pole of Σ ano is shifted 0.03 eV from the solution (D4), where the pole cancellation in the total self-energy does not occur any more and the spectral function shows weird two peaks. The concrete representation of the selfenergies of the modified two-component model is given by the following form, where the pole of Σ ano is shifted from that of Σ ano because of ∆D 1 = 0.03 eV and a factor r = 2 is introduced to avoid singular spectrum. The spectrum and self-energies of the modified two-component model are denoted with the index m as A m (ω). Even in this case, the machine learning results well reproduce all the line shapes. This indicates that our machine learning flexibly and accurately reproduces the exact solution irrespective of the presence or absence of the peak cancellation.
The parameters in the Boltzmann machine, α nor = (b, {W m }) and α ano = (w λ , {b λ }, {V λ m }), are optimized by using the standard gradient method. The parameters at the k + 1th step, α nor k+1 and α ano k+1 , are updated from the k-th values as where and · · · 1 represents L 1 norm. The factors S −1 g nor k 1 −1/2 and ( g ano k 1 ) −1/2 are introduced to accelerate the optimization. Here, we use the natural gradient method to optimize the variational parameters in ImΣ nor (ω j ) because of its efficiency, [20,67,68] while the simple steepest descent method is employed to optimize the part of ImΣ ano (ω j ) because the natural gradient method assumes that the optimized distribution is positive or negative definite, while ImΣ ano (ω j ) does not satisfy this condition. During the optimization of the Boltzmann machine, we may introduce a regularization term by L 1 norm of the mixture of the Boltzmann machines as λ w λ |w λ |. While λ w = 10 −3 will accelerate the optimization, the results of the optimization is confirmed to be insensitive if λ w ≤ 10 −3 . In the actual fitting, we employed λ w = 10 −3 .

Parameters in optimization
In the present paper, first, we optimize the Boltzmann machine with L = 8 visible nodes and 2L = 16 hidden nodes for the part ImΣ nor and, then, we enhance the resolution with L = 9 visible nodes and 18 hidden nodes to obtain better resolution with reasonable numerical cost. In the optimization with L = 9, we skip the outer loop (the update of the center of mass by the Bayesian process) to reduce the computational cost and perform longer minimization steps up to 2 × 10 4 . We employ the broadening factor δ = 10 meV throughout this paper. We show in Appendix E 3 that the result does not sensitively depend on the choice of δ.

self-energy (eV)
FIG. 16. Resolution (δ) dependence of self-energies. Selfenergies are obtained from the machine learning using the ARPES EDC curves with δ=5 meV. The ARPES EDC is taken from the experimental data of Bi2212 at optimum doping at 11K, which are supplied by Kondo et al. [28]. The peak position and their cancellation between W and Σ nor remain essentially the same even for δ smaller than the experimental resolution.

Effects of resolution δ
In the present study, the small imaginary part iδ utilized in the Green functions is chosen to be equal to the experimental resolution. When substantially larger resolution δ is taken, the detailed spectra are trivially not reproducible. On the other hand, when smaller resolution δ is taken, the spectra may be easily fitted. Here, we examine how the smaller δ affects the inferred selfenergy. As a typical example, we take δ=5 meV, which is a half of δ used in the main article, and confirmed that the smaller δ does not change the qualitative structure of the self-energy. As shown in Fig. 16, the peak structures in ImΣ nor and ImW , and the cancellation between them are reproduced.

Gaussian distribution represented by Boltzmann machine
When we choose the parameters as in Eqs (26) and (27), the Boltzmann machine easily represents the Gaussian distribution with the center x λ , variance s 2 λ , and weight w 0λ , which is a localized sparse dis-tribution. Superposition of the Gaussian distribution can easily be expressed by Eq. (22) by taking M larger than 1 (typically we take M several).

Appendix F: Robustness of Machine Learning
The present use of machine learning is categorized to a general class of regression analysis as addressed in the first paragraph of Sec. II A. In the standard simple case of the regression task, training data set is simply given by the observed A at discrete number of x and we infer the functional form of A(x). In the present case, it is more involved and the training data is the experimentally measured discrete and limited number of A and ω, and the regression task is to determine Σ as a continuous function of ω. In terms of the optimization with the machine learning, our task is to minimize the difference between the measured data A and that obtained from the inferred Σ(ω), which is a continuous function of ω. Therefore, our work is categorized to the machine learning application to a regression task, one of the most widely applied machine learning fields. Our regression scheme is illustrated in Figure 5.
In the regression analysis, it is helpful to examine the reliability of the machine learning by using solvable cases as the benchmark, as in other type of the regression task found in the problem of solving quantum many-body problems and classical statistical physics problems [10] It is also important to test the stability of the procedure by adding noises. In this section we show the robustness against the noise and in D, we show several benchmark tests for solvable models.

Stability against noise
We examine stability of the present machine-learning scheme. By using A fit and σ 2 n introduced in Sec. III E, we can generate synthetic experimental data with the same or larger amplitude of noise than the original data. Here, we use the synthetic data to examine the input data dependence of the present scheme.
Here, χ 2 of the optimized A(k, ω) by the machine learning from the synthetic ARPES spectrum A syn generated by the maximally-likelihood inference of the ARPES spectrum is given by s )) 2 /N d N r = 2.1 × 10 −6 (defined in Eq. (28)), which is the same level as the experimental χ 2 , namely (27). The same level of χ 2 value indicates that the machine learning optimization to fit the experimental A(k, ω) is successfully achieved within the limit of the level of the experimental noise. The standard deviation of the experimental uncertainty is around χ exp = 1. Imaginary part of normal self-energy ImΣ nor (kAN, ω), ImW (kAN, ω) and ImΣ tot (kAN, ω) deduced by the present machine learning from A(k, ω). The error bars are those for the dip energy (horizontal bar) and the dip depth (vertical bar) derived in the procedure mentioned above.
as the experimental interpretation, we used a standard index (for noise, variance and bias decomposition, see Ref. 21) expressed as δχ ML = χ 2 ML − χ 2 exp = 0.8×10 −3 . This is nothing but the pure generalization error/test error derived after subtracting the experimental noise. Here δχ ML /χ exp = 0.7 is the intrinsic machine learning error in the unit of the experimental standard deviation. This is well within the experimental error bar. We show in A that other example of optimization without peak structure shows much larger standard error. If we assume that the inferred A(k, ω) follows the probability distribution P = exp[−χ 2 ML /2χ 2 exp ] given from the maximum likelihood inference (see Ref. 21), one can estimate the corresponding variance of the inference for the selfenergy by sampling the variation of the peak structure. The variance is plotted in Fig. 17(b) for the peak part of ImW . This indicates that the variance for the peak position and the weight is small and the existence of the peak is robust.
To further examine the reliability of the emergence of the peak, in Fig. 17(a), we first show the estimated A fit and amplified noiseσ n = 4σ n for the optimum doped Bi2212. Withσ n we generate many synthetic experimental samples. The reason why we takeσ n instead of σ n is to secure the stability of the peak structure in the presence of the experimental noise with the safety factor 4. Then we perform the machine learning and extract the self-energy from the synthetic A(k, ω), which provides us with the error bars of the estimated self-energies in our machine learning. As shown in Fig. 2(c), the variance of ImΣ nor , and ImΣ ano thus obtained from the synthetic data is reasonably small with the peak structure in the imaginary part of the self-energy, which indicates that our solution of the inverse problem is numerically stable. Note that the error bars are somewhat overestimated here (namely, larger error bars than those of Fig.2(c)) because of the factor 4 above, but still the peak structure is reasonably retained. However, further increase of the noise to several times ofσ n smears out the peak structure, implying that very accurate experimental data in the present ARPES quality are required to reveal the peak structure.
The stability in the present inverse problem shown here clearly indicates the difference from notorious illconditioned problems such as the analytic continuation from the imaginary time (Matsubara frequency) variable to the real frequency typically studied by the maximum entropy method. In contrast to the non-sparse and involved nature of the analytic continuation from the Matsubara frequency, the present stability numerically shown here is consequences of the sparse structure of the transformation between the spectral function and the self-energy. The mapping between A and Σ is, though strongly nonlinear, diagonal in the variable ω and transformation is sparse confined in a limited frequency range. Imposed physical requirement further ensures the stability. The machine learning including the Boltzmann machine in known to be powerful to strongly nonlinear transformation [7,9,14,69,70] such as Eqs. (1) and (2).
Of course if the noise is too high, the peak structure is smeared out. Our synthetic data analysis tells that 64 times higher noise level than the estimated experimental noise washes out the peak structure (not shown) and the factor 10 to the present experimental level would be the limit for the meaningful quantitative analysis.

Dependence on initial guess
The stability of the optimal solution is examined in the previous subsection. While a single local minimum of the cost function is analyzed above, in the present regression scheme, a multi-valley structure of the cost function in the parameter space may appear due to the non-linear nature of the regression. To explore the nature of the possible multi-minima, we perform the outer loop optimization, which is illustrated in Figs. 1 and 3. In the outer loop optimization, we update the initial condition of the imaginary part of the anomalous self-energy to prepare for the next iteration of the inner loop optimization. During the practical optimization, the initial guess for the (R)BM parameters at the first stage of the optimization shown in Fig. 3 will affect the optimized self-energies.
To illustrate the initial guess dependence, here, we examine typical solutions for Bi2212 at 12K obtained from the randomly chosen initial guesses for ImΣ ano . As explained in Sec. III A, we initialize ImΣ ano as a linear combination of the Gaussian distributions. We randomly chose the center of mass, height, and width of these Gaussian distributions. From the physical constraint, we choose the center of mass within |ω| < 0.3 eV. In Fig. 18, the typical examples of the solutions are shown.
There are two kinds of the solutions: Superconducting and pseudogap solutions are found. When an initial guess for ImΣ ano generates a large enough superconducting gap, superconducting solutions are obtained. The superconducting solutions show a minimum of the amplitude of ImΣ nor around ω = 0, while the amplitude of ReΣ ano around ω = 0 is substantial enough to generate a quasiparticle gap in the spectral function. In contrast, if an initial guess for ImΣ ano cannot generate a large enough superconducting gap, the amplitude of the normal component ImΣ nor shows a (negative) peak around ω = 0 to generate a gap in the spectral function. We call such a solution the pseudogap solution. A typical example of the pseudogap solution is shown in Fig. 18(a), which shows a three order of magnitude larger cost function as illustrated in Fig. 18(d). Two superconducting solutions with cost functions larger than the minimum value χ 2 ML = 1.6 × 10 −6 are shown in Figs. 18(b) and (c). Similarly to the optimal solution shown in Fig. 8(a), the superconducting solution shown in Fig. 18(b) exhibits peak structures of ImΣ ano . However, due to the shift in the peak position, the solution gives an one order of magnitude larger cost function. A featureless ImΣ ano also generates a superconducting solution with a large cost function as shown in Figs. 18(c) and (d).
Here we note that the larger cost functions originate from systematic deviation of the regression model A(ω) from the experimental data A exp (ω). The difference between them, A(ω) − A exp (ω), is shown for the three solutions in Fig. 18(e). The pseudogap solution (Fig. 18(a)) and the superconducting solution with the featureless ImΣ ano (Fig. 18(c)) overestimate the spectral function within the quasiparticle gap: A(ω) is larger than A exp (ω) for ω ∼ 0 eV. In contrast, the solution with a peak of ImΣ ano at a higher energy scale shows stronger superconducting gap, which results in A(ω) < A exp (ω) around ω = 0 eV.
As examplified by the solutions from randomly chosen initial guesses in Fig. 18, the optimal self-energies shown in Fig. 8 indeed give the spectral function closer to the experimental data. Within our many attempts, we found the unique solution that has a cost function value comparable to the estimated experimental error, as shown in Fig. 8. (e) The differences between the regression models A(ω) and the experimental data A exp (ω) are shown. For the regression model with the self-energies (a), the difference, A(ω) − A exp (ω), multiplied by a factor 0.1 is shown because the difference is larger than those with the self-energies shown in (b) and (c). The black solid curve shows the difference, A(ω) − A exp (ω), for the optimal solution with χ 2 ML = 2.1 × 10 −6 eV −2 (shown in Fig. 8(a)).

Stability against energy cut-off and background
The present machine-learning scheme is based on the imaginary parts of the self-energy within a finite frequency range −Λ < ω < Λ, where Λ 0.4 eV, because the experimental data observed within −0.4 ω 0.2. Therefore, in the genuine self-energy, there is a possible unknown contribution from the outside of the cutoff energy Λ. However, as explained below, such a contribution is a monotonic and bounded function of ω, and, thus, possible errors due to the lack of information can be estimated.
Due to the Kramers-Kronig relation, the real part of the self-energy can be affected by the cutoff energy Λ. Because the imaginary part of the normal self-energy is expected to extend over the cutoff energy, the real part of the normal self-energy has a monotonic and bounded contribution from the outside of the cutoff energy. On the other hand, because the anomalous self-energy is finite only within the cutoff energy scale, the real part of the anomalous self-energy can be affected by the cutoff only through the normal self-energy.
In the main text, we ignored the contribution of the high-energy part of normal self-energy. To critically examine the possible contribution from the outside of the cutoff energy, here, we assume a possible distribution of the imaginary part of the normal self-energy outside the cutoff: The imaginary part of the normal selfenergy outside the cutoff is assumed to be confined within Ω − W /2 ω Ω + W /2 centered at Ω , where |Ω | > Λ W and the amplitude of the imaginary part is approximately constant within this energy range. Then, contribution to the real part is given by which is monotonic for −Λ < ω < Λ. When |Ω | > W is assumed, the contribution ∆ReΣ nor (ω = 0) and its derivative ∂∆ReΣ nor (ω)/∂ω| ω=0 are approximately If we consider formation of a lower Hubbard band, for instance, we may assume that |Ω | ∼ |ImΣ nor (Ω )| ∼ O(1) eV and W Λ. Then, the contribution from the outside of the cutoff is bounded as | − ImΣ nor (Ω )W /(πΩ )| Λ/π and ImΣ nor (Ω )W /(πΩ 2 ) < 1/(2π). As we show in Fig. 19(a), the qualitative feature of the peak structure at ω < −0.4 eV does not change even when we add an artificial high-energy part for the normal self-energy, where we compare the self-energy inferred from the experimental ARPES data with the self-energy inferred with an additional high-energy part Σ H . Here, we added the following fixed high-energy part Σ H in the optimization process, where s = 0.15eV 3 , a = 0.3eV 2 , b = 0.025eV 2 , and c = 0.01eV 2 , whose imaginary part becomes substantial The self-energies ImΣ nor (k, ω) and ImΣ ano (k, ω) obtained by the machine learning of the procedure in (c). Inset: ImΣ nor (k, ω) and ImW (k, ω) together with ImΣ tot (k, ω), showing the robust cancellation of ImΣ nor (k, ω) and ImW (k, ω) in the peak (dip). The blue dashed line ω/π has a similar slope with ImΣ tot (k, ω), implying a universal origin of this marginal Fermi-liquid behavior.
for ω > −0.4 eV as shown in the inset of Fig. 19(a). Then we fit the experimental ARPES data within the experimentally measured energy range using this self-energy form with the added high-energy tail. Indeed, in the solution, the artificial high-energy part does not affect the prominent peak structures at all in ImΣ nor and ImΣ ano . When the energy range covered by the measured A(k, ω) becomes wider after excluding the extrinsic background, the uncertainty becomes of course further reduced.
To examine effects of high-energy structure of ImΣ nor further, especially effects of a possible peak structure re-sponsible for the waterfall structure around -0.5eV [72][73][74][75][76][77], we conducted the following analysis by adding Σ H that has a peak around -0.5eV to the Boltzmann-machine representation of ImΣ nor . Here, we assume a Lorentzian peak around ω = -0.5eV as Σ H instead of monotonic ω dependence at high energies and find optimized Σ + Σ H to fit the experimentally observed ARPES data of the optimally doped Bi2212. As shown in Fig. 19(b), the presence of the peak structure around -0.5eV does not essentially change the structures of the low-energy self-energy in the region ω > -0.2eV. Here, we note that larger am-plitude and/or wider width of the Lorentzian cannot fit the experimental spectral function, and we also note that the anomalous cusp in ImΣ nor near the peak structure around -0.5eV is inevitable to minimize the cost function if one keeps the Lorentzian peak structure. Aside from details, thus, the potential peak structure does not alter the peak structure of our interest. As examined in theoretical and numerical studies (for example, Ref. 78), we also note that the realistic waterfall structure can be derived from broader structures in ImΣ nor than Σ H in Fig. 19(b). In this case the effect of that broad structure becomes smaller than the present critical test.
As an alternative way, one can also examine whether the peak structure is insensitive to the possible slowly varying extrinsic energy dependence or not, by studying the effect of the possible background onto the spectral function. The origin of the background in the experimental spectral function may be the electronic incoherent part or the extrinsic experimental setup extended in the low-and high-energy regions. The background is expected to have a broad (slowly ω dependent) structure.
Such a background has been studied before [71] and we mimic such background structure as a possible effect to see the sensitivity to the peak (dip) structure around ω ∼ −0.07 eV. In Fig. 19(c), we show the optimized solution when a model background b(ω) shown in the inset is subtracted from the spectral function by hand. The high energy offset in the experimental A(k, ω) does not appreciably depend on temperature and momentum, implying such an extrinsic origin. The way of optimization to minimize χ 2 in Eq. (26) is the same as before except that we fit A exp (k, ω)−b(ω) with a ω independent rescaling of the amplitude to satisfy the optimal n 0 = 0.4, instead of A exp (k, ω). The result shows again that such a broad structure does not affect the prominent peak structure and the peak cancellation between the normal and anomalous self-energies is retained. While the cancellation continues to be retained even after the subtraction of the background, the amplitude of the peak structures becomes smaller after the subtraction of the background, as shown in 19(d). Although the reduction of the peak intensity is obtained as a consequence of the regression, it has a simple physical interpretation originating from the renormalization effect: The inferred renormalization factor is affected by the amplitude of the imaginary part of the normal self-energy for ω < -0.1 eV. After the background subtraction, the amplitude of the imaginary part becomes smaller and, thus, the renormalization factor becomes larger. While the superconducting gap amplitude estimated in the quasiparticle peak does not depend on the background, the normalization factor depends. To reproduce the gap amplitude, the smaller amplitude of ImΣ ano is required when the renormalization becomes larger. Therefore, the peak structures of ImΣ nor , which have to cancel the peaks in ImΣ ano to reproduce the spectral function become smaller after the subtraction of the background. By the subtraction of the hypothesized background which is essentially constant for ω < −0.1 eV, we see that the high-energy imaginary part of the normal self-energy has a marginal fermi liquid feature ImΣ nor (k, ω) ∝ ω consistent with the coefficient of the T -linear resistivity in experiments.
Here we remark that the realistic background estimated from the ARPES data has a similar form to b(ω), but with much smaller amplitude. As a standard method to estimate the background, we utilize the EDC curve at the momentum far outside the Fermi surface provided by Kondo [79]. The estimated background at the antinodal point can be well fitted by a modified sigmoid function, where a = 2.24 eV −1 , b = −0.02 eV, c = 0.014 eV, and the weight of the background w is determined to constitute 20% of the original spectrum for -0.4 eV < ω < 0.1 eV, which is nearly a quarter of b(ω) in amplitude employed in Fig. 19(c). Therefore, the analysis in Fig. 19(c) is regarded as the effect of unrealistically exaggerated background as a extreme case. In the realistic background, the modification of the self-energy peak by the background must be much smaller.

self-energy (eV)
FIG. 20. Self-energy learning based on two-band Green function for Bi2212. The total spectral function of the bilayer model A(ω) (blue crosses) is shown in (a), which reproduces the experimental data (red squares). Here, the spectral function is rescaled to fulfill n0 = 0.3, where n0 is the integrated spectral function for -0.4 eV < ω < 0.1 eV. See the text at the end of Appendix F for the definition of the integrated spectral function. We choose that the bilayer splitting AB − B at the antinodal region is equal to 0.19 eV by following the tightbinding Hamiltonian for an underdoped Bi2212 with Tc = 78 K reported by Drozdov et al. [80]. The matrix element of the antibonding band ρAB is equal to 0.25 by following Kordyuk et al. [81], where ρB = 1−ρAB. The optimized Boltzmann machine self-energies shown in (b) reproduce the spectral function A(ω) (blue crosses) in (a). The cancellation between ImΣ nor and ImW is illustrated in (b). The contribution from the bonding band A B (ω) (magenta curve) and antibonding band A AB (ω) (cyan curve) is shown in (a).
ARPES data are basically not available in the unoccupied part and therefore the inferred behavior in the positive energy side of the spectral function has relatively larger uncertainty. In fact, electron-hole asymmetry was suggested in the scanning tunneling microscope data [82]. If any kind of data in the positive energy side can be analyzed together, the asymmetry can be analyzed more quantitatively. However, this issue is beyond the scope of the present work. Nevertheless, the asymmetry expected in the optimally doped Bi2212 (our main target of the analysis) is modest or weak if it exists as one sees from the STM data and the level of quantitative uncertainty in our analysis does not essentially alter the peak structure in the negative energy side. Technically this is analogous to the case of the uncertainty at the tail part of negative energy side beyond the accessibility by ARPES or the effect of the background discussed above. In fact, consideration of the background assumed only in the negative energy side employed here necessarily introduces electron-hole asymmetry in the measurement and the insensitivity to the background is interpreted for the asymmetry as well.

Bilayer nature
In the bilayer cuprate, the bare band structure consists of the bonding and antibonding band. However, we have analyzed the ARPES data by using the single-band description of the Green function, which describes the bonding band, because it crosses the Fermi energy. We here examine the effects from the antibonding band, which is located above the bonding band in the ω axis, and show that the contribution from the antibonding band does not change the qualitative results. Below, we explicitly take into account the two-band nature of Bi2212 and demonstrate that the results from the multiband treatment are consistent with those from the singleband treatment.
Due to the symmetry of the Bi2212 crystal structure, the bare band structure is diagonalized by using the bonding and antibonding Wannier orbitals that are given by the bonding and antibonding combination of the d x 2 −y 2 orbitals in two adjacent CuO 2 planes. Even when the self-energies are taken into account, the Green functions for the bilayer cuprates are diagonal if the interlayer components of the normal and anomalous self-energies are negligible.
When we neglect the interlayer components of the normal and anomalous self-energies, the self-energies in the two adjacent CuO 2 planes are identical. Therefore, we only introduce the self-energies, Σ nor (k, ω) and Σ ano (k, ω), for both of the adjacent CuO 2 planes. The Green functions for the bonding and antibonding band are obtained as follows, respectively: Here, we note that the self-energies are identical even after diagonalization to obtain the bonding and antibonding band, irrespective of the amplitude of the interlayer hopping constituting the bare band.
Then, the spectral function, A(ω), is decomposed into the contribution from the bonding orbitals, A B (ω), and antibonding orbitals, A AB (ω), as follows: Here, the orbital dependent matrix elements ρ B (= 1 − ρ AB ) and ρ AB are taken into account. Then, we extract the self-energies by using the two band decomposition of A(ω) = ρ B A B (ω) + ρ AB A AB (ω). As same as in the single band picture, we train the selfenergies by minimizing the training error between the theoretical spectral function A(ω) and experimental spectral function A exp (ω). By following Kordyuk et al. [81], we assume that the integrated weight of the contribution from the antibonding band is 25% as ρ AB = 0.25. In Fig. 20, the optimized spectral function A(ω) and selfenergies are shown.
Even when we take the multiband nature into account, we obtain the self-energies that are essentially same as those obtained in the single band analysis. The can-cellation between ImΣ nor and ImW is also observed in Fig. 20b.

Appendix G: Details of Wavelet Analysis and Improved ImΣ to Reduce Test Error
Here, we discuss why we employ this wavelet formalism. The idea of using the binary representation Eq. (13) is to represent a complex function of ω by successive coarse graining. If a function has ω dependence with various frequency scales, this hierarchical structure can be efficiently picked up by wavelet with different scales and each wavelet is represented by each digit of the binary number σ j (j = 0, 1, . . . , L − 1). For example, the last digit σ L−1 represents the slowest nonzero modulation of frequency dependence (or in other words, short real-time value), namely with the period of the half of our frequency range. σ 0 picks up the most rapid modulation in frequency (or in other words, long real-time value) alternating in the period of the frequency grid mesh ∆ω = ω/2 L etc.. One can have an analogy to Fourier series analysis, where the first digit of the binary number σ 0 corresponds to the largest time component and the last digit σ L−1 corresponds to the shortest nonzero time component in the form of e iωt . It was shown that the wavelet can represent the ω dependence in the orders of magnitude different scales simultaneously and has a flexible representability in the regression problem with small number of parameters because of the logarithmic description (in the present case, C and D with only L arguments, where each of the L components represent logarithmically different scales of frequency dependences) for any discrete data-point set [16][17][18].
The grid mesh ∆ω is chosen to be smaller than or comparable to the experimental energy resolution (∼ 10meV [12,28]) to fully reproduce the experimental A(k, ω) within the resolution of the grid size ∆ω.
If the experimental noise level is comparable to the energy resolution, the step-wise representation is sufficient. However, when the experimental spectral data contains only small noise, the step-wise representation for the imaginary part of the self-energies may introduce a systematic increase in the test errors. To reduce the possible increased error, we introduce a piecewise-linear representation instead of the step-wise representation ImΣ nor/ano . Namely, we interpolate the self-energies between ω = ω I and ω = ω I+1 linearly as for ω I ≤ ω < ω I+1 , where ω I is the midpoint of the I(σ)th interval defined by ω I ≡ Λ(I + 1/2)/2 L − Λ/2. Since the estimated noise is very small for the experimental ARPES spectra of the optimally doped Bi2212 (T c = 90K) at 11K (analyzed in Figures 6(a the increment ∆ImΣ(ω I ), because for δ > ∆ω and thus is satisfied when δ = 10meV and ∆ω = Λ/2 L ∼ 3.2meV. Thus, the piecewise linear representation introduces a negligible correction to the real part.
Appendix I: Optimization of n0 We employed n 0 = 0.3 in the present analyses. This choice is justified by the least square fit. We have examined the optimum choice by the least square fit of χ 2 by taking several choices of n 0 . Here, we note that χ 2 is trivially scaled by the square of the amplitude of A exp (ω), and thus is scaled by the square of n 0 . Therefore, we need to optimize χ 2 /n 2 0 . Fig. 21a shows that χ 2 normalized by n 2 0 for optimally doped Bi2212 at the antinodal momentum has indeed minimum at n 0 = 0.3, which indicates that the machine learning suggests that this choice is the optimized value of n 0 . The obtained self-energies do not sensitively depend on the choice of n 0 as one sees in Figs. 21(b)-(d) and we see no qualitative change in the feature of the pronounced peaks in ImΣ ano and ImΣ nor at the same energy together with their cancellation in ImΣ tot .
Appendix J: Unoccupied states, bilayer nature, and matrix element

Unoccupied states
One might be concerned with the effect of the unoccupied states, especially the effect of the particle-hole asymmetry in the scanning tunneling spectroscopy reported in the literature [83,84]. Our present scheme does not assume the particle-hole symmetry and is capable of the asymmetry if it exists, although the asymmetry at the energies far from the Fermi level may have some uncertainty because of the lack of the data. Aside from the possible origin of the asymmetry [84,85], the cancellation of the peak structures in the self-energies turns out to be robust.
First of all, in the superconducting phase, the lowenergy spectra around the Fermi level are plausibly particle-hole symmetric. Therefore, due to the lack of the information about the unoccupied states, the optimized self-energies are almost symmetric. Then, the asymmetric behaviors could originate from the self-energies away from the Fermi level. However, these high energy asymmetric behaviors hardly affect the cancellation. We have already demonstrated the robustness of the cancellation under the high-energy perturbation in Appendix F. Thus, the essence of the present results is not affected by the asymmetry, which may originate from the high-energy or unoccupied parts of the spectrum. See also Appendix F2.

Matrix elements
One might also be concerned with the effects of the matrix elements. Even when the energy dependent matrix elements exist, the self-energies obtained by using our method is robust as far as the matrix elements are smooth functions of ω. This is because, when the energy dependent matrix elements are treated as a smooth function of ω, the effects of the matrix element can be simply taken into account as a smooth background. As already examined in Appendix F 3, a smooth background does not change the qualitative and essential conclusion of the present paper. Only if the high resolution ARPES measurements on the entire Brillouin zone, especially at the antinodal region, become available from different experimental setups, quantitative examination of the effects of the matrix elements and background will be conducted and these are highly desirable. However, the data are not available yet and clearly beyond the scope of the present paper.
Appendix K: Decomposition of Self-Energy While the normal and superconducting components of the total self-energy, ImΣ nor (k, ω) and ImW (k, ω), show the prominent peak structures, which are absent in the Bardeen-Cooper-Schrieffer (BCS) mean-field theory [86], there are an extended background and a BCS-like superconducting contribution, in addition to the peaks. To highlight the peak part, we decompose ImΣ nor (k, ω) and ImW (k, ω) into the peaks and other components. As proposed in Ref. 29, Σ tot (k, ω) may consist of a single pole that generates a superconducting gap and a smooth normal state component. Then, we decompose ImΣ tot (k, ω) as ImΣ tot (k, ω) = ImΣ N (k, ω) + L BCS (k, ω). (K1) Here, while the BCS-like superconducting contribution is represented by a Lorentzian, where ∆ 0 and Γ are phenomenological parameters that correspond to a BCS-like superconducting gap and life time of quasiparticles, respectively. The background ImΣ N (k, ω) can be represented by a linear combination of many Gaussian distributions, where its large amplitude signals either the incoherence of electrons at that energy or some extrinsic origin arising from the experimental setup or background. Then, the peak contribution canceled in ImΣ tot (k, ω) is, if it exists, obtained from ImΣ nor (k, ω) as ImΣ PEAK (k, ω) = ImΣ nor (k, ω) − ImΣ N (k, ω), (K3) and from ImW (k, ω) as ImW PEAK (k, ω) = ImW (k, ω) − L BCS (k, ω), (K4) where ImΣ PEAK (k, ω) = −ImW PEAK (k, ω) holds.

Appendix L: Momentum Dependence
We have shown the machine learning results in the main text at the antinodal point, because the remarkable structure of the pronounced anomalous self-energy peak, coexisting with the normal self-energy peak is most clearly identified with its dominant contribution to the superconductivity. However, the momentum dependence of the peak structure provides us with useful insight. We here show the momentum dependence of self-energy structure. Fig. 22 shows the imaginary part of the normal and anomalous self-energies together with W for the ARPES measurement angle 1.0 • , 11.1 • , 21.2 • , 31.3 • , and 41.9 • obtained by the machine learning result of Bi2201 at optimum doping at 11.3 K [12]. Note that 0 • is the antinodal and 45 • is the nodal points. Although the peaks in the normal and anomalous self-energies become less significant with approaching the nodal point as is expected, the cancellation of ImΣ nor and W always holds and the prominent peak is missing in ImΣ tot similarly to the case at the antinodal point. This result further corroborates the universal mechanism of the peak cancellation and the dominant contribution to the superconductivity.

Appendix M: Machine Learning Results above Tc
We have shown the machine learning results in the main text for the superconducting phase well below T c to show the remarkable structure of the pronounced anomalous self-energy peak with its dominant contribution to the superconductivity. However, the question how the cancellation of the normal and anomalous self-energy evolves with raising temperatures provides us with further insight on its role in the superconductivity. Fig. 23 shows machine learning results of the normal and anomalous self-energies together with W for the ARPES measurement obtained by the machine learning of Bi2212 at optimum doping at 40, 80 and 120K [12]. The anomalous self-energy peak vanishes above T c as it should be, which further confirms the validity of the present machine learning scheme. ImΣ nor (ω) at 120 K (red curve overlapped with the black curve, ImΣ tot (ω)) shows small signature of pseudogap (small dip around ω = 0). The pseudogap, though not a standard behavior as observed in the underdoped region, is clearly seen in the spectrum (23(c)) if one would perform the electron-hole symmetrization and was analyzed in detail in the original experiments (Figs. 17c and 1d in Ref. 12). In Ref. 12, it was even argued that the (incoherent) electron pairing is formed below 150K. The energy of the dip of ImΣ nor (ω) shifts with raising temperatures and crosses the quasiparticle peak energy when T crosses T c consistently with the results in Ref. 47. Then the pseudogap is interpreted as generated by the peak of the normal self-energy above T c , which is continued from the peak below T c , while it is hidden in the spectral function in the superconducting phase below T c because of the cancellation with the anomalous self-energy. It turns out that the d-wave superconducting gap has an entirely different origin, which emerges as the sharp drop around ω = 0 in ImW , namely as arising from a pole of W , distinct from the pole of the self-energies Σ ano and Σ nor . Although the superconducting gap has a different origin, the main contribution to self-energy (eV) self-energy (eV) self-energy (eV) self-energy (eV) (b) 11.1°( g) 11.1°( c) 21.2°( h) 21.2°( d) 31.3°( i) 31.3°( e) 41.9°( j) 41.9°F IG. 22. Momentum (angle) dependence of self-energies. Self-energies are obtained from the machine learning using the ARPES EDC curves plotted in the upper panel and taken from the experimental data of Bi2201 at optimum doping at 11.3 K at the angle 1.0 • ((a) and (f)), 11.1 • ((b) and (g)), 21.2 • ((c) and (h)), 31.3 • ((d) and (i)), and 41.9 • ((e) and (j)) supplied by Kondo et al. [12]. Although the quasiparticle peak becomes sharper when the nodal point is approached, prominent peaks are found at all angles in imaginary parts of the normal and anomalous self-energies around ±0.07 eV, which are missing in W at all the angles, though the peaks become less pronounced and are almost missing at 41.9 • (nearly nodal point).
the superconducting order is attributed to the peak (ideally pole) of ImΣ ano around ω = −0.07(0.04) eV for Bi 2212 (Bi2201) (see Figs. 2 and 3). This indicates a tight relation of the pseudogap (manifested by the normal selfenergy peak) to the superconductivity (contributed from the peak in ImΣ ano ) through their cancellation. , and (f): Self-energies obtained from the machine learning yielding the spectral functions in the left panels. Although the peak (dip) cancellation between W and Σ nor around -0.07eV still exists at 40K and 80K, the peak and dip of ImΣ ano and ImW essentially vanish above Tc, while the dip of ImΣ nor at ω < 0 below Tc, shifts to the energy around ω = 0, indicating the formation of the pseudogap. Furthermore, the total self-energy (black symbols) below the peak energy (< 0.07 eV) shows a constant slope approximately given by ω/π as drawn as blue dashed lines, supporting marginal Fermi liquid behavior except for the constant value of possible background. (g). Comparison of imaginary parts of normal and anomalous self-energies at 120K. While the peak (dip) structure of ImΣ nor at ω = 0 (as shown in f) introduces the pseudogap even at 120K in the spectrum (e). (h). Gap function at 120K. Even though ∆(ω) is very small but finite at ω = 0, ∆(ω) is more strictly vanishing at ω = 0.