Deep Learning and AdS/QCD

We propose a deep learning method to build an AdS/QCD model from the data of hadron spectra. A major problem of generic AdS/QCD models is that a large ambiguity is allowed for the bulk gravity metric with which QCD observables are holographically calculated. We adopt the experimentally measured spectra of $\rho$ and $a_2$ mesons as training data, and perform a supervised machine learning which determines concretely a bulk metric and a dilaton profile of an AdS/QCD model. Our deep learning (DL) architecture is based on the AdS/DL correspondence (arXiv:1802.08313) where the deep neural network is identified with the emergent bulk spacetime.


I. INTRODUCTION
The AdS/CFT correspondence [2][3][4], or the holographic principle, is a promising way to define a quantum gravity. In spite of its importance, a fatal problem is to find a dual gravity system for a given QFT, for which so far no systematic approach has been successful. Particularly important QFTs are those with Yang-Mills sectors, including QCD, whose large N and strong coupling limit are believed to give a classical gravity dual, while we are lacking in how to construct the dual concretely and explicitly. So far, we know necessary conditions for the gravity dual, such as symmetries and spectral properties, as well as recently investigated OTOCs [5][6][7] and computational complexities [8].
The string theory "top-down" construction does not solve the problem, because it merely provides examples as a pair of a gravity and a QFT at the same time. Many excellent work provided a pair in which the QFT side resembles a given target QFT. A lot of effort has been put to seek for a gravity dual of QCD, as QCD is the renowned, and realistic QFT among all.
It should be emphasized that once a classical gravity system is given, the dual QFT quantities can be easily calculated by the AdS/CFT dictionary. The problem is how we can go backwards: for a given QFT correlators, how we get the gravity system. To solve this kind of inverse problems, we need special techniques. In particular, strongly coupled QFT consists of a vast amount of data, such as n-point correlators for infinite kinds of local/nonlocal gauge invariant operators. Furthermore, QCD is a part of the Standard Model for which a lot of experimental data is available. To this end, machine learning method may help us. If there exists a gravity dual of QCD which is simple enough with a finite number of parameters, the features of the data of QCD needs to be efficiently extracted, by solving the AdS/CFT backwards. Deep learning [9][10][11] which technically advanced these years may shed light on the fatal problem of the AdS/CFT. In Ref. [1], deep learning was applied to determine an emergent gravity metric from a given data of a QFT. The dictionary used was for a one-point function of an operator in the QFT, corresponding to a bulk field value in the dual gravity system. In Ref. [12], the method was applied to lattice QCD data of a chiral condensate, to find a metric of the gravity system dual to QCD. Based on the success of this method working for one-point functions, in this paper we make one more step toward realistic QCD. We use hadronic two-point functions, i.e. the hadron spectra which are measured in experiments.
Needless to say, the most well-observed quantities in experiments for QCD are hadron spectra and hadron couplings. The best framework to test the deep learning method is the AdS/QCD [13][14][15], a bottom-up construction of phenomenological gravity models based on symmetries and the dictionary. It is known that the simple AdS/QCD framework can host lots of QCD quantities including hadron spectra. Here, again, the conventional methods in the AdS/QCD is first to come up with a gravity model, then to calculate the QCD quantities from the model and then to compare those with experimental data. If the quantity does not match well, one throws away the gravity model and try again with a different gravity model. The gravity model has a large arbitrariness, and in addition the dictionary is nonlocal, so solving the inverse problem is challenging. This is the reason why the deep learning method can help finding the gravity dual of QCD.
The AdS/DL correspondence in Ref. [1] was invented due to the similarity between the deep neural network (DNN) and bulk gravity system. 1 In the training, weights of the neural network are trained and determined by machine, which is regarded as an emergence of the spacetime, as the differential equation on the discretized gravity spacetime is regarded as a propagation of information on the deep neural network, with the depth direction identified with the AdS radial direction.
In this paper, we upgrade the deep neural network given in Refs. [1,12] to accommodate hadron two-point functions as a supervised learning, and use the experimental data of the mass of ρ mesons and a 2 mesons at zero temperature as the training data. The neural network is a discretized bulk action of the AdS/QCD model [13][14][15], with the metric and the dilaton fields identified as the network weights. With some physically reasonable regularizations, the supervised learning of the neural network is successful, ending up with smooth profiles of the gravity metric and the dilaton configuration in 5 spacetime dimensions, as the trained weight parameters of the neural network. Namely, the inverse problem of finding the gravity system for a given hadron spectra is solved by the deep learning. Using the obtained metric and dilaton, we can predict excited hadron masses which were not used as the training data.
The advantage of the deep learning method is that it can provide a systematic approach to determine the gravity dual from a given dataset of the QFT correlators, which even generalizes for prediction. Using the data analysis methods of deep learning [21], we may hope to combine all possible information of QCD as the training data to discover a proper gravity dual. For that, various QCD requirements studied in improved holographic QCD [23,24] will help. This paper is organized as follows. First, in Sec. II we make a brief review of the concept of the renowned soft-wall AdS/QCD model of Ref. [15]. Based on the model, in Sec. III we construct our deep neural network by discretizing the bulk equation of motion. 2 We provide our deep learning architecture with experimental vector meson mass spectra as our supervising dataset, and describe our hyperparameters and regularizations. The deep learning is performed in Sec. IV. After a reproduction test of the background of Ref. [15], we use the experimentally measured data to train our AdS/QCD model to find an optimized, emergent background geometry and dilaton. We discuss physical implication of our emergent geometry. Sec. V is for our conclusion. App. A includes the details of our deep learning architecture.

II. REVIEW: THE ADS/QCD MODEL
The AdS/QCD [13][14][15] is to provide a simple phenomenological model in 5-dimensional curved spacetime which describes desired sectors of QCD effectively. The 5-dimensional Lagrangian is built under the guide of the AdS/CFT dictionary. For example, to compute hadron mass spectra, one introduces fields propagating in a 5-dimensional curved spacetime which are dual to the hadrons of concern. One of the most popular models of AdS/QCD is so-called soft-wall models given in Ref. [15] where, rather than using a brute cut-off of the 5-dimensional geometry as in the inaugural work [13,14], one introduces a 5-dimensional dilaton field to realize a smooth wall to confine fields in the curved geometry, which enables discussions on a part of the QCD Regge trajectories.
In this section, we briefly review the model given by Ref. [15] on which our deep learning architecture is built. With the explicitly given 5-dimensional action, the vector meson spectra can be calculated by the model.
Generically, any AdS/QCD model is given in the following manner. First of all, one assumes the existence of an effective theory which is holographically dual to QCD via the AdS/CFT correspondence. Next, based on the well-known dictionary of the AdS/CFT, one writes the 5-dimensional bulk action of the theory, with ingredients necessary to reproduce a QCD sector of one's concern. For example, for flavor symmetry, one introduces corresponding gauge symmetry in the bulk theory. The associated gauge bosons in 5 dimensions correspond to the flavor current operators in QCD, which are nothing but the vector mesons. The gauge bosons are in a 5dimensional curved spacetime whose gravity metric and bulk dilaton field are prearranged.
Let us focus on the vector meson spectra in QCD. The gravity side is the 5-dimensional effective action of a U (1) gauge theory where F M N is the field strength of the 5-dimensional massless gauge field V M (z, x µ ), and g 5 is the gauge coupling constant. The indices M, N represent those of the 5-dimensional spacetime, while the directions along the 4 dimensions associated with QCD are denoted as µ, ν. The emergent 5-th direction is parameterized by the coordinate z (≥ 0). Everything is made dimensionless by using the AdS radius L. The theory is in a curved spacetime. The soft wall models include the gravity metric g M N (z) and the dilaton field Φ(z) as the background fields. In particular, the dilaton field is essential in the spectral analyses in Ref. [15]. The metric is written with a single function A(z) without losing its generality, because QCD is in an infinitely extended flat Lorentzian spacetime, at zero temperature. For the use of the Ad-S/CFT correspondence, it is assumed that the curved spacetime is asymptotically AdS: A(z) ∼ − log z near the AdS boundary z = 0, where the unit L = 1 is used. Following Ref. [15], we choose a gauge V z = 0, ∂ µ V µ = 0. Using the plane wave basis for the 4 dimensions, we consider a solution of the form FIG. 1. In the ordinary AdS/QCD modeling, one prepares the background metric and dilaton fields (A(z) and Φ(z)) and then calculates the QCD observables (the hadron mass mn).
The judgement of whether the chosen background is appropriate is checked after matching the calculated observables with experimental data. On the other hand, our AdS/DL approach solves backward.
with a mass-shell condition −k 2 = m 2 for the 4dimensional mass m. Then we obtain the following equation which the coefficient function v(z) should satisfy, Here we have defined the combination In the AdS/CFT dictionary, for the current operator of QCD to be excited, the corresponding modes in the gravity side need to be normalizable. The differential equation (4) has normalizable solutions only for discrete values of the mass m = m n (n = 0, 1, 2, · · · ). Therefore, these discrete values m n are interpreted as the vector meson spectrum.
So, summarizing the procedures, once the explicit form of the metric function A(z) and the dilaton profile Φ(z) is given, one can calculate the normalizable solutions of (4) to obtain the vector meson mass spectra. See Fig. 1 for the illustration. In Ref. [15] the following functions are used, This ansatz is simple enough since A(z) = − log z means that the whole 5-dimensional spacetime is AdS 5 , and Φ(z) = z 2 was adopted to produce the asymptotic Regge behavior in the hadron spectra. One can also calculate the spectra of higher spin mesons from the 5-dimensional model. A spin-S meson is dual to a rank-S symmetric tensor field [25]. By the procedures similar to those of the vector meson case, the z-dependent part of the bulk tensor field obeys the same equation as (4) with a different definition of B(z); One understands that the spin dependence of the meson spectra is encoded in the background bulk field profiles. Now, one can see that the difficulty of the AdS/QCD model building is in solving inversely the duality: the direction of the standard procedures is only from the gravity to QCD quantities. One needs the explicit metric and the dilaton which are not known a priori. 3 On the other hand, this paper aims at "optimizing" the model by experimental data. More concretely, we determine the model background B(z) by using deep learning of the training data of hadron masses. See Fig. 1. In addition, using the spin dependence of the definition of B(z) in (7), we can obtain the spacetime metric and the dilaton separately: the machine finds different B(z)'s from the data for the mesons with S = 1 and S = 2 respectively, then the metric and the dilaton are obtained as Using these, one can calculate other QCD observables as a prediction of the determined model. Our proposal offers a new data-driven approach to the AdS/QCD model building.

III. NEURAL NETWORK FOR ADS/QCD
In this section we construct our neural network. According to the concept of the AdS/DL correspondence [1], we present a deep neural network architecture optimizing a generic soft-wall AdS/QCD model. The optimization is conducted under a supervised deep learning, with respect to the QCD experimental data, where the meson spectrum m in (4) is treated as the input data.
The implementation can be divided into three steps. First, we translate the equation of motion (EOM) of the bulk field into a deep neural network. The network itself is regarded as a spacetime, and the metric on the discretized spacetime corresponds to trainable weight parameters in the neural network. The next step is to arrange the input and the output layers for a binary classification. The input data is the vector meson spectrum data, and the output data is related to the normalizability condition of the solution of the bulk field equation. The remaining step is technical for the training to be successful: fixing hyperparameters and introducing a regularization. We have to adjust the value of the hyperparameters such as the discretization spacing and the number of layers. We also need a regularization which FIG. 2. Neural network representation of (12). The layer depth direction (horizontal direction in this figure) corresponds to the emergent radial direction of the 5-dimensional spacetime. The input is on the left, the output is on the right. Only the thick lines are trainable weights, corresponding to the bulk background field B (z). The binary classification layer at the output is not shown in this figure, for simplicity.
come from physical requirements such as smoothness of the spacetime. In the following, in each subsection we explain the three steps described above.

A. Deep neural network
We prepare a deep neural network where trainable weight parameters are identified with the configuration of the gravity/dilaton fields of the holographic model. We identify the radial propagation (4) of the bulk vector field from z to z + ∆z with the propagation of the information on the neural network from one layer to another.
We introduce a conjugate momentum field to reduce (4) to a set of first order differential equations, one of which is Then we discretize the z coordinate in (10) and (11) with a spacing ∆z, The derivatives acting on the vector field π and v are replaced by the difference, and we leave the form B (z) ≡ ∂B/∂z as it is. With this (12), let us make a neural network representation of the bulk field equation. The propagation equation (12) tells us that the values of (π, v) at z = N ∆z with an integer N can be computed from the initial values of (π, v) at z = ∆z and m, when a background B (z) is given. Namely, the output of the system is given by the input and B (z), Recall that in a generic feedforward deep neural network (DNN) the unit values at the final layer are calculated by the unit values at the initial layer with a given trainable weights. The concept of the AdS/DL correspondence is based on this similarity [1]. Following that, we find the dictionary between the bulk field equation and the neural network as shown in TABLE I. Here m is z-independent in (11), so it corresponds to a unit of a fixed value.
Equation (13) is nothing but the deep neural network itself, with the layer depth direction identified as the emergent radial direction z. See FIG. 2. The AdS boundary z = 0 is identified with the input layer, while the deep infrared region of the emergent space z = ∞ is the output layer. Generically, neural network weights are matrixvalued trainable parameters, while our weights include only B (z) as the trainable parameters. This means that our neural network is sparse, and most of weight components are put to 0 or some fixed value.
As we mentioned, the 5-dimensional spacetime is asymptotically AdS 5 near z = 0, so we can determine the values of (π, v) at the initial layer z = ∆z, based on their behavior in the AdS. Using the asymptotic solution of (4) with the AdS metric (and also assuming that the dilaton vanishes there), we find See App. A 1 for the derivation. The overall proportionality magnitude of v and π is ignored since the EOM (4) is linear in v(z). See FIG. 2 for the whole schematic view of our neural network.

B. Dataset for binary classification
The next step is to prepare training dataset for our supervised leaning. Since we want to extract possible features from the experimental data of the meson spectrum, we may simply use the value of the spectrum as our input. Then, what should the output data be? In view of (13), the output data is the values of the vector field at z = ∞. Depending on whether the mode v(z) is normalizable or not, the values of v(z) at large z vanishes or diverges. It needs to be normalizable only when we have the correct value of the input mass m. In this way, we can arrange the neural network as a binary classification problem, by putting a discrimination label to the input mass.
More concretely, as the input data we prepare a set of random real numbers (the mass). The range of the generated random numbers include the experimentally measured values of the meson mass. If the random number is close to (far from) the experimental values in the real spectrum, it is named as a positive (negative) data. In the training data, the output value as the label for the positive (negative) data should be 0 (1), which is the discrimination label. Therefore, we give our neural network a task to classify the input real numbers: a meson mass classifier.
Since we expect the final output to be 0 or 1, we introduce an additional layer there. This layer outputs 0 or 1 according to whether the value of v(N ∆z) satisfies the following normalizability condition or not, For this, we arrange as the final layer a smeared box function of the width , with no weight multiplication. 4 And as for the loss function, we simply adopt L1 loss.

C. Hyperparameters and regularization
For the machine learning to work, we have to tune the hyperparameters in the architecture, and also need to introduce regularization terms to the loss function.
Our hyperparameters are, the number of layers N , and the discretization spacing ∆z which appears in the fixed weights of the neural network, 5 and the discrimination threshold in (15) in the classification layer. In fact, these are closely related to each other from a physical viewpoint, as follows.
First, N and ∆z give the infrared location z = N ∆z at which whether the vector field v(z) behaves as a normalizable function or not is verified, while the threshold for the discrimination is . Although in principle we have → 0 for N → ∞, for our numerical calculations we need a finite N . In addition, even though the analytic solution goes to 0 at z → ∞, due to the discretization error of the differential equation (4) the value at z = N ∆z could deviate from 0. So, for the network to work properly even with the discretization effect, we need to tune such that it also allows the deviation. In practice, we fix these hyperparameters by using the analytic solutions in the famous soft wall model of Ref. [15], see App. A 2 for the details.
The regularization which should be added to the loss function is introduced by the following three reasons. The fist one is the spacetime interpretability. The obtained set of weights is interpreted as B(z), which is the dilaton field and the metric field. They need to be a smooth function of z, otherwise there is no physical interpretation [1,12]. Second, as in (14), for the AdS/CFT to work, we impose the asymptotic AdS condition, which constrains the form of B(z) near the initial layer, the boundary z = 0. The third reason is the soft wall. For (4) of the vector field to have normalizable solutions, 6 the background field needs to have an infrared wall. This is also related to the technical trainability of the neural network, and to the choice of the random initial configurations of B (z) before the training.
Therefore, we introduce the following three kinds of regularization terms: • B(z) is a smooth function of z.
• B(z) is asymptotically AdS at small z.
• B(z) has a "wall" at large z.
For concrete functional forms of these regularizations in the loss function, see App. A 3.

IV. OPTIMIZATION BY THE DATA OF MESON SPECTRA
In this section, we determine the background of the Ad-S/QCD model (1) by training our neural network with the dataset of the meson spectrum. We have two numerical experiments. The first one is a test case to check whether our deep learning can actually reproduce the soft wall model (6) of Ref. [15] from the data which is generated by the model with (6) in advance. We confirm that our architecture can learn the model successfully. The second numerical experiment is the optimization of the model by deep learning with the experimental data of the meson spectrum. The machine finds a metric function A(z) and a dilaton profile Φ(z) which are consistent with the experimental data. We discuss physical implication of the determined AdS/QCD model.

A. Reproduction test
Since the implementation of our neural network is totally based on the generic soft wall model, as a first check of whether our architecture works properly, we perform a reproduction test of the model of Ref. [15] with (6). First, we prepare the mass spectra calculated by the model with (6). Then, using only that set of data, we train our neural network to find B (z). Then we compare the function B (z) which machine determined, and the function (6). If we confirm that they are similar enough, then we claim that our architecture works and the model of Ref. [15] is reproduced.
For the dataset of m 2 , we generate a set of real numbers in the range [0, 10). This range includes the first two levels of the spectrum, since the meson masses calculated by the model of Ref. [15] is m 2 n = 4(n + 1) with n = 0, 1, 2, · · · (see Fig. 3(a)). Note that the widths of the region of the positive data (the blue dots in Fig. 3(a) [15] is reproduced well, qualitatively. Although we introduce some physical regularizations such as the asymptotic AdS condition and the wall condition, it is worth noted that only a part of the meson spectra is used for training the model. Hence we claim that our deep learning architecture works for the meson spectrum as the input data. And in particular, the AdS/DL paradigm is shown to be helpful to construct effective AdS/QCD models.

B. Model determined by experimental data
Finally, we come to the actual aim of the deep learning method: the determination of the metric/dilaton functions by the experimental data of the meson spectra. We use the spectra of the ρ meson (S = 1) and the a 2 meson (S = 2). Combining the results of these two, we can obtain Φ(z) and A(z) separately, as shown in (7).
The experimental data of the tower of the ρ meson spectrum [26] is given as ρ(770), ρ(1450), ρ(1700), · · · , and we use only the first two levels, since it is expected by the analysis of the model of Ref. [15] that the discretization errors are difficult to be handled for higher excitations. As for the a 2 meson, a 2 (1320), a 2 (1700) are reported as the established poles [26]. We generate the input data with these values, and design our neural network. (See FIg. 3(b) and Fig. 3(c) for the generated input data, and TABLE II for the hyperparameters of the architecture. For details of the hyperparameters, the initial weights, and the dataset generation, see App. A 2, App. A 4, and App. A 5, respectively.) In the implementation of the dimension-ful quantities  (such as the masses) into the numerical experiment, we normalized everything appearing in (4) in the unit of the AdS radius L. It is actually the unique dimension-ful parameter in the simple AdS/QCD model (1). The input values in Fig. 3(b) and Fig. 3(c) are multiplied by L when they are used in the training. The value of L can be chosen arbitrarily, and here we choose it in such a way that the mass of the lowest ρ meson (0.77GeV) is equal to the mass of the ground state (m 2 n = 4(n+1)) of the model of Ref. [15] in the unit of L, for simplicity. This results in L = 4/0.77 2 ∼ 2.6GeV −1 as the unit of length.
We repeat the training of the neural networks with those dataset, twenty times for the ρ meson and five times for the a 2 meson, separately. The network is optimized, and the emergent B (z)s are obtained, which are shown in FIG. 5.
To obtain the metric and the dilaton, we calculate B(z) by a discretized integration We set the integration constant C such that B(∆z) = log ∆z, because B(z) should behave as an asymptotically AdS spacetime, A(z) − log z by the assumption. From this integral we compute the metric profile function A(z) and the dilaton profile function Φ(z). They are shown in FIG. 6 and FIG. 7. The emergent metric A(z) for the near-boundary re-gion 0 < z ≤ 2.0 can be consistently fit with a function with a 1 = 0.849, a 2 = −0.496, and a 3 = −0.433. The first term (− log z) is for the AdS 5 spacetime, and the corrections are obtained as above. The fitting Taylor series for the deviation from the AdS is in the power of (z − 0.2) where z = ∆z = 0.2 is the location of the initial layer.
The emergent dilaton profile is plotted in Fig. 7. The obtained points with the error bars can be fit well with a linear function, with a constant φ 1 = 1.43. As one can see in Fig. 7, our deep learning have found a linear dilaton profile, rather than the z 2 behavior which was originally anticipated in Ref. [15]. The linear dilaton is a popular background in string theory as the world sheet theory is solvable, and it is interesting that it shows up in the machine learning. 7

C. Physical properties of the emergent spacetime
With the optimized emergent metric and dilaton functions, we can even try to make a prediction. We calculate the mass of the next-higher level excitation of the ρ meson and the a 2 meson. Numerically calculating the eigenvalue of (4) with our emergent B(z), we summarize the result in TABLE III. In the middle column of TABLE III, the calculated mass of the third-lowest excitation of each meson is shown. As for the ρ meson, our optimized model predicts the mass which is compared with the experimental data within 10 percent error. Our prediction of the a 2 meson mass will be confirmed in future experiment, as there is no established value in experiments at present.
Our prediction accompanies a caution, as our architecture is based on various hyperparameters and discretization errors, as well as the fact that we only use the lowest and the second-lowest meson masses as the training data.
A change in widths introduced in the positive/negative data in FIG. 3 may cause unsuccessful trainings. One of the deficits of deep learning in general is unknown relations between hyperparameters and generalization, and any prediction is with such a caution. Nevertheless, we find that our prediction is still reasonable, which is encouraging.
Finally, let us discuss the physical property of the emergent metric (17) and the dilaton (18). Using these, we find an effective volume element √ −ge −Φ which is in the action (1). Since the radial z-dependence of the volume element reflects physical properties of the geometry, we plot a logarithm of our optimized effective volume element 5A − Φ in FIG. 8. In the figure, for a comparison, we plot the dashed line which is 5A − Φ of the model of Ref. [15].
We notice that when z is increased and goes to the infrared, our volume element gradually deviates in the positive direction once, compared to the AdS spacetime of Ref. [15]. This can be also seen from the fact that the first correction a 1 in (17) is positive, meaning that the AdS spacetime is deformed toward a slower warping. One possible interpretation is that this is a tendency toward a confining geometry, which is consistent with that the non-supersymmetric QCD is a confining theory at zero temperature.
Then, with a further increasing of z, our effective volume element goes below the AdS line. The resultant bump of the volume element as a function of z resembles the machine-learned geometry of the AdS/DL model for the lattice data of the QCD chiral condensate [12], where at the finite temperature the emergent geometry consists of both the confining wall and the black hole horizon. Our geometry is trained with experimental meson spectra, while the one in Ref. [12] is with different QCD observables. Finding a unique geometry consistent with more QCD observables is a challenging problem.

V. CONCLUSION
In this paper, we have proposed a deep learning architecture to discover an AdS/QCD model from a given experimental data of meson mass spectra. The neural network depth direction is identified with the emergent radial direction z of the 5-dimensional model. The model is a soft-wall model based on Ref. [15], with the unknown metric function A(z) of the curved geometry and the unknown dilaton profile Φ(z). We have identified those profiles with weights of the deep neural network (FIG. 2). By the supervised training with the lowest and the secondlowest ρ and a 2 meson mass values as the training data, our machine has found optimized profiles of the geometry A(z) and the dilaton Φ(z) (FIGs. 6 and 7). Therefore, the deep learning, based on the concept of AdS/DL [1,12], can derive an effective AdS/QCD model from QCD experimental data.
With the emergent geometry and the dilaton profile, we have calculated the excited meson masses. This prediction has turned out to be reasonable, although it should not be taken seriously, as the architecture has various regularization and discretization errors. Nevertheless, the training to obtain the emergent geometry is worthwhile in a larger perspective. This framework may open up a whole scheme of determining a better holographic model by a vast amount of data of QCD. As we have emphasized in Sec. I, QFT has an infinite number of data, which are spectra and scattering amplitudes, equivalent to n-point correlators of an infinite kinds of gauge invariant operators. Finding better holographic models is equivalent to the feature extraction of QCD, which may help revealing the hidden mechanism of how the AdS/CFT correspondence works.
Relatedly, the similarity between the holographic dictionary and the deep neural network architecture may have some physical origin, and such a standpoint may provide a new way to investigate two subjects which appear distantly related: quantum gravity and data science. In the growing subject (see Ref. [27] for a recent summary of data science application to string theory), the idea of equating a holographic spacetime with neural network [1,12,16,17,[32][33][34] may be intertwined with machine learning string landscapes initiated by Refs. [28][29][30][31]. Discovering a complete gravity dual of QCD is a challenging problem, and various data-scientific methods applied to string theory may help for it.

ACKNOWLEDGMENTS
We would like to thank Hong-Ye Hu  In this appendix we provide details of our deep learning architecture. Our Python code for the machine learning is written on the basis of the code used in Ref. [12]. The latter code is available at [35] and we revise the forward function and the regularization terms of the code. In addition, the preparation of the input data shown in Fig. 3 is necessary. In the following, we provide details about the input layer, the hyperparameters, the loss function and the regularizations, the initial weights, and the dataset generation.

Input values at the initial layer
At the end of Sec. III A, we substitute the initial layer for units v, π with the asymptotic solution of a typical soft wall model as in (14), while the unit m 2 receives its value from the input data. We derive (14) in the following.
As mentioned, we assume that our model background is asymptotically AdS 5 . This means that A(z) ∼ − log z and Φ(z) vanishes at z ∼ 0. Then the function B(z) defined in (7) is approximated by (2S−1) log z. Hence the bulk field equation (4) reduces near the AdS boundary z ∼ 0 to Assuming a power-law configuration v ∼ z α , then the equation above reduces to which forces us to choose α = 0, 2S. We can write a general solution as v = a + bz 2S .
According to the AdS/CFT dictionary, the nonnormalizable part a corresponds to a source in the boundary theory and the normalizable part b corresponds to the expectation value of an operator associated with the source. Since we are interested in hadron spectra without the source, we set a = 0, and moreover we can choose b = 1 due to the linearity of equation (A1). Therefore we estimate the behavior of v near the boundary as which also derives π ∼ 2Sz 2S−1 , that is (14).

Hyperparameters
As we described briefly in Sec. III C, our hyperparameters N, ∆z, in our neural network are fixed by carefully looking at discretization errors. Here we describe how to choose the value of the hyperparameters. We choose ∆z first, then determine the others. For the evaluation of the discretization errors, we need some analytic solutions for comparison, and we adopt the fluctuation solutions of Ref. [15].
FIg . 9 shows the lowest solution v 0 (z) of the equation of motion of bulk field, which is v(z) with m = m n=0 of the model [15]. Both the analytic solution and the numerical solution are plotted. The numerical solution is obtained by Euler method with width ∆z = 0.2. As mentioned in Sec. III C, we have to look at the location where the solution converges, and also at the difference between the analytic solution and the numerical solution.
First, the plot shows that the analytic v 0 almost touches the z axis around z = 4. This suggests to set N = 20 in order to discriminate the normalizability of solution. Next, the numerical solution approaches 0.2, not 0, due to the discretization error. So we have to choose greater than 0.2. From the analysis above, we set these hyperparameters to the values shown in TABLE III. Of course, we may choose a smaller ∆z and a larger N . However, the discretization error does not change so much, while a larger N makes the computation too heavy because the neural network becomes deeper.

Loss function and regularization
The total loss function we use is where E is an L1 loss function, which is commonly used in binary classification. Denoting y as the output of the The ground sate solution of (4) with the model background given in Ref. [15] (B(z) = z 2 + log z), solved analytically (yellow line) and numerically (blue line). The blue line interpolates the plots of the numerical solution.
neural network and y as the output part of the training data, then the loss function is made of the L1 distance y − y, Here i labels the data and d is the number of the data. The output of our neural network takes a binary value 0 or 1, which is transformed from the unit value of v at the N th layer by its normalizability condition. This transformation is given by a smeared box (step) function which was also used in Ref. [12]. For more details, see the function named "t" in the code of Ref. [12].
The loss function (A5) contains a sum of p i 's, which are the regularizations. Explicitly, they have the following form: Here B (l) denotes the weight at l-th layer, or equivalently B (l∆z). And the coefficient constant c i,l determines how their constraints are effective as the regularization. There are four kinds of regularizations, p i (i = 1, 2, 3, 4). In the following we explain each in detail. p 1 is the regularization to require that the emergent background is asymptotically AdS. Note that S is spin of meson whose mass spectrum we use as the input, and this factor comes from (7). Since this constraint is only for the boundary region of the 5-dimensional geometry, we turn it on only at the first few layers. We set most of c 1,l to 0. When we train the network with the ρ meson spectrum, we use the following coefficients: .001, 0.0001, 0.0001, 0, · · · , 0} .
(A8) p 2 and p 3 are the regularizations to require that the emergent background is a smooth geometry. This constraint is necessary to be imposed over all the layers layer but the first few. So, in the ρ meson case, we impose p 2 regularization on the 4th layer and deeper layers. For p 3 , after some tuning, we find that it is enough to impose it from the 7th layer. The coefficients used in the ρ meson training is shown below. (A9) p 4 is the regularization to require that the weight B (z) is a monotonously increasing function. "relu" is a builtin function of machine learning package, which is defined as Hence, if B decreases along two layers, this regularization provide a positive loss. A small constant δ plays a roll to require that B should increase at least by δ. A nonzero small δ is necessary to train the a 2 meson data in our case. The p 4 regularization is to require that B(z) should provide the "wall"-like behavior not to have a continuum spectra. For that, we need this p 4 regularization at the large z region, otherwise the model allows some unexpected continuous spectrum at a larger mass. The coefficient used in the ρ meson training is the same as that of p 3 , {c 4,l } = {0, 0, 0, 0, 0, 0, 0.01, · · · } . (A11) All the coefficients above is for the ρ meson training. For the a 2 meson training, we choose the coefficients as follows: c 1,1 = 0.01 , c 1,i = 0.001 (i = 2, · · · , 8) , c 3,i = 0.01 (i = 9, · · · , 12) , c 3,j = 0.005 (j = 13, · · · , 21) , c 3,k = 0.001 (k = 22, · · · , 80) , c 4,i = 10 (i = 61, · · · , 80) , and the other components are put to zero.

Initial weight
For a successful training, we need to control the initial values of the weights to some extent. In the ρ meson training, we initially sample the values of the weights at the l-th layer from the normal distribution whose mean is 1+0.5l with the standard deviation 0.3l when we initialize the neural network. In the a 2 meson training, we initially sample them from the normal distribution whose mean is 3/(0.5l + 0.05) + 0.05l − 7 with the standard deviation 10.

Training dataset
Here we present our Python code to generate the dataset given in Fig. 3(b) and Fig. 3(c). Our training trials observe that the success of the training depends on the range of the positive data values and also the local density of data points, which is presented explicitly in the following codes for the ρ meson and for the a 2 meson. 1) Dataset for ρ meson (Fig. 3