Impact of the form of weighted networks on the quantum extreme reservoir computation

The quantum extreme reservoir computation (QERC) is a versatile quantum neural network model that combines the concepts of extreme machine learning with quantum reservoir computation. Key to QERC is the generation of a complex quantum reservoir (feature space) that does not need to be optimized for different problem instances. Originally, a periodically-driven system Hamiltonian dynamics was employed as the quantum feature map. In this work we capture how the quantum feature map is generated as the number of time-steps of the dynamics increases by a method to characterize unitary matrices in the form of weighted networks. Furthermore, to identify the key properties of the feature map that has sufficiently grown, we evaluate it with various weighted network models that could be used for the quantum reservoir in image classification situations. At last, we show how a simple Hamiltonian model based on a disordered discrete time crystal with its simple implementation route provides nearly-optimal performance while removing the necessity of programming of the quantum processor gate by gate.


I. INTRODUCTION
In recent years we have seen the steady growth of the number of qubits available on a variety of quantum processors [1][2][3][4].This has led to the new phase of quantum computer development, often called the "NISQ" era.Here NISQ stands for noisy intermediate-scale quantum, which indicates that the quantum processor is too small to implement logical quantum operations and hence is inherently noisy.The number of qubits in these quantum processors (well in excess of 50 [3,5,6]) has already reached the point where the quantum computational tasks they can perform are intractable in a conventional computer, however, noise prevents us to extract the quantum advantage such quantum computer promise.Hence, for the NISQ era to mark its significance in computer history, quantum advantages for real applications have to be demonstrated.
Many of the current NISQ processors are designed to operate via quantum gates [1,[7][8][9].To run a quantum algorithm, we need to obtain a quantum gate circuit from the quantum algorithm and then decompose each quantum gate into one implementable on the quantum processor at hand.The noise in these quantum processors necessitates the optimization of quantum gate circuits to minimize their effect.As long as the physical qubits are directly used for computation, quantum algorithms also need to be relatively short and resilient to noise.Variational quantum algorithms (VQAs) have attracted a lot of attention from this viewpoint and have been intensively investigated [10][11][12].However, there have been several issues with them, with the most significant obstacle being the difficulty in the optimization of the variational models [13,14].VQAs are a type of model of quantum neural networks (QNNs).It is well known that there are other models for using QNNs.One such example is quantum reservoir computation [15][16][17][18][19], which should be expected to be more implementation friendly.Similarly to the chaotic dynamics used in the (classical) reservoir computation [20,21], the quantum reservoir is to generate complex dynamics in the quantum system.Realizing such dynamics using a quantum gate circuit approach is, however, not that simple [1,[22][23][24][25].We do not require precise programming to generate a sufficient complexity in the quantum reservoir to realize our quantum algorithm [16,18].Instead, an effective quantum reservoir can potentially be generated by a simpler quantum system giving us a better way to utilize the computational power of QNNs.
Recently the quantum extreme reservoir computation (QERC) was proposed [26] as a more advanced yet simpler QNN model based on reservoir computation and extreme machine learning [27].This model uses a quantum reservoir to generate a quantum neural network which is then used for extreme machine learning.The use of quantum reservoirs for extreme machine learning has been discussed.However, there had been no attempts to image classifications until recent years [19], and the previous models have been presented only in a general form.Unlike the previous models, the QERC was the first concrete one that can perform the MNIST image classification task, which is considered an important task in computer vision.This model has numerically shown to achieve the highest accuracy in classifying handwritten digits using the MNIST dataset with the smallest number of qubits [28].An interesting feature of this approach is that it utilizes a discrete time crystal (DTC) as the feature map, which is much simpler to implement than the quantum gate circuit needed to generate a random unitary matrix.This suggests that if we could understand the mechanism associated with using the complexity of the quantum dynamics for generating an effective feature space, it would become possible to design quantum feature maps more efficiently.
One of the versatile methods to study the complexity of quantum dynamics is to characterize it as a complex network.Such network approaches have allowed us to quantify the complexity of quantum states around critical points of quantum phase transitions [29], to build a graphical calculus for Gaussian pure states [30], and to reveal the preferential attachment mechanism during the melting process of the DTC [31].Further these network approaches can be also applied to analysis of quantum machine learning models.In Ref. [26], the performance behavior of the QERC with the DTC dynamics was explored with the complex network emerging in the Hilbert space of the DTC dynamics [31].Now let us outline the focus and structure of this paper.We will use the QERC proposed in [26] as a tool to investigate the role of the feature map in quantum neural networks.This model provides a convenient platform to do so, as the quantum contribution fully relies on the quantum reservoir.We start in Section II with a description of the QERC model.In Section II, we also present our method to characterize the unitary map, which is responsible for the quantum reservoir, as a complex network.By our method, we investigate the feature map properties with the DTC and some random unitaries and will observe the difference in their dynamics in Section III.Then, in Section IV, by benchmarking the QERC performance with those unitaries with concrete practical tasks, we discuss what properties of the performance arises based on the difference in the models' dynamics.We will confirm that the difference is not only the quantum reservoir properties, but also, in fact, can be exploited for a better QERC performance for the practical tasks.In Section V, we summarize our results.

II. THE QERC MODEL AND ITS CHARACTERIZATION
Let us begin with a brief description of QERC.As shown in Fig. 1a), QERC can be described in terms of three key components: the encoder, the quantum reservoir, and the classical processor: Encoder: Here, the data to be classified is preprocessed (if necessary) and encoded into the initial state of the quantum reservoir.In more detail, as shown in Fig. 1a), a Principal Component Analysis (PCA) map is used for the preprocessing of the classical data.Then an appropriate encoding strategy needs to be chosen for the problem at hand.For a quantum reservoir of L qubits, the 2L most significant parameters from the PCA map will be encoded by single-qubit rotations.
Quantum reservoir: In this step, the quantum reservoir provides the feature space for QERC.The quantum dynamics of the quantum reservoir determines the feature-map properties for the quantum computation, which is given by the unitary operator Û .

Classical processor:
In this final step, the state given by the unitary operator Û acting on the initial state is measured projectively on the computational basis.The process will be repeated to obtain the amplitude distribution of the state generated by the unitary operator.This amplitude distribution is then processed through a one-layer neural network (ONN).
We immediately notice that this is a hybrid quantumclassical algorithm.In QERC the feature space is provided by the quantum reservoir, whereas the optimization is carried out on the classical processor (ONN).Typically the quantum reservoir does not need to be optimized for different problem instances [15,18,26].Now our interest in this paper is the properties we require for the quantum reservoir and their influence on the performance of QERC.In particular, we want to show that how we set the quantum reservoir is important.In this work, we first employ the DTC model used in [26] as our choice of a quantum reservoir.The DTC model has a parameter that controls the complexity of the dynamics, namely starting with the perfect discrete time crystal when the rotation parameter error ϵ = 0, the dynamics gradually deviates from a DTC, acquiring its complexity as ϵ increases.This parameter ϵ represents an error in the single-qubit rotation in the DTC Hamiltonian, which is given by where σa l (a = x, y, z) represent the Pauli operators on the l-th qubit.Next, T is the cycle of driving, while the DTC cycle is 2T .Further g is the rotation strength, and in this case, we set gT = π.Now J lm = J 0 /|l − m| α is the coupling strength between the qubits l and m with a power-law decay that scales with a constant α.Finally, D l is a disordered external field for each qubit l.Unless explicitly stated, all the D l T are set to zero in this work.
The time-periodic system is conveniently characterized by the Floquet operator F = exp −i Ĥ2 T /2ℏ exp −i Ĥ1 T /2ℏ where the stroboscopic time-evolution can be obtained by the unitary FIG. 1. a) Schematic architecture of the quantum extreme reservoir computation (QERC) processor.It begins with an image of size 28 × 28 pixels, is processed through principal component analysis, and is compressed to 2L components (where L is the number of qubits).Using these 2L components, an initial state corresponding to the image is created by single-qubit rotations.The quantum reservoir then lets the initial state evolve.By projective measurements on the computational basis, the final state is converted to classical information.The amplitude distribution of this classical information is fed into the one-layer neural network (ONN).b) Schematics of the weight distributions for the x-, y-, and z-components.The panel c) summarizes the definition of the ratio R ν = s (ν,G) + /s (ν) for the ν-component, where s (ν) is given by s (ν) = (s operator Û (nT, 0) = Fn for n ∈ N. Hence, we use the unitary operator Û (nT, 0) for different values of n to characterize the quantum reservoir.

A. Characterization of the unitary matrices
The next step is the characterization of the unitary operator Û (nT, 0).This unitary operator acts as a map between the input and the output states given by the feature map used for the quantum computation.Such a map can be considered as a weighted network [32,33].However as the unitary operators are defined on the complex field, the translation to a weighted network is not trivial.Here we apply a generator decomposition of a unitary matrix U ∈ U(N ), where N is the dimension of the unitary matrix, in order to represent the unitary operator as a weighted network.The generators of U(N ) are the Hermitian matrices forming the Lie algebra.Now a unitary matrix U ∈ U(N ) can be written in the form U = e −iG where G is a Hermitian matrix.This Hermitian matrix can be represented with real coefficients a lm , b lm , and c k by the decomposition of G with respect to the generators λ as where those λ generators are the generalized Gell-Mann matrices [34][35][36] ( Here δ ij represents the Kronecker delta, σ ν ij (ν = x, y) is the (i, j)-component of the Pauli matrices and I the identity matrix.Next G as a weight matrix has three components x, y, and z with {a lm }, {b lm } and {c k } being the x, y and z contributions of the weight matrix, respectively.
The Hermitian matrix G obtained from the unitary matrix U is not necessarily unique.To uniquely determine G for a given unitary matrix, we employ the principal logarithm of a matrix [37] in our numerical analysis.
If A is a complex-valued matrix of dimension N with no eigenvalues on the negative real line R − , then there is a unique logarithm X of a matrix A such that all of its eigenvalues lie in the strip {z : −π < Im(z) < π}.Here X is called the principal logarithm of A and denoted by X = log(A).For the DTC model, we compute the Hermitian matrix for period n, G(n To convert the weight matrix to its weight distribution, we first count how many coefficients are in a certain value window (s, s + ds] for s, ds ∈ R.This gives us a histogram h(s) to show how likely the coefficients are to take a certain value (s, s + ds].In numerical calculations, we take 100 segments for each coefficient set to determine the value of ds: let the support of the histogram denoted by a sequence . Then, once we define s 0 , s M −1 and M , we have ds = (s M −1 −s 0 )/M .After obtaining the histogram, we have a density function ρ(s) = h(s)/N ν ds where N ν is the number of elements in the component ν = x, y or z, that is, We refer to this as the weight distribution of the νcomponents (ν = x, y, z) (see Fig. 1b)).

B. Characterization of the weight distributions
We will show weight distributions for the DTC model in different configurations and other models in Sec.III.To quantitatively characterize those weight distributions, we calculate two quantities for each weight distribution.The first is the empirical standard deviation σ ν is given by the standard deviation of values of elements in a component ν, for example, in the case of ν = x, σ x = var lm (a lm ).
Our second quantity is a ratio that represents how far the weight distribution reaches from its center compared with a Gaussian function approximating the weight distribution, which is analogous to the MP rank defined in Ref. [38].The ratio R ν for the weight distribution of the ν-component is given by The denominator s (ν) is defined as a quantity representing how far the weight distribution reaches from its center in the horizontal axis.Since the weight distribution ρ(s) is not necessarily symmetric with respect to s = 0, s (ν) is given by s (ν) = (s represents how far a Gaussian distribution function reaches from s = 0 obtained by a Gaussian fitting to the weight distribution.The Gaussian function reaches s = ∞ in general. Thus, we introduce a cutoff for the Gaussian function.In more detail, let f (s; u, v) denote the Gaussian function given by the Gaussian fitting with fitting parameters u, v: f (s; u, v) = u exp(−vs 2 ).To introduce the cutoff, we consider cross points between f (s; u, v) and a horizontal line at a value of d, that is, d = f (s; u, v).Then, s (ν,G) + is given as s (ν,G) + = ln(u/d)/v.In our numerical calculations, we set d = (N ν ds) −1 , which corresponds to the possible minimum nonzero height of the density function ρ(s) = h(s)/N ν ds.In Fig. 1c), the definition of R ν is summarized.

C. Simulation setup for the QERC
We begin our considerations here by first directly evaluating the properties of the feature map generated by our DTC dynamics using the method outlined above.Setting a computational task is not essential to do the analysis, however, it is extremely useful when later we compare these properties to the performance of the QERC.It is convenient to set a computational task to evaluate both at the same time and on a similar footing.In this paper, we use the well-known MNIST dataset [39] where each image has 784 (= 28 × 28) pixels.We employ PCA to reduce each image data to the 2L components, which can then be encoded in the initial state of the quantum reservoir of L qubits by single-qubit rotations.Finally, to optimize the parameters of the ONN, we employ the stochastic gradient descent method used in [26].Throughout this work, our parameters are set as L = 10, J 0 T = 0.12 with α = 1.51,These are compatible with the current ion trap experiments [40].We also set ϵ = 0.03 as the highest accuracy rate that has been reported for the QERC with this parameter value [26].

III. QUANTUM RESERVOIR WEIGHT DISTRIBUTION
It is important to emphasize that unless perfectly periodical, the quantum dynamics of the DTC model deviates from its initial state in time as it evolves.This allows for the growth of the complexity in the system.To observe such complexity growth in the unitary dynamics, we evaluate the weight distribution of G(n) for various time periods: n = 2, 10, 50, and 100.From Eq.2, we can determine the real coefficients a lm , b lm and c k characterizing G(n) for each n.The Hermitian weight matrix G(n) is equivalent to the n-period effective Hamiltonian up to the constant factor ℏ/nT .Thus, the diagonal (corresponding to {c k }) and off-diagonal (corresponding to {a lm }, {b lm }) entries of G(n) are associated with the energies of the basis states and the transition energies between the basis states, respectively.
In of the time evolution.For n = 2 (blue line), we observe very sharp peaks at s = 0 for the x and y components.As these components correspond to the off-diagonal entries of the Hermitian matrix G(n), the sharp peaks around s = 0 mean very few transitions between the basis states for this time period.However, for large periods, n = 50 (brown curve), 100 (orange curve), the weight distributions for all the components converge to a similar shape that is approximately quadratic in the log-scaled plots (gaussian in linear plot).In the middle of these two time regions, at n = 10 (green curve) the x and y components have already converged to the typical distribution, however, the z component has broadened the most.This suggests that there is a tradeoff in this time regime; the stationary elements of the z component significantly suppress the effect of the x and y components.This trade-off captures the dynamics of the DTC melting slowly in time in this (ϵ = 0.03) parameter regime.
The behavior for those components can be quantitatively observed by the empirical standard deviation for the weight distributions, defined in Sec.II B. In Fig. 2b), the empirical standard deviations are depicted against the number of periods, n.While the z-component is broadened at n = 10, the x-and y-components gradually get higher standard deviations, and then they converge.

A. Comparing the weight distribution
To capture the characteristics of the weight distribution for the DTC model, we first introduce the Haar measure sampling of unitary operators [25,41,42].The Haar measure sampling can be considered to exhibit a typical complexity that a quantum computer may provide, and its gate implementation is usually given through unitary t-design [22,23].Hence the similarity and disparity in the weight functions for these cases would give us valuable insights into understanding the DTC dynamics and its role in QERC.In our analysis, to obtain a typical distribution given by the Haar measure sampling, the N ×N unitary matrix U H is created using the QR decomposition [43] where N = 2 L = 2 10 = 1024.We compare this unitary map, which we refer to as the Haar-random model, to the converged weight distribution of the DTC model.Now as shown in Fig. 3a) and b), the typical weight distributions are approximately Gaussian for all components x, y, and z.Here we use only one sample from the Haar-random model since one sample and not the average of many samples will be used within the QERC.Further, we do not lose generality as discussed in Appendix.D.
We can now compare the DTC model to the Haarrandom model, where we characterize the weight distributions of the DTC model with two properties, broadness, and tail.Although the DTC's y-component has a narrower distribution compared to the Haar-random model, the broadness of the distribution for the x-and z-components are comparable as shown by the empirical standard deviations in Table .I.
However, only the DTC model has a tail in the weight distribution of the x component; a few large elements at the edge of the weight distribution.To quantitatively observe tails in the weight distribution, we calculate the ratio R defined in Sec.II B. Table.I shows the averages of the ratios for the DTC and Haar models where R denotes the average, that is, ν=x,y,z R ν /3.One can see that the averaged ratio of the DTC model is smaller than that of the Haar model, which is close to the unit.It implies that the weight distribution of the DTC model deviates from that of the Haar model in terms of the tail.
Next to explore the difference associated with the tail we found in the DTC models distribution, we employ the Cauchy distribution.The reason for this is as follows.In classical reservoir computation, the Cauchy distribution was used to obtain the edge of chaos where the reservoir computation should be optimal [44].Hence it is interesting to see the properties of the feature map generated by the Cauchy distribution.The Cauchy distribution is given by Cauchy(x; γ) = 1 πγ where γ is the scale parameter.Since the Cauchy distribution has a power-law tail, one would expect that the weight distribution exhibits a long tail.The unitary matrix U C for this Cauchy-random model is defined as follows.First, we generate an N × N Hermitian matrix A whose real and imaginary parts in each independent entry are drawn from the Cauchy distribution (7).Then we define the unitary matrix U C as U C = e −iA .
Further, we set γ = 0.04 in Eq. 7 for consistency with the corresponding parameter of the Haar-random model (σ ≈ 0.04), and the size N is set as N = 2 L = 1024.
Applying the decomposition (2) to G C = i log(U C ), we obtain each weight distribution for the three components x, y and z., which are depicted in Fig. 3a) and b).The weight distributions for the Cauchy-random model definitely have a tail, much longer than that in the DTC model, for all the components x, y, and z.Table .I shows the averaged ratio of the Cauchy-random model, which is even smaller than the other two models.This reflects the nature of the Cauchy distribution (7), and we will come back to this point later.

IV. RELATION BETWEEN THE QERC PERFORMANCE AND THE WEIGHT DISTRIBUTION
As we have characterized the three models through the weight distribution, let us now turn our attention to the performance of the QERC employing these three different models as its feature space.In the previous work [26] it was shown that the accuracy of the QERC increases with the number of the time periods of the DTC model saturating near n = 50.The behavior of this accuracy rate can be predicted from the time evolution of the weighted distributions seen in Fig. 2, as the unitary map of the DTC model acquires the typical complexity around n = 50.To illustrate this further, Fig. 4 summarizes the comparison between the accuracy rate and the weight distribution.Here we plot the accuracy rates for training (blue dot) and testing (orange downward triangle) against the time period n in the DTC model and insert the weight distributions for n = 2, 10, 50, 100.
The broadness in the x-and y-components of the weight distribution are essential for the quantum reservoir to achieve a higher performance.The trade-off between the x, y components and z component is reflected in the average accuracy rates.This suggests that even if the system dynamics is complex enough, within a short coherent regime, the system does not evolve enough to achieve the computational power the system would promise.
The complexity generated in a finite system has to be bounded, and unlike unitary maps from the Haarmeasure sampling, the DTC model does not reach the maximum randomness allowed for the system to have certain tendencies in its dynamics.Next, we further investigate the effect of this difference in these models on the performance of the QERC.

A. Tails in the Distribution and the Performance
The unitary operators we characterized through the weight distributions directly serve as the feature map for the QERC.We will now explore how these different weight distributions and their associated feature maps affect the performance of the QERC with the MNIST dataset.
Table .IIa) presents the accuracy rates for each model (see also Appendix.A).
Before the comparison of the quantum feature maps, we first provide the performance of the case where the PCA components are directly fed into the ONN (without quantum feature maps).One can immediately observe that the case, denoted by "PCA" in Table .IIa), has the lower accuracy rates in both training and testing than any other cases with quantum feature maps.It states that those quantum feature maps significantly help the QERC achieving a high performance.
Next, we compare the accuracy rates of the quantum feature maps.In testing, the DTC (n = 100) model is comparable to the Haar-random model, whereas not so in training.Moreover, it is interesting to notice that the Average accuracy rates with the associated standard deviation of the various feature models with a) the MNIST and b) FashionMNIST datasets.For the Fashion-MNIST case, we picked three classes: T-shirt, Pullover, and Dress.The average and the standard deviation are taken from 250 to 300 epochs, and ∆acc.denotes the gap between training and testing.The label "PCA" denotes the case where the PCA components are directly fed into the ONN (without any quantum feature maps).The results for the DTC cases with/without disorder are for the period n = 100.In the random models, the accuracy rates are from a specific realization.
Haar-random model does not give the best testing accuracy rate in this setting, but the Cauchy-random model does.These observations may suggest that the tail in the weight distribution of the DTC and Cauchy models contributes to the higher accuracy rate.
In order to see the relation between the accuracy rate and the weight distribution, we show the correlation between the testing accuracy rate and the quantities we used to characterize the weight distribution: the averaged empirical standard deviation (σ) and ratio (R) in Fig. 5a).The quantities correspond to the vertical and horizontal axes, respectively, and the color of the markers indicates the testing accuracy rate.One can find that the color becomes darker as the ratio R gets smaller.
To investigate this further, we introduce the t-random model, whose unitary is defined using the Student's tdistribution in the same way to generate the Cauchyrandom model.The Student's t-distribution is defined as where Γ(•) denotes the Gamma function, and γ t is a scale parameter.ν is a parameter determining how heavy the tail of the distribution is.This parameter connects the standard Cauchy (ν = 1) and normal (ν = ∞) distributions when γ t = 1.In our context, this parameter allows us to generate the weight distribution located in between the Haar-and Cauchy-r random models in Fig. 5a) with γ t = σ = γ = 0.04.In Fig. 5a), the stars correspond to the averaged data points over ten realizations of the unitary for each value of ν (ν = 1, 2, 3, 5, 10, and 100).The error bars correspond to the standard deviations of each data point (see Appendix.B for detailed results of the t-random model).One can find that the ratio and the testing accuracy rate tend to be smaller and darker, respectively, as ν gets smaller.Moreover, data points close to each other in the plot have similar testing accuracy rates.This simulation with the Student's t-distribution illustrates the correlation between the testing accuracy rate and the tail in the weight distribution.
Considering the implementability of the QERC with an effective quantum feature map, one may wonder what physical system is close to the Cauchy-and t-random (for ν = 1) models in Fig. 5a) and achieves a high testing accuracy rate since the DTC model is far from such models, and there still seems to be room to improve it.We here consider disorder to the DTC model in Eq. 1, and actually, the disordered DTC (DDTC) model has a higher accuracy rate, as we will see later.
In Eq. 1, we have set D l T = 0 to enable us to consider the DTC model without the disorder.That constraint can now be relaxed.The disorder in Floquet systems has been considered to be important to suppress the thermalization and stabilize the DTCs [8,40].Actually, it is more realistic to have a little disorder in such quantum systems, and so it is worth checking if our QERC's performance is robust to such disorder.We choose the disorder terms D l T in Eq.1 independently drawn from a uniform distribution on [0, 2π).As illustrated in Fig. 3c), the introduction of the disorder changes the form of the weight distribution, and it is characterized by the empirical standard deviation and the ratio (see Table.I).So the DDTC model is now located around the upper left region of Fig. 5a), and one can find that the model achieves a similar testing accuracy rate to that of the Cauchy-and t-random (with ν = 1) models (see also Table.IIa)).
So far, we have only discussed our testing accuracy, and it is important we now turn our attention to the training accuracy and the difference between the testing and training accuracy rates.The difference ∆ acc. is an important parameter in terms of the overfitting and the generalization performance of this machine learning model.Neural networks often show the effects of overfitting [45,46] where the neural networks are too well optimized to the training date and lose their flexibility to deal with the testing data.The generalization performance is hence an important factor in designing QNNs.
In Table .II, we also provide the training accuracy and difference ∆ acc.for the various feature models we have considered.We also plot the correlation between the generalization performance and the properties of the weight distribution in Fig. 5b).One can observe that the Cauchy-random model has the smallest gap ∆ acc.among the artificial models, and the value of ∆ acc.tends to be higher for a larger value of the ratio.It is strongly suggestive that the tail in the weight distribution helps the QERC to acquire the generalization performance suppressing the training accuracy rate.One can also find that the physical models (DTC and DDTC) have even smaller gaps.Therefore, from our analysis, the DDTC model gives the best generalization performance and a nearly-optimal testing accuracy rate among quantum feature maps we have shown.It is an encouraging fact that a simple Hamiltonian system could perform at least as good as a t-designed unitary map, which makes the implementation of such QNNs much simpler more feasible.

B. Simulations in other settings
First, we consider the optimizer for the ONN.We used the stochastic gradient descent method as the optimizer for the ONN in these numerical simulations.The method has been broadly employed in many situations.Hence, our observations above can be seen in many scenarios.Moreover, we found the same effect of the tail with a more technical optimizer, AdaGrad [47] (see Appendix.C).
Next, we direct our attention to the dataset.So far, we have used the MNIST dataset to benchmark quantum feature maps and concluded that the tail in the weight distribution contributes to the testing accuracy rate and generalization performance.In this section, we see the difference in the QERC performance between the quantum feature models we have considered with another dataset.In order to see the difference in the generalization performance, we need to choose a dataset carefully.If it's a simple dataset such as the 2D isotropic Gaussian samples demonstrated in Ref. [26], all the feature models would have high accuracy rates, and it should be hard to see the difference between our models.In contrast, if we choose a hard dataset such as the Fashion MNIST [48], all the models we have may achieve poor performance, and there should not be any room to discuss generalization.
For our calculations, we choose the Fashion MNIST dataset with a few classes since classifying all the classes is too hard for the QERC.Then we picked three classes: T-shirt, Pullover, and Dress, and show the accuracy rate in Table.IIb).One can see the same trend we have obtained with the MNIST dataset: the models with the tail tend to have higher testing accuracy rates and better generalization.This examination suggests that the choice of a tailed feature map is an effective technique to push the QERC performance up more when the performance is good but not perfect.

V. DISCUSSION AND CONCLUSION
In this work, we have performed network analysis on the unitary maps used in the QERC.Such unitary maps U can be converted to a weighted Hermitian matrix G = i log U that is characterized by the set of three weighted distribution functions.We observed that the weight distribution for the DTC model grows in time to near n = 50 where it converges to its typical shape.We compared the DTC model weight distribution against those associated with the Haar-random and Cauchy-random models.The DTC and Haar-random models are similar with respect to the Gaussian-like broadness of their weight distributions, whereas the DTC and Cauchy-random models are similar in terms of the tails in their respective weight distribution.This suggests that the unitary map for DTC's period over n = 50 is nearly as complex as the one by the Haar-random model, yet it still has power law characteristics and so is not totally random.
Next, in the comparison of the performance of the QERC with the MNIST and Fashion MNIST datasets, we found that the power law tendency (tail) in the unitary map contributes to the high testing accuracy rate by suppressing the training accuracy rate.This indicates that, at least for certain image classification problems, the tendency in the feature map in QNNs would help the QNNs to acquire better generalization performance.Although similar observations have been noted in classical neural network models, including reservoir computation [20,21,44,49], this is the first time for it to be observed in the QNN scenario.
Not only the properties found here could serve as a guideline for designing more effective feature maps in the future, but our network approach to the quantum machine learning model could also be used to investigate other useful network properties in quantum feature spaces.Furthermore, the approach could provide a physical interpretation of information processing in quantum machine learning schemes.As long tail in the weight distribution implies strong connectivities between certain basis states in the network dynamics, our approach may provide physical insights in the performance of quantum machine learning.
Finally, the fact that the QERC can perform with the feature map generated by a simple Hamiltonian model, such as the DTC with the disorder, is encouraging for the QERC's implementation.It strongly suggests that it can significantly reduce the overhead for the feature map in many other QNNs.
On the other hand, another overhead seems to remain; the reconstruction of the probability amplitudes of all computational basis states may require an exponentially large number of samples of the same initial states.However, it is non-trivial whether such a precise reconstruction is needed for the ONN.In fact, we need to acquire enough information for the task performed for the ONN.Hence, it remains an open question how much it is possible to reduce the measurement overhead while maintaining the good performance with a simple Hamiltonian model used for the quantum feature map.
) being the maximum (minimum) value of s in the ν-component.Next one can consider a Gaussian fitting function f (s; u, v) = u exp(−vs 2 ), which allows us to define s (ν,G) + := ln(u/d)/v where d corresponds to the possible minimum nonzero height of the density function ρ(s) = h(s)/Nν ds, that is, d = (Nν ds) −1 .
FIG. 2. a) Convergence of the weight distribution in the DTC model.From the left to right panels, the weight distribution functions for a lm , b lm and c k are depicted where the colors correspond to different periods.the blue (dot), green (upward triangle), brown (downward triangle), and orange (filled circle) curves are for n = 2, 10, 50, and 100, respectively.b) The n-dependency of the empirical standard deviations, σx = var lm (a lm ) (blue with diamonds), σy = var lm (b lm ) (orange with hexagons), and σz = var k (c k ) (green with pentagons).

FIG. 3 .
FIG.3.Comparison of the Haar-random, Cauchy-random, and DTC models for n = 100 in a).In each case from the left to right panels, the distributions of the x-, y-, and zcomponents are depicted respectively.In b), the weight distributions of those models are shown for the range [−0.75, 0.75].Finally, in c), a comparison is shown between the DTC model with and without disorder for n = 100.

FIG. 4 .
FIG. 4. Average accuracy rates for training and testing with the associated standard deviation against the period in the DTC model.The blue (dot) and orange (downward triangle) curves correspond to training and testing, respectively.At each datapoint, the average and the standard deviation are taken for 250 to 300 epochs in the ONN optimization.

FIG. 5 .
FIG. 5. Scatter plot of the models we consider in this paper against σ and R with colored markers indicating a) the testing accuracy rate and b) the gap ∆acc. of accuracy rates between training and testing.

FIG. 8 .
FIG. 8. Weight distributions of the Haar random model a), the Cauchy random model b), and the disordered DTC model c).The total number of realizations in each case is 10.The colored curves are corresponding to each curve in Fig. 3 a,b,d) respectively.The other realizations are plotted as gray curves.