Tensor network to learn the wavefunction of data

How many different ways are there to handwrite digit 3? To quantify this question imagine extending a dataset of handwritten digits MNIST by sampling additional images until they start repeating. We call the collection of all resulting images of digit 3 the"full set."To study the properties of the full set we introduce a tensor network architecture which simultaneously accomplishes both classification (discrimination) and sampling tasks. Qualitatively, our trained network represents the indicator function of the full set. It therefore can be used to characterize the data itself. We illustrate that by studying the full sets associated with the digits of MNIST. Using quantum mechanical interpretation of our network we characterize the full set by calculating its entanglement entropy. We also study its geometric properties such as mean Hamming distance, effective dimension, and size. The latter answers the question above -- the total number of black and white threes written MNIST style is $2^{72}$.


Introduction
Generalization is a remarkable ability of supervised learning algorithms to learn patterns underlying training data and subsequently perform well on new datasets.It reflects both potency of the algorithm, but also certain simplicity of the training data.Namely presence of patterns that might be apparent to a human eye but usually very difficult to quantify.On the contrary datasets without underlying patterns, such as fully random or ad hoc ones can be learned but can not be generalized [1,2].To better understand when generalization is possible and inform development of more efficient supervised learning algorithms, it would be important to characterize patterns that underlie various datasets of interest.In this context a training set should be thought of as a small subset of the "full set of data" which includes all possible hypothetical data exhibiting given patterns.We introduce a novel tool, of all N I = 2 784 possible images.Qualitatively function with some appropriate is the indicator function of the full set.To emphasize that Ψ characterizes the data itself and its properties exhibit robust independence of the tensor network architecture we call it the wavefunction of data.Using quantum mechanical interpretation of Ψ(x) we can characterize the full set by calculating its entanglement entropy.We also study geometric properties of the full set such as mean Hamming distance, effective dimension, and the size.The latter is simply the approximate total number of images recognized by our network as depicting the given digit.In contrast to the first two properties, which can be studied using training set alone, size is the global property of the full set.
Before we proceed with the results, we would like to explain why tensor network samplerdiscriminator/classifier is an appropriate architecture to define the full set via (1).In the recent years tensor networks, such as Matrix Product States (MPS) and Tensor Trains, have been actively used to build various classification [5,6,7,8] and generative [9,10] algorithms.They demonstrate robust performance on par with the advanced CNN architectures [7,11].
In our case we train Ψ for a particular digit i.Then the value P(x) = 1 means Ψ recognizes x as an image of i. 1 Good quality of generalization means our network reliably recognizes images i outside of the training set.In the ideal case with perfect quality of generalization there still could be images of other digits or even noise recognized by our network as i.This Figure 1: On both panels I is the set of all 2 784 images, F is the full set of images of digit i.
Let panel: ideal discriminator Ψ recognizes all images of digit i, but may also recognize as i images of other digits or noise.This means |Ψ(x) 2 | ≥ for all x ∈ F , as well as for some set x ∈ R depicted in red.Right panel: all images sampled by an ideal sampler Ψ are images of i.This means Ψ has the support on the subset of the full set S ⊂ F , while is illustrated in the left panel of Fig. 1.There gray square region represents all possible N I = 2 784 images.Red area R represents images which our network "recognizes" as i, i.e. |Ψ(x)| 2 exceed for x from this area.The blue disk, denoted as F represents the full set of images depicting given digit i.It is a subset of the red area R.
Tensor network architectures allow for an efficient evaluation of |Ψ(x)| 2 as a function of some components x α while values of other components x β are fixed.It therefore can be used for sampling: pixels are sampled consequently, using conditional probability distribution specified by Ψ(x).This idea got traction recently and several such architectures were introduced in [9,10].Clearly, only images with large values of |Ψ(x)| 2 can be sampled.Provided our sampler Ψ(x) achieves a good quality, i.e. ideally all sampled images depict i, we can think of Ψ(x) as a function with the support on a subset of the full set.This is illustrated in the right panel of Fig. 1.There orange subset S of the blue disk represent images x for which |Ψ(x)| 2 is sufficiently large to be sampled, while for all other x / ∈ S, |Ψ(x)| 2 ≈ 0.
The idea of the sampler-discriminator is to train an MPS-based tensor network Ψ(x) which accomplishes both tasks.Schematically we minimize the objective function where T represents the training set -a set of N T images of digit i.It is a small subset inside the full set F .Importantly, our architecture enforces wave-function normalization As a result decreasing of the loss function (2) automatically decreases value of |Ψ(x)| 2 for x outside of T .Assuming T is approximately uniformly distributed within F and Ψ(x) changes smoothly, we may expect |Ψ(x)| 2 to mostly decrease outside of F , while inside F it would remain relatively large.The latter behavior would assure generalization of discriminator: value of |Ψ(x)| 2 for x ∈ F would exceed certain threshold.The former property, smallness of |Ψ(x)| 2 for x / ∈ F , assures good quality of sampling.We thus conclude that a network Ψ which simultaneously accomplishes both discrimination and sampling with high quality has a support on F , with (1) being its indicator function.
In practice decreasing of the loss function during training process will eventually lead to overfitting when |Ψ(x)| 2 is large for x ∈ T but not necessarily for x ∈ F .We therefore stop training as soon as discrimination/classification begins to reduce after reaching its maximal value.The logic outlined above is schematic, we justify it a posteriori by examining the quality of recognizing (classifying) and sampling achieved by the trained Ψ.Further details of the network architecture and the training process are described below in Methods.
Ideally, for the trained network P defined in (1) is the indicator function of the full set: |Ψ(x)| 2 exceeds certain threshold for x ∈ F and plunges below it for x / ∈ F .It therefore reflects the data itself rather than peculiarities of the architecture or the training process.To justify this claim we show that certain properties of Ψ(x), such as quality of discriminating/ classifying and sampling, typical value of |Ψ(x)| 2 for x ∈ F , value of entanglement entropy associated with Ψ(x), etc. are not sensitive to MPS bond dimension or initialization seed.This confirms our main conclusion that the proposed architecture provides a novel way to quantitatively characterize the data itself, rather than peculiarities of the network design or the training process.

Results
The core of our construction is the Matrix Product State real tensor network in the canonical form [12]. Mathematically it is a real-valued function Ψ(x) where x α is a vector of 28 2 = 784 binary variables.Canonical form imposes normalization condition (3).We train the network by minimizing loss function (2) via gradient descent, and the test set T is the set of black and white MNIST images of digit i. Corresponding tensor network is labeled Ψ i .
As the learning process proceeds, quality of sampling by Ψ i gradually grows -the network  remembers images from the training set and tries to replicate them.This is shown in the left panel of Fig. 2. The quality of recognizing digit i for images from the test set (calibration of threshold is discussed in Methods) grows initially, but then may decay slightly due to overfitting.Similar behavior is exhibited by the quality of classification, for which all ten Ψ i must be trained.This is shown in the right and central panels of Fig. 2. Overfitting becomes more pronounced when the bond dimension D of the tensor network grows.To prepare the network of interest, which would simultaneously accomplish both sampling and discrimination/classification tasks, the training process is stopped as soon as the quality of discrimination/classification reaches its maximum.For the sufficiently large D 100 this happens already after a few epochs.
We now demonstrate that core properties of properly trained Ψ i are largely independent of the bond dimension D, provided the latter is sufficiently large, D 30.To begin with we study how the quality of sampling and classification depends on the bond dimension.The quality of classification is the maximal value from the central panel of Fig. 2, since we stop training at that point.Results for sampling and classification for different D shown in Fig. 3 confirm that quality remains essentially the same in a wide range of bond dimensions.It is also not sensitive to the initial seed.
Next we discuss to what extent (1) defines characteristic function of the full set of images of a given digit i.We also address the question of the size N F of the full set -the global property of the full set which can not be deduces directly from the training dataset.In what follows we focus on i = 3 while results for other nine digits are qualitatively similar.E and E + dE. 3 Then In these notations normalization condition (3) becomes unity of the "partition function" at unit temperature We first discuss the distribution of energies E(x) of the training set, which we denote ρ tr (E)dE, shown in blue in Fig. 4. Minimization of loss function ( 2) is the minimization of energy averaged over ρ tr .The shape of ρ tr is not robust and the mean value of the loss function E tr decreases with the increase of D. What remains essentially the same is the energy of the lowest states, which we define by averaging |Ψ(x)| 2 over the training set The distribution ρ tr should be compared with the distribution of energies for images sampled by the network itself.The probability of sampling an image x is equal |Ψ(x)| 2 and therefore the distribution of E for the sampled images is the Gibbs distribution at unit temperature It is shown in red in Fig. 4. Naturally, E sm is larger than E tr , i.e. the value of the loss function averaged over the set of sampled images is larger than the one for the training set.The shape of ( 7) is also changing with D. At the same time energy of the lowest states is robust and matches ( 6) with a good precision Moreover with good accuracy it is equal to energy of the ground state E 0 ≈ E g ≡ min x E(x), were minimization can go over training set x ∈ T or the set of sampled images x ∈ S. From e S = ρ sm e E we find that e S is growing rapidly with E, at least for energies around E ∼ E sm .
In other words there are exponentially more images x with larger values of E(x).As E grows, quality of sampled images deteriorates.Qualitatively we can explain this as follows.Lowenergy states x with E(x) ≈ E 0 are the high quality "neat" images of digit 3, which will be recognized as such with almost 100% confidence.Each neat image gives rise to many more "corrupted" images, which still can be recognized as 3, albeit with a smaller confidence.These are the images with the larger values of E. As the level of corruption grows, so is the total number of such images, and their typical E increases.This is demonstrated in Fig. 5, where we show typical sampled images with three different values of E.
From this discussion it is clear there is no sharp value of threshold to define the boundary of the full set.The size of the full set, the total number of images x for which P(x) = 1, is dominated by the images with E ≈ − ln , which grows roughly as e E ≈ 1/ .
To unambiguously define the size of the full set, we define the latter it to include only neat images of 3, in which case the threshold, which we will call ε, can be taken very close to E 0 ≈ E g .We propose a way to fix ε in Methods, but provided ∆E = ε − E g is small enough, at leading order the total size of the full set will be given by the exponent, while ∆E will control the subleading term, The number N F can be interpreted as both, the leading exponent controlling the size of the full set -the total number of neat images of 3, as well as the number of images M ≈ N F which need to be sampled before there would be repetitions, i.e. an image sampled twice.The latter interpretation follows from equating the total number of sampled images with given energy E, M ρ sm (E) and the total number of images e S(E) with this energy.Understood as the equation on M , it yields M = e E , where E should be larger than E g ≈ E 0 .Minimization of M over all possible value of E readily gives (10).An immediate question is how robust the latter interpretation is, given that ρ sm depends on the network architecture.We test it by sampling images with one trained network, with the bond dimension D, and evaluating E(x) using another trained network, with the bond dimension D .Alternatively, we sample images using properly trained GAN.In all cases resulting ρ sm have approximately the same value of E 0 = − ln e −E sm and therefore N F ∼ M ∼ e E 0 remain the same.The comparison of E 0 evaluated for Ψ 3 for different sets, training, test and sampled with help of Ψ 3 itself and with an auxiliary GAN, is shown in Fig. 6.
At this point we would like to conclude the energy of low lying states E 0 is a robust characteristic of the full set which defines its size, thus answering the question from the abstract.Here the full set would be defined to include only neat images of 3 and similarly for other digits.The results for size V defined in (10)  of the full set we hoped it would be.Looking the distribution ρ test of energies E(x) for the images from the test set, we immediately find many neat images of 3 with the energies of order E test , which are significantly larger than E 0 .This is clear from Fig. 4, where ρ test is shown in green.This indicates not a conceptual flaw but certain deficiency in how our network was trained.We argue now, for an idealized properly trained network typical values of E(x) for images from both training and test sets should be around E 0 .To confirm this we retrain our network by doubling the training set using GAN-generated images.As expected, as the size of the training set N T increases both E tr and E test decrease, but the value of E 0 = − ln e −E defined with help of any set, train, test, or sampled, remains robust.The resulting picture is as follows.At leading order the total size of the full set is given by (10) and is accessible by a network trained with help of MNIST, while the number of neat images of 3 which our network misclassifies by assigning P(x) = 0 is substantially smaller than e E 0 .
To further characterize the full set geometrically, we evaluate its mean distance and the effective dimension, defined with help of the Hamming distance.Individual images x are binary strings and Hamming distance d(x 1 , x 2 ), defined as the number of distinct components of x 1 and x 2 , provides a simple notion of distance between them.Clearly, Hamming distance is primitive in the sense it does not reflect how similar or different the essence of images would be to a human observer.Nevertheless the full set understood as a subset of vertexes x of a unit cube equipped with the Hamming distance, satisfying P(x) = 1 exhibits well-defined coarse grained geometric properties.The mean distance d(x 1 , x 2 ) between two random images of digit i, taken either from train/test or sampled sets is substantially smaller than two completely random images with the same mean value of black and white pixels.This is demonstrated in Table 1 where we show results for all ten digits.
To evaluate the full set effective dimension, we use the standard approach of [13,14].For a and then study how mean value d = d min (x a ) , averaged over the set of x a , scales with the set size Here ∆ is the effective (or fractal) dimension.Linear fit of d(K) in log-log coordinates is shown in Fig. 8.We focus on neat images of digits i.Therefore, we consider a subset of train/test and sampled sets for which E(x a ) is close to E 0 .In both cases we obtain similar results, indicating full sets of digits have well-defined effective dimensions.The results are shown in Table 1.We would like to note, unlike the size of the full set, which is a global property requiring knowledge of the whole full set, mean distance and the effective dimensions can be inferred directly from the train/test set.Our results for ∆ are compatible with previous studies of the effective dimension [13,14], which used Euclidean distance in conjunction with the original MNIST.This suggests rendering images black and white does not drastically affect geometries properties of the full set.
As an application of our architecture, we evaluate entanglement entropy (EE), by interpreting Ψ as a quantum-mechanical wavefunction.For a tensor network its maximal EE determines its expressiveness.In the context of quantum physics EE is a central quantity which measures the amount of information shared between different parts of the system [15].In particular it rigorously bounds classical mutual information associated with a bi-partition [16] Popularity of EE transcended physics into machine learning with several different groups recently discussing it in the context of tensor-based architectures [17,18,19,20,21].For a MPS architecture it is natural to discuss EE of bi-partitions, S k , where all n = 784 pixels are split into two groups of k and n − k pixels correspondingly, see Methods.Resulting Page curve4 is shown in Fig. 9.We take EE averaged along the "plateau" region of bipartitions with k ranging between 200 and 600, for which S k is approximately the same.This corresponds to splitting image horizontally into two halves.Averaged EE, which we denote by S, slowly grows with epoch, which is expected: as the tensor network tries to better fit training data it needs more expressiveness associated with larger entanglement.Notably, after a few epochs S become essentially independent of the initial seed.We stop training Ψ when it exhibits maximal quality of discrimination/classification.Averaged S for such Ψ as a function of bond dimension is shown in the left panel of Fig. 7.It quickly grows for small D and becomes approximately constant for D 100.Robust independence of S on D and the initial seed is a further confirmation EE of a trained network is a characteristic of the data itself, not of the network architecture.This is the crucial difference between our work and previous studies of the EE in the context of tensor network algorithms.Our value of S should be contrasted with the maximal EE S max = 2 ln D for a tensor network with the bond dimension D, and for a random network with all MPS tensors drawn from the unitary ensemble, S random ln D, see Fig. 7.
Relatively small value of S, and hence of mutual information between upper and lower halves of the images for all ten digits indicates there is a small number of ways to continue the image of a given digit, if a half of it is known.Schematically, this means there is a finite number of styles to handwrite any given digit i: once an upper part of the image is given, it fixes the digit itself and its style within a range of a few possibilities.This interpretation is corroborated by a positive correlation between the value of the entanglement, shown in Table 1, and sizes of the full sets N F , see eq. (10).The logic here being that larger number of styles will be reflected by larger value of N F .
The EE provides an upper bound on mutual information, an important information-theoretic properties characterizing the data.Recently mutual information and the entanglement entropy of data in the training set have been studied in [21,23,24,25].Our work provides an alternative conceptually better way to evaluate it, as it is based on the full set, rather than the training set alone.

Discussion
In this paper we introduced a tensor network architecture to learn the wavefunction of data Ψ.We introduced the notion "the full set of data" -the collection of all hypothetical data exhibiting the same underlying pattern.Our tensor network provides a practical way to learn and subsequently characterize the full set by defining its indicator function P(x), see eq. ( 1).We have trained Ψ using black and white version of MNIST and demonstrated its core properties are independent of the network parameters, which confirms they are characteristics of the data itself.Using Ψ we have estimated the sizes of the full sets of individual digits, i.e. the total number of black and white MNIST-style images depicting a particular digit i.The results are shown in Table 1.To further illustrate utility of Ψ as a vehicle to study the data itself, we have calculated entanglement entropy, which upper bounds mutual information, associated with splitting images into two parts.The results are also shown in Table 1.The entanglement entropy/mutual information of MNIST images is small, which indicates relatively small number of different styles of handwriting.
The full set is a concrete, learnable cousin of an abstraction called the manifold of data in the series of recent papers [26,27,28].The manifold of data, by definition, requires idealized "infinite data limit" when the training set grows indefinitely.This is in contrast with our approach suited for practically available datasets.We have seen in the case of MNIST digits, certain images from the train/test set fall outside of the full set defined via indicator function (1).We argued, this problem goes away as the size of the training set grows.In this case the distributions ρ tr , ρ sm , and ρ test have smaller support and eventually, in the infinite data limit, collapse to a narrow vicinity of E 0 .This is the limit in which the full set, which would striclly include all images of a given digit, would become the manifold of data of refs.[26,27,28]. 5he rate with which the value of the loss function E tr − E g and E test − E g decreases with N T -the size of the training set, as well its dependence on bond dimension D, should presumably follow the universal scaling laws outlined in [26,27,28].To verify that would be an interesting problem.
One of the most important open questions of machine learning is to understand why certain datasets admit generalizations, as is the case of virtually any visual dataset exhibiting a pattern recognizable by a human observer.It would be a substantial step forward to quantitatively characterize this phenomenon by explaining why generalization is possible.The studies of mutual information/entanglement entropy, initially motivated by a more narrow question of gauging suitability of tensor network-based architectures, is a first step in this direction.It is clear though, a much more comprehensive characteristic of data is necessary to understand if it admits generalization and the best way to achieve it.Our network is a novel tool to characterize the full set of data globally through the boolean function P, in the present case defined on a unit cube of the dimension n = 784.We surmise that the boolean function complexity [29] could be the right language to characterize possible efficiency of generalization in the quantitative terms.Furthermore, drawing from the connection with quantum mechanics, we believe circuit complexity of Ψ interpreted as a quantum-mechanical wave-function could be an important characteristic of patterns, underlying the initial dataset.

MPS training procedure
We train a separate MPS |Ψ i for each label i in the dataset (i ∈ 0..9 in case of MNIST).Our training procedure is similar to [9,10], with the key difference being the use of tangent-space gradient optimization of [30].We start by mapping training samples x to product states |x = ⊗α |x α by representing each black or white pixel x α with "down" |0 and "up" |1 states.With this representation of data samples we can define the probability of a given sample in accordance with Born's rules as We train out network by minimizing values of the probability distribution p(x) averaged over the training set T , while keeping normalization condition (3) for the wave function The gradient of the NLL loss function can be found analytically, In practice we do not evaluate gradient with respect to |Ψ , instead we update each tensor of MPS independently via DMRG [9,10] method with a two-site update.
Gradient descent is carried out by TSGO [30] with rotation angle (learning rate) η = π/36, which showed better and faster convergence compared to Adam or SGD.
Sampling.Sampling from a trained Ψ is carried via Born rule.Starting from the first pixel, which corresponds to the state |x 1 , one samples it with marginal probability p( Here x 1 |Ψ is a "state", i.e. a tensor with N − 1 indexes α 2 , α 2 , . . ., α N .The subsequent pixel is sampled conditionally: p(x 2 |x 1 ) = p(x 1 ∪ x 2 )/p(x 1 ).To effectively calculate the probabilities one needs to keep the MPS in the right canonical form [9]. Typical examples of sampled images are shown in Fig. 5.

Quality of sampling.
To assess quality of sampling we trained an auxiliary CNN to classify MNIST-like images on QMNIST [31], a dataset which extends MNIST with additional 50K images, which allowed us to train CNN on images never seen by our MPS.We sampled images with MPS and passed them through CNN which returned a probability of sampled image of, say, three to be classified as three.This probability we interpret as the quality of sampling, which is shown in the left panels of Fig. 2 and Fig. 3.

Classification.
To classify images we take ten MPS Ψ i trained to minimize the loss function (15) for each digit i separately.To predict label for an image x we calculate arg max Accuracy of such classification peaked at around 96% in our simulations, which is not as high as contemporary supervised NN architectures, but on par with common unsupervised methods.
Discrimination.Discrimination quality is the ability of a NN to distinguish images of, say, threes from any other images.In our case, due to normalization condition (3) the wave function Ψ 3 tends to decrease outside of the set of threes.In our setup |Ψ(x)| 2 ≥ indicates that x is contained in the set of threes and x is outside of the set of threes if |Ψ(x)| 2 < .Here we propose a simple method to fix and estimate discrimination quality.While assessment of discrimination quality in principle requires to test MPS against the whole "space of images" I, in practice, most of this space is irrelevant.Our model does not confuse threes with random images.The hardest examples to discriminate are pictures which are structurally similar to 3s, i.e. those depicting other symbols.That is why we test Ψ i on other digits on MNIST.We split training test into two parts.One part is used to train the model, while the other is used to fix such that the balanced accuracy on the test set is maximized.The resulting quality of discrimination is shown on the right panel of Fig. 2.
Fixing ε.The value of ε, introduced around eq. ( 9) should be fixed to separate "neat" images of digit i from the "corrupted" ones.There is no precise definition of "neat" or "corrupted".However, we can try to estimate ε as follows.Firstly, we use trained Ψ i to sample the images of i and keep only those with the value of E ≈ E 0 .With the help of auxiliary CNN, we estimate sampling quality of each sample and calculate mean and variance of the sampling quality distribution.Then, we start looking into sampled images with larger values of E > E 0 .For the given E we estimate mean and variance of the sampling quality distribution.Thus for each E we have mean value and variance.Now we are looking for E = E * which is large enough such that corresponding sampling quality distribution is sufficiently different from the one for E = E 0 , namely corresponding mean values are separated by at least the standard deviation.This value of E * is taken to be ε = E * .We have checked numerically that ε obtained in this way yields sub-leading contribution to N F , see eq. ( 10).
Further comments.We trained an auxiliary Deep Convolutional GAN to independently sample additional MNIST-like images.With this samples we doubled the amount of images and retrained MPS on extended data.As the result the value of mean E averaged over test set decreased by 4%.Additionally we used GAN samples to verify our estimation of effective (fractal) dimension for each digit.The results are in agreement with values of ∆ from the Table 1.

Entanglement Entropy
Entanglement Entropy of a bi-partition which splits a quantum spin-chain into two parts A and B consisting of k "left" and n − k "right" spins correspondingly is defined as where ρ A = Tr B (ρ AB ) and ρ B = Tr A (ρ AB ) are reduced density matrices for each partition and ρ AB = |Ψ Ψ| for a pure state.
For any pure state the entanglement entropy can be expressed using the singular values of the Schmidt decomposition of the state, where |u i A and |v i B are orthonormal states of the subsystems A and B separated .The entanglement entropy reduces to The advantage of the MPS is that we can bring network to the form with the orthogonality center to be the bond between the subsystems A and B. Then Schmidt decomposition (18) becomes singular value decomposition of the product of two MPS tensors from the nodes adjacent to the orthogonality center bond [12].

Figure 2 :
Figure 2: Left panel: quality of sampling by Ψ 3 with the bond dimension D = 100 during the training process.Quality is assessed by an auxiliary CNN.Central and right panels: quality of classification and discrimination by Ψ i and Ψ 3 with the same D during the training process.
First we would like to understand how many different images x ∈ I there are with the given value of |Ψ(x)| 2 and what different values of |Ψ(x)| 2 represent.It would prove useful to use the language of statistical mechanics and think about different images x ∈ I as possible microstates of some auxiliary physical system. 2 We define "energy" of a microstate E(x) := − ln |Ψ(x)| 2 and entropy S(E) such that e S(E) dE is the number of microstates with the energy between 2 Familiarity with the basics of statistical mechanics are helpful but not necessary in what follows.

Figure 3 :Figure 4 :
Figure 3: Quality of sampling (left) and classification (right) by the trained Ψ 3 as a function of bond dimension.

Figure 7 :
Figure 7: Left: Average entanglement S of Ψ 3 as a function of bond dimension.Right: Average entanglement S for each Ψ i during training.

Figure 8 :Figure 9 :
Figure 8: Log-log plot for d as a function of number of samples K. Images are sampled with Ψ 3 with the bond dimension D = 100.Only images with E(x) ≈ E 0 are included.

Figure 10 :
Figure 10: Matrix Product State graph representation.Blue boxes represent tensors (of rank 3) with α i external indices, associated with pixel representation.Bond dimension D controls the maximum of other two dimensions of the tensors, depicted by horizontal lines.

Table 1 :
(10)table summarizes main properties of the full sets associated with the digits of black and white MNIST.V , defined in(10), controls the size of the full set.d ab is the mean Hamming distance.∆ is the full set effective (fractal) dimension.n is an average number of black pixels for images of a given digit i. S is the (averaged) entanglement entropy.