Compressing deep neural networks by matrix product operators

A deep neural network is a parameterization of a multi-layer mapping of signals in terms of many alternatively arranged linear and nonlinear transformations. The linear transformations, which are generally used in the fully-connected as well as convolutional layers, contain most of the variational parameters that are trained and stored. Compressing a deep neural network to reduce its number of variational parameters but not its prediction power is an important but challenging problem towards the establishment of an optimized scheme in training efficiently these parameters and in lowering the risk of overfitting. Here we show that this problem can be effectively solved by representing linear transformations with matrix product operators (MPO). We have tested this approach in five main neural networks, including FC2, LeNet-5, VGG, ResNet, and DenseNet on two widely used datasets, namely MNIST and CIFAR-10, and found that this MPO representation indeed sets up a faithful and efficient mapping between input and output signals, which can keep or even improve the prediction accuracy with dramatically reduced number of parameters.

A deep feedforward neural network sets up a mapping between a set of input signals, such as images, and a set of output signals, say categories, through a multi-layer transformation, F , which is represented as a composition of many alternatively arranged linear (L) and nonlinear (N) mappings 32,33 . More specifically, an n-layer neural network F is a sequential product of alternating linear and nonlinear transformations: The linear mappings contain most of the variational parameters that need to be determined. The nonlinear mappings, which contain almost no free parameters, are realized by some operations known as activations, including Rectified Linear Unit (ReLu), softmax, and so on. A linear layer maps an input vector x of dimension N x to an output vector y of dimension N y via a linear transformation characterized by a weight matrix W A fully-connected layer plays the role as a global linear transformation, in which each output element is a weighted summation of all input elements, and W is a full matrix. A convolutional layer 2 represents a local linear transformation, in the sense that each element in the output is a weighted summation of a small portion of the elements, which form a local cluster, in the input. The variational weights of this local cluster form a dense convolutional kernel, which is designated to extract some specific features. To maintain good performance, different kernels are used to extract different features. A graphical representation of W is shown in Fig. 1(a). Usually, the number of elements or neurons, N x and N y , are very large, and thus there are a huge number of parameters to be determined in a fully-connected layer 10 . The convolutional layer reduces the variational parameters by grouping the input elements into many partially overlapped kernels, and one output element is connected to one kernel. The number of variational parameters in a convolutional layer is determined by the number of kernels and the size of each kernel. It could be much less than that in a fully-connected layer. However, the total number of parameters in all the convolutional layers can still be very large in a deep neural network which contains many convolutional layers 11 . To train and store these parameters raises a big challenge in this field. First, it is time consuming to train and optimize these parameters, and may even increase the probability of overfitting. This would limit the generalization power of deep neural networks. Second, it needs a big memory space to store these parameters. This would limit its applications where the space of hard disk is strongly confined, for example, on mobile terminals.
However, the linear transformations in a commonly used deep neural network have a number of features which may allow us to simplify their representations. In a fully-connected layer, for example, it is well known that the rank of the weight matrix is strongly restricted [34][35][36] , due to short-range correlations or entanglements among the input pixels. This suggests that we can safely use a lower-rank matrix to represent this layer without affecting its prediction power. In a convolutional layer, the correlations of imagines are embedded in the kernels, whose sizes are generally very small in comparison with the whole image size. This implies that the "extracted features" from this convolution can be obtained from very local clusters. In both cases, a dense weight matrix is not absolutely  w (1) w (2) w (3) w (n) necessary in order to perform a faithful linear transformation. This peculiar feature of linear transformations results from the fact that the information hidden in a dataset is just short-range correlated. Thus to reveal accurately the intrinsic features of a dataset, it is sufficient to use a simplified representation that catches more accurately the key features of local correlations. This motivates us to adopt matrix product operators (MPO) 37,38 , which is a commonly used approach to represent effectively a higher-order tensor or Hamiltonian with shortrange interactions, to represent linear transformation matrices in deep neural networks. The application of MPO in condensed matter physics and quantum information science has achieved great successes 39,40 in the past decade. In a quantum many-body system, the Hamiltonian or any other physical operator can be expressed as a higher-order tensor in the space spanned by the local basis states 41 . An MPO is simply a tensor-train approximation 42,43 that is used to factorize a higher-order tensor into a sequential product of the so-called local tensors. Nevertheless, it provides an efficient and faithful representation of the systems with short-range interactions whose entanglement entropies are upper bounded 44 , or equivalently the systems with finite excitation gaps in the ground states. To represent exactly a quantum many-body system, the total number of parameters that need to be introduced should in priciple grow exponentially with the system size (or the size of each "image" in the language of neural network). However, using the MPO representation, the number of variational parameters needed is greatly reduced since the number of parameters contained in an MPO just grows linearly with the system size 45 .
To construct the MPO representation of a weight matrix W, we first reshape it into a 2n-indexed tensor Here, the one-dimensional coordinate x of the input signal x with dimension N x is reshaped into a coordinate in a n-dimensional space, labelled by (i 1 i 2 · · · i n ). Hence, there is a one-to-one mapping between x and (i 1 i 2 · · · i n ). Similarly, the one-dimensional coordinate y of the output signal y with dimension N y is also reshaped into a coordinate in a ndimensional space, and there is a one-to-one correspondence between y and ( j 1 j 2 · · · j n ). If I k and J k are the dimensions of i k and j k , respectively, then The index decomposition in Eq. (3) is not unique. One should in principle decompose the input and output vectors such that the test accuracy is the highest. However, to test all possible decompositions is time consuming. For the results presented in this work, we have done the decomposition just by convenience.
The MPO representation of W is obtained by factorizing it into a product of n local tensors the virtual basis dimension on the bond linking w (k) and w (k+1) with D 0 = D n = 1. For convenience in the discussion below, we assume D k = D for all k except k = 0 or n. A graphical representation of this MPO is shown in Fig. 1 In this MPO representation, the tensor elements of w (k) are variational parameters. The number of parameters increases with the increase of the virtual bond dimension D. Hence D serves as a tunable parameter that controls the expressive power.

II. RESULTS
Here we show the results obtained with the MPO representation in five kinds of neural networks on two datasets, i.e., FC2 46 and LeNet-5 2 on the MNIST dataset 47 , VGG 10 , ResNet 11 , and DenseNet 12 on the CIFAR-10 dataset 48 . Among them, FC2 and LeNet-5 are relatively shallow in the depth of network. VGG, Residual CNN (ResNet), and Dense CNN (DenseNet) are deeper neural networks.
For convenience, we use MPO-Net to represent a deep neural network with all or partial linear layers being represented by MPO. Moreover, we denote an MPO, defined by Eq. (5), as To quantify the compressibility of MPO-net with respect to a neural network, we define its compression ratio ρ as where l is to sum over the linear layers whose transformation tensors are replaced by MPO. N (l) ori and N (l) mpo are the number of parameters in the l-th layer in the original and MPO representations, respectively. The smaller is the compression ratio, the fewer number of parameters is used in the MPO representation. Furthermore, to examine the performance of a given neural network, we train the network m times independently to obtain a test accuracy a with a standard deviation σ defined by where a i is the test accuracy of the i-th training procedure.ā is the average of a i . The results presented in this work are obtained with m = 5.

A. MNIST dataset
We start from the identification of handwritten digits in the MNIST dataset 47 , which consists of 60,000 digits for training and 10,000 digits for testing. Each image is a square of 28×28 grayscale pixels, and all the images are divided into 10 classes corresponding to numbers 0∼9, respectively.

FC2
We first test the MPO representation in the simplest textbook structure of neural network, i.e., FC2 46 . FC2 consists of only two fully-connected layers whose weight matrices have 784 × 256 and 256 × 10 elements, respectively. We replace these two weight matrices respectively by M 4,4,4,4 4,7,7,4 (D) and M 1,1,10,1 4,4,4,4 (4) in the corresponding MPO representation. Here we fix the bond dimension in the second layer to 4, and only allow the bond dimension to vary in the first layer.  But even for D = 16, the compression ratio is still below 8%, which indicates that the number of parameters to be trained can be significantly reduced without any accuracy loss.

LeNet-5
We further test MPO-Net with the famous LeNet-5 network 2 , which is the first instance of convolutional neural networks. LeNet-5 has five linear layers. Among them, the last convolutional layer and the two fully-connected layers contain the most of parameters. We represent these three layers by three MPOs, which are structured as M 2,5,6,2 2,10,10,2 (4), M 2,3,7,2 2,5,6,2 (4) and M 1,5,2,1 2,3,7,2 (2), respectively. The compression ratio is ρ ∼ 0.2. Table I shows the results obtained with the original and MPO representations of LeNet-5. We find that the test accuracy of LeNet-5 can be faithfully reproduced by MPO-Net. Since LeNet-5 is the first and prototypical convolutional neural network, this success gives us confidence in using the MPO presentation in deeper neural networks.
B. CIFAR-10 dataset CIFAR-10 is a more complex dataset 48 . It consists of 50,000 images for training and 10,000 images for testing. Each image is a square of 32×32 RGB pixels. All the images in this dataset are divided into 10 classes corresponding to airplane, automobile, ship, truck, bird, cat, deer, dog, frog, and horse, respectively. To have a good classification accuracy, deeper neural networks with many convolutional layers are used. In order to show the effectiveness of MPO representation, as a preliminary test, we use MPO only on the fullyconnected layers and on some heavily parameter-consuming convolutional layers. For both structures, the compression ratio of MPO-Net is about 0.0006. Hence the number of parameters used is much less than in the original representation. However, we find that the prediction accuracy of MPO-Net is even better than those obtained from the original networks. This is consistent with the results reported by Novikov 43 for the ImageNet dataset 49 . It results from two facts of MPO: First, since the number of variational parameters is greatly reduced in MPO-Net, the representation is more economical and it is easier to train the parameters. Second, the local correlations between input and output elements are more accurately represented by MPO. This can reduce the probability of overfitting.

ResNet
ResNet 11 is commonly used to address the degradation problem with deep convolutional neural networks. It won the first place on the detection task in ILSVRC in 2015, and differs from the ordinary convolutional neural network by the so-called ResUnit structure, in which identity mappings are added to connect some of the input and output signals. The ResNet structure used in our calculation has a fully-connected layer realized by a weight matrix of 64k × 10. Here k con-trols the width of the network. We represent this layer by an MPO of M 1,5,2,1 4,4,4,k (3). In our calculation, k = 4 is use and the corresponding compression ratio is about 0.11. Figure 3 shows the test accuracy as a function of the depth of layers of ResNet with k = 4. We find that MPO-Net has the same accuracy as the normal ResNet for all the cases we have studied. We also find that even the ResUnit can be compressed by MPO. For example, for the 56-layer ResNet, by representing the last heaviest ResUnit and the fully-connected layer with two M 2,4,4,4,4,4,4,2 2,4,4,4,4,4,4,2 (4) and one M 1,5,2,1 4,4,4,k (3), we obtain the same accuracy as the normal ResNet. Similar observations are obtained for other k values.

DenseNet
The last deep neural network we have tested is DenseNet 12 . Constructed in the framework of ResNet, DenseNet modifies ResUnit to DenseUnit by adding more shortcuts in the units. This forms a wider neural network, allowing the extracted information to be more efficiently recycled. It also achieved great success in the ILSVRC competition, and drew much attention in the CVPR conference in 2017.
The DenseNet used in this work has a fully-connected layer with a weight matrix of (n + 3km) × 10, where m controls the total depth L of the network, L = 3m + 4, n and k are the other two parameters that specify the network. Although there is only one fully-connected layer in DenseNet, it consumes about half of the total parameters in the network. We use MPO to reduce the parameter number in this layer. Corresponding to different m, k, and n, we use different MPO representations.
Our results are summarized in Table II. For the four DenseNet structures we have studied, the fully-connected layer is compressed by more than 7 to 21 times. The corresponding compression ratios vary from 0.044 to 0.129. In the first three cases, we find that the test precisions obtained with MPO-Net agree with the DenseNet results within numerical errors. For the fourth case, the test accuracy obtained with MPO-Net is even slightly higher than that obtained with DenseNet.

III. DISCUSSION
Motivated by the success of MPO in the study of quantum many-body systems with short-range interactions, we propose to use MPO to represent linear transformation matrices in deep neural networks. This is based on the assumption that the correlations between pixels, or the inherent structures of information hidden in "images", are essentially localized 50,51 . We have tested our approach with five different kinds of main neural networks on two datasets, and found that MPO can not only improve the efficiency in training and reduce the memory space, as originally expected, but also slightly improve the test accuracy using much fewer number of parameters than in the original networks. This, as already mentioned, may result from the fact that the variational parameters can be more accurately and efficiently trained due to the dramatic reduction of parameters in MPO-Net. The MPO representation emphasizes more on the local correlations of input signals. It puts a strong constraint to the linear transformation matrix and avoids the training data being trapped at certain local minima. We believe this can reduce the risk of overfitting.
MPO can be used to represent both fully-connected and convolutional layers. One can also use it just to represent the kernels in convolutional layers, as suggested by Garipov 52 . However, it is more efficient in representing a fully-connected layer where the weight matrix is a fully dense matrix. This representation can greatly reduce the memory cost and shorten the training time in a deep neural network where all or most of the linear layers are fully-connected ones, such as in a recurrent neural network 53,54 which is used to dispose video data.
Tensor-network representation of deep neural network is actually not new. Inspired by the locality assumption about the correlations between pixels, matrix product representation has been already successfully used to characterize and compress images 50 , and to determine the underlying generative models 55 . Novikov et al 43 also used MPO to represent some fully-connected layers, not including the classifiers, in FC2 and VGG. Our work, however, demonstrates that all fully-connected layers, including the classifiers especially, as well as convolutional layers, can be effectively represented by MPO no matter how deep a neural network is.
There are also other mathematical structures that have been used to represent deep neural networks. For example, Kossaifi et al 56 used a Tucker-structure representation, which is a lower-rank approximation of a high-dimensional tensor, to represent a fully-connected layer and its input feature. Hallam et al 57 used a tensor network called multi-scale entangled renormalization ansatz 58 and Liu et al 59 used an unitary tree tensor network 60 to represent the entire map from the input to the output labels.
In this work, we have proposed to use MPO to compress the transformation matrices in deep neural networks. Similar ideas can be used to compress complex datasets, for example the dataset called ImageNet 49 , in which each image contains about 224 × 224 pixels. In this case, it is matrix product states 61 , instead of MPO, that should be used. We believe this can reduce the cost in decoding each "image" in a dataset, and by combining with the MPO representation of the linear transformation matrices, can further compress deep neural networks and enhance its prediction power.

IV. METHODS
The tensor elements of w (k) in Eq. (5) are the variational parameters that need to be determined in the training procedure of deep neural networks. For an MPO whose structure is defined by Eq. (6), The total number of these variational parameters equals The strategy of training is to find a set of optimal w's so that the following cost function is minimized where n is the label of images, i is the label of all the parameters, including the local tensors in the MPO representations and the kernels in the untouched convolutional layers. |w| represents the norm of parameter w, and α is an empirical parameter that is fixed prior to the training. The first term measures the cross entropy between prediction vectors y and target label vectors t. The second term is a constraint, called the L2 regularization 62 , added to alleviate overfitting.
To implement a training step, L is evaluated using the known w's, which are randomly initialized, and input dataset. The gradients of the cost function with respect to the variational parameters are determined by the standard back propagation 63 . Parameters w's are updated by the stochastic gradient descent with momentum algorithm 64 . This training step is terminated when the cost function stops to drop.
The detailed structures of the neural networks we have studied are introduced in the supplemental material. The source code used in this work is available at https://github.com/zfgao66/deeplearning-mpo. Competing Interests -The authors declare that they have no competing financial interests. * qingtaoxie@ruc.edu.cn † huihai.zhao@riken.jp ‡ zlu@ruc.edu.cn § txiang@iphy.ac.cn In this supplementary material, we gives the detailed structure of the neural networks used in the paper titled as Compressing deep neutral networks by matrix product operators. The corresponding source code used in this work is available at https://github.com/zfgao66/deeplearning-mpo. The used structures of FC2, LeNet5, VGG, ResNet, DenseNet in this paper are summarized in Tab. II-VII, and their prototypes can be found in Ref. [1][2][3][4][5] , respectively. In order to simplify the descriptions, we introduce some short-hands summarized in Tab. I which are used in this materials.