Building separable approximations for quantum states via neural networks

Finding the closest separable state to a given target state is a notoriously difficult task, even more difficult than deciding whether a state is entangled or separable. To tackle this task, we parametrize separable states with a neural network and train it to minimize the distance to a given target state, with respect to a differentiable distance, such as the trace distance or Hilbert--Schmidt distance. By examining the output of the algorithm, we obtain an upper bound on the entanglement of the target state, and construct an approximation for its closest separable state. We benchmark the method on a variety of well-known classes of bipartite states and find excellent agreement, even up to local dimension of $d=10$, while providing conjectures and analytic insight for isotropic and Werner states. Moreover, we show our method to be efficient in the multipartite case, considering different notions of separability. Examining three and four-party GHZ and W states we recover known bounds and obtain additional ones, for instance for triseparability.


I. INTRODUCTION
Entanglement is now considered a defining feature of quantum theory, with broad implications in modern physics, from quantum information processing to many-body physics.
The detection and characterisation of entanglement is however a notoriously challenging problem [1,2]. First of all, it is known that the problem of determining whether a given density matrix is entangled or separable is NP-hard [3,4]. There exist however general methods for detecting entanglement, notably the celebrated negativity under partial transposition (NPT) criteria which ensures the considered density matrix must be entangled [5,6]. The converse, however, does not hold, as there exist entangled states which are positive under partial transposition, so-called bound (or PPT) entanglement [7]. Other techniques have been developed, yet all of them are only useful in specific cases in practice. In particular, Ref. [8] developed a method based on semi-definite programming, while Refs [9,10] proposed a numerical algorithm to construct separable decompositions. Moving beyond the bipartite case, the certification of multipartite entanglement, of which there exist a zoology of different forms, is by far even more challenging and less understood.
Beyond the question of determining whether a given quantum state is entangled or not, one may consider the problem of approximating a given target state via a separable one. More precisely, if the target state is separable, the question is to provide an explicit (separable) decomposition for the density matrix. While, if the state is entangled, to construct a separable state that minimizes a certain distance (in the Hilbert space) with respect to the target state.
This question has been addressed indirectly in the studies of entanglement measures based on the distance from the set of separable states [1,11], and is particularly relevant when constructing entanglement witnesses [12][13][14][15][16]. Additionally, finding the closest separable state has been studied directly, but this task is even difficult for two-qubit systems [17]. For a very specific notion of distance, it has also been studied directly though the concept of "best separable approximation" Find closest sep. state to: Figure 1. Schematic of the proposed algorithm. Given a target state ρT, a neural network constructs a separable state ρNN which minimizes the distance to the target, i.e. it tries to find the closest separable state. For a single input k (represented in the one-hot encoding), the neural network outputs the (subnormalized) pure state ρ k 1 ⊗ ρ k 2 . The neural network is evaluated for K values of k, and its outputs are summed up to construct ρNN. The distance between this and ρT is used to update the neural network's parameters. of a quantum state [18]. The construction of separable approximations for multipartite states is largely unexplored, except for specific families of states, which typically have a high level of symmetry [19][20][21][22][23][24][25][26][27]. Moreover, Ref. [28] developed a numerical method based on Gilbert's algorithm for constructing closest separable approximations for multipartite states, considering various notions of separability.
In the present work, we attack these questions using tools from machine learning. Specifically, we devise neural networks for constructing a separable approximation, given a target density matrix. We define a notion of "closest separable state", which represents the separable state minimizing a given distance with respect to the target; note that this does not coincide with the best separable approximation in general. We benchmark our method with two distance measures, the trace distance and Hilbert-Schmidt distance, on several examples, including bipartite entangled state of local dimension up to d = 10 (isotropic and Werner states). From the out-put of the algorithm, we obtain some analytical insight on the distance to the closest separable state as well as its structure. We also consider a family of states featuring bound entanglement. In turn, we demonstrate the potential of our method in the multipartite case, where we construct multi-separable decompositions for several classes of entangled states (noisy GHZ and W states) up to four qubits. Again, we show how to obtain some analytical bounds on the noise thresholds from the output of the algorithm. In particular, we establish estimates on multi-separability. We conclude with a number of open questions and directions for future research. Finally, in the Appendices we study the case of randomly chosen twoqubit states, for which we create ansätze for closest separable states, and derive an exact bound for the two-qubit case.

II. RELATED WORK
Previous work on using machine learning for the separability problem has been focused either having the machine choose good measurements and then using an existing entanglement criteria [29,30] , or on viewing the task as a classification problem [31][32][33][34][35][36][37]. For classification, typically a training set is constructed where quantum states are labeled as separable or entangled. The machine learns on this training set and given a new example predicts whether it is entangled or separable. There are several difficulties with this approach. First, the machine just gives a guess of whether the state is entangled or separable, and does not provide any kind of certificate. Second, the training data can only be generated in a regime where we already understand the problem well, which results in the machine giving only marginal new insight at best. This could be circumvented by using suboptimal criteria (e.g. PPT) to create the training data, however, the machine would just learn this criteria instead of correctly identifying the entanglement and/or separability boundary.
We overcome these challenges by using a generative model, which tries to give an explicit separable decomposition of a target state. This way, we immediately get a certified upper bound on the distance from the separable states. A similar approach has been taken in Refs. [38,39], where the authors represent the quantum states with "quantum neural network states" [40,41], and their extension to density matrices [42][43][44][45], as opposed to the dense representation we utilise. Their results show a more limited flexibility in the loss function and in the design of types of separable states, in particular in the multipartite case. In contrast, our technique allows us to examine the key notion of genuine multipartite entanglement [1,2], the strongest form of entanglement in multipartite quantum systems. This is possible as our model allows for optimising over biseparable models. Hence we can obtain close to optimal bounds for demonstrating 3-party and 4-party genuine multipartite entanglement. Moreover, our method also allows us to tackle the even harder question of triseparability.

III. PRELIMINARIES
In this section we first introduce the notions of separability for bipartite and multipartite systems and then define the closest separable state. Finally we introduce the basic concepts of neural networks. For more detailed introductions on separability and entanglement or on neural networks, we refer the interested reader to Refs. [1] and [46], respectively.
A quantum state ρ 12 acting on H 1 ⊗ H 2 , shared between two parties, is said to be separable if it can be constructed by the convex combination of some local quantum density matrices ρ k 1 acting on H 1 , and ρ k 2 acting on H 2 as For a multipartite system of n parties several notions of separability exist. The straightforward generalization of Eq. (1) results in the notion of a fully separable decomposition, Naturally, one can also just examine bipartite separability on the mutlipartite system by grouping the parties together. This leads to the notion of biseparability with respect to the partition (I|Ī), where I denotes a subset of the indices {1, 2, . . . , n} andĪ denotes its complement. A multipartite state is called biseparable if it can be decomposed as a convex mixture of states that are separable considering all possible bipartitions, namely where crucially, now each I k can be different. There are many ways to quantify entanglement of a target state ρ T , among which a particularly useful one is based on the distance of a state from the set of separable states. Any distance measure 1 D between quantum states σ 1 , σ 2 , which is zero if and only if σ 1 = σ 2 , and for which D(σ 1 , σ 2 ) ≥ D (Λ(σ 1 ), Λ(σ 2 )) for any completely positive trace preserving map Λ, can be used to construct an entanglement measure, by minimizing D(ρ T , ρ Sep. ) over separable states ρ Sep. [1,11]. We will use the neural network to find the closest separable state with respect to a distance D, formally where ρ Sep is a separable state. Note that the closest separable state is not necessarily unique. For the neural network method presented in this paper, any D which is differentiable with respect to one of the states can be used. We choose to work with two distances; the first is the trace distance (related to the Schatten 1-norm) [47], where {µ i } i are the eigenvalues of σ 1 − σ 2 . Note that the trace-distance-based measure can be useful in quantum hypothesis testing, and, among other measures, is an important measure in the study of closest classical states, which is distinct from the closest separable state [48][49][50][51][52]. We will not examine closest classical states in this work, but note that our methods can easily be adopted for their study. The second distance we consider is the Hilbert-Schmidt distance (related to the Schatten 2-norm) [53][54][55] The Hilbert-Schmidt-based measure can be useful for constructing entanglement witnesses [12][13][14][15][16]. Both the trace distance and Hilbert-Schmidt distance can be used as a basis for an entanglement measure, however, one could consider others, such as the Bures distance [53], relative entropy of entanglement [11] or the robustness of entanglement [56]; see e.g. Ref. [57] for an overview and other examples of geometric measures of entanglement. Let us now concisely introduce the concept of an artificial neural network [46], the basis of our numerical representation of separable states. A neural network is a numeric model which can in principle represent any multivariate function. A crucial point is to be able to adjust the parameters of the neural network to represent the desired function, however in many use-cases this can be done surprisingly efficiently with the techniques of deep learning.
In this work we will be using one of the simplest types of neural networks, the so-called multilayer perceptron. It is characterized by the number of neurons per layer (width), the number of layers (depth), and the activation functions used at the neurons. Altogether these model an iterative sequence of parametrized affine, and fixed nonlinear transformations, on the input; namely the map from layer l to l + 1 is where the weight matrix W l and bias vector b l parametrize the affine transformation, h is a fixed differentiable nonlinear function (activation function), and r l is the input of layer l, and its length signifies the width (number of "neurons") of layer l. The vector r 1 (r depth ) is the input (output) of the whole model. At initialization, the weights and biases of all layers are set randomly. During training, the parameters of the model ({W l , b l } l ) are updated such that they minimize a differentiable loss function of the training set, which as we will see later, in our case will be the trace or Hilbert-Schmidt distance. This is done by first evaluating the model for a batch of inputs, and then by slightly updating the parameters via a method called backpropagation, which relies on the gradient of the loss function with respect to the model parameters. This is repeated for many batches, until the model converges, a maximum training time is reached, or a satisfactory loss is achieved. Once trained, the neural network can be evaluated on new input instances.

IV. NEURAL NETWORKS AS SEPARABLE STATES
The task is to find the closest separable state to a given target density matrix. The central idea of this work is to use a neural network as a variational ansatz for the density matrix by representing the local components of the separable decomposition with a single neural network. The approach is inspired by a similar approach taken for nonlocality, where neural networks represent the local components of a Bell-local behavior [58].
To demonstrate the method, let us examine the example of a bipartite 2-qubit state. We ask a neural network to represent the map where we take ρ k i = |ψ k i ψ k i | to be pure states, with |ψ k i = α k i |0 +β k i |1 . That is, the neural network will take as input an integer value k between 1 and K (in a one-hot representation), and will output the numbers (p k , α k 1 , β k 1 , α k 2 , β k 2 ), such that normalization for each subsystem is satisfied. Note that for each complex number, two real numbers are output, the real and imaginary part. We evaluate the neural network for K values of k, normalize the (p k ) k probability vector and sum up the outputs p k ρ k 1 ⊗ ρ k 2 to construct a separable state via The neural network is trained to minimize the distance between the target density matrix ρ T and the constructed separable density matrix ρ NN , i.e. D Tr (ρ T , ρ NN ). The process is roughly illustrated in Fig. 1, where the p k are not shown explicitly.
By construction the neural network represents a single density matrix ρ NN , so for each target state ρ T , the network must be retrained in order to obtain an approximation of the closest separable state to that target state. During training, requiring K values of k in order to evaluate the state technically means working with a batch size of size K. That is, we evaluate K inputs (k = 1, 2, . . . K) to construct ρ NN and only then calculate the gradients required for the optimization of the neural network. An advantage of this method is that only the input layer increases with K, which implies that the global size of the neural network can grow slowly with the dimension of the target state.
More generally, for more parties or higher dimensions, the neural network represents the map where we take the ρ k i (i = 1 . . . n) to be pure, and the neural network explicitly outputs the parameters of the pure states. By evaluating this neural network for K values of k, we construct a separable state via either Eq. (1) for the bipartite case (n = 2), or any of Eqs. (2,3,4) for the different notions of multipartite separability. Recall that by Caratheodory's theorem, in principle the largest K needed is j d 2 j , however even less could be sufficient. Thus we keep K as a free hyperparameter, which we set before training begins. More technical details on the neural networks we used can be found in Appendix A or in the sample code provided in the Code Availability section.
The neural network is optimized in the high-dimensional non-convex landscape of the network's weights, so it is not guaranteed to converge to the optimal solution. However, in practice, optimization procedures based on gradient descent reach close-to-optimal solutions efficiently. Notice, that even for suboptimal solutions we obtain an upper bound on the amount of entanglement of the target state, since the utilized distances serve as entanglement measures [11,53,59]. However, we can go one step further, and examine families of states parametrized by a single parameter, which we refer to as q, typically of the form where ρ ent is an entangled state and ρ sep is a separable state, oftentimes the maximally mixed state. If ρ ent ≡ ρ T (q = 1) is truly entangled, then when decreasing q, for some value q * we will cross the separability boundary. We can observe this transition by varying q and retraining the neural network from scratch for each target distribution. An approximation of q * becomes clear from how close the algorithm can get to the target states for different q values.

V. CASE STUDIES
To benchmark the method, we first use the algorithm to examine the separability boundary for some exemplary families of bipartite states. We consider symmetric classes of states, i.e. isotropic and Werner states, and estimate the noise threshold for separability. We find excellent agreement with analytically known optimal thresholds. Additionally, we compare our results for the minimal Hilbert-Schmidt distance for isotropic states to the known analytic distance, and find excellent correspondence. We furthermore conjecture the minimal trace distances for isotropic, and trace and Hilbert-Schmidt distances for Werner states. Moreover, we obtain analytical insight to the problem of characterising the closest separable states for the Bell state in terms of trace distance. Then we also discuss an example of bound entanglement, i.e. an entangled sate that cannot be detected by the partial transpose criterion.
Then we move on to the multipartite case, where we consider 3-and 4-qubit GHZ and W states. We show that our algorithm can capture various notions of multipartite entanglement, including full separability, biseparability and even triseparability. We estimate again noise thresholds, finding excellent agreement with previously known bounds. Moreover, we explain how, from the numerical output from the algorithm, one can obtain analytical bounds on the noise thresholds.
Additionally, in Appendix B we compare our neural network algorithm to a naive gradient-descent based heuristic to show its advantage. In Appendix C we examine the performance of the algorithm on random bipartite two-qubit density matrices, and compare it to the optimal solution obtained via a semi definite program (SDP). We conjecture an analytic ansatz of the closest separable state for a general 2-qubit state, which we find to be very close to the solutions found by the neural network, and prove a bound on the trace distance. In Appendix D, we detail a method to obtain a strict lower bound on the separability threshold compatible with our method.

A. Bipartite case
We start our benchmarking with two classes of highly symmetric bipartite states. Isotropic states are defined as where 0 ≤ q ≤ 1, I 12 is the identity operator on the joint space, and is the maximally entangled state of local dimension d. Werner states are defined as where P sym = 1 2 (I 12 + F 12 ), with F 12 = d i,j |i j| 1 ⊗ |j i| 2 the flip operator. These classes of states represent a good benchmark as the separability thresholds, i.e. the value of q for which the state becomes separable, are known analytically. Specifically, isotropic states are separable for q ≤ 1 d , while Werner states are separable for q ≤ 1 2 . Hence when running our algorithm for these classes of target states, for different values of q, we expect to find a distance of the closest separable state that vanishes when q approaches the separability threshold. This is precisely what we observe. We run the neural network independently for 11 values of q, and additionally for the exact separability boundary value. The results for both the trace distance and Hilbert-Schmidt distance for d ≤ 5 are depicted in   (20) (right). Note that for the Werner states with 2 ≤ d ≤ 5, all the trace distance curves essentially overlap and are thus not distinguishable on the plot. For the right plot we did 2 runs and kept the best results to improve the smoothness of the curves. Fig. 2 (each line is plotted with its respective loss function, the trace or Hilbert-Schmidt distance). They confirm that the algorithm works properly in this regime, finding a sharp transition at the known separability thresholds. When making a linear fit to the data that is outside the seemingly flat separable region, we recover the thresholds with a precision of at least 10 −4 . To give an example of the running time on a personal computer, for isotropic states the training for a single target state for d = 5 took at most 15 minutes, while for d = 2 it took only at most 30 seconds 2 . When the trace distance is found to be smaller than 2 · 10 −3 , we choose to stop the training, and conclude that the state to be separable. Otherwise we run the algorithm until the resulting trace distance converges, i.e. it doesn't change more than 2 · 10 −4 in one epoch. Additionally, for d = 10 we examine the Werner states, also plotted in Fig. 2. For such a large state, with K = 100, training took about 1 hours 15 minutes on a personal computer for a single epoch (3000 batches), which was reduced to 45 minutes when training on a GPU 3 . Due to the increased runtime we only ran one epoch for each point in Fig. 2, and did not wait until convergence. We observe that the neural network struggles more in finding a closest separable state in the separable area, however it works remarkably well in the entangled regime, and still manages to give qualitatively interpretable results on where the entanglement boundary lies. For increased accuracy one could run the algorithm several times independently and take the smallest value for each q, or one could run the algorithm with a larger batch size K. For example for the separability boundary at q = 0.5, by using K = 150 instead of 100, after 5 epochs (5 times 3000 batches), the trace distance reduced to 0.024 from the 0.045 seen in Fig. 2. 2 Timed with an Intel i7-8700k CPU @ 3.70 GHz with 6 cores (12 threads) and 16 GB RAM. 3 Trained on a RTX-3080 GPU with 10 GB memory.
For isotropic states, the exact Hilbert-Schmidt distance to the closest separable state is known. In dimension 2, we retrieve the value of 1/ √ 3 that was known shown in Refs. [13,54]. A generalized formula is given by Ref. [15], which we recover up to excellent precision for q = 1 and up to d = 10 in Fig. 3. Moreover, we examine the trace distance, and based on the results in Figs. 2 and 3, we conjecture that the trace distance from the isotropic states to their respective closest separable states follows Using the semidefinite program described in Appendix C, we can verify that a trace distance of 1/2 is optimal for the Bell state, i.e. for d = 2. We have not found this explicitly proven in the literature, even though the related concept of finding the closest classical state to the Bell state has been well studied [48][49][50]. For Werner states, similarly drawing from Fig. 2, we conjecture that the distance to the closest separable is given by for the Hilbert-Schmidt distance, and by D Wer for the trace distance. To the best of our knowledge, these relations are not proven in the literature. We note that plotting these conjectured equations in Fig. 2 (first two panels) would result in lines that are indistinguishable from the data.
While the algorithm performs a numerical optimisation, one can nevertheless obtain some analytical insight from the output. Here we illustrate this point. First we could partially characterize the closest separable state for the two-qubit Bell state, i.e. |φ + 2 . Setting the Bell state as the target, and using the trace distance as the loss function of the neural network, at the end of training it finds the separable state with a = 1 4 , |b| ∈ [0, 1/4] and small c values. However, when using the Hilbert-Schmidt distance D HS as the loss, the neural network converges to the a = 1 6 , b = 0 and c = 0 solution. Both solutions have the same trace distance of 0.5 from the Bell state. From these two extremes, we constructed the ansatz Eq. 19 for the closest separable state and verify that for c = 0, a ∈ [0, 1 6 ] and b ∈ [0, a] they indeed all give a trace distance of 1 2 . We go further and find other values of (a, b, c) for which the trace distance is 1 2 . For example if all parameters are set to be real, and b = a, then it is a closest separable state for (a − 1 8 ) 2 + c 2 ≤ 1 8 . The same holds for a = 1/6, 24c 2 − 1 6 ≤ b ≤ 1 6 (with all parameters real). Clearly there are countless others, but characterizing the whole range of satisfactory (a, b, c) values is beyond the scope of this paper. This analysis stands here to show how one can gain insight by looking at the output state of the neural network.
To conclude our discussion of the bipartite case, we consider a family of entangled states that feature bound entanglement. Specifically, we consider the class of states introduced in Ref. [60], although we adopt the parametrization used in Ref. [61]. This family of 2-qutrit states exhibits bound entanglement, i.e. a PPT entangled region. The states are with β ± = 5 2 ± q, and q ∈ [−2.5, 2.5], however, we only consider q ∈ [0, 2.5] since the negative q regime gives the same states up to permutations. It is known that ρ q is separable for q ∈ [0, 0.5], is PPT entangled for q ∈ (0.5, 1.5] and is NPT entangled for q ∈ (1.5, 2.5]. For several values of q we train the neural network to approximate ρ T = ρ q , and display the results in Fig. 2. We can see that by explicitly constructing the separable decomposition, our results are not sensitive to whether the partial transpose is positive or negative, and the neural network approach successfully identifies the separable and entangled regions.   (15) gives the exact expression for the isotropic state as proven in Ref. [15]. We conjecture the general formula for the trace distance of the isotropic state in Eq. (16). For the Werner states, we conjecture the general formula for the trace distance in Eq. (17) and for the Hilbert-Schmidt distance in Eq. (18).

(22)
We mix both with the maximally mixed state as we did for the isotropic states in Eq. (12). For three qubits, we use the neural network to distinctly examine   6. triseparability, as a generalization of Eq. (4), namely ρ = k p k ρ k , with (I 1 k |I 2 k |I 3 k ) a partitioning of I for each k, and 7. biseparability, as in Eq. (4).
For the biseparable and triseparable cases, on a technical level, for each k we ask the neural network to output density matrices for all possible partitions, i.e. for each k it actually outputs 3 terms at a time for the 3-party case, and 6 terms for the 4party case.
We present the results in Fig. 4, except for 4-qubit separability with respect to a fixed partition, to not overcrowd the figure, however, note that those results are qualitatively similar. The consistent straight lines formed from independent runs give us confidence that the algorithm works well for approximately detecting the separability boundaries.
From Fig. 4, we extract estimates of the separability bound by fitting linear curves to the data that is outside the seemingly flat separable region and close to the boundary 4 . Moreover, inspired from Ref. [28], we show how to obtain actual lower bounds on the separability thresholds in Appendix D. All results are given in Tables I and II. With the flexibility of the current technique, we are able to quickly get estimates and bounds on the noise thresholds for many notions of separability, or alternatively, entanglement. These results can be improved by taking more points or running the algorithm multiple times. In cases where the exact threshold is known, our results are close to it. Where the boundary is not known to be exact, we can see how close it is to being tight. We observe that in these cases, in fact the analytic upper bounds seem to be close to, or in fact, optimal. Finally, we established estimates for many notions of separability, for which we did not find previous estimates or bounds in the literature. These were complemented by lower bounds provided by J.Shang and O. Gühne based on the method in Ref. [28] in private communications. Table II. Four-party separability thresholds for the noisy GHZ and W states. Results are given as (i) an estimate of the bound obtained via a linear fit close to the threshold and (ii) a certified lower bound obtained from the method discussed in Appendix D. Bounds marked with a dagger ( †) are lower bounds provided by J.Shang and O. Gühne based on the method in Ref. [28] in a private communication. All other previously known bounds are upper bounds. An asterisk (*) denotes cases where the bound is known to be exact.

VI. CONCLUSION AND OUTLOOK
In summary, we have addressed the question of constructing the closest separable state to a given target state, by using a neural network as a compact model for separable states. We avoided the bottleneck of having to explicitly model many (up to n j=1 d 2 j ) separable pure states in a decomposition by using a single neural network to represent them all. We demonstrated that by training the model independently on multiple states from a family, we can identify the separability boundary well. We did this for examples where the boundaries are known, PPT entangled states, as well as 3-and 4-party multiqubit states. Additionally, we showed how analytical insight can be gained from the output of the algorithm. In the bipartite case, we partially characterized the closest separable state for the two-qubit Bell state, and conjectured relations for the distances in case of arbitrary dimension. In the multipartite case we showed how to obtain strict lower bounds on the noise threshold.
The technique presented here opens up avenues for a variety of numeric applications in quantum foundations. In particular, for any task with reasonable Hilbert space sizes, it is possible to optimize over the set of separable states, as long as the loss function is differentiable. Among other potential applications, it can be especially helpful for obtaining (estimates or bounds on) entanglement measures, measures of robustness, separable ground state energies, and with minor modifications can be easily adapted to finding the closest classical state. Moreover, a particularly fruitful avenue for research could be focused on combining our approach with other generative neural network approaches to quantum state representations, namely "quantum neural network states" [40,41], particularly their extension to density matrices [42][43][44][45]. Using such an ansatz for the separability problem has been examined in Ref. [39]. Such prospects of further developing the algorithms give the promise of exciting novel numerical tools for a broad range of tasks, both for numerical work and gaining analytic insight.

VII. CODE AVAILIBILITY
We have made sample code available at www.github.com/Antoine0Girardin/ Neural-network-for-separability-problem.

VIII. ACKNOWLEDGMENTS
We thank Pavel Sekatski for discussions, and Otfried Gühne and Jiangwei Shang for pointing out the work of Ref. [28] and for providing additional values for Table II. We acknowledge financial support from the Swiss National Science Foundation (project 2000021_192244/1 and NCCR QSIT). T.K. additionally acknowledges funding from the Swiss National Science Foundation Doc.Mobility grant (project P1GEP2_199676).

Appendix A: Technical details of the utilized neural networks
The main idea of how we use neural networks can be found in the maintext, while the implemented code can be found in the online repository provided. Here, we briefly describe some of the technical details and hyperparameters that we used.
As described in the maintext we use a feedforward neural network to represent a generic separable state of a fixed dimension and separability structure. We use a multilayer perceptron with rectified linear units as activations, except in the final layer where we use sigmoid activations. The outputs are normalized via a softmax function for the probability vectors, and by dividing by the 2-norm for the complex entries of the pure states. For the calculation in the maintext we employed a single hidden layer, with a width of 100, or 200 for more difficult calculations. The number of elements in the separable decomposition, K, is analytically upper bounded by j d 2 j , however in the implementation, typically K = j d j gives satisfactory results and allows for much quicker training. For training we use the Adadelta optimizer. where the expected closest distance should be 0.8 (which the neural network approaches to 1E-6 precision).

Appendix B: Comparing with gradient descent
To see the advantage of using a neural network, we compare our algorithm with the naive optimization algorithm of gradient descent, for the simplest case of two qubits.
We parametrize the quantum state in a similar way as in Eq. (9), i.e. the free parameters are the K probabilities and the real and imaginary parts of the pure states composing the separable state according to Eq. (1), with d = 2, K = 16. The gradient descent algorithm varies these parameters to minimize the trace distance with respect to a target state, which we chose to be the Bell state, namely Eq. (13), with d = 2. The gradient descent algorithm was run with an initial learning rate of 1, decreased by a factor of 0.98 each round for 250 rounds, and with a momentum factor of 0.2.
Recall that the neural network, even with one layer, did not have any trouble finding the closest separable state with a trace distance of 0.5. However, as shown in the left panel of Fig. 5, we notice that already for this simple case the gradient descent technique has difficulties in finding the closest state. Somewhat surprisingly, if only real numbers are chosen to represent the state, the gradient descent technique performs better and converges to a good solution. Note that for higher dimensions, e.g. d = 5, the real-valued gradient descent also has difficulties, as shown in the right panel of Fig. 5.

Appendix C: Random states
When benchmarking the method on random states, we noticed that there is a strong connection between the obtained trace distance of the closest separable state and the lowest eigenvalue of the partial transpose. In this appendix, we first show benchmark results for the method on random two-qubit states (d 1 = d 2 = 2), where the PPT criteria clearly distinguishes entangled from separable states. We observe a strong correlation between the trace distance and Hilbert-Schmidt distance of the closest separable state and the smallest eigenvalue of the partial transpose of the state. We verify these results with a SDP to see how close to the optimal solution the neural network can get for two-qubits states. Finally, we present an analytic ansatz of the closest separable state, based on the numerical results of the neural network and our intuition, which we numerically validate to be very close to the actual closest separable state.
In the two-qubit case the positive partial transpose criteria is a necessary and sufficient condition for separability. Thus, in Fig. 6 we plot the distance to the closest separable state obtained by the neural network against the smallest eigenvalue of the partial transpose, which we will refer to as λ. Using the trace distance as a loss, we tested 400 random states with the trace distance as a loss function, and 300 with the Hilbert-Schmidt distance as the loss (the neural network was retrained 5 times for each state and the lowest distance was kept).
First, we observe that the neural network achieves close to zero distance in the separable regime for all states. Clearly it can not and should not reach zero distance for entangled states (i.e. on the left side of the figures, where λ < 0). We observe a much stronger relation: in fact the Hilbert-Schmidt distances of the closest separable state seem to line up on a line with slope −1, while the trace distance results seem to be below this. We formulate these two observations; namely in the entangled regime, for λ(ρ T ) < 0, where we explicitly denoted which distance was minimized in the subscript of ρ CSS .
It is possible to use a SDP to find the closest separable state with respect to the trace distance by using the PPT criteria.
The SDP has a dual form and can be expressed as [67] minimize T rY + T rZ With Y and Z Hermitian matrices, X = ρ T − ρ CSS , ρ T being the target state and ρ CSS the separable state, ρ CSS ≥ 0, The result of the SDP for all random states are plotted in Fig. 6.
Finally, we provide an ansatz for the closest separable state with respect to the trace distance. Intuitively, we set the smallest eigenvalue of the partial transpose to be 0 instead of negative, and adjust the others such that the trace remains unchanged.
Theorem. Let ρ T be an entangled state whose partial transpose has an eigendecomposition of U DU † , with D = diag(λ 1 , λ 2 , λ 3 , λ 4 ), where λ 1 is the smallest eigenvalue (i.e. λ 1 ≡ λ(ρ T )). Then let our ansatz of the closest separable state be ρ = (U D U † ) Γ with D = diag(0, λ 2 + λ1 3 , λ 3 + λ1 3 , λ 4 + λ1 3 ) and X Γ denoting the partial transpose of X. If ρ is a valid density matrix then Before proceeding to the proof, note that ρ is only actually a separable density matrix if λ 2 + λ1 3 > 0. However, only about 0.1% of random states have a ρ approximation which is not a valid separable density matrix. The trace distances of the approximations of the 400 random states examined previously are depicted in Fig. 6.
Proof. Recall that D Tr (ρ T , ρ ) = 1 2 i |µ i |, where {µ i } i is the set of eigenvalues of the difference. As a first step let us examine this difference.
where E 11 is the matrix with a single nonzero entry in its first position, and thus u is the first column of U .
To prove the theorem we must show that no matter what u appears in the decomposition, the trace distance is bounded, namely that max u D Tr (ρ T , ρ ) ≤ −λ 1 , which, after canceling out −λ 1 , reads explicitly as where we have used the notation e.v. i for the i-th eigenvalue.
Using that the identity matrix is jointly diagonalizable with uu † , the left-hand side becomes where {ν i } 4 i=1 are the eigenvalues of (uu † ) Γ in nondecreasing order. Notice that the partial transpose preserves the trace, so i ν i = 1, since u is the (unit-length) first column of a unitary matrix. So essentially, we must maximize (C7) by distributing 1 among the four eigenvalues ν i . Due to the absolute value, the value 1 4 becomes a divider: eigenvalues below it should be as small as possible, while eigenvalues above it should be as large as possible. So we split the eigenvalues into two parts (C8) If we have S ≤ 1 4 = {ν 1 }, then the best we can do is push down ν 1 to be as negative as possible, so the other eigenvalues can jointly be larger (ν 2 + ν 3 + ν 4 = 1 + |ν 1 | if ν 1 < 0). Additionally, it is known that there can be at most one negative eigenvalue of the partial transpose [68,69], and that all eigenvalues are larger than −1/2 [68], i.e.
Using the first, and that the eigenvalues sum to 1, we see that Finally, note that it does not make sense to add more eigenvalues to S ≤ 1 4 , since if S ≤ 1 4 = {ν 1 , ν 2 }, then by Ineq. (C10), namely that 0 ≤ ν 2 , we cannot increase the weight of S > 1 4 , i.e. ν 3 + ν 4 = 1 + |ν 1 | − ν 2 . So essentially we are in the same position as when S ≤ 1 4 = {ν 1 }, and thus the upper bound is 6. Placing this back in Expression(C7), or Ineq. (C6), we see that the theorem is proven.

Appendix D: Lower bounds on separability thresholds
To be able to give separability thresholds and not only approximation with linear plots, we follow a procedure similar that the one used in Ref. [28]. The core idea is to prove the separability of a target state by showing that it can be written as the convex combination of two separable states. One of the two states is separable because it is generated by the neural network (ρ CSS in the following). The other is separable because it lies within the separability ball around the fully mixed state (ρ x ).
In more detail, we start from a state ρ we want to prove separable, and build a new state ρ t = (1 + )ρ − I d , where > 0 is an arbitrary small number such that D Tr (ρ t , ρ) is at least larger than the neural network's precision. That new state ρ t will be further away from the fully mixed state. Then we use the neural network to obtain an approximately closest separable state to ρ t , which we call ρ CSS . Finally, we try to find a separable state ρ x = (1 + )ρ − ρ CSS by scanning several values of . We can use the condition tr(ρ 2 x ) ≤ 1 2 N −α 2 with α 2 = 2 N 17 2 3 N −3 +1 and N the number of parties, from Ref. [9] to show that ρ x is fully separable. The condition tr(ρ 2 x ) ≤ 1 d−1 with d the dimension of the Hilbert space from Ref. [70] can be used to show biseparability. If the condition is satisfied, the separability of ρ follows from the convexity of the set of separable states.