Optimal entanglement witnesses: a scalable data-driven approach

Multipartite entanglement is the key resource allowing quantum devices to outperform their classical counterparts, and entanglement certification is fundamental to assess any quantum advantage. The only scalable certification scheme relies on entanglement witnessing, typically effective only for special entangled states. Here we focus on finite sets of measurements on quantum states (hereafter called quantum data); and we propose an approach which, given a particular spatial partitioning of the system of interest, can effectively ascertain whether or not the data set is compatible with a separable state. When compatibility is disproved, the approach produces the optimal entanglement witness for the quantum data at hand. Our approach is based on mapping separable states onto equilibrium classical field theories on a lattice; and on mapping the compatibility problem onto an inverse statistical problem, whose solution is reached in polynomial time whenever the classical field theory does not describe a glassy system. Our results pave the way for systematic entanglement certification in quantum devices, optimized with respect to the accessible observables.

Multipartite entanglement is a key resource allowing quantum devices to outperform their classical counterparts, and entanglement certification is fundamental to assess any quantum advantage. The only scalable certification scheme relies on entanglement witnessing, typically effective only for special entangled states. Here we focus on finite sets of measurements on quantum states (hereafter called quantum data); and we propose an approach which, given a particular spatial partitioning of the system of interest, can effectively ascertain whether or not the data set is compatible with a separable state. When compatibility is disproven, the approach produces the optimal entanglement witness for the quantum data at hand. Our approach is based on mapping separable states onto equilibrium classical field theories on a lattice; and on mapping the compatibility problem onto an inverse statistical problem, whose solution is reached in polynomial time whenever the classical field theory does not describe a glassy system. Our results pave the way for systematic entanglement certification in quantum devices, optimized with respect to the accessible observables.
Introduction. Preparing and processing strongly entangled many-body states, in both a controlled and scalable way, is the goal of all quantum simulators and computers. Indeed, as the efficient representation of generic entangled many-body states is impossible on classical machines, entanglement represents the key computational resource of these devices [1,2]. As a consequence, developing generic and scalable methods to certify entanglement in multipartite systems stands as a grand challenge of quantum information science. Even more fundamentally, entanglement certification is a central task to probe the resilience of quantum correlations from the microscopic world to the macroscopic one [3].
Any practical method must circumvent the tomographic reconstruction of the full density matrix [4,5] (which implies a number of measurements scaling exponentially with system size), and it should instead infer entanglement from the partial information contained in a given data set of measurement results (hereafter referred to as quantum data). When one adopts this data-driven strategy, the goal of entanglement certification is to establish whether or not the quantum data are compatible with a separable state [5][6][7]. Given an extended quantum system composed of N tot degrees of freedom, grouped together into N ≤ N tot clusters [see Fig. 1(a)], the stateρ of the system is separable [8] if it can be written in the formρ p := dλ p(λ)ρ prod (λ) (1) whereρ prod (λ) = ⊗ N i=1 |ψ i (λ i ) ψ i (λ i )| is a product state of the partition, |ψ i (λ i ) being the state of the i-th clus- * Electronic address: irenee.frerot@gmail.com † Electronic address: tommaso.roscilde@ens-lyon.fr ter, parametrized by parameters λ = (λ 1 , ..., λ i , ..., λ N ), distributed according to p(λ) ≥ 0. The distribution p fully specifies classical correlations across the partition. A multipartite entangled stateρ, on the other hand, cannot be written in the above form. Given a set of observablesÂ a (a = 1, ..., R), multipartite entanglement is therefore witnessed by the quantum data set { Â a ρ } R a=1 [where Â a ρ = Tr(Â aρ )] if one proves that the latter cannot be reproduced by any separable state. This task is accomplished by proving that the quantum data violate an entanglement witness (EW) inequality, Ŵ ρp = a W a Â a ρp ≥ B sep , valid for all separable statesρ p [9]. Here W a are suitable coefficients and B sep is the so-called separable bound.
EW operatorsŴ are generally defined based on the properties of special entangled states (e.g. squeezed states, total spin singlets, etc.) [9], and failure of a data set to violate a given EW inequality does not exclude the existence of a different violated inequality involving the same data, yet to be discovered. This may erroneously suggest that entanglement witnessing is limited by creativity and physical insight; and that the entanglement witnessing problem ("is a quantum data set compatible with a separable state?") [5][6][7] is generically undecidable. The goal of our work is to show that this is not the case, and that the entanglement witnessing capability of a quantum data set can be exhaustively tested. Our key insight is that the problem of finding the distribution p(λ), which defines the separable state reproducing at best the quantum data, is a statistical inference problem; and remarkably it has the structure of a convex optimization problem, whose solution can be attained in a time scaling polynomially with the partition size (under mild assumptions), and with the Hilbert space dimension of the subsystems composing the partition. When the optimal separable state fails to reproduce the quantum data, the distance between the quantum data set { Â a ρ } and  1: (a) Partition of a quantum device into N clusters, each of which is subject to Mi measurements; (b) A separable state of the system is described as a probability distribution p(λ) of local states defined by the {λi} parameters; (c) Our algorithm builds a trajectory of separable states (parametrized by couplings {Ka} defining p(λ)) which converges to the optimal state approximating at best some target quantum data. If the state fails to reproduce the quantum data exactly, the vector joining the optimal separable data and the quantum data reconstructs the optimal EW inequality.
the optimal separable set { Â a ρp } allows one to reconstruct the optimal EW inequality violated by the quantum data. We benchmark our approach by establishing new EW inequalities satisfied by the low-temperature states of the Heisenberg antiferromagnetic chain and the quantum Ising chain; in the latter case, our new EW inequalities outperform all previously known EW criteria for multipartite entanglement. Our work parallels the recent mapping of the Bell-nonlocality detection problem onto an inverse statistical problem [10], and it offers an efficient scheme for entanglement detection in state-ofthe-art quantum devices within a device-dependent scenario.
Quantum data set. For definiteness, we assume that on each subsystem i = 1, ..., N , M i local observablesÔ (i) m can be measured (m = 1, . . . M i ; e.g. the Pauli matriceŝ σ (i) a , a ∈ {x, y, z} for individual qubits taken as subsystems). For convenience, we denote the local identity operator byÔ (i) 0 := 1. In order to reveal entanglement, these local observables must be non-commuting [11]. From these local observables, we build p-body correlators of the form mi where m i = 0 for N − p subsystems. Arbitrary observables can be built as linear combinations of correlators -such as e.g. powers of collective spin variables [12,13] a /2 (a = x, y, z) for systems of qubits. Hence we shall assume that R observables of the formÂ a = m x (a) mÔm can be measured, where the sum runs over all strings m = (m 1 , . . . m N ), and x (a) m are arbitrary real coefficients. The quantum data { Â a ρ } R a=1 form the basis for entanglement certification in our scheme. The problem of entanglement certification based on a data set has been discussed in the past, but the proposed methods either lack scalability [6], or are scalable only under some restrictive assumptions (shortrange correlations, low-dimensional geometry) [7]. Our method aims at surpassing these limitations.
Mapping onto an inverse statistical problem. The key aspect behind our approach is the limited information content of separable states. The parameters λ specify-ing the product stateρ prod (λ) can indeed be chosen as where d i is the dimension of the local Hilbert space of the i-th subsystem [14]. The average of theÂ a observable on a separable state reads . Given a product state, the calculation of each term in the sum defining A a (λ) is clearly an operation scaling as O(N ). Once the quantum nature of the state has been absorbed in A a (λ), the calculation of Â a ρp , Eq. (2), is a classical statistical average over the distribution p which, from a statistical physics viewpoint, can be regarded as the Boltzmann distribution p(λ) =: exp[−H(λ)]/Z of a classical field theory on a lattice (normalized by the Z factor), with a vector field λ i defined on each of the N clusters [ Fig. 1(b)]. The complexity of separable states is fundamentally inscribed in the effective Hamiltonian H(λ), which is a priori arbitrary, namely it is specified by a number O(exp(N )) of parameters.
Once the classical statistical structure of the expectation values on separable states is exposed, the problem of reproducing the quantum data with a separable state takes the form of a statistical inference problem, whose solution is well known in statistical physics [15]. First of all, applying a maximum-entropy principle [16], the Hamiltonian can be parametrized without loss of generality with as many parameters as the elements of the quantum data set [17]: The parameters K = {K a } R a=1 -the coupling constants of the classical field theory -are Lagrange multipliers whose optimization allows one to build the separable stateρ p whose expectation values { Â a ρp } best approximate the quantum data { Â a ρ }. The optimiza-tion of K can be efficiently achieved upon minimizing the cost function L(K) := log Z(K) − a K a Â a ρ [10,15]. The a-th component of the gradient of L is g a := ∂L ∂Ka = A a p − Â a ρ , and its Hessian matrix is namely the covariance matrix of the A a (λ) functions. Since the latter is a semidefinite positive matrix, L is a convex function. Therefore, a simple gradient-descent algorithm, which consists in iterating the update rule K a = K a − [ A a p − Â a ρ ] with 1, or any improvement thereof, is guaranteed to reach the global optimum of the problem. In practice, this requires to repeatedly compute A a p as in Eq. (2), a task efficiently accomplished e.g. by Markov-chain Monte Carlo sampling of p(λ), whenever the Hamiltonian H does not describe a glassy system. The restriction to non-glassy systems is the only practical limitation of our approach [17]; and is ensured in the examples considered below by considering translationally invariant systems. Construction of an optimal entanglement witness. As illustrated on Fig. 1(c), the algorithm converges to the distribution p which minimizes |g| -the norm of the gradient of L. If the minimal distance g (min) vanishes (within the error on the quantum data), i.e. if Â a ρ (min) p = Â a ρ for all a = 1, . . . R, then entanglement cannot be assessed from the available data. But in the opposite case, the coupling constants K a increase indefinitely along the optimization, and the coefficients of the gradient g (min) a = Â a ρ (min) p − Â a ρ allow us to build a violated EW inequality. First, we define the normalized coefficients W a := −g (min) a /|g (min) |. The condition |g (min) | 2 > 0 is then rewritten as: The linear combinationŴ := − R a=1 W aÂa is the datadriven EW operator. The separable bound B sep , namely the minimal value of Tr(ρŴ) over separable states, is violated by the data set, ultimately proving that entanglement is present among the subsystems. The operatorŴ is optimal, in that any other normalized linear combina-tionŴ = − a W aÂa defines an EW inequality whose violation cannot exceed the violation of the inequality in-volvingŴ. This property follows from the convexity of the set of separable states. Complexity of the algorithm. If the quantum data contain correlation functions involving up to k points, the effective Hamiltonian H contains O(N k ) terms; therefore the computational cost of evaluating statistical averages of the kind of Eq. (2) with a precision of (using Monte Carlo sampling) scales as is the cost of evaluating the local observables o tion to the case of systems of N qubits partitioned into subsystems consisting of single qubits; and quantum data will be assumed to consist of one-and two-point correlations, σ Heisenberg antiferromagnetic chain. The first example of entangled states that we study with our approach is the thermal equilibrium state of the S = 1/2 Heisenberg chainĤ = J N i=1Ŝ (i) ·Ŝ (i+1) , whereŜ (i) are S = 1/2 spin operators, J is the exchange energy, and periodic boundary conditions (PBC) are assumed. Thermal equi- , due to rotational invariance of the spin-spin couplings and translational invariance. These elementary symmetries of the quantum data are directly inherited by the classical Hamiltonian defining separable states aimed at reproducing them. The Hamiltonian takes the form of a classical long-range Heisenberg model H( The most effective existing multipartite entanglement criterion for this quantum data is based on the collective spin, namely Ĵ 2 = ij Ŝ (i) ·Ŝ (j) < N/2 [19,20], which is verified for t = T /J 1.4. This criterion is a permutationally invariant EW (PIEW), treating correlations at all distances on the same footing, and it cannot be optimal at sufficiently high temperatures, namely when the correlation length ξ becomes of the order of a few lattice spacings.
As a first validation of our approach, we search for the optimal EW based on two-body correlations σ (i) aσ (j) a by using as input quantum data the correlations (obtained via quantum Monte Carlo -QMC [17]) at t = 1 for N = 64 spins, at which ξ = 0.72. Because of their finite range we only used correlations up to a distance r max = 10. Fig. 2 illustrates the results of our approach. The saturation to a finite value of the distance between the quantum data and those of the optimized separable state (measured by the norm of the vector g, see Fig. 2(a)) and the divergence of the couplings K r (Fig. 2(b)) clearly indicate the success of entanglement witnessing. The optimal EW operator can be reconstructed in principle from the asymptotic value of the gradient vector g (∞) aŝ In practice, we found a more strongly violated EW inequality using the asymptotic couplings of the effective Hamiltonian, namely w r = K (∞) r /|K (∞) |which display a clear spatial structure, shown in Fig. 2(c) (see [17] for the numerical values). The final step of the approach consists in determining the separable bound B sep = minρ p Tr(ρ pŴ ). The latter can be obtained as the solution of a set of algebraic equations [21,22]; here we rather obtain it by finding the ground-state energy of the classical Hamiltonian W cl = − N i=1 rmax r=1 w r n (i) · n (i+r) via temperature annealing [18] [ Fig. 2(d)]. We observe that B sep /N = −0.5032, while the quantum data reach Ŵ ρ /N = −0.6089. In contrast, the best PIEW -properly normalized [17] -is violated by an amount of 0.04552. This result is not incremental, because the EW inequality we find is optimal among all those containing two-body correlators. Interestingly, for temperatures t 1.4 (at which the PIEW ceases to work) we found numerically impossible to prove thatρ(T ) is entangled solely based on two-point correlators: this in turn shows that the maximal set of thermal states whose entanglement can be witnessed using two-point correlators is essentially captured by the PIEW. This will not be the case in our next example, in which our approach significantly extends the range of witnessed entangled thermal states. Quantum Ising chain.
Our final example is the quantum Ising model with Hamiltonian The star corresponds to t = 0.28, g = 0.5, at which the quantum data used as input were calculated. The color represents the violation ∆ = ( Ŵ ρ − Bsep)/N of our data-driven EW. The various curves correspond to the temperature below which different entanglement criteria are satisfied (nearest-neighbour concurrence [23]; best PIEW [12]; and quantum Fisher information (QFI) ofĴz [24].
, where J is the interaction strength and Jg the transverse field. In the ground state, the system displays a quantum critical point (QCP) at g = g c = 1/2 between a ferromagnetic phase (g < g c ) and a paramagnetic phase (g > g c ) [25]. At finite temperature around the QCP, the system is known to exhibit robust entanglement [24,26,27]. Given the symmetries of the correlation functions ( σ ρ ∼ δ ab ), the classical Hamiltonian tailored to reproduce them is of the form: As input quantum data, we consider the correlation functions of a chain of N = 64 spins with PBC at a temperature t = T /J = 0.28 for g = 0.5 -obtained as well via QMC. Given the finite correlation length, we only used correlators up to a distance r max = 20. Following the same procedure as described for the Heisenberg chain, we find an optimal EW operator which is spatially structured, of the form (coefficients and separable bound in the Supplemental Material [17]). On Fig. 3, we show that this new EW criterion, optimal for the thermal state at t = 0.28, g = 0.5, allows one to prove entanglement for a larger set of thermal states than all the existing criteria in the literature (namely the nearest-neighbour concurrence [23], the PIEW [12], and the quantum Fisher information [24] see [17] for further details). Conclusions. We introduced a data-driven method to probe multipartite entanglement in many-body systems. This method relies on mapping separable states onto Boltzmann distributions for a classical field theory on a lattice. The classical degrees of freedom of this field the-ory are dictated by the considered partitioning of the system. The structure of the corresponding classical Hamiltonian is dictated by the quantum data at hand; and its parameters are optimized in order to fit at best the quantum data. This method allows to exhaustively test the entanglement witnessing capability of a set of quantum data in a time scaling polynomially with the number of parties in the partition (if the size of quantum data is also polynomial); this is guaranteed whenever the classical field theory is not a model of a glass (namely when it does not feature disorder and frustration). This opens the way to the systematic certification of entanglement in intermediate-scale quantum devices.

Acknowledgments
We are very grateful to Antonio Acín for insightful discussions. IF acknowledges support from the

Supplemental Material
In this Supplemental Material, we provide: 1) further technical details on the variational algorithm described and implemented for the data presented in the main text; 2) on the generation of quantum data, used as input to our algorithm, by quantum Monte Carlo; 3) on the relative versus absolute violation of the entanglement witnesses; 4) on the comparison with existing entanglement criteria. In the attached .csv files, the numerical coefficients of the entanglement witnesses discussed in the main text are given. General strategy. Following the notations of the main text, we assume that the quantum data consist of a collection of expectation values { Â ρ } R a=1 measured on the quantum device. Our constructive strategy to solve the separability problem is to try and reproduce these data with a separable state -the failure of this attempt marking the success of entanglement detection. As discussed in the main text, a separable stateρ p is defined by an arbitrary probability distribution p(λ) over local quantum states |ψ i (λ i ) . Our strategy is then to build an optimal p opt (λ), such that the corresponding separable stateρ popt produces the best possible approximation to the available data attainable using separable states. For a given separable stateρ p , we have:

see the main text for the precise definition of theÔ's operators).
Our approach is in essence a variational approach, in which we parametrize the distribution p(λ) as a Boltzmann distribution p(λ) = exp[−H(λ)]/Z(K) associated with a classical Hamiltonian H(λ) = − R a=1 K a A a (λ). Two crucial properties, on which we further elaborate in this section, make this choice of Ansatz especially suited to solve the separability problem. Firstly, the expressive power of this Ansatz is complete, which means that there is no loss of generality in looking for a separable state of this specific form: if a separable state of this form cannot reproduce the data, then no separable state whatsoever can do so. Secondly, the variational parameters K a can be optimized by minimizing a convex cost function, whose gradient can be evaluated at a cost scaling polynomially with N (the number of local subsystems) and with d (the local Hilbert space dimension), under mild assumptions (specifically, the absence of glassiness of the classical model H(λ)).
Completeness of the Ansatz. If a distribution p(λ) exists which reproduces the data set: A a p = Â ρ for all a = 1, . . . R, it is generically not unique. One may therefore choose the distribution which, as a further specification, maximizes the Shannon entropy functional S[p] = − dλ p(λ) log p(λ). This amounts to removing any other constraints on the distribution except that of reproducing the data set with its averages. Following the seminal work of Jaynes [16], maximizing S[p] under the constraint of reproducing the data is achieved upon introducing Lagrange multipliers K a , and minimizing the functional . Setting to zero the functional derivative with respect to p(λ) yields as a solution the Boltzmann distribution p(λ) = exp[ a K a A a (λ)]/Z(K). The parameters K a are hence exactly the tuning knobs that allow p(λ) to satisfy the constraints to the best that a classical probability distribution can do.
To further understand this point, let us stress that throughout our work we assume (as it is reasonable to do) that the size of the quantum data set scales at most polynomially with system size, so that the number of constraints associated with the reproduction of the quantum data also scales polynomially. On the other hand a distribution p(λ) is uniquely defined by an exponentially large number of constrains -as many as the possible values of the argument λ. The exponentially many constraints, to be added in order to specify the distribution uniquely, cannot help it in any way in reproducing the quantum data. On the other hand, maximizing the entropy of the distribution precisely gets rid of the useless constraints beyond the ones imposed by the quantum data themselves. Once the least constrained distribution is achieved upon maximizing the entropy (subject to the constraint), varying the parameters K a of the distribution exactly allows one to reproduce all the data sets which could potentially be produced by the most general distribution p(λ). The Boltzmann distribution associated with the classical Hamiltonian H(λ) = − a K a A a (λ) can therefore be viewed as an Ansatz whose expressive power of quantum data sets of is as high as one can possibly achieve with a classical distribution.
Optimizing the variational parameters. We then show that the parameters K a can be optimized upon minimizing a convex cost function. Convexity is a crucial property of the whole procedure: if the optimization relied on a heuristic algorithm, then the failure to reproduce the quantum data could simply mean that the optimization has been trapped in some local minimum [28], and therefore the result would be inconclusive. As stated in the main text, a convex cost function for our problem is given by L(K) = log Z(K) − a K a Â a ρ . Another crucial aspect for the scalability of our algorithm is that the cost function L(K) itself is never computed. Only its gradient g a = ∂L/∂K a = A a p − Â a ρ is evaluated, and used to update the parameters K a in a gradient-descent algorithm, or any improvement thereof (in this paper, we used the accelerated gradient-descent algorithm of Nesterov). Even though the cost function itself is never computed, its existence and its convexity are key to ensure the monotonous convergence of our algorithm towards the global optimum of the problem [28]. Specifically, together with the cost function, the norm of its gradient converges towards its minimal value; namely, the distri-bution p(λ) converges towards the best possible approximation to the data with a separable state. If a distribution p(λ) exists which reproduces the data, then the gradient of the cost function vanishes, and it is impossible to detect entanglement from the available data. Notice that this is not a limitation of our approach, but on the contrary it represents a fundamental property of the data that our method exhibits. On the other hand, if the data lie outside of the convex region reachable by separable states, the cost function L is not bounded from below, and the gradient will stabilize to a finite value, leading to a runaway to infinity of the coupling constants K a , and marking the success of entanglement detection -as further discussed in the main text.
Computational complexity. Finally, we would like to remark that the computational cost required to evaluate the gradient g a = A a p − Â a ρ with a given relative precision of via Monte Carlo methods scales as 1/ 2 . One could imagine a fine-tuned situation in which the distance between the data under investigation and the separable set is exponentially small in the system size: |g| = O[exp(−N )], which would translate into a computational cost of our algorithm diverging exponentially with N . While such a situation cannot be excluded a priori, in any practical instance the quantum data come with a finite uncertainty -certainly not decreasing exponentially with the system size. Indeed, the best scaling of the relative uncertainty that one can expect is as N −1/2 , when considering collective observables which are the sums of O(N ) nearly independent degrees of freedom (as it happens in systems with a finite correlation length), and the same benign scaling is shared by Monte Carlo estimates of the same quantities. On the other hand, exponentially decreasing precision would require exponentially large statistics, which is not a realistic assumption for any source of the quantum data set (be it experiments or numerical calculations). As a consequence, quantum data whose distance to the separable set scales exponentially with system size would inevitably be reproduced by our algorithm using a separable state within their uncertainty, and at a polynomial cost.
In the literature, the separability problem has been proved to be NP-hard in the bipartite case [29]. This implies that there exists instances requiring an exponential cost in the local Hilbert space dimension d. On the other hand, we are not aware of a similar complexity result in the multipartite case, namely for a fixed d (d = 2 in the qubit examples treated explicitly in this work), but increasing the number N of parties. For the multipartite separability problem with N qubits, we state in the main text that classical glassy models define the practical limitation to the scalability of our approach. We would like to emphasize that this assumption is rather conservative. Indeed, the classical models one has to sample in our approach involve continuous degrees of freedom (e.g. N classical rotators representing vectors on the Bloch sphere, defining the local quantum states, see Section A 2 below). While Ising spin glasses, which involve ±1 variables, have been proved to be NP-hard [30], a similar result does not exist for frustrated disordered classical models involving rotators (to the best of our knowledge). This (classical) statistical-physics observation is consistent with the absence of formal proof of NP-hardness of the (quantum) multipartite separability problem.
Concerning the bipartite case (N = 2, increasing d), whose NP-hardness is proven [29], our algorithm has a cost which is polynomial in d, in apparent contradiction with the complexity result. First, we notice that the NP-hardness [29] concerns the situation where the full bipartite state ρ AB (which is a d 2 × d 2 Hermitian matrix of unit trace) is used as input. Our algorithm treats a more general situation, where 1-and 2-body correlations of the form Â a , B b , Â aBb are known [whereÂ a (a ∈ {1, . . . , R A }), andÂ b (a ∈ {1, . . . , R B }) are local observables on A and B subsystems, respectively]. This knowledge is equivalent to the knowledge of ρ AB ifÂ a andB b form tomographically complete sets of observables (for instance, the R A = R B = d 2 − 1 generalized Gell-Mann matrices, which are the generators of SU (d)). In our approach, we parametrize separable states as Boltzmann where ψ A and ψ B represent local quantum states, parametrized by 2d−2 classical variables, and where K a , K b , K ab are R A + R B + R A R B variational parameters. The NP-hardness result [29] implies that if one considers tomographicallycomplete sets of observables, then there exists instances of parameters K a , K b , K ab for which sampling the corresponding Boltzmann distribution takes a time diverging exponentially with d. We cannot immediately identify to which hard statistical physics problem this situation would correspond -but certainly such hard instances must exist, as imposed by the result of Ref. [29]. In analogy with glassy problems, for these instances the energy landscape described by H(ψ A , ψ B ) should display a myriad of local minima separated by energy barriers whose height is proportional to d, making the sampling of the model via Monte-Carlo methods inefficient.
On a more constructive tone, we would like to remark that such complexity results only refer to worst-case instances. In the context of our approach, such worst-case instances could correspond to glassy models, and in the case of translationally invariant data considered in this paper, such glassiness is avoided by construction. Such instances are not expected to be generically encountered when analyzing data from present-day quantum simulators of non-disordered systems. Finally, we would like to emphasize that there is no risk of erroneously concluding that entanglement is present if such hard instances manifest themselves. We have already argued above that realistic quantum data cannot reveal entanglement in the case of exponentially small violations of witness inequalities. In the presence of glassiness, instead, one would be unable to run the simulation forward due to very large error bars in the Monte Carlo evaluation of the expectation values for separable states. As a consequence, one would conclude that entanglement cannot be detected within the accuracy of the method.

Special case: partitioning the system into N qubits
In this work we introduce a variational algorithm to fit a given data set of expectation values by using separable states. In the case of qubits taken as individual subsystems, separable states are represented without loss of generality as Boltzmann distributions over classical Heisenberg spins n (i) on the unit sphere (which represent pure states on the Bloch sphere for individual qubits).
In the examples discussed in the main text, the data set contains one-qubit expectation values σ (i) a ρ and twoqubit correlations σ In the examples we considered (namely, the one-dimensional antiferromagnetic Heisenberg model, and the Ising model in a transverse field, both with periodic boundary conditions), correlations σ Since we used translationally invariant chains (with periodic boundary conditions), the one-qubit data reduces to the average mag- a ρ /N , and the the twoqubit correlations depend only on the inter-qubit distance: In the case of the Heisenberg model, which displays SU (2) invariance, we have m a = 0. In this case, we considered as quantum data ρ . Correspondingly, the classical Hamiltonian aiming at reproducing the quantum data contains one-and twobody interactions terms (the latter truncated beyond a given distance r max ). For the Heisenberg model, we get while for the quantum Ising model, where m y = m z = 0, we have The K's coefficients are the variational parameters of our algorithm, which are optimized in an iterative manner. A simple gradient-descent algorithm consists in iterating the following update rule (for the Ising model): for a ∈ {x, y, z}, and r ∈ {1, 2, · · · N/2}; and (for the Heisenberg model): In the above equations, · p is the expectation value on the Boltzmann distribution for the classical Hamiltonian (whose couplings are the K's coefficients), while · ρ are the target quantum data. As discussed in the main text (see also [10]), is a small parameter, implementing a numerical gradient descent of the (convex) L function.
In practice, we implemented the Nesterov's accelerated gradient-descent (NAG) algorithm, with = 0.01. Each step of the NAG algorithm requires to compute the Euclidean distance g between the separable data and the quantum data, namely to compute m x p and C (r) a p for the Ising model and C (r) p for the Heisenberg model. This was implemented using Markov-chain Monte Carlo. The number of Monte Carlo steps (defined below) implemented at each step of the NAG algorithm was chosen such that the relative error on g be smaller than a given threshold η, which we chose as η = 0.05 for the Ising model, and η = 0.1 for the Heisenberg model. In other words, one step of the NAG algorithm is completed when: where Err(g α ) is the error on g α , as estimated from the Monte Carlo algorithm. Each step of the Monte Carlo algorihm consisted of 2N iterations of single-spin Metropolis updates and of single-spin microcanonical overrelaxation updates [31]. The amplitude of the proposed Metropolis updates was adapted along the Monte Carlo simulation so that the move be accepted with frequency 0.5±0.1. Therefore, a single Monte Carlo step consists of 2N microcanonical updates, and of N accepted Metropolis updates (on average). As the variational optimization of the K's parameters progresses along the NAG algorithm, the norm of the gradient g decreases, and therefore an increasing number of Monte Carlo steps is required at each step of the NAG algorithm in order to achieve the required relative precision of η. When the quantum data cannot be fitted by a separable state, g stabilizes to a finite value. The number of steps of the NAG algorithm to achieve this convergence (and therefore the total number of Monte Carlo steps along the whole optimization) depends on the value of |g| as obtained at the end of the optimization. For the examples presented in the main text, about 10 3 steps of the NAG algorithm were necessary, each of them comprising 10 4 ÷ 10 7 Monte Carlo steps.
(with a, b ∈ {x, y, z}) [4]. The dashed line on Fig. 3 defines the temperature below whichρ 12 is entangled. Since the concurrence criterion [23] is based on a subset of the full quantum data we considered (which contains all one-and two-qubits correlations functions, which is equivalent to all two-body reduced density matriceŝ ρ ij , and not onlyρ 12 ), by construction our data-driven method must detect entanglement in a region of the phase diagram strictly larger than the one detected by the concurrence -a fact clearly visible on Fig. 3.
Permutationally-invariant entanglement witnesses. In Ref. [12], a complete family of 8 entanglement witnesses based on the two-qubits reduced density matrix averaged over all pairs,ρ av,2 = 2 i =jρ ij /[N (N − 1)], was derived. Equivalently,ρ av,2 is reconstructed from the knowledge of all one-and two-body correlations averaged over all permutations: ρ . Since m a and C ab are coarse-grained features of the quantum data we have considered, if an EW inequality is violated by m a and C ab (namely if one of the 8 EW inequalities of ref. [12] is violated), then our data-driven algorithm must also reconstruct a violated entanglement witnesses -in general, a more strongly violated one. As illustrated on Fig. 3 for the quantum Ising model, for which we tested all 8 criteria for each parameters (t, g) (temperature and transverse field), this is clearly the case.
Quantum Fisher information. The quantum Fisher information (QFI) is another multipartite entanglement witness. Formally, the QFI quantifies the sensitivity of the state ρ to unitary transformationsρ(φ) = e −iφÔρ e iφÔ withÔ a quantum observable [35]. The QFI can be expressed as QFI(Ô,ρ) = 2 n =m (p n − p m ) 2 | n|Ô|m | 2 /(p n + p m ), whereρ is diagonalized aŝ ρ = n p n |n n|. Here, we chose forÔ the collective spin J z = N i=1σ (i) z /2, which is optimal to witness entanglement around the quantum critical point of the quantum Ising model [24,26]. The inequality QFI(Ĵ z ,ρ) ≤ N is satisfied by all separable states, so that a QFI exceeding the system size is an entanglement witness [35]. In general, computing the QFI involves the knowledge of the full density matrix ρ, but the mapping of the quantum Ising chain onto a free-fermion model [25] makes this computation tractable [24]. Notice that computing the QFI requires knowledge beyond one-and two-body correlators, and therefore it goes beyond the data set we have considered. Hence there is no guarantee a priori that our method exceeds the EW capability of the QFI. Nevertheless, as illustrated on Fig. 3, the parameter region where entanglement is detected by the QFI is broadly included in the region where entanglement is detected via our data-driven algorithm.
Appendix D: Absolute versus relative violation of the entanglement witnesses By construction, the optimal witness found by our approach is the one whose absolute violation is maximized. Namely, among all possible witness operatorŝ W = − a W aÂa , properly normalized with the euclidian norm a W 2 a = 1, our witness operator maximizes the difference B sep − Tr(ρŴ), where B sep = minρ sep Tr(ρ sepŴ ). As a consequence, it is the witness operator which is most robust to the uncertainty present on the quantum data, namely the one that requires the lowest amount of statistics producing the quantum data themselves.
On the other hand, it is also relevant to consider the noise robustness of a given entanglement witness, namely the robustness to a noisy, imperfect preparation of the quantum stateρ that should produce the quantum data. Noisy state preparation can be generically modeled as turning the target state into (1 − η)ρ + η1/D, where η parametrizes the strength of the noise, and D is the total Hilber space dimension. Assuming that all operatorsÂ a composing the witness are traceless (which is the case for tensor products of local Pauli matrices, as considered in this paper), this leads us to define the noise robustness as the maximal value of η such that (1 − η max )Tr(ρŴ) = B sep , namely η max = 1−B sep /Tr(ρŴ). There is no guarantee that the witnesses found by our approach are those whose noise robustness is maximal, and in fact it turns out not to be the case, as shown by the following example.
In the case of the Heisenberg chain, we have considered translationally-invariant entanglement witnesses of the formŴ = − a∈{x,y,z} N i=1 rmax r=1 w rσ (i) aσ (i+r) a . Our convention has been to normalize them to r w 2 r = 1. For a meaningful comparison with the PIEW [ N i=1Ŝ (i) ] 2 ≥ N/2, the latter should be properly normalized according to the same convention, namely: W PIEW = (N − 1) −1/2 a∈{x,y,z} a , with a separable bound given by −N/ √ N − 1. For the data considered in the main text (namely, a thermal state of the one-dimensional Heisenberg model at temperature T /J = 1.00 with N = 64 spins), we find a violation −N/ √ N − 1 − Tr(ρŴ PIEW ) = 0.04552. In contrast, the optimal witness found by our data-driven approach exhibits a larger violation of 0.10570. On the other hand, the noise robustness of the PIEW is η max = 0.255, while the noise robustness of the data-driven EW found by our approach is η max = 0.174. This is qualitatively consistent with the observation that the PIEW is violated up a temperature of T /J ≈ 1.400, which is higher than the temperature up to which the data-driven EW (optimal by construction at T /J = 1.00) is violated.

Appendix E: Detailed numerical values of the entanglement witnesses
The numerical coefficients of the entanglement witnesses reconstructed by our algorithm are given in this Section. For the Heisenberg model at temperature T /J = 1 (Fig. 2 of the main text), we discarded the correlations at distances beyond r = 10. The coefficients K (r) of the entanglement witness are given in Table I the quantum Ising model (Fig. 3 of the main text), the corresponding coefficients K x , K (r) x , K