Conformal Bootstrap with Reinforcement Learning

We introduce the use of reinforcement-learning (RL) techniques to the conformal-bootstrap programme. We demonstrate that suitable soft Actor-Critic RL algorithms can perform efficient, relatively cheap high-dimensional searches in the space of scaling dimensions and OPE-squared coefficients that produce sensible results for tens of CFT data from a single crossing equation. In this paper we test this approach in well-known 2D CFTs, with particular focus on the Ising and tri-critical Ising models and the free compactified boson CFT. We present results of as high as 36-dimensional searches, whose sole input is the expected number of operators per spin in a truncation of the conformal-block decomposition of the crossing equations. Our study of 2D CFTs uses only the global $so(2,2)$ part of the conformal algebra, and our methods are equally applicable to higher-dimensional CFTs. When combined with other, already available, numerical and analytical methods, we expect our approach to yield an exciting new window into the non-perturbative structure of arbitrary (unitary or non-unitary) CFTs.


Introduction
The non-perturbative formulation of a generic Quantum Field Theory (QFT) and the analytic, or numerical, solution of its dynamics remains an extremely challenging conceptual and computational problem with important theoretical and experimental implications.
The problem becomes more tractable in Conformal Field Theories (CFTs): a special class of QFTs that describe typically the short and large-distance behaviours of generic QFTs. Most notably, in a unitary, relativistic CFT in D spacetime dimensions, the local structure of the theory is characterised by a set of discrete data: the scaling dimensions ∆ i of local conformal primary operators O i and their Operator Product Expansion (OPE) coefficients C k ij . Once these data are known, the generic correlation function of any local operator in the theory can be determined.
Unitarity implies certain well-known constraints on these data. For example, a conformal primary operator with scaling dimension ∆ and spin s must satisfy the inequalities The equality ∆ = s + D − 2 occurs only for conserved currents.
More elaborate, and powerful, constraints on the CFT data arise from crossing symmetry: the property that a correlation function is the same irrespective of the channel used in its OPE decomposition. These constraints (consistency conditions) form the basis of the conformal bootstrap approach. Since the 1970s (see e.g. [1]) it was hoped that by solving the conformal bootstrap equations, one would be able to solve CFTs non-perturbatively, without the need for a Lagrangian formulation. For many years the complexity of the conformal bootstrap equations, and the fact that they admit an infinite set of solutions for an infinite set of unknowns, did not allow the programme to evolve beyond a limited set of cases in 2D conformal field theory.

Brief Background on the Modern Conformal Bootstrap
Significant progress was instigated in 2008 by the seminal paper [2], which shifted the focus away from the search of exact solutions of the conformal bootstrap equations and towards the following approach: Make an assumption about the spectrum of the CFT and ask if the bootstrap equations can be satisfied; if the equations cannot be satisfied, then this assumption can be successfully eliminated. With suitable truncations on the infinitedimensional CFT spectrum, this programme can be implemented numerically, and powerful linear and semidefinite programming methods 1 have been employed in recent years to obtain many significant results in this direction. It is impossible to list here all the results and different applications of this approach. For a concise review, and orientation to the relevant literature, we refer the reader to [4][5][6].
The assumptions that drive this approach are selected blindly; in the words of [7], the bootstrap computations in this context are performed in an "oracle mode". Nevertheless, suitable assumptions not only carve out significant parts of the space of potential CFTs, 1 A commonly used package is the Semidefinite Program Solver (SDPB) [3].
but one interestingly finds in many cases that known theories lie at cusps of the boundary of allowed possibilities. Even more efficiently, sometimes one discovers that the allowed region is an isolated "island". When this happens, the oracle-mode can be used to compute remarkably well scaling dimensions and OPE coefficients. A beautiful application of this method is encountered in the 3D Ising model [8]. Theories at the boundary of the allowed and disallowed regions are obviously special from this perspective and have been the primary target of standard applications of the conformal bootstrap. Efficient computational methods, like the Extremal Functional Method [9], can be used to enhance the arsenal of the conformal bootstrap in this context.
Nevertheless, some obvious shortcomings of this approach include: (a) For theories inside the allowed region one cannot, in general, tell how far they are located away from the boundary.
(b) With generic assumptions in oracle mode it is hard to identify and solve specific pre-selected CFTs, such as one's favourite gauge (conformal field) theory, that may not lie on the boundary of allowed and disallowed regions of the search.
(c) Higher-dimensional searches that would facilitate the study of more general classes of CFTs are computationally expensive and difficult to implement with the existing techniques. Typically, with current standard techniques one is restricted to searches of a couple of parameters.
To address some of these problems Ref. [7] recently introduced the Navigator-function method, which replaces the binary information of the oracle mode with continuous information from an optimised continuous, differentiable function, called the Navigator function.
The Navigator function is positive in the disallowed region, negative in the allowed region, zero at the boundary and, in principle, it is defined globally on the space of parameters.
By minimising the Navigator function one can flow from a disallowed region to an allowed region and thus map out islands in parameter space, e.g. by finding one feasible point inside the island or by finding an island's extremal points. The algorithms of [7] employ the same well-developed semi-definite programming tools of SDPB that were previously used to determine OPE coefficients as a maximisation problem. Notable precursors of the Navigator-function method are the optimisation methods proposed in [10].
Another notable approach to the conformal bootstrap, with the potential to address the above issues, was proposed earlier on by Gliozzi in [11]; see also [12][13][14][15] for further work in this direction. In [11] the conformal-block expansion of the crossing equations was arbitrarily truncated and Taylor-expanded in cross-ratio space. A specific assumption was made about the spectrum of operators that enter the truncated conformal-block expansions.
Viewing the resulting crossing equations as an over-constrained system of linear equations for the unknown OPE-squared coefficients, and demanding the existence of non-trivial solutions, yields conditions on the allowed scaling dimensions, which are phrased as vanishing determinants. This method can be used, in principle, to study a wider class of CFTs, including non-unitary CFTs, which are beyond the reach of the above-mentioned SDPB approaches. It requires, however, that the CFT is "truncable", which is not an a priori obvious property of a given CFT (see [12] for an example that is not truncable). In [14], the Gliozzi approach was reformulated as a minimisation problem, which improves important aspects of the method. The approach to the conformal bootstrap that we introduce in this paper is similar in spirit to the reformulation of [14].
Both of the above approaches, and the one we introduce below, are phrased as optimisation problems. A distinctive feature of what we do is that instead of minimising directly the quantity of interest, we optimise a Neural Network (NN) that predicts a probability distribution, which is then sampled to make the actual predictions. This approach has several advantages. In direct optimisation function, one needs to compute partial derivatives, which can become expensive in high-dimensional searches. 2 In contrast, we use fixed optimisation algorithms for the NNs, independent of the details and complexity of the specific problem.
Moreover, when one optimises the function of interest directly, one has to first pick a point in state-space to initialise the process, and then the derivatives guide the search towards the closest minimum. In order to flow to the minimum, one has to pick a small enough learning rate, but that inevitably restricts the flow to the closest minimum, even if it is not the global one. Our approach is efficient at trying to find the global minimum, because the learning rate varies and it probes minima at varying distances from the original starting point. The price we have to pay for these advantages is that our computations become less "exact", i.e. less direct and more statistical.

A Novel Study of Truncations Based on Artificial Intelligence
In the present work we study truncated crossing equations as an optimisation problem and develop methods to find approximate numerical solutions taking advantage of recent developments in Machine Learning (ML) and the wider availability of associated techniques.
2 In [7] this problem is avoided with a general SDP gradient formula and the efficient use of a quasi-Newton method.
Similar to [11,14], our approach is more akin to the original philosophy of the 1970s, which aimed at a direct solution of the conformal bootstrap equations. We will explain momentarily how we set up and implement a multi-dimensional search of approximate solutions and how this search benefits from artificial intelligence-techniques.

Introductory Comments on ML Terminology
Designing architectures and algorithms which one day could surpass human performance has been a long-running goal in the field of ML. Although a significant part of the theoretical (statistical and probabilistic) groundwork had been laid down for more than half a century, ML has only recently started to truly flourish. Decades ago algorithms which beat professional chess players had already been designed, but these approaches involved codes that were rigid and non-dynamic, meaning that once written their knowledge would be capped. In contrast, all of the modern developments in having machines learn how to solve problems include dynamic programming and a statistical approach to learning. The latter has only become practically feasible of late with the rapid development of and easier access to powerful central processing units (CPUs) and graphics processing units (GPUs).
The three best-known categories of ML algorithms are: supervised, unsupervised and reinforcement learning. In supervised learning some of the data are tagged and contain both the input and desired output. The algorithm trains on the tagged data and learns how to produce a sensible output from any input. Typical applications of supervised learning are classification and regression problems. In unsupervised learning there are no externally provided tagged data for training; the algorithm recognises on its own structure in a given set of data. In Reinforcement Learning (RL) [16]-or Deep Reinforcement Learning (DRL), that employs Deep Neural Networks (DNNs) in the learning steps of the "agent"-one knows the goals but does not know how to achieve them. The algorithm interacts with a dynamic environment and receives feedback based on its performance that guides it towards the desired result.
In recent years, ML has had a rising number of applications in High Energy Physics. 3 In this paper, we will initiate a study of the conformal-bootstrap programme using RL techniques. This is the first study of conformal field theory of this kind. 4 3 See [17] for a compendium of reviews ranging from the more experimental to the more computational aspects, and [18] for a summary of applications to String Theory. RL implementations have appeared in the context of String Theory even more recently in [19]. See also [20] for a nice introduction to deep learning from a physics-motivated viewpoint. 4 An alternative ML approach towards certain aspects of CFT, using supervised learning, appeared in [21].

RL Setup in the Conformal Bootstrap
Ultimately, a successful RL algorithm should be able to identify a proper CFT, by converging to a configuration of CFT data that satisfy the crossing equations within a prescribed accuracy. It should similarly be able to exclude improper CFTs by failing to converge to a configuration that satisfies the crossing equations within the prescribed accuracy.
The basic scenario of our approach includes the following ingredients: • Consider a specific four-point function with operators that have fixed symmetry properties, scaling dimensions and spins. If the scaling dimensions of the external operators are unknown, one can include them, as unknown variables, into the search.
• The crossing equations are truncated with a specific assumption about the number of operators per spin that appear in each channel. We call this assumption the spinpartition of the truncated conformal-block expansion. For example, if the truncation of the conformal block expansion in a given channel is assumed to include only operators of integer spin, and we truncate at maximum spin 3, then the spin-partition specifies the number of operators at spin 0, 1, 2 and 3. The spin-partition, which is an input to the RL algorithm, specifies the dimensionality of the vector of unknown scaling dimensions and OPE-squared coefficients ( ∆, C), that we aim to determine.
The crossing equations, which are functions of the cross-ratios (see Sec. 2.1 for details), are reduced to a set of algebraic equations for the unknown scaling dimensions and OPE-squared coefficients ( ∆, C). The reduction can be achieved by Taylor expanding the conformal blocks around a particular point (as in standard applications of the numerical conformal bootstrap), or by evaluating the conformal blocks on a set of different points in cross-ratio space. We will implement the latter approach in this paper. Naturally, the number of algebraic crossing equations obtained in this manner should be larger than the number of unknowns. In compact vector form, the reduced algebraic crossing equations are Since we truncate the crossing equations, it is not guaranteed (or expected) that Eqs. (1.3) have an exact solution. Our aim is to find approximate solutions to (1. 3) The methodology, focus and scope of [21] are very different from the one that we introduce below.
that minimize E. Approximate solutions are expected to flow towards exact solutions of the exact crossing equations as one adds more and more operators to the truncation.
• One can specify the width of the search either individually for each unknown scaling dimension and OPE-squared coefficient, or collectively. For example, one can set a common upper cutoff, ∆ max , on the unknown scaling dimensions. Clearly, because of the unitarity constraints, (1.1)-(1.2), if the maximum spin in the spin-partition is • With these specifications in mind, we set up a soft Actor-Critic RL algorithm, [22], that performs a multi-dimensional search on the vector space of the unknown scaling dimensions and OPE-squared coefficients ( ∆, C) and returns configurations that minimise the norm of the crossing-equation vector E. The operation and key components of the RL algorithm will be discussed in Sec. 3.

Overview and Discussion of Results
Our main goal in this paper is to show that suitable RL algorithms can be applied to the conformal-bootstrap programme to efficiently perform multi-dimensional searches, and (when appropriately guided) to detect and solve arbitrary CFTs. We aim primarily at a proof-of-concept demonstration of the approach with less emphasis on maximising the accuracy of the results, which we will consider in future work. In that vein, we want to test RL algorithms against results that can be obtained independently using analytic methods.
We choose to analyse 2D CFTs, as in this case it is straightforward to write exact conformal blocks for operators of arbitrary spin. Throughout our computations, we will only use the global so(2, 2) part of the 2D conformal algebra, without making any reference to the Virasoro algebra, which is a special feature of two dimensions. Consequently, every tool that we set up in this paper is directly generalisable and applicable to higher-dimensional CFTs, which will be treated elsewhere. For concreteness, we will focus separately on the two leading unitary minimal models (the Ising and tri-critical Ising model) and the free boson CFT on a circle.

Key Results
We highlight the following results: • In all the cases we analysed, the algorithm was able to detect the CFT whose spinpartition we used as input. This is extremely promising. It suggests that Reinforcement Learning has a great potential as a tool in conformal-bootstrap studies of generic at weak-coupling points) to solve the theory at generic points by adiabatically changing the parameters.
• We can perform efficient high-dimensional searches; our current algorithm can do direct searches with tens of operators. In the context of the 2D compactified boson CFT, we present results of a run with 36 parameters. We can, in principle, go to even higher spins and scaling dimensions with multiple sequential runs that start with a smaller number of operators and gradually introduce more.

Numerical Uncertainties
An important aspect of our approach, which is not addressed in detail in the preliminary investigations of this paper, has to do with the systematic treatment of errors. As emphasised at the beginning of this subsection, the main goal of the present work is to establish that our algorithm detects the intended CFT and produces sensible numbers. We achieve this goal by comparing said numbers with the available exact analytic results. A preliminary discussion of errors and uncertainties, and how they can be incorporated systematically in the future, is relegated to the concluding Sec. 6. In the rest of this subsection, we flesh out an important aspect of our approximations that affects the implementation of our approach.
As already noted, the truncated crossing equations that we are trying to solve do not admit, in general, any exact solutions. Therefore, our main task is to find configurations that minimise the violation of the truncated equations. What is the minimal violation of the truncated equations that we should be aiming for? This is not a priori known and the answer can depend strongly on the specifics of the CFT, the four-point function that we are considering, the type of truncation that we are implementing on the spectrum and the way we reduce the crossing equations as functions in cross-ratio space to a number of algebraic We have empirically found that in all computations performed for this paper a solution has been properly identified for values of A below 0.1% irrespective of the spin truncation.
Once A is below this empirical threshold and A stops improving and the agent has visibly converged to a configuration, we terminate the run and record the result. We have implemented this triple selection rule in all the runs that are reported in this paper.
To obtain further evidence for the acceptance, or rejection, of a configuration one can study the dependence of the best A obtained by the algorithm as more and more operators are included. Once a configuration has been accepted as a valid approximation to the exact problem, one can define individual uncertainties for each CFT datum that is being computed. We present preliminary results of statistical errors in specific examples in Sec. 5.
We discuss general uncertainties and their sources further in the concluding Sec. 6.

Outline
The rest of this paper is organised as follows. In Sec. 2 we present a brief review of useful basic CFT properties and set up our notation. We introduce the truncation scheme that we use, the associated spin partitions and a measure of accuracy that plays a key role in A shorter version of this paper, summarising the key approaches and results, can be found in [23].

CFT Prerequisites and Notation
In what follows we assume some familiarity with the basic concepts of conformal field theory.
For a review of conformal field theory we refer the reader to the standard textbook [24] and the recent overviews in [4][5][6], which summarise the more modern perspective on CFTs above two dimensions. Sec. 2.1 provides a general overview of useful properties for CFTs in any spacetime dimension. In Secs 2.2 and 2.3 we specialise the discussion to 2D CFTs, which will be the main focus of the computations in this paper.

Generalities
The so(D, 2) conformal algebra of a CFT in D spacetime dimensions organises the spectrum of local operators/states of the theory in corresponding representations. A primary operator O i has scaling dimension ∆ i and spin (under the SO(D) Lorentz group) s i . Notice that the case D = 2 is special, since the so(2, 2) part of the conformal algebra extends to the infinite-dimensional Virasoro algebra. It is, therefore, customary in 2D CFTs to refer to the operators that are highest-weights in Virasoro representations as primaries, while operators that are highest-weights in representations of the global part so(2, 2) are called quasi-primary. Since we will be using only the so(2, 2) structure of 2D CFTs, the reader should anticipate a clear distinction between primary and quasi-primary operators in the context of our applications.
A central object in the analysis of CFTs is the Operator Product Expansion (OPE), which allows one to recast the product of two conformal primaries O i , O j as a sum over single conformal primaries and their descendants The OPE coefficients C k ij are c-numbers that are closely connected to the three-point function coefficients C ijk of the conformal primaries O i , O j , O k . For example, the two-and three-point functions of three conformal primary scalar operators are given by the expressions where the factor K(∆ i , x i ) has a fixed form (that will be written explicitly in two dimensions below), and g(u, v) is a-typically complicated-theory-specific function of the cross-ratios is the conformal block that captures the contribution of intermediate The conformal blocks are theoryindependent and, as already mentioned earlier, in many cases are either known analytically in closed form, or can be determined using convenient relations. Specific expressions for two-dimensional conformal blocks will be given momentarily.
It is customary (in the context of the so-called conformal frame) to re-express the cross-ratios in terms of two variables z,z as In Euclidean CFT z andz are complex conjugate.
It is also customary to work in a basis of conformal primaries that diagonalises the two-point functions (2.2). This is a convenient choice in general, but it can be subtle in conformal manifolds for degenerate protected operators because of operator-mixing effects.
In what follows we denote the OPE-squared sum at fixed scaling dimension ∆ k as In the absence of degeneracies in the spectrum of operators that run in this sum, the sum (2.8) comprises a single term. This is not, however, the only possibility and in some of the applications of the main text we will encounter cases where degeneracies do exist. Our algorithm tries to determine the full coefficients C k i 1 i 2 i 3 i 4 , hence if there are degeneracies it will not be able to resolve them to determine the individual contributions that make up the sum in (2.8).
Obviously, the OPE expansion in (2.6) is not unique. Instead of using the to obtain a different looking, but equivalent, expansion of the four-point function. These two approaches yield respectively the so-called s-and t-channel expansions of the four-point function. 6 To distinguish the OPEsquared coefficients in each channel, we will denote the s-channel coefficients as s C k i 1 i 2 i 3 i 4 and the t-channel coefficients as t C k i 1 i 2 i 3 i 4 . The t-channel can be obtained from the s-channel by exchanging the insertions 1 ↔ 3 and equivalently the cross-ratios u ↔ v, or z ↔ 1 − z and z ↔ 1 −z. The equality of the two expansions leads to the crossing symmetry constraints where the factor h(∆ i ; u, v) accounts for the contribution of the prefactor K.
In general, the operators that appear in the s-channel k-sum are different from the operators that appear in the t-channel k -sum. Moreover, note that the crossing equations (2.9) have to be satisfied as functions of u, v at any values of u, v. This imposes stringent constraints on the CFT data of scaling dimensions and OPE coefficients. We will set up an RL algorithm that solves these equations-yielding the CFT data-using an assumption about the rough structure of the spin-dependence of the spectrum of operators that appear in the OPE of each channel.

Crossing Equations in 2D CFTs
It will be useful for our purposes to spell out the above results in the more specific case of two-dimensional CFTs.
The analysis of the crossing equations (2.9) requires explicit knowledge of the conformal blocks g Over the years significant progress in the computation of conformal blocks (see [5] for a guide to the literature) has provided important input in the development of the conformal-bootstrap programme. In even-dimensional CFTs the conformal blocks in four-point functions of scalar operators are known analytically in closed form. In twodimensional CFTs, in particular, they are also known analytically for any four-point function of spinless or spinning conformal primary operator [25]. The latter is one of the basic reasons why we will focus on 2D CFTs. We stress again that the aforementioned conformal blocks in two dimensions are conformal blocks for the global so(2, 2) part of the Virasoro algebra. In this paper we will not be using Virasoro conformal blocks. 7 6 It is also possible to consider the (13) − (24) OPEs that yield the u-channel expansion. We will not consider the u-channel expansion in this paper. We note that the s, t and u channel expansions do not converge simultaneously at all cross-ratio values. For further comments we refer the reader to the review [5]. 7 In two dimensions it would have been more efficient, in general, to work with the full Virasoro blocks.
However, this would be problematic for us for two reasons. First, the general Virasoro conformal blocks are not

Concretely, consider four quasi-primary operators in a (Euclidean) 2D CFT denoted as
O i (i = 1, 2, 3, 4) with left-and right-moving conformal weights (h i ,h i ). The corresponding scaling dimensions and spins of these operators are ∆ i = h i +h i and s i = h i −h i . We insert the operators at four distinct spacetime points denoted in complex coordinates as (z i ,z i ).
The s-channel conformal-block expansion of the four-point function of these operators is the complex parameters z,z that express the cross-ratios u, v in (2.7). We are also using is the ordinary hypergeometric function.
Adapting (2.8), we also set suppressing the reference to the operators O i .
In the above notation the crossing equations (2.9) take the form At this point it is useful to make the following observations.
First, when one sums over the conformal block of a spinning quasi-primary operator (i.e. an operator with conformal weights (h,h) and h =h) in either channel, one is also summing known in closed analytic form (see, however, [26] for useful expansions of these quantities). Second-and more important-this would limit the direct applicability of our approach to the special features of two-dimensional In this manner, we can restrict the sums in (2.14) to only run over operators with h ≥h, hence reducing by half the number of intermediate quasi-primary operators that we need to consider in the ensuing application of the RL algorithm.
Second, it is useful to single-out the contribution of the identity operator, when this is present in a given channel, by setting C 0,0 g (1234) 0,0 (z,z) = g 12 g 34 . This explicit non-vanishing constant in (2.14) will prevent, in general, the RL algorithm from converging to the trivial solution where all s C h,h and t C h ,h are set to zero.

Truncations, Spin-partitions and Measures of Accuracy
We view the exact crossing equations (2.14) as non-linear equations for the unknown However, in their current form, the exact crossing equations (2.14) are impractical both for analytic and numerical methods. As already mentioned in Sec. 1.2.2, we need to implement a truncation.
For numerical methods the first obvious obstacle is the appearance of a typically infinite number of contributions to the conformal-block expansion. We address this problem by truncating the spectrum of intermediate quasi-primary operators, by setting some upper cutoff ∆ max on the scaling dimensions. The convergence properties of the conformal-block expansion [27] imply that one does not have to consider very large values of ∆ max for sensible numerical results, but the precise value of an optimal ∆ max is not easy to determine a priori and is, in general, theory-dependent. We will later make the surprising observation that in some examples values of ∆ max as low as 2 can already yield good approximations. 9 A second issue has to do with the continuous dependence of the exact crossing equations (2.14) on the cross-ratio parameters z,z. In this paper, we follow the approach of [28] and evaluate the truncated crossing equations at a finite discrete set of points in the z-plane. We have noticed experimentally that the sampling of z-points suggested in Sec. 3.1 of [28] works well also in our computations. In general, if the number of unknown scaling dimensions and OPE-squared coefficients is, in total, N unknown , we choose N z z-points (with N z > N unknown ) to evaluate the truncated crossing equations.
With these specifications, the exact crossing equations (2.14) have been reduced to a finite set of non-linear algebraic equations, where the scaling dimensions of all contributing intermediate quasi-primary operators are bounded from above by ∆ max . This necessarily also puts an upper bound on the allowed spin s of these operators, since |s| ≤ ∆ ≤ ∆ max . 10 However, despite the above considerable simplifications, the problem remains intractable: there is still a vast space of possibilities that an algorithm can explore associated with the freedom to choose any number of quasi-primaries at each spin. This final issue can be fixed by introducing a spin-partition.
The spin-partition is a sequence of positive integers that specifies the number of quasiprimaries per spin contributing to the conformal-block expansions of the truncated crossing equations. The spin-partition is an input to the RL algorithm that we set up in the next section. It fixes the dimensionality N unknown of the vector space of parameters ( ∆, C) where the search takes place. We will be listing spin-partitions using the template of Tab. 1.
We have thus arrived at a framework of truncated equations where the dimension of the vector ( ∆, C) is N unknown and the dimension of the vector E is N z . Each entry E i (i = 1, . . . , N z ) of the vector E contains the evaluation of the truncated version of Eq. (2.14) at one of the points (z i ,z i ) in our z-sampling Spin 0 1 2 · · · n − 1 n s-channel a 0 a 1 a 2 · · · a n−1 a n  This framework is very similar to the starting point of the approach [11,14]. Notice, however, that the truncation in the scheme of [11,14] is arbitrary, whereas here it comes with a further assumption that the unknown scaling dimensions are inside a specific window of scaling dimensions. This detail is an important distinction between our approach/implementation and those of [11,14]. In particular, our approach entails a probabilistic search in specified parameter windows.
In general, (2.16) is not expected to have any exact solutions. Accordingly, as we explain in the next section, our RL algorithm is designed to minimise the Euclidean norm of E and determine configurations of CFT data that satisfy the truncated crossing equations with the best possible accuracy. Although the Euclidean norm || E|| is an important quantity of the computation, it is not straightforward to judge whether its raw value at an optimal configuration is actually small or large. For that reason, we find it useful to define a "relative measure of accuracy", A, defined in the context of (2.17) as The quantity A is guaranteed to be a number between 0 and 1. Its value gives a % measure of the accuracy at which we have been able to satisfy the truncated equations (2.16), and this can in turn be compared more straightforwardly between different computations.

Continuous Action Space Reinforcement Learning
In many physical settings it is very common to have access to large amounts of data (e.g. collider physics), where supervised/unsupervised ML techniques find direct application.
However, in scenarios often found in theoretical physics this is not usually the case. This is where RL comes in handy because the learning agent is able to generate its own data.
Reinforcement Learning, in brief, is an algorithm consisting of two parts with equal importance. The first is the so-called "agent", which is the brain of the algorithm. The second is the "environment": what the agent interacts with. The basic setup of the algorithm is the process of the agent making decisions as it explores the provided environment, while the environment gives feedback on the agent's actions. One wants the agent to explore the environment towards finding an ideal solution, while exploiting the best solution it finds (explore-exploit dilemma). One also has to find a suitable algorithm for how the agent (the neural network) "learns" and retains its experiences.
There exists a considerable amount of previous work on DRL algorithms, which have been applied to a large variety of problems, both theoretical and real-world. There are examples of agents which can beat video games, drive cars, guide robots, solve mathematical equations and-possibly the most famous one-AlphaGo, which beat professional Go champions using a combination of supervised learning and DRL [29], and the improved AlphaGo Zero, which relied completely on DRL [30].
Such algorithms can be split into two main sets and can be distinguished by whether the actions (defined by numbers) taken by the agent are discrete or continuous. Algorithms such as Deep Q-Learning [31] or Actor-Critic methods [32] use a discrete action space (convenient when one can take only a finite amount of actions), while algorithms such as the soft Actor-Critic method [22] and the Deep Deterministic Policy Gradient method [33] were developed for when the actions can take any real value.
In this paper we are making use of the soft Actor-Critic algorithm and implementing it using the PyTorch package for Python 3. ). We will not go into the details of the aforementioned algorithms, since these can be found in the original papers (with pseudo code), and there exist plenty of additional online resources showcasing their implementation. Furthermore, we will treat the learning algorithm itself as a black box, i.e. we will not be interested in its study, although one can adjust the hyperparameters for the learning following [22]. We are mostly interested in the environment the agent will get to explore.

Soft Actor-Critic Algorithm
Although we will not be providing the full details of the agent implementation, it is still useful to give a short overview of what actually happens inside the brain of the algorithm.
The algorithm itself is an iterative process, where the iteration is over "steps" taken by the agent. These steps can also be grouped into "episodes". An episode is concluded when the last step results in a terminal state. The steps and terminal states are more important when talking about the environment, and they will be discussed in more detail in the following subsection. In every iterative step of the algorithm there are a number of processes executed by the code. In order, these include: 1. Choose Action: Since our agent is designed to come up with scaling dimensions and OPE-squared coefficients for given CFT spin-partitions, each action will directly correspond to an unknown (such as a scaling dimension or OPE-squared coefficient).
The actions themselves can take continuous values. The agent takes an action by predicting values for the unknowns.
2. Implement the Action in the Environment: We shall explain the implementation of the environment in detail in the next subsection. For now we shall say that the values of the predictions by the agent are fed into the environment code.
3. Observe the Environment: In this step the constraints are calculated by the environment (such as the crossing equations or additional constraints) and are fed back to the agent as observations (it is what the agent "sees").

Obtain Reward:
The algorithm for the environment comes up with a quantitative judgment (discussed in the next section) on how well the agent did with its prediction of the parameters. This is then fed back to the agent.  iterations. This is very important for the next step. In the current step the current information is stored in the array.
7. Update Neural Networks (learn): A random set of samples is taken from the previously mentioned memory buffer and this data is used as training data to update the weights of the neural networks of the learning algorithm. In the optimisation step of the weights we use the ADAM optimiser [34]. Once the networks have been fed forward and backpropagated, their structures (weights) will adjust to better suit the data.
Hence in the next iteration they will try to predict results which better satisfy the constraints. It is important to note that the networks do not actually predict the values themselves but a probability distribution which is then sampled for the predictions; this is where the explore-exploit dilemma enters.
We display the details of the NNs that we used for our searches in Tab. 2.

Environment
Here we summarise some of the most salient features of the environment implementation. The The very last section of the environment checks for final states. In our case this is simply a flag checking if the current solution is better than the current best from previous runs. If, indeed, it is, then the code overwrites the previous best, and supplies the flag to the agent. The agent needs to know whether or not the step led to a final state, as this directly feeds into the approximation of the probability distribution.
We summarise these steps in Alg. 1, where A stands for an action by the agent and R * for the current best reward.

Three Modes of Running the Algorithm
The RL algorithm can be implemented in several different ways depending on the scope and focus of the search. In this subsection, we outline three different modes that were employed in producing the results of Secs 4 and 5. In summary, these are: • Mode 1. Specify the spin-partition and ∆ max and search for scaling dimensions within the unitarity bound and ∆ max . For OPE-squared coefficients there are very few constraints, e.g. they may only be restricted by unitarity to be positive.
• Mode 2. There is a specific expectation for the scaling dimensions, for which the search is contained within a narrow window. There are no expectations for the OPE-squared coefficients, where the search is initially as wide as in mode 1.  Clearly, the range of the search becomes more narrow as we go from mode 1 to mode 3.
The computational time is expected to be larger, in general, in mode 1.
Our algorithm gives the user two key dials that can be tuned at will at the beginning, or multiple times in the middle of a run. The first is a lower bound for each parameter (we will call it the "floor"). The second dial is a separate size for the search window of each parameter, in each action of the agent (we will call this dial the "guess-size"). As a rule of thumb, the initial window should at first be set large enough to minimise the probability of the agent getting trapped at a local minimum. Once the presence of a potential global minimum has been established, one can then start to hone in by gradually reducing its size.
We next provide a more detailed description of each mode.

Mode 1
Since this mode involves the widest search windows, a blind search may be hindered by the existence of multiple false vacua, or may lead to an approximate solution that represents a CFT that is not of immediate interest. As a result, this mode can be assisted by additional preparation that partially restricts the search. For example, one could start with a rough preliminary exploration of the minima of || E|| using Mathematica, or obtain a rough estimate of some of the scaling dimensions using the approach of [11]. This preparation can help significantly facilitate the subsequent search.
To commence the search we initially run the algorithm in "guessing mode" where the RL agent only tries to improve on its own guess in the current cycle. This allows for the random exploration of configuration space and generates some initial profiles of CFT data.
Then, we enter the "normal mode", where the agent initially takes the final state from the guessing mode and tries to find small corrections so as to better satisfy the constraints.
Once it finds such a correction, it replaces the final state and proceeds with a new correction iteratively. Here one can set specific values for the floor and guess-sizes. It helps to set the guess-size at a magnitude comparable to the expected order of parameter change as the agent hits the next final state. In most cases, the user can easily detect this size by observing how the agent generates configurations in real time.
The algorithm continues the search ad infinitum and the crucial question is when to stop and record the result. We have observed in the context of different theories that in actual solutions the agent reaches in reasonable time (of the order of an hour on a modern laptop) a value of the relative measure of accuracy A below 0.5%. In addition, when the search window is set near actual solutions the agent keeps reducing A significantly below the threshold of 0.5% with an apparent convergence on the values of the parameters ( ∆, C).
Based on this observation, we have always aimed for runs that drop A below 0.1%.

Mode 2
In this mode we conduct, from the beginning, a narrow search in scaling dimensions. We have found that the following protocol produces good results.
We set the floor of the scaling dimensions to the expected values and the corresponding guess-sizes to 0. This freezes the scaling dimensions and reduces the dimensionality of the search by half, since we are conducting a search by varying only the OPE-squared coefficients. After exiting the guessing mode, we conduct the search for the optimal OPE-squared coefficients using the same procedure as in mode 1.
Once the relative accuracy A drops to the order of 1%, we unfreeze the scaling dimensions by reducing their floor and opening their guess-size. The size of the search window around the expected values of the scaling dimensions can be controlled freely by the user. If the agent is already in the vicinity of a solution, the scaling dimensions will not move significantly once unfrozen, and the full set of parameters ( ∆, C) will now be adjusted by the agent to reduce A even further. We continue the search until we achieve an acceptably small value of A and observe an apparent convergence following the general procedure outlined in mode 1.
During this process it may happen that some scaling dimensions are driven towards the boundary of the prescribed window of search. In that case, the user can slightly increase the corresponding window to explore whether the approximate solution lies nearby. As long as the agent keeps improving the accuracy A, the window can be kept in place. If there is, however, a stage in the run where the agent stops improving at an unacceptably high A, and the adjustment of guess-sizes does not help, then this can be viewed as a strong signal that a solution does not exist in the prescribed windows.

Mode 3
In this case, we are conducting a narrow search in all components of the parameters ( ∆, C). We can run the algorithm as in mode 2 without the initial run to approximate the configuration of the OPE-squared coefficients, since this is already approximately known.

Enlarging the Spin-Partition
After having obtained results for a given spin-partition one can implement a shortcut for subsequent searches with an enlarged spin-partition (e.g. when ∆ max is increased). Instead of re-running the algorithm for all parameters, it is more economical to instead implement a strategy akin to that of mode 2: • Perform the search with the least number of parameters using the steps outlined previously.
• Freeze these parameters. • Unfreeze all parameters and let the agent determine how these new parameters change the old ones to find a better solution.
This type of implementation opens up the exciting possibility of reconstructing considerable amounts of CFT data without a full, specific, a priori given spin-partition.

Comments on User Input
To summarise, our overall approach is sketched in Alg. 2. It should be apparent from the description of the above three modes that, although the RL algorithm is set up to run independently without the input of an external user, in actual runs user intervention can help in significantly speeding up the search. A suitable real-time adjustment of the guess-size for individual parameters helps the agent focus faster around a region of potential interest. In the future, this is an aspect of the algorithm we would like to improve-or better automate-in order to facilitate more efficient parallel runs. At this stage, the mode with the minimal user input is mode 3, which involves the smallest search windows.

Application I: Minimal Models
We now pass on to explicit applications of our algorithm, starting with minimal models.
The unitary minimal models are, in the appropriate sense, the simplest possible 2D CFTs and benchmarks of the original conformal bootstrap programme from the 1970s. Here we revisit them from the perspective of the global part of the Virasoro algebra, completely disregarding the Virasoro enhancement of the so(2, 2) conformal algebras.
In this section we search for approximate solutions to the crossing equations that we listed in Sec. 2.2, which describe minimal models. The consistency of the crossing equations in this well-known class of 2D CFTs was understood analytically early on. It is therefore a good starting point to verify that our method recovers known facts about these theories correctly. We focus on the two leading representatives in the series of unitary minimal models, the Ising and tri-critical Ising models.

Analytic Solution
We next briefly recall some of the salient features of the Ising and tri-critical Ising models (see [24] for a comprehensive review).

Ising Model
The Ising model, M (4, 3), is the simplest model in the unitary minimal series M(p + 1, p).
It has central charge c = 1 2 and it is equivalent to the CFT of a free Majorana fermion. Besides the identity operator I, its spectrum contains two more primary operators: the spin operator σ with conformal weights (h,h) = ( 1 16 , 1 16 ), and the energy-density operator (also called thermal operator) ε with conformal weights (h,h) = ( 1 2 , 1 2 ). The corresponding OPEs are ε(z 1 ,z 1 )ε(z 2 ,z 2 )ε(z 3 ,z 3 )ε(z 4 ,z 4 ) . For example, by focusing on the holomorphic part of the theory, we obtain at the first few levels the following quasi-primaries in the Virasoro conformal families of the identity and energy-density operators. 11 In the conformal family of the identity, the states are the only quasi-primaries up to level 5. In the conformal family of the energy-density, the states |ε , are the only quasi-primaries up to level 5. A potential quasi-primary at level 2 does not exist, because it is one of the characteristic null states of the Ising model.
When combined with the anti-holomorphic sector, these results yield the spin-partitions that will be employed in the analysis of Sec. 4.2.1 below.

Tri-critical Ising Model
The tri-critical Ising model, M(5, 4), is the next minimal model in the unitary series. 12 It has central charge c = 7 10 , and besides the identity operator, its conformal primary spectrum comprises three energy-density operators ε with (h,h) = The OPEs of these operators are listed in Tab. 7.4 of [24]. We will be interested in four-point functions of the tri-critical Ising model that resemble those of the Ising model, and the way our algorithm differentiates between the two CFTs. We will therefore focus on the primary operators σ and ε , which satisfy Notice the similarity with the OPEs (4.1), (4.3). Accordingly, in the next subsection we will study the four-point functions Similar to the case of the Ising-model primary ε, we find that the conformal family of ε in the tri-critical Ising model contains the following quasi-primary states, up to level 4 in the holomorphic sector: To obtain this result we had to use that the Verma module of the state |ε contains the following null state at level 3 (in the holomorphic sector):

Reinforcement-Learning Results
The above analytic data can now be compared with those obtained from our RL algorithms.
This exercise is helpful in checking the efficiency of our code before proceeding to the more complicated example of the c = 1 compactified boson CFT.

σσσσ in Ising Model
The exact crossing equation for the four-point function (4.4) in the Ising model is   z 4 ) in the Ising model with ∆ max = 6.5.
Using the crossing equation (4.13) to determine our reward function, we performed the following computation with the RL algorithm. We set ∆ σ = 1 8 , for the external spin operator σ, and searched in mode 2 for solutions with the spin-partition of Tab. 3, which is informed by the analytic solution with a cutoff ∆ max = 6.5. A more agnostic search in mode 1, with more limited information about the initial profile of the scaling dimensions, is also feasible. Such runs are presented in the next Sec. 5. Here, the mode-2 runs are computing independently the OPE-squared coefficients and confirm the analytic values of the scaling dimensions that were used to initiate the runs. In the implementation of the algorithm we enforced the unitarity constraint that the OPE-squared coefficients are positive. This is a search in a 10-dimensional space of unknowns (5 for the scaling dimensions and 5 for the corresponding OPE-squared coefficients). The results of a run with 29 crossing equations-that is, (4.13) evaluated at 29 different points on the z-plane-appear in Tab. 4.
This particular run took approximately 12 hours on a modern laptop machine to yield the relative accuracy A = 3.31618 × 10 −6 . 13 When unfrozen, the scaling dimensions were allowed to vary with a guess-size 0.1. It is worth noting that the agent started the run with a random profile of OPE-squared coefficients (some of which were orders of magnitude away from those of the Ising model) and gradually converged to the results of Tab. 4.
We observe that the relative accuracy at which we can satisfy the truncated crossing equations is impressively strong, even with a very rough truncation of only 5 quasi-primary operators. When compared against the analytic expectations, the numerical results for the scaling dimensions agree at the order of 1%. For the OPE-squared coefficients, the agreement is equally impressive for the two lower-lying operators ε and L −2 with scaling dimensions 1 and 2 respectively, but (as might be expected) becomes worse for the higher scaling dimension operators at ∆ = 4, 6 that lie closer to ∆ max .
Notice that the exact unitarity bound for the spin-2, 4 and 6 operators requires their scaling dimensions satisfying ∆ ≥ 2, 4 and 6 respectively. Since we have truncated the 13 After submission to PRD, we found that replacing the MPMATH numerical PYTHON library with SciPy reduces the running time to just 30 minutes with similar results to those of Tab. 4.
) for ∆ σ = 1 8 and the spin-partition of Tab. 3 with ∆ max = 6.5. The numerical results were obtained with a mode-2 run of the RL algorithm.
crossing equations, we do not expect the results to obey the strict unitarity bounds, and, as a result, we have allowed the agent to explore solutions with a small violation of these bounds.

σ σ σ σ in Tri-critical Ising Model
Similarly, in the tri-critical Ising model we study the four-point function (4.9) whose crossing equation is (4.14) Once again, the sum over h,h does not include the contribution of the identity operator, which has been singled out in the last two terms of the equation. In this case we ran the RL algorithm in mode 2 by setting ∆ σ = 7 8 for the external operator σ , and using the spin-partition of Tab. 5 informed by the analytic solution of the tri-critical Ising model with ∆ max = 6.5.
It may be instructive to compare this spin-partition with the corresponding spin-partition for the Ising model in Tab. 3. The only difference is 3 versus 1 spin-2 quasi-primary operators.
In the analytic solution there is another difference, which is not apparent in Tab. 5. At Spin 0 1 2 3 4 5 6 2 -3 -1 -1 Table 5: A spin-partition informed by the conformal-block decomposition of the four-point function σ (z 1 ,z 1 )σ (z 2 ,z 2 )σ (z 3 ,z 3 )σ (z 4 ,z 4 ) in the tri-critical Ising model with ∆ max = 6.5.  Table 6: Analytic and numerical solutions for scaling dimensions and OPEsquared coefficients in the conformal-block decomposition of the four-point function σ (z 1 ,z 1 )σ (z 2 ,z 2 )σ (z 3 ,z 3 )σ (z 4 ,z 4 ) for ∆ σ = 7 8 and the spin-partition of Tab. 5 with ∆ max = 6.5. The numerical results were obtained with a mode-2 run of the RL algorithm. spin-6 the tri-critical Ising model has 2 degenerate quasi-primary states instead of just one, whose contribution combines as a single term in the crossing equations. The degeneracies are, therefore, invisible to the spin-partition and consequently not detectable from our analysis.
In this context, we performed a search in a 14-dimensional space of scaling dimensions and OPE-squared coefficients. The RL algorithm was run with 29 different points on the z-plane. The results that appear in Tab. 6 were obtained after a run that lasted approximately 8 hours and yielded a configuration with relative accuracy A = 0.000705966 (significantly larger than that in Tab. 4 for the Ising model).
The comparison between the numerical and analytic results follows a pattern similar to that in the Ising model. The agent has clearly located the CFT data of the tri-critical Ising Spin 0 1 2 3 4 5 6 1 -2 -1 -1 The operators ε, ε , Y are all spinless with different scaling dimensions: 1, 3, 6, respectively.
In this subsection, we compare the first two cases: εεεε in the Ising model, and ε ε ε ε in the tri-critical Ising model.
In all these cases the crossing equations are similar,   Table 9: Analytic and numerical solutions for scaling dimensions and OPEsquared coefficients in the conformal block decomposition of the four-point function

Application II: c = 1 Compactified Boson
With an eye towards more general applications, it is important to explore the performance of our approach beyond the restricted class of rational conformal field theories, of which minimal models are a special case. In this section, we study the c = 1 compactified boson CFT. This is a free scalar CFT. Free CFTs are the benchmark of the Lagrangian QFT approach and the basis of perturbative methods in quantum field theory, readily solved by traditional methods and an entry-level litmus test for the generalisation of our method to more challenging settings.
The reader should appreciate that by rediscovering the compactified boson CFT as a solution to the crossing equations, one would be able to solve it without the use of the standard Lagrangian methods, e.g. they would be able to determine correlation functions without using Wick's theorem. Despite its simplicity, the free scalar CFT has a rich spectrum of primary operators with momentum and winding around the target circle and scaling dimensions that depend non-trivially on an exactly marginal coupling-the radius of the circle. This is therefore an interesting toy model where our methods can be used to compute conserved U (1) current. We discover that even with a very small cutoff, as low as ∆ max = 2, the algorithm can detect correctly the 2D compactified boson CFT and returns rather accurate approximate values for scaling dimensions and OPE-squared coefficients.

Analytic Solution
Before delving into the results of the RL exercise, it is useful to recall briefly the analytic solution of the 2D S 1 scalar theory that we want to rediscover from a conformal bootstrap/RL perspective.
Consider the 2D CFT of a compact boson X with radius R: Since this is a free theory, it is straightforward to analytically compute all its data. Let us summarise some of the pertinent details following closely the conventions of [37] with The basic conformal primaries of the theory are the U (1) currents and the vertex operators 14 where n and w are the integer momentum and winding quantum numbers of the corresponding states. j,j have respectively conformal scaling dimensions (h,h) = (1, 0), (0, 1), while V p,p has (h,h) = ( p 2 2 ,p 2 2 ). The spin of an operator is s = h −h. As a result, the vertex operator V p,p has spin s = 1 2 (p 2 −p 2 ) = nw. Corresponding states with only momentum, or only winding, are spinless.
The remaining spectrum of operators can be organised using the Virasoro algebra, but since we only want to use the global so(2, 2) part of the 2D conformal algebra, we need to also identify all the quasi-primary operators. All quasi-primaries of the theory can be obtained by combining any quasi-primary operator from the left-moving (holomorphic) sector with any quasi-primary operator from the right-moving (anti-holomorphic) sector.
There are no factors with mixed holomorphic-antiholomorphic derivatives in an operator because of the equations of motion ∂∂X = 0. Hence, let us focus momentarily on the holomorphic sector.
As already noted in our minimal-model discussion, a quasi-primary state (in the holomor-  The two-and three-point functions involving the above quasi-primaries can be computed straightforwardly using Wick contractions. Explicit results, that will be compared against those from the RL output, will be listed in the next subsection.

Reinforcement-Learning Results
We will now attempt to rediscover the S 1 theory from the conformal-bootstrap perspective.
We consider two kinds of four-point functions. The first one is the four-point function of four spinless conformal primaries with arbitrary, but fixed, scaling dimension. The zero-spin assumption is not necessary; we only make it here for convenience and illustration purposes.
We further assume that these operators are charged under a conserved U (1) symmetry. We denote them as V p and parametrise their scaling dimension ∆ p by the real variable p using the relation We emphasise that this equation should be viewed as the definition of the real number p.
At this point we do not specify how p relates to the U (1) charge of V p and hence (5.10) is not a dynamical statement about the scaling dimension ∆ p in terms of some other quantum number.
Keeping the above in mind, we consider the four-point function whereV p denotes the complex conjugate of V p . Since V p andV p have opposite U (1) charge, the four-point function (5.11) is neutral under the assumed global U (1) symmetry. V p is expected to capture the primary vertex operator V p,p (z,z) = e ip(X(z)+X(z)) with p =p = n R and winding w = 0, or the T-dual V p,−p (z,z) = e ip(X(z)−X(z)) with p =p = w R and momentum n = 0. Only a minimal part of this information will be incorporated indirectly into the algorithm via the spin-partition. Using this partial information, the agent will have to uncover that V p is indeed part of the S 1 theory and that p is related to the U (1) charge.
The second kind of four-point function that we will consider is the correlator of the conserved spin-1 operator j, j(z 1 )j(z 2 )j(z 3 )j(z 4 ) . (5.12) We next display the results of the RL algorithm for each case.

Momentum/winding Sector
The crossing equation for the four-point function (5.11) can be written as In the t-channel block decomposition we have separated the contribution of the identity operator and have used the normalisation convention V pVp = 1.
Let us fix for concreteness the scaling dimension ∆ p of V p to some specific value, e.g.
cutoff ∆ max . In Tab. 11 we collect four spin-partitions that will be used to study the The results at ∆ max = 3.5 in Tab. 12 exhibit a noticeable decrease in A (which translates to a smaller violation of the truncated-reduced crossing equations) and agreement between the numerical and analytic results for the low-lying spectrum, which is comparable with the ∆ max = 2 run. Notice that there are two deliberate features complicating the ∆ max = 3.5 run.
First, the fact that the spin-3 operator is absent in the s-channel was not an input. The agent had to discover this feature (as it does), but this complicates the search. Interestingly, although the spin-3 operator is absent in the exact conformal decomposition, the agent manages to identify its scaling dimension with remarkable accuracy. Apparently, this is not an accident; similar results are obtained in the higher cutoff runs of Tabs  Comparing the numerical and analytic results in Tabs    comparable to the ones obtained in Tab. 14 with the use of the RL algorithm.
As an illustration, we performed a preliminary analysis of the statistical errors with multiple runs for the ∆ max = 2, ∆ p = 0.1 case by completing 12 runs with 20 z-points in about 2 hours each. The results, collected in Tab. 15, provide a more complete picture of the final output of the computation. We note that the errors in Tab. 15 do not include systematic errors associated with the truncation or the choice of the z-points.
Finally, we performed the following exercise. Using the fixed spin-partition for ∆ max = 2 from Tab. 11, we varied ∆ p from 0.1 to 0.6 with a step of 0.1. As ∆ p increases so do the scaling dimensions in the s-channel. As a result, in the s-channel we increase appropriately the upper cutoff in the search and the fixed spin-partition is no longer that of ∆ max = 2.
At the same time, the t-channel scaling dimensions remain within the ∆ max = 2 window.
In Fig. 1     about the CFT, Fig. 1 would provide evidence that the variable p is proportional to the U (1) charge of the operator V p , since the scalar appearing in the OPE V p V p has twice the U (1) charge of V p (the U (1) charge is additive) and the scaling dimension ∆ s is found to be ∆ s = (2p) 2 . A sharper argument along these lines could be obtained by studying the four-point function V p 1 V p 2V p 1V p 2 for a generic pair of p 1 , p 2 . The four-point function jjV pVp would also yield related information.
At this point, it is interesting to ask whether the RL results allow us to conclusively determine that the CFT in question has a one-dimensional conformal manifold (namely an exactly marginal operator). The uncharged, spinless operator of scaling dimension 2 that appears in the t-channel is an obvious candidate that indicates the existence of a one-dimensional conformal manifold. Moreover, if there is some additional information that the spectrum of the CFT is discrete, the fact that we can solve the crossing equations for a continuous set of scaling dimensions ∆ p for the operators V p , signals the fact that the theory has an exactly marginal deformation and that the scaling dimension of V p can be used as a proxy for the value of the exactly marginal coupling.

Spin-1 correlation functions
A characteristic feature of the S 1 theory is the existence of a conserved holomorphic (and separately an anti-holomorphic) U (1) current j(z), under which many of the operators of the theory are charged. In this subsection, we study the four-point function of this current, (5.12). The holomorphic current j(z) has spin 1 and (since it is conserved) scaling dimension ∆ j = 1. Keeping its scaling dimension ∆ j free for the moment, we find that the four-point function (5.12) yields the crossing equation 14) The 1/16 factor in the last term, capturing the contribution of the identity, originates from the normalisation condition jj = 1 4 .  Table 17: Analytic and numerical solutions from 10 runs for the mean and standard deviation of scaling dimensions and OPE-squared coefficients in the conformal-block decomposition of the four-point function j(z 1 )j(z 2 )j(z 3 )j(z 4 ) . ∆ j is also an unknown and the spin-partition is that of Tab. 16. The numerical results were obtained with 16 z-points and a mode-1 run of the RL algorithm.
With this spin-partition we ran the RL algorithm 10 times in mode 1 using 16 z-points.
Each run lasted approximately two hours. In this case, we kept the conformal scaling dimension of the external operator j as one of the unknowns to be determined by the agent. Overall, this was a 9-dimensional search. The results, collected in Tab. 17, include statistical errors and exhibit the relative accuracy A = (2.13657 ± 0.0819217) × 10 −4 . It is very rewarding to see that the agent determined the scaling dimension of the conserved U (1) current to excellent accuracy just from the knowledge of the spin partition, and reproduced sensibly the low-lying spectrum and OPE data of the quasi-primary operators that appear in the OPE of the current with itself. For comparison, we also performed a single, independent mode-2 run with 16 z-points, where the scaling dimension of the current was fixed from the beginning at the analytic value ∆ j = 1.  Table 18: Analytic and numerical solutions for scaling dimensions and OPE-squared coefficients in the conformal-block decomposition of the four-point function j(z 1 )j(z 2 )j(z 3 )j(z 4 ) for ∆ j = 1 and the spin-partition of Tab. 16. The numerical results were obtained with 16 z-points and a mode-2 run of the RL algorithm.

Conclusions and Outlook
In this paper we introduced the use of Reinforcement-Learning techniques into the conformal bootstrap programme. We tested an RL soft Actor-Critic algorithm in the context of several We view the approach introduced here as largely complementary to the more standard ones that have already been developed to-date in the context of the numerical conformal bootstrap. We believe that our method is comparatively stronger in performing efficient multi-dimensional searches in arbitrary, a priori selected (unitary or non-unitary) CFTs.
Since it is based on statistical and probabilistic techniques, it can be weaker in accuracy, on detecting rigorous bounds and on conclusively rejecting CFT data as inconsistent. The latter is the context where standard numerical conformal-bootstrap approaches have excelled over the last decade. Eventually, one would like to combine all available analytic and numerical methods at their disposal to build a powerful multi-purpose toolbox.
We envisage the most efficient application of our approach in contexts where a CFT can be solved in a parametrically convenient regime (e.g. in a weakly coupled large-N regime or a weakly coupled regime on a conformal manifold). Then, one can use the information of the perturbative solution to set up a well-informed spin-partition, that can in turn be applied adiabatically to a search with gradually changing parameters. By using a gradual update of the CFT data, one should be able to implement the RL algorithm step-by-step and track them from a weak-to a strong-coupling regime. This is a concrete context, where one can try to leverage all available analytic and numerical information. For example, in superconformal field theories, our approach can benefit from many recent developments that use the superconformal structure of the theory in an essential way.
Although our results provide a proof of principle for the usefulness of RL techniques to this class of problems, there are several aspects of our approach that require further investigation and development. The most urgent is to systematically understand how to incorporate reliable errors in our computations. The primary source of error is of an analytic nature and originates from the truncation of the conformal-block expansions. The convergence properties of these expansions, [27], imply that there is a sufficiently high ∆ max above which the error will be negligible. It is unclear, however, how to identify this optimal ∆ max in a generic theory and for generic four-point functions. Hence, one might initially need to perform a case-by-case analysis in order to explore how our results are affected by an increasing ∆ max .
Another source of error, which is sometimes more significant than the error due to the ∆ max truncation, comes from the way we reduce the functional dependence of the crossing equations on the cross-ratios to a discrete set of algebraic equations. In this paper we have chosen to implement this reduction by evaluating the crossing equations on a finite set of cross-ratio values. We noticed experimentally that the sampling of z-points suggested in Sec. 3.1 of [28] works well in our computations. However, we lack a good understanding of whether this is the optimal sampling, or how the calculations are affected by the number of z-points selected. An error can consequently be associated with these effects by varying the sampling (in form and size). Alternatively, one can explore more standard reductions based on Taylor-expansions of the conformal blocks around some point in z-space. It would be interesting to repeat the computations of this paper with this alternative approach and compare results.
Other errors have to do with the statistical nature of our approach and the fact that we do not a priori know the minimal possible violation of the truncated crossing equations for a given truncation and reduction. In this paper we quantified this violation with a relative measure of accuracy A and performed runs of the RL algorithm up to the point where the improvement of A was saturated. An important additional measure of error for each CFT datum is a statistical error obtained by performing the same type of run many times, which we sampled in the case of the c = 1 compactified boson CFT on S 1 for the simplest case of ∆ max = 2 in the momentum sector and for ∆ max = 8 in the four-point function of the conserved U (1) current. The evaluation of this type of error would benefit from a fully parallelisable algorithm. As we noted in Sec. 3.3, current implementations of the algorithm benefit from the judicious caretaking of the user, which obstructs the full parallelisability of the code. It would be useful to improve this aspect in future work.
In this paper we did not make systematic use of the constraints of global symmetries or of the full constraints of unitarity on the OPE-squared coefficients. As we observed in Sec. 5.2.1, multi-dimensional searches can benefit significantly from prior information on the signs of the OPE-squared coefficients. Without such information the agent is allowed to explore cancellations between different conformal blocks that sidetrack the search by increasing the statistical error on certain OPE-squared coefficients, especially so for those at higher scaling dimensions that come naturally with suppressed numerical values.
Finally, we treated the learning algorithm itself as a black box, using the off-the shelf soft Actor-Critic algorithm of [22]. It would be interesting to explore what efficiency and speed gains one can achieve by tuning hyperparameters or choosing the Deep Deterministic Policy Gradient method [33]. We also chose the simplest definition for the reward function (3.1). The choice of an appropriate reward function is crucial in achieving better results for RL algorithms and this is an area that also deserves further investigation.