Similarity-Based Equational Inference in Physics

Automating the derivation of published results is a challenge, in part due to the informal use of mathematics by physicists, compared to that of mathematicians. Following demand, we describe a method for converting informal hand-written derivations into datasets, and present an example dataset crafted from a contemporary result in condensed matter. We define an equation reconstruction task completed by rederiving an unknown intermediate equation posed as a state, taken from three consecutive equational states within a derivation. Derivation automation is achieved by applying string-based CAS-reliant actions to states, which mimic mathematical operations and induce state transitions. We implement a symbolic similarity-based heuristic search to solve the equation reconstruction task as an early step towards multi-hop equational inference in physics.


I. INTRODUCTION
Automating physical reasoning first involves the comprehension of physical concepts, language, and algebra, which form a cohesive informal mathematical explanation. Physics-inspired data-driven neural approaches are used often for the purpose of accurate calculation and simulation, but not for deriving equations [1][2][3][4][5][6]. Examples of equational [7][8][9] and conceptual [10] inference do not convey the complex arguments conducted within derivations, which combine assumed mathematics similar to premise selection, symbolic manipulation of equations, and reference to physical concepts. Automated Theorem Proving (ATP) is associated with mathematical rigour, but is not suitable for this type of informal equational and conceptual reasoning used by physicists [11,12]. Literature at this interface is limited and scarce [13], and there are few complete real-world derivations of published results that exist in a computer interpretable format.
Considering these limitations, this paper proposes three contributions: (i) we present a novel dataset consisting of 368 equations compiled from the detailed derivation of a published equation in physics [14], represented as a sequence of states and actions, and we describe the dataset creation method; (ii) we define an equation reconstruction task by considering the smallest nontrivial derivation as three consecutive equational states with an unknown intermediate state we aim to reconstruct; (iii) we propose the use of symbolic similaritybased heuristics, a knowledge base of assumed equations, and a set of granular mathematical operations posed as string-related actions formulated in a computer algebra system (CAS), to solve the equation reconstruction task on the PhysAI-DS1 dataset [15]. We claim an approach accuracy of 56.2% considering both exact matching and the non-zero similarity cutoff regime, given the dataset, knowledge base, the set of actions, and the computer algebra system. * jordan.meadows@postgrad.manchester.ac.uk † andre.freitas@manchester.ac.uk

II. EQUATIONAL INFERENCE IN PHYSICS
Research at the border between artificial intelligence and physics has seen many recent successes. Data-driven approaches are popular, where many share the theme of feeding large datasets to physics-inspired architectures, learning physical model parameters often within the context of real-world problems [1][2][3][16][17][18][19]. In contrast to these simulation and calculation themed approaches, others aim to infer equations or conceptual insights from experimental or generated data from toy model physics closer to symbolic regression [7-10, 20, 21], even recovering the latent network structure of dynamical systems from time series data [22]. While this class of approach is more similar to traditional theoretical physics, neural methods suffer from an explainability problem, and equational and conceptual reasoning are crucial components of physics derivations.
Automated theorem proving has seen little success in physics on the equational reasoning side, literature is scarce [13], and efforts have been made to formalise physics towards ATP [12]. While Ref. [11] highlights five directions, we expand upon three in this work: (i) increasing the collection of physical theories we have in forms that can be used for training and validating symbolic reasoners in physics; (ii) developing representation methods and evaluation tasks which necessitate approaches that select suitable approximations, idealisations and abstractions; and (iii) analysis of the nature of the informal argumentation used in physics.
Aligned with direction (i), we present 368 equations taken from physical theory [14] converted into a dataset as the first contribution. Following direction (iii), this dataset is generated by a method which captures much of the derivation argumentation, by treating derivation progression as a finite state machine reliant upon computer algebra operations and representations. Direction (ii) we address with an equation reconstruction task as a second contribution, which provides a medium for creating approaches which involve the selection of idealisations, approximations, and abstractions. As a third contribution, our similarity-based approach partially reconstructs the equational argument from PhysAI-DS1, which requires use of symbolic approximations and the selection of supporting premises as a result of the physical theory the comprised derivation represents.
ATP methods have a strong reliance on logical formalisation. In order to be comparable to ATP-based methods, derivations in physics would need to be translated into a logical form. There is a shared understanding that existing standard logical frameworks used in ATPs are limited and not aligned to the requirements of the nature of physics argumentation [11,12], and thus comparative baselines of this kind can not be made apart from a very controlled fragment of the physics domain. Additionally, to our knowledge there are no efforts automating computer algebra in physics, not at least at the discourse-level form expressed in physics papers. We claim our baseline for the equation reconstruction task is the first of its kind, with other approaches unable to be easily adapted to solve it.
Reinforcement learning has been applied in the context of regular [23] and equational [24] theorem proving, and separately nuclear physics [25]. Although such techniques may be useful for searching large state spaces, they are not currently applied in the context of computer algebra expressed at the surface form of physics papers. Instead, our contribution targets a granular understanding of the dialog between equational symbolic forms in physics and similarity metrics, and its utility as one component of the symbolic inference mechanism.
We develop and evaluate our early single-hop [26] approach in the context of a multi-hop extensible equational inference task, using a computer algebra sequence and subsequent dataset created using a novel method for converting contemporary physics derivations into CASinterpretable data.

III. DATASET CONSTRUCTION METHODOLOGY
In theoretical condensed matter, Ref. [14] is concerned with manipulating topological features of polaritons in cavity-embedded honeycomb metasurfaces. The following equation is the first numbered result: This is accompanied by a brief derivation highlighting key equations. The challenge is to use this derivation as a guide to derive (1) with CAS-level representation, at the granularity level of physicists performing the derivation in real time, to then store this derivation as a dataset.

A. Manual Derivation
We start by manually deriving (1) from a single firstquantised harmonic oscillator (as seen in Figure 1). We avoid skipping any steps, which drastically lengthens the derivation, as in practice minute mathematical operations are taken for granted at the paper form, which may require its elicitation to enable CAS-compliance. We frame the derivation as a finite state machine where the current equation is the current state, and we carefully choose an informal action (operation), and action argument, that will progress the derivation to the next state. The state, action, and argument sequences are recorded up to equation (1).

B. Computer Algebra Augmentation
This curated derivation serves as a guide for the approximate derivation to be recreated in the CAS. However, particular actions, e.g. "divide RHS by 2", may not already exist explicitly in the CAS and must be constructed. There are limitations on how actions can be constructed, so the reconstructed derivation may not exactly mirror the hand-written version in general.
We use SymPy [27], which allows the rendering and manipulation of equations in L A T E X, such that equations can match those within papers. From the hand-written derivation which we represent as a state sequence, we define an initial 2-tuple (LHS, RHS) equational state using LHS and RHS generated from the CAS, and build and store appropriate actions which induce state transitions to other 2-tuples, which closely match the hand-written state sequence where possible. Along the way, actions may require yet unused symbols and assumed equations to progress the derivation, both of which we collect in a knowledge base (KB). We employ the following Conventions: 1. Equations are stored as (LHS, RHS) tuples.

Actions are functions which accept two arguments
including the current state and: no argument, a symbol, an equation. These secondary arguments are called Non-State Arguments (NSAs). Respectively these action categories are called self-state, symbol-state, and equation-state actions.
3. Actions cause the minimum possible state transition while still progressing down the derivation sequence. In contrast, the action causing the maximum possible state transition would be called "immediately derive the goal state from the initial state". A more granular action example is "divide RHS by 2" which induces a minimal change in the state semantics comparatively. Actions are designed to operate at this resolution, but physicists work less granularly.
FIG. 1. The first three consecutive equational states from the computer algebra derivation. Each state is seen associated with four main pieces of information stored in PhysAI-DS1. If state 2 is considered unknown, the three states form a derivation unit and this diagram represents the equation reconstruction task, where the goal is to select the correct action and NSA to reconstruct state 2 given states 1 and 3.

4.
No LHS is the same for any state in the sequence (by default). For example, if (H, ω 0 ) is the current state and the action "divide RHS by 2" is applied to it, then the next state in the sequence is (H (1) , ω 0 /2) with an index appended to the LHS, because ω = ω/2. However an action "remove index" exists for the specific case of algebraic manipulation with the LHS.
5. If an action is not appropriate given its arguments, then the action returns the current state. (e.g. action = "divide current state RHS by KB equation RHS", but the KB equation RHS is a vector).
From this sequence of states, actions, and arguments, we construct a dataset PhysAI-DS1, consisting of 368 consecutive entries. Features from PhysAI-DS1 from left to right read: {State string length (LaTeX text), State string (LaTeX text), State string length (SymPy text), State string (SymPy text), State string length (SymPy tree), State string (SymPy tree), Action, Non-state argument (SymPy text), State type, Action type}, which we have designed to capture the important aspects of a physics derivation. The minimum, average, and maximum equation string lengths for the SymPy text considered in this paper are respectively 52, 495, 5476. The L A T E X equivalent is 56, 582, 6318. We consider three different representations for the states, while NSAs are expressed using the native representation of the CAS. Of the 368 states, 76 are categorised as type integrative, 217 are consequent, and 75 are terminal.
The states are categorised by their role in the derivation: Integrative states: formed by the action "consider knowledge base equation". Consequent states: follow integrative states in the sequence. Terminal states: directly precede integrative states. There are 35 unique actions split into self-state, symbol-state and equation-state categories, described in Convention 2. The final state 2-tuple in the dataset represented as an equation is given by which is mathematically equivalent but rearranged from (1), up to the LHS. SymPy has a native method of ordering equations and this ordering problem persists throughout the derivation. The order of noncommutative terms is preserved however, and this issue is cosmetic. The names of actions are in some cases misleading and indeed behave unintuitively also. This reflects the limitations of our chosen CAS, as often the most obvious way to force a desired state transition during construction involves the invention of obscure stringbased operational actions, as a result of the limitations of the computer algebra system and the curated derivation. See Figure 1 for key information associated with equational states in PhysAI-DS1.

IV. EQUATION RECONSTRUCTION TASK
The equation reconstruction task aims to support the construction of AI-supported derivation systems operating at different levels of symbolic representation (from a string/natural language level to an algebraic object/operational level). The task considers three consecutive states in a derivation grouped into a derivation unit {s i−1 , s i , s i+1 }, where two consecutive actions performed on s i−1 form the unit.
We aim to create a reconstructionŝ i such that it is mathematically equivalent and sufficiently similar to an unknown s i , given s i−1 and s i+1 .
In practice this involves selecting the correct action and NSA to cause the transition s i−1 → s i (with probability 1) given a suitable knowledge base and action set. This derivation unit translates along the state sequence, and combined with a suitable inference algorithm, outputs one reconstruction per translation.
States are non-Markov which reflects the behaviour of physicists who may use equations in the distant history to progress the derivation. Action and state sequences behave deterministically, and actions receive two arguments: the current state and the non-state argument (NSA). In addition to those in the KB, we allow for equation-type NSAs to be any equational state in the history up to s i which violates the Markov condition.
A derivation unit treats the contained states as a "micro-derivation" with no explicit connection to the complete sequence. Although the state history is required by the actions for progression, it effectively becomes part of the KB which grows proportional to i. As follows, for N states in the full derivation, there are N − 2 fully independent reconstructionsŝ i , which is equivalent to solving the equation reconstruction task for N − 2 independent derivations, each of size three.
We append dummy states onto the end and beginning of the sequence totalling N = 370 states. This results in 368 reconstructions comparable with the 368 states in the derivation. The first dummy state is given by s 0 = (x, ?) and represents an entirely neutral equation akin to beginning with no prior knowledge. The second dummy state appended last is the 2-tuple form of equation (1) from Ref. [14] passed through a computer algebra system, such that the final state in PhysAI-DS1, and the dummy state, are given respectively as: V. SYMBOLIC SIMILARITY-BASED SEARCH

A. Similarity Measures
If we consider equations (3) and (4) in the sequence we can observe they are equivalent but symbolically dissimilar, as the string representations of each LHS are not identical. We employ a similarity measure M to classify whether one state is equal and similar to another, with a hyperparameter ε.
We form a set of canonical algebraic actions referred by the PhysAI-DS1 dataset. Actions accept two arguments: the current state, and a Non-State Argument (NSA). An NSA can be either None, a symbol, or an equation 2tuple.
We form a knowledge base of requisite equations necessary for the derivation, and a set L consisting of every (non-numeric) symbol from each equation. All states prior to s i−1 are also included in the knowledge base. Additionally, if l i−1 is the set of symbols which constitutes s i−1 , and l i+1 is the set which constitutes s i+1 , then we append the symmetric difference l i−1 l i+1 to L. As we know the form of s i−1 and s i+1 , we can use the constituent symbols to guide the derivation which we achieve by including l i−1 l i+1 in L, an element of which may be accepted as the argument of an action function as an NSA. The set R contains the state history up to s i and the requisite equations of the knowledge base. We define a combined knowledge base K = L ∪ R.
Given the knowledge base K, the set of actions A, and the initial and goal states respectively s i−1 , s i+1 ∈ S for set of states S, we aim to generate a stateŝ i ∈ S such that the similarity measure M (ŝ i , s i ) ≤ ε where M : S × S → R ≥0 , and ε ∈ R ≥0 is a hyperparameter determining whetherŝ i and s i are considered both equal and similar. We choose M to be based upon one of five similarity measures: where |s i | is string length, m is the number of matching characters (considered matching if max(|s i |, |s j |)/2 − 1 or less characters between them), and t is half of the number of transpositions.
• Jaro-Winkler similarity, M (s i , s j ) = 1 − sim JW (s i , s j ): where l is the length of a common prefix at the start of the string up to a maximum of 4 characters, and p is a constant scaling factor (defined by the Jellyfish library default for our case).
Taking M as the Levenshtein measure, the distance between s N and s N −1 from the derivation state sequence is M (s N −1 , s N ) = 5, therefore we require ε ≥ 5 to determine the two states are similar within the Levenshtein metric. Lower bounds on ε that capture the final reconstruction for each of the similarity measures can be obtained via B. Heuristic Search Following the equation reconstruction task description, while we know the form of s i−1 and s i+1 with the s i unknown, we can form reconstruction candidatesĉ i by applying an action a ∈ A to s i−1 such that a : (s i−1 , k) → c i , whereĉ i ∈ S and k ∈ K. We apply a second action to all possibleĉ i with the mapping a : (ĉ i , k ) →ĉ i+1 , where in general k = k and a = a , to obtainĉ i+1 for (ĉ i ,ĉ i+1 ) ∈ S f inal .
At this stage multiple paths {s i−1 ,ĉ i ,ĉ i+1 } exist. By solving the following optimisation problem, we can determineŝ i+1 : We employ use of a heuristic H(ĉ i , s i+1 ) : S × S → R ≥0 to order the paths, as the search may end early under a specific condition. If we compare a stateĉ i+1 such that M (ĉ i+1 , s i+1 ) = M (ŝ i+1 , s i+1 ) = 0, then the reconstruction candidateĉ i from the path containinĝ c i+1 =ŝ i+1 is selected as the reconstructionŝ i which ends that iteration. Otherwise, allĉ i+1 are compared from each path, and theĉ i of the path corresponding with the lowest M (ĉ i+1 , s i+1 ) is taken as theŝ i . Based on the Levenshtein metric, H is defined as follows: • x(ĉ i , s i+1 ) = |l s \ l c |, where l s and l c are the set of symbols that constitute s i+1 andĉ i respectively. If x is large that means there are many symbols in s i+1 that are not inĉ i , and the two are unlikely to be semantically linked.
• y(ĉ i , s i+1 ) = M L (c min , s min ), where c min and s min are the subexpressions in the trees ofĉ i and s i+1 respectively corresponding to the lowest M L , given that the number of characters in the string representations of c min and s min is less than or equal to 100. This represents a similarity calculation for small subexpressions. If no suitable comparisons exist then y = M L (ĉ i , s i+1 ). Given r = n 1 x(ĉ i , s i+1 ), n 2 y(ĉ i , s i+1 ), n 3 z(ĉ i , s i+1 ) for n 1 , n 2 , n 3 ∈ Z ≥0 , the heuristic is given by Each path {s i−1 ,ĉ i ,ĉ i+1 } corresponds to a heuristic value and lowest-valued paths are searched first with the aim of finding a comparison M (ĉ i+1 , s i+1 ) = 0 which terminates the search, else all comparisons are made. The process is based on the notion that if s i+1 can be obtained by applying two consecutive actions to s i−1 , then the state after one action may be s i .
Extending to the multi-hop case, for number of intermediate states N , the number of derivation paths {s i−1 , c i , ..., c i+N } grows as (|A||K|) N +1 . Then, each Levenshtein comparison M L (ĉ i+N −1 , s i+N ) in the heuristic, and final e.g. Damerau-Levenshtein comparison in the search M DL (ĉ i+N , s i+N ) has O(mn) time complexity for equation string lengths m and n. However, this work targets step-wise, single-hop inference [26] as a unit of analysis, and end-to-end or multi-hop inference is currently outside our scope.

VI. EMPIRICAL EVALUATION
A reconstructionŝ i is classified as a success in the evaluation if M (ŝ i , s i ) ≤ ε, where ε = ηε low for η ∈ Z ≥0 . The similarity value ε low = M (s N −1 , s N ) for state sequence of length N , is necessary in order to classify equations (3) and (4) as equal and similar after a second action, and represents a unit of difference scalable with η. It can be found directly by comparing the penultimate and final equations (states) in the derivation (state sequence).
The value η = 0 represents exact string matching, and η = 1 represents the minimum similarity required to reconstruct all equational states. At η = 0 false negatives may occur with no false positives and the accuracy is minimised, while for η > 0 the accuracy increases due to decrease in false negatives and increase in false positives. We evaluate for accuracy at values of η ∈ {0, 1}. Table  1 describes the approach accuracy from the equation reconstruction experiments across the similarity measures.
From the Table 1 results, the approach based on the Damerau-Levenshtein metric outperforms the other string measures with respect to exact matches, and does not differ in accuracy upon introducing a unit of difference ε low . This suggests that Damerau-Levenshtein is less susceptible to the inclusion of false positives with increasing η. The Levenshtein-based measures both show low sensitivity to increasing η.
The Hamming measure approach results in the lowest exact matching accuracy, but the highest accuracy at unit difference ε low . This suggests that the measure is relatively more susceptible to false positives within our formalism, and less suitable for this task.

A. Categories of States and Actions
Actions may be categorised dependent upon their NSA as either self-state, symbol-state, or equation-state.
States are categorised by their role in the state sequence as either integrative, consequent, or terminal (see Section 3). There are two actions per state reconstruction (see Section 4) but only one action is responsible for the reconstruction directly. We consider this action, and the corresponding state, in the following analysis. Table 2 describes the total number of cases for each category of state versus each category of action. Table 3 describes the accuracy corresponding to each case.
At η = 0, consequent states formed by self-state actions and equation-state actions, and all integrative states, are most accurately reconstructed by the Damerau-Levenshtein approach. The standard Levenshtein approach reconstructs consequent states with symbol-state actions most accurately. Even in the exact matching case at η = 0, integrative states are reconstructed with almost 90% accuracy, and all terminal states fail.
At η = 1 the Levenshtein-based approaches share the lowest performance, where the Hamming approach displays non-zero accuracy in terminal states, and over 0.9 accuracy in the consequent self-state and symbol-state categories. This suggests the Hamming approach quickly includes false positives with increasing η, as do the Jarobased approaches particularly within the equation-state class, comparative to the Levenshtein-based approaches.

B. Qualitative Analysis and Approach Limitations
To provide additional insights into the limitations of our approach to solving the equation reconstruction task, we categorise known reconstruction failure types for the Damerau-Levenshtein variant.

Terminal State Problem.
All terminal state reconstructions will fail -a claim supported by Table  3. This arises due to the state after being integrative by definition, and integrative states are reconstructed by the specific action "consider knowledge base equation". Integrative states represent the beginning of new derivation branches, and arriving at an integrative state after two consecutive actions is only dependent upon the second action being "consider knowledge base equation". Therefore, any possible first action and subsequent reconstruction is suitable because of this condition on the second action. Terminal states represent 20.4% of the test set and around half of all reconstruction errors.
Multiple Pathways. Reconstructions are ultimately determined by the final similarity check M (ĉ i+1 , s i+1 ) such thatĉ i+1 satisfies equation (6). Manyĉ i may be associated with a uniqueĉ i+1 , and the lowest heuristic valueĉ i is then selected.
Repeating equations. If the s i+1 is an equation that has already been derived in the state history (up to the LHS index) then the first action will be "consider knowledge base equation" with the NSA as the repeated equation, and a second action chosen to not transition the state (equivalent to "do nothing", Convention 5).
Action Limitations. Actions are made specifically to induce state transitions between known states, i.e. from the curated derivation passed through CAS. This means upscaling the data or considering other datasets is time-consuming, as the set of consecutive actions that form a separate derivation may be entirely different. This would mean for each new derivation at least one pass of the full derivation sequence through CAS must be made, and a new dataset must be constructed. Additionally, even within the same derivation, actions may not return mathematically correct states in all cases, as many are reliant on string manipulation and are not native to SymPy directly. There is considerable advantage in finding mathematically "robust" actions suitable for rigorous exploration of unknown states, but such work is outside our current scope.
Knowledge Base Selection. Premise selection is the task of selecting relevant mathematical statements aiming to maximise the probability of proving a given conjecture. The set of knowledge base equations have been hand-crafted in the same vein as the action set, and they suffer similar limitations. It is a non-trivial task [28] selecting suitable axiomatic equations or supporting statements capable of reconstructing derivations or proofs, and is outside our current scope. The hyperparameters used for each experiment were (ε, n1, n2, n3) = (ε low , 10, 10, 10), where {ni} refer to the hyperparameters associated with equation (7), the heuristic.

VII. CONCLUSION
In response to demand for informal mathematical datasets [11], we have created a method for producing datasets from physics derivations as a step towards multihop inference in physics. This method first involves a curated hand-written derivation, then a recreation of the derivation in a suitably expressive computer algebra system, under an expressive representation scheme. We categorise states and actions within this methodology, and employ it to create a dataset PhysAI-DS1 from a recent result in condensed matter [14]. We propose an equation reconstruction task which considers sets of three consecutive states within the derivation. The intermediate state is then considered unknown, and information from a knowledge base and the initial and final states may be used to mathematically infer the form of the unknown state by applying sequences of string-based CAS-reliant actions to the known state, and by utilising similarity measures to approximate equality. As an early solution to the equation reconstruction task, we formulate primitive similarity-based heuristics with a search algorithm capable of reconstructing equations within PhysAI-DS1 to 56.2% accuracy using the Damerau-Levenshtein met-ric, compared to baseline approaches using four distinct string similarity measures. We discuss limitations of the CAS-reliant similarity-based approach, and systematically characterise the performance of separate categories of action and state. Future work will focus on the physics explanation extraction process to account for different physical systems in a novel PhysAI-DS dataset. From this dataset, we aim to address multi-hop inference in physics via a neuro-symbolic approach designed to account for both the linguistic and mathematical features of physics argumentation and explanation.