Quiver Mutations, Seiberg Duality and Machine Learning

We initiate the study of applications of machine learning to Seiberg duality, focusing on the case of quiver gauge theories, a problem also of interest in mathematics in the context of cluster algebras. Within the general theme of Seiberg duality, we define and explore a variety of interesting questions, broadly divided into the binary determination of whether a pair of theories picked from a series of duality classes are dual to each other, as well as the multi-class determination of the duality class to which a given theory belongs. We study how the performance of machine learning depends on several variables, including number of classes and mutation type (finite or infinite). In addition, we evaluate the relative advantages of Naive Bayes classifiers versus Convolutional Neural Networks. Finally, we also investigate how the results are affected by the inclusion of additional data, such as ranks of gauge/flavor groups and certain variables motivated by the existence of underlying Diophantine equations. In all questions considered, high accuracy and confidence can be achieved.


Contents
Seiberg duality [1] for supersymmetric quantum field theories is one of the most fundamental concepts in modern physics, generalizing the classical electro-magnetic duality of the Maxwell equations. In parallel, cluster algebras [2,3] have become a widely pursued topic in modern mathematics, interlacing structures from geometry, combinatorics and number theory. These seemingly unrelated subjects were brought together in [4][5][6] in the context of quiver gauge theories realized as world-volume theories of D-brane probing Calabi-Yau singularities. Interestingly, the common theme -quiver Seiberg duality in physics and mutations of cluster algebras in mathematics -emerged almost simultaneously around 1995, completely unbeknownst to the authors of each. It was not until almost a decade later that a proper dialogue was initiated. Meanwhile, [7][8][9][10][11] placed the study of quiver gauge theories and toric Calabi-Yau spaces on a firm footing via brane tilings, or dimer models, which are bipartite tilings of the torus. In the mathematics community, cluster algebras have taken a life of their own [12]. Seiberg duality for quiver gauge theories and cluster mutations for quivers have thus allianced a fruitful matrimony. Continued and often surprising interactions between the physics and mathematics have persisted, ranging from QFT amplitudes [13,14], to quantization [15], to dualities [16].
Given the highly combinatorial nature of quivers and cluster algebras, it is natural to ask whether the machine learning program could be applied to this context. Specifically, one could wonder where in the hierarchy of difficulty, from the least amenable numerical analysis to the most resilient number theory, would quivers and mutations reside. This is thus the motivation of our current work. The paper is organized as follows. After a rapid parallel introduction to Seiberg duality in quiver gauge theories and cluster mutation, from the physics and mathematics point of view in Section §2.1 and §2.2, we proceed in Sections §3 to §5 to study a host of pertinent problems which we will summarize shortly. We conclude in Section §6 and present some details of the neural networks, and their performances over training in the appendices.

Summary of Results
To provide the readers with an idea of the machine learning performance at a glance, we provide here: a brief description of the problem-styles addressed in this paper; a list of the quivers used to generate the mutation classes examined in the investigations; and a table summarizing the investigations' key results.
Data Format The datasets used in these investigations represent each quiver in consideration by its graph-theoretic adjacency matrix (in some investigations with an additional vector structure augmented on). Each investigation has its own dataset of quivers, generated using the Sage software [43], such that each full dataset is the union of mutually exclusive sets of quiver matrices, where all quivers in each set belong to the same duality class.
Two styles of classification problem are addressed in this paper, and each processes the input quiver data in a different format. The first is binary classification on pairwise data inputs. Here each data input is a pair of matrices, and each pair can be classified as having its two constituent quivers in the same class, or not in the same class. On these problems the Naive Bayes (NB) classification method, as described in appendix A.2, performed best and was hence used. The second problem style is multiclassification directly on the matrices. Here each data input is a matrix, and the matrix is classified into one of the duality classes the classifier is trained on. On these problems Convolutional Neural Networks (NN), as described in appendix A.3, performed best and were hence used.
Within each investigation 5-fold cross validation was used to produce a statistical dataset of measures for the analysis of the classifier's performance. In 5-fold cross validation, 5 independent classifiers are each trained on 80% of the data, and validated on the remaining 20%, such that the union of the validation sets gives the full dataset for the investigation. Measures of the classifiers' performance are calculated for each classifier and averaged. In addition, the investigations were also run for varying training/validation % splits, with results plotted as 'learning curves', shown in appendix B.
Quivers considered Here we list the quivers used to generate the duality classes making up the datasets of the investigations considered in this paper. They are listed with an adjacency matrix representation and are labelled in the form: Qi . Different combinations of these quivers (with further Dynkin type examples) were used in each investigation, as listed in the following table.
The first 3 quivers, Q1, Q2, Q3, as well as Q12, Q13, Q15, are finite mutation type under the duality, whilst the remaining listed here are infinite mutation type. Additionally other Dynkin and finite mutation types were used in investigations, labelled in the standard Sage quiver package format [44]. These additional quivers considered were either Dynkin type of various sizes, labelled by the letter and rank of the Dynkin diagram they are equivalent to (with direction added to the edges); or affine type which correspond to affine Dynkin diagrams, and are labelled using Kac's notation with Dynkin letter, rank, and an optional twist. In the case of affine A, rank is given by a pair of integers for the number of clockwise/anticlockwise edges respectively. The specific affine quiver used to generate a mutation class used in an investigation is the choice auto-generated by the Sage package for the input label information. Finally, 'T' type are so named for being shaped like a letter 'T', their three integer entries give the number of nodes in each of the branches from the branch point (inclusive). These quivers are described further as they are introduced with each investigation.  [3,3,3]], which is also of type affine

Investigation Results
Here we tabulate each investigation with a brief description, a list of the quivers used to generate the duality classes in the dataset, and the measures of learning performance. The measures of performance (as described in appendix A.4) are presented as a pair: (acc, φ), consisting of accuracy of agreement, acc, and Matthews' correlation coefficient, φ, where calculated. Both evaluate to 1 for perfect learning, results are shown to 2 decimal places. Dynkin and T type quivers are denoted using the Sage quiver package convention, other infinite mutation type quivers are denoted using the label assigned in the preceding 'Quivers considered' list.
NB classifier results showed perfect classification between 2 mutation type classes. Classifying classes of different quiver sizes was trivial and did not reduce performance as expected. Where classification was between more than 2 classes the performance was lower but still very good. Enhancing the datasets with rank information, or Diophantine-inspired variables, did not improve NB classification.
NNs required rank information in their dataset to classify well, but with this included NNs outperformed the NB classifier, particularly when classifying quivers at unseen mutation depths, and when classifying against random antisymmetric matrices.
We should also mention that we are using the word "depth" throughout the paper. Starting with a quiver (at depth 0) having n nodes, we have n choices of dualizing one node. These newly generated quivers are said to be at depth 1. We can then apply mutations to these depth-one quivers again by choosing one node to dualize. Such quivers obtained are at depth 2 (except the quiver at depth 0 we start with, i.e., dualizing the same node twice). Hence, when we say a quiver is at depth k, the (shortest) distance would be k from this quiver to our starting quiver under mutations.  (4,4,4) In this section we review Seiberg duality, which is an IR equivalence between 4d N = 1 gauge theories [1]. We will phrase our discussion in the language of quivers, since all the theories considered in this paper are of this type.
Let us consider dualizing a node j in the quiver, which does not have adjoint chiral fields. 1 The transformation of the gauge theory can be summarized in terms of the following rules: 1. Flavors. In physics, the arrows connected to the mutated node are usually referred to as flavors. The flavors transform by simply reversing their orientation, namely: 1.a) Replace every incoming arrow i → j with the outgoing arrow j → i. Calling X ij the incoming arrow, we replace it by the dual flavorX ji .

1.b)
Replace every outgoing arrow j → k with the incoming arrow k → j. Calling X jk the outgoing arrow, we replace it by the dual flavorX kj . This is the quiver implementation of the fact that the magnetic flavors are in the complex conjugate representations, of both the dualized gauge group and the spectator nodes, of the original flavors. 2 This transformation is shown in Figure 2.1.
is an IR equivalence between 4d N = 1 in the language of quivers, since all the pe. uiver, which does not have adjoint chiral ry can be summarized in the following to the mutated node are usually referred reversing their orientation, namely: the outgoing arrow j ! i. Calling X ij dual flavorX ji th the incoming arrow k ! j. Calling the dual flavorX kj t that the magnetic flavors are in the dualized gauge group and the spectator er, i.e. composite arrows, as follows. For ! k. This meson M ik can be regarded j ! k of the original theory. In other ws consisting of incoming and outgoing with adjoints are known, under certain condiw, we allow for the possibility of chiral fields In this section we review Seiberg duality, which is an IR equivalence between 4d N = 1 gauge theories []. We will phrase our discussion in the language of quivers, since all the theories considered in this paper are of this type. Let us consider dualizing a node j in the quiver, which does not have adjoint chiral fields. 1 The transformation of the gauge theory can be summarized in the following rules: 1. Flavors. In physics, the arrows connected to the mutated node are usually referred to as flavors. The flavors transform by simply reversing their orientation, namely: 1.a) Replace every incoming arrow i ! j with the outgoing arrow j ! i. Calling X ij the incoming arrow, we replace it by the dual flavorX ji 1.b) Replace every outgoing arrow j ! k with the incoming arrow k ! j. Calling X jk the outgoing arrow, we replace it by the dual flavorX kj This is the quiver implementation of the fact that the magnetic flavors are in the complex conjugate representations, of both the dualized gauge group and the spectator nodes, of the original flavors. 2 2. Mesons. Next we add mesons to the quiver, i.e. composite arrows, as follows. For every 2-path i ! j ! k we add a new arrow i ! k. This meson M ik can be regarded as the composition of the flavors i ! j and j ! k of the original theory. In other words, we generate all possible composite arrows consisting of incoming and outgoing chiral fields.
1 Generalizations of Seiberg duality to gauge groups with adjoints are known, under certain conditions (see e.g. []). 2 In our discussion, including the points that follow, we allow for the possibility of chiral fields connecting pairs of nodes in both directions.  In this section we review Seiberg duality, which is an IR equivalence between 4d N = 1 gauge theories []. We will phrase our discussion in the language of quivers, since all the theories considered in this paper are of this type. Let us consider dualizing a node j in the quiver, which does not have adjoint chiral fields. 1 The transformation of the gauge theory can be summarized in the following rules: 1. Flavors. In physics, the arrows connected to the mutated node are usually referred to as flavors. The flavors transform by simply reversing their orientation, namely: 1.a) Replace every incoming arrow i ! j with the outgoing arrow j ! i. Calling X ij the incoming arrow, we replace it by the dual flavorX ji 1.b) Replace every outgoing arrow j ! k with the incoming arrow k ! j. Calling X jk the outgoing arrow, we replace it by the dual flavorX kj This is the quiver implementation of the fact that the magnetic flavors are in the complex conjugate representations, of both the dualized gauge group and the spectator nodes, of the original flavors. 2

2.
Mesons. Next we add mesons to the quiver, i.e. composite arrows, as follows. For every 2-path i ! j ! k we add a new arrow i ! k. This meson M ik can be regarded as the composition of the flavors i ! j and j ! k of the original theory. In other words, we generate all possible composite arrows consisting of incoming and outgoing chiral fields.
1 Generalizations of Seiberg duality to gauge groups with adjoints are known, under certain conditions (see e.g. []). 2 In our discussion, including the points that follow, we allow for the possibility of chiral fields connecting pairs of nodes in both directions.  In this section we review Seiberg duality, which is an IR equivalence between 4d N = 1 gauge theories []. We will phrase our discussion in the language of quivers, since all the theories considered in this paper are of this type. Let us consider dualizing a node j in the quiver, which does not have adjoint chiral fields. 1 The transformation of the gauge theory can be summarized in the following rules: 1. Flavors. In physics, the arrows connected to the mutated node are usually referred to as flavors. The flavors transform by simply reversing their orientation, namely: 1.a) Replace every incoming arrow i ! j with the outgoing arrow j ! i. Calling X ij the incoming arrow, we replace it by the dual flavorX ji 1.b) Replace every outgoing arrow j ! k with the incoming arrow k ! j. Calling X jk the outgoing arrow, we replace it by the dual flavorX kj This is the quiver implementation of the fact that the magnetic flavors are in the complex conjugate representations, of both the dualized gauge group and the spectator nodes, of the original flavors. 2

2.
Mesons. Next we add mesons to the quiver, i.e. composite arrows, as follows. For every 2-path i ! j ! k we add a new arrow i ! k. This meson M ik can be regarded as the composition of the flavors i ! j and j ! k of the original theory. In other words, we generate all possible composite arrows consisting of incoming and outgoing chiral fields.
1 Generalizations of Seiberg duality to gauge groups with adjoints are known, under certain conditions (see e.g. []). 2 In our discussion, including the points that follow, we allow for the possibility of chiral fields connecting pairs of nodes in both directions.  In this section we review Seiberg duality, which is an IR equivalence between 4d N = 1 gauge theories []. We will phrase our discussion in the language of quivers, since all the theories considered in this paper are of this type. Let us consider dualizing a node j in the quiver, which does not have adjoint chiral fields. 1 The transformation of the gauge theory can be summarized in the following rules: 1. Flavors. In physics, the arrows connected to the mutated node are usually referred to as flavors. The flavors transform by simply reversing their orientation, namely: 1.a) Replace every incoming arrow i ! j with the outgoing arrow j ! i. Calling X ij the incoming arrow, we replace it by the dual flavorX ji 1.b) Replace every outgoing arrow j ! k with the incoming arrow k ! j. Calling X jk the outgoing arrow, we replace it by the dual flavorX kj This is the quiver implementation of the fact that the magnetic flavors are in the complex conjugate representations, of both the dualized gauge group and the spectator nodes, of the original flavors. 2 2. Mesons. Next we add mesons to the quiver, i.e. composite arrows, as follows. For every 2-path i ! j ! k we add a new arrow i ! k. This meson M ik can be regarded as the composition of the flavors i ! j and j ! k of the original theory. In other words, we generate all possible composite arrows consisting of incoming and outgoing chiral fields.
1 Generalizations of Seiberg duality to gauge groups with adjoints are known, under certain conditions (see e.g. []). 2 In our discussion, including the points that follow, we allow for the possibility of chiral fields connecting pairs of nodes in both directions. 2. Mesons. Next we add mesons, i.e. composite arrows, to the quiver as follows. For every 2-path i → j → k we add a new arrow i → k. This meson M ik can be regarded as the composition of the flavors i → j and j → k of the original theory, namely M ik = X ij X jk . In other words, we generate all possible composite arrows consisting of incoming and outgoing chiral fields. Figure 2.1 also illustrates the addition of a meson.
3. Ranks. The rank of the dualized node transforms as where N f j is the number of flavors at the dualized node j. Later we will consider generic quivers, which are not necessarily anomaly free. These quivers are interesting from a mathematical point of view and, in such cases, we will not consider the ranks of the nodes. Ranks will only be taken into account for anomaly free quivers, i.e. theories for which the gauge (and hence dualizable) nodes have an equal number of incoming and outgoing arrows. In these cases, which, more explicitly, is given by with a ij the (positive) number of bifundamental arrows going from node i into node j.

4.a)
In the original superpotential, we replace instances of X ij X jk with the meson M ik obtained by composing the two arrows.

4.b) Cubic dual flavors-meson couplings.
For every meson, we add a new cubic term in the superpotential, coupling it to the corresponding magnetic flavors. Namely, we add the term M ikXkjXji .
If there are fields that acquire mass in this process, we can integrate them out using their equations of motion.
All the rules discussed above, with the exception of the one for the ranks, are the same ones that are used for cluster algebras. Cluster algebras also come equipped with a set of generators known as cluster variables.

Mutation of Cluster Algebras
Mathematically speaking, an algebra is a structure that functions like a vector space with the additional feature that elements can be multiplied together. An algebra can be presented by generators, think of basis vectors, and relations, i.e. algebraic dependencies generalizing linear dependencies of a vector space. A rank n cluster algebra is a subalgebra of the field of rational functions in n variables where its generators can be grouped together into algebraically independent sets known as clusters, also all of size n, such that certain exchange relations allow one to transition from one cluster to another [2]. These exchange relations, known as cluster mutation, can be described using the language of quivers, echoing the description of Seiberg duality in physics.

Cluster
Variables. Given an initial cluster {x 1 , x 2 , . . . , x n }, we allow cluster mutations in n directions, each of the form for each 1 ≤ j ≤ n, and where the products are over all incoming arrows and outgoing arrows, respectively. We thus get a new generator, cluster variable, x j , and yielding the cluster {x 1 , x 2 , . . . , x j−1 , x j , x j+1 , . . . , x n }. The process of cluster mutation may be continued but to mutate while using this new cluster as a reference, we use the quiver µ j Q in place of Q, where µ j Q is the quiver obtained by applying the rules of Seiberg duality at node j.
Given a quiver Q, we construct the associated cluster algebra A Q by applying cluster mutation in all directions and iterating to obtain the full list of cluster variables, i.e. generators of A Q . Generically, this process yields an infinite number of generators for the cluster algebra, as well as an infinite number of different quivers along the way. However, in special cases, a cluster algebra, and its defining quiver, have a specified mutation type.
We refer to a cluster algebra, or its associated quiver, as being of finite type if it has a finite number of generators, i.e. cluster variables, constructed by the cluster mutation process 3 . As proven by Fomin and Zelevinsky [3], the list of cluster algebras of finite type exactly agree with Gabriel's ADE classification 4 of quivers admitting only finitely many indecomposable representations [50], or those equivalent to them via quiver mutation, i.e. Seiberg duality.
Another important family of cluster algebras are those of finite mutation type. Such cluster algebras are those with only a finite number of quivers reachable via mutation, i.e. Seiberg duality. This class of cluster algebras completely encompasses the subclass of cluster algebras of finite type. In totality, this class contains all rank 2 cluster algebras, like the aforementioned cluster algebra associated to SU (2), cluster algebras of surface type, and eleven exceptional types (E 6 , E 7 , E 8 , affine E 6 , E 7 , E 8 , elliptic E 6 , E 7 , E 8 , and two additional quivers known as X 6 and X 7 ) [51,52]. Such finite mutation type quivers have also been studied previously in the physics literature where they were referred to as complete quantum field theories [53].
Cluster algebras of surface types, i.e. associated to orientable Riemann surfaces, were first described by Fomin, Shapiro, and Thurston [54]. Generically, the quiver associated to a triangulation of a Riemann surface is obtained by taking the medial graph where nodes of the quiver correspond to non-boundary arcs of the triangulation and we draw an arrow of the quiver between nodes i and j for every triangular face where arcs associated to i and j meet at a vertex and j follows i in clockwise order. Mutating at a node corresponds to flipping between the two possible diagonals for triangulating a quadrilateral. Since such triangulations live on an orientable Riemann surface, any associated quiver has at most two arrows between any given pair of nodes, thus demonstrating that such cluster algebras admit only finitely many quivers and are hence of finite mutation type. The eleven exceptional cases of Felikson, Shapiro and Tumarkin do not have a surface model but at least the finite and affine type E quivers are well-known from previous representation theory, e.g. Gabriel's ADE classification, and Kac's extension to affine quivers [55].
In this paper we will focus on the transformation of the quiver (rules 1 and 2) and in some cases include information on the ranks (rule 3), so we will not deal with rule 4 nor with rule 5. Even with this restriction, we will manage to obtain non-trivial results. Having said that, the superpotential is a crucial element of the duality, as is the mutation of cluster variables in the context of cluster algebras. We plan to incorporate both of these in future studies.

Recognizing Mutations
There are various ways to construct the dataset. We can directly assign each mutation class a different label. Then the machine will be asked to do a multiclass classification. We can also have datasets that consists of matrix pairs so that every {input→output} has the form where 1 indicates that M 1 and M 2 are in the same class while 0 indicates that they are not. Let us first start with the latter using the Mathematica built-in function Classify.

Classifying Two Mutation Classes
As the simplest example, let us machine learn only two different classes, ['A',4] and ['D',4] 5 , shown with their adjacency matrices as Q1 and Q2, for the cases n = 4, in the Quivers list of §1.2.
Notice that these matrices/quivers are of finite mutation types, i.e., the duality trees are closed. Many (but not all) quivers in finite mutation types 6 contain sources and sinks, and are hence anomalous. Albeit not physically meaningful, we are still interested in these quivers from pure mathematics and machine learning viewpoints. Furthermore, we can compare these results with those from infinite mutation types.
The result 7 of 5-fold cross validation is tabulated in Table 3.1. We also plot the learning curves at different training percentages in Fig. B.1. We can see that the machine gives 100% accuracy most of the time, which is very inspiring. Before we add more mutation classes to our data, we are also curious about how the machine would behave when it is asked to predict unseen classes. In the above twoclass example, the validation set V is the complement of the training set T  Fig. 3.1. We find that the overall result is not very satisfying, 5 Henceforth, we will use the same notation as in Sage [43,44] for known quiver mutation types, and we will not specify the matrices and quivers. 6 To be clear, we should point out that finite mutation types and finite types refer to different concepts. In the sense of [3,51], a finite mutation type indicate that there are finitely many dual quivers generated from our starting quiver while a finite type is the namesake of a Dynkin type. Sometimes we will use the term "finite classes". This is the same as "finite mutation types". However, note the word "class" is slightly different from "mutation type" in our context. Each class refers to one duality tree. For instance, ['A',4] and ['A',5] are not in the same class as they are certainly not duals, but they are both of finite mutation types. 7 Note the metrics used to evaluate the machine's performance (accuracy, F-score, and MCC φ) are defined in appendix A.4.
8 Unlike 1's, the 0's always have a matrix from trained classes. However, as we will further study in -12 -  . We generate 517, 572 and 600 matrices respectively. We choose training data out of 14182 pairs and validation data out of 13897 pairs. Data with indeterminate φ's, which appeared several times, is not plotted. These indeterminate φ's appeared 7 times in all (training and validation ten times at each training percentage). The method is chosen by the machine. and the Matthews φ could be indeterminate occasionally. From the confusion matrices, we can know that there is still always a zero entry. This zero always appears at FP or TN, i.e., only 1's or only 0's are predicted when the actual values are 0 in each single training. It is reasonable to see that such result as the machine has met some unseen mutation classes. This also shows that the machine is certainly not learning mutations (at least not the whole knowledge thereof) when the dataset only contains two different mutation classes 9 .

Fixing the Method
The Classify function in Mathematica has an option where one can specify the method used in the classifier. So far, this value is default in our experiments, and the method is chosen automatically by the machine. However, it is worth finding what method can §3.2 and Appendix A.2 when finding the optimal method of the classifier, assigning 1 or 0 to a given pair is solely determined by the two matrices in this pair. Any other matrices, no matter whether they are related by mutations to the matrices in this pair, are irrelevant. In this sense, the ['E',6]/['A',6] and the ['E',6]/['D',6] pairs are always unseen classes. 9 One may wonder whether the dimensions of matrices would affect our result, but in fact it is not a main influence. We will further study this when we include more different mutation classes in our training.
give better predictions. It turns out that the Naive Bayes (NB) is the method we should choose. When studying the ADE Dynkin type quivers with 6 nodes above, we find that at each training percentage, the relatively higher accuracy is obtained only when the machine chooses NB. Hence, we perform this experiment with the same dataset again, but this time, we fix our method to NB. The learning curves are reported in Fig. 3 Training(%) Performance Accuracy φ behave like those of usual learning curves. Moreover, although there is still always a zero entry in the confusion matrix, the Matthews φ is never indeterminate anymore. In contrast, we can try what happens if we fixate on other methods. As an example, the result of only using Random Forest with the same dataset at 80% training percentage is reported in Table 3.2. It is obviously inferior to the result using NB. Henceforth, unless specified, we will always apply NB in the Classify function for future experiments 10 . Now, we would like to understand why NB always yields such good results. In Appendix A.2, we give a mathematical background of NB. The main reason is that the mutual independence of matrix pairs coincides with the basic assumption of NB.

Two Classes Revisit
To some extent, machine learning finite mutation classes would not be that necessary in application simply because we can traverse all the matrices. Let us try another example which contains two infinite mutation classes. The first one is the theory living on D3s probing F 0 , the 0 th Hirzebruch surface, which is isomorphic to P 1 × P 1 , as depicted in Q4 [56,57]. The second one is generated by the quiver and adjacency matrix given in Q5, which is also anomaly free. The learning result of 5-fold cross validation is tabulated in Table 3.3. We also   Figure B.1, we see that the learning curve now looks smoother and more beautiful when we use NB. Now that infinite mutation types generate infinitely many quivers under the Seiberg duality mutation, we can do something that is not done in finite mutations. In the training dataset T , we include the matrices generated to some depth (equal to the number of mutations from the original quiver) in the duality tree. However, the validation dataset V consists of matrices generated at depths that are far away from those in T . We still start with the above two matrices, and generate (102+138) matrices. From these matrices, we create 6933 1's and 6358 0's. Then the 1's and 0's of T will be evenly chosen out of the 13291 pairs. For V , we start with the following matrices: of unseen depths, the result is not very surprising. As a matter of fact, the confusion matrices always have a vanishing TP (actual=predicted=1) and an extremely small FP (actual=0, predicted=1). This shows that the machine tends to regard the pairs from unseen depths as unrelated theories.
We now have seen that the machine does a good job for validation, but does not perform well when meeting unseen depths far away. It would be natural to ask, given both matrices of depth 0 to depth n 1 and of depth n 2 to depth n 3 (n 3 > n 2 > n 1 > 0), whether the machine can extrapolate the matrices of depths between n 1 and n 2 . We still contemplate the above case with two different mutation classes (Q4 and Q5), but this time, we have n 1 = 3, n 2 = 6 and n 3 = 8 for both of the two classes (and hence, we are validating matrices of depths 4 and 5) 11 . The learning result at 90% training percentage is listed in Table 3.5. We can see that the result is better than the one in Accuracy(%) 65 Table 3.4. It is very natural to expect this since we are having much more matrices trained (or more precisely, the ratio of seen against unseen matrices is much larger).
On the other hand, we should also expect that the result would still have much room to be improved regarding the fearture of NB. We now make a proposal using the assumption of NB. As discussed in §3.2 and Appendix A.2, whether a pair matrices are related to each other by mutations is independent of other matrices. This condition certainly applies here. 11 In our training set, we also include pairs of 1's from depths 0-3 and depths 6-8. Likewise, in our validation set, we also include pairs of 1's from depths 4-5 and depths 0-3/6-8. Same is for 0's as well.
-16 -We can actually visualize the duality trees of quivers. Examples can be found in Fig. 2 and 7 in [57]. Since a mutation can act on every single node of a quiver, an n-node quiver is directly connected to n other dual quivers. This is true for any quiver in any mutation class. Furthermore, the duality tree of an infinite mutation class is apparently infinite. Thus, it does not matter which quiver we choose to start with due to the symmetry of the duality tree 12 . Now, from the example of Table 3.4, we know that the machine is poor at predicting matrices of depths from (n 1 +1) to n 2 when only matrices within depths n 1 are trained. This can be illustrated as in Fig. 3.3(a). Likewise, for the example of Table 3.5, we have Fig. 3.3(b). Then we can have a green disk of trained matrices, centered at each point (up to the azimuth) in the blue annulus, tangent to the two boundaries of the blue annulus as shown in Fig. 3.4(a). We can use such trained green disk/dataset to predict the matrices inside the white annulus bounded by the green disk and the disk of radius n 4 . By the same reasoning, the machine would give poor predictions to those matrices. Notice that the disks of radii n 2 and n 4 have a leaf-shaped overlap, which means that given the small blue disk and the green disk as the training set, this leaf would not enjoy a good prediction. If we draw the green disk along the blue annulus, then those green disks, along with the blue disk in the middle, will become the same training set as in Fig. 3.3(b). The leaf-shaped overlaps will form the white annulus in the middle bounded by blue disk and the blue annulus, which is the unseen dataset as in Fig. 3.3(b). Since the machine cannot learn well in the leaf shapes, although the training set is larger (compared to Fig. 3.3(a)) 12 For a finite mutation, this is also true as the duality tree will finally close and be symmetric. which may improve the result, as a consequence of mutual independence assumption, the performance of the machine would still not be greatly improved. Nevertheless, we should emphasize that this is mainly due to the particular feature of NB. As we will see in §3.5, this illustration for NB here would be quite different for Neural Networks (NN).

Classifying More Mutation Classes
We now contemplate the datasets containing more mutation classes. It is natural to first consider the case with three mutation classes. We again use ['A',6], ['D',6] and ['E',6] as an example. Of course, unlike the aforementioned case, all the three classes have to appear in the training dataset this time. The learning result of 5-fold cross validation is reported in Table 3.6.

Accuracy
F-Score φ 0.90291800±0.00920160 0.90936100±0.00886124 0.81580000±0.01625320 The learning curve at different training percentage is given in Fig. B.3. We can see that the performance, albeit not as perfect as the cases with two classes, is still very -18 -satisfying, with ∼90% accuracies and ∼0.8 Matthews correlation coefficients when only ∼60% of the data is trained.
We can also add one more class into the two-class example for Q4 and Q5. The new one is generated by Q6. The learning results are reported in Table 3.7 for 5-fold cross validation and Fig. B.4 for learning curves. The performance is still very nice, Accuracy F-Score φ 0.90553300±0.00970378 0.91187400±0.00831757 0.82051800±0.01696320 though it is not as perfect as the two-example class.
Let us now contemplate examples with four and five mutation classes. To compare this with the three-class example above, we first choose Q4, Q5, and Q6 for our data. For the four-class example, the remaining quiver is depicted in Q7.
The learning results are reported in Table 3  For the five-class example, we further include Q8. The learning results are reported in Table 3  that the numbers of different classes can affect the performance of the machine.
Nevertheless, a better learning result is always wanted. When we are having more classes, a combinatorial problem arises. If there are more mutation classes in the data, there will be more and more distinct pairs of 0's than pairs of 1's. If we want adequate combinations of 0's, then to keep the dataset well-balanced, correspondingly many 1's are required as well. However, all the distinct pairs of 1's will be included while 0's may still not be enough. On the other hand, if we keep adding pairs to our dataset, although we will have more combinations of 0's, there will be duplicated pairs of 1's. These repeated pairs will not be helpful and hence the dataset will be biased. Thus, how the number of mutation classes is (quantitatively) related to the number of matrices generated and the number of pairs assigned is a newly raised question. Roughly speaking, the best way is perhaps to include all the 1's and correspondingly many 0's. Then the number of distinct pairs is maximized while keeping the dataset balanced. Another possible way to resolve this is to use multiclassification with one single matrix as a data point instead of matrix pairs so that the combinatorial problem could be avoided. Let us now contemplate such multiclassifications.

Multiclass Classifications
For datasets consisting of matrix pairs, we have already seen that NB is the best method for learning mutations. To make this more convincing and more clear, we also plot the learning curves with different methods in Fig. 3.5 as an example 13 . We also tabulate the 5-fold cross validation for NN in Table 3.10. We should emphasize that the NN here used in Mathematica is different from the (C)NN we will use below for multiclassifications. The NN in Mathematica Classify is used for matrix pairs while NN in Python deals with single matrix as one datapoint in the dataset 14 . Unless specified, we will always refer to multiclassifications in Python when saying NN below.
Besides pairing matrices and assigning 1's and 0's, there is a more direct way to classify theories in distinct duality trees as aforementioned. We can simply assign different mutation classes with different labels, and then let the machine tell which classes the given quivers belong to. So far, we have been using Mathematica and its 13 At first, we would like to try much more matrices and much larger datasets. However, a normal laptop is not capable of giving the whole learning curve of SVM. Nevertheless, this example with a smaller size can still tell the difference between various methods. Here, although random forest is still inferior to NB, the discrepancy is small. However, one can check that if we include more matrices and more data, the advantage of NB over other methods will be greater. 14 We should mention that Mathematica now also incorporates complicated neural network, though we are using Python here for CNN to make a more clear distinction between binary and multi classifications in our discussions. built-in function to do the machine learning. One can still use Classify and NB to do the training, but it turns out that NB (and Mathematica classifier) is only good when the data is a set of pairs. Thus, we turn to Python to perform machine learning on mutations with the help of Sage [43] and TensorFlow [58]. Henceforth, when we say that the method is NB (or NN), we simultaneously mean that the type of the dataset used is the one suitable for this method. This time, we choose three classes generated by Q12, Q13, Q14. Quiver Q12 is defined through triangulation of a 10-gon, this process is shown in Fig. 3.6.
(a) (b) According to the theorem by Felikson, Shapiro and Tumarkin [51], the first two classes are finite while the third one is infinite. Now we label the three classes with -21 -[1,0,0], [0,1,0] and [0,0,1] respectively. Thus, when the machine predicts [a 1 ,a 2 ,a 3 ], it is giving probabilities of which class the matrix being predicted should belong to, where a i 's are the probabilities of the three classes respectively. For instance, if the output is [0.9,0.06,0.04], then the machine classifies the matrix into the first class.
We use Convolutional Neural Networks (CNNs) to deal with the dataset which contains (1547+1956+1828) matrices. We find that there is only ∼55% of accuracy when 80% of data is trained. However, it is quite remarkable that for the last class, which is the only infinite one, the machine has a 100% accuracy, i.e., it always correctly recognizes the matrices in this class and never misclassify other matrices to this class. Hence, the machine seems to have learnt something related to finite and infinite mutations. We will explore this in §5.4.

Classifying against Random Antisymmetric Matrices
There is also another possible way to have a machine learning model on quiver mutations. If we are given some quiver and a class of dual theories, we may wonder whether this quiver also belongs to the duals. Therefore, we can train the machine using a specific class of matrices, along with some randomly generated antisymmetric matrices.
So as not to just learn anomalies, when we are dealing with anomaly-free quivers, we should mainly have random matrices that are anomaly free as well. For simplicity, let us contemplate the 3×3 matrices. As the nullity of a non-zero 3×3 matrix is at most 1, it should be easier to generate matrices that are anomaly free 15 . We first test the dP 0 theory, viz, the class generated by Q9, with correspondingly many random antisymmetric matrices. We generate matrices up to depth 7, and we have (382+388) matrices for training and validation. The learning curves are plotted in Fig. B.7.
As we can see, the result is pretty good with ∼90% accuracy when only ∼60% of data is trained. If we use this model to predict unseen matrices, i.e. the 384 matrices at depth 8 plus 377 random matrices, the prediction can still reach ∼97% accuracy. The accuracy for the matrices in the dP 0 duals is ∼93% while the accuracy for random matrices is 100%.  at 95% training percentage, the accuracy is ∼85%, and the Matthews φ is only ∼0.7. This is certainly not that satisfying 16 .

Examples with Different Types
We Notice that in these three experiments, we also have matrix pairs {(M 4×4 ,M 6×6 )→0} in our data, that is, we also include the trivial zeros from pairs of two quivers with different numbers of nodes. In all of the experiments, when we train 95% of the dataset and validate the remaining 5%, the accuracy is about 70%-80%, and φ is about 0.4-0.6. As expected, when we have more mutation classes, the performance of the machine becomes worse. As a sanity check, we remove {(M 4×4 ,M 6×6 )→0} in our data. For instance, we generate 52, 50, 70, 54, 76, 77 and 77 matrices respectively, and create 14254 pairs with 7375 1's and 7529 0's. We find that the accuracy becomes 65%-75%, and φ becomes 0.4-0.5. Getting a lower accuracy and a lower φ completely makes sense. Quivers with different nodes are apparently not dual to each other. Henceforth, we will not include pairs of matrices with different dimensions for 0's in our datasets which are easily learnt to classify as 0's.

Dynkin and Affine Types
So far in this section, we have discussed two different (finite) mutation types. We mainly deal with ADE types and include affine types as well. In light of the above learning results, we wonder whether different types would affect our result. A simple check would involve only two mutation classes with one Dynkin and one affine. For instance, we test ['D',4] and ['A',(3,1),1] here. We pick out two points in the whole learning curve as in Table 4

example of ['A',4] and ['D',4]
. From the viewpoint of machine learning, this is definitely a successful and exciting result. More importantly, our point here is to seek out the influence of different types. We find that learning mutation classes of the same type (e.g. only Dynkin) and learning those of different types (e.g. Dynkin+affine) have the same performance. Let us further try an example with one finite mutation type (['D',4]) and one infinite mutation type. For the infinite one, we choose the quiver Q4. We pick out two points on the learning curve as tabulated in Table 4 ,6]). This time let us remove the two affine types and study the learning performance of the data with 5 classes. The results are reported in Table 4.4 for 5-fold cross validation and B.9 for learning curves. We find that the result is improved. Unlike the above tests, this seems to tell us that the influence from different types outcompetes the influence from the number of mutation classes. However, as we will see next, this is not the real reason.

T Type
Now, we perform a test on 3 infinite classes, all of which are T types [44]: ['T', (4,4,4)], ['T',(4,5,3)] and ['T',(4,6,2)]. A quiver of T type is an orientation of a tree containing a unique trivalent vertex, three leaves of degree one, and with the remaining vertices in the branches being of degree two. When we say a quiver is of type [ T , (a, b, c)], we mean there are a total of (a−2)+(b−2)+(c−2) vertices of degree two, summing up the contributions from the three branches. They are all 10×10 matrices 17 . The learning results are given in Table 4.5 for 5-fold cross validation and Figure B.10 for learning curves. We see that the performance is basically the same as the three-infinite-class Accuracy F-Score φ 0.88569500±0.00987409 0.89199500±0.00793421 0.77648300±0.01925770

Splitting the Dataset
Let us now try to solve the puzzle left at the end of §4.1. Consider the quivers and matrices in Q9 and Q10. We can machine learn the dataset with these two classes. This yields 100% accuracy and φ = 1 most of the time, which is good as expected. However, we can put these two quivers and the two quivers in §3.3 (Q4 and Q5) together and machine learn the four classes generated from these four quivers. The 5fold cross validation is given in Table 4.6. We also pick out three points on the learning 1±0 1±0 Unlike the usual result one should expect from a four-class case, this learning result is almost as good as two-class cases. In fact, this is the key. Since we have two classes of 3×3 matrices and two classes of 4×4 matrices, the machine actually splits the dataset into two pieces, viz, it treats 3×3 and 4×4 matrices separately. Just like including zeros from pairs of matrices of different sizes, although machine learning is not affected by dimensions of matrices longitudinally 18

Adding Ranks of Nodes for NB
Since physically interesting quivers have (round) nodes as gauge groups, each node carries the rank information of the gauge group. Thus, we can further add the rank information to "help" the machine learn Seiberg duality. Above all, these quivers should be anomaly free, which is encoded by the kernel of the adjacency matrix M with certain rules under Seiberg duality as discussed in §2.1 [59,60]. We simply add the ranks of nodes as a column vector v to our dataset by We first test this on three classes as in Q4, Q5, and Q6. The results are given in Table 5.1 for 5-fold cross validation and Fig. B.11 for learning curves. We find that the Accuracy F-Score φ 0.91041400±0.00306970 0.91662600±0.00340356 0.82855000±0.00626524 learning result is the same compared to the former example with bare matrix input. Now we add the class generated by Q7 to our data. The four-class result is reported in Table 5  We also further include Q8 to construct the five-class example with extra rank information. The result can again be found in Table 5.3 for 5-fold cross validation and   Again, we learn that the learning results are not improved with the extra vectors. Based on the above results, it is possible that the machine already sees the rank information when we only feed it with bare matrix input (since it is related to the adjacency matrix kernels), therefore it does not require us to give the rank vector explicitly.
Moreover, we can try predicting totally unseen matrices as well. Let us use the three-class example (Q4, Q5, and Q6). We still train (102+138+161) matrices, viz, generate to (and include) depths 4. Then our validation contains matrices of depths 5 and 6, which has (688+978+1258) matrices. The training set has 12938 1's and 12961 0's while the validation set has 8987 1's and 8974 0's. After picking out correspondingly many pairs from each set, at 90% training, we find that the accuracy is 0.50632400±0.00932148, and φ is 0.01286830±0.01174640. As a result, the performance is the same as before. Therefore, we would say for NB, the machine already sees the rank information to some extent even if we only have bare matrix input 19 .

Adding Diophantine Variables
It is also natural to ask what would happen if we use some other ways of dataset enhancement. For superconformal chiral quivers, physical constraints should be imposed to those block quivers. The following conditions: chiral anomaly cancellation for the gauge groups, vanishing NSVZ β-function for each coupling as well as their weighted sum, and marginality of chiral operators in the superpotential at interacting fixed point, leads to a Diophantine equation [56,60,61]. 20  where a ij 's are the numbers of arrows among blocks (i.e., entries of the matrix) and α i 's the numbers of nodes in the blocks. Motivated by this intrinsic structure of the mutation classes rooted in these physical constraints, we simply arrange a 2 ij 's and a 12 a 23 a 31 (which we shall call Diophantine variables for simplicity) into a vector and add it to the data. Now each pair looks like M, (a 2 12 , a 2 23 , a 2 31 , a 12 a 23 a 31 ) T , N, However, we should emphasize that we are not actually telling the machine that the quivers/matrices should obey the Diophantine equation. Otherwise, for instance, for superconformal three-block quivers, we would only have 16 of them [59]. We are just using some specific combinations of a ij 's (inspired by Diophantine equations), and putting this extra explicit vector in the data to see if this would give any improvement. We first try an example with three mutation classes of 3×3 matrices 21 . We use the quivers Q9, Q10 and Q11. We list the 5-fold cross validation result in Table 5 reference, the learning result without including any extra information/vectors is also given in Table 5.5. We can see that there is no improvement.

Accuracy
F-Score φ 0.91431900±0.00644304 0.91987000±0.00657059 0.83621300±0.01123010 Let us contemplate an example with four mutation classes. This time, we use the quivers Q4, Q5, Q6, and Q7. We report the results in Table 5.7 for 5-fold cross validation and Fig. B.15 for learning curves. Again, the performance is the same.
We generate (102+138+161+102) matrices. There are 14040 1's and 14109 0's. The method is NB. The Diophantine variables are included. Now move on to the case with five mutation classes. Besides the above four matrices, we further include the quiver Q8. The experiment without adding the Diophantine variables is done in §3.4. The new learning results are given in Table 5.8 for 5-fold cross validation and Fig. B.16 for learning curves. We find that this is still not improved.
Moreover, we can try predicting totally unseen matrices as well. Let us use the three-class example (Q4, Q5, and Q6). We still train (102+138+161) matrices, viz, generate to (and include) depths 4. Then our validation contains matrices of depths 5 and 6, which has (688+978+1258) matrices. The training set has 12886 1's and 13029 0's while the validation set has 8979 1's and 8981 0's. After picking out correspondingly many pairs from each set, at 90% training, we find that the accuracy is 0.50191000±0.01061240, and φ is 0.00206997±0.025543800. We also have the similar experiment for NN, where this extra Diophantine-inspired structure does not improve learning as well. This suggests that such information does not help encode the structure of the quivers, which may be reasonable as we are also considering more general quivers and classes.

Adding Ranks of Nodes for NN
Now back to the example of Q12, Q13, and Q14 in the multiclass classification, let us add the rank information to our dataset by augmenting the data input matrices to include the rank vectors as before. We have (496+898+484) matrices for training and validation. The learning curves of accuracies are plotted in Fig. 5.1. We can see that the result is greatly improved after we include the rank information. With enough data trained, the accuracies approach 1, which is much better than the examples using NB. We also notice that at very low training percentage, the machine again confuses the two finite mutation classes while almost always gives correct results for the infinite one 23 . The test without rank information above looks like the "limit" at low training percentage of the test with rank information. To see whether this model is really useful, we use it to predict matrices at unseen depths in these classes. For the predicted (1051+3263+1344) matrices, we get ∼74% accuracy and ∼71% F1 score. Although this has not reached perfectness, in particular for the purpose of application, the result for unseen matrices are still much better than those in NB. It is not just guessing any more, and we are on track to further improve this.

Finite and Infinite Mutations
Recall that in §3.5, the machines seems to treat finite and infinite mutations separately. Hence, we replace the infinite one (Q14) with another finite class as shown in Q15, which is anomalous.
We have tried CNN, as well as MLP and RNN, and find that all of them predict [∼0.333,∼0.333,∼0.333]. This means that the machine is not able to decide the classes of the matrices. Hence, comparing the two examples (Q11-13 and Q11, 12,14), whether a mutation class is finite or infinite could affect the learning result. More precisely, the machine is learning something that helps it distinguish between finite and infinite mutation types.
We can also include the rank information for the example of Q12, Q13, and Q15. Although the quiver Q15 is anomalous, we can still assign some vector, say (1,1,1,1,1,1,1) T to it. Then the anomalies for every node should still add some consistent information on the duality operation among duals 24 . We have (496+484+499) matrices for training and validation 25 , and the model will be used to predict (1051+1344+1631) matrices. For training and validation, the learning curves are plotted in Fig. 5.2. We can see that with enough training, the result is still very good. It is also worth noting that when the machine meets a matrix belonging to the second class (Q13), it never misclassifies the matrix to other classes, viz, the red learning curve is a constant equal 100%. Now for prediction, the machine again gives ∼71% accuracy and ∼0.71 F1 score.
The above two examples show promising results for both physicists and mathematicians. We see that imposing rank information in NN significantly improves the performance of the machine to learn Seiberg duality. From a pure mathematical point of view, in particular the second example with all finite mutation types, this shows that the machine can learn which quivers are from which surfaces (or the 11 sporadic quivers) if we enhance the data as above.

Predicting Matrices at Middle Depths
Now we would like to know whether the results for unseen data in predictions can be improved. Our strategy is again to train the matrices up to some depths, as well as some matrices at depths far away. Then we can check how NN behaves when predicting the matrices at middle depths. As a toy model, we train the matrices generated from Q12, Q13, and Q14 at depths 0-3 and 5. Then we use the trained model to predict the (351+705+350) matrices at depth 4. In order to have a more balanced dataset, we choose 1062 matrices out of 3263 matrices at depth 5 for the class of Q13. Therefore, we have (1196+1255+1478) matrices for training and validation. We train 90% and validate the remaining 10% for our model, which gives almost always 100% accuracy as expected. Impressively, after repeating training/validation and prediction a few times, we find that the machine almost always gives 100% accuracy on the matrices at unseen depth (with only several errors out of tens of thousands of predictions, and in particular these few errors never happen for the infinite class). Such things do not happen for the NB cases. This is a perfect result, especially in the sense of application of machine learning on quiver mutations. It means that we can have a model to make good predictions on data of a different style to the training data (here at unseen depths).
One may also wonder whether things would change if more mutation classes are involved. Hence, we further include Q15 to the above dataset. For just training and validation, we find that the result is still that good. Having more classes does not seem to affect the learning result too much. Now we apply this model to matrices at unseen depth just like the above case. Again, the machine gives ∼98% accuracy and ∼0.98 F1 score, which is an impressive result.

Classifying Against Random Antisymmetric Matrices
Let us do the same test involving randomly generated antisymmetric matrices again, but with rank information included. We still generate the matrices to depth 7 so that there are 382 matrices. We train these together with 384 random antisymmetric matrices. The learning curves are plotted in Fig. 5  improves the result significantly. Even at low training percentage, the accuracy still looks perfect. Now we use this to predict the 384 matrices at depth 8, along with 461 unseen random matrices. It turns out the accuracy is almost 100%, with roughly ten mistakes only. Thus, if we would like to know whether a quiver belongs to some specific class of theories, this kind of model would be very useful. It is also worth noting that here we do not even need to include matrices at depths outside those used for predictions.
We can further try an example with two classes and some random matrices. This time, Q10 is involved as well. We now generate to depth 6 and choose 384 out of the 506 matrices for this newly added class. It turns out at 90% training, the accuracy is only 0.8460000±0.0336155, with F1 score being 0.8420000±0.0258844. If we use this model to predict matrices at next unseen depths, along with unseen random matrices, the accuracy is ∼80%, with F1 score being ∼0.81. This does not decrease too much compared to the validation result. However, using a NN to identify whether a random quiver belongs to a particular duality class works best when only considering one class at a time.

Conclusions and Outlook
Based on all the tests above, we can see that Seiberg duality and quiver mutations are very machine learnable. Several points are summarized as below. We first list the conclusions for NB and Mathematica classifier: • The number of different mutation classes is the dominant influence in our machine learning. Fewer classes in the dataset would give better learning results.
Other factors (such as mutation types, dimensions of matrices and adding rank information) are outcompeted for influence on the learning when there is a larger number of mutation classes.
• One reason that numbers of classes greatly affect our result would be the large number of matrices we have. In particular, (#[combinations of assigning 0] − #[combinations of assigning 1]) gets larger when we include more mutation classes. We need to find a balance between avoiding duplicated 1's and taking care of various combinations of 0's. Our strategy would be to generate as many distinct 1's as possible, and then generate approximately same number of 0's. Thus, we could maximize the combinations of 0 without duplicated 1's while keeping the dataset unbiased.
• The dimensions of matrices affect the result "transversally" rather than "longitudinally". If we have two datasets with, say, k different mutation classes of m × m matrices and k different mutation classes of n × n matrices (m = n), the performance should roughly be the same. On the other hand, the machine would spontaneously split the data into smaller parts in terms of the dimensions of matrices. For instance, a dataset with 2 classes of 4×4 matrices and 3 classes -35 -of 5×5 matrices would lead to a better result than the dataset 4 classes of 4×4 matrices does. The former effectively has (2+3) classes, and hence the machine would have better performance in contrast to those with pure 4 or 5 classes. Of course, the (2+3)-class case would still be a bit worse than a pure 2-class example. Moreover, in light of the above two points, we shall never include trivial 0's where each pair consists of matrices with different sizes. Although the transversal influence of dimensions does improve our result, this would bring a larger discrepancy between combinations of 0's and 1's, which can be cumbersome as aforementioned, especially for the dataset with many mutation classes. Now that these 0's represent theories that are obviously not dual to each other, there is no necessity to have them in the dataset.
• NB is the best method in the Classify function due to its mutual independence assumption.
• The NB classifier already sees the hint of rank information when we only have bare matrices as input, and thus imposing rank information would not further improve the machine learning result of the NB classifier.
• When the machine encounters mutation classes that are not seen in the training data, the performance gets worse. This is a reasonable result.
For multiclass classifications (and cases with random antisymmetric matrices), we mainly use CNNs here, and we see they behaves differently compared to NB. What NB is good at does not seem to work for a NN method, and vice versa. NB gives good results when the data is arranged in pairs while NN has great performance in multiclass classifications. It turns out that NN would be more useful in application of machine learning mutations in light of the following points: • We find that NN can distinguish whether a mutation class is finite or infinite, even without adding rank information. If we have a finite (infinite) mutation class among infinite (finite) mutation classes, the machine can almost always give 100% accuracy to single out that finite/infinite class.
• We can impose the ranks as additional vectors augmented to the matrices. Then an NN classifier can give extremely good results for validation. This means the ranks of nodes would somehow reveal the structure behind a quiver to some extent. If we include some matrices at depths far away, then the unseen matrices -36 -at middle depths can be perfectly classified (as depicted in Fig. 3.3(b)) 27 . The machine almost always give nearly 100% accuracy when making predictions. Furthermore, the number of distinct mutation classes does not seem to strongly affect the performance of NN in this case.
• We can train one class of matrices with some other randomly generated matrices. Even without rank information, the results are still quite nice (e.g. see the results at the end of §3.6). To improve these results, including rank information can bring great improvements. If we use this model to predict matrices at unseen depths in that class (as depicted in Fig. 3.3(a)), as well as unseen random matrices, the results are still almost-perfect (i.e., almost 100% accuracy). Unlike the above bullet point, this does not even require matrices at depths far away to be involved in training. However, this kind of model only works best for classification with one class (against the random matrices). Having more classes would make it lose efficacy (e.g., two classes plus random matrices would decrease the accuracy of predictions to 80%).
We see that ∼100% accuracy for predictions can be obtained in all the above three points. These are the key results that might be useful in real-world application.
Outlook It would also be interesting to ask whether the machine can recognize totally unseen classes (rather than just matrices at unseen depths in trained classes) after training. For NB and Mathematica, we can use matrix pairs and the predictions on pairs involving unseen classes will still be 0 or 1. However, as we have already seen, such model is poor at prediction on unseen data, hence it may not be that useful here. On the other hand, NN performs well for predictions. However, it is not suitable for dataset with matrix pairs. Therefore, we can only apply these classification networks to multiclass classification problems. Unfortunately, due to the problem structure of multiclassification, NNs can only recognize, and classify into, categories that are trained. When meeting an unseen class, it would treat the matrix as some element from a trained class. The design of supervised learning used with these NNs implies no machine can even tell that such matrix does not belong to any trained class, let alone recognizing a totally unseen class. Perhaps the closest realization so far would be the model containing random matrices. Then the machine would at least know that the unseen classes are different from the class being trained.
Thus, it would be natural to ask whether the advantages of the above two methods can be combined. NB has better behavior when the matrices are paired, and NN can have really good results when dealing with matrices at unseen depths. From the perspective of machine learning, the network structure, such as the choices of layers and loss functions, might be improved. We hope that in future we can develop new techniques for our models, especially for NNs or similar models, to make good predictions for matrix pairs and hence be useful for unseen classes.
More generally, we can imagine training the machine with a large number of pairs consisting of a randomly generated quiver and a dual connected to it by a single Seiberg duality on one of its nodes. We could then investigate if the machine can determine whether a pair of quivers are dual. If successful, this would arguably amount to the machine "learning Seiberg duality".
There are many other directions for future work as well. For instance, supervised learning is used in this paper. We would also like to see what would happen if we do not label the matrices and let the machine learn without supervision. We are also not taking superpotentials into account here. All the bidirectional arrows get cancelled as we integrate out these fields. It would be intriguing to explore non-trivial superpotential quivers. Such data may be constructed with the help of Kasteleyn matrices [7,8]. Moreover, similarly to what we have done for Seiberg duality in 4d, we can try applying machine to 2d N = (0, 2) triality [63,64], 0d N = 1 quadrality [65], and to the order (m + 1) dualities of m-graded quivers that generalize them [16]. It is also worth noting that in [21], machine learning is applied to D-branes probing toric CY cones. Therefore, it is possible for us to study volume minimizations with machine learning. Finally, it would be interesting to ask whether the concept of finite types could be machine learnt. Such types are exactly the ADE Dynkin types and their matrices have eigenvalues less than 2 [66]. Matrices and their eigenspaces are ubiquitous in mathematics, physics and machine learning. This would lead to a deeper study of matrices in machine learning.

A.1 Mathematica's Classify
Within the Mathematica software, the Classify function allows analysis of a variety of allowed input data types. These input data types include strings, sounds, and images, as well as the familiar numerical inputs. In our case the input data are tensor structures with integer entries. It may hence be noted that the generality of this function's data inputs may reduce the likelihood of it being optimised for use exclusively with tensors.
The Classify function takes as input training and validation sets, in our case these were lists of pairs of square matrices (or pairs of matrices along with vectors of their respective rank data). In addition within the calling of the function, the user can specify the classification method used, as well as the classification performance goal, and even allow the option for pseudo-random number seeding for the classification process.
The performance goal used was the standard "automatic" option. This selection calculates a weighted tradeoff for the final classifier that is trained such that it has high accuracy of output whilst still running quickly in subsequent classifications, and not requiring excessive memory storage.
More importantly in the creation of the classifier is the classification method used. Mathematica allows 9 method options, which among them include: Decision Trees, Markov Sequence Classifiers, Support Vector Machines, and Simple Artificial Neural Networks. When running Classify without specifying a method the program will run all methods and output a learning curve to allow comparison of performance between the methods on the input dataset (using parameters for comparison based on the validation data) [67].
In initial testing of the Classify function with some of the datasets, the Naive Bayes method was consistently superior in the performance of its classifier. This is linked to the independence of the pair structure of the input data. Therefore, to avoid superfluous classifier training the method was specified to be Naive Bayes for the remainder of the investigation. Further discussion of the design and success of this method is discussed in Appendix A.2.

A.2 The Naive Bayes Method
We have seen that the Naive Bayes method, as a machine learning classifier, always gives us the best result when applying the built-in Classify to learn the matrix mutations. Essentially, our model is a conditional probability problem: p(v i |T ), where T acts as the condition for the machine to predict each v i ∈ V to be 0 or 1. Then Bayes' theorem Since p(T ) does not affect our result as this is solely determined by the fixed training set T = {t 1 , t 2 , . . . , t n } in each single experiment, we can fixate on the numerator: Naive Bayes is "naive" because it assumes that every t i is independent of the other conditions in T , which is exactly the property of matrix mutations. Whether a pair of matrices/quivers are related by mutations is always independent of other matrices/quivers. This is the reason why the NB method is always the ideal choice.
Therefore, we may omit all the t k 's in the conditional probability of t j ,viz, As a result, we have For our binary classification, the output is either 0 or 1. Then the Bayesian classifier C B should output n (n = 0, 1) if p(v i = n|T ) ≥ p(v i = 1 − n|T ) [68]. Hence, we require For the NB classifier, we get As NB is the simplest (Bayes) network, it is often faster than other methods. More importantly, the assumption of conditional independence in NB reflects the special feature of the data.

A.3 Python's CNNs
In investigations requiring multiclass classification, a more technical machine learning structure is needed to allow high-performance classification. To facilitate this the TensorFlow library, and within this the machine learning specific sub-library Keras, were used [58]. Artificial Neural Networks (NNs) are code structures for non-linear function fitting. Their design was generally inspired by that of a biological brain, and they have seen significant success in recent years where computation speed can now account for the computational inefficiency of using these networks compared to traditional algorithms. The networks used in this investigation were dense and deep, in that they had all neurons fully connected between layers, and there were multiple hidden layers in the network.
More specifically the network style used was a Convolutional Neural Network (CNN). The defining feature of these networks is the local action at the neurons in the hidden layers which preserves the multidimensional structure of the tensor input, acting with a simple linear 2d function, and then applying non-linear activation. Important to stress is the importance of the non-linearity in the activation functions at each neuron, allowing NNs to well address problems of higher complexity. These networks are traditionally used for image recognition, as the use of convolution is good for identifying local structure in arrays with dimension larger than 1 -this motivated their use for this matrix-based datatype [69].
The specific CNN used in this investigation had a sequential structure such that it was a linear stack of layers. The network had 3 convolutional layers, each with LeakyReLU activation, and each followed by a Maxpooling layer. Then 2 generic dense layers, one with LeakyReLU activation, and the other with softmax activation. The Maxpooling layers simply assign to an entry the maximum value of a set of some of the surrounding entries. They are traditionally used in the CNN structure.
LeakyReLU was used as the standard activation function at each layer. This activation is simple to compute, it is monotonic, and inherently non-linear, with the added benefit of fast gradient descent in training due to its proportional derivative form. This function leaves positive inputs to the neuron unchanged, but scales negative inputs down (in our case by a factor of 10). The additional dense layers are needed in CNNs to recreate the vector data structure for classification. Softmax was used as the final activation as it is a sigmoid equivalent, however with traditionally better results and a normalized output essential for classification problems with multiple classes.
When compiling the NN, additional inputs of loss function, optimizer, and metric are required. The loss function is a measure of the performance of the model, it is the function whose optimal value will indicate a well-trained NN, and hence a good model."Mean squared error" was used for the loss function in this investigation, this measure is simple, and computationally inexpensive. It is calculated as the sum of squares of the difference between each input and its predicted value by the model, -41 -therefore the output values used in training are vector floats bounded by 0 and 1 to reflect the hot encoding of the Boolean output nature in this classification. The optimizer is the method by which the parameters of the network are updated in accordance with the performance of the loss function. Here the "Adam" optimizer was used, which is an inexpensive first-order gradient based method [70]. Finally, the metric used was "accuracy", this gives the final measure of the NNs performance and is simply the proportion of correct classifications the model performs on the validation dataset.

A.4 Measures of the Machine's Performance
Measures of the performance of a classification method are essential for justifying the use of machine learning. The most standard measure of a classifier is "accuracy", as mentioned in Appendix A.3 this is the proportion of correct classifications performed by the classifier on a validation dataset. To ensure the measure is unbiased, it is important the validation dataset is not used for training whilst still being representative.
To ensure representative validation datasets, as well as providing a means of calculating error for these measures, k-fold cross-validation was used. In these investigations k = 5, and hence in each investigation the full dataset (all data points with their respective classification labels) were first randomized, then split into 5 equal size sub-datasets. The machine learning process for training and then validating the classifier was then iterated 5 times, where in each case the validation dataset was a different sub-dataset from the split, and the training dataset was the remaining 4 sets combined. For each of the 5 iterations the measures of performance were calculated and recorded, giving a small dataset for each measure from which a mean and standard error could be calculated [71].
More technical measures of performance used include Matthew's correlation coefficient (MCC, φ), and F1 score (also called just F-Score). Both these measures take into account Type I and II errors from misclassification. A Type I error is a "false positive" (FP), where for example a random matrix is classified as in the mutation class, and conversely a Type II error is a "false negative" (FN), where a quiver matrix is classified as not in the class being trained by the machine.
The F1 score measure gives equal weight to Type I and II errors, whereas the MCC measure uses variable weights based on the occurrence of true positives and negatives (TP/TN). These factors make MCC a more favorable measure in this style of binary classification problem [72].
All three measures can be summarized as functions over the "confusion matrix", defined:

B Investigation Learning Curves
This appendix section presents additional learning curves calculated for the investigations, as discussed in the paper. Each graph shows the performance of the investigation's classification method on the specified dataset for varying proportional splits of the dataset into training and validation data. Measures of classification performance considered were accuracy, and Matthew's correlation coefficient, φ, as discussed in §A.4.