Trie-based ranking of quantum many-body states

Ranking bit patterns -- finding the index of a given pattern in an ordered sequence -- is a major bottleneck scaling up numerical quantum many-body calculations, as fermionic and hard-core bosonic states translate naturally to bit patterns. Traditionally, ranking is done by bisectioning search, which has poor cache performance on modern machines. We instead propose to use tries (prefix trees), thereby achieving a two- to ten-fold speed-up in numerical experiments with only moderate memory overhead. For the important problem of ranking permutations, the corresponding tries can be compressed. These compressed"staggered"lookups allow for a considerable speed-up while retaining the memory requirements of prior algorithms based on the combinatorial number system.


I. INTRODUCTION
At their first encounter with the quantum manyfermion problem, students are usually warned that any brute-force attempt at a solution is destined to shatter at the exponential wall -the doubling of computing requirements with the addition of each state. Commonly, this then segues into the presentation of some polynomial-cost approximation.
So it is perhaps ironic that the brute-force solution of small many-fermion problems, known as exact diagonalization (ED) [1][2][3], many-body Lanczos method [4,5], or full-configuration interaction (FullCI) [6,7], has remained one of the workhorses of many-body physics and quantum chemistry. On small systems, it serves as benchmark for more advanced (classical [8] or quantum computing [9]) methods and to explore emergent many-body behavior [10]. For realistic systems, exact diagonalization is particularly useful as kernel of embedding theories [11][12][13], where one approximately maps the full problem onto one or more difficult but small many-body problems. In that problem space, ED is competing with Monte Carlo methods [14,15], which do not have as severe a restriction in system size, but usually require sufficient symmetry to avoid a prohibitive sign problem. An iterative ED is also at the core of more sophisticated renormalization schemes, most notably the numerical renormalization group (NRG) [16] and density matrix renormalization group (DMRG) [17] method. Since scaling with system size is the fundamental limitation of ED, we are seeking ways to mitigate it while retaining an unbiased deterministic approach. Here, the Lanczos method combined with on-the-fly representation of the Hamiltonian [6,18] and use of quantum numbers [1] is the state-of-the-art in diagonalizing spin systems [19] and gaining traction for many-fermion solvers [20,21] (see Fig. 1 and Sec. II for a recap). This approach is still memory-bound, as the subspaces still grow exponentially, albeit somewhat more slowly. The main runtime bottleneck is, perhaps surprisingly, not the application of the Hamiltonian itself, but the mapping of the many-body states back into the block structure generated by the quantum numbers, a procedure known as "hashing" in the ED community [18] and "ranking" Figure 1. On-the-fly construction of the matrix elements of a non-interacting three-site Hamiltonian in the symmetry sector N = 2: unranking the states into occupation number states |α = n2n1n0 (left column), followed by applying the Hamiltonian (center column) and "ranking", i.e., mapping back to the state index in the symmetry sector (right column). in computer science [22] (red arrows in Fig. 1). Aside from generic techniques for the mapping, such as bisection search and hash tables, for special sectors explicit formulas have been put forward for computing the rank by examining the position of each bit in the state [23] (see Sec. III A).
In this paper, we first show how to create a set of small precomputed tables for these explicit formulas, allowing us to process data in chunks of multiple bits rather than one bit at a time at a considerable speedup (see Sec. III B). We then generalize these formulas to an arbitrary set of quantum numbers by leveraging a special kind of n-any search tree called "trie" (Sec. IV) [24,25]. Numerical experiments in Sec. V show significant speedups for both microbenchmarks and realworld ground state computations.

II. THE FINITE MANY-FERMION PROBLEM
To set the stage, let us briefly review the challenge of finding the ground state of a system of interacting fermions in M spinorbitals. We need to find the lowest eigenvalue and associated eigenvector of the Hamiltonian, a 2 M × 2 M matrix given in 2nd quantization by: where i, j, k, l are spinorbital indices, i = 0, . . . , M − 1, t ij are hopping amplitudes, U ijkl are two-body interaction strengths,ĉ i is (a matrix) annihilating a fermion in spinorbital i, andĉ i † is its Hermitian conjugate. The chemical potential, if present, can be absorbed into the diagonal entries of t.
The explicit form ofĤ is easily constructed in the occupation number basis. There, each basis state is one possible combination of occupations n i of the spinorbitals, |n M −1 . . . n 1 n 0 . Since for fermions n i ∈ {0, 1}, each state is nothing but a bit pattern. In order to form a matrix, we assume these patterns are ordered, or "ranked", lexicographically. Interpreting a bit pattern as a number in base two then gives us a natural correspondence of a state and its rank α ∈ {0, . . . , 2 M − 1} in the basis. For example, the state |011101 has rank α = 011101 2 = 29. The i-th annihilator is then a matrix confined to the (2 i )'th side diagonal: where δ ij is the Kronecker delta and the product in the prefactor ensures anticommutativity.
In implementing Eqs. (1) and (2), the motivated but unseasoned physicist has to brace for a series of increasingly painful concessions. First, at around M = 14, the cost O(2 3M ) of numerical diagonalization becomes prohibitive and Krylov subspace methods can be used instead [4,5]. Secondly, at M ∼ 18, the memory O(2 2M ) required to store H starts to blow up, and one may switch to sparse storage [26] as it also combines neatly with Krylov methods: there, we do not need to construct H explicitly, but only its applications on vectors H|ψ . Sparse storage requires O(K 2 M ) memory, where K is the number of unique side diagonals created by Eq. (2), which becomes problematic for M ∼ 22. One may then try to compress multiple columns into bitsets [26], which will get one to M ∼ 26. Finally one may use massive parallelization to distribute the required memory [27,28], which for a supercomputer of reasonable size will break down at M ∼ 32. (These limits vary with system type, technical prowess of the implementer and, assuming "Moore's law" holds, should be incremented by one every 18 months or so.) The key to compressing H further is to realize that Eq. (1) already constitutes a highly compressed form: a sum of O(M 4 ) terms, each of which is a product of creation and annihilation operators with a scalar prefactor. For each term T of this form, there exists a tuple (m, r, x, s, v) such that the application on any occupation basis state |α is given by the following, extremely efficient formula [6,18]: Here, denotes bitwise and, ⊕ denotes bitwise xor, and h(x) is the Hamming weight-the number of set bits-of x. (For completeness, we show how to construct these tuples in Appendix A.) Eq. (3) allows one to compute H|ψ in O(K2 M ) time, where K is the number of terms in H. K ≥ K as K is equal to the number of unique values of x, so we increase runtime, but require only O(K) memory for storing the tuples for each term in H. This takes care of applying operators, however, eventually the memory O(2 M ) required to store a single vector will become a problem. To mitigate this, we can use the fact that the Hamiltonian (1) conserves particle number (commutates with the particle number operatorN ): which partitions the Hamiltonian into M + 1 blocks with different particle numbers N = 0, 1, . . . , M . SinceN is diagonal in the occupation number basis, each state |α can be assigned to exactly one block. For this, we introduce the notation |α ≡ |N i , where N is the block number (the number of set bits in α) and i is the rank of α within its block. As with the full space, we choose |α to have rank i if it is the i-th state, ordered lexicographically within the N -th block.
SinceĤ is now block-diagonal thanks to particle number conservation, we can treat each block separately and then collect the results afterwards. This only involves states in the matrix-vector product confined to one block, i.e.: The number of states in the block N is equivalent to the number of choices of N set bits out of M , so the storage requirements for a block vector drop to M N . Since the largest block is half-filled (N = M/2) and for large M , we do not substantially affect the scaling, but the prefactor allows us to go to slightly larger systems, M → M + 3 or so. Combining Eq. (5) with on-the-fly application (3) involves yet another complication, which shall become the focus of this paper: in order to apply some term T , we first need to unrank |N j , i.e., map it to its corresponding |α for evaluating Eq. (3), and after we have computed |β = T |α , we need to rank it, i.e., map |β back to its |N i (cf. Fig. 1). Unranking is the easier of the two problems, as we simply need to maintain a list of |α for each |N i . Ranking is trickier to do quickly: we cannot simply use a lookup table which maps |α to |N j , as this would again require O(2 M ) memory. Hithero, one typically uses bisectioning search into the list of manybody states, which is slow on modern machines. As a result, it is ranking and not Eq. (3) that is usually the bottleneck of finite many-body calculations.

III. COMBINATION RANKING
As outlined in the previous section, we are concerned with ranking and unranking the M N basis vectors having N electrons for M spinorbit states (or, in general, bitpatterns of length M with N bits set to 1). The set of all such combinations C M N [29] can be written as where c i conveys the ith selected elements out of the M possibilities. The task is now to assign a rank i ∈ {0, . . . , M N − 1} to each element of C M N . We do so lexicographically, i.e., first order by c N , then by c N −1 , and so forth till c 1 . For example, the set C 4 3 contains the elements (2, 1, 0), (3, 1, 0), (3, 2, 0), and (3,2,1) which are thus assigned the lexicographical ranks 0, 1, 2 and 3, respectively.

A. Ranking using combinadics
A convenient way of computing this rank is the combinatorial number system or combinadics [22,30] which allows one to compute the lexical rank by calculating the number of combinations ordered before the current one as [22,30]: where c n = 0 whenever n > c. We can comprehend Eq. (8) by realizing that there are c N N possibilities to select N elements with flavors 0 . . . c N − 1 that have lexicographically the leading (N th) bit set to 1 before c N ; then for the (N − 1)th element there are c N −1 N −1 such possibilities and so forth. For example, 310 C = 3 3 = 1, which reflects that (3, 1, 0) is ranked second after (2, 1, 0) in C N 3 . We contrast this with the "bit pattern" of each element: For the same example, we have α(310) = 1011 2 = 11. We can thus define a rank and unrank function by the following identities: end while 10: end function Figure 2. Combinadics algorithm for ranking a bit pattern using Eq. (8) [23]. Here, ctz(x) counts the number of trailing zeros in the binary representation of x, denotes bitwise and, and α (α − 1) is an efficient means to remove the rightmost bitwise 1 from α. The set of binomial coefficients c k should be precomputed and stored.
Eq. (8) allows one to compute rank(α), known also as "perfect hashing" in the context of exact diagonalization [23,27]. We reproduce this algorithm in Figure 2. The basic idea is to extract the position c k of the k-th set bit by using the count trailing zeros (ctz) instruction, available on most modern CPUs, followed by clearing the least significant bit. One then adds the corresponding binomial coefficient of Eq. (8) to the rank i. These coefficients should be precomputed for all 0 ≤ c < M and 1 < k ≤ M , which comes at a moderate memory cost of 32 KiB for M = 64 (64-bit numbers).
A priori it is not clear why this algorithm should be faster than simply maintaining an ordered list of all elements of C M N and finding a representative by bisectioning, since from Eq. (6) we expect O(M ) steps are needed for both algorithms in the half-filled case. However, depending on the exact structure of the Hamiltonian, ranking by Eq. (8) can be significantly faster on modern machines, because: (i) bisectioning heavily relies on efficient branch prediction, as at each step we have to choose which "half" of the list we will focus on; and (ii) bisectioning has poor cache locality, as we have to jump around the complete list. In contrast to that, the algorithm in Figure 2 is essentially branch-free except for the loop condition, and the lookup table is small enough such that the CPU may reasonably keep it in its cache.
For unranking, Eq. (11), we only have a (relatively) small number of M N elements for each C M N . For these the bitpattern α can be stored in a lookup table.

B. Staggered lookup
While a full lookup table for rank(α) is prohibited by the memory cost, we can further enhance the numerical efficiency of the ranking algorithm Figure 2 by splitting up α into chunks of R bits, starting from the least significant ones, and employ for these chunks of bits precalculated ranks. This balances computational cost vs. memory cost and can be optimized with respect to R.
If in the previous chunks of α, we already encountered α ← α/2 R 10: end while 11: end function In the corresponding algorithm Figure 3, we keep track of how many bits M and how many set bits N we have encountered previously. We then use Eq. (12) for the next chunk of R bits of α, remove these bits, and update M and N before turning to the next bits. Crucially, rank(α, M , N ), unlike rank(α), can be precomputed and stored provided R is not too large. For a given bound M , one needs to precompute for M ∈ {0, R, 2R, . . . , M − R}. For a given M , one needs to precompute for N = {0, 1, . . . , M } and all possible values of α ∈ {0, 1, . . . , 2 R − 1}. This gives a total memory demand of: numbers. For M = 64 and R = 8, this requires 580 KiB of memory, and the corresponding lookup table can be held in cache on reasonably modern machines. Let us conclude with a couple of remarks on the lookup table: firstly, a table constructed in aforementioned fashion is universal in the sense that it works for any sector; we can save memory by restricting ourselves to the values of N to those admittable for a given N at the expense of a slightly more complicated lookup logic. Since the table is of moderate size, it is usually not necessary to do so. However, the table should be stored with the α index varying fastest to ensure that only those values of (M , N ) that are actually used are loaded as cache lines. Secondly, in computing the entries of the lookup table, the algorithm in Figure 2 can be reused, since: Thirdly, one may not only store rank(α, M , N ) for a given α, but also the offset in the lookup table corresponding to the updated values of M and N . This saves the computation of the Hamming weight and some index manipulation, yet doubles the memory demand. We empirically find this to be a beneficial tradeoff on most modern CPUs, and have employed it in all benchmarks. Similar to the standard combination ranking algorithm (Figure 2), the fast ranking algorithm can be made essentially branch-free. (The "while" loop can turned into a for loop if M is added as an argument, and unrolled if M is known a priori.) Unlike the standard algorithm, only M/R instead of N lookups are required, as R bits are processed at a time. The rest of the manipulations are cheap, since the computation of the Hamming weight is available as a separate CPU instruction on all common machines. For the "bottle-neck" case of N ≈ M/2 and the choice R = 8, we thus expect a significant speedup. This is evidenced numerically in Sec. V.

IV. TRIE-BASED RANKING
If our Hamiltonian (1) only conserves particle number, we can use the algorithm presented in Sec. III B to rank. Commonly, however, the Hamiltonian will have more symmetries. For instance, if H conserves total spin as well, it conserves both the number N ↑ and N ↓ of spinup and spin-down particles, respectively. Assuming that the least significant bits correspond to spin-down, we have: where α ↑ = α/2 M/2 and α ↓ = α−α ↑ ; rank N is given by Eq. (10) as before, allowing us to reuse the corresponding fast algorithm (Fig. 3). For other sets of quantum numbers, e.g., total momentum or orbital parity [31], the situation is different: there, we do not usually have an expression that is both, compact and fast, for ranking states from the corresponding sector, and need to fall back on a search. Ideally, we would like to carry over the advantages from the staggered lookup in a combination to this case: a fast, cachefriendly search.
To this end, let us understand the fast ranking algorithm Fig. 3 as lookup in a tree index. For the case of N = 2 particles in M = 6 flavors and a chunk size of R = 2 this is depicted in Fig. 4a: for any pattern α, we start at the root node. We dispatch on the R least significant bits (α 1 α 0 ) and follow the (α 1 α 0 ) branch to  the next node. We then shift α by 2 bits to the right and repeat the procedure until we arrive at a leaf node, which contains the rank. This procedure already suggests a generalization of the lookup for arbitrary quantum number sets: Elevating Fig. 4a to a data structure, we have constructed a trie (or prefix tree) [24,25]. A trie of radix 2 R for bit patterns of length M is a search tree of height M/R with a branching factor of 2 R , i.e., each non-leaf node of the trie maintains an array of size 2 R , which are pointers to descendant nodes. The leaf nodes instead represent the index i. (This is not exactly congruent with Fig. 4a, where we have omitted branches with only one possible 1: function i = rank-trie(t, α) r ← α mod 2 R
path, a procedure known as suffix compression.) Tries maintain two important benefits of the staggered lookup algorithm: they process data in chunks of R bits, thus only requiring M/R lookups, and they replace branching by an indexing operation into an array of size 2 R .
Tries, however, are not ideal in terms of memory locality and require significant memory overhead. To mitigate this, we can switch to a linearized representation [32], depicted in Fig. 4b: the trie is represented by a single contiguous array (t 0 . . . t T −1 ), which is efficient since the trie is not mutated once created. Each node is then represented by an index k in the array (the root node has index 0). For a branch node, the element t k+r contains the index of the child for the branch corresponding the R active bits of the state being equal to r (white boxes). For a leaf node, t k corresponds to the rank of the state (gray boxes).
We note that this structure is compatible with packing [32]: e.g., in Fig. 4, the branch node corresponding to 10 2 does not have a child for r = 11 2 . We can thus omit the element for r = 11 2 = 3. In general, a sequence of forbidden trailing descendants can be omitted. The same is true for forbidden descendants at the beginning: there, we omit the elements and move the index in the parent node forward by the number of elements deleted. For example, the node 0000 only allows r = 11 2 , so we can omit the first three elements. Consequently the index t 4 , which is the corresponding element in node 00 2 , contains 5 rather than 8, reflecting the omission of 3 elements. As illustrated in Fig. 4c, packing thus significantly reduces the memory demand in cases where the trie is only sparsely populated. (Even more advanced packing strategies, where "holes" in the middle of the index are kept track of and filled by suitable nodes, are possible but beyond the scope of this paper.) Fig. 5 presents the algorithm for ranking a state using the linearized (and optionally packed) trie. Let us walk through the algorithm for, e.g., the trie in Fig. 4 and α = 100100 2 : we start at the root with i = 0. Since r = α 11 2 = 00 2 = 0 and t 0 = 4, we move to i = 4 and consider the next two bits of the state, r = 01 2 . Note that t 5 = 8, even though the data for node 0100 starts at index 9, as we have omitted the unused first child. Finally, we have α = 10 2 and the rank is given as t 8+2 = 12.
As with staggered lookup, we can now trade off memory overhead with lookup performance by adjusting the radix R. Unlike staggered lookup, the memory overhead now scales with the number N s of states in the sector, typically requiring O(f N s ) space, where the overhead factor f ≥ 1 depends on R and the structure of the sector. In the example above, we have f ≈ 2.
Let us add a remark on the trailing elements in the packed trie (Fig. 4c): we could omit those, since they are not pointing to any valid rank. However, including the full root node and the trailing elements of the last leaf node and setting the unused indices to zero ensures that the lookup procedure (Lines 5 and 8 in Fig. 5) will only ever access valid array indices, regardless of whether α is a valid state or not. This allows us to modify the algorithm in Fig. 5 to include a cheap check whether α is indeed a valid state in the sector: after the lookup, we simply verify that t i = α by consulting the unrank lookup table.
Finally, let us turn back to our initial consideration, i.e., making the ranking algorithm suitable to conservation laws beyond the total number of particles (per spin sector). This is possible by simply eliminating all leaves and branches that do not fulfill a given conservation law. In practice, one starts by generating all states on the fly which fulfill particle number conservation in lexicographic order. Next, one filters out those states swhich do not satisfy additional conservation law. This iterator over state numbers is then fed into the trie generation routine, which can directly generate the packed trie.

A. Ranking microbenchmark
To compare the established methods against one another, we first perform a microbenchmark focusing on lookup performance alone. For this, we consider M = 28 spinorbitals and conserved total occupationN and analyze all sectors N = 1, . . . , 27, which corresponds to ranking the combination set C 28 N . To this end, we implemented each ranking strategybisectioning, combination ranking, staggered lookup and trie lookup-as Julia [33] code and compiled with maximum optimization levels. We drew 10 8 random states from each sector with replacement. We then ordered them by numerical value to simulate the situation that while the states generated in the application of Eq. (3) vary somewhat unpredictably, they are usually still ordered to some degree (omitting this ordering increases the cost of bisection search considerably and unfairly.) We then timed 20 iterations on a single AMD Ryzen-3600 CPU core.
The results are presented in Fig. 6. Panel (a) compares lookup times in terms of CPU cycles. We see that established methods (bisectioning and combinadics ranking) are generally slowest, except for the case of low filling: there, combinadics is most efficient since its scaling is with N rather than M . Otherwise, we observe significant speedups for our improved algorithms, i.e., staggered and trie lookups. This speedup is particularly pronounced for close to half-filled sectors (N ∼ M/2), since there the number of states is largest and thus cache misses become more and more of an issue for bisectioning. We further see that staggered lookup slightly outperforms tries for large sectors for the same R, since it is more economical in terms of cache demand. For a more sparsely populated sector (away from the half-filled case), we see that tries are slightly more efficient. In general, both trie and staggered lookups perform well, typically requiring around 10-15 CPU cycles per lookup. Fig. 6(b) presents the corresponding memory demand for each index: here, the baseline is given by bisectioning, which requires an ordered list of all states N s in the sector. The trie indices add a radix-dependent overhead: for the half-filled case, we find an overhead factor f ≈ 2.11 for R = 4, f ≈ 4.58 for R = 8, and f ≈ 6.16 for R = 12. For the more sparsely populated cases N = 5, one finds that the overhead increases more strongly: f ≈ 2.73, 10.45, and 61.06 for R = 4, 8, and 12, respectively. This is to be expected, as in the limit R → M we recover the inverse of the "sparseness" factor, f → 2 M /N s . Fortunately, the total memory demand of the index is still monotonously falling as we increase sparseness.

B. Hubbard chain
To examine the performance of ranking in a more realistic setting, let us study a special case of Eq. (1), namely a chain of M/2 Hubbard atoms: where the sum over spins σ runs over {↑, ↓}. We are interested in computing the lowest energy eigenvalue by constructing the Krylov subspace given sector N, S z (cf. Eq. 12) [34] using the Lanczos algorithm, on-the-fly evaluation (3). We implemented these techniques in Julia and ran the calculations in parallel on six AMD Ryzen-3600 CPU cores on the same node. Fig. 7 presents the average runtime for a single matrixvector multiplication H|ψ per state, i.e., the total runtime divided by the number N s of states in the sector as well as divided by M , in units of nanoseconds. We see that both our ranking techniques, staggered and trie lookup, improve substantially on the state-of-the-art bisectioning, achieving an around three-fold speed-up.
We note that the timings in Fig. 7 not only include rankings. Indeed, Eq. (3) suggests the following, rather crude, estimate for the total runtime: t ≈ N s N t (t rank + t apply + t FMA + t mem ), (17) where N t is the number of non-branching terms (in our case, N t = 3M/2), t rank is the time needed for ranking states, t apply is the time needed for computing the bit operations when applying T in Eq. (3), t FMA is the time needed for multiplying with the corresponding element of ψ and accumulating the result, and t mem is the time for accessing the corresponding component of ψ. Disentangling the constituent times is difficult, but profiling information suggests that around 1/3 of runtime is spent on ranking in the case of staggered and trie lookup. This is consistent with Fig. 6(a), where we only observe a moderate increase of t with N even though one expects linear scaling of t rank . In contrast, for the bisectioning lookup the ranking times completely dominate the computation.
Since trie and staggered lookup scale with M rather than N , the half-filled case N = M/2 in Fig. 6(a) is most favorable for these algorithm. To study a more sparse setting, Fig. 6(b) presents timings for the quarterfilled case N = M/4. There, bisectioning, which scales with min(N, M − N ), cf. Fig. 5, improves compared to staggered and trie lookup. However, we still observe an about four-fold speed-up by switching to the improved ranking algorithms.
Let us finally note that the differences within the improved algorithms are minor, consistent with the ranking microbenchmark in Sec. V A. Interestingly, even though trie lookup with R = 8 consistently outperforms R = 4 in the microbenchmark, the two algorithms have similar performance in the benchmark Fig. 6. This may be due to the fact that R = 8 has a significantly larger memory footprints which becomes more relevant for the full algorithm, so the cache misses may balance out the speed benefit. (Cache pressure is expected to be higher in this benchmark due to the manipulation of the state vectors.)

C. Hubbard ring
To showcase trie ranking (Sec. IV), we choose a system and a set of quantum numbers where staggered lookup is not possible. To do so, we consider again a onedimensional system of M/2 Hubbard atoms, but with periodic boundary conditions: Let us note that H in the momentum basis (18) is significantly less compact than the real-space formulation, cf. Eq. (16), since the Hubbard interaction generates O(M 3 ) terms. For this reason, such symmetries are usually taken into account implicitly in ED codes by generating a set of "representative" states in the realspace basis and then symmetrizing the result to obtain the corresponding momentum state [19]. However, there are other quantum numbers, such as orbital parity in quantum impurity models (also known as the PS quantum number [31]), which can be directly expressed in the real-space occupation number basis.
These caveats nonwithstanding, the momentum basis admits a further quantum number, the total crystal momentum: which commutes with both the Hamiltonian and the occupation number basis and can thus be readily used to lower the block size. Therefore, it provides a benchmark for trie rankings vs. established bisectioning methods.  Fig. 7, the runtime is scaled with M −3 , since that is the scaling of the number of terms in Eq. (18). We see that the speedup of trie lookup vs. bisection is milder here than in Sec. V B. We can attribute this to the greater "sparseness" of states in the space of M -bits, which means that tries have to process more bits per state. However, the benefit is still substantial, with a two-to three-fold improvement in runtime observed.
Notably, instead of plateau for trie-based ranking in Fig. 7, we see an uptick in runtime for large M in Fig. 8. This may be indicative of growing cache pressure in this cases due to the memory footprint of the trie.

VI. CONCLUSIONS AND OUTLOOK
We have removed a computational bottleneck of exact many-fermion calculations by speeding up the ranking of states, thereby providing an efficient mapping between indices of state in the full Fock space and in the block corresponding to a set of quantum numbers. For the common case of conserved particle number, the state-ofthe-art combinadics algorithm present in many ED algorithms can be modified to yield considerable ranking speed-ups at negligible overhead with a staggered lookup. For more complictated quantum number combinations, the state-of-the-art bisectioning algorithm can be replaced by a trie lookup, which has a similar scaling with memory but a much-improved performance. Thanks to these improvements, ranking is no longer the bottleneck of the ED code.
With these optimizations, we expect exact diagonalization to stay competitive as, e.g., a solver for quantum impurity models [11] in cases where the quantum Monte Carlo techniques encounter a significant sign problem or where high numerical accuracy is required for, e.g., analytic continuation [35,36].
Trie ranking is also applicable to non-Abelian symmetries [19] leveraged in spin systems, where instead of ranking states in a quantum number sector we are ranking representatives of a symmetry orbit within a symmetry sector. Combining trie lookup and sublattice coding techniques [19,28] seems particularly promising, as sublattice coding reduces the size and scaling of the states which have to be ranked, which should make the memory overhead associated with tries less problematic also for large systems.
An intriguing challenge for trie rankings are selected CI techniques [37] as well as related Monte Carlo approaches [38,39], which restrict FullCI to a subset of states in the Fock space. These techniques optimize the set, either stochastically or deterministically, by minimizing the ground state energy, which means the set to be ranked cannot be static. This "dynamic ranking" makes the use of linearized and packed tries impractical, so ranking based on hash tables could be competitive.
A unit-tested Julia implementation of the techniques outlined here is available from the authors upon reasonable request and forthcoming as an open-source package. confined to their mask. This means we simply have to fulfill both demands whenever the masks intersect m a m b . When the demands are not fulfilled, the Pauli principle is violated and the value v of the term is set to zero (line 7), otherwise it is-for now-simply the product of both scalars (line 5).
Line 9-14 deal with the actual composition of the two terms. Firstly, in line 9, we construct the mask m as union of the individual masks, since the operator flavors in the product are just the union of the operators in each factor. For the right outer state r, r b takes precedence whenever m b is set, otherwise, the other term can make demands r a (line 10). For the left outer state l, we make a similar argument (line 11). The change mask x is given simply as the symmetric bit difference between left and right (line 12).
The sign mask s is at first set as the simple exclusive or of the individual masks (line 13), however, there is a complication: the creation and annihilation operators alter the state α, so each flavor in m is "locked" in place by T . This means we need to exclude any flavor from the sign mask s that is also present in m (line 14).
The term T we have constructed so far is almost equal to T = T a T b , however, it may still deviate from T by a sign due to the reordering of creators and annihilators encoded in the tuple t. Instead of keeping track of these permutations explicitly, we use a computationally appealing shortcut: we simply compare the effects of T a T b |α and T |α for a state where T |α = 0. One such state is, by construction, |r . Lines 15-18 compute Eq. (A1) for T a T b |r and keep track of the relevant sign masks overlaps in p. Any deviation between T and T is then absorbed into v (line 19) and the algorithm is complete.
We note that the algorithm in Fig. 9 does not only prove that a tuple for on-the-fly application exists, it also offers an extremely fast way to compute products of terms on modern machines, as it relies exclusively on bit operations.