The variance of relative surprisal as single-shot quantifier

The variance of (relative) surprisal, also known as varentropy, so far mostly plays a role in information theory as quantifying the leading order corrections to asymptotic i.i.d.~limits. Here, we comprehensively study the use of it to derive single-shot results in (quantum) information theory. We show that it gives genuine sufficient and necessary conditions for approximate state-transitions between pairs of quantum states in the single-shot setting, without the need for further optimization. We also clarify its relation to smoothed min- and max-entropies, and construct a monotone for resource theories using only the standard (relative) entropy and variance of (relative) surprisal. This immediately gives rise to enhanced lower bounds for entropy production in random processes. We establish certain properties of the variance of relative surprisal which will be useful for further investigations, such as uniform continuity and upper bounds on the violation of sub-additivity. Motivated by our results, we further derive a simple and physically appealing axiomatic single-shot characterization of (relative) entropy which we believe to be of independent interest. We illustrate our results with several applications, ranging from interconvertibility of ergodic states, over Landauer erasure to a bound on the necessary dimension of the catalyst for catalytic state transitions and Boltzmann's H-theorem.


I. INTRODUCTION
Many central results of quantum information theory are concerned with the manipulation of quantum systems in the so-called asymptotic i.i.d. setting, in which one considers the limit of taking infinitely many independent and identically distributed copies of a quantum system [1][2][3][4][5][6][7][8][9]. While not always being physically realistic, this setting is convenient to work in because it allows for the application of standard concentration results from statistics and information theory [10,11]. Recent years have seen a lot of effort in studying more general settings, in which subsystems might be correlated with one another or the size of the system is finite. Conceptually, the most extreme weakening of the i.i.d. setting is the single-shot setting, in which generally no assumptions about the size of the system or its correlations are made. This setting derives its name from the fact that it can be seen to describe a single iteration of a protocol, in contrast to the i.i.d. setting which is concerned with infinitely many independent iterations. Several classic results [12][13][14][15][16] have been instrumental to the recent developments of majorization-based resource theories [8,[17][18][19][20][21][22][23], which have found diverse applications in entanglement theory, thermodynamics, asymmetry and many more central topics in information theory. By now there exists a detailed understanding of an intuitive trade-off between the above settings (and that we detail in later sections). On the one hand, the i.i.d. setting, due to its various assumptions, can usually be characterized by variants of the (quantum) relative entropy, S(ρ σ) := tr(ρ(log(ρ) − log(σ)). * pboes@zedat.fu-berlin.de † nelly.hy.ng@gmail.com ‡ henrik-physics@arrr.de On the other hand, majorization-like conditions play a central role for single-shot transformations, even extending to the study of approximate transitions. Many such results are tight in the quasi-classical regime [24,25] (when we restrict to transformations between pairs of states that commute with another). Nevertheless, the number of conditions that need to be checked increases linearly with the dimension of the states of interest, which makes a more systematic and simplified understanding in characterizing the possibility of state transitions difficult. Fortunately, the family of smoothed entropies have turned out to be a powerful tool to describe these constraints and operationally characterize a variety of singleshot tasks [22,[26][27][28][29][30][31][32][33][34][35][36][37][38][39]. However, they typically require an optimization over the potentially high-dimensional state space.
In this work, we develop a complementary approach to the study of the single-shot setting in which "single-shot effects" are witnessed and quantified by a single quantity, the variance of (relative) surprisal, where log( ρ σ ) ≡ log ρ − log σ is the relative surprisal of one quantum state ρ with respect to another σ, and we use log ≡ log 2 . In the following, we refer to this quantity simply as the relative variance. The relative variance has been shown to measure leading-order corrections to asymptotic results in (quantum) information theory [40][41][42][43][44][45][46][47][48][49][50][51][52][53][54][55][56], but here we show that its role extends to the genuine single-shot setting. We do this from two points of view: First, we show that the relative variance quantifies single-shot corrections to possible state-transitions between pairs of quantum states. These imply that state-transitions between pairs of states with low relative variance are essentially characterized by the relative entropy and hence do not exhibit strong single-shot effects. Simple examples of such states are ergodic states (see below).
Since single-shot effects often represent operational impediments (such as, for example, decreased ability to extract work in the context of thermodynamics), the above finding motivates the question whether there is, in some sense, a "cost" associated to obtaining states of low relative variance. We proceed to show that this is indeed the case: Reducing the relative variance between a pair of states necessitates a proportional reduction of the relative entropy between those states. In the resource-theoretic setting (defined later), in which the relative entropy is known to itself be measure of a state's resourcefulness, this finding implies that increasing the operational value of a system in the sense of reducing its relative variance comes at the cost of reducing its value as measured by the relative entropy. Formally, the above tradeoff result is an implication of a resource monotone that we construct. We formulate the trade-off between relative variance and relative entropy both for single systems as well as for the marginal changes of bipartite systems.
Overall, our findings motivate the relative variance as a simple measure of single-shot effects: The smaller the relative variance of pairs of states, the better their single-shot properties are described by the relative entropy, and vice versa. In addition to the above results, we clarify the relation between the relative variance and the smoothed entropies, relate our single-shot results to the known asymptotic leading order corrections mentioned above, and present a novel axiomatic characterization of the (relative) entropy from a single, physically motivated axiom. We believe the latter finding to be of independent interest. Apart from its applications in the context of resource theories, our work can also be seen as a thorough investigation of the mathematical properties of the variance of relative surprisal, such as uniform continuity bounds, corrections to sub-additivity and the relation of the variance of relative surprisal to smoothed Rényi divergences and the theory of approximate majorization.
The remainder of this paper is structured as follows: Our results are concerned with generic state transitions between pairs of states in the quasi-classical setting. However, following a formal introduction of the setup and terminology below, for the sake of clarity, in Section II we first provide an overview of our main results for the special case of state transitions under unital channels, which also corresponds to the resource theory of purity. The results are then shown in full generality and further discussed in Section III. Throughout, we focus on the formal results and provide applications mostly for illustration in boxes, leaving a more detailed study of applications and implications of our results to future work.

A. Setup and Notation
Let S and S be quantum systems represented on Hilbert spaces with fixed and finite respective dimensions d and d . Further, let D(S) be the set of quantum states on S, and similarly for S . Given two pairs of states ρ, σ ∈ D(S) and ρ , σ ∈ D(S ) -sometimes referred to as dichotomieswe write (ρ, σ) (ρ , σ ) if there exists a quantum channel E : D(S) → D(S ) such that E(ρ) = ρ and E(σ) = σ . The pre-order has been studied both classically and quantumly (e.g. [13,17,57,58]) and forms the backbone of many resource theories. In a resource theory a set of so-called free states D F ⊆ D(S) is specified, together with a set of quantum channels F such that each of these channels maps free states into free states, F[D F ] ⊆ D F . These channels are therefore called free. This terminology originates from the idea that free states constitute a class of states that are easy to prepare in a given physical context, while free channels constitute physical operations that are easy to implement in this context (see Ref. [23] for a review on resource theories). One is then concerned with the special case of the ordering in which E is free. In the special case that there is only a single free state σ, this corresponds to the special case σ = σ and induces the pre-order σ on D(S), known as σ-majorization, where ρ σ ρ is equivalent to (ρ, σ) (ρ , σ). As such, the results presented below can naturally be applied to those resource theories that follow the above form, although we emphasize that they hold more generally as well.
Since we are often concerned with approximate state transitions, we further write (ρ, σ) (ρ , σ ) whenever there exists a state ρ such that (ρ, σ) (ρ , σ ) and D(ρ , ρ ) := Finally, the i.i.d. setting corresponds to the special case of dichotomies of the form (ρ ⊗n , σ ⊗n ) (ρ ⊗n , σ ⊗n ) for some n ∈ N. Here, one usually considers the asymptotic limit n → ∞. The statement we made in the introduction, that in this limit the relative entropy characterizes possible state transitions, is based on the well-known fact that, for any two dichotomies (ρ, σ) and (ρ , σ ) it holds that, for any > 0, there exists a sufficiently large n such that (ρ ⊗n , σ ⊗n ) (ρ ⊗n , σ ⊗n ) if and only if S(ρ σ) > S(ρ σ ). We will recover this statement as a special case from results below.

II. OVERVIEW OF MAIN RESULTS
In this section, we provide an overview of our results for the special case I , where is our shorthand for the maximally mixed state in d dimensions, whereas 1 d is the d-dimensional identity operator. This choice of σ corresponds to the resource theory of purity, also known as the resource theory of stochastic nonequilibrium [59], and captures the essential insights from our results while being easier to state.
Channels that preserve the maximally mixed state are also called unital channels 1 and the ordering I is known as majorization 2 . Unital channels are often used to model random 1 In general, unital channels are channels that map the identity of their domain to the identity of their image, and hence are more general than the channels we consider here. We ignore this difference because here we are only interested in strict preservation of the input state. 2 There exist various equivalent definitions of majorization between quantum states. At the level of d-dimensional probability distributions, we say that p I q iff for all k = 1, 2, . . . , n − 1 it holds that i . An alternative definition to the one given above is to say that that ρ majorizes ρ when the vector of eigenvalues of ρ majorizes that of ρ . processes. For instance, a (strict) subset of unital channels are random unitary channels that describe the evolution of a system under a unitary operator that was drawn at random from some fixed distribution. As such, the resource theory of purity is concerned with describing the evolution and operational "value" of quantum states in the presence of random processes.

A. Variance of surprisal
The central quantity in this work is the relative variance, defined in Eq. (1). For σ = I, the relative variance reduces to the variance (of surprisal) where S(ρ) := log(d) − S(ρ I) = − tr(ρ log(ρ)) is the von Neumann entropy (itself the mean of the surprisal, − log(ρ)). The variance of surprisal is also known as information variance or varentropy, and as capacity of entanglement in the context of entanglement in many-body physics [60][61][62].
Operationally, the variance can be understood, for instance, as the variance of the length of a codeword in an optimal quantum source code. However, while, as mentioned in the introduction, it is well known to quantify the leading order corrections to various (quantum) information theoretic tasks in the asymptotic limit, its relevance in the single-shot setting has not yet, to the authors' knowledge, been thoroughly investigated (recently, some formal properties have, however, been developed in Ref. [63]). In this work, we study the (relative) variance and show that it in fact provides a useful measure of single-shot effects for approximate state transitions.
We begin by mentioning some properties that the variance of surprisal fulfills, and which we use throughout the paper: 1. Additivity under tensor products: 3. Uniform continuity (Lemma 10): for a constant K.
5. V (ρ) = 0 if and only if all non-zero eigenvalues of ρ are the same. We call such states flat states. Examples include any pure state and the maximally mixed state.
6. For fixed dimension d ≥ 2, the stateρ d with maximal variance has the spectrum [64] spec(ρ d ) = 1 − r, r d − 1 , . . . , r d − 1 (2) with r being the unique solution to We have 1 4 (2), and, in the limit of large d, r ≈ 1 2 . Properties 3 and 4 are original contributions of this work and are part of our main technical results.

B. Sufficient criteria for single-shot state transitions
It is well known that when considering the i.i.d. limit, approximate majorization reduces to an ordering with respect to the von Neumann entropy. More precisely, given two states ρ and ρ , then S(ρ ) > S(ρ) implies that for any > 0 there exists a number N ∈ N such that However, this is not the case in a single-shot setting. Here, the question whether ρ I ρ in full generality depends on d − 1 independent constraints on the spectra of ρ and ρ . This makes dealing with exact single-shot state transitions considerably more difficult. Our first result shows that for approximate state transitions, there nevertheless exist simple sufficient conditions at the single-shot level that also involve the von Neumann entropy, but with a correction quantified by the variance: Result 1 (Sufficient conditions for approximate state transition). Let ρ, ρ be two states on S and 1 > > 0. If We emphasize that this result is a fully single-shot result. It shows that the variance quantifies the single-shot deviation from the above i.i.d. case. A convenient reformulation of Result 1 is as follows: Let ρ, ρ be two states on S with Solving for in Result 1, we see that ρ I, ρ where can be achieved. An appealing feature of this result is that it does not require any optimization, as is typically present in results relying on smoothed entropies (compare, for example Result 1 and its generalization Thm 8 to the results in [58]). When applied to state transitions under unital channels in the i.i.d. limit, Result 1 straightforwardly produces finite-size corrections towards asymptotic interconvertibility: given two states ρ and ρ with S(ρ ) > S(ρ), it implies that ρ ⊗n I, (ρ ) ⊗n with which vanishes in the limit n → ∞. For finitely many copies of the two states, the variances of initial and final states bound the achievable precision. In Appendix F we provide a more detailed analysis of the i.i.d. case, where we use Result 1 (and its generalization to σmajorization) to study convertibility between sequences of n i.i.d. states for large but finite n and with an error n such that n n → ∞, but possibly n → 0. This can be seen as a simple form of moderate-deviation analysis, and in particular we recover the "resonance"-phenomenon reported in Refs. [55,56], namely that second-order corrections vanish when with the proof being as simple as solving a quadratic equation. Result 1 implies that state transitions between pairs of initial and final states with low variance are essentially characterized by the entropy. As an application, in Box 1, we prove a simpler version of a recent result on the macroscopic interconvertibility of ergodic states under thermal operations [65,66]. Finally, let us note that Eq. (3) is in general not very tight, since we know from Hoeffding-type bounds that, in the i.i.d. limit, the amount of probability outside of the typical window (which directly contributes to the error) scales as ∝ exp(−n). Nevertheless, the absence of any trailing terms makes it simple to evaluate, especially in single-shot scenarios.

C. Relation to smoothed min-and max-entropies
As mentioned in the introduction, smoothed generalized entropies are often used to describe single-shot processes. For instance, the continuous family of Rényi entropies has been found to characterize possible single-shot transitions in the semi-classical setting [37,68]. Among those entropies, the smoothed minand max-entropies are of particular prominence, since they enjoy clear operational meanings in various information processing tasks such as randomness extraction or data compression (see, e.g., Refs. [26,30]) and, in a sense, quantify complementary single-shot properties of quantum states. These quantities, the precise definition of which is given in Section III C, can differ significantly from the von Neumann entropy for arbitrary states. However, the following result shows that one can bound this difference by the variance. Box 1. Interconvertibility of ergodic states As an application of Result (1), we discuss the interconvertibility of ergodic states in the unital setting. Informally speaking, ergodic states are states on an infinite chain of identical, finitedimensional Hilbert-spaces of dimension d, enumerated by Z and called sites below, which have the property that correlations between observables located at distance sites converge to zero as their distance is increased. See Ref. [65] for a detailed description of ergodic states. Importantly, states are in general correlated, examples being ground states of gapped, local Hamiltonians or thermal states of many-body systems away from the critical temperature. Nevertheless, if ρn denotes the density matrix of n consecutive sites of the chain with entropy Sn = S(ρn), the (quantum) Shannon-MacMillan-Breimann theorem shows that for arbitrarily small > 0 and sufficiently large n, we can find an approximation ρ n of ρn with the property that each eigenvalue pj of ρ n fulfills [67] | − log(pj) − Sn| ≤ 2n and D(ρn, ρ n ) ≤ . Thus, the variance fulfills where S n = S(ρ n ). By uniform continuity of the variance, Lemma 10, we therefore have withK some constant and where we used 2 ≤ √ . Result 1 now tells us that if we have two ergodic states with entropies Sn = sn and S n = s n such that s < s , then for any > 0 and sufficiently large n, we can convert ρn to ρ n using a unital channel with error at most which can be made arbitrarily small.
Result 2 (Bounds on smoothed min-and max-entropies). Let 1 > > 0 and let ρ be a state on S. Then, As a straightforward corollary, Result 2 implies that for any 1 > > 0 and any state ρ, Result 2 has the appealing feature of providing an upper bound that factorizes into the variance and a function of the smoothing parameter . In analogy with Result 1, Result 2 shows that for states with small variance, finite-size corrections (e.g. to coding rates) are less pronounced, making these states ideal candidates for information encoding and transmission. This makes intuitive sense if we recall that the spectrum of states with zero variance is uniform over its support. Within the class of states with the same entropy, these flat states are therefore maximally "compressed" and "random". As a sidenote, we observe that various models for "batteries" used in single-shot quantum thermodynamics restrict to battery states with zero or very low relative entropy variance. This choice can conveniently be interpreted in terms of Eq. (4), since S min and S max quantify the amount of single-shot work of creation and extractable work respectively for non-equilibrium states [19,37]. Restricting to low variance battery states therefore ensures that the process of storing and extracting work from a battery involves a minimal dissipation of heat.

D. Decrease of variance
The previous two results establish the variance as a measure of single-shot effects, bounding the extent of such effects in the context of approximate state transitions and operational tasks such as data compression or randomness extraction. In particular, they imply that the manipulation of states with low variance produces less overhead due to finitesize effects. Therefore, state transitions between states of low variance can be operationally advantageous. This motivates the question whether there exists a resource-theoretic "cost" associated to decreasing the variance of a state. We show that this is indeed the case: Under unital channels, decreasing the variance lower bounds entropy production.
The statement above, formulated in Result 3 is a consequence of a new resource monotone that we derive.
Non-increasing resource monotones with respect to majorization are called Schur convex and non-decreasing ones Schur concave. Resource monotones are an important tool in the study of resource theories. For example, for a Schur convex function f , f (ρ ) > f (ρ) suffices to conclude that ρ I ρ but the former is often far easier to check than the latter. An example of a Schur convex function is the purity tr(ρ 2 ) of a state ρ, while the entropy S(ρ) is Schur concave. The variance itself is evidently not monotone, which might partially explain why it has so far not been studied resourcetheoretically. However, the following lemma shows that the variance and entropy jointly give rise to a monotone.
is Schur concave.
By Schur-concavity, we have 0 ≤ M (ρ) ≤ ( 1 ln(2) + log(d)) 2 . Notably, unlike many commonly used monotones, M is not additive with respect to product states. Fig. 1 compares the regions of increasing M and entropy compared to the majorization ordering for two initial states in d = 3 and fixed eigenbasis, illustrating that for some states M provides strictly stronger necessary conditions for the majorization ordering than the entropy S.
By means of Lemma 2, we can derive the following bound on entropy production, which is our third main result and proven in the more general statement of Corollary 13.
Result 3 (Lower bound on entropy production). Let ρ, ρ be a pair of states on S. If ρ I ρ , then . Result 3 shows that the decrease of variance under unital channels can only come at the cost of increasing the system's entropy. It also complements the previous results, in that it shows that states with positive variance necessarily exhibit finite-size effects: The entropy fails to characterize single-shot transitions, as witnessed by the variance. Still, the i.i.d. limit is consistent with Eq. (5), since in the asymptotic limit the LHS grows linearly, while the other terms remain constant to leading order. At the same time, there exist sequences of state transitions for which the constraint imposed by Eq. (5) remains non-trivial in the limit of large system size. An example are transitions from the stateρ d , as defined by Eq. (2), to any state of constant variance, in the limit of growing d. As a simple application, in Box 2, we apply Result 3 to the task of erasure and find corrections to Landauer's principle that are quantified by the variance. Box 2. Finite-size corrections to Landauer's principle Landauer's principle states that the erasure of information physically requires the dissipation of entropy which consumes work, converting it into heat [69,70]. A simple resourcetheoretical model of an erasure process consists of a unital channel acting on a system S whose state is to be erased (i.e. mapped to a fixed pure state |ψ ) together with an information battery B that acts as a source of purity. In the simplest setting, B is an n-qubit system, with each qubit being either in a pure state |0 or maximally mixed. We define the finite-size work cost (in units of kB ln (2)) of erasing an initial state ρ as the size of the smallest information battery that allows for an erasure of ρ, i.e. the smallest integer n such that The usual formulation of Landauer's bound, n ≥ S(ρ), is then a simple consequence of the monotonicity of the entropy. However, applying Result 3 yields, which provides corrections to Landauer's bound that increase with the initial variance of ρ. Indeed, we note that Eq. (6) remains true even when the battery size is not constrained, i.e. one can allow the initial and final battery states to contain arbitrarily large "reservoirs" of pure and maximally mixed qubits, that is, states of the form |0 0| ⊗λ 1 ⊗ (12/2) ⊗λ 2 and |0 0| ⊗(λ 1 −n) ⊗(12/2) ⊗(λ 2 +n) for arbitrarily large λ1, λ2. At the same time, the correction term in Eq. (6) can easily be made to vanish in the presence of a bystander system whose state is returned unchanged and uncorrelated from S and B. Such systems are technically known as trumping catalysts (see Box 3) and hence the above bound is not robust to this simple extension of the setting.

E. Bounds on marginal entropy production
In the previous subsection, it was shown that a decrease in the variance lower-bounds entropy production under unital channels. However, not all quantum channels are unital channels and hence it is natural to ask whether a similar result exists for a more general class of channels. It is clear that Result 3 cannot be generalized for all channels: for instance, the channel that maps every input state to a pure state reduces both entropy and variance. However, using the properties of the variance and the previous results, we can formulate an analogous lower bound for arbitrary quantum channels, by considering their effect on the environment. To state those results, we first note the well-known fact that every quantum channel on a quantum system S can be understood as the local effect of a unital channel acting on S together with an environment E. More formally, any channel E from S to itself can be written as for some initial state ρ E of the environment and unital channel U on the joint system SE. We call the pair (U, ρ E ) a dilation of E. By the Stinespring dilation theorem, one can always choose ρ E to be pure and U to be a unitary channel for sufficiently large environment dimension d E , but here, in the context of the resource theory of purity, we use the above more general representation.
We next show that one can extend Result 3 to local changes of variance and entropy. Let (U, ρ E ) be a dilation of a given quantum channel E. For an initial ρ S on S, let ρ S = E(ρ S ) denote the final state on S and let ρ E = tr S (U(ρ S ⊗ ρ E )) denote the final state on E. Moreover, denote as the changes of entropy and variance on S respectively, and similarly for the environment. Finally, let denote the mutual information between S and E after the application of U. We then have the following: Result 4 (Lower bound on marginal entropy production). Given a quantum channel E from S to itself, let (U, ρ E ) be any dilation of E where U is a unital map, and denote This result follows straightforwardly from combining Result 3 with Property 4 of the variance (or more precisely, Lemma 11) together with the subadditivity of the von Neumann entropy. It is particularly interesting from a resourcetheoretic point of view, since in such theories, the environment E often explicitly models a particular kind of physical system, such as a thermal bath, a clock, a battery, or a catalyst. For instance, we may then consider the setting of Landauer erasure described in Box 2, with the additional requirement that I S:E = 0. In Box 3, we also apply Result 4 to gain insight into catalytic processes, in particular, to derive bounds on the dimension of the catalyst required for certain processes. Box 3. Bound on catalyst dimension for state transitions It is well-known that the set of possible state transitions in a resource theory can be enlarged with the help of catalysts, that is, auxiliary systems whose local state remains unchanged in a process. In terms of the notation established in the main text and given two quantum states ρ, ρ ∈ D(S) , we write ρ C ρ if there exists a quantum channel E with dilation (U, ρE) such that ρ E = ρE, that is, if the local state of the environment remains unchanged. Moreover, we write ρ T ρ if the dilation can be chosen such that IS:E = 0, where the catalyst not only remains locally unchanged, but is also returned uncorrelated from S. This relation is known as trumping [71][72][73]. Clearly, while the converse relations in general do not hold. As such, catalysts enable previously impossible state transitions. Indeed, recently it was shown that if ρ and ρ are two full-rank states, then ρ C ρ is equivalent to S(ρ ) > S(ρ) [74], a result that has found applications in the context of fluctuation theorems in (quantum) thermodynamics [75] and can be further strengthened to the case that U is unitary [76,77]. What these results are silent about, however, is the required size of the catalyst. Here, we apply Result 4 to show that transitions between states with similar entropy that decrease the variance can only be realized by means of a catalyst with very large dimension. In particular, consider any state transition ρ C ρ between full-rank states such that This follows from the fact that ∆VE = ∆SE = 0, the monotonicity of f and the logarithm, as well as IS:E ≤ S(ρ ) − S(ρ). This shows that, for any fixed ∆VS and fixed system dimension dS, dE has to grow as dE ≥ O(exp(δ −1/8 )) for the above equation to be satisfied. For state transitions where the entropy change is small, reducing the variance is therefore possible only at the expense of using a large catalyst.

F. Local monotonicity and entropy
Result 4 is non-trivial only when the RHS of Eq. (8) is positive (i.e. when there is significant decrease of marginal variances compared to the mutual information). This is because the LHS is always non-negative -a property we call local monotonicity with respect to maximally mixed states (and unital channels). The local monotonicity of entropy follows straightforwardly from the fact that it is Schur concave, additive and sub-additive. Our last result is to show that, conversely, this property is essentially unique to the von Neumann entropy, namely, it singles out the latter from all continuous functions on quantum states.
To state our result, let us first define local monotonicity more formally. Consider a function f on quantum states on finite-dimensional Hilbert-spaces. Let I 1 and I 2 be maximally mixed-states on systems S 1 and S 2 , and let U be a unital channel, namely U(I 1 ⊗ I 2 ) = I 1 ⊗ I 2 . We say that f is locally monotonic with respect to maximally mixed states if for any such I i , U and two states ρ i ∈ D(S i ), we have where ρ 1 = tr 2 U(ρ 1 ⊗ρ 2 ) and similarly for ρ 2 . We then have the following result: Result 5 (Uniqueness of von Neumann entropy). Let f be a continuous function that is locally monotonic with respect to maximally mixed states. Then where S is the von Neumann entropy, d is the Hilbert-space dimension of ρ and b d depends only on d but not otherwise on ρ. It is sufficient for this result to restrict the set of channels to unitary channels.
The proof of the result can be found in Appendix H. While there exist many axiomatic characterizations of the entropy, we consider the above interesting because its only axiom (apart from continuity) -local monotonicity -is directly motivated by physical, as opposed to mathematical, considerations. This is because physics is often concerned with the possible changes of local quantities in the course of physical processes. As an example, in Box 4, we apply Result 5 to derive a version of Boltzmann's H theorem.
Finally, going back to Result 4, we see that just like how Result 3 provides a strengthening to the monotonicity of the entropy, Result 4 provides a strengthening to the local monotonicity of the entropy.

Box 4. A version of Boltzmann's H theorem
Here we note that one can derive a version of Boltzmann's H theorem as a simple corollary of Result 5: Consider a "gas" of N independent quantum systems initially in state ρ (0) = ⊗ N i=1 ρi. At any point in time t, two of these systems, call them i and j, first undergo a joint evolution, described by a (possibly random) unitary channel Ut. We further assume that, following this interaction, any correlations between these two particles vanish (This is the infamous "Stoßzahlansatz"). A single iteration of this process then yields the chain of states . All other systems in the gas remain unchanged during this process. We are now interested in finding a continuous, real-valued function f such that for all t and all possible Ut. Then, the above result implies that f exists and is given by the von Neumann entropy.

III. MAIN RESULTS FOR GENERIC QUANTUM CHANNELS
In the previous section, we have provided an overview of our main results for the special case of the resource theory of purity. We now turn to an exposition of our results in their full generality. All previously mentioned results are special cases of the general results presented here. Since we discussed the interpretation of the results already in the last section, we now focus on the technical and formal presentation. Some of technical proofs are nevertheless delegated to the appendices.

A. Notation and main concepts
Recall the pre-order defined at the beginning of the last section for pairs of states (ρ, σ). A well-known example of this ordering that goes beyond majorization is β-majorization in quantum thermodynamics, where σ = σ is the thermal state of the system at inverse temperature β [19]. Throughout the following, we focus on the quasi-classical setting, in which the initial and final pairs commute with one another, i.e.

Lorenz curves
A key tool in proving our sufficiency results is a wellknown connection between the pre-order and the Lorenz curve. For self-consistency, we present this connection and all relevant constructions in the notation of this paper. In particular, let ρ and σ be two commuting positive-semidefinite operators on the same d-dimensional Hilbert space H, and denote by {|i } d i=1 an orthonormal basis of H that simultaneously diagonalizes both ρ and σ, i.e. σ = i s i |i i| and ρ = i p i |i i|. Furthermore, let us assume w.l.o.g. that the basis {|i } d i=1 orders ρ relative to σ, namely for any i = 1, . . . , d. Note that in this ordering neither the (p i ) i nor (s i ) i are necessarily ordered. Given the notations introduced, we can now introduce the Lorenz curve.
If ρ and σ do not commute, the Lorenz curve is taken to be L ρ|σ = L W(ρ)|σ , where W(ρ) is the state pinched to the eigenbasis of σ, i. e. , W(ρ) = i P i ρP i , with P i the projectors onto the eigenspaces of σ.
Due to the way we have ordered the eigenvalues according to Eq. (9), the Lorenz curve is by definition always concave.
The following now provides a simple and well-known equivalence relation between Lorenz curves and σ-majorization. Theorem 4. Given two pairs of commuting states (ρ, σ) and (ρ , σ ), the following are equivalent: 1. For the entire range of x ∈ [0, 1], Theorem 4 is a condensed version of a more extended statement found as Theorem 24 in Appendix B, also see [13][14][15][16].

Flat and steep approximations relative to σ
We are now in a position to define the following approximations, known as flat and steep approximations of a state ρ relative to σ, denoted as ρ fl and ρ st respectively, which will play an important role for the derivation of our results. These states were initially defined in Ref. [24] for the special case of thermal reference states. Although the following Definitions 5 and 6 seem technical, they have the essential appealing property that for any state ρ and any 1 > > 0, we have 3 The states are constructed as follows.
Definition 5 (Flat approximation relative to σ). Let σ, ρ be commuting quantum states on a d-dimensional Hilbert space H and {|i } d i=1 a common eigenbasis of the two states that orders ρ relative to σ, yielding σ = i s i |i i| and ρ = i p i |i i|. For any 0 ≤ ≤ 1, the -flattest approximation relative to σ is the state ρ fl = ip i |i i|, where thep i are defined as follows: If D(ρ, σ) < , setp i = s i . Otherwise, define M ∈ {1, 2, · · · , d − 1} as the smallest integer such that and let N ∈ {2, . . . , d} be the largest integer such that These integers always exist when ≤ D(ρ, σ) and moreover satisfy M ≤ N ( [24], App. D, Lemma 6). Using these definitions, finally set Definition 6 (Steep approximation relative to σ). Let σ, ρ be commuting quantum states on a d-dimensional Hilbert space H and {|i } d i=1 a common eigenbasis of the two states that orders ρ relative to σ, yielding σ = i s i |i i| and ρ = d i p i |i i|. Then, for 0 ≤ ≤ 1, the -steep approximation relative to σ is the state where R ∈ {2, . . . , d} is the largest index such that where f σ (ρ, ) := V (ρ σ) ( −1 − 1) and l c (x) = min(c · x, 1).
This lemma is proven in Appendix E. Intuitively, it shows that the steep (flat) approximations allow us to obtain a state close to ρ in trace distance, with its Lorenz curve being lower (upper) bounded by straight lines l c with gradients c = r st (r fl ) governed by both the relative entropy and its variance. These simple bounds on the Lorenz curves of ρ st and ρ fl are crucial for our derivation of Theorem 8 and Theorem 9, which are the general statements for Results 1 and 2 and are obtained as a direct consequence of this Lemma 7.
B. Sufficient criteria for state transitions under σ-majorization Using Lemma 7, we can now derive sufficiency conditions for approximate state transitions between commuting pairs of quantum states.
Proof. Let¯ = /2, and r st = 2 S(ρ σ)−fσ(ρ,¯ ) , and r fl = 2 S(ρ σ )+f σ (ρ ,¯ ) . Note that if the above condition holds, then in the whole range of x ∈ [0, 1], we have that l rst ≥ l r fl . By Lemma 7, we then have that in that range , which by Theorem 4 implies that there exists a channel E such that E(σ) = σ and E(ρ¯ st ) = ρ ¯ fl . Applying the same channel to ρ yields a state E(ρ) =ρ such that D(ρ , ρ ) ≤ , since Result 1 in Section II follows as a special case for σ = I. As mentioned earlier, in Appendix F, we apply Theorem 8 to derive sufficient conditions for i.i.d. state transitions with large but finite number of states, recovering a previously observed resonance condition, where second-order corrections can vanish even for non-zero variances of the initial and final states ρ, ρ .

C. Relation to smoothed min-and max-relative entropies
Two quantities that are useful in describing single-shot processes are the min-and max-relative entropy. Given a positive semidefinite operator σ ≥ 0 and a quantum state ρ ∈ D(S), let π ρ denote the projector onto the support of ρ. Moreover, for two operators A and B, we write A ≥ B to mean that the operator A − B is positive semidefinite. In terms of this notation, if supp(ρ) ⊆ supp(σ), then we have the following definitions [78]: S min (ρ σ) := − log tr(π ρ σ), S max (ρ σ) := log min{λ : ρ ≤ λσ}.
The smoothed variants are further defined as where the optimizations are over the set of all quantum states ε-close in terms of trace distance to ρ, denoted as B ε (ρ). Finally, define as S max (ρ) := log(d) − S min (ρ I) and S min (ρ) := log(d) − S max (ρ I) the min-and max-entropies utilized in Result 2.
We now present the generalization of Result 2 for the smoothed min-and max-relative entropies, which is easily proven by making use of Lemma 7.
Using this fact and Lemma 7, in particular the definition of l c (x), we then find tr(π ρ st σ) ≤ r −1 st . Combining this with the definition of the smooth min-relative entropy then yields (11) Finally, combining Eqns. (10) and (11) yields the second claim of the theorem.

D. Uniform continuity and correction to subadditivity
We here show that the variance of relative surprisal is uniformly continuous and bound its violation of subadditivity. Both of these properties of the relative variance are key tools to derive our main results, however we believe that they are of independent interest and use.
This lemma is proven in Appendix C.
Lemma 11 (Correction to sub-additivity of relative variance.). Let ρ, σ be two commuting quantum states on a ddimensional, bipartite system, with d ≥ 2 and σ full-rank. If σ = σ 1 ⊗ σ 2 is a product state with smallest eigenvalue s min , then x, √ x} and I 1:2 denotes the mutual information between the two partitions of ρ.
This lemma is proven in Appendix D.

E. A new monotone and relative entropy production
We now turn to the presentation and derivation of the results that generalize Results 3 and 4. We begin by noting that the relative entropy S(ρ σ) is a non-increasing resource monotone with respect to the ordering , generalizing the Schur concavity of the von Neumann entropy. We then have the following generalization of Lemma 2, which we prove in App. G. Result 2 follows by setting σ = σ = I. Furthermore, since σ is the minimum of the pre-order σ , monotonicity implies that for any state ρ that commutes with σ. We now derive the following corollary of the above theorem as a general version of Result 3, where we write ∆S = S(ρ σ) − S(ρ σ ) and ∆V = V (ρ σ) − V (ρ σ ): Corollary 13. Let (ρ, σ) and (ρ , σ ) be pairs of commuting states, with σ, σ both full-rank. If (ρ, σ) (ρ , σ ), then ∆S ≥ ∆V 2 M smin (ρ σ) ≥ ∆V 2(1/ ln(2) − log(s min )) .
Proof. By monotonicity of the relative entropy and positivity of M , the statement is trivially true whenever ∆V ≤ 0. Hence, assume that ∆V > 0. By Theorem 12, we know that Let us write a = 1/ ln(2) − log(s min ), so that Then inserting the definition of M smin and reshuffling terms yields (using where we write χ = a − S(ρ σ) ≥ 0. Solving the quadratic equation in ∆S then gives Here, we have used the fact that ∆S ≥ 0 by monotonicity in the first step (to disregard one solution), positivity of χ in the second step and the concavity of the square root in the third step (more precisely that f (y) ≥ f (x) + f (y)(y − x) for any differentiable concave function). This concludes the proof.
We note in passing that such lower bounds on the production of relative entropy are essential for quantifying irreversibility in thermodynamics, where 1 β S(ρ τ β ) denotes the non-equilibrium free energy of a system in state ρ in an environment of inverse temperature β. Here, τ β denotes the Gibbs state of the system at inverse temperature β. We leave the detailed investigation of applications of our results to thermodynamics for future work.
Next, we present the generalized version of Result 4. Let S and E be two systems of respective dimension d S and d E and σ ≡ σ S ⊗ σ E as well as σ ≡ σ S ⊗ σ E be two fixed product states on the joint system SE. Let E : D(S ⊗E) → D(S ⊗E) be a quantum channel such that E(σ) = σ . Using this channel we can define a channel C : D(S) → D(S) as for an initial state ρ E of the environment. As in the previous section, for some initial state ρ S on S, we denote as ρ S = C(ρ S ) the final state on S, as ∆S S = S(ρ S σ S ) − S(ρ S σ S ) the marginal change of relative entropy on S, and similarly for ∆V S and the environment E. Finally, I S:E is the mutual information of the final state E(ρ S ⊗ ρ E ), as defined in Eq. (7) (with U replaced by E). We then have the following: Theorem 14. Let C be a channel defined via Eq. (12), for fixed and full-rank states σ, σ as well as ρ E . Then, for any initial system state ρ S such that ρ S ⊗ρ E commutes with σ and E(ρ S ⊗ ρ E ) commutes with σ , we have and f (x) = max{ 4 √ x, x 2 }. Here, s min is the smallest eigenvalue of σ.
Proof. Applying Corollary 13 with ρ ≡ ρ S ⊗ ρ E and Lemma 11 yields The statement then follows from the fact that, for any state ρ SE on SE with mutual information I S:E ,

F. Relative entropy from local monotonicity
Lastly, let us discuss the general version of local monotonicity, which uniquely characterizes the relative entropy. To do this, let F be the set of all finite-dimensional density matrices with full rank and let C F be the set of quantum channels that map states of full rank to states of full rank (on possibly different Hilbert spaces), symbolically C F (F) ⊆ F. We further generalize the notion of local monotonicity to states σ 1 ⊗ σ 2 that are not fixed points of a given channel: We say that a function f on pairs of quantum states (ρ, σ), with ρ defined on the same Hilbert-space as σ ∈ F, is locally monotonic with respect to where again ρ 1 = tr 2 [C(ρ 1 ⊗ ρ 2 )] and similarly for ρ 2 . We then have the following theorem.
Theorem 15. Let f be a function that is locally monotonic with respect to C F and assume that ρ → f (ρ, σ) is continuous for fixed σ ∈ F. Then where a and b are constants.
The proof can be found in Appendix H.

IV. CONCLUSIONS AND OUTLOOK
In this work we comprehensively studied formal properties of the variance of (relative) surprisal together with their applications to single-shot (quantum) information theory. Before closing, let us comment on the high-level motivation for this work and open avenues for further research. To do this, we restrict again to the case of unital channels (σ = I) for simplicity.
As discussed throughout the paper, the von Neumann entropy quantifies information theoretic tasks in the asymptotic limit. Conversely, the min-and max-entropies typically appear in the fully single-shot regime. All these quantities are special cases of the Rényi entropies with S(ρ) = S 1 (ρ) := lim α→1 S α (ρ), S min (ρ) := lim α→∞ S α (ρ) and S max (ρ) := lim α→0 S α (ρ). Indeed, we can consider the min-and max-entropies to be the end points of the "Rényi curve" α → S α (ρ). This curve encodes the full spectrum of a state (see below). Hence, roughly speaking, one can say that in the single-shot regime the full shape of the curve matters, while in the asymptotic i.i.d. limit only the point α = 1 matters. From this point of view the approach to single-shot information using min-and max-entropies rests on the observation that the end points of the (smoothed) curve capture many of the operationally relevant single-shot effects for a given state. In contrast, the approach presented here quantifies singleshot effects by studying not the end points but rather the neighbourhood of the Rényi curve around α = 1. To see this, consider the Taylor-expansion of S α (ρ) around α = 1 + x. Then first performing the expansion and finally taking the limit x → 0 yields where κ (n) is the n-th cumulant of surprisal. In Appendix I, we present the definition of the cumulants of surprisal as well as the derivation of (13) (also see [38]). Eq. (13) is interesting for a number of reasons. To begin with, we have κ (1) = S(ρ) and κ (2) = V (ρ). Hence, (13) shows that the variance of surprisal is (up to a factor of −2) the slope of the Rényi curve at α = 1 and gives the first order correction to the approximation S α (ρ) ≈ S(ρ). This fact is well-known [38,79]. It lets us apply some of our results to the Rényi curve. For instance, Result 2 relates the neighbourhood of the Rényi curve around α = 1 to its smoothed end-points, while Result 3 constraints the possible changes of the Reńyi curve under unital channels, in the sense that the slope at α = 1 can, by means of such channels, only be flattened at the expense of raising the curve at this point.
More generally, the expansion (13) is interesting because it implies that the higher order cumulants of surprisal give a hierarchy of increasingly fine-grained knowledge about a state's spectrum. This follows once we recognize that for a ddimensional state ρ, it suffices to know S n (ρ) for n = 2, . . . , d to fully reconstruct the spectrum of ρ (for the reader's convenience we provide a proof of this statement in Appendix J). In turn, the results of this paper, in which we studied the first order of this hierarchy, then suggest that studying the single-shot properties of higher order cumulants or surprisal could yield insights about single-shot information theory that are somewhat complementary to the approach of smoothed Rényi entropies.
In particular, it would be interesting whether it is possible to construct a hierarchy of Schur-concave functions with increasing relevance at the single shot level from cumulants of surprisal. As a first step in this direction, the following interesting problem arises: We have mentioned that knowing the Rényi entropies S n (ρ) for n = 2, . . . , d provides full information about the spectrum of the state. Is it also true that the first d − 1 cumulants of surprisal encode the full spectrum of the state? Another problem to consider is the extension of our study to the fully quantum setting of non-commuting matrices. We leave these questions to future work.
Acknowledgements. The authors would like to thank Angela Capel, Xavier Coiteux-Roy, Iman Marvian, Renato Renner, Carlo Sparaciari, Marco Tomamichel and Stefan Wolf for stimulating discussions and suggestions and especially Jens Eisert for fruitful comments on an earlier version of this work. We want to particularly thank Mark Wilde for suggesting the extension of our work to generic channels instead of σ-  Appendix B establishes all the notation used throughout our proofs and collects a handful of technical lemmas. Appendices C and D present the proofs for Lemma 10 and Lemma 11 respectively. Appendix E presents the proof of the central technical Lemma 7 that underlies all of our sufficiency results. Appendix F presents the application of Theorem 8 to the case of finite i.i.d. sequences. Appendix G then provides the proof of Theorem 12. Appendix H discusses details on the axiomatic characterization of locally monotonic functions, including the proofs of Result 5 and Theorem 15. Finally, Appendix I provides the details to the expansion Eq. (13) and Appendix J sketches the proof that a state's spectrum can be inferred from the values of d − 1 Rényi entropies, as claimed in the conclusion.

Appendix B: Notation and auxiliary lemmata
In the following we will make frequent use of the following definitions: • L(ρ σ) := tr(ρ(log(ρ) − log(σ)) 2 ), We also remind the reader that we use logarithms with base 2, log = log 2 . Lemmas 16 -22 are technical tools used in the derivation of our results. We list them here for completeness.
We will also use the following generalization of the Fannes-Audenaert inequality, which is implied by Lemma 7 in [80]: Lemma 18 (Continuity of relative entropy (Lemma 7, [80])). Consider any full rank state σ with s min > 0 denoting its smallest eigenvalue. Then, for any two states ρ, ρ such that D(ρ, ρ ) ≤ , we have Lemma 19 (Pinsker inequality). For quantum states ρ, σ acting on the same Hilbert space, S(ρ σ) ≥ 1 2 ln(2) ρ − σ there exists a right stochastic d × d matrix E, that is, a matrix with all non-negative entries each of whose rows sums up to 1, such that pE = q, sE = s where p, q, s, s are simply vectors containing the eigenvalues of ρ, ρ , σ and σ respectively.
Next, 4 technical lemmas are proven explicitly and used in later parts of our work.
Proof. We write σ = ⊕ i s i 1 i , where 1 i is the identity operator in the i-th eigenspace of σ. Similarly, we can write ρ = ⊕ i ρ i and ρ = ⊕ρ i . We then have The mapping ρ → U ρ U † =: U(ρ ), with U = ⊕ i U i a block-diagonal unitary, is a σ-preserving quantum channel. Now, without loss of generality, choose a basis |i, j in each eigenspace of σ such that ρ i = j p i,j |i, j i, j| with p i,j ≥ p i,j+1 . We Theorem 24 (Relative-majorization and Lorenz curves). Let (p, q) ∈ R n + and (p , q ) ∈ R m + be two pairs of probability vectors (non-negative, normalised) such that when q i = 0, then p i = 0 and similarly for p and q . Furthermore, denote the states ρ, σ to be diagonalized in the same basis with eigenvalues p, q; while ρ , σ are states diagonalized in the same basis with eigenvalues p , q respectively. Then the following are equivalent: 1. There exists a n × m-right stochastic matrix M such that pM = p and qM = q , 2. L ρ|σ ≥ L ρ |σ , 3. For every continuous concave function g : R → R, Proof. Forms and proofs of this theorem appear in various places of the literature, and can be found in e.g. [13][14][15][16]. But since the proven statements often vary in some detail, we here present the proof for the reader's convenience. 1. ⇒ 3. : Let m ij denote the elements of the matrix M that exists by assumption. Define the matrix A with elements a i j = q i m i j(q j ) −1 . It is easy to check that A is a left-stochastic matrix and Aq = q and that (p/q)A = (p /q ), where p/q denotes element-wise division. Since p/q and p /q are real vectors due to fact that q = 0 implies p = 0, and similarly for the final pair, we can now apply Proposition A.1. p.579 from [16] to arrive at the desired statement.
3. ⇒ 2. : First, we note that, by definition of the Lorenz curve, L ρ|σ (x) = L ρ |σ (x) for x = 0, 1. We hence need to show domination of the Lorenz curve only for x ∈ (0, 1). For this, let (g t ) t∈R and (h t ) t∈R be families of parametrized functions defined by g t (x) := max{x − t, 0} and h t (x) := max{t − x, 0}. By convexity of the max function, g t as well as h t are convex for every t. Next, define the function S ρ|σ : (0, 1) → R as where q 0 = 0 and we enumerate the vectors such that p i /q i ≥ p i+1 /q i+1 , without loss of generality. A moment's thought will show that S ρ|σ (x) is the value of the slope of the Lorenz curve L ρ|σ at x, whenever this slope is well defined. Now, for fixed x, we distinguish between the case S ρ|σ (x) ≥ S ρ |σ (x) and S ρ|σ (x) < S ρ |σ (x). In the first case, we evaluate condition 3. for g t with t = S ρ|σ (x). This yields where k is the largest index such that p k /q k > t, and similarly k is the largest index such that p k /q k > t. Geometrically, we can interpret the above inequality as making a statement about two parallel lines with slope t running through the respective points L ρ|σ ( k i=1 q i ) (on the LHS) and L ρ|σ ( k i=1 q i ) (on the RHS). Now, by our choice of t, the point (x, L ρ|σ (x)) lies on the left hand parallel, while by the fact that S ρ|σ (x) ≥ S ρ |σ (x), the point (x, L ρ |σ ) is guaranteed to lie below the right hand-parallel. Hence, we have L ρ|σ (x) ≥ L ρ |σ (x), as required.
We now turn to the second case: If S ρ|σ (x) < S ρ |σ (x), then we instead evaluate condition 3. for h t with t = S ρ |σ (x). This yields, after canceling out some terms, where in turn this time k is the largest index such that p k /q k ≥ t, and similarly k is the largest index such that p k /q k ≥ t. By following the same argument as in the first case, we know that the point (x, L ρ |σ (x)) lies on the right-hand parallel, while the point (x, L ρ |σ ) is guaranteed to lie above the left-hand parallel. Hence, we have L ρ|σ (x) ≥ L ρ |σ (x), as required.
Note that under the condition [ρ, σ] = [ρ , σ ] = 0, the existence of a quantum channel E such that E(ρ) ∈ B (ρ ), is equivalent to having a classical stochastic matrix M such that pM = p and qM = q (by Lemma 20). Theorem 4 is therefore implied by the first and second statements of Theorem 24.
Lemma 25. Let x, y ∈ [0, 1] such that |x − y| ≤ 1 2 . Then Proof. The statement is clearly true if x = y. Hence, w.l.o.g. let x > y, and set z = |x − y| = x − y ≤ 1/2. First, note that and that F z (0) = η(z), so that is sufficient to show that, To show this, let us begin by evaluating the derivative η (x) = −1 ln (2) [ln(x)+1], and noting that this is a monotonically decreasing function, with a root at x * = e −1 . As graphically shown in Fig. 3, Eq. (B1) states that of all integrals of fixed width z, the one with the largest absolute value is the one over the interval [0, z].
We now show that this is indeed the case. We consider three different cases: If y ≤ x * − z, the statement is automatically true because η is positive and monotonically decreasing below x * , so that an application of Lemma 22 yields  From similar reasoning, we know that F z (y) ≤ F z (1 − z) whenever x * ≤ y ≤ 1 − z, by monotonicity and negativity of η above x * . For the third case, when x * − z < y < x * , the fact that part of the integral is positive and part negative implies that where in the second step we applied the bounds derived for the previous cases. Hence, it remains to show that F z (0) ≥ F z (1−z). To do so, first note that F z (0) = η(z) and F z (1−z) = η(1−z). Furthermore, the function g(z) := η(z)−η(1−z) is continuous over [0,1], is positive at z = e −1 ∈ [0, 1/2], with roots at x = 0, 1/2, 1. By invoking the intermediate value theorem, we know that g(z) ≥ 0 for all z ≤ 1/2, which concludes the proof.
Lemma 26. Let q ∈ (0, 1] and x, y, ∈ [0, q]. If |x − y| ≤ q/e 2 , then Proof. The proof is in spirit very similar to that of Lemma 25. Again the statement is trivially true if x = y. Assume, then, again without loss of generality that x > y and set z = |x − y| = x − y ≤ q/e 2 . We note that |χ(x q) − χ(y q)| = z 0 χ (y + r q)dr =: G z (y q), and G(0 q) = χ(z q), so it is sufficient to show G z (0 q) ≥ G z (y q). Now, with the same strategy, let us first evaluate χ (x q) = 1 ln(2) 2 2 ln(x/q) + ln 2 (x/q) , and plot it in Fig. 4. We are interested in three intervals of this function: : [0, q/e 2 ], where it is monotonically decreasing and positive on the interval; [q/e 2 , q/e], where it is monotonically decreasing and negative; and [q/e, q], where it is monotonically increasing and negative. By monotonicity on these separate intervals, the fact that z ≤ q/e 2 (which implies that each of these intervals is at least as wide as z) and invoking Lemma 22, it follows that max r∈[0,s−z] G z (r q) = max{G z (0 q), G z (q/e − z q) + G z (q/e q)}, by the following reasoning: For r ≤ q/e 2 − z, by applying Lemma 22, we have that G z (r q) ≤ G z (0 q) by positivity in that interval. Similarly, we can bound G z (r q) for the values q/e 2 ≤ r ≤ s − z by the second term in the above bracket. Finally, for q/e 2 − z < r < q/e 2 , parts of the integral cancel out, so that we can bound the integral by the above two terms. It hence remains to show that G z (0 q) always dominates the second term above, which we check by explicit evaluation: we first have where, since z ≤ q/e 2 ≤ 1 by assumption and the fact that ln 2 (x) is strictly monotonically decreasing in the range [0, 1], ln 2 (z/q) ≥ ln 2 (e −2 ) = 4. On the other hand, we can upper bound the second term by noting that G z (q/e − z q) + G z (q/e q) ≤ 2z · |χ (q/e q)| = 2z ln 2 (2) , which is always smaller than G z (0 q). The main result of uniform continuity of relative variance (and therefore the non-relative variance of surprisal) is proven in Lemma 10. To do so, let us first establish the following technical lemma.
Proof. As the first step, we note that due to fact that both ρ and ρ commute with σ, we only need to consider the spectra of the various states. This follows from Lemma 23. Let U denote the unitary channel defined in the proof of that Lemma. Then by construction we have that [ρ, U(ρ )] = 0. Since and L(U(ρ ) σ) = L(ρ σ), it follows that we can in the following replace ρ by U(ρ ), without loss of generality. Since all three states then commute with another, we can assume the decompositions ρ = i p i |i i|, ρ = i q i |i i| and σ = i s i |i i|, in terms of which we have and where we have introduced the variable x i := |p i − q i |. We now show that each of the terms in the RHS of (C1) can be upper bounded by a term of the form either χ(x i s i ) or C · x i for some constant C. To see this, consider the ith term in the sum and let us distinguish the following cases, where we assume without loss of generality that q i < p i : Case I: p i ≤ s i /e 2 In this case, we know that x i ≤ s i /e 2 and so can apply Lemma 26 to find that Case II: s i /e 2 ≤ q i . In this case, we can make use of the fact that χ(· s i ) is Lipschitz continuous in its first argument over the interval [s i /e 2 , 1]. In particular, by differentiability of χ over this interval, we have that Case III: q i < s i /e 2 < p i Here, we distinguish three sub-cases. To discuss these cases, we note that since for fixed s i , χ(· s i ) is continuous and has roots at 0 and s i , as well as a local maximum at s i /e 2 , by the mean value theorem there must be a point . We now distinguish the following sub-cases: First, assume that q * i ≤ p i . Then we can make use of Lipschitz continuity, since Note that this sub-case always covers situations in which p i ≥ s i , because we are guaranteed that q * i ≤ s i . Hence, it remains to consider the case q * i < p i < s i . Now, if x i ≤ s i /e 2 , then we can apply Lemma 26 to find that Finally, if x i > s i /e 2 , then we have that where we twice used the fact that χ is strictly monotonically decreasing on the interval [s i /e 2 , s i ] and positive. In the first step, combined with the definition of q * i this implies that χ(p i s i ) > χ(q i s i ). In the second step, this implies that χ(p i s i ) < χ(s i /e 2 s i ).
Overall, we have seen that we can upper bound each of the terms on the RHS of (C1) by either χ(x i s i ) or by C(s i ) · x i , whereĈ(s i ) := max{4,C(s i )} = max{4, χ (1 s i )}.
Let A denote the set of indices i that we have bounded by χ(x i s i ) and B = [d]\A those that we have bounded by C(s i ) · x i . We now turn to upper bound the two groups of terms corresponding to these two sets. In particular, let respectively, and denote ∆ 1 := i∈A x i , ∆ 2 := i∈B x i . We can straightforwardly bound T 2 as where we recall that D ≡ D(ρ, ρ ). Bounding T 1 is more involved. By applying the previously derived upper bound, we have We first note where in the second step we used that η(x i ) is positive and s i ≤ 1, in the second x i ≤ s i /e 2 , which holds for all the terms in T 1 by the previous arguments, and in the last step again positivity of η(x i ). Next, we make use of the identity Plugging this into the RHS of (C2) yields where To bound F 1 , we note that {x i /∆ ≤ } i form a |A|-dimensional probability vector, corresponding to some density matrix . Hence, where we have used the fact that, by Property 6 of the variance, as presented in the main text, and that S( I) 2 = S( ) 2 ≤ log 2 (d) for a d-dimensional density matrix for the case |A| ≥ 2. Clearly, this upper bound is also valid for |A| ∈ {0, 1}. Next, to lower bound the term F 2 , note that which yields The minimization arises because we distinguish two cases: If ∆ 1 ≥ 1, then the terms log(∆ 1 )∆ 1 · η(x i /∆ 1 ) are positive and can be lower bounded by zero, while if ∆ < 1, the term is negative and can be lower bounded by the ∆ 1 log(∆ 1 ) log(d), again using the fact that the {x i /∆ 1 } i∈A form a probability distribution. Plugging these bounds back into (C3) then yields We are finally in a position to combine the bounds on T 1 and T 2 . This gives where we used C(s min ) + 8 log 2 (d) + 8 = max{4, log 2 (s min )} + 8 log 2 (d) + 8 ≤ 12 + log 2 (s min ) + 8 log 2 (d) =: c 1 .
With this technical lemma established, it is then relatively easy to prove uniform continuity of V (ρ σ), which is stated as Lemma 10 in the main text (we restate it here for convenience).
Proof. We have (2)  For notational convenience, for the remainder of this appendix we write V ≡ V (ρ σ) and V 1 ≡ V (ρ 1 σ 1 ) and similarly for the other subsystem and other quantities, L and S.
Proof. Let us begin by introducing the following shorthand notation: • I as the mutual information of ρ across the partitions 1 and 2, Next, note that The special case of this equality is well-known for the non-relative version of von Neumann entropy, i.e. I(A : B) = S(ρ A ) + S(ρ B ) − S(ρ AB ), while the identity above for relative entropies is proved in Proposition 2 of [82]. Therefore, we have where ζ ρ := L − L ⊗ , since in the second last line, the terms I 2 and 2IS ⊗ are positive and can be dropped with the inequality. Our goal is to bound this remaining function in terms of the mutual information I. To do so, we can apply Lemma 27, since [ρ, σ] = 0 implies that [ρ 1 ⊗ ρ 2 , σ] = 0. Applying this lemma yields with c 1 = √ 2 ln 2 · c 1 and c 2 = 4 2 ln(2)c 2 , where c 1 , c 2 are the constants from the statement of Lemma 27, and where in the second step we used Pinsker's inequality (Lemma 19) and the fact that I = S(ρ ρ 1 ⊗ ρ 2 ). Finally, by noting that, for d ≥ 2, c 1 = max{c 1 , c 2 }, and f (I) = max{ √ I, 4 √ I}, we obtain the bound The statement of the Lemma then follows by setting K = c 1 .
Appendix E: Proof of Lemma 7 In order to proof the central technical result of Lemma 7, we first make the following simple observation of lower and upper bounds for Lorenz curves, which is spelled out in Lemma 28. Proof. This is obvious given the concavity of the Lorenz curve itself.
Remark 29. The functions l c (x) = min(c · x, 1) furthermore satisfy the property that whenever c ≥ d, we have that for all x ∈ [0, 1], l c (x) ≥ l d (x).
A small further technical observation stated in Lemma 30 is required to then prove Lemmas 31 and 32, which jointly give rise to Lemma 7.
Since 2 ≥ 1 implies that ρ 1 fl ∈ B 2 (ρ), the fourth inequality holds. Lastly, consider the first ineuqality. First, note that ρ 1 st and ρ 2 st always share the same basis, so we may write them as ρ 1 st = d i=1p (1) i |i i| and ρ 2 st = d i=1p (2) i |i i| respectively. Furthermore, they have the same relative ordering w.r.t. σ. In other words, the discrete points that define the respective Lorenz curves are aligned w.r.t. the x-axis, and therefore their condition reduces to a simple comparison between the cummulative sum of the eigenvalues. Concretely, we want that for all k ∈ {1, · · · , d}: Denoting R 1 and R 2 to be the respective indices R according to the construction of Definition 6, using 1 and 2 respectively, note that R 2 ≤ R 1 . For the various regimes for the Lorenz curves we therefore have: This finishes the proof.
Proof. Using the fact that [ρ, σ] = 0, we can decompose the states into their simultaneous eigenbasis as and take the ordering of the eigenbasis such that p i /s i ≥ p i+1 /s i+1 for all i ∈ {1, . . . , d − 1}. Next, given , definẽ namelyĩ is the largest index such that the tail-sum of the ordered distribution on p is larger or equal to . Also, denote the following tail-sums where we set − = 0 ifĩ = d. By construction, − ≤ ≤ + . We are now going to make use of the steep state with the smaller parameter − : let ρ − st = ip i |i i| denote the − -steep approximation of ρ relative to σ, withp i explicitly defined in Definition 6. By construction, we have that where P + = {i|p i > 0}. We can now infer that , where the first inequality follows from Lemma 30 together with the fact that ≥ − , while the second inequality follows from Lemma 28. Hence, it remains to show that which implies S(ρ σ) − f σ (ρ, ) ≤ log A. If S(ρ σ) ≤ log A, then this inequality would hold for any non-negative function f σ (ρ, ). Otherwise, we can derive the explicit form of f σ so that Eq. (E1) holds via Cantelli's inequality. More precisely, consider the real-valued random variable X with sample space Ω = {1, 2, . . . , d} distributed as Prob(X = log(p i /s i )) = p i . We then have ≤ + = Prob (X ≤ log A) = Prob (S(ρ σ) − X ≥ S(ρ σ) − log A) where the first step follows by definition of + . The second step expresses the fact that, by virtue of the fact that we have ordered the state bases by decreasing ratios p i /s i , + is the total probability, with respect to X, that a ratio smaller than or equal to pĩ/sĩ is sampled. In the final step we used Cantelli's inequality (Lemma 21) with random variable −X and λ ≡ S(ρ σ) − log A, together with the fact that the mean and variance of −X are given by −S(ρ σ) and V (ρ σ), respectively. The claim then follows by a simple re-arrangement of the terms above.
Lemma 32. Let ∈ [0, 1] and let ρ, σ be two commuting d-dimensional states that satisfy supp(ρ) ⊆ supp(σ), and denote the -flat approximation of ρ w.r.t. σ as ρ fl according to Definition 5. Then, To see the left equality, note that we have definedĩ and − in such a way that, in terms of the notation of Definition 5, M =ĩ − 1.
Even for the special case ofĩ = 1, the above ratio is well-defined. Moreover, the definition of the valuesp i ensures equality of the ratiosp i /s i for i ≤ M . The right inequality, on the other hand, is a property of the flat approximation that is proven in [24]. Together, they imply that Therefore, by Lemma 30, Lemma 28 and Remark 29, we know that L ρ fl |σ (x) ≤ L ρ − fl |σ (x) ≤ B (x). Our goal is then to show that B ≤ 2 S(ρ σ)+fσ(ρ, ) . By positivity of f σ (ρ, ), this is clearly true whenever S(ρ σ) ≥ log B. In case S(ρ σ) < log B, we again consider a real-valued random variable X with sample space Ω = {1, 2, . . . , d} distributed as Prob(X = log(p i /s i )) = p i . We then have where we used Cantelli's inequality with X and λ ≡ log(B) − S(ρ σ) > 0, in the last step. The claim then follows by re-arranging the terms in the above inequality.