Bayesian estimation of correlation functions

We apply Bayesian statistics to the estimation of correlation functions. We give the probability distributions of auto-and cross-correlations as functions of the data. Our procedure uses the measured data optimally and informs about the certainty level of the estimation. Our results apply to general stationary processes and their essence is a nonparametric estimation of spectra. It allows one to better understand the statistical noise ﬂuctuations, assess the correlations between two variables, and postulate parametric models of spectra that can be further tested. We also propose a method to numerically generate correlated noise with a given spectrum


I. INTRODUCTION
In this article, we propose a rigorous method to estimate correlation functions.By "rigorous" we mean a non-parametric estimation based on Bayesian statistics, being free of ad-hoc procedures, ansätze, or assumptions going beyond the observed data.We use this notion to distinguish our approach from the orthodox methods, as we explain below.To motivate our work, we briefly introduce the concept of correlation and discuss why its estimation can benefit from a Bayesian approach.
Correlation is intimately related to causality and prediction.For this reason, in physics one often deals with detecting, quantifying, explaining, and testing correlations.A common way to analyze them is through correlation functions.Roughly, a correlation function A(t)B(t ) is the expected value of the product of two variables (observables) A and B measured at times t and t .
Correlation functions are ubiquitous in-and often the cornerstone of-different theories and formalisms.They play a major role in the study of noise, particularly regarding spectra.[1][2][3] Actually, a spectrum is the Fourier transform of a correlation function.Spectral analysis has been the main tool for the study of noise in very different contexts for more than a century.[4][5][6][7][8][9][10] In the past, the application of spectral analysis has ranged from the pioneering works in vacuum tubes [4] and electric circuits [5] to particle scattering, [11] earthquakes and tides, music and geographic features, [6,7] to name a few examples.Among the remarkable discoveries made along that path, one can highlight the almost universal appearance of 1/f or flicker noise, whose origin is still under debate.[7][8][9][10] Today, noise is also in the spotlight in cutting-edge fields like quantum computing: It requires using errorcorrection, [12] it affects coherence times, and drives spatial correlations between qubits.[13,14] With emerging fields exploiting Noisy Intermediate-Scale Quantum (NISQ) devices, [15,16] there is a need for a better un-derstanding of the properties and origin of noise.
The estimation of correlation functions emerges then as a relevant task.However, the current standard techniques to estimate correlation functions can be subject to improvement.By "standard", "traditional", or "orthodox" we mean non-Bayesian.To cast more light on our motivations, we point out some issues in traditional parameter estimation.Later, we introduce Bayesian statistics as the way to sort them out and achieve a better (and actually optimal, as we argue below) estimation of correlation functions, which is the main goal of our paper.
Standard statistics estimates a variable by yielding a "best guess" of its value, called an estimator, and a confidence interval that reflects its uncertainty.Two familiar examples of estimators are the mean and the median.In the context of spectral analysis, we can mention the periodogram [1] (defined and discussed in the main text) as the most common estimator of the spectrum.A fundamental issue in standard estimation is the arbitrariness in the choice of an estimator.For example, there is no universally applicable ordering, preferring the mean over the median or vice versa.Several arguments can support either according to different criteria, but still, in the traditional approach, one estimator has to be chosen.Even the often adopted unbiased estimators [1] have drawbacks under certain circumstances.[17] Ultimately, there is no universal "best estimator".
A similar arbitrariness also affects the choice of parameters to quantify errors, test hypotheses, and, most importantly for the subject of this article, assess correlations.For example, we will discuss in the main text the limitations of Pearson's r coefficient [18] to quantify the correlation between two variables.There is always the doubt that a different parameter could serve that purpose better.The search for "better" r coefficients leads to a series of more complicated definitions [1] like nonparametric or rank correlation, [1] that end up raising the same kind of doubts: "It is precisely the uncertainty in interpreting the significance of the linear correlation coefficient r that leads us to the important concepts of nonparametric or rank correlation."[1] Moreover, the statistical fluctuations of these coefficients-necessary to assess the confidence interval-is another matter that quickly gets similarly nontransparent: New (and potentially arbitrary) parameters are needed to analyze these fluctuations, making the problem scale in complexity.The whole process becomes less transparent and more dependent on assumptions not always met in practice, such as the asymptotic conditions required by the central limit theorem.For example, Ref. 19 studies the statistical properties of 1/f noise analyzing the "variance of the variance" of the periodogram, namely "the error of the error."Quantities like these are difficult to interpret and add elements of arbitrariness.The variance of the variance of the variance would be the next variable to look at in a presumably never ending progression. 1  This arbitrariness manifests a fundamental issue in traditional estimation: One cannot guarantee the optimal usage of the data, or optimal processing of information, when performing any inference.By inference, we mean, for instance, estimating a variable, assessing correlations, or testing a hypothesis.And with "optimal usage of the data," we mean that no relevant information contained in the data about the inference is lost in the inference process.
Bayesian probability theory, [17,21] or Bayesian statistics, puts an end to these issues.This theory is solidly built as an extension of logic, departing from the minimal Cox's axioms-which include Bayes' rule, hence the name.[17,21] Throughout its construction, there is no arbitrary assumption or ad hoc choice of parameters, of estimators, or of statistical tests.Instead, Bayesian statistics focuses on the calculation of the probability distributions that encode all our knowledge-including measured data-, and nothing but our knowledge, about a variable. 2In this way, Bayesian statistics guarantees the optimal usage of the data when making inferences.The interested reader is referred to Ref. 17 for a complete overview of fundamental historical problems and paradoxes of the orthodox theory that only a Bayesian approach can resolve. 3  1 On the other hand, if it is not a (two-point) correlation function but noise as a phenomenon which is in the focus, one should note that the former does not fully describe the latter.Various procedures have been put forward to test if the noise-or the noisy system-is equilibrium, Markovian, Gaussian, linear, timereversal symmetric, or stationary [7,9,20]. 2Remarkably, Bayesian statistics rejects the idea of observation as a sampling of the "true" probability distribution of a variable.Actually, the "true" probability distribution is an ill-defined concept, in contrast to a probability distribution that just reflects our knowledge.These distinctions might seem irrelevant at first sight, but they become decisive on numerous occasions.The method presented in this work is one of them, as we will explicitly point out. 3Another outstanding achievement of Bayesian statistics is the formulation of the whole statistical mechanics from information Consequently, Bayesian statistics can enhance parameter estimation.Surprisingly, beyond a few exceptions in some specific problems, [3,23] the Bayesian estimation of correlation functions has not been fully developed.In this article, we take up this task.Our work can find the following applications.First and foremost, in the context of spectral analysis, our results allow one to estimate the spectrum including the quantification of the corresponding statistical estimation errors (encoded in probability distributions or in the error bars calculated from them).We emphasize that these statistical uncertainty measures (such as error bars) are produced also in the non-parametric estimation, which is the topic of this article.Though we do not pursue it here, one could in turn use the results of a non-parametric estimation for a further parametric estimation, that is for estimation of parameters of a specific model of the spectrum.[3] Second, our results also allow one to discuss with generality several statistical properties of noise and the periodogram, like the signal-to-noise ratio (whose value we prove to be universal in the non-parametric estimation).Third, we develop a formalism to judge-with its corresponding uncertainty-the correlation between two generic variables, without involving arbitrary parameters.Finally, we provide a method to numerically generate correlated noise with an arbitrary spectrum.
The paper is organized as follows.In Sec.II we introduce the Bayesian formalism through a simple example and sets the basis for the estimation of correlation functions.In Sec.III we study auto-correlations, namely the correlation functions of a time-dependent variable with itself.This includes a subsection discussing the statistical properties of the periodogram and the generation of uncorrelated noise.Sec.IV is parallel to Sec.III but focuses on the correlation between two variables, the crosscorrelation.It also proposes a method to generate correlated noise with an arbitrary spectrum.In Sec.V we illustrate our results numerically and in Sec.VI B we give some guidelines on estimating continuous spectra.Sec.VII contains the conclusions.To improve the flow of the main text, topics which are technical or can be discussed separately are delegated to appendices: In App.A we prove that the multivariate Gaussian distribution maximizes entropy, in App.B we derive the prior distributions from invariance principles, in App.C we derive Eq. ( 24), App.D contains a useful identity for the zero-frequency case, and in App.E we give the estimating distribution for unnormalized correlation strength.The remaining appendices are especially notable for practical usage of our formulas: in App.F we explain how to merge points in a spectral plot, superseding artificial "windowing" or batching the data, in App.G we explain how to rigorously correct the estimating distributions for errors on input, and in App.H we theory, requiring a minimal set of axioms [22].
give the central-limit versions of the estimating distributions.

II. VARIANCE IN ONE DIMENSION
In this section, we use a simple example to introduce Bayesian statistics and to set the basis of parameter estimation.However, the results presented here go beyond mere illustration purposes and directly apply to spectral analysis.The next sections will make the connection clear.
Before going into details, we recall a few definitions and conventions from probability theory and logic.Everything inside the probability symbol P is to be interpreted as a logical statement.P (a) represents the probability that the statement a is true.P (a ∪ b) is the probability that either a or b (or both) are true.P (a ∩ b) is the probability that both a and b are true.However, usually the logical "and" operator is substituted by a comma or even omitted: P (ab) = P (a, b) ≡ P (a∩b).The conditional probability P (a|b) ≡ P (ab)/P (b) denotes the probability that a is true provided that b is true.When considering the probability that a variable x has a value x , one should write P (x = x ) (note that x = x is a logical statement), but the notation is usually simplified by omitting the variable name: P (x ) ≡ P (x = x ).At last, the mean of the expression f (x) with the variable x having a probability distribution P (x) is The mean and variance of x, the two central quantities in this article, are defined as Having set the notation, consider a scalar variable x.Assume we want to estimate its mean ν and variance Λ given a data sample x = {x (0) , . . ., x (M −1) } that contains M values x (m) .One cannot directly apply the definitions ν = x and Λ = (x − ν) 2 because P (x) in Eq. ( 1) is unknown.One usually resorts to standard formulas like ν However, this estimation of ν and Λ is not very informative.To begin with, Eqs.(3) do not report on the confidence of the result and how this confidence depends on the size M of the sample.In other words, ν and Λ are just specific estimator choices or "best guesses" of ν and Λ.We aim at more than that.Our goal is to get the probability distribution of ν and Λ given the observed data, namely P (ν| x) and P (Λ| x).By estimating the parameters ν and Λ, we mean computing these distributions.
We will refer to them as the estimating distributions to make a clear distinction to the sampling distributions defined below.The estimating distributions encode all the information in the data, and nothing but the information in the data, that is relevant for the knowledge of ν and Λ.This includes, for example, the size of the data sample M .
Our goal is to calculate the distributions P (ν| x) and P (Λ| x), but we start by looking at the sampling distribution P ( x|νΛ).Its meaning is the following: If we knew only the mean ν and variance Λ of a variable x, what would be the probability to get the data sample x?
Shannon's entropy theorem [22,24] is the key to construct sampling distributions.It is about the entropy S of a distribution P (x), defined as The theorem proves that S measures the information about the variable x encoded in P (x).Therefore, assuming nothing about x other than its mean ν and variance Λ, the distribution P (x|νΛ) representing the knowledge about x must have maximal entropy allowed by the constrains In App.A we prove that it is a Gaussian, It is important to emphasize the following.This result does not mean that the "true" distribution of x is Eq. ( 4).It means that our knowledge of x is described by Eq. ( 4).This interpretation is one of the core concepts of Bayesian statistics.In contrast, standard statistics would consider Eq. ( 4) an approximation or ansatz, even when it is not.The rest of the estimation process-and its validity for a generic variable x-relies on Eq. ( 4), and hence on this subtle point.Equation ( 4) gives the probability to observe the single value x based on ν and Λ as the only knowledge about x.The probability to observe the whole data sample x can be obtained from the definition of conditional probabilities, that is, P (ab) = P (a)P (b|a).When b is logically independent on a, namely when P (b|a) = P (b), we have P (ab) = P (a)P (b).Therefore, if the values in the data sample x are logically independent, With this and Eq. ( 4), we have the sampling distribution P ( x|νΛ).However, this is not what we want to calculate.For the estimation of ν and Λ, we rather need P (ν| x) and P (Λ| x).It is possible to get these estimating distributions from the sampling distribution P ( x|νΛ) by using elementary probability theory.We need to apply just two rules.The first is which can be easily derived from (i) the axiom (ii) the equivalence between the logical statement a and a ∩ (∪ i b i ), because ∪ i b i is always true by virtue of The rule of Eq. ( 6) allows us to write The second rule we need is the Bayes' rule, P (a|b) = P (b|a)P (a) P (b) .
When applied to the integrands, one gets These equations express the estimating distributions as functions of the sampling distribution P ( x|νΛ) and constitute the essence of parameter estimation from the perspective of Bayesian statistics together with the Shannon's entropy theorem.
The steps that led to Eq. ( 7) will be repeatedly used throughout the article.We can look at P ( x) in the denominators as a mere normalization constant.P (νΛ) is the prior probability for the mean and variance, and it represents our knowledge about ν and Λ previous to the acquisition or analysis of any data.Assigning priors has been subject to extensive discussions, see Ref. 17.We follow the approach advocated therein, and assign priors based on the requirements of symmetry (invariance under certain transformations).Accordingly, in App.B we prove that P (νΛ) ∝ 1/Λ keeps P (νΛ) invariant under "frame" transformations of the variable x being the independence of P (νΛ) on the units and the reference value chosen to measure x relative to. 4erforming the integrations, we get Eq.( 7) in a closed form Remarkably, these distributions depend only on two variables that are functions of the data, or sufficient statistics, namely ν and Λ.Not surprisingly, they are given by the well-known formulas of Eqs.(3).However, the estimating distributions allow one to evaluate any desired estimator.For example, the maximum-likelihood estimators are the values of ν and Λ that maximize P (ν| x) and P (Λ| x), respectively.Equation (8) gives Interestingly, mle(Λ) significantly differs from Λ for low values of M .In a similar way, we could use the estimating distributions to calculate, for example, the confidence intervals. 5efore illustrating the distributions, we examine Eq. ( 8) in certain special limits.The case M = 1 corresponds to measuring a single value x (0) , giving ν = x (0) and Λ = 0 by Eq. ( 3).Evaluating Eq. (8a) with Λ = 0 and M = 1 is undefined, but let us take the limits Λ → 0 and then M → 1: That is, the estimating distribution for the mean that we deduce from a single data point is centered at that point and falls off as 1/|x| symmetrically to both sides.Keeping M at a finite, albeit possibly small, distance above 1 is a way of regularization of this non-normalizable distribution. 6The same happens with Eq. (8b): For M = 1, the formula should be interpreted as a limit towards a nonnormalizable distribution 1/Λ, with parameters M → 1 and Λ → 0 possibly providing effective cut-offs for the divergencies at both Λ → 0 and Λ → ∞ tails of the distribution.Since a single point delivers no information about the variance, our estimating distribution equals to the prior distribution.
Remarkably, one can meaningfully interpret Eq. ( 8) even for M = 0, even though the expressions in Eq. ( 3) are undefined.Taking them as regularization parameters instead, we get that for M → 0 also Eq. (8a) reduces to the corresponding prior, P (ν|no data) = P (ν).Recovering the prior distributions when the data deliver no information is an appealing property of Eq. ( 8).
As a final example, consider that two points were measured, M = 2, and they came out exactly the same, so that Λ = 0. Equation (8a) now becomes a regularized delta function centered at ν with the width Λ → 0. In reality, the values x (m) are measured with some finite precision.We therefore obtain another natural result that if Λ = 0, one can be sure that ν = ν, but only up to the precision with which Λ = 0 is valid, which is never an infinite precision in reality.
As stressed in Ref. [17], this behavior is the hallmark of Bayesian statistics.The results such as Eq. ( 8), which are based on the Bayes rule and other rules backed by the probability-theory axioms used in Eq. ( 7), remain valid even under "extreme" cases.We will not repeatedly alert the reader on this fact for the formulas to come.Nevertheless, in Sec.VI B we discuss some paradoxes which occur for M = 1 that are of a different type.
To conclude the discussion of their general aspects, we note that having the estimating distributions P (ν| x) and P (Λ| x), we know everything that can be known about ν and Λ from the observed data x and-we emphasize again-nothing more than the data and the prior invariance principles: no model on x or ansatz on the "true" P (x) has been invoked.These estimating distributions give more information than just the estimators ν and Λ: Spread or narrow, they express our confidence in the knowledge of ν and Λ.We will illustrate it with a figure shortly, but let us first consider a particular case relevant for spectral analysis.
Assume that the mean of x is known, ν = ν 0 .Proceeding as before, we get the estimating distribution of the variance in the form Compared to Eq. (8b), there is a difference in the power exponent.Also, Λ should be here calculated according to the second expression in Eq. (3) replacing ν → ν 0 .
Though not needed here, for later convenience we note that this short calculation can be performed also in the following way, P (Λ|ν = ν 0 , x) = dν P ( x|νΛ)P (Λ) P (ν = ν 0 , x) P (ν), (10) and using the prior P (ν) = δ(ν − ν 0 ).As more important for spectral estimation, we plot Eq. ( 9) in Fig. 1 for different values of M and Λ = 1.The information content of the full estimating distribution can be appreciated from the figure.Despite all the plotted distributions corresponding to the same variance estimator ( Λ = 1), they clearly convey a different knowledge on Λ.One way to illustrate it is to look at the dependence of the confidence interval size on the sample size M , plotted in the right panel of Fig. 1.The change of the information content with the size of the data sample M is obvious.
As a final remark, we emphasize that in this section, with the main result being Eqs. ( 8) and (9), we have indeed estimated a correlation function: The variance Λ = (x− x ) 2 is the correlation function of x with itself.What we do next is to extend this procedure to correlation functions of (possibly two different) variables that evolve in time.The essence of the parameter-estimation formalism remains identical: First, we find the sampling distribution that maximizes the entropy.Then, we apply Bayes' rule to calculate the estimating distribution.The main difference is the passage to Fourier space in order to simplify both distributions.

III. AUTO-CORRELATIONS
This section extends the estimation of Sec.II to a real variable A that evolves in time.This scenario includes the study of noise.The results that we present allow one to estimate the auto-correlation matrix of A (defined below) with Bayesian statistics.We establish connections with the common estimators used in the literature, such as the periodogram.Along the lines of Sec.II, our results contain more information.
Consider M independent batches of n measurements of A in time: Â ≡ A (0) , . . ., A (M −1) , The super-and subscripts are the batch and time index, respectively.Different batches correspond to uncorrelated copies or runs of the experiment.Let the time interval ∆ between A (m) j and A (m) j+1 be the same for all m and j.Let be the mean of the variable A (the same for all j) and Σ the correlation matrix, with components Our goal now is to estimate these parameters.We adopt two approximations.First, we impose periodic boundary conditions: A j+n = A j .These conditions are implicit in the discrete Fourier transforms used throughout this paper.Second, we consider stationary noise so that Σjk only depends on j − k.The scalar function Σ j defined by Σij = Σ i−j is called the autocorrelation function, or simply correlation function.Stationary noise implies that Σ j = Σ −j .These two commonly adopted approximations are necessary for the correlation matrix to be diagonal in Fourier space.
In parallel with Sec.II, we start by giving the sampling distribution of the variable A. Next, we estimate the mean µ and the correlation matrix Σ, and conclude by discussing some spectral properties of noise.

A. Sampling distribution
Since by definition the M data batches are independent, the sampling distribution given only µ and Σ reads with, as we prove in App.A, the multivariate Gaussian and µ = µ(1, . . ., 1) T .These expressions are analogous to Eqs. ( 4) and (5).We insist on the fact that this is not an approximation of the distribution of A. Rather, it describes our knowledge based on the mean µ and correlation matrix Σ.
It is convenient to perform a coordinate change to simplify the sampling distribution.The translational invariance encoded in Σij = Σ j−i hints on switching to Fourier space.The Fourier transform of each batch (m) reads In agreement with our previous notation, we define α ≡ α (0) , . . ., α (M −1) , Denoting by F the Fourier-transform matrix, with components and thus unitary, we can write α (m) = F A (m) .Similarly, we define the vector ν ≡ F µ.All components of this vector except the first are zero, ν k = νδ k,0 , where, for later convenience, we define ν ≡ µ √ n.As mentioned, with periodic boundary conditions the correlation matrix is diagonal in Fourier space: Σ j e −i2πjk/n .
Notice the difference in notation between the correlation matrix Λ, with components Λjk , and the scalar function (of k) Λ k .Λ k gets the same name as Σ j , namely autocorrelation function or simply correlation function.Since it is the Fourier transform of Σ j , it is also called spectrum. 7nother notation caveat is pertinent.The bijections between A (m) and α (m) , and between Σ and Λ, allow one to write the sampling distribution in the following equivalent forms: Basically, data (in real or Fourier space) is on the left of the bar, whereas the mean and correlation matrices (in real or Fourier space) are on the right.These changes are trivial in the Bayesian formalism as introduced in Sec.II.We will express the sampling distribution in any of these forms according to our convenience.Also, distributions like P α (m) k µΛ k may be referred to as "sampling distributions" with no risk of confusion.Analogously, the estimating distributions will be written in different equivalent forms, easily identifiable because the mean and correlation matrix will lie on the left of the bar, and the data on the right.In a similar spirit, we can replace the mean µ by the vector ν or its only non-zero component ν.
The sampling distribution of Eq. ( 12) becomes much simpler in the α coordinates: with x the largest integer smaller or equal to x, P α and8 The k-th component of the vector ν is the mean of the variable α k ; out of these, the only one with a non-zero mean is α 0 .This remark will be relevant in the next subsection for the choice of priors.It also allows us to write P α Λ k for k = 0.A last detail worth discussing about Eq. ( 17) is that k ranges from 0 to n/2 (and not n).This, as well as the origin of the factor d k , is due to (i) α is even in j.(In these relations, the subscript indexes are understood mod n.)In other words, the independent coordinates in α (m) can be taken as α (m) k for k = 0, . . ., n/2 , and, similarly, Λ is fully parametrized by the independent values Λ k with k = 0, . . ., n/2 .The sampling distribution already encodes this information.

B. Estimation of auto-correlations
The estimation of the correlation matrix Σ requires the calculation of P ( Σ| Â).Equivalently, we can estimate its transform Λ with P ( Λ| α).This is easier because P ( α|µ Λ) manifestly factors in k, see Eq. (17).Assuming a factored prior (as argued in App.B) and using Bayes' rule, it is easy to see that P ( Λ| α) also factors: where we denoted . Equation ( 19) greatly simplifies the estimation of Λ: It decouples the estimation of the matrix Λ, dependent on all the (Fourier transformed) data α, into the estimation of the diagonal elements Λ k for k = 0, . . ., n/2 , each of which only depends on α k .In this way, the problem reduces to the one solved in Sec.II.This connection is not unexpected: Since the change to Fourier space diagonalizes Σ, the multivariate Gaussian of Eq. ( 13) is mapped to a product of one-variable Gaussians.Thus, spectral analysis reduces to the estimation of the variance in each of the n/2 + 1 independent one-dimensional problems labeled by k = 0, . . ., n/2 .
One could now in principle read off the result from Eqs. ( 8) and ( 9), noting that Eq. ( 4) gives Eq. ( 18) upon replacing 1/2 → d k .However, it might be easier to repeat the calculation: Λ k can be estimated with Bayes' rule, the sampling distribution P α 18) and the prior (cf.App.B) The resulting estimating distribution of Λ k reads with the sufficient statistics This is the main result of this section.In the literature, the quantities Λk (sometimes up to a factor) are called the periodogram.Whereas Λ k is commonly estimated with Λk , our distribution P (Λ k |α k ) contains more information.Figure 1 and its discussion in Sec.II apply to Λ k and make this evident.One important application of our results can take place in the design of experiments: As the right bottom plot of Fig. 1 reflects, one can figure out the amount of data needed in order to reach a certain confidence level in Λ k (that is, in the spectrum).This precise assessment of confidence would be difficult without the estimating distributions.One would be limited, at most, to the asymptotic analysis of the periodogram fluctuations to give a rough idea of the error.The maximum-likelihood estimator for Λ k is and we repeat that it tends to the periodogram value Λk only for large M and that the estimating distribution P (Λ k |α k ) allows the calculation of the corresponding confidence interval as explained in Fig. 1.
The estimation of the mean is completely identical to that of Sec.II with the result given in Eq. (8a).We copy it here converting to the notation used in this section, For small values of M , this Lorentzian-like function significantly differs from the Gaussian that one might conjecture under (an abuse of) the central limit theorem.The maximum-likelihood estimator of µ is As expected, mle(µ) is the mean of the variable A taken over all batches and all times.

C. Statistical properties of noise
Not only is the sampling distribution P α (m) k µΛ k necessary to estimate Λ k as in Sec.III B, but it is a key to understand statistical properties of noise.In this section, we discuss two applications.On the one hand, we show how to generate noise with a given mean µ and spectrum Λ k .On the other, we find the exact expression for the signal-to-noise ratio (SNR) of the periodogram.Our derivation will clarify some aspects of the SNR that can be found in the literature [1,2,10,26] and make extensions.In particular, we elaborate on its value of order 1 irrespective of the data-sample size, arguing that it is universal for any kind of noise.We offer a simple general proof that avoids relying on asymptotic limits, since these are not necessarily met in practice.
We focus first on the generation of noise with a given mean µ and spectrum Λ k .Since α (m) k ∈ C, it is convenient to get the sampling distributions of its real and imaginary parts, Rα k , separately.From Eq. ( 18), The distribution for Iα  for k = 0, . . ., n/2 .Then, Fourier transform α (m) to get A (m) .Repeat the process to generate M batches.This is in agreement with Ref. 10, which restricts itself to 1/f β noise and departs from the statistical properties of white noise (Λ k ∝ 1).9Our approach does not impose any condition on noise other than being stationary.In fact, the statistical properties of noise, given by the sampling distribution P ( α|µ Λ) and common to every function Λ k , should not be confused with the particular shape of the function Λ k , which describes a specific kind of noise.With our sampling distribution, one can generate noise with arbitrary spectra.This discussion sets the basis for Sec.IV C, which will extend to the generation of correlated noise.
We proceed to discuss some statistical properties of the periodogram.The literature reports that the periodogram Λk for one batch, given by Eq. (22a) with M = 1, has a SNR ∼ 1 irrespective of the size n of the data sample.In other words, its value fluctuates a lot for different measurements of the same phenomenon (i.e. when repeating an experiment or measuring a new batch), displaying a relative error around 1. We offer a simple and general proof that this holds for any kind of noise.Once again, the key point is to deal with distributions instead of estimators.According to App.C, the distribution for the periodogram value Λk depends only on Λ k and reads This distribution 10 has the moments 11 the integral converging for n+M d k > 0. We observe that while the maximum of the distribution is not at Λk = Λ k , the mean of the observed values of Λk is equal to Λ k .For the signal to noise ratio, we get For k = 0 and taking M = 1, we finally recover the result SNR = 1.Our derivation confirms that the SNR does not depend on the batch size n, but only on the number of data batches M .The paradox that the SNR does not increase with n is discussed in Sec.VI C in the context of continuous spectra.We also point out that Eq. ( 25) agrees with the dependence √ M when M → ∞, expected from the law of large numbers [2].

IV. CROSS CORRELATIONS
This section extends the estimation of Sec.III to two real variables A and B that evolve in time.Such estimation includes the study of their noise which might be correlated.The results that we present allow the estimation of the correlation strength and phase (defined below) with Bayesian statistics.We discuss the connection with common estimators used in the literature, like Pearson's r coefficient.Similarly to the previous cases, our results contain more than just suggesting a specific estimator.
The notation of the variable A is identical to that of Sec.III and trivially extends to the variable B. Let us use X = (A, B) as a composite variable.The batch (m) contains the data and the whole data set X contains M batches, X = X (0) , . . ., X (M −1) .
Let the means of A and B be µ A and µ B , respectively, and write the correlation matrix as Σ ≡ ΣA ΣAB ΣBA ΣB .
10 For example, Ref. [28] investigated this distribution experimentally and compared the findings with a model where the distribution is Gaussian (which it is not). 11As a side remark, we note that from Eqs. ( 24) and ( 21) we can get the prior distribution for the periodogram value P ( Λk ) ∝ P ( Λk |Λ k )/P (Λ k | Λk ) ∝ 1/ Λk for k = 0 and ∝ 1/ Λ1/2 for k = 0. Thus, the prior for a finite frequency element of the periodogram is 1/ Λk , as expected for a scale variable.However, the symmetry between the priors for Λ k and Λk is not valid for k = 0.

with analogous expressions for ΣB
jk and ΣBA jk .As in Sec.III, our only approximations are (i) periodic boundary conditions: A j+n = A j and the same for B; and (ii) stationary noise: ΣC jk = Σ C j−k , for C = A, B, AB, BA.As before, it follows that the scalar functions Σ A j and Σ B j are even in j.Here, Σ A j , Σ B j are auto-correlation functions and Σ AB j , Σ BA j the so-called cross-correlation functions.Often, all of them are called correlation functions without the risk of confusion.We note that from now on, the matrix Σ refers to the variable X, and not to A or B separately.
As in Sec.III, we first give the sampling distribution of the variable X, then estimate the correlation matrix Σ and the means µ A , µ B , and finish with the procedure to generate correlated noise.

A. Sampling distribution
Since the M data batches are independent, the sampling probability given only by µ A , µ B , and Σ factors in the batch index, As we prove in App.A, P X (m) µ Σ is a multivariate Gaussian, The vector µ has the components As in Sec.III A, it is convenient to Fourier transform the variables for each batch (m).We denote by Z the conjugate variable to X: F is the Fourier-transform matrix given by Eq. ( 15), so F is unitary.For each batch, the coordinates Z (m) gather the Fourier transform of A (m) and B (m) , In the transformed coordinates Z, the correlation matrix takes the form

The notation for α
and analogous expressions for B and BA.Therefore, Λ is block diagonal in k, with blocks To prevent confusion, the reader should take note of the difference in notation for the 2n × 2n matrix Λ, the 2 × 2 matrix Λk and the scalar functions (of k) Λ C k for C = A, B, AB, BA.The functions Λ C k are correlation functions in Fourier space and are thus termed spectra.Λ AB k and Λ BA k are called cross spectra or simply spectra.Below, we need also the Fourier transform of the vector µ, which is a vector with 2n components, only two of which are nonzero.In line with the notation for Λ k , we collect the two nonzero components into For a convenient parametrization of Λk , it is necessary to consider the following constraints on its matrix elements: for even n are real.
We can now parametrize Λk .
, each block Λk of the correlation matrix is given by the real parameters k |, and the cross-correlation phase φ k of the (cross) spectrum.Alternatively, we can consider the real parameters The dimensionless parameter s k quantifies the degree of correlation between A and B, but normalized to the autocorrelations of A and B. We will refer to s k as the correlation strength.It hints at a connection with Pearson's r coefficient or linear correlation coefficient, which will be defined in Eq. (37).Far from being an arbitrary choice to quantify cross-correlations, we stress that the definition of s k has emerged in a natural parametrization of the matrix Λk .The caveat regarding notations made in Sec.III also applies here: The sampling distribution will be denoted by the equivalent forms P ( X|µ Σ) = P ( X|ν Λ) = P ( Ẑ|µ Σ) = P ( Ẑ|ν Λ), according to convenience.In these forms, the data lies on the left of the bar, and the means and correlation function on the right.In this sense, it should not be confusing to refer to distributions such as P (Z (m) k |ν k Λk ) as sampling distributions.Analogous considerations, with sides swapped, apply to the estimating distribution.
After the passage to the Fourier space, the sampling probability of Eq. ( 26) takes the form Here, with the obvious notation T .The inverse of the block Λk reads Equation ( 29) is the main result of this section.As in Eq. ( 18), note that in Eq. (29b) the means µ A , µ B (inside ν) only appear for k = 0.This fact is important for choosing the priors, see App.B. Finally, we note that k ranges from 0 to n/2 , and not n, in Eq. (29a).The reason is that γ for k = 0, . . ., n/2 .We also recall that in the parametrization of Λ, only the values of k = 0, . . ., n/2 are needed.

B. Estimation of cross-correlations
In this section, we estimate the correlation matrix Σ, or equivalently its transform Λ, which in turn contains the spectra and cross spectra.As previously [see Eq. ( 19)], the factoring in k of the sampling distribution-Eq.(29a)-implies the factoring of the estimating distribution: This means that the estimation of the correlation matrix Λ decouples into n/2 + 1 independent estimations, one for each k = 0, . . ., n/2 , of the set of four parameters We start by estimating the spectrum Λ A k .The estimation of Λ B k is identical upon the exchange of A and B. Our goal is to calculate the estimating distribution ) by virtue of the k factoring, see Eq. (30).We do not expect the result to be exactly the same as Eq. ( 21), because now, due to possible correlations, the data about B might give extra information about A and vice versa.The estimating distribution is: In these two steps, we applied Eq. ( 6) and the Bayes' rule.Note also the obvious replacement Λ A k Λ B k s k φ k → Λk inside the probability sign P , since both contain the same information.We have also introduced the notation which makes it easier to provide general formulas covering both k = 0 and k = 0 cases.The sampling distribution P (Z k |ν k Λk ) entering Eq. ( 31) is given by Eq. ( 29).As for the priors, according to App.B, with for all k and with the notation with analogous expressions for the variable B, and Inserting all this into Eq.( 31) and performing the inte-grations in Λ B k and φ k , we get: Here, F is the hypergeometric function. 12We could not find a primitive for the integral in s k and leave it for numerical evaluation.The sufficient statistics in Eq. ( 36) are the periodograms ΛA k and ΛB k , given by applying Eq. (22a) to A and B, respectively.Further, where ᾱ0 is defined in Eq. ( 22b) and β0 analogously.The sufficient statistic sk is the Pearson's r coefficient, a parameter commonly used in the literature to analyze crosscorrelations.We have arrived at it as the natural parameter of the estimating distribution.
As follows from Eq. (36), the data about B enter the estimation of the auto-correlations Λ A k , through sk and the integration in s k .On the other hand, the noise of B is integrated out in the estimation, see Eq. ( 31).In contrast, applying Eq. ( 21) would implicitly assign all the noise observed in the data on A directly to Λ A k .It is reassuring to note that Eq. ( 36) reduces to Eq. ( 21) in the absence of correlations, that is in the limit of sk → 0 and M → ∞. Figure 2 visualizes the difference between the two formulas.One can see that Eq. ( 36) corrects Eq. ( 21) appreciably when M is small, while the difference diminishes for larger M .Nevertheless, the difference proves the need to consider all the available data ( X) to perform an estimation, even when it may seem that only a part of it ( Â) is relevant.After these remarks, our study of auto-correlations is complete.
So far, we have estimated the diagonal elements of Λk , namely the auto-correlations of the variables A and B. Now, we focus on estimating the remaining parameters, s k and φ k , which account for cross-correlations.To do  21) versus Eq. (36).Bottom: Estimating distributions of µA, namely Eq. ( 23) versus Eq. ( 46).The deviations from the dashed lines are significant for low values of M and sk close to 0 or 1. so, we calculate the estimating distribution P (s k φ k | X)equivalently P (s k φ k |Z k )-once again by virtue of the k factoring in Eq. (30).Performing the same manipulations used to get Eq.(31), we can write The sampling distribution is given by k |ν k Λk and Eq.(29b) and the prior in Eq. (35).Performing the integrations one gets Here, where the angle φk is the argument of the complex number ΛAB k .We have also introduced to ease the formulas in this section.In sum, the only sufficient statistics relevant to inferences about crosscorrelations are sk and φk . 13 The estimating distribution for the correlation strength, P (s k |Z k ), can be obtained by integrating out φ k in Eq. ( 40), what yields We note in passing that for d k = 1, so that m is a positive integer, this hypergeometric function can be rewritten in terms of the Legendre polynomial L m−1 of order m − 1: The advantage of this form is that for small m one gets the result in the form of a simple polynomial in the two rational expressions appearing in the previous equation.
Integrating out s k instead gives the estimating distribution for the phase, With Eqs.(36), (40), ( 43) and (44), the estimation of the correlation matrix Λ is complete. 14Since its diagonal elements (auto-correlations) were already discussed, now we focus on the off-diagonal ones (cross-correlations).We proceed to describe the estimating distributions of s k and φ k in terms of M and of the sufficient statistics sk , φk .As expected from the dependence on q s k φ k -defined in Eq. (41)-, P (s k φ k |Z k ) is radially symmetric: Its shape does not depend on φk but only on sk and M .The angle φk is just a polar shift, see Fig. 3a.The dependence on sk is analyzed with P (s k | X) in Fig. 3b and with the contour lines of P (s k φ k |Z k ) in Fig. 3c.We point out that the estimating distribution is "conservative" in the assessment of the correlation strength.We mean that, (i) as Fig. 3b shows, for low or moderate values of sk -say, in the range 0 < sk 0.5-P (s k |Z k ) is quite spread along the s k axis (the radial axis); and (ii) P (s k φ k |Z k ) is skewed towards low values of s k , see Fig. 3c.As for the dependence on 13 That the scale variables ΛA k and ΛB k are not needed is another advantage of the parameterization through the dimensionless variable s k .See App.E for a different choice. 14One might be interested in different parameterizations, most often using the unnormalized correlation strength Λ AB k instead of s k .The single additional estimating distribution needed for this parameterization is given in App.E. The complex-plane axes have been omitted (but note the given length scale) and the plots arranged in a matrix for the sake of visibility and ease of comparison.We take φk = 0, but recall that the value of φk is irrelevant for the shape of P (s k φ k | X) up to polar rotations.The crosses mark the maximum-likelihood estimators.
M for a fixed sk , Fig. 3c shows that greater values reduce the spread of P (s k φ k |Z k ). Figure 4 displays the transition upon changing M for the estimating distributions P (s k |Z) and P (φ k |Z).
The case M = 1 does not appear in Fig. 4 but it is worth of discussion.
For any k, Eq. (43) reduces to a uniform distribution, It follows from identities [29] and 2 F 1 (0, 0, 1/2, x) = 1, for m = 1, m = 1/2, and m = 0, respectively.Consequently, a single batch of data does k }.With only two values, it is impossible to tell whether α and β are correlated, anticorrelated or not correlated at all.Hence the posterior for s k is equal to its prior.Section VI D contains more on this point.
In parallel with the previous sections, we discuss next the importance and possible applications of our results.
To begin with, one could study cross-correlations with coefficients like Pearson's r, which is equal to sk -cf.Eq. ( 37)-, and with the phase φk of ΛAB k .But we stress once again that it is more informative to make Bayesian inferences, that is, describing s k and φ k with P (s k φ k | Ẑ), or their marginals P (s k | Ẑ) and P (φ k | Ẑ).Figures 3c and  4 illustrate this point: Even for the same values of the sufficient statistics sk and φk , the estimating distributions convey a different degree of confidence depending on the amount of available data.Not only the confidence level, but also other features of the estimating distribution (like the skewness that we have pointed out) are possibly relevant pieces of information that our estimation retains.As we discussed with Fig. 1, the explicit dependence of the estimating distributions on the number of batches M can help to design experiments aiming at a certain confidence level.In turn, this enables to detect low correlations or correlations contaminated by strong noise: When a variable (like s k ) is close to 0, one can only resolve it from its fluctuations by knowing how its error scales with the amount of data.
Moreover, there is an important conceptual difference between choosing a coefficient like Pearson's r (or, for example, Kendall's τ ) [1] and our approach.Sometimes, these coefficients are not defined from the parametrization of the correlation matrix Λk , but instead from other arguments.For example, Pearson's r is introduced in the context of linear regressions, hence the name "linear correlation coefficient."[1] Consequently, one might wonder whether using other parameters would be better to analyze cross-correlations.However, this concern does not apply to our estimation: The sufficient statistic sk appeared as a native parametrization of the correlation matrix Λk in Sec.IV A. Actually, with our set of parameters {Λ A k , Λ B k , s k , φ k } for k = 0, . . ., n/2 , we are sure not to loose any information about the correlation matrix Λ.
The estimation of cross-correlations is now complete.We conclude this section by discussing the estimation of µ A .It holds identically for µ B .Similarly to what we pointed out about Λ A k , the result for the estimating distribution P (µ A | X) is not expected to be exactly the same as P (µ A | Â)-Eq.( 23).The estimating distribution P (µ A | X)-equivalent to P (µ A |Z 0 ) due to the k factoring in Eq. ( 30)-reads Once more, we have applied Eq. ( 6) and Bayes' rule to get this expression.As for the presence of the means ν in this formula, recall our previous remark that ν only appears in the sampling distribution for k = 0.The sampling distribution P (Z 0 |ν Λ0 ) is given by Eq. (29a), and the prior P (νΛ 0 ) by Eq. (34).After these substitutions, we can integrate analytically in all variables but s 0 , which remains for numerical evaluation: There is a clear resemblance to the estimating distribution of Sec.III: As a check, Eq. ( 46) reduces to Eq. ( 23) in the absence of cross-correlations, namely using the prior P (s 0 ) = δ(s 0 − 0).The comparison between both results appears in Fig. 2.An identical discussion to that comparing Eqs. ( 21) and ( 36) applies also here.

C. Generation of noise
While the method to generate noise given in Sec.III C was previously reported in the literature for a specific kind of noise [10], we are not aware of proposals of how to generate correlated noise, namely variables with given means µ A , µ B and spectra Λ A k , Λ B k , and Λ AB k (or equivalently, Λ A k , Λ B k , s k , and φ k ).In this section, we propose a general method for this purpose.
We proceed analogously to Sec.III C, using now the sampling distribution given by Eq. (29b).However, it is necessary to transform the variables α Tk exists and The formula applies for j (m) k upon the substitution H k → J k .Splitting to the real and imaginary parts, the same for Ih    36).The maximum-likelihood estimators appear circled, and the estimating distributions for Λ A k , Λ B k and Λ C k are overlaid as a linear color gradient.The dashed lines plot the spectral functions used to generate the data, see Eqs. (49).Note that adopting the log scale for the horizontal axis makes the distributions for high k overlap.To declutter the plot and keep its information content, we locally average the points using the procedure described in App.F.

V. NUMERICAL EXAMPLE
In this section, we test and illustrate our analytical Bayesian estimation formulas applying them on numerically generated uncorrelated and correlated noise.Specifically, we generate M = 10 batches of noisy signals for three variables A, B, and C. Each batch contains n = 1000 data points.We work in arbitrary units for all magnitudes, including time.We choose spectra, corresponding to white noise; to noise with an exponentially decaying correlation function, x(s)x(s + t) ∝ e −t/λ , taking λ = 10; and to 1/f noise, respectively.
For 1/f noise, we take a frequency cutoff smaller than the lowest nonzero k frequency.A is generated independently from B and C with the method presented in Sec.III C. B and C are anti-correlated with a constant strength s k = 0.7 and phase φ k = π.They are generated according to Sec.IV C. Finally, A and B have the same total spectral power (that is, the same area under the spectral-power curve), while C is hundred times weaker.
We chose to simulate this data according to the following purposes.First, we aim at analyzing autocorrelations for some typical noise types.Second, we intend to check that the cross-correlation between the variables B and C is correctly quantified in strength and phase, and also assess the lack of correlation between A and B, and A and C. In other words, we set a test for the positive and negative assessment of correlation.At last, we want to check that our estimation of discrete spectra is sensible for data generated from continuous dispersions, at least qualitatively.The next section discusses this point further.Figure 6 analyzes cross-correlations.The estimating distributions for s k and φ k appear plotted in a color scale for every pair in the set {A, B, C}.The degree of correlation is correctly estimated in all cases: While P (s k | X) accumulate around 0.7 for {B, C}, they peak at 0 for the uncorrelated pairs {A, B} and {A, C} at majority of the frequencies. 15The results for P (φ k | X) are consistent with it those for the correlation strength: For the uncorrelated cases of {A, B} and {A, C}, the estimating distributions peak at random values and mostly are spread over [0, 2π).For the correlated case {B, C}, the estimating distribution is narrow and peaks at π.
Our results succeed in estimating the parameters that were used to generate the data.Some comments regarding the convergence from discrete to continuous spectra, oriented towards the application to actual experimental data, appear in the next section. 15But not at every k: as a result of statistical fluctuations, the noise of uncorrelated variables shows false correlations (a peak away from s k = 0) at some frequencies.This behavior is to be expected.At these values, the estimating distribution of the phase is localized (not spread to whole [0, 2π) interval), which is consistent with a finite value of MLE(s k ).One could come up with a statistical distribution for their number analogous to Eq. ( 24).However, one can judge whether the given two variables are or are not correlated even without such tests, see Sec.VI D. FIG. 6. Left: Estimation of the correlation strength for all possible pairs in the set {A, B, C}, with P (s k | X) as a linear color gradient.The legend appears in the upper right corner and the color code is consistent with Fig. 5.The maximum-likelihood estimators appear circled.The results reflect that only the variables B and C are correlated.The value s k = 0.7, used to generate the data, is marked with a faint gray line.Right: The same for the correlation phase with the estimating distribution P (φ k | X).The gray line marks φ k = π.The points were averaged according to the procedure described in App.F for better visibility.

VI. DISCUSSION
We now discuss several aspects of our results concerning their application.In addition, a reader with experience in spectral estimation might find some of our results counter-intuitive.We clarify those seeming contradictions, assigning them mostly to the difference between parametric and non-parametric estimation.The technical results of our article have been already given in previous sections and they can be used without following the discussion here.

A. Non-parametric versus parametric estimation
Let us return to a single variable of time A(t) observed at discretized times t i = i∆, resulting in a discrete set {A i ≡ A(t i )} n i=1 .This discrete set allows one to infer properties of the auto-correlation Σ(τ ) evaluated at discrete values of the time delay τ d = d∆: d= −n/2 +1 .Due to the discrete sampling of A one has no access to Σ(τ ) for other values of τ .Since the Fourier transform is a bijection, the discrete set of the Fourier coefficients {Λ k } is equivalent to {Σ d }: It is a minimal-size set of parameters required to describe what can be known about Σ(t) from the discretized values of A. However, one can introduce a different set of parameters, as a model to explain or predict {Σ d }.Any such model can be then tested with respect to the data, in the sense that the parameters of the model can be estimated.
If the function Σ(τ ) is defined for any real τ , what requires A(t) to be defined for any real t, it is natural to consider Λ(κ), the continuum of the Fourier components of Σ(τ ), as such an extended model.Introducing a continuum of redundant variables, there would be no hope of estimating them from {Σ d }.The redundancy is compensated by considering simple functions for Λ(κ), describable by only a few parameters.A typical example would be Λ(κ) = c/κ α for κ ∈ κ low , κ high and zero otherwise.Here, the overall strength c, the power α and the cutoffs κ low and κ high would be the four (instead of a continuum of) parameters to be estimated.In general, one considers a family of continuous curves C p (κ) defined by N parameters gathered in vector p.Since the goal is to estimate the parameters p, one calls the procedure parametric estimation.We do not introduce such functional models.We estimate the natural parameterization of {Σ d }, namely the discrete set {Λ k }.To distinguish it from the other approach, we call it non-parametric estimation.
Now we come to the central point of this section.The motivation to introduce the continuum version of Λ k is not so much the existence of Σ(τ ) as a "more-true" entity than its discretized version {Σ d }. 16 Rather, the motivation is to compress the information contained in the 16 Indeed, a continuum model can be adopted even if A i are unde-results of non-parametric estimation. 17The latter typically results in hundreds to millions of data points in a plot such as Fig. 5.To interpret such a plot requires that one or a few of the typical shapes are recognized in it. 18Whether knowingly or not, anyone looking at a figure such as Fig. 5 performs intuitively at least a rough parametric estimation and draws conclusion from there.That is, the global properties of the spectral data are judged, meaning relations between different frequencies.Our formulas do not do this; they describe each frequency k in isolation.Once this point is appreciated, the "paradoxes" that we discuss below quickly resolve.

B. Continuous spectra
Let us now shortly comment on the parametric estimation, which is fitting the parameters of a continuous curve Λ(κ) ≡ C p (κ).Unfortunately, the issue with it is that, unlike the relation of {A i } to A(t) and {Σ d } to Σ(τ ) which are trivial, the relation of {Λ k } to Λ(κ) is not: Λ k is not equal to Λ(κ k ) evaluated at the corresponding real frequency κ k = k/n∆.Namely, the function Λ has to be additionally transformed: The transformation F reflects two well-known complications: the continuous frequencies which are beyond the Nyquist frequency |κ| > 1/2∆ are aliased (folded over) fined for intermediate "times" (or if i does not represent time, but an index that is inherently integer, such as the rank).There is no difference from the point of view of the estimation.The difference would be in the interpretation: it might be unclear what a continuum function Λ(κ), even if fitting a simple shape well, represents.In this case, one might be better off postulating a candidate shape directly in the discrete space, with the benefit that no F discussed in the next subsection arises. 17While Ref. [30] discusses a slightly different "frequency distribution", the following citation from page 60 there is exactly to the point here: "But it is often possible to find a simple form that fits the frequency distribution approximately.If this can be done it has the advantage of describing the results of the statistics briefly.In some cases it is suggestive of the causes that lie behind the results.But the main reason, in general, for looking for a simple mathematical 'law' of this type, is that if it is found it is believed to have predictive value.That is to say the simple law, if it is a very good approximation to the distribution function F of the original sample, is likely to describe the distribution function of another sample (or of the whole population) even better than F would.This is partly because it is likely that there are a few predominating causes lying behind the statistics, even though these cases are unknown". 18The typical shapes are: a constant representing a white noise, that is a Markovian environment; a Lorentzian for an environment equivalent to a two-level system with fixed transition rates [31], a 1/f for a collection of two-level systems with a specific distribution of transition rates (Ref.[7] attributes it to Johnson 1925 and Schottky 1926); 1/f 2 for a random walk [32].Further, a substantial volume of results exists on ARMA models [33], with spectrum a general rational function of k.
into this range and the values of Λ(κ ) for which n∆κ is not integer are smeared throughout the discrete set {Λ k } with a weight falling off only as 1/(κ − κ ) 2 .We do not give formulas for F, they can be found in Ref. [1].Equation (50) then states that Λ(κ) has to be first aliased and smeared (the first arrow) and then sampled (the second arrow), to arrive at a discrete set {Λ k }, the equivalent discrete spectrum.Nevertheless, once the candidate model function C p (κ) has been decided for and the above transformation is performed, one can estimate the parameters p from the likelihood Here, P Λ k = F[C p ](κ k ) X is given by Eq. ( 21) or Eq. ( 36) evaluated at Λ k = F[C p ](κ k ).In practice, the parametric fitting is often done either simply ignoring the F in Eq. ( 51), or moving it on the data, by changing the equation Λ = F[C] into F inv [Λ] = C, using an abstract notation without indexes.In the first approach the quality of the approximation F ≈ 1 relies on the properties of the function C p (κ k ) being estimated, and is thus unknown.In the second approach, the effective inverse F inv is implemented by windowing, [34] filtering, and procedures of similar kind. 19Again, since F certainly does not have an exact inverse (being a map from a real axis to a finite set), the quality of the effective inverse is difficult to predict and will depend on the function C itself.Therefore, we suggest using Eq.(51) instead: there, the data are, in principle, processed independently from the parametric model, by a non-parameteric estimation.The likelihood is given by evaluating the resulting estimating distributions at points given by the model function transformed into an effective discrete spectrum. 20The advantage is that this likelihood is exact, and thus allows one to assign rigorous error bars to the estimated parameters p.
We note that one can do parametric estimation directly, skipping the non-parametric one.Reference 3 is one example, where the model likelihood is calculated in the time domain (that is, directly from the data A i ).Nevertheless, the key to be able to do so is to have the candidate model, equivalent to C p in the notation here. 21ithout that, to propose a reasonable candidate C p one has to start with a plot such as Fig. 5 anyway.Our approach can, therefore, be viewed as splitting the parametric estimation to two steps, what allows one to delay conjecturing about C p till we can take a more informed decision.
Finally, analogous considerations hold for the crosscorrelation, where formulas such as Eq.(51) will apply for Λ A k , Λ B k , and Λ AB k together.We point out here that the estimating distribution in Eq. ( 30) is actually nonseparable in the four variables.There are thus dependencies in the plausible values of the parameters.Going to marginals, which we have done in calculating, for example, the estimating distributions P (s k ) and P (φ k ), means that those dependencies were lost.One could consider fitting parametric models to the more-variable estimating functions such as P (s k φ k ) and similar, instead of the marginals.Nevertheless, we leave investigations of parametric estimation as outlined in this subsection for future works.

C. Paradox 1: Precision does not increase upon
acquiring more data?
Let us again focus on a single variable with discrete spectrum Λ k .Consider that we increase the batch size n keeping the number of batches M and the time interval ∆ fixed.While the number of k points, being n/2 + 1, increases, the number of data points α for each k does not, remaining at M .Given that our estimating distributions (and the error that they convey) depend only on M -see e.g.P (Λ A k |α k ) in Eq. ( 21)-, it looks paradoxical that the acquisition of more data by increasing n does not have any effect on the precision of the estimation (given by M ).
This paradox is resolved by considering our previous discussion of Eq. ( 51): While it is true that after increasing n, the precision of Λ A k does not increase for given k, the fact that there are more values of k results in a more precise parametric estimation.For example, in Fig. 5, doubling n but keeping M fixed would correspond to roughly doubling the number of red distributions-and circles-, but keeping their shape-and thus the length of the corresponding error bar-fixed.We could fit a continuous curve better than with a lower n.The same point has arisen in Sec.III C: It might have looked paradoxical that the signal to noise ratio given in Eq. ( 25) does not increase with the number of measured data, proportional to n.Again, the increase of the number of k points delivers a higher precision of a parametric fit.

D. Paradox 2: A single trace can not reveal anything about cross-correlations?
The paradox from the previous subsection has a variant appearing for cross-correlations.Namely, we found that the estimating distribution of the correlation strength s k , Eq. ( 43), is constant, P (s k |Z k ) = 1 irrespective of the data, if M = 1.This seems to imply that irrespective of what was measured, from a single batch one can not draw any conclusions about a possible correlation of two variables A and B. One can easily find an example where the intuition points against such a statement: For example, imagine that a million of pairs B i , C i were measured in a single batch, and while wildly fluctuating individually, they happen to be opposite in all instances B i = −C i .One can obviously judge the perfect anti-correlation of B and C from such data.
The explanation is, again, in the global properties of the estimating distributions, that is, their relations for different frequencies k.While it is true that looking at a single k, one can conclude nothing about s BC k , the fact that all the estimated phases φ k are the same in the given example makes it obvious that B and C are correlated.The parametric estimation would confirm it, for example examining a model such as B(t) = aA(t) + cC(t), finding c ≈ −1, a ≈ 0. To illustrate, Fig. 7 shows the resulting estimating distributions in the first row.A similar situation arises if the correlation is imperfect, see the second row of the figure.Again, while nothing can be concluded from the estimating distributions for s k as they are all constant, the clustering of phases φ k shows that they are not randomly distributed and thus the two signals are not uncorrelated.
E. Paradox 3: Are formulas for M > 1 ever relevant?
In practice one has seldom (if ever) access to statistically independent batches.In majority of cases, the whole set of n×M data points is measured on a single system, after which the data are separated into M batches.There might or might not be time delays between measurements taking data assigned to different batches m.Nevertheless, data measured in this way correspond to a single batch, possibly with some portions of the data omitted.As omitting valid data never improves any estimation, the best estimation one can hope for in this scenario is to retain all the data without omissions and use it as a single batch, not splitting it into artificial batches.This is the best (most informative) procedure for a subsequent parametric estimation: Despite our formulas taking trivial forms (for M = 1) at each k, for example s k = 1 irrespective of the data, there is no issue in using the M = 1 non-parametric estimation results for the parametric estimation; we have explained it in the previous two subsections.
Nevertheless, calculating the likelihood in the parametric estimation, Eq. (51), is not the only use of the nonparametric estimation.As we pointed out in Sec.VI B, the latter is required to assist in formulating the parametric model in the first place.This task might be difficult based on seeing only the M = 1 plots.The artificial splitting of a single measurement into M batches can then be FIG. 7. The analog to Fig. 6 for fully or partially anti-correlated signals for a single batch, M = 1.The left panels show the estimating distributions for the correlation strength s k , the right ones for the correlation phase φ k .We generate B and C with spectra as shown in Fig. 5, with s k = 0.7 and φ k = π for all k.In the first row we calculate correlation of B with its inverse −B, while in the second row the correlation of B and C. To declutter the plot the horizontal axis was divided into 100 bins equally spaced in the logarithmic scale and all distributions falling within the same bin were arithmetically averaged.To deliver the point of the figure, we intentionally do not use here a more efficient averaging described in App.F. seen as a visualization tool. 22Once the parametric model is formulated, one should use the true number of batches, without splitting them artificially to smaller ones, even for M = 1.
A natural question is the error stemming from splitting a single batch into several smaller ones which are then treated as statistically independent (even though they are not).While we have investigated this question, we conclude that quantifying the error arising in this way in the non-parametric estimation requires additional assumptions on the correlation function.We have not been able to find assumption(s) which would be general and lead to useful formulas.We leave this question open.In any case, we expect that the "averaging" described in App.F makes this issue irrelevant, since it replaces the artificial splitting of batches by a more efficient procedure.

VII. CONCLUSION
In this article, we have applied Bayesian statistics to estimate correlation functions.We have given the estimating probability distributions to calculate (i) the variance for scalar variables; (ii) the auto-correlation functions for time-dependent variables, which includes the study of noise spectra; and (iii) the cross-correlations for possibly coupled pairs of variables, which covers the study of correlated noise.Our approach guarantees an optimal usage of the information contained in the data, makes no assumptions on the distribution of the variables, reduces to a minimum the assumptions on the priors, and avoids the arbitrary choice of estimators.Every result is expressed in terms of a distribution, from which confidence intervals can be calculated afterwards.
The relevance of our results is manifest when applied to spectral analysis.They allow one to estimate the spectrum in terms of the estimating distributions.This inference is more informative than using standard estimators such as the periodogram.It allows one to assess the certainty of the spectrum estimation (for example, by error bars) and fit continuous spectra in terms of maximum likelihood.Our analysis also covers the study of crosscorrelations from a Bayesian perspective: We offer a systematic way to assess the correlation between two variables without choosing arbitrary parameters to quantify it and with distributions that assess the certainty level.We presented some numerical tests that support our calculations.
Beyond these fundamental results, our work discusses other aspects of noise from a Bayesian perspective.As an example concerning the statistical properties of the periodogram, we derived its exact signal-to-noise ratio and proved that it is universal.We also proposed a method to numerically simulate correlated stationary noise with an arbitrary spectrum.which ends the proof for n = 1.Now, consider n > 1.Once again, only in this section for notational simplicity, we denote P ( x| µ Σ) by P ( x).The goal is to find the distribution P ( x) that maximizes the entropy subject to the conditions It is convenient to perform a coordinate change y = U x, with U unitary such that Ξ = U ΩU † is diagonal.Because Ω is symmetric and real, U exists and can be chosen real.Define ν = U µ and note that P ( x) = P ( y), since the Jacobian of the change x → y is |U | = 1.Rewriting Eqs.(A5) and (A6) in the y coordinates is then trivial: It only requires the changes µ i → ν i and Ωij → Ξij .We apply the method of Lagrange to P ( y) with multipliers λ − 1, λ i and λ ij : λ ij y i y j = 0, and thus λ ij y i y j .
To determine the multipliers, we insert this expression into Eqs.(A6) (rewritten for the y variable): d n y P ( y) = 1 allows us to write so the exponentials in Eq. (A7) are independent on λ ij for i = j.Therefore, From this equation, which in turn yields In other words, the maximization of S can be decoupled in n independent maximizations of S i , each of them respective to P (y i ) and subject to the conditions y i = ν i and y 2 i = Ξii .The solution for each i is given by Eq. (A4): Transforming back to the original x coordinates, we get with Σij = x i x j − µ i µ j .This completes the proof.
with P (Λ k ) given by Eq. (B5).Eq. (B6) seems reasonable due to the decoupling of the variables A j into the n/2 + 1 independent variables α k , with k = 0, . . ., n/2 -see the discussion after Eq. ( 18).This decoupling is manifest in the sampling distribution P ( α|µ Λ)-Eq.( 17)-and suggests to consider priors that also factor in k.However, let us take up a more systematic approach to get Eq.(B6).The invariance of the prior under the frame transformation Eq. (B1) of A reads: P (µ , Λ )dµ dΛ 0 . . .dΛ n/2 = P (µ, Λ)dµ dΛ 0 . . .dΛ n/2 , which implies It is immediate to see that Eq. (B6) together with Eqs.(B4) fulfills this condition.However, we now demonstrate that it is not the only solution.We proceed as before in this section: We rewrite Eq. (B7) as a system of differential equations and solve it.The differential equations read: Here, we used the obvious notation Λ The first equation is trivial.It implies the independence of P (µ, Λ) on µ.We can thus focus on the second, 2 Λ • ∂ ΛP ( Λ) + (2 n/2 + 2)P ( Λ) = 0. (B8) A canonical way to solve linear differential equations in manifolds is the method of separation of variables [36].It starts by a factored ansatz like Eq. (B6), leaving for the end the proof that it is the unique solution.Here, the ansatz leads to the system of uncoupled equations We can solve these equations to conclude that any function of the form is a solution of Eq. (B7).That is, such a function is invariant under the frame transformations of A. We also impose λ k > 0 to have lim Λ k →∞ P (µ, Λ) = 0.In many classical problems that require solving a differential equation like the one in Eq. (B3), one has a set of boundary conditions (along a so-called characteristic curve) [36] that (i) narrow down the solution to unique values of λ k and (ii) allow one to prove that the factored ansatz is actually the only solution.However, we do not have such boundary conditions for the prior P (µ, Λ) and thus we find that the frame invariance does not lead to a unique factorizable prior.
We now invoke an additional invariance argument: the prior for the noise at a fixed physical frequency should not depend on how many data points the batch contains.The Fourier index k corresponds to the frequency k/n∆, where ∆ is the time step with which the data are measured.Crucially, changing n changes the assignments of the indexes k to the physical frequencies.If the reassignement is to leave the priors the same, it must be that the constants λ k for k = 0 actually do not depend on k.Requiring in addition that the prior for Λ k=0 , is described by Eq. (B4), we arrive at λ k = 1 for all k.With this additional invariance requirement, we conclude that there is a unique factorizable prior, where we separated Λ k=0 which is a somewhat different variable than all other Λ k =0 .

Many variables, non-factorizable priors
It is not too difficult to find a non-factorizable prior fulfilling Eq. (B8).We first note the following function family, Here, p can be any real number except zero, though probably only p ≥ 1, or at least p > 0, can ever be relevant for scenarios realistic in physics.As the expression in the outer bracket is the p-norm of the vector Λ, this family already includes appealing choices, such as an expression which depends only on the sum of Λ k (for p = 1), the Euclidean distance in the space of vectors Λ (for p = 2), or the maximum max k Λ k (for p → ∞).Yet another possibility worth considering is a prior factorizable in Λ 0 and the rest.
A second look on Eq. (B8) reveals that it specifies the prior P ( Λ) as a homogeneous function of variables Λ of degree − n/2 −1, by the Euler theorem on homogeneous functions [37].Equations (B11), (B12), (B13), and (B14) are instances of such homogeneous functions, and there are many more.Still, we speculate that the requirement of the symmetry of the prior, either factorizable or not, with respect to permutation of the indexes k within the set k = 0, is a relevant invariance requirement.It leads to a unique factorizable prior and reduces the number of possibilities to consider for a non-factorizable one, suggesting Eqs.(B13) or (B14) as (perhaps the only) viable candidates.
To conclude, without imposing extra conditions beyond the frame invariance, the prior is not unique.We adopt Eq. (B12) and its analogs as the prior for the main text, following the common practice of the separation of variables technique.We leave the investigations of the consequences of non-factorizable priors as a future research avenue.

Priors for cross-correlations
At last, we make some remarks about the priors needed for Sec.IV.Following the notation of the main text, beware that now Λ refers to Eq. ( 27) and not Eq.( 16).An analogous discussion to that of Eq. (B6) leads to with for all k.P Λ A k and P Λ B k are given by Eq. (B5).Note that we have assumed the factoring of the prior in k and also in the independent quantities {Λ A k , Λ B k , s k , φ k } that parametrize Λk for each k.Analogously to the previous case, the factorization does not follow from the invariance requirements that we impose.
In Eq. (B15), it remains to determine the priors P (s k ) and P (φ k ).The correlation strength s k and the correlation phase φ k are the modulus and the argument of , respectively.Similarly to the quantity Λ of the one-dimensional case, discussed in the beginning of this appendix, Λ A k , Λ B k and Λ AB k transform by a multiplicative factor under the frame transformation.Therefore, s k and φ k remain invariant and there are no nontrivial conditions such as Eqs.(B2) that apply to P (s k ) and P (φ k ).We can impose only entropy maximization on these priors.The result, of a trivial proof, is a uniform distribution: The phase prior for k = 0, n/2 follows from Λ AB 0 and Λ AB n/2 (for even n) being real.

Appendix C: Derivation of the periodogram distribution
In this appendix, we derive the distribution P ( Λk |Λ k ), necessary to understand the statistical properties of the periodogram Λk .The distribution can be calculated as follows.We start by applying some probability-theory rules on P ( Λk |Λ k ): The first equality uses Eq. ( 6).In the second, we apply the conditional-probability definition P (ab) = P (a)P (b|a) twice.The differentials in the variables α read is real for k = 0, n/2 and complex (in general) otherwise.Consequently, the number of onedimensional integrals over α is 2M d k .Next, we use that P (µ|Λ k ) = P (µ) and define the average Also, using the definition of Eq. (22a), we write P ( Λk |α k µΛ k ) as a delta function.These two replacements give We now consider k = 0 and k = 0 separately.For k = 0 the sampling distribution P (α k |µΛ k ) is independent on µ.The integral in µ is then trivial, giving dµP (µ) = 1 by the normalization of the probability distribution P (µ) and the overline can be simply omitted.We can now substitute the condition imposed by the delta function in P (α k |µΛ k ), in turn given by m P α (m) k µΛ k and Eq. ( 18).This yields For k = 0, a simple integration in µ is needed to evaluate Eq. (C1) and confirm that the previous equation holds also for k = 0, and thus for all k.Using δ(x) = dΘ(x)/dx and taking the derivative out of the integrals, we recognize the formula for the volume of a 2M d k -dimensional sphere and write where the N -dimensional volume V N (r) reads Substituting this expression, taking the derivative with respect to Λk and normalizing the result, we get Eq.( 24).
Appendix D: A useful identity to integrate out the mean Concerning zero-frequency spectral components which have finite mean, the following formula is useful.The k = 0 sampling probability from Eq. ( 29b) is where we have used d k = 1/2, and, with s φ = exp(iφ), We note that all quantities in the above equations are real.It means that s φ is either +1 or −1.The variable Z 0 denotes together the two sets {α where In the derivation, it is helpful to notice that ΛA 0 and ΛB 0 are the variances of the sets {α is the covariance of the two sets.Particularly, the (co)variance is not changed upon a uniform shift of all input numbers.One can use Eq.(D2) to calculate the sampling distributions averaged over the mean, P (• • • |ν 0 Λ0 ).Indeed, setting s = 0 and ignoring the B-related variables, one can easily integrate out the mean µ A in Eq. (D2), deriving Eq. ( 21) from Eq. ( 20) for k = 0. Similarly, integrating out both ν A and ν B in Eq. (D2) gives the k = 0 sampling distribution entering Eq. (31).Here we consider an alternative parametrization of the cross-correlation, using {Λ and one can see that the marginals of Λ A , Λ B , and φ are the same using any of the coordinates sets, since integrating out s k is the same as integrating out |Λ AB k |.However, the marginals of s k versus |Λ AB k | are different and can not be obtained from each other, as these two variables define different hyperplanes in the four-dimensional space.
For completeness, we repeat the definition of the marginal, We could cast it as the following one-dimensional integral  density of the points on the horizontal axis is constant on the linear scale, it grows exponentially on the log scale.Due to a finite resolution of the printer, computer screen, and the human eye, one can often see just an impenetrable mess in such plots above a certain frequency.Here we propose a solution of this problem.
The idea is to replace several neighboring points on the plot by a single representative point.While this idea seems too obvious, there are different ways how one can merge several points into one.We suggest the following way.We aim to represent a (dimensionfull) frequency interval F = κ low , κ high , where for simplicity κ low > 0. It contains a set K of integer Fourier indexes k, that is, k ∈ K ⇔ κ k ≡ k/n∆ ∈ F. Let the number of integers in K be K.For each of these integers, one has the associated set {α k }.This merged set is interpreted as an input to Eq. ( 22) corresponding to an effective batch size KM .Evaluating Eq. (22a) gives Λeff = (1/K) k∈K Λk , a simple average of the sufficient statistics.One then evaluates Eq. ( 21) with effective parameters Λk → Λeff and M → KM , to obtain the representative estimating distribution.
This procedure has the following advantages: 1) One can merge a variable number of Fourier components throughout the plot, for example, plotting non-merged values (K = 1) at low frequencies, and increasing K towards high frequencies.Therefore, one does not have to sacrifice the lowest frequencies, usually of primary interest, a deficiency which results from artificially splitting the data to shorter batches [38].2) In the limit of a narrow frequency window (more specifically, in the limit where the spectral density is constant within the covered frequency interval), the resulting estimating distribution is exact.3) Since one gets an estimating distribution for the representative point, one still has the freedom of presenting it in various ways, for example, plotting its MLE or average, in a way which is consistent with the nonmerged points on the plot.4) Most crucially, due to 2), a rigorous error bar can be assigned to each representative point.
Let us point out another paradox related to this approach.The number of points assigned to a single "representative" point grows for larger frequency.As one expects, and as we prove in App.H on the central limits, the estimating distributions shrink, in proportion to 1/ √ K.It would seem that when such points are used in some subsequent estimation (say, parametric estimation of the continuous spectrum), these points with minuscule error bars will dominate the fit, making the low-frequency points irrelevant.The resolution of the paradox is in realizing that an "effective point" represents an interval rather than a precise frequency, irrespective of where exactly it is placed.The interval width is proportional to K. Therefore, to use an "effective point" correctly, one needs to use error bars for both its coordinates, the vertical as well as the horizontal one.
We have used this averaging in plotting Fig. 5.We have split the horizontal axis to intervals of constant length on the log-scale and merged the frequencies within each such interval.This means the K starts at 1 on that plot, becomes larger than 1 at k = 47 and grows towards high frequencies up to 31 at the highest frequency.

Appendix G: How to treat errors (finite precision) on input
In this article, we have considered stochastic variables, denoted as A, B, or C. Their fluctuations (or noise) is what we aim at extracting.However, apart from their "inherent" fluctuations, the values that we use as input in the above formulas are often subject to additional errors.Namely, there is no realistic case where a continuous quantity, such as A j , would be known precisely.In line with the spirit of this article, the values A j will be given as probability distributions rather than as exact numbers.These probability distributions are results of some measurement protocol, which is nothing else than another case of estimation.This estimation comes with its own precision, and thus non-zero width of those probability distributions on the input.We introduce the following notation, with A the available signal, A the quantity of interest, and δA the unwanted error.
To illustrate, consider that A is the energy of a qubit.As there is no way to measure it directly, one needs to extract it from observing, for example, qubit precession, fitting it to damped harmonic oscillations.While there is no issue with the formulas in the main text, since they work whatever is the nature, character, or origin of the fluctuations in the signal, in this case one is interested in the inherent noise of the qubit energy, without the contribution from the imperfect energy-estimation.We now give a simple recipe how to subtract the latter contribution.The subtraction is possible if we can separately estimate the spectrum of the additional error δA and this error is uncorrelated with the remaining part of the signal A. In the following we give the necessary formulas for correcting the auto-and cross-spectra.

Auto-correlation case
In this scenario, one has individual estimates of the auto-correlation for the available input A and for the error itself δA.The estimating distribution of the autocorrelation for the signal with the error removed is Here, the probability distributions P and p are the estimating distributions given in Eq. (21) or Eq. ( 36) for the error-polluted signal A and the error δA, respectively.In the simple case of the error being white noise with variance σ 2 , one can use p(λ k ) = δ(λ k − σ 2 ) in the above, and the only calculation required is to get the overall normalization factor as a one-dimensional integral.
In practice, one can not sample the error together with the signal (since that would mean one could simply subtract the error and use precise inputs).Therefore, the assumption on their zero correlation can not be checked, and will hardly be met in practice.For this reason, we do not delve on this topic more 25 and stop at Eq. (G2) as a rough approximation for the signal stripped of the additional error.We put tilde on the resulting estimating distribution to emphasize that the result of Eq. (G2) relies on an explicit assumption (of zero correlation of the signal of interest and the error) which won't be met in practice, unlike all other estimating distributions derived in this article which are, in a well defined sense, exact.Nevertheless, even if only approximate, we stress that Eq. (G2) produces a valid estimating distribution (a well-defined probability distribution), whatever is the relation of the signal and error: The error might be comparable or even much bigger than the typical value of signal auto-correlation as estimated from the available data.

Cross-correlation case
We now extend the input-error-removal procedure to cross-correlations.Equation (G1) then applies to each of two stochastic variables A and B. The cross-correlator we can access from the error-polluted input data is As before, this result assumes that errors are uncorrelated with signals.We now discuss two typical scenarios.First, consider that A and B are estimated independently, either using separate experimental equipment or unrelated measurement and estimation procedures.In this case, one expects mutually independent errors δA i δB j = 0, and there is nothing to correct for. 26 Second, consider that A and B are composite quantities: Not directly accessible, they are derived from other measured or estimated variables.In this situation the errors will often be correlated by common elements in that derivation.Assuming that one can somehow estimate the cross-spectrum δA i δB j , it can be removed analogously as before for auto-correlations.The main difference is that cross-correlations are complex quantities, having both phase and magnitude.The subtraction needs to be done for estimating distributions describing both, that is, before any these two variables (phase or magnitude) is marginalized out.Let us consider that the removal is done for the estimating distribution expressed using the unnormalized cross-correlation strength |Λ k | and the phase φ k .It corresponds to Eq. (E3) before integrating out the phase and reads 26 First, this result still relies on the assumption of the errors being uncorrelated with signals.Otherwise the errors do become correlated through the correlation between the signals; such errors we simply can not correct for.Second, the result is counter intuitive, since acquiring additional relevant information must change the Bayes posterior.The lack of such correction uncovers a deficiency of the approximate procedure expressed by Eq. (G2) and its analogues.See Footnote 25.
where the complex number Λ k = |Λ k | exp(iφ k ).We now use this distribution in the following variants: P (Λ ) for the cross-correlation evaluated with the available-errorpolluted-data, p(λ) for the cross-correlation of the errors, and P (Λ) for the cross-correlation of interest.Using tilde with the same meaning as explained below Eq. (G2), the desired estimating distribution is with the integration being over the complex plane.Otherwise, the equation is completely analogous to Eq. (G2).
The marginal distribution for the magnitude is obtained by integrating out the phase in P (Λ k ), and vice-versa.
Let us finish with comments on using Eq.(G5) in practice.The formula contains two explicit one-dimensional integrals, while the distributions P and p require another integral each (they are given by two instances of Eq. (G4), which do not have to be normalized individually as only the final distribution needs to be normalized).Counting also the integration required for the marginalized distributions, we get a procedure requiring five nested integrals, making it computationally demanding.Still, this is the most favorable scenario: The removal can be applied using alternative parametrizations, for example, the set {Λ A k , Λ B k , Re Λ AB k , Im Λ AB k }.However, using normalized correlation strength s k among the parameters, the presence of auto-correlations in the denominator [see Eq. ( 28)] makes the procedure too complex to be practical in our experience.Therefore, we do not give formulas for that variant: the errors on input can be removed only for the cross-correlation parametrized with the unnormalized correlation strength described in App.E.

Appendix H: Central-limit results
Here we consider how estimating distributions scale in the limit M → ∞.Compared to exact expressions, the limiting approximations have two benefits.First, this limit corresponds to a large number of observations for which the formulas should display the "central-limit" behavior: each estimating distribution should become a Gaussian with the variance falling of as 1/M .Deriving such variance stands for a consistency check and gives insight on how does the distribution width (the uncertainty) depend on sufficient statistics.Second benefit is practical: the exact expressions are complicated functions, often given only as integrals, which become exceedingly unwieldy in this limit.Standard numerical libraries might fail to reliably evaluate hypergeometric functions with parameters of order thousand or more.While one rarely has so many data batches available, very high values of M easily result when plotting representative values according to App.F. Here, the limit formula might become numerically more precise than the exact one (unless the latter is evaluated in a special way).Perhaps even more importantly, the Gaussian-like posteriors drastically simplify the evaluation of the figure of merit in a parametric fit of the spectrum: Instead of possibly involving complicated integrals, the logarithm of Eq. (51) becomes a simple sum of weighted squared differences. 27 Therefore, the central-limit expressions can be used to quickly locate the approximate position of the minimum for a parametric fit of spectra, followed by a refinement using exact formulas.
In this part, we prefer simplicity to rigor: We are interested only in the leading order results.Among other, we neglect factors of order one compared to M , so that we drop any small integer, δ k,0 among them, accompanying M in sum.We skip the derivations, give results only for the most important estimating distributions, and do not specify the error arising from the approximations.Concerning the last point, it is important to understand the following: the results given below are valid in the limit M → ∞ unless the sufficient statistics are singular, for example s = 0 or s = 1.The singular cases might not display the central-limit behavior.This fact implies that when M is fixed and the estimating distributions are considered as functions of their sufficient statistics, the limit expressions will fail once the sufficient statistics become close enough to the singular value, for example once s becomes small enough.In other words, at what value the parameter M can be considered "large" so that the formulas below apply, depends not only on M itself but also on the values of sufficient statistics.

The auto-correlation
The estimating distribution in Eq. ( 21) is the Poisson distribution, simple enough to work with so that no approximation is needed.Nevertheless, to be consistent with the remainder of this section, we approximate it by the following form valid in the limit M → ∞: The distribution p(λ; m) in Eq. (H2a) is the probability to observe m events over time m for a Poisson process 27 For example, in the case of correlation spectrum Λ k , Eq. (51) becomes log , see Eq. (H3), and analogously for other variables. 28The exact expression for the density p(λ; m) contains an additional factor 1/λ δ k 0 /2+1 .with rate λ.It peaks at rate λ = 1, corresponding to Λ k = Λk .At large m, the distribution is well approximated by (H3) In the first panel of Fig. 9, we compare Eq. ( 21) to its central-limit approximation in Eq. (H3).
While it was not exact for Eq. ( 21), the appeal of Eq. (H2) is that it covers also the case of correlation measurement considered in Sec.IV B. Namely, Eq. ( 36) reduces to it in the limit M → ∞, the only change needed is to put the upper subscript A on the two quantities on the right hand side of Eq. (H2b).

The normalized cross-correlation
For large M , the hypergeometric function in Eq. ( 40) can be set to 1, which greatly simplifies the expression.Integrating out the variable s k , we got H5) a form which is suitable when the explicit periodicity in the phase variable is beneficial.Alternatively, the expression explicitly Gaussian in the phase variable is Similarly, integrating out the phase gives

(H7)
The two approximations are compared to the exact distributions in the lower two panels of Fig. 9.

The unnormalized cross-correlation
We could cast Eq. (E3) into the following form We were not able to obtain an analytic estimate for the variance σ Λ AB k directly from Eq. (E3).However, our numerical investigations suggested As we could derive the same formula from Eq. (E3) in the limit sk → 0, we believe that it is valid generally.The correspondence between the exact distribution and its approximate form is illustrated in the upper right panel of Fig. 9.Among other, Eq. (H8) implies that one needs M ∼ Λ A k Λ B k /(Λ AB k ) 2 batches to discriminate a finite value of Λ AB k from zero, that is, to detect a crosscorrelation.

FIG. 1 .
FIG.1.Left: Estimating distributions-Eq.(9)-for Λ = 1 and different sizes M of the data sample.The curves have been shifted and scaled in the vertical axis for the sake of visibility.Right: 90%-confidence intervals for the estimating distributions.The top figure shows how the maximum-likelihood estimator and confidence intervals are defined.The bottom figure plots the relation between confidence and sample size.The color code is consistent with the plot on the left.
k = 0, n/2 ; α (m) k ∈ R otherwise.These expressions are all we need: To generate the batch (m), generate the normally distributed and independent variables Rα (m) k is also shared with Sec.III and trivially extends to β (m) k for B.

FIG. 2 .
FIG. 2. Comparison of estimating distributions of the variableA when considering the partial data Â (dashed lines) versus the whole data X (colored solid lines).The different colors indicate the values of sk , ranging from 0 to 0.9 in steps of 0.1.The batch size M appears on the top right corners, ΛA k = 1 and ᾱ0 = 0. Top: Estimating distributions of Λ A k , namely Eq. (21) versus Eq. (36).Bottom: Estimating distributions of µA, namely Eq. (23) versus Eq. (46).The deviations from the dashed lines are significant for low values of M and sk close to 0 or 1.

FIG. 3 .
FIG. 3. Distributions of cross-correlations as a function of sk , φk , and M .(a) Contours of P (s k φ k | X)-Eq.(40)-enclosing the 75, 50, 25, 10, and 5%-probability regions.The plot is in the complex plane, with s k as the radial coordinate and φ k as the polar angle.The unit length is given by the radius of the circle.For red, yellow, blue, and green, sk = {0.2,0.5, 0.7, 0.9} and φk = {0, 3π/4, 5π/4, 7π/4}, respectively.In all cases, M = 10.(b) Dependence of P (s k | X)-Eq.(43)-on sk for M = 10.The curves have been shifted and scaled in the vertical axis.(c) Contours like in (a) as a function of sk and M .The complex-plane axes have been omitted (but note the given length scale) and the plots arranged in a matrix for the sake of visibility and ease of comparison.We take φk = 0, but recall that the value of φk is irrelevant for the shape of P (s k φ k | X) up to polar rotations.The crosses mark the maximum-likelihood estimators.

FIG. 4 .
FIG. 4. Distributions of cross-correlations as a function ofM for sk = 0.7 and φk = π.The curves are shifted and scaled in the vertical axis for visibility.
k -into two independently distributed scalars.To do so, we perform the transformationW (m) k = Tk Z (m) k− ν k that diagonalizes the matrix inside the sampling distribution: H k → J k .The distribution for the imaginary parts only applies to k = 0, n/2, because h .These distributions are all we need to generate two variables A and B with given means µ A , µ B and spectra parametrized by Λ A k , Λ B k , s k , and φ k .The procedure is as follows.To generate a batch (m), (i) for each k, diagonalize the matrix Λk , finding the coordinate-change matrix Tk and the eigenvalues H k and J k ; (ii) generate the independent and normally distributed variables Rh k with their sampling distributions-Eq.(48) and similar-, and do the same for j (m) k ; finally, (iii) perform the transformation W (m) k → Z (m) k − ν k with T −1 k , and Z (m) k → Â(m) , B(m) with F −1 (the inverse Fourier transform from the frequency space back to the time space).

FIG. 5 .
FIG. 5. Top: evolution of the generated variables A, B and C for batch m = 1.Bottom: Estimation of the spectra Λ A k with Eq. (21); and of Λ B k , Λ C k with Eq. (36).The maximum-likelihood estimators appear circled, and the estimating distributions for Λ A k , Λ B k and Λ C k are overlaid as a linear color gradient.The dashed lines plot the spectral functions used to generate the data, see Eqs. (49).Note that adopting the log scale for the horizontal axis makes the distributions for high k overlap.To declutter the plot and keep its information content, we locally average the points using the procedure described in App.F.

Figure 5
Figure 5 plots the time evolution of the first batch (top panel) and the resulting estimated auto-correlations (bottom panel).The estimation of Λ k uses Eq. (36) of Sec.IV B for B and C, and Eq.(21) of Sec.III B for A. These estimating distributions are plotted in a color scale.The results fit well the continuous curves used to generate the data.
contain M real numbers each.It is now an algebraic exercise to cast the expression into the following form ) where sk and | ΛAB k | are as defined in the main text, K n is the modified Bessel function of the second kind, and I n is the modified Bessel function of the first kind.Note how | ΛAB k | sets the scale for Λ AB k .In Fig. 8 we illustrate this distribution for different values of M .

FIG. 8 .
FIG.8.Distribution of the magnitude of unnormalized crosscorrelations as a function of M for sk = 0.7.The curves are shifted and scaled in the vertical axis for visibility.
To get the single representative, one should merge these sets into one set containing KM entries, {α(m) eff } = k∈K {α (m)

FIG. 9 .
FIG.9.Central-limit (large-M ) to estimating distributions.In each panel, exact estimating distributions for the variable given on the x axis are plotted for several values of m as given by the color bar.The black dashed curves give the approximate (central-limit) estimating distributions for m = 200.
in the main text.These sets differ only in one variable, being |Λ AB k | instead of s k .Therefore, the only additional marginal posterior to calculate is p(|Λ AB k |): the remaining three posteriors remain as given in the main text.Indeed, we have p