Testing the Quasar Hubble Diagram with LISA Standard Sirens

Quasars have recently been used as an absolute distance indicator, extending the Hubble diagram to high redshift to reveal a deviation from the expansion history predicted for the standard, $\Lambda$CDM cosmology. Here we show that the Laser Interferometer Space Antenna (LISA) will efficiently test this claim with standard sirens at high redshift, defined by the coincident gravitational wave (GW) and electromagnetic (EM) observations of the merger of massive black hole binaries (MBHBs). Assuming a fiducial $\Lambda$CDM cosmology for generating mock standard siren datasets, the evidence for the $\Lambda$CDM model with respect to an alternative model inferred from quasar data is investigated. By simulating many realizations of possible future LISA observations, we find that for $50\%$ of these realizations (median result) 4 MBHB standard siren measurements will suffice to strongly differentiate between the two models, while 14 standard sirens will yield a similar result in $95\%$ of the realizations. In addition, we investigate the measurement precision of cosmological parameters as a function of the number of observed LISA MBHB standard sirens, finding that 15 events will on average achieve a relative precision of 5\% for $H_0$, reducing to 3\% and 2\% with 25 and 40 events, respectively. Our investigation clearly highlights the potential of LISA as a cosmological probe able to accurately map the expansion of the universe at $z\gtrsim 2$, and as a tool to cross-check and cross-validate cosmological EM measurements with complementary GW observations.


I. INTRODUCTION
A key goal for cosmology is to characterize the accelerated cosmic expansion [1,2] and understand the underlying physics [3,4]. Major efforts are underway to measure the expansion history using a variety of techniques, across a wide span of redshift, in hopes of revealing a clue to the nature of dark energy. To date, the Hubble diagram has been constructed out to redshift z ∼ 1.4 using the local distance ladder [5] and type Ia supernovae as standard candles [6]. In the near future, dedicated observatories and experiments such as DESI [7], Euclid [8], the Rubin Observatory [9], and the Roman Space Telescope [10] will refine and extend the Hubble diagram out to z ∼ 3. Other methods, using for example the cosmic microwave background (CMB) [11], can retrieve information on the cosmic expansion at higher redshift, but must assume a cosmological model in order to constrain the late-time expansion rate.
A new, direct method to measure the expansion history out to significantly higher redshift has recently been implemented, using quasars as absolute distance indicators [12,13]. Determination of the quasar luminosity distance relies on a phenomenological relationship between the ultraviolet and X-ray emissions of the accreting plasma [14]. Risaliti & Lusso [13] analyzed a large catalog of X-ray and UV observations of quasars to calibrate this relationship and build a Hubble diagram that extends out to z 5.5. They found that their quasar sample yields distances that are in agreement with the standard, ΛCDM cosmological model at z 1.4. However, at larger redshifts there appears to be significant disagreement between the quasar distances and the ΛCDM model predictions [13,15]. This innovation necessarily raises new questions. Are quasars truly reliable distance indicators?
There are no clear answers at present, and secondary analyses of the quasar data have come to varying conclusions about the state of the standard cosmology [16][17][18][19]. Is there a crisis? After all, the local Hubble diagram has come under close scrutiny, recently, due to the apparent tension with parameter values inferred from the CMB [5,11,20,21]. For certain, there is a need for new data and new methods to probe the expansion history.
The nascent field of gravitational wave (GW) astronomy offers a way to address these and other questions, providing a new, independent probe of the expansion using standard sirens [22][23][24]. An absolute distance is obtained from the inspiral and merger gravitational waveform, which, coupled with an electromagnetic (EM) measurement of the redshift, yields a new, independent determination of the cosmological parameters, for example of the Hubble constant [25]. It is worth stressing that distance measurements with GWs are absolute, meaning that they do not depend on a specific cosmological model or other distance indicators.
The standard siren method has already been used to produce constraints on the Hubble constant H 0 with real data collected by the LIGO-Virgo interferometers [25][26][27][28]. In particular GW170817 [29], a binary neutron star merger, was the first multi-messenger event for which both GW and EM signals were observed. This first standard siren provided a measurement of H 0 = 70.0 +12.0 −8.0 km s −1 Mpc −1 [30][31][32]. Other GW events observed without EM counterparts can also be used as standard sirens. If the host galaxy cannot be identified, one can still cross-correlate the sky localization region of the GW event with a galaxy catalog and assign to each galaxy within this region a non-zero probability of being the host galaxy [22,33,34]. This "statistical method" has less constraining power per event than the counterpart method because the redshift is not known precisely. However, combining information from many GW events allows the statistical significance to be increased. Combining all the events observed in the first two LIGO and Virgo observing runs gave a result H 0 = 68 +14 −7 km s −1 Mpc −1 [32]. Looking ahead, GW observations by LIGO/Virgo and proposed observatories, such as the Cosmic Explorer [35] or the Einstein Telescope [36], will yield orders of magnitude more candidate standard sirens. Perhaps the boldest of these future plans is the Laser Interferometer Space Antenna (LISA) [37], a GW observatory consisting of a fleet of three satellites in an Earth-trailing, heliocentric orbit, positioned at the vertices of a 2.5 million km triangle. Using laser interferometry, LISA will measure GWs in the mHz frequency range. The loudest objects in this band of the GW sky are expected to be mergers of massive black hole binaries (MBHBs). Observations of the expected GW signal, and an EM counterpart by ground-based telescopes, offer the exciting possibility to extend the Hubble diagram out to z ∼ 7 or higher [38,39]. This is the possibility we explore: the ability of deep-redshift standard sirens observed by LISA to validate or refute the expansion history inferred by the quasar Hubble diagram [13]. To do this, we generate mock LISA MBHB standard siren observations based on a flat ΛCDM cosmology and we adopt a Bayesian framework to test the deviation found in the quasar observations. We aim to estimate how many MBHB standard sirens are necessary to rule out a deviation of that size and to explore the power of LISA to constrain the matter density and Hubble parameter. Due to the general formulation of our analysis, this work can be also used to test with standard sirens other cosmological models against ΛCDM.
In the long lead time before LISA flies, it is clear that the quasar Hubble diagram will evolve. The X-ray satellite eROSITA [40] will deliver approximately three million AGNs [41]. On this basis, the Hubble diagram could be boosted by a sample of ∼ 10 4 quasars, distributed up to z ∼ 3, after various selection cuts and criteria are applied [42]. Higher redshift quasars that stretch back to the epoch of reionization may be observed by Euclid [8], with estimates suggesting 50 − 200 objects in the range z = 7 − 8, and order 10 at z = 8 − 9 [43]. Follow-up observations of a few of these rare objects could yield luminosity distances reaching as far back as standard sirens. The number of objects in the catalog is not the limiting factor, however. The major barrier is the scatter in the X-ray and UV relationship and any underlying systematic effects. Time will tell if the quasar Hubble diagram can be refined with ensuing observations, and whether the discrepancy with the ΛCDM expansion history persists. In this paper, we take the tension at face value, as a target for standard sirens. Of course, LISA offers a unique view of the cosmos in its own right, and will provide an independent probe of the high-redshift expansion history. This paper is organized as follows. In Sec. II we introduce the quasar Hubble diagram constructed with quasar observations and we briefly summarise the main steps to reproduce the results found in Ref. [13]. In Sec. III we describe how the mock LISA standard sirens observations are generated and we investigate how such observations can constrain the Hubble parameter and the matter density. We present our method of Bayesian hypothesis testing in Sec. IV A, and our results, in terms of the number of LISA standard sirens needed to test the quasar Hubble diagram, in Sec. IV B. We conclude with a discussion of the results and assess future prospects in Sec. V. The code used for the analysis is available at [44].

II. QUASAR HUBBLE DIAGRAM
Cosmologists have long hoped that quasars, observed at redshifts z ∼ 6 and greater, could provide a reliable distance indicator with which to chart the cosmic expansion history. These aspirations took a significant step forward when a phenomenological relationship between the quasar x-ray and ultraviolet luminosities log 10 L X = γ log 10 L U V + β, was obtained by Avni & Tananbaum [45]. Here, L X , L U V are the monochromatic luminosities at 2 keV and 2500Å. There is good physical reason to expect the UV and X-ray spectra are connected: the UV emission originates in the infalling accretion material onto the quasar, whereas the X-rays are emitted by the plasma of hot electrons surrounding the accretion disk that scatter off the UV photons. Measurement of the fluxes yields the luminosity distance, once the parameters γ, β are determined. In terms of the observed fluxes, Eq. (1) becomes log 10 F X = γ log 10 F U V + 2(γ − 1) log 10 H 0 D L +β, The quasar Hubble diagram based on the data set presented in Ref. [13], in the context of two cosmological models. Green dots are quasar data points, shown without error bars; red circles are the binned data, with errors. Blue triangles are the binned SN data, also shown without error bars. The solid black curve is the Ωm = 0.3 ΛCDM model that best fits the joint quasar and type Ia supernovae at z < 1.4. The dashed magenta curve is the best-fit phenomenological model proposed in Ref. [13]. (Right) The relative difference between the luminosity distances for the binned quasar data relative to ΛCDM is shown as a function of redshift. The solid black line represents the ΛCDM baseline. The dashed magenta curve is the best-fit phenomenological model of Ref. [13].
whereβ absorbs the introduction of H 0 into the log of the luminosity distance. One can then use flux and redshift measurements to simultaneously fit the parameters and a theoretical model D L . A data set consisting of measured values of log 10 F X , log 10 F U V , z, and uncertainty δ(log 10 F X ) for N = 1598 quasars has been collected by Risaliti & Lusso [13], and generously shared with us.
To constrain the luminosity distance-redshift relationship with quasar observations we maximise the total likelihood of quasar data, In the above equation, ∆ i = log 10 F X | i − (γ log 10 F U V + 2(γ − 1) log 10 H 0 D L +β)| i ,σ 2 i = δ 2 + δ(log 10 F X )| 2 i , where log 10 F X | i and δ(log 10 F X ) are respectively the observed log of the X-ray flux and its uncertainty, and δ introduces a new global variable to describe the intrinsic dispersion in the flux law, Eq. (2). Following Ref. [13], we make a joint analysis with type Ia SNe data [46]. The SNe likelihood is a multivariate gaussian distribution of the form L SN e ∝ exp − 1 2 χ 2 SN e , where and ∆µ = µ obs − µ thy is the difference in observed and theoretical distance modulus, and C is the covariance matrix. The total likelihood is the product of L SN e and L QSO . To begin, we use only those quasars in the same redshift range as the SNe (z < 1.4). We bin the quasar data to match the redshifts of the SNe data set. In each bin, the mean log-flux is determined and its uncertainty is obtained. We maximize the sum of the log-likelihoods; the cosmological terms are common to both. In this way we calibrate the quasars, and thereby determine parameters γ = 0.6,β = −14.7, δ = 0.25; uncertainties in these parameters are subdominant toσ. Given these parameters, the luminosity distance -redshift is fixed and Eq. (2) yields the data set y = (z i , log 10 (H 0 D i L ), σ log 10 (H 0 D i L ) ), shown in Fig. 1. For the redshift range used for calibration (z < 1.4), we recover a ΛCDM model best fit value consistent with Ω m = 0.31 ± 0.05 (1σ) as obtained in Ref. [13].
Proceeding to higher redshift, however, it is not immediately apparent whether the ΛCDM model and quasars are still in agreement. Fig. 1 shows the binned quasar measurements depart from the ΛCDM prediction at high redshifts. To quantify this tension, Risaliti & Lusso introduce an alternative (ALT) phenomenological model of the luminosity distance -redshift relation [13], D ALT L (z; a 2 , a 3 ) = c ln 10 H 0 (x + a 2 x 2 + a 3 x 3 ) , x = log 10 (1 + z) . The redshift-luminosity-distance relation D ALT L in Eq. (5) is useful to encode the deviation found with the quasar observations. Due to its general formulation, D ALT L can be also used to account for other kinds of observations or alternative cosmic expansion predictions which deviate from the ΛCDM. By expanding the luminosity distance of a flat ΛCDM cosmology at low redshift, it is possible to relate the coefficients a 2 , a 3 of the phenomenological model in Eq. (5) with the matter density Ω m : Setting Ω m = 0.3, we find that, with a 2 2.95 and a 3 3.54, D ALT L differs at most 0.6% from the predictions of the flat ΛCDM for redshifts up to z ∼ 1, and, at most 10% for redshifts up to z ∼ 5 [47].
Using the combined data set of SNe and quasars at all redshifts, it has been claimed [13] that the ALT model with a 2 3.5 and a 3 1.5 provides a much better fit to the data. Indeed, we reproduce the results of Ref. [13] using the ensemble sampler for Markov Chain Monte Carlo (MCMC) emcee [48], and the data set y consisting of all 1598 quasars out to redshift z ∼ 5.5 and the SNe observations to obtain the posterior distribution of a 2 , a 3 . The likelihood used for the MCMC is given by the product of a normal likelihood with mean given by the prediction of D ALT L for the quasar data and the aforementioned multivariate normal distribution L SN e for the SNe data. The prior on a 2 , a 3 is chosen to be a flat distribution. These choices allow us to recover the posterior distribution p(a i | y) of a 2 , a 3 shown in Fig. 2, which is in good agreement with that of Figure 5 of Ref. [13].
It is possible to represent the flat ΛCDM model at low redshift in the (a 2 , a 3 )−parameter space by using the relation between the parameters a 2 , a 3 and the matter density Ω m in Eq. (6). The black dot in Figure 2 corresponds to the specific value of a 2 , a 3 for Ω m = 0.3, whereas the black solid line corresponds to other values of matter density. Although the relation between a 2 , a 3 and the ΛCDM parameters is only approximate and based on an expansion at low redshift, Fig. 2 shows that the point corresponding to ΛCDM in the (a 2 , a 3 ) parameter space is ∼ 4σ away from the best fit value in the (a 2 , a 3 ) posterior retrieved from quasar data. Based mainly on these results, it is further claimed [13] that ΛCDM is no longer a good fit to the quasar Hubble diagram at z 3, and, consequently, there is a 4σ tension between the standard cosmological model and the quasar data.
For the sake of argument, we can ask what does the phenomenological model have that ΛCDM does not? The arc of H 0 D L given by Eq. (5) passes closer to the quasars at z ∼ 1.8, 3 and above, than ΛCDM with Ω m = 0.3. For visualization purposes, in Fig. 1 we have binned the data to emphasize this point. In particular, the quasars at 2.7 z 3 look like an outlier from the perspective of ΛCDM. This disagreement is made clearer where we plot the relative difference, ∆ = (D QSO L /D LCDM L ) − 1, between the binned data and ΛCDM luminosity distance; the ΛCDM curve passes away from the binned data point by roughly four times its standard deviation. This difference is independent of binning in this redshift interval. However, we caution that the quasar data are sparse around both theoretical models at high redshift, suggesting that more data are needed. It is well to keep in mind that unresolved systematics, e.g. selection bias, quasar evolution, or accretion properties, may still affect the inferred luminosity-redshift relationship for quasars (see Ref. [49], but also Ref. [50].) Now we formulate our questions. We would like to find some complementary check of the Hubble diagram at z 3. Can we determine the luminosity distance with sufficient resolution to test deviations from ΛCDM at the 10% level? Can we distinguish ΛCDM from the alternative illustrated in Fig. 1, at redshift z 3? These questions highlight a sort of new, high-redshift Hubble tension that can be addressed with GW standard sirens.

III. LISA MBHB STANDARD SIRENS
LISA is a space mission designed to open the millihertz frequency range to GW observations [37]. In this frequency range different populations of GW sources are expected to be detected, including Galactic binaries [51][52][53][54], stellarorigin black hole binaries [55], extreme mass ratio inspirals (EMRIs) [56], the mergers of massive black hole binaries (MBHBs) with masses between 10 4 to 10 7 M [57] and perhaps stochastic GW backgrounds [58][59][60]. The population of MBHB sources is particularly interesting from a cosmological perspective, since MBHB mergers will be detected at high redshift (up to z ∼ 15 − 20) and could have EM counterparts observable up to z ∼ 7 [38].
As mentioned in Sec. I, obtaining complementary redshift information on GW events, through the identification of the corresponding host galaxy once an EM counterpart is spotted, is in fact one of the simplest ways to perform cosmological tests with GW observations. The GW events are standard sirens [22,24] that provide a direct measurement of the luminosity distance D L to the source. If this can be linked to a redshift z obtained from EM observations, the combined observation provides a point on the distance-redshift relation D L (z; θ ), which can be used to constrain the parameters, θ, of the cosmological model under consideration.
All GW sources that LISA will be able to observe at cosmological distances could be used as standard sirens, either with or without EM counterpart identification. For sources such as stellar-origin BHBs or EMRIs we do not expect any measurable EM counterpart (see however [61][62][63]). For these sources, cosmological inference will only be possible using the statistical method [64][65][66][67]. MBHBs on the other hand could produce observable EM emissions to be detected at higher redshift with respect to stellar-mass BHBs and EMRIs. Several studies found that MBHBs could emit radiation in different bands of the EM spectrum both at merger and during long lasting (ranging from weeks to months) afterglows (see e.g. [68][69][70]). Moreover pre-merger EM observational signatures could even be spotted during their inspiral phase [71][72][73][74]. Given a sufficiently accurate sky localization and a sufficiently powerful EM emission, future telescopes are expected to detect the counterpart of at least a few MBHB mergers per year [38,39]. These high-redshift standard sirens will allow us to probe the expansion of the universe at unprecedented distances and to test possible deviations from ΛCDM at z 3 [75][76][77]. This latter topic is the focus of our investigation, for which we will need realistic catalogs of LISA MBHB standard sirens. In what follows we will summarise how we construct such catalogs, building mainly on the results of [38,39]. The details of our approach are provided in Appendix A.

A. Construction of standard siren catalogs
To simluate catalogs of MBHB standard sirens, we consider the source redshift distributions provided by [38], where the redshift distribution of LISA MBHB standard sirens with identified EM counterparts are reported according to the simulations performed in [39] and based on the LISA sensitivity curve of [37]. From these redshift distributions we pick suitable values for all events in the catalogs. To each redshift value we then associate a unique value of the luminosity distance D L by using the flat ΛCDM distance-redshift relation  with fiducial cosmological parameters set to H 0 = 70 km s −1 Mpc −1 and Ω m = 0.3. The final value of the luminosity distance of each MBHB event in the catalog is then randomly scattered around the value given by our fiducial cosmology according to a Gaussian probability with standard deviation σ D , which is estimated as follows.
To each MBHB event in our simulated dataset we associate a realistic 1σ distance error by combining the following independent uncertainties (see Appendix A for details): • The weak lensing contribution, estimated as in [39,78] but taking into account a possible delensing of up to 30%.
• The LISA instrument error, estimated to scale as σ LISA /D L ∝ 2D L [80].
• The redshift measurement error from the EM counterpart, assumed to be observed photometrically at z > 2 and spectroscopically at z < 2.
The total distance uncertainty σ D is then obtained by adding in quadrature all contributions listed above. Therefore, the total distance uncertainty depends not only on the source redshift but also on the assumed cosmological model. Fig. 3 shows the contribution of all these sources of error as a function of redshift, together with the total estimated error on the luminosity distance.
Each LISA MBHB event that will be used in the rest of the paper has been constructed according to the procedure outlined above and detailed in Appendix A. Note however that this procedure cannot estimate the actual number of cosmologically useful events that LISA will detect. This information can only be obtained by simulating the full formation and evolution history of MBHBs, by performing parameter estimation over the GW signal and by estimating the emission and detectability of EM counterparts [39,81]. Because of this all our results are given in terms of the number of standard siren that LISA will observe, with fiducial values taken from the existing literature.

B. Testing ΛCDM with LISA MBHB standard sirens
We can now use the catalogs defined above to check how well ΛCDM, the standard cosmological model, can be constrained by a given number of MBHB standard sirens. In particular we focus on how well the matter density parameter and the Hubble constant are constrained. For this reason, we express Eq. (7) in terms of the Hubble parameter h defined through H 0 /c = h/2997.9 Mpc −1 , which for our fiducial flat ΛCDM is equal to h = 0.7. Since no more than ∼50 standard sirens are optimistically expected to be detected by LISA over its maximal extended mission duration (10 years) [38,39,77], we will analyse the constraints arising from different possible observational outcomes, with the number of total detected standard sirens ranging from a minimum of 10 to a maximum of 50 events. Our fiducial scenario will assume 15 standard sirens, in agreement with a realistically expected outcome over a nominal mission of 4 years [38,77]. Given a set of standard sirens observations with associated luminosity distance errors, namely y = (z j , D j , σ Dj ), we choose to work with the following likelihood, to conduct our statistical analysis. This specific choice implicitly assumes that the luminosity distance measurements are normally distributed around the cosmological predictions D M A L (z j ; θ A ) of a given cosmological model M A with parameters θ A . In addition, the standard deviation of the normal likelihood is assumed to be the luminosity distance uncertainty σ Dj = σ D (z j ) of the respective measurement, which in turn depends on the redshift and on the luminosity distance prediction of the cosmological model used to generate the data, i.e. ΛCDM. In the following inference analysis we study the constraining power on the cosmological parameters of the ΛCDM model through the luminosity distance . Therefore, we do not include the dependency of the luminosity distance uncertainty σ Dj on the cosmological model in our inference analysis and we take each σ Dj as an estimate for the measurement uncertainty (in other words the ΛCDM parameters over which σ Dj depends will not be varied, but fixed to their fiducial values).
In Fig. 4 we report the ΛCDM relative constraints at 90% Confidence Interval (C.I.) as a function of the number N SS of standard siren events. To obtain these results, for each value of N SS we have simulated 300 standard siren catalogs. For each catalog, we find the best-fit ΛCDM cosmology through maximum likelihood estimation, which is equivalent to a least squares-fit to the D L -z measurements. Confidence intervals on the best-fit parameters are found by assuming Gaussian errors in the measurements, consistent with the model above. The numbers reported in Fig. 4 correspond to the median of the distribution of all the 90% C.I. obtained in this way from all the 300 catalogs with the same N SS . In other words, the reported numbers represent the median of the corresponding quantity over 300 realizations with a fixed number of standard siren events, but different redshift and distance values for each event. is clear that roughly 25 MBHB standard sirens will suffice to provide a ∼3% error on H 0 and ∼6% error on Ω m , while 40 MBHB standard sirens will improve these results to ∼2% and ∼5%, respectively. For our fiducial case however (N SS = 15) we can reach only a relative error of 5% on H 0 and of 10% on Ω m . These results are broadly in agreement with previous analyses [39,77].
The constraining power in the ΛCDM parameter space depends on the specific realization. We estimate the constraining power of the measurements by computing a Gaussian approximation to the likelihood, which is where θ = (h, Ω m ). The determinant of the matrix Γ ij determines the volume of the localization region. For each N SS , we found the realization that had the median localization volume among all those simulated and took this to be "representative" of that number of standard sirens. In Fig. 5 we use these representative realizations to construct contour regions at the 90% confidence level in the Ω m -h parameter space, for N SS = 10, 15, 20. The contour regions in Fig. 5 are based on the Gaussian approximation of the likelihood, and thus well characterize the latter only for a relatively large number of observations, which in our case means N SS 15. The contour regions (ellipses) are centered on the best fit value estimated from the representative realization through the maximum likelihood estimator. Note that since in each dataset the data points have been scattered randomly around our fiducial cosmology, the actual values of the best fit parameters are not significant, as the best fit parameters scatter around the injected values depending on which redshift and measurement uncertainty are randomly drawn in the data generation procedure (the fact that all three best fit values lie on the bottom-right with respect to the true injected value is just a coincidence). In order to further check the results for our fiducial scenario, and thus cross-check the Gaussian approximation, we have further performed a Bayesian analysis of the representative realization with 15 events. Therefore, we obtain the posterior distribution p(Ω m , h| y) using emcee [48] and we show the results of the MCMC sampling in Fig. 6. We also provide the MCMC implementation in Ref. [44]. For the MCMC, we set flat priors for both parameters Ω m and h respectively with support Ω m ∈ [0, 1] and h ∈ [0.2, 2.2] and we use the likelihood reported in Eq. (8).
In Table I we show some summary statistics of the Bayesian and frequentist analysis of the 15 representative standard sirens. The median of the marginalized posterior shown in Fig. 6 are in agreement with the best-fit values obtained with the aforementioned frequentist analysis at 68% confidence level and both Bayesian and frequentist analysis yield 3% and 16% constraints on the relative precision of h and Ω m , respectively. Fig. 5 shows that the higher the number of standard sirens, the better the ΛCDM constraints, as expected. By comparing Fig. 5 and Fig. 6 we can also notice that the Gaussian approximation cannot fully catch all the features  Table I of the true posterior for 15 events, although the approximation seems reasonably correct at least to define 1 and 2σ confidence regions.
Finally, in order to better visualize the uncertainty in the distance-redshift space as well, the representative realization with 15 standard sirens has been used to produce a regression plot in the z-D L space (left panel of Fig. 7), which shows the allowed ΛCDM curves at the 90% C.I. (solid blue lines) together with the best fit from the dataset realization (dotted blue line), the injected ΛCDM curve (black solid line) and the data points of that particular realization with their associated 1σ uncertainties in the distance (in red). The right panel of Fig. 7 shows the residuals of each curve and data point with respect to the fiducial ΛCDM curve. From Fig. 7 we deduce that LISA will test deviations from ΛCDM in the distance-redshift space at the 10% level or better up to redshift ∼6, roughly.
This clearly shows the potential of LISA to provide unique cosmological constraints, not only for ΛCDM but also for other cosmological models, in a regime not yet efficiently probed by standard EM observations. These results already indicate that LISA will have the required power to test the deviations from ΛCDM reported with quasar data at z 3. In what follows we will quantify this statement using Bayesian hypothesis testing, and by looking in particular at how well LISA will differentiate between ΛCDM and the high-z deviation found in Ref. [13].

A. Methodology
The scope of this section is to forecast the number N SS of MBHB standard sirens that we would need to observe to test the deviation from the ΛCDM model inferred from the quasar luminosity distances [13]. Bayesian hypothesis testing is a suitable framework for this goal, because the parameters of different cosmological models are not known precisely and are therefore well represented as distributions which depend on the model.
If we have two cosmological models M 1 and M 2 that describe the expansion history of the Universe and predict a luminosity distance relation D M A L (z; θ A ), with parameters of the model θ A , we can compute the Bayes factor to understand which model is favored by a set y of observations, assuming that the two models are equally probable a priori. In our analysis, the Bayes factor is used to forecast how many measurements y = (z j , D j , σ Dj ) of MBHB standard sirens are needed to conclude that the data strongly favor ΛCDM with respect to the deviation found in Ref. [13], assuming the underlying model providing the data is indeed ΛCDM. As observations, we use catalogues of luminosity distances y = (z j , D j , σ Dj ) generated using the prescription explained in Sec. III. The likelihood p( y | θ A , M A ) and the model parameter priors p( θ A |M A ), are chosen according to the following prescriptions. The likelihood p( y | θ A , M A ) is given by Eq. (8) and the expected value D M A L (z j ; θ A ) is given by the prediction of luminosity distance of the following two cosmological models: We fixed in both models H 0 = 70 km/s Mpc and c = 2.9979 × 10 8 m/s as done in Ref. [13]. We do not include the Hubble constant as an additional parameter to Model 2 because we aim to test exclusively the predictions of the Universe expansion history of the two cosmological models, and because it is a calibration constant for the ALT best-fit model. Even including the Hubble constant as an additional parameter to Model 2, we find that our results are not considerably affected. For Model 1 (flat ΛCDM model) a suitable prior is the beta p.d.f. Ω m ∼ β(a, b) because of its support Ω m ∈ [0, 1]. Notwithstanding the tight constraints on the value of Ω m from different cosmological measurements (see e.g. [11]), we choose a prior peaked on the value of matter density of Ω m = 0.31±0.05 found in Ref. [13] at low redshifts, where good agreement is obtained between the ΛCDM model and the quasar data. Therefore, we choose β(a = 22.65, b = 50) as the prior on the ΛCDM model in order to have a median of 0.31 ± 0.05, where the uncertainty is the 68% credible interval. For comparison, the analysis of the CMB power spectrum combined with lensing and BAO give the following constraint on the matter density Ω m = 0.315 ± 0.007 [11].
For Model 2 (ALT best-fit model) we approximate the posterior of a 2 , a 3 of Fig. 2 (Fig. 5 in Ref. [13]) with a Multivariate Normal distribution (a 2 , a 3 ) ∼ N ( µ, Σ): with covariance matrix Σ and expected value µ estimated from the MCMC posterior samples a 2 , a 3 of Fig. 2 (Fig. 5 in Ref. [13]): The goodness of this approximation is confirmed by the very low Kullback-Leibler divergence (≈ 4 × 10 −7 ) between the approximate multivariate Gaussian distribution and true posterior distribution of a 2 , a 3 . We use this multivariate normal distribution as the prior for Model 2, rather than the posterior samples, since it allows us to evaluate the double integral of the evidence p( y|M 2 ) analytically, as shown in Appendix B. Having specified all the tools needed to calculate the Bayes factor, we can use several MBHB standard siren catalogues to study how many observations are necessary to test the claimed deviation from ΛCDM.

B. Results
We aim to study the Bayes factor O 12 as we add more observations. However, the Bayes factor itself depends on the redshifts of the observations, and measurements with different redshifts can favor one model over the other. Therefore we decided to study the distribution of Bayes factors by generating 10 4 realizations of catalogues of N SS standard sirens.
For each value of N SS we obtain the distribution of O 12 as follows • We draw a set of N SS redshifts from the expected distribution of MBHB standard sirens and assign uncertainties σ Dj , as described in Sec. III.
• We scatter the luminosity distance measurements around the flat ΛCDM model with a Normal distribution D j ∼ N (D ΛCDM L , σ Dj ), as mentioned in Sec. III.
• We calculate the Bayes factor for the N SS simulated standard sirens y = (z j , D j , σ Dj ) using the likelihood in Eq. (8) and the priors of the two cosmological models with the aforementioned values of a, b, µ, Σ.
We then obtain 10 4 realizations of Bayes factors per each N SS , which gives us a distribution of Bayes factors of possible future observations of MBHBs standard sirens. We can define the α% Evidence Interval I = [O L , O R ] as the symmetric interval of Bayes factors which covers the population of Bayes factor with probability of α%: Note that an evidence interval is neither a confidence interval nor a credible interval, because we are not estimating any parameters, but it tells us how likely we are to observe a realization of the Universe which allows us to distinguish between the two models. The distribution of the Bayes factors for the aforementioned priors is shown in Fig. 8. As we add more observations, the distribution of Bayes factors initially widens, but the median, represented by the vertical line, shifts to higher values. The shift in the median is due to the data favoring the ΛCDM model (Model 1) with respect to the ALT best-fit model (Model 2) as we increase the number of observations. This is of course expected since the standard siren data are generated according to the ΛCDM model.
The widening of the distribution is a consequence of the larger space of realizations of Bayes factor per N SS . In fact, if we draw a large number of N SS we can obtain different redshift realizations that can favor more or less one model over the other one. Those catalogues with many high-redshift standard sirens easily discriminate between the two models and have high Bayes factors, whereas catalogues with many low-redshift standard sirens have lower Bayes factors.
For Bayes factors above ∼ 20 Model 1 (ΛCDM) is strongly favored with respect to Model 2, while for Bayes factors above ∼ 150 the data very strongly favor Model 1 with respect to Model 2 [82]. In order to assess how many standard sirens we need to distinguish between the two models, we plot in Fig. 9 the median, the 90% and 50% evidence interval of the Bayes factor distribution as a function of the number of standard sirens. In 50% of realizations of the Universe, the Bayes factor will lie above the red line, under the aforementioned assumptions. From Fig. 9 it is clear that already with four standard sirens we have a 50% chance that the Bayes factor would lie above the very strong evidence threshold, ruling out the deviation found in Ref. [13]. With 14 MBHB standard sirens we have a 95% chance that the two models will be distinguished by LISA.

V. DISCUSSION AND CONCLUSION
Constructing the Hubble diagram is a fundamental step to study the expansion history of the Universe, but it requires accurate and precise measurements of the luminosity distance. The extension of the Hubble diagram with quasar observations has not only opened up the possibility to explore higher redshift regions but it has also suggested a possible new tension with the current cosmological model, the ΛCDM model. In this work we addressed the crucial question of whether and when we will be able to rule out or confirm this tension. We have every reason to expect the quasar Hubble catalog to grow and refine in the intervening years before LISA launches. Various estimates suggest a boost in the size of the quasar catalog by a factor of 10 [15] or even 100 [83]. It is harder to forecast whether there will be a refinement in the intrinsic scatter of the quasar flux law. Hence, we have taken the quasar tension at face value in this work, as a target or reference against which to compare our forecasts of LISA standard siren measurements. Observations of GWs from MBHB mergers with associated EM counterparts will be critical to explore the high redshift regions of the Hubble diagram and, thus, to test the claimed deviation from ΛCDM. Median 90% evidence interval 50% evidence interval Very strong evidence for M 1 = ΛCDM Strong evidence for M 1 = ΛCDM Figure 9. Bayes factor evolution as a function of the number of standard sirens NSS. The red dots represent the median of the distribution of Bayes factor for given NSS, and the shaded regions represent respectively the 90% and 50% evidence interval. The horizontal dashed lines represent the threshold above which the ΛCDM model is favored with respect to the ALT best-fit model, which encodes the deviation obtained from quasar observations. As a reference, we report that the Bayes factor of the 15 representative SS dataset of Fig. 7 is 6387.
Assuming the validity of the ΛCDM model, we described how to generate catalogs of mock MBHB standard siren observations by LISA. We used a Bayesian approach to determine the relative favor of ΛCDM and the best fit model of Ref. [13], as determined by the data. This is expressed in terms of the Bayes factor, which can be calculated given priors of the model parameters, the likelihood function and the observed data. In the process, we reproduced some of the results of Ref. [13], in particular the posterior distribution of the phenomenological model, which we used as a basis for the prior beliefs on the deviation from ΛCDM. Given a fixed number of simulated standard sirens N SS we used 10 4 catalog realizations to study the behavior of the Bayes factor and we found a 50%, 75% and 95% chance to rule out the deviation found by Ref. [13] with N SS = 4, 7, 14, respectively. These numbers are within the range of the expected rates for LISA MBHB standard sirens [39], suggesting that LISA will indeed be able to discriminate the observed tension from the standard ΛCDM model. Furthermore, since the alternative hypothesis for the cosmic expansion has been formulated in a generic way, our analysis can also be used to test a wide variety of other cosmological models against the ΛCDM model with standard sirens. To facilitate this, our code for the Bayesian hypothesis testing analysis has been made available at [44].
We have also evaluated the constraints on the matter density and Hubble constant as a function of the number of MBHB standard sirens and we found that the Hubble parameter can be more precisely constrained than the matter density on average. The relative precision of the Hubble parameter ranges from 3 − 20% for 10 standard sirens up to 1 − 4% for 50. Whereas, the matter density can be constrained with relative precision ranging from 7 − 50% for 10 standard sirens up to 4 − 7% for 50. The combination of MBHB events with other standard sirens detectable by LISA, in particular stellar-origin black hole binaries and EMRIs, will allow to improve the constraints on H 0 found in this paper, perhaps to below-percent accuracy in optimistic scenarios. These results will help elucidate the current tension on the measured value of the Hubble constant, which may well not be solved by the time LISA will launch.
The great potential of MBHB mergers for cosmology lies however beyond the results one can obtain on H 0 . As we showed in this paper, by using these events as standard sirens LISA will probe the expansion history at z 2 with an accuracy possibly not attained by future EM observations, even as a new suite of large-scale structure surveys [7-10, 84] in the post-reionisation matter-dominated regime, 2 z 6, are poised to search for clues to the nature of dark energy and chart the cosmic acceleration. Quasars have recently been proposed anew as a standard candles in this redshift range [13]. Other methods may also open up this frontier in the coming years, such as line intensity mapping [85][86][87]. Reconciling the Hubble diagram among all these different methods will be a crucial activity in the future. LISA standard sirens offer a proven and complementary method to independently test the Hubble diagram at z 2. This synergy between GW and EM observations will provide the necessary confidence to reliably test the cosmic expansion at high redshift.

ACKNOWLEDGMENTS
We thank Guido Risaliti and Elisabeta Lusso for kindly providing the quasar data from Ref. [13]. We also thank Ollie Burke for his useful help and comments, Chiara Caprini for comments on the draft, and Ryan Hickox for discussions on quasar astrophysics.
Appendix A: Details of the MBHB standard siren catalogs In this appendix we provide the details of our approach to construct catalogs of MBHB standard sirens for LISA, as outlined in Sec. III A. We start by drawing plausible redshift values for MBHB mergers from an interpolated redshift distribution based on the data presented in Fig. 1 of [38]. These data are based on the LISA sensitivity curve adapted to the LISA design proposed in [37] and have been constructed using the procedure outlined in [39], where MBHB merger histories have been reproduced using semi-analytical merger-tree simulations and the emission and detection of GW and EM signals have been estimated with standard astrophysical and data analysis techniques. We choose to work with the "popIII" model, which produces average results with respect to the astrophysical populations considered in [38,39]. In any case we checked that our final results do not change appreciably if the other astrophysical populations are considered, as the redshift distributions are similar (cf. Fig. 1 of [38]). We moreover set to zero the probability of drawing MBHB redshift values at z < 0.1, in order to match our expectations of not finding MBHBs at very low redshift. We do not expect our results to be sensitive to this cut because the probability to have an event below z = 0.1 from our redshift distribution is particularly low, approximately 0.05%. To each redshift value in the catalog we then find a unique value of the luminosity distance D L by using the flat ΛCDM distance-redshift relation D ΛCDM L (z; H 0 , Ω m ) as given in Eq. 7. For ease of notation, we drop the label ΛCDM in this appendix. To estimate a realistic 1σ distance error to each MBHB merger event, we combine the following uncertainties: • A weak lensing contribution analytically estimated by the following fitting formula [39,78] (see also [88]) σ lens (z) D L (z) = 0.066 1 − (1 + z) −0.25 0.25 1.8 .

(A1)
We will consider the possibility of delensing, i.e. the use of dedicated matter surveys along the line of sight of the GW event in order to estimate the lensing magnification distribution and thus remove part of the uncertainty due to weak lensing. Following [89] we will realistically assume that up to a 30% delensing will be achievable at redshift 2 (note that in the most optimistic scenario this reduction could be as large as 50% [89]). Furthermore the delensing power should grow linearly from redshift zero, until it reaches a constant value at around z = 2 (see Fig. 4 of [89]). For these reasons we will employ a delensing factor provided by the following phenomenological formula where z * = 0.073 has been fixed requiring that F delens (z) be within 1% of 0.7 (the z → ∞ asymptotic value) at z = 2. The final lensing uncertainty on D L will thus be σ delens (z) = F delens (z)σ lens (z) . (A3) • A peculiar velocity uncertainty contribution as estimated by the following fitting formula [39,79] σ v (z) D L (z) where the r.m.s. peculiar velocity value is set to v 2 = 500 km/s, in agreement with average values observed in galaxy catalogs.
• A LISA instrument error estimated as σ LISA /D L ∝ 2/SNR [80]. If we further assume that, once marginalised over all other waveform parameters, the SNR scales as 1/D L , we can take the LISA measurement relative error on D L to be σ LISA /D L ∝ 2D L . To fix the constant of proportionality we rely on recent parameter estimation results [90], which yield a 1σ error on D L of roughly 5% at redshift 4, by employing a full Bayesian approach over a MBHB signal with SNR = 945. The LISA measurement error on D L that we use is thus be given by where 36.6 Gpc corresponds to z = 4 in our fiducial cosmology.
For the sake of simplicity we propagate this redshift uncertainty to the distance uncertainty by assuming our fiducial cosmology, similarly to [39]. We assume redshift measurements at z < 2 are taken spectroscopically and, for our purposes, with negligible errors (see e.g. [94,95]).
Finally, as mentioned in the main text, the total distance uncertainty σ D is then obtained by adding in quadrature all contributions listed above (see Fig. 3). Moreover the final value of the luminosity distance of each MBHB event is then randomly scattered around the value given by our fiducial cosmology, according to a Gaussian probability with standard deviation σ D . This procedure delivers simulated of LISA MBHB standard sirens, similar for example to the one exposed in Fig. 7.

Appendix B: Details of the Bayesian methods
The evidence is the integral of the product between the likelihood and the model prior. Here we show how it is possible to calculate explicitly the evidence for the ALT best-fit model (Model 2) with the likelihood and model prior given respectively in Eqs. (8) and (11).
Let us expand the argument of the exponential of Eq. (8): where we defined ξ = c ln 10/H 0 and the sum over j is the sum over the SS measurements expressed in terms of x j = log 10 (1 + z j ). We now define ψ j = D j /ξ − x j and we express the above sum as In order to obtain the evidence for the ALT model we need to multiply the likelihood by the model prior in Eq. (11).
So the evidence for the ALT model is where we introduced V = − ξ 2 2 B + Σ −1 µ in the last line. The above equation shows how it is possible to directly evaluate the evidence of the ALT best fit model using the observations y.