Calibrating gravitational-wave search algorithms with conformal prediction

In astronomy, we frequently face the decision problem: does this data contain a signal? Typically, a statistical approach is used, which requires a threshold. The choice of threshold presents a common challenge in settings where signals and noise must be delineated, but their distributions overlap. Gravitational-wave astronomy, which has gone from the first discovery to catalogues of hundreds of events in less than a decade, presents a fascinating case study. For signals from colliding compact objects, the field has evolved from a frequentist to a Bayesian methodology. However, the issue of choosing a threshold and validating noise contamination in a catalogue persists. Confusion and debate often arise due to the misapplication of statistical concepts, the complicated nature of the detection statistics, and the inclusion of astrophysical background models. We introduce Conformal Prediction (CP), a framework developed in Machine Learning to provide distribution-free uncertainty quantification to point predictors. We show that CP can be viewed as an extension of the traditional statistical frameworks whereby thresholds are calibrated such that the uncertainty intervals are statistically rigorous and the error rate can be validated. Moreover, we discuss how CP offers a framework to optimally build a meta-pipeline combining the outputs from multiple independent searches. We introduce CP with a toy cosmic-ray detector, which captures the salient features of most astrophysical search problems and allows us to demonstrate the features of CP in a simple context. We then apply the approach to a recent gravitational-wave Mock Data Challenge using multiple search algorithms for compact binary coalescence signals in interferometric gravitational-wave data. Finally, we conclude with a discussion on the future potential of the method for gravitational-wave astronomy.


I. INTRODUCTION
The burgeoning field of gravitational-wave astronomy is in a state of rapid evolution.Second-generation detectors [1][2][3] have progressed from the first observation of a binary black hole merger [4] to the compilation of extensive transient event catalogues [5][6][7] including also binary neutron star and black-hole neutron-star mergers.With this progress, the methodologies for evaluating the statistical significance of Compact Binary Coalescence (CBC) signals have undergone notable transformations.While the significance of the initial detection [4] was assessed through the frequentist False Alarm Rate (FAR), contemporary catalogues [5][6][7] now use probabilistic Bayesian methods.
However, astrophysicists aiming to learn from gravitational-wave data are confronted with a challenge: the difficulty in identifying signals when their distribution and the noise distributions overlap.This issue is by no means unique in astronomy (see, e.g.Feigelson and Babu [8]).However, gravitational-wave astronomy is an especially intriguing case study because the Signal-To-Noise Ratio (SNR) ratio of sources is low, but the potential scientific reward is high.Moreover, much of the insights derive from studying the population of identified sources [9].The events producing signals within current sensitivities are isotropically distributed, so the number of detections scales with the cube of the horizon distance (a measure of the detector sensitivity).Therefore, there are always more events just beyond the horizon than within: increasing the horizon distance by just 25% will double the number of events.The conundrum facing anyone wishing to utilise the hundreds of sources now reported is how to select a threshold to cut between the signals and the noise.On the one hand, we can choose a conservative threshold, ensuring a high catalogue purity (the fraction of true signals).However, the conservative threshold also entails a loss of accuracy; after all, we must discard many low-significance astrophysical signals buried in the noise.On the other hand, choosing a liberal threshold would include a larger number of astrophysical signals but at the cost of bias induced by non-astrophysical catalogue contamination.
There are efforts underway to address these issues.For example, population-level analyses can utilise hierarchical models to assess mixed catalogues of signals and noise, avoiding the contamination problem altogether [26][27][28][29] and recent efforts are also underway to produce a unified significance estimate [30].Nevertheless, the problem of choosing thresholds will continue to be of interest as mixed methods are in their infancy, and some of the most interesting events will inevitably come from close to the detection horizon: the question of "does this data contain an astrophysical signal?" will inevitably persist.
This work will introduce a new and transformative framework to solve this problem using Conformal Prediction (CP) [31].CP is an approach to uncertainty quantification developed within the context of Machine Learning (ML).CP takes an existing point-prediction algorithm and a calibration data set (consisting of correctly labelled data) and generalises the underlying algorithm's point-prediction to a prediction set with a guaranteed validity (where valid means that the true label is guaranteed to belong to the set with a predefined confidence).Its appeal arises from its universal applicability, guarantees, and single assumption: exchangeability of the data.Moreover, the prediction guarantees are distribution-free: there is no asymptotic assumption or underlying model.It can be used for classification and regression or, correspondingly, search/detection and inference/measurement in the language of gravitational wave astronomy [32].This work will explore the classification (or search/detection) problem.We will demonstrate how CP can be applied to calibrate pipelines without requiring knowledge of its internal behaviour.Moreover, we will discuss how CP offers an alternative approach to developing a meta-pipeline: taking the inputs from multiple search algorithms and providing a single statement which optimally combines their outputs and is well calibrated.
As we will show, CP is simple to implement, easily tested, has minimal assumptions, and no required astrophysical model.For these reasons, we anticipate that CP will be of general interest to the field.While we will discuss CP exclusively in the context of searching for CBC signals, we anticipate it will find utility for searches for other sources of gravitational-wave radiation and beyond.
The remainder of this article is structured as follows.In Section II, we introduce the existing traditional approaches for significance estimation within gravitationalwave astronomy and further motivate this work by considering their real-world performance.We provide a lay guide to CP in Section III.We apply it in Section IV to a toy cosmic-ray detector problem to demonstrate the basic algorithm and extensions in the noise-dominated regime.Moreover, we also use our toy problem to explain some of the subtleties of CP.In Section VI, we then go on to apply CP to the recent Mock Data Challenge of LIGO-Virgo data [33].Finally, we end with a discussion on the advantages, difficulties, and future prospects of CP for gravitational-wave astronomy in Section VII.

II. METHODOLOGY: QUANTIFYING SIGNIFICANCE WITH TRADITIONAL APPROACHES
To begin our discussion, we first review the data, search algorithms, detection statistics, and two dominant quantities used to assess candidate significance: FAR and p astro .Gravitational-wave strain data comprises quasistationary coloured-Gaussian background noise, astrophysical signals, and a variety of non-astrophysical transient noise sources termed glitches [34,35].Absent glitches, the optimal detection statistic is the coloured Gaussian noise matched-filter SNR.When the signal source properties (e.g., the mass of the system) are unknown (as is typical), a bank of templates is searched, often in combination with techniques to maximise or marginalise over subsets of the full parameter space (see, e.g.Sathyaprakash and Dhurandhar [36]).However, in the presence of glitches, the optimal statistic is unknown.To guide the reader on how the leading searches remain sensitive to astrophysical signals despite frequent glitches, we now describe in broad terms a typical search algorithm or pipeline: the interested reader may wish to review Abbott et al. [35] for a deeper discussion.
The central tools used by most pipelines to distinguish between signals and glitches are the coincidence between detectors and signal consistency checks such as the χ 2 detection statistic [18], which discriminates cases where the data is likely to contain a glitch by analysing the way power is distributed in the broadband signal.Typically, the χ 2 and matched-filter SNR are combined to produce a combined ranking statistic which we label ρ.Additional terms may also be included in the combined ranking statistics, such as weights based on whether the region of parameter space is expected to contain more astrophysical signals and amplitude-phase-time consistency checks between detectors.The combined ranking statistic can be tuned to maximise the separation of signals from noise (as verified by simulations).Since the combined ranking statistic is ad-hoc, its background distribution (where the background is taken to mean in the absence of any astrophysical signal) is inherently unknown and must be empirically estimated from the data.However, gravitational-wave detectors cannot be shielded from astrophysical signals.Therefore, pipelines use approaches such as time-sliding between separate independent detectors to destroy correlations between astrophysical signals (see, e.g.[37,38]), resulting in empirical measurement of the background.We denote such a background as the set {ρ} = {ρ 0 , ρ 1 , . . ., ρ n−1 } of n values measured on the background.
Once the background has been estimated for a new candidate event with ranking statistic ρ ′ , the pipeline estimates its significance by calculating the FAR.Informally, the FAR is the amount of background data one must observe to see a ranking statistic as large as ρ ′ .Such a dimensionful approach results in an intuitive understanding of the significance given knowledge of the amount of data searched.E.g. for a search of one month of data, an event with a FAR of 1 per millennia is a clear detection, while a FAR of 1 per day is more likely to be noise.More precisely, the FAR is calculated empirically as the inverse of the number of background events with a ranking statistic of ρ ′ or greater divided by the segment duration used in the search.One sees then that the FAR is the one-sided right-tail empirical p-value divided by the segment duration: where H 0 is the null hypothesis, we apply set-builder notation, and define the set size by | • |.
The FAR of the first detection was reported in the paper abstract: "less than 1 event per 203 000 years" [4].However, once a population of signals was established, it became preferential to move to a probabilistic approach instead.Following Farr et al. [39], the foreground and background distributions are modelled by a Poisson mixture model with prior choices informed by the pipeline outputs and previously observed signals.From this, each pipeline produces a new significance estimate, p astro : the probability that the signal is astrophysical [40][41][42][43].Moreover, the modelled approach allows further sub-classification as p astro = p BNS + p NSBH + p BBH (and a complementary probability of terrestrial origin).With this new approach, the first Gravitational Wave Transient Catalogue (GWTC-1) [44] defined "GW" events as those with a FAR less than 1 per 30 days and a p astro greater than 1/2.This latter definition has become a de facto standard.For example, a p astro greater than 1/2 is the threshold used to identify events for further follow-up in several recent catalogues [5][6][7]).Yet, it demonstrates that even with a probabilistic interpretation of the nature of a candidate, researchers still like to establish a threshold and draw a clear delineation, and it is quite common to see astrophysics research take the provided thresholds at face value.
The final complicating piece of this picture is that multiple pipelines analyse the same data.Our typical pipeline above described the core features, but each employs a unique arsenal of techniques built over many years by many people.The result is that for any given candidate, we end up with multiple estimates of its significance: a FAR and p astro per-pipeline.The pipelines broadly agree for unambiguous signals and noise events where apples-to-apple comparisons can be made.However, it is in the grey middle ground where things become complicated.To demonstrate this, we use data from the recent GWTC-3 catalogue [5], which reported on data from the second part of the third LIGO-Virgo observing run.We use the associated data release, which includes triggers where at least one pipeline had a FAR of less than 2 per day: as such, we expect this to in-FIG.1.Comparison of the probability of astrophysical origin estimated by pairs of pipelines for all candidates reported in GWTC-3 (including sub-threshold candidates).While clear signal (top right) and clear noise (bottom left) cases usually agree, a significant off-diagonal scatter remains between these points.
clude both the astrophysical signals and a great number of non-astrophysical candidates.
In Fig. 1, we scatter-plot the p astro of each trigger for pairs of CBC search pipelines used in GWTC-3 (we exclude the Coherent WaveBurst pipeline that applies an unmodelled search approach).In the off-diagonal corners, two dense regions correspond to the clear signal (top right) and clear noise (bottom left) cases where pipelines agree.However, scattered through the plane are confusion cases where one pipeline finds p astro > 0.5, indicating the data contained an astrophysical source, while the other pipeline is more pessimistic (p astro < 0.5).If we are lucky enough to know experts from both pipelines, we can understand the cause of the discrepancy.Sometimes, it is well understood different choices lead to different sensitivities in different parts of the parameter space.If the more sensitive pipeline found the event while the other did not, this explains the difference, and we may gain confidence that this is an astrophysical signal.Other times, the differences are more contentious or yet to be understood -this should be expected, as these are complicated multi-stage pipelines with differing and often implicit assumptions.Nevertheless, it leaves the uninformed with the previously described choice-of-threshold conundrum exacerbated by the need to learn the detailed inner workings of the pipeline to understand the results.One standard solution is to take the maximum p astro , implicitly trusting that the only explanation is variations FIG. 2. Comparison of the FAR estimated by pairs of pipelines demonstrating the intrinsic scatter for all candidates reported in GWTC-3 (including sub-threshold candidates).While clear signal (bottom left) and clear noise (top right) cases usually agree, a significant off-diagonal scatter remains between these points.in sensitivity.However, another explanation is random uncertainty in significance or even that one pipeline is malperforming.
One may imagine that the inclusion of different astrophysical foreground prior models in the Bayesian analysis may explain the scatter in Fig. 1 between pipelines; however, Fig. 2 demonstrates that the scatter is also inherent in the underlying and simpler FAR.Finally, in Fig. 3, we plot each pipeline's FAR against p astro .Here, we see the approximate sigmoid relationship with significant scatter.
The GWTC-3 results demonstrate the inherent difficulty facing anyone wishing to select a set of events for further analysis.However, these results are only part of the picture.They present only the pipelines used by the LVK collaborations.There are external groups that produce independent catalogues where the same conclusions hold up: scatter between significance estimates.Moreover, pipelines are not static: they are constantly developed, improved and re-configured.It is well known that the same pipeline with a different configuration can produce a different significance estimate (usually for wellunderstood reasons understood by the pipeline experts).Therefore, even choosing a single pipeline can effectively represent a different pipeline per observing run (or period in which the methodology and configuration are static).Finally, using p astro as a threshold also utilises information from estimates of the population properties.Since we are constantly learning new information and improving estimates, this can lead to the re-ranking of past data, resulting in the possibility of reclassifying old candidates.
One naive way of describing the situation is that significance estimates (i.e. the FAR or p astro ) do not come with an associated uncertainty (from, e.g.intrinsic configuration choices, population choices, or data choices).The oft-used approach to resolve this is to take the scatter from multiple pipelines as a proxy indication of the uncertainty.This has primarily been the community approach: confidence in the first detection from a new source class is validated by the involvement of multiple pipelines.However, this is not satisfactory and discards inherent information about pipeline sensitivity.In the remainder of this article, we will introduce a formal alternative based on CP.Our fundamental interest is to develop a tool that takes the FAR or p astro as a heuristic and calibrates it, enabling standardisation between pipelines and proper uncertainty estimates for whether a candidate is of astrophysical origin.

III. METHODOLOGY: QUANTIFYING SIGNIFICANCE WITH CONFORMAL PREDICTION
We now introduce the CP methodology.We intend to give the reader a guide to the application without delving into the foundational theory, which can be found in reviews such as Angelopoulos and Bates [45] and Shafer and Vovk [46].
To begin, it should be understood that CP was developed in the Machine Learning classification algorithm context.Specifically, it can be applied to any classification algorithm, i.e. given some observed data x, an algorithm that produces a single predicted label y (ℓ) drawn from a set of N possible labels {y (0) , y (1) , . . ., y (N −1) }.CP calibrates the classification algorithm by producing a prediction set Γ α where α ∈ [0, 1] is the allowed error rate also known as the significance level.The steps to generate the prediction set are as follows: 1. Definitions: Define a non-conformity measure A(x, y (ℓ) ), which returns a non-conformity score s for each label in the complete set.The requirements for the non-conformity score are loose; it must simply be a real-valued number.However, for the algorithm to be useful, the score should be large when y (ℓ) is not the correct label (i.e. it measures how unusual the labelling would be).
2. Calibration: Now define the calibration data: n pairs of (x, y ( l) ) where x is the observed data and y ( l) is the true label (indicated by the hat on the index).In our context, calibration data will always be drawn from simulations.Now, for each element of the calibration data, calculate the equivalent score for the true label and store this in a set of calibration scores i ) where the lower subscript i is added to indicate the i th element of the calibration data.
3. Quantile: The final step before generating the prediction set is to define the allowed error rate α ∈ [0, 1], then given a set of calibration scores, we calculate where ⌈•⌉ is the ceiling function, and we indicate by the use of s (j) the j th value of the ordered set of s i .As described in Angelopoulos and Bates [45], q is essentially the 1 − α quantile of the calibration scores with a small correction.
4. Prediction: Finally, given a new observed data point x ′ , we generate the prediction set: that is, for each label y (ℓ) , we first calculate the corresponding score A(x ′ , y (ℓ) ), then if the score is less than q we include the label in Γ α , the set of predicted labels.
CP guarantees that the probability that the true label is contained in Γ α is approximately 1 − α, this is known as marginal coverage; more concretely, it can be shown [45] that such that if N , the number of calibration data points, is sufficiently large, we recover the standard approximate result of 1 − α.
Is this useful?Practitioners in the field will no doubt know that there is a well-built-up statistical literature on decision theory behind the FAR and p astro introduced in Section II (and we will explore this in detail in our toy model (cf.Section IV).However, as discussed, pipelines can be miscalibrated and disagree with one another.The core motivation behind studying CP is that we can treat the statistical quantities arising from pipelines as heuristics and use the calibration data set to adjust it, ensuring robust performance.As we will see later in Section VI C: this calibration process can, in fact, be viewed as a generalisation of the empirical measurement of the FAR itself.
It is worthwhile to consider how CP quantifies uncertainty in the label.As scientists, we are used to talking about uncertainty on a measurement, e.g. a real-valued number accompanied by an uncertainty interval.CP can also tackle this problem (the realm of parameter estimation or regression), but in our current context, we don't have a real-valued number; instead, we have a label.For example, should we classify this chunk of data as containing a "signal" or just "noise"?CP provides uncertainty on the point prediction made by an underlying classifier by introducing the prediction set Γ α .Inspecting Eq. ( 3), one can see that for binary classification of signal or noise; the four possible prediction sets are the empty set, ∅, one of two singleton sets {noise}, and {signal}, or the double-label {noise, signal}.As an anthropomorphic explanation, when asked "does this data contain a signal or noise?" the CP algorithm can respond "Neither", "Noise", "Signal", or "Either noise or signal".
Varying the error rate for a fixed test data point will vary the size of the prediction set.In the extremes: α close to zero or one, the CP algorithm will be forced to respond with the double label or empty set (in the case of binary classification).Between the extremes, the performance will depend on the problem setup and choice of non-conformity score (we will demonstrate this later).This observation leads to the identification of what is known as the CP confidence [46], which we discuss later in Section IV C.

IV. CONFORMAL PREDICTION FOR A TOY COSMIC RAY DETECTOR
We now provide a guide to CP in the context of classification and a simple astrophysics problem: a cosmic-ray detector.We will describe the problem and implementation qualitatively here, but the reader may wish to refer to the data release associated with this article, which contains program code to reproduce all parts of this section [47].

A. Problem setup
Consider a toy cosmic-ray detector consisting of a Geiger counter, which records the number of incidents of ionising radiation it receives per minute while pointing to the sky (this example is not intended to be realistic but indicative of typical astronomy problems).Absent a cosmic ray, the detector will be subject to background radiation from terrestrial sources, which we model as Poisson distributed with a mean of λ b counts per minute.The detector will observe a cosmic ray as a transient burst of N c ionising particles in some time δt, which, for the sake of this discussion, we take to be δt ≪ 1 minute.As such, we can identify and localise a cosmic ray in the data by searching for minute-long bins where the count rate exceeds the background.The excess amount will depend on N c , which we will model again as Poisson distributed with mean λ c .Finally, we will also model the number of cosmic rays as Poisson distributed with some rate λ r per minute.In Fig. 4, we provide an illustrative example of data from our toy detector showing minute-long bins with background, clear cosmic ray events (far above the background) and marginal cases in-between.
The standard statistical search algorithm used in cases such as this to identify if a bin contains a cosmic ray event is the frequentist one-sided p-value or, equivalently, the FAR.Namely, for an observed count c ′ and given the background rate where T is the bin duration of 1 minute.Note: for this toy model, we know the FAR in closed form; this differs from the empirical FAR, Eq. ( 1), we use in gravitationalwave astronomy.Finally, our search algorithm proceeds by applying a threshold to the p-value or FAR: bins above the threshold likely contain a cosmic ray, while those below do not.In Fig. 4, we apply a p-value threshold of 1/20 or, equivalently, a FAR of 1 per 20 minutes.At this threshold, we can identify four categories: several actual signals are identified (true positives: TP), but four background events above the threshold are identified as cosmic rays (false positives: FP).Meanwhile, several cosmic rays are missed and classified as background (false negative: FN), but most background events are correctly classified as background (true negative: TN).The non-zero counts of FP and FN are not a deficiency of the algorithm but rather inherent: with the true labels coloured in Fig. 4, it is obvious which contains a cosmic ray and which does not, but our search algorithm has only the count rate leading, inevitably to errors in classification.
Of course, this is a well-studied problem of statistical decision theory (see, e.g., Cowan [48]).In Fig. 5 and Fig. 6, we reproduce two standard figures of merit which demonstrate this behaviour.First, the ROC curve shows the true positive rate against the false positive rate.The ROC curve is generated by varying the FAR threshold, repeatedly simulating our cosmic-ray detector, and empirically measuring the two rates.The curve demonstrates the trade-off between true positives and false positives possible with our given search algorithm: points closer to the ideal case (top-left corner) are better in maximising the true positive rate while minimising the false positive rate.Second, in Fig. 6, we show an alternative visualisation of the same data: the precision and missrate.Considering the case of a catalogue of gravitationalwave signals, these are of more direct relevance.The precision tells of the purity of the catalogue.If the precision is sufficiently close to 1, one can be reasonably assured the catalogue is pure and does not contain any potentially biasing terrestrial artefacts.However, such a guarantee comes at a cost: the miss-rate tends to 0 in the same limit, indicating the catalogue size will shrink.

B. Conformal Prediction
At this point, we now step beyond the confines of classical statistical decision theory and introduce the application of conformal prediction (CP).In this context, the cosmic-ray detector search algorithm described above can be considered a classification algorithm that produces a label y ∈ {background, cosmic-ray} (whereby "cosmicray" we implicitly mean there is both a cosmic ray and background).
We apply the CP approach defined in Section III to our cosmic-ray detector problem.We generate a large set FIG. 7. Visualisation of the non-conformity scores expressed in Eq. ( 6) and Eq. ( 7).
of calibration data points consisting of simulated data and the true classification (i.e.whether a cosmic ray was present or not).Next, we define our non-conformity score.We choose to use the complement of the Poisson probability mass function (noting that for the background + cosmic-ray case, the sum of two Poisson distributed variables is itself Poisson distributed with a rate equal to the sum of the rates), i.e., In Fig. 7, we visualise our non-conformity scores, showing that close to the mean, the non-conformity is at a minimum for each class, while away from these, they are close to unity.We note that the absolute magnitude of the variation in non-conformity measure is not important: what matters is the relative quantile they appear when ranked by the conformal algorithm.In this sense, the relative magnitude between classes is important (though this will not be the case later when we consider the class-conditional Mondrian conformal prediction later on).
Once our non-conformity score is defined, we can apply the conformal algorithm to new test data given some choice of α.For each data point, the output of the algorithm will be the prediction set Γ α .In our binary case, Γ α can be the empty set, ∅, one of two singleton sets {background}, and {cosmic-ray}, or the double-label {background, cosmic-ray}.
The marginal coverage guarantee, Eq. ( 3), states that, if implemented correctly, the correct label will be in Γ α a fraction ∼ 1 − α of the time.To check this, in Fig. 8, FIG.8.The empirically measured coverage (the fraction of events for which the true label is in the prediction set) for the cosmic-ray test data set after applying CP.A grey band marks the 95% binomial confidence interval expected given the size of the test data; we see variations around this due to the discrete nature of the underlying data.
we plot the empirically measured coverage after applying the conformal algorithm to a large simulated Cosmic-ray data set.The marginal coverage (the number of times the true label appears in the prediction set) follows the oneto-one mapping guaranteed by Eq. (3), demonstrating proper algorithm implementation.There is some variation when 1 − α is close to zero as the set sizes become small; moreover, the step-like nature of the empirical coverage arises from the discrete nature of the Poisson data in our toy model.Fig. 8 also provides an insight into the limitation of the simple CP algorithm: the coverage guarantee applies only to the marginal, not the conditional labels.As a result, the conditional labels may be over-or under-covered (i.e., exceed the allowed error rate).We see this manifest in Fig. 8 for the cosmic-ray label, which strays away from the diagonal.This is problematic: in gravitational-wave astronomy, we are not interested in ensuring that the label is correct as averaged over both the signal and noise labels.We want the validity guarantee (i.e.Eq. ( 3)) to apply to conditional labels.To achieve the guarantee for all labels individually, we can use Mondrian Conformal Prediction (MCP) [49], where the data is split by class, and then the conformal prediction algorithm is applied to each group separately.Using this technique, both the conditional labels are guaranteed to follow Eq. ( 3) and, by extension, the marginal labels do too.
The cost of MCP is that the number of calibration data points entering Eq. ( 3) is no longer the total number but the number per label.Therefore, the intrinsic error on rare classes consistently exceeds more common labels by design.We apply the simple class-wise algorithm where the possible labels define the groups [49].However, more advanced approaches are possible: see Ding et al. [50] for a formal introduction to the topic and discussion of a clustered algorithm capable of extending to many sets.
To apply MCP, we split our calibration data set into simulated data points containing a cosmic ray and those that do not.Then, we apply CP to each label and the corresponding calibration set separately for the test data.For this reason, unlike the standard CP algorithm, the relative values between non-conformity measures do not matter in MCP.In Fig. 9, we reproduce Fig. 8 but having applied MCP.Now, Eq. ( 3) is valid for both the marginal and class-conditional labels.

C. Confidence
There is a defined quantity within the CP framework known as the confidence [46].This arises from noting that the Γ α prediction sets are nested, such that if α 1 ≥ α 2 , then Γ α1 ⊆ Γ α2 .Since the size of Γ α is a discrete quantity, it varies in steps, and these change points can be used to assign significance statements.This observation leads us to the standard definition of confidence: FIG. 10.The illustrative example of data from Fig. 4, but with the labels as predicted by the MCP algorithm and using α = 0.1 (i.e. at a 90% coverage guarantee).
Definition 1 The confidence is the value of α such that the size of Γ α changes from 1 to 2 (i.e. the point where we go from the single to the double label).
Necessarily, each data point has a unique confidence assigned to whichever label is the single-label given the data.
In Fig. 11, we take our demonstration cosmic-ray data and add the confidence, assigning [0, 1] as the confidence for data points with single-label prediction "cosmicray" and flip the confidence to [−1, 0] for data points with single-label prediction "background" (this is nonstandard, but allows in the binary case to plot the confidence on a single diverging colour scale).From this figure, we observe a sharp divide near the boundary between the non-conformity scores of the two labels (cf.Fig. 7).This notion of confidence does have uses; for example, it automatically produces a potential decision algorithm for calling something a signal: only those data points for which the single label is "cosmic-ray".However, it is limited in that it does not allow one to talk about the confidence that an arbitrary data point contains a signal because, for those with a single-label "background", the confidence is the background confidence.
To further understand the confidence, we note that, in this toy example, it is a function only of the observed count rate.Therefore, as in Fig. 12, we can plot the confidence as a function of the count rate to see the mapping.In this figure, we see that at a count rate of 110 (the point where the non-conformity scores of background and cosmic-ray labels are equal, cf.Fig. 7), the confidence flips between the cosmic-ray and background single label.There is a minimum, and on either side, the confidence monotonically increases for either label.This motivates us to consider an alternative definition, the conditional confidence: Definition 2 The conditional confidence in label y is the minimum value of α such that y ∈ Γ α .
We add this to Fig. 12 for both the cosmic-ray and background labels, demonstrating that it can be calculated for any data point.Comparing Fig. 12 and Fig. 7, it is apparent that in this example, the conditional confidence is the scaled complement of the non-conformity score.In a sense, this may seem circular.However, it is worth noting that the conditional confidence depends on the distribution of non-conformity scores in the calibration set and not solely on the non-conformity score itself.Intuitively, the conditional confidence in label y can be understood as the probability (interpreted as a relative frequency) that the true label is y as measured from the calibration data set.We believe conditional confidence is useful in providing an intuitive guide to understanding the significance associated with each label for a given data point.To conclude, we finally apply the conditional confidence to our demonstration data in Fig. 13 which, contrasted with Fig. 11, demonstrates a smoother variation in assigned confidence and the ability to assign confidence in the cosmic-ray label to all data points.FIG.12.The mapping from counts to confidence (cf.Definition 1).In blue, we show the confidence of counts where the single-label prediction is "cosmic-ray", in orange cases where the single-label prediction is "background".We also plot the mapping to the conditional confidence (cf.Definition 2), the cosmic-ray (green) and background (red) labels.

FIG. 13.
The illustrative example of data from Fig. 4 coloured by the conditional confidence (i.e. the minimum value of α such that the conditional label is included in the set, cf.Definition 2) for the cosmic-ray label.

D. Measuring performance by set size
Fig. 9 may give the impression that we achieved perfect performance at no cost: the calibrated CP label sets always contain the true labels a fraction 1 − α of the time despite us never testing the performance of the confor- mity scores.However, we did not consider the set size, i.e. how many labels are given singleton labels "cosmicray" or "background", the double label, or no label at all?Indeed, the set size is critical to practical utility and where we should measure the performance of our nonconformity scores.
In Fig. 14, we plot the set size for all four possible prediction sets as a function of 1 − α.In doing so, we show the performance: the ability to identify cosmic-ray and background events uniquely varies as a function of the allowed error rate.At the lower extreme, we have the limiting behaviour of the algorithm.Namely, for 1 − α ∼ 0 (the maximum allowed error rate), all data points are in the empty set while the size of the singleton and double labels is close to zero.For 1 − α ≲ 0.6, the set size of the singleton labels grows linearly with the size of the empty set decreasing.Above 1 − α ∼ 0.6, the set size of the singletons and empty set decrease while the set size of the double label rapidly increases.Fig. 14 explains why there is no free lunch with CP.While we can choose 1 − α arbitrarily close to one (i.e.minimise the allowed error rate), this comes at the cost of increasing the size of the double label.I.e., the cost is a majority of triggers for which the algorithm is essentially uninformative.Here, there is a parallel with Fig. 6 in which we saw that choosing a conservative threshold increased the precision at the cost of increasing the missrate.Such behaviour is unavoidable, but by measuring the set size, one can compare and optimise choices of non-conformity score.

E. Performance of a poor non-conformity score
Finally, it is helpful to take an illustrative example of what happens when the non-conformity score performs poorly.To demonstrate this, we take our cosmic-ray detector example and consider an alternative choice of nonconformity score: i.e. while the background score stays the same, we replace the cosmic-ray score with a uniform random number generator.We show the results by applying this to our demonstration data in Fig. 15.At first, it may appear to still perform reasonably well: most of the cosmic ray events are labelled as cosmic-ray.However, on closer inspection, we see that almost all the noise events are given the double label, multiple prominent cosmic rays have no label assigned, and background data points are labelled as cosmic rays.This choice of the non-conformity score is extreme but yields insights into what to expect if a poor choice is made for the non-conformity score.We can further study the behaviour by looking at the set sizes as a function of 1−α; this is done in Fig. 15 and shows that at 1−α = 0.5, labels are randomly assigned between the four choices while at either extreme either no label is assigned or the double label.
The set size is one way to measure the performance of a non-conformity score.For example, comparing Fig. 14 and Fig. 16 we see that around 1 − α ∼ 0.7, the standard non-conformity score produces more single labels than either the double or background.Meanwhile, this is never true for the alternative (i.e., Eq. ( 8) and Eq. ( 9) which are intentionally broken) non-conformity scores demonstrating that the informative non-conformity measure outperformed the alternative.The choice of non-conformity score can therefore be viewed as an optimisation problem.However, the choice of objective function is itself subjective and will depend on the use case.For example, one option is to choose a non-conformity score that minimises the number of double labels, aiming to increase the algorithms capacity to unambiguously label the data.However, such a choice may come at the cost of increasing the empty label set.Alternatively, one may choose to maximise the TPR (or minimise the FPR) at some fixed α.Extending this idea, the non-conformity score itself can be parameterised, enabling direct optimisation (see, e.g.Colombo [51]).Regardless of the methodology, the choice of objective function for the optimization will always be subjective and the best choice will depend on the overarching use case.For gravitational-wave astronomy, we anticipate some combination of maximising the number of single labels while minimising the number of false positives, but we intend to explore this in future work.

V. CONCLUSION: TOY MODEL
In this section, we have used a simplistic toy model to introduce CP.In the main, we use this as a tool to understand CP and not as a demonstration of the application of CP to realistic astrophysical problems.We recognise that there are steps that do not transcend, e.g.here, we know the statistical properties of the signal and noise distributions perfectly and can use these to construct a non-conformity score.Nevertheless, we hope it may prove useful as a starting point for others to apply CP using the accompanying notebook [47].

VI. CONFORMAL PREDICTION FOR GRAVITATIONAL-WAVE ASTRONOMY
Having introduced CP for a simple toy model, we now extend the discussion to gravitational-wave astronomy.We will focus on the use case of modelled transient searches for CBC signals.However, the discussion applies generally since the standard statistical framework is applied across the field.
Our primary task is to define the non-conformity measure A(x, y).Considering the binary classification problem, signal or noise, two obvious initial choices exist: using the FAR or the Bayesian p astro quantities.For source classification, e.g.binary black hole, neutron star black hole, binary neutron star, or terrestrial, one could use the multi-class CP algorithm and the Bayesian probabilities provided by the pipeline for each source class.There-FIG.16.The set sizes of the four possible prediction sets after applying MCP to 1000 test points with the non-conformity scores given in Eq. ( 8) and Eq. ( 9).
fore, these choices are readily applied to the outputs of existing pipelines, which is what we choose to do in this work.
However, CP offers scope for further development.For example, the FAR used by pipelines uses a ranking statistic combining the matched-filter SNR and χ 2 statistic amongst other quantities.Such a combined ranking statistic can itself be used as a non-conformity score: in effect, the "calibration" data set of CP is then analogous to the background data used in a traditional search pipeline.Building on this idea, if the combination is parameterised, one could optimise the ranking statistic (non-conformity score) to minimise the counts of the empty set of multi-label prediction sets on some test data.Such an idea builds on a similar application by McIsaac and Harry [52], which seeks to maximise the separation of signals and noise.Many more such innovations are likely possible.

A. Using Conformal prediction to calibrate multiple competing pipelines
To demonstrate the application of CP to gravitationalwave astronomy, we will use the results of a recent Mock Data Challenge (MDC) study in advance of the LVK fourth observing run [33].In this MDC, four low-latency CBC online search algorithms were applied to a real-time data replay from the third observing run.Simulated signals were added to the data at a rate much greater than the anticipated astrophysical rate under current detec-tor sensitivities.This higher rate was used to stress-test the low-latency infrastructure: the primary goal of the MDC was to measure expected performance in producing public alerts used to trigger event follow-up.Taking the MDC data, we adjust classifications for all real gravitational-wave detector events present in the MDC, but do note there are potential sub-threshold signals that remain.We also remove all early warning triggers from the MDC and use the corrected p astro values from Ray et al. [43].
The MDC data products provide a perfect test bed for CP.The increased rate produces a sizeable set of simulated triggers, e.g. points in the data stream that the search pipelines identify as likely to contain a signal.Most recorded triggers in the MDC are simulated signals (this differs from the astrophysical scenario where, at a high FAR threshold, most triggers will be nonastrophysical noise).Moreover, the configuration of the pipelines was in development during the MDC, leading to imperfect performance.For these reasons, the performance of the pipelines is not representative of the tuned performance expected during the run.This point is discussed within Chaudhary et al. [33] specifically for the case of PyCBC: " The FAR values for injections recovered during the MDC are subject to a substantial upward bias due to the high rate of high-SNR injected events, which significantly influences the background estimation."As a result, in the context of candidate significance estimation, we can consider the MDC data as the application of poorly calibrated pipelines to a given data set.It, therefore, is a good test bed to show how CP can automatically calibrate the pipelines.However, we stress that the following discussion should not be taken as indicative of the performance of the pipelines, only as an example where they are known to be ill-tuned.
Let us begin by studying the performance of the pipelines using traditional significant estimation approaches.We start by thinking about the catalogue of events that would be produced at a given threshold.In Fig. 17, we plot the purity of the resulting catalogue as a function of p astro ; we present results separated by pipeline.We calculate the purity as the fraction of triggers with p astro greater than the threshold which pertains to an injected signal.We plot the actual purity (the true number of simulated signals in the trigger set) and the estimated purity: the sum of the p astro for all triggers above the threshold.The sum of p astro to estimate the number of astrophysical signals is commonly used in the context of a catalogue of triggers (see, e.g.Abbott et al. [5]).It formally amounts to the posterior-estimated number of foreground events in the Farr et al. [39] framework.Fig. 17 shows varying behaviour by pipeline, with all pipelines under-estimating the actual purity by varying amounts.(We note that, due to the presence of potential sub-threshold real signals in the MDC data, the "Actual" estimate here is potentially biased; however, given the expected purity of sub-threshold candidates in GWTC-3 [5], the level of bias is at most a few percent).
FIG. 17.The estimated and actual purity for the MDC results as a function of the pastro threshold split by pipeline.Estimated purity refers to the sum of pastro above the threshold, while actual purity refers to the count of triggers pertaining to signals above the threshold.We use purity here as it is the common language of the field.However, we note that it is identical to the coverage defined in the field of CP.
By comparison, the advantage of CP is that α, the allowed error rate of the algorithm, maps directly onto the actual purity of the resulting catalogue.
To demonstrate CP in practice, for the set of candidates from each pipeline, we evenly split the MDC data results into a calibration and test set.We then apply MCP using the FAR as the non-conformity score for 'signal' and the inverse False Alarm Rate (iFAR) as the nonconformity score for 'noise'.This way, we use the pipeline outputs directly without adding additional information.We then apply MCP to each trigger in the test data set, using the calibration data for producing a prediction set.Note that the computational effort required for this step is negligible (a few CPU seconds on any modern computer).
In Fig. 18, we plot the label coverage for each pipeline, demonstrating it satisfies Eq. (3), i.e. for all α, the fraction of test triggers which contain a simulated signal has a one-to-one correspondence with 1−α.Moreover, we note that all pipelines satisfy this: irrespective of their underlying performance, once calibrated by CP the coverage guarantee is ensured.We now note that what is known in the field of CP as coverage is equivalent to the catalogue purity.As such, Fig. 18 and Fig. 17 can be contrasted to show how calibrating with CP regularises the meaning of the threshold between pipelines.The implication is that once calibrated by CP, the catalogue produced at a fixed FIG. 18.
The marginal and conditional coverage for all MDC results after applying MCP, demonstrating they satisfy the validity guarantee.Recall that the marginal coverage is averaged over all labels, while conditional coverage is as applied to a single label at a time.A grey band marks the 95% binomial confidence interval expected, given the size of the entire test data for each pipeline.Note that for the conditional labels, the size of the effective test data set is smaller, and therefore, the anticipated Poisson counting error can be larger, as is the case of the GstLAL conditional noise label.
α threshold contains an a priori known contamination rate: α.Therefore, downstream analysis can decide the contamination rate they are willing to accept and then use that to set the threshold for inclusion.

B. Understanding individual events: confidence
In the last sub-section, we saw how a catalogue could be created by applying MCP to calibrate the significance estimates.Such an application guarantees the purity of the resulting catalogue.It is, therefore, directly applicable to the case of population analyses, where one often needs to control the purity over a set of triggers.However, this leaves the question of assessing individual events and deciding if they are astrophysical, which we now discuss.
In the traditional framework, candidate significance is assessed by combining the FAR, p astro , their constituent elements (e.g. the χ 2 statistic), and a deep knowledge of the performance of the pipeline.For example, the first direct observation of gravitational waves from GW150914 [4] reported a FAR of 1 event per 203,000 years (and gave an equivalent > 5σ estimate).However, once a source class is established, p astro is generally the preferred mechanism to identify new events (for example, independent reanalyses use this criterion Venumadhav et al. [53]).However, for newly detected source classes, because p astro requires an astrophysical model of the rates, which is generally poorly constrained, it is common to revert to a more detailed study of the FAR (see, e.g. the discovery of the first neutron star black hole mergers [54]).
In the CP framework, we can use the confidence to assess candidate significance.As discussed in Section IV C, one can compute either the standard definition of confidence, Definition 1, or the conditional confidence, Definition 2. We now consider how these definitions of the confidence can be applied to CBC signals using the MDC for illustration.
In the left-hand panel of Fig. 19, we plot the standard confidence (Definition 1) for all triggers in the MDC against their iFAR.We find a one-to-one mapping, which is expected since we use the FAR as the non-conformity score for the signal label.The standard confidence that the data contains a signal can only be computed when the single label prediction is for a signal (see Section IV C).Therefore, we find there is a minimum iFAR below which the conditional confidence that the data contains a signal cannot be computed.Instead, we can compute the confidence that the data is noise (since, in this binary case, that is now the single-label prediction).We illustrate this by adding the noise confidence as a dashed line.
In the middle panel, we go on to show the mapping between the conditional confidence in the signal label, Definition 2, against the iFAR.Unlike the standard confidence, the conditional confidence can be computed for all values.
For both the standard and conditional confidence, we note that they behave broadly as we expect: the confidence increases monotonically with the iFAR.However, it is notable that the mapping is at odds with the expectation of seasoned analysts in this field: namely, we find that even at a FAR of 1 per 1000 years, the confidence of some pipelines is barely above 0.5.For comparison, in Fig. 3, at a FAR of 1 per 1000 years, all pipelines report a p astro close to unity.
Moreover, the confidence is pipeline-dependent, with substantial disagreements between pipelines.This occurs due to our choice of non-conformity score: we use the FAR.The non-conformity score ranks how signallike the data is compared to the most significant signal in the data: smaller FARs are more signal-like.As a result, pipelines that have a long tail in the iFAR for signals will consequently produce less confidence at the same iFAR relative to pipelines with shorter tails.(It should be remembered, however, at this point that CP is distribution-free in the sense that the distributions are never explicit but learned via the calibration data set).There is nothing inherently wrong here, but we do concur that what is known as confidence in CP does not reflect what a gravitational-wave analyst might understand the term to mean.
If we would like the confidence to better reflect our understanding, we can either look at the choice of nonconformity score, or the definition of the confidence.An obvious alternative choice for the non-conformity score is p astro : however, since this is closely related to the FAR (c.f.Fig. 3), we encounter similar issues.Meanwhile, it is worthwhile reflecting on why the seasoned analysts' intuition suggests that a signal with an iFAR of 1000 years should confidently be called a signal.This is because, if the pipeline is well calibrated (which we anticipate to be the case most often), then the iFAR intrinsically suggests the data is not consistent with the background.With this in mind, we define another definition of confidence, the not-noise confidence: The not-noise confidence is the minimum 1 − α such that the noise label is not included in Γ α .
Applying this definition in the right-hand panel of Fig. 19, we recover a mapping much more in line with expectation: we see a rapid increase in the not-noise confidence, and for values above 1 year, the confidence is close to unity.This demonstrates the power of CP: it should be remembered that the underlying algorithm is distribution-free, it has learned this intuitive threshold directly from the calibration data.Moreover, if the underlying algorithm itself was not well calibrated, the confidence still would be (this would manifest as a significant departure from the four calibrated pipelines in the right-hand figure of Fig. 19).
The three definitions of confidence presented in Fig. 19 all offer different ways to assess the confidence we may have in an individual event.However, we believe that further work needs to be done to identify which of these (or perhaps an alternative definition) is best suited to providing a summary of the significance of an individual event.Moreover, careful future study will need to be made of how these interact with the choice of non-conformity score.We also suggest that alternative choices of nonconformity be explored to see if these can better represent our understanding.
Finally, if the CP calibration has succeeded, we should expect it to regularise pipeline behaviour, i.e. we would expect that the same event found by different pipelines would have a similar confidence.We would not expect it to give the same confidence to a given event since pipeline performance differs.To investigate this, in Fig. 20, we plot histograms of the normalised difference between the not-noise confidence for all pairs of pipelines.We also show the difference between p astro for the same pairs.Notably, while the p astro difference has a bimodal structure, with frequent cases in which the pipelines completely disagree about a candidate, the confidence difference peaks at zero, demonstrating a spread up to the extremes.This demonstrates that the confidence measured by CP regularises behavior between pipelines by learning from the calibration data set.For each definition, we plot the confidence against the iFAR for all triggers (separated by pipeline) in the MDC.For the standard definition, Definition 1, the confidence that the data contains a signal can only be calculated when the single label prediction is for a signal (see Section IV C); we mark these points by a solid line in the left-hand panel.Meanwhile, for values of the iFAR where the single label prediction is for noise, we use a dashed line.We, therefore, see a turnover in the left-most panel, a minimum iFAR below which we cannot assign any confidence that the data contains a signal.We sort triggers by iFAR to produce a continuous line showing the learned mapping.In all cases, we truncate the figure at an iFAR of 10 4 years for visualisation purposes: the mapping extends up to the maximum iFAR in the data set and monotonically approaches unity in that limit.In the right-hand panel, we add an inset showing the behaviour as each curve approaches unity.FIG.20.Histogram of the normalised difference (i.e. the difference divided by the sum) in the not-noise confidence and pastro for all pairs of pipelines in the MDC.Note: we filter to only cases where both pipelines identify the signal (defined as finding a trigger within a 0.1 s window) and take the closest match in trigger time.We also filter cases where pastro is not predicted by one or both pipelines.

C. Conformal Prediction as a generalisation of the traditional framework
To conclude our discussion, we finally discuss how the CP and traditional FAR thresholds are related.In the traditional framework, to determine if the data contains a signal, we calculate the FAR (cf.Eq. ( 1)) and then apply a threshold: FAR ′ .If the FAR is below the threshold, we reject the null hypothesis and determine it is likely a signal.We can, therefore, formulate this in the language of CP by saying that the prediction set of the traditional framework is "signal" : FAR < FAR ′ .
Formally, this is incorrect as it falls into the "inverse fallacy" in that by rejecting the null hypothesis, we assume the data contains a signal.However, in practice, it is very often done.Meanwhile, in MCP, if the "signal" non-conformity measure is given by the FAR while the "noise" non-conformity by the iFAR, the prediction set is given by where qs and qn are (effectively) the 1 − α quantile FAR and iFAR of the calibration data set (cf. Section III).Comparing Eq. ( 10) and Eq. ( 11), we now see the following three connections between the two methods in the binary classification case where the FAR (or equivalently the p-value) is used as the non-conformity score.We use these to explain the differences and advantages of CP.
First, in the traditional framework, the threshold for determining if the data contains a signal is chosen by hand.In contrast, in the CP framework, the threshold is automatically decided by the algorithm and calibration data set (i.e.q is determined by the user choice of α).Of course, if the FAR is already well-calibrated, the CP framework offers no advantage in this respect.However, if that is not the case, CP calibrates the pipeline automatically.
Second, CP extends the labelling: while in the traditional framework, one either learns the data is a signal or not, for CP, the prediction set can be used to assess significance.I.e. at a fixed choice of α, the set may contain both signal and noise: this provides the user with a means to understand the inherent uncertainty, and a choice of definition can be applied to calculate a confidence in a given label.
Finally, we see that in CP, one does not fall foul of the inverse fallacy: the signal label arises naturally from the definition of the non-conformity score without assuming it is the negation of the noise label.
Taken together, we therefore argue that CP can be viewed as an extension of the traditional statistical framework.

VII. DISCUSSION AND APPLICATIONS OF CONFORMAL PREDICTION
Conformal prediction offers a generalisation of the traditional framework for significance quantification in gravitational-wave astronomy.In this work, we aim to introduce and explore CP in the context of CBC searches: we do not seek to demonstrate real application yet and envision this for future work.
We now outline three ways where CP may enhance existing efforts.
First, to add the conditional confidence as a calibrated alternative to the p astro and FAR in assessing the significance of single events.A motivating question we posed in the introduction is how to answer questions such as "does this data contain an astrophysical signal?".The traditional framework answers this by comparing the FAR to a threshold or with the astrophysical probability p astro .In contrast, CP offers the confidence: the key difference between these concepts is that the confidence does not rely on an explicit astrophysical model like the p astro and is learned from the performance of the pipeline on calibration data.As shown in Fig. 20, this moderates the differences between pipelines, leading to a more stable estimate of the significance.
Second, as a means to automatically set thresholds which guarantee the purity of a catalogue.With CP we can circumvent the problem of determining a threshold on the significance by instead only requiring the user to specify the error rate.Specifically, given the appropriate tools, a user could set an error rate of 1% and then take all events where the "signal" label is in the prediction set and be assured by Eq. (3) that at least 99% of the catalogue are astrophysical signals (within the bounds of the exchangeability assumption).As shown in Fig. 18, this guarantees the user that the catalogue contains a fixed contamination fraction.
Finally, CP offers a framework to develop a postprocessing search pipeline combining the outputs from multiple search pipelines.Specifically, in future work, we will develop a parameterised non-conformity score combining the outputs from multiple pipelines into a single meta-pipeline.This has the advantage that the betweenpipeline behaviour can be regularised using the test and calibration data and we can optimise the score leveraging parameter-space dependent pipeline performance.
For any of these applications to be successful, the critical missing ingredient is a large-scale MDC, which accurately captures the actual pipeline performance on realistic data.The MDC used in this work used an unrealistically high event rate and, therefore, is inappropriate for application to astrophysical signals.Indeed, this underlines the primary limiting factor of CP: the assumption of exchangeability between the calibration and test data.Ensuring this in practice will not be easy.Unlike many ML use cases, we must simulate the calibration data set for gravitational-wave applications since we do not have a ready training data set.In the simulation, assumptions must be introduced, e.g., about the waveform models and the rate: assessing and validating these will be critical.Moreover, using data from past observing runs breaks exchangibility as the detector sensitivity changes dramatically (Moreover, since it changes during an observing run, this is also a concern).In conformal prediction, such non-exchangeability cases are known as distribution drift and can be accounted for by applying weighted conformal procedures [45].Nevertheless, we expect this to be a challenge for any successful application.
We acknowledge that the direction of CP is in many respects orthogonal to the overall direction of the field where the p astro approach has become dominant.However, we believe that in some cases, end users of the data products do not sufficiently understand the assumptions and caveats of the many p astro methodologies to interpret them fully.While p astro offers a valuable and powerful approach, CP offers an alternative in which the end user can, given existing open access to the data and software, calibrate the pipeline themselves, allowing CP to learn the uncertainty inherent in the underlying method.Moreover, we want to emphasise that, for either the p astro or FAR (or equivalently p-value approach), if the underlying assumptions are met, CP cannot improve on them.I.e., CP does not offer a mechanism to improve the sensitivity of well-calibrated searches.However, it does enable calibration without requiring an understanding of the internal models or making asymptotic assumptions.
Finally, in this work, we have discussed the potential application for CBC search.However, CP may also find utility in other areas of the field, such as the low-latency alert products attached to open public alerts, the search for continuous gravitational waves from rapidly rotating neutron stars, or the search for bursts of GWs from unknown sources.

VIII. DATA AVAILABILITY STATEMENT
The source program behind Section IV is openly available on Zenodo [47].

FIG. 3 .
FIG.3.Comparison by pipeline between pastro and FAR for all candidates reported in GWTC-3 (including sub-threshold candidates).

FIG. 4 .
FIG.4.An illustrative example of data from our toy cosmic ray detector.Each data point records the number of counts within a minute-long interval or bin.Thick circles mark bins containing a cosmic ray.Data points are filled according to the prediction of the FAR detection approach: blue circles correspond to data points which surpass the threshold and, hence where we reject the null hypothesis.In contrast, orange circles indicate those that are consistent with background noise.

FIG. 9 .
FIG.9.The empirically measured coverage (the fraction of events for which the true label is in the prediction set) for the cosmic-ray test data set after applying MCP.A grey band marks the 95% binomial confidence interval expected given the size of the cosmic-ray test data; we see variations around this due to the discrete nature of the underlying data.

FIG. 11 .
FIG. 11.The illustrative example of data from Fig.4coloured by the confidence as defined in Definition 1.To aid visualisation, positive values are assigned to data points where the single-label prediction is "cosmic-ray" while we assign negative confidences to those where the single-label prediction is "background" (i.e.values closer to −1 indicate greater confidence in the noise label).

FIG. 14 .
FIG.14.The set sizes for the four possible prediction sets after applying MCP to 1000 test points for the cosmic-ray detector problem.

FIG. 19 .
FIG.19.The relation between the iFAR and three definitions of confidence within the CP framework: the standard confidence given in Definition 1 (left-hand panel), the conditional signal confidence given in Definition 2 (middle panel), and the not-noise confidence given in Definition 3 (right-hand panel).For each definition, we plot the confidence against the iFAR for all triggers (separated by pipeline) in the MDC.For the standard definition, Definition 1, the confidence that the data contains a signal can only be calculated when the single label prediction is for a signal (see Section IV C); we mark these points by a solid line in the left-hand panel.Meanwhile, for values of the iFAR where the single label prediction is for noise, we use a dashed line.We, therefore, see a turnover in the left-most panel, a minimum iFAR below which we cannot assign any confidence that the data contains a signal.We sort triggers by iFAR to produce a continuous line showing the learned mapping.In all cases, we truncate the figure at an iFAR of 10 4 years for visualisation purposes: the mapping extends up to the maximum iFAR in the data set and monotonically approaches unity in that limit.In the right-hand panel, we add an inset showing the behaviour as each curve approaches unity.