Reproducing GW150914: the first observation of gravitational waves from a binary black hole merger

In February 2016, LIGO and Virgo announced the first observation of gravitational waves from a binary black hole merger, known as GW150914. To establish the confidence of this detection, large-scale scientific workflows were used to measure the event's statistical significance. These workflows used code written by the LIGO Scientific Collaboration and Virgo and were executed on the LIGO Data Grid. The codes used to perform these analyses are publicly available, but there has not yet been an attempt to directly reproduce the results, although several subsequent analyses have replicated the analysis, confirming the detection. In this paper, we attempt to reproduce the result from the compact binary coalescence search presented in the GW150914 discovery paper using publicly available code executed primarily on the Open Science Grid. We show that we reproduce the original result, discuss the challenges we encountered, and make recommendations for scientists who wish to make their work reproducible.

FOR THE SCIENTIFIC COMMUNITY to build on previous results, it must trust that these results are not accidental or transient, but rather that they can be reproduced to an acceptably high degree of similarity by subsequent analyses. This notion of reproducibility is magnified both in importance and challenges in the context of computational science workflows [1]. An increasingly large fraction of scientific results depend on computational elements, which in turn creates reproducibility challenges associated with the implementation of these computational elements. Being able to reason about the validity of published scientific results and re-use them in derivative works becomes an extremely challenging task. Publishers have made great strides in including artifacts associated with the manuscripts. However data, methods, and results are still hard to find and harder still to reproduce (re-creating the results from the original author's data and code), to replicate (arriving at the same conclusion from a study using new data or different methods), and to re-use in derivative works (using code or data from a previous study in a new analysis) [2].
Our work focuses on reproducing the computational analysis used to establish the significance of the first detection of gravitational waves from binary black holes by the Advanced Laser Interferometer Gravitational-wave Observatory (LIGO) [3]. As part of its commitment to Open Data, LIGO made the data and scientific codes from its first observing run available to the scientific community. Previous analyses have replicated the results of the GW150914 discovery [4], [5]. In these analyses, the data from LIGO's first observing run was re-analyzed either by independent teams of scientists with different codes, with different data, or by using different workflows to those used in the original GW150914 discovery.
In a previous work, we have performed a post-hoc comparison of these results using the published papers and the PRIMAD reproducibility formalism [6]. Here, we attempt to reproduce ab-initio the original LIGO analysis used in the GW150914 discovery paper using public information. Specifically, we attempt to reproduce the results of the PyCBC search for gravitational waves [7], [8] shown in Figure 4 of Abbott et al. [3].
Our effort is not completely separate from the original analysis, as one co-author of this paper was a member of the team involved in running the original LIGO analysis. However, our aim was to automate the production of the result in a way that other co-authors of this paper who were not members of the LIGO or Virgo collaborations, as well as other scientists, could reproduce the result.
The original analysis workflows were executed on the LIGO Data Grid, a collection of computational resources that are not available to the wider community. Since non-LIGO scientists do not have access to these systems, we execute the analysis on the Open Science Grid (OSG) [9] and rely on a cyberinfrastructure software stack that has latest stable releases of key software packages such as HTCondor [9], Pegasus [10], and the CERN Virtual Machine Filesystem (CVMFS) [11]. We have created a script that automates the setup and deployment of the LIGO workflows on a typical local compute cluster and from there Pegasus manages their execution on OSG.
Our main goal was to reproduce the results of the PyCBC search shown in Figure 4 of Abbott et al. [3], shown on the left side of our Figure 1. Our reproduction of this plot, shown on the right-hand side of Figure 1 shows that we can reproduce the search result, but there are small, noticeable differences in the search background (explained later in the paper). Based on documentation, we believe that these differences are because the data used in the original analysis was different from that released by the Gravitational Wave Open Science Center (GWOSC) [12] and used in our analysis. Unfortunately, the original data set is not public and so we are unable to confirm this hypothesis. However, we consider our ability to re-run a scientific workflow last executed in 2015 and largely reproduce the results to be a success.
This article is structured as follows. First, we provide background on the first gravitationalwave discovery. We then describe our recent efforts on ab-inito analysis to reproduce the GW150914 result, followed by challenges we encountered. In the results section, we explain any difference observed in our reproduction of result published by LIGO. We perform an analysis of the workflow run-time provenance data and the compute resources required to execute the workflow. We conclude with recommendations for others who wish to reproduce the GW150914 result.

THE DISCOVERY OF GW150914
Gravitational-wave astronomy is an interesting case study for robust science because it has three main science phases: low-latency data analysis, offline analysis, and public and educational dissemination of results. The low-latency analysis processes instrumental data in near real time to identify astrophysical signals. Alerts are disseminated to a wide range of electromagnetic Detection statisticρ c  Figure 1. Results from the binary coalescence search presented in the GW150914 discovery paper from [3] with permission (left) and our attempt to reproduce these results (right). These histograms show the number of candidate events (orange markers) and the mean number of background events (black lines) as a function of the search detection statistic and with a bin width of 0.2. The scales on the top give the significance of an event in Gaussian standard deviations based on the corresponding noise background. We were able to reproduce the search result for GW150914, but we were unable to exactly reproduce the search background. The differences between the two figures is likely due to differences in the gravitational-wave stain data used, as described in the text.
telescopes to identify multi-messenger counterparts. Offline analyses validate the low-latency detections, identify signals missed in low-latency, and provide determination of source properties. When a detection is published, the data is released to the scientific community and the public. Since the analysis codes are also released, it should be possible for people outside the LIGO Scientific Collaboration and Virgo to reproduce the published results.
Our attempt to reproduce the first detection of gravitational waves from binary black holes, known as GW150914, starts from the data released by the GWOSC [12]. GW150914 was first detected by a low-latency search for gravitationalwave bursts that identifies interesting candidates but does not provide the final statistical significance of detected events. To establish the significance of events, data from the LIGO detectors is subsequently analyzed by scientific workflows that use longer stretches of data to provide a measure of the noise background in the detectors and use this to measure the significance of candidate events. Results from two offline analyses were presented in the GW150914 discovery paper: one that used a search technique that did not make assumptions about the shape of the gravitational waveform [3] and one using matched filtering to search for the signals from merging black holes [7], known as PyCBC. Here, we focus on reproducing the results of the PyCBC binary black hole search.
The PyCBC search uses matched filtering to compare the LIGO data with a bank of template binary black hole waveforms that model the target sources. If the noise in the LIGO detectors was stationary and Gaussian, estimation of the statistical significance of candidate events that crossed a signal-to-noise ratio threshold would be straightforward. However, the LIGO detector data contains non-Gaussian noise transients and periods of non-stationary noise. Additional signalprocessing techniques are applied to the data that down-weight the signal-to-noise ratio of non-Gaussian noise events. Coincidence between the detectors in time and mass parameters is required for candidate signals. The map between the detection statistic (weighted signal-to-noise ratio) and the statistical significance of an event must be empirically measured by the workflow. This is done by time-shifting the data between the detectors and repeating the coincidence analysis many times. The most computationally-intensive part of the PyCBC workflow are the matched filtering and the calculation of the detection statistics. Performing the coincidence and the time-shift analysis can require a large amount of memory to process the candidate events. Once these steps are complete, the workflow produces a measurement of the statistical significance of candidates. A separate script run after the workflow completes produces a histogram that compares candidate events to the noise background.

REPRODUCING THE ANALYSIS
Our work is the first attempt to reproduce the original LIGO analysis. Previous analyses, for example 1-OGC result [5], provide an example of replication in gravitational-wave science. In the 1-OGC analysis, a different team with different experimental setup recovered the discovery of GW150914. Here, "different experimental setup" means a modified data-analysis pipeline with a different configuration to that used in the original analysis. The 1-OGC result independently confirmed that GW150914 was a high significance discovery, but the event was recovered with slightly different parameters to the original discovery; these parameter differences can be explained by differences between the different algorithms used.
Here, we provide an example of reproducibility of the measurement of the statistical significance of GW150914 shown in Figure 4 of Abbott et al. [3]. Reference [3] by its nature as a brief letter does not provide sufficient information to reproduce the result. Reference [8] provides additional description of the analysis and the the codes 1 and configuration files 2 are publicly released on GitHub. Although the codes and configuration are public, the LIGO/Virgo collaboration does not provide full instructions for running the workflow and reproducing the analysis. Our work provides a fully reproducible process.
Not all of the information needed to reproduce the GW150914 workflow was available in the public release accompanying the publications. The lack of a single, public repository of this knowledge is the most significant challenge for a group outside the LIGO and Virgo collaborations to reproducing the GW150914 result. However, one of the co-authors of our work was a member of the team who performed the original analysis. They were able to review their unpublished notes, which allowed us to successfully reproduce the LIGO analysis. To ensure that scientists who were not involved in the original analysis could reproduce the results, we created scripts that were run independently by another author of this paper not involved in the original analysis. These scripts were created in a peer-programming style, which started from the original scripts used to run the LIGO workflow and created the result plot. We iteratively fixed problems encountered when trying to run the analysis using information entirely in the public domain, filling in missing public information with the original analysis notes where necessary.
PyCBC is a gravitational-wave data-analysis toolkit written primarily in Python with C extensions for numerically-intensive computation. Rerunning old versions of interpreted Python code can be challenging if the underlying software stack has changed since the code was originally executed. Fortunately, LIGO packaged the PyCBC codes used in the original analysis as PyInstaller bundles. These bundles package the Python code with a Python interpreter and the Python library dependencies allowing us to run the original codes without needing to recreate the entire software stack. Our final version of the workflow execution script is provided in a data release that accompanies this paper 3 and the GitHub commit history 4 documents the iterative process of addressing the issues encountered, which we describe below.

Software versions
Software provenance is critical to the reproducibility of scientific workflows. However, neither the discovery paper published in Physical Review Letters, nor the technical paper published in Physical Review D documented the exact version of the PyCBC code used to produce the analysis. The notes from original run recorded that PyCBC v1.3.2 was used, and recorded the git commit hash of the configuration files used (which are stored in a separate GitHub repository).

Open data
The original analysis used data and metadata that are proprietary to the LIGO Scientific Collaboration and the workflow used tools that queried proprietary servers to locate and access these data. Our script modifies the workflow to use the public data and services provided by GWOSC. For the metadata, we created wrapper codes that have the same command-line API as the proprietary codes and translate these to queries against the public data repositories. The format of the data-quality metdadata provided by GWOSC is different to that used in the original analysis. Information from the public LIGO technical note T1600011-v3 was used to determine how to use the public metadata in way that is as close as possible to the original metadata. LIGO publishes its public data using CVMFS under the gwosc.osgstorage.org organization. Our script configures the workflow to use data from CVMFS, allowing us to rely on its distribution and caching capabilities when running jobs on the OSG. To allow the workflow generation script to find these data, we installed the LIGO Diskcache API to index the CVMFS files and the LIGO Datafind Server to resolve the workflow's metdatadata queries to file URLs for the CVMFS data. Configuration files for these tools are provided in our data release.

Workflow format
To provide sufficient resources to run the workflow, we executed the computationallyintensive jobs on the OSG. This required a newer version of Pegasus workflow management system [10] than the version originally used to plan and execute the analysis. Our workflow generation script modifies the workflow written by PyCBC v1.3.2 to be compatible with Pegasus 4.9.3.

Access to codes
Although all of the codes used to generate and run the analysis workflow were public, the script used to make the figure shown in Abbott et al. [3] was never released in the PyCBC software repository. Since one of the authors of this paper helped create this script, we were able to obtain the original code used.

WORKFLOW EXECUTION
After modifying the original workflow generation script to address the challenges described in the previous section, we attempted to reproduce the analysis. LIGO did not provide estimates of the runtimes or the resource requirements of the analysis tasks, so we executed the workflow on a combination of local and OSG resources. We used USC-ISI computers to manage the workflow and run the post-processing jobs and OSG resources to run the computationally-intensive jobs. Several challenges were encountered during our attempt to execute the workflow, as described below.
Operating system and hardware mismatches The PyCBC PyInstaller bundles are not true static executables nor are they packaged in a robust containerized environment like Singularity. The bundles require the appropriate C standardlibrary shared objects to be installed on the target machine and perform just-in-time compilation of bundled C code using the now-deprecated scipy.weave module. A standard set of OS libraries, the GNU C Compiler, and processor instructions was guaranteed for the original analysis as it was run on a single homogeneous LIGO Data Grid cluster. However, not all the OSG compute notes had the correct versions of C standardlibrary installed and some nodes lacked processor instructions (specifically, we encountered QEMUemulated virtual machines that lacked the FMA4 instruction) that the PyCBC bundles required on the execute nodes. To address this, we used Pegasus and HTCondor matchmaking and fault tolerance functionalities and the ability to express requirements of the desired node characteristics to steer the PyCBC executables to compatible OSG compute nodes.

Non-deterministic memory use
The amount of memory that the matchedfiltering jobs required is determined by the data that they analyze. If LIGO data contains more non-Gaussian noise than average, more memory is required to compute signal-based vetoes and to store the resulting triggers. Since the noise is random, it is not possible to determine in advance how much memory is required for a given job. To address this, we configured HTCondor to automatically request more memory on each retried October 2020 filtering job that failed.

Post-processing memory requirements
Several of the workflow's post-processing jobs require very large memory footprints (greater than 128 Gb). It was challenging to find machines with sufficient capacity for these jobs on OSG and so these jobs were executed on the local cluster at USC-ISI. This cluster is managed using HTCondor partitionable slots allowing a single job to request sufficient memory in our multi-core machines. Coordination with the cluster administrators was required to ensure that these resources were available.

Long-term code archival
During the preparation of this paper, the LIGO Scientific Collaboration deleted the repository at git.ligo.org that stored the compiled PyInstaller PyCBC executables used in the original analysis. We had preserved a copy of the PyCBC v1.3.2 PyInstaller bundles prior to their deletion in an archive file on the IEEE DataPort server 5 . We have hosted an uncompressed version of this archive on a USC/ISI web server 6 and configured our workflow generation script to download the bundles from the USC/ISI server. Preserving these executables will allow others to run using the bundles rather than having to recreate the complex PyCBC software stack from the public source code available on GitHub.

RESULTS
Once the various issues described above had been addressed in our workflow-generation script generate_workflow.sh 7 and LIGO's Py-CBC v1.3.2 PyInstaller bundles, we were able to reproduce the LIGO analysis workflow. We observed 28,676 task failures as the workflow ran approximately 155 days of badput (amount of computation time used on failed jobs). The majority of job failures were caused by PyInstaller bundles landing on incompatible nodes and the majority of badput was due to compute-intensive jobs being evicted due to using too much memory. Failing jobs were re-run using the retry on failure semantics in Pegasus and HTCondor that then The workflow generates a web page with a number of diagnostic plots that are used by LIGO scientists to understand the detector state, the properties of the noise, and the results of the search. For the purpose of reproducing Figure 4 of abbott et al/ [3], the primary data product is a 2.5 Gb HDF5 file that contains the triggers found in coincidence between the LIGO detectors, and the search background estimated using the time-slide method [7], [8]. We have archived a compressed version of this HDF5 file on the IEEE DataPort server 8 .
To allow future researchers to reproduce our work on their own resources, we show the distribution of physical memory used by LIGO jobs and their run-times as frequency histograms in Figure 2 and Figure 3 respectively. In PyCBC workflows, each job type is associated with a transformation (executable). In Table 1 we show the top 10 transformations ordered by the maximum physical memory used. Table 2 shows the top ten transformations ordered by maximum runtime in seconds.
The workflow generates the data required to make Figure 4 of Abbott et al.   0  10000  20000  30000  40000  50000  60000  70000  80000  90000  100000  110000  120000  130000  140000  150000  160000  170000  180000  190000  200000  210000  220000  230000  240000  250000  260000  270000  280000  290000  300000  310000  320000  330000  340000  350000  360000  histogram. As noted earlier, this script was not made public. Even though we had an internal version of the script, no PyInstaller bundle was created that captured the software stack used by that script. Running the plotting script against current versions of the libraries resulted in failures, so we needed to reproduce the original software stack. This was a considerable challenge and illustrates the importance of releasing containerized executables in addition to the source code for reproducibility in scientific analyses. We obtained the version of PyCBC and LALsuite used by this code from notes made at the time of the original analysis (v1.3.4 and v6.36, respectively). We then determined the necessary and sufficient set of lower-level libraries required by these high-level libraries by examining the setup.py and requirements.txt in Py-CBC v1.3.4. Using the PyCBC v1.3.4 install instructions, we create a Python virtual environment with the same version of pip and virualenv used in the original analysis. An iterative process of running the LALSuite configure script was performed until all the required dependencies of LALSuite were installed. The specific versions of fourteen libraries (and their dependencies) were either installed using pip or compiled from source into the Python virtual environment. The iterative process of determining the required dependencies was complicated by the fact that pip caches previous software builds and so the install process is not necessarily idempotent.
After these libraries were installed, LALSuite and PyCBC were installed and the Python plot-ting script executed. Our data release includes a script make_pycbc_hist.sh 9 that automates the installation and execution of the plotting code. Our reproduction of the LIGO result is shown in Figure 1, which includes the original LIGO/Virgo result for comparison. We find that we were able to reproduce the search result. However, there are some small but noticeable differences in the search background (continuous black line) and the lower bound on the significance that the workflow reports for GW150914; We find the signficance is greater than 5σ, rather than greater than 5.1σ. We attribute both of these differences to changes in the input data used by the workflow. The usage instructions for the GWOSC data state that the LIGO strain data in the public data set are based on the C02 calibration of the LIGO detectors, whereas the original PyCBC configuration files state that C01 data was used for the analysis of Abbott et al. [3]. We hypothesize that the GWOSC C02-based data contains slightly less analysis time than the C01 data originally used. This would result in a lower bound for the significance of the event and produce slight changes in the search background. However, we are not able to verify this as we were unable to obtain access to the proprietary C01 data.

CONCLUSIONS
We have described the process and challenges encountered in reproducing the measured statistical significance of GW150914. Our script is configured to execute compute-intensive jobs on the OSG. To allow scientists to run on other resources, our data release provides instructions for running all jobs on local resources. The memory and runtime profiling of the workflow tasks provided in this paper will enable appropriate resource selection.
Our execution of the workflow used HTCondor as the job scheduler; this scheduler is also used by the LIGO Data Grid. Although one could modify our workflow generation script to use alternative job schedulers, we recommend against this because of the wide variance in the memory requirements of the jobs and need for a relatively homogeneous environment. We rely on HTCondor for job resubmission in case of failure and its mechanisms of custom-created HTCondor classads to increase memory requested for a job in case of failures. When using a schedulers like SLURM, we recommend that one uses the upper memory bounds we provide in this paper. A better practice would be to overlay HTCondor on the native scheduler using resource provisioning techniques such as HTCondor glideins. We also recommend the use of CVMFS to access GWOSC data. Although Pegasus can be configured to transfer the data at runtime, e.g. from the submit host, where the workflow system is located (USC/ISI in our case) or via http from the GWOSC web site, this requires movement of tens of thousands of input data files. It is more efficient to rely on the CVMFS storage and caching mechanism and to configure Pegasus to create symbolic links to the CVMFS locations, rather than performing true copies.
We have demonstrated that, although LIGO did not provide complete instructions for reproducing the GW150914 result, sufficient information exists either in the public domain or recorded as notes describing the original analysis to reproduce the PyCBC GW150914 workflow. Although we made substantial progress in reproducing the PyCBC result shown in Figure 4 of Ref. [3], we were unable to reproduce it exactly as we did not have access to the original input data and metadata. If these are made public, it is straightforward to modify our workflow generation script to use these data. As part of this work, we have released scripts that allows other scientists to easily reproduce the LIGO analysis using publicly available data using their own compute resources or the OSG using the latest stable versions of Pegasus and HTCondor.
Our results show that, in principle, it is possible to release instructions and code that allow other scientists to reproduce a major scientific result. We encourage scientists who wish to do so to ensure that instructions include: access to the original data and codes used; documentation of software and configuration file versions; containerized executables that capture the complete software stack used in the original analysis; longterm archival of code and data products used; and documentation about the computational resources needed to execute the analysis.