Optimal inference of the start of COVID-19

Zheng-Meng Zhai,1,* Yong-Shang Long,1,* Ming Tang ,1,2,† Zonghua Liu,1 and Ying-Cheng Lai 3,4,‡ 1State Key Laboratory of Precision Spectroscopy and School of Physics and Electronic Science, East China Normal University, Shanghai 200241, China 2Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai 200241, China 3School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, Arizona 85287, USA 4Department of Physics, Arizona State University, Tempe, Arizona 85287, USA


I. INTRODUCTION
The first case of COVID-19 in the United States was reported on January 20, 2020 and, according to the official account, the first death on American soil occurred on February 29. Astonishingly, on April 21, California reported that the first death in the state occurred on February 6, more than three weeks earlier than previously reported. One implication is that community spreading may have already occurred in the U.S. three weeks earlier than believed. The importance of knowing precisely the starting date of community spreading cannot be overstated: this is the date based on which government mitigating actions and control measures would be imposed. In the U.S., according to the government report, the onset of an exponential increase in the infections occurred in the middle of March, leading to the belief that COVID-19 began the phase of community spreading around the same time. Based on this perception, the White House issued a nationwide social-distancing order on March 16. Statewide stay-at-home or shelter-in-place orders were given by the governors of various states at different times. The effectiveness of these government actions notwithstanding, as of August 28, there have been close to six million cases in the US and over 181 000 deaths. This devastating development means that the perceptive date of community spreading of COVID-19 in the U.S. was wrong: it could have been weeks earlier than the governments have chosen to believe.
To develop a reliable method to infer the starting date of COVID-19, or any infectious disease, is of uttermost importance. Precise knowledge of exactly when community spreading started would prompt the government to take actions at the earliest possible moment, drastically reducing the number of infections and saving many thousands of lives. More specifically, knowing this time in combination with knowledge about the symptomatic individuals in the early stage of the disease spreading enables (1) an effective reduction in the range of contact tracing, which increases the chance of accurately locating the source of infection with limited resources, (2) an assessment of the ability to infect of the virus and the way by which it spreads, providing guidance for early control measures, (3) determination of the interstate and international propagation paths of the virus, and (4) providing unequivocal early warnings for the governments. In this regard, a recent work based on gene sequencing analysis found clusters of related viruses in patients living in different neighborhoods of New York City, suggesting that multiple, independent but isolated introductions of the virus had mainly come from Europe and other parts of the U.S. [1]. In another study [2], the frequencies of the key words such as coughing and fevers on Twitter were used for model analysis, with the finding that the actual outbreak time can be 5-19 days earlier than officially reported.
Our method is based on inverse inference and enables the starting date of the epidemic to be deduced from limited available data. Another feature of our model is that it contains sufficient details to capture the key dynamical behaviors of COVID-19 spreading. In particular, our inverse method of inference is based on a comprehensive, non-Markovian spreading model tailored to COVID-19 in either an open or a closed setting [3]. The dynamics is described by the time evolution of the populations in five distinct states (SHIJR): susceptible (S), hidden (H), infected (I), confirmed (J), and removed (R). The S, I, and R states are conventional, but the H and J states are COVID-19 proper. Of particular importance is the H-state population: it is the population of individuals who have already contracted the coronavirus but have shown only mild symptoms or would never show any symptoms. To evolve the dynamics, the initial hidden population, H (t 0 ), is an essential parameter, where t 0 is the time (day) at which the number of real confirmed cases begins to increase. Government control measures are usually imposed some time after this day. It is important to note that t 0 is not the day when the coronavirus first appears in the community (i.e., the starting date of community spreading): the latter could be significantly earlier, at a date termed as time zero, denoted as 0! For a time period between t 0 and t 1 , where t 1 > t 0 , there is an appreciable and continuous increase in the number of infections, prompting the government to impose vigorous control measures on day t 1 . The time line is thus 0 < t 0 < t 1 . The problem is to determine time 0. Our inverse inference method is articulated to solve this problem. By generating dynamical evolution of the epidemic model, comparing the model prediction with the limited data available after t 0 , and invoking an optimization procedure, we determine the key model parameter values including H (t 0 ). Running the model between a hypothesized time 0 and t 0 to determine how long it takes for H (0) (a small positive integer, e.g., one or two) to reach H (t 0 ) allows us to pin down time 0 precisely. The relevant dates and time line are illustrated in Sec. II.
We apply the inverse method to 10 states in the U.S. plus New York City and the country as a whole, and find that time zero is as early as the beginning of January. In the U.S., the day t 1 is March 16, while t 0 varies among the individual states (e.g., February 26 for Washington and March 2 for New York). For a specific system (a state or a country), letting T ≡ t 0 − 0 be the time span between the date on which the virus started community spreading and the date of the number of real confirmed cases beginning to increase, we find an exponential scaling relation between H (t 0 ), the number of people who already carried the coronavirus on the officially confirmed date in that system and T . This means that a longer delay in reporting the first case would lead to an exponentially large virus-carrying population, rendering significantly more challenging to fully control the disease spreading. The need for early, preemptive testing thus cannot be overemphasized.
Our model has the power to predict the occurrence of community spreading of COVID-19 long before the time of official report of the outbreak, making it possible for the governments to impose control measures and to summon the essential medical resources. Take the example of the U.S. In February, there was already unmistakable evidence that the coronavirus already existed in the U.S. In complete hindsight, consider the fictitious scenario that widespread tests had been carried out in the U.S. in February so that adequate data had been available. Inverse inference could have been carried out then to determine time zero. This could have sent the vital message to the governments that community spreading of COVID-19 had already started weeks ago. If strict government control actions had been taken at the end of February, the epidemic picture of the U.S. today would have been drastically different. Our framework of modeling and inverse inference can arguably be a valuable asset for guiding the governments to take appropriate actions for possible future outbreaks of coronavirus or other infections diseases.

II. METHODS
In the U.S., the circumstances under which COVID-19 spreads vary dramatically among different states: not only are the levels of travel restriction orders dissimilar, but other factors affecting the disease spreading such as the population, medical resources, and social and political cultures are also distinct. A quantitative assessment of the effects of the control measures taken by the government to contain COVID-19 thus needs to be carried out on a state-by-state basis. A complication is that each individual state is not a closed system: people move into and out of the state on a daily basis. This presents a tremendous challenge to modeling [4,5], as most current data analyses and models for COVID-19 were for the setting of a closed system without considering the inbound and outbound population movements . Another feature of the COVID-19 pandemic that most existing models did not take into account is the non-Markovian nature of the spreading dynamics, as characterized by the various time delays associated with the dynamical states. To account for the non-Markovian characteristics in the model, coupled differential equations with distinct time delays are necessary [3].
To determine time 0 for a state in the U.S. treated as an open system, we develop a coupled, dual-system spreading model. In particular, we treat the target system (A) as one under influences from another, much larger system (B) that represents all the other states. Because the size of B is much larger than that of A, in terms of the spreading dynamics, system B can be regarded as a closed system. From the standpoint of nonlinear physics, the influences of system B on system A can be viewed as a perturbation or background noise, while the effects of A on B can be neglected. The perturbation can be estimated based on the population of the target state and the empirical human movement data. The backbone of this unidirectionally coupled modeling framework is a non-Markovian spreading model incorporating various time delays in a closed-system setting [3].
For the five-state (SHIJR) model in the closed-system setting, there are three parameters whose values are to be determined from data: (1) the infection rate β (the probability that an individual in the S state catches the virus and switches into the H state), (2) H (t 0 ) (the hidden population at the initial time t 0 when the available data, i.e., the number of confirmed cases, began to increase with time to enable reliable estimation of the model parameters), and (3) the fraction of undocumented infections, denoted as η, whose value is determined by the testing and surveillance capability of the government. For COVID-19, the state transitions in the  Table IV in Appendix B for dates of the first confirmed case(s) reported by the World Health Organization for each state/city/country studied in this paper]. Time t 1 is the date on which the government imposed control measures. The time interval [t 0 , t 2 ] is used to estimate the three key parameters of the SHIJR spreading model: In the open-system setting, an additional feature exists: there is time dependence due to human movements in and out of the system. For both closedand open-system models, the government control measures result in an exponential decrease in the human social and movement activities. The collective effects of these measures can be described by one parameter: the exponential decay rate of the activities, denoted by λ, where a larger rate corresponds to more stringent control measures. The value of λ can be estimated based on the available epidemic data. A detailed mathematical description of the model is presented in Appendix A. Figure 1 explains our principle to infer time 0 in terms of a number of key dates underlying COVID-19 spreading. Some days after time 0, the first or the first few cases emerged and were officially reported. However, the number of cases is typically small, e.g., one or two, and this number can remain unchanged for a number of days. Time t 0 is the date after which the number of cases begins to increase. The government imposes control measures on date t 1 > t 0 . Since the government control is an integral part of our model, it is necessary to use data up to some date later for determining the model parameters when the effects of the control measures have been manifested in the data. We assume that this would require at least 12 days and thus set t 2 = t 1 + 12 days.
To estimate the four model parameters in a computationally feasible way while ensuring accuracy, we devise the following two-step procedure. We separate the four parameters into two groups: [β, H (t 0 ), η] and λ, where the first three are associated with the intrinsic spreading process while the last is tied to the reduction in the human movements as a result of government control measures. Because of the time delay for the effects of the measures to be manifested, the value of λ does not affect the fitting with the data in the early stage of the disease. As a result, we first use the available data in the time interval For each parameter, we assign an initial range of its possible values. A large number of combinations of the three parameter values are then used to evolve the model from t 0 to generate the number of confirmed cases, J (t ), in the time interval [t 0 , t 2 ]. Using the real data, we carry out a weighted optimization procedure to determine a set of combinations of the three parameters with minimum errors. We then choose a range for λ and uniformly distribute a number of values of λ in this range. Each λ value is combined with the already determined combinations of [β, H (t 0 ), η] to yield an equal number of combinations [β, H (t 0 ), η, λ]. With all the chosen λ values, this leads to a large number of combinations of the four parameters. Finally, we carry out the same optimization procedure in the time interval [t 0 , t 3 ] with t 3 > t 2 to determine the combination of the four parameters with the minimum error. We choose t 3 = t 2 + 12 days.
With the values of the optimal parameters so estimated, the model generates the time series J (t ) that can be compared with the data. In the time interval [t 0 , t 3 ], the agreement is generally excellent. However, this is not indicative of the predictive power of the model because the model parameters are estimated using the data in the same time interval. To test the model for prediction, we run the model in the time interval [t 3 , t 4 ], as indicated in Fig. 1. Comparison with the real data in this time interval reveals generally quite good agreement, validating the model.
The model is now ready to be used for inferring time 0. Quite straightforwardly, for a given system (a state, a city, or a country), we set H (0) = 1, choose a number of possible candidates for time 0, and run the model from time 0 to t 0 to test which candidate date leads to the known number H (t 0 ): the starting date that gives the correct H (t 0 ) value is taken to be time 0.

A. Finding time zero for the U.S. states, Italy, Spain, and United Kingdom
We aim to find time zero for 10 representative states in the U.S., plus New York City (NYC), the entire U.S.A., and three European countries [Italy, Spain, and the United Kingdom (UK)]. Each State in the U.S. is treated as an open system, while NYC, U.S.A., and the three European countries are treated as a closed system. We first demonstrate that each corresponding model has the required predictive power. Figure 6(a) shows the inferred time zero together with its confidence interval for the 15 states/city/countries, in an ascending order. It can be seen that SARS-CoV-2 appeared in Italy about the New Year day. In the U.S., it first emerged between January 7 and 11 in Washington or New York State. When the whole country of U.S.A. is treated as a closed system, the virus first appeared on about January 9, with confidence interval overlapping with that of the Washington and New York States, suggesting Washington or New York as the first state in which the virus emerged. It can be seen from Fig. 6(a) that the officially reported time of the emergence of the first few cases does not represent the beginning of the community spreading. In most cases, the time zeros inferred by our method were earlier than the official time, e.g., about one month earlier in Italy, two months earlier in New York State and New York City, over half month earlier for Washington State. These results indicate that, before the official report of cases, COVID-19 had already spread in the community for some time. The danger of the misconception of the delayed starting date of community spread as reported by the governments is real with devastating consequences: when the governments decided to impose control measures, it may have already been too late. For example, New York State issued the lockdown order on March 22, which is more than two months later than time 0 (January 9), indicating unusually slow response of the state government to COVID-19. For the U.S. as a country, the White House issued a nationwide social-distancing order on March 16, but it was already two months later than the starting time of COVID-19 in the country, attesting to the extremely and unreasonably slow response of the federal government to the pandemic! There are cases where the inferred dates of time zero agree approximately with the official dates, e.g., the State of Arizona. This is because the outbreak in other regions of the country (e.g., the original epicenter New York City) increased the awareness level, promoting the local governments and population to take certain protective measures and thereby delaying the community spread. Figure 6(b) shows that the values of the interval between time 0 and t 0 for different systems are in the range [20,53], where the maximum value of 53 days occurs for the State of New York. The average time interval t 0 − time 0 is about 33 days, indicating that SARS-CoV-2 began to spread about a month earlier than reported officially. The overall time interval of free growth from time 0 to t 1 for the States of Washington and New York as well as the whole U.S.A. is about 73 days. This explains why the controlling effect in the U.S.A. was largely unsuccessful, as the virus had spread freely for so long, making it difficult for nonpharmacological interventions such as tracing and isolation strategies to effectively inhibit the spreading. Comparing the situations between U.S.A. and Italy, the time interval t 0 − time 0 of the former is smaller than that of the latter, but U.S.A. has a greater value of the time interval t 1 − t 0 , meaning that earlier prevention and control measures were more seriously taken in Italy. It is plausible that this resulted in better control of the virus in Italy than in the U.S.A. indicating insufficient testing for a relatively long period of time after the exponential outbreak. Insufficient test, of course, gave fewer confirmed cases than actual, opening the door for the governments to undermine the severity of the pandemic and even to have the deception that COVID-19 would be under control. A consequence is that COVID-19 has continued to spread aggressively in the U.S. at the present, with no indication of control in sight. In contrast, in countries that have successfully controlled the disease, such as China and South Korea, the value of λ is between 0.1 and 0.19, signifying significantly stringent government control measures [3]. For a given system, after time zero has been determined, it is straightforward to predict how the hidden, asymptomatic population grows with time from a single case, before any government control measure is imposed. Under such a circumstance, there is free growth characterized by an exponential law, as shown in Fig. 7 on a semilogarithmic scale for the 15 systems, where the exponential growth rate is determined by the infection rate β.
What is the relation between the size of the virus-carrying hidden population at the time of first confirmed case and T , the time elapsed since zero? Figure 8 answers this question for the 15 systems. It can be seen that T varies drastically among the 15 systems. When the first confirmed case was reported, there is already a sizable population of the asymptomatic individuals (hundreds or even thousands), whose actual size also varies dramatically among the systems. A general trend is that the hidden population at t 0 tends to grow exponentially with T , a consequence of the free growth behavior exemplified in Fig. 7.

B. Finding time zero for the recent second COVID-19 outbreak in Beijing, China
We apply our framework of inverse inference to predict time zero for the recent second outbreak of COVID-19 in Beijing, China. From June 11 to 14, 79 cases were reported in Beijing. Since the city had had 0 new cases for several months before, the new outbreak is independent of the previous one in the January-February time frame, and can thus be regarded as a second outbreak. Analyzing the official report indicates that symptomatic patients first emerged on June 6. The extreme sparsity of the available data prevents an accurate determination of the four key parameters in our inverse model through optimization, but it is still possible to make approximate estimates. In particular, in Beijing, the regions of   For uncontrolled and free growth, the exponential rate is essentially the infection rate β whose value has been estimated from data. five days, the estimated time 0 is quite close to the actual time 0. This demonstrates that, in China where government responses are quick and testing is widely available, our model can predict time 0 even during the early stage of the outbreak with sparse data. That is, from the first official report of the outbreak, our model is already capable of providing a rough estimate of time 0, facilitating greatly localization of the spreading source(s).

IV. DISCUSSION
There were speculations that SARS-CoV-2 could have been in the U.S. a few weeks earlier than January 20, the FIG. 8. Approximately exponential relation between the size of the hidden population at the time of first confirmed case reported and the time elapsed since zero. For the 15 systems studied, both the duration T and the hidden population H (t 0 ) vary widely. Among all the systems, when the first confirmed case was reported, the hidden population ranges from hundreds to thousands. officially reported date. Inverse inference based on our non-Markovian SHIJR model for COVID-19 leads to a consistent answer confirming the speculations: the virus first emerged in the U.S. at the very beginning of the year. Our confidence in this result comes from our model that has been comprehensively constructed and rigorously tested in terms of the following three aspects. First, the designation of the model states and their dynamical evolution are fully in accord with the known characteristics of COVID-19, subject to government control measures. Second, the key model parameters are optimally estimated using empirical data from an adequately long time period including the initial growth phase of the epidemic. Third, the model has been validated with its predictive power firmly established through a comparison between the model generated and real data of the daily accumulative number of confirmed cases in a 14-day period that is not involved in parameter estimation. All these have been done for 10 U.S. states and New York City, the U.S. as a whole, plus three European countries most severely hit by the COVID-19 pandemic. Particularly worth noting is that our inverse procedure gives two results that are mutually consistent: the virus was already in Europe as early as the end of December 2019 and the earliest possible date for New York City to have the virus is around January 10. This consistence gives credence at a quantitative level to the widely believed proposition that the virus in New York City was from Europe through air travel [1].
Our study has demonstrated that, for a given system (a state, a city, or even a country), open or closed, our non-Markovian SHIJR model is capable of yielding an estimate of time zero and generating the possible epidemic trajectories into the future based on limited data with the power to predict the most likely epidemic scenario. With the inclusion of appropriate optimization and inverse inference procedures, the predictive modeling framework represents a contribution to mathematical and computational epidemiology, going beyond the existing models and offering a general and comprehensive paradigm applicable not only to COVID-19 but also to future pandemics. In addition, the framework developed in this paper can be an accurate and reliable tool or source for governments at all levels, enabling not only an accurate assessment of the government testing and surveillance capabilities for the infectious disease, but also a comprehensive evaluation of the effects of government imposed measures to control the disease. This can provide guidance for optimizing these measures to save human lives and for determining the optimal time for reopening to minimize the economical and social impacts.  Figure 9 illustrates the generalized model. An individual can be in one of the five states at each time step: susceptible (S), hidden (H), infected (I), confirmed and isolated (J), and removed (R). The states S, I, and R have the same meanings as in the classical SIR model for infectious disease, but states H and J are unique for COVID-19. In particular, an individual in H has had the virus and is infectious but is asymptomatic or only mildly symptomatic, in contrast to the I state in which individuals show symptoms. The H individuals are capable of infecting others. The J state contains individuals who are confirmed with COVID-19. Individuals in the R state, by definition, are not infectious.
We treat each state in the U.S. as an open system, regarding the influences from all the other states as perturbations, mathematically represented by the populations moving into and out of the S and H states, denoted as S in (t ), S out (t ), H in (t ), and H out (t ), respectively, as shown in Fig. 9. These functions are determined by the travel intensity as a function of time. Two types of travel need to be distinguished: interstate and intrastate, with the corresponding intensity functions l inter (t ) and l intra (t ). Due to government-imposed travel restrictions, these functions decay exponentially from an initial value to a final smaller constant value. For instance, a recent estimate [17] has given that, for several major U.S. cities, the travel restrictions would reduce the outbound human movements by 50%. In general, we have l inter (t ) = 1, e −λ inter (t−t c ) , and e −λ inter (t s −t c ) for t < t c , t c t t s , and t > t s , respectively, and TABLE IV. Additional information and results for 10 states in the U.S., New York City, Italy, Spain, and the United Kingdom. The