Article Text

## Abstract

**Objectives** Assessment of recuperation and death times of a population inflicted by an epidemic has only been feasible through studying a sample of individuals via time-to-event analysis, which requires identified participants. Therefore, we aimed to introduce an original model to estimate the average recovery/death times of infected population of contagious diseases without the need to undertake survival analysis and just through the data of unidentified infected, recovered and dead cases.

**Design** Cross-sectional study.

**Setting** An internet source that asserted from official sources of each government. The model includes two techniques—curve fitting and optimisation problems. First, in the curve fitting process, the data of the three classes are simultaneously fitted to functions with defined constraints to derive the average times. In the optimisation problems, data are directly fed to the technique to achieve the average times. Further, the model is applied to the available data of COVID-19 of 200 million people throughout the globe.

**Results** The average times obtained by the two techniques indicated conformity with one another showing p values of 0.69, 0.51, 0.48 and 0.13 with one, two, three and four surges in our timespan, respectively. Two types of irregularity are detectable in the data, significant difference between the infected population and the sum of the recovered and deceased population (discrepancy) and abrupt increase in the cumulative distributions (step). Two indices, discrepancy index (DI) and error of fit index (EI), are developed to quantify these irregularities and correlate them with the conformity of the time averages obtained by the two techniques. The correlations between DI and EI and the quantified conformity of the results were −0.74 and −0.93, respectively.

**Conclusion** The results of statistical analyses point out that the proposed model is suitable to estimate the average times between recovery and death.

- COVID-19
- epidemiology
- public health

## Data availability statement

Data are available upon reasonable request.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

## Statistics from Altmetric.com

### Strengths and limitations of this study

Data were drawn from the Kaggle website comprised of cumulative infected, recovered and deceased population affected by COVID-19.

We used the curve fitting technique to propose a multimodal delayed outcome-based compartmental technique.

The process of curve fitting functions as a denoising agent to facilitate achieving a better compartmental model.

The study does not have individual-level data, nor an individual assessment of digital access, use or competency.

The study is limited by being cross-sectional rather than interventional or prospective.

## Introduction

Virus epidemics have been known as one of the major health issues leading to a high mortality rate in human communities throughout history. The Spanish influenza emerged in 1918, caused about 50 millions deaths for just over 2 years.1 Also, since the early reports of HIV infection in 1980, more than 36 million of deaths have been reported around the world due to virus infection until the end of 2020.2 Tragically, not only these older global epidemics but also the local spread of SARS, MERS and Ebola viruses in recent years causing disruptions in functions of societies, implies that the health authorities are not ready yet for such crises.3

The COVID-19 is a new strain that has not been previously identified in human. Most people infected with the COVID-19 virus, the cause of the recent pandemic across the globe would experience mild-to-moderate respiratory illness and recover without requiring special treatment. Older people, and those with underlying medical problems such as cardiovascular diseases, diabetes, chronic respiratory diseases and cancer, are more likely to develop serious illnesses.4 Since December 2019, a growing number of cases for the COVID-19, a worldwide health disaster of this time, has been discovered initially in China.

A variety of models and simulations have developed as important decision tools that can be useful to control human and animal diseases.5 However, since each disease exhibits its own particular biological characteristics, the models need to be adapted to each specific case in order to be able to tackle real situations.6

Among the epidemiological models, the most common model used to study the spread of COVID-19 are the compartmental models. In such models, there is a susceptible population, which is assumed to be equal to the population of whichever region is being examined minus the number of people that have previously had the disease. Some of the susceptible individuals get infected in each period, where the rate of infection is a function of the number of infected individuals as well as other factors that shift the rate of transmission. Finally, infected individuals, according to the selected model, either move to the removed compartment or again to the susceptible compartment.**7 8**

Some prevalent forms of compartmental models are SIR, SI, SIS, SIRS, SEIR, SEIRS, MSIR, MSEIR and MSEIRS, among others. In all these acronyms, S, I, R, E and M stand for susceptible, infected, removed, exposed (individuals already exposed to the disease but not infected yet) and maternal (those with maternal immunity), respectively.9 These models have been applied to many emerging infectious diseases, for example, avian influenza, Ebola, HIV/AIDS and many others. The inclusion of different compartments is based on the nature of the diseases or the temporary stage of the epidemic.10

In order to be able to study the COVID-19 comprehensively, the removed compartment splits into two subcompartments, recovered (R) and deceased (D), as due to the noticeable case fatality rate (CFR) of the COVID-19, the number of deceased individuals becomes important for statistical analyses.11 Delayed SIR-alike models are introduced to account for the time delays of transmissions of individuals between different classes.12 13 Distribution fitting is the process of fitting a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon. Distribution fitting aims to predict the probability or to forecast the frequency of occurrence of the magnitude of the phenomenon in a certain interval. There are many probability distributions of which some can be fitted more closely to the observed frequency of the data than others, depending on the characteristics of the phenomenon and of the distribution. The distribution giving a close fit is supposed to lead to good predictions. The conditions that need to be considered in the selection of probability distribution are the general trend, skewness (symmetry or asymmetry of the data), etc.14

The importance of estimation of the time averages to recovery and death, either via survival analysis or modelling, emerges where the facilities plan to provide the necessary supplies. Hence, in this study, we try to present a model to estimate these time averages using the process of curve fitting and confirm it by an optimisation problem model.

## Method

### Data preparation

Data were drawn from the Kaggle website comprised of the date of the observation, country-wise separation, segregation by state or province (if provided by the country) and cumulative infected, recovered and deceased population affected by COVID-19.14 In order to mitigate the disjunction in the data, the data underwent a procedure that if there is a descent in the cumulative data, the data of the inconsistent day would be replaced by a weighted average of the data of the day before inconsistency and the next valid data. As another modification, for countries with state or province-wise separation of the data, we added up the data of these states or provinces for the same day to get a dataset for the whole country. If the data of a country were missing before or after a certain day, that day would be appointed as the onset or terminal day in dataset for that country. The mathematical method that is used for data preparation is provided in online supplemental information.

### Supplemental material

### Patient and public involvement

It was not possible to involve patients or the public in the design, or conduct, or reporting, or dissemination plans of our research.

### Data categorisation

In order to be able to study the COVID-19 comprehensively, the retrieved compartment of SIR model splits into two subcompartments, R and D, because due to the noticeable CFR of the COVID-19, the number of deceased individuals becomes important for statistical analyses.11 Delayed SIR-alike models are taken to account for the time delays of transmissions of individuals between different classes.12 13 In this study, we focus on the average time of incubation, recovery and death of individuals, according to a delayed susceptible–infected–recovered/deceased (SIRD) model given in system of equations (1)–(4).

(1)

(2)

(3)

(4)

Where S, I, R and D stand for the susceptible, infected, recovered and deceased compartments, respectively.

α is the average incubation rate of an individual.

β is the average recovery rate of an individual.

γ is the average death rate of an individual.

τ_{1} is the average incubation time of a susceptible individual.

τ_{2} is the average recovery time of an infected individual.

τ_{3} is the average death time of an infected individual.

### Curve fitting and goodness of fit (GoF)

After modification of the secondary data, the frequency distribution of each category of the secondary data was plotted against the number of days from the onset of the outbreak to determine the number of peaks to choose to fit a cumulative Gaussian distribution to each of these classes.

The validity of the fitted models was checked via three GoF indices, namely Pearson’s reduced χ^{2} statistics, root mean square error of approximation and standardised root mean square residual.

The cumulative Gaussian fitted to the modified secondary data for some countries could not pass the GoF criterion. So, by looking at the frequency distribution of these countries, it was decided to put skewed normal distribution and hence its cumulative function to use.

The steps to obtain time averages to recovery and death via curve fitting is explained in the following steps to be replicable:

Download the secondary data from the internet source.

List the countries in the dataset.

Pick a country to study.

If the data is given state wise or province wise, add up number of people in each category.

If the data (given cumulatively) show a descent, modify the data according to online supplemental equations s1 and s2.

Determine the number of waves according to the frequency plot of the three categories of the data (visually determined).

Determine the initial guesses of the parameters given in online supplemental equation s4 for each wave.

Minimise online supplemental equations s6–s8 simultaneously with regard to the constraints given in (online supplemental equations s9–s11.

Calculate the GoF indices and determine whether the fit is acceptable according to the defined criteria of the indices or not.

If the fit was acceptable proceed to step 14; if not acceptable, then minimise the skewed Gaussian error functions simultaneously with regard to the constraints (online supplemental equation s11, s20 and s21) (similar to step 8).

Repeat step 9. If the criteria are met, proceed to step 14; otherwise calculate the discrepancy (equation (5)) and step (equation (6)) indices.

Fit the last function fitted to the data but this time omit constraint (online supplemental equation s11) (checking for steps in the data).

Check whether the discrepancy and step indices pass the criteria to identify the incoherency in the data.

Subtract the mean of infected peak from the means of recovered and deceased peaks to attain the average time to recovery and time to death of the peak rendered by curve fitting method, respectively. Repeat for the number of waves determined in step 6.

All detailed procedure about Gaussian and skewed Gaussian distributions, and GoF can be found in online supplemental information.

### SIRD compartmental model

A quite suitable compartmental model was selected for our study, by the process of elimination of irrelevant ones. SI and SIS models were eliminated because infected individuals either recover or die,15 so the models excluding the removed compartment are unfit. The models involving exposed compartment are unsuitable for our study, as there is no data for a mediate class between susceptible and infected group in our dataset. Additionally, accounting for the exposed compartment requires predictions and estimations.16 Since there is not enough evidence of maternal immunity and also lack of data,17 the models involving maternal compartment would be eliminated. Finally, the models such as SIRS, in which the individual from the removed compartment, would be retransmitted to the susceptible compartment is ruled out, because there is not a strong claim for reinfection of population of the removed compartment in a single surge.18 The SIR version of compartmental models seemed appropriate for our dataset, where the removed compartment was divided into two moieties, recovered and deceased.

The numerical solution method and the computation software are introduced in the online supplemental information.

### Optimisation problem

The objective here is to find the optimum values of *τ*_{2} and *τ*_{3}, and to compare these values with those achieved as explained in Gaussian fit and skewed Gaussian fit sections. For this purpose, two optimisation problems were defined (method I and method II) and discussed in the online supplemental information.

The steps to obtain time averages to recovery and death via optimisation problems are explained in the following steps:

Download the secondary data from the internet source.

List the countries in the dataset.

Pick a country in the dataset.

If the data is given state wise or province wise, add up number of people in each category.

If the data (given cumulatively) show a descent, modify the data according to equations (online supplemental equations s1 and s2).

If the number of waves is more than one, the time interval of each peak is found out by utilising equations (online supplemental equations s37 and s38).

Method I: Assign the initial value of t

_{3}(the initial value of t_{3}is t_{3}^{1}).Minimise equation (online supplemental equation s25) and obtain t

_{2}^{1}for each day (t) (t_{3}^{1}is fixed). Note that t+t_{2}^{1}cannot exceed the number of days from the onset of pandemic or the limit set in step 6.Minimise equation (online supplemental equation s25) and obtain t

_{3}^{2}for each day (t) (the average of t_{2}^{1}is calculated and fixed).Minimise equation (online supplemental equation s25) and obtain t

_{2}^{2}for each day (t) (the average of t_{3}^{2}is calculated and fixed).Repeat steps 9 and 10 recursively until the criterion in equation (online supplemental equation s26) is met.

Rewrite t

_{3}as the product of t_{2}(variable) and the ratio of the averages of t_{3}^{fin}to t_{2}^{fin}(parameter) as given in equation (online supplemental equation s29).Minimise equation (online supplemental equation s28). The average of t

_{2}renders the value of τ_{2}and likewise τ_{3}of method I is achieved.Method II: Divide the number of days of each wave to 3 (the initial number of divisions).

Assign the initial values of t

_{3}.Minimise equation (online supplemental equation s35) (Periodic Fatality Ratio (PFR) is fixed and t

_{3}is the variable to optimise).Recalculate the values of PFR with regard to the new values of t

_{3}(PFR is a vector with the size equal to the number of divisions of days for each wave).Minimise equation (online supplemental equation s35) again to obtain t

_{3}^{2}.Repeat steps 17 and 18 recursively until the criterion in equation (online supplemental equation s26) is met for t

_{3}.Minimise equation (online supplemental equation s36) to obtain t

_{2}.Calculate the SE of t

_{2}and t_{3}.Increase the number of divisions of days of each wave by 1 and repeat steps 15–21.

Repeat step 22 until the number of days in each division is at least 2.

Calculate the average of t

_{2}and t_{3}having minimum values of SE to achieve τ_{2}and τ_{3}of method II.

Compare the SE of t_{2} and t_{3} obtained from step 13 and step 24 (method I and method II). The smaller ones are selected to fill tables 1–4 for optimisation problem.

## Results

### Gaussian curve fitting

The data of each country went through data preparation to assure consistency of the cumulative data (figure 1A), and then were subjected to the curve fitting procedure using sequential quadratic programming algorithm to find the ordinary least squares of errors defined in equations (online supplemental equations s6–s8) or their corresponding defined errors for skewed Gaussian fit. The results indicated an acceptable GoF for most of the countries and were able to point at the illogical jumps in the data and regulate it as well, as illustrated in figure 1B. The overall GoF was rejected for the cases (countries) with jumps in their data (steps) by our defined criteria, but the piecewise GoF before and after the steps satisfied the criteria. Yet, there existed some cases with too many steps, which could not satisfy any piecewise GoF in any arbitrary set of intervals as shown in figure 1C. The comparison between fitting power of Gaussian curve fitting and skewed Gaussian curve fitting is given as an example in figure 2. Although scaling the difference between the two fitting functions is not clearly visible on the cumulative data, it can be observed that skewed Gaussian curve fittings follow the trend of frequency distribution figures much better. The subtraction of the infected population mean from the mean of its corresponding recovered and dead peak renders average times to recovery and death (τ_{2} and τ_{3}) derived by curve fitting technique, respectively.

In equations (online supplemental equation s3, s4 and s17, s18), the parameters a and σ represent population in each category and the timespan of a wave, respectively, and the parameters µ, and ζ are connected to the position of each wave in timespan and the relation between rise to decay of each wave, respectively.

### SIRD compartmental model

The improved SIRD model was implemented in two cases to model the COVID-19 breakdown for different countries. In Case 1, the model was directly fitted to the prepared data extracted for those countries that showed only one peak in our dataset, because of the inability of the SIRD models with the rate of incubation, recovery and death as time-invariant variables to fit to multiple peaks. In this model, the average incubation, recovery and death times are considered variable to obtain the optimal fit. In the second method, the improved SIRD model is fitted to the compartments derived by the Gaussian model. In this method, each surge is separated and the compartments are constructed as explained in the SIRD compartmental model subsection. In this case, the average incubation time is considered a variable and the average recuperation (recovery) and death times are considered parameters drawn from the difference between the mean of corresponding peaks (ie, the subtraction of the mean of the infected peak from the mean of the recovered peak which renders average time to recovery and similarly the average time to death is calculated). The data of countries with only one peak went through both cases of SIRD modelling as a means of comparison between the two methods as shown in figure 3A–C. The sum of absolute error is calculated for both cases with respect to the prepared data, as the observation (not the Gaussian model fitted to the observation) and the average and the SD of the ratio of the sum of absolute normalised error for case 1 (directly fitted to the prepared data) to case 2 (fitted to the Gaussian) for 10 countries with 1 peak in their data is calculated (1.3951±0.3275). The impediment of using SIRD model with invariant τ_{2} and τ_{3} to multiple-peaked data is solved in case 2 as shown in figure 3D–F.

### Optimisation problems

The results of the optimisation problems are acquired after the comparison between the SE of t_{2} and t_{3} vectors of the two methods. This scenario rendered incoherent results for the countries with steps or discrepancy in their data.

The discrepancy between values of the two methods of estimation of the CFR as introduced in equations (online supplemental equations s31 and s32 will impede the correct calculations of the time averages to recovery and death via the optimisation problem; however, they can be obtained from the curve fitting procedure while neglecting the prior given in equation (online supplemental equation s11), because the error of curve fitting for these countries, as we put equation (online supplemental equation s11) in use, is relatively high (eg, where the value of reduced χ^{2} GoF statistics is higher than 3), in such cases, the constraint of curve fitting presented in equation (online supplemental equation s11) is disregarded. As illustrated in figure 4A, as the number of the days from the onset of the pandemic grows, the CFR estimate approaches a horizontal asymptote. The intercept of this asymptote is equal to the estimation of the CFR, which is determined by applying equation (online supplemental equation s32) at the terminal data of the dataset. As mentioned earlier, equations (online supplemental equations s31 and s32) can be used to estimate the CFR of countries; however, equation (online supplemental equation s32) is more likely to render a better estimate. It is predictable that in countries with consistent data, the estimation of the CFR from both equations (online supplemental equations s31 and s32 approach the same asymptote unlike figure 4B.

### Hypothesis test

In order to investigate the nearness of the values of time averages obtained via the two techniques, a hypothesis test was performed on the data with acceptable values of the discrepancy index (DI) and the error of fit index (EI, equation (6)). DI and EI are defined in the next section. The null hypothesis is that the expected value of the normalised error between the values achieved via the two techniques is zero (H_{0}: E((τ_{j,i,op}*−*τ_{j,i,cf})/τ_{j,i,cf})=0, where the subscript j can be 2 or 3, i is the peak number, and op and cf subscripts state that whether the average time is calculated by the optimisation problem of curve fitting, respectively) and the alternate hypothesis (H_{1}) is that the expected value of the mentioned variable is non-zero. The test is carried out for countries with identical number of peaks separately and their p values are given in tables 1–4. The result of the test clarified that the null hypothesis could not be rejected in a two-tailed test with 5% level of significance.

## Discussion

Since epidemics are a great concern for jeopardising the healthcare of humanity, it is important to develop more comprehensive models to assimilate their behaviour as thoroughly as possible. The ability to predict and estimate the average time to recovery and death of the infectious population of an epidemic, in addition to other factors, is imperative to maintaining and providing necessary stocks that healthcare institutes need for confronting the decease.

In the dataset, there were some countries which could not be fitted to either of Gaussian and skewed Gaussian model while applying the mentioned priors. After the examination of the data of such countries, it was found out that the reported numbers are not coherent (ie, addition of cumulative recovered and deceased population does not add up near to cumulative infectious data). In such cases, the prior remarked in equation (online supplemental equation s11) must be excluded so that the GoF criteria can be met. Nevertheless, the results of the fit of the data to these countries can be involved in the calculation of further statistical parameters, unless the value of DI (given in equation (5), is above 0.1. This error is calculated directly from the last day of our dataset and also from the amplitudes of the fitted Gaussian model for the discrepant countries. Another shortcoming for our dataset is that the data of one or more classes ceased to be presented after a specific day, which compels us to prune the data of those countries up to that day or omit them in our calculations.

The countries with steps in their cumulative data do not hinder the implementation of the improved SIRD model, but the countries that have shown discrepancy in their data (having a DI greater than 0.1) do prevent carrying it out, because eventually, the infectious compartment will turn into either of recovered or deceased compartment, and existence of discrepancy makes an acceptable fit impossible. The results of the comparison between the two cases indicate that the model fitted by case 2, although it is not directly fitted to the observation, has a better fit for the most countries (ie, showed relatively a smaller sum of absolute error), even though case 1 has a greater degree of freedom (ie, also average time to recovery and death are variables of fit). This phenomenon can be attributed to the uncertainty of parameter estimation when fitting an epidemiological model to a noised dataset.19 This proves that our suggested method is not only conducive to model a multimodal epidemic, but also is favourable to estimate the parameters of unimodal epidemiological models with.

The results of estimation of the CFR for some countries from our dataset were quite conflicting with what is reported from WHO as an average.20 This can be attributed to either of the inability of the officials to account for all of the infected population for some reasons (eg, preference for home treatment) in contrast to ‘the countability of the deceased population’, or data manipulation or other errors. As a consequence, these results prevent researchers from performing credible multivariate analyses including the CFR factor, yet it can be used as a mean to roughly estimate the true infected population of countries. The convergence of the values of CFR estimated by the two methods confirms that equation (online supplemental equation s11) is applicable for the considered country and the DI is within acceptable range, hence it can be applied for further statistical analyses.

Tables 1–4 provides the peak-wise time averages to recovery and death for countries obtained from the two techniques (curve fitting and optimisation problem) introduced in the model development section. The descending trend of the p values as the number of peaks grow can be attributed to the negative influence that overlapping of the peaks have on estimation accuracy of the two techniques. Contrary to the nearness of most of the average time values obtained by method I of the optimisation problem to those obtained by the curve fitting process, the deceased class exhibited relatively high values of SE, and it is caused by the difference in magnitude between recovered and deceased population; a small deviation in the number of recovered from its mean for a specific day causes drastic change in the calculation of the estimation of time to death for that day, causes the SE of t_{3} to escalate.

Even though the corresponding average times to recovery and death drawn from the two techniques for most countries showed an acceptable agreement, there existed three types of inaccuracy which caused a disjunction in the average times, obtained by the two techniques or caused one or both techniques to malfunction and therefore not being able to produce a result. The first cause of inaccuracy is the lack of data on one or more categories of the people inflicted by the virus, this inaccuracy leads both techniques to produce invalid results. The second inaccuracy is when the difference in the two methods of estimation of the CFR is significant. This type of inaccuracy affects the optimisation problem in a way that it may not be able to render an output. The third inaccuracy occurs when there is an abrupt increment (steps) in the cumulative data of the three classes. This will cause a disjunction in the results of the two techniques for the specific peak where the step is observable or for more extreme cases causes the inability of optimisation problems.

The existence of these inaccuracies leads us to develop indices to quantify them. The DI is defined in equation (5):

(5)

The DI is defined in a way that varies between 0 and 1 and it has a steep slope between the values 0.01 and 0.1 of |CFR_{2}–CFR_{1}|. The EI (briefly, the error index) is developed to quantify the frequency and amplitude of steps in the data by means of comparing the cumulative data with the fitted cumulative distribution to the data. This comparison is made via the reduced χ^{2} GoF statistics. Since the reduced χ^{2} GoF is calculated for each class of our data separately, EI includes those values of reduced χ^{2} GoF that are above our defined acceptable range. If more than one of the classes of the data has a reduced χ^{2} value above the acceptable range, the geometric mean of those values is applied to calculate the EI index. For countries with multiple peaks of the pandemic, the piecewise reduced χ^{2} GoF is calculated for each peak whose bounds are determined by using equations (online supplemental s37 and s38). The *EI* is formulated in a way to vary between 0 and 1 too and is given in equation (6):

(6)

The DI of the countries with discrepancy in their data and the arithmetic mean of the ratios of the average times of the two techniques is given in table 5.

The ‘time ratios’ that are defined to correlate with the DI are formulated to be ≤1, for such purpose, the shorter τ_{2} and τ_{3} are divided by the longer one extracted by the two techniques, hence their average is between 0 and 1. The values of unavailable τ_{2} and τ_{3} for these calculations are deemed to be zero. The correlation between DI and the average of the ratios is −0.74 (figure 5A). Some of the causes of relatively large deviation of scattered data from the regression line are combination with the effect produced by the steps for most countries with discrepancy, inability to rule out the effect of a class and relate the DI to one of the time averages because the discrepancy is calculated using all three classes, and the inability in shrinking the multiple time averages of multiple peaks and not being able to determine which peak caused the discrepancy.

Three causes come to mind with regard to the relatively large deviation of scattered data from the regression line. The first one is combination with the effect produced by the steps for most countries with that of discrepancy. The second one is the inability to rule out the effect of a class and relate the DI to one of the time averages, because the discrepancy is calculated using all three classes. The last cause is the inability in shrinking the multiple time averages of multiple peaks and not being able to determine which peak caused the discrepancy. Nevertheless, the value of this correlation is large enough in magnitude to infer the existence of a negative correlation between the two variables.

The EI of the countries with reduced χ^{2} values more than 3 and the geometric mean of the ratios of the time average of the specific peaks where the step is observed between the two techniques is given in table 6.

The defined ‘time ratios’ are formulated to be smaller than 1, like those of the DI, therefore their geometric mean is also between 0 and 1. Similar to the calculation of the DI, the values of unavailable τ_{2} and τ_{3} are deemed to be zero. When there is no step in recovered or deceased data, the τ_{3} or τ_{2} is ignored, respectively, and geometric mean is not used for the ‘time ratios’. On the other hand, when there is a step in infected data and/or in both recovered and deceased data, the geometric mean is considered for the ‘time ratios’. The correlation between EI and the geometric mean of the defined ratios is −0.93 (figure 5B). The causes of relatively small deviation of scatter data from the regression line are exactly opposite of the causes mentioned for the relative high deviation of the data of DI. For example, fewer countries with steps have manifested discrepancy effect. For some countries one of the recovered or deceased classes can be dispensed with in the calculation of the defined ratios for EI as long as it does not show a reduced χ^{2} value over 3 but if the infected class contains a step in its data, no classes are excluded from calculation. Another constituent of this relatively large correlation is that in bimodal or multimodal distributions, the specific peaks, and consequently their EI via piecewise GoF, where steps have occurred are extractible. These correlations establish the validity of the two techniques to estimate the average times to recovery and death of COVID-19 statistically as a case study.

The data on infected and recovered population, despite the data of deceased population, is acquired via specific test of the infection and the accuracy of this data partly depends on the tendency of individuals to check on their health status and this might affect the time averages and moreover the CFR factor of the epidemiological disease. These observations pose threat to the validity of the resultant data, but if the dataset is well-labelled with respect to its validity or if there exists a prior knowledge of the epidemiology through precise statistical analyses, the true values of interest can be calculated. Further, the valid data of the time averages can be subjected to multivariate analyses and machine learning algorithms to investigate effective factors and predict these time averages in different conditions, which is out of scope of the current study.

Spatial stratified heterogeneity (SSH) of the time-to-event data (tables 1–4) is measured.21 22 Since the study of heterogeneity in current study is not spatiotemporal, the average data of all the waves of the pandemic is allocated for measuring SSH. The mean of the data resulted by the two techniques is used for attributing a single time-to-event data to a country. For those countries where an average time-to-event via one technique was not available, the other technique is used to assign the time-to-event data to that country. The data are partitioned into six strata by minimising the within-strata variances and maximising the between-strata variance. The result shows extreme heterogeneity with q-statistic results being 0.9339 and 0.9397, and f-test results being 358.65 and 361.75 for average time to recovery and time to death of countries, respectively. The results of f-tests indicate that heterogeneity of the study population for the selected variables. These results are also projected in figure 6 in which spatial stratification of times to recovery and death for all countries is given. The number of countries with DI and EI larger than the critical values of their corresponding indices are about half of the available countries data, therefore the observed heterogeneity is partly caused by such phenomenon. This can intervene in the analysis of findings and deviate the epidemiological interpretations from their origin.

In the current study, two techniques, curve fitting and optimisation problem, are developed to calculate the recuperation and death average time of epidemics based on their observed data. The curve fitting process is a utility for fitting better unimodal SIRD model and are also able to generate SIRD models with two or more peaks. The result of the estimation of the time averages by the two techniques agreed to one another in absence of irregularities. It was also shown that the closer the peaks are to each other, the less is the conformity between the results of the two techniques. Two indices are defined to correlate the level of mismatch in the result of the two techniques for estimating the average times and both indicated strong negative correlations between intensity of irregularities and level of matching the results of the two techniques. Further to this study, the findings of this work can be subjected to statistical analyses to correlate them with determining factors (eg, socioeconomic parameters, geography, seasonal changes, etc) to achieve a better understanding of their underlying behaviour for COVID-19 pandemic.

## Data availability statement

Data are available upon reasonable request.

## Ethics statements

### Patient consent for publication

### Ethics approval

Not applicable.

## Acknowledgments

We wish to thank Mrs. Akram Kia from Social Determinants of Health ResearchCenter for her great assistance in registration of our study.

## Supplementary materials

## Supplementary Data

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

## Footnotes

Contributors AR, EG, DH-M, FF and HH-M provided a substantial contribution to the design and interpretation of the paper and revised drafts. AR initiated the project and wrote the initial draft. AR and HH-M are the guarantors for the study.

Funding Shahid Beheshti University of Medical Sciences, award no 33247.

Map disclaimer The inclusion of any map (including the depiction of any boundaries therein), or of any geographic or locational reference, does not imply the expression of any opinion whatsoever on the part of

*BMJ*concerning the legal status of any country, territory, jurisdiction or area or of its authorities. Any such expression remains solely that of the relevant source and is not endorsed by*BMJ*. Maps are provided without any warranty of any kind, either express or implied.Competing interests None declared.

Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

Provenance and peer review Not commissioned; externally peer reviewed.

Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.