A Time Series Model to Forecast COVID-19 Infection rate in Nigeria Using Box-Jenkins Method

.


Introduction
Coronavirus popularly known as COVID-19 is a severe viral disease caused by a contagious acute Respiratory Syndrome Coronavirus 2 (SARS-CoV 2).It belongs to the genus 'coronavirus' of the Coronaviridae family (Sahin, 2020).It is characterized by crownshape (the name "coronavirus" is derived from the Greek κορώνα, meaning crown.)peplomers with 80-160 nM in size.The genome of CoV contains a linear, single-stranded RNA molecule of positive (mRNA) polarity and about 28-32Kb in length (woo et al., 2020) It was first discovered in Wuhen, Hubei District of China in December, 2019 (WHO, 2020) since then, it has spread across over 200 countries of the world.On March 11 th , 2020, the World Health Organization declared the outbreak a pandemic.Covid-19 is currently a major worldwide threat to human existence and has caused the largest global recession.It has been spreading rapidly globally, with a considerable impact on global morbidity, mortality and healthcare utilization (Rauf and Oladipo, 2020).As of 31 st October, 2020, the world has registered over 46.4 million confirmed cases of the deadly virus from which 1,200,565 and 33,493,349 are the recorded deaths and recoveries respectively.
On February 27, 2020, Nigeria recorded its first case of Covid-19.The index case was an Italian citizen who arrived Nigeria via the Murtala Mohammed International Airport, Lagos at 10pm aboard Turkish airline from Milan, Italy.Since then, there has been an exponential rise in the number of confirmed cases of the virus.As of 31 st October, 2020, Nigeria has 62,853 confirmed cases, 58,675 discharged and 1144 unfortunate deaths recorded (NCDC, 2020) Curtailing infection rate, preventing transmission and reducing death is the goal of every society.How many persons will be infected on daily bases, how to manage them and future occurrence is stochastic (uncertain) and the effect of the intervention strategies employed by the government greatly rely on past and future trends of the pandemic.Due to the varying trend, it is therefore pertinent to construct a realistic model that will competently help policy makers, medical field, government and other relevant authorities to understand the components of the series to control the global epidemic threat and provide future forecast of possible number of daily infections.These will prepare healthcare for the upcoming cases.Using statistical models to study of the trend of the  in Nigeria can provide critical information for responding to outbreaks and understanding the impact of strategies employed by the government in containing the spread of the disease.
Time series modeling is a dynamic area that carefully collects and rigorously studies past observations to develop an appropriate model which describes the inherent structure of the series and also used to generate future values (Cochrane, 1997).Time series forecasting is the act of predicting the future by understanding the past (Raicharoen et al., 2001).One of the most popular and frequently used stochastic time series model is the Autoregressive Integrated Moving Average (ARIMA) Model (Zhang, 2003) During the ongoing pandemic, some research publications have focused on the epidemiology, trend analysis and forecasting for different cities and countries.These studies presented long-term and short-term trend using time series data from relevant database and offered forecasting applications using models such as ARIMA model, Exponential Smoothing methods, SEIR model and Regression Model.
Applying purely data-driven statistical method, Yang et al. (2020) estimated the case fatality rate (CFR) for COVID-19 in three clusters: Wuhan city, other cities of Hubei province, and other provinces of mainland China.A simple linear regression model was applied to estimate the CFR from each cluster.The result obtained showed that CFR during the first weeks of the epidemic ranges from 0.15% (95% CI: 0.12-0.18%) in mainland China excluding Hubei through 1.41% (95% CI: 1.38-1.45%) in Hubei province excluding the city of Wuhan to 5.25% (95% CI: 4.98-5.51%) in Wuhan.Their results conclusively indicate CFR of COVID-19 was lower than the previous coronavirus epidemics caused by SARS-CoV and Middle East respiratory syndrome coronavirus (MERS-CoV).
To study the epidemic trend of COVID-19 in mainland China, Hubei province, Wuhan city and other provinces outside Hubei from January 16 to February 14, 2020, Zhu et al. (2020) generated the epidemic curve of the new confirmed cases, multiple of the new confirmed cases for period-over-period, multiple of the new confirmed cases for fixed-base, and the period-over-period growth rate of the new confirmed cases using data from National Health Commission.From January 16 to February 14, 2020, the cumulative number of new confirmed cases of COVID-19 in mainland China was 50 031, including 37 930 in Hubei province, 22 883 in Wuhan city and 12,101 in other provinces outside Hubei.
Fanelli and Piazza, (2020) analyzed the temporal dynamics of COVID-19 outbreak in China, Italy and France with the timeframe of January 22 to March 15 2020.A first analysis of simple day-lag maps points to some universality in the epidemic spreading and the analysis of the same data within a simple susceptible-infectedrecovered-deaths model indicated that the kinetic parameter that described the rate of recovery appeared to be the same, regardless of the country, while the infection and death rates appeared to be more variable.
Piccolomini and Zama, (2020) also proposed the modification of the Susceptible-Infected-Exposed-Recovered-Dead (SEIRD) differential model for the analysis and forecast of the COVID-19 spread in some regions of Italy.They introduced a time-dependent transmitting rate and reported the maximum infection spread for the three Italian regions firstly affected by the COVID-19 outbreak (Lombardia, Veneto and Emilia Romagna).Danon et al. (2020), applied an existing national-scale metapopulation model to capture the spread of CoVID-19 in England and Wales.They captured data from population sizes and population movement, together with parameter estimates from the current outbreak in China and were able to predict the peak of the outbreak after person-person transmission was established in England and Wales.Jit et al. (2020) applied exponential growth model to fit critical care admissions from multiple surveillance to study likely COVID-19 case numbers and progress in the United Kingdom from February 16 -March 23, 2020.They estimated that on 23 March, there were 102,000 (median; 95% credible interval 54,000 -155,000) new cases and 320 (211 -412) new critical care reports, with 464,000 (266,000 -628,000) cumulative cases since February 16.Prashant et al. (2020) applied the ARIMA and Fuzzy Models in Forecasting COVID-19 Outbreak in India.Both models suggested an exponential uplift in COVID-19 cases in the near future.
Rauf and Oladipo (2020) applied the Box-Jenkins procedure in forecasting the spread of COVID-19 in Nigeria.The ARIMA (1, 1, 0) was selected as the best model fit for the dataset.The limitation to this study was 10-day forecast The main aim of the study is to employ the Box-Jenkins modeling approach to develop a model and apply it to forecast future incidences of COVID-19 disease in Nigeria using a more robust dataset and projections of future occurrences.
The specific objectives are i.Develop a time series model that will identify the trend of COVID-19 occurrence in Nigeria ii.Estimate parameters of the developed model iii.Diagnosing the model iv.
Predict the future incidence of COVID-19 disease in Nigeria.

Data and Source
Confirmed cases of COVID-19 infections are collected for Nigeria by the Nigeria Centre for Disease Control, NCDC.Data was therefore extracted from the official website of NCDC (http://www.ncnc.org)from February 11, 2020 to October 31, 2020 (7 months) to build a predictive model.

Procedures
The Box-Jenkins method was employed in building the Autoregressive Integrated Moving Average (ARIMA) model.This is an iterative three-stage approach to modeling as shown in the diagram

Postulating a general class of ARIMA
The selection of a proper model is extremely important as it reflects the underlying structure of the series and this fitted model in turn is used for future forecasting.
A linear time series model was considered as the current value of the observed series is a linear function.Different univariate time series models are used in literature such as the Autoregressive (p) and Moving Average (q) Models (Hipel and McLeod, 1994).The combination of these two models forms the Autoregressive Moving Average (ARMA) models.However, in this study, the Autoregressive Integrated Moving Average (ARIMA) Model is considered.
This ARIMA model is a transformed ARMA models which means it combine the Autoregressive (p) and Moving Average(q) and transforms the trend from a non-stationary to a stationary one (constant mean and variance) In an Autoregressive (p) model, the future value of a variable is assumed to be a linear combination of p past observations and a random error together with a constant term.
Mathematically, the Autoregressive (p) model can be expressed as (Lee, 2010) In a Moving Average (q) model, the model regress against past values of the series, it used past errors as the explanatory variables.The MA (q) model is given by (Lee, 2010), The random shocks are assumed to be a white noise process.
As stated earlier, Autoregressive (AR) and Moving Average (MA) models can be effectively combined together to form a general more useful model known as ARMA model Mathematically, an ARMA (p,q) model is represented as (Lee, 2010) With  2 = 0,   ≠ 0  2  > 0 Usually, ARMA models are manipulated using the lag operator notation (Lee, 2010), the lag or backshift operator is defined as In practice, ARMA (p, q) models can only be used for stationary time series data.However, many time series show non-stationary behavior in such situations therefore, the ARIMA Model is implemented instead (Hipel and McLeod, 1994) This study is not an exception in this scenario as the data contain a trend and non-stationary behavior, therefore, it is inadequate to implement the ARMA model in this situation, the research propose an ARIMA model which is a generalization of an ARMA model to include the case of non-stationarity.Here, we apply finite differencing of the data points so as to transform the non-stationary data to stationary.
The mathematical formulation of the ARIMA (p,d,q) model using lag polynomials is given by That is where p, d and q refer to the order of autoregressive, integrated and moving average parts of the model respectively.

Model Identification
To determine a proper model and the order of the autoregressive and moving average term for a given time series data, the autocorrelation function (ACF) and partial autocorrelation function (PACF) analysis was carried out to decide which autoregressive or moving average component to be used in the model.The plot of the autocorrelation function (ACF) and partial autocorrelation function (PACF) against consecutive time lags was done in order to determine this.
The autocorrelation coefficient at lag k is defined as   = Where ℎ ≥ max(,  + 1)

Order Determination
The order of a time series model was determined by defining the criteria for choosing the order of a model or by testing hypothesis   = 0,   = 0

Parameter Estimation
To obtain the best estimates for  parameters for Autoregressive moving average(p, q) where  = ( φ ̂1, … , φ ̂p, θ ̂1, … , θ ̂q) where the grid search is used to obtain the value of  that maximizes S()

Diagnostic checking
After fitting the model, the estimated model is tested to determine whether the estimated model conform to the specification of a stationary univariate process.The Ljung -Box test is performed to test the model adequacy and the Autocorrelation Function of the residuals plotted.The steps are reiterated until the required adequacy is achieved.

Results and Discussion
The overall distribution of daily COVID-19 confirmed number of infection from February 27, 2020 to September 10, 2020 was retrieved from the Nigerian Centre for Disease Control (NCDC) official website (http://covid19.ncdc.gov.ng/).Analysis was conducted with the use of R and Python statistical software.Figure 2 have plotted a dataset for daily confirmed cases of COVID-19 infection cases in Nigeria.From the plot above, it can be deduced that the series there is random fluctuations in the data which is roughly constant over time.

Decomposing the COVID-19 Series
In separating the series into its constituent components which are mainly the trend and the irregular components in the case of this series, the trend is estimated using the additive model to compute the simple moving average.In this case, the 8-point moving average (n=8) is used to obtain the smoothened series to estimate the trend.Figure 3 showed the smoothened series which estimates the trend for the series.We have hence removed the trend component and we are left with the irregular component.

Autoregressive Integrated Moving Average (ARIMA) Model
The research considered the ARIMA (p,d,q) model for the analysis as it allows for nonzero autocorrelation in the irregular component and also it makes assumptions about correlations between successive values of the series.
ARIMA Models are defined for stationary series.In our time series plot on figure 2, the plot is a non-stationary in nature, we hence need to difference the series d times to obtain a stationary series.From the table above, the best model is the one with the lowest Akaike Information Criterion correction (AICc) value which is ARIMA (0, 1, 1) model.

Model Parameter Estimation
As observed from the above analysis, the best model is the ARIMA (0,1,1) base on the AICc criterion and ACF and PACF graphs.The model is then estimated with its parameter estimates for forecasting the daily spread series of COVID-19 in Nigeria.From the output, the estimated value of Ѳ is -0.6543 (see Appendix IV).Therefore, the workable predictive model obtained after the substitution of estimated parameters is represented as ARIMA (0,1,1) = yt = 0.6543 −1

Box-Ljung test
The Ljung-Box test is a diagnostic tool used to test for lack of fit of a time series model (Box and Jenkins, 1976).The value for the Ljung-Box test statistic (X-Squared) is 20.688 with a p-value as 0.5157.These has hence provided relevant validation (p>0.5) in favor of the null hypothesis at 5% level of significance thereby establishing the suitability of the model.The results of the and obtain the results.

Residual ACF:
The correlogram plot for the forecast errors (residuals) to measure the goodness of fit as shown below.From figure 8, We get a density plot of the residual error values suggests that the residual errors are Gaussian.

Forecast ARIMA Model
The ARIMA model is used to forecast future time steps of the Covid-19 confirmed cases in Nigeria.A one-step forecast using the ARIMA model is used.It accepts the index of the time steps to make predictions as arguments, 228 observations are used in the training dataset to fit the model.Therefore, the index of the next time step for making prediction start at 229.The training dataset is splitted into train and test sets, use the train set to fit the model and generate a prediction for each element on the test set.The forecast is performed by re-creating the ARIMA model after each new observation is received.All observations are tracked in the history that is seeded with the training data and to which new observations are appended each iteration.These procedure prints the prediction and expected value of each iteration.The results of the iteration is as tabulated below.A time series plot showing the expected value (blue) and the forecast prediction (red) is as shown below.The forecast for the next 85 days (November 1, 2020 -January 24, 2020) as well as the lower (Lo) and upper (Ho) predictive intervals 80% and 95% respectively as shown in the table below.
The figure below shows the plot of the confirmed cases of COVID-19 for the first 235 days and for the next 85 days using the estimated ARIMA (0,1,1) model.

Conclusion
There was no case of COVID-19 in Nigeria until February, 2020.Since then, the deadly virus has been reported on daily basis by the NCDC and showed an upward trend, special precautionary measures were taken such as total lockdowns, use of facemasks, social distancing,

Figure 2 :
Figure 2: Time Series Plot of confirmed COVID-19 cases in Nigeria

Figure 3 :
Figure 3: Simple Moving Average to estimate trend using 8point moving average (n=8)

Figure 7 :
Figure 7: Residual ACF PlotFrom Figure7above, it can be deduced that, the spikes in between the horizontal dotted lines are random and gradually decreasing to zero.This implies that the ARIMA model is

Figure 8 :
Figure 8: ARIMA fit Residual Error Density Plot From figure 8, We get a density plot of the residual error values suggests that the residual errors are Gaussian.

Figure 9 :
Figure 9: Graphical representation of forecasted (red) and expected (blue) number of COVID-19 Cases in Nigeria.

Table 1 :
ARIMA Models and corresponding AICc

Table 4 :
predicted and expected values with lower (Lo) and Upper (Hi) prediction intervals