Last edited: 2024-10-30 19:25:16

**The ARMA process, short for autoregressive moving average, is one the most basic time series models but it is an important building block in forecasting. Let us look at its definition and some of its properties such as causality and invertibility but also how to choose which order your ARMA model should have.**

The ARMA($p,q$) process $X_t$ has the following defintion:

$X_t - \sum_{i=1}^q \phi_j X_{t-i} = Z_t + \sum_{i=1}^q \theta_j Z_{t-i},$where $Z_t \sim \text{WN}(0,\sigma^2)$ (white noise with mean 0 and variance $\sigma^2$) and $1 -\sum_{i=1}^q \phi_i z^i$ and $1 + \sum_{i=1}^q \theta_i z^i$ have no common zeros. You might also define the ARMA model in the following short form:

$\phi(B) X_t = \theta(B) Z_t,$where $B$ is the lag operator defined as $B X_t = X_{t-1}$. The two functions are defined as:

$\phi(z) = \sum_{i=1}^q \phi_i z^i$and

$\theta (z) = \sum_{i=1}^q \theta_i z^i$An ARMA process $X_t$ has a unique (weakly) stationary if and only if $\phi(z) \neq 0$ for all $z \in \mathbb{C}$ when $|z| = 1$

An ARMA process $X_t$ is causal if there exist a real value sequence $(\varepsilon_i, i \in \mathbb{N}_0)$ such that for all $t \in \mathbb{Z}$:

$\sum_{i=0}^\infty |\varepsilon_i| < \infty$and

$X_t = \sum_{i=0}^\infty \varepsilon_i Z_{t-i}.$In other words, $X_t$ can be described as an MA$(\infty)$ process. Equivalently an ARMA process $X_t$ is causal if and only if $\phi(z) \neq 0$ for all $z \in \mathbb{C}$ when $|z| \leq 1$.

An ARMA process $X_t$ is invertible if there exist a real value sequence $(\pi_i, i \in \mathbb{N}_0)$ such that for all $t \in \mathbb{Z}$:

$\sum_{i=0}^\infty |\pi_i| < \infty$and

$Z_t = \sum_{i=0}^\infty \pi_i X_{t-i}.$In other words, $X_t$ can be described as an AR$(\infty)$ process. Equivalently an ARMA process $X_t$ is invertible if and only if $\theta(z) \neq 0$ for all $z \in \mathbb{C}$ when $|z| \leq 1$.

It’s always possible to fit an ARMA$(p,q)$ model with excessively large $p$ and $q$ values, but this isn’t advantageous for forecasting. While this approach often yields a small estimated white noise variance, the mean squared error of forecasts is also affected by errors in parameter estimation. To address this, we introduce a "penalty factor" to discourage the selection of overly complex models.

We start by introducing the AICC criterion, where AIC stands for Akaike’s Information Criterion and the last C denotes biased-corrected. The AICC estimates the Kullback-Leibler divergence, which measures how different one probability distribution is from another, comparing the estimated distribution to the actual data distribution. It assumes that $Z \sim \text{IID} \mathcal{N} (0, \sigma^2)$, but it shows robustness to moderate deviations from normality, such as when $Z_t$ follows a $t$-distribution.

The AICC criterion says that you have to choose $p$, $q$, $\phi_p$ and $\theta_q$ to minimize the following:

$-2 \ln{L(\phi_p, \theta_q, S(\phi_p,\theta_q)/n) + 2n \frac{p+q+1}{n-p-q-2}},$where $\phi_p = (\phi_1, ... , \phi_p)$ and $\theta_p = (\theta_1, ... , \theta_p)$. To apply this criterion in practice, we fit a wide range of models with varying orders (p, q) to the data and select the one that minimizes the negative log-likelihood, adjusted by the penalty factor $2n \frac{p+q+1}{n-p-q-2}$

One issue with the AICC criterion is that the estimators for p and q aren’t consistent; they don’t converge almost surely to the true values. In contrast, consistent estimators can be derived using the Bayesian Information Criterion (BIC), which also penalizes the selection of large p and q values, helping to prevent overfitting.

The BIC criterion says that you have to choose $p$ and $q$ to minimize the following:

$(n-p-q) \ln{\frac{n\hat{\sigma}^2}{n-p-q}}+n (1+\ln{\sqrt{2 \pi}}) + (p+q) \ln{\frac{\sum_{t=1}^n X_t^2 - n \hat{\sigma}^2}{p-q}},$where $\hat{\sigma}^2$ is the maximum likelihood estimate of the white noise variance. A downside of the BIC is its efficiency in finding minimizers. While selecting the model that minimizes the AICC is asymptotically efficient for causal and invertible ARMA processes, this isn’t the case for the BIC. In this context, efficiency means that minimizing the AICC will, in the long run, lead to a model with the lowest one-step-ahead prediction errors.

When using constrained maximum likelihood estimation—where certain coefficients are assumed to be zero during the estimation process—the term $p+q+1$ is replaced by $m$, which represents the number of non-zero coefficients, in both the AICC and BIC.

Was the post useful? Feel free to donate!

DONATE