This document gathers some main concepts in time series analysis as well as some examples written in R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
When working with R, we can define our own functions and create modules called packages which serve a specific purpose. These can be submitted to CRAN(The Comprehensive R Archive Network). Publishing a package enables people to use the code you have developed! In this case we are using the following packages:
library(ggplot2)
library(readr)
library(lubridate)
library(reshape2)
library(RcppTOML)
In order for you to reproduce the results found in this notebooks, we must set a seed for the random number generator. This is a topic beyond the scope of this repo, but I highly advise you to read about pseudorandom number generators. We might as well parse config from a file to avoid using magic constants. For this repo (and many of our my personal projects) my file format of choice is TOML (Tom’s Obvious, Minimal Language).
set.seed(204) # so you can reproduce the results
config <- parseToml("../config.toml") # The only magic constant will be the path to the config file.
Let’s start by defining stochastic processes: A family of indexed random variables \(Z(\omega, t)\) where \(\omega\) belongs to a sample space and \(t\) belongs to an index set.
Thus, we can conclude that time series are a sort of stochastic processes.
For example, one of the basic time series is white noise, which is a time series generated from uncorrelated variables, which are most of the time normally distributed.
white.noise <- rnorm(300,0,1) # N(0,1) variates
plot(white.noise, type = "l")
This collection of random variables \(e_{t}\) have the following properties:
For this particular case, having only a couple hundred observations, the estimations are slightly biased but we can observe nevertheless that these properties are conserved. Here we have descriptive statistics for the previously generated realisation of gaussian white noise, as well as a visualisation of its distribution courtesy of ggplot2.
mean(white.noise)
## [1] 0.01316385
var(white.noise)
## [1] 1.003168
p.w <- ggplot(as.data.frame(white.noise), aes(x=white.noise)) +
geom_histogram(aes(y=..density..), colour="black", bins=18) +
geom_density(alpha=.2, fill="blue")
print(p.w)
We can visualise another type of time series, this time a real biomedical example: glycaemic history!
#glc <- read_csv("../data/example_glucose.csv")
glc <- read_csv(config$data$build$example_glc)
## Parsed with column specification:
## cols(
## datetime = col_datetime(format = ""),
## glucose = col_double()
## )
#glc$datetime <- sapply(glc$datetime, as_datetime)
p.glc <- ggplot(data = glc, aes(x = as_datetime(datetime), y = glucose)) +
geom_line(color = "#00AFBB")
print(p.glc)
These include moving averages and filtering, autoregressions and random walks. These are some of the building blocks used to work with more advanced time series models and handle missing values.
A moving average is calculated by taking the mean of a certain number of observations, measurements within a defined stencil/window. This is a widespread smoothing and filtering technique. After applying a moving average the slower oscillations are more apparent and some of the faster oscillations are taken out Its definition is as follows:
\[ ma_{t} \; = \; \frac{1}{N} \sum_{i=0}^{N-1} v_{t-i} \]
Where \(ma_{t}\) represents the moving average value at timestep \(t\), defined using stencil of \(N-1\) observations before the timestep (we could call them left or past) and that observation itself. We could alternatively define a centered moving average. Doing this operation requires the definition of a lag operator.
\[ Ly_{t} = y_{t-1} \]
The lag operator has some convenient properties that will be stated but not proven here.
The lag operator can be raised to powers. \[ L^{n}y_{t} = y_{t-n} \]
Polynomials of it can be subsequently defined. \[ a(L) = a_{0} + a_{1}L + a_{2}L^{2} + ... + a_{n}L^{n} \]
Lag polynomials can be multiplied. Said operation is commutative. \[ a(L)b(L) = b(L)a(L) \]
Where M is guaranteed to be integer if N is odd.
Caveats: When computing moving averages, be careful with the programming language/library/package of your choice. Some functions may yield errors or warnings. Some others may exhibit undefined behaviour. Others could be well defined but not act as expected. Here I show how to calculate both centered and left moving averages in R. To check the documentation of a function in R, type the name of the function preceeded by a question mark i.e. ?filter
.
n <- 200
w <- rnorm(n,0,1) # n N(0,1) variates
m.avg <-data.frame(
x = 1:n,
norm = w, # the original series
c3 = filter(w, sides=2, filter=rep(1/3,3)), # centered moving average
c5 = filter(w, sides=2, filter=rep(1/5,5)), # centered moving average
l3 = filter(w, sides=1, filter=rep(1/3,3)), # left moving average
l5 = filter(w, sides=1, filter=rep(1/5,5)), # left moving average
l10 = filter(w, sides=1, filter=rep(1/10,10)) # left moving average
)
# We `melt` the data.frame to enable easier visualisation.
melted.m.avg <- melt(m.avg, id="x")
p.m.avg <- ggplot(data = melted.m.avg, aes(x = x, y = value, color = variable)) +
geom_line() + # we want lines
facet_grid(variable ~ .) # we want each series to be on a different grid
print(p.m.avg)
\[ x_{t} \; = \; \sum_{i=1}^{N}c_{i} x_{t-i} \; + w_{t} \]
This equation defines an autoregression of \(N\)-th order, expressed as a linear combination of previous values and some coefficient plus a white noise term. The following is an example of a second order autoregression performed on the previously presented glycaemic data.
\[ glucose_{t} = glucose_{t-1} - 0.9glucose_{t-2} + w_{t} \]
Note how the same function filter
is used to perform moving average calculations and autoregressions. This is achieved by changing the value of the method
parameter.
method = "concolution"
for moving averagemethod = "recursive"
for autoregressionglc$autoreg <- filter(glc$glucose, filter=c(1, -.9), method="recursive")
melted.glc <- melt(glc, id="datetime")
p.glc.autoreg <- ggplot(
data = melted.glc,
aes(x = as_datetime(datetime), y = value, color = variable)
) + geom_line() # we want lines
print(p.glc.autoreg)
Another possible model for describing stochastic processes, are random walks. In these models, each observation \(x_{t}\) is given by the previous value \(x_{t-1}\) plus a random value. This could be, for instance, white noise \(w_{t}\). These models can also include drift \(\delta\) which is a constant that expresses a trend. Summed up, it is written as follows:
\[ x_{t} = \delta + x_{t-1} + w_{t} \]
Which can alternatively be simply the summation of white noise observations, plus the trend (drift), and an initial value.
\[ x_{t} = \delta t + \sum_{j=1}^{t}w_{j} + x_{0} \]
set.seed(154) # Reproducibility
w = rnorm(200);
x = cumsum(w)
delta = .2
wd = w + delta # R automatically performs element-wise addition. No need to iterate
xd = cumsum(wd)
plot.ts(xd, ylim=c(-5,55), main="random walk", ylab='')
lines(x, col=4)
abline(h=0, col=4, lty=2)
abline(a=0, b=.2, lty=2)