This document gathers some main concepts in time series analysis as well as some examples written in R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

Index

Used packages

When working with R, we can define our own functions and create modules called packages which serve a specific purpose. These can be submitted to CRAN(The Comprehensive R Archive Network). Publishing a package enables people to use the code you have developed! In this case we are using the following packages:

library(ggplot2)
library(readr)
library(lubridate)
library(reshape2)
library(RcppTOML)

Reproducibility

In order for you to reproduce the results found in this notebooks, we must set a seed for the random number generator. This is a topic beyond the scope of this repo, but I highly advise you to read about pseudorandom number generators. We might as well parse config from a file to avoid using magic constants. For this repo (and many of our my personal projects) my file format of choice is TOML (Tom’s Obvious, Minimal Language).

  set.seed(204) # so you can reproduce the results
  config <- parseToml("../config.toml") # The only magic constant will be the path to the config file.

Some definitions

Let’s start by defining stochastic processes: A family of indexed random variables \(Z(\omega, t)\) where \(\omega\) belongs to a sample space and \(t\) belongs to an index set.

Thus, we can conclude that time series are a sort of stochastic processes.

Examples

  1. White noise

For example, one of the basic time series is white noise, which is a time series generated from uncorrelated variables, which are most of the time normally distributed.

white.noise <-  rnorm(300,0,1) # N(0,1) variates 
plot(white.noise, type = "l")

This collection of random variables \(e_{t}\) have the following properties:

  1. \(E(e_{t}) = 0\)
  2. \(E(e_{t}e_{s}) = 0\)
  3. \(E(e^{2}_{t}) = \sigma^{2}\)

For this particular case, having only a couple hundred observations, the estimations are slightly biased but we can observe nevertheless that these properties are conserved. Here we have descriptive statistics for the previously generated realisation of gaussian white noise, as well as a visualisation of its distribution courtesy of ggplot2.

mean(white.noise)
## [1] 0.01316385
var(white.noise)
## [1] 1.003168
p.w <- ggplot(as.data.frame(white.noise), aes(x=white.noise)) +
  geom_histogram(aes(y=..density..), colour="black", bins=18) +
    geom_density(alpha=.2, fill="blue") 
print(p.w)

  1. Glycaemic history

We can visualise another type of time series, this time a real biomedical example: glycaemic history!

#glc <- read_csv("../data/example_glucose.csv")
glc <- read_csv(config$data$build$example_glc)
## Parsed with column specification:
## cols(
##   datetime = col_datetime(format = ""),
##   glucose = col_double()
## )
#glc$datetime <- sapply(glc$datetime, as_datetime)
p.glc <- ggplot(data = glc, aes(x = as_datetime(datetime), y = glucose)) + 
  geom_line(color = "#00AFBB")
print(p.glc)

Basic techniques

These include moving averages and filtering, autoregressions and random walks. These are some of the building blocks used to work with more advanced time series models and handle missing values.

  1. Moving average

A moving average is calculated by taking the mean of a certain number of observations, measurements within a defined stencil/window. This is a widespread smoothing and filtering technique. After applying a moving average the slower oscillations are more apparent and some of the faster oscillations are taken out Its definition is as follows:

\[ ma_{t} \; = \; \frac{1}{N} \sum_{i=0}^{N-1} v_{t-i} \]

Where \(ma_{t}\) represents the moving average value at timestep \(t\), defined using stencil of \(N-1\) observations before the timestep (we could call them left or past) and that observation itself. We could alternatively define a centered moving average. Doing this operation requires the definition of a lag operator.

  1. Lag operator The lag operator sometimes called backshift operator, applied on a time series observation yields its previous element. Denoted \(L\), operating on an arbitrary element of a time series \(y_{t}\):

\[ Ly_{t} = y_{t-1} \]

The lag operator has some convenient properties that will be stated but not proven here.

  1. Centered moving average Restricting the size of the stencil to odd values, a centered moving average can be defined using a lag operator. \[ cma_{t} \; = \; \frac{1}{N} \sum_{i=0}^{N-1} L^{M} v_{t+i} \;\; \; \;\; M := (N-1)/2 \]

Where M is guaranteed to be integer if N is odd.

Caveats: When computing moving averages, be careful with the programming language/library/package of your choice. Some functions may yield errors or warnings. Some others may exhibit undefined behaviour. Others could be well defined but not act as expected. Here I show how to calculate both centered and left moving averages in R. To check the documentation of a function in R, type the name of the function preceeded by a question mark i.e. ?filter.

n <- 200
w <- rnorm(n,0,1)  # n N(0,1) variates 
m.avg <-data.frame(
  x = 1:n,
  norm = w, # the original series
  c3 = filter(w, sides=2, filter=rep(1/3,3)), # centered moving average 
  c5 = filter(w, sides=2, filter=rep(1/5,5)), # centered moving average
  l3 = filter(w, sides=1, filter=rep(1/3,3)), # left moving average
  l5 = filter(w, sides=1, filter=rep(1/5,5)), # left moving average
  l10 = filter(w, sides=1, filter=rep(1/10,10)) # left moving average
)
# We `melt` the data.frame to enable easier visualisation.
melted.m.avg <- melt(m.avg, id="x")
p.m.avg <- ggplot(data = melted.m.avg, aes(x = x, y = value, color = variable)) +
  geom_line() + # we want lines
  facet_grid(variable ~ .) # we want each series to be on a different grid
print(p.m.avg)

  1. Autoregressions An autoregression is a technique used to determine values of a time series, based on previous observations of itself. In other words, imagine we take the two previous values of time series and use it to compute the current one. We are analising stochastic processes, so there is a random element, a deviaton we cannot account for which can be expressed as white noise. This subject will be retaken, in depth, afterwards on this course.

\[ x_{t} \; = \; \sum_{i=1}^{N}c_{i} x_{t-i} \; + w_{t} \]

This equation defines an autoregression of \(N\)-th order, expressed as a linear combination of previous values and some coefficient plus a white noise term. The following is an example of a second order autoregression performed on the previously presented glycaemic data.

\[ glucose_{t} = glucose_{t-1} - 0.9glucose_{t-2} + w_{t} \]

Note how the same function filter is used to perform moving average calculations and autoregressions. This is achieved by changing the value of the method parameter.

glc$autoreg <- filter(glc$glucose, filter=c(1, -.9), method="recursive")
melted.glc <- melt(glc, id="datetime")
p.glc.autoreg <- ggplot(
    data = melted.glc, 
    aes(x = as_datetime(datetime), y = value, color = variable)
  ) + geom_line() # we want lines
print(p.glc.autoreg)

Random Walk

Another possible model for describing stochastic processes, are random walks. In these models, each observation \(x_{t}\) is given by the previous value \(x_{t-1}\) plus a random value. This could be, for instance, white noise \(w_{t}\). These models can also include drift \(\delta\) which is a constant that expresses a trend. Summed up, it is written as follows:

\[ x_{t} = \delta + x_{t-1} + w_{t} \]

Which can alternatively be simply the summation of white noise observations, plus the trend (drift), and an initial value.

\[ x_{t} = \delta t + \sum_{j=1}^{t}w_{j} + x_{0} \]

set.seed(154) # Reproducibility
w = rnorm(200); 
x = cumsum(w)
delta = .2
wd = w + delta  # R automatically performs element-wise addition. No need to iterate 
xd = cumsum(wd)
plot.ts(xd, ylim=c(-5,55), main="random walk", ylab='')
lines(x, col=4)
abline(h=0, col=4, lty=2) 
abline(a=0, b=.2, lty=2)