🕔 Time Series

Introduction

Any metric that is measured over regular time intervals forms a time series. Analysis of Time Series is commercially important because of industrial need and relevance, especially with respect to Forecasting (Weather data, sports scores, population growth figures, stock prices, demand, sales, supply…). In the graph shown below is the temperature over times in two US cities:

A time series can be broken down to its components so as to systematically understand, analyze, model and forecast it. We have to begin by answering fundamental questions such as:

  1. What are the types of time series?
  2. How to decompose it? How to extract a level, a trend, and seasonal components from a time series?
  3. What is auto correlation etc.
  4. What is a stationary time series?
  5. And, how do you plot time series?

Introduction to Time Series: Data Formats

There are multiple formats for time series data.

  • Tibble format: the simplest is of course the standard tibble/ dataframe, with a time variable to indicate that the other variables vary with time.

  • The ts format: The stats::ts() function will convert a numeric vector into an R time series ts object.

  • The modern tsibble format: this is a new format for time series analysis, and is used by the tidyverts set of packages.

  • The base ts object is used by established packages forecast

  • The standard tibble object is used by timetk & modeltime

  • The special tsibble object (“time series tibble”) is used by fable, feasts and others from the tidyverts set of packages

Creating and Plotting Time Series

In this first example, we will use simple ts data first, and then do another with tsibble format, and then a third example with a tibble that we can plot as is and do more after conversion to tsibble format.

ts format data

There are a few datasets in base R that are in ts format already.

AirPassengers
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
str(AirPassengers)
##  Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...

This can be easily plotted using base R:

plot(AirPassengers)

One can see that there is an upward trend and also seasonal variations that also increase over time.

Let us take data that is “time oriented” but not in ts format: the syntax of ts() is:

Syntax:  objectName <- ts(data, start, end, frequency)

where,

    `data`: represents the data vector
    `start`: represents the first observation in time series
    `end`: represents the last observation in time series
    `frequency`: represents number of observations per unit time. For 
    example 1=annual, 4=quarterly, 12=monthly, etc.

We will pick simple numerical vector data variable from trees:

trees
##    Girth Height Volume
## 1    8.3     70   10.3
## 2    8.6     65   10.3
## 3    8.8     63   10.2
## 4   10.5     72   16.4
## 5   10.7     81   18.8
## 6   10.8     83   19.7
## 7   11.0     66   15.6
## 8   11.0     75   18.2
## 9   11.1     80   22.6
## 10  11.2     75   19.9
## 11  11.3     79   24.2
## 12  11.4     76   21.0
## 13  11.4     76   21.4
## 14  11.7     69   21.3
## 15  12.0     75   19.1
## 16  12.9     74   22.2
## 17  12.9     85   33.8
## 18  13.3     86   27.4
## 19  13.7     71   25.7
## 20  13.8     64   24.9
## 21  14.0     78   34.5
## 22  14.2     80   31.7
## 23  14.5     74   36.3
## 24  16.0     72   38.3
## 25  16.3     77   42.6
## 26  17.3     81   55.4
## 27  17.5     82   55.7
## 28  17.9     80   58.3
## 29  18.0     80   51.5
## 30  18.0     80   51.0
## 31  20.6     87   77.0
# Choosing the `height` variable
trees_ts <- ts(trees$Height, 
               frequency = 1, # No reason to believe otherwise
               start = 1980)  # Arbitrarily picked "1980" !
plot(trees_ts)

( Note that this example is just for demonstration: tree heights do not decrease over time!!)

tsibble data

The package tsibbledata contains several ready made tsibble format data. Run data(package = "tsibbledata") to find out about these. Let us try PBS which is a dataset containing Monthly Medicare prescription data in Australia.

data("PBS")

This is a large dataset, with 1M observations, for 336 combinations of key variables. Data appears to be monthly. Note that there is more than one quantitative variable, which one would not be able to support in the ts format.

There are multiple Quantitative variables ( Scripts and Cost). The Qualitative Variables are described below. (Type help("PBS") in your Console)

The data is disaggregated using four keys:

Concession: Concessional scripts are given to pensioners, unemployed, dependents, and other card holders
Type: Co-payments are made until an individual’s script expenditure hits a threshold ($290.00 for concession, $1141.80 otherwise). Safety net subsidies are provided to individuals exceeding this amount.
ATC1: Anatomical Therapeutic Chemical index (level 1) ATC2: Anatomical Therapeutic Chemical index (level 2)

Let us simply plot Cost over time:

PBS %>% ggplot(aes(x = Month, y = Cost)) + 
  geom_point() + 
  geom_line()

This basic plot is quite messy. We ought to use dplyr to filter the data using some combination of the Qualitative variables( 336 combinations!). Let us try that now:

PBS %>% count(ATC1, ATC2, Concession, Type)
## # A tibble: 336 × 5
##    ATC1  ATC2  Concession   Type            n
##    <chr> <chr> <chr>        <chr>       <int>
##  1 A     A01   Concessional Co-payments   204
##  2 A     A01   Concessional Safety net    204
##  3 A     A01   General      Co-payments   204
##  4 A     A01   General      Safety net    204
##  5 A     A02   Concessional Co-payments   204
##  6 A     A02   Concessional Safety net    204
##  7 A     A02   General      Co-payments   204
##  8 A     A02   General      Safety net    204
##  9 A     A03   Concessional Co-payments   204
## 10 A     A03   Concessional Safety net    204
## # ℹ 326 more rows

We have 336 combinations of Qualitative variables, each containing 204 observations: so let us filter for a few such combinations and plot:

PBS %>% dplyr::filter(Concession == "General", 
                      ATC1 == "A",
                      ATC2 == "A10") %>% 
  ggplot(aes(x = Month, y = Cost, colour = Type)) + 
  geom_line() + 
  geom_point()

As can be seen, very different time patterns based on the two Types of payment methods. Strongly seasonal for both, with seasonal variation increasing over the years, but there is an upward trend with the Co-payments method of payment.

tibble data

Let us read and inspect in the US births data from 2000 to 2014. Download this data by clicking on the icon below, and saving the downloaded file in a sub-folder called data inside your project:

Read this data in:

## Rows: 5,479
## Columns: 5
## $ year          <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
## $ month         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ date_of_month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ day_of_week   <dbl> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3,…
## $ births        <dbl> 9083, 8006, 11363, 13032, 12558, 12466, 12516, 8934, 794…

So there are several numerical variables for year, month, and day_of_month, day_of_week, and of course the births on a daily basis. We will create a date column with these separate ones above, and then plot the births, say for the month of March, in each year:

## # A tsibble: 5,479 x 4 [1D]
##    date       births date_of_month day_of_week
##    <date>      <dbl>         <dbl>       <dbl>
##  1 2000-01-01   9083             1           6
##  2 2000-01-02   8006             2           7
##  3 2000-01-03  11363             3           1
##  4 2000-01-04  13032             4           2
##  5 2000-01-05  12558             5           3
##  6 2000-01-06  12466             6           4
##  7 2000-01-07  12516             7           5
##  8 2000-01-08   8934             8           6
##  9 2000-01-09   7949             9           7
## 10 2000-01-10  11668            10           1
## # ℹ 5,469 more rows

Hmm…can we try to plot box plots over time (Candle-Stick Plots)? Over month / quarter or year?

# Monthly box plots
births_tsibble %>%
  index_by(month_index = ~ yearmonth(.)) %>% # 180 months over 15 years
  # No need to summarise, since we want boxplots per year / month
  ggplot(., aes(y = births, x = date, 
                group =  month_index)) + # plot the groups
  
  geom_boxplot(aes(fill = month_index))      # 180 plots!!  

# Quarterly boxplots
births_tsibble %>%
  index_by(qrtr_index = ~ yearquarter(.)) %>% # 60 quarters over 15 years
  # No need to summarise, since we want boxplots per year / month
  ggplot(., aes(y = births, x = date, 
                group = qrtr_index)) +
  
  geom_boxplot(aes(fill = qrtr_index))        # 60 plots!!

# Yearwise boxplots
births_tsibble %>% 
  index_by(year_index = ~ lubridate::year(.)) %>% # 15 years, 15 groups
    # No need to summarise, since we want boxplots per year / month

  ggplot(., aes(y = births, 
                x = date, 
                group = year_index)) + # plot the groups
  
  geom_boxplot(aes(fill = year_index)) +           # 15 plots
  scale_fill_distiller(palette = "Spectral")

Although the graphs are very busy, they do reveal seasonality trends at different periods.

A Workflow in R

Download the RMarkdown tutorial file by clicking the icon above and open it in RStudio or rstudio.cloud.

Previous
Next