Resources for Econ Grads: 2017

Saturday, September 23, 2017

Hamilton Filtering

In this post I present a R function that implements James D. Hamilton's alternative to the Hodrick-Prescott Filter as described in his article "Why You Should Never Use the Hodrick Prescott Filter", forthcoming in the Review of Economics and Statistics. In the paper, Hamilton demonstrates how HP filtering creates artificial correlation between variables and suggests filtering using the residuals of the following regression:
$$ y_{t+h} = \beta_0 + \sum_{l=1}^p \beta_l y_{t+1-l} + \nu_{t+h}, $$
$y_{t+h}$ are the elements of $y$ that can be predicted $h$ periods ahead using its previous $l$ observations. This corresponds to the trend after fluctuations have disappeared after $h$ periods. $h$ should therefore be chosen to be 2 years where we assume macroeconomic shocks to be worn off. $l$ is recommended to include 1 year of data to remove possible seasonal components from the trend. I highly recommend reading Hamilton's blog post on the topic.
I implement this filter in the following R function based on the matlab code accompanying Hamilton's paper.

hamfilter <- function(yt, h=2*frequency(y), p=frequency(y), trend=FALSE){
  # Hamilton Filter
  #
  # R code to calculate cyclical component based on h-period-ahead
  #forecast error from linear regression 
  # y(t+h) = b0 + b1*y(t) + ... + bp*y(t+1-p) +  e(t+h) 
  # as recommended in
  # James D. Hamilton, "Why You Should Never Use the Hodrick-Prescott Filter"
  # Review of Economics and Statistics, forthcoming
  # Code based on James D. Hamilton's matlab code
  # inputs:  
  #   y = (T x 1) vector or time series, tth element is observation for date t
  #   p = number of lags that should be considered in the regression of y(t+h),
  # default is the frequency of a time series
  # outputs:
  # yreg = (T x 1) vector or time series, tth element is cyclical component for date t
  # (optional) ytrend = (T x 1) vector or time series, tth element is trend component
  #
  # works with any frequency of data
  # works with leading and trailing missing values (not internal)
  # preserves class of input data (i.e. time series, vector, etc.)
  #
  require(zoo)
  # create final output matrices to preserve input class and length
  yreg <- yt
  yreg[] <- NaN
  ytrend <- yt
  ytrend[] <- NaN
  # prepare data 
  y <- na.omit(yt)
  T <- length(y)
  lseq <- 0:-(p-1)
  X <- cbind(rep(1,T-p-h+1), as.matrix(lag(as.zoo(y),lseq)[((p):(T-h)),]))
  yh <- y[(p+h):T]
  # extimate and predict y(t+h) = b0 + b1*y(t) + ... + bp*y(p-1) + e(t+h)
  b <- solve(t(X) %*% X) %*% (t(X) %*% yh)
  ytrend[!is.na(yt)] <- c(rep(NaN,p+h-1),X%*%b)
  yreg[!is.na(yt)] <- y - c(rep(NaN,p+h-1),X%*%b)
  if(trend==TRUE){
    return(cbind(ytrend,yreg))
  } else {
    return(yreg)
  }
}

Tuesday, September 12, 2017

A Function to Retrieve Time Series by Variable Number from CanSim (Statistics Canada)

In this post, I want to make a function available that allows you to download a time series from Statistics Canada directly into R. The function is a wrapper for the get_cansim_vector function from the 'cansim' package. It determines the frequency of a series and converts it into a time series.

All CanSim variables start with a "v". To find this vector name, you need to change the view of a table you find on the Statistics Canada website by clicking 'Add/Remove Data' => 'Customize Layout' and then tick the box 'Display Vector identifier and coordinate'

Below is the function that will do the trick. The function works with daily, weekly, monthly, quarterly and yearly data. If the series does not have one of these frequencies, the data is still returned, but not properly converted into a time series.

ts_cansim <- function(vector, start, end = NA){
  
  # load required packages
  require(cansim)
  require(lubridate)
  
  # get series
  if(is.na(end)){
    xx <- get_cansim_vector(vector , start_time = start)
  } else {
    xx <- get_cansim_vector(vector , start_time = start, end_time = end)
  } 
  
  # determine frequency
  time.diff <- as.Date(xx$REF_DATE[2]) - as.Date(xx$REF_DATE[1])

x <- xx$VALUE
  n <- nrow(xx)
  
  st.date <- as.Date(xx$REF_DATE[1])
  end.date <- as.Date(xx$REF_DATE[n])
  
  # daily frequency
  if (1 <= time.diff & time.diff <= 4){
    n.freqency = 365.25
    x <- ts(xx$VALUE, start = c(year(st.date), month(st.date), day(st.date)), freq = n.freqency)
  }
  
    # weekly frequency
  if (time.diff == 7){
    n.freqency = 365.25/7
    x <- ts(xx$VALUE, start = c(year(st.date), month(st.date), day(st.date)), freq = n.freqency)
  }

# monthly frequceny 
  if (25 < time.diff & time.diff <= 31){
    n.freqency = 12
    x <- ts(xx$VALUE, start = c(year(st.date), month(st.date)), freq = n.freqency)
    }

# quarterly frequency
  if (70 < time.diff & time.diff <= 100){
    n.freqency = 4
    x <- ts(xx$VALUE, start = c(year(st.date), quarter(st.date)), freq = n.freqency)
    }

# yearly frequency
  if (300 < time.diff & time.diff <= 400){
    n.freqency = 1
    x <- ts(xx$VALUE, start = year(st.date), freq = n.freqency)
    }

return(x)
    
}

Now we can call the function retrieving the variable from CanSim.

Let me know if you have any comments or suggestions.

- this post was updated 2020-11-16 -

Thursday, February 16, 2017

Creating a Dataset with Multiple Lags of Each Variable

In many forecasting exercises it is useful to include several lags of a variable as potential predictor. I have not been able to find a nice solution online to create a data set that includes several lags of each variable, so I want to share my solution.
The goal is to create h=12 lags of each variable in a data set. So if you have three variables in a matrix, say [x1, x2, x3], you want a new matrix of the form [x1_l0, x1_l1, x1_l2, ... , x1_lh, x2_l0, x2_l1, x2_l2, ... , x2_lh, x3_l0, x3_l1, x3_l2, ... , x3_lh, ] , where h is the maximum lag length.

To do this, we'll need the zoo package that allows to create multiple lags of a series and fills up the created matrix of lags with missing values. The forecast package is used to create some example data of time series.

First, we will create a sample data set of time series using the arima.sim function of the forecast package:

Now, we will create the lagged data set. The code avoids loops by using the sapply function, thus creating a list with the variables and its lags as elements. The list is then concentrated back into a matrix using the do.call function. The next step is to assign each predictor the variable name combined with its respective lag. This is done by using the the same trick as before, concentrating a list that includes as its elements a combination of the variable name and the respective lag.