Interrupted time series & regression discontinuity

Causal impact assessment workshop

Author

Erik-Jan van Kesteren & Oisín Ryan

In this practical, you will create several versions of interrupted time-series models for estimating the counterfactual cigarette sales. For the advanced time-series models with autoregression and differencing, we will use the package fpp3. First, load the following two packages:

library(tidyverse)
library(fpp3)

We will again be using the proposition 99 dataset:

prop99 <- read_rds("raw_data/proposition99.rds")

Data preparation

In this practical, we will need to transform our dataset into a tsibble (a time-series table object). This is necessary for the fpp3 package to figure out which column indicates time. We also only need the cigarette sales data from California for this practical You can prepare the data by running the following code:

# try to figure out what each line does!
prop99_ts <- 
  prop99 |> 
  filter(state == "California") |> 
  select(year, cigsale) |>
  mutate(prepost = factor(year > 1988, labels = c("Pre", "Post"))) |> 
  as_tsibble(index = year) |> 
  mutate(year0 = year - 1989)

Note that we have also already included a prepost variable in the preparation which can be used to filter the pre or post-intervention time-series.

Growth curve

Through estimating the effect of time on the pre-intervention time-series, we can create a prediction for the post-intervention counterfactual which includes a trend. In some fields, this approach is called estimating a “growth curve”.

Exercise 1

Create the linear growth curve model from the slides: use linear regression to predict cigarette sales with the time variable year in the pre-intervention period. What is the estimated year-over-year decrease in cigarette sales?

Code

# here fit a very simple growth curve simple model
fit_growth <- lm(
  formula = cigsale ~ year, 
  prop99_ts |> filter(prepost == "Pre")
)

summary(fit_growth)
# here we see a negative slope; decrease of 1.78 per year

Exercise 2

Create predictions for the post-intervention period from this model. Then, estimate and interpret the causal effect(s) of the policy.

Code

pred <- predict(
  object = fit_growth, 
  newdata = prop99_ts |> filter(prepost == "Post")
)

# effect at each time-point T
ce_growth <- prop99_ts |> filter(prepost == "Post") |> pull(cigsale) - pred

# You could summarize these effects by taking the mean of them
mean(ce_growth)

# on average, 28.27 fewer cigarette packages per 100000 
# people sold each year due to the intervention

Time-series model

Time-series techniques also take into account autocorrelations and the idea that recent values of the outcome of interest have more predictive power over the current value than values far in the past. With the fpp3 package, we can do a data-driven model selection for the proposition 99 dataset and produce a counterfactual with automatic uncertainty quantification.

Exercise 3

Fit an ARIMA() model using only the pre-intervention cigarette sales data in California. Then, using the forecast() function, create forecasts for 12 years and plot those forecasts with the autoplot() function. What do you notice about the uncertainty in this plot?

Code

# create ARIMA model. NB: this user interface of the
# FPP3 package is slightly idiosyncratic. Don't worry
# too much about it!
fit_arima <-  
  prop99_ts |> 
  filter(prepost == "Pre") |> 
  # here we use the corrected AIC to do model selection
  # many other methods of model selection are also 
  # possible (e.g. cross-validation)
  model(timeseries = ARIMA(cigsale, ic = "aicc")) 

# create forecasts
fcasts <- forecast(fit_arima, h = "12 years") 

# plot the forecasts
fcasts |> autoplot(prop99_ts)

# Uncertainty interval becomes wider as we move further
# in time.

Exercise 4

Use the ARIMA forecasts to estimate the causal effect of the proposition 99 policy intervention.

Code

observed_cigsale <- 
  prop99_ts |> 
  filter(prepost == "Post") |> 
  pull(cigsale)

# you can use the predicted means directly
mean(fcasts$.mean - observed_cigsale)

# or you can get a prediction interval of the differences
# by using the distribution objects in the forecasts
# again, very idiosyncratic interface but this is how you
# could get uncertainty around your ACE estimate
ace_distribution <- sum(fcasts$cigsale - observed_cigsale) / 12
ace_distribution

# using the hilo function you can get 95% intervals:
hilo(ace_distribution)

# So the ACE is small and not significantly different from 0

Regression Discontinuity

Regression discontinuity designs (RDDs) share many similarities to Interrupted Time Series approaches. In an RDD analysis, we typically fit a piecewise linear model of some kind, and test whether the relationship between two variables changes on either side of a threshold. In the context of Interrupted Time Series, this often amounts to fitting a growth-curve type model on the full time-series, including main and interaction effects of an intervention indicator.

Exercise 5

Using lm() and prop99_ts, fit the piecewise growth curve model described in the slides. For time use the year0 column, in which time is centered around the intervention moment (i.e 1989 represents \(time =0\)). Make use of the dummy indicator prepost. Based on your model, does the intervention have an effect on the trend in cigarette sales?

Code

# year0 is scaled in this way to aid interpretability 
# of the intercept terms, but has no effect on the slope terms
# Note that the model is a little more complex than in 
# Exercise 1, but is fit to the full time series
fit_rdd <- lm(cigsale ~ year0 + prepost + year0:prepost, prop99_ts)

# inspect the parameter estimates
summary(fit_rdd)

# the interaction effect year0:prepost parameterizes 
# the change in trend after the intervention -1.4947, 
# p = .005 indicates a significant change in trend after 
# the intervention the trend is "more strongly negative" 
# after the intervention

# You can also visualize your RDD model using the below code
pred_df |> 
  ggplot(aes(x = year, y = cigsale)) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2, 
              data = pred_df |> filter(prepost == "Pre")) +
  geom_line(aes(y = fit), data = pred_df |> filter(prepost == "Pre")) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2, 
              data = pred_df |> filter(prepost == "Post")) +
  geom_line(aes(y = fit), data = pred_df |> filter(prepost == "Post")) +
  geom_line(linewidth = 1, color = "darkgreen") +
  ylim(0, 150) +
  theme_minimal() + 
  geom_vline(xintercept = 1988, linetype = 2) +
  annotate("label", x = 1988, y = 150, label = "Intervention")

Conclusion

In this practical, you have used growth curves and time-series models to estimate the effect of the proposition 99 policy intervention. You have seen that these models can be used to “impute” the counterfactual or to directly parameterize the change in the target variable after the intervention. There are many details we skipped over here, such as how to best perform model selection, and the many different RDD-type analyses you could perform, but this provides a basic starting point. Notice how different model types can provide both different point estimates of the causal effect, and very different quantifications of our uncertainty around that causal effect. In particular ARIMA-type models, explicitly designed to forecast, will often reflect the idea that we become more and more uncertain about the future the farther ahead we want to predict. There is likely no simple answer to the question of which of these models or approaches should be preferred in practice.