16  Flight Delays

In this case study, we’ll analyze data on airline flights in the United States and their delays. The data, published by the Bureau of Transportation Statistics, is popular in examples and tutorials all across data science—there’s even an R package containing an extract to make it easier to include in examples. As there are examples all across the Internet with this data, I can give a sample data analysis with less worry that I’m giving something away and ruining someone’s carefully crafted homework assignment.

This case study is designed to provide a realistic research question that can be answered with regression. The question is phrased in business terms, rather than statistical terms, as statisticians must learn to translate research questions into statistical ones and translate statistical results into business and policy recommendations.

We’ll begin with a description of the data and problem before figuring out how to model it and giving a sample data analysis report.

16.1 Problem statement

16.1.1 Background

In the United States, the Bureau of Transportation Statistics records data on each airline flight conducted by US airlines over a certain size. The data includes the flight date, carrier, its scheduled departure and arrival times, and its actual departure and arrival times—allowing us to examine delays in detail.

16.1.2 Data

The data file pit-flights.csv.gz contains data on every flight from Pittsburgh International Airport in 2023 by the covered airlines.1 The variables include:

  • year, month, day: The date of the flight
  • dep_time, sched_dep_time: The departure time and scheduled departure time of the flight, in 24-hour format (e.g. 505 = 5:05am, 1340 = 1:40 pm)
  • dep_delay: Departure delay (minutes)
  • arr_time, sched_arr_time, arr_delay: The arrival time and delay, in the same format
  • carrier: The airline as a two-letter code (the IATA airline designator)
  • flight: The airline’s flight number for this flight
  • tailnum: The tail number (basically a serial number) for the airplane that flew the flight
  • origin: The origin airport (PIT for all flights)
  • dest: The destination airport, as a three-letter FAA airport code
  • air_time: Time the flight spent in the air, in minutes
  • distance: Distance between the origin and destination, in miles
  • hour, minute: Time of the scheduled departure, separated into hour and minutes
  • time_hour: Date and hour of the flight in ISO 8601 timestamp form. (If you’re not familiar with dates and times in R, look at lubridate; its ymd_hms() function can automatically parse this into a date object, and its various accessor functions can extract different components of the date and time.)

16.1.3 Research questions

You have been hired by Indiana Airways, a budget carrier that is considering expanding to offering service from Pittsburgh.2 As a budget airline, they are extremely concerned about reducing costs.3 Delays are expensive because flight crews have to be paid for longer, and because passengers may miss connections (which Indiana Airways must rebook) or need refunds. Indiana Airways would like to understand typical delays for flights from Pittsburgh so they can plan their service.

Specifically, the Senior Associate Vice President for Operations has several questions:

  1. Which times of year and days of the week have the most delays? Compare both the fraction of flights delayed more than 15 minutes and the typical delay amounts.
  2. Some airline staff believe that departure delays are less important on longer flights, as the pilots have more time to make up for the delay by flying faster or adjusting their route. Does this appear to occur in the data—that is, do pilots seem to make up for departure delays on longer flights, compared to shorter ones?

Write a report analyzing the data and answering these questions.

16.2 Exploration

Before starting any data analysis task, we should do exploratory data analysis. We have three goals:

  1. Ensure we understand what the data represents and what each observation means.
  2. Identify any problems, missing data, typos, and so on.
  3. Make plots that could help answer the research questions or indicate what methods we should use to do so.

Often the first two can be done together, as we check the data and check our understanding of it.

16.2.1 Understanding and checking the data

Let’s start by loading the data.

library(dplyr)

flights <- read.csv("data/pit-flights.csv.gz")

We observe there are 42,130 rows. In Table 16.1 I’ve printed 10 rows from the data, selecting out a few of the key variables we might use. As expected, all the flights are from PIT (Pittsburgh). According to a random great circle distance calculator I found online, the distance from Pittsburgh to Denver (DEN) is 1,290 miles, matching the distance shown for the first flight, so the distance units are correct.

Table 16.1: An extract of the first 10 rows of the flights data, showing some key variables.
time_hour flight carrier dep_time sched_dep_time dep_delay origin dest air_time distance
2023-01-01T05:00:00Z 67 WN 505 500 5 PIT DEN 194 1290
2023-01-01T05:00:00Z 137 WN 509 510 -1 PIT TPA 128 873
2023-01-01T05:00:00Z 43 WN 540 540 0 PIT BWI 44 210
2023-01-01T06:00:00Z 114 AA 555 600 -5 PIT PHL 46 268
2023-01-01T06:00:00Z 198 YX 555 600 -5 PIT EWR 52 319
2023-01-01T06:00:00Z 49 AA 556 600 -4 PIT MIA 141 1013
2023-01-01T06:00:00Z 122 DL 559 600 -1 PIT ATL 80 526
2023-01-01T06:00:00Z 123 UA 600 600 0 PIT IAH NA 1117
2023-01-01T06:00:00Z 50 WN 602 605 -3 PIT MCO 116 834
2023-01-01T06:00:00Z 4 WN 604 605 -1 PIT MDW 121 402

The times match the described format, but there’s one problem. Notice that each time in the time_hour column ends with Z; that’s the time zone code for Greenwich Mean Time. Depending on Daylight Saving Time, Pittsburgh is 4 or 5 hours behind GMT, so 5:00Z is about midnight in Pittsburgh. What time zone is the data in? At most airports, the first flights of the day are around 5 or 6am, and arriving flights end around midnight. Let’s make a histogram of the departure hours as recorded in the time_hour column:

library(lubridate)
library(ggplot2)

# hour() and ymd_hms() from lubridate:
flights |>
  mutate(hour = hour(ymd_hms(time_hour))) |>
  ggplot(aes(x = hour)) +
  geom_histogram(binwidth = 1) +
  labs(x = "Departure hour", y = "Flights")

The flights are indeed between about 5am and 10-11pm, so these times are clearly local time, not Greenwich Mean Time—the data is mislabeled.4 We’ll keep this in mind when reporting results about the times of day with the most delays.

Next, let’s examine the delay amounts. Outliers here could throw off our analysis. The code below uses patchwork to put two ggplots side-by-side.

library(patchwork)

g1 <- ggplot(flights, aes(x = dep_delay)) +
  geom_histogram() +
  labs(x = "Departure delay (minutes)",
       y = "Flights")

g2 <- ggplot(flights, aes(x = arr_delay)) +
  geom_histogram() +
  labs(x = "Arrival delay (minutes)",
       y = "Flights")

g1 | g2

But if you run this code yourself, you’ll notice there were two warning messages:

Warning messages:
1: Removed 544 rows containing non-finite outside the scale range (`stat_bin()`).
2: Removed 637 rows containing non-finite outside the scale range (`stat_bin()`).

That’s concerning. Let’s check the summaries of the delay variables to see what “non-finite” values there might be:

summary(flights$dep_delay)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
 -40.000   -7.000   -4.000    7.476    2.000 1709.000      544 
summary(flights$arr_delay)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
 -65.000  -18.000   -9.000    1.525    4.000 1703.000      637 

So there are NAs—missing values—for about 1% of flights. It’s not clear why they might be missing, so we will look at an extract of a few flights with missing values below.

flights |>
  filter(is.na(arr_delay)) |>
  select(time_hour, carrier, flight, dep_time, sched_dep_time,
         dep_delay, arr_time, sched_arr_time) |>
  head(n = 5) |>
  knitr::kable()
time_hour carrier flight dep_time sched_dep_time dep_delay arr_time sched_arr_time
2023-01-01T06:00:00Z UA 123 600 600 0 1229 828
2023-01-02T08:00:00Z YX 227 821 830 -9 1133 958
2023-01-02T16:00:00Z G4 160 NA 1627 NA NA 1900
2023-01-02T05:00:00Z WN 67 NA 505 NA NA 650
2023-01-03T13:00:00Z 9E 228 NA 1310 NA NA 1440

In some of them, the departure time is present, but there is no arrival time. Perhaps these were flights that were diverted or returned to the original airport due to problems. In others, there is no departure or arrival time. It’s possible these flights were simply canceled. (Indeed, the first table I found on Google with information about canceled flights indicated about 1-2% are canceled, matching the rate of missingness.)

So what do we do with these flights? Indiana Airways didn’t say anything about cancellations, and we don’t know how or why they were canceled. In principle, a canceled flight is a major delay—passengers have to wait until the next flight with empty seats—but we don’t know how much of a delay. Hence we can’t include these flights in average delay calculations, but we could choose to count them when calculating the fraction of delayed flights, making it the fraction of delayed or canceled flights. On the other hand, Indiana Airways doesn’t have to pay its pilots for not flying, so perhaps they don’t care. We will omit these flights in our analysis, but ideally we would ask Indiana Airways if we could.

16.2.2 Plotting the research questions

Let’s try to make plots and tables that can help answer the research questions. First, what times of year and days of week have the most delays? Indiana Airlines did not say whether arrival or departure delays are most important, but we can infer: it is the arrival delay that determines if a customer will miss their connecting flight at the next airport, and the flight crew are being paid until the flight arrives at its destination, so arrival delays are what matter for costs. Let’s plot them by month and day of week.

g1 <- ggplot(flights, aes(x = factor(month), y = arr_delay)) +
  geom_boxplot() +
  labs(x = "Month", y = "Arrival delay (minutes)")

# wday() is from lubridate
g2 <- ggplot(flights, aes(x = wday(time_hour, label = TRUE), y = arr_delay)) +
  geom_boxplot() +
  labs(x = "Day of week", y = "Arrival delay (minutes)")

g1 | g2

But because the delay distributions are so skewed, as we saw above, these boxplots are nearly useless. More detailed plots by month, day, or hour would be hard to read because of how many plots would be necessary. This calls for tables. We’ve used knitr’s kable() function before, but for beautiful tables for publications, gt is like ggplot for tables.

library(gt)

flights |>
  filter(!is.na(arr_delay)) |>
  # use lubridate to turn month numbers into text:
  mutate(month = month(time_hour, label = TRUE, abbr = FALSE)) |>
  group_by(month) |>
  summarize(arr_mean = mean(arr_delay),
            arr_median = median(arr_delay),
            arr_75 = quantile(arr_delay, probs = 0.75),
            pct_delayed = mean(arr_delay > 15),
            n = n()) |>
  gt() |>
  tab_header(title = "Flight delays from Pittsburgh") |>
  tab_spanner(label = "Arrival delay",
              columns = c(arr_mean, arr_median, arr_75)) |>
  cols_label(month = "Month",
             arr_mean = "Mean",
             arr_median = "Median",
             arr_75 = "75th pct.",
             pct_delayed = "% delayed",
             n = "Flights") |>
  fmt_number(arr_mean, decimals = 1) |>
  fmt_number(c(arr_75, n), decimals = 0) |>
  fmt_percent(pct_delayed, decimals = 1) |>
  cols_align("left", month)
Flight delays from Pittsburgh
Month Arrival delay % delayed Flights
Mean Median 75th pct.
January 2.9 -9 7 18.4% 3,299
February −4.3 -12 0 11.9% 3,049
March 4.3 -7 7 18.0% 3,579
April 3.5 -7 6 15.9% 3,450
May −2.9 -9 0 10.3% 3,618
June 8.6 -6 9 20.5% 3,431
July 11.9 -6 15 24.4% 3,464
August 2.6 -10 4 15.1% 3,669
September 3.7 -8 3 14.9% 3,341
October −3.5 -10 −1 10.6% 3,742
November −4.6 -11 0 10.0% 3,496
December −4.4 -12 0 10.9% 3,355

It’s clear that the worst delays are in June and July—not December, when you’d expect the holiday rush and winter weather to cause problems. January comes in third, perhaps because the worst winter weather in Pittsburgh is in January and February, not December.

We can make similar plots or tables for days of the week and hours of the day, but to avoid redundancy, we’ll display those below in the report.

To explore the second research question, that departure delays may be less important on longer flights than shorter ones, a plot is not immediately obvious. There are three variables in question: departure delay, arrival delay, and distance. (We use distance instead of total flight time, which is also in the data, for reasons explored in Exercise 2.9.) We would need a 3D plot to visualize them all at once, which would be hard to read.

Instead, let’s consider the gain: the difference between departure and arrival delay. A positive value indicates the arrival delay is smaller than the departure delay, and so the pilots made up lost time somehow. How does the gain vary with distance?

ggplot(flights, aes(x = distance, y = dep_delay - arr_delay)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  labs(x = "Distance (miles)", y = "Delay gain (minutes)")

The relationship is not particularly strong. But notice the largest gains (over 50 minutes) are on flights over 1,000 miles, while the smallest gains (over 2 hours) are on flights under 1,000 miles. Perhaps there is a relationship, and perhaps it will become clearer when we control for time of year and other factors.

However, the plot also hints at a serious problem. Notice how the flights over 1,500 miles appear in vertical lines. That’s because there are only a few major airports with flights from Pittsburgh greater than that distance:

flights |>
  filter(distance > 1500) |>
  count(dest)
  dest   n
1  LAS 844
2  LAX 361
3  PHX 463
4  SEA 435
5  SFO 478
6  SLC   2

If the arrival delay is affected by the destination in any way, perhaps because of local air traffic control policies or how busy the airports typically are, then distance is confounded with destination: the long flights are only to certain destinations. Perhaps pilots can make up for delays when flying to Las Vegas (LAS), Los Angeles (LAX), Phoenix (PHX), Seattle (SEA), or San Francisco (SFO), not because of their distance but because of how those airports work. We would not be able to determine this from our data. Additionally, those airports may be served by only certain airlines that have different procedures from those doing shorter flights:

flights |>
  filter(distance > 1500) |>
  count(carrier) |>
  arrange(desc(n))
  carrier   n
1      NK 787
2      WN 612
3      UA 478
4      AS 435
5      AA 268
6      DL   2
7      G4   1
flights |>
  filter(distance < 500) |>
  count(carrier) |>
  arrange(desc(n))
   carrier     n
1       YX 10872
2       AA  3782
3       WN  3332
4       B6  1155
5       OO  1034
6       UA  1013
7       9E   904
8       OH   775
9       NK   560
10      MQ   323
11      DL   278
12      G4   210

Indeed, long flights tend to be on airlines like Southwest (WN) and Spirit (NK), while short flights tend to be on carriers like Republic Airways (YX), who operate short-haul flights for American, Delta, and United.

16.3 Modeling decisions

Next, we need to decide how to answer the research questions using the statistical methods covered so far during the course.

Research question 1, on the times and days with the most delays, does not seem to require any modeling or inference. It will be sufficient to display plots or tables highlighting the differences in delay rates.

Research question 2, on whether pilots make up for departure delays on longer flights, may require modeling. The gain variable defined above seems like a reasonable response, in which case the research question is: is the gain larger for longer-distance flights?

This suggests a model of gain ~ distance, and the research question asks for the sign on distance. But is that sufficient, or would it be useful to control for other variables? Figure 16.1 shows a causal diagram of the relevant variables, based on considerations discussed above in the EDA. Different airlines operate flights to different destinations, and so their flights have different distances; they may also have different flight policies that affect how much pilots try to make up for delays. Perhaps different destinations have different air traffic control policies that affect the possible gain. And destination affects the choice of departure times—airlines schedule flights based on time zones and time changes that affect the local time when the flight arrives at the destination. Finally, perhaps the amount of departure delay affects the gain: on a flight that’s not delayed, pilots don’t try to speed up, but on delayed flights, pilots try to gain some extra minutes before arrival.

Figure 16.1: A plausible causal diagram of the relevant variables.

The diagram suggests we should control for the air carrier as a possible confounding variable. We would control for destination, but as every flight is from Pittsburgh, destination and distance are perfectly correlated—each destination has only one possible distance. So we cannot control for it without making the model perfectly collinear. The consideration about departure delays suggests we should include an interaction or otherwise allow the distance \(\to\) gain relationship to depend on departure delay.

16.4 Producing a report

16.5 The report

Now we must translate this analysis into a report. Chapter 25 describes the basic structure we’ll use. Our report will omit the code and only show plots and tables that are important—that is, important for the reader to understand to interpret our results.

The report is annotated with various notes commenting on its style and structure. Review these notes for more information on designing a good report.

16.5.1 Executive summary

16.5.2 Introduction

16.5.3 Methods

16.5.3.1 Data

Data description and “Table 1”

Notice that we give a detailed summary of the flights from Pittsburgh in Table 16.2. This is a classic “Table 1”: as described in Section 25.1.3, we typically give tables describing the data so readers understand the sample and can judge whether it is representative of the population they’re interested in. For instance, if Indiana Airways is interested only in routes of certain distances, they can judge if our sample is appropriate for that.

Our data includes all flights from Pittsburgh International Airport by 14 major air carriers during 2023, as reported to the Bureau of Transportation Statistics. A breakdown of carriers and distances is shown in Table 16.2. There are 42130 flights in total, spread relatively evenly throughout the year, as shown in Figure 16.2.

Table 16.2: 2023 flights from Pittsburgh International Airport, broken down by airline and distance.
On time Delayed Flights
Distance
750-1500 miles 80% 20% 10,729
< 750 miles 87% 13% 28,818
> 1500 miles 81% 19% 2,583
Airline
YX 91% 9% 10,896
DL 89% 11% 2,700
9E 88% 12% 965
OO 87% 13% 2,014
UA 86% 14% 2,875
OH 85% 15% 775
MQ 83% 17% 437
AS 82% 18% 435
WN 82% 18% 9,452
AA 82% 18% 5,929
G4 81% 19% 1,320
B6 79% 21% 1,155
F9 76% 24% 180
NK 74% 26% 2,997
Total
85% 15% 42,130
Figure 16.2: Flights per week from Pittsburgh International Airport in 2023. Dips in flight counts appear during the winter holidays, July 4th, and Labor Day in early September.

Of these flights, 637 have missing departure or arrival delay information, perhaps because the flights were canceled or diverted from their intended destination. As these flights do not provide complete delay information and are a small proportion of the total number of flights, they are excluded from our analysis.

16.5.3.2 Delay analysis

To identify the times of year and days of week with the most delays, we will break flights down by month and day and calculate mean, median, and 75th percentile arrival delays. We study arrival delays rather than departure delays because it is late arrivals that cause customers to miss connections and cause flight crews to remain on duty past their scheduled hours, so controlling arrival delays is of paramount importance to Indiana Airways.

16.5.3.3 Delay gains analysis

To study whether pilots can make up for departure delays on longer flights by flying faster, we examine delay gains: the difference between departure delay and arrival delay for each flight. A positive gain indicates the flight was less delayed at arrival than it was at departure, suggesting the flight crew made up for lost time. If longer-distance flights have higher gains than shorter-distance flights, then perhaps crews have a deliberate strategy of catching up from departure delays.

We constructed a linear regression to predict gain using distance, while controlling for air carrier, as different carriers fly different routes (affecting distance) and may have different policies and procedures (affecting their reaction to delays), making carrier a confounding variable.

Partial residual diagnostics (shown in Figure 16.3) did not find signs of nonlinearity, so we did not test nonlinear or additive models. However, residual variances are severely heteroskedastic when plotted against departure delay, and the residual distribution is severely skewed, as shown in Figure 16.4. The large sample size means the non-normality is not a serious concern, but to account for the heteroskedasticity, we will use the sandwich estimator for all inference.

Figure 16.3: TODO partial residuals
Figure 16.4: Normal quantile-quantile of the standardized residuals.

16.5.4 Results

16.5.4.1 Delays by time of year, time of day, and day of week

Table 16.3: Typical arrival delays by month. June, July, and January have the highest delay rates, meaning flights delayed more than 15 minutes. Delays average about 12 minutes per flight in July.
Flight delays from Pittsburgh
Month Arrival delay % delayed Flights
Mean Median 75th pct.
January 2.9 -9 7 18.4% 3,299
February −4.3 -12 0 11.9% 3,049
March 4.3 -7 7 18.0% 3,579
April 3.5 -7 6 15.9% 3,450
May −2.9 -9 0 10.3% 3,618
June 8.6 -6 9 20.5% 3,431
July 11.9 -6 15 24.4% 3,464
August 2.6 -10 4 15.1% 3,669
September 3.7 -8 3 14.9% 3,341
October −3.5 -10 −1 10.6% 3,742
November −4.6 -11 0 10.0% 3,496
December −4.4 -12 0 10.9% 3,355

Arrival delays by time of year are shown in Table 16.3. Unexpectedly, the largest average delays are in July rather than during the busy holiday season (November and December) or the depths of winter (January and February), with 12-minute average delays in July versus on-time or early arrivals in November and December. Nearly 25% of flights in July are delayed more than 15 minutes, versus only 10% during the holiday season, and 18% in January.

Table 16.4: Delay amounts and gain by day of week. Friday, Saturday, and Sunday flights are most likely to be delayed more than 15 minutes, but the average delays are only a few minutes higher than flights on weekdays.
Flight delays by day of week
Day Arrival delay Delay gain % delayed Flights
Mean Median 75th pct. Mean Median 75th pct.
Sun 3.1 -9 5 6.2 8 14 16.1% 6,075
Mon 1.3 -9 4 5.8 7 14 14.6% 6,137
Tue −1.6 -10 1 6.1 8 14 12.4% 5,806
Wed −0.4 -10 1 6.7 8 14 13.1% 5,859
Thu 1.9 -9 4 5.7 7 14 15.3% 6,245
Fri 3.3 -7 6 4.8 7 13 17.5% 6,170
Sat 3.0 -8 5 6.2 7 14 16.2% 5,201

Delays also vary by day of week, as shown in Table 16.4. The largest average arrival delays occur Friday through Sunday, with over 16% of flights being delayed more than 15 minutes, despite Saturday being the least-busy day for departures from Pittsburgh. This may be due to different scheduling or staffing practices on weekends, or congestion elsewhere in the country that affects Pittsburgh flights. Table 16.4 also shows that typical delay gains are consistent from day to day, around 5 minutes, so there is no major change in delay practices throughout the week.

Table 16.5: Delay amounts by hour of the day. Afternoon and evening departures, starting around 2pm, are the most likely to have delays more than 15 minutes.
Flight delays from Pittsburgh
Time Arrival delay Delay gain % delayed Flights
Mean Median 75th pct. Mean Median 75th pct.
5-9am −4.0 -11 −1 6.7 8 15 8.9% 13,928
9am-noon −3.7 -12 −2 7.3 9 14 9.9% 7,124
Noon-3pm 1.1 -8 4 5.7 7 13 15.1% 6,163
3-6pm 9.2 -5 12 3.8 6 13 22.2% 7,610
6-9pm 10.5 -4 15 5.0 7 13 25.0% 6,109
9pm-midnight 8.1 -6 16 8.7 10 18 25.8% 559

By contrast, Table 16.5 shows typical delays by time of day. Afternoon and evening departures, from about 3pm onward, are most likely to have delays more than 15 minutes, while morning flights arrive at their destinations, on average, slightly early. Only 9-10% of early morning flights are delayed, versus 25% of flights past 6pm. One possible mechanism is that delays compound throughout the day: aircraft typically make several flights each day, and so a mechanical problem or thunderstorm early in the day can cause cascading delays throughout the day.

16.5.4.2 Delay gain analysis

Presenting the regression

Observe that the linear regression model we chose is shown only in the form of Table 16.6: there is no regression equation written out. In most fields, readers know what linear regression is, so it is sufficient to tell them what the predictors are; a mathematical formula is redundant (and often hard to read). See Section 25.4 on how to produce tables like this.

Table 16.6: Results of a linear regression model predicting delay gains.

Characteristic

Beta

95% CI

1

p-value

(Intercept)   8.07  7.03, 9.11 <0.001
Distance (mi)   0.002 0.002, 0.003 <0.001
Departure delay (min)  -0.012 -0.018, -0.006 <0.001
Month


    1
    2   3.21  2.46, 3.96 <0.001
    3  -0.611 -1.34, 0.116  0.10 
    4   0.221 -0.487, 0.929  0.5  
    5   2.52  1.84, 3.20 <0.001
    6   0.843 0.125, 1.56  0.021
    7   0.248 -0.494, 0.989  0.5  
    8   2.27  1.58, 2.97 <0.001
    9   0.817 0.117, 1.52  0.022
    10   3.22  2.56, 3.88 <0.001
    11   3.15  2.46, 3.83 <0.001
    12   5.54  4.81, 6.28 <0.001
Airline


    9E
    AA  -6.81  -7.70, -5.92 <0.001
    AS  -4.51  -6.33, -2.70 <0.001
    B6  -6.66  -7.68, -5.63 <0.001
    DL  -2.64  -3.57, -1.71 <0.001
    F9  -4.13  -6.47, -1.78 <0.001
    G4 -10.2   -11.3, -9.01 <0.001
    MQ  -5.17  -6.65, -3.69 <0.001
    NK  -7.20  -8.24, -6.16 <0.001
    OH  -8.80  -10.1, -7.54 <0.001
    OO  -2.46  -3.45, -1.46 <0.001
    UA  -5.58  -6.62, -4.53 <0.001
    WN  -4.46  -5.33, -3.59 <0.001
    YX  -3.46  -4.32, -2.59 <0.001
Day of week


    1
    2  -0.520 -0.988, -0.052  0.029
    3  -0.393 -0.874, 0.089  0.11 
    4   0.165 -0.304, 0.634  0.5  
    5  -0.432 -0.902, 0.037  0.071
    6  -1.46  -1.95, -0.977 <0.001
    7  -0.198 -0.692, 0.297  0.4  
Distance (mi) * Departure delay (min)   0.000 0.000, 0.000  0.062
1

CI = Confidence Interval

Our regression model for predicting delay gain is shown in Table 16.6. Notably, the model predicts that for flights with no departure delay, flight crews can gain an additional 0.21 minutes (95% CI [0.17, 0.251]) for each 100 miles of additional flight distance. However, it also predicts that for each additional minute of departure delay, the gain decreases by 0.0119 minutes (95% CI [0.0175, 0.00628]).

Because of the interaction between delay and distance, this prediction is for flights of 0 distance. The interaction is statistically significant (\(t(41459) = 1.9\), \(p = 0.0617\)) and positive (\(\hat \beta = 7.2\times 10^{-6}\), 95% CI \([-3.5\times 10^{-7}, 1.5\times 10^{-5}]\)).

The fit is illustrated in Figure 16.5, the effects plot for distance and delay gains. Longer-distance flights indeed have higher average delay gains, suggesting pilots are able to make up for lost time. The interaction implies that when departure delays are longer, short flights are unable to make up the delay, while longer flights are able to speed up further and produce higher gains. This matches what we might expect if pilots attempt to make up for delays on longer flights, though the overall average delay gain is small, indicating this effect cannot mitigate delays more than 10 or 15 minutes.

Figure 16.5: Effects plot for flight delay gains. The positive slope indicates longer flights indeed have higher gains, and the slope is steeper for longer departure delays.

16.5.5 Discussion


  1. The data is compressed by gzip due to its size. R’s read.csv() and readr’s read_csv() functions can decompress gzip files automatically, so you do not need to do anything special to load the data.↩︎

  2. No trademarks were harmed in the making of this case study. Indiana Airways had their license revoked by the FAA in early 1980 for safety violations (Wewe 1980).↩︎

  3. That’s why they hired you, a student, not a fancy business consultant.↩︎

  4. I used the anyflights package to download the data, and this seems to a bug in its data cleaning code. Fortunately it’s not a serious problem for us, but it would be if we had flights from multiple airports in different time zones and we needed to know which flights happened first.↩︎