Airline Passenger Analysis - R & Weka

25 minute read

JGausin_GIT.knit

Executive summary.

Based on the models, customers’ overall satisfaction appears to be heavily influenced by Seat Comfort satisfaction, Departure Delay, and Arrival Delay. There doesn’t seem to be a significant relationship with check-in service (A12) and baggage handling (A23). Additionally, there is no statistical significance between the classes with food-and-drink satisfaction. The majority of customers are loyal or returning, ranging from youth to middle age, and typically travel on medium-haul flights. Surprisingly, both arrival and departure of flights are delayed less than 15 minutes.

According to attribute selection, Type of Travel, Class, Inflight Wifi Service, Online Boarding, Inflight Entertainment, and Check-in Service seem to have the most influence on overall satisfaction. Arrival Delay and Departure Delay are contested in attribute selection, appearing important in one but not the other. Gender appears to be the least important feature in relation to overall satisfaction or any other data features.

Preprocessing

Library

Importing the data set here {See R Code}

library(tidyverse)
library(ISLR2)
library(boot)
library(dplyr)
library(glmnet)
library(tree)
library (pls)
library(randomForest)
library(caret)
library(moments)
library(e1071)
library(gridExtra)
library(arules)
library(arulesViz)
library(rpart)
#install.packages("rpart.plot")
library(rpart.plot)
#install.packages("factoextra")
library(factoextra)

Import the Data Set.

We split the original data to train and test set to be used on the Final Project. The AirlineSurvey_train training set is about 80% of the original data set and the test set about 20% AirlineSurvey_test.

As an example we can see the features of the training data set:

names(airlinesurvey_train)

##  [1] "X"                                 "id"                               
##  [3] "Gender"                            "Customer.Type"                    
##  [5] "Age"                               "Type.of.Travel"                   
##  [7] "Class"                             "Flight.Distance"                  
##  [9] "Inflight.wifi.service"             "Departure.Arrival.time.convenient"
## [11] "Ease.of.Online.booking"            "Gate.location"                    
## [13] "Food.and.drink"                    "Online.boarding"                  
## [15] "Seat.comfort"                      "Inflight.entertainment"           
## [17] "On.board.service"                  "Leg.room.service"                 
## [19] "Baggage.handling"                  "Checkin.service"                  
## [21] "Inflight.service"                  "Cleanliness"                      
## [23] "Departure.Delay.in.Minutes"        "Arrival.Delay.in.Minutes"         
## [25] "satisfaction"

Clean Data sets for any missing values

There are 310 number of rows with missing data for some columns in the training set. Likewise there are 83 number of rows with missing data for some columns in the test set.

We remove the following missing data {See R code}:

AirlineSurvey_train <- airlinesurvey_train %>%
                        na.omit()
AirlineSurvey_test <-airlinesurvey_test %>%
                        na.omit()
dim_new_train <- dim(AirlineSurvey_train)
dim_new_test <- dim(AirlineSurvey_test)

In conclusion, after cleaning. We got the following table for the training and test sets:

For the train set:

Train_set <- c("Original", "NA values", "Post-Sanitization")
dim_og_train_1 <- dim_og_train[1]
dim_new_train_1 <- dim_new_train[1]
NumberRows <- c(dim_og_train_1,sum_og_train,dim_new_train_1)
table <- rbind(Train_set,NumberRows)
table

##            [,1]       [,2]        [,3]               
## Train_set  "Original" "NA values" "Post-Sanitization"
## NumberRows "103904"   "310"       "103594"

For the test set:

Test_set <- c("Original", "NA values", "Post-Sanitization")
dim_og_test_1 <- dim_og_test[1]
dim_new_test_1 <- dim_new_test[1]
Rowsr <- c(dim_og_test_1,sum_og_test,dim_new_test_1)
tabler <- rbind(Test_set,Rowsr)
tabler

##          [,1]       [,2]        [,3]               
## Test_set "Original" "NA values" "Post-Sanitization"
## Rowsr    "25976"    "83"        "25893"

Step 1: Statistics on departure delay (A8) and arrival delay (A9).

First we create a function in R to find the mode, shown below:

#Create a function that finds the mode
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

i. mean and median

For the mean and median, We will use the summary built-in function in R.

The mean, median , and mode for Departure.Delay.in.Minutes:

summary(AirlineSurvey_train$Departure.Delay.in.Minutes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   14.75   12.00 1592.00

mod_delay <- Mode(AirlineSurvey_train$Departure.Delay.in.Minutes)

cat("The mode of the Departure Delay in Minutes is: ", mod_delay)

## The mode of the Departure Delay in Minutes is:  0

The mean, median , and mode for Arrival.Delay.in.Minutes:

summary(AirlineSurvey_train$Arrival.Delay.in.Minutes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   15.18   13.00 1584.00

mod_arrival <- Mode(AirlineSurvey_train$Arrival.Delay.in.Minutes)
cat("The mode of the Arrival Delay in Minutes is: ", mod_arrival)

## The mode of the Arrival Delay in Minutes is:  0

ii. The spread: standard deviation (for each)

We can find the standard deviation of the feature using the sd function in R:

std_D_dim <- sd(AirlineSurvey_train$Departure.Delay.in.Minutes)
std_A_dim <- sd(AirlineSurvey_train$Arrival.Delay.in.Minutes)
cat("The std for Departure.Delay.in.Minutes: ", std_D_dim)

## The std for Departure.Delay.in.Minutes:  38.11674

cat("The std for Arrival.Delay.in.Minutescs: ", std_A_dim)

## The std for Arrival.Delay.in.Minutescs:  38.69868

iii. Percentiles: the 10 th, 50 th, 75 th and 90 th percentiles (for each)

R offers a function referred to as quantile to determine the percentiles. However, we would need to input the .10, .50, .75th and .90 as values, since they are not the default percentiles.

Arrival.delay.Quartile <- quantile(AirlineSurvey_train$Arrival.Delay.in.Minutes, c(.10,.50,.75,.90))
Departure.delay.Quartile <- quantile(AirlineSurvey_train$Departure.Delay.in.Minutes, c(.10,.50,.75,.90))

The following table display the percentile for Arrival.Delay.in.Minutes and Departure.Delay.in.Minutes:

table_quartile<- rbind(Arrival.delay.Quartile,Departure.delay.Quartile)
table_quartile

##                          10% 50% 75% 90%
## Arrival.delay.Quartile     0   0  13  44
## Departure.delay.Quartile   0   0  12  44

iv. 1st quartile, 3rd quartile, and the median

We will use the summary function again available in R to display the 1st quartile, 3rd quartile, and the median:

departure_summary <- summary(AirlineSurvey_train$Departure.Delay.in.Minutes)
arrival_summary <-summary(AirlineSurvey_train$Arrival.Delay.in.Minutes)
table_summary <- rbind(departure_summary,arrival_summary)
table_summary

##                   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## departure_summary    0       0      0 14.74794      12 1592
## arrival_summary      0       0      0 15.17868      13 1584

v. The skewness (for each)

R offers the skewness function that can be used to determine the skewness of the features:

#skewness(AirlineSurvey_train$Departure.Delay.in.Minutes)
Departure.delay.Skewness <-skewness(AirlineSurvey_train$Arrival.Delay.in.Minutes)
Arrival.delay.Skewness <- skewness(AirlineSurvey_train$Departure.Delay.in.Minutes)

The following table display the skewness for Arrival.Delay.in.Minutes and Departure.Delay.in.Minutes:

table_skewness <- rbind(Departure.delay.Skewness,Arrival.delay.Skewness)
table_skewness

##                              [,1]
## Departure.delay.Skewness 6.596446
## Arrival.delay.Skewness   6.768853

vi. The covariance and correlation between A8 and A9

R offers the covariance and correlation functions that can be used to determine the covariance and correlation of the features, respectively:

covariance.A8wA9 <-cov(AirlineSurvey_train$Arrival.Delay.in.Minutes,AirlineSurvey_train$Departure.Delay.in.Minutes, method="pearson")
correlation.A8wA9 <- cor(AirlineSurvey_train$Arrival.Delay.in.Minutes,AirlineSurvey_train$Departure.Delay.in.Minutes, method="pearson")

The following table display the covariance and correlation for Arrival.Delay.in.Minutes and Departure.Delay.in.Minutes:

table_covwcorr <- rbind(covariance.A8wA9,correlation.A8wA9)
table_covwcorr

##                           [,1]
## covariance.A8wA9  1424.1494855
## correlation.A8wA9    0.9654809

vii. Plotting the distribution

Plotting the distribution in a histogram in R:

par(mfrow=c(2,1))
hist(AirlineSurvey_train$Arrival.Delay.in.Minutes)
hist(AirlineSurvey_train$Departure.Delay.in.Minutes)

viii. Summary

There is a very strong correlation between A8 and A9, with the value of 0.9654809. Since the value of skewness for both A8 and A9 is greater than 1, it can be said that the distribution is highly skewed. This conclusion is reinforced by the histogram, where most of the values are more distributed in the left side of the graph. While the maximum value of the delay and arrival is above 1500 minutes (about 25 hours), the average delay is close to 15 minutes. While arrival and delay should be close to zero for punctuality, an average delay of 15 minutes is quite small compare to the logistics involve.

Step 2: Discretize some features

i. Discretize Age

Our next step is to convert numerical values to categorical values. Discretize age (A3) to nominal values using the following criteria: 0-15: Child; 16-35: Youth; 36-55 Middle age; 56-70: Old; >70- Senior;

We discretize the following values in R, and store the new categorical (nominal) values as a new feature called Age.cat in the AirlineSurvey_train training set:

#AirlineSurvey_train <- AirlineSurvey_train >%>

# Doesn't replace the feature but actually adds a new feature column that is categorical
# The numerical values can be remove later on
AirlineSurvey_train <- within(AirlineSurvey_train,
                      {
                      Age.cat <- NA 
                      Age.cat[AirlineSurvey_train$Age < 16] <- "Child"
                      Age.cat[AirlineSurvey_train$Age >= 16 & AirlineSurvey_train$Age < 36] <- "Youth"
                      Age.cat[AirlineSurvey_train$Age >= 36 & AirlineSurvey_train$Age < 56] <- "MiddleAge"
                      Age.cat[AirlineSurvey_train$Age >=56 & AirlineSurvey_train$Age < 71] <- "Old"
                      Age.cat[AirlineSurvey_train$Age >= 71] <- "Senior"
                      })

# its a character for now. Must factorize
#summary(AirlineSurvey_train$Age.cat)
AirlineSurvey_train$Age.cat <- as.factor(AirlineSurvey_train$Age.cat)

The following shows the distribution (in alphabetical order) of the new discretize feature Age.cat:

summary(AirlineSurvey_train$Age.cat)

##     Child MiddleAge       Old    Senior     Youth 
##      6024     45379     16135       755     35301

ii. Discretize flight

Furthermore we discretize distance (A7) to nominal values using the following criteria: 0-500 miles: Short haul; 501-3000 miles: Medium haul; >3000 Long haul.

We discretize the following values in R, and store the new categorical (nominal) values as a new feature called flightDistance.cat in the AirlineSurvey_train training set:

# Doesn't replace the feature but actually adds a new feature column that is categorical
AirlineSurvey_train <- within(AirlineSurvey_train,
        {
        flightDistance.cat <- NA 
        flightDistance.cat[AirlineSurvey_train$Flight.Distance < 501] <- "ShortHaul"
        flightDistance.cat[AirlineSurvey_train$Flight.Distance >= 501 & AirlineSurvey_train$Flight.Distance < 3001] <- "MediumHaul"
        flightDistance.cat[AirlineSurvey_train$Flight.Distance >= 3001] <- "LongHaul"
        })

# its a character for now. Must factorize
#summary(AirlineSurvey_train$Age.cat)
AirlineSurvey_train$flightDistance.cat <- as.factor(AirlineSurvey_train$flightDistance.cat)

The following shows the distribution (in alphabetical order) of the new discretize feature flightDistance.cat:

summary(AirlineSurvey_train$flightDistance.cat)

##   LongHaul MediumHaul  ShortHaul 
##       8248      63109      32237

iii. Discretize delays (A8 and A9)

Likewise we discretize A8 and A9 to nominal values: Small: 0-15; Medium: 16-45; Long: >45.

We discretize the following values in R, and store the new categorical (nominal) values as a new feature called Departure.Delay.cat in the AirlineSurvey_train training set:

# Doesn't replace the feature but actually adds a new feature column that is categorical
# The numerical values can be remove later on
AirlineSurvey_train <- within(AirlineSurvey_train,
        {
        Departure.Delay.cat <- NA 
        Departure.Delay.cat[AirlineSurvey_train$Departure.Delay.in.Minutes < 16] <- "Small"
        Departure.Delay.cat[AirlineSurvey_train$Departure.Delay.in.Minutes >= 16 & AirlineSurvey_train$Departure.Delay.in.Minutes < 46] <- "Medium"
        Departure.Delay.cat[AirlineSurvey_train$Departure.Delay.in.Minutes >= 46] <- "Long"
        })
# its a character for now. Must factorize
#summary(AirlineSurvey_train$Age.cat)
AirlineSurvey_train$Departure.Delay.cat <- as.factor(AirlineSurvey_train$Departure.Delay.cat)

The following shows the distribution (in alphabetical order) of the new discretize feature Departure.Delay.cat:

summary(AirlineSurvey_train$Departure.Delay.cat)

##   Long Medium  Small 
##   9879  13037  80678

We discretize the following values in R, and store the new categorical (nominal) values as a new feature called Arrival.Delay.cat in the AirlineSurvey_train training set.

# Doesn't replace the feature but actually adds a new feature column that is categorical
AirlineSurvey_train <- within(AirlineSurvey_train,
        {
        Arrival.Delay.cat <- NA 
        Arrival.Delay.cat[AirlineSurvey_train$Arrival.Delay.in.Minutes < 16] <- "Small"
        Arrival.Delay.cat[AirlineSurvey_train$Arrival.Delay.in.Minutes >= 16 & AirlineSurvey_train$Arrival.Delay.in.Minutes < 46] <- "Medium"
        Arrival.Delay.cat[AirlineSurvey_train$Arrival.Delay.in.Minutes >= 46] <- "Long"
        })
# its a character for now. Must factorize
#summary(AirlineSurvey_train$Age.cat)
AirlineSurvey_train$Arrival.Delay.cat <- as.factor(AirlineSurvey_train$Arrival.Delay.cat)

The following shows the distribution (in alphabetical order) of the new discretize feature Arrival.Delay.cat:

summary(AirlineSurvey_train$Arrival.Delay.cat)

##   Long Medium  Small 
##  10066  13569  79959

iv. Plot the distributions

PLotting the following distributions using the ggplot function in R:

v. Summary

As seen in Step 1, a majority of the Departure.Delay.in,Minutes and Arrival.Delay.in.Minutes are relatively small delays. Step 2, reinforces the same finding. According to the categorical Age, a majority of travelers are either Youth or Middle Age, that is, the range of 16-56 years old. Likewise, a majority of the distance of flights are Medium Haul or 500-3000 miles long.

Step 3.

Next we are going to test three hypotheses. However, first we must factorize the quality features.

Factorize quality features needed for the problem in train set: {Ouput Not Shown, see R code }:

Factorize quality features needed for the problem in test set: {Ouput Not Shown, see R code }

Test 1: Long haul passengers’ overall satisfaction is influenced more by the in-flight service quality than by the departure delays.

First, we create new Long Haul List by filtering the AirlineSurvey_train training set into a new data frame called Long.Haul.Passenger.

Long.Haul.Passenger <- AirlineSurvey_train[AirlineSurvey_train$flightDistance.cat == "LongHaul",]

The dimension of the Long.Haul.Passenger:

dim(Long.Haul.Passenger)

## [1] 8248   29

The number of rows should be equivalent to the Long Haul summary statistics in Step 2 (ii).

Next, we examine the following relationships between Long Haul’s travelers overall satisfaction against Departure.Delay.in.Minutes and Long Haul’s overall satisfaction against Inflight.service. Since the overall satisfaction is a nominal value, we would need to use Classification methods. In this step, we use Logistic Regression to compare the relationship.

The relationship between overall satisfaction against Departure.Delay.in.Minutes:

#Departure Delay:
glm.fit1 <- glm(satisfaction ~ Departure.Delay.in.Minutes, data = Long.Haul.Passenger, family = binomial)
summary(glm.fit1)

## 
## Call:
## glm(formula = satisfaction ~ Departure.Delay.in.Minutes, family = binomial, 
##     data = Long.Haul.Passenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7538   0.6955   0.6955   0.7016   1.4833  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.2961490  0.0285069  45.468  < 2e-16 ***
## Departure.Delay.in.Minutes -0.0039751  0.0006114  -6.502 7.93e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8817.8  on 8247  degrees of freedom
## Residual deviance: 8776.4  on 8246  degrees of freedom
## AIC: 8780.4
## 
## Number of Fisher Scoring iterations: 4

The relationship between overall satisfaction against Inflight.service:

#ServiceQuality:
glm.fit2 <- glm(satisfaction ~ Inflight.service, data = Long.Haul.Passenger, family = binomial)
summary(glm.fit2)

## 
## Call:
## glm(formula = satisfaction ~ Inflight.service, family = binomial, 
##     data = Long.Haul.Passenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0863   0.1310   0.1310   0.5321   1.3795  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)         -11.57     196.97  -0.059    0.953
## Inflight.service1    11.10     196.97   0.056    0.955
## Inflight.service2    11.26     196.97   0.057    0.954
## Inflight.service3    11.48     196.97   0.058    0.954
## Inflight.service4    13.45     196.97   0.068    0.946
## Inflight.service5    16.32     196.97   0.083    0.934
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8817.8  on 8247  degrees of freedom
## Residual deviance: 6203.9  on 8242  degrees of freedom
## AIC: 6215.9
## 
## Number of Fisher Scoring iterations: 10

FINDING: Departure.Delay.in.Minutes is a stronger predictor for a Long Haul’s overall satisfaction compare to Inflight.service according to the Pr(>|z|).

Test 2: Medium haul passengers’ overall satisfaction is influenced more by the arrival delays than by the in-flight entertainment.

First, we create new List by filtering the AirlineSurvey_train training set into a new data frame called Medium.Haul.Passenger.

Medium.Haul.Passenger <- AirlineSurvey_train[AirlineSurvey_train$flightDistance.cat == "MediumHaul",]

The dimension of the Medium.Haul.Passenger:

dim(Medium.Haul.Passenger)

## [1] 63109    29

The number of rows should be equivalent to the Medium Haul summary statistics in Step 2 (ii).

Next, we examine the following relationships between Medium Haul’s travelers overall satisfaction against Arrival.Delay.in.Minutes and Medium Haul’s overall satisfaction against Inflight.entertainment. Since the overall satisfaction is a nominal value, we would need to use Classification methods. In this step, we use Logistic Regression to compare the relationship.

The relationship between overall satisfaction against Arrival.Delay.in.Minutes:

#Arrival Delay
set.seed(1)
glm.fit_arrdelay <- glm(satisfaction ~ Arrival.Delay.in.Minutes, data = Medium.Haul.Passenger, family = binomial)
summary(glm.fit_arrdelay)

## 
## Call:
## glm(formula = satisfaction ~ Arrival.Delay.in.Minutes, family = binomial, 
##     data = Medium.Haul.Passenger)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.095  -1.095  -1.040   1.262   2.833  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -0.1968598  0.0086762  -22.69   <2e-16 ***
## Arrival.Delay.in.Minutes -0.0029660  0.0002307  -12.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 86579  on 63108  degrees of freedom
## Residual deviance: 86397  on 63107  degrees of freedom
## AIC: 86401
## 
## Number of Fisher Scoring iterations: 4

The relationship between overall satisfaction against Inflight.entertainment:

#In-flight Entertainment
set.seed(1)
glm.fit_entertain <- glm(satisfaction ~ Inflight.entertainment, data = Medium.Haul.Passenger, family = binomial)
summary(glm.fit_entertain)

## 
## Call:
## glm(formula = satisfaction ~ Inflight.entertainment, family = binomial, 
##     data = Medium.Haul.Passenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4564  -0.8061  -0.5406   0.9776   1.9978  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)              -10.566     42.238  -0.250    0.802
## Inflight.entertainment1    8.717     42.238   0.206    0.837
## Inflight.entertainment2    9.328     42.238   0.221    0.825
## Inflight.entertainment3    9.608     42.238   0.227    0.820
## Inflight.entertainment4   11.056     42.238   0.262    0.794
## Inflight.entertainment5   11.201     42.238   0.265    0.791
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 86579  on 63108  degrees of freedom
## Residual deviance: 74686  on 63103  degrees of freedom
## AIC: 74698
## 
## Number of Fisher Scoring iterations: 9

FINDING: Arrival.Delay.in.Minutes is a stronger predictor for a Medium Haul’s overall satisfaction compare to Inflight.service.

Test 3: Satisfaction is influenced by the combination of arrival delay time and departure delay time for all passengers.

To test the hypothesis:

set.seed(1)
glm.fit_arrWdelay <- glm(satisfaction ~ Arrival.Delay.in.Minutes+Departure.Delay.in.Minutes, data = AirlineSurvey_train, family = binomial)
summary(glm.fit_arrWdelay)

## 
## Call:
## glm(formula = satisfaction ~ Arrival.Delay.in.Minutes + Departure.Delay.in.Minutes, 
##     family = binomial, data = AirlineSurvey_train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.162  -1.085  -1.020   1.272   2.894  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -0.2201797  0.0067958 -32.400  < 2e-16 ***
## Arrival.Delay.in.Minutes   -0.0072306  0.0006531 -11.071  < 2e-16 ***
## Departure.Delay.in.Minutes  0.0040626  0.0006599   6.157 7.43e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 141768  on 103593  degrees of freedom
## Residual deviance: 141360  on 103591  degrees of freedom
## AIC: 141366
## 
## Number of Fisher Scoring iterations: 4

FINDING: The combination of Arrival.Delay.in.Minutes and Departure.Delay.in.Minutes is a strong predictor for the overall satisfaction in the AirlineSurvey_train training data set. This shows that the customers are more keen to Departure and Arrival in the flight rather than inflight service and entertainment.

Step 4.

Let’s look into find associations between some of the important attributes. In this section, we will be using Weka.

i. Determining association rules.

We filter the original data set to consist only of Gender, Age (Nominal), Type of travel , Flight distance (Nominal), Class, Arrival delays (Nominal), and Overall satisfaction as attributes. Then we calculate the association rules based on given support.

In this step, we will use Weka to compute the association rules using the Apriori rule in the Associate Tab.

=== === === === === === === === Summary === === === === === === === ===
Minimum support: 0.4 (41438 instances)

Minimum metric : 0.6 Number of cycles performed: 12
Generated sets of large itemsets:
Size of set of large itemsets L(1): 10
Size of set of large itemsets L(2): 8
Best rules found:</p>

Class=Business 49533 ==> Customer.Type=Loyal Customer 42191 <conf:(0.85)> lift:(1.04) lev:(0.02) [1710] conv:(1.23)
Gender=Male 51018 ==> Customer.Type=Loyal Customer 42326 <conf:(0.83)> lift:(1.02) lev:(0.01) [631] conv:(1.07)
flightDistance.cat=MediumHaul 63109 ==> Customer.Type=Loyal Customer 52024 <conf:(0.82)> lift:(1.01) lev:(0) [448] conv:(1.04)
Arrival.Delay.cat=Small 79959 ==> Customer.Type=Loyal Customer 65444 <conf:(0.82)> lift:(1) lev:(0) [97] conv:(1.01)
Gender=Female 52576 ==> Customer.Type=Loyal Customer 42336 <conf:(0.81)> lift:(0.99) lev:(-0.01) [-631] conv:(0.94)
Customer.Type=Loyal Customer 84662 ==> Arrival.Delay.cat=Small 65444 <conf:(0.77)> lift:(1) lev:(0) [97] conv:(1.01)
flightDistance.cat=MediumHaul 63109 ==> Arrival.Delay.cat=Small 48593 <conf:(0.77)> lift:(1) lev:(-0) [-117] conv:(0.99)
satisfaction=neutral or dissatisfied 58697 ==> Customer.Type=Loyal Customer 44249 <conf:(0.75)> lift:(0.92) lev:(-0.04) [-3721] conv:(0.74)
satisfaction=neutral or dissatisfied 58697 ==> Arrival.Delay.cat=Small 43458 <conf:(0.74)> lift:(0.96) lev:(-0.02) [-1847] conv:(0.88)
Customer.Type=Loyal Customer 84662 ==> flightDistance.cat=MediumHaul 52024 <conf:(0.61)> lift:(1.01) lev:(0) [448] conv:(1.01)

=== === === === === === === === === === === === === === === === ===

</div>

ii. Summary.

For #1. Most returning customers are in business class. This result may be due to businesses haveing partner relationship with Air carriers; an employee may only be able to fly with one carrier due to business policies. For #2 and #4: Gender seems to both positively identify loyalty (returning). This may be due to the fact that the amount of disloyal customers is low compared to the loyal customers. For #9: It is surprising that there is an association between satisfaction=neutral or dissatisfied 58697 ==> Arrival.Delay.cat=Small. That is, if the arrival is short, we would be assume that the customers would be satisfied.

</div>

Step 5.

i. Reduce the satisfaction features using PCA.

Using PCA (Principal Component Analysis), we combine features A10-A23 into a single feature. Let us call it PCAS. Next we find average, minimum, and maximum of A10-A23 (computed for each passenger record). Let us call them AVES, MINS, and MAXS, respectively. Lastly, we convert A24 (overall satisfaction) into a numeric value by converting neutral or unsatisfied to 1.0 and satisfied to 4. Let us call it DA24.

We use Weka for the following step. USING PCA for the training set, we found the following AVES, MIN, and MAX for the first component in the rank (We remove all others): Rank1: Variance: .9399 {MIN: -5.698, MAX: 4.777, MEAN: 0, STD: 2.233}

ii. Using three models.

We use the first components analysis on the following classifier:

Logistic Regression:

Summary
Correctly Classified Instances	14528	56.1078 %
Incorrectly Classified Instances	11365	43.8922 %

AdaBoostM1:

Summary
Correctly Classified Instances	14528	56.1078 %
Incorrectly Classified Instances	11365	43.8922 %

Using the training set instead of the test set, we get the following summary using the cross validation of 10 folds:

Time taken to build model: 0.66 seconds

Stratified cross-validation:

Summary
Correctly Classified Instances	79951	77.1772 %
Incorrectly Classified Instances	23643	22.8228 %

FINDING: Stratified Cross-validation shows a significant accuracy. However, that is only due the fact that the cross validation was done using the training set. Both AdaBoostM1 and Logistic Regression used the training set and used the test set.

5c. Using PCAS from A10-A23

PCAS reduces the complexity of the features while retaining the relationship of the features. The following top 3 components/ranks were used: Rank1: {MIN: -5.698, MAX: 4.777, MEAN: 0, STD: 2.233}

Rank2: {MIN: -5.596, MAX: 4.652, MEAN: 0, STD: 2.003}

Rank3: {MIN: -4.678, MAX: 6.565, MEAN: 0, STD: 1.941}

If we use the following 3 components we get the following results:

Using Logistic Regression with the supplied test set:

Summary
Correctly Classified Instances	14528	56.1078 %
Incorrectly Classified Instances	11365	43.8922 %

No apparent increase in accuracy on the test set

If we train it in the training set instead:

Summary
Correctly Classified Instances	80100	77.3211 %
Incorrectly Classified Instances	23494	22.6789 %
Kappa statistic	0.5387

If we use the training set again, but with crossvalidation of 10.

=== Stratified cross-validation ===

Summary
Correctly Classified Instances	80106	77.3269 %
Incorrectly Classified Instances	23488	22.6731 %

There does not seem to be a high increase in accuracy if we use three components instead of one.

Step 6.

i. flight distance (in miles) and arrival delay (in minutes)

In the following step, we will use a general linear regression to explore the relationship between Arrival.Delay.in.Minutes and Flight.Distance in the AirlineSurvey_train data set.

The summary shows:

set.seed(1)
#names(AirlineSurvey_train)
#plot(Flight.Distance ~ Arrival.Delay.in.Minutes, data=AirlineSurvey_train)
lm.fit <- lm(Arrival.Delay.in.Minutes ~ Flight.Distance , AirlineSurvey_train)
summary(lm.fit)

## 
## Call:
## lm(formula = Arrival.Delay.in.Minutes ~ Flight.Distance, data = AirlineSurvey_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -15.29  -15.22  -15.04   -2.20 1568.81 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.529e+01  1.871e-01  81.713   <2e-16 ***
## Flight.Distance -9.413e-05  1.206e-04  -0.781    0.435    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.7 on 103592 degrees of freedom
## Multiple R-squared:  5.885e-06,  Adjusted R-squared:  -3.769e-06 
## F-statistic: 0.6096 on 1 and 103592 DF,  p-value: 0.4349

The Pr(>|t|) indicates that there is little to no relationship between the two features. Likewise, the R^2 value is close to 0.

ii. flight distance (in miles) and departure delay (in minutes)

In the following step, we will use a general linear regression to explore the relationship between Departure.Delay.in.Minutes and Flight.Distance in the AirlineSurvey_train data set.

#plot(Flight.Distance ~ Departure.Delay.in.Minutes, data=AirlineSurvey_train)
set.seed(1)
lm.fit2 <- lm(Departure.Delay.in.Minutes~Flight.Distance, AirlineSurvey_train)
#names(summary(lm.fit2))

The summary shows:

summary(lm.fit2)

## 
## Call:
## lm(formula = Departure.Delay.in.Minutes ~ Flight.Distance, data = AirlineSurvey_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -15.02  -14.73  -14.68   -2.70 1577.26 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.466e+01  1.843e-01  79.546   <2e-16 ***
## Flight.Distance 7.284e-05  1.187e-04   0.613     0.54    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.12 on 103592 degrees of freedom
## Multiple R-squared:  3.632e-06,  Adjusted R-squared:  -6.021e-06 
## F-statistic: 0.3762 on 1 and 103592 DF,  p-value: 0.5396

#summary(lm.fit2)$r.squared

The Pr(>|t|) indicates that there is little to no relationship between the two features. LIkewise, the R^2 value is summary(lm.fit2)$r.squared which is close to 0.

FINDING: Flight.Distance was not a strong predictor for Departure and Arrival Delay. Arrival Delay and Departure Delay is a type of exponential problem where the waiting time is “memory-less”. Hence, linear regression would fail to fit the model.

Step 7.

Next we look into a few more tests.

Test 1: Is satisfaction with seat comfort related (or depends on) to passenger Gender?

We will try to examine the relationship with seat comfort and Gender based on a Naive bayes classifier using the training and the test set.

Naive Bayes Classifier

Summary
Correctly Classified Instances	7969	30.7767 %
Incorrectly Classified Instances	17924	69.2233 %

In RandomTree:

Summary
Correctly Classified Instances	7969	30.7767 %
Incorrectly Classified Instances	17924	69.2233 %

FINDING:There does not seem to be a relationship between Gender and seat comfort.

Test 2: Is satisfaction with gate location related to passenger age?

Let’s examine passenger age (numerical) and gate location satisfaction:
Using decision trees in Weka, we got the following results:

Using J48 we got the following statistics:

Summary
Correctly Classified Instances	7122	27.5055 %
Incorrectly Classified Instances	18771	72.4945 %
Kappa statistic	0
Mean absolute error	0.2626
Root mean squared error	0.3624
Relative absolute error	99.9997 %
Root relative squared error	100 %
Total Number of Instances	25893

FINDING: No significant relationship between age and gate location. Worse than random guess.

Test 3: Do first time passengers have more or less expectations than returning customers measured in terms of overall satisfaction?

Using Naive Bayes, we can see that:

                                    Class

Attribute	neutral or dissatisfied	satisfied
~	(0.57)	(0.43)
Customer.Type
+ Loyal Customer	44250.0	40414.0
+ disloyal Customer	14449.0	4485.0
[total]	58699.0	44899.0

FINDING: From the split, we can see that Disloyal Customers (First-time?) are more likely to be dissatisfied than Loyal customers. About 14449.0/(14449.0+4485.0) = 0.76 = 76% of disloyal customers are neutral or dissatisfied and 44250.0/(44250.0+40414.0) = .52 = 52% of loyal customers are neutral or dissatisfied.

Test 4: Is there a distinct (statistically significant) difference between business and personal travelers (A5) in terms of their reaction to their flights? (Hint: Use any attribute(s) that you think appropriate to measure their reaction.)

The most appropriate to measure for business and personal travelers is A10 (Departure and Arrival Time Satisfaction). Business travelers are more in a time constraints than personal travelers. Hence, a flight that is punctual would be highly rated for business travelers. Let’s examine:

Naive Bayes Classifier

                  Class

Attribute	0	1	2	3	4	5
~	(0.05)	(0.15)	(0.17)	(0.17)	(0.25)	(0.22)
Type.of.Travel
+ Personal Travel	869.0	2716.0	3046.0	3785.0	11436.0	10283.0
+ Business travel	4423.0	12738.0	14098.0	14120.0	14040.0	12052.0
[total]	5292.0	15454.0	17144.0	17905.0	25476.0	22335.0

FINDING: There does not seem to be a distinct difference between the Departure and Arrival Time Satisfaction between Business and personal travelers.

Let’s examine overall satisfaction instead:
Using Naive Byes Classifier, the split:

                                     Class

Attribute	neutral or dissatisfied	satisfied
~	(0.57)	(0.43)
Type.of.Travel
+ Personal Travel	28867.0	3264.0
+ Business travel	29832.0	41635.0
[total] 58699.0	44899.0

FINDING: Now there is a clear distinct difference between personal and business travel. Personal travelers tend to be more neutral or dissatisfied (28867.0/(28867.0+3264.0)) = 90% and among Business travelers, they tend to be more satisfied overall (41635.0/(29832.0+41635.0)) = 58%.

Test 5: Is there a distinct (statistically significant) difference between business class passengers and economy passengers (A6) in terms of their reaction to satisfaction with food-and-drink?

Naive Bayes Classifier

                            Class

Attribute	0	1	2	3	4	5
~	(0)	(0.12)	(0.21)	(0.21)	(0.23)	(0.21)
Class
+ Eco Plus	19.0	1102.0	1552.0	1591.0	1690.0	1520.0
+ Business	32.0	4359.0	10570.0	10702.0	12380.0	11496.0
+ Eco	57.0	7342.0	9799.0	9948.0	10227.0	9226.0
[total]	108.0	12803.0	21921.0	22241.0	24297.0	22242.0

FINDING: No distribution difference between the classes with the food-and-drink satisfaction. The minor key difference in the distribution; Business class has a lower count (relative to its class) when rating the food-and-drink satifaction as 1.

Step 8.

i. A12 vs A23

We will determine if any relationship exists between check-in service (A12) and baggage handling (A23) using four data mining technique:

Using J48:

Summary
Correctly Classified Instances	7254	28.0153 %
Incorrectly Classified Instances	18639	71.9847 %
Kappa statistic	0
Mean absolute error	0.259
Root mean squared error	0.3599
Relative absolute error	99.9996 %
Root relative squared error	100 %
Total Number of Instances	25893

Using RandomTree:

Summary
Correctly Classified Instances	7279	28.1118 %
Incorrectly Classified Instances	18614	71.8882 %
Kappa statistic	0.0023
Mean absolute error	0.2548
Root mean squared error	0.3569
Relative absolute error	98.3792 %
Root relative squared error	99.1746 %
Total Number of Instances	25893

Using LogitBoost:

Summary
Correctly Classified Instances	7277	28.1041 %
Incorrectly Classified Instances	18616	71.8959 %
Kappa statistic	0.0038
Mean absolute error	0.2548
Root mean squared error	0.3569
Relative absolute error	98.3956 %
Root relative squared error	99.1755 %
Total Number of Instances	25893

Using Naive Bayes:

Summary
Correctly Classified Instances	7279	28.1118 %
Incorrectly Classified Instances	18614	71.8882 %
Kappa statistic	0.0023
Mean absolute error	0.2548
Root mean squared error	0.3569
Relative absolute error	98.38 %
Root relative squared error	99.1746 %
Total Number of Instances	25893

Overall there doesn’t seem to be a relationship with check-in service (A12) and baggage handling (A23).

i. A10, A16 vs A24

We will examine the relationship between A10 and A16.

Departure and arrival time satisfaction vs overall satisfaction:

Using Random Forest:

Summary
Correctly Classified Instances	14528	56.1078 %
Incorrectly Classified Instances	11365	43.8922 %
Kappa statistic	0
Mean absolute error	0.4894
Root mean squared error	0.4949
Relative absolute error	99.4982 %
Root relative squared error	99.7166 %
Total Number of Instances	25893

OneR:

Summary
Correctly Classified Instances	14528	56.1078 %
Incorrectly Classified Instances	11365	43.8922 %
Kappa statistic	0
Mean absolute error	0.4389
Root mean squared error	0.6625
Relative absolute error	89.2364 %
Root relative squared error	133.4939 %
Total Number of Instances	25893

AdaBoost1:

Summary
Correctly Classified Instances	14528	56.1078 %
Incorrectly Classified Instances	11365	43.8922 %
Kappa statistic	0
Mean absolute error	0.4897
Root mean squared error	0.4949
Relative absolute error	99.5565 %
Root relative squared error	99.7262 %
Total Number of Instances	25893

Seat Comfort satisfaction vs overall satisfaction

Using Random Forest:

Summary
Correctly Classified Instances	17514	67.6399 %
Incorrectly Classified Instances	8379	32.3601 %
Kappa statistic	0.3629
Mean absolute error	0.419
Root mean squared error	0.4587
Relative absolute error	85.1956 %
Root relative squared error	92.423 %
Total Number of Instances	25893

Using OneR:

Summary
Correctly Classified Instances	17514	67.6399 %
Incorrectly Classified Instances	8379 32.3601 %
Kappa statistic	0.3629
Mean absolute error	0.3236
Root mean squared error	0.5689
Relative absolute error	65.7908 %
Root relative squared error	114.6232 %
Total Number of Instances	25893

Using AdaBoost1:

Summary
Correctly Classified Instances	17514	67.6399 %
Incorrectly Classified Instances	8379	32.3601 %
Kappa statistic	0.3629
Mean absolute error	0.4202
Root mean squared error	0.4588
Relative absolute error	85.4337 %
Root relative squared error	92.438 %
Total Number of Instances	25893

FINDING: Seat comfort had better accuracy in predicting overall satisfaction than Departure and arrival time satisfaction.

Step 8:

In this step we will use Weka attribute selection to rank the attributes:

InfoGainAttributeEval:

Ranked attributes:

Value	Col.	Attribute
0.304096	10	Online.boarding
0.233257	5	Inflight.wifi.service
0.192687	4	Class
0.163958	3	Type.of.Travel
0.135373	12	Inflight.entertainment
0.113619	11	Seat.comfort
0.087677	14	Leg.room.service
0.082571	13	On.board.service
0.074691	18	Cleanliness
0.073303	7	Ease.of.Online.booking
0.06156	15	Baggage.handling
0.05916	17	Inflight.service
0.04597	16	Checkin.service
0.039726	20	Age.cat
0.03777	9	Food.and.drink
0.037176	21	flightDistance.cat
0.026794	2	Customer.Type
0.017381	8	Gate.location
0.00538	23	Arrival.Delay.cat
0.00364	22	Departure.Delay.cat
0.00314	6	Departure.Arrival.time.convenient
0.00011	1	Gender

OneRAttributeEval w/ Ranker Ranked attributes:

Value	Col.	Attribute
79.0335	10	Online.boarding
75.2399	4	Class
74.2398	5	Inflight.wifi.service
70.2155	12	Inflight.entertainment
68.0541	3	Type.of.Travel
68.0426	11	Seat.comfort
66.668	14	Leg.room.service
65.5974	7	Ease.of.Online.booking
65.3455	13	On.board.service
63.2662	18	Cleanliness
62.5519	15	Baggage.handling
62.4071	17	Inflight.service
61.0219	21	flightDistance.cat
61.0122	16	Checkin.service
60.7178	20	Age.cat
59.9282	9	Food.and.drink
58.5932	8	Gate.location
56.6606	2	Customer.Type
56.6606	23	Arrival.Delay.cat
56.6606	6	Departure.Arrival.time.convenient
56.6606	22	Departure.Delay.cat
56.6606	1	Gender

CfsSubsetEval with GreedyStepwise: Selected attributes: 3,4,5,10,12,16,23 : 7

                 Type.of.Travel
                 Class
                 Inflight.wifi.service
                 Online.boarding
                 Inflight.entertainment
                 Checkin.service
                 Arrival.Delay.cat

</div> </body> </html>

Twitter Facebook LinkedIn

Justin Gausin

Executive summary.

Preprocessing

Library

Import the Data Set.

Clean Data sets for any missing values

Step 1: Statistics on departure delay (A8) and arrival delay (A9).

i. mean and median

ii. The spread: standard deviation (for each)

iii. Percentiles: the 10 th, 50 th, 75 th and 90 th percentiles (for each)

iv. 1st quartile, 3rd quartile, and the median

v. The skewness (for each)

vi. The covariance and correlation between A8 and A9

vii. Plotting the distribution

viii. Summary

Step 2: Discretize some features

i. Discretize Age

ii. Discretize flight

iii. Discretize delays (A8 and A9)

iv. Plot the distributions

v. Summary

Step 3.

Step 4.

i. Determining association rules.

ii. Summary.

Step 5.

i. Reduce the satisfaction features using PCA.

ii. Using three models.

5c. Using PCAS from A10-A23

Step 6.

i. flight distance (in miles) and arrival delay (in minutes)

ii. flight distance (in miles) and departure delay (in minutes)

Step 7.

Step 8.

i. A12 vs A23

i. A10, A16 vs A24

Step 8:

You May Also Enjoy

Using OCR for identifying RMF Boundary Diagrams - Python

Using Rayshader, GIS, and OpenstreetMap for Shenandoah National Park Hikes - R

Differential Analysis in Genomic Data Science - R

Geometric Brownian Motion on Stock and Option Price Discovery - Matlab