Airline Passenger Analysis - R & Weka
Executive summary.
Based on the models, customers’ overall satisfaction appears to be heavily influenced by Seat Comfort satisfaction, Departure Delay, and Arrival Delay. There doesn’t seem to be a significant relationship with check-in service (A12) and baggage handling (A23). Additionally, there is no statistical significance between the classes with food-and-drink satisfaction. The majority of customers are loyal or returning, ranging from youth to middle age, and typically travel on medium-haul flights. Surprisingly, both arrival and departure of flights are delayed less than 15 minutes.
According to attribute selection, Type of Travel, Class, Inflight Wifi Service, Online Boarding, Inflight Entertainment, and Check-in Service seem to have the most influence on overall satisfaction. Arrival Delay and Departure Delay are contested in attribute selection, appearing important in one but not the other. Gender appears to be the least important feature in relation to overall satisfaction or any other data features.
Preprocessing
Library
Importing the data set here {See R Code}
library(tidyverse)
library(ISLR2)
library(boot)
library(dplyr)
library(glmnet)
library(tree)
library (pls)
library(randomForest)
library(caret)
library(moments)
library(e1071)
library(gridExtra)
library(arules)
library(arulesViz)
library(rpart)
#install.packages("rpart.plot")
library(rpart.plot)
#install.packages("factoextra")
library(factoextra)
Import the Data Set.
We split the original data to train and test set to be used on the Final Project. The AirlineSurvey_train training set is about 80% of the original data set and the test set about 20% AirlineSurvey_test.
As an example we can see the features of the training data set:
names(airlinesurvey_train)
## [1] "X" "id"
## [3] "Gender" "Customer.Type"
## [5] "Age" "Type.of.Travel"
## [7] "Class" "Flight.Distance"
## [9] "Inflight.wifi.service" "Departure.Arrival.time.convenient"
## [11] "Ease.of.Online.booking" "Gate.location"
## [13] "Food.and.drink" "Online.boarding"
## [15] "Seat.comfort" "Inflight.entertainment"
## [17] "On.board.service" "Leg.room.service"
## [19] "Baggage.handling" "Checkin.service"
## [21] "Inflight.service" "Cleanliness"
## [23] "Departure.Delay.in.Minutes" "Arrival.Delay.in.Minutes"
## [25] "satisfaction"
Clean Data sets for any missing values
There are 310 number of rows with missing data for some columns in the training set. Likewise there are 83 number of rows with missing data for some columns in the test set.
We remove the following missing data {See R code}:
AirlineSurvey_train <- airlinesurvey_train %>%
na.omit()
AirlineSurvey_test <-airlinesurvey_test %>%
na.omit()
dim_new_train <- dim(AirlineSurvey_train)
dim_new_test <- dim(AirlineSurvey_test)
In conclusion, after cleaning. We got the following table for the training and test sets:
For the train set:
Train_set <- c("Original", "NA values", "Post-Sanitization")
dim_og_train_1 <- dim_og_train[1]
dim_new_train_1 <- dim_new_train[1]
NumberRows <- c(dim_og_train_1,sum_og_train,dim_new_train_1)
table <- rbind(Train_set,NumberRows)
table
## [,1] [,2] [,3]
## Train_set "Original" "NA values" "Post-Sanitization"
## NumberRows "103904" "310" "103594"
For the test set:
Test_set <- c("Original", "NA values", "Post-Sanitization")
dim_og_test_1 <- dim_og_test[1]
dim_new_test_1 <- dim_new_test[1]
Rowsr <- c(dim_og_test_1,sum_og_test,dim_new_test_1)
tabler <- rbind(Test_set,Rowsr)
tabler
## [,1] [,2] [,3]
## Test_set "Original" "NA values" "Post-Sanitization"
## Rowsr "25976" "83" "25893"
Step 1: Statistics on departure delay (A8) and arrival delay (A9).
First we create a function in R to find the mode, shown below:
#Create a function that finds the mode
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
i. mean and median
For the mean and median, We will use the summary built-in function in R.
The mean, median , and mode for Departure.Delay.in.Minutes:
summary(AirlineSurvey_train$Departure.Delay.in.Minutes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 14.75 12.00 1592.00
mod_delay <- Mode(AirlineSurvey_train$Departure.Delay.in.Minutes)
cat("The mode of the Departure Delay in Minutes is: ", mod_delay)
## The mode of the Departure Delay in Minutes is: 0
The mean, median , and mode for Arrival.Delay.in.Minutes:
summary(AirlineSurvey_train$Arrival.Delay.in.Minutes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 15.18 13.00 1584.00
mod_arrival <- Mode(AirlineSurvey_train$Arrival.Delay.in.Minutes)
cat("The mode of the Arrival Delay in Minutes is: ", mod_arrival)
## The mode of the Arrival Delay in Minutes is: 0
ii. The spread: standard deviation (for each)
We can find the standard deviation of the feature using the sd function in R:
std_D_dim <- sd(AirlineSurvey_train$Departure.Delay.in.Minutes)
std_A_dim <- sd(AirlineSurvey_train$Arrival.Delay.in.Minutes)
cat("The std for Departure.Delay.in.Minutes: ", std_D_dim)
## The std for Departure.Delay.in.Minutes: 38.11674
cat("The std for Arrival.Delay.in.Minutescs: ", std_A_dim)
## The std for Arrival.Delay.in.Minutescs: 38.69868
iii. Percentiles: the 10 th, 50 th, 75 th and 90 th percentiles (for each)
R offers a function referred to as quantile to determine the percentiles. However, we would need to input the .10, .50, .75th and .90 as values, since they are not the default percentiles.
Arrival.delay.Quartile <- quantile(AirlineSurvey_train$Arrival.Delay.in.Minutes, c(.10,.50,.75,.90))
Departure.delay.Quartile <- quantile(AirlineSurvey_train$Departure.Delay.in.Minutes, c(.10,.50,.75,.90))
The following table display the percentile for Arrival.Delay.in.Minutes and Departure.Delay.in.Minutes:
table_quartile<- rbind(Arrival.delay.Quartile,Departure.delay.Quartile)
table_quartile
## 10% 50% 75% 90%
## Arrival.delay.Quartile 0 0 13 44
## Departure.delay.Quartile 0 0 12 44
iv. 1st quartile, 3rd quartile, and the median
We will use the summary function again available in R to display the 1st quartile, 3rd quartile, and the median:
departure_summary <- summary(AirlineSurvey_train$Departure.Delay.in.Minutes)
arrival_summary <-summary(AirlineSurvey_train$Arrival.Delay.in.Minutes)
table_summary <- rbind(departure_summary,arrival_summary)
table_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## departure_summary 0 0 0 14.74794 12 1592
## arrival_summary 0 0 0 15.17868 13 1584
v. The skewness (for each)
R offers the skewness function that can be used to determine the skewness of the features:
#skewness(AirlineSurvey_train$Departure.Delay.in.Minutes)
Departure.delay.Skewness <-skewness(AirlineSurvey_train$Arrival.Delay.in.Minutes)
Arrival.delay.Skewness <- skewness(AirlineSurvey_train$Departure.Delay.in.Minutes)
The following table display the skewness for Arrival.Delay.in.Minutes and Departure.Delay.in.Minutes:
table_skewness <- rbind(Departure.delay.Skewness,Arrival.delay.Skewness)
table_skewness
## [,1]
## Departure.delay.Skewness 6.596446
## Arrival.delay.Skewness 6.768853
vi. The covariance and correlation between A8 and A9
R offers the covariance and correlation functions that can be used to determine the covariance and correlation of the features, respectively:
covariance.A8wA9 <-cov(AirlineSurvey_train$Arrival.Delay.in.Minutes,AirlineSurvey_train$Departure.Delay.in.Minutes, method="pearson")
correlation.A8wA9 <- cor(AirlineSurvey_train$Arrival.Delay.in.Minutes,AirlineSurvey_train$Departure.Delay.in.Minutes, method="pearson")
The following table display the covariance and correlation for Arrival.Delay.in.Minutes and Departure.Delay.in.Minutes:
table_covwcorr <- rbind(covariance.A8wA9,correlation.A8wA9)
table_covwcorr
## [,1]
## covariance.A8wA9 1424.1494855
## correlation.A8wA9 0.9654809
vii. Plotting the distribution
Plotting the distribution in a histogram in R:
par(mfrow=c(2,1))
hist(AirlineSurvey_train$Arrival.Delay.in.Minutes)
hist(AirlineSurvey_train$Departure.Delay.in.Minutes)
viii. Summary
There is a very strong correlation between A8 and A9, with the value of 0.9654809. Since the value of skewness for both A8 and A9 is greater than 1, it can be said that the distribution is highly skewed. This conclusion is reinforced by the histogram, where most of the values are more distributed in the left side of the graph. While the maximum value of the delay and arrival is above 1500 minutes (about 25 hours), the average delay is close to 15 minutes. While arrival and delay should be close to zero for punctuality, an average delay of 15 minutes is quite small compare to the logistics involve.
Step 2: Discretize some features
i. Discretize Age
Our next step is to convert numerical values to categorical values. Discretize age (A3) to nominal values using the following criteria: 0-15: Child; 16-35: Youth; 36-55 Middle age; 56-70: Old; >70- Senior;
We discretize the following values in R, and store the new categorical (nominal) values as a new feature called Age.cat in the AirlineSurvey_train training set:
#AirlineSurvey_train <- AirlineSurvey_train >%>
# Doesn't replace the feature but actually adds a new feature column that is categorical
# The numerical values can be remove later on
AirlineSurvey_train <- within(AirlineSurvey_train,
{
Age.cat <- NA
Age.cat[AirlineSurvey_train$Age < 16] <- "Child"
Age.cat[AirlineSurvey_train$Age >= 16 & AirlineSurvey_train$Age < 36] <- "Youth"
Age.cat[AirlineSurvey_train$Age >= 36 & AirlineSurvey_train$Age < 56] <- "MiddleAge"
Age.cat[AirlineSurvey_train$Age >=56 & AirlineSurvey_train$Age < 71] <- "Old"
Age.cat[AirlineSurvey_train$Age >= 71] <- "Senior"
})
# its a character for now. Must factorize
#summary(AirlineSurvey_train$Age.cat)
AirlineSurvey_train$Age.cat <- as.factor(AirlineSurvey_train$Age.cat)
The following shows the distribution (in alphabetical order) of the new discretize feature Age.cat:
summary(AirlineSurvey_train$Age.cat)
## Child MiddleAge Old Senior Youth
## 6024 45379 16135 755 35301
ii. Discretize flight
Furthermore we discretize distance (A7) to nominal values using the following criteria: 0-500 miles: Short haul; 501-3000 miles: Medium haul; >3000 Long haul.
We discretize the following values in R, and store the new categorical (nominal) values as a new feature called flightDistance.cat in the AirlineSurvey_train training set:
# Doesn't replace the feature but actually adds a new feature column that is categorical
AirlineSurvey_train <- within(AirlineSurvey_train,
{
flightDistance.cat <- NA
flightDistance.cat[AirlineSurvey_train$Flight.Distance < 501] <- "ShortHaul"
flightDistance.cat[AirlineSurvey_train$Flight.Distance >= 501 & AirlineSurvey_train$Flight.Distance < 3001] <- "MediumHaul"
flightDistance.cat[AirlineSurvey_train$Flight.Distance >= 3001] <- "LongHaul"
})
# its a character for now. Must factorize
#summary(AirlineSurvey_train$Age.cat)
AirlineSurvey_train$flightDistance.cat <- as.factor(AirlineSurvey_train$flightDistance.cat)
The following shows the distribution (in alphabetical order) of the new discretize feature flightDistance.cat:
summary(AirlineSurvey_train$flightDistance.cat)
## LongHaul MediumHaul ShortHaul
## 8248 63109 32237
iii. Discretize delays (A8 and A9)
Likewise we discretize A8 and A9 to nominal values: Small: 0-15; Medium: 16-45; Long: >45.
We discretize the following values in R, and store the new categorical (nominal) values as a new feature called Departure.Delay.cat in the AirlineSurvey_train training set:
# Doesn't replace the feature but actually adds a new feature column that is categorical
# The numerical values can be remove later on
AirlineSurvey_train <- within(AirlineSurvey_train,
{
Departure.Delay.cat <- NA
Departure.Delay.cat[AirlineSurvey_train$Departure.Delay.in.Minutes < 16] <- "Small"
Departure.Delay.cat[AirlineSurvey_train$Departure.Delay.in.Minutes >= 16 & AirlineSurvey_train$Departure.Delay.in.Minutes < 46] <- "Medium"
Departure.Delay.cat[AirlineSurvey_train$Departure.Delay.in.Minutes >= 46] <- "Long"
})
# its a character for now. Must factorize
#summary(AirlineSurvey_train$Age.cat)
AirlineSurvey_train$Departure.Delay.cat <- as.factor(AirlineSurvey_train$Departure.Delay.cat)
The following shows the distribution (in alphabetical order) of the new discretize feature Departure.Delay.cat:
summary(AirlineSurvey_train$Departure.Delay.cat)
## Long Medium Small
## 9879 13037 80678
We discretize the following values in R, and store the new categorical (nominal) values as a new feature called Arrival.Delay.cat in the AirlineSurvey_train training set.
# Doesn't replace the feature but actually adds a new feature column that is categorical
AirlineSurvey_train <- within(AirlineSurvey_train,
{
Arrival.Delay.cat <- NA
Arrival.Delay.cat[AirlineSurvey_train$Arrival.Delay.in.Minutes < 16] <- "Small"
Arrival.Delay.cat[AirlineSurvey_train$Arrival.Delay.in.Minutes >= 16 & AirlineSurvey_train$Arrival.Delay.in.Minutes < 46] <- "Medium"
Arrival.Delay.cat[AirlineSurvey_train$Arrival.Delay.in.Minutes >= 46] <- "Long"
})
# its a character for now. Must factorize
#summary(AirlineSurvey_train$Age.cat)
AirlineSurvey_train$Arrival.Delay.cat <- as.factor(AirlineSurvey_train$Arrival.Delay.cat)
The following shows the distribution (in alphabetical order) of the new discretize feature Arrival.Delay.cat:
summary(AirlineSurvey_train$Arrival.Delay.cat)
## Long Medium Small
## 10066 13569 79959
iv. Plot the distributions
PLotting the following distributions using the ggplot function in R:
v. Summary
As seen in Step 1, a majority of the Departure.Delay.in,Minutes and Arrival.Delay.in.Minutes are relatively small delays. Step 2, reinforces the same finding. According to the categorical Age, a majority of travelers are either Youth or Middle Age, that is, the range of 16-56 years old. Likewise, a majority of the distance of flights are Medium Haul or 500-3000 miles long.
Step 3.
Next we are going to test three hypotheses. However, first we must factorize the quality features.
Factorize quality features needed for the problem in train set: {Ouput Not Shown, see R code }:
Factorize quality features needed for the problem in test set: {Ouput Not Shown, see R code }
Test 1: Long haul passengers’ overall satisfaction is influenced more by the in-flight service quality than by the departure delays.
First, we create new Long Haul List by filtering the AirlineSurvey_train training set into a new data frame called Long.Haul.Passenger.
Long.Haul.Passenger <- AirlineSurvey_train[AirlineSurvey_train$flightDistance.cat == "LongHaul",]
The dimension of the Long.Haul.Passenger:
dim(Long.Haul.Passenger)
## [1] 8248 29
The number of rows should be equivalent to the Long Haul summary statistics in Step 2 (ii).
Next, we examine the following relationships between Long Haul’s travelers overall satisfaction against Departure.Delay.in.Minutes and Long Haul’s overall satisfaction against Inflight.service. Since the overall satisfaction is a nominal value, we would need to use Classification methods. In this step, we use Logistic Regression to compare the relationship.
The relationship between overall satisfaction against Departure.Delay.in.Minutes:
#Departure Delay:
glm.fit1 <- glm(satisfaction ~ Departure.Delay.in.Minutes, data = Long.Haul.Passenger, family = binomial)
summary(glm.fit1)
##
## Call:
## glm(formula = satisfaction ~ Departure.Delay.in.Minutes, family = binomial,
## data = Long.Haul.Passenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7538 0.6955 0.6955 0.7016 1.4833
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2961490 0.0285069 45.468 < 2e-16 ***
## Departure.Delay.in.Minutes -0.0039751 0.0006114 -6.502 7.93e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8817.8 on 8247 degrees of freedom
## Residual deviance: 8776.4 on 8246 degrees of freedom
## AIC: 8780.4
##
## Number of Fisher Scoring iterations: 4
The relationship between overall satisfaction against Inflight.service:
#ServiceQuality:
glm.fit2 <- glm(satisfaction ~ Inflight.service, data = Long.Haul.Passenger, family = binomial)
summary(glm.fit2)
##
## Call:
## glm(formula = satisfaction ~ Inflight.service, family = binomial,
## data = Long.Haul.Passenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0863 0.1310 0.1310 0.5321 1.3795
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.57 196.97 -0.059 0.953
## Inflight.service1 11.10 196.97 0.056 0.955
## Inflight.service2 11.26 196.97 0.057 0.954
## Inflight.service3 11.48 196.97 0.058 0.954
## Inflight.service4 13.45 196.97 0.068 0.946
## Inflight.service5 16.32 196.97 0.083 0.934
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8817.8 on 8247 degrees of freedom
## Residual deviance: 6203.9 on 8242 degrees of freedom
## AIC: 6215.9
##
## Number of Fisher Scoring iterations: 10
FINDING: Departure.Delay.in.Minutes is a stronger predictor for a Long Haul’s overall satisfaction compare to Inflight.service according to the Pr(>|z|).
Test 2: Medium haul passengers’ overall satisfaction is influenced more by the arrival delays than by the in-flight entertainment.
First, we create new List by filtering the AirlineSurvey_train training set into a new data frame called Medium.Haul.Passenger.
Medium.Haul.Passenger <- AirlineSurvey_train[AirlineSurvey_train$flightDistance.cat == "MediumHaul",]
The dimension of the Medium.Haul.Passenger:
dim(Medium.Haul.Passenger)
## [1] 63109 29
The number of rows should be equivalent to the Medium Haul summary statistics in Step 2 (ii).
Next, we examine the following relationships between Medium Haul’s travelers overall satisfaction against Arrival.Delay.in.Minutes and Medium Haul’s overall satisfaction against Inflight.entertainment. Since the overall satisfaction is a nominal value, we would need to use Classification methods. In this step, we use Logistic Regression to compare the relationship.
The relationship between overall satisfaction against Arrival.Delay.in.Minutes:
#Arrival Delay
set.seed(1)
glm.fit_arrdelay <- glm(satisfaction ~ Arrival.Delay.in.Minutes, data = Medium.Haul.Passenger, family = binomial)
summary(glm.fit_arrdelay)
##
## Call:
## glm(formula = satisfaction ~ Arrival.Delay.in.Minutes, family = binomial,
## data = Medium.Haul.Passenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.095 -1.095 -1.040 1.262 2.833
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1968598 0.0086762 -22.69 <2e-16 ***
## Arrival.Delay.in.Minutes -0.0029660 0.0002307 -12.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 86579 on 63108 degrees of freedom
## Residual deviance: 86397 on 63107 degrees of freedom
## AIC: 86401
##
## Number of Fisher Scoring iterations: 4
The relationship between overall satisfaction against Inflight.entertainment:
#In-flight Entertainment
set.seed(1)
glm.fit_entertain <- glm(satisfaction ~ Inflight.entertainment, data = Medium.Haul.Passenger, family = binomial)
summary(glm.fit_entertain)
##
## Call:
## glm(formula = satisfaction ~ Inflight.entertainment, family = binomial,
## data = Medium.Haul.Passenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4564 -0.8061 -0.5406 0.9776 1.9978
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.566 42.238 -0.250 0.802
## Inflight.entertainment1 8.717 42.238 0.206 0.837
## Inflight.entertainment2 9.328 42.238 0.221 0.825
## Inflight.entertainment3 9.608 42.238 0.227 0.820
## Inflight.entertainment4 11.056 42.238 0.262 0.794
## Inflight.entertainment5 11.201 42.238 0.265 0.791
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 86579 on 63108 degrees of freedom
## Residual deviance: 74686 on 63103 degrees of freedom
## AIC: 74698
##
## Number of Fisher Scoring iterations: 9
FINDING: Arrival.Delay.in.Minutes is a stronger predictor for a Medium Haul’s overall satisfaction compare to Inflight.service.
Test 3: Satisfaction is influenced by the combination of arrival delay time and departure delay time for all passengers.
To test the hypothesis:
set.seed(1)
glm.fit_arrWdelay <- glm(satisfaction ~ Arrival.Delay.in.Minutes+Departure.Delay.in.Minutes, data = AirlineSurvey_train, family = binomial)
summary(glm.fit_arrWdelay)
##
## Call:
## glm(formula = satisfaction ~ Arrival.Delay.in.Minutes + Departure.Delay.in.Minutes,
## family = binomial, data = AirlineSurvey_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.162 -1.085 -1.020 1.272 2.894
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2201797 0.0067958 -32.400 < 2e-16 ***
## Arrival.Delay.in.Minutes -0.0072306 0.0006531 -11.071 < 2e-16 ***
## Departure.Delay.in.Minutes 0.0040626 0.0006599 6.157 7.43e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 141768 on 103593 degrees of freedom
## Residual deviance: 141360 on 103591 degrees of freedom
## AIC: 141366
##
## Number of Fisher Scoring iterations: 4
FINDING: The combination of Arrival.Delay.in.Minutes and Departure.Delay.in.Minutes is a strong predictor for the overall satisfaction in the AirlineSurvey_train training data set. This shows that the customers are more keen to Departure and Arrival in the flight rather than inflight service and entertainment.
Step 4.
Let’s look into find associations between some of the important attributes. In this section, we will be using Weka.
i. Determining association rules.
We filter the original data set to consist only of Gender, Age (Nominal), Type of travel , Flight distance (Nominal), Class, Arrival delays (Nominal), and Overall satisfaction as attributes. Then we calculate the association rules based on given support.
In this step, we will use Weka to compute the association rules using the Apriori rule in the Associate Tab.
=== === === === === === === === Summary === === === === === === === ===
Minimum support: 0.4 (41438 instances)
Minimum metric === === === === === === === === === === === === === === === === === For #1. Most returning customers are in business class. This result may be due to businesses haveing partner relationship with Air carriers; an employee may only be able to fly with one carrier due to business policies. For #2 and #4: Gender seems to both positively identify loyalty (returning). This may be due to the fact that the amount of disloyal customers is low compared to the loyal customers. For #9: It is surprising that there is an association between satisfaction=neutral or dissatisfied 58697 ==> Arrival.Delay.cat=Small. That is, if the arrival is short, we would be assume that the customers would be satisfied. Using PCA (Principal Component Analysis), we combine features A10-A23 into a single feature. Let us call it PCAS. Next we find average, minimum, and maximum of A10-A23 (computed for each passenger record). Let us call them AVES, MINS, and MAXS, respectively. Lastly, we convert A24 (overall satisfaction) into a numeric value by converting neutral or unsatisfied to 1.0 and satisfied to 4. Let us call it DA24. We use Weka for the following step. USING PCA for the training set, we found the following AVES, MIN, and MAX for the first component in the rank (We remove all others): Rank1: Variance: .9399 {MIN: -5.698, MAX: 4.777, MEAN: 0, STD: 2.233} We use the first components analysis on the following classifier: Logistic Regression: AdaBoostM1: Using the training set instead of the test set, we get the following summary using the cross validation of 10 folds: Time taken to build model: 0.66 seconds Stratified cross-validation: FINDING: Stratified Cross-validation shows a significant accuracy. However, that is only due the fact that the cross validation was done using the training set. Both AdaBoostM1 and Logistic Regression used the training set and used the test set. PCAS reduces the complexity of the features while retaining the relationship of the features. The following top 3 components/ranks were used: Rank1: {MIN: -5.698, MAX: 4.777, MEAN: 0, STD: 2.233} Rank2: {MIN: -5.596, MAX: 4.652, MEAN: 0, STD: 2.003} Rank3: {MIN: -4.678, MAX: 6.565, MEAN: 0, STD: 1.941} If we use the following 3 components we get the following results: Using Logistic Regression with the supplied test set: No apparent increase in accuracy on the test set If we train it in the training set instead: If we use the training set again, but with crossvalidation of 10. === Stratified cross-validation === There does not seem to be a high increase in accuracy if we use three components instead of one. In the following step, we will use a general linear regression to explore the relationship between Arrival.Delay.in.Minutes and Flight.Distance in the AirlineSurvey_train data set. The summary shows: The Pr(>|t|) indicates that there is little to no relationship between the two features. Likewise, the R^2 value is close to 0. In the following step, we will use a general linear regression to explore the relationship between Departure.Delay.in.Minutes and Flight.Distance in the AirlineSurvey_train data set. The summary shows: The Pr(>|t|) indicates that there is little to no relationship between the two features. LIkewise, the R^2 value is FINDING: Flight.Distance was not a strong predictor for Departure and Arrival Delay. Arrival Delay and Departure Delay is a type of exponential problem where the waiting time is “memory-less”. Hence, linear regression would fail to fit the model. Next we look into a few more tests. Test 1: Is satisfaction with seat comfort related (or depends on) to passenger Gender? We will try to examine the relationship with seat comfort and Gender based on a Naive bayes classifier using the training and the test set. Naive Bayes Classifier In RandomTree: FINDING:There does not seem to be a relationship between Gender and seat comfort. Test 2: Is satisfaction with gate location related to passenger age? Let’s examine passenger age (numerical) and gate location satisfaction: Using J48 we got the following statistics: FINDING: No significant relationship between age and gate location. Worse than random guess. Test 3: Do first time passengers have more or less expectations than returning customers measured in terms of overall satisfaction? Using Naive Bayes, we can see that: FINDING: From the split, we can see that Disloyal Customers (First-time?) are more likely to be dissatisfied than Loyal customers. About 14449.0/(14449.0+4485.0) = 0.76 = 76% of disloyal customers are neutral or dissatisfied and 44250.0/(44250.0+40414.0) = .52 = 52% of loyal customers are neutral or dissatisfied. Test 4: Is there a distinct (statistically significant) difference between business and personal travelers (A5) in terms of their reaction to their flights? (Hint: Use any attribute(s) that you think appropriate to measure their reaction.) The most appropriate to measure for business and personal travelers is A10 (Departure and Arrival Time Satisfaction). Business travelers are more in a time constraints than personal travelers. Hence, a flight that is punctual would be highly rated for business travelers. Let’s examine: Naive Bayes Classifier FINDING: There does not seem to be a distinct difference between the Departure and Arrival Time Satisfaction between Business and personal travelers. Let’s examine overall satisfaction instead: FINDING: Now there is a clear distinct difference between personal and business travel. Personal travelers tend to be more neutral or dissatisfied (28867.0/(28867.0+3264.0)) = 90% and among Business travelers, they tend to be more satisfied overall (41635.0/(29832.0+41635.0)) = 58%. Test 5: Is there a distinct (statistically significant) difference between business class passengers and economy passengers (A6) in terms of their reaction to satisfaction with food-and-drink? Naive Bayes Classifier FINDING: No distribution difference between the classes with the food-and-drink satisfaction. The minor key difference in the distribution; Business class has a lower count (relative to its class) when rating the food-and-drink satifaction as 1. We will determine if any relationship exists between check-in service (A12) and baggage handling (A23) using four data mining technique: Using J48: Using RandomTree: Using LogitBoost: Using Naive Bayes: Overall there doesn’t seem to be a relationship with check-in service (A12) and baggage handling (A23). We will examine the relationship between A10 and A16. Departure and arrival time satisfaction vs overall satisfaction: Using Random Forest: OneR: AdaBoost1: vs Seat Comfort satisfaction vs overall satisfaction Using Random Forest: Using OneR: Using AdaBoost1: FINDING: Seat comfort had better accuracy in predicting overall satisfaction than Departure and arrival time satisfaction. In this step we will use Weka attribute selection to rank the attributes: InfoGainAttributeEval: Ranked attributes: OneRAttributeEval w/ Ranker Ranked attributes: CfsSubsetEval with GreedyStepwise: Selected attributes: 3,4,5,10,12,16,23 : 7
Generated sets of large itemsets:
Size of set of large itemsets L(1): 10
Size of set of large itemsets L(2): 8
Best rules found:</p>
ii. Summary.
Step 5.
i. Reduce the satisfaction features using PCA.
ii. Using three models.
Summary
Correctly Classified Instances
14528
56.1078 %
Incorrectly Classified Instances
11365
43.8922 %
Summary
Correctly Classified Instances
14528
56.1078 %
Incorrectly Classified Instances
11365
43.8922 %
Summary
Correctly Classified Instances
79951
77.1772 %
Incorrectly Classified Instances
23643
22.8228 %
5c. Using PCAS from A10-A23
Summary
Correctly Classified Instances
14528
56.1078 %
Incorrectly Classified Instances
11365
43.8922 %
Summary
Correctly Classified Instances
80100
77.3211 %
Incorrectly Classified Instances
23494
22.6789 %
Kappa statistic
0.5387
Summary
Correctly Classified Instances
80106
77.3269 %
Incorrectly Classified Instances
23488
22.6731 %
Step 6.
i. flight distance (in miles) and arrival delay (in minutes)
set.seed(1)
#names(AirlineSurvey_train)
#plot(Flight.Distance ~ Arrival.Delay.in.Minutes, data=AirlineSurvey_train)
lm.fit <- lm(Arrival.Delay.in.Minutes ~ Flight.Distance , AirlineSurvey_train)
summary(lm.fit)
##
## Call:
## lm(formula = Arrival.Delay.in.Minutes ~ Flight.Distance, data = AirlineSurvey_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.29 -15.22 -15.04 -2.20 1568.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.529e+01 1.871e-01 81.713 <2e-16 ***
## Flight.Distance -9.413e-05 1.206e-04 -0.781 0.435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.7 on 103592 degrees of freedom
## Multiple R-squared: 5.885e-06, Adjusted R-squared: -3.769e-06
## F-statistic: 0.6096 on 1 and 103592 DF, p-value: 0.4349
ii. flight distance (in miles) and departure delay (in minutes)
#plot(Flight.Distance ~ Departure.Delay.in.Minutes, data=AirlineSurvey_train)
set.seed(1)
lm.fit2 <- lm(Departure.Delay.in.Minutes~Flight.Distance, AirlineSurvey_train)
#names(summary(lm.fit2))
summary(lm.fit2)
##
## Call:
## lm(formula = Departure.Delay.in.Minutes ~ Flight.Distance, data = AirlineSurvey_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.02 -14.73 -14.68 -2.70 1577.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.466e+01 1.843e-01 79.546 <2e-16 ***
## Flight.Distance 7.284e-05 1.187e-04 0.613 0.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.12 on 103592 degrees of freedom
## Multiple R-squared: 3.632e-06, Adjusted R-squared: -6.021e-06
## F-statistic: 0.3762 on 1 and 103592 DF, p-value: 0.5396
#summary(lm.fit2)$r.squared
summary(lm.fit2)$r.squared
which is close to 0.
Step 7.
Summary
Correctly Classified Instances
7969
30.7767 %
Incorrectly Classified Instances
17924
69.2233 %
Summary
Correctly Classified Instances
7969
30.7767 %
Incorrectly Classified Instances
17924
69.2233 %
Using decision trees in Weka, we got the following results:
Summary
Correctly Classified Instances
7122
27.5055 %
Incorrectly Classified Instances
18771
72.4945 %
Kappa statistic
0
Mean absolute error
0.2626
Root mean squared error
0.3624
Relative absolute error
99.9997 %
Root relative squared error
100 %
Total Number of Instances
25893
Class
Attribute
neutral or dissatisfied
satisfied
~
(0.57)
(0.43)
Customer.Type
+ Loyal Customer
44250.0
40414.0
+ disloyal Customer
14449.0
4485.0
[total]
58699.0
44899.0
Class
Attribute
0
1
2
3
4
5
~
(0.05)
(0.15)
(0.17)
(0.17)
(0.25)
(0.22)
Type.of.Travel
+ Personal Travel
869.0
2716.0
3046.0
3785.0
11436.0
10283.0
+ Business travel
4423.0
12738.0
14098.0
14120.0
14040.0
12052.0
[total]
5292.0
15454.0
17144.0
17905.0
25476.0
22335.0
Using Naive Byes Classifier, the split: Class
Attribute
neutral or dissatisfied
satisfied
~
(0.57)
(0.43)
Type.of.Travel
+ Personal Travel
28867.0
3264.0
+ Business travel
29832.0
41635.0
[total] 58699.0
44899.0
Class
Attribute
0
1
2
3
4
5
~
(0)
(0.12)
(0.21)
(0.21)
(0.23)
(0.21)
Class
+ Eco Plus
19.0
1102.0
1552.0
1591.0
1690.0
1520.0
+ Business
32.0
4359.0
10570.0
10702.0
12380.0
11496.0
+ Eco
57.0
7342.0
9799.0
9948.0
10227.0
9226.0
[total]
108.0
12803.0
21921.0
22241.0
24297.0
22242.0
Step 8.
i. A12 vs A23
Summary
Correctly Classified Instances
7254
28.0153 %
Incorrectly Classified Instances
18639
71.9847 %
Kappa statistic
0
Mean absolute error
0.259
Root mean squared error
0.3599
Relative absolute error
99.9996 %
Root relative squared error
100 %
Total Number of Instances
25893
Summary
Correctly Classified Instances
7279
28.1118 %
Incorrectly Classified Instances
18614
71.8882 %
Kappa statistic
0.0023
Mean absolute error
0.2548
Root mean squared error
0.3569
Relative absolute error
98.3792 %
Root relative squared error
99.1746 %
Total Number of Instances
25893
Summary
Correctly Classified Instances
7277
28.1041 %
Incorrectly Classified Instances
18616
71.8959 %
Kappa statistic
0.0038
Mean absolute error
0.2548
Root mean squared error
0.3569
Relative absolute error
98.3956 %
Root relative squared error
99.1755 %
Total Number of Instances
25893
Summary
Correctly Classified Instances
7279
28.1118 %
Incorrectly Classified Instances
18614
71.8882 %
Kappa statistic
0.0023
Mean absolute error
0.2548
Root mean squared error
0.3569
Relative absolute error
98.38 %
Root relative squared error
99.1746 %
Total Number of Instances
25893
i. A10, A16 vs A24
Summary
Correctly Classified Instances
14528
56.1078 %
Incorrectly Classified Instances
11365
43.8922 %
Kappa statistic
0
Mean absolute error
0.4894
Root mean squared error
0.4949
Relative absolute error
99.4982 %
Root relative squared error
99.7166 %
Total Number of Instances
25893
Summary
Correctly Classified Instances
14528
56.1078 %
Incorrectly Classified Instances
11365
43.8922 %
Kappa statistic
0
Mean absolute error
0.4389
Root mean squared error
0.6625
Relative absolute error
89.2364 %
Root relative squared error
133.4939 %
Total Number of Instances
25893
Summary
Correctly Classified Instances
14528
56.1078 %
Incorrectly Classified Instances
11365
43.8922 %
Kappa statistic
0
Mean absolute error
0.4897
Root mean squared error
0.4949
Relative absolute error
99.5565 %
Root relative squared error
99.7262 %
Total Number of Instances
25893
Summary
Correctly Classified Instances
17514
67.6399 %
Incorrectly Classified Instances
8379
32.3601 %
Kappa statistic
0.3629
Mean absolute error
0.419
Root mean squared error
0.4587
Relative absolute error
85.1956 %
Root relative squared error
92.423 %
Total Number of Instances
25893
Summary
Correctly Classified Instances
17514
67.6399 %
Incorrectly Classified Instances
8379 32.3601 %
Kappa statistic
0.3629
Mean absolute error
0.3236
Root mean squared error
0.5689
Relative absolute error
65.7908 %
Root relative squared error
114.6232 %
Total Number of Instances
25893
Summary
Correctly Classified Instances
17514
67.6399 %
Incorrectly Classified Instances
8379
32.3601 %
Kappa statistic
0.3629
Mean absolute error
0.4202
Root mean squared error
0.4588
Relative absolute error
85.4337 %
Root relative squared error
92.438 %
Total Number of Instances
25893
Step 8:
Value
Col.
Attribute
0.304096
10
Online.boarding
0.233257
5
Inflight.wifi.service
0.192687
4
Class
0.163958
3
Type.of.Travel
0.135373
12
Inflight.entertainment
0.113619
11
Seat.comfort
0.087677
14
Leg.room.service
0.082571
13
On.board.service
0.074691
18
Cleanliness
0.073303
7
Ease.of.Online.booking
0.06156
15
Baggage.handling
0.05916
17
Inflight.service
0.04597
16
Checkin.service
0.039726
20
Age.cat
0.03777
9
Food.and.drink
0.037176
21
flightDistance.cat
0.026794
2
Customer.Type
0.017381
8
Gate.location
0.00538
23
Arrival.Delay.cat
0.00364
22
Departure.Delay.cat
0.00314
6
Departure.Arrival.time.convenient
0.00011
1
Gender
Value
Col.
Attribute
79.0335
10
Online.boarding
75.2399
4
Class
74.2398
5
Inflight.wifi.service
70.2155
12
Inflight.entertainment
68.0541
3
Type.of.Travel
68.0426
11
Seat.comfort
66.668
14
Leg.room.service
65.5974
7
Ease.of.Online.booking
65.3455
13
On.board.service
63.2662
18
Cleanliness
62.5519
15
Baggage.handling
62.4071
17
Inflight.service
61.0219
21
flightDistance.cat
61.0122
16
Checkin.service
60.7178
20
Age.cat
59.9282
9
Food.and.drink
58.5932
8
Gate.location
56.6606
2
Customer.Type
56.6606
23
Arrival.Delay.cat
56.6606
6
Departure.Arrival.time.convenient
56.6606
22
Departure.Delay.cat
56.6606
1
Gender
Type.of.Travel
Class
Inflight.wifi.service
Online.boarding
Inflight.entertainment
Checkin.service
Arrival.Delay.cat