# Titanic: Machine Learning from Disaster, [Kaggle Competition](https://www.kaggle.com/c/titanic/data), [Top 15 Accuracy](https://www.kaggle.com/pcyslm)

C.Y. Peng

Keywords: Data Science, Exploratory Data Analysis, Feature Engineering, Principle Component Analysis, Machine Learning, Basic Application
## Lib Information - Release Ver: 20200101-R 1.0.0 - Lib Ver: 20200101 - Author: C.Y. Peng - Required Lib: requirements_venv36dl.txt - OS Required: Windows 64 bit - Data Source: [Titanic: Machine Learning from Disaster, Kaggle Competition](https://www.kaggle.com/c/titanic/data) - [My Kaggle Competition](https://www.kaggle.com/pcyslm) ## Part I. Project Overview. This project, the main topic of the implementation of Titanic problem and submitted the kaggle competition. We cover several segments as follows: - [x] Project Overview - [x] Fundemental Principle - [x] Introduction to the Data Science - [x] Titanic Problem Overview - [x] Titanic Problem Definition - [x] Data Analysis through the Data Science - [x] From EDA to Model Established - [x] Analysis Conclusion - [x] Quick Start - [x] Project Lib Architecture - [x] Model Established Pipeline - [x] Operation Mode - [x] Other Records - [x] Reference Data science is a technology to extract the effective information from the large data. This repo enables you to have a quick understanding of the data science from Tianic problem analysis and join the Kaggle competition. ## Part II. Fundemental Principle ### Introduction to the Data Science Figure 1. Data Science Flow Chart
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. There are five steps to established the prediction system from the large data, as Figure 1: - Problem Definition: Using the professional view to define the problem and the objective of the solving of the problem - Exploratory Data Analysis, EDA: First, we use the statistics methods to analyze the data sets to summerize the main characteristics from data collection and data clean. Second, we use the principle component analysis to analyze the features importance. - Feature Engineering: Converting the raw data to the features and using these features to established the model. - Model Established: Spliting the data to the training, the testing and the validation data to established the model, select the model and optimize the model. - Model Maintenance: Runtime model implementation, monitoring, and the diagnosis. ### Titanic Problem Overview Figure 2. Sinkable Titanic, From Reference [(https://www.britannica.com/topic/Titanic)]
Titanic, in full Royal Mail Ship (RMS) Titanic, the Titanic set sail on its maiden voyage, traveling from Southampton, England, to New York City, as Figure 2.. On April 15, 1912, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg, the time just 4 days after maiden voyage from Southampon. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 (Mortality rate: 68%) passengers and crew. Why some people could survived? Why some people could died? there are some clues from some information, such as the class, as Figure 2.. In this project, we could create a model that predicts which passengers survived the Titanic shipwreck. There are some tips for this project as following description: - Data Source: [Titanic: Machine Learning from Disaster, Kaggle Competition](https://www.kaggle.com/c/titanic/data) - For this project, there are 12 variables​ (including the output) in the train file and 11 variables (excluded the output) in the test file. - The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. - We could use the model to predict the test file and submitted the results to the Kaggle competition. ### Titanic Problem Definition Figure 3. Titanic Problem Definition
According to the reference , we could reduce the problem to the math problem: - The data sets are 891 data, which including the one variable for the output in the train file, survived​ (0: died, 1: survived). - We want to established the binary classification system to predicts which passengers survived the Titanic shipwreck. - This project show the analysis for Titanic problem. ## Data Analysis through the Data Science: ### From EDA to Model Established - #### Titanic Problem Data Basic Profile First, we look into the insights for all data as following as tables, there are some investigations in this dataset: | No. | Variables | Data Type | Unique Data | Description | |-----|-------------|------------------------------|--------------|------------------------------------------------------------------------------------------------------------------------------------| | 1 | PassengerId | numerical, integer | | passenger ID | | 2 | Pclass | categorical, ordinal | 3 | a proxy for socio-economic status (SES), ticket class, 1 = 1st (Upper), 2 = 2nd (Middle), 3 = 3rd (Lower) | | 3 | Name  | string object | | passenger name | | 4 | Sex  | categorical, nominal binary | 2 | male or female | | 5 | Age   | numerical, integer | | age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 | | 6 | Sibsp | numerical, integer | | family relations, # of siblings/ spouses aboard the Titanic | | 7 | Parch | numerical, integer | | family relations, # of parents/ children aboard the Titanic | | 8 | Ticket | alphabet/number mixed object | | ticket number | | 9 | Fare | numerical, float | | passenger fare | | 10 | Cabin | alphabet/number mixed object | | cabin number | | 11 | Embarked | categorical, nominal | 3 | port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton | | 12 | Survived | categorical, nominal binary | | prediction output |
Table 1. Titanic Problem Data Information
From the Table 1, we split the data into three types: - Numerical Data Type: PassengerId, Age, Sibsp, Parch, Fare - Categorical Data Type: Pclass, Sex, Embarked, Survived (System Output) - Other Data Type: Name, Ticket, Cabin Furthermore, we need to transform some object into numerical data or categorical data: - Name Variable: transform the male name into Mr. (older) or Master (younger), female name into Mrs (married woman) or Miss (unmarried woman) and some other rare name into Rare. The name variable become the nominal categorical data. - Ticket Variable: there are two types in the ticket variables, one is the some alphabets&numbers (A&N type), we view as the categorical data type; the another is the only numbers (N type), most of the ticket variable is the N type, we view as the numerical data type. - Cabin Variable: cabin variable is made up of the single alphabet and numbers, we transform the variable into the first single alphabet for nominal categorical data. Next, we check the missing data in the files as following as table: | variables | train file (missing no, percentage) | test file (missing no, percentage) | | ----------- | ----------------------------------- | ---------------------------------- | | PassengerId | 0 | 0 | | Survived | 0 | - | | Pclass | 0 | 0 | | Name | 0 | 0 | | Sex | 0 | 0 | | Age | 177, 0.20 | 86, 0.21 | | SibSp | 0 | 0 | | Parch | 0 | 0 | | Ticket | 0 | 0 | | Fare | 0 | 1, 0.002 | | Cabin | 687, 0.77 | 327, 0.78 | | Embarked | 2, 0.002 | 0 |
Table 2. Missing Data Information in Train File and Test File
From Table 2., we discover that there are many missing data in the Cabin about 0.77 percentage in this variable, this variable will discard due to less information . Second, we look into the numerical data, we organize the statistics of the numerical data as following as the tables and the figures: | | PassengerId | Survived | Age | SibSp | Parch | Fare | | ----- | ----------------- | ------------------ | ------------------ | ------------------ | ------------------- | ------------------ | | count | 891.0 | 891.0 | 714.0 | 891.0 | 891.0 | 891.0 | | mean | 446.0 | 0.38 | 29.69 | 0.52 | 0.38 | 32.20 | | std | 257.35 | 0.49 | 14.52 | 1.10 | 0.81 | 49.69 | | min | 1.0 | 0.0 | 0.42 | 0.0 | 0.0 | 0.0 | | 25% | 223.5 | 0.0 | 20.12 | 0.0 | 0.0 | 7.91 | | 50% | 446.0 | 0.0 | 28.0 | 0.0 | 0.0 | 14.45 | | 75% | 668.5 | 1.0 | 38.0 | 1.0 | 0.0 | 31.0 | | max | 891.0 | 1.0 | 80.0 | 8.0 | 6.0 | 512.33 |
Table 3. Numerical Data Statistic Description in Train File
| | PassengerId | Age | SibSp | Parch | Fare | | ----- | ------------------ | ----------------- | ------------------ | ------------------ | ------------------ | | count | 418.0 | 332.0 | 418.0 | 418.0 | 417.0 | | mean | 1100.5 | 30.27 | 0.45 | 0.39 | 35.63 | | std | 120.81 | 14.18 | 0.90 | 0.98 | 55.91 | | min | 892.0 | 0.17 | 0.0 | 0.0 | 0.0 | | 25% | 996.25 | 21.0 | 0.0 | 0.0 | 7.90 | | 50% | 1100.5 | 27.0 | 0.0 | 0.0 | 14.45 | | 75% | 1204.75 | 39.0 | 1.0 | 0.0 | 31.5 | | max | 1309.0 | 76.0 | 8.0 | 9.0 | 512.32 |
Table 4. Numerical Data Statistic Description in Test File
and these data distribution plot, box plot and pair plot: Figure 4. Numerical Data Distribution in Train File Figure 5. Numerical Data Distribution in Test File Figure 6. Numerical Data Box Polt in Train File Figure 7. Numerical Data Distribution Pair Plot versus Output in Train File
From above Table 3.-4. and Figure 4.-7., we find that the: - PassengerID: - The meaningless number (Figure 4.-5.). - Age: - Most of the passengers are over 16 yeaers old in the all passengers; under 16 years old, highest percentage of the passengers are under 5 years old (Figure 4.-5.). - Positive& negative samples are close to 1:1 in Age data. Most young (20-35 years old) are higher survived rate more than older (22-40 years old) (Figure 6.). - Higher survived are about 10-20 years old (Figure 7.). - SibSp: - Most of the passengers are without any sibling or spouse. Secondly, some passengers at least with 1 sibling or spouse (Table 3.-4., Figure 4.-5.). - Positive& negative samples are close to 1:1 in SibSp data (Figure 6.). - Higher died rate as the passengers are without any sibling or spouse. Near to 2 siblings or spouse at high survived rate (Figure 7.). - Parch: - Most of the passengers are without any parents or children aboarding the Titanic (Table 3.-4., Figure 4.-5.); secondly, some passengers are with one child or parent (Figure 5.). - Imbalance samples in Parch data (Figure 6.). - Less than or equal to 2 children or parents are higher survived rate, especially no child or parent (Figure 7.). - Fare: - Most of the people pay low fare (Table 3.-4., Figure 4.-5.). - Imbalance samples in Fare data (Figure 6.). - It seems that the survived rate of the higher fare is higher than the lower fare (Figure 7.). - Overall: - Numerical data distribution in the train file is simliar to the test file, so it is effective to estalished the model for prediction (Figure 3.-4.). There are no correlation for numerical data in train file and test file as following as figures: Figure 8. Linear Correlation Heat Map Plot in Train File Figure 9. Linear Correlation Heat Map Plot in Test File Figure 10. Kendall Correlation Heat Map Plot in Train File Figure 11. Kendall Correlation Heat Map Plot in Test File Figure 12. Pearson Correlation Heat Map Plot in Train File Figure 13. Pearson Correlation Heat Map Plot in Test File Figure 14. Spearman Correlation Heat Map Plot in Train File Figure 15. Pearson Correlation Heat Map Plot in Test File
Furthermore, we look into the categorical data, we show the statistics of the categorical data as following as the tables: | type | Cabin |type | Embarked |type | Pclass |type | Sex |type | Name |type | Ticket | | -------- | ----- |------- | -------- |-----| ------ |-------- | ---- |------- | ---- |----- | ------ | | unknown | 0.77 |S | 0.72 | 3 | 0.55 | male | 0.65 | Mr | 0.58 |N | 0.74 | | C | 0.07 |C | 0.19 | 1 | 0.24 | female | 0.35 | Miss | 0.21 |A&N | 0.26 | | B | 0.05 |Q | 0.09 | 2 | 0.21 | | | Mrs | 0.14 | | | | D | 0.04 |unknown | 0.002 | | | | | Master | 0.04 | | | | E | 0.04 | | | | | | | Rare | 0.03 | | | | A | 0.02 | | | | | | | | | | | | F | 0.01 | | | | | | | | | | | | G | 0.004 | | | | | | | | | | | | T | 0.001 | | | | | | | | | | |
Table 5. Percentage of the Each Data Type in Categorical Data in Train File
| type | Cabin |type | Embarked |type | Pclass |type | Sex |type | Name |type | Ticket | | -------- | ----- |------- | -------- |-----| ------ |-------- | ---- |------- | ---- |----- | ------ | | unknown | 0.78 |S | 0.65 | 3 | 0.52 | male | 0.64 | Mr | 0.57 |N | 0.71 | | C | 0.08 |C | 0.24 | 1 | 0.26 | female | 0.36 | Miss | 0.19 |A&N | 0.29 | | B | 0.04 |Q | 0.11 | 2 | 0.22 | | | Mrs | 0.17 | | | | D | 0.03 |unknown | | | | | | Master | 0.05 | | | | E | 0.02 | | | | | | | Rare | 0.01 | | | | F | 0.02 | | | | | | | | | | | | A | 0.02 | | | | | | | | | | | | G | 0.002 | | | | | | | | | | |
Table 6. Percentage of the Each Data Type in Categorical Data in Test File
and | | Cabin | Embarked | Pclass | Sex | Name | | -------- | ----- | -------- | ------ | ---- | ---- | | Died | 0.47 | 0.43 | 0.55 | 0.53 | 0.48 | | Survived | 0.53 | 0.57 | 0.45 | 0.47 | 0.52 |
Table 7. Percentage of the Categorical Data versus Output in Train File Figure 16. Categorical Data Count Plot in Train File Figure 17. Categorical Data Count Plot in Test File Figure 18. Categorical Data Count Plot versus Output in Train File Figure 19. Categorical Data Percentage Bar Plot versus Output in Train File
From above figures and tables, we organize some tips: - Sex: - Male:Felmale = 1.9:1 boarding the Titanic (Table 5.-6., Figuare 16.-17.). - Male survived rate less than the female (Figure 17.-18.); survived rate of the female is about 0.7, but of the male is about 0.2 (Figure 19.). - Name: - Most of male on the Tianic are married (Mr.), unmarried male (Master) are lowest percentage of the passengers on the Titanc (Table 5.-6., Figure 16.-17.). - Lowest survival rate of the all passengers are the married male (Mr.), married male are most died passengers on the Titanic; survived rate of the female (Miss, Mrs) are higher than male, no matter married or unmarried (Figure 18.-19.). - Pclass: - Most people pay lowest class of the rooms on the Titanic (Table 5.-6., Figuare 16.-17.). - Most of passengers in the lowest class are died and far more than other class (Figure 18.-19.). - Mortality rate rises with class lower (Figure 19.). - Cabin: - The percentage of the missing data is too high, 0.78 (Table 5.-6., Figuare 16.-17.). Less information in the Cabin variable. - There are higher survived rate for some first alphabet 'B', 'D', 'E' (Figure 18.-19.). - Embarked: - Most of the passengtage aboarding Titanic at the Southampon (Table 5.-6., Figuare 16.-17.). - The survived rate is related to the aboarding location: Cherbourg > Queenstown > Southampton (Figure 18.-19.). - Ticket: - 75% are N type (refer to numerical data), other are A&N type (Table 7.-8.). - Overall: - Categorical data distribution in the train file is simliar to the test file, so it is effective to estalished the model for prediction (Table 5.-6., Figuare 16.-17.). - Most of the samples are imbalance in categorical type (Figuare 18.-19.). #### Feature Engineering from Basic Profile Before establing the model, we want to convert the variables to the features. There are some tips: - Numerical Data: - For Ticket variable, we transform the A&N categorical data type into number -1, others N type are above 0. All ticket variable become the numerical data. - Comparing the normalization and the standarization variables as the features. - Using the KNN to imputated some missing value, such as Age. - Nominal Categorical Data: - Using the one-hot encoder for nominal categorical data as the features, such as Name. - Using the KNN to imputated some missing value, such as the Cabin and the Embarked. - Ordinal Categorical Data: - For Pclass variable, converting the 1/2/3 class into the numerical integer 0/1/2. - Binary Categorical Data: - For Sex variable, mapping the male/female to the digital binary. We show the one-hot encoding features name for the nominal categorical data type conversion: | Name | Embarked | |------|----------| | x0 | x1 |
Table 8. One-Hot Encoding Features for the Nominal Categorical Data Conversion
On the other hand, ROC curve used to rank features in importance order, which gives a visual way to rank features performances. This technique is most suitable for binary classification tasks. We show the single feature ROC curve for decision stump as following figures and display the AUC which are above 0.5: Figure 20. ROC Curve for Numerical Data Normalization Figure 21. ROC Curve for Numerical Data Standarization
From above figures, we find that the some critical features: - Fare (normaliztion, AUC = 0.514) > Age (normaliztion, AUC = 0.509) > Parch (normaliztion, AUC = 0.505)> Ticket (normaliztion, AUC = 0.504) > SibSp (normaliztion, AUC = 0.502) - Fare (standardization, AUC = 0.687) > Ticket (standardization, AUC = 0.599) > Parch (standardization, AUC = 0.561)> SibSp (standardization, AUC = 0.543)> Age (standardization, AUC = 0.516) #### Principle Component Analysis, PCA In this section, we use the PCA to analyze the features after feature eingeering. We compare the normalization and the standardization numerical data after using PCA: Figure 22. Percentage Explained Variance for Numerical Data Normalization Figure 23. Percentage Explained Variance for Numerical Data Standarization Figure 24. Eigenvectors for Numerical Data Normalization Figure 25. Eigenvectors for Numerical Data Standarization
Above figures, we list first 10 eigen vectors in the figures. We organize some tip: - For numerical data normalization: - About first 4 eigenvalues (First: 0.33, Second: 0.22, Third: 0.15, Forth: 0.12) reach the 80% variance importance (Figure 22.). - The first eigenvector: (Figure 24.) - Name (Mr.), Sex, Ticket, Fare and Pclass are the main vairation of the dataset. - Ticket, Pclass, Name (Mr.), Sex are negative correlated; Fare and Name (Miss) are less positive correlated. - Pclass is the most important factor of the variation in the dataset. - The second eigenvector: (Figure 24.) - Name (Mr., Miss, Mrs.), Sex, Ticket, Fare and Pclass are the secondly vairation of the dataset. - Ticket, Pclass, Name (Miss, Mrs.) and Embarked (Queenstown) are positive correlated; Name (Mr.) and Sex are negative correlated. - Sex and Name are the most important factor of the variation in the dataset. - The third eigenvector: (Figure 24.) - Age (positive), Ticket (negative), Fare (positive) are related. - Ticket is the critical factor; secondly, Age is. - The forth eighenvector (Meaningless): (Figure 24.) - Embarked points are negative correlated. - For numerical data standarization: - About first 5 eigenvalues (First: 0.33, Second: 0.18, Third.: 0.13, Forth: 0.1, Fifth: 0.07) reach the 80% variance importance (Figure 23.). - The first eigenvector shows:(Figure 25.) - Age, Name (Mr.) and Sex are negative correlated; others are negative. - Age, SibSp and Parch are the critical factors. - The second eigenvector: (Figure 25.) - Ticket (negative), Fare (positive), Pclass (negative) are highly correlated. - The third eigenvector: (Figure 25.) - Age and Ticket are highly positive correlated. - The forth eighenvector: (Figure 25.) - Ticket (positive), Fare (positive), Name (Miss, positive), Embarked (Cherbourg, positive) are correlated; others are negative. - The fifth eigenvector: (Figure 25.) - Positive correlation are Parch, Name (Miss, Mrs.); negative correlation are SibSp, Fare, Name (Mr.), Sex. #### Model Established We extract the 7:3 training-testing/ validation data set from the training data. According to the machine learning theory, we select the 10 models to established the prediction model because there is the different VC dimension for the each model. We use the nested cross-validation grid search technology to established the system model, including model established (inner folder = 5) and parameters tunning (outer folder = 10). For this project, we choose the some models as following descriptions: - Naive Bayes Model: BernoulliNB, GaussianNB - Gaussian Processes Model: GaussianProcessClassifier - Linear Model: Perceptron, LogisticRegressionCV - Linear and Quadratic Discriminant Model: LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis - Ensemble Model: GradientBoostingClassifier, BaggingClassifier, AdaBoostClassifier, DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier, XGBClassifier​ - Extraction Model: MLPClassifier​ - Others​: KNeighborsClassifier​ Using the training-testing data set to execute the nested cross-validation grid search, we could evaluate the model from the learning curve as following figures: Figure 26. Learning Curve for Numerical Data Normalization Figure 27. Learning Curve for for Numerical Data Standarization
And we organize the tables: | Classifier Name | Classifier Train Accuracy Mean | Classifier Test Accuracy Mean | Classifier Test Accuracy 3*STD | Classifier Time | | ----------------------------- | ------------------------------ | ----------------------------- | ------------------------------ | --------------- | | GaussianProcessClassifier | 0.84 | 0.85 | 0.08 | 4.42 | | GradientBoostingClassifier | 0.92 | 0.84 | 0.12 | 555.98 | | XGBClassifier | 0.92 | 0.84 | 0.17 | 40.55 | | MLPClassifier | 0.84 | 0.82 | 0.21 | 314.18 | | BaggingClassifier | 0.89 | 0.81 | 0.12 | 13.59 | | RandomForestClassifier | 0.93 | 0.81 | 0.07 | 34.19 | | LogisticRegressionCV | 0.80 | 0.81 | 0.17 | 29.41 | | DecisionTreeClassifier | 0.88 | 0.81 | 0.12 | 0.24 | | AdaBoostClassifier | 0.83 | 0.80 | 0.16 | 22.44 | | GaussianNB | 0.80 | 0.80 | 0.18 | 0.04 | | KNeighborsClassifier | 0.94 | 0.79 | 0.13 | 1.22 | | LinearDiscriminantAnalysis | 0.81 | 0.79 | 0.12 | 0.07 | | ExtraTreesClassifier | 0.92 | 0.79 | 0.22 | 20.21 | | BernoulliNB | 0.80 | 0.78 | 0.12 | 0.08 | | Perceptron | 0.66 | 0.66 | 0.58 | 0.24 | | QuadraticDiscriminantAnalysis | 0.65 | 0.62 | 0.06 | 0.04 |
Table 9. Nested Cross-Validation Grid Search For Numerical Data Normalization
| Classifier Name | Classifier Train Accuracy Mean | Classifier Test Accuracy Mean | Classifier Test Accuracy 3*STD | Classifier Time | | ----------------------------- | ------------------------------ | ----------------------------- | ------------------------------ | --------------- | | BaggingClassifier | 0.93 | 0.85 | 0.14 | 19.04 | | LinearDiscriminantAnalysis | 0.82 | 0.85 | 0.07 | 0.06 | | RandomForestClassifier | 0.93 | 0.84 | 0.11 | 41.27 | | BernoulliNB | 0.80 | 0.83 | 0.12 | 0.13 | | LogisticRegressionCV | 0.84 | 0.82 | 0.12 | 34.07 | | MLPClassifier | 0.85 | 0.82 | 0.12 | 339.83 | | DecisionTreeClassifier | 0.87 | 0.82 | 0.11 | 0.27 | | KNeighborsClassifier | 0.89 | 0.81 | 0.12 | 1.48 | | ExtraTreesClassifier | 0.90 | 0.81 | 0.12 | 24.31 | | GaussianProcessClassifier | 0.88 | 0.81 | 0.09 | 5.08 | | GradientBoostingClassifier | 0.91 | 0.80 | 0.13 | 574.52 | | XGBClassifier | 0.91 | 0.80 | 0.12 | 38.04 | | AdaBoostClassifier | 0.85 | 0.80 | 0.13 | 27.00 | | Perceptron | 0.77 | 0.78 | 0.15 | 0.35 | | GaussianNB | 0.81 | 0.77 | 0.17 | 0.07 | | QuadraticDiscriminantAnalysis | 0.67 | 0.65 | 0.13 | 0.05 |
Table 10. Nested Cross-Validation Grid Search For Numerical Data Standarization
Furthermore, we would use the validation data as the model input and produced the output to plot the ROC curve plot: Figure 28. ROC Curve for Numerical Data Normalization Figure 29. ROC Curve for for Numerical Data Standarization
From above figures and tables, we organize some tips for these models: - Some model degree of the fitting for numerical data normalization (Table 11.-12., Figure 26.-27.): - All models are overfitting (Figure 26.-27.) due to the seldom training data. - GaussianProcessClassifier is the best model of the nested cross-validation stage. - Some model degree of the fitting for numerical data standarization (Table 11.-12., Figure 26.-27.): - All models are overfitting (Figure 26.-27.) due to the seldom training data. - BaggingClassifier is the best model of the nested cross-validation stage. - From ROC curve for numerical data normalization (Figure 28.-29.): - AdaBoostClassifier, BaggingClassifier, KNeighborsClassifier are the best model from the AUC. - From ROC curve for numerical data standarization (Figure 28.-29.): - GradientBoostingClassifier, KNeighborsClassifier are the best model from the AUC. - We would choose the best model from the learning curve and the roc curve (Table 11.-Table 12, Figure 26.-29.). In this project, we use the best model to predict the test file in Kaggle competition. We choose the as following as models as the best model and submitted the output to the Kaggle website: - From CV Learning Curve, GaussianProcessClassifier for Numerical Data Normlization: **0.79425** score - From CV Learning Curve, BaggingClassifier for Numerical Data Standarization: *0.76555* score - From ROC Curve, AdaBoostClassifier, BaggingClassifier, KNeighborsClassifier for Numerical Data Normlization: *0.77511*, *0.77511*, *0.78468* score - From ROC Curve, GradientBoostingClassifier, KNeighborsClassifier for Numerical Data Standarization: *0.76555*, *0.74162* score From above sumbmissions, we understand the best model are GaussianProcessClassifier, KNeighborsClassifier for Numerical Data Normlization; moreover, we summerize these model report: | | precision | recall | f1-score | support | | ------------ | ------------------ | ------------------ | ------------------ | ------------------ | | 0 | 0.77 | 0.89 | 0.83 | 121.0 | | 1 | 0.8 | 0.62 | 0.70 | 84.0 | | accuracy | 0.78 | 0.78 | 0.78 | 0.78 | | macro avg | 0.79 | 0.76 | 0.76 | 205.0 | | weighted avg | 0.78 | 0.78 | 0.77 | 205.0 |
Table 11. Best Model Performace Summary Report for GaussianProcessClassifier Model
| | precision | recall | f1-score | support | | ------------ | ------------------ | ------------------ | ------------------ | ------------------ | | 0 | 0.82 | 0.90 | 0.86 | 128.0 | | 1 | 0.8 | 0.68 | 0.73 | 77.0 | | accuracy | 0.81 | 0.81 | 0.81 | 0.81 | | macro avg | 0.81 | 0.79 | 0.80 | 205.0 | | weighted avg | 0.81 | 0.81 | 0.81 | 205.0 |
Table 12. Best Model Performace Summary Report for KNeighborsClassifier Model
#### Some Information from Tree Based Models: We would get some information about the features importance by the some tree based models. Figure 30. Feature Importance for Numerical Data Normalization Figure 31. Feature Importance for Numerical Data Standarization
From above figures, we understand that: - For Tree based models for numerical data normalization (GradientBoostingClassifier, AdaBoostClassifier, DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier​) (Figure 30.-31.): - There are high test score from the first to third ranking: XGBClassifier, RandomForestClassifier, DecisionTreeClassifier - Name (Mr.) is the most important factor in the XGBClassifier. Secondly, others are Sex, Name (Rare), Pclass. - The RandomForestClassifier Features Importance: Age > Fare > Sex > Ticket > Pclass > Name (Mr.) > Name (Mrs.) > SibSp > Name (Miss) - Name (Mr.) is the most important factor in the DecisionTreeClassifier. Secondly, others are Pclass, Fare, Name (Rare). - For Tree based models for numerical data normalization (GradientBoostingClassifier, AdaBoostClassifier, DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier​) (Figure 30.-31.): - There are high test score from the first to third ranking: RandomForestClassifier, ExtraTreesClassifier, DecisionTreeClassifier - The RandomForestClassifier Features Importance: Fare > Age > Ticket > Sex > Name (Mr.)> Pclass - The ExtraTreesClassifier Features Importance: Name (Mr.) > Sex > Pclass > Fare > Ticket > Age > SibSp - The DecisionTreeClassifier Features Importance: Sex ### Analysis Conclusion We organize some tips from above analysis: - It is important factor for the Name related to the Age and Sex, such as x0_Mr (From Basic Profile). - Pclass, Fare are the key variables (From Basic Profile). - Furthermore, some variables need more analysis, such as Age, SibSp, Parch, Fare, Ticket (Fome Descion Stump ROC Curve). - Model prediction performance for the numerical data normalization is better than standarization (From model established). - Name is the critical features, second is the Pclass (From Features Importance & From PCA). - We get the high score **0.79425**, around [top 15 accuracy](https://www.kaggle.com/pcyslm) in Kaggle Competition. ## Part III. Quick Start In this project, we use the preprocessing data sets to tran/test the model, and validate the model. Figure 32. Titanic Problem Lib Function Flow