Feature Engineering (Wk 1)
Titanic:
Name [Text]
* Age [Numeric] [Continuous] [177 null records]
* Gender [Text] [Categorical] [Complete]
* PClass [Numeric] [Categorical] [Complete]
Embarked [Text] [Categorical]
Survived [Target Variable] [Complete]
? Parch [Numeric]
Sibsp [Numeric]
Fare [Numeric]
Passenger Id [Numeric]
Ticket [Alphanumeric]
Cabin [Alphanumeric]
He who masters errors, masters universe
Questions to ask yourself
Do they have null records ?
Notes to Self
We need to address null records
- Apply mean from the rest of the sample
- Strategy 1
NOTE: How Pandas Profiling can help initial Data Exploratory analysis and Feature Engineering
Colab Notebook (link)
Kaggle Tutorials (link) Advanced Features — Kaggle(link)/ Best Notebook (link) /Jake Book (link)
Precision and Recall (link)
Suppose a computer program for recognizing dogs in photographs identifies 8 dogs in a picture containing 12 dogs and some cats. Of the 8 identified as dogs, 5 actually are dogs (true positives), while the rest are cats (false positives). The program’s precision is 5/8 while its recall is 5/12. When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3. So, in this case, precision is “how useful the search results are”, and recall is “how complete the results are”.
Sensitivity (also called the true positive rate, the epidemiological/clinical sensitivity, the recall, or probability of detection[1] in some fields) measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition). It is often mistakenly confused with the detection limit[2][3], while the detection limit is calculated from the analytical sensitivity, not from the epidemiological sensitivity.
Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).
Dataset link: (Titanic)
Notebook link:
Converting Categorical Data into Numerical Data
Data Imputation
Feature importance Function
Confusion Matrix Example (Actual Vs Predicted) [Parables : Boy calling out wold, (Type I error and Type II error) — Pregnant lady
Workflow for Titanic Dataset
Which features are available in the dataset? [Titanic]
Which features are categorical?
Which features are numerical?
Which features are mixed data types?
Which features may contain errors or typos?
Which features contain blank, null or empty values?
Advanced (Open Kaggle Tutorials)
Class Encoding
Frequency Encoding