Feature Engineering (Wk 1)

Pradeep Ankem
2 min readJun 5, 2020

Titanic:

Name [Text]

* Age [Numeric] [Continuous] [177 null records]

* Gender [Text] [Categorical] [Complete]

* PClass [Numeric] [Categorical] [Complete]

Embarked [Text] [Categorical]

Survived [Target Variable] [Complete]

? Parch [Numeric]

Sibsp [Numeric]

Fare [Numeric]

Passenger Id [Numeric]

Ticket [Alphanumeric]

Cabin [Alphanumeric]

He who masters errors, masters universe

Questions to ask yourself

Do they have null records ?

Notes to Self

We need to address null records

- Apply mean from the rest of the sample

- Strategy 1

NOTE: How Pandas Profiling can help initial Data Exploratory analysis and Feature Engineering

Colab Notebook (link)

Kaggle Tutorials (link) Advanced Features — Kaggle(link)/ Best Notebook (link) /Jake Book (link)

Precision and Recall (link)

Suppose a computer program for recognizing dogs in photographs identifies 8 dogs in a picture containing 12 dogs and some cats. Of the 8 identified as dogs, 5 actually are dogs (true positives), while the rest are cats (false positives). The program’s precision is 5/8 while its recall is 5/12. When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3. So, in this case, precision is “how useful the search results are”, and recall is “how complete the results are”.

Sensitivity (also called the true positive rate, the epidemiological/clinical sensitivity, the recall, or probability of detection[1] in some fields) measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition). It is often mistakenly confused with the detection limit[2][3], while the detection limit is calculated from the analytical sensitivity, not from the epidemiological sensitivity.

Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

Dataset link: (Titanic)

Notebook link:

Converting Categorical Data into Numerical Data

Data Imputation

Feature importance Function

Confusion Matrix Example (Actual Vs Predicted) [Parables : Boy calling out wold, (Type I error and Type II error) — Pregnant lady

Workflow for Titanic Dataset

Which features are available in the dataset? [Titanic]

Which features are categorical?

Which features are numerical?

Which features are mixed data types?

Which features may contain errors or typos?

Which features contain blank, null or empty values?

Advanced (Open Kaggle Tutorials)

Class Encoding

Frequency Encoding

--

--

Pradeep Ankem

In Parallel Universe, I would have been a Zen Monk.