Stats + Probability 101 — GL

Topics to be covered

Pradeep Ankem
4 min readJun 8, 2019

Pandas Stats Module

To read a csv

import pandas as pd
df = pd.read_csv('file_name.csv or URL.csv')

To get a statistical summary

pd.describe()

To make head or tail of the data frame

df.head()
df.tail()

Shape of You

df.shape

Structure of the data

df.info() 

Get some column names

df.columns

Let’s get the unique values

def rstr(df): return df.apply(lambda x: [x.unique()])
print(rstr(data))

Identify unique values in a column

df['col_name'].unique()

Cross tables

pd.crosstab(df.column1,df.column2)

Feature Engineering

In Titanic, most important features are Survived, Pclass, Gender, Age

let's explore together

The four sisters, who are always together with a side kick

import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

Calculate Histogram for time spent on day calls by customers

plt.hist(ch['total day minutes'], bins= 10, facecolor= 'tan')
plt.xlabel('Total Day Minutes')
plt.ylabel('No. of Customers')
plt.show()

How do we categorize the churner and the non-churner for the time spent on day calls(total day minutes)?

import seaborn as sns
g = sns.FacetGrid(ch, col="churn")
g.map(plt.hist, "total day minutes")

Find the number of customers who did opt for voice mail plan

ch['voice mail plan'].value_counts()

Produce a countplot for the above result

sns.set(style="whitegrid", color_codes=True)
sns.countplot(x="voice mail plan", hue= "churn", data=ch)

Create a boxplot for a categorical variable(international plan) and continuous variable(area code).

sns.boxplot(x = "international plan", y = "area code", data=ch)

Create a crosstab for the area code to find the churner or non-churner.

Let's find out

How to pivot information using python for categorical values? Plot one.

pd.pivot_table(ch, index = ['area code','voice mail plan'], columns=['international plan'], aggfunc=len)

Now calculate the total international minutes for all the combinations above.

pd.pivot_table(ch, 'total intl minutes', index = ['area code','voice mail plan'], columns=['international plan'])

How do we understand the correlation between the variables or the columns within the dataframe. Plot one and analyze.

sns.pairplot(ch)

Find Standard deviation of total night calls.

You know it already

Plot a distplot for the above result to look at specifically total night calls.

sns.distplot(ch['total night calls'])

Plot a histogram to group it by churner or non-churner for the column area code.

ch.hist(by='churn', column = 'area code')

Calculate areawise churner or non-churner using countplot.

ch['area code']= ch['area code'].astype('category')
sns.countplot(x="area code", hue= "churn", data=ch)

Dataset: https://bit.do/titanicdf

Ref: [link] and play ground link

“Never get inside a lake, if it is on an average 4 ft. deep”

“If Bill Gates walks into Bar, on an average every body is a millionaire”

Descriptive Statistics

Min

Max

Median

Mode

Quartile (1st Quartile and 3rd Quartile)

Std. Deviation

Introduction Central Limit Theorem

Measure of Central Values

Measure the spread around Central Values

Overall distribution shape

Discrete Vs Continuous

Online Tools : GeoGebra/Desmos/Netlogo/Scratch Gaming

Excel Tools:

Bell Curve Vs Pareto Chart

Uni-variate Distribution/Binomial Distribution

Stats Meme Page

XKCD Essentials

Source: Machine Learning with R
Source: link

Exploration of Formulas

Mean

Measures of Central Tendency — Mean, Median, Mode

Measures of Variability— St. Deviation, Min, Max, Variance, Kurtosis and Skewness

Distribution

Inter Quartlie Details (Post-Mortem of Box and Whisker Plot)

Wolfram notebook on Statistics

Null hypothesis Vs Alternate hypothesis

P < 0.05 Vs P > 0.05

Few commands in R

Table function in R

Data Set (Churn) [link]

Calculate Histogram for time spent on day calls by customers

How do we categorize the churner and the non-churner for the time spent on day calls(total day minutes)?

Find the number of customers who did opt for voice mail plan

Create a boxplot for a categorical variable(international plan) and continuous variable(area code).

Create a crosstab for the area code to find the churner or non-churner.

--

--

Pradeep Ankem
Pradeep Ankem

Written by Pradeep Ankem

In Parallel Universe, I would have been a Zen Monk.

Responses (1)