Stats + Probability 101 — GL
Topics to be covered
Pandas Stats Module
To read a csv
import pandas as pd
df = pd.read_csv('file_name.csv or URL.csv')
To get a statistical summary
pd.describe()
To make head or tail of the data frame
df.head()
df.tail()
Shape of You
df.shape
Structure of the data
df.info()
Get some column names
df.columns
Let’s get the unique values
def rstr(df): return df.apply(lambda x: [x.unique()])
print(rstr(data))
Identify unique values in a column
df['col_name'].unique()
Cross tables
pd.crosstab(df.column1,df.column2)
Feature Engineering
In Titanic, most important features are Survived, Pclass, Gender, Age
let's explore together
The four sisters, who are always together with a side kick
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
Calculate Histogram for time spent on day calls by customers
plt.hist(ch['total day minutes'], bins= 10, facecolor= 'tan')
plt.xlabel('Total Day Minutes')
plt.ylabel('No. of Customers')
plt.show()
How do we categorize the churner and the non-churner for the time spent on day calls(total day minutes)?
import seaborn as sns
g = sns.FacetGrid(ch, col="churn")
g.map(plt.hist, "total day minutes")
Find the number of customers who did opt for voice mail plan
ch['voice mail plan'].value_counts()
Produce a countplot for the above result
sns.set(style="whitegrid", color_codes=True)
sns.countplot(x="voice mail plan", hue= "churn", data=ch)
Create a boxplot for a categorical variable(international plan) and continuous variable(area code).
sns.boxplot(x = "international plan", y = "area code", data=ch)
Create a crosstab for the area code to find the churner or non-churner.
Let's find out
How to pivot information using python for categorical values? Plot one.
pd.pivot_table(ch, index = ['area code','voice mail plan'], columns=['international plan'], aggfunc=len)
Now calculate the total international minutes for all the combinations above.
pd.pivot_table(ch, 'total intl minutes', index = ['area code','voice mail plan'], columns=['international plan'])
How do we understand the correlation between the variables or the columns within the dataframe. Plot one and analyze.
sns.pairplot(ch)
Find Standard deviation of total night calls.
You know it already
Plot a distplot for the above result to look at specifically total night calls.
sns.distplot(ch['total night calls'])
Plot a histogram to group it by churner or non-churner for the column area code.
ch.hist(by='churn', column = 'area code')
Calculate areawise churner or non-churner using countplot.
ch['area code']= ch['area code'].astype('category')
sns.countplot(x="area code", hue= "churn", data=ch)
Dataset: https://bit.do/titanicdf
Ref: [link] and play ground link
“Never get inside a lake, if it is on an average 4 ft. deep”
“If Bill Gates walks into Bar, on an average every body is a millionaire”
Descriptive Statistics
Min
Max
Median
Mode
Quartile (1st Quartile and 3rd Quartile)
Std. Deviation
Introduction Central Limit Theorem
Measure of Central Values
Measure the spread around Central Values
Overall distribution shape
Discrete Vs Continuous
Online Tools : GeoGebra/Desmos/Netlogo/Scratch Gaming
Excel Tools:
Bell Curve Vs Pareto Chart
Uni-variate Distribution/Binomial Distribution
Stats Meme Page
XKCD Essentials
Exploration of Formulas
Mean
Measures of Central Tendency — Mean, Median, Mode
Measures of Variability— St. Deviation, Min, Max, Variance, Kurtosis and Skewness
Distribution
Inter Quartlie Details (Post-Mortem of Box and Whisker Plot)
Wolfram notebook on Statistics
Null hypothesis Vs Alternate hypothesis
P < 0.05 Vs P > 0.05
Few commands in R
Table function in R
Data Set (Churn) [link]
Calculate Histogram for time spent on day calls by customers
How do we categorize the churner and the non-churner for the time spent on day calls(total day minutes)?
Find the number of customers who did opt for voice mail plan
Create a boxplot for a categorical variable(international plan) and continuous variable(area code).
Create a crosstab for the area code to find the churner or non-churner.