5 Key Steps in Data Science

Pradeep Ankem
2 min readAug 3, 2019

--

Pre-requisites:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.linear_model import Linear Regression
  1. Data collection: The data collection step involves gathering the learning

material an algorithm will use to generate actionable knowledge. In most

cases, the data will need to be combined into a single source like a text file,

spreadsheet, or database.

[eg. read_csv]

df = pd.read_csv('data.csv')

NOTES: raw datasets in csv formats, Google Big Query, Toy Data Sets available in Seaborn or Pre-defined data sets, Twitter Data, Stock Data

2. Data exploration and preparation: The quality of any machine learning project is based largely on the quality of its input data. Thus, it is important to learn more about the data and its nuances during a practice called data exploration.Additional work is required to prepare the data for the learning process. This involves fixing or cleaning so-called “messy” data, eliminating unnecessary data, and re-coding the data to conform to the learner’s expected inputs.

“Half of Data Science is imports, the other half is cleaning”

x_train = df['Father'].values[:,np.newaxis]
y_train = df['Son'].values

3. Model training: By the time the data has been prepared for analysis, you are likely to have a sense of what you are capable of learning from the data. The specific machine learning task chosen will inform the selection of an appropriate algorithm, and the algorithm will represent the data in the form of a model.

4. Model evaluation: Because each machine learning model results in a biased solution to the learning problem, it is important to evaluate how well the algorithm learns from its experience. Depending on the type of model used, you might be able to evaluate the accuracy of the model using a test dataset or you may need to develop measures of performance specific to the intended application.

5. Model improvement: If better performance is needed, it becomes necessary to utilize more advanced strategies to augment the performance of the model. Sometimes, it may be necessary to switch to a different type of model altogether. You may need to supplement your data with additional data or perform additional preparatory work as in step two of this process

[Source: Machine Learning with R book]

--

--

Pradeep Ankem
Pradeep Ankem

Written by Pradeep Ankem

In Parallel Universe, I would have been a Zen Monk.

No responses yet