Data Preprocessing Techniques in Machine learning

Without data machine learning cannot be imagined, so its very important to provide appropriate data to machine learning algorithms. Firstly we should check the relevance of data, i.e. the data must be relevant with respect to the objective we want to achieve. For example I want to make groups of students based on grades, to achieve this if I provide students personal data like name, phone number, address will not help us, instead if I provide students academic data, that will surely help in achieving the objective. So here academic data is valid and personal data is irrelevant/invalid data. The following are the data preprocessing techniques that will make our data READY for machine learning algorithms.

Managing Missing Values

A data consisting of Missing values can be considered as incomplete data which is of NO use to machine learning algorithms, so we need to remove the NAN from dataset and replace it with SOMETHING. Now what can be this SOMETHING, it could be mean,median,standard deviation,min value or max values. Below mentioned codes explains each of the case.

data['price'] = data['price'].fillna(data['price'].mean())
data['price'] = data['price'].fillna(data['price'].median())
data['price'] = data['price'].fillna(data['price'].std())
data['price'] = data['price'].fillna(data['price'].min())
data['price'] = data['price'].fillna(data['price'].max())


Standardization is a scaling technique such that when it is applied the features will be rescaled so that they’ll have the properties of a standard normal distribution with mean,μ=0 and standard deviation, σ=1, it is also called Z score normalization. Here z is the z score which is calculated using below mentioned formula.

from sklearn.preprocessing import StandardScaler
data_scaler = StandardScaler().fit(a)
data_rescaled = data_scaler.transform(a)


Biarization as the name suggests here we try to convert number into either ‘1’ or ‘0’. When dataset contains probabilities and we want to convert the probabilities into crisp values we can use binarization.

from sklearn.preprocessing import Binarizer
binary = Binarizer(threshold=0.5)
binary1 = binary.transform(a)

One hot encoding

To convert text-based values into numeric values we need one hot encoding, which transforms values into binary form, represented as ‘1’ or ‘0’. ‘0’ means the value does not belong to a particular feature and ‘1’ means the value belongs to a particular feature.

import pandas as pd
dff = pd.read_csv("team.csv")
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df1 = dff
df1.TEAM = le.fit_transform(df1.TEAM)

from sklearn.preprocessing import OneHotEncoder
import numpy as np  
import pandas as pd
enc = OneHotEncoder()
enc_df1 = pd.DataFrame(enc.fit_transform(df1[['TEAM']]).toarray())

abc = dfle.join(enc_df1)
final = abc.drop(['TEAM'], axis='columns')

Leave a Reply

Your email address will not be published. Required fields are marked *