Without data machine learning cannot be imagined, so its very important to provide appropriate data to machine learning algorithms. Firstly we should check the relevance of data, i.e. the data must be relevant with respect to the objective we want to achieve. For example I want to make groups of students based on grades, to achieve this if I provide students personal data like name, phone number, address will not help us, instead if I provide students academic data, that will surely help in achieving the objective. So here academic data is valid and personal data is irrelevant/invalid data. The following are the data preprocessing techniques that will make our data READY for machine learning algorithms.
Managing Missing Values
A data consisting of Missing values can be considered as incomplete data which is of NO use to machine learning algorithms, so we need to remove the NAN from dataset and replace it with SOMETHING. Now what can be this SOMETHING, it could be mean,median,standard deviation,min value or max values. Below mentioned codes explains each of the case.
data['price'] = data['price'].fillna(data['price'].mean()) data
data['price'] = data['price'].fillna(data['price'].median()) data
data['price'] = data['price'].fillna(data['price'].std()) data
data['price'] = data['price'].fillna(data['price'].min()) data
data['price'] = data['price'].fillna(data['price'].max()) data
Standardization is a scaling technique such that when it is applied the features will be rescaled so that they’ll have the properties of a standard normal distribution with mean,μ=0 and standard deviation, σ=1, it is also called Z score normalization. Here z is the z score which is calculated using below mentioned formula.
from sklearn.preprocessing import StandardScaler data_scaler = StandardScaler().fit(a) data_rescaled = data_scaler.transform(a) data_rescaled
Biarization as the name suggests here we try to convert number into either ‘1’ or ‘0’. When dataset contains probabilities and we want to convert the probabilities into crisp values we can use binarization.
from sklearn.preprocessing import Binarizer binary = Binarizer(threshold=0.5) binary1 = binary.transform(a) binary1
One hot encoding
To convert text-based values into numeric values we need one hot encoding, which transforms values into binary form, represented as ‘1’ or ‘0’. ‘0’ means the value does not belong to a particular feature and ‘1’ means the value belongs to a particular feature.
import pandas as pd dff = pd.read_csv("team.csv") dff from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df1 = dff df1.TEAM = le.fit_transform(df1.TEAM) df1 from sklearn.preprocessing import OneHotEncoder import numpy as np import pandas as pd enc = OneHotEncoder() enc_df1 = pd.DataFrame(enc.fit_transform(df1[['TEAM']]).toarray()) enc_df1 abc = dfle.join(enc_df1) abc final = abc.drop(['TEAM'], axis='columns') final