The preprocessed data cannot be given to algorithms directly, before that we need to decide the Independent and Dependent variables of our data. An example data is mentioned below
This data consist of 6 samples and 3 features. Data is about scooter price which depends on distance travelled and years its being used, hence here price is dependent variable, distance and years are independent variables.
import pandas as pd df = pd.read_csv("abcd.csv") df.head()
As mentioned below X holds Independent variables(distance and years) and y holds Dependent variable(price)
X = df[['distance', 'years']] y = df['price']
Now to generate training and testing data we need to import train_test_split from sklearn.model_selection. As mentioned below we need to provide independent and dependent variables, also the test size or train size must be provided. random_state helps to maintain the same results (training and testing data) for different runs, if it won’t have been mentioned then the data keeps on changing. Finally we generate X_train, X_test, y_train and y_test.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=10)