You are on page 1of 12

Data Representation

Introduction

● The main objective of machine learning is to build models that understand


data and find underlying patterns.

● Data must be feed in a way that is interpretable by the computer.

● To feed the data in the model, it must be represented as a table or matrix


dimensions.

● Converting the data into the correct tabular form is one of the first step in
data preprocessing.
Data Represented in a Table
Data should be arranged in a two dimensional space made of rows and columns.

Easily understand the data and pinpoint any problems.


CSV
data in
table
format
Data Represented in a Table
To load a CSV file and work on it as a table, we use the pandas library.

The data is loaded into tables called DataFrames.


Independent and Target Variables
DataFrame that we use contains variables or features that can be classified into two
categories.

Independent Variable (Predictor Variable)


o Used to predict the target variable.

o Is independent of each other.

Dependent Variable (Target Variable)


Independent Variables
Features in the dataframe

size (m, n)

where m is the number of observations

n is the number of features.


Independent Variables
Independent Variables must be normally distributed and should not contain

• Missing or Null Values

• Highly categorical data features

• Outliers

• Data on different scales

• Human error

• Multicollinearity (independent variables that are correlated)

• Very large independent feature sets

• Sparse data

• Special characters
Feature Matrix and Target Vector
A single piece of data is called a scalar.

A group of scalars is called a vector, and a group of vectors is called a matrix.

A matrix is represented in rows and columns.

Feature matrix data is made up of independent columns, and the target vector depends
on the feature matrix columns. Independent
Variable
Car Model Dependent
Car Capacity Variable
Car Brand Car Price
Loading a Sample Dataset and Creating Feature Matrix and Target
Matrix

1. Import Pandas
Library
import pandas as pd

2. Load the
dataset into
pandas dataset=“filename”
Dataframe df=pd.read_csv(dataset,header=0)

3. To print all the


colums
df.columns
Loading a Sample Dataset and Creating Feature Matrix and Target
Matrix

4. Total Number of
Rows
df.index

Syntax:
5. Set Address
column as index Dataframe.set_index(‘column name’,inplace=True)
df.set_index(‘Address’, inplace=True)

6. Reset the index


df.reset_index(inplace=True)
Loading a Sample Dataset and Creating Feature Matrix and Target
Matrix

7. Retrieve first
five rows and
columns df.iloc[0:4, 0:3]

8. Retrieve the
data using labels
df.loc[0:4,[“Avg. Area Income”, “Avg. Area House Age”]]

9. Reset the index


df.reset_index(inplace=True)
Loading a Sample Dataset and Creating Feature Matrix and Target
Matrix

10. Drop a
column X=df.drop[‘Price’,axis=1]

11. Shape of
feature matrix
x.shape

12. To store the Y=df[‘Price’]


target variable Created
y.head(10) Feature
and Target
13. Shape of new Matrices of
variable a dataset
y.shape

You might also like