You are on page 1of 5

Sr.

No: 39
Name: Vanraj Pardeshi

Experiment No. 03

Aim: Implement Data Exploration and Data Preprocessing in Python.

Theory:
Data Exploration:
Data exploration is the initial step in data analysis, where users explore a large data set in an unstructured
way to uncover initial patterns, characteristics, and points of interest. This process isn’t meant to reveal
every bit of information a dataset holds, but rather to help create a broad picture of important trends and
major points to study in greater detail.
Data exploration can use a combination of manual methods and automated tools such as data visualizations,
charts, and initial reports.

Data Preprocessing:
Data preprocessing is the process of transforming raw data into an understandable format. It is also an
important step in data mining as we cannot work with raw data. The quality of the data should be checked
before applying machine learning or data mining algorithms.
Preprocessing of data is mainly to check the data quality. The quality can be checked by the following

• Accuracy: To check whether the data entered is correct or not.


• Completeness: To check whether the data is available or not recorded.
• Consistency: To check whether the same data is kept in all the places that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.

Tasks involved in Data Preprocessing

• Data cleaning: Data cleaning is the process to remove incorrect data, incomplete data and inaccurate
data from the datasets, and it also replaces the missing values. There are some techniques in data
cleaning.
Handling Missing Values:
1. Missing Values can be handled using the standard procedures, either by filling the missing values
manually if the dataset is relatively too small or by filling the NA values with appropriate central measure of
tendency or either fill these values with ‘Not Available’ or ‘NA’.
2. Noisy Data:
Noisy generally means random error or containing unnecessary data points. Here are some of the
methods to handle noisy data.
In order to deal with noisy data the following methods can be used:
1. Binning
2. Clustering
3. Regression
• Data integration: Data Integration is the process of combining multiple sources into a single dataset.
The Data integration process is one of the main components in data management.

1|Page
• Data reduction: Data Reduction process helps in the reduction of the volume of the data which makes
the analysis easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. Some of the techniques in data reduction are
1. Dimensionality reduction
2. Numerosity reduction
3. Data compression
• Data transformation: The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There are some methods
in data transformation.
1. Smoothing
2. Aggregation
3. Discretization
4. Normalization
Code and Output:

fig 1: Importing the necessary Libraries

fig 2: Reading the CSV file / Dataset using pandas into a pandas dataframe

fig 3: Pandas Dataframe of the processed dataset

fig 4: Shape and Size of the Dataset


Shape: Number of Rows, Number of Columns
Size: Total number of entries or values present in the dataset
2|Page
fig 5: Extracting the information about the dataset using pandas info() method

fig 6: Using Pandas Describe() method to get the measures of central tendency

fig 7: Creating a Copy of your Dataset

fig 8: Finding the total number of null values present in the dataset

fig 9: Filling in the null values present in the Age column with the mean.

3|Page
fig 10: Filling in the null values present in the Fuel column with the mode

fig 11: Filling in the null values present in MetColor Column with the mode

fig 12: Filling in the null values present in HP and KM columns with their respective mean

fig 13: All Null values handled

fig 14: Scatter Plot between Age and Price

4|Page
fig 15: Custom Scatter Plot

fig 16: Plotting a Histogram for the KM attribute

fig 17: Plotting a Boxplot for the attribute ‘Price’

Conclusion: From the above Experiment, I understood how to perform preprocessing and data exploration
were executed on a dataset containing records with null values and various types of attributes. These null
values were filled in using appropriate data preprocessing techniques and various plots were charted to
explore the initial patterns in the dataset.

5|Page

You might also like