Professional Documents
Culture Documents
No: 39
Name: Vanraj Pardeshi
Experiment No. 03
Theory:
Data Exploration:
Data exploration is the initial step in data analysis, where users explore a large data set in an unstructured
way to uncover initial patterns, characteristics, and points of interest. This process isn’t meant to reveal
every bit of information a dataset holds, but rather to help create a broad picture of important trends and
major points to study in greater detail.
Data exploration can use a combination of manual methods and automated tools such as data visualizations,
charts, and initial reports.
Data Preprocessing:
Data preprocessing is the process of transforming raw data into an understandable format. It is also an
important step in data mining as we cannot work with raw data. The quality of the data should be checked
before applying machine learning or data mining algorithms.
Preprocessing of data is mainly to check the data quality. The quality can be checked by the following
• Data cleaning: Data cleaning is the process to remove incorrect data, incomplete data and inaccurate
data from the datasets, and it also replaces the missing values. There are some techniques in data
cleaning.
Handling Missing Values:
1. Missing Values can be handled using the standard procedures, either by filling the missing values
manually if the dataset is relatively too small or by filling the NA values with appropriate central measure of
tendency or either fill these values with ‘Not Available’ or ‘NA’.
2. Noisy Data:
Noisy generally means random error or containing unnecessary data points. Here are some of the
methods to handle noisy data.
In order to deal with noisy data the following methods can be used:
1. Binning
2. Clustering
3. Regression
• Data integration: Data Integration is the process of combining multiple sources into a single dataset.
The Data integration process is one of the main components in data management.
1|Page
• Data reduction: Data Reduction process helps in the reduction of the volume of the data which makes
the analysis easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. Some of the techniques in data reduction are
1. Dimensionality reduction
2. Numerosity reduction
3. Data compression
• Data transformation: The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There are some methods
in data transformation.
1. Smoothing
2. Aggregation
3. Discretization
4. Normalization
Code and Output:
fig 2: Reading the CSV file / Dataset using pandas into a pandas dataframe
fig 6: Using Pandas Describe() method to get the measures of central tendency
fig 8: Finding the total number of null values present in the dataset
fig 9: Filling in the null values present in the Age column with the mean.
3|Page
fig 10: Filling in the null values present in the Fuel column with the mode
fig 11: Filling in the null values present in MetColor Column with the mode
fig 12: Filling in the null values present in HP and KM columns with their respective mean
4|Page
fig 15: Custom Scatter Plot
Conclusion: From the above Experiment, I understood how to perform preprocessing and data exploration
were executed on a dataset containing records with null values and various types of attributes. These null
values were filled in using appropriate data preprocessing techniques and various plots were charted to
explore the initial patterns in the dataset.
5|Page