You are on page 1of 4

Data Cleaning

Data Cleaning, is an essential part of statistical analysis.


We have used a range of techniques, allowing us to build data scripts for
data suffering from a wide range of errors and inconsistencies. These
techniques include technical and subject-matter related aspects of data

cleaning. Technical aspects include data reading, type conversion and string
matching and manipulation. Subject-matter related aspects include topics

like data checking, error localization and an introduction to imputation


methods.

Analysis of data is a process of inspecting, cleaning, transforming, and


modelling data with the goal of highlighting useful information, suggesting

conclusions, and supporting decision making.

We have spent most of our time in preparing the data, before doing any
statistical operation.

Data Cleaning is the process of transforming raw data into consistent data
that can be analysed. We have aimed at improving the content of statistical

statements based on the data as well as the reliability.

Data cleaning has profoundly influenced our statistical statements based on


the data. Typical actions like imputation and outlier handling have influence
the results of a statistical analyses.
Statistical Analysis

We have utilized the above techniques to clean the data obtained.


Steps followed for Data Cleaning:

1. Examine the data


2. Read by column
3. Apply- used for returning the types of variables used which are there
4. Number ‘2’ indicates columns, Number ‘1’ indicates rows
5. STR- displays the structure of the data frame

6. Display the descriptive statistics


7. Missing values for each are totalled and represented below each

column
8. First column shows the row number

9. Selecting and displaying the missing rows as number of missing rows


are very low we delete the missing rows - we do listwise deletion

(when less number of records are there for missing values) simple
deletion
10. After loading second file on the same data frame the previous
data frame gets overridden

11. Handle string columns-gender


12. Use imputation

13. Scan plausible estimated value based on the values that are
present

14. Replace with central tendency values-mean/median for


continuous mode for categorical

15. Good choice when variable misses less than 30% of their values
16. Calculate Mean

17. Replace model


18. Check for outliers
19. Clamp Transformation
20. Binning
21. Normalization and rescaling
22. Split datasets

You might also like