Dr. Megha Huma Naz Assistant Professor Roll No-1810981004 M.E Fellowship(C.S.E) Table Of Content • Data Pre-processing: An Overview Data Quality Why Is Data Preprocessing Important? Major Task Of Data Pre-processing • Data Cleaning Importance Data Cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration What is Data Preprocessing • Data pre-processing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviours or trends, and is likely to contain many errors. • Data pre-processing is a proven method of resolving such issues. Data pre- processing prepares raw data for further processing. • To make data more suitable for data mining. • To improve the data mining analysis with respect to time, cost and quality. Data Quality • Measures for data quality: A multidimensional view ▫ Accuracy: correct or wrong, accurate or not ▫ Completeness: not recorded, unavailable, … ▫ Consistency: some modified but some not, dangling, … ▫ Timeliness: timely update? ▫ Believability: how trustable the data are correct? ▫ Interpretability: how easily the data can be understood? Why Data Preprocessing? • Data in the real world is dirty ▫ incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data e.g., occupation=“” ▫ noisy: containing errors or outliers e.g., Salary=“-10” ▫ inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records Why Is Data Preprocessing Important? • No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. • Data preparation, cleaning, and transformation comprises the majority of the work in a data mining application (90%). Major Tasks in Data Preprocessing • Data cleaning ▫ Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies • Data integration ▫ Integration of multiple databases, or files • Data transformation ▫ Normalization and aggregation • Data reduction ▫ Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization (for numerical data) Table Of Content • Data Pre-processing: An Overview Data Quality Why Is Data Preprocessing Important? Major Task Of Data Pre-processing • Data Cleaning Importance Data Cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration Data Cleaning • Importance ▫ “Data cleaning is the number one problem in data warehousing” ▫ Real-world data tend to be incomplete, noisy, and inconsistent. ▫ Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. • Data cleaning tasks – this routine attempts to ▫ Fill in missing values ▫ Identify outliers and smooth out noisy data ▫ Correct inconsistent data ▫ Resolve redundancy caused by data integration Missing Data • Data is not always available ▫ E.g., many tuples have no recorded values for several attributes, such as customer income in sales data • Missing data may be due to ▫ equipment malfunction ▫ inconsistent with other recorded data and thus deleted ▫ data not entered due to misunderstanding ▫ certain data may not be considered important at the time of entry How to Handle Missing Data? 1. Ignore the tuple ▫ Class label is missing (classification) ▫ Not effective method unless several attributes missing values 2. Fill in missing values manually: tedious (time consuming) + infeasible (large db)? 3. Fill in it automatically with ▫ a global constant : e.g., “unknown”, a new class?! (misunderstanding) Continue.. 4. the attribute mean ▫ Average income of All Electronics customer $28,000 (use this value to replace) 5. The attribute mean for all samples belonging to the same class as the given tuple 6. the most probable value ▫ determined with regression, inference-based such as Bayesian formula, decision tree. (most popular) Noisy Data • Noise: random error or variance in a measured variable. • Incorrect attribute values may due to ▫ faulty data collection instruments ▫ data entry problems ▫ data transmission problems etc • Other data problems which requires data cleaning ▫ duplicate records, incomplete data, inconsistent data How to Handle Noisy Data? • Binning method: ▫ first sort data and partition into (equi-depth) bins ▫ then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Clustering ▫ Similar values are organized into groups (clusters). ▫ Values that fall outside of clusters considered outliers. • Combined computer and human inspection ▫ detect suspicious values and check by human (e.g., deal with possible outliers) • Regression ▫ Data can be smoothed by fitting the data to a function such as with regression. (linear regression/multiple linear regression) Binning Methods for Data Smoothing Outlier Removal • Data points inconsistent with the majority of data • Different outliers ▫ Valid: CEO’s salary, ▫ Noisy: One’s age = 200, widely deviated points • Removal methods ▫ Clustering ▫ Curve-fitting ▫ Hypothesis-testing with a given model Summary.. • Data preparation is a big issue for data mining • Data preparation includes ▫ Data cleaning and data integration ▫ Data reduction and feature selection ▫ Discretization • Many methods have been proposed but still an active area of research
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"