You are on page 1of 29

UNIT II:

The DataScience Process: Overview of the Data Science Process-Setting the research
goal, Retrieving Data,Data Preparation,Exploration,Modeling,data Presentation and
Automation.Getting Data in and out of R,Using reader package, Interfaces to the
outside world.

Data Mining Process:

Data Mining is a process of discovering various models, summaries,


and derived values from a given collection of data.
The general experimental procedure adapted to data-mining problems
involves the following steps:
1. State the problem and formulate the hypothesis

Most data-based modeling studies are performed in a particular


application domain. Hence, domain-specific knowledge and
experience are usually necessary in order to come up with a
meaningful problem statement. Unfortunately, many application
studies tend to focus on the data-mining technique at the expense
of a clear problem statement. In this step, a modeler usually
specifies a set of variables for the unknown dependency and, if
possible, a general form of this dependency as an initial
hypothesis. There may be several hypotheses formulated for a
single problem at this stage. The first step requires the combined
expertise of an application domain and a data-mining model. In
practice, it usually means a close interaction between the data-
mining expert and the application expert. In successful data-mining
applications, this cooperation does not stop in the initial phase; it
continues during the entire data-mining process.

2. Collect the data

This step is concerned with how the data are generated and
collected. In general, there are two distinct possibilities. The first
is when the data-generation process is under the control of an
expert (modeler): this approach is known as a designed
experiment. The second possibility is when the expert cannot
influence the data- generation process: this is known as the
observational approach. An observational setting, namely, random
data generation, is assumed in most data-mining applications.
Typically, the sampling
distribution is completely unknown after data are collected, or it is
partially and implicitly given in the data-collection procedure. It is
very important, however, to understand how data collection affects
its theoretical distribution, since such a priori knowledge can be
very useful for modeling and, later, for the final interpretation of
results. Also, it is important to make sure that the data used for
estimating a model and the data used later for testing and applying
a model come from the same, unknown, sampling distribution. If
this is not the case, the estimated model cannot be successfully
used in a final application of the results.

3. Preprocessing the data

In the observational setting, data are usually "collected" from the


existing databses, data warehouses, and data marts. Data
preprocessing usually includes at least two common tasks:

1. Outlier detection (and removal) – Outliers are unusual data


values that are not consistent with most observations.
Commonly, outliers result from measurement errors, coding and
recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model
produced later. There are two strategies for dealing with
outliers:
a. Detect
and eventually remove outliers as a part of the
preprocessing phase, or
b. Develop robust modeling methods that are insensitive to outliers.

2. Scaling, encoding, and selecting features – Data preprocessing


includes several steps such as variable scaling and different types
of encoding. For example, one feature with the range [0, 1] and the
other with the range [−100, 1000] will not have the same weights
in the applied technique; they will also influence the final data-
mining results differently. Therefore, it is recommended to scale
them and bring both features to the same weight for further
analysis. Also, application-specific encoding methods usually
achieve

Data preprocessing

Data preprocessing is an important step in the data mining process. It refers


to the cleaning, transforming, and integrating of data in order to make it
ready for analysis. The goal of data preprocessing is to improve the quality
of the data and to make it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data preprocessing is an important step in the data mining process that
involves cleaning and transforming raw data to make it suitable for
analysis. Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and duplicates.
Various techniques can be used for data cleaning, such as imputation,
removal, and transformation.
Data Integration: This involves combining data from multiple sources to
create a unified dataset. Data integration can be challenging as it requires
handling data with different formats, structures, and semantics. Techniques
such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable
format for analysis. Common techniques used in data transformation
include normalization, standardization, and discretization. Normalization is
used to scale the data to a common range, while standardization is used to
transform the data to have zero mean and unit variance. Discretization is
used to convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset,
while feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete
categories or intervals. Discretization is often used in data mining and
machine learning algorithms that require categorical data. Discretization
can be achieved through techniques such as equal width binning, equal
frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range,
such as between 0 and 1 or -1 and 1. Normalization is often used to handle
data with different units and scales. Common normalization techniques
include min-max normalization, z-score normalization, and decimal
scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and
the accuracy of the analysis results. The specific steps involved in data
preprocessing may vary depending on the nature of the data and the
analysis goals.
By performing these steps, the data mining process becomes more efficient
and the results become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform
the raw data in a useful and efficient format.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generated due to faulty data collection, data entry errors etc. It can
be handled in following ways :

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data
is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0
or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information.
This is done to improve the efficiency of data analysis and to avoid
overfitting of the model. Some common steps involved in data reduction
are:
Feature Selection: This involves selecting a subset of relevant features
from the dataset. Feature selection is often performed to remove irrelevant
or redundant features from the dataset. It can be done using various
techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-
dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional
and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization
(NMF).
Sampling: This involves selecting a subset of data points from the dataset.
Sampling is often used to reduce the size of the dataset while preserving
the important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into
clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be done
using techniques such as k-means, hierarchical clustering, and density-
based clustering.
Compression: This involves compressing the dataset while preserving the
important information. Compression is often used to reduce the size of the
dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gzip
compression.

Data Cleaning in Data Mining

Data cleaning is an essential step in the data mining process. It is crucial to


the construction of a model. The step that is required, but frequently
overlooked by everyone, is data cleaning. The major problem with quality
information management is data quality. Problems with data quality can
happen at any place in an information system. Data cleansing offers a
solution to these issues.

Data cleaning is the process of correcting or deleting inaccurate, damaged,


improperly formatted, duplicated, or insufficient data from a dataset. Even
if results and algorithms appear to be correct, they are unreliable if the data
is inaccurate. There are numerous ways for data to be duplicated or
incorrectly labeled when merging multiple data sources.

In general, data cleaning lowers errors and raises the caliber of the data.
Although it might be a time-consuming and laborious operation, fixing data
mistakes and removing incorrect information must be done. A crucial
method for cleaning up data is data mining. A method for finding useful
information in data is data mining. Data quality mining is a novel
methodology that uses data mining methods to find and fix data quality
issues in sizable databases. Data mining mechanically pulls intrinsic and
hidden information from large data sets. Data cleansing can be
accomplished using a variety of data mining approaches.

To arrive at a precise final analysis, it is crucial to comprehend and improve


the quality of your data. To identify key patterns, the data must be prepared.
Exploratory data mining is understood. Before doing business analysis and
gaining insights, data cleaning in data mining enables the user to identify
erroneous or missing data.

Data cleaning before data mining is often a time-consuming procedure that


necessitates IT personnel to assist in the initial step of reviewing your data
due to how time-consuming data cleaning is. But if your final analysis is
inaccurate or you get an erroneous result, it's possible due to poor data
quality.

Steps for Cleaning Data

You can follow these fundamental stages to clean your data even if the
techniques employed may vary depending on the sorts of data your firm
stores:

1. Remove duplicate or irrelevant observations

Remove duplicate or pointless observations as well as undesirable


observations from your dataset. The majority of duplicate observations will
occur during data gathering. Duplicate data can be produced when you
merge data sets from several sources, scrape data, or get data from clients
or other departments. One of the most important factors to take into account
in this procedure is de-duplication. Those observations are deemed
irrelevant when you observe observations that do not pertain to the
particular issue you are attempting to analyze.

You might eliminate those useless observations, for instance, if you wish to
analyze data on millennial clients but your dataset also includes
observations from earlier generations. This can improve the analysis's
efficiency, reduce deviance from your main objective, and produce a
dataset that is easier to maintain and use.
2. Fix structural errors

When you measure or transfer data and find odd naming practices, typos, or
wrong capitalization, such are structural faults. Mislabelled categories or
classes may result from these inconsistencies. For instance, "N/A" and "Not
Applicable" might be present on any given sheet, but they ought to be
analyzed under the same heading.

3. Filter unwanted outliers

There will frequently be isolated findings that, at first glance, do not seem
to fit the data you are analyzing. Removing an outlier if you have a good
reason to, such as incorrect data entry, will improve the performance of the
data you are working with.

However, occasionally the emergence of an outlier will support a theory


you are investigating. And just because there is an outlier, that doesn't
necessarily indicate it is inaccurate. To determine the reliability of the
number, this step is necessary. If an outlier turns out to be incorrect or
unimportant for the analysis, you might want to remove it.

4. Handle missing data

Because many algorithms won't tolerate missing values, you can't overlook
missing data. There are a few options for handling missing data. While
neither is ideal, both can be taken into account, for example:

Although you can remove observations with missing values, doing so will
result in the loss of information, so proceed with caution.

Again, there is a chance to undermine the integrity of the data since you can
be working from assumptions rather than actual observations when you
input missing numbers based on other observations.

To browse null values efficiently, you may need to change the way the data
is used.
5. Validate and QA

As part of fundamental validation, you ought to be able to respond to the


following queries once the data cleansing procedure is complete:

o Are the data coherent?


o Does the data abide by the regulations that apply to its particular
field?
o Does it support or refute your working theory? Does it offer any new
information?
o To support your next theory, can you identify any trends in the data?
o If not, is there a problem with the data's quality?

False conclusions can be used to inform poor company strategy and


decision-making as a result of inaccurate or noisy data. False conclusions
can result in a humiliating situation in a reporting meeting when you find
out your data couldn't withstand further investigation. Establishing a culture
of quality data in your organization is crucial before you arrive. The tools
you might employ to develop this plan should be documented to achieve
this.

Techniques for Cleaning Data

The data should be passed through one of the various data-cleaning


procedures available. The procedures are explained below:

1. Ignore the tuples: This approach is not very practical because it is


only useful when a tuple has multiple characteristics and missing
values.
2. Fill in the missing value: This strategy is also not very practical or
effective. Additionally, it could be a time-consuming technique. One
must add the missing value to the approach. The most common
method for doing this is manually, but other options include using
attribute means or the most likely value.
3. Binning method: This strategy is fairly easy to comprehend. The
values nearby are used to smooth the sorted data. The information is
subsequently split into several equal-sized parts. The various
techniques are then used to finish the assignment.
4. Regression: With the use of the regression function, the data is
smoothed out. Regression may be multivariate or linear. Multiple
regressions have more independent variables than linear regressions,
which only have one.
5. Clustering: This technique focuses mostly on the group. Data are
grouped using clustering. After that, clustering is used to find the
outliers. After that, the comparable values are grouped into a "group"
or "cluster".

Process of Data Cleaning

The data cleaning method for data mining is demonstrated in the


subsequent sections.

1. Monitoring the errors: Keep track of the areas where errors seem to
occur most frequently. It will be simpler to identify and maintain
inaccurate or corrupt information. Information is particularly
important when integrating a potential substitute with current
management software.
2. Standardize the mining process: To help lower the likelihood of
duplicity, standardize the place of insertion.
3. Validate data accuracy: Analyse the data and spend money on data
cleaning software. Artificial intelligence-based tools were utilized to
thoroughly check for accuracy.
4. Scrub for duplicate data: To save time when analyzing data, find
duplicates. By analyzing and investing in independent data-erasing
technologies that can analyze imperfect data in quantity and automate
the operation, it is possible to avoid again attempting the same data.
5. Research on data: Our data needs to be vetted, standardized, and
duplicate-checked before this action. There are numerous third-party
sources, and these vetted and approved sources can extract data
straight from our databases. They assist us in gathering the data and
cleaning it up so that it is reliable, accurate, and comprehensive for
use in business decisions.
6. Communicate with the team: Keeping the group informed will help
with client development and strengthening as well as giving more
focused information to potential clients.

Usage of Data Cleaning in Data Mining.

The following are some examples of how data cleaning is used in data
mining:

o Data Integration: Since it is challenging to guarantee quality with


low-quality data, data integration is crucial in resolving this issue. The
process of merging information from various data sets into one is
known as data integration. Before transferring to the ultimate location,
this step makes sure that the embedded data set is standardized and
formatted using data cleansing technologies.
o Data Migration: The process of transferring a file from one system,
format, or application to another is known as data migration. To
ensure that the resulting data has the correct format, structure, and
consistency without any delicacy at the destination, it is crucial to
maintain the data's quality, security, and consistency while it is in
transit.
o Data Transformation: The data must be changed before being
uploaded to a location. Data cleansing, which takes into account
system requirements for formatting, organizing, etc., is the only
method that can achieve this. Before conducting additional analysis,
data transformation techniques typically involve the use of rules and
filters. Most data integration and data management methods include
data transformation as a necessary step. Utilizing the systems' internal
transformations, data cleansing tools assist in cleaning the data.
o Data Debugging in ETL Processes: To prepare data for reporting
and analysis throughout the extract, transform, and load (ETL)
process, data cleansing is essential. Only high-quality data are used
for decision-making and analysis thanks to data purification.

Cleaning data is essential. For instance, a retail business could receive


inaccurate or duplicate data from different sources, including CRM or ERP
systems. A reliable data debugging tool would find and fix data
discrepancies. The deleted information will be transformed into a common
format and transferred to the intended database.

Characteristics of Data Cleaning

To ensure the correctness, integrity, and security of corporate data, data


cleaning is a requirement. These may be of varying quality depending on
the properties or attributes of the data. The key components of data
cleansing in data mining are as follows:

o Accuracy: The business's database must contain only extremely


accurate data. Comparing them to other sources is one technique to
confirm their veracity. The stored data will also have issues if the
source cannot be located or contains errors.
o Coherence: To ensure that the information on a person or body is the
same throughout all types of storage, the data must be consistent with
one another.
o Validity: There must be rules or limitations in place for the stored
data. The information must also be confirmed to support its veracity.
o Uniformity: A database's data must all share the same units or values.
Since it doesn't complicate the process, it is a crucial component while
doing the Data Cleansing process.
o Data Verification: Every step of the process, including its
appropriateness and effectiveness, must be checked. The study,
design, and validation stages all play a role in the verification process.
The disadvantages are frequently obvious after applying the data to a
specific number of changes.
o Clean Data Backflow: After addressing quality issues, the previously
clean data must be replaced with data that is not present in the source
so that legacy applications can profit from it and avoid the need for a
subsequent data-cleaning program.

Tools for Data Cleaning in Data Mining

Data Cleansing Tools can be very helpful if you are not confident of
cleaning the data yourself or have no time to clean up all your data sets.
You might need to invest in those tools, but it is worth the expenditure.
There are many data cleaning tools in the market. Here are some top-ranked
data cleaning tools, such as:

1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
9. TIBCO Clarity
10. Winpure

Benefits of Data Cleaning

When you have clean data, you can make decisions using the highest-
quality information and eventually boost productivity. The following are
some important advantages of data cleaning in data mining, including:

o Removal of inaccuracies when several data sources are involved.


o Clients are happier and employees are less annoyed when there are
fewer mistakes.
o The capacity to map out the many functions and the planned uses of
your data.
o Monitoring mistakes and improving reporting make it easier to
resolve inaccurate or damaged data for future applications by allowing
users to identify where issues are coming from.
o Making decisions more quickly and with greater efficiency will be
possible with the use of data cleansing tools.

DATA TRANSFORMATION INTRODUCTION:


Data transformation in data mining refers to the process of converting raw
data into a format that is suitable for analysis and modeling. The goal of
data transformation is to prepare the data for data mining so that it can be
used to extract useful insights and knowledge. Data transformation
typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and
missing values in the data.
2. Data integration: Combining data from multiple sources, such as
databases and spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such
as between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a
subset of relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories
or bins.
6. Data aggregation: Combining data at different levels of granularity, such
as by summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it
helps to ensure that the data is in a format that is suitable for analysis and
modeling, and that it is free of errors and inconsistencies. Data
transformation can also help to improve the performance of data mining
algorithms, by reducing the dimensionality of the data, and by scaling the
data to a common range of values.
The data are transformed in ways that are ideal for mining the data. The
data transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset
using some algorithms It allows for highlighting important features present
in the dataset. It helps in predicting the patterns. When collecting data, it
can be manipulated to eliminate or reduce any variance or any other noise
form. The concept behind data smoothing is that it will be able to identify
simple changes to help predict different trends and patterns. This serves as
a help to analysts or traders who need to look at a lot of data which can
often be difficult to digest for finding patterns that they wouldn’t see
otherwise.
2. Aggregation: Data collection or aggregation is the method of storing
and presenting data in a summary format. The data may be obtained from
multiple data sources to integrate these data sources into a data analysis
description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is
necessary to produce relevant results. The collection of data is useful for
everything from decisions concerning financing or business strategy of the
product, pricing, operations, and marketing strategies. For example, Sales,
data may be aggregated to compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set
of small intervals. Most Data Mining activities in the real world require
continuous attributes. Yet many of the existing data mining frameworks are
unable to handle these attributes. Also, even if a data mining task can
manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values. For example,
(1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to
assist the mining process from the given set of attributes. This simplifies
the original data & makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level data
attributes using concept hierarchy. For Example Age initially in Numerical
form (22, 25) is converted into categorical value (young, old).
For example, Categorical attributes, such as house addresses, may be
generalized to higher-level definitions, such as town or country.
6. Normalization: Data normalization involves converting all data
variables into a given range. Techniques that are used for normalization
are:
 Min-Max Normalization:
 This transforms the original data linearly.
 Suppose that: min_A is the minima and max_A is the maxima of
an attribute, P
 Where v is the value you want to plot in the new range.
 v’ is the new value you get after normalizing the old value.
 Z-Score Normalization:
 In z-score normalization (or zero-mean normalization) the values
of an attribute (A), are normalized based on the mean of A and its
standard deviation
 A value, v, of attribute A is normalized to v’ by computing
 Decimal Scaling:
 It normalizes the values of an attribute by changing the position of
their decimal points
 The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.
 A value, v, of attribute A is normalized to v’ by computing
 where j is the smallest integer such that Max(|v’|) < 1.
 Suppose: Values of an attribute P varies from -99 to 99.
 The maximum absolute value of P is 99.
 For normalizing the values we divide the numbers by 100 (i.e., j =
2) or (number of integers in the largest number) so that values
come out to be as 0.98, 0.97 and so on.
ADVANTAGES OR DISADVANTAGES:
Advantages of Data Transformation in Data Mining:
1. Improves Data Quality: Data transformation helps to improve the quality
of data by removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration
of data from multiple sources, which can improve the accuracy and
completeness of the data.
3. Improves Data Analysis: Data transformation helps to prepare the data for
analysis and modeling by normalizing, reducing dimensionality, and
discretizing the data.
4. Increases Data Security: Data transformation can be used to mask
sensitive data, or to remove sensitive information from the data, which
can help to increase data security.
5. Enhances Data Mining Algorithm Performance: Data transformation can
improve the performance of data mining algorithms by reducing the
dimensionality of the data and scaling the data to a common range of
values.
Disadvantages of Data Transformation in Data Mining:
1. Time-consuming: Data transformation can be a time-consuming process,
especially when dealing with large datasets.
2. Complexity: Data transformation can be a complex process, requiring
specialized skills and knowledge to implement and interpret the results.
3. Data Loss: Data transformation can result in data loss, such as when
discretizing continuous data, or when removing attributes or features from
the data.
4. Biased transformation: Data transformation can result in bias, if the data
is not properly understood or used.
5. High cost: Data transformation can be an expensive process, requiring
significant investments in hardware, software, and personnel.
Overfitting: Data transformation can lead to overfitting, which is a
common problem in machine learning where a model learns the detail and
noise in the training data to the extent that it negatively impacts the
performance of the model on new unseen data.

Data science is about extracting knowledge and insights from data. The tools
and techniques of data science are used to drive business and process
decisions.
Data Science Processes:
1.Setting the Research Goal
2.Retrieving Data
3.Data Preparation
4.Data Exploration
5.Data Modeling
6.Presentation and Automation
1.Setting the research goal:
Data science is mostly applied in the context of an organization. When the
business asks you to perform a data science project, you’ll first prepare a project
charter. This charter contains information such as what you’re going to research,
how the company benefits from that, what data and resources you need, a
timetable, and deliverables.
2. Retrieving data:
The second step is to collect data. You’ve stated in the project charter which
data you need and where you can find it. In this step you ensure that you can use
the data in your program, which means checking the existence of, quality, and
access to the data. Data can also be delivered by third-party companies and takes
many forms ranging from Excel spreadsheets to different types of databases.

3. Data preparation:
Data collection is an error-prone process; in this phase you enhance the quality
of the data and prepare it for use in subsequent steps. This phase consists of three
subphases: data cleansing removes false values from a data source and
inconsistencies across data sources, data integration enriches data sources by
combining information from multiple data sources, and data transformation ensures
that the data is in a suitable format for use in your models.
4. Data exploration:
Data exploration is concerned with building a deeper understanding of your data.
You try to understand how variables interact with each other, the distribution of the
data, and whether there are outliers. To achieve this, you mainly use descriptive
statistics, visual techniques, and simple modeling. This step often goes by the
abbreviation EDA, for Exploratory Data Analysis.

5. Data modeling or model building:


In this phase you use models, domain knowledge, and insights about the data
you found in the previous steps to answer the research question. You select a
technique from the fields of statistics, machine learning, operations research, and
so on. Building a model is an iterative process that involves selecting the variables
for the model, executing the model, and model diagnostics.

6.Presentation and automation:


Finally, you present the results to your business. These results can take many
forms, ranging from presentations to research reports. Sometimes you’ll need to
automate the execution of the process because the business will want to use the
insights you gained in another project or enable an operational process to use the
outcome from your model.
Knowledge and Skills for Data Science Professionals:
• Statistical / mathematical reasoning.
• Business communication/leadership.
• Programming.
1. Statistics:
Wikipedia defines it as the study of the collection, analysis, interpretation,
presentation, and organization of data. Therefore, it shouldn’t be a surprise that
data scientists need to know statistics.
For example, data analysis requires descriptive statistics and probability theory,
at a minimum. These concepts will help you make better business decisions from
data.
2. Programming Language R/ Python:
Python and R are one of the most widely used languages by Data Scientists.
The primary reason is the number of packages available for Numeric and Scientific
computing.
3. Data Extraction, Transformation, and Loading:
Suppose we have multiple data sources like MySQL DB, MongoDB, Google
Analytics. You have to Extract data from such sources, and then transform it for
storing in a proper format or structure for the purposes of querying and analysis.
Finally, you have to load the data in the Data Warehouse, where you will analyze
the data. So, for people from ETL (Extract Transform and Load) background Data
Science can be a good career option.
4. Data Wrangling and Data Exploration:
Cleaning and unify the messy and complex data sets for easy access and
analysis this is termed as Data Wrangling. Exploratory Data Analysis (EDA) is the
first step in your data analysis process. Here, you make sense of the data you have
and then figure out what questions you want to ask and how to frame them, as well
as how best to manipulate your available data sources to get the answers you
need.

5. Machine Learning:
Machine Learning, as the name suggests, is the process of making machines
intelligent, that have the power to think, analyze and make decisions. By building
precise Machine Learning models, an organization has a better chance of
identifying profitable opportunities or avoiding unknown risks.
You should have good hands-on knowledge of various Supervised and
Unsupervised algorithms.

6. Big Data Processing Frameworks:


Nowadays, most of the organizations are using Big Data analytics to gain
hidden business insights. It is, therefore, a must-have skill for a Data Scientist.
Therefore, we require frameworks like Hadoop and Spark to handle Big Data.
dimensionality reduction by providing a smaller number of
informative features for subsequent data modeling.
These two classes of preprocessing tasks are only illustrative
examples of a large spectrum of preprocessing activities in a data-
mining process.
Data-preprocessing steps should not be considered completely
independent from other data-mining phases. In every iteration of
the data-mining process, all activities, together, could define new
and improved data sets for subsequent iterations. Generally, a good
preprocessing method provides an optimal representation for a
data-mining technique by incorporating a priori knowledge in the
form of application-specific scaling and encoding.

4. Estimate the model

The selection and implementation of the appropriate data-mining


technique is the main task in this phase. This process is not
straightforward; usually, in practice, the implementation is based
on several models, and selecting the best one is an additional task.
The basic principles of learning and discovery from data are given
in Chapter 4 of this book. Later, Chapter 5 through 13 explain and
analyze specific techniques that are applied to perform a successful
learning process from data and to develop an appropriate model.
5. Interpret the model and draw conclusions

In most cases, data-mining models should help in decision making.


Hence, such models need to be interpretable in order to be useful
because humans are not likely to base their decisions on complex
"black-box" models. Note that the goals of accuracy of the model
and accuracy of its interpretation are somewhat contradictory.
Usually, simple models are more interpretable, but they are also
less accurate. Modern data-mining methods are expected to yield
highly accurate results using highdimensional models. The
problem of interpreting these models, also very important, is
considered a separate task, with specific
techniques to validate the results. A user does not want
hundreds of pages of numeric results. He does not understand
them; he cannot summarize, interpret, and use them for
successful decision making.

The Data mining Process

Getting Data In and Out of R

5.1 Reading and Writing Data


There are a few principal functions reading data into R.

 read.table, read.csv, for reading tabular data


 readLines, for reading lines of a text file
 source, for reading in R code files ( inverse of dump)
 dget, for reading in R code files ( inverse of dput)
 load, for reading in saved workspaces
 unserialize, for reading single R objects in binary form

There are of course, many R packages that have been developed


to read in all kinds of other datasets, and you may need to resort
to one of these packages if you are working in a specific area.
There are analogous functions for writing data to files

 write.table, for writing tabular data to text files (i.e. CSV) or


connections
 writeLines, for writing character data line-by-line to a file or
connection
 dump, for dumping a textual representation of multiple R objects
 dput, for outputting a textual representation of an R object
 save, for saving an arbitrary number of R objects in binary format
(possibly compressed) to a file.
 serialize, for converting an R object into a binary format for
outputting to a connection (or file).

5.2 Reading Data Files


with read.table()
The read.table() function is one of the most commonly used
functions for reading data. The help file for read.table() is worth
reading in its entirety if only because the function gets used a lot
(run ?read.table in R). I know, I know, everyone always says to read
the help file, but this one is actually worth reading.
The read.table() function has a few important arguments:

 file, the name of a file, or a connection


 header, logical indicating if the file has a header line
 sep, a string indicating how the columns are separated
 colClasses, a character vector indicating the class of each column in the
dataset
 nrows, the number of rows in the dataset. By default read.table() reads an
entire file.
 comment.char, a character string indicating the comment character. This
defalts to "#". If there are no commented lines in your file, it’s worth
setting this to be the empty string "".
 skip, the number of lines to skip from the beginning
 stringsAsFactors , should character variables be coded as factors? This
defaults to TRUE because back in the old days, if you had data that were
stored as strings, it was because those strings represented levels of a
categorical variable. Now we have lots of data that is text data and they
don’t always represent categorical variables. So you may want to set
this to be FALSE in those cases. If you always want this to be FALSE, you
can set a global option via options(stringsAsFactors = FALSE) . I’ve never
seen so much heat generated on discussion forums about an R function
argument than the stringsAsFactors argument. Seriously.

For small to moderately sized datasets, you can usually call


read.table without specifying any other arguments
> data <- read.table("foo.txt")
In this case, R will automatically

 skip lines that begin with a #


 figure out how many rows there are (and how much memory needs to be
allocated)
 figure what type of variable is in each column of the table.

Telling R all these things directly makes R run faster and more
efficiently. The read.csv() function is identical to read.table except
that some of the defaults are set differently (like
the sep argument).

5.3 Reading in Larger Datasets with


read.table
With much larger datasets, there are a few things that you can do
that will make your life easier and will prevent R from choking.

 Read the help page for read.table, which contains many hints
 Make a rough calculation of the memory required to store your
dataset (see the next section for an example of how to do this). If
the dataset is larger than the amount of RAM on your computer,
you can probably stop right here.
 Set comment.char = "" if there are no commented lines in your file.
 Use the colClasses argument. Specifying this option instead of
using the default can make ’read.table’ run MUCH faster, often
twice as fast. In order to use this option, you have to know the
class of each column in your data frame. If all of the columns are
“numeric”, for example, then you can just set colClasses = "numeric" .
A quick an dirty way to figure out the classes of each column is
the following:
> initial <- read.table("datatable.txt", nrows = 100)
> classes <- sapply(initial, class)
> tabAll <- read.table("datatable.txt", colClasses = classes)

 Set nrows. This doesn’t make R run faster but it helps with memory
usage. A mild overestimate is okay. You can use the Unix tool wc to
calculate the number of lines in a file.

In general, when using R with larger datasets, it’s also useful to


know a few things about your system.

 How much memory is available on your system?


 What other applications are in use? Can you close any of them?
 Are there other users logged into the same system?
 What operating system ar you using? Some operating systems can limit
the amount of memory a single process can access

5.4 Calculating Memory Requirements


for R Objects
Because R stores all of its objects physical memory, it is
important to be cognizant of how much memory is being used up
by all of the data objects residing in your workspace. One situation
where it’s particularly important to understand memory
requirements is when you are reading in a new dataset into R.
Fortunately, it’s easy to make a back of the envelope calculation
of how much memory will be required by a new dataset.

For example, suppose I have a data frame with 1,500,000 rows and
120 columns, all of which are numeric data. Roughly, how much
memory is required to store this data frame? Well, on most modern
computers double precision floating point numbers are stored
using 64 bits of memory, or 8 bytes. Given that information, you
can do the following calculation

1,500,000 × 120 × 8 bytes/numeric | = 1,440,000,000 bytes |


| = 1,440,000,000 / 2 20 bytes/MB
| = 1,373.29 MB
| = 1.34 GB

So the dataset would require about 1.34 GB of RAM. Most


computers these days have at least that much RAM. However, you
need to be aware of

 what other programs might be running on your computer, using up RAM


 what other R objects might already be taking up RAM in your workspace

Reading in a large dataset for which you do not have enough RAM
is one easy way to freeze up your computer (or at least your R
session). This is usually an unpleasant experience that usually
requires you to kill the R process, in the best case scenario, or
reboot your computer, in the worst case. So make sure to do a
rough calculation of memeory requirements before reading in a
large dataset. You’ll thank me later.

You might also like