You are on page 1of 11

Unit 1

Data science
• Data science is the domain of study that deals with vast volumes of
data using modern tools and techniques to find unseen patterns,
derive meaningful information, and make business decisions. Data
science uses complex machine learning algorithms to build predictive
models.

• The data used for analysis can come from many different sources and
presented in various formats.
The Data Science Lifecycle
• Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage
involves gathering raw structured and unstructured data.
• Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data
Architecture. This stage covers taking the raw data and putting it in a form that can be
used.
• Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization.
Data scientists take the prepared data and examine its patterns, ranges, and biases to
determine how useful it will be in predictive analysis.
• Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining,
Qualitative Analysis. Here is the real meat of the lifecycle. This stage involves performing
the various analyses on the data.
• Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision
Making. In this final step, analysts prepare the analyses in easily readable forms such as
charts, graphs, and reports.
Prerequisites for Data Science
1. Machine Learning
• Machine learning is the backbone of data science. Data Scientists need to have a solid grasp of ML in addition to
basic knowledge of statistics.
2. Modeling
• Mathematical models enable you to make quick calculations and predictions based on what you already know
about the data. Modeling is also a part of Machine Learning and involves identifying which algorithm is the most
suitable to solve a given problem and how to train these models.
3. Statistics
• Statistics are at the core of data science. A sturdy handle on statistics can help you extract more intelligence and
obtain more meaningful results.
4. Programming
• Some level of programming is required to execute a successful data science project. The most common
programming languages are Python, and R. Python is especially popular because it’s easy to learn, and it supports
multiple libraries for data science and ML.
5. Databases
• A capable data scientist needs to understand how databases work, how to manage them, and how to extract data
from them.
Data Scientist
• Data Scientist Work in a variety of fields. Each is crucial to finding
solutions to problems and requires specific knowledge. These fields
include data acquisition, preparation, mining and modeling, and
model maintenance. Data scientists take raw data, turn it into a
goldmine of information with the help of machine learning algorithms
that answer questions for businesses seeking solutions to their
queries. Each of the field is explained in this introduction to data
science tutorial, starting with,
• Data Acquisition: Here, data scientists take data from all its raw sources, such as databases and flat-files. Then, they
integrate and transform it into a homogenous format, collecting it into what is known as a “data warehouse,” a system
by which the data can be used to extract information from easily. Also known as ETL, this step can be done with some
tools, such as Talend Studio, DataStage and Informatica.
• Data Preparation: This is the most important stage, wherein 60 percent of a data scientist’s time is spent because
often data is “dirty” or unfit for use and must be scalable, productive and meaningful. In fact, five sub-steps exist here:
• Data Cleaning: Important because bad data can lead to bad models, this step handles missing values and null or void
values that might cause the models to fail. Ultimately, it improves business decisions and productivity.
• Data Transformation: Takes raw data and turns it into desired outputs by normalizing it. This step can use, for example,
min-max normalization or z-score normalization.
• Handling Outliers: This happens when some data falls outside the scope of the realm of the rest of the data. Using
exploratory analysis, a data scientist quickly uses plots and graphs to determine what to do with the outliers and see
why they’re there. Often, outliers are used for fraud detection.
• Data Integration: Here, the data scientist ensures the data is accurate and reliable.
• Data Reduction: This compiles multiple sources of data into one, increases storage capabilities, reduces costs and
eliminates duplicate, redundant data.
• Data Mining: Here, data scientists uncover the data patterns and relationships to take better business decisions. It’s a
discovery process to get hidden and useful knowledge, commonly known as exploratory data analysis. Data mining is
useful for predicting future trends, recognizing customer patterns, helping to make decisions, quickly detecting fraud
and choosing the correct algorithms. Tableau works nicely for data mining.
• Model Building: This goes further than simple data mining and requires building a machine learning model. The model
is built by selecting a machine learning algorithm that suits the data, problem statement and available resources.
• There are two types of machine learning algorithms: Supervised and Unsupervised:
• Supervised: Supervised learning algorithms are used when the data is labeled. There are two types:
• Regression: When you need to predict continuous values and variables are linearly dependent,
algorithms used are linear and multiple regression, decision trees and random forest
• Classification: When you need to predict categorical values, some of the classification algorithms used
are KNN, logistic regression, SVM and Naïve-Bayes
• Unsupervised: Unsupervised learning algorithms are used when the data is unlabeled, there is no
labeled data to learn from. There are two types:
• Clustering: This is the method of dividing the objects which are similar between them and dissimilar to
others. K-Means and PCA clustering algorithms are commonly used.
• Association-rule analysis: This is used to discover interesting relations between variables, Apriori and
Hidden Markov Model algorithm can be used
• Model Maintenance: After gathering data and performing the mining and model building, data
scientists must maintain the model accuracy. Thus, they take the following steps:
• Assess: Running a sample through the data occasionally to make sure it remains accurate
• Retrain: When the results of the reassessment aren’t right, the data scientist must retrain the
algorithm to provide the correct results again
• Rebuild: If retraining fails, rebuilding must occur.
Exploratory data analysis
• Database-oriented data sets and applications
• Relational database, data warehouse, transactional database

• Advanced data sets and advanced applications


• Data streams and sensor data
• Time-series data, temporal data, sequence data (incl. bio-sequences)
• Structure data, graphs, social networks and multi-linked data
• Object-relational databases
• Heterogeneous databases and legacy databases
• Spatial data and spatiotemporal data
• Multimedia database
• Text databases
• The World-Wide Web
Functionalities

• Multidimensional concept description: Characterization and discrimination


• Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet
regions
• Frequent patterns, association, correlation vs. causality
• Diaper  Beer [0.5%, 75%] (Correlation or causality?)
• Classification and prediction
• Construct models (functions) that describe and distinguish classes or
concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on (gas
mileage)
• Predict some unknown or missing numerical values
Functionalities
• Cluster analysis
• Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns
• Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
• Outlier: Data object that does not comply with the general behavior of the data
• Noise or exception? Useful in fraud detection, rare events analysis
• Trend and evolution analysis
• Trend and deviation: e.g., regression analysis
• Sequential pattern mining: e.g., digital camera  large SD memory
• Periodicity analysis
• Similarity-based analysis
• Other pattern-directed or statistical analyses

You might also like