DSBDA - Unit - 1

Unit I
INTRODUCTION: DATA SCIENCE AND BIG DATA
Introduction to Data science and Big Data, Defining Data science and Big Data, Big Data
examples, Data Explosion: Data Volume, Data Variety, Data Velocity and Veracity. Big data
infrastructure and challenges Big Data Processing Architectures: Data Warehouse,
Re-Engineering the Data Warehouse, shared everything and shared nothing architecture, Big data
learning approaches. Data Science – The Big Picture: Relation between AI, Statistical Learning,
Machine Learning, Data Mining and Big Data Analytics
Examples-application-of-big-data-in-real-life
Brief Introduction to Big Data
∙ Big Data is the amount of data just beyond technology’s capability to store, manage and
process efficiently.
∙ Data Science : Data Science is the science which uses computer science, statistics and
machine learning, visualization and human-computer interactions to collect, clean,
integrate, analyze, visualize, interact with data to create data products.
Examples of Applications Big Data

∙ Big Data Contributions to Education
Big data has great influence in the education world. It has been able to provide solution to one of
the biggest pitfalls in the education system, that is, the one-size-fits-all fashion of academic set
up, by contributing in e-learning solutions.
Big data has proven to be really important in not only just reframing coursework and the grading
Big data contributions to Healthcare
The big data is in extended use in the field of medicine and healthcare. Big data has provided us
with so many benefits in this field that now after being able to utilize big data approaches and
solutions by healthcare organizations, it just seems impossible, not to mention totally useless to
go back to how things were before big data.
Following are some of the many ways in which big data has contributed to healthcare
∙ Big data reduces costs of treatment since there is less chances of having to perform
unnecessary diagnosis.
∙ It helps in predicting outbreaks of epidemics and also helps in deciding what preventive
measures could be taken to minimize the effects of the same.
∙ It helps avoid preventable diseases by detecting diseases in early stages which helps in
preventing it from getting any worse and it also makes the treatment easy and effective.
∙ Patients can be provided with the evidence based medicine which is identified and
prescribed after doing the past medical results research.
Big data contributions in Public sector
Big data has also played a major role in Public sectors. It provides a large range of benefits to
public sectors most of which are the facilities provided by Big Data to the government sectors
that includes power ingestion, deceit recognition, etc. Some of the facilities of Big Data to
government sectors are as follows:
∙ Big Data is also used by the FDA (Food and Drug Administration) to identify and examine
food based infections.
∙ Big Data is used by the government to stay up-to-date in the field of agriculture as well by
keeping track of all the land and livestock that exists, the crops and farm animals, etc. ∙
Big Data is hugely used for deceit recognition and has also helped in catching tax evaders. Big
Data Contributions to Communications, Media and Entertainment
Now here is an interesting one, you have been using some of the platforms that utilises the
benefits of big data, big time. For example:
Spotify, which is an on-demand music providing platform, uses big data analytics and collect data
from all of the users around the globe and then uses the analyzed data to give informed music
recommendations and suggestions to every individual user.
Amazon Prime, that offers, videos, music and Kindle books in a one-stop shop is also big on
using big data.
∙ Weather patterns
There are weather sensors deployed all around the globe and data is collected from them. There
are satellites deployed by joint polar satellite system which they use to monitor the weather and
environmental conditions.
All of the data collected from these sensors and satellites can be used in different ways such as in
weather forecast, to study global warming, understanding the patterns of natural disasters to make
necessary preparations in case of crisis and to predict the availability of usable water around the
world and many more.
∙ Big Data Contributions to Transportation
Since the rise of big data, it has been used in various ways to make transportation more efficient
and easy. Following are some of the areas where big data contributed to transportation.
∙ Route planning: Big data can be used to understand and estimate the user’s needs on
different routes and on multiple modes of transportation and then utilising route planning
to reduce the users wait times.
∙ Congestion management by predicting traffic conditions: Using big data, real time
estimation of congestion and traffic patterns is now possible. For examples, people using
Google Maps to locate the least traffic prone routes.
∙ Safety level of traffic: Using the real time processing of big data and predictive analysis to
identify the traffic accidents prone areas can help reduce accidents and increase the safety
level of traffic
And guess what? We too make use of this application when we plan route to save fuel and time,
based on our knowledge of having taken that particular route sometime in the past. In this case
we analysed and made use of the data that we had previously acquired on account of our
experience and then we used it to make a smart decision. It’s pretty cool that big data has played
parts not only in such big fields but also in our smallest day to day life decisions too, isn’t it?
∙ Big Data Contributions to Banking Zones and Fraud Detection
Benefits of big data
Why Big Data is so important?
This has been one of the most asked questions since the advent of Big Data.
The importance of big data lies in how an organization is using the collected data and not in how
much data they have been able to collect. If an organization is looking to get benefitted by Big
Data then being able to efficiently use Big Data is what that matters the most.
To do that, there are Big Data solutions that make the analysis of Big Data much easier than it
used to be which in turn helps the organizations make better and smart business decisions using
big data. This is where the benefits of Big Data start showing up. Let’s see what those benefits
are:
∙ Gaining Insights:
In older times, when even storing and managing big data was considered as a tedious task,
let alone analyzing the data to gain benefits from it, a very huge amount of data was going
unused and wasted. Data that could contain important information which could help gain
insights about the businesses or industries. Now, with all the different Big Data solutions
available to manage and analyze Big Data, no data containing information is going
unused. Big Data is now providing deep insights with the help of not only just structured
data but also the unstructured and semi-structured data.
∙ Prediction and Decision making:
Now that the organizations are able to analyze Big Data, they have successfully started
using Big Data to mitigate risks, revolving around various factors of their businesses.
Using Big Data to reduce the risks regarding the decisions of the organizations and
making predictions has become one of the many benefits coming from big data in
industries.
∙ Cost-effectiveness:
Using and analyzing Big Data to make relevant predictions and smart decisions also make the
organizations cost-effective. Not only that but also using some of the tools of Big Data to manage
and analyze the data also brings cost advantages to businesses especially when a huge amount of
data is to be stored and processed.
∙ Marketing effectiveness:
Big Data, along with being able to help businesses and organizations in making smart decisions
also drastically increases the sales and marketing effectiveness of the businesses and
organizations thus highly improving their performances in the industry. Also, organizations can
use big data to understand the latest trends of customer and user needs through real-time
analytics and then acting accordingly on it and increasing their market value.
So, to sum it all, we have tabulated the benefits of Big Data in brief:
Areas of concern Big Data benefits
Big data can provide better
unstructured and semi-structured
Insights
data
Big data helps mitigate risk and
make a smart decision by proper
Prediction and Decision making risk analysis
Better and smart decisions made
with the help of Big Data make
Cost-effectiveness the organizations cost-effective
Big Data also increases the sales
and marketing effectiveness of an
Marketing effectiveness organization
insights with the help of
When talking about big data, we all have heard about the 3V’s of big data. It’s nothing but a
better way to define big data. Big data initially consisted of three attributes namely volume,
velocity, and variety.
While these three attributes pretty much gave the essence of the definition of big data, there is
another attribute that was added in the list, later on, termed as veracity.
Let’s see what these attributes mean:
∙ Velocity– Denotes the speed at which data is emanating and changes are occurring between
the diverse data sets.
Velocity:
∙ Velocity refers to the high speed of accumulation of data.
∙ In Big Data velocity data flows in from sources like machines, networks, social media,
mobile phones etc.
∙ There is a massive and continuous flow of data. This determines the potential of data that
how fast the data is generated and processed to meet the demands.
∙ Sampling data can help in dealing with the issue like ‘velocity’.
∙ Example: There are more than 3.5 billion searches per day are made on Google. Also,
FaceBook users are increasing by 22%(Approx.) year by year.
∙ Volume – This refers to the sheer volume of data being generated every second.
Volume:
∙ The name ‘Big Data’ itself is related to a size which is enormous.

∙ Volume is a huge amount of data.
∙ To determine the value of data, size of data plays a very crucial role.
If the volume of data is
very large then it is actually considered as a ‘Big Data’. This means whether a particular
data can actually be considered as a Big Data or not, is dependent upon the volume of
data.
∙ Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’. ∙
Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes(6.2 billion
GB) per month. Also, by the year 2020 we will have almost 40000 ExaBytes of data.
∙ Variety– As more and more data is being digitized, some data is found to be structured and
some is found to be unstructured in nature, with big data in the picture we can use
structured as well as unstructured data.
Variety:
∙ Itrefers to nature of data that is structured, semi-structured and unstructured data. ∙
It also refers to heterogeneous sources.
∙ Variety is basically the arrival of data from new sources that are both inside and outside of
an enterprise. It can be structured, semi-structured and unstructured.
o Structured data: This data is basically an organized data. It generally refers to data
that has defined the length and format of data.
o Semi- Structured data: This data is basically a semi-organised data. It is generally
a form of data that do not conform to the formal structure of data. Log files are the
examples of this type of data.
o Unstructured data: This data basically refers to unorganized data. It generally refers to
data that doesn’t fit neatly into the traditional row and column structure of the relational
database. Texts, pictures, videos etc. are the examples of unstructured data which can’t be
stored in the form of rows and columns.
∙ Veracity – This refers to the discrepancies found in the data.

All of these attributes contribute to the definition of big data but then again where does it all lead
us? To answer that question, people have largely started taking yet another attribute into
consideration, that is, the 5th V termed as value.
Veracity:
∙ Itrefers to inconsistencies and uncertainty in data, that is data which is available can
sometimes get messy and quality and accuracy are difficult to control.
∙ Big Data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources.
∙ Example: Data in bulk could create confusion whereas less amount of data could convey
half or Incomplete Information.
∙ Value – Having access to big data is all well and good but that’s only useful if we can turn it
into a value.
Value:
∙ After having the 4 V’s into account there comes one more V which stands for Value!. The
bulk of Data having no Value is of no good to the company, unless you turn it into
something useful.
∙ Data in itself is of no use or importance but it needs to be converted into something valuable
to extract Information. Hence, you can state that Value! is the most important V of all the
5V’s.
Goal of Data Science

∙ Turn data into data products
Data Science – A Visual Definition
The Data Scientist

“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to
understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s
going
to be a hugely important skill.”
“The critical job in the next 20 years will be the analytic scientist … the individual with the
ability to understand a problem domain, to understand and know what data to collect about it, to
identify analytics to process that data/information, to discover its meaning, and to extract
knowledge from it—that’s going to be a very critical skill.”
Analytic scientists require advanced training in specific domains, data science tools,
multiple analytics, and visualization to perform predictive and prescriptive analytics. They may
hold Ph.D.’s, but pragmatic experience in a domain will be equally important.
The Goal of Analytics
From sensors (data collection, measurement, observation, …)
to Monitoring and Alerting to Sensemaking (Data and Analytics Science)
Big Data Challenges

1. Dealing with data growth
The most obvious challenge associated with big data is simply storing and analyzing all that
information.
2. Generating insights in a timely manner
Organizations don't just want to store their big data — they want to use that big data to achieve
business goals.
3. Recruiting and retaining big data talent
Organizations need professionals with big data skills. That has driven up demand for big data
experts — and big data salaries have increased dramatically as a result.
4. Integrating disparate data sources
The variety associated with big data leads to challenges in data integration. Big data comes from
a lot of different places — enterprise applications, social media streams, email systems,
employee-created documents, etc. Combining all that data and reconciling it so that it can be used
to create reports can be incredibly difficult.
5. Validating data
Closely related to the idea of data integration is the idea of data validation. Often organizations
are getting similar pieces of data from different systems, and the data in those different systems
doesn't always agree.
6. Securing big data
Security is also a big concern for organizations with big data stores. After all, some big data
stores can be attractive targets for hackers or advanced persistent threats
Data processing architectures
Methods to interact with data storage.
∙ Lambda
∙ Kappa
∙ Zeta
Lambda Architecture
Lambda architecture is a data processing technique that is capable of dealing with huge amount
of data in an efficient manner. The efficiency of this architecture becomes evident in the form of
increased throughput, reduced latency and negligible errors. While we mention data processing
we basically use this term to represent high throughput, low latency and aiming for
near-real-time applications. Which also would allow the developers to define delta rules in the
form of code logic or natural language processing (NLP) in event-based data processing models
to achieve robustness, automation and efficiency and improve the data quality. Moreover, any
change in the state of data is an event to the system and as a matter of fact it is possible to give a
command, queried or expected to carry out delta procedures as a response to the events on the
fly.
Event sourcing is a concept of using the events to make prediction as well as storing the changes
in a system on the real time basis a change of state of a system, an update in the databases or an
event can be understood as a change. For instance if someone interact with a web page or a social
network profile, the events like page view, likes or Add as a Friend request etc… are triggering
events that can be processed or enriched and the data stored in a database.
Data processing deals with the event streams and most of the enterprise software that follow the
Domain Driven Design use the stream processing method to predict updates for the basic model
and store the distinct events that serve as a source for predictions in a live data system. To handle
numerous events occurring in a system or delta processing, Lambda architecture enabling data
processing by introducing three distinct layers. Lambda architecture comprises of Batch Layer,
Speed Layer (also known as Stream layer) and Serving Layer.
1. Batch layer
New data keeps coming as a feed to the data system. At every instance it is fed to the batch layer
and speed layer simultaneously. Any new data stream that comes to batch layer of the data system
is computed and processed on top of a Data Lake. When data gets stored in the data lake using
databases such as in memory databases or long term persistent one like NoSQL based storages
batch layer uses it to process the data using MapReduce or utilizing machine-learning (ML) to
make predictions for the upcoming batch views.
2. Speed Layer (Stream Layer)
The speed layer uses the fruit of event sourcing done at the batch layer. The data streams
processed in the batch layer result in updating delta process or MapReduce or machine learning
model which is further used by the stream layer to process the new data fed to it. Speed layer
provides the outputs on the basis enrichment process and supports the serving layer to reduce the
latency in
responding the queries. As obvious from its name the speed layer has low latency because it deals
with the real time data only and has less computational load.
3. Serving Layer
The outputs from batch layer in the form of batch views and from speed layer in the form of
near-real time views are forwarded to the serving layer which uses this data to cater the pending
queries on ad-hoc basis.
Here is a basic diagram of what Lambda Architecture model would look like:
Lambda Architecture
Let’s translate that to a functional equation which defines any query in big data domain. The
symbols used in this equation are known as Lambda and the name for the Lambda architecture is
also coined from the same equation. This function is widely known to those who are familiar with
tidbits of big data analysis.
Query = λ (Complete data) = λ (live streaming data) * λ (Stored data)
The equation means that all the data related queries can be catered in the Lambda architecture by
combining the results from historical storage in the form of batches and live streaming with the
help of speed layer.
Applications of Lambda Architecture
Lambda architecture can be deployed for those data processing enterprise models where: ∙ User
queries are required to be served on ad-hoc basis using the immutable data storage.
∙ Quick responses are required and system should be capable of handling various updates in
the form of new data streams.
∙ None of the stored records shall be erased and it should allow addition of updates and new
data to the database.
Lambda architecture can be considered as near real-time data processing architecture. As
mentioned above, it can withstand the faults as well as allows scalability. It uses the functions of
batch layer and stream layer and keeps adding new data to the main storage while ensuring that
the existing data will remain intact. Companies like Twitter, Netflix, and Yahoo are using this
architecture to meet the quality of service standards.
Pros and Cons of Lambda Architecture
Pros
∙ Batch layer of Lambda architecture manages historical data with the fault tolerant
distributed storage which ensures low possibility of errors even if the system crashes.
∙ It is a good balance of speed and reliability.
∙ Fault tolerant and scalable architecture for data processing.
Cons
∙ It can result in coding overhead due to involvement of comprehensive processing. ∙
Re-processes every batch cycle which is not beneficial in certain scenarios. ∙ A data
modeled with Lambda architecture is difficult to migrate or reorganize. Kappa
Architecture
In 2014 Jay Kreps started a discussion where he pointed out some discrepancies of Lambda
architecture that further led the big data world to another alternate architecture that used less code
resource and was capable of performing well in certain enterprise scenarios where using multi
layered Lambda architecture seemed like extravagance.
Kappa Architecture cannot be taken as a substitute of Lambda architecture on the contrary it
should be seen as an alternative to be used in those circumstances where active performance of
batch layer is not necessary for meeting the standard quality of service. This architecture finds its
applications in real-time processing of distinct events. Here is a basic diagram for the Kappa
architecture that shows two layers system of operation for this data processing architecture.
Kappa Architecture
Let’s translate the operational sequencing of the kappa architecture to a functional equation which
defines any query in big data domain.
Query = K (New Data) = K (Live streaming data)
The equation means that all the queries can be catered by applying kappa function to the live
streams of data at the speed layer. It also signifies that that the stream processing occurs on the
speed layer in kappa architecture.
Applications of Kappa architecture
Some variants of social network applications, devices connected to a cloud based monitoring
system, Internet of things (IoT) use an optimized version of Lambda architecture which mainly
uses the services of speed layer combined with streaming layer to process the data over the data
lake.
Kappa architecture can be deployed for those data processing enterprise models where:
∙ Multiple data events or queries are logged in a queue to be catered against a distributed file
system storage or history.
∙ The order of the events and queries is not predetermined. Stream processing platforms can
interact with database at any time.
∙ It is resilient and highly available as handling Terabytes of storage is required for each node
of the system to support replication.
The above mentioned data scenarios are handled by exhausting Apache Kafka which is extremely
fast, fault tolerant and horizontally scalable. It allows a better mechanism for governing the
data-streams. A balanced control on the stream processors and databases makes it possible for the
applications to perform as per expectations. Kafka retains the ordered data for longer durations
and caters the analogous queries by linking them to the appropriate position of the retained log.
LinkedIn and some other applications use this flavor of big data processing and reap the benefit of
retaining large amount of data to cater those queries that are mere replica of each other.
Pros and Cons of Kappa architecture
Pros
∙ Kappa architecture can be used to develop data systems that are online learners and
therefore don’t need the batch layer.
∙ Re-processing is required only when the code changes.
∙ It can be deployed with fixed memory.
∙ It can be used for horizontally scalable systems.
∙ Fewer resources are required as the machine learning is being done on the real time basis.
Cons
Absence of batch layer might result in errors during data processing or while updating the
database that requires having an exception manager to reprocess the data or reconciliation.
Conclusion
In short the choice between Lambda and Kappa architectures seems like a tradeoff. If you seek
you’re an architecture that is more reliable in updating the data lake as well as efficient in
devising the machine learning models to predict upcoming events in a robust manner you should
use the Lambda architecture as it reaps the benefits of batch layer and speed layer to ensure less
errors and speed. On the other hand if you want to deploy big data architecture by using less
expensive hardware and require it to deal effectively on the basis of unique events occurring on
the runtime then select the Kappa architecture for your real-time data processing needs.
Zeta Architecture
The Zeta Architecture is a high-level enterprise architectural construct not unlike the Lambda
architecture which enables simplified business processes and defines a scalable way to increase
the speed of integrating data into the business. The result? A powerful, data-centric enterprise.
There are seven pluggable components of the Zeta Architecture which work together, reducing
system-level complexity while radically increasing resource utilization and efficiency.
∙ Distributed File System - all applications read and write to a common, scalable solution,
which dramatically simplifies the system architecture.
∙ Real-time Data Storage - supports the need for high-speed business applications through
the use of real-time databases.
∙ Pluggable Compute Model / Execution Engine -delivers different processing engines and
models in order to meet the needs of diverse business applications and users in an
organization.
∙ Deployment / Container Management System - provides a standardized approach for
deploying software. All resource consumers are isolated and deployed in a standard way.
∙ Solution Architecture - focuses on solving specific business problems, and combines one
or more applications built to deliver the complete solution. These solution architectures
encompass a higher-level interaction among common algorithms or libraries, software
components and business workflows.
∙ Enterprise Applications - brings simplicity and reusability by delivering the components
necessary to realize all of the business goals defined for an application.
∙ Dynamic and Global Resource Management - allows dynamic allocation of resources so
that you can accommodate whatever task is the most important for that day.
Benefits of Zeta Architecture
There are several benefits to implementing a Zeta Architecture in your organization
∙ Reduce time and costs of deploying and maintaining applications
∙ Fewer moving parts with simplifications such as using a distributed file system
∙ Less data movement and duplication - transforming and moving data around will no
longer be required unless a specific use case calls for it
∙ Simplified testing, troubleshooting, and systems management
∙ Better resource utilization to lower data center costs
What Is Data Warehousing? Types, Definition & Example

https://www.youtube.com/watch?v=T_D2tDTmrWE
What is Data Warehousing?
Data warehouse is a collection of technologies aimed at enabling the knowledge worker
(executive, manager, analyst) to make better and faster decision.
A Data Warehousing (DW) is process for collecting and managing data from varied sources to
provide meaningful business insights. A Data warehouse is typically used to connect and analyze
business data from heterogeneous sources. The data warehouse is the core of the BI system which
is built for data analysis and reporting.
It is a blend of technologies and components which aids the strategic use of data. It is electronic
storage of a large amount of information by a business which is designed for query and analysis
instead of transaction processing. It is a process of transforming data into information and making
it available to users in a timely manner to make a difference.
The decision support database (Data Warehouse) is maintained separately from the organization's
operational database. However, the data warehouse is not a product but an environment. It is an
architectural construct of an information system which provides users with current and historical
decision support information which is difficult to access or present in the traditional operational
data store.
Data warehouse system is also known by the following name:
∙ Decision Support System (DSS)
∙ Executive Information System
∙ Management Information System
∙ Business Intelligence Solution
∙ Analytic Application
∙ Data Warehouse
How Datawarehouse works?
A Data Warehouse works as a central repository where information arrives from one or more data
sources. Data flows into a data warehouse from the transactional system and other relational
databases.
Data may be:
∙ Structured
∙ Semi-structured
∙ Unstructured data
The data is processed, transformed, and ingested so that users can access the processed data in the
Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data
warehouse merges information coming from different sources into one comprehensive database.
By merging all of this information in one place, an organization can analyze its customers more
holistically. This helps to ensure that it has considered all the information available. Data
warehousing makes data mining possible. Data mining is looking for patterns in the data that may
lead to higher sales and profits.
Types of Data Warehouse
Three main types of Data Warehouses are:
1. Enterprise Data Warehouse:
Enterprise Data Warehouse is a centralized warehouse. It provides decision support service across
the enterprise. It offers a unified approach for organizing and representing data. It also provide the
ability to classify data according to the subject and give access according to those divisions.
2. Operational Data Store:
Operational Data Store, which is also called ODS, are nothing but data store required when
neither Data warehouse nor OLTP (Online transaction processing) systems support organizations
reporting needs. In ODS, Data warehouse is refreshed in real time. Hence, it is widely preferred
for routine activities like storing records of the Employees. Regular data updates.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular line of
business, such as sales, finance, sales or finance. In an independent data mart, data can collect
directly from sources.
General stages of Data Warehouse
Earlier, organizations started relatively simple use of data warehousing. However, over time, more
sophisticated use of data warehousing begun.
The following are general stages of use of the data warehouse:
Offline Operational Database:
In this stage, data is just copied from an operational system to another server. In this way,
loading, processing, and reporting of the copied data do not impact the operational system's
performance.
Offline Data Warehouse:
Data in the Data warehouse is regularly updated from the Operational Database. The data in Data
warehouse is mapped and transformed to meet the Data warehouse objectives.
Real time Data Warehouse:
In this stage, Data warehouses are updated whenever any transaction takes place in operational
database. For example, Airline or railway booking system.
Integrated Data Warehouse:
In this stage, Data Warehouses are updated continuously when the operational system performs a
transaction. The Data warehouse then generates transactions which are passed back to the
operational system.
Components of Data warehouse
Four components of Data Warehouses are:
Load manager: Load manager is also called the front component. It performs with all the
operations associated with the extraction and load of data into the warehouse. These operations
include transformations to prepare the data for entering into the Data warehouse.
Warehouse Manager: Warehouse manager performs operations associated with the management
of the data in the warehouse. It performs operations like analysis of data to ensure consistency,
creation of indexes and views, generation of denormalization and aggregations, transformation
and merging of source data and archiving and baking-up data.
Query Manager: Query manager is also known as backend component. It performs all the
operation operations related to the management of user queries. The operations of this Data
warehouse components are direct queries to the appropriate tables for scheduling the execution of
queries.
End-user access tools:
This is categorized into five different groups like
∙ Data Reporting 2.
∙ Query Tools
∙ Application development tools
∙ EIS(Executive information system) tools,
∙ OLAP tools and data mining tools.
Who needs Data warehouse?
Data warehouse is needed for all types of users like:
∙ Decision makers who rely on mass amount of data
∙ Users who use customized, complex processes to obtain information from multiple data
sources.
∙ It is also used by the people who want simple technology to access the data ∙ It also
essential for those people who want a systematic approach for making decisions.
∙ If the user wants fast performance on a huge amount of data which is a necessity for reports,
grids or charts, then Data warehouse proves useful.
∙ Data warehouse is a first step If you want to discover 'hidden patterns' of data-flows and
groupings.
Benefits of a Data Warehouse
Organizations have a common goal – to make better business decisions. A data warehouse, once
implemented into your business intelligence framework, can benefit your company in numerous
ways. A data warehouse:
1. Delivers enhanced business intelligence
By having access to information from various sources from a single platform, decision makers
will no longer need to rely on limited data or their instinct. Additionally, data warehouses can
effortlessly be applied to a business’s processes, for instance, market segmentation, sales, risk,
inventory, and financial management.
2. Saves times
A data warehouse standardizes, preserves, and stores data from distinct sources, aiding the
consolidation and integration of all the data. Since critical data is available to all users, it allows
them to make informed decisions on key aspects. In addition, executives can query the data
themselves with little to no IT support, saving more time and money.
3. Enhances data quality and consistency
A data warehouse converts data from multiple sources into a consistent format. Since the data
from across the organization is standardized, each department will produce results that are
consistent. This will lead to more accurate data, which will become the basis for solid decisions.
4. Generates a high Return on Investment (ROI)
Companies experience higher revenues and cost savings than those that haven’t invested in a data
warehouse.
5. Provides competitive advantage
Data warehouses help get a holistic view of their current standing and evaluate opportunities and
risks, thus providing companies with a competitive advantage.
6. Improves the decision-making process
Data warehousing provides better insights to decision makers by maintaining a cohesive database
of current and historical data. By transforming data into purposeful information, decision makers
can perform more functional, precise, and reliable analysis and create more useful reports with
ease.
7. Enables organizations to forecast with confidence
Data professionals can analyze business data to make market forecasts, identify potential KPIs,
and gauge predicated results, allowing key personnel to plan accordingly.
8. Streamlines the flow of information
Data warehousing facilitates the flow of information through a network connecting all related or
non-related parties.
What Is a Data Warehouse Used For?

Here, are most common sectors where Data warehouse is used:
Airline:
In the Airline system, it is used for operation purpose like crew assignment, analyses of route
profitability, frequent flyer program promotions, etc.
Banking:
It is widely used in the banking sector to manage the resources available on desk effectively. Few
banks also used for the market research, performance analysis of the product and operations.
Healthcare:
Healthcare sector also used Data warehouse to strategize and predict outcomes, generate patient's
treatment reports, share data with tie-in insurance companies, medical aid services, etc.
Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps government
agencies to maintain and analyze tax records, health policy records, for every individual.
Investment and Insurance sector:
In this sector, the warehouses are primarily used to analyze data patterns, customer trends, and to
track market movements.
Retain chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to
track items, customer buying pattern, promotions and also used for determining pricing policy.
Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and to make
distribution decisions.
Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their advertising and
promotion campaigns where they want to target clients based on their feedback and travel
patterns.
Steps to Implement Data Warehouse
The best way to address the business risk associated with a Datawarehouse implementation is to
employ a three-prong strategy as below
∙ Enterprise strategy: Here we identify technical including current architecture and tools.
We also identify facts, dimensions, and attributes. Data mapping and transformation is
also passed.
∙ Phased delivery: Datawarehouse implementation should be phased based on subject areas.
Related business entities like booking and billing should be first implemented and then
integrated with each other.
∙ Iterative Prototyping: Rather than a big bang approach to implementation, the
Datawarehouse should be developed and tested iteratively.
Disadvantages and Limitations of a Data Warehouse
In this article, we will discuss various disadvantages of Data warehouse.
Maintenance costs outweigh the benefits
Data warehouses for a huge IT project would involve high maintenance systems which may affect
the revenue for medium scale organizations. The cost to benefit ratio is on the lower side as it not
only involves systems with equipped technology but also longer hours as an investment from the
IT department.
This would restrict the organization’s growth especially when it’s a business which is adapting to
its market conditions.
Data Ownership
An important concern of Data warehouses is the security of data. Primarily, Data warehouses are
marked for software applications for service. This restricts your data security as the data which
has been implemented locally might be sensitive only for a certain department.
Leaking of data within the same organization could lead to hiatus and cause problems for the
executives. You can avoid this by ensuring that the individuals entrusted with the analysis are
trusted employees of the company with no departmental lineage as it could lead to reluctance
because of data censorship.
Data Rigidity
The type of data imported into a data warehouse is often static data sets which have the least
flexibility to generate specific solutions. For the data to be used, it has to be transformed and
cleansed which could take several days or weeks.
Moreover, warehouses are subjected to ad hoc queries which are extremely difficult as they have
least processing speed and query speed. Even though the queries are restricted to the data marts
used during consolidation and integration, most of them are ad hoc queries.
Underestimation of ETL (Executive information system) processing time
Often organizations do not estimate the time required for the ETL process and find their work
interrupted leading to backlogs. A significant portion of the time required for the entire process of
data warehouse development is for extraction, cleaning, and loading of consolidated data into the
warehouse. Even with tools to make the process faster, efficient transformation takes up to several
days or weeks.
Hidden problems of the Source
Hidden problems of the source arise when an organization finds themselves with problems related
to the original source systems which were involved in the importing of data into the warehouse
after several years of operation.
Practically, a human error while entering data like property details, like for example, leaving
certain fields incomplete or improperly filled could be considered as void property data.
Inability to capture required data
There is always the probability that the data which was required for analysis by the organization
was not integrated into the warehouse leading to loss of information. Consider the example of
property registration, apart from the regular details, the date of registration plays an important
role in statistical analysis at the end of the month. However, such data could be overlooked and
not imported.
Increased demands of the users
After success with the initial few queries, users of the facility may ask more complicated queries
which would increase the workload on the system and server. With awareness of the features of
the data warehouse, there might also be an increase in the number of queries posed by the staff
which also increase the server load.
However, if your organization’s systems and servers are equipped with high-end hardware, this
wouldn’t be a problem.
Long-duration project
A comprehensive warehouse project might take up to three years to complete. Not all
organizations are able to dedicate themselves entirely and are hence more reluctant in investing in
a data warehouse. If the organization has to offer very low historical data, a majority of the data
mart features will not be utilized as much and will only be a limitation.
Complications
The integration feature is one of the most important aspects of a data warehouse. Which is why it
is recommended that an organization pays special attention to the disparate and equally
compelling data warehousing tools and their results to arrive at a proper business conclusion and
make their decision. This could be a challenging task if the organization’s management is not
dedicated and lack experience.
So it was all about Disadvantages and Limitations of a Data Warehouse. If you have any
problems then please comment below.
Introduction to Data Warehousing and Business Intelligence
A Data Warehouse may be described as a consolidation of data from multiple sources that is
designed to support strategic and tactical decision making for organizations. The primary purpose
of DW is to provide a coherent picture of the business at a point in time.
Business Intelligence (BI), on the other hand, describes a set of tools and methods that transform
raw data into meaningful patterns for actionable insights and improving business processes. This
usually involves data preparation, data analytics, and data visualization.
However, BI tools greatly vary in capabilities, and while full-stack solutions are aimed to provide
all three of these, many tools labeled as BI offers only analytics and visualization. BI tools require
a data warehouse to work with unstructured data, as the tools have very limited data preparation
capabilities. However certain full-stack Business Intelligence Analytics & Dashboard Software,
such as Sisense, can provide end users with an end to end solution that does not require additional
investment in data warehousing.
Business Intelligence is an umbrella term that is used interchangeably with Data Analytics or to
describe a process which includes data preparation, analytics, and visualization. Data
warehousing describes tools that take care of joining disparate data sources, cleaning the data
and preparing it for analysis.
Reengineering data warehouse
Reengineering data warehouse is done in such a way that will determine/reduces risk associated
within the corporate/public sector in early phase of reengineering or early phase of software
development. And modification takes place in one part of application or ‗n‘ number part of
application or in whole application not effect other application that are exist in corporate/public
sector. This entire task is performed through Pattern based reengineering methodology.
Shared Nothing vs Shared Everything
In database cluster implementation we can have multiple ways to make sure how different nodes
will communicate with each other.
Shared nothing approach: None of the nodes will use others memory or storage. This is best
suited for the solutions where inter node communication is not required, i.e. a node can come up
with a solution on its own.
Shared Memory: In this approach memory is shared, i.e. each node/ processor is working with
same memory. This is used when we need nodes to share solutions/ calculations done by other
nodes and are available in memory.
Shared Everything: In this approach nodes share memory plus storage. This makes sense when
nodes are working on problem where calculations and data created/ used by node is dependent on
others.
Clusters :
Clustering involves multiple, independent computing systems working together as one. So, if one
of the independent systems fails, the cluster software can distribute work from the failing system
to the remaining systems in the cluster.
Why Use Clusters?
Companies generally turn to clustering to improve availability and scalability. Clusters improve
availability by providing alternate options in case a system fails. Clustering involves multiple,
independent computing systems working together as one. So, if one of the independent systems
fails, the cluster software can distribute work from the failing system to the remaining systems in
the cluster. Users won’t know the difference – they interact with a cluster as though it were a
single server – and the resources they rely on will still be available.
Most companies consider enhanced availability the primary benefit of clustering. In some cases,
clustering can help companies achieve “five nines” (99.999 percent) availability.
But clustering offers scalability benefits, too. When load exceeds the capabilities of the systems
that make up the cluster, you can incrementally add more system to increase the cluster’s
computing power and meet processing requirements. As traffic or availability assurance increases,
all or some parts of the cluster can be increased in size or number.
Types of Clustering
Shared-disk and shared-nothing architectures are the predominant approaches to clustering. The
names are fairly accurate descriptions of each type.
In a shared-nothing environment, each system has its own private (not shared) memory and one or
more disks (see Figure 1). The clustered processors communicate by passing messages through a
network that interconnects the computers. Client requests are automatically routed to the system
that owns the resource. Resources include things like memory, disk, and really any computing
resource at the disposal of the computer. Only one of the clustered systems can “own” and access
a particular resource at a time. But, in the event of a failure, resource ownership may be
dynamically transferred to another system in the cluster.
Figure 1. The shared-nothing architecture

Shared-nothing clustering offers excellent scalability. In theory, a shared-nothing multiprocessor
can scale up to thousands of processors because the processors do not interfere with one another
– no resources are shared. For this reason shared-nothing is generally preferable to other forms of
clustering. Furthermore, the scalability of shared-nothing clustering makes it ideal for read
intensive analytical processing typical of data warehouses.
In a shared-disk environment, all of the connected systems share the same disk devices (see
Figure 2). Each processor still has its own private memory, but all the processors can directly
address all the disks.
Figure 2. The shared-disk architecture

Typically, shared-disk clustering does not scale as well as shared-nothing for smaller machines.
Because all the nodes have access to the same data, they need a controlling facility to direct
processing so all the nodes have a consistent view of the data as it changes. Furthermore, attempts
by two (or more) nodes to update the same data at the same time must be prohibited. These
management requirements can impose performance and scalability problems for shared disk
systems. But with some optimization techniques shared-disk is well-suited to the large-scale
processing you find in mainframe environments. Mainframes are already very large processors
capable of processing enormous volumes of work. To equal the computing power resulting from
only a few clustered mainframes would require many, many clustered PC and midrange
processors.
Benefits of Shared Nothing Architecture
Scaling becomes simpler when things such as disks are not shared. For example, scaling up a
single shared disk to get more storage space can lead to enormous problems if things do not go
well, as all of the other resources require that disk to be able to do their work.
On the other hand, if you are using several nodes that do not share the space, scaling up the disk
space on any or all of the resources becomes quite a bit easier with less potential problems. If the
scaling should fail on one of the resources, the others will still continue to do their work normally.
According to Gerardnico, “This architecture is followed by essentially all high-performance,
scalable, DBMSs, including Teradata, Netezza, Greenplum, as well as several Morpheus
integrations. It is also used by most of the high-end e-commerce platforms, including Amazon,
Akamai, Yahoo, Google, and Facebook.”
What is MapReduce?
What is MapReduce in Hadoop?
MapReduce is a programming model suitable for processing of huge data. Hadoop is capable of
running MapReduce programs written in various languages: Java, Ruby, Python, and C++.
MapReduce programs are parallel in nature, thus are very useful for performing large-scale data
analysis using multiple machines in the cluster.
MapReduce programs work in two phases:
∙ Map phase
∙ Reduce phase.
An input to each phase is key-value pairs. In addition, every programmer needs to specify two
functions: map function and reduce function.
In this beginner training, you will learn
How MapReduce Works? Complete Process
The whole process goes through four phases of execution namely, splitting, mapping, shuffling,
and reducing.
Let's understand this with an example –
Consider you have following input data for your Map Reduce Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
MapReduce Architecture
The final output of the MapReduce task is
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
The data goes through the following phases

Input Splits:
An input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a
chunk of the input that is consumed by a single map
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in each
split is passed to a mapping function to produce output values. In our example, a job of mapping
phase is to count a number of occurrences of each word from input splits (more details about
input-split is given below) and prepare a list in the form of <word, frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records
from Mapping phase output. In our example, the same words are clubed together along with their
respective frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines values
from Shuffling phase and returns a single output value. In short, this phase summarizes the
complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates total
occurrences of each word.
MapReduce Architecture explained in detail
∙ One map task is created for each split which then executes map function for each record in
the split.
∙ It is always beneficial to have multiple splits because the time taken to process a split is
small as compared to the time taken for processing of the whole input. When the splits are
smaller, the processing is better to load balanced since we are processing the splits in
parallel.
∙ However, it is also not desirable to have splits too small in size. When splits are too small,
the overload of managing the splits and map task creation begins to dominate the total job
execution time.
∙ For most jobs, it is better to make a split size equal to the size of an HDFS block (which is
64 MB, by default).
∙ Execution of map tasks results into writing output to a local disk on the respective node and
not to HDFS.
∙ Reason for choosing local disk over HDFS is, to avoid replication which takes place in case
of HDFS store operation.
∙ Map output is intermediate output which is processed by reduce tasks to produce the final
output.
∙ Once the job is complete, the map output can be thrown away. So, storing it in HDFS with
replication becomes overkill.
∙ In the event of node failure, before the map output is consumed by the reduce task, Hadoop
reruns the map task on another node and re-creates the map output.
∙ Reduce task doesn't work on the concept of data locality. An output of every map task is fed
to the reduce task. Map output is transferred to the machine where reduce task is running.
∙ On this machine, the output is merged and then passed to the user-defined reduce function.
∙ Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the
local node and other replicas are stored on off-rack nodes). So, writing the reduce output
How MapReduce Organizes Work?
Hadoop divides the job into tasks. There are two types of tasks:
∙ Map tasks (Splits & Mapping)
∙ Reduce tasks (Shuffling, Reducing)
as mentioned above.
The complete execution process (execution of Map and Reduce tasks, both) is controlled by two
types of entities called a
∙ Jobtracker: Acts like a master (responsible for complete execution of submitted job) ∙
Multiple Task Trackers: Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Jobtracker that resides on
Namenode and there are multiple tasktrackers which reside on Datanode.
∙ A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
∙ It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on
different data nodes.
∙ Execution of individual task is then to look after by task tracker, which resides on every
data node executing part of the job.
∙ Task tracker's responsibility is to send the progress report to the job tracker.
∙ In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker so as to
notify him of the current state of the system.
∙ Thus job tracker keeps track of the overall progress of each job. In the event of task failure,
the job tracker can reschedule it on a different task tracker.
Hadoop Architecture in Detail – HDFS, Yarn & MapReduce
Hadoop now has become a popular solution for today’s world needs. The design of Hadoop
keeps various goals in mind. These are fault tolerance, handling of large datasets, data locality,
portability across heterogeneous hardware and software platforms etc. In this blog, we will
explore the Hadoop Architecture in detail. Also, we will see Hadoop Architecture Diagram that
helps you to understand it better.
So, let’s explore Hadoop Architecture.
What is Hadoop Architecture?
Hadoop has a master-slave topology. In this topology, we have one master node and multiple
slave nodes. Master node’s function is to assign a task to various slave nodes and manage
resources. The slave nodes do the actual computing. Slave nodes store the real data whereas on
master we have metadata. This means it stores data about data. What does metadata comprise
that we will see in a moment?
Hadoop Application Architecture in Detail
Hadoop Architecture comprises three major layers. They are:-
∙ HDFS (Hadoop Distributed File System)
∙ Yarn
∙ MapReduce
1. HDFS
HDFS stands for Hadoop Distributed File System. It provides for data storage of Hadoop.
HDFS splits the data unit into smaller units called blocks and stores them in a distributed
manner. It has got two daemons running. One for master node – NameNode and other for slave
nodes – DataNode.
a. NameNode and DataNode
HDFS has a Master-slave architecture. The daemon called NameNode runs on the master
server. It is responsible for Namespace management and regulates file access by the client.
DataNode daemon runs on slave nodes. It is responsible for storing actual business data.
Internally, a file gets split into a number of data blocks and stored on a group of slave machines.
Namenode manages modifications to file system namespace. These are actions like the opening,
closing and renaming files or directories. NameNode also keeps track of mapping of blocks to
DataNodes. This DataNodes serves read/write request from the file system’s client. DataNode
also creates, deletes and replicates blocks on demand from NameNode.
Java is the native language of HDFS. Hence one can deploy DataNode and NameNode on
machines having Java installed. In a typical deployment, there is one dedicated machine running
NameNode. And all the other nodes in the cluster run DataNode. The NameNode contains
metadata like the location of blocks on the DataNodes. And arbitrates resources among various
competing DataNodes.
b. Block in HDFS
Block is nothing but the smallest unit of storage on a computer system. It is the smallest
contiguous storage allocated to a file. In Hadoop, we have a default block size of 128MB or
256 MB.
One should select

the block size very carefully. To explain why so let us take an example of a file which is 700MB
in size. If our block size is 128MB then HDFS divides the file into 6 blocks. Five blocks of
128MB and one block of 60MB. What will happen if the block is of size 4KB? But in HDFS we
would be having files of size in the order terabytes to petabytes. With 4KB of the block size, we
would be having numerous blocks. This, in turn, will create huge metadata which will overload
the NameNode. Hence we have to choose our HDFS block size judiciously.
c. Replication Management
To provide fault tolerance HDFS uses a replication technique. In that, it makes copies of the
blocks and stores in on different DataNodes. Replication factor decides how many copies of the
blocks get stored. It is 3 by default but we can configure to any value.
The
above figure shows how the replication technique works. Suppose we have a file of 1GB then
with a replication factor of 3 it will require 3GBs of total storage.
To maintain the replication factor NameNode collects block report from every DataNode.
Whenever a block is under-replicated or over-replicated the NameNode adds or deletes the
replicas accordingly.
d. What is Rack Awareness?
Hadoop Architecture
A rack contains many DataNode machines and there are several such racks in the production.
HDFS follows a rack awareness algorithm to place the replicas of the blocks in a distributed
fashion. This rack awareness algorithm provides for low latency and fault tolerance. Suppose the
replication factor configured is 3. Now rack awareness algorithm will place the first block on a
local rack. It will keep the other two blocks on a different rack. It does not store more than two
blocks in the same rack if possible.
3. YARN
YARN or Yet Another Resource Negotiator is the resource management layer of Hadoop. The
basic principle behind YARN is to separate resource management and job
scheduling/monitoring function into separate daemons. In YARN there is one global
ResourceManager and per-application ApplicationMaster.
Inside the YARN framework, we have two daemons ResourceManager and NodeManager. The
ResourceManager arbitrates resources among all the competing applications in the system. The
job of NodeManger is to monitor the resource usage by the container and report the same to
ResourceManger. The resources are like CPU, memory, disk, network and so on.
The ApplcationMaster negotiates resources with ResourceManager and works with NodeManger
to execute and monitor the job.
The ResourceManger has two important components – Scheduler and ApplicationManager

i. Scheduler
Scheduler is responsible for allocating resources to various applications. This is a pure scheduler
as it does not perform tracking of status for the application. It also does not reschedule the tasks
which fail due to software or hardware errors. The scheduler allocates the resources based on the
requirements of the applications.
ii. Application Manager
Following are the functions of ApplicationManager
∙ Accepts job submission.
∙ Negotiates the first container for executing ApplicationMaster. A container incorporates
elements such as CPU, memory, disk, and network.
∙ Restarts the ApplicationMaster container on failure.
Functions of ApplicationMaster:-
∙ Negotiates resource container from Scheduler.
∙ Tracks the resource container status.
∙ Monitors progress of the application.
We can scale the YARN beyond a few thousand nodes through YARN Federation feature. This
feature enables us to tie multiple YARN clusters into a single massive cluster. This allows for
using independent clusters, clubbed together for a very large job.
iii. Features of Yarn
YARN has the following features:-
a. Multi-tenancy
YARN allows a variety of access engines (open-source or propriety) on the same Hadoop data
set. These access engines can be of batch processing, real-time processing, iterative processing
and so on.
b. Cluster Utilization
With the dynamic allocation of resources, YARN allows for good use of the cluster. As compared
to static map-reduce rules in previous versions of Hadoop which provides lesser utilization of
the cluster.
c. Scalability
Any data center processing power keeps on expanding. YARN’s ResourceManager focuses on
scheduling and copes with the ever-expanding cluster, processing petabytes of data.
d. Compatibility
MapReduce program developed for Hadoop 1.x can still on this YARN. And this is without any
disruption to processes that already work.
What is Artificial Intelligence?
Artificial Intelligence is a field where algorithms are used to perform automatic actions. Its
models are based on the natural intelligence of humans and animals. Similar patterns of the past
are recognized, and related operations are performed automatically when the patterns are
repeated.
It utilizes the principles of software engineering and computational algorithms for the
development of solutions to a problem. Using Artificial intelligence, people can develop
automatic systems that provide cost savings and several other benefits to companies. Large
organizations are heavily dependant on Artificial Intelligence, including tech giants like
Facebook, Amazon, and Google.
Comparis Data Science Artificial Intelligence
on Factor
Meaning Data Science aims to curate massive Artificial Intelligence helps in

data for analytics and visualization implementing data and the knowledge
of machines
Skills You need to use statistical techniques You must use algorithms for
for development and design development and design
Technique Data Science makes use of the Data AI uses Deep Learning and Machine
Analytics technique Learning techniques
Observation It looks for patterns in data to make It imposes intelligence in machines

well-informed decisions using data to make them respond as
humans do
Solving It utilizes parts of a loop or program to AI, however, represents the loop for
Issues solve particular issues planning and perception
Processing It uses a medium level of data It uses high-level processing of

processing for data manipulation scientific data for data manipulation
Graphic It allows you to represent data in It helps you use an algorithm network
several graphical formats node representation
Tools Data Science makes use of tools, such AI uses tools viz. Shogun, Mahout,
Involved as SAS, SPSS, Keras, R, Python, etc. Caffe, PyTorch, TensorFlow,
Scikit-Learn, etc.
Application Applications of Data Science are AI applications are used in several

s dominantly used in Internet search industries, including transport,
engines, such as Yahoo, Bing, Google, healthcare, manufacturing,
etc. automation,
etc.
What is Data Science?
You must have wondered, ‘What is Data Science?’, Data science is a broad field of study
pertaining to data systems and processes, aimed at maintaining data sets and deriving meaning
out of them. Data scientists use a combination of tools, applications, principles and algorithms
to make sense of random data clusters. Since almost all kinds of organizations today are
generating exponential amounts of data around the world, it becomes difficult to monitor and
store this data. Data science focuses on data modelling and data warehousing to track the
ever-growing data set. The information extracted through data science applications are used to
guide business processes and reach organisational goals.
Scope of Data Science
One of the domains that data science influences directly is business intelligence. Having said
that, there are functions that are specific to each of these roles. Data scientists primarily deal with
huge chunks of data to analyse the patterns, trends and more. These analysis applications
formulate reports which are finally helpful in drawing inferences. A Business Intelligence expert
picks up where a data scientist leaves – using data science reports to understand the data
trends in any particular business field and presenting business forecasts and course of action
based on these inferences. Interestingly, there’s also a related field which uses both data science,
data analytics and business intelligence applications- Business Analyst. A business analyst
profile combines a little bit of both to help companies take data driven decisions.
Data scientists analyse historical data according to various requirements, by applying different
formats, namely:
∙ Predictive causal analytics: Data scientists use this model to derive business forecasts.
The predictive model showcases the outcomes of various business actions in
measurable terms. This can be an effective model for businesses trying to understand
the future of any new business move.
∙ Prescriptive Analysis: This kind of analysis helps businesses set their goals by
prescribing the actions which are most likely to succeed. Prescriptive analysis uses the
inferences from the predictive model and helps businesses by suggesting the best
ways to achieve those goals.
Data science uses a wide array of data-oriented technologies including SQL, Python, R, and
Hadoop, etc. However, it also makes extensive use of statistical analysis, data visualization,
distributed architecture, and more to extract meaning out of sets of data.
Data scientists are skilled professionals whose expertise allows them to quickly switch roles at
any point in the life cycle of data science projects. They can work with Artificial Intelligence and
machine learning with equal ease. In fact, data scientists need machine learning skills for specific
requirements like:
∙ Machine Learning for Predictive Reporting: Data scientists use machine learning
algorithms to study transactional data to make valuable predictions. Also known as
supervised learning, this model can be implemented to suggest the most effective
courses of action for any company.
∙ Machine Learning for Pattern Discovery: Pattern discovery is important for
businesses to set parameters in various data reports and the way to do that is through
machine learning. This is basically unsupervised learning where there are no
pre-decided parameters. The most popular algorithm used for pattern discovery is
Clustering.
What is Artificial Intelligence?

AI, a rather hackneyed tech term that is used frequently in our popular culture – has come to be
associated only with futuristic-looking robots and a machine-dominated world. However, in
reality, Artificial Intelligence is far from that.
Simply put, artificial intelligence aims at enabling machines to execute reasoning by replicating
human intelligence. Since the main objective of AI processes is to teach machines from
experience, feeding the right information and self-correction is crucial. AI experts rely on deep
learning and natural language processing to help machines identify patterns and inferences.
Scope of Artificial Intelligence
∙ Automation is easy with AI: AI allows you to automate repetitive, high volume tasks by
setting up reliable systems that run frequent applications.
∙ Intelligent Products: AI can turn conventional products into smart commodities. AI
applications when paired with conversational platforms, bots and other smart
machines can result in improved technologies.
∙ Progressive Learning: AI algorithms can train machines to perform any desired
functions. The algorithms work as predictors and classifiers.
∙ Analysing Data: Since machines learn from the data we feed them, analysing and
identifying the right set of data becomes very important. Neural networking makes it
easier to train machines.
What is Machine Learning?
Machine Learning is a subsection of Artificial intelligence that devices means by which systems
can automatically learn and improve from experience. This particular wing of AI aims at
equipping machines with independent learning techniques so that they don’t have to be
programmed to do so, this is the difference between AI and Machine Learning.
Machine learning involves observing and studying data or experiences to identify patterns and
set up a reasoning system based on the findings. The various components of machine learning
include:
∙ Supervised machine learning: This model uses historical data to understand behaviour
and formulate future forecasts. This kind of learning algorithms analyse any given
training data set to draw inferences which can be applied to output values. Supervised
learning parameters are crucial in mapping the input-output pair.
∙ Unsupervised machine learning: This type of ML algorithm does not use any classified
or labelled parameters. It focuses on discovering hidden structures from unlabeled
data to help systems infer a function properly. Algorithms with unsupervised learning
can use both generative learning models and a retrieval-based approach.
∙ Semi-supervised machine learning: This model combines elements of supervised and
unsupervised learning yet isn’t either of them. It works by using both labelled and
unlabeled data to improve learning accuracy. Semi-supervised learning can be a
cost-effective solution when labelling data turns out to be expensive.
∙Reinforcement machine learning: This kind of learning doesn’t use any answer key to
guide the execution of any function. The lack of training data results in learning from
experience. The process of trial and error finally leads to long-term rewards.
Machine learning delivers accurate results derived through the analysis of massive data sets.
Applying AI cognitive technologies to ML systems can result in the effective processing of data
and information. But what are the key differences between Data Science vs Machine Learning
and AI vs ML? Continue reading to learn more.
Difference between AI and Machine Learning

Artificial Intelligence Machine Learning
AI aims to make a smart computer system ML allows machines to learn from data so they can
work just like humans to solve complex provide accurate output
problems
Based on capability, AI can be categorized into ML can be categorized into Supervised Learning,
Weak AI, General AI, and Strong AI Unsupervised Learning, and Reinforcement
Learning
AI systems are concerned with maximizing the Machine Learning primarily concerns with accuracy
chances of success and patterns
AI enables a machine to emulate human behavior Machine Learning is a sub-set of AI
Mainly deals with structured, semi-structured, Deals with structured and semi-structured data
and unstructured data
Some applications of AI are virtual assistants Applications of ML are recommendation system,

such as Siri, chatbots, intelligent humanoid search algorithms, Facebook auto friend tagging
robot, etc. system, etc.
Difference Between Data Science and Machine Learning

Data Science Machine Learning
Data Science helps with creating insights Machine Learning helps in accurately predicting or
from data that deals with real world classifying outcomes for new data points by learning
complexities patterns from historical data
Preferred skill-set: Preferred skill-set:

– domain expertise – Python/ R Programming
– strong SQL – Strong Mathematics Knowledge
– ETL and data profiling – Data Wrangling
– NoSQL systems, Standard reporting, – SQL Model specific visualization
Visualization
Horizontally scalable systems preferred GPUs are preferred for intensive vector operations
to handle massive data
Components for handling unstructured Major complexity is with the algorithms and
raw data mathematical concepts behind them
Most of the input data is in human Input data is transformed specifically for the type of
consumable form algorithms used
Relationship between Data Science, Artificial Intelligence and Machine Learning
Artificial Intelligence and data science are a wide field of applications, systems and more that
aim at replicating human intelligence through machines. Artificial Intelligence represents an
action planned feedback of perception.
Perception > Planning > Action > Feedback of Perception

Data Science uses different parts of this pattern or loop to solve specific problems. For instance,
in the first step, i.e. Perception, data scientists try to identify patterns with the help of the data.
Similarly, in the next step, i.e. planning, there are two aspects:
∙ Finding all possible solutions

∙ Findingthe best solution among all solutions
Data science creates a system that interrelates both the aforementioned points and helps
businesses move forward.
Although it’s possible to explain machine learning by taking it as a standalone subject, it can
best be understood in the context of its environment, i.e., the system it’s used within.
Simply put, machine learning is the link that connects Data Science and AI. That is because it’s
the process of learning from data over time. So, AI is the tool that helps data science get results
and solutions for specific problems. However, machine learning is what helps in achieving that
goal. A real-life example of this is Google’s Search Engine.
∙ Google’s search engine is a product of data science

∙ It uses predictive analysis, a system used by artificial intelligence, to deliver intelligent
results to the users
∙ For instance, if a person types “best jackets in NY” on Google’s search engine, then the
AI collects this information through machine learning
∙ Now, as soon as the person writes these two words in the search tool “best place to buy,”
the AI kicks in, and with predictive analysis completes the sentence as “best place to
buy jackets in NY” which is the most probable suffix to the query that the user had in
mind.
To be precise, Data Science covers AI, which includes machine learning. However, machine
learning itself covers another sub-technology — Deep Learning.Deep Learning is a form of
machine learning but differs in the use of Neural Networks where we stimulate the function of a
brain to a certain extent and use a 3D hierarchy in data to identify patterns that are much more
useful.
Difference Between Data Science, Artificial Intelligence and Machine Learning
Although the terms Data Science vs Machine Learning vs Artificial Intelligence might be related
and interconnected, each of them are unique in their own ways and are used for different purposes.
Data Science is a broad term, and Machine Learning falls within it. Here’s the key difference
between the terms.
Machine Learning Data Science
Includes Machine Learning. Subset of Artificial Intelligence. Includes various Data
Operations.
Artificial Intelligence combines large Machine Learning uses efficient Data Science works by
amounts of data through iterative programs that can use data without sourcing, cleaning, and
processing and intelligent algorithms being explicitly told to do so. processing data to extract
to help computers learn meaning out of it for
automatically. analytical purposes.
Some of the popular tools that AI The popular tools that Machine Some of the popular tools
uses are Learning makes use of are-1. used by Data Science are-1.
1. TensorFlow2. Scikit Learn Amazon Lex2. IBM Watson SAS2. Tableau3. Apache
3. Keras Studio3. Microsoft Azure ML Spark4. MATLAB
Studio
Artificial Intelligence uses logic and Machine Learning uses statistical Data Science deals with
decision trees. models. structured and unstructured
data.
Chatbots, and Voice assistants are Recommendation Systems such as Fraud Detection and
popular applications of AI. Spotify, and Facial Recognition are Healthcare analysis are
popular examples. popular examples of Data
Science.

DSBDA - Unit - 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSBDA - Unit - 1

Uploaded by

Copyright:

Available Formats

Unit I

INTRODUCTION: DATA SCIENCE AND BIG DATA

Examples of Applications Big Data

∙ The name ‘Big Data’ itself is related to a size which is enormous.

∙ Veracity – This refers to the discrepancies found in the data.

Goal of Data Science

The Data Scientist

Big Data Challenges

What Is Data Warehousing? Types, Definition & Example

What Is a Data Warehouse Used For?

Figure 1. The shared-nothing architecture

Figure 2. The shared-disk architecture

The data goes through the following phases

One should select

The ResourceManger has two important components – Scheduler and ApplicationManager

What is Artificial Intelligence?

Meaning Data Science aims to curate massive Artificial Intelligence helps in

Observation It looks for patterns in data to make It imposes intelligence in machines

Processing It uses a medium level of data It uses high-level processing of

Application Applications of Data Science are AI applications are used in several

Scope of Data Science

What is Artificial Intelligence?

Scope of Artificial Intelligence

What is Machine Learning?

Difference between AI and Machine Learning

AI enables a machine to emulate human behavior Machine Learning is a sub-set of AI

Some applications of AI are virtual assistants Applications of ML are recommendation system,

Difference Between Data Science and Machine Learning

Preferred skill-set: Preferred skill-set:

Relationship between Data Science, Artificial Intelligence and Machine Learning

Perception > Planning > Action > Feedback of Perception

∙ Finding all possible solutions

∙ Google’s search engine is a product of data science

You might also like