Professional Documents
Culture Documents
Introduction to Data science and Big Data, Defining Data science and Big Data, Big Data
examples, Data Explosion: Data Volume, Data Variety, Data Velocity and Veracity. Big data
infrastructure and challenges Big Data Processing Architectures: Data Warehouse,
Re-Engineering the Data Warehouse, shared everything and shared nothing architecture, Big data
learning approaches. Data Science – The Big Picture: Relation between AI, Statistical Learning,
Machine Learning, Data Mining and Big Data Analytics
Examples-application-of-big-data-in-real-life
Brief Introduction to Big Data
∙ Big Data is the amount of data just beyond technology’s capability to store, manage and
process efficiently.
∙ Data Science : Data Science is the science which uses computer science, statistics and
machine learning, visualization and human-computer interactions to collect, clean,
integrate, analyze, visualize, interact with data to create data products.
Big data has great influence in the education world. It has been able to provide solution to one of
the biggest pitfalls in the education system, that is, the one-size-fits-all fashion of academic set
up, by contributing in e-learning solutions.
Big data has proven to be really important in not only just reframing coursework and the grading
Big data contributions to Healthcare
The big data is in extended use in the field of medicine and healthcare. Big data has provided us
with so many benefits in this field that now after being able to utilize big data approaches and
solutions by healthcare organizations, it just seems impossible, not to mention totally useless to
go back to how things were before big data.
Following are some of the many ways in which big data has contributed to healthcare
∙ Big data reduces costs of treatment since there is less chances of having to perform
unnecessary diagnosis.
∙ It helps in predicting outbreaks of epidemics and also helps in deciding what preventive
measures could be taken to minimize the effects of the same.
∙ It helps avoid preventable diseases by detecting diseases in early stages which helps in
preventing it from getting any worse and it also makes the treatment easy and effective.
∙ Patients can be provided with the evidence based medicine which is identified and
prescribed after doing the past medical results research.
Big data contributions in Public sector
Big data has also played a major role in Public sectors. It provides a large range of benefits to
public sectors most of which are the facilities provided by Big Data to the government sectors
that includes power ingestion, deceit recognition, etc. Some of the facilities of Big Data to
government sectors are as follows:
∙ Big Data is also used by the FDA (Food and Drug Administration) to identify and examine
food based infections.
∙ Big Data is used by the government to stay up-to-date in the field of agriculture as well by
keeping track of all the land and livestock that exists, the crops and farm animals, etc. ∙
Big Data is hugely used for deceit recognition and has also helped in catching tax evaders. Big
Data Contributions to Communications, Media and Entertainment
Now here is an interesting one, you have been using some of the platforms that utilises the
benefits of big data, big time. For example:
Spotify, which is an on-demand music providing platform, uses big data analytics and collect data
from all of the users around the globe and then uses the analyzed data to give informed music
recommendations and suggestions to every individual user.
Amazon Prime, that offers, videos, music and Kindle books in a one-stop shop is also big on
using big data.
∙ Weather patterns
There are weather sensors deployed all around the globe and data is collected from them. There
are satellites deployed by joint polar satellite system which they use to monitor the weather and
environmental conditions.
All of the data collected from these sensors and satellites can be used in different ways such as in
weather forecast, to study global warming, understanding the patterns of natural disasters to make
necessary preparations in case of crisis and to predict the availability of usable water around the
world and many more.
∙ Big Data Contributions to Transportation
Since the rise of big data, it has been used in various ways to make transportation more efficient
and easy. Following are some of the areas where big data contributed to transportation.
∙ Route planning: Big data can be used to understand and estimate the user’s needs on
different routes and on multiple modes of transportation and then utilising route planning
to reduce the users wait times.
∙ Congestion management by predicting traffic conditions: Using big data, real time
estimation of congestion and traffic patterns is now possible. For examples, people using
Google Maps to locate the least traffic prone routes.
∙ Safety level of traffic: Using the real time processing of big data and predictive analysis to
identify the traffic accidents prone areas can help reduce accidents and increase the safety
level of traffic
And guess what? We too make use of this application when we plan route to save fuel and time,
based on our knowledge of having taken that particular route sometime in the past. In this case
we analysed and made use of the data that we had previously acquired on account of our
experience and then we used it to make a smart decision. It’s pretty cool that big data has played
parts not only in such big fields but also in our smallest day to day life decisions too, isn’t it?
∙ Big Data Contributions to Banking Zones and Fraud Detection
Benefits of big data
Why Big Data is so important?
This has been one of the most asked questions since the advent of Big Data.
The importance of big data lies in how an organization is using the collected data and not in how
much data they have been able to collect. If an organization is looking to get benefitted by Big
Data then being able to efficiently use Big Data is what that matters the most.
To do that, there are Big Data solutions that make the analysis of Big Data much easier than it
used to be which in turn helps the organizations make better and smart business decisions using
big data. This is where the benefits of Big Data start showing up. Let’s see what those benefits
are:
∙ Gaining Insights:
In older times, when even storing and managing big data was considered as a tedious task,
let alone analyzing the data to gain benefits from it, a very huge amount of data was going
unused and wasted. Data that could contain important information which could help gain
insights about the businesses or industries. Now, with all the different Big Data solutions
available to manage and analyze Big Data, no data containing information is going
unused. Big Data is now providing deep insights with the help of not only just structured
data but also the unstructured and semi-structured data.
∙ Prediction and Decision making:
Now that the organizations are able to analyze Big Data, they have successfully started
using Big Data to mitigate risks, revolving around various factors of their businesses.
Using Big Data to reduce the risks regarding the decisions of the organizations and
making predictions has become one of the many benefits coming from big data in
industries.
∙ Cost-effectiveness:
Using and analyzing Big Data to make relevant predictions and smart decisions also make the
organizations cost-effective. Not only that but also using some of the tools of Big Data to manage
and analyze the data also brings cost advantages to businesses especially when a huge amount of
data is to be stored and processed.
∙ Marketing effectiveness:
Big Data, along with being able to help businesses and organizations in making smart decisions
also drastically increases the sales and marketing effectiveness of the businesses and
organizations thus highly improving their performances in the industry. Also, organizations can
use big data to understand the latest trends of customer and user needs through real-time
analytics and then acting accordingly on it and increasing their market value.
So, to sum it all, we have tabulated the benefits of Big Data in brief:
Areas of concern Big Data benefits
Big data can provide better
unstructured and semi-structured
Insights
data
Big data helps mitigate risk and
make a smart decision by proper
Prediction and Decision making risk analysis
Better and smart decisions made
with the help of Big Data make
Cost-effectiveness the organizations cost-effective
Big Data also increases the sales
and marketing effectiveness of an
Marketing effectiveness organization
insights with the help of
When talking about big data, we all have heard about the 3V’s of big data. It’s nothing but a
better way to define big data. Big data initially consisted of three attributes namely volume,
velocity, and variety.
While these three attributes pretty much gave the essence of the definition of big data, there is
another attribute that was added in the list, later on, termed as veracity.
Let’s see what these attributes mean:
∙ Velocity– Denotes the speed at which data is emanating and changes are occurring between
the diverse data sets.
Velocity:
∙ Velocity refers to the high speed of accumulation of data.
∙ In Big Data velocity data flows in from sources like machines, networks, social media,
mobile phones etc.
∙ There is a massive and continuous flow of data. This determines the potential of data that
how fast the data is generated and processed to meet the demands.
∙ Sampling data can help in dealing with the issue like ‘velocity’.
∙ Example: There are more than 3.5 billion searches per day are made on Google. Also,
FaceBook users are increasing by 22%(Approx.) year by year.
∙ Volume – This refers to the sheer volume of data being generated every second.
Volume:
∙ Variety– As more and more data is being digitized, some data is found to be structured and
some is found to be unstructured in nature, with big data in the picture we can use
structured as well as unstructured data.
Variety:
∙ Itrefers to nature of data that is structured, semi-structured and unstructured data. ∙
It also refers to heterogeneous sources.
∙ Variety is basically the arrival of data from new sources that are both inside and outside of
an enterprise. It can be structured, semi-structured and unstructured.
o Structured data: This data is basically an organized data. It generally refers to data
that has defined the length and format of data.
o Semi- Structured data: This data is basically a semi-organised data. It is generally
a form of data that do not conform to the formal structure of data. Log files are the
examples of this type of data.
o Unstructured data: This data basically refers to unorganized data. It generally refers to
data that doesn’t fit neatly into the traditional row and column structure of the relational
database. Texts, pictures, videos etc. are the examples of unstructured data which can’t be
stored in the form of rows and columns.
Veracity:
∙ Itrefers to inconsistencies and uncertainty in data, that is data which is available can
sometimes get messy and quality and accuracy are difficult to control.
∙ Big Data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources.
∙ Example: Data in bulk could create confusion whereas less amount of data could convey
half or Incomplete Information.
∙ Value – Having access to big data is all well and good but that’s only useful if we can turn it
into a value.
Value:
∙ After having the 4 V’s into account there comes one more V which stands for Value!. The
bulk of Data having no Value is of no good to the company, unless you turn it into
something useful.
∙ Data in itself is of no use or importance but it needs to be converted into something valuable
to extract Information. Hence, you can state that Value! is the most important V of all the
5V’s.
Kappa Architecture
Let’s translate the operational sequencing of the kappa architecture to a functional equation which
defines any query in big data domain.
Query = K (New Data) = K (Live streaming data)
The equation means that all the queries can be catered by applying kappa function to the live
streams of data at the speed layer. It also signifies that that the stream processing occurs on the
speed layer in kappa architecture.
Applications of Kappa architecture
Some variants of social network applications, devices connected to a cloud based monitoring
system, Internet of things (IoT) use an optimized version of Lambda architecture which mainly
uses the services of speed layer combined with streaming layer to process the data over the data
lake.
Kappa architecture can be deployed for those data processing enterprise models where:
∙ Multiple data events or queries are logged in a queue to be catered against a distributed file
system storage or history.
∙ The order of the events and queries is not predetermined. Stream processing platforms can
interact with database at any time.
∙ It is resilient and highly available as handling Terabytes of storage is required for each node
of the system to support replication.
The above mentioned data scenarios are handled by exhausting Apache Kafka which is extremely
fast, fault tolerant and horizontally scalable. It allows a better mechanism for governing the
data-streams. A balanced control on the stream processors and databases makes it possible for the
applications to perform as per expectations. Kafka retains the ordered data for longer durations
and caters the analogous queries by linking them to the appropriate position of the retained log.
LinkedIn and some other applications use this flavor of big data processing and reap the benefit of
retaining large amount of data to cater those queries that are mere replica of each other.
Pros and Cons of Kappa architecture
Pros
∙ Kappa architecture can be used to develop data systems that are online learners and
therefore don’t need the batch layer.
∙ Re-processing is required only when the code changes.
∙ It can be deployed with fixed memory.
∙ It can be used for horizontally scalable systems.
∙ Fewer resources are required as the machine learning is being done on the real time basis.
Cons
Absence of batch layer might result in errors during data processing or while updating the
database that requires having an exception manager to reprocess the data or reconciliation.
Conclusion
In short the choice between Lambda and Kappa architectures seems like a tradeoff. If you seek
you’re an architecture that is more reliable in updating the data lake as well as efficient in
devising the machine learning models to predict upcoming events in a robust manner you should
use the Lambda architecture as it reaps the benefits of batch layer and speed layer to ensure less
errors and speed. On the other hand if you want to deploy big data architecture by using less
expensive hardware and require it to deal effectively on the basis of unique events occurring on
the runtime then select the Kappa architecture for your real-time data processing needs.
Zeta Architecture
The Zeta Architecture is a high-level enterprise architectural construct not unlike the Lambda
architecture which enables simplified business processes and defines a scalable way to increase
the speed of integrating data into the business. The result? A powerful, data-centric enterprise.
There are seven pluggable components of the Zeta Architecture which work together, reducing
system-level complexity while radically increasing resource utilization and efficiency.
∙ Distributed File System - all applications read and write to a common, scalable solution,
which dramatically simplifies the system architecture.
∙ Real-time Data Storage - supports the need for high-speed business applications through
the use of real-time databases.
∙ Pluggable Compute Model / Execution Engine -delivers different processing engines and
models in order to meet the needs of diverse business applications and users in an
organization.
∙ Deployment / Container Management System - provides a standardized approach for
deploying software. All resource consumers are isolated and deployed in a standard way.
∙ Solution Architecture - focuses on solving specific business problems, and combines one
or more applications built to deliver the complete solution. These solution architectures
encompass a higher-level interaction among common algorithms or libraries, software
components and business workflows.
∙ Enterprise Applications - brings simplicity and reusability by delivering the components
necessary to realize all of the business goals defined for an application.
∙ Dynamic and Global Resource Management - allows dynamic allocation of resources so
that you can accommodate whatever task is the most important for that day.
Benefits of Zeta Architecture
There are several benefits to implementing a Zeta Architecture in your organization
∙ Reduce time and costs of deploying and maintaining applications
∙ Fewer moving parts with simplifications such as using a distributed file system
∙ Less data movement and duplication - transforming and moving data around will no
longer be required unless a specific use case calls for it
∙ Simplified testing, troubleshooting, and systems management
∙ Better resource utilization to lower data center costs
MapReduce Architecture
The final output of the MapReduce task is
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
Hadoop Architecture
A rack contains many DataNode machines and there are several such racks in the production.
HDFS follows a rack awareness algorithm to place the replicas of the blocks in a distributed
fashion. This rack awareness algorithm provides for low latency and fault tolerance. Suppose the
replication factor configured is 3. Now rack awareness algorithm will place the first block on a
local rack. It will keep the other two blocks on a different rack. It does not store more than two
blocks in the same rack if possible.
3. YARN
YARN or Yet Another Resource Negotiator is the resource management layer of Hadoop. The
basic principle behind YARN is to separate resource management and job
scheduling/monitoring function into separate daemons. In YARN there is one global
ResourceManager and per-application ApplicationMaster.
Inside the YARN framework, we have two daemons ResourceManager and NodeManager. The
ResourceManager arbitrates resources among all the competing applications in the system. The
job of NodeManger is to monitor the resource usage by the container and report the same to
ResourceManger. The resources are like CPU, memory, disk, network and so on.
The ApplcationMaster negotiates resources with ResourceManager and works with NodeManger
to execute and monitor the job.
Artificial Intelligence is a field where algorithms are used to perform automatic actions. Its
models are based on the natural intelligence of humans and animals. Similar patterns of the past
are recognized, and related operations are performed automatically when the patterns are
repeated.
It utilizes the principles of software engineering and computational algorithms for the
development of solutions to a problem. Using Artificial intelligence, people can develop
automatic systems that provide cost savings and several other benefits to companies. Large
organizations are heavily dependant on Artificial Intelligence, including tech giants like
Facebook, Amazon, and Google.
Comparis Data Science Artificial Intelligence
on Factor
Skills You need to use statistical techniques You must use algorithms for
for development and design development and design
Technique Data Science makes use of the Data AI uses Deep Learning and Machine
Analytics technique Learning techniques
Solving It utilizes parts of a loop or program to AI, however, represents the loop for
Issues solve particular issues planning and perception
Graphic It allows you to represent data in It helps you use an algorithm network
several graphical formats node representation
Tools Data Science makes use of tools, such AI uses tools viz. Shogun, Mahout,
Involved as SAS, SPSS, Keras, R, Python, etc. Caffe, PyTorch, TensorFlow,
Scikit-Learn, etc.
You must have wondered, ‘What is Data Science?’, Data science is a broad field of study
pertaining to data systems and processes, aimed at maintaining data sets and deriving meaning
out of them. Data scientists use a combination of tools, applications, principles and algorithms
to make sense of random data clusters. Since almost all kinds of organizations today are
generating exponential amounts of data around the world, it becomes difficult to monitor and
store this data. Data science focuses on data modelling and data warehousing to track the
ever-growing data set. The information extracted through data science applications are used to
guide business processes and reach organisational goals.
One of the domains that data science influences directly is business intelligence. Having said
that, there are functions that are specific to each of these roles. Data scientists primarily deal with
huge chunks of data to analyse the patterns, trends and more. These analysis applications
formulate reports which are finally helpful in drawing inferences. A Business Intelligence expert
picks up where a data scientist leaves – using data science reports to understand the data
trends in any particular business field and presenting business forecasts and course of action
based on these inferences. Interestingly, there’s also a related field which uses both data science,
data analytics and business intelligence applications- Business Analyst. A business analyst
profile combines a little bit of both to help companies take data driven decisions.
Data scientists analyse historical data according to various requirements, by applying different
formats, namely:
∙ Predictive causal analytics: Data scientists use this model to derive business forecasts.
The predictive model showcases the outcomes of various business actions in
measurable terms. This can be an effective model for businesses trying to understand
the future of any new business move.
∙ Prescriptive Analysis: This kind of analysis helps businesses set their goals by
prescribing the actions which are most likely to succeed. Prescriptive analysis uses the
inferences from the predictive model and helps businesses by suggesting the best
ways to achieve those goals.
Data science uses a wide array of data-oriented technologies including SQL, Python, R, and
Hadoop, etc. However, it also makes extensive use of statistical analysis, data visualization,
distributed architecture, and more to extract meaning out of sets of data.
Data scientists are skilled professionals whose expertise allows them to quickly switch roles at
any point in the life cycle of data science projects. They can work with Artificial Intelligence and
machine learning with equal ease. In fact, data scientists need machine learning skills for specific
requirements like:
∙ Machine Learning for Predictive Reporting: Data scientists use machine learning
algorithms to study transactional data to make valuable predictions. Also known as
supervised learning, this model can be implemented to suggest the most effective
courses of action for any company.
∙ Machine Learning for Pattern Discovery: Pattern discovery is important for
businesses to set parameters in various data reports and the way to do that is through
machine learning. This is basically unsupervised learning where there are no
pre-decided parameters. The most popular algorithm used for pattern discovery is
Clustering.
∙ Automation is easy with AI: AI allows you to automate repetitive, high volume tasks by
setting up reliable systems that run frequent applications.
∙ Intelligent Products: AI can turn conventional products into smart commodities. AI
applications when paired with conversational platforms, bots and other smart
machines can result in improved technologies.
∙ Progressive Learning: AI algorithms can train machines to perform any desired
functions. The algorithms work as predictors and classifiers.
∙ Analysing Data: Since machines learn from the data we feed them, analysing and
identifying the right set of data becomes very important. Neural networking makes it
easier to train machines.
Machine Learning is a subsection of Artificial intelligence that devices means by which systems
can automatically learn and improve from experience. This particular wing of AI aims at
equipping machines with independent learning techniques so that they don’t have to be
programmed to do so, this is the difference between AI and Machine Learning.
Machine learning involves observing and studying data or experiences to identify patterns and
set up a reasoning system based on the findings. The various components of machine learning
include:
∙ Supervised machine learning: This model uses historical data to understand behaviour
and formulate future forecasts. This kind of learning algorithms analyse any given
training data set to draw inferences which can be applied to output values. Supervised
learning parameters are crucial in mapping the input-output pair.
∙ Unsupervised machine learning: This type of ML algorithm does not use any classified
or labelled parameters. It focuses on discovering hidden structures from unlabeled
data to help systems infer a function properly. Algorithms with unsupervised learning
can use both generative learning models and a retrieval-based approach.
∙ Semi-supervised machine learning: This model combines elements of supervised and
unsupervised learning yet isn’t either of them. It works by using both labelled and
unlabeled data to improve learning accuracy. Semi-supervised learning can be a
cost-effective solution when labelling data turns out to be expensive.
∙Reinforcement machine learning: This kind of learning doesn’t use any answer key to
guide the execution of any function. The lack of training data results in learning from
experience. The process of trial and error finally leads to long-term rewards.
Machine learning delivers accurate results derived through the analysis of massive data sets.
Applying AI cognitive technologies to ML systems can result in the effective processing of data
and information. But what are the key differences between Data Science vs Machine Learning
and AI vs ML? Continue reading to learn more.
AI aims to make a smart computer system ML allows machines to learn from data so they can
work just like humans to solve complex provide accurate output
problems
Based on capability, AI can be categorized into ML can be categorized into Supervised Learning,
Weak AI, General AI, and Strong AI Unsupervised Learning, and Reinforcement
Learning
AI systems are concerned with maximizing the Machine Learning primarily concerns with accuracy
chances of success and patterns
Mainly deals with structured, semi-structured, Deals with structured and semi-structured data
and unstructured data
Data Science helps with creating insights Machine Learning helps in accurately predicting or
from data that deals with real world classifying outcomes for new data points by learning
complexities patterns from historical data
Horizontally scalable systems preferred GPUs are preferred for intensive vector operations
to handle massive data
Components for handling unstructured Major complexity is with the algorithms and
raw data mathematical concepts behind them
Most of the input data is in human Input data is transformed specifically for the type of
consumable form algorithms used
Artificial Intelligence and data science are a wide field of applications, systems and more that
aim at replicating human intelligence through machines. Artificial Intelligence represents an
action planned feedback of perception.
To be precise, Data Science covers AI, which includes machine learning. However, machine
learning itself covers another sub-technology — Deep Learning.Deep Learning is a form of
machine learning but differs in the use of Neural Networks where we stimulate the function of a
brain to a certain extent and use a 3D hierarchy in data to identify patterns that are much more
useful.
Difference Between Data Science, Artificial Intelligence and Machine Learning
Although the terms Data Science vs Machine Learning vs Artificial Intelligence might be related
and interconnected, each of them are unique in their own ways and are used for different purposes.
Data Science is a broad term, and Machine Learning falls within it. Here’s the key difference
between the terms.
Machine Learning Data Science
Includes Machine Learning. Subset of Artificial Intelligence. Includes various Data
Operations.
Artificial Intelligence combines large Machine Learning uses efficient Data Science works by
amounts of data through iterative programs that can use data without sourcing, cleaning, and
processing and intelligent algorithms being explicitly told to do so. processing data to extract
to help computers learn meaning out of it for
automatically. analytical purposes.
Some of the popular tools that AI The popular tools that Machine Some of the popular tools
uses are Learning makes use of are-1. used by Data Science are-1.
1. TensorFlow2. Scikit Learn Amazon Lex2. IBM Watson SAS2. Tableau3. Apache
3. Keras Studio3. Microsoft Azure ML Spark4. MATLAB
Studio
Artificial Intelligence uses logic and Machine Learning uses statistical Data Science deals with
decision trees. models. structured and unstructured
data.
Chatbots, and Voice assistants are Recommendation Systems such as Fraud Detection and
popular applications of AI. Spotify, and Facial Recognition are Healthcare analysis are
popular examples. popular examples of Data
Science.