Professional Documents
Culture Documents
lOMoARcPSD|20574153
COURSE OBJECTIVES:
To understand the basics of big data analytics
To understand the search methods and visualization
To learn mining data streams
To learn frameworks
To gain knowledge on R language
UNIT I INTRODUCTION TO BIG DATA 9
Introduction to Big Data Platform – Challenges of Conventional Systems - Intelligent data analysis
–Nature of Data - Analytic Processes and Tools - Analysis Vs Reporting - Modern Data Analytic
Tools- Statistical Concepts: Sampling Distributions - Re-Sampling - Statistical Inference -
Prediction Error.
UNIT II SEARCH METHODS AND VISUALIZATION 9
Search by simulated Annealing – Stochastic, Adaptive search by Evaluation – Evaluation
Strategies –Genetic Algorithm – Genetic Programming – Visualization – Classification of Visual
Data Analysis Techniques – Data Types – Visualization Techniques – Interaction techniques –
Specific Visual data analysis Techniques
UNIT III MINING DATA STREAMS 9
Introduction To Streams Concepts – Stream Data Model and Architecture - Stream Computing -
Sampling Data in a Stream – Filtering Streams – Counting Distinct Elements in a Stream –
Estimating Moments – Counting Oneness in a Window – Decaying Window - Real time Analytics
Platform(RTAP) Applications - Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions
UNIT IV FRAMEWORKS 9
MapReduce – Hadoop, Hive, MapR – Sharding – NoSQL Databases - S3 - Hadoop Distributed File
Systems – Case Study- Preventing Private Information Inference Attacks on Social Networks Grand
Challenge: Applying Regulatory Science and Big Data to Improve Medical Device
Innovation
UNIT V R LANGUAGE 9
Overview, Programming structures: Control statements -Operators -Functions -Environment and
scope issues -Recursion -Replacement functions, R data structures: Vectors -Matrices and arrays -
Lists -Data frames -Classes, Input/output, String manipulations
COURSE OUTCOMES:
CO1:understand the basics of big data analytics
CO2: Ability to use Hadoop, Map Reduce Framework.
CO3: Ability to identify the areas for applying big data analytics for increasing the business
outcome.
CO4:gain knowledge on R language
CO5: Contextually integrate and correlate large amounts of information to gain faster insights.
TOTAL:45 PERIODS
A big data platform acts as an organized storage medium for large amounts of data. Big data
platforms utilize a combination of data management hardware and software tools to store
aggregated data sets, usually onto the cloud.
o Volume
o Veracity
o Variety
o Value
o Velocity
Scalability in big data refers to the ability of data to expand and accommodate a growing influx
of information without compromising its integrity or performance1. A scalable data platform
utilizes added hardware or software to increase output and storage of data, and accommodates
rapid changes in the growth of data, either in traffic or volume23. Data scalability is important
for any successful business operation today, allowing organizations to handle an ever-increasing
amount of data easily and efficient
Velocity of Big Data Velocity refers to the speed with which data is generated. High velocity
data is generated with such a pace that it requires distinct (distributed) processing techniques. An
example of a data that is generated with high velocity would be Twitter messages or Facebook
posts
The data, which comes in structured, semi-structured, and unstructured forms, is collected from
multiple sources across web, mobile, and the cloud. It is then stored in a repository—a data lake
or data warehouse —in preparation to be processed.
7. What is security?
Big data analytics in security is the use of advanced analytical techniques on large-scale data sets
to identify and address potential cybersecurity threats12345. It involves the ability to gather,
analyze, visualize and draw insights from massive amounts of digital information14. It can help
predict and stop cyber attacks by detecting anomalies and patterns1345. It works together with
security technologies and sensors to improve the cyber defence posture of organizations14.
2. Recommendation:
7. IoT:
8. Education Sector:
9. Energy Sector:
• 1. Lower costs Across sectors such as healthcare, retail, production, and manufacturing
Big Data solutions are help reducing costs. ...
• 2. New innovations and business opportunities Analytics gives a lot of insight into trends
and customer preferences. ...
• Reporting just provides the data that is asked for while analysis provides the information
or the answer that is actually needed.
They enable organizations to make informed decisions based on large volumes of data,
improving business strategies and operations.
Companies that can effectively harness Big Data gain a competitive edge by identifying trends,
customer preferences, and market opportunities.
Big Data platforms facilitate innovation by enabling the development of advanced analytics,
machine learning models, and artificial intelligence applications.
They can reduce the cost of data storage and processing through scalable, distributed
architectures.
Data comes in various formats, including structured data (e.g., databases), unstructured data
(e.g., text and multimedia), and semi-structured data (e.g., XML and JSON).
This component involves collecting data from various sources, such as databases, logs, IoT
devices, and social media. Tools like Apache Kafka and Flume are commonly used for real-time
data ingestion.
Big Data platforms offer scalable and distributed storage solutions capable of handling the large
volume of data. Examples include Hadoop Distributed File System (HDFS) and cloud-based
storage services like Amazon S3.
To extract valuable insights, data needs to be processed. Big Data platforms support batch
processing (e.g., Apache Hadoop) and stream processing (e.g., Apache Spark) for real-time
analytics.
Once data is processed, it can be analyzed using various tools and frameworks like Apache Hive,
Apache Pig, or machine learning libraries. Visualization tools like Tableau or Power BI help in
presenting insights.
Data governance ensures data quality, compliance, and security. Access control, encryption, and
auditing are essential components of data security in Big Data platforms.
Part B
• Government and public administration: track tax, defense and public health data.
Big data and marketing go hand-in-hand, as businesses harness consumer information to forecast
market trends, buyer habits and other company behaviors. All of this helps businesses determine
what products and services to prioritize.
Big Data Examples in Transportation
Navigation apps and databases, whether used by car drivers or airplane pilots, frequently rely on
big data analytics to get users safely to their destinations. Insights into routes, travel time and
traffic are pulled from several data points and provide a look at travel conditions and vehicle
demands in real time.
To stay on top of citizen needs and other executive duties, governments may look toward big
data analytics. Big data helps to compile and provide insights into suggested legislation,
financial procedure and local crisis data, giving authorities an idea of where to best delegate
resources.
Big Data Examples in Business
Succeeding in business means companies have to keep track of multiple moving parts — like
sales, finances, operations — and big data helps to manage it all. Using data analytics,
professionals can follow real-time revenue information, customer demands and managerial tasks
to not only run their organization but also continually optimize it.
When it comes to medical cases, healthcare professionals may use big data to determine the best
treatment. Patterns and insights can be drawn from millions of patient data records, which guide
healthcare workers in providing the most relevant remedies for patients and how to best advance
drug development.
As cyber threats and data security concerns persist, big data analytics are used behind the scenes
to protect customers every day. By reviewing multiple web patterns at once, big data can help
identify unusual user behavior or online traffic and defend against cyber attacks before they even
start.prioritize concurrent breaches, map out multipart attacks and identify potential root causes
of security issues.
Big data has revolutionized the way businesses operate, but it has also presented a
number of challenges for conventional systems. Here are some of the challenges
faced by conventional systems in handling big data:
Big data is a term used to describe the large amount of data that can be stored and
analyzed by computers. Big data is often used in business, science and government.
Big Data has been around for several years now, but it's only recently that people
have started realizing how important it is for businesses to use this technology in
order to improve their operations and provide better services to customers. A lot of
companies have already started using big data analytics tools because they realize
how much potential there is in utilizing these systems effectively!
However, while there are many benefits associated with using such systems -
including faster processing times as well as increased accuracy - there are also some
challenges involved with implementing them correctly.
• Scalability
• Speed
• Storage
• Data Integration
• Security
Scalability
A common problem with conventional systems is that they can't scale. As the
amount of data increases, so does the time it takes to process and store it. This
can cause bottlenecks and system crashes, which are not ideal for businesses
looking to make quick decisions based on their data.
Conventional systems also lack flexibility in terms of how they handle new types of
information--for example, if you want to add another column (columns are like
fields) or row (rows are like records) without having to rewrite all your code from
scratch.
Speed
Speed is a critical component of any data processing system. Speed is important
because it allows you to:
• Process and analyze your data faster, which means you can make better-
informed decisions about how to proceed with your business.
• Make more accurate predictions about future events based on past
performance.
Storage
The amount of data being created and stored is growing exponentially, with
estimates that it will reach 44 zettabytes by 2020. That's a lot of storage space!
The problem with conventional systems is that they don't scale well as you add
more data. This leads to huge amounts of wasted storage space and lost information
due to corruption or security breaches.
Data Integration
The challenges of conventional systems in big data are numerous. Data
integration is one of the biggest challenges, as it requires a lot of time and
effort to combine different sources into a single database. This is especially true
when you're trying to integrate data from multiple sources with different
schemas and formats.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
Security measures such as firewalls, passwords and encryption help protect against
unauthorized access and attacks by hackers who want to steal data or disrupt
operations. But these security measures have limitations: They're expensive; they
require constant monitoring and maintenance; they can slow down performance if
implemented too extensively; and they often don't prevent breaches altogether
because there's always some way around them (such as through phishing emails).
Conventional systems are not equipped for big data. They were designed for a
different era, when the volume of information was much smaller and more
manageable. Now that we're dealing with huge amounts of data, conventional
systems are struggling to keep up. Conventional systems are also expensive and
time-consuming to maintain; they require constant maintenance and upgrades in
order to meet new demands from users who want faster access speeds and more
features than ever before.age 9 of 28
• Disk Capacity
– 1990 – 20MB
– 2000 - 1GB
– 2010 – 1TB
• Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years,
currently around 70 – 80MB / sec
Intelligent Data Analysis provides a forum for the examination of issues related to the research
and applications of Artificial Intelligence techniques in data analysis across a variety of
disciplines. These techniques include (but are not limited to): all areas of data visualization,
data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering,
database mining techniques, tools and applications, use of domain knowledge in data analysis,
big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic,
statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are
preferred
that discuss the development of new AI-related data analysis architectures, methodologies, and
techniques and their applications to various domains.
Intelligent Data Analysis (IDA) is one of the most important approaches in the field of data
mining, which attracts great concerns from the researchers. Based on the basic principles of IDA
and the features of datasets that IDA handles, the development of IDA is briefly summarized
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
from three aspects, i.e., algorithm principle, the scale and type of the dataset. Moreover, the
challenges facing the IDA in big data environment are analyzed from four views, including big
data management, data collection, data analysis, and application pattern. It is also cleared that in
order to extract more values from data, the further development of IDA should combine
practical applications and theoretical researches together.
That’s a pretty broad title, but, really, what we’re talking about here are some fundamentally
different ways to treat data as we work with it. This topic can seem academic but it is relevant for
web analysts specifically and researchers broadly. Yes, this topic out to be pretty darn important
when it comes time to applying statistical operations and performing model building and testing.
So, we have to start with the basics: the nature of data. There are four types of data:
• Nominal
• Ordinal
• Interval
• Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.
The distinction between the four types of scales center on three different characteristics:
1. The order of responses – whether it matters or not
2. The distance between observations – whether it matters or is interpretable
3. The presence or inclusion of a true zero
Nominal Scales
Consider traffic source (or last touch channel) as an example in which visitors reach our site
through a mutually exclusive channel, or last point of contact. These channels would include:
1. Paid Search
2. Organic Search
3. Email
4. Display
(This list looks artificially short, but the logic and interpretation would remain the same for nine
channels or for 99 channels.)
If we want to know that each channel is simply somehow different, then we could count the
number of visits from each channel. Those counts can be considered nominal in nature.
Big data is the storage and analysis of large data sets. These are complex data sets which can be
both structured or unstructured. They are so large that it is not possible to work on them with
traditional analytical tools. These days, organizations are realising the value they get out of big
data analytics and hence they are deploying big data tools and processes to bring more efficiency
in their work environment. They are willing to hire good big data analytics professionals at a
good salary. In order to be a big data analyst, you should get acquainted with big data first and
get certification by enrolling yourself in analytics courses online.
Top 5 Big Data Tools
There are many big data tools and processes being utilised by companies these days. These are
used in the processes of discovering insights and supporting decision making. The top big data
tools used these days are open source data tools, data visualization tools, sentiment tools, data
extraction tools and databases. Some of the best used big data tools are mentioned below –
1. R-Programming
R is a free open source software programming language and a software environment for
statistical computing and graphics. It is used by data miners for developing statistical software
and data analysis. It has become a highly popular tool for big data in recent years.
2. Datawrapper
It is an online data visualization tool for making interactive charts. You need to paste your data
file in a csv, pdf or excel format or paste it directly in the field. Datawrapper then generates any
visualization in the form of bar, line, map etc. It can be embedded into any other website as
well. It is easy to use and produces visually effective charts.
3. Tableau Public
Tableau is another popular big data tool. It is simple and very intuitive to use. It communicates
the insights of the data through data visualisation. Through Tableau, an analyst can check a
hypothesis and explore the data before starting to work on it extensively.
4. Content Grabber
Content Grabber is a data extraction tool. It is suitable for people with advanced programming
skills. It is a web crawling software. Businesses can use it to extract content and save it in a
structured format. It offers editing and debugging facility among many others for analysis
later. The market is full of big data tools these days. These tools help unlock the power that
big data provides to business processes. By choosing the tools carefully, a company can
increase its efficiency in its operations.
Analytics and reporting can help businesses transform data into actionable insights,
identify customer behavior patterns, measure each department’s performance, and
improve operational efficiency.
However, while these two terms are often used interchangeably, they represent different
approaches to understanding and communicating data.
Reporting involves gathering data and presenting it in a structured way, whereas analytics
is using data to identify patterns and gain insights to inform future decision-making.
A nurse takes vital signs, records symptoms, and reports this information to the doctor.
The doctor then uses this information to diagnose the patient’s condition and develop a
treatment plan.
But how many companies actually know the difference between analytics and reporting?
And do they have dedicated roles for both areas?
We conducted a survey with 22 respondents to find the answers to these questions (and a
few more you’ll want to stick around for).
Analytics and reporting both represent ways of understanding and communicating data,
but they do it differently.
Reporting is the process of collecting data and presenting it in a structured and easy-to-
understand manner, often in the form of charts, tables, or graphs. It’s important to have a
well-defined process and present data accurately, to prevent any misinterpretations.
Reports usually provide information on past performances and KPIs like sales figures,
website traffic, or customer demographics. Depending on what you want to focus on,
there are several types of reports (e.g. financial report, sales report, marketing report,
etc.).
Analytics, on the other hand, involves using data to draw insights and make informed
decisions. It goes beyond simply looking at what has happened in the past and instead
aims to answer questions about why something happened and what might happen in the
future.
Modern analytics tools also leverage complex data analysis techniques, such as predictive
modeling, data mining, and machine learning, to uncover hidden insights and trends in
the data. The purpose of analytics is to help managers and executives make informed
decisions that will drive the business forward.
Overall, analytics answers why something is happening based on the data, whereas
reporting tells what is happening.
Because these two terms represent different processes, companies should employ
different people for both areas – data analysts and reporting analysts.
We asked our respondents whether they have reporting analysts in the company and most
of them answered “Yes”.
We also asked those who have reporting analysts about how long they’ve had them on the
team. Most respondents have had them for between 1-3 years.
As for data analysts, most respondents have 2-3 data analysts in the organization.
Analytics and reporting both play a critical role in modern business operations.
Because of the vast amounts of data they collect, businesses need effective analytics and
reporting processes to leverage the information and make strategic decisions. It helps
them optimize operations, improve efficiency, reduce costs, and deliver better customer
experiences.
Like most marketers and marketing managers, you want to know how your efforts are
translating into results each month. How is your website performing? How well are you
converting traffic into leads and customers? Which marketing channels are performing
best? How does organic search compare to paid campaigns and to previous months? You
might have to scramble to put all of this together in a single report, but now you can
have it all at your fingertips in a single Databox dashboard.
Our Monthly Marketing Performance Dashboard includes data from Google Analytics 4
and HubSpot Marketing with key performance metrics like:
1. Website sessions, new users, and new leads. Basic engagement data from
your website. How much traffic? How many new visitors? How many lead
conversions?
2. Lead generation vs goal. Did you reach your goal for lead conversion
for the month, quarter, or year? If not, by how much did you miss?
3. Overall marketing performance. A summary list of the main KPIs for your
website: sessions, contacts, leads, customers, bounce rate, avg. session
duration, pages/session, and pageviews.
5. Blog post traffic. How much traffic did your blog attract during a certain
period?
6. New contacts by source. Which sources drove the highest number of new
contacts
7. Visits and contacts by source. How did your sources compare by both
sessions and new contacts in a certain period of time?
Now you can benefit from the experience of our Google Analytics and HubSpot
Marketing experts, who have put together a plug-and-play Databox template that contains
all the essential metrics for monitoring and analyzing your website traffic and its sources,
lead generation, and more. It’s simple to implement and start using as a standalone
dashboard or in marketing reports, and best of all, it’s free!
Step 2: Connect your HubSpot and Google Analytics 4 accounts with Databox.
We already touched briefly on some of the main differences between data analytics and
reporting, but we also wanted to do a deep dive into each one individually and show you
some interesting things our respondents pointed out.
• Differences in goals
To begin with, analytics and reporting both serve different purposes. If you’re looking to
get an answer to ‘what’s happening’ you need data reporting.
However, if you already have data reports (in simple words: organized and summarized
data) and you need to find out the answer to ‘what now,’ you need to dive into analytics
(and analytics dashboards).
Technically speaking, reporting is a subdivision of analytics and you can’t have analytics
without reporting, but analytics goes a bit further and is generally a more complex
process.
“Analytics looks at the incoming data reports, looks for patterns, delivers insights, and
guides actionable marketing decisions,” Ryan explains.
If you want actionable insights or recommendations from raw data, you’ll first need to
organize and format it – which is what reporting takes care of.
Similarly, reporting without analytics is useless at its core. Because then you have an idea
of what’s happening based on the data gathered, but no way to interpret it into actionable
takeaways to execute.
With this in mind, it’s apparent that their use cases drastically differ.
Sean Carrigan of MobileQubes adds that “analytics is useful for ad hoc interpretation of
data to answer specific questions related to user behavior, trends, etc. so that
improvements can be implemented.
Related: Marketing Reporting: The KPIs, Reports, & Dashboard Templates You Need to
Get Started
Since reporting is about formatting and making data easy to understand, it’s more
presentation-oriented than analytics. It typically relies on showcasing data in charts,
graphs, and other visually appealing formats.
The focus is on summarizing key metrics and performance indicators so that shareholders
and managers can easily grasp the information.
On the other hand, analytics outputs are generally in form of documented insights,
recommended actions and strategies, forecasts, ad hoc reports, summary reports, and
dashboards.
Eden Cheng from PeopleFinderFree adds that “reporting is utilized to drag details from
the raw data, in the leading form of easy-to-read dashboards of valuable graphs.
Therefore, via reporting, data is carefully arranged and summarized in seamlessly
digestible ways.”
Cheng also mentions that “analytics is one step ahead of reporting and enables you to
question and discover variable data.”
Difference in Goals
On the other hand, analytics is focused on exploring and understanding data in greater
detail to uncover insights and opportunities for improvement. The goal of analytics is to
identify patterns, relationships, and trends within the data that may not be immediately
visible in standard reports. Analytics tools are designed to provide users with the ability
to ask more complex questions, test hypotheses, and gain a deeper understanding of the
data.
Alina Clark of Cocodoc agrees and adds that “the goal of reporting is to change data from
its raw form, which is unintelligible and hard to understand, into an easy-to-visualize
format. The end result of any reporting system is to make the analysis as easy as possible.
At the same time, analytics churns through the data, draws out the problems, and provides
the solution while at it. Any data analysis that doesn’t look at the three stages (problems-
solutions-conclusions) fails to achieve the intended goals in most instances.”
Put simply, the goal of reporting is to organize and summarize data, while the purpose of
analytics is to interpret it and deliver actionable recommendations.
Building a report and preparing for data analytics both involve a different step-by-step
process.
• Translate the data into a format that can be analyzed and presented
• Develop and design a dashboard or report format that meets the needs of
the audience
• Develop and test analytical models to read the data and extract insights
• Use trend and pattern analysis, and data visualization techniques to communicate
the results
• Make decisions and create strategies based on the insights and recommendations
1. Apache Hadoop:
1. Apache Hadoop is a big data analytics tool is a Java-based free software framework.
2. It helps in the effective storage of a huge amount of data in a storage place known as
a cluster.
3. It runs in parallel on a cluster and also has the ability to process huge data across all
nodes in it.
4. There is a storage system in Hadoop popularly known as the Hadoop Distributed File
System (HDFS), which helps to splits the large volume of data and distribute it
across many nodes present in a cluster.
2. KNIME:
1. KNIME analytics platform is one of the leading open solutions for data-
driven innovation.
2. This tool helps in discovering the potential hidden in a huge volume of data, it also
performs mine for fresh insights, or predicts the new futures.
3. OpenRefine:
1. OneRefine tool is one of the efficient tools to work on the messy and large volume
of data.
2. It includes cleansing data and transforming that data from one format to another.
3. It helps to explore large data sets easily.
4. Orange:
1. Orange is famous for open-source data visualization and helps with data analysis for
beginners and as well to the expert
2. This tool provides interactive workflows with a large toolbox option to create
the same which helps in the analysis and visualizing of data.
5. RapidMiner:
1. RapidMiner tool operates using visual programming and also it is much capable of
manipulating, analyzing and modelling the data.
2. RapidMiner tools make data science teams easier and more productive by using an
open-source platform for all their jobs like machine learning, data preparation, and
model deployment.
6. R-programming:
1. R is a free open source software programming language and a software environment
for statistical computing and graphics.
2. It is used by data miners for developing statistical software and data analysis.
3. It has become a highly popular tool for big data in recent years.
7. Datawrapper:
1. It is an online data visualization tool for making interactive charts.
2. It uses data files in a CSV, pdf or Excel format.
3. Datawrapper generate visualization in the form of bar, line, map etc. It can
be embedded into any other website as well.
8. Tableau:
1. Tableau is another popular big data tool. It is simple and very intuitive to use.
2. It communicates the insights of the data through data visualization.
3. Through Tableau, an analyst can check a hypothesis and explore the data
before starting to work on it extensively.
Severe class imbalance between majority and minority classes in Big Data can bias the
predictive performance of Machine Learning algorithms toward the majority (negative) class.
Where the minority (positive) class holds greater value than the majority (negative) class and the
occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to
adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six
sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely
investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-
Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark
framework. The first case study is based on a Medicare fraud detection dataset. The second case
study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test
data from a separate source (POST dataset). Results from the Medicare case study are not
conclusive
regarding the best sampling approach using Area Under the Receiver Operating Characteristic
Curve and Geometric Mean performance metrics. However, it should be noted that the Random
Undersampling approach performs adequately in the first case study. For the SlowlorisBig case
study, Random Undersampling convincingly outperforms the other five sampling approaches
(Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 ,
SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the
Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its
classification performance in both case studies, Random Undersampling is the best choice as it
results in models with a significantly smaller number of samples, thus reducing computational
burden and training time.
9. Discuss briefly about resampling?
The problem of low learning algorithm accuracy caused by serious imbalance of big data in
Internet of Things, and proposes a bidirectional self-adaptive resampling algorithm for
imbalanced big data. Based on the sizes of data sets and imbalance ratios inputted by the user,
the algorithm will process the data using a combination of oversampling for minority class and
distribution sensitive undersampling for majority class.
This paper proposes a new distribution- sensitive resampling algorithm. According to the
distribution of samples, the majority and minority samples are divided into different categories,
and different processing methods are adopted for the samples with different distribution
characteristics The algorithm makes the sample set after resampling keep the same
characteristics with the original data set as much as possible. The algorithm emphasizes the
importance of boundary samples, that is, the samples at the boundary of majority classes and
minority classes are more important than other samples for learning algorithm. The boundary
minority samples will be copied, and the boundary majority samples will be reserved. Real-
world application is introduced in the end, which shows that compared with the existing
imbalanced data resampling algorithms, this algorithm improves the accuracy of learning
algorithm, especially for the accuracy and recall rate of minority class.
10. What are the types of modern data analytics tools?Explain it.
• Data management tools, such as Apache Hadoop, Cassandra, and Qubole, that
store and process large amounts of data.
• Data mining tools, such as KNIME, RapidMiner, and Wolfram Alpha, that
extract patterns and insights from data.
• Data visualization tools, such as Tableau Public, Google Fusion Tables, and
NodeXL, that present data in graphical or interactive forms.
• Data analysis techniques, such as in-memory analytics, predictive analytics, and text
mining, that apply algorithms and models to data.
11. Elaborately write about Statistical inference?
The need for new methods to deal with big data is a common theme in most scientific fields,
although its definition tends to vary with the context. Statistical ideas are an essential part of this,
and as a partial response, a thematic program on statistical inference, learning and models in big
data was held in 2015 in Canada, under the general direction of the Canadian Statistical Sciences
Institute, with major funding from, and most activities located at, the Fields Institute for
Research in Mathematical Sciences. This paper gives an overview of the topics covered,
describing challenges and strategies that seem common to many different areas of application
and including some examples of applications to make these challenges and strategies more
concrete.
Big data provides big opportunities for statistical inference, but perhaps even bigger challenges,
especially when compared with the analysis of carefully collected, usually smaller, sets of data.
From January to June 2015, the Canadian Statistical Sciences Institute organised a thematic
program on Statistical Inference, Learning and Models in Big Data. It became apparent within
the first two weeks of the program that a number of common issues arose in quite different
practical settings. This paper arose from an attempt to distil these common themes from the
presentations and discussions that took place during the thematic program.
Scientifically, the program emphasised the roles of statistics, computer science and mathematics
in obtaining scientific insight from big data. Two complementary strands were introduced:
cross- cutting, or foundational, research that underpins analysis, and domain-specific research
that focused on particular application areas. The former category included machine learning,
statistical inference, optimisation, network analysis and visualisation. Topic-specific workshops
addressed problems in health policy, social policy, environmental science, cyber-security and
social networks. These divisions are not rigid, of course, as foundational and application areas
are part of a feedback cycle in which each inspires developments in the other. Some very
important application areas where big data is fundamental were not able to be the subject of
focused workshops, but many of these applications did feature in individual presentations. The
program started with an opening conference
• In regression analysis, it’s a measure of how well the model predicts the
response variable.
• In classification (machine learning), it’s a measure of how well samples
are classified to the correct category.
Sometimes, the term is used informally to mean exactly what it means in plain English (you’ve
made some predictions, and there are some errors). In regression, the term “prediction error” and
“Residuals” are sometimes used synonymously. Therefore, check the author’s intent before
assuming they mean something specific (like the mean squared prediction error).
Prediction error can be quantified in several ways, depending on where you’re using it. In
general, you can analyze the behavior of prediction error with bias and variance (Johari, n.d.).
In statistics, the root-mean-square error (RMSE) aggregates the magnitudes of prediction errors.
The Rao-Blackwell theory can estimate prediction error as well as improve the efficiency of
initial estimators.
In machine learning, Cross-validation (CV) assesses prediction error and trains the prediction
rule. A second method, the bootstrap, begins by estimating the prediction rule’s sampling
distribution (or the sampling distribution’s parameters); It can also quantify prediction error and
other aspects of the prediction rule.
Unit-II
Answer all the
questions Part-A
1. What is meant by probabilistic techniques?
Examples of probabilistic data structures are as follows: Membership query (Bloom filter,
Bloom count filter, private filter, cuckoo filter). Power (linear counting, probabilistic counting,
LogLog, HyperLogLogLog, HyperLogLog++). Frequency (Counting sketch, Counting-minimal
sketch). Similarity (LSH, MinHash, SimHash), and others.
Big data is one of the most concerned topics in business today across information technology
sectors. most research fields toward using big data tools to leverage from the huge data that
available today. The traveling salesman problem is one of problems that is growth by increasing
the input as a factorial (n!). therefore it is important to find algorithm to solve big number of
cities with feasible time and within available memory space.
This article introduces two proposed algorithms to solve traveling salesman problem by
clustering using three methods; k- means, Gaussian Mixture Model, and Self-Organizing Map to
select the best one for proposed algorithms. The proposed algorithms depend on arranging the
cities (points) in chromosomes for Genetic Algorithm after clustering the big data to reduce the
problem and solving each cluster separately based on divide and conquer concept. The two
proposed algorithms tested by applying on different number of points, the nearest points
algorithm solved traveling salesman problem with 2 million points.
3.Define stochastic?
It has evolved in response to other developments in statistics, notably time series and sequential
analysis, and to applications in artificial intelligence, economics and engineering. Its
resurgence in the big data era has led to new advances in both theory and applications of this
microcosm of statistics and data science.
Stochastic processes can be grouped into various categories based on their mathematical
properties123
Data streaming is the process of transmitting, ingesting, and processing data continuously
rather than in batches. It is used to deliver real-time information to users and help them make
better decisions4. Big data streaming is a process in which large streams of real-time data are
processed to extract insights and useful trends out of it1. Data streaming is a key capability for
.
Data Lakes have evolved from the batch-based, large scale ingestion platforms to becoming
event-driven as the need for data “now” becomes more and more important. Capturing all
enterprise and external data in one place is now a commodity service. Doing that, plus
providing the data up to the hour and even minute it’s available is the new capability that
enterprises are targeting to continue identifying insights and monetizing their data capabilities.
But the prospect of building out a real-time architecture can be quite overwhelming for the
enterprise with no past experience in this space. Or even for those who do, but are working to
upgrade their technology stack and make a pivot to more modern tools. Below are some of the
considerations one should make when looking to make the move to a real-time or event-driven
architecture.
Pre-processing of data involves a set of key tasks that demand extensive computational
infrastructure and this in turn will make way for better results from your big data strategy.
Moreover, cleanliness of the data would determine the reliability of your analysis and this should
be given high priority while plotting your data strategy.
Since the extracted data tend to be imperfect with redundancies and imperfections, data pre-
processing techniques are an absolute necessity. The bigger the data sets, the more complex
mechanisms are needed to process it before analysis andvisualization. Pre-processing prepares
the data and makes the analysis feasible while improving the effectiveness of the results.
Following are some of the crucial steps involved in data pre-processing.
Data cleansing
Cleansing the data is usually the first step in data processing and is done to remove the
unwanted elements as well as to reduce the size of the data sets, which will make it easier for the
algorithms to analyze it. Data cleansing is typically done by using instance reduction techniques.
Instance reduction helps reduce the size of the data set without compromising the quality of
insights that can be extracted from the data. It removes instances and generates new ones to
make the data set compact. There are two major instance reduction algorithms:
Instance selection:
Instance selectionis used to identify the best examples from a very large data set with many
instances in order to curate them as the input for the analytics system. It aims to select a subset of
the data that can act as a replacement for the original data set while completely fulfilling the
goal. It will also remove redundant instances and noise.
Instance generation:
Instance generation methods involve replacing the original data with artificially generated data
in order to fill regions in the domain of an issue with no representative examples in the master
data. A common approach is to relabel examples that appear to belong to wrong class labels.
Instance generation thus makes the data clean and ready for the analysis algorithm.
Data normalization
Normalization improves the integrity of the data by adjusting the distributions. In simple words,
it normalizes each row to have a unit norm. The norm is specified by parameter p which
denotes the p-norm used. Some popular methods are:
StandardScaler: Carries out normalization so that each feature follows a normal distribution.
MinMaxScaler: Uses two parameters to normalize each feature to a specific range – upper and
lower bound.
Data transformation
If a data set happens to be too large in the number of instances or predictor variables,
dimensionality problem arises. This is a critical issue that will obstruct the functioning of most
data mining algorithms and increases the cost of processing. There are two popular methods for
data transformation by dimensionality reduction – Feature Selection and Space Transformation.
Feature selection: It is the process of spotting and eliminating as much unnecessary information
as possible. FS can be used to significantly reduce the probability of accidental correlations in
learning algorithms that could degrade their generalization capabilities. FS will also cut the
search space occupied by features, thus making the process of learning and mining faster. The
ultimate goal is to derive a subset of features from the original problem that describes it well.
set of features by combining the originals. This kind of a combination can be made to obey
certain criteria. Space transformation techniques ultimately aim to exploit non-linear relations
among the variables.
One of the common assumptions with big data is that the data set is complete. In fact, most data
sets have missing values that’s often overlooked. Missing values are datums that haven’t been
extracted or stored due to budget restrictions, a faulty sampling process or other limitations in the
data extraction process. Missing values is not something to be ignored as it could skew your
results.
Fixing the missing values issue is challenging. Handling it without utmost care could easily lead
to complications in data handling and wrong conclusions.
There are some relatively effective approaches to tackle the missing values problem. Discarding
the instances that might contain missing values is the common one but it’s not very effective as
it could lead to bias in the statistical analyses. Apart from this, discarding critical information is
not a good idea. A better and more effective method is to use maximum likelihood procedures to
model the probability functions of the data while also considering the factors that could have
induced the missingness. Machine learning techniques are so far the most effective solution to
the missing values problem.
Noise identification
Data gathering is not always perfect, but the data mining algorithms would always assume it to
be. Data with noise can seriously affect the quality of the results, tackling this issue is crucial.
Noise can affect the input features, output or both in most cases. The noise found in the input is
called attribute noise whereas if the noise creeps into the output, it’s referred to as class noise. If
noise is present in the output, the issue is very serious and the bias in the results would be very
high.
There are two popular approaches to remove noise from the data sets. If the noise has affected
the labelling of instances, data polishing methods are used to eliminate the noise. The other
method involves using noise filters that can identify and remove instances with noise from the
data and this doesn’t require modification of the data mining technique.
Minimizing the pre-processing tasks
Preparing the data for your data analysis algorithm can involve many more processes depending
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
on the application’s unique demands. However, basic processes like cleansing, deduplication and
normalization can be avoided in most cases if you choose the right source for data extraction. It’s
highly unlikely that a raw source can give you clean data. As far as web data extraction is
concerned,a managed web scraping service like PromptCloud can give you clean and ready
to use data that’s ready to be plugged into your analytics system.
• Definition: Structured data is highly organized and follows a specific, predefined format.
It is typically found in relational databases and consists of rows and columns.
• Examples: Examples include data stored in SQL databases, spreadsheets, and CSV files.
Structured data can represent customer records, sales transactions, financial data,
and more.
• Characteristics:
• Data is organized into tables with well-defined schemas.
• Easy to query and analyze using SQL or similar query languages.
• Suitable for traditional business intelligence and reporting.
Semi-Structured Data:
Definition: Semi-structured data is partially organized but does not conform to a rigid schema. It may contain tag
Examples: JSON (JavaScript Object Notation), XML (eXtensible Markup Language), and NoSQL databases lik
Characteristics:
Unstructured Data:
• Definition : Unstructured data lacks a predefined structure or format and is typically not
organized in a database-like manner. It includes text, images, videos, audio, and more.
• Examples: Social media posts, emails, documents (e.g., PDFs and Word documents),
multimedia content, and sensor data are all examples of unstructured data.
• Characteristic :
s
• No fixed schema or format, making it challenging to analyze using
traditional methods.
• Requires advanced techniques like natural language processing (NLP) and
machine learning for analysis.
• Valuable for sentiment analysis, content categorization, and image recognition.
14.What is data visualization techniques?
• Charts and Graphs: Visual representations like bar charts, line graphs, scatter plots, and
pie charts are commonly used to display patterns and trends in data.
• Heatmaps: Heatmaps use color intensity to represent data values, making it easy to
identify hotspots or concentration in large datasets.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
• Geospatial Visualization: Mapping data onto geographic maps to reveal spatial patterns,
like geographical information systems (GIS) and location-based data.
• Treemaps: Treemaps display hierarchical data structures, such as folder structures or
organizational hierarchies, using nested rectangles.
• Image Processing: Techniques like image filtering, segmentation, and edge detection are
used to process and enhance visual data.
• Object Detection: Identifying and locating objects within images or videos, often using
deep learning models like Convolutional Neural Networks (CNNs).
• Image Classification: Categorizing images into predefined classes or labels, commonly
used in applications like content moderation and image tagging.
• Image Recognition: Going beyond classification to recognize specific objects, scenes, or
patterns within images.
16.What is video analysis techniques?
• Time Series Plots: Visualizing data changes over time, useful for monitoring trends and
patterns.
• Gantt Charts: Displaying project timelines and scheduling information.
• Calendar Heatmaps: Visualizing data patterns across days, weeks, or months on
a calendar grid.
20.What is interactive and dynamic visualization?
Part-B
1.Explain about adaptive search by evaluation?
Random search algorithms are very useful for simulation optimization, because they are
relatively easy to implement and typically find a “good” solution quickly. One drawback is that
strong convergence results to a global optimum require strong assumptions on the structure of
the problem.
This chapter begins by discussing optimization formulations for simulation optimization that
combines expected performance with a measure of variability, or risk. It then summarizes
theoretical results for several adaptive random search algorithms (including pure adaptive search,
hesitant adaptive search, backtracking adaptive search and annealing adaptive search) that
converge in probability to a global optimum on ill-structured problems. More importantly, the
complexity of these adaptive random search algorithms is linear in dimension, on average.
While it is not possible to exactly implement stochastic adaptive search with the ideal linear
performance, this chapter describes several algorithms utilizing a Markov chain Monte Carlo
sampler known as hit-and-run that approximate stochastic adaptive search. The first optimization
algorithm discussed that uses hit-and-run is called improving hit-and-run, and it has polynomial
complexity, on average, for a class of convex problems. Then a simulated annealing algorithm
and a population based algorithm, both using hit-and-run as the candidate point generator, are
described. A variation to hit-and-run that can handle mixed continuous/integer feasible regions,
called pattern hit-and-run, is described. Pattern hit-and-run retains the same convergence results
to a target distribution as hit-and-run on continuous domains. This broadly extends the class of
the optimization problems for these algorithms to mixed continuous/integer feasible regions.
In this age of technological and advanced world big data is prominent as a world new currency.
The term big data is not a framework, language and Technology. Actually Big data is nothing but
a problem statement. In the current era number of IOT enable devices is using data in huge
amount. The data is coming from different datasets at an enormous amount. As data is increasing
exponentially every year, the traditional system to store and process the data become incapable to
handle it. The existing technologies are not capable to handle the big data. In this digital world,
the data is generated automatically by the online interactions of big data applications. The Big
data is used in the evaluation of emerging form of information. In the last two years data is
growing at an enormous speed exponentially as compare to last twenty years. In this current era
human life is totally dependable on IOT. This paper presents the overall changes in big data
analytics evaluation growth in the recent years. The innovations in the technology and greater
affordability of digital devices with internet made a new world of data are called big data. The
data captured by enterprises such ad rise of IOT and multimedia has produce an overwhelming of
data in either structured or un-structured format. It is a fact that data that is too big to process is
also too big to transfer anywhere. So it is just an analytical program which needs to be moved
(not the data) and this is only possible with cloud computing.
Big Data Analysis using the Genetic Algorithm: The field of Information Theory refers big data
as datasets whose rate of increase is exponentially high and in small span of time; it becomes
very painful to analyze them using typical data mining tools. Such data sets results from daily
capture of stock exchange, any credit card user’s timely usage trends, insurance cross line
capture, health care services etc. In real time these data sets go on increasing and with passage of
time create complex scenarios. Thus the typical data mining tools needs to be empowered by
computationally efficient and adaptive technique to increase degree of efficiency by using
adaptive techniques. Using GA over data mining creates great robust, computationally efficient
and adaptive systems. In past there have been several researches on data mining using statistical
techniques. The statistics that have heavily contributed are the ANOVA, ANCOVA, Poisson’s
Distribution, and Random Indicator Variables etc. The biggest drawback of any statistical tactics
lies in its tuning. With exponential explosion of data, this tuning goes on taking more time and
inversely affects the through put. Also due to their static nature, often complex hidden patterns
are left out. The idea here is to use genes to mine out data with great efficiency. Also I will show
how this mined data can be effectively used for different purposes. Rather than sticking to
general notion of probabilities, I have here used the concept of Expectations, and have modified
the theory of Expectations to achieve the desired results. Any data categorizes of three main
components, the constants, the variables and the variants. The constant comprises of data that
practically remains unaltered in a given span of time. The variables are changing with time while
in case of variants; it is not clear whether they will behave as constant or variables. So, taking
this as the first step, we have three set each containing respective data as stated. Now we will
calculate the expectancy of each datum inside the data set. 4.1 Calculation of Expectancy
A new algorithm called multi-objective genetic programming (MOGP) for complex civil
engineering systems. The proposed technique effectively combines the model structure selection
ability of a standard genetic programming with the parameter estimation power of classical
regression, and it simultaneously optimizes both the complexity and goodness-of-fit in a system
through a non-dominated sorting algorithm. The performance of MOGP is illustrated by
modeling a complex civil engineering problem: the time-dependent total creep of concrete. A
Big Data is used for the model development so that the proposed concrete creep model—referred
to as a “genetic programming based creep model” or “G-C model” in this study—is valid for
both normal and high strength concrete with a wide range of structural properties. The G-C
model is then compared with currently accepted creep prediction models. The G-C model
obtained by MOGP is simple, straightforward to use, and provides more accurate predictions
than other prediction models.
Introduction
Different techniques can be used for modeling nonlinear systems in structural engineering, and
the models obtained from these techniques can be broadly categorized into two groups:
phenomenological (or knowledge-based) and behavioral. Phenomenological models consider the
physical laws governing the system (such as energy, momentum, etc.). In these models, the
structure of the system should be selected by the model developer based on the physical laws,
which requires prior knowledge about the system. Due to the complexity of many structural
engineering systems/phenomena (such as modeling of concrete shrinkage and creep), it is not
always possible to derive such models. In contrast to phenomenological models, behavioral
models can be easily developed by finding the relationships between input variables and outputs
for a set of experimental data without considering the physical theories. For developing
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
behavioral models, no prior knowledge is needed about the mechanism or fundamental theory
that produced the experimental data. Therefore, behavioral modeling techniques can be used for
approximate modeling of many structural engineering systems [1], [2].
While behavioral models can be advantageous, many behavioral models require the user to pre-
specify/hypothesize the formulation structure of the model. In other words, behavioral
techniques optimize the unknown coefficients of a pre-defined formulation structure. In
particular, regression analysis is a commonly used technique for developing behavioral models.
Although this technique can be used for developing both linear and nonlinear models, it has a
strong sensitivity to outliers and can exhibit large model errors due to the idealization of
complex processes, approximation, and averaging widely varying prototype conditions [3], [4].
Furthermore, for linear regressions, the least square estimate of unknown parameters can be
obtained analytically, while nonlinear regressions typically use an iterative optimization
procedure to estimate the unknown parameters, which requires the user to provide starting
values. Failure in defining the appropriate starting values can lead to convergence problems or
finding the local minimum rather than a global minimum in the optimization process. Therefore,
using traditional techniques such as regression analysis cannot guarantee that a reliable and
accurate behavioral model will be obtained, particularly for complex nonlinear engineering
systems.
Although ANNs are generally successful in prediction, they are only appropriate to use as part of
a computer program, not for the development of practical prediction equations. In addition, ANN
requires data to be initially normalized based on the suitable activation function and the best
network architecture to be determined by the user, and it can have a complex structure and a high
potential for over-fitting [10]. SVMs, on the other hand, are one of the efficient kernel-based
methods that can solve a convex constrained quadratic programming (CCQP) problem to find a
set of parameters. However, selecting the appropriate kernel in SVM can be a challenge, and the
results are not transparent [11].
One powerful technique for developing nonlinear behavioral models in the case of complex
optimization problems is genetic programming (GP) [12]. GP is specialization subset of genetic
algorithms (GAs) [13], which are based on the principles of genetics and natural selection. GP
and its variants have been successfully used for solving a number of different civil engineering
problems (e.g., [14], [15]). Multi-gene genetic programming (MGGP) is a robust variant of GP
that combines the ability of the standard GP in constructing the model structure with the
capability of traditional regression in parameter estimation. In this technique, each symbolic
model (and each member of the GP population) is a weighted linear combination of low order
non-linear transformations of the input variables. In contrast to standard symbolic regression,
MGGP allows the evolution of accurate and relatively compact mathematical models. Even
when large numbers of input variables are used, this technique can automatically select the most
contributed variables in the model, formulate the structure of the model, and solve the
coefficients in the regression equation [16], [17], [18], [19]. Therefore, unlike other techniques
such as traditional regression analysis or ANN, there is no need in the MGGP technique for the
user to pre-define the formulation structure of the model or select any existing form of the
relationship for optimization [3], [4], which makes it more practical for complex optimization
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
problems. Recent studies also show that compared to other novel computer-based techniques
such as SVM and particle swarm model selection, GP shows better performance in problems
having high dimensionality and large training sets [20].
Typically, standard GP algorithms (including MGGP) will optimize only one objective in the
model development process: maximizing the goodness-of-fit to the training data. The main
drawback of using a single objective in the optimization process is that the developed models can
become overly complex. In other words, minimizing the complexity of the developed models
should be another important objective to be considered. In this study, a new algorithm called
multi-objective genetic programming (MOGP) is developed. MOGP is an extension of standard
GP algorithms that can simultaneously solve for two competing objectives (i.e. maximizing the
goodness-of-fit and minimizing the model complexity). By performing multi-gene symbolic
regression via MOGP, one can develop parsimonious and accurate data-based models for
complex engineering systems.
Big Data visualization techniques — charts, maps, interactive content, infographics, motion
graphics, scatter plots, regression lines, timelines, for example, enable companies' decision-
makers get results by better understanding their processes and stakeholders. Software support
multiple and high amounts of raw data to provide instant analysis of facts, trends, and patterns.
Big data visualization is a remarkably powerful business capability.
According to IBM, every day, 2.5 quintillion bytes of data are created from social media,
sensors, webpages, and all kinds of management systems are using it to control the business
processes.
By helping correlations between thousands of variables available in the big data world,
technologies could present massive amounts of data in an understanding way, which means Big
Data visualization initiatives combine IT and management projects.
.
These types include:
• Temporal: data is linear and one dimensional
• Hierarchical: visualizes ordered groups within a larger group
• Network: involves visualization for the connection of datasets to datasets
• Multidimensional: contrast of temporal type
• Geospatial: involves geospatial or spatial maps
• Miscellaneous: other types of visualizations1
There is no clear consensus on the boundaries between these fields, but broadly speaking,
scientific visualization deals with data that has a natural geometric structure, while information
Structured Data
• Structured data can be crudely defined as the data that resides in a fixed field within
a record.
• It is type of data most familiar to our everyday lives. for ex: birthday,address
• A certain schema binds it, so all the data has the same set of properties. Structured
data is also called relational data. It is split into multiple tables to enhance the
integrity of the data by creating a single record to depict an entity. Relationships are
enforced by the application of table constraints.
• The business value of structured data lies within how well an organization can
utilize its existing systems and processes for analysis purposes.
Geek
11 A 1 A
1
Geek
11 A 2 B
2
Roll
Name Class Section No Grade
Geek
11 A 3 A
3
Semi-Structured Data
• Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly organized into
rows and columns like that in a spreadsheet. However, there are some features like
key-value pairs that help in discerning the different entities from each other.
• Since semi-structured data doesn’t need a structured query language, it is commonly
called NoSQL data.
• A data serialization language is used to exchange semi-structured data across
systems that may even have varied underlying infrastructure.
• Semi-structured content is often used to store metadata about a business process but
it can also include files containing machine instructions for computer programs.
• This type of information typically comes from external sources such as social media
platforms or other web-based data feeds.
Semi-Structured Data
Data is created in plain text so that different text-editing tools can be used to draw valuable
insights. Due to a simple format, data serialization readers can be implemented on hardware
with limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files, transit,
store, and parse. The sender and the receiver don’t need to know about the other system. As
long as the same serialization language is used, the data can be understood by both systems
comfortably. There are three predominantly used Serialization languages.
1. XML– XML stands for eXtensible Markup Language. It is a text-based markup language
designed to store and transport data. XML parsers can be found in almost all popular
development platforms. It is human and machine-readable. XML has definite standards for
schema, transformation, and display. It is self-descriptive. Below is an example of a
programmer’s details in XML.
XML
<ProgrammerDetails>
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
<CodingPlatforms>
<CodingPlatform Type="Fav">GeeksforGeeks</CodingPlatform>
<CodingPlatform Type="2ndFav">Code4Eva!</CodingPlatform>
<CodingPlatform Type="3rdFav">CodeisLife</CodingPlatform>
</CodingPlatforms>
</ProgrammerDetails>
<!--The 2ndFav and 3rdFav Coding Platforms are imaginative because Geeksforgeeks
is the best!-->
XML expresses the data using tags (text within angular brackets) to shape the data (for ex:
FirstName) and attributes (For ex: Type) to feature the data. However, being a verbose and
voluminous language, other formats have gained more popularity.
2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file format for
data interchange. JSON is easy to use and uses human/machine-readable text to store and
transmit data objects.
Javascript
{
"firstName": "Jane",
"lastName": "Doe", "codingPlatforms": [
{ "type": "Fav", "value": "Geeksforgeeks" },
{ "type": "2ndFav", "value": "Code4Eva!" },
{ "type": "3rdFav", "value": "CodeisLife" }
]
This format isn’t as formal as XML. It’s more like a key/value pair model than a formal data
depiction. Javascript has inbuilt support for JSON. Although JSON is very popular amongst
web developers, non-technical personnel find it tedious to work with JSON due to its heavy
dependence on JavaScript and structural characters (braces, commas, etc.)
YAML example
Unstructured Data
• Unstructured data is the kind of data that doesn’t adhere to any definite schema or
set of rules. Its arrangement is unplanned and haphazard.
• Photos, videos, text documents, and log files can be generally considered
unstructured data. Even though the metadata accompanying an image or a video
may be semi-structured, the actual data being dealt with is unstructured.
• Additionally, Unstructured data is also known as “dark data” because it cannot be
analyzed without the proper software tools.
Un-structured Data
Big data visualization makes a difference John Tukey, a celebrated mathematician and
researcher, once said: “The greatest value of a picture is when it forces us to notice what we
never expected to see.” And our data visualization team couldn’t agree more. Visualization
allows business users to look beyond individual data records and easily identify dependencies
and correlations hidden inside large data sets. Here go examples of how big data analysis
results can look with and without well-implemented data visualization. Example 1: Analysis of
industrial data In some cases, the maintenance team can skip the ‘looking for insights’ part and
just get notified by the analytical system that part 23 at machine 245 is likely to break down.
Nevertheless, the maintenance team is unlikely to be satisfied with instant alerts only. They
should be proactive, not just reactive in their work, and for that, they need to know dependencies
and trends. Big data visualization helps them get the required insights. For example, if the
maintenance team would like to understand the connections between machinery failures and
certain events that trigger them, they should look at connectivity charts for insights.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
Example 2: Analysis of social comments Imagine a retailer operating nationwide. One customer
may visit their store and post on Facebook: “Guys, if you haven’t bought Christmas presents yet,
go to [the retailer’s name].” Another customer may share on Twitter: “I hate New Year time! I’ve
never seen lines that long! I wasted an hour at [the retailer’s name] today. And the staff was
rude. Hate this place!” The third customer may post on Instagram: “Look what a gorgeous
reindeer sweater
I bought at [the retailer’s name]!” The company’s customer base is 20+ million. It would be
impossible for the retailer to browse all over the internet in the search of all the comments and
reviews and try to get insights just by scrolling through and reading all the comments. To have
these tasks automated, companies resort to sentiment analysis. And to get instant insights into the
analysis results, they apply big data visualization. For example, word clouds demonstrate the
frequency of the words used. The higher the frequency, the bigger a word’s font. So, if the
biggest words are hate, awful, terrible, failed, and their likes – it’s high time to react. Example 3:
Analysis of customer behavior Companies use a similar scenario to analyze customer behavior.
They strive to implement big data solutions that would allow gathering detailed data about the
purchases in brick-and-mortar and online stores, browsing history and engagement, GPS data
and data from the customer mobile app, calls to the support center and more. Registering billions
of events daily, a company is unable to identify the trends in customer behavior if they have just
multiple records at their disposal.
With big data visualization, ecommerce retailers, for instance, can easily notice the change in
demand for a particular product based on the page views. They can also understand the peak
times when visitors make most of their purchases, as well as look at the share of coupon
redemption, etc. Most frequently used big data visualization techniques Earlier, we studied on
practical examples how companies can benefit from big data visualization, and now we’ll give
an overview of the most widely used data visualization techniques. Symbol maps The symbols
on such maps differ in size, which makes them easy to compare. Imagine a US manufacturer
who has launched a new brand recently. The manufacturer is interested to know which regions
liked the brand particularly. To achieve this, they can use a map with symbols representing the
number of customers who liked the product (left a positive comment in social media, rated a new
product high in a customer survey, etc.) Line charts Line charts allow looking at the behavior of
one or several variables over time and identifying the trends. In traditional BI, line charts can
show sales, profit and revenue development for the last 12 months.
When working with big data, companies can use this visualization technique to track total
application clicks by weeks, the average number of complaints to the call center by months, etc.
Pie charts Pie charts show the components of the whole. Companies that work with both
traditional and big data may use this technique to look at customer segments or market shares.
The difference lies in the sources from which these companies take raw data for the analysis. Bar
charts Bar charts allow comparing the values of different variables. In traditional BI, companies
can analyze their sales by category, the costs of marketing promotions by channels, etc. When
analyzing big data, companies can look at the visitors’ engagement with their website’s multiple
pages, the most frequent pre-failure cases on the shop floor and more. Heat maps Heat maps use
colors to represent data. A user may encounter a heat map in Excel that highlights sales in the
best performing store with green and in the worst performing – with red. If a retailer is interested
to know the most frequently visited aisles in the store, they will also use a heat map of their sales
floor. In this case, the retailer will analyze big data, such as the data from a video surveillance
system. How to avoid mistakes related to big data visualization?
The main purpose of big data visualization is to provide business users with insights. Choosing
the right visualization tool among the variety of options on the market
(Microsoft Power BI, Tableau, QlikView, and Sisense are just a couple of product names) and
applying the right techniques to create uncluttered and intuitive dashboards may appear to be a
more complicated task than it seems. If you feel that you need any assistance with this issue, you
can involve big data consultants to help you choose the most suitable visualization solution
and/or customize it. Read more on https://www.scnsoft.com/blog/big-data-visualization-
techniques
9. Explain about search by simulated annealing?
In a situation like shown above, the gradient descent gets stuck at the local minima if it started at
the indicated point. It wouldn't further be able to reach the global minima. In cases like these,
simulated annealing proves useful.
Simulated annealing is an algorithm based on the physical annealing process used in metallurgy.
During physical annealing, the metal is heated up until it reaches its annealing temperature and
then is gradually cooled down to change it into the desired shape. It is based on the principle that
the molecular structure of the metal is weak when it is hot and can be changed easily. Where as
when it cools down, it becomes hard and thus changing the shape of the metal becomes difficult.
Simulated annealing has a probabilistic way of moving around in a search space and is used for
optimizing model parameters. It mimics physical annealing as a temperature parameter is used
here too.
If the temperature is higher, the more likely the algorithm will accept a worse solution. This
expands the search space unlike gradient descent and allows it to travel down a trivial path. This
promotes exploration.
If the temperature is lower, the less likely it will accept a worse solution. This tells the algorithm
that once it is in the right part of the search space, it does not need to search any other parts and
instead must focus on finding the global maximum by converging. This promotes exploitation.
The main difference between a greedy search and simulated annealing is that the greedy search
always go for the best option where as in simulated annealing, it has a probability (using
Boltzmann distribution) to accept a worse solution.
Algorithm
For a function h(•) we are trying to maximize, the steps for simulated annealing algorithm is as
follows:
3. For the n number of iterations i=1,2,...,n , loop through the following steps until the
termination condition is reached:
p= exp(Δh/ti)
• If Δh is greater than zero, it means that our new solution is better and we accept it.
If it is less than zero, then we generate a random number u ~ U(0,1). We accept
the new solution x' if u ≤ p.
• We then reduce the temperature t using a temperature reduction function α.
Temperature reduction functions like t = t - α or t = t * α may be used here.
The termination conditions here may be achieving a particular temperature or a performance
threshold.
Note that if the temperature is high, say maybe a 100, then the probability that we are going to
accept the candidate solution comes out to be high, when we substitute it in the formula. As the
temperature becomes closer to 0, the algorithm functions like the greedy hill climbing algorithm.
• It does not rely on restrictive properties of the model and hence is versatile.
• The precision of the numbers used in its implementation has a significant effect on
the quality of results.
• There is a tradeoff between the quality of result and the time taken for the algorithm
to run.
[24]
The term random function is also used to refer to a stochastic or random process,[25][26] because a
stochastic process can also be interpreted as a random element in a function space.[27][28] The
terms stochastic process and random process are used interchangeably, often with no
specific mathematical space for the set that indexes the random variables.[27][29] But often these
two terms are used when the random variables are indexed by the integers or an interval of
the real line.[5][29] If the random variables are indexed by the Cartesian plane or some higher-
dimensional Euclidean space, then the collection of random variables is usually called a random
field instead.[5][30] The values of a stochastic process are not always numbers and can be vectors
or other mathematical objects.[5][28]
Based on their mathematical properties, stochastic processes can be grouped into various
categories, which include random walks,[31] martingales,[32] Markov processes,[33] Lévy processes,
[34]
Gaussian processes,[35] random fields,[36] renewal processes, and branching processes.[37] The
study of stochastic processes uses mathematical knowledge and techniques
from probability, calculus, linear algebra, set theory, and topology[38][39][40] as well as branches of
mathematical analysis such as real analysis, measure theory, Fourier analysis, and functional
analysis.[41][42][43] The theory of stochastic processes is considered to be an important contribution
to mathematics[44] and it continues to be an active topic of research for both theoretical reasons
and applications.[45][46][47]
Big Data visualization techniques — charts, maps, interactive content, infographics, motion
graphics, scatter plots, regression lines, timelines, for example, enable companies' decision-
makers get results by better understanding their processes and stakeholders. Software support
multiple and high amounts of raw data to provide instant analysis of facts, trends, and patterns.
Rock Content Writer
According to IBM, every day, 2.5 quintillion bytes of data are created from social media,
sensors, webpages, and all kinds of management systems are using it to control the business
processes.
By helping correlations between thousands of variables available in the big data world,
technologies could present massive amounts of data in an understanding way, which means Big
Data visualization initiatives combine IT and management projects.
In this article, we will address data and how its visual representation should move together to
ensure it is effectively employed.
Today’s companies collect and store vast amounts of information that would take years for
a human to read and understand.
Visualization resources rely on powerful tools to interpret raw data and process it to
generate visual representations that allow humans to take in and understand enormous
amounts of data in a few minutes.
Big Data visualization describes data of almost any type — numbers, trigonometric function,
linear algebra, geometric, basic, or statistical algorithms — in a visual basis format —
coding, reports analytics, graphical interaction — that makes it easy to understand and interpret.
Thus, it goes far beyond typical graphs, bubble plots, histograms, pie, and donut charts to more
complex representations like heat maps and box and whisker plots, enabling decision-makers to
explore data sets to identify correlations or unexpected patterns.
The amount of data is growing every year thanks to the Internet and innovations such as
operational systems, sensors, and the Internet of Things.
The problem for companies is that data is only useful if valuable insights can be
extracted from large amounts of raw data and read by who can analyze them — data
literacy in near real-time.
• Enable decision-makers to understand what the amount of data means very quickly;
• Capture trends — the use of appropriate techniques can make it easy to recognize this
information;
• Reveal patterns — identify correlations and unexpected connections that could not be
found with specific questions; and
• Provide a highly effective way to communicate any insights that surfaces to others.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
Big Data visualization provides a relevant suite of techniques for gaining a qualitative
understanding.
Charts
Charts use elements to match the values of variables and compare multiple components, showing
the relationship between data points.
• Line chart — the comparable elements are lines that could help to analyze peak and fall
moments at an axis variant, such as sales volume over a period.
• Pie and donut charts — they are used to compare parts of the whole, such as
components of one category. The angle and the arc of each sector correspond to
the illustrated value, and the distance from the center evaluates their importance.
• Bar chart — each value is displayed by a bar, either vertical or horizontal. It is
not indicated when values are very close to each other.
Source
Plots
• Scatter (X-Y) plot — shows the mutual variation of two data items (axis X and Y).
• Bubble plot — it has the same scatter plot concept, but the markers are bubbles.
The main difference is the bubble size, the third measure that represents another
variable.
• Histogram plot — represents the element variable over a specific period.
Source
Maps
Maps make it possible to position data points on different objects and areas, such as layouts,
geographical maps, and building projects. They could be heat maps or a dot distribution map.
Source
Big Data also makes companies find new ways of data visualization — semistructured and
unstructured data require new visualization techniques. You can try to use some of the
ones below to address these challenges.
If we do not have enough knowledge about the amount and the distribution of data, they can be
best visualized with this model of Big Data visualization technique that represents the probability
distribution function.
Source
Box and whisker plot
It shows the distribution of massive data, often to understand the outliers in the data in a
graphical display of five statistics:
• Minimum;
• Lower quartile;
• Median;
• Upper quartile; and
• Maximum.
Extreme values are represented by whiskers that extend out from the edges of the box.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
Source
Word clouds
It represents the frequency of a word within a body of the text: the bigger the word, the more
relevant it is.
Source
Network diagrams
It makes relationships as nodes and ties to analyze social networks or mapping product sales
across geographic areas, for example.
Source
Correlation matrices
They are used to summarizing data, as input and output for advanced analyses that allows quick
identification of relationships between variables with fast response times.
Source
Big Data visualization tools need to support multiple and high amounts of data sources and
provide instant analysis. Users can better understand information by designs and dashboards to
discover correlations, trends, and patterns in data. The main tools to build a decision-making
platform are:
• Visual.ly
• Power BI
• Sisense
• Periscope Data
• Zoho Analytics
• IBM Cognos Analytics
• Tableau Desktop
• Qlik solution — QlikSense and QlikView
• Microsoft PowerBI
• Oracle Visual Analyzer
• FineReport.
Visual.ly is a new way to think about content creation and data visualization for your
company — capture more relevant information with visuals to deliver better content faster.
By using charts, maps, interactive content, infographics, motion graphics, explaining videos,
histograms, scatter plots, regression lines, timelines, treemaps, and word clouds, the
Visual.ly platform reaches more details from data to leverage businesses’ results and
generate better opportunities for brands.
Unit-III
Answer all the
questions
Part-A
1. What is Data stream?
Data streaming is the process of transmitting, ingesting, and processing data continuously
rather than in batches123. It is used to deliver real-time information to users and help them make
better decisions4. Big data streaming is a process in which large streams of real-time data are
processed to extract insights and useful trends out of it 1. Data streaming is a key capability for
organizations who want to generate analytic results in real time
2. What is meant transactional data stream?
Transactional data, when used right, can be a key source of business intelligence. For instance,
in big data analytics, transactional data is vital to understand peak transaction volume, peak
ingestion rates, and peak data arrival rates.
3. Write short notes on measurement data stream?
Continuous queries can be used for monitoring, alerting, security, personalization, etc. Data
streams can be either transactional (i.e., log interactions between entities, such as credit card
purchases, web clickstreams, phone calls), or measurement (i.e., monitor evolution of entity
states, such as physical phenomena, road traffic, temperature, network).
4. What are the examples in data stream?
Examples
Some real-life examples of streaming data include use cases in every industry, including real-
time stock trades, up-to-the-minute retail inventory management, social media feeds, multiplayer
game interactions, and ride-sharing apps.
For example, when a passenger calls Lyft, real-time streams of data join together to create a
seamless user experience. Through this data, the application pieces together real-time location
tracking, traffic stats, pricing, and real-time traffic data to simultaneously match the rider with
the best possible driver, calculate pricing, and estimate time to destination based on both real-
time and historical data.
In this sense, streaming data is the first step for any data-driven organization, fueling big data
ingestion, integration, and real-time analytics.
1. Fraud perception
2. Real-time goods dealing
3. Consumer enterprise
4. Observing and describing on inside IT systems
Stream Processing is used by organizations in various industries to keep up with data from
billions of “things”1. Stream processing is useful in use cases where we can detect a problem and
we have a reasonable response to improve the outcome2. Following are some of the use cases:
• Algorithmic Trading
• Stock Market Surveillance
• Smart Patient Care
• Monitoring a production line2Stream processing architectures help simplify the data
management tasks required to consume, process and publish the data securely and
reliably3
Part-B
1. Explain about stream concepts?
2. Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of
images per day. Surveillance cameras generate images with lower resolution than
satellites, but there can be numerous of them, each producing a stream of images at
a break of 1 second each.
Before we get to streaming data architecture, it is vital that you first understand streaming data.
Streaming data is a general term used to describe data that is generated continuously at high
velocity and in large volumes.
A stream data source is characterized by continuous time-stamped logs that document events in
real time.
Examples include a sensor reporting the current temperature, or a user clicking a link on a web
page. Stream data sources include:
• IoT sensors
An effective streaming architecture must account for the distinctive characteristics of data
streams which tend to generate copious amounts of structured and semi-structured data that
requires ETL and pre-processing to be useful.
Due to its complexity, stream processing cannot be solved with one ETL tool or database. That’s
why organizations need to adopt solutions consisting of multiple building blocks that can be
combined with data pipelines within the organization’s data architecture.
Although stream processing was initially considered a niche technology, it is hard to find a
modern business that does not have an eCommerce site, an online advertising strategy, an app, or
products enabled by IoT.
Each of these digital assets generates real-time event data streams, thus fueling the need to
implement a streaming data architecture capable of handling powerful, complex, and real-time
analytics.
Stream Computing
The stream processing computational paradigm consists of assimilating data readings from
collections of software or hardware sensors in stream form (i.e., as an infinite series of tuples),
analyzing the data, and producing actionable results, possibly in stream format as well.
In a stream processing system, applications typically act as continuous queries, ingesting data
continuously, analyzing and correlating the data, and generating a stream of results.
Applications are represented as data-flow graphs composed of operators and interconnected by
streams, as shown in the figure. The individual operators implement algorithms for data
analysis, such as parsing, filtering, feature extraction, and classification. Such algorithms are
typically single-pass because of the high data rates of external feeds (e.g., market information
from stock exchanges, environmental sensors readings from sites in a forest, etc.).
Stream processing applications are usually constructed to identify new information by
incrementally building models and assessing whether new data deviates from model predictions
and, thus, is interesting in some way. For example, in a financial engineering application, one
might be constructing pricing models for options on securities, while at the same time detecting
mispriced quotes, from a live stock market feed. In such an application, the predictive model
itself might be refined as more market data and other data sources become available (e.g., a feed
with weather predictions, estimates on fuel prices, or headline news).
Streams applications may consist of dozens to hundreds of analytic operators, deployed on
production systems hosting many other potentially interconnected stream applications,
distributed over a large number of processing nodes.
Data arrives as sequence of items. Sometimes continuously and at high speed. Can’t store them
all in main memory. Can’t read again; or reading again has a cost. We abstract the data to a
particular feature, the data field of interest the label. Sampling in data streams Data stream
models Sampling
Data arrives as sequence of items. Sometimes continuously and at high speed. Can’t store them
all in main memory. Can’t read again; or reading again has a cost. We abstract the data to a
particular feature, the data field of interest the label. Sampling in data streams Data stream
models Sampling
Data arrives as sequence of items. Sometimes continuously and at high speed. Can’t store them
all in main memory. Can’t read again; or reading again has a cost. We abstract the data to a
particular feature, the data field of interest the label. Sampling in data streams Data stream
models Sampling The data We have a set of n labels Σ and our input is a stream s = x1, x2,
x3, . . . xm, where each xi ∈ Σ. Take into account that some times we do not know in advance
the length of the stream. Goal Compute a function of stream, e.g., median, number of distinct
elements, longest increasing sequence. Sampling in data streams Data stream models Sampling
The data We have a set of n labels Σ and our input is a stream s = x1, x2, x3, . . . xm, where each
xi ∈ Σ. Take into account that some times we do not know in advance the length of the stream.
Goal Compute a function of stream, e.g., median, number of distinct elements, longest increasing
sequence. Sampling in data streams Data stream models Sampling The data We have a set of n
labels Σ and our input is a stream s = x1, x2, x3, . . . xm, where each xi ∈ Σ. Take into account
that some times we do not know in advance the length of the stream. Goal Compute a function of
stream, e.g., median, number of distinct elements, longest increasing sequence. Sampling in data
streams Data stream models Sampling The data We have a set of n labels Σ and our input is a
stream s = x1, x2, x3, . . . xm, where each xi ∈ Σ. Take into account that some times we do not
know in advance the length of the stream. Goal Compute a function of stream, e.g., median,
number of distinct elements, longest increasing sequence.
Another common process on streams is selection, or filtering. We want to accept those tuples in
the stream that meet a criterion. Accepted tuples are passed to another process as a stream, while
other tuples are dropped. If the selection criterion is a property of the tuple that can be calculated
(e.g., the first component is less than 10), then the selection is easy to do. The problem becomes
harder when the criterion involves lookup for membership in a set. It is especially hard, when
that set is too large to store in main memory. In this section, we shall discuss the technique
known as “Bloom filtering” as a way to eliminate most of the tuples that do not meet the
criterion.
A Motivating Example
Again let us start with a running example that illustrates the problem and what we can do about
it. Suppose we have a set S of one billion allowed email addresses – those that we will allow
through because we believe them not to be spam. The stream consists of pairs: an email address
and the email itself. Since the typical email address is 20 bytes or more, it is not reasonable to
store S in main memory. Thus, we can either use disk accesses to determine whether or not to let
through any given stream element, or we can devise a method that requires no more main
memory than we have available, and yet will filter most of the undesired stream elements.
Suppose for argument’s sake that we have one gigabyte of available main memory. In the
technique known as Bloom filtering, we use that main memory as a bit array. In this case, we
have room for eight billion bits, since one byte equals eight bits. Devise a hash function h from
email addresses to eight billion buckets. Hash each member of S to a bit, and set that bit to 1. All
other bits of the array remain 0. Since there are one billion members of S, approximately 1/8th of
the bits will be 1. The exact fraction of bits set to 1 will be slightly less than 1/8th, because it is
possible that two members of S hash to the same bit. We shall discuss the exact fraction of 1’s in
Section 4.3.3. When a stream element arrives, we hash its email address. If the bit to which that
email address hashes is 1, then we let the email through. But if the email address hashes to a 0,
we are certain that the address is not in S, so we can drop this stream element. Unfortunately,
some spam email will get through. Approximately 1/8th of the stream elements whose email
address is not in S will happen to hash to a bit whose value is 1 and will be let through.
Nevertheless, since the majority of emails are spam (about 80% according to some reports),
eliminating 7/8th of the spam is a significant benefit. Moreover, if we want to eliminate every
spam, we need only check for membership in S those good and bad emails that get through the
filter. Those checks will require the use of secondary memory to access S itself. There are also
other options, as we shall see when we study the general Bloom-filtering technique.
6. Elaborately explain about Counting Distinct Elements in a Stream?
The count-distinct problem is the problem of finding the number of distinct elements in a data
stream with repeated elements1. One way to solve this problem is to create a map and store the
elements in the map with value as their frequency because duplicate cannot exist in map data
structure. So all the values that have been inserted into the map will be distinct. Finally, the size
of the map will give you the number of distinct elements in the array present in the given input
array (or vector)2. There are also algorithms such as Recordinality that estimate the number of
.
Suppose we have a window of length N on a binary stream. We want at all times to be able
to answer queries of the form “how many 1’s are there in the last k bits?” for any k≤ N. For
this purpose we use the DGIM algorithm.
The basic version of the algorithm uses O(log2 N) bits to represent a window of N bits, and
allows us to estimate the number of 1’s in the window with an error of no more than 50%.
To begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on.
Since we only need to distinguish positions within the window of length N, we shall represent
timestamps modulo N, so they can be represented by log2 N bits. If we also store the total
number of bits ever seen in the stream (i.e., the most recent timestamp) modulo N, then we can
determine from a timestamp modulo N where in the current window the bit with that timestamp
is.
2. The number of 1’s in the bucket. This number must be a power of 2, and we refer to
the number of 1’s as the size of the bucket.
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right
end. To represent the number of 1’s we only need log2 log2 N bits. The reason is that we know
this number i is a power of 2, say 2j , so we can represent i by coding j in binary. Since j is at
most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a bucket.
There are six rules that must be followed when representing a stream by buckets.
• The right end of a bucket is always a position with a 1.
• There are one or two buckets of any given size, up to some maximum size.
Decaying window is a concept in big data that assigns more weight to recent elements 1. The
technique computes a smooth aggregation of all the 1’s ever seen in the stream, with decaying
weights. When an element further appears in the stream, less weight is given 1. The decaying
window algorithm allows you to identify the most popular elements in an incoming data
stream, and also discounts any random spikes or spam requests that might have boosted an
element’s
.
Real-time Analytics Platform (RTAP) Applications can be broken down into smaller, easier to
understand parts as follows:
1. What is Real-time Analytics? Real-time analytics is the analysis of data as soon as it enters the system, allow
immediately
. It helps in measuring data from a business point of view in real-time, further making
the best use of data
. An ideal RTAP would help in analyzing the data, correlating it, and predicting the
outcomes on a real-time basis
.
3. What are the benefits of Real-time Analytics Platform (RTAP)? RTAPs help in
managing and processing data, leading to timely decision-making
. RTAPs connect data sources for better analytics and visualization, and they help
organizations in tracking things in real-time, thus helping them in the decision-making
process
.
4. What are some Real-life Applications of Real-time
Analytics?
• Crisis Management: Real-time analytics can be used to monitor social media and news
• .Increased Company
feeds to detect Vision: to
and respond Real-time analytics can help organizations to identify trends
crises quickly
and patterns in their data, leading to better decision-making and increased company
vision
.
• Quicker and Less Costly Changes: Real-time analytics can help organizations to identify
and respond to changes in their data quickly, leading to quicker and less costly
changes
.
• Personalized Marketing: Real-time analytics can be used to analyze customer data and
provide personalized marketing experiences
.
• Fraud Detection: Real-time analytics can be used to detect fraudulent activities in real-
time, such as credit card fraud
5. What are some Real-time Analytics Tools for Data Analytics? Some widely used
RTAPs include Apache SparkStreaming, a Big Data platform for data stream analytics in
real-time, and Cisco Connected Streaming Analytics
.
In summary, Real-time Analytics Platform (RTAP) Applications are tools that enable
organizations to extract valuable information and trends from real-time data, leading to timely
decision-making and increased company vision. They can be used for various applications, such
as crisis management, personalized marketing, and fraud detection. Some widely used RTAPs
include Apache SparkStreaming and Cisco Connected Streaming
Analytics.
Big data trend has enforced the data-centric systems to have continuous fast data streams. In
recent years, real-time analytics on stream data has formed into a new research field, which aims
to answer queries about “what-is-happening-now” with a negligible delay. The real challenge
with real-time stream data processing is that it is impossible to store instances of data, and
therefore online analytical algorithms are utilized. To perform real-time analytics, pre-processing
of data should be performed in a way that only a short summary of stream is stored in main
memory. In addition, due to high speed of arrival, average processing time for each instance of
data should be in such a way that incoming instances are not lost without being captured. Lastly,
the learner needs to provide high analytical accuracy measures. Sentinel is a distributed system
written in Java that aims to solve this challenge by enforcing both the processing and learning
process to be done in distributed form. Sentinel is built on top of Apache Storm, a distributed
computing platform. Sentinel’s learner, Vertical Hoeffding Tree, is a parallel decision tree-
learning algorithm based on the VFDT, with ability of enabling parallel classification in
distributed environments. Sentinel also uses SpaceSaving to keep a summary of the data stream
and stores its summary in a synopsis data structure. Application of Sentinel on Twitter Public
Stream API is shown and the results are discussed
In recent years, stream data is generated at an increasing rate. The main sources of stream data
are mobile applications, sensor applications, measurements in network monitoring and traffic
management, log records or click-streams in web exploring, manufacturing processes, call detail
records, email, blogging, twitter posts, Facebook statuses, search queries, finance data, credit
card transactions, news, emails, Wikipedia updates [5]. On the other hand, with growing
availability of opinion-rich resources such as personal blogs and micro blogging platforms
challenges arise as people now use such systems to express their opinions. The knowledge of
real-time sentiment analysis of social streams helps to understand what social media users think
or express “right now”. Application of real-time sentiment analysis of social stream brings a lot
of opportunities in data-driven marketing (customer’s immediate response to a campaign),
prevention of disasters immediately, business disasters such as Toyota’s crisis in 2010 or Swine
Flu epidemics in 2009 and debates in social media. Real-time sentiment analysis can be applied
in almost all domains of business and industry. Data stream mining is the informational
structure extraction as models and patterns from continuous and evolving data streams.
Traditional methods of data analysis require the data to be stored and then processed off-line
using complex algorithms that make several passes over data. However in principles, data
streams are infinite, and data is generated with high rates and therefore it cannot be stored in
main memory. Different challenges arise in this context: storage, querying and mining. The
latter is mainly related to the computational resources to analyze such volume of data, so it has
been widely studied in the literature, which introduces several approaches in order to provide
accurate and efficient algorithms [1], [3], [4]. In real-time data stream mining, data streams are
processed in an online manner (i.e. real-time processing) so as to guarantee that results are up-
to-date and that queries can be answered in real-time with negligible delay [1], [5]. Current
solutions and studies in data stream sentiment analysis are limited to perform sentiment analysis
in an off-line approach on a sample of stored stream data. While this approach can work in some
cases, it is not applicable in the real-time case. In addition, real-time sentiment analysis tools
such as MOA [5] and RapidMiner [3] exist, however they are uniprocessor solutions and they
cannot be scaled for an efficient usage in a network nor a cluster. Since in big data scenarios, the
volume of data rises drastically after some period of analysis, this causes uniprocessor solutions
to perform slower over time. As a result, processing time per instance of data becomes higher
and instances get lost in a stream. This affects the learning curve and accuracy measures due to
less available data for training and can introduce high costs to such solutions. Sentinel relies on
distributed architecture and distributed learner’s to solve this shortcoming of available solutions
for real-time sentiment analysis in social media.
12. Discuss briefly about Stock Market Predictions?
A stock market is the aggregation of buyers and sellers of stocks (shares), which represent
ownership claims on businesses which may include securities listed on a public stock exchange
as well as those traded privately. We have seen through the years that people have incurred high
losses which have led to devastations of lives and hence a need for prediction system arises
which can be trusted and consistent throughout the life cycle. Also predicting stock prices is an
important task of financial time series forecasting, which is of primary interest to stock
investors, stock traders and applied researchers. Precisely predicting stocks is essential for
investors to gain enormous profits. However the volatility of the market makes this kind of
prediction is highly difficult. We show that Data Mining and Machine Learning could be used
to guide an investor’s decisions. The main aim is to build a model with the help of Data Mining
techniques such as Knn which can be used for classification and regression combined with
Machine Learning techniques like Genetic algorithm, SVR along with Sentiment Analysis
based social media text, which forecast’s stock price for companies. The system if correctly
implemented will help investors and new users to kick start the investment process and can
provide undue benefits. The system can be enhanced by considering the input parameters and
the data considered overtime.
Unit-IV Answer
all the questions
Part-A
1. What is MapReduce?
A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce:
Simplified Data Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase.
In the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed
to the reducer as input. The reducer runs only after the Mapper is over. The reducer too takes
input in key-value format, and the output of reducer is the final output.
o The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys
will not be unique in this case.
o Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This
sort and shuffle acts on these list of <key, value> pairs and sends out unique keys and a
list of values associated with this unique key <key, list(values)>.
o An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
function on a list of values for unique keys, and Final output <key, value> will be
stored/displayed.
The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper
task is complete, the results are sorted by key, partitioned if there are multiple reducers, and
then written to disk. Using the input from each Mapper <k2,v2>, we collect all the values for
each unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as
input to reducer phase.
o It can be used in various application like document clustering, distributed sorting, and
web link-graph reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi-cluster, multi-core, and
mobile environment.
5. What is hadoop?
Hadoop is an open source software programming framework for storing a large amount of data
and performing the computation. Its framework is based on Java programming with some native
code in C and shell scripts.
6. What is Hive?
Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System
(HDFS). Hive makes job easy for performing operations like
• Data encapsulation
• Ad-hoc queries
• Analysis of huge datasets
7. What is MapR?
MapR was a business software company headquartered in Santa Clara, California. MapR
software provides access to a variety of data sources from a single computer cluster,
including big data workloads such as Apache Hadoop and Apache Spark, a distributed file
system, a multi-model database management system, and event stream processing,
combining analytics in real-time with operational applications. Its technology runs on
both commodity hardware and public cloud computing services. In August 2019, following
financial difficulties, the technology and intellectual property of the company were sold
to Hewlett Packard Enterprise.[3][4]
8. What is S3?
S3 is a cloud object storage service offered by Amazon Web Services (AWS). It allows you to
store and access any amount of data from anywhere on the web. S3 is secure, durable, scalable
and cost-effective
9. What is simulation?
Regulatory science is the scientific and technical basis for developing and evaluating regulations
in various industries, especially those involving health or safety 1. For example, the FDA uses
regulatory science to assess the safety, efficacy, quality, and performance of all FDA-regulated
products2. Regulatory science can also involve developing new tools, standards, and approaches
for regulation2
• HDFS is designed for scalability and fault tolerance. It stores data across
multiple machines (nodes) in a cluster, allowing it to handle vast amounts of data.
• Data is distributed in blocks, typically 128MB or 256MB in size. These blocks
are replicated across multiple nodes to ensure data durability and availability.
• HDFS is highly fault-tolerant. It replicates data blocks across multiple nodes (usually
three by default) in the cluster. If a node or a block becomes unavailable, HDFS can still
access the data from a replica.
• The system constantly monitors the health of nodes and can automatically replace failed
nodes with their replicas.
13. What is data write and read patterns?
• HDFS stores data in fixed-size blocks. This block-based approach simplifies data
storage and retrieval.
• It's particularly advantageous for handling large files efficiently, as you can parallelize
the processing of data across the distributed cluster.
15. What is master slave architecture?
• HDFS promotes data locality, which means it tries to process data on the same
node where it is stored. This reduces data transfer over the network, improving
performance.
• MapReduce, a popular data processing framework in the Hadoop ecosystem, leverages
data locality for efficient processing.
• HDFS is designed for high throughput, allowing for efficient data streaming and
batch processing.
• It can scale horizontally by adding more commodity hardware to the cluster to
accommodate growing data needs.
18. What is interoperability?
• HDFS can be accessed using various programming languages and tools, including
Java, Python, and others.
• Several higher-level tools and frameworks, such as Apache Hive, Apache Pig,
and Apache Spark, integrate seamlessly with HDFS for data processing.
19. What is use cases?
• HDFS is commonly used in Big Data scenarios for storing and processing large
datasets for analytics, machine learning, log analysis, and more.
• It is well-suited for applications that require scalability and fault tolerance, such as web-
scale applications and data lakes.
20. What is data partitioning?
• Sharding divides the dataset into smaller, more manageable partitions called shards.
Each shard contains a subset of the data.
• The distribution of data across shards is typically based on a defined partitioning
key, which can be a specific column or attribute of the data.
Part-B
1. Elaborately explain about map reduce?
MapReduce is defined as a big data analysis model that processes data sets using a parallel
algorithm on computer clusters, typically Apache Hadoop clusters or cloud systems like Amazon
Elastic MapReduce (EMR) clusters. This article explains the meaning of MapReduce, how it
works, its features, and its applications.
MapReduce is a big data analysis model that processes data sets using a parallel algorithm on
computer clusters, typically Apache Hadoop clusters or cloud systems like Amazon Elastic
MapReduce (EMR) clusters.
A software framework and programming model called MapReduce is used to process enormous
volumes of data. Map and Reduce are the two stages of the MapReduce program’s operation.
Vast volumes of data are generated in today’s data-driven market due to algorithms and
applications constantly gathering information about individuals, businesses, systems and
Organizations.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
The tricky part is figuring out how to quickly and effectively digest this vast volume of data
without losing insightful conclusions.
It used to be the case that the only way to access data stored in the Hadoop Distributed File
System (HDFS) was using MapReduce. Other query-based methods are now utilized to obtain
data from the HDFS using structured query language (SQL)-like commands, such as Hive and
Pig. These, however, typically run alongside tasks created using the MapReduce approach.
This is so because MapReduce has unique benefits. To speed up processing, MapReduce
executes logic (illustrated above) on the server where the data already sits, rather than
transferring the data to the location of the application or logic.
MapReduce first appeared as a tool for Google to analyze its search results. However, it quickly
grew in popularity thanks to its capacity to split and process terabytes of data in parallel,
producing quicker results.
MapReduce is essential to the operation of the Hadoop framework and a core component. While
“reduce tasks” shuffle and reduce the data, “map tasks” deal with separating and mapping the
data. MapReduce makes concurrent processing easier by dividing petabytes of data into smaller
chunks and processing them in parallel on Hadoop commodity servers. In the end, it collects all
the information from several servers and gives the application a consolidated output.
For example, let us consider a Hadoop cluster consisting of 20,000 affordable commodity servers
containing 256MB data blocks in each. It will be able to process around five terabytes worth of
data simultaneously. Compared to the sequential processing of such a big data set, the usage of
MapReduce cuts down the amount of time needed for processing.
To speed up the processing, MapReduce eliminates the need to transport data to the location
where the application or logic is housed. Instead, it executes the logic directly on the server home
to the data itself. Both the accessing of data and its storing are done using server disks. Further,
the input data is typically saved in files that may include organized, semi-structured, or
unstructured information. Finally, the output data is similarly saved in the form of files.
The main benefit of MapReduce is that users can scale data processing easily over several
computing nodes. The data processing primitives used in the MapReduce model are mappers and
reducers. Sometimes it is difficult to divide a data processing application into mappers and
reducers. However, scaling an application to run over hundreds, thousands, or tens of thousands
of servers in a cluster is just a configuration modification after it has been written in the
MapReduce manner.
Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and
storage.
This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop
Distributed File System.
Audience
This tutorial has been prepared for professionals aspiring to learn the basics of Big Data
Analytics using Hadoop Framework and become a Hadoop Developer. Software Professionals,
Analytics Professionals, and ETL developers are the key beneficiaries of this course.
Prerequisites
Before you start proceeding with this tutorial, we assume that you have prior exposure to Core
Java, database concepts, and any of the Linux operating system flavors.
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
This is a brief tutorial that provides an introduction on how to use Apache Hive HiveQL with
Hadoop Distributed File System. This tutorial can be your first step towards becoming a
successful Hadoop Developer with Hive.
Audience
This tutorial is prepared for professionals aspiring to make a career in Big Data Analytics using
Hadoop Framework. ETL developers and professionals who are into analytics in general may as
well use this tutorial to good effect.
Prerequisites
Before proceeding with this tutorial, you need a basic knowledge of Core Java, Database
concepts of SQL, Hadoop File system, and any of Linux operating system flavors.
MapR is one of the Big Data Distribution. It is a complete enterprise distribution for Apache
Hadoop which is designed to improve Hadoop’s reliability, performance, and ease of use.
Why MapR?
1. High Availability:
MapR provides High Availability features such as Self – Healing it means that no Namenode
architecture.
It has job tracker High Availability and NFS. MapR achieves only distributing its file system
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
metadata.
2. Disaster Recovery:
MapR provides mirroring facility which allows users to enable policies and mirror data. It
automatically within the multinode cluster or single node cluster between on-premise and cloud
infrastructure
3. Record Performance:
MapR is a world record performance cost only $9 to the earlier cost of $5M at a speed of 54 sec.
And it handles the large size of clusters like 2,200 nodes.
4. Consistent Snapshots:
MapR is the only big data distribution which provides a consistent, point in time recovery
because of its unique read and writes storage architecture.
MapR has own security system for data protection in cluster level.
6. Compression:
MapR provides automatic behind the scenes compression to data. It applies compression
automatically to files in the cluster.
9. Enterprise-grade NoSQL
Mostly MapR Ecosystem Packs are released in every quarter and yearly also
A single version of MapR may support multiple MEPs, but only one at a time.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
In a familiar case, Hadoop Ecosystem components and Opensource components are like Spark,
Hive etc components are included in MapR Ecosystem Packs are like below tools:
Collectd
Elasticsearch
Grafana
Fluentd
Kibana
Open TSDB
Sharding is a very important concept that helps the system to keep data in different resources
according to the sharding process. The word “Shard” means “a small part of a whole“. Hence
Sharding means dividing a larger part into smaller parts. In DBMS, Sharding is a type of
DataBase partitioning in which a large database is divided or partitioned into smaller data and
different nodes. These shards are not only smaller, but also faster and hence easily
manageable.
Need for Sharding:
Consider a very large database whose sharding has not been done. For example, let’s take a
DataBase of a college in which all the student’s records (present and past) in the whole college
are maintained in a single database. So, it would contain a very very large number of data, say
100, 000 records. Now when we need to find a student from this Database, each time around
100, 000 transactions have to be done to find the student, which is very very costly. Now
consider the same college students records, divided into smaller data shards based on years.
Now each data shard will have around 1000-5000 students records only. So not only the
database became much more manageable, but also the transaction cost each time also reduces
by a huge factor, which is achieved by Sharding. Hence this is why Sharding is needed.
In a sharded system, the data is partitioned into shards based on a predetermined criterion. For
example, a sharding scheme may divide the data based on geographic location, user ID, or time
period. Once the data is partitioned, it is distributed across multiple servers or nodes. Each
server or node is responsible for storing and processing a subset of the data.
To query data from a sharded database, the system needs to know which shard contains the
required data. This is achieved using a shard key, which is a unique identifier that is used to
map the data to its corresponding shard. When a query is received, the system uses the shard
key to determine which shard contains the required data and then sends the query to the
appropriate server or node.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
Features of Sharding:
• Sharding makes the Database smaller
• Sharding makes the Database faster
• Sharding makes the Database much more easily manageable
• Sharding can be a complex operation sometimes
• Sharding reduces the transaction cost of the Database
• Each shard reads and writes its own data.
• Many NoSQL databases offer auto-sharding.
• Failure of one shard doesn’t effect the data processing of other shards.
Benefits of Sharding:
Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of
availability, partition tolerance, and speed. Barriers to the greater adoption of NoSQL stores
include the use of low-level query languages (instead of SQL, for instance), lack of ability to
perform ad hoc joins across tables, lack of standardized interfaces, and huge previous
investments in existing relational databases.[10] Most NoSQL stores lack true ACID
transactions, although a few databases have made them central to their designs.
Instead, most NoSQL databases offer a concept of "eventual consistency", in which database
changes are propagated to all nodes "eventually" (typically within milliseconds), so queries for
data might not return updated data immediately or might result in reading data that is not
accurate, a problem known as stale read.[11] Additionally, some NoSQL systems may exhibit
lost writes and other forms of data loss.[12] Some NoSQL systems provide concepts such as
write- ahead logging to avoid data loss.[13] For distributed transaction processing across multiple
databases, data consistency is an even bigger challenge that is difficult for both NoSQL and
relational databases. Relational databases "do not allow referential integrity constraints to span
databases".[14] Few systems maintain both ACID transactions and X/Open XA standards for
distributed transaction processing.[15] Interactive relational databases share conformational relay
analysis techniques as a common feature. [16] Limitations within the interface environment are
overcome using semantic virtualization protocols, such that NoSQL services are accessible to
most operating systems.[17]
Create a Bucket dialog box will open. Fill the required details and click the Create button.
The bucket is created successfully in Amazon S3. The console displays the list of buckets
and its properties.
Select the Static Website Hosting option. Click the radio button Enable website hosting
and fill the required details.
Click the Add files option. Select those files which are to be uploaded from the system
and then click the Open button.
Click the start upload button. The files will get uploaded into the bucket.
To open/download an object − In the Amazon S3 console, in the Objects & Folders list, right-
click on the object to be opened/downloaded. Then, select the required object.
step 3 − Open the location where we want this object. Right-click on the folder/bucket where the
object is to be moved and click the Paste into option.
Step 3 − A confirmation message will appear on the pop-up window. Read it carefully and click
the Empty bucket button to confirm.
Amazon S3 Features
• Low cost and Easy to Use − Using Amazon S3, the user can store a large amount
of data at very low charges.
• Secure − Amazon S3 supports data transfer over SSL and the data gets encrypted
automatically once it is uploaded. The user has complete control over their data by
configuring bucket policies using AWS IAM.
• Scalable − Using Amazon S3, there need not be any worry about storage
concerns. We can store as much data as we have and access it anytime.
• Higher performance − Amazon S3 is integrated with Amazon CloudFront, that
distributes content to the end users with low latency and provides high data
transfer speeds without any minimum usage commitments.
• Integrated with AWS services − Amazon S3 integrated with AWS services
include Amazon CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon
RDS, Amazon Route 53, Amazon VPC, AWS Lambda, Amazon EBS, Amazon
Dynamo DB, etc.
Now we think you become familiar with the term file system so let’s begin with HDFS.
HDFS(Hadoop Distributed File System) is utilized for storage permission is a Hadoop cluster.
It mainly designed for working on commodity Hardware devices(devices that are inexpensive),
working on a distributed file system design. HDFS is designed in such a way that it believes
more in storing the data in a large chunk of blocks rather than storing small data blocks. HDFS
in Hadoop provides Fault-tolerance and High availability to the storage layer and the other
devices present in that Hadoop cluster.
HDFS is capable of handling larger size data with high volume velocity and variety makes
Hadoop work more efficient and reliable with easy access to all its components. HDFS stores
the data in the form of the block where the size of each data block is 128MB in size which is
configurable means you can change it according to your requirement in hdfs-site.xml file in
your Hadoop directory.
1. NameNode(Master)
2. DataNode(Slave)
1. System Failure: As a Hadoop cluster is consists of Lots of nodes with are commodity
hardware so node failure is possible, so the fundamental goal of HDFS figure out this failure
problem and recover it.
2. Maintaining Large Dataset: As HDFS Handle files of size ranging from GB to PB, so
HDFS has to be cool enough to deal with these very large data sets on a single cluster.
3. Moving Data is Costlier then Moving the Computation: If the computational operation is
performed near the location where the data is present then it is quite faster and the overall
throughput of the system can be increased along with minimizing the network congestion
which is a good assumption.
4. Portable Across Various Platform: HDFS Posses portability which allows it to switch
across diverse Hardware and software platforms.
5. Simple Coherency Model: A Hadoop Distributed File System needs a model to write once
read much access for Files. A file written then closed should not be changed, only data can be
appended. This assumption helps us to minimize the data coherency issue. MapReduce fits
perfectly with such kind of file model.
6. Scalability: HDFS is designed to be scalable as the data storage requirements increase over
time. It can easily scale up or down by adding or removing nodes to the cluster. This helps to
ensure that the system can handle large amounts of data without compromising performance.
7. Security: HDFS provides several security mechanisms to protect data stored on the cluster.
It supports authentication and authorization mechanisms to control access to data, encryption
of data in transit and at rest, and data integrity checks to detect any tampering or corruption.
8. Data Locality: HDFS aims to move the computation to where the data resides rather than
moving the data to the computation. This approach minimizes network traffic and enhances
performance by processing data on local nodes.
9. Cost-Effective: HDFS can run on low-cost commodity hardware, which makes it a cost-
effective solution for large-scale data processing. Additionally, the ability to scale up or down
as required means that organizations can start small and expand over time, reducing upfront
costs.
10. Support for Various File Formats: HDFS is designed to support a wide range of file
formats, including structured, semi-structured, and unstructured data. This makes it easier to
store and process different types of data using a single system, simplifying data management
and reducing costs.
Social Networks?
Online social networks, such as Facebook, are increasingly utilized by many people. These
networks allow users to publish details about themselves and to connect to their friends. Some of
the information revealed inside these networks is meant to be private. Yet it is possible to use
learning algorithms on released data to predict private information. In this paper, we explore how
to launch inference attacks using released social networking data to predict private information.
We then devise three possible sanitization techniques that could be used in various situations.
Then, we explore the effectiveness of these techniques and attempt to use methods of collective
inference to discover sensitive attributes of the data set. We show that we can decrease the
effectiveness of both local and relational classification algorithms by using the sanitization
methods we described.
SOCIAL networks are online applications that allow their users to connect by means of various
link types. As part of their offerings, these networks allow people to list details about themselves
that are relevant to the nature of the network. For instance, Facebook is a general-use social
network, so individual users list their favorite activities, books, and movies. Conversely,
LinkedIn is a professional network; because of this, users specify details which are related to
their professional life (i.e., reference letters, previous employment, and so on.) Because these
sites gather extensive personal information, social network application providers have a rare
opportunity: direct use of this information could be useful to advertisers for direct marketing.
However, in practice, privacy concerns can prevent these efforts [1]. This conflict between the
desired use of data and individual privacy presents an opportunity for privacy-preserving social
network data mining—that is, the discovery of information and relationships from social network
data without violating privacy.
Privacy concerns of individuals in a social network can be classified into two categories: privacy
after data release, and private information leakage.
Instances of privacy after data release involve the identification of specific individuals in a data
set subsequent to its release to the general public or to paying customers for a specific usage.
Perhaps the most illustrative example of this type of privacy breach (and the repercussions
thereof) is the AOL search data scandal.
10. Write an overview of Big Data Framework?
Frameworks provide structure. The core objective of the Big Data Framework is to provide a
structure for enterprise organisations that aim to benefit from the potential of Big Data. In order
to achieve long-term success, Big Data is more than just the combination of skilled people and
technology – it requires structure and capabilities.
The Big Data Framework was developed because – although the benefits and business cases of
Big Data are apparent – many organizations struggle to embed a successful Big Data practice in
their organization. The structure provided by the Big Data Framework provides an approach for
organizations that takes into account all organizational capabilities of a successful Big Data
practice. All the way from the definition of a Big Data strategy, to the technical tools and
capabilities an organization should have.
Big Data is a people business. Even with the most advanced computers and processors in the
world, organisations will not be successful without the appropriate knowledge and skills. The
Big Data Framework therefore aims to increase the knowledge of everyone who is interested in
Big Data. The modular approach and accompanying certification scheme aims to develop
knowledge about Big Data in a similar structured fashion.
The Big Data framework provides a holistic structure toward Big Data. It looks at the various
components that enterprises should consider while setting up their Big Data organization. Every
element of the framework is of equal importance and organisations can only develop further if
they provide equal attention and effort to all elements of the Big Data framework.
The Big Data framework is a structured approach that consists of six core capabilities that
organisations need to take into consideration when setting up their Big Data organization. The
Big Data Framework is depicted in the figure below:
The Big Data Framework consists of the following six main elements:
Data has become a strategic asset for most organisations. The capability to analyse large data
sets and discern pattern in the data can provide organisations with a competitive advantage.
Netflix, for example, looks at user behaviour in deciding what movies or series to produce.
Alibaba, the Chinese sourcing platform, became one of the global giants by identifying which
suppliers to loan money and recommend on their platform. Big Data has become Big Business.
In order to achieve tangible results from investments in Big Data, enterprise organisations need a
sound Big Data strategy. How can return on investments be realised, and where to focus effort in
Big Data analysis and analytics? The possibilities to analyse are literally endless and
organisations can easily get lost in the zettabytes of data. A sound and structured Big Data
strategy is the first step to Big Data success.
In order to work with massive data sets, organisations should have the capabilities to store and
process large quantities of data. In order to achieve this, the enterprise should have the
underlying IT infrastructure to facilitate Big Data. Enterprises should therefore have a
comprehensive Big Data architecture to facilitate Big Data analysis. How should enterprises
design and set up their architecture to facilitate Big Data? And what are the requirements from
a storage and processing perspective?
The Big Data Architecture element of the Big Data Framework considers the technical
capabilities of Big Data environments. It discusses the various roles that are present within a Big
Data Architecture and looks at the best practices for design. In line with the vendor-independent
structure of the Framework, this section will consider the Big Data reference architecture of
the National Institute of Standards and Technology (NIST).
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
The Big Data algorithms element of the framework focuses on the (technical) capabilities of
everyone who aspires to work with Big Data. It aims to build a solid foundation that includes
basic statistical operations and provides an introduction to different classes of algorithms.
In order to make Big Data successful in enterprise organization, it is necessary to consider more
than just the skills and technology. Processes can help enterprises to focus their direction.
Processes bring structure, measurable steps and can be effectively managed on a day-to-day
basis. Additionally, processes embed Big Data expertise within the organization by following
similar procedures and steps, embedding it as ‘a practice’ of the organization. Analysis becomes
less dependent on individuals and thereby, greatly enhancing the chances of capturing value in
the long term.
Big Data functions are concerned with the organisational aspects of managing Big Data in
enterprises. This element of the Big Data framework addresses how organisations can structure
themselves to set up Big Data roles and discusses roles and responsibilities in Big Data
organisations. Organisational culture, organisational structures and job roles have a large
impact on the success of Big Data initiatives. We will therefore review some ‘best practices’ in
setting up enterprise big data
In the Big Data Functions section of the Big Data Framework, the non-technical aspects of Big
Data are covered. You will learn how to set up a Big Data Center of Excellence (BDCoE).
Additionally, it also addresses critical success factors for starting Big Data project in the
organization.
6. Artificial Intelligence
The last element of the Big Data Framework addresses Artificial Intelligence (AI). One of the
major areas of interest in the world today, AI provides a whole world of potential. In this part
of the framework, we address the relation between Big Data and Artificial Intelligence and
outline key characteristics of AI.
Many organisations are keen to start Artificial Intelligence projects, but most are unsure where
to start their journey.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
The Big Data Framework takes a functional view of AI in the context of bringing business
benefits to enterprise organisations. The last section of the framework therefore showcases how
AI follows as a logical next step for organisations that have built up the other capabilities of the
Big Data Framework. The last element of the Big Data Framework has been depicted as a
lifecycle on purposes. Artificial Intelligence can start to continuously learn from the Big Data in
the organization in order to provide long lasting value.
11. Discuss about Preventing Private Information Inference Attacks on Social Networks?
Nowadays social media is becoming very popular and used for marketing as per users profile.
But for this, social networking sites share users data with other marketing companies and it is
possible the third party companies can use users private data. Significant factor in multimedia
mobile systems is social network, where users can send their photos, videos and other media
files. On the other hand, the information (e.g., user Bio, posts, etc. ) on social media platforms
shared usually reveals lots of users private information. That can be mined and mistreated for the
malicious reasons. To tackle privacy concerns, privacy preserving mechanisms adopted by many
social network service providers, e.g. hiding users profiles, anonymzing user identity, etc. As an
attributes result from user profiles are usually set such that it could be accessed to prevent
personal information outflow only by friends. To understand the hidden attributes to the
numerous efficiency of current privacy protecting mechanisms different attacks have been
proposed. Almost solutions are based on the social networking links along with users or their
behaviors. The proposed work is an inference attack prevention model on social networking
application to solve prevention problem. To prevent inference attack we proposed data
sanitization method on user’s profile.
Social networking websites are virtual communities that foster interaction and encourages
among associates of a group by permitting them to connect with other users, post personal data
and link their personal profiles to others profiles. In most cases, membership is attained by
registering as a user of that website in web community. Regularly interacting and visiting with
people who use that website makes ones network solider. Though many social networking
websites are release to anyone, who belong to a specific real world occupation or some are open
only to people in certain age group. Members of social networking websites are communicated
by posting weblogs, video and music stream, messages and chatting. Social networking sites
members frequently link smaller communities within the interwork. Members of the social
networking websites allow endorsing themselves and their comforts by posting individual
profiles that contain enough information for others to determine if they are involved in
associating with that person.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
An opponent of social networking claim that can be used to outbreak privacy and it contributes
to grasping behavior. Many people are free with the information they post concerning
themselves, those websites are frequently used to investigate s social habits and persons
character. Social networks permit users to make public the details about themselves and to get
connected with their friends. Some of the information is to be private when it revealed inside
these networks. On released data to predict private information it is possible to use machine
learning algorithms. In this paper, explore that how to launch inference attacks to predict private
information by using released social networking data.
12. Describe about Applying Regulatory Science and Big Data to Improve Medical
Device Innovation
Understanding how proposed medical devices will interface with humans is a major challenge
that impacts both the design of innovative new devices and approval and regulation of existing
devices. Today, designing and manufacturing medical devices requires extensive and expensive
product cycles. Bench tests and other preliminary analyses are used to understand the range of
anatomical conditions, and animal and clinical trials are used to understand the impact of design
decisions upon actual device success. Unfortunately, some scenarios are impossible to replicate
on the bench, and competitive pressures often accelerate initiation of animal trials without
sufficient understanding of parameter selections. We believe these limitations can be overcome
through advancements in data-driven and simulation based medical device design and
manufacture, a research topic that draws upon and combines emerging work in the areas of
Regulatory Science and Big Data.
We propose a cross disciplinary grand challenge to develop and holistically apply new thinking
and techniques in these areas to medical devices in order to improve and accelerate medical
device innovation.
Unit-V
Answer all the
questions Part-A
1. What is R?
R is a programming language that is mainly used for statistical computing and graphics. It was
created by statisticians Ross Ihaka and Robert Gentleman in the 1990s, and it is now supported
by the R Core Team and the R Foundation for Statistical Computing. R is similar to the S
language, which was developed at Bell Laboratories by John Chambers and colleagues.
2. What is Data frames?
Data Frames
Data Frames can have different types of data inside it. While the first column can be character,
the second and third can be numeric or logical. However, each column should have the same
type of data.
3. What is classes?
Classes and Objects are basic concepts of Object-Oriented Programming that revolve around
the real-life entities. Everything in R is an object. An object is simply a data structure that has
some methods and attributes. A class is just a blueprint or a sketch of these objects. It
represents the set of properties or methods that are common to all objects of one type.
4. Give short notes on input/output?
Here are a few of the string manipulation functions available in R’s base packages. We are going
to look at these functions in detail.
The R toupper () function is used to convert all characters of the string in uppercase. Any
symbol, space, or number in the string is ignored while applying this function. Only alphabet are
converted. The syntax for using this function is given below: Syntax toupper(x) Parameters x
Required. Specify text to be converted.
7. What is meant by tolower()?
The strsplit() in R programming language function is used to split the elements of the specified
character vector into substrings according to the given substring taken as its parameter.
Syntax: strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes...
9. What is nchar()?
nchar () method in R Programming Language is used to get the length of a character in a string
object. Syntax: nchar (string) Where: String is object. Return: Returns the length of a string.
10. Write a short notes on sprintf()?
The sprintf() function in R is a built-in function that prints formatted strings. You can use it to
control the number of digits, alignment, padding, and other aspects of how strings are
displayed. For example, you can use sprintf(“%f”, x) to format a numeric value x with six digits
after the decimal point12. You can also use other format specifiers such as %d for
integers, %s for strings, %e for scientific notation, and more3. The sprintf() function is useful for
generating dynamic messages and formatting data for reporting4
• R is renowned for its powerful capabilities in data analysis and statistical modeling.
It provides a vast array of built-in functions and packages for statistical analysis, hypothesis
testing, regression, and more.
• Users can perform data manipulation, cleansing, and transformation tasks with ease using
R's data manipulation libraries like dplyr and tidyr.
12. What is data visualization?
• R offers extensive data visualization tools, including the popular ggplot2 package,
which allows users to create highly customizable and publication-quality graphs and charts.
• It provides support for various plotting styles, such as scatter plots, bar
charts, histograms, box plots, and heatmaps.
13. What is extensive package ecosystem?
• R provides functions for importing data from a wide range of sources, including
CSV, Excel, SQL databases, and web APIs.
• Exporting results to various formats, such as CSV, Excel, PDF, and graphics files,
is straightforward.
16. What is interactive data analysis?
• R supports interactive data exploration and analysis through the use of graphical user
interfaces (GUIs) like RStudio and Jupyter notebooks.
• Users can explore data, execute code, and visualize results in real time.
R is cross-platform and runs on various operating systems, including Windows, macOS, and
Linux.
19. What is integration and other tools?
R can be integrated with other data science and analytics tools and languages, including Python,
SQL, and tools for big data processing like Apache Spark.
20. What is map phase?
• The first phase of MapReduce is the "Map" phase, where the input data is divided into
smaller chunks, called splits.
• A user-defined function called the "Mapper" is applied to each split independently.
The Mapper takes an input record and emits key-value pairs based on some logic.
• The output of the Mapper is an intermediate set of key-value pairs, which are grouped by
key.
Part-B
1. Write an overview of R language?
R is a programming language for statistical computing and graphics supported by the R Core
Team and the R Foundation for Statistical Computing. Created by statisticians Ross
Ihaka and Robert Gentleman, R is used among data
miners, bioinformaticians and statisticians for data analysis and developing statistical software.
[7]
The core R language is augmented by a large number of extension packages containing
reusable code and documentation.
According to user surveys and studies of scholarly literature databases, R is one of the most
commonly used programming languages in data mining.[8] As of April 2023, R ranks 16th in
the TIOBE index, a measure of programming language popularity, in which the language peaked
in 8th place in August 2020.[9][10]
The official R software environment is an open-source free software environment released as
part of the GNU Project and available under the GNU General Public License. It is written
primarily in C, Fortran, and R itself (partially self-hosting). Precompiled executables are
provided for various operating systems. R has a command line interface.[11] Multiple third-
party graphical user interfaces are also available, such as RStudio, an integrated development
environment, and Jupyter, a notebook interface.
Control statements are expressions used to control the execution and flow of the program
based on the conditions provided in the statements. These structures are used to make a
decision after assessing the variable. In this article, we’ll discuss all the control statements with
the examples.
In R programming, there are 8 types of control statements as follows:
• if condition
• if-else condition
• for loop
• nested loops
• while loop
• repeat and break statement
• return statement
• next statement
if condition
This control structure checks the expression provided in parenthesis is true or not. If true, the
execution of the statements in braces {} continues.
Syntax:
if(expression){
statements
....
....
}
Example:
x <- 100
Output:
[1] "100 is greater than 10"
if-else condition
It is similar to if condition but when the test expression in if condition fails, then statements
in else condition are executed.
Syntax:
if(expression){
statements
....
....
}
else{
statements
....
....
}
Example:
x <- 5
Output:
[1] "5 is less than 10"
for loop
It is a type of loop or sequence of statements executed repeatedly until exit condition is
reached.
Syntax:
for(value in vector){
statements
....
....
}
Example:
x <- letters[4:10]
for(i in x){
print(i)
}
Output:
[1] "d"
[1] "e"
[1] "f"
[1] "g"
[1] "h"
[1] "i"
[1] "j"
Nested loops
Nested loops are similar to simple loops. Nested means loops inside loop. Moreover, nested
loops are used to manipulate the matrix.
Example:
# Defining matrix
m <- matrix(2:15, 2)
for (r in seq(nrow(m))) {
for (c in seq(ncol(m))) {
print(m[r, c])
}
}
Output:
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14
[1] 3
[1] 5
[1] 7
[1] 9
[1] 11
[1] 13
[1] 15
while loop
while loop is another kind of loop iterated until a condition is satisfied. The testing expression
is checked first before executing the body of loop.
Syntax:
while(expression){
statement
....
....
}
Example:
x=1
# Print 1 to 5
while(x <= 5){
print(x)
x=x+1
}
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
x=1
# Print 1 to 5
repeat{ prin
t(x)
x=x+1
if(x > 5)
{ break
}
}
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
return statement
return statement is used to return the result of an executed function and returns control to the
calling function.
Syntax:
return(expression)
Example:
func(1)
func(0)
func(-
1)
Output:
[1] "Positive"
[1] "Zero"
[1] "Negative"
next statement
next statement is used to skip the current iteration without executing the further statements and
continues the next iteration cycle without terminating the loop.
Example:
# Defining vector
x <- 1:10
Output:
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
Operators
values: Example
10 + 5
• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators
• Miscellaneous operators
R Arithmetic Operators
Arithmetic operators are used with numeric values to perform common mathematical operations:
+ Addition x+y
- Subtraction x-y
* Multiplication x*y
/ Division x/y
^ Exponent x^y
R Assignment Operators
Example
my_var <- 3
my_var <<- 3
3 -> my_var
3 ->> my_var
R Comparison Operators
== Equal x == y
!= Not equal x != y
R Logical Operators
Operato Description
r
& Element-wise Logical AND operator. It returns TRUE if both elements are TRUE
&& Logical AND operator - Returns TRUE if both statements are TRUE
R Miscellaneous Operators
:
Operato r Creates a series of numbers in a x Example
Description <- 1:10
sequence
Functions are useful when you want to perform a certain task multiple times. A function
accepts input arguments and produces the output by executing valid R commands that are
inside the function. In R Programming Language when you are creating a function the function
name and the file in which you are creating the function need not be the same and you can
have one or more functions in R.
Creating a Function in R
Functions are created in R by using the command function(). The general structure of the
function file is as follows:
Functions in R Programming
Note: In the above syntax f is the function name, this means that you are creating a function
with name f which takes certain arguments and executes the following statements.
Types of Function in R Language
1. Built-in Function: Built-in functions in R are pre-defined functions that are
available in R programming languages to perform common tasks or operations.
2. User-defined Function: R language allow us to write our own function.
Built-in Function in R Programming Language
Here we will use built-in functions like sum(), max() and min().
R
R provides built-in functions like print(), cat(), etc. but we can also create our own functions.
These functions are called user-defined functions.
Example
R
# A simple R function to
check # whether x is even or
odd
evenOdd = function(x){
if(x %% 2 == 0)
return("even")
else
return("odd")
}
print(evenOdd(4))
print(evenOdd(3))
Output
[1] "even"
[1] "odd"
R Function Example – Single Input Single Output
Now create a function in R that will take a single input and gives us a single output.
Following is an example to create a function that calculates the area of a circle which takes in
the arguments the radius. So, to create a function, name the function as “areaOfCircle” and the
arguments that are needed to be passed are the “radius” of the circle.
R
# A simple R function to
calculate # area of a circle
areaOfCircle = function(radius){
area = pi*radius^2
return(area)
}
print(areaOfCircle(2))
Output
12.56637
R Function Example – Multiple Input Multiple Output
Now create a function in R Language that will take multiple inputs and gives us multiple
outputs using a list.
The functions in R Language take multiple input objects but returned only one object as output,
this is, however, not a limitation because you can create lists of all the outputs which you want
to create and once the list is created you can access them into the elements of the list and get
Let us consider this example to create a function “Rectangle” which takes “length” and
“width” of the rectangle and returns area and perimeter of that rectangle. Since R Language
can return only one object. Hence, create one object which is a list that contains “area” and
“perimeter” and return the list.
resultList = Rectangle(2, 3)
print(resultList["Area"])
print(resultList["Perimeter"])
Output
$Area
[1] 6
$Perimeter
[1] 10
Inline Functions in R Programming Language
Sometimes creating an R script file, loading it, executing it is a lot of work when you want to
just create a very small function. So, what we can do in this kind of situation is an inline
function.
To create an inline function you have to use the function command with the argument x and
then the expression of the function.
Example
R
# A simple R program to
# demonstrate the inline function
f = function(x) x^2*4+x/3
print(f(4))
print(f(-2))
print(0)
Output
65.33333
15.33333
0
Passing Arguments to Functions in R Programming Language
There are several ways you can pass the arguments to the function:
• Case 1: Generally in R, the arguments are passed to the function in the same order
as in the function definition.
• Case 2: If you do not want to follow any order what you can do is you can pass the
arguments using the names of the arguments in any order.
• Case 3: If the arguments are not passed the default values are used to execute the
function.
Now, let us see the examples for each of these cases in the following R code:
R
# A simple R program to
demonstrate # passing arguments to
a function
# A simple R program to
demonstrate # Lazy evaluations of
functions
# A simple R program to
demonstrate # Lazy evaluations of
functions
# This'll throw an
error print(Cylinder(5,
Output
Error in print(radius) : argument "radius" is missing, with no default
Other Built-in Functions in R
Functions Syntax
Mathematical Functions
Functions Syntax
f. cos(), sin(), and tan() calculates a number’s cosine, sine, and tang.
Statistical Functions
Data Manipulation
Functions
File Input/Output
Functions
In this tutorial, you will learn everything about environment and scope in R programming with
the help of examples.
In order to write functions in a proper way and avoid unusual errors, we need to know the
concept of environment and scope in R.
R Programming Environment
Environment can be thought of as a collection of (functions, variables etc.). An environment is created when w
The top level environment available to us at the R command prompt is the global environment called R_Globa
We can use the ls() function to show what variables and functions are defined in the current
environment. Moreover, we can use theenvironment() function to get the current environment.
a <- 2
ls()
environment()
Output
<environment: R_GlobalEnv>
In the above example, we can see that a, b and f are in the R_GlobalEnv environment.
Notice that x (in the argument of the function) is not in this global environment. When we define
a function, a new environment is created.
Here, the function f() creates a new environment inside the global environment.
Actually an environment has a frame, which has all the objects defined, and a pointer to the
enclosing (parent) environment.
Hence, x is in the frame of the new environment created by the function f. This environment will
also have a pointer to R_GlobalEnv.
f <- function(f_x){
g <- function(g_x){
print("Inside g")
print(environment()
) print(ls())
}
g(5)
print("Inside f")
print(environment())
print(ls())
}
f(6)
environment()
Output
R Programming Scope
In R, there are two main types of variables: global variables and local variables.
outer_func
inner_func
Global Variables
Global variables are those variables which exist throughout the execution of a program. It can be
changed and accessed from any part of the program.
Local Variables
On the other hand, local variables are those variables which exist only within a certain part of a program like a
In the above program the variable c is called a local variable.
If we assign a value to a variable with the function inner_func(), the change will only be local and cannot be ac
This is also the same even if names of both global variables and local variables match.
Output
[1] 30
[1] 20
[1] 10
Here, the outer_func() function is defined, and within it, a local variable a is assigned the
value 20.
Inside outer_func(), there is an inner_func() function defined. The inner_func() function also
has its own local variable a, which is assigned the value 30.
When inner_func() is called within outer_func(), it prints the value of its local variable a (30).
Then, continues executing and prints the value of its local variable a (20).
outer_func()
Outside the functions, a global variable a is assigned the value 10. This code then prints the
value of the global variable a (10) .
Global variables can be read but when we try to assign to it, a new local variable is created instead.
To make assignments to global variables, super assignment operator, <<-, is used.
When using this operator within a function, it searches for the variable in the parent environment frame, if not f
If the variable is still not found, it is created and assigned at the global level.
Output
[1] 30
[1] 30
[1] 30
When the statement a <<- 30is encountered within inner_func(), it looks for the
variable a in environment.
outer_func()
When the search fails, it searches in R_GlobalEnv.
Since, a is not defined in this global environment as well, it is created and assigned there which
• R
Recursion, in the simplest terms, is a type of looping technique. It exploits the basic working of
functions in R.
Recursive Function in R:
Recursion is when the function calls itself. This forms a loop, where every time the function is
called, it calls itself again and again and this technique is known as recursion. Since the loops
increase the memory we use the recursion. The recursive function uses the concept of recursion
to perform iterative tasks they call themselves, again and again, which acts as a loop. These
kinds of functions need a stopping condition so that they can stop looping continuously.
Recursive functions call themselves. They break down the problem into smaller components.
The function() calls itself within the original function() on each of the smaller components.
After this, the results will be put together to solve the original problem.
rec_fac(5)
Output:
[1] 120
Here, rec_fac(5) calls rec_fac(4), which then calls rec_fac(3), and so on until the input
argument x, has reached 1. The function returns 1 and is destroyed. The return value is
multiplied by the argument value and returned. This process continues until the first function
call returns its output, giving us the final result.
• R
Example: Sum of Series Using Recursion
Recursion in R is most useful for finding the sum of self-repeating series. In this example, we
will find the sum of squares of a given series of numbers. Sum = 12+22+…+N2
Example:
R-
Output:
[1] 385
Output:
[1] 15
In this example, the sum_n function recursively increases n until it reaches 1, which is the base
case•of R
the recursion, by adding the current value of n to the sum of the first n-1 values.
Output:
[1] 1024
In this example, the base case of the recursion is represented by the exp_n function, which
recursively multiplies the base by itself n times until n equals 0.
Types of Recursion in R
1. Direct Recursion: The recursion that is direct involves a function calling itself
directly. This kind of recursion is the easiest to understand.
3. Mutual Recursion: Multiple functions that call each other repeatedly make up
mutual recursion. To complete a task, each function depends on the others.
4. Nested Recursion: Nested recursion happens when one recursive function calls
another recursively while passing the output of the first call as an argument. The
arguments of one recursion are nested inside of this one.
5. Structural Recursion: Recursion that is based on the structure of the data is known
as structural recursion. It entails segmenting a complicated data structure into
smaller pieces and processing each piece separately.
I searched for a reference to learn about replacement functions in R, but I haven't found any yet.
I'm trying to understand the concept of the replacement functions in R. I have the code below but
I don't understand it:
cutoff(x) <- 65
Could anyone explain what a replacement function is in R?
Vectors
To combine the list of items to a vector, use the c() function and separate the items by a comma.
In the example below, we create a vector variable called fruits, that combine strings:
Example
# Vector with numerical values in a
sequence numbers <- 1:10
numbers
You can also create numerical values with decimals in a sequence, but note that if the last
element does not belong to the sequence, it is not used:
Example
# Vector of logical values
log_values <- c(TRUE, FALSE, TRUE,
FALSE) log_values
Vector Length
To find out how many items a vector has, use the length() function:
Example
fruits <- c("banana", "apple", "orange")
length(fruits)
Sort a Vector
Example
Access Vectors
You can access the vector items by referring to its index number inside brackets []. The first item
has index 1, the second item has index 2, and so on:
Example
You can also access multiple elements by referring to different index positions with
the c() function:
Example
You can also use negative index numbers to access all items except the ones specified:
Example:
Example
# Print fruits
fruits
Repeat Vectors
function: Example
repeat_each
Example
Repeat sequence of the vector
repeat_times
Example
Repeat each value independence
repeat_indepent <- rep(c(1,2,3), times = c(5,2,1))
repeat_indepent
One of the examples on top, showed you how to create a vector with numerical values in a
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
Example
numbers
Example
numbers
Note: The seq() function has three parameters: from is where the sequence starts, to is where the
sequence stops, and by is the interval of the sequence.
The data structure is a particular way of organizing data in a computer so that it can be used
effectively. The idea is to reduce the space and time complexities of different tasks. Data
structures in R programming are tools for holding multiple values. The two most important
data structures in R are Arrays and Matrices.
Arrays in R
Arrays are data storage objects in R containing more than or equal to 1 dimension. Arrays can
contain only a single data type. The array() function is an in-built function which takes input
as a vector and arranges them according to dim argument. Array is an iterable object, where
the array elements are indexed, accessed and modified individually. Operations on array can be
performed with similar structures and dimensions. Uni-dimensional arrays are called vectors in
R. Two-dimensional arrays are called matrices.
Syntax:
array(array1, dim = c (r, c, m), dimnames = list(c.names, r.names, m.names))
Parameters:
array1: a vector of values
dim: contains the number of matrices, m of the specified number of rows and columns
dimnames: contain the names for the dimensions
Example:
Python3
# creating a vector
vector1 <- c("A", "B", "C") # declaring a character array uni_array <- array(vector1)
print("Uni-Dimensional Array")
print(uni_array)
Output:
,,1
,,2
Matrices in R
Syntax:
matrix(data, nrow, ncol, byrow)
Parameters:
data: contain a vector of similar data type elements.
nrow: number of rows.
ncol: number of columns.
byrow: By default matrices are in column-wise order. So this parameter decides how to
arrange the matrix
Example:
Python3
A = matrix(
# Taking sequence of elements
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
# No of rows and
columns nrow = 3, ncol =
3,
# By default matrices
are # in column-wise
order
# So this parameter decides
# how to arrange the
matrix byrow = TRUE
)
Output:
Arrays vs Matrices
Arrays Matrices
Arrays can contain greater than or equal to 1 Matrices contains 2 dimensions in a table
dimensions. like structure.
Arrays Matrices
It is a singular vector arranged into the specified It comprises of multiple equal length vectors
dimensions. stacked together in a table.
array() function can be used to create matrix by matrix() function however can be used to
specifying the third dimension to be 1. create at most 2-dimensional array.
A list in R is a generic object consisting of an ordered collection of objects. Lists are one-
dimensional, heterogeneous data structures. The list can be a list of vectors, a list of matrices, a
list of characters and a list of functions, and so on.
A list is a vector but with heterogeneous data elements. A list in R is created with the use
of list() function. R allows accessing elements of an R list with the use of the index value. In
R, the indexing of a list starts with 1 instead of 0 like in other programming languages.
Creating a List
To create a List in R you need to use the function called “list()”. In other words, a list is a
generic vector containing other objects. To illustrate how a list looks, we take an example here.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
We want to build a list of employees with the details. So for this, we want attributes such as ID,
employee name, and the number of employees.
Example:
R
print(empList)
Output:
[[1]]
[1] 1 2 3 4
[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba"
[[3]]
[1] 4
Accessing components of a list
We can access components of an R list in two ways.
• Access components by names: All the components of a list can be named and we
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
can use those names to access the components of the R list using the dollar
command.
Example:
R
# R program to access
# components of a list
Output:
$ID
[1] 1 2 3 4
$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"
$`Total Staff`
[1] 4
# R program to access
# components of a list
"ID" = empId,
"Names" = empName,
"Total Staff" = numberOfEmp
)
print(empList)
$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"
$`Total Staff`
[1] 4
Example:
# R program to edit
# components of a list
$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"
$`Total Staff`
[1] 4
$Names
[1] "Debi" "Sandeep" "Subham" "Shiba" "Kamala"
$`Total Staff`
[1] 5
Concatenation of lists
Two R lists can be concatenated using the concatenation function. So, when we want to
concatenate two lists we have to use the concatenation operator.
Syntax:
list = c(list, list1)
list = the original list
list1 = the new list
Example:
R
# R program to edit
# components of a list
Output:
Before concatenation of the new list
$ID
[1] 1 2 3 4
$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"
$`Total Staff`
[1] 4
# R program to access
# components of a list
Output:
Before deletion the list is
$ID
[1] 1 2 3 4
$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"
$`Total Staff`
[1] 4
$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"
Output:
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] "Sun"
[[5]]
[1] "Mon"
[[6]]
[1] "Tue"
Converting List to Vector
Here we are going to convert the R list to vector, for this we will create a list first and then
unlist the list into the vector.
R
# Create lists.
lst <-
list(1:5)
print(lst)
print(vec)
Output:
[[1]]
[1] 1 2 3 4 5
[1] 1 2 3 4 5
R List to matrix
We will create matrices using matrix() function in R programming. Another function that will
be used is unlist() function to convert the lists into a vector.
R
# Defining list
lst1 <- list(list(1, 2, 3),
list(4, 5, 6))
# Print list
cat("The list is:\
n") print(lst1)
cat("Class:", class(lst1), "\n")
# Print matrix
cat("\nAfter conversion to matrix:\n")
print(mat)
cat("Class:", class(mat), "\n")
Output:
The list is:
[[1]]
[[1]][[1]]
[1] 1
[[1]][[2]]
[1] 2
[[1]][[3]]
[1] 3
[[2]]
[[2]][[1]]
[1] 4
[[2]][[2]]
[1] 5
[[2]][[3]]
[1] 6
Class: list
R is a programming language for statistical computing and graphics supported by the R Core
Team and the R Foundation for Statistical Computing. Created by statisticians Ross
Ihaka and Robert Gentleman, R is used among data
miners, bioinformaticians and statisticians for data analysis and developing statistical software.
[7]
The core R language is augmented by a large number of extension packages containing
reusable code and documentation.
According to user surveys and studies of scholarly literature databases, R is one of the most
commonly used programming languages in data mining.[8] As of April 2023, R ranks 16th in
the TIOBE index, a measure of programming language popularity, in which the language peaked
in 8th place in August 2020.[9][10]
A data structure is a particular way of organizing data in a computer so that it can be used
effectively. The idea is to reduce the space and time complexities of different tasks. Data
structures in R programming are tools for holding multiple values.
R’s base data structures are often organized by their dimensionality (1D, 2D, or nD) and
whether they’re homogeneous (all elements must be of the identical type) or heterogeneous
(the elements are often of various types). This gives rise to the six data types which are most
frequently utilized in data analysis.
The most essential data structures used in R include:
• Vectors
• Lists
• Dataframes
• Matrices
• Arrays
• Factors
Vectors
A vector is an ordered collection of basic data types of a given length. The only key thing here
is all the elements of a vector must be of the identical data type e.g homogeneous data
structures. Vectors are one-dimensional data structures.
Example:
Python3
Lists
A list is a generic object consisting of an ordered collection of objects. Lists are heterogeneous
data structures. These are also one-dimensional data structures. A list can be a list of vectors,
list of matrices, a list of characters and a list of functions and so on.
Example:
Python3
print(empList)
Output:
[[1]]
[1] 1 2 3 4
[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba"
[[3]]
[1] 4
Dataframes
Dataframes are generic data objects of R which are used to store the tabular data. Dataframes
are the foremost popular data objects in R programming because we are comfortable in seeing
the data within the tabular form.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153
• A data-frame must have column names and every row should have a unique name.
• Each column must have the identical number of items.
• Each item in a single column must be of the same data type.
• Different columns may have different data types.
To create a data frame we use the data.frame() function.
Example:
Python3
Matrices
A matrix is a rectangular arrangement of numbers in rows and columns. In a matrix, as we
know rows are the ones that run horizontally and columns are the ones that run vertically.
Matrices are two-dimensional, homogeneous data structures.
Now, let’s see how to create a matrix in R. To create a matrix in R you need to use the function
called matrix. The arguments to this matrix() are the set of elements in the vector.
You have to pass how many numbers of rows and how many numbers of columns you want to
have in your matrix and this is the important point you have to remember that by default,
matrices are in column-wise order.
Example:
Python3
A = matrix(
# Taking sequence of
elements c(1, 2, 3, 4, 5, 6, 7,
8, 9),
# By default matrices
are # in column-wise
order
# So this parameter decides
# how to arrange the
matrix byrow = TRUE
)
Output:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
Arrays
Arrays are the R data objects which store the data in more than two dimensions. Arrays are n-
dimensional data structures. For example, if we create an array of dimensions (2, 3, 3) then it
creates 3 rectangular matrices each with 2 rows and 3 columns. They are homogeneous data
structures.
Now, let’s see how to create arrays in R. To create an array in R you need to use the function
called array(). The arguments to this array() are the set of elements in vectors and you have to
pass a vector containing the dimensions of the array.
Example:
Python3
A = array(
# Taking sequence of
elements c(1, 2, 3, 4, 5, 6, 7,
8),
print(A)
Output:
,,1
[,1] [,2]
[1,] 1 3
[2,] 2 4
,,2
[,1] [,2]
[1,] 5 7
[2,] 6 8
Factors
Factors are the data objects which are used to categorize the data and store it as levels. They
are useful for storing categorical data. They can store both strings and integers. They are useful
to categorize unique values in columns like “TRUE” or “FALSE”, or “MALE” or “FEMALE”,
etc.. They are useful in data analysis for statistical modeling.
Now, let’s see how to create factors in R. To create a factor in R you need to use the function
called factor(). The argument to this factor() is the vector.
Example:
Python3
print(fac)
Output:
[1] Male Female Male Male Female Male Female
Levels: Female Male
4) What is Bootstrapping?
It is method of sample reuse.
The main idea is to use the observed sample to estimate the population distribution.
Three forms of bootstrapping:
Non-Parametric(Re-sampling)
Semi-Parametric(Adding noise)
Parametric(Simulation)
Unit – II
Unit- IV
Unit- V
1) How to assign a string to a variable and check the length of a string in R-language ?
Assigning a string to a variable is done with the variable followed by the <- operator and
the string.
Eg: str <- “Hello”
str #print the value of str
To find the number of characters in a string, use the nchar() function.
Eg: str <- “hello world”
nchar(str)