You are on page 1of 155

lOMoARcPSD|20574153

lOMoARcPSD|20574153

DS4015-BIG DATA ANALYTICS

Master of Computer Applications -(BARATH.S/AP)

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 BIG DATA ANALYTICS

COURSE OBJECTIVES:
 To understand the basics of big data analytics
 To understand the search methods and visualization
 To learn mining data streams
 To learn frameworks
 To gain knowledge on R language
UNIT I INTRODUCTION TO BIG DATA 9
Introduction to Big Data Platform – Challenges of Conventional Systems - Intelligent data analysis
–Nature of Data - Analytic Processes and Tools - Analysis Vs Reporting - Modern Data Analytic
Tools- Statistical Concepts: Sampling Distributions - Re-Sampling - Statistical Inference -
Prediction Error.
UNIT II SEARCH METHODS AND VISUALIZATION 9
Search by simulated Annealing – Stochastic, Adaptive search by Evaluation – Evaluation
Strategies –Genetic Algorithm – Genetic Programming – Visualization – Classification of Visual
Data Analysis Techniques – Data Types – Visualization Techniques – Interaction techniques –
Specific Visual data analysis Techniques
UNIT III MINING DATA STREAMS 9
Introduction To Streams Concepts – Stream Data Model and Architecture - Stream Computing -
Sampling Data in a Stream – Filtering Streams – Counting Distinct Elements in a Stream –
Estimating Moments – Counting Oneness in a Window – Decaying Window - Real time Analytics
Platform(RTAP) Applications - Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions
UNIT IV FRAMEWORKS 9
MapReduce – Hadoop, Hive, MapR – Sharding – NoSQL Databases - S3 - Hadoop Distributed File
Systems – Case Study- Preventing Private Information Inference Attacks on Social Networks Grand
Challenge: Applying Regulatory Science and Big Data to Improve Medical Device
Innovation
UNIT V R LANGUAGE 9
Overview, Programming structures: Control statements -Operators -Functions -Environment and
scope issues -Recursion -Replacement functions, R data structures: Vectors -Matrices and arrays -
Lists -Data frames -Classes, Input/output, String manipulations

COURSE OUTCOMES:
CO1:understand the basics of big data analytics
CO2: Ability to use Hadoop, Map Reduce Framework.
CO3: Ability to identify the areas for applying big data analytics for increasing the business
outcome.
CO4:gain knowledge on R language
CO5: Contextually integrate and correlate large amounts of information to gain faster insights.
TOTAL:45 PERIODS

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 BIG DATA


ANALYTICS UNIT I
Part A
1. What is Big Data Platform?

A big data platform acts as an organized storage medium for large amounts of data. Big data
platforms utilize a combination of data management hardware and software tools to store
aggregated data sets, usually onto the cloud.

2. List the characteristics of Big Data Platform?

o Volume

o Veracity

o Variety

o Value

o Velocity

3. What is meant by scalability?

Scalability in big data refers to the ability of data to expand and accommodate a growing influx
of information without compromising its integrity or performance1. A scalable data platform
utilizes added hardware or software to increase output and storage of data, and accommodates
rapid changes in the growth of data, either in traffic or volume23. Data scalability is important
for any successful business operation today, allowing organizations to handle an ever-increasing
amount of data easily and efficient

4. Shortly write about speed?

Velocity of Big Data Velocity refers to the speed with which data is generated. High velocity
data is generated with such a pace that it requires distinct (distributed) processing techniques. An
example of a data that is generated with high velocity would be Twitter messages or Facebook
posts

5. Write a short notes on storage?

The data, which comes in structured, semi-structured, and unstructured forms, is collected from
multiple sources across web, mobile, and the cloud. It is then stored in a repository—a data lake
or data warehouse —in preparation to be processed.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

6. What is meant by data integration?

• Extract data from various sources

• Store data in an appropriate fashion

• Transform and integrate data with analytics

• Orchestrate and Use/Load data

7. What is security?

Big data analytics in security is the use of advanced analytical techniques on large-scale data sets
to identify and address potential cybersecurity threats12345. It involves the ability to gather,
analyze, visualize and draw insights from massive amounts of digital information14. It can help
predict and stop cyber attacks by detecting anomalies and patterns1345. It works together with
security technologies and sensors to improve the cyber defence posture of organizations14.

8. What are the applications of big data analytics?

1. Tracking Customer Spending Habit, Shopping Behavior:

2. Recommendation:

3. Smart Traffic System:

4. Secure Air Traffic System:

5. Auto Driving Car:

6. Virtual Personal Assistant Tool:

7. IoT:

8. Education Sector:

9. Energy Sector:

10. Media and Entertainment Sector:

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

9. What are the opportunities and challenges in big data?

• 1. Lower costs Across sectors such as healthcare, retail, production, and manufacturing
Big Data solutions are help reducing costs. ...

• 2. New innovations and business opportunities Analytics gives a lot of insight into trends
and customer preferences. ...

• 3. Business Proliferation: ...

• 4. Identifying the Big data to use ...

• 5. Making Big Data Analytics fast ...

10. Shortly explain about data reporting and analysis?

• Reporting provides data. ...

• Reporting just provides the data that is asked for while analysis provides the information
or the answer that is actually needed.

• Reporting is done in standardized manner while analysis can be customized.

11. What is data-driven decision making?

They enable organizations to make informed decisions based on large volumes of data,
improving business strategies and operations.

12. Write shortly about competitive advantage?

Companies that can effectively harness Big Data gain a competitive edge by identifying trends,
customer preferences, and market opportunities.

13. What is innovation?

Big Data platforms facilitate innovation by enabling the development of advanced analytics,
machine learning models, and artificial intelligence applications.

14. What is cost efficiency?

They can reduce the cost of data storage and processing through scalable, distributed
architectures.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

15. What is variety?

Data comes in various formats, including structured data (e.g., databases), unstructured data
(e.g., text and multimedia), and semi-structured data (e.g., XML and JSON).

16. What is data ingestion?

This component involves collecting data from various sources, such as databases, logs, IoT
devices, and social media. Tools like Apache Kafka and Flume are commonly used for real-time
data ingestion.

17. What is data storage?

Big Data platforms offer scalable and distributed storage solutions capable of handling the large
volume of data. Examples include Hadoop Distributed File System (HDFS) and cloud-based
storage services like Amazon S3.

18. What is processing?

To extract valuable insights, data needs to be processed. Big Data platforms support batch
processing (e.g., Apache Hadoop) and stream processing (e.g., Apache Spark) for real-time
analytics.

19. What is data analysis and visualization?

Once data is processed, it can be analyzed using various tools and frameworks like Apache Hive,
Apache Pig, or machine learning libraries. Visualization tools like Tableau or Power BI help in
presenting insights.

20. What is data governance and security?

Data governance ensures data quality, compliance, and security. Access control, encryption, and
auditing are essential components of data security in Big Data platforms.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Part B

1. Describe about Big Data Platform examples?

BIG DATA EXAMPLES TO KNOW

• Marketing: forecast customer behavior and product strategies.

• Transportation: assist in GPS navigation, traffic and weather alerts.

• Government and public administration: track tax, defense and public health data.

• Business: streamline management operations and optimize costs.

• Healthcare: access medical records and accelerate treatment development.

• Cybersecurity: detect system vulnerabilities and cyber threats.

Big Data Examples in Marketing

Big data and marketing go hand-in-hand, as businesses harness consumer information to forecast
market trends, buyer habits and other company behaviors. All of this helps businesses determine
what products and services to prioritize.
Big Data Examples in Transportation

Navigation apps and databases, whether used by car drivers or airplane pilots, frequently rely on
big data analytics to get users safely to their destinations. Insights into routes, travel time and
traffic are pulled from several data points and provide a look at travel conditions and vehicle
demands in real time.

Big Data Examples in Government

To stay on top of citizen needs and other executive duties, governments may look toward big
data analytics. Big data helps to compile and provide insights into suggested legislation,
financial procedure and local crisis data, giving authorities an idea of where to best delegate
resources.
Big Data Examples in Business

Succeeding in business means companies have to keep track of multiple moving parts — like
sales, finances, operations — and big data helps to manage it all. Using data analytics,
professionals can follow real-time revenue information, customer demands and managerial tasks

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

to not only run their organization but also continually optimize it.

Big Data Examples in Healthcare

When it comes to medical cases, healthcare professionals may use big data to determine the best
treatment. Patterns and insights can be drawn from millions of patient data records, which guide
healthcare workers in providing the most relevant remedies for patients and how to best advance
drug development.

Big Data Examples in Cybersecurity

As cyber threats and data security concerns persist, big data analytics are used behind the scenes
to protect customers every day. By reviewing multiple web patterns at once, big data can help
identify unusual user behavior or online traffic and defend against cyber attacks before they even
start.prioritize concurrent breaches, map out multipart attacks and identify potential root causes
of security issues.

2.Briefly explain on challenges on conventional systems?

Big data has revolutionized the way businesses operate, but it has also presented a
number of challenges for conventional systems. Here are some of the challenges
faced by conventional systems in handling big data:

Big data is a term used to describe the large amount of data that can be stored and
analyzed by computers. Big data is often used in business, science and government.
Big Data has been around for several years now, but it's only recently that people
have started realizing how important it is for businesses to use this technology in
order to improve their operations and provide better services to customers. A lot of
companies have already started using big data analytics tools because they realize
how much potential there is in utilizing these systems effectively!

However, while there are many benefits associated with using such systems -
including faster processing times as well as increased accuracy - there are also some
challenges involved with implementing them correctly.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Challenges of Conventional System in big data

• Scalability
• Speed
• Storage
• Data Integration
• Security

Scalability
A common problem with conventional systems is that they can't scale. As the
amount of data increases, so does the time it takes to process and store it. This
can cause bottlenecks and system crashes, which are not ideal for businesses
looking to make quick decisions based on their data.
Conventional systems also lack flexibility in terms of how they handle new types of
information--for example, if you want to add another column (columns are like
fields) or row (rows are like records) without having to rewrite all your code from
scratch.
Speed
Speed is a critical component of any data processing system. Speed is important
because it allows you to:
• Process and analyze your data faster, which means you can make better-
informed decisions about how to proceed with your business.
• Make more accurate predictions about future events based on past
performance.
Storage
The amount of data being created and stored is growing exponentially, with
estimates that it will reach 44 zettabytes by 2020. That's a lot of storage space!

The problem with conventional systems is that they don't scale well as you add
more data. This leads to huge amounts of wasted storage space and lost information
due to corruption or security breaches.
Data Integration
The challenges of conventional systems in big data are numerous. Data
integration is one of the biggest challenges, as it requires a lot of time and
effort to combine different sources into a single database. This is especially true
when you're trying to integrate data from multiple sources with different
schemas and formats.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

Another challenge is errors and inaccuracies in analysis due to lack of


understanding of what exactly happened during an event or transaction. For
example, if there was an error while transferring money from one bank account to
another, then there would be no way for us know what actually happened unless
someone tells us about it later on (which may not happen).
Security
Security is a major challenge for enterprises that depend on conventional
systems to process and store their data. Traditional databases are designed to
be accessed by trusted users within an organization, but this makes it difficult
to ensure that only authorized people have access to sensitive information.

Security measures such as firewalls, passwords and encryption help protect against
unauthorized access and attacks by hackers who want to steal data or disrupt
operations. But these security measures have limitations: They're expensive; they
require constant monitoring and maintenance; they can slow down performance if
implemented too extensively; and they often don't prevent breaches altogether
because there's always some way around them (such as through phishing emails).

Conventional systems are not equipped for big data. They were designed for a
different era, when the volume of information was much smaller and more
manageable. Now that we're dealing with huge amounts of data, conventional
systems are struggling to keep up. Conventional systems are also expensive and
time-consuming to maintain; they require constant maintenance and upgrades in
order to meet new demands from users who want faster access speeds and more
features than ever before.age 9 of 28
• Disk Capacity
– 1990 – 20MB
– 2000 - 1GB
– 2010 – 1TB
• Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years,
currently around 70 – 80MB / sec

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

How long it will take to read 1TB of data?


•1TB (at 80Mb / sec):
• 1 disk - 3.4 hours
• 10 disks - 20 min
•100 disks - 2 min
• 1000 disks - 12 sec
What do we care about when we process data?
• Handle partial hardware failures without going down:
– If machine fails, we should be switch over to stand by machine
– If disk fails – use RAID or mirror disk
• Able to recover on major failures:
– Regular backups
– Logging
– Mirror database at different site
• Capability:
– Increase capacity without restarting the whole system
– More computing power should equal to faster processing
• Result consistency:
– Answer should be consistent (independent of something failing) and returned in reasonable
amount of time

3.Explain about intelligent data analysis?

Intelligent Data Analysis provides a forum for the examination of issues related to the research
and applications of Artificial Intelligence techniques in data analysis across a variety of
disciplines. These techniques include (but are not limited to): all areas of data visualization,
data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering,
database mining techniques, tools and applications, use of domain knowledge in data analysis,
big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic,
statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are
preferred
that discuss the development of new AI-related data analysis architectures, methodologies, and
techniques and their applications to various domains.

Intelligent Data Analysis (IDA) is one of the most important approaches in the field of data
mining, which attracts great concerns from the researchers. Based on the basic principles of IDA
and the features of datasets that IDA handles, the development of IDA is briefly summarized
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

from three aspects, i.e., algorithm principle, the scale and type of the dataset. Moreover, the

challenges facing the IDA in big data environment are analyzed from four views, including big
data management, data collection, data analysis, and application pattern. It is also cleared that in
order to extract more values from data, the further development of IDA should combine
practical applications and theoretical researches together.

4.Discuss briefly about Nature of data?

The Nature of Data

That’s a pretty broad title, but, really, what we’re talking about here are some fundamentally
different ways to treat data as we work with it. This topic can seem academic but it is relevant for
web analysts specifically and researchers broadly. Yes, this topic out to be pretty darn important
when it comes time to applying statistical operations and performing model building and testing.

So, we have to start with the basics: the nature of data. There are four types of data:
• Nominal
• Ordinal
• Interval
• Ratio

Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.

The distinction between the four types of scales center on three different characteristics:
1. The order of responses – whether it matters or not
2. The distance between observations – whether it matters or is interpretable
3. The presence or inclusion of a true zero
Nominal Scales

Nominal scales measure categories and have the following characteristics:


• Order: The order of the responses or observations does not matter.
• Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is
not the same as a 2 and 3.
• True Zero: There is no true or real zero. In a nominal scale, zero is uninterpretable.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Consider traffic source (or last touch channel) as an example in which visitors reach our site
through a mutually exclusive channel, or last point of contact. These channels would include:
1. Paid Search
2. Organic Search
3. Email
4. Display

(This list looks artificially short, but the logic and interpretation would remain the same for nine
channels or for 99 channels.)

If we want to know that each channel is simply somehow different, then we could count the
number of visits from each channel. Those counts can be considered nominal in nature.

5.What is analytics processes and tools ?Explain it.

Best Analytic Processes and Big Data Tools

Big data is the storage and analysis of large data sets. These are complex data sets which can be
both structured or unstructured. They are so large that it is not possible to work on them with
traditional analytical tools. These days, organizations are realising the value they get out of big
data analytics and hence they are deploying big data tools and processes to bring more efficiency
in their work environment. They are willing to hire good big data analytics professionals at a
good salary. In order to be a big data analyst, you should get acquainted with big data first and
get certification by enrolling yourself in analytics courses online.
Top 5 Big Data Tools
There are many big data tools and processes being utilised by companies these days. These are
used in the processes of discovering insights and supporting decision making. The top big data
tools used these days are open source data tools, data visualization tools, sentiment tools, data
extraction tools and databases. Some of the best used big data tools are mentioned below –
1. R-Programming
R is a free open source software programming language and a software environment for
statistical computing and graphics. It is used by data miners for developing statistical software
and data analysis. It has become a highly popular tool for big data in recent years.
2. Datawrapper

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

It is an online data visualization tool for making interactive charts. You need to paste your data
file in a csv, pdf or excel format or paste it directly in the field. Datawrapper then generates any

visualization in the form of bar, line, map etc. It can be embedded into any other website as
well. It is easy to use and produces visually effective charts.
3. Tableau Public
Tableau is another popular big data tool. It is simple and very intuitive to use. It communicates
the insights of the data through data visualisation. Through Tableau, an analyst can check a
hypothesis and explore the data before starting to work on it extensively.
4. Content Grabber
Content Grabber is a data extraction tool. It is suitable for people with advanced programming
skills. It is a web crawling software. Businesses can use it to extract content and save it in a
structured format. It offers editing and debugging facility among many others for analysis
later. The market is full of big data tools these days. These tools help unlock the power that
big data provides to business processes. By choosing the tools carefully, a company can
increase its efficiency in its operations.

6. Explain about analysis vs reporting ?

Analytics and reporting can help businesses transform data into actionable insights,
identify customer behavior patterns, measure each department’s performance, and
improve operational efficiency.

And this is just the tip of the iceberg.

However, while these two terms are often used interchangeably, they represent different
approaches to understanding and communicating data.

Reporting involves gathering data and presenting it in a structured way, whereas analytics
is using data to identify patterns and gain insights to inform future decision-making.

Think of it as a nurse (reporting) and doctor (analytics).

A nurse takes vital signs, records symptoms, and reports this information to the doctor.
The doctor then uses this information to diagnose the patient’s condition and develop a
treatment plan.

But how many companies actually know the difference between analytics and reporting?
And do they have dedicated roles for both areas?

We conducted a survey with 22 respondents to find the answers to these questions (and a
few more you’ll want to stick around for).

Let’s check it out.


Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

• What is Analytics vs Reporting?

• Importance of Analytics and Reporting

• 4 Key Differences between Reporting and Analytics

• Streamline Both Your Reporting and Analytics with Databox

What is Analytics vs. Reporting?

Analytics and reporting both represent ways of understanding and communicating data,
but they do it differently.

Reporting is the process of collecting data and presenting it in a structured and easy-to-
understand manner, often in the form of charts, tables, or graphs. It’s important to have a
well-defined process and present data accurately, to prevent any misinterpretations.

Reports usually provide information on past performances and KPIs like sales figures,
website traffic, or customer demographics. Depending on what you want to focus on,
there are several types of reports (e.g. financial report, sales report, marketing report,
etc.).

This process allows stakeholders, executives, managers, investors, or regulators, to


quickly and easily access and understand important information like financial results and
operational metrics.

Analytics, on the other hand, involves using data to draw insights and make informed
decisions. It goes beyond simply looking at what has happened in the past and instead
aims to answer questions about why something happened and what might happen in the
future.

Modern analytics tools also leverage complex data analysis techniques, such as predictive
modeling, data mining, and machine learning, to uncover hidden insights and trends in
the data. The purpose of analytics is to help managers and executives make informed
decisions that will drive the business forward.

Overall, analytics answers why something is happening based on the data, whereas
reporting tells what is happening.

Because these two terms represent different processes, companies should employ
different people for both areas – data analysts and reporting analysts.

We asked our respondents whether they have reporting analysts in the company and most
of them answered “Yes”.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

We also asked those who have reporting analysts about how long they’ve had them on the
team. Most respondents have had them for between 1-3 years.

As for data analysts, most respondents have 2-3 data analysts in the organization.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Importance of Analytics and Reporting

Analytics and reporting both play a critical role in modern business operations.

Because of the vast amounts of data they collect, businesses need effective analytics and
reporting processes to leverage the information and make strategic decisions. It helps
them optimize operations, improve efficiency, reduce costs, and deliver better customer
experiences.

Together, these two processes provide a comprehensive view of a business’s operations,


and using only one of them can cost you relevant insights. Both are critical for businesses
to thrive in today’s competitive environment.

PRO TIP: How Well Are Your Marketing KPIs Performing?

Like most marketers and marketing managers, you want to know how your efforts are
translating into results each month. How is your website performing? How well are you
converting traffic into leads and customers? Which marketing channels are performing
best? How does organic search compare to paid campaigns and to previous months? You
might have to scramble to put all of this together in a single report, but now you can
have it all at your fingertips in a single Databox dashboard.

Our Monthly Marketing Performance Dashboard includes data from Google Analytics 4
and HubSpot Marketing with key performance metrics like:

1. Website sessions, new users, and new leads. Basic engagement data from
your website. How much traffic? How many new visitors? How many lead
conversions?

2. Lead generation vs goal. Did you reach your goal for lead conversion
for the month, quarter, or year? If not, by how much did you miss?

3. Overall marketing performance. A summary list of the main KPIs for your
website: sessions, contacts, leads, customers, bounce rate, avg. session
duration, pages/session, and pageviews.

4. Email response. Overall, how effective were your email campaigns,


measured by email opens?

5. Blog post traffic. How much traffic did your blog attract during a certain
period?

6. New contacts by source. Which sources drove the highest number of new
contacts

7. Visits and contacts by source. How did your sources compare by both
sessions and new contacts in a certain period of time?

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Now you can benefit from the experience of our Google Analytics and HubSpot
Marketing experts, who have put together a plug-and-play Databox template that contains
all the essential metrics for monitoring and analyzing your website traffic and its sources,
lead generation, and more. It’s simple to implement and start using as a standalone
dashboard or in marketing reports, and best of all, it’s free!

You can easily set it up in just a few clicks – no coding required.

To set up the dashboard, follow these 3 simple steps:

Step 1: Get the template

Step 2: Connect your HubSpot and Google Analytics 4 accounts with Databox.

Step 3: Watch your dashboard populate in seconds.

4 Key Differences between Reporting and Analytics

We already touched briefly on some of the main differences between data analytics and
reporting, but we also wanted to do a deep dive into each one individually and show you
some interesting things our respondents pointed out.

The X key differences between reporting and analytics are:

• Differences in underlying purposes and use cases

• Differences in the way data is presented

• Differences in goals

• Differences in the process

Differences in Underlying Purposes and Use Cases

To begin with, analytics and reporting both serve different purposes. If you’re looking to
get an answer to ‘what’s happening’ you need data reporting.

However, if you already have data reports (in simple words: organized and summarized
data) and you need to find out the answer to ‘what now,’ you need to dive into analytics
(and analytics dashboards).

Technically speaking, reporting is a subdivision of analytics and you can’t have analytics
without reporting, but analytics goes a bit further and is generally a more complex
process.

As VisualFizz’s Marissa Ryan puts it, “Reporting is simply a means of making an


observation about an occurrence. While that, of course, is an important step, reporting
doesn’t necessarily provide direction, guidance, or anything actionable.”

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

“Analytics looks at the incoming data reports, looks for patterns, delivers insights, and
guides actionable marketing decisions,” Ryan explains.

In short, “reporting is for observation. Analytics is for actions.”


Looking at reporting and analytics this way shows us they’re dependent on each other.

If you want actionable insights or recommendations from raw data, you’ll first need to
organize and format it – which is what reporting takes care of.

Similarly, reporting without analytics is useless at its core. Because then you have an idea
of what’s happening based on the data gathered, but no way to interpret it into actionable
takeaways to execute.

With this in mind, it’s apparent that their use cases drastically differ.

Sean Carrigan of MobileQubes adds that “analytics is useful for ad hoc interpretation of
data to answer specific questions related to user behavior, trends, etc. so that
improvements can be implemented.

Reporting provides data related to what is happening and is processed in a standardized


format on a repeatable schedule… But it is only fully valuable when it is followed with
proper and insightful analytics,” concludes Carrigan.

Related: Marketing Reporting: The KPIs, Reports, & Dashboard Templates You Need to
Get Started

Differences in the Way Data is Presented

Since reporting is about formatting and making data easy to understand, it’s more
presentation-oriented than analytics. It typically relies on showcasing data in charts,
graphs, and other visually appealing formats.

The focus is on summarizing key metrics and performance indicators so that shareholders
and managers can easily grasp the information.

On the other hand, analytics outputs are generally in form of documented insights,
recommended actions and strategies, forecasts, ad hoc reports, summary reports, and
dashboards.

Eden Cheng from PeopleFinderFree adds that “reporting is utilized to drag details from
the raw data, in the leading form of easy-to-read dashboards of valuable graphs.
Therefore, via reporting, data is carefully arranged and summarized in seamlessly
digestible ways.”

Cheng also mentions that “analytics is one step ahead of reporting and enables you to
question and discover variable data.”

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Difference in Goals

The primary goal of reporting is to provide a standardized, high-level overview of key


metrics and performance indicators. It’s used to monitor the health of the business, track
progress toward goals, and identify areas that may require further investigation.

On the other hand, analytics is focused on exploring and understanding data in greater
detail to uncover insights and opportunities for improvement. The goal of analytics is to
identify patterns, relationships, and trends within the data that may not be immediately
visible in standard reports. Analytics tools are designed to provide users with the ability
to ask more complex questions, test hypotheses, and gain a deeper understanding of the
data.

Alina Clark of Cocodoc agrees and adds that “the goal of reporting is to change data from
its raw form, which is unintelligible and hard to understand, into an easy-to-visualize
format. The end result of any reporting system is to make the analysis as easy as possible.

At the same time, analytics churns through the data, draws out the problems, and provides
the solution while at it. Any data analysis that doesn’t look at the three stages (problems-
solutions-conclusions) fails to achieve the intended goals in most instances.”

Put simply, the goal of reporting is to organize and summarize data, while the purpose of
analytics is to interpret it and deliver actionable recommendations.

Differences in the Process

Building a report and preparing for data analytics both involve a different step-by-step
process.

To build a report, you need to:


• Outline the purpose of the report and the business requirement

• Gather relevant data from your different sources

• Translate the data into a format that can be analyzed and presented

• Develop and design a dashboard or report format that meets the needs of
the audience

• Present the data in a clear and concise manner

• Provide real-time reporting

• Share the report with your audience and gather


feedback

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

As for data analytics, the steps involved are:


• Define a problem and develop a data hypothesis

• Collect and clean data from relevant sources

• Develop and test analytical models to read the data and extract insights

• Use trend and pattern analysis, and data visualization techniques to communicate
the results

• Make decisions and create strategies based on the insights and recommendations

7. Describe about modern data analytic tools?

1. Apache Hadoop:
1. Apache Hadoop is a big data analytics tool is a Java-based free software framework.
2. It helps in the effective storage of a huge amount of data in a storage place known as
a cluster.
3. It runs in parallel on a cluster and also has the ability to process huge data across all
nodes in it.
4. There is a storage system in Hadoop popularly known as the Hadoop Distributed File
System (HDFS), which helps to splits the large volume of data and distribute it
across many nodes present in a cluster.
2. KNIME:
1. KNIME analytics platform is one of the leading open solutions for data-
driven innovation.
2. This tool helps in discovering the potential hidden in a huge volume of data, it also
performs mine for fresh insights, or predicts the new futures.
3. OpenRefine:
1. OneRefine tool is one of the efficient tools to work on the messy and large volume
of data.
2. It includes cleansing data and transforming that data from one format to another.
3. It helps to explore large data sets easily.
4. Orange:
1. Orange is famous for open-source data visualization and helps with data analysis for
beginners and as well to the expert

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

2. This tool provides interactive workflows with a large toolbox option to create
the same which helps in the analysis and visualizing of data.

5. RapidMiner:
1. RapidMiner tool operates using visual programming and also it is much capable of
manipulating, analyzing and modelling the data.
2. RapidMiner tools make data science teams easier and more productive by using an
open-source platform for all their jobs like machine learning, data preparation, and
model deployment.
6. R-programming:
1. R is a free open source software programming language and a software environment
for statistical computing and graphics.
2. It is used by data miners for developing statistical software and data analysis.
3. It has become a highly popular tool for big data in recent years.
7. Datawrapper:
1. It is an online data visualization tool for making interactive charts.
2. It uses data files in a CSV, pdf or Excel format.
3. Datawrapper generate visualization in the form of bar, line, map etc. It can
be embedded into any other website as well.
8. Tableau:
1. Tableau is another popular big data tool. It is simple and very intuitive to use.
2. It communicates the insights of the data through data visualization.
3. Through Tableau, an analyst can check a hypothesis and explore the data
before starting to work on it extensively.

8. Explain about sampling distributions?

Severe class imbalance between majority and minority classes in Big Data can bias the
predictive performance of Machine Learning algorithms toward the majority (negative) class.
Where the minority (positive) class holds greater value than the majority (negative) class and the
occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to
adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six
sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely
investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-
Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark
framework. The first case study is based on a Medicare fraud detection dataset. The second case

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test
data from a separate source (POST dataset). Results from the Medicare case study are not
conclusive

regarding the best sampling approach using Area Under the Receiver Operating Characteristic
Curve and Geometric Mean performance metrics. However, it should be noted that the Random
Undersampling approach performs adequately in the first case study. For the SlowlorisBig case
study, Random Undersampling convincingly outperforms the other five sampling approaches
(Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 ,
SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the
Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its
classification performance in both case studies, Random Undersampling is the best choice as it
results in models with a significantly smaller number of samples, thus reducing computational
burden and training time.
9. Discuss briefly about resampling?

The problem of low learning algorithm accuracy caused by serious imbalance of big data in
Internet of Things, and proposes a bidirectional self-adaptive resampling algorithm for
imbalanced big data. Based on the sizes of data sets and imbalance ratios inputted by the user,
the algorithm will process the data using a combination of oversampling for minority class and
distribution sensitive undersampling for majority class.

This paper proposes a new distribution- sensitive resampling algorithm. According to the
distribution of samples, the majority and minority samples are divided into different categories,
and different processing methods are adopted for the samples with different distribution
characteristics The algorithm makes the sample set after resampling keep the same
characteristics with the original data set as much as possible. The algorithm emphasizes the
importance of boundary samples, that is, the samples at the boundary of majority classes and
minority classes are more important than other samples for learning algorithm. The boundary
minority samples will be copied, and the boundary majority samples will be reserved. Real-
world application is introduced in the end, which shows that compared with the existing
imbalanced data resampling algorithms, this algorithm improves the accuracy of learning
algorithm, especially for the accuracy and recall rate of minority class.

10. What are the types of modern data analytics tools?Explain it.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

• Data management tools, such as Apache Hadoop, Cassandra, and Qubole, that
store and process large amounts of data.
• Data mining tools, such as KNIME, RapidMiner, and Wolfram Alpha, that
extract patterns and insights from data.

• Data visualization tools, such as Tableau Public, Google Fusion Tables, and
NodeXL, that present data in graphical or interactive forms.
• Data analysis techniques, such as in-memory analytics, predictive analytics, and text
mining, that apply algorithms and models to data.
11. Elaborately write about Statistical inference?

The need for new methods to deal with big data is a common theme in most scientific fields,
although its definition tends to vary with the context. Statistical ideas are an essential part of this,
and as a partial response, a thematic program on statistical inference, learning and models in big
data was held in 2015 in Canada, under the general direction of the Canadian Statistical Sciences
Institute, with major funding from, and most activities located at, the Fields Institute for
Research in Mathematical Sciences. This paper gives an overview of the topics covered,
describing challenges and strategies that seem common to many different areas of application
and including some examples of applications to make these challenges and strategies more
concrete.

Big data provides big opportunities for statistical inference, but perhaps even bigger challenges,
especially when compared with the analysis of carefully collected, usually smaller, sets of data.
From January to June 2015, the Canadian Statistical Sciences Institute organised a thematic
program on Statistical Inference, Learning and Models in Big Data. It became apparent within
the first two weeks of the program that a number of common issues arose in quite different
practical settings. This paper arose from an attempt to distil these common themes from the
presentations and discussions that took place during the thematic program.

Scientifically, the program emphasised the roles of statistics, computer science and mathematics
in obtaining scientific insight from big data. Two complementary strands were introduced:
cross- cutting, or foundational, research that underpins analysis, and domain-specific research
that focused on particular application areas. The former category included machine learning,
statistical inference, optimisation, network analysis and visualisation. Topic-specific workshops
addressed problems in health policy, social policy, environmental science, cyber-security and
social networks. These divisions are not rigid, of course, as foundational and application areas
are part of a feedback cycle in which each inspires developments in the other. Some very

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

important application areas where big data is fundamental were not able to be the subject of
focused workshops, but many of these applications did feature in individual presentations. The
program started with an opening conference

1 that gave an overview of the topics of the 6-month program.


2 All the talks presented at the Fields Institute are available online through FieldsLive.
3Technological advances enable us to collect more and more data—already in 2012 it was
estimated that data collection was growing at 50% per year (Lohr, 2012). While there is much
important research underway in improving the processing, storing and rapid accessing of records
(Chen et al., 2014; Schadt et al., 2010), this program emphasised the challenges for modelling,
inference and analysis.
Two striking features of the presentations during the program were the breadth of topics and
the remarkable commonalities that emerged across this broad range. We attempt here to
illustrate many of the common issues and solutions that arose and provide a picture of the status
of big data research in the statistical sciences. While the report is largely inspired by the series
of talks at the opening conference and related references, it also touches on issues that arose
throughout the program.
12. Explain about prediction error?

• In regression analysis, it’s a measure of how well the model predicts the
response variable.
• In classification (machine learning), it’s a measure of how well samples
are classified to the correct category.
Sometimes, the term is used informally to mean exactly what it means in plain English (you’ve
made some predictions, and there are some errors). In regression, the term “prediction error” and
“Residuals” are sometimes used synonymously. Therefore, check the author’s intent before
assuming they mean something specific (like the mean squared prediction error).

Mean Squared Prediction Error (MSPE)


MSPE summarizes the predictive ability of a model. Ideally, this value should be close to zero,
which means that your predictor is close to the true value. The concept is similar to Mean
Squared Error (MSE), which is a measure of the how well an estimator measures a parameter (or
how close a regression line is to a set of points). The difference is that while MSE measures of
an estimator’s fit, the MSPE is a measure of a predictor’s fit— or how well it predicts the true
value.
Quantifying Prediction Errors
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

Prediction error can be quantified in several ways, depending on where you’re using it. In
general, you can analyze the behavior of prediction error with bias and variance (Johari, n.d.).

In statistics, the root-mean-square error (RMSE) aggregates the magnitudes of prediction errors.
The Rao-Blackwell theory can estimate prediction error as well as improve the efficiency of
initial estimators.

In machine learning, Cross-validation (CV) assesses prediction error and trains the prediction
rule. A second method, the bootstrap, begins by estimating the prediction rule’s sampling
distribution (or the sampling distribution’s parameters); It can also quantify prediction error and
other aspects of the prediction rule.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Unit-II
Answer all the
questions Part-A
1. What is meant by probabilistic techniques?

Examples of probabilistic data structures are as follows: Membership query (Bloom filter,
Bloom count filter, private filter, cuckoo filter). Power (linear counting, probabilistic counting,
LogLog, HyperLogLogLog, HyperLogLog++). Frequency (Counting sketch, Counting-minimal
sketch). Similarity (LSH, MinHash, SimHash), and others.

2. What is meant by traveling salesman problem?

Big data is one of the most concerned topics in business today across information technology
sectors. most research fields toward using big data tools to leverage from the huge data that
available today. The traveling salesman problem is one of problems that is growth by increasing
the input as a factorial (n!). therefore it is important to find algorithm to solve big number of
cities with feasible time and within available memory space.

This article introduces two proposed algorithms to solve traveling salesman problem by
clustering using three methods; k- means, Gaussian Mixture Model, and Self-Organizing Map to
select the best one for proposed algorithms. The proposed algorithms depend on arranging the
cities (points) in chromosomes for Genetic Algorithm after clustering the big data to reduce the
problem and solving each cluster separately based on divide and conquer concept. The two
proposed algorithms tested by applying on different number of points, the nearest points
algorithm solved traveling salesman problem with 2 million points.

3.Define stochastic?

It has evolved in response to other developments in statistics, notably time series and sequential
analysis, and to applications in artificial intelligence, economics and engineering. Its
resurgence in the big data era has led to new advances in both theory and applications of this
microcosm of statistics and data science.

4.Write about the classification of stochastic processes?

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

It has four main types – non-stationary stochastic processes, stationary stochastic


processes, discrete-time stochastic processes, and continuous-time stochastic processes.

5.What are the types of stochastic processes?

Stochastic processes can be grouped into various categories based on their mathematical
properties123

. The most common types of stochastic processes include:


• Random walks
• Martingales
• Markov processes
• Lévy processes
• Gaussian processes
Other types of stochastic processes include non-stationary stochastic processes, stationary
stochastic processes, discrete-time stochastic processes, and continuous-time stochastic
processes

6.What is data streaming?

Data streaming is the process of transmitting, ingesting, and processing data continuously
rather than in batches. It is used to deliver real-time information to users and help them make
better decisions4. Big data streaming is a process in which large streams of real-time data are
processed to extract insights and useful trends out of it1. Data streaming is a key capability for
.

7.What is meant by event driven architecture?

Data Lakes have evolved from the batch-based, large scale ingestion platforms to becoming
event-driven as the need for data “now” becomes more and more important. Capturing all
enterprise and external data in one place is now a commodity service. Doing that, plus
providing the data up to the hour and even minute it’s available is the new capability that
enterprises are targeting to continue identifying insights and monetizing their data capabilities.

But the prospect of building out a real-time architecture can be quite overwhelming for the
enterprise with no past experience in this space. Or even for those who do, but are working to
upgrade their technology stack and make a pivot to more modern tools. Below are some of the

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

considerations one should make when looking to make the move to a real-time or event-driven
architecture.

8.Write a short notes on heavy intermediate processing?


Big data has shown phenomenal growth over the past decade and it’s widespread application by
businesses as a growth catalyst continues to deliver positive results. The scale of data is
massive and the volume, velocity and variety of data calls for more efficient processing to make
it machine-ready. Although there are a multitude of ways to extract data such as public
APIs, custom web scraping services, internal data sources, etc., there would always remain the
need to do some pre-processing to make the data perfectly suitable for business applications.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Pre-processing of data involves a set of key tasks that demand extensive computational
infrastructure and this in turn will make way for better results from your big data strategy.
Moreover, cleanliness of the data would determine the reliability of your analysis and this should
be given high priority while plotting your data strategy.

Data pre-processing techniques

Since the extracted data tend to be imperfect with redundancies and imperfections, data pre-
processing techniques are an absolute necessity. The bigger the data sets, the more complex
mechanisms are needed to process it before analysis andvisualization. Pre-processing prepares
the data and makes the analysis feasible while improving the effectiveness of the results.
Following are some of the crucial steps involved in data pre-processing.

Data cleansing

Cleansing the data is usually the first step in data processing and is done to remove the
unwanted elements as well as to reduce the size of the data sets, which will make it easier for the
algorithms to analyze it. Data cleansing is typically done by using instance reduction techniques.

Instance reduction helps reduce the size of the data set without compromising the quality of
insights that can be extracted from the data. It removes instances and generates new ones to
make the data set compact. There are two major instance reduction algorithms:

Instance selection:

Instance selectionis used to identify the best examples from a very large data set with many
instances in order to curate them as the input for the analytics system. It aims to select a subset of
the data that can act as a replacement for the original data set while completely fulfilling the
goal. It will also remove redundant instances and noise.

Instance generation:
Instance generation methods involve replacing the original data with artificially generated data
in order to fill regions in the domain of an issue with no representative examples in the master
data. A common approach is to relabel examples that appear to belong to wrong class labels.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Instance generation thus makes the data clean and ready for the analysis algorithm.

Tools you can use: Drake, DataWrangler, OpenRefine

Data normalization

Normalization improves the integrity of the data by adjusting the distributions. In simple words,
it normalizes each row to have a unit norm. The norm is specified by parameter p which
denotes the p-norm used. Some popular methods are:

StandardScaler: Carries out normalization so that each feature follows a normal distribution.

MinMaxScaler: Uses two parameters to normalize each feature to a specific range – upper and
lower bound.

ElementwiseProduct: Uses a scalar multiplier to scale every feature.

Tools you can use: Table analyzer, BDNA

Data transformation

If a data set happens to be too large in the number of instances or predictor variables,
dimensionality problem arises. This is a critical issue that will obstruct the functioning of most
data mining algorithms and increases the cost of processing. There are two popular methods for
data transformation by dimensionality reduction – Feature Selection and Space Transformation.

Feature selection: It is the process of spotting and eliminating as much unnecessary information
as possible. FS can be used to significantly reduce the probability of accidental correlations in
learning algorithms that could degrade their generalization capabilities. FS will also cut the
search space occupied by features, thus making the process of learning and mining faster. The
ultimate goal is to derive a subset of features from the original problem that describes it well.

Space transformations: Space transformations work similar to feature selection. However,


instead of selecting the valuable features, space transformation technique will create a fresh new
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

set of features by combining the originals. This kind of a combination can be made to obey
certain criteria. Space transformation techniques ultimately aim to exploit non-linear relations
among the variables.

Tools you can use: Talend, Pentaho

Missing values imputation

One of the common assumptions with big data is that the data set is complete. In fact, most data
sets have missing values that’s often overlooked. Missing values are datums that haven’t been
extracted or stored due to budget restrictions, a faulty sampling process or other limitations in the
data extraction process. Missing values is not something to be ignored as it could skew your
results.
Fixing the missing values issue is challenging. Handling it without utmost care could easily lead
to complications in data handling and wrong conclusions.
There are some relatively effective approaches to tackle the missing values problem. Discarding
the instances that might contain missing values is the common one but it’s not very effective as
it could lead to bias in the statistical analyses. Apart from this, discarding critical information is
not a good idea. A better and more effective method is to use maximum likelihood procedures to
model the probability functions of the data while also considering the factors that could have
induced the missingness. Machine learning techniques are so far the most effective solution to
the missing values problem.
Noise identification

Data gathering is not always perfect, but the data mining algorithms would always assume it to
be. Data with noise can seriously affect the quality of the results, tackling this issue is crucial.
Noise can affect the input features, output or both in most cases. The noise found in the input is
called attribute noise whereas if the noise creeps into the output, it’s referred to as class noise. If
noise is present in the output, the issue is very serious and the bias in the results would be very
high.
There are two popular approaches to remove noise from the data sets. If the noise has affected
the labelling of instances, data polishing methods are used to eliminate the noise. The other
method involves using noise filters that can identify and remove instances with noise from the
data and this doesn’t require modification of the data mining technique.
Minimizing the pre-processing tasks

Preparing the data for your data analysis algorithm can involve many more processes depending
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

on the application’s unique demands. However, basic processes like cleansing, deduplication and
normalization can be avoided in most cases if you choose the right source for data extraction. It’s
highly unlikely that a raw source can give you clean data. As far as web data extraction is
concerned,a managed web scraping service like PromptCloud can give you clean and ready
to use data that’s ready to be plugged into your analytics system.

9.What is ensemble analysis related to a large volume of

data? Ensemble analysis

Ensemble data analysis, roughly termed multi-data-set analysis or multi-algorithm analysis, is


made for the whole data-set or a large volume of data. Big data are argued to be the whole data-
set without any sampling purpose. What is the whole set? Approximately, it may be resampling
data, labeled data and unlabeled data, prior data and posterior data. It is known that the term
“ensemble” appears at least in the context of ensemble learning in machine learning, the
ensemble system in statistics mechanics, and ensemble Kalman filtering in data assimilation.
10.Shortly discuss about association analysis related to unknown data sampling?
Usually, big data are collected without special sampling strategies. Normally, data producers are
quite different from data users, so that the cause-effect relation hidden in observation data is not
clear to specific data users. Set theory, ie the theory about members and their relations in a set, is
general enough to deal with data analysis and problem solving. In a sense, the relation among set
members corresponds to the association in big data. Association analysis is critical to multi-
sourcing, multi-type, and multi-domain data analysis. Typically, association analysis is
exemplified with association rule algorithms in data mining (Agrawal and Srikant 1994), data
association in target tracking, and links analysis in networks

11. What is structured data?

• Definition: Structured data is highly organized and follows a specific, predefined format.
It is typically found in relational databases and consists of rows and columns.
• Examples: Examples include data stored in SQL databases, spreadsheets, and CSV files.
Structured data can represent customer records, sales transactions, financial data,
and more.
• Characteristics:
• Data is organized into tables with well-defined schemas.
• Easy to query and analyze using SQL or similar query languages.
• Suitable for traditional business intelligence and reporting.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

12.What is semi structured data?

Semi-Structured Data:
Definition: Semi-structured data is partially organized but does not conform to a rigid schema. It may contain tag
Examples: JSON (JavaScript Object Notation), XML (eXtensible Markup Language), and NoSQL databases lik
Characteristics:

Flexible and adaptable to evolving data structures.


Allows for nested or hierarchical data representation.
Requires specialized tools for querying and analysis, often using languages like JSONPath or XPath.

13.What is unstructured data?

Unstructured Data:
• Definition : Unstructured data lacks a predefined structure or format and is typically not
organized in a database-like manner. It includes text, images, videos, audio, and more.
• Examples: Social media posts, emails, documents (e.g., PDFs and Word documents),
multimedia content, and sensor data are all examples of unstructured data.
• Characteristic :
s
• No fixed schema or format, making it challenging to analyze using
traditional methods.
• Requires advanced techniques like natural language processing (NLP) and
machine learning for analysis.
• Valuable for sentiment analysis, content categorization, and image recognition.
14.What is data visualization techniques?

• Charts and Graphs: Visual representations like bar charts, line graphs, scatter plots, and
pie charts are commonly used to display patterns and trends in data.
• Heatmaps: Heatmaps use color intensity to represent data values, making it easy to
identify hotspots or concentration in large datasets.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

• Geospatial Visualization: Mapping data onto geographic maps to reveal spatial patterns,
like geographical information systems (GIS) and location-based data.
• Treemaps: Treemaps display hierarchical data structures, such as folder structures or
organizational hierarchies, using nested rectangles.

15.What is image analysis techniques?

• Image Processing: Techniques like image filtering, segmentation, and edge detection are
used to process and enhance visual data.
• Object Detection: Identifying and locating objects within images or videos, often using
deep learning models like Convolutional Neural Networks (CNNs).
• Image Classification: Categorizing images into predefined classes or labels, commonly
used in applications like content moderation and image tagging.
• Image Recognition: Going beyond classification to recognize specific objects, scenes, or
patterns within images.
16.What is video analysis techniques?

• Video : Creating concise summaries of long videos by selecting key


Summarization
• frames or moments of interest.
Object Tracking: Monitoring and tracking the movement of objects within videos,
• critical for surveillance and object behavior analysis.
Action Recognition: Identifying and categorizing human actions or movements in video
• sequences.
Video Analytics: Combining video data with other contextual data sources to gain
deeper
insights, such as analyzing customer behavior in retail stores.
17.What is text and document visualization techniques?

• Word Clouds: Visual representations of word frequency, where word size


corresponds to its occurrence in the text.
• Topic Modeling: Identifying themes or topics within a collection of
text documents and visualizing their relationships.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

• Sentiment Analysis Visualization: Representing sentiment scores and emotional


tones in text data.
• Document Clustering: Grouping similar documents together for easier
exploration and analysis.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

18.What is network visualization techniques?

• Graph : Visualizing networks and relationships between entities using


Visualization
node-link diagrams or matrix representations.
• Social Network Analysis: Analyzing and visualizing connections and interactions in
social networks.
• Flowcharts and Sankey Diagrams: Representing complex processes and flow of
information or resources.
19.What is temporal data visualization techniques?

• Time Series Plots: Visualizing data changes over time, useful for monitoring trends and
patterns.
• Gantt Charts: Displaying project timelines and scheduling information.
• Calendar Heatmaps: Visualizing data patterns across days, weeks, or months on
a calendar grid.
20.What is interactive and dynamic visualization?

• Interactive Dashboards: Building interactive visualizations and dashboards that allow


users to explore data and make real-time decisions.
• Animation: Creating animations to show changes in data over time, which can aid in
storytelling and understanding dynamic data.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Part-B
1.Explain about adaptive search by evaluation?
Random search algorithms are very useful for simulation optimization, because they are
relatively easy to implement and typically find a “good” solution quickly. One drawback is that
strong convergence results to a global optimum require strong assumptions on the structure of
the problem.

This chapter begins by discussing optimization formulations for simulation optimization that
combines expected performance with a measure of variability, or risk. It then summarizes
theoretical results for several adaptive random search algorithms (including pure adaptive search,
hesitant adaptive search, backtracking adaptive search and annealing adaptive search) that
converge in probability to a global optimum on ill-structured problems. More importantly, the
complexity of these adaptive random search algorithms is linear in dimension, on average.
While it is not possible to exactly implement stochastic adaptive search with the ideal linear
performance, this chapter describes several algorithms utilizing a Markov chain Monte Carlo
sampler known as hit-and-run that approximate stochastic adaptive search. The first optimization
algorithm discussed that uses hit-and-run is called improving hit-and-run, and it has polynomial
complexity, on average, for a class of convex problems. Then a simulated annealing algorithm
and a population based algorithm, both using hit-and-run as the candidate point generator, are
described. A variation to hit-and-run that can handle mixed continuous/integer feasible regions,
called pattern hit-and-run, is described. Pattern hit-and-run retains the same convergence results
to a target distribution as hit-and-run on continuous domains. This broadly extends the class of
the optimization problems for these algorithms to mixed continuous/integer feasible regions.

2. What is meant by evaluation strategies?

In this age of technological and advanced world big data is prominent as a world new currency.
The term big data is not a framework, language and Technology. Actually Big data is nothing but
a problem statement. In the current era number of IOT enable devices is using data in huge
amount. The data is coming from different datasets at an enormous amount. As data is increasing
exponentially every year, the traditional system to store and process the data become incapable to
handle it. The existing technologies are not capable to handle the big data. In this digital world,
the data is generated automatically by the online interactions of big data applications. The Big
data is used in the evaluation of emerging form of information. In the last two years data is
growing at an enormous speed exponentially as compare to last twenty years. In this current era
human life is totally dependable on IOT. This paper presents the overall changes in big data
analytics evaluation growth in the recent years. The innovations in the technology and greater
affordability of digital devices with internet made a new world of data are called big data. The
data captured by enterprises such ad rise of IOT and multimedia has produce an overwhelming of
data in either structured or un-structured format. It is a fact that data that is too big to process is
also too big to transfer anywhere. So it is just an analytical program which needs to be moved
(not the data) and this is only possible with cloud computing.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

3. Describe about genetic algorithm?

Big Data Analysis using the Genetic Algorithm: The field of Information Theory refers big data
as datasets whose rate of increase is exponentially high and in small span of time; it becomes
very painful to analyze them using typical data mining tools. Such data sets results from daily
capture of stock exchange, any credit card user’s timely usage trends, insurance cross line
capture, health care services etc. In real time these data sets go on increasing and with passage of
time create complex scenarios. Thus the typical data mining tools needs to be empowered by
computationally efficient and adaptive technique to increase degree of efficiency by using
adaptive techniques. Using GA over data mining creates great robust, computationally efficient
and adaptive systems. In past there have been several researches on data mining using statistical
techniques. The statistics that have heavily contributed are the ANOVA, ANCOVA, Poisson’s
Distribution, and Random Indicator Variables etc. The biggest drawback of any statistical tactics
lies in its tuning. With exponential explosion of data, this tuning goes on taking more time and
inversely affects the through put. Also due to their static nature, often complex hidden patterns
are left out. The idea here is to use genes to mine out data with great efficiency. Also I will show
how this mined data can be effectively used for different purposes. Rather than sticking to
general notion of probabilities, I have here used the concept of Expectations, and have modified
the theory of Expectations to achieve the desired results. Any data categorizes of three main
components, the constants, the variables and the variants. The constant comprises of data that
practically remains unaltered in a given span of time. The variables are changing with time while
in case of variants; it is not clear whether they will behave as constant or variables. So, taking
this as the first step, we have three set each containing respective data as stated. Now we will
calculate the expectancy of each datum inside the data set. 4.1 Calculation of Expectancy

4. Discuss briefly about genetic programming?

A new algorithm called multi-objective genetic programming (MOGP) for complex civil
engineering systems. The proposed technique effectively combines the model structure selection
ability of a standard genetic programming with the parameter estimation power of classical
regression, and it simultaneously optimizes both the complexity and goodness-of-fit in a system
through a non-dominated sorting algorithm. The performance of MOGP is illustrated by
modeling a complex civil engineering problem: the time-dependent total creep of concrete. A
Big Data is used for the model development so that the proposed concrete creep model—referred
to as a “genetic programming based creep model” or “G-C model” in this study—is valid for
both normal and high strength concrete with a wide range of structural properties. The G-C
model is then compared with currently accepted creep prediction models. The G-C model
obtained by MOGP is simple, straightforward to use, and provides more accurate predictions
than other prediction models.
Introduction

Different techniques can be used for modeling nonlinear systems in structural engineering, and
the models obtained from these techniques can be broadly categorized into two groups:
phenomenological (or knowledge-based) and behavioral. Phenomenological models consider the
physical laws governing the system (such as energy, momentum, etc.). In these models, the
structure of the system should be selected by the model developer based on the physical laws,
which requires prior knowledge about the system. Due to the complexity of many structural
engineering systems/phenomena (such as modeling of concrete shrinkage and creep), it is not
always possible to derive such models. In contrast to phenomenological models, behavioral
models can be easily developed by finding the relationships between input variables and outputs
for a set of experimental data without considering the physical theories. For developing
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

behavioral models, no prior knowledge is needed about the mechanism or fundamental theory
that produced the experimental data. Therefore, behavioral modeling techniques can be used for
approximate modeling of many structural engineering systems [1], [2].

While behavioral models can be advantageous, many behavioral models require the user to pre-
specify/hypothesize the formulation structure of the model. In other words, behavioral
techniques optimize the unknown coefficients of a pre-defined formulation structure. In
particular, regression analysis is a commonly used technique for developing behavioral models.
Although this technique can be used for developing both linear and nonlinear models, it has a
strong sensitivity to outliers and can exhibit large model errors due to the idealization of
complex processes, approximation, and averaging widely varying prototype conditions [3], [4].
Furthermore, for linear regressions, the least square estimate of unknown parameters can be
obtained analytically, while nonlinear regressions typically use an iterative optimization
procedure to estimate the unknown parameters, which requires the user to provide starting
values. Failure in defining the appropriate starting values can lead to convergence problems or
finding the local minimum rather than a global minimum in the optimization process. Therefore,
using traditional techniques such as regression analysis cannot guarantee that a reliable and
accurate behavioral model will be obtained, particularly for complex nonlinear engineering
systems.

In recent years, more advanced computer-aided pattern-recognition and data-classification


techniques, such as artificial neural networks (ANNs) and support vector machines (SVMs),
have been used to develop behavioral models in various civil engineering problems (e.g., [5],
[6], [7], [8]). ANN discovers patterns and approximates relationships in data based on a
supervised learning algorithm, a form of regression that relies on the inputs and outputs of a
training set [9].

Although ANNs are generally successful in prediction, they are only appropriate to use as part of
a computer program, not for the development of practical prediction equations. In addition, ANN
requires data to be initially normalized based on the suitable activation function and the best
network architecture to be determined by the user, and it can have a complex structure and a high
potential for over-fitting [10]. SVMs, on the other hand, are one of the efficient kernel-based
methods that can solve a convex constrained quadratic programming (CCQP) problem to find a
set of parameters. However, selecting the appropriate kernel in SVM can be a challenge, and the
results are not transparent [11].

One powerful technique for developing nonlinear behavioral models in the case of complex
optimization problems is genetic programming (GP) [12]. GP is specialization subset of genetic
algorithms (GAs) [13], which are based on the principles of genetics and natural selection. GP
and its variants have been successfully used for solving a number of different civil engineering
problems (e.g., [14], [15]). Multi-gene genetic programming (MGGP) is a robust variant of GP
that combines the ability of the standard GP in constructing the model structure with the
capability of traditional regression in parameter estimation. In this technique, each symbolic
model (and each member of the GP population) is a weighted linear combination of low order
non-linear transformations of the input variables. In contrast to standard symbolic regression,
MGGP allows the evolution of accurate and relatively compact mathematical models. Even
when large numbers of input variables are used, this technique can automatically select the most
contributed variables in the model, formulate the structure of the model, and solve the
coefficients in the regression equation [16], [17], [18], [19]. Therefore, unlike other techniques
such as traditional regression analysis or ANN, there is no need in the MGGP technique for the
user to pre-define the formulation structure of the model or select any existing form of the
relationship for optimization [3], [4], which makes it more practical for complex optimization
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

problems. Recent studies also show that compared to other novel computer-based techniques

such as SVM and particle swarm model selection, GP shows better performance in problems
having high dimensionality and large training sets [20].

Typically, standard GP algorithms (including MGGP) will optimize only one objective in the
model development process: maximizing the goodness-of-fit to the training data. The main
drawback of using a single objective in the optimization process is that the developed models can
become overly complex. In other words, minimizing the complexity of the developed models
should be another important objective to be considered. In this study, a new algorithm called
multi-objective genetic programming (MOGP) is developed. MOGP is an extension of standard
GP algorithms that can simultaneously solve for two competing objectives (i.e. maximizing the
goodness-of-fit and minimizing the model complexity). By performing multi-gene symbolic
regression via MOGP, one can develop parsimonious and accurate data-based models for
complex engineering systems.

5. Elaborately explain about visualization?

Big Data visualization techniques — charts, maps, interactive content, infographics, motion
graphics, scatter plots, regression lines, timelines, for example, enable companies' decision-
makers get results by better understanding their processes and stakeholders. Software support
multiple and high amounts of raw data to provide instant analysis of facts, trends, and patterns.
Big data visualization is a remarkably powerful business capability.

According to IBM, every day, 2.5 quintillion bytes of data are created from social media,
sensors, webpages, and all kinds of management systems are using it to control the business
processes.
By helping correlations between thousands of variables available in the big data world,
technologies could present massive amounts of data in an understanding way, which means Big
Data visualization initiatives combine IT and management projects.

6. Explain about classification of visual data analysis techniques?

.
These types include:
• Temporal: data is linear and one dimensional
• Hierarchical: visualizes ordered groups within a larger group
• Network: involves visualization for the connection of datasets to datasets
• Multidimensional: contrast of temporal type
• Geospatial: involves geospatial or spatial maps
• Miscellaneous: other types of visualizations1
There is no clear consensus on the boundaries between these fields, but broadly speaking,
scientific visualization deals with data that has a natural geometric structure, while information

7. What is meant by data types?Explain

it. Types of Big Data

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Structured Data

• Structured data can be crudely defined as the data that resides in a fixed field within
a record.
• It is type of data most familiar to our everyday lives. for ex: birthday,address
• A certain schema binds it, so all the data has the same set of properties. Structured
data is also called relational data. It is split into multiple tables to enhance the
integrity of the data by creating a single record to depict an entity. Relationships are
enforced by the application of table constraints.
• The business value of structured data lies within how well an organization can
utilize its existing systems and processes for analysis purposes.

Sources of structured data


A Structured Query Language (SQL) is needed to bring the data together. Structured data is
easy to enter, query, and analyze. All of the data follows the same format. However, forcing a
consistent structure also means that any alteration of data is too tough as each record has to be
updated to adhere to the new structure. Examples of structured data include numbers, dates,
strings, etc. The business data of an e-commerce website can be considered to be structured
data.
Roll
Name Class Section No Grade

Geek
11 A 1 A
1

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Geek
11 A 2 B
2

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Roll
Name Class Section No Grade

Geek
11 A 3 A
3

Cons of Structured Data


1. Structured data can only be leveraged in cases of predefined functionalities. This
means that structured data has limited flexibility and is suitable for certain specific
use cases only.
2. Structured data is stored in a data warehouse with rigid constraints and a definite
schema. Any change in requirements would mean updating all of that structured
data to meet the new needs. This is a massive drawback in terms of resource and
time management.

Semi-Structured Data

• Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly organized into
rows and columns like that in a spreadsheet. However, there are some features like
key-value pairs that help in discerning the different entities from each other.
• Since semi-structured data doesn’t need a structured query language, it is commonly
called NoSQL data.
• A data serialization language is used to exchange semi-structured data across
systems that may even have varied underlying infrastructure.
• Semi-structured content is often used to store metadata about a business process but
it can also include files containing machine instructions for computer programs.
• This type of information typically comes from external sources such as social media
platforms or other web-based data feeds.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Semi-Structured Data
Data is created in plain text so that different text-editing tools can be used to draw valuable
insights. Due to a simple format, data serialization readers can be implemented on hardware
with limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files, transit,
store, and parse. The sender and the receiver don’t need to know about the other system. As
long as the same serialization language is used, the data can be understood by both systems
comfortably. There are three predominantly used Serialization languages.

1. XML– XML stands for eXtensible Markup Language. It is a text-based markup language
designed to store and transport data. XML parsers can be found in almost all popular
development platforms. It is human and machine-readable. XML has definite standards for
schema, transformation, and display. It is self-descriptive. Below is an example of a
programmer’s details in XML.
XML

<ProgrammerDetails>
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
<CodingPlatforms>
<CodingPlatform Type="Fav">GeeksforGeeks</CodingPlatform>
<CodingPlatform Type="2ndFav">Code4Eva!</CodingPlatform>
<CodingPlatform Type="3rdFav">CodeisLife</CodingPlatform>
</CodingPlatforms>
</ProgrammerDetails>

<!--The 2ndFav and 3rdFav Coding Platforms are imaginative because Geeksforgeeks
is the best!-->

XML expresses the data using tags (text within angular brackets) to shape the data (for ex:
FirstName) and attributes (For ex: Type) to feature the data. However, being a verbose and
voluminous language, other formats have gained more popularity.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file format for
data interchange. JSON is easy to use and uses human/machine-readable text to store and
transmit data objects.
Javascript

{
"firstName": "Jane",
"lastName": "Doe", "codingPlatforms": [
{ "type": "Fav", "value": "Geeksforgeeks" },
{ "type": "2ndFav", "value": "Code4Eva!" },
{ "type": "3rdFav", "value": "CodeisLife" }
]

This format isn’t as formal as XML. It’s more like a key/value pair model than a formal data
depiction. Javascript has inbuilt support for JSON. Although JSON is very popular amongst
web developers, non-technical personnel find it tedious to work with JSON due to its heavy
dependence on JavaScript and structural characters (braces, commas, etc.)

3. YAML– YAML is a user-friendly data serialization language. Figuratively, it stands


for YAML Ain’t Markup Language. It is adopted by technical and non-technical handlers all
across the globe owing to its simplicity. The data structure is defined by line separation and
indentation and reduces the dependency on structural characters. YAML is extremely
comprehensive and its popularity is a result of its human-machine readability.

YAML example

A product catalog organized by tags is an example of semi-structured data.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Unstructured Data

• Unstructured data is the kind of data that doesn’t adhere to any definite schema or
set of rules. Its arrangement is unplanned and haphazard.
• Photos, videos, text documents, and log files can be generally considered
unstructured data. Even though the metadata accompanying an image or a video
may be semi-structured, the actual data being dealt with is unstructured.
• Additionally, Unstructured data is also known as “dark data” because it cannot be
analyzed without the proper software tools.

Un-structured Data

8. Describe about visualization techniques?

Big data visualization makes a difference John Tukey, a celebrated mathematician and
researcher, once said: “The greatest value of a picture is when it forces us to notice what we
never expected to see.” And our data visualization team couldn’t agree more. Visualization
allows business users to look beyond individual data records and easily identify dependencies
and correlations hidden inside large data sets. Here go examples of how big data analysis
results can look with and without well-implemented data visualization. Example 1: Analysis of
industrial data In some cases, the maintenance team can skip the ‘looking for insights’ part and
just get notified by the analytical system that part 23 at machine 245 is likely to break down.
Nevertheless, the maintenance team is unlikely to be satisfied with instant alerts only. They
should be proactive, not just reactive in their work, and for that, they need to know dependencies
and trends. Big data visualization helps them get the required insights. For example, if the
maintenance team would like to understand the connections between machinery failures and
certain events that trigger them, they should look at connectivity charts for insights.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

Example 2: Analysis of social comments Imagine a retailer operating nationwide. One customer
may visit their store and post on Facebook: “Guys, if you haven’t bought Christmas presents yet,
go to [the retailer’s name].” Another customer may share on Twitter: “I hate New Year time! I’ve
never seen lines that long! I wasted an hour at [the retailer’s name] today. And the staff was
rude. Hate this place!” The third customer may post on Instagram: “Look what a gorgeous
reindeer sweater

I bought at [the retailer’s name]!” The company’s customer base is 20+ million. It would be
impossible for the retailer to browse all over the internet in the search of all the comments and
reviews and try to get insights just by scrolling through and reading all the comments. To have
these tasks automated, companies resort to sentiment analysis. And to get instant insights into the
analysis results, they apply big data visualization. For example, word clouds demonstrate the
frequency of the words used. The higher the frequency, the bigger a word’s font. So, if the
biggest words are hate, awful, terrible, failed, and their likes – it’s high time to react. Example 3:
Analysis of customer behavior Companies use a similar scenario to analyze customer behavior.
They strive to implement big data solutions that would allow gathering detailed data about the
purchases in brick-and-mortar and online stores, browsing history and engagement, GPS data
and data from the customer mobile app, calls to the support center and more. Registering billions
of events daily, a company is unable to identify the trends in customer behavior if they have just
multiple records at their disposal.
With big data visualization, ecommerce retailers, for instance, can easily notice the change in
demand for a particular product based on the page views. They can also understand the peak
times when visitors make most of their purchases, as well as look at the share of coupon
redemption, etc. Most frequently used big data visualization techniques Earlier, we studied on
practical examples how companies can benefit from big data visualization, and now we’ll give
an overview of the most widely used data visualization techniques. Symbol maps The symbols
on such maps differ in size, which makes them easy to compare. Imagine a US manufacturer
who has launched a new brand recently. The manufacturer is interested to know which regions
liked the brand particularly. To achieve this, they can use a map with symbols representing the
number of customers who liked the product (left a positive comment in social media, rated a new
product high in a customer survey, etc.) Line charts Line charts allow looking at the behavior of
one or several variables over time and identifying the trends. In traditional BI, line charts can
show sales, profit and revenue development for the last 12 months.
When working with big data, companies can use this visualization technique to track total
application clicks by weeks, the average number of complaints to the call center by months, etc.
Pie charts Pie charts show the components of the whole. Companies that work with both
traditional and big data may use this technique to look at customer segments or market shares.
The difference lies in the sources from which these companies take raw data for the analysis. Bar
charts Bar charts allow comparing the values of different variables. In traditional BI, companies
can analyze their sales by category, the costs of marketing promotions by channels, etc. When
analyzing big data, companies can look at the visitors’ engagement with their website’s multiple
pages, the most frequent pre-failure cases on the shop floor and more. Heat maps Heat maps use
colors to represent data. A user may encounter a heat map in Excel that highlights sales in the
best performing store with green and in the worst performing – with red. If a retailer is interested
to know the most frequently visited aisles in the store, they will also use a heat map of their sales
floor. In this case, the retailer will analyze big data, such as the data from a video surveillance
system. How to avoid mistakes related to big data visualization?
The main purpose of big data visualization is to provide business users with insights. Choosing
the right visualization tool among the variety of options on the market

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

(Microsoft Power BI, Tableau, QlikView, and Sisense are just a couple of product names) and
applying the right techniques to create uncluttered and intuitive dashboards may appear to be a
more complicated task than it seems. If you feel that you need any assistance with this issue, you
can involve big data consultants to help you choose the most suitable visualization solution
and/or customize it. Read more on https://www.scnsoft.com/blog/big-data-visualization-
techniques
9. Explain about search by simulated annealing?

What is simulated annealing?

In a situation like shown above, the gradient descent gets stuck at the local minima if it started at
the indicated point. It wouldn't further be able to reach the global minima. In cases like these,
simulated annealing proves useful.

Simulated annealing is an algorithm based on the physical annealing process used in metallurgy.
During physical annealing, the metal is heated up until it reaches its annealing temperature and
then is gradually cooled down to change it into the desired shape. It is based on the principle that
the molecular structure of the metal is weak when it is hot and can be changed easily. Where as
when it cools down, it becomes hard and thus changing the shape of the metal becomes difficult.

Simulated annealing has a probabilistic way of moving around in a search space and is used for
optimizing model parameters. It mimics physical annealing as a temperature parameter is used
here too.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

If the temperature is higher, the more likely the algorithm will accept a worse solution. This
expands the search space unlike gradient descent and allows it to travel down a trivial path. This
promotes exploration.
If the temperature is lower, the less likely it will accept a worse solution. This tells the algorithm
that once it is in the right part of the search space, it does not need to search any other parts and
instead must focus on finding the global maximum by converging. This promotes exploitation.
The main difference between a greedy search and simulated annealing is that the greedy search
always go for the best option where as in simulated annealing, it has a probability (using
Boltzmann distribution) to accept a worse solution.

Algorithm
For a function h(•) we are trying to maximize, the steps for simulated annealing algorithm is as
follows:

1. Start by generating an initial solution x.

2. Set the initial temperature t=t0 where t0 > 0.

3. For the n number of iterations i=1,2,...,n , loop through the following steps until the
termination condition is reached:

• Sample out θ ~ g(θ) where g is a symmetric distribution.


• The new candidate solution becomes x'=x ± θ.
• We find the difference in cost between our old and new solution
(Δh=h(x')- h(x)) and calculate the probability p using the difference and
the current
temperature(ti). This is the probability that we should accept/ reject the candidate
solution.

p= exp(Δh/ti)

• If Δh is greater than zero, it means that our new solution is better and we accept it.
If it is less than zero, then we generate a random number u ~ U(0,1). We accept
the new solution x' if u ≤ p.
• We then reduce the temperature t using a temperature reduction function α.
Temperature reduction functions like t = t - α or t = t * α may be used here.
The termination conditions here may be achieving a particular temperature or a performance
threshold.

Note that if the temperature is high, say maybe a 100, then the probability that we are going to
accept the candidate solution comes out to be high, when we substitute it in the formula. As the
temperature becomes closer to 0, the algorithm functions like the greedy hill climbing algorithm.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Advantages and disadvantages


Advantages of Simulated Annealing
• Simulated annealing is easy to code and use.

• It does not rely on restrictive properties of the model and hence is versatile.

• It can deal with noisy data and highly non-linear models.

• Provides optimal solution for many problems and is robust.


Disadvantages of Simulated Annealing
• A lot of parameters have to be tuned as it is metaheuristic.

• The precision of the numbers used in its implementation has a significant effect on
the quality of results.

• There is a tradeoff between the quality of result and the time taken for the algorithm
to run.

10. What is meant by stochastic?Explain it.

In probability theory and related fields, a stochastic (/stəˈkæstɪk/) or random process is


a mathematical object usually defined as a sequence of random variables, where the index of the
sequence has the interpretation of time. Stochastic processes are widely used as mathematical
models of systems and phenomena that appear to vary in a random manner. Examples include the
growth of a bacterial population, an electrical current fluctuating due to thermal noise, or the
movement of a gas molecule.[1][4][5] Stochastic processes have applications in many disciplines
such as biology,[6] chemistry,[7] ecology,[8] neuroscience,[9] physics,[10] image processing, signal
processing,[11] control theory,[12] information theory,[13] computer science,
[14]
and telecommunications.[15] Furthermore, seemingly random changes in financial
markets have motivated the extensive use of stochastic processes in finance.[16][17][18]
Applications and the study of phenomena have in turn inspired the proposal of new stochastic
processes. Examples of such stochastic processes include the Wiener process or Brownian
motion process,[a] used by Louis Bachelier to study price changes on the Paris Bourse,[21] and
the Poisson process, used by A. K. Erlang to study the number of phone calls occurring in a
certain period of time.[22] These two stochastic processes are considered the most important and
central in the theory of stochastic processes, [1][4][23] and were discovered repeatedly and
independently, both before and after Bachelier and Erlang, in different settings and countries.
[21]

[24]

The term random function is also used to refer to a stochastic or random process,[25][26] because a
stochastic process can also be interpreted as a random element in a function space.[27][28] The
terms stochastic process and random process are used interchangeably, often with no
specific mathematical space for the set that indexes the random variables.[27][29] But often these
two terms are used when the random variables are indexed by the integers or an interval of

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

the real line.[5][29] If the random variables are indexed by the Cartesian plane or some higher-
dimensional Euclidean space, then the collection of random variables is usually called a random
field instead.[5][30] The values of a stochastic process are not always numbers and can be vectors
or other mathematical objects.[5][28]
Based on their mathematical properties, stochastic processes can be grouped into various
categories, which include random walks,[31] martingales,[32] Markov processes,[33] Lévy processes,
[34]
Gaussian processes,[35] random fields,[36] renewal processes, and branching processes.[37] The
study of stochastic processes uses mathematical knowledge and techniques
from probability, calculus, linear algebra, set theory, and topology[38][39][40] as well as branches of
mathematical analysis such as real analysis, measure theory, Fourier analysis, and functional
analysis.[41][42][43] The theory of stochastic processes is considered to be an important contribution
to mathematics[44] and it continues to be an active topic of research for both theoretical reasons
and applications.[45][46][47]

11. Describe about interaction techniques?

the state-of-the-art interaction techniques of visualization authoring tools. The visualization


tools tend to help users in the creation, exploration, or presentation of visualizations. Also, they
allow users to craft expressive designs or extract data from visualizations. The review presents
the interaction techniques integrated into the tools for those mentioned above five high-level
goals. We cover each goal’s tools and summarize how a sequence in the independent interaction
techniques leads to the goal. We also discuss how well researchers had evaluated the usability
and intuitiveness of interaction techniques. We aimed to reflect on the strengths and weaknesses
of the evaluations. To that end, from the perspective of human cognition, we reviewed the goals,
procedures, and findings of evaluations. Principally, human cognition is engaged when they
perform tasks in a tool. The interaction techniques bridge the gap between human cognition and
the goals they want to achieve from the tool. To sum up, in this review, we present a novel triad
‘goals-interaction techniques-cognition’ taxonomy. Besides, the review suggests the need for
further work to enhance tools and understand users.

12. Discuss briefly about specific visual data analysis techniques?

Big Data visualization techniques — charts, maps, interactive content, infographics, motion
graphics, scatter plots, regression lines, timelines, for example, enable companies' decision-
makers get results by better understanding their processes and stakeholders. Software support
multiple and high amounts of raw data to provide instant analysis of facts, trends, and patterns.
Rock Content Writer

Big data visualization is a remarkably powerful business capability.

According to IBM, every day, 2.5 quintillion bytes of data are created from social media,
sensors, webpages, and all kinds of management systems are using it to control the business
processes.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

By helping correlations between thousands of variables available in the big data world,
technologies could present massive amounts of data in an understanding way, which means Big
Data visualization initiatives combine IT and management projects.

In this article, we will address data and how its visual representation should move together to
ensure it is effectively employed.

You will see the following topics:

• What is Big Data visualization?


• Why is it important to have a good method of visualization?
• What are the types of Big Data visualization?
• What are the main tools for Big Data visualization?

What is Big Data visualization?

A defining characteristic of Big Data is volume.

Today’s companies collect and store vast amounts of information that would take years for
a human to read and understand.

Visualization resources rely on powerful tools to interpret raw data and process it to
generate visual representations that allow humans to take in and understand enormous
amounts of data in a few minutes.

Big Data visualization describes data of almost any type — numbers, trigonometric function,
linear algebra, geometric, basic, or statistical algorithms — in a visual basis format —
coding, reports analytics, graphical interaction — that makes it easy to understand and interpret.

Thus, it goes far beyond typical graphs, bubble plots, histograms, pie, and donut charts to more
complex representations like heat maps and box and whisker plots, enabling decision-makers to
explore data sets to identify correlations or unexpected patterns.

Why is it important to have a good method of visualization?

The amount of data is growing every year thanks to the Internet and innovations such as
operational systems, sensors, and the Internet of Things.

The problem for companies is that data is only useful if valuable insights can be
extracted from large amounts of raw data and read by who can analyze them — data
literacy in near real-time.

Big Data visualization techniques are important because they:

• Enable decision-makers to understand what the amount of data means very quickly;
• Capture trends — the use of appropriate techniques can make it easy to recognize this
information;
• Reveal patterns — identify correlations and unexpected connections that could not be
found with specific questions; and
• Provide a highly effective way to communicate any insights that surfaces to others.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

What are the types of Big Data visualization?

Big Data visualization provides a relevant suite of techniques for gaining a qualitative
understanding.

We described the basic types below.

Charts

Charts use elements to match the values of variables and compare multiple components, showing
the relationship between data points.

• Line chart — the comparable elements are lines that could help to analyze peak and fall
moments at an axis variant, such as sales volume over a period.
• Pie and donut charts — they are used to compare parts of the whole, such as
components of one category. The angle and the arc of each sector correspond to
the illustrated value, and the distance from the center evaluates their importance.
• Bar chart — each value is displayed by a bar, either vertical or horizontal. It is
not indicated when values are very close to each other.

Source

Plots

Plots help to visualize data sets in 2D or 3D. It can be:

• Scatter (X-Y) plot — shows the mutual variation of two data items (axis X and Y).
• Bubble plot — it has the same scatter plot concept, but the markers are bubbles.
The main difference is the bubble size, the third measure that represents another
variable.
• Histogram plot — represents the element variable over a specific period.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Source

Maps

Maps make it possible to position data points on different objects and areas, such as layouts,
geographical maps, and building projects. They could be heat maps or a dot distribution map.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Source
Big Data also makes companies find new ways of data visualization — semistructured and
unstructured data require new visualization techniques. You can try to use some of the
ones below to address these challenges.

Kernel density estimation

If we do not have enough knowledge about the amount and the distribution of data, they can be
best visualized with this model of Big Data visualization technique that represents the probability
distribution function.

Source
Box and whisker plot

It shows the distribution of massive data, often to understand the outliers in the data in a
graphical display of five statistics:

• Minimum;
• Lower quartile;
• Median;
• Upper quartile; and
• Maximum.
Extreme values are represented by whiskers that extend out from the edges of the box.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

Source

Word clouds

It represents the frequency of a word within a body of the text: the bigger the word, the more
relevant it is.

Source
Network diagrams

It makes relationships as nodes and ties to analyze social networks or mapping product sales
across geographic areas, for example.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Source

Correlation matrices

They are used to summarizing data, as input and output for advanced analyses that allows quick
identification of relationships between variables with fast response times.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Source

What are the main tools for Big Data visualization?

Big Data visualization tools need to support multiple and high amounts of data sources and
provide instant analysis. Users can better understand information by designs and dashboards to
discover correlations, trends, and patterns in data. The main tools to build a decision-making
platform are:

• Visual.ly
• Power BI
• Sisense
• Periscope Data
• Zoho Analytics
• IBM Cognos Analytics
• Tableau Desktop
• Qlik solution — QlikSense and QlikView
• Microsoft PowerBI
• Oracle Visual Analyzer
• FineReport.
Visual.ly is a new way to think about content creation and data visualization for your
company — capture more relevant information with visuals to deliver better content faster.

By using charts, maps, interactive content, infographics, motion graphics, explaining videos,
histograms, scatter plots, regression lines, timelines, treemaps, and word clouds, the
Visual.ly platform reaches more details from data to leverage businesses’ results and
generate better opportunities for brands.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Unit-III
Answer all the
questions
Part-A
1. What is Data stream?

Data streaming is the process of transmitting, ingesting, and processing data continuously
rather than in batches123. It is used to deliver real-time information to users and help them make
better decisions4. Big data streaming is a process in which large streams of real-time data are
processed to extract insights and useful trends out of it 1. Data streaming is a key capability for
organizations who want to generate analytic results in real time
2. What is meant transactional data stream?

Transactional data, when used right, can be a key source of business intelligence. For instance,
in big data analytics, transactional data is vital to understand peak transaction volume, peak
ingestion rates, and peak data arrival rates.
3. Write short notes on measurement data stream?

Continuous queries can be used for monitoring, alerting, security, personalization, etc. Data
streams can be either transactional (i.e., log interactions between entities, such as credit card
purchases, web clickstreams, phone calls), or measurement (i.e., monitor evolution of entity
states, such as physical phenomena, road traffic, temperature, network).
4. What are the examples in data stream?

Examples
Some real-life examples of streaming data include use cases in every industry, including real-
time stock trades, up-to-the-minute retail inventory management, social media feeds, multiplayer
game interactions, and ride-sharing apps.

For example, when a passenger calls Lyft, real-time streams of data join together to create a
seamless user experience. Through this data, the application pieces together real-time location
tracking, traffic stats, pricing, and real-time traffic data to simultaneously match the rider with
the best possible driver, calculate pricing, and estimate time to destination based on both real-
time and historical data.

In this sense, streaming data is the first step for any data-driven organization, fueling big data
ingestion, integration, and real-time analytics.

5. What are the characteristics of Data


streams? Characteristics of Data Streams :
1. Large volumes of continuous data, possibly infinite.
2. Steady changing and requires a fast, real-time response.
3. Data stream captures nicely our data processing needs of today.
4. Random access is expensive and a single scan algorithm
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

5. Store only the summary of the data seen so far.


6. Maximum stream data are at a pretty low level or multidimensional in creation,
needs multilevel and multidimensional treatment.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

6. Shortly write about the applications of data streams?

1. Fraud perception
2. Real-time goods dealing
3. Consumer enterprise
4. Observing and describing on inside IT systems

7. List out the advantages of data streams?

• This data is helpful in upgrading sales


• Help in recognizing the fallacy
• Helps in minimizing costs
• It provides details to react swiftly to risk

8. List out the disadvantages of data streams?

• Lack of security of data in the cloud


• Hold cloud donor subordination
• Off-premises warehouse of details introduces the probable for disconnection

9. What are the types of stream processing?

Stream processing encompasses dataflow programming, reactive programming, and


distributed data processing. Stream processing systems aim to expose parallel processing for
data streams and rely on streaming algorithms for efficient implementation.
10. What are the applications of stream processing?

Stream Processing is used by organizations in various industries to keep up with data from
billions of “things”1. Stream processing is useful in use cases where we can detect a problem and
we have a reasonable response to improve the outcome2. Following are some of the use cases:
• Algorithmic Trading
• Stock Market Surveillance
• Smart Patient Care
• Monitoring a production line2Stream processing architectures help simplify the data
management tasks required to consume, process and publish the data securely and
reliably3

11.What is event time vs processing time?


• Event : This refers to the actual time when an event occurred or when data was
Time
generated. Event time is essential for analyzing data accurately, especially when events
• are not processed immediately.
Processing Time: Processing time is the time at which data is ingested and processed by
the stream processing system. It may lag behind event time, and handling these time
disparities is critical in stream processing.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

12. What is windowing?

• Definition: Windowing in stream processing involves dividing the continuous data


stream into finite, discrete time intervals or windows. These windows are used for
aggregation and analysis, enabling the examination of data within specific time frames.
• Types of Windows: There are various window types, including tumbling windows (non-
overlapping), sliding windows (partially overlapping), and session windows (dynamically
defined by event patterns).
13. What is stateful stream processing?
• Stream processing applications often need to maintain state over time. This can
involve aggregating data, tracking patterns, or maintaining context.
• Stateful processing is essential for tasks like fraud detection (tracking
transaction history), user session management, and maintaining rolling averages.
14. What is scalability and fault tolerance?
• Stream processing systems need to be scalable to handle high-volume data streams
and fault-tolerant to handle hardware failures gracefully.
• Scaling can involve parallelizing processing across multiple nodes or containers.
15. What is complex event processing?
CEP is a subset of stream processing that focuses on identifying and acting upon complex
patterns or sequences of events in real time. It's used in applications like stock trading, supply
chain management, and network monitoring.
16. What is event?
Events are individual data points or records that represent something happening in the real world.
Events can be generated by various sources, such as sensors, applications, devices, or user
interactions.
17. What is immutability?
Once an event is generated, it is considered immutable. This means that events are not updated
or changed; instead, new events may be generated to reflect changes or updates.

18. What is data sources?


Data sources are the origins of data streams. These can include IoT devices, sensors, log files,
web applications, social media feeds, and more. Data sources produce events and send them to
the stream processing system.

19. What is ingestion layer?


Data Ingestion: This layer handles the collection and ingestion of data from various sources
into the stream processing system. Ingestion techniques can include Kafka, Apache Flume, or
custom connectors.
20. What is schema validation?
Ensures that incoming events conform to a predefined schema or structure.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Part-B
1. Explain about stream concepts?

Introduction to stream concepts :


A data stream is an existing, continuous, ordered (implicitly by entrance time or explicitly by
timestamp) chain of items. It is unfeasible to control the order in which units arrive, nor it is
feasible to locally capture stream in its entirety.
It is enormous volumes of data, items arrive at a high rate.
Types of Data Streams :
• Data stream –
A data stream is a(possibly unchained) sequence of tuples. Each tuple comprised of a set of
attributes, similar to a row in a database table.
• Transactional data stream –
It is a log interconnection between entities
1. Credit card – purchases by consumers from producer
2. Telecommunications – phone calls by callers to the dialed parties
3. Web – accesses by clients of information at servers
• Measurement data streams –
1. Sensor Networks – a physical natural phenomenon, road traffic
2. IP Network – traffic at router interfaces
3. Earth climate – temperature, humidity level at weather stations
Examples of Stream Sources-
1. Sensor Data –
In navigation systems, sensor data is used. Imagine a temperature sensor floating
about in the ocean, sending back to the base station a reading of the surface
temperature each hour. The data generated by this sensor is a stream of real
numbers. We have 3.5 terabytes arriving every day and we for sure need to think
about what we can be kept continuing and what can only be archived.

2. Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of
images per day. Surveillance cameras generate images with lower resolution than
satellites, but there can be numerous of them, each producing a stream of images at
a break of 1 second each.

3. Internet and Web Traffic –


A bobbing node in the center of the internet receives streams of IP packets from
many inputs and paths them to its outputs. Websites receive streams of
heterogeneous types. For example, Google receives a hundred million search
queries per day.
Characteristics of Data Streams :
1. Large volumes of continuous data, possibly infinite.
2. Steady changing and requires a fast, real-time response.
3. Data stream captures nicely our data processing needs of today.
4. Random access is expensive and a single scan algorithm
5. Store only the summary of the data seen so far.
6. Maximum stream data are at a pretty low level or multidimensional in creation,
needs multilevel and multidimensional treatment.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Applications of Data Streams :


1. Fraud perception
2. Real-time goods dealing
3. Consumer enterprise
4. Observing and describing on inside IT systems
Advantages of Data Streams :
• This data is helpful in upgrading sales
• Help in recognizing the fallacy
• Helps in minimizing costs
• It provides details to react swiftly to risk
Disadvantages of Data Streams :
• Lack of security of data in the cloud
• Hold cloud donor subordination
• Off-premises warehouse of details introduces the probable for disconnection

2. Discuss briefly about Stream Data Model and Architecture?

Before we get to streaming data architecture, it is vital that you first understand streaming data.
Streaming data is a general term used to describe data that is generated continuously at high
velocity and in large volumes.
A stream data source is characterized by continuous time-stamped logs that document events in
real time.
Examples include a sensor reporting the current temperature, or a user clicking a link on a web
page. Stream data sources include:

• Server and security logs

• Clickstream data from websites and apps

• IoT sensors

• Real-time advertising platforms

Therefore, a streaming data architecture is a dedicated network of software components capable


of ingesting and processing copious amounts of stream data from many sources. Unlike
conventional data architecture solutions, which focus on batch reading and writing, a streaming
data architecture ingests data as it is generated in its raw form, stores it, and may incorporate
different components for real-time data processing and manipulation.

An effective streaming architecture must account for the distinctive characteristics of data
streams which tend to generate copious amounts of structured and semi-structured data that
requires ETL and pre-processing to be useful.

Due to its complexity, stream processing cannot be solved with one ETL tool or database. That’s
why organizations need to adopt solutions consisting of multiple building blocks that can be
combined with data pipelines within the organization’s data architecture.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Although stream processing was initially considered a niche technology, it is hard to find a
modern business that does not have an eCommerce site, an online advertising strategy, an app, or
products enabled by IoT.

Each of these digital assets generates real-time event data streams, thus fueling the need to
implement a streaming data architecture capable of handling powerful, complex, and real-time
analytics.

3. Describe about Stream Computing?

Stream Computing
The stream processing computational paradigm consists of assimilating data readings from
collections of software or hardware sensors in stream form (i.e., as an infinite series of tuples),
analyzing the data, and producing actionable results, possibly in stream format as well.

In a stream processing system, applications typically act as continuous queries, ingesting data
continuously, analyzing and correlating the data, and generating a stream of results.
Applications are represented as data-flow graphs composed of operators and interconnected by
streams, as shown in the figure. The individual operators implement algorithms for data
analysis, such as parsing, filtering, feature extraction, and classification. Such algorithms are
typically single-pass because of the high data rates of external feeds (e.g., market information
from stock exchanges, environmental sensors readings from sites in a forest, etc.).
Stream processing applications are usually constructed to identify new information by
incrementally building models and assessing whether new data deviates from model predictions
and, thus, is interesting in some way. For example, in a financial engineering application, one
might be constructing pricing models for options on securities, while at the same time detecting
mispriced quotes, from a live stock market feed. In such an application, the predictive model
itself might be refined as more market data and other data sources become available (e.g., a feed
with weather predictions, estimates on fuel prices, or headline news).
Streams applications may consist of dozens to hundreds of analytic operators, deployed on
production systems hosting many other potentially interconnected stream applications,
distributed over a large number of processing nodes.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

4. What is meant by Sampling Data in a Stream?

Data arrives as sequence of items. Sometimes continuously and at high speed. Can’t store them
all in main memory. Can’t read again; or reading again has a cost. We abstract the data to a
particular feature, the data field of interest the label. Sampling in data streams Data stream
models Sampling
Data arrives as sequence of items. Sometimes continuously and at high speed. Can’t store them
all in main memory. Can’t read again; or reading again has a cost. We abstract the data to a
particular feature, the data field of interest the label. Sampling in data streams Data stream
models Sampling
Data arrives as sequence of items. Sometimes continuously and at high speed. Can’t store them
all in main memory. Can’t read again; or reading again has a cost. We abstract the data to a
particular feature, the data field of interest the label. Sampling in data streams Data stream
models Sampling The data We have a set of n labels Σ and our input is a stream s = x1, x2,
x3, . . . xm, where each xi ∈ Σ. Take into account that some times we do not know in advance
the length of the stream. Goal Compute a function of stream, e.g., median, number of distinct
elements, longest increasing sequence. Sampling in data streams Data stream models Sampling
The data We have a set of n labels Σ and our input is a stream s = x1, x2, x3, . . . xm, where each
xi ∈ Σ. Take into account that some times we do not know in advance the length of the stream.
Goal Compute a function of stream, e.g., median, number of distinct elements, longest increasing
sequence. Sampling in data streams Data stream models Sampling The data We have a set of n
labels Σ and our input is a stream s = x1, x2, x3, . . . xm, where each xi ∈ Σ. Take into account
that some times we do not know in advance the length of the stream. Goal Compute a function of
stream, e.g., median, number of distinct elements, longest increasing sequence. Sampling in data
streams Data stream models Sampling The data We have a set of n labels Σ and our input is a
stream s = x1, x2, x3, . . . xm, where each xi ∈ Σ. Take into account that some times we do not
know in advance the length of the stream. Goal Compute a function of stream, e.g., median,
number of distinct elements, longest increasing sequence.

5. Explain about Filtering Streams?

Another common process on streams is selection, or filtering. We want to accept those tuples in
the stream that meet a criterion. Accepted tuples are passed to another process as a stream, while
other tuples are dropped. If the selection criterion is a property of the tuple that can be calculated
(e.g., the first component is less than 10), then the selection is easy to do. The problem becomes
harder when the criterion involves lookup for membership in a set. It is especially hard, when
that set is too large to store in main memory. In this section, we shall discuss the technique
known as “Bloom filtering” as a way to eliminate most of the tuples that do not meet the
criterion.
A Motivating Example
Again let us start with a running example that illustrates the problem and what we can do about
it. Suppose we have a set S of one billion allowed email addresses – those that we will allow
through because we believe them not to be spam. The stream consists of pairs: an email address

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

and the email itself. Since the typical email address is 20 bytes or more, it is not reasonable to
store S in main memory. Thus, we can either use disk accesses to determine whether or not to let
through any given stream element, or we can devise a method that requires no more main
memory than we have available, and yet will filter most of the undesired stream elements.
Suppose for argument’s sake that we have one gigabyte of available main memory. In the
technique known as Bloom filtering, we use that main memory as a bit array. In this case, we
have room for eight billion bits, since one byte equals eight bits. Devise a hash function h from
email addresses to eight billion buckets. Hash each member of S to a bit, and set that bit to 1. All
other bits of the array remain 0. Since there are one billion members of S, approximately 1/8th of
the bits will be 1. The exact fraction of bits set to 1 will be slightly less than 1/8th, because it is
possible that two members of S hash to the same bit. We shall discuss the exact fraction of 1’s in
Section 4.3.3. When a stream element arrives, we hash its email address. If the bit to which that
email address hashes is 1, then we let the email through. But if the email address hashes to a 0,
we are certain that the address is not in S, so we can drop this stream element. Unfortunately,
some spam email will get through. Approximately 1/8th of the stream elements whose email
address is not in S will happen to hash to a bit whose value is 1 and will be let through.
Nevertheless, since the majority of emails are spam (about 80% according to some reports),
eliminating 7/8th of the spam is a significant benefit. Moreover, if we want to eliminate every
spam, we need only check for membership in S those good and bad emails that get through the
filter. Those checks will require the use of secondary memory to access S itself. There are also
other options, as we shall see when we study the general Bloom-filtering technique.
6. Elaborately explain about Counting Distinct Elements in a Stream?

The count-distinct problem is the problem of finding the number of distinct elements in a data
stream with repeated elements1. One way to solve this problem is to create a map and store the
elements in the map with value as their frequency because duplicate cannot exist in map data
structure. So all the values that have been inserted into the map will be distinct. Finally, the size
of the map will give you the number of distinct elements in the array present in the given input
array (or vector)2. There are also algorithms such as Recordinality that estimate the number of
.

7. Discuss briefly about Estimating Moments?

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

8. What is meant by Counting Oneness in a Window

Suppose we have a window of length N on a binary stream. We want at all times to be able
to answer queries of the form “how many 1’s are there in the last k bits?” for any k≤ N. For
this purpose we use the DGIM algorithm.

The basic version of the algorithm uses O(log2 N) bits to represent a window of N bits, and
allows us to estimate the number of 1’s in the window with an error of no more than 50%.

To begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on.

Since we only need to distinguish positions within the window of length N, we shall represent
timestamps modulo N, so they can be represented by log2 N bits. If we also store the total
number of bits ever seen in the stream (i.e., the most recent timestamp) modulo N, then we can
determine from a timestamp modulo N where in the current window the bit with that timestamp
is.

We divide the window into buckets, 5 consisting of:

1. The timestamp of its right (most recent) end.

2. The number of 1’s in the bucket. This number must be a power of 2, and we refer to
the number of 1’s as the size of the bucket.

To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right
end. To represent the number of 1’s we only need log2 log2 N bits. The reason is that we know
this number i is a power of 2, say 2j , so we can represent i by coding j in binary. Since j is at
most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a bucket.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

There are six rules that must be followed when representing a stream by buckets.
• The right end of a bucket is always a position with a 1.

• Every position with a 1 is in some bucket.

• No position is in more than one bucket.

• There are one or two buckets of any given size, up to some maximum size.

• All sizes must be a power of 2.

• Buckets cannot decrease in size as we move to the left (back in time).

9. Describe about Decaying Window?

Decaying window is a concept in big data that assigns more weight to recent elements 1. The
technique computes a smooth aggregation of all the 1’s ever seen in the stream, with decaying
weights. When an element further appears in the stream, less weight is given 1. The decaying
window algorithm allows you to identify the most popular elements in an incoming data
stream, and also discounts any random spikes or spam requests that might have boosted an
element’s
.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

10. Explain about Real time Analytics Platform(RTAP) Applications?

Real-time Analytics Platform (RTAP) Applications can be broken down into smaller, easier to
understand parts as follows:
1. What is Real-time Analytics? Real-time analytics is the analysis of data as soon as it enters the system, allow

immediately

2. What is a Real-time Analytics Platform (RTAP)? A real-time analytics platform is a


tool that enables organizations to extract valuable information and trends from real-time
data

. It helps in measuring data from a business point of view in real-time, further making
the best use of data

. An ideal RTAP would help in analyzing the data, correlating it, and predicting the
outcomes on a real-time basis

.
3. What are the benefits of Real-time Analytics Platform (RTAP)? RTAPs help in
managing and processing data, leading to timely decision-making

. RTAPs connect data sources for better analytics and visualization, and they help
organizations in tracking things in real-time, thus helping them in the decision-making
process
.
4. What are some Real-life Applications of Real-time
Analytics?
• Crisis Management: Real-time analytics can be used to monitor social media and news
• .Increased Company
feeds to detect Vision: to
and respond Real-time analytics can help organizations to identify trends
crises quickly
and patterns in their data, leading to better decision-making and increased company
vision

.
• Quicker and Less Costly Changes: Real-time analytics can help organizations to identify
and respond to changes in their data quickly, leading to quicker and less costly
changes

.
• Personalized Marketing: Real-time analytics can be used to analyze customer data and
provide personalized marketing experiences

.
• Fraud Detection: Real-time analytics can be used to detect fraudulent activities in real-
time, such as credit card fraud

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

5. What are some Real-time Analytics Tools for Data Analytics? Some widely used
RTAPs include Apache SparkStreaming, a Big Data platform for data stream analytics in
real-time, and Cisco Connected Streaming Analytics

.
In summary, Real-time Analytics Platform (RTAP) Applications are tools that enable
organizations to extract valuable information and trends from real-time data, leading to timely
decision-making and increased company vision. They can be used for various applications, such
as crisis management, personalized marketing, and fraud detection. Some widely used RTAPs
include Apache SparkStreaming and Cisco Connected Streaming
Analytics.

11. Describe about Real Time Sentiment Analysis?

Big data trend has enforced the data-centric systems to have continuous fast data streams. In
recent years, real-time analytics on stream data has formed into a new research field, which aims
to answer queries about “what-is-happening-now” with a negligible delay. The real challenge
with real-time stream data processing is that it is impossible to store instances of data, and
therefore online analytical algorithms are utilized. To perform real-time analytics, pre-processing
of data should be performed in a way that only a short summary of stream is stored in main
memory. In addition, due to high speed of arrival, average processing time for each instance of
data should be in such a way that incoming instances are not lost without being captured. Lastly,
the learner needs to provide high analytical accuracy measures. Sentinel is a distributed system
written in Java that aims to solve this challenge by enforcing both the processing and learning
process to be done in distributed form. Sentinel is built on top of Apache Storm, a distributed
computing platform. Sentinel’s learner, Vertical Hoeffding Tree, is a parallel decision tree-
learning algorithm based on the VFDT, with ability of enabling parallel classification in
distributed environments. Sentinel also uses SpaceSaving to keep a summary of the data stream
and stores its summary in a synopsis data structure. Application of Sentinel on Twitter Public
Stream API is shown and the results are discussed
In recent years, stream data is generated at an increasing rate. The main sources of stream data
are mobile applications, sensor applications, measurements in network monitoring and traffic
management, log records or click-streams in web exploring, manufacturing processes, call detail
records, email, blogging, twitter posts, Facebook statuses, search queries, finance data, credit
card transactions, news, emails, Wikipedia updates [5]. On the other hand, with growing
availability of opinion-rich resources such as personal blogs and micro blogging platforms
challenges arise as people now use such systems to express their opinions. The knowledge of
real-time sentiment analysis of social streams helps to understand what social media users think
or express “right now”. Application of real-time sentiment analysis of social stream brings a lot
of opportunities in data-driven marketing (customer’s immediate response to a campaign),
prevention of disasters immediately, business disasters such as Toyota’s crisis in 2010 or Swine
Flu epidemics in 2009 and debates in social media. Real-time sentiment analysis can be applied
in almost all domains of business and industry. Data stream mining is the informational
structure extraction as models and patterns from continuous and evolving data streams.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Traditional methods of data analysis require the data to be stored and then processed off-line
using complex algorithms that make several passes over data. However in principles, data
streams are infinite, and data is generated with high rates and therefore it cannot be stored in
main memory. Different challenges arise in this context: storage, querying and mining. The
latter is mainly related to the computational resources to analyze such volume of data, so it has
been widely studied in the literature, which introduces several approaches in order to provide
accurate and efficient algorithms [1], [3], [4]. In real-time data stream mining, data streams are
processed in an online manner (i.e. real-time processing) so as to guarantee that results are up-
to-date and that queries can be answered in real-time with negligible delay [1], [5]. Current
solutions and studies in data stream sentiment analysis are limited to perform sentiment analysis
in an off-line approach on a sample of stored stream data. While this approach can work in some
cases, it is not applicable in the real-time case. In addition, real-time sentiment analysis tools
such as MOA [5] and RapidMiner [3] exist, however they are uniprocessor solutions and they
cannot be scaled for an efficient usage in a network nor a cluster. Since in big data scenarios, the
volume of data rises drastically after some period of analysis, this causes uniprocessor solutions
to perform slower over time. As a result, processing time per instance of data becomes higher
and instances get lost in a stream. This affects the learning curve and accuracy measures due to
less available data for training and can introduce high costs to such solutions. Sentinel relies on
distributed architecture and distributed learner’s to solve this shortcoming of available solutions
for real-time sentiment analysis in social media.
12. Discuss briefly about Stock Market Predictions?

A stock market is the aggregation of buyers and sellers of stocks (shares), which represent
ownership claims on businesses which may include securities listed on a public stock exchange
as well as those traded privately. We have seen through the years that people have incurred high
losses which have led to devastations of lives and hence a need for prediction system arises
which can be trusted and consistent throughout the life cycle. Also predicting stock prices is an
important task of financial time series forecasting, which is of primary interest to stock
investors, stock traders and applied researchers. Precisely predicting stocks is essential for
investors to gain enormous profits. However the volatility of the market makes this kind of
prediction is highly difficult. We show that Data Mining and Machine Learning could be used
to guide an investor’s decisions. The main aim is to build a model with the help of Data Mining
techniques such as Knn which can be used for classification and regression combined with
Machine Learning techniques like Genetic algorithm, SVR along with Sentiment Analysis
based social media text, which forecast’s stock price for companies. The system if correctly
implemented will help investors and new users to kick start the investment process and can
provide undue benefits. The system can be enhanced by considering the input parameters and
the data considered overtime.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Unit-IV Answer
all the questions
Part-A

1. What is MapReduce?

A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce:
Simplified Data Processing on Large Clusters," published by Google.

The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase.
In the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed
to the reducer as input. The reducer runs only after the Mapper is over. The reducer too takes
input in key-value format, and the output of reducer is the final output.

2. What are the Steps in Map Reduce?

Steps in Map Reduce

o The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys
will not be unique in this case.
o Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This
sort and shuffle acts on these list of <key, value> pairs and sends out unique keys and a
list of values associated with this unique key <key, list(values)>.
o An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
function on a list of values for unique keys, and Final output <key, value> will be
stored/displayed.

3. What is sort and shuffle?

The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper
task is complete, the results are sorted by key, partitioned if there are multiple reducers, and
then written to disk. Using the input from each Mapper <k2,v2>, we collect all the values for
each unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as
input to reducer phase.

4. What are the usage of map reduce?

o It can be used in various application like document clustering, distributed sorting, and
web link-graph reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi-cluster, multi-core, and
mobile environment.

5. What is hadoop?

Hadoop is an open source software programming framework for storing a large amount of data
and performing the computation. Its framework is based on Java programming with some native
code in C and shell scripts.
6. What is Hive?

Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System
(HDFS). Hive makes job easy for performing operations like

• Data encapsulation
• Ad-hoc queries
• Analysis of huge datasets

7. What is MapR?

MapR was a business software company headquartered in Santa Clara, California. MapR
software provides access to a variety of data sources from a single computer cluster,
including big data workloads such as Apache Hadoop and Apache Spark, a distributed file
system, a multi-model database management system, and event stream processing,
combining analytics in real-time with operational applications. Its technology runs on
both commodity hardware and public cloud computing services. In August 2019, following
financial difficulties, the technology and intellectual property of the company were sold
to Hewlett Packard Enterprise.[3][4]
8. What is S3?

S3 is a cloud object storage service offered by Amazon Web Services (AWS). It allows you to
store and access any amount of data from anywhere on the web. S3 is secure, durable, scalable
and cost-effective
9. What is simulation?

A simulation is the imitation of the operation of a real-world process or system over


time. Simulations require the use of models; the model represents the key characteristics or
behaviors of the selected system or process, whereas the simulation represents the evolution of
the model over time.
10. What is regulatory science?

Regulatory science is the scientific and technical basis for developing and evaluating regulations
in various industries, especially those involving health or safety 1. For example, the FDA uses
regulatory science to assess the safety, efficacy, quality, and performance of all FDA-regulated
products2. Regulatory science can also involve developing new tools, standards, and approaches
for regulation2

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

11. What is distributed and scalable storage?

• HDFS is designed for scalability and fault tolerance. It stores data across
multiple machines (nodes) in a cluster, allowing it to handle vast amounts of data.
• Data is distributed in blocks, typically 128MB or 256MB in size. These blocks
are replicated across multiple nodes to ensure data durability and availability.

12. What is data reliability and fault tolerance?

• HDFS is highly fault-tolerant. It replicates data blocks across multiple nodes (usually
three by default) in the cluster. If a node or a block becomes unavailable, HDFS can still
access the data from a replica.
• The system constantly monitors the health of nodes and can automatically replace failed
nodes with their replicas.
13. What is data write and read patterns?

• HDFS is optimized for write-once, read-many-times patterns. It's well-suited for


scenarios where you need to store large volumes of data and perform batch processing
or analytics on it.
• Appending data to existing files is supported, but modifying existing data within a file
is not efficient.
14. What is block based storage?

• HDFS stores data in fixed-size blocks. This block-based approach simplifies data
storage and retrieval.
• It's particularly advantageous for handling large files efficiently, as you can parallelize
the processing of data across the distributed cluster.
15. What is master slave architecture?

• HDFS follows a master-slave architecture.


• The NameNode is the master server, responsible for managing metadata and
coordinating data access. It stores information about the structure and locations of all files and
directories.
• DataNodes are the slave servers that store the actual data blocks and report to
the NameNode about block health and availability.
16. What is data locality?

• HDFS promotes data locality, which means it tries to process data on the same
node where it is stored. This reduces data transfer over the network, improving
performance.
• MapReduce, a popular data processing framework in the Hadoop ecosystem, leverages
data locality for efficient processing.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

17. What is high throughput and scalability?

• HDFS is designed for high throughput, allowing for efficient data streaming and
batch processing.
• It can scale horizontally by adding more commodity hardware to the cluster to
accommodate growing data needs.
18. What is interoperability?

• HDFS can be accessed using various programming languages and tools, including
Java, Python, and others.
• Several higher-level tools and frameworks, such as Apache Hive, Apache Pig,
and Apache Spark, integrate seamlessly with HDFS for data processing.
19. What is use cases?

• HDFS is commonly used in Big Data scenarios for storing and processing large
datasets for analytics, machine learning, log analysis, and more.
• It is well-suited for applications that require scalability and fault tolerance, such as web-
scale applications and data lakes.
20. What is data partitioning?

• Sharding divides the dataset into smaller, more manageable partitions called shards.
Each shard contains a subset of the data.
• The distribution of data across shards is typically based on a defined partitioning
key, which can be a specific column or attribute of the data.
Part-B
1. Elaborately explain about map reduce?

MapReduce is defined as a big data analysis model that processes data sets using a parallel
algorithm on computer clusters, typically Apache Hadoop clusters or cloud systems like Amazon
Elastic MapReduce (EMR) clusters. This article explains the meaning of MapReduce, how it
works, its features, and its applications.
MapReduce is a big data analysis model that processes data sets using a parallel algorithm on
computer clusters, typically Apache Hadoop clusters or cloud systems like Amazon Elastic
MapReduce (EMR) clusters.
A software framework and programming model called MapReduce is used to process enormous
volumes of data. Map and Reduce are the two stages of the MapReduce program’s operation.
Vast volumes of data are generated in today’s data-driven market due to algorithms and
applications constantly gathering information about individuals, businesses, systems and
Organizations.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

The tricky part is figuring out how to quickly and effectively digest this vast volume of data
without losing insightful conclusions.
It used to be the case that the only way to access data stored in the Hadoop Distributed File
System (HDFS) was using MapReduce. Other query-based methods are now utilized to obtain
data from the HDFS using structured query language (SQL)-like commands, such as Hive and
Pig. These, however, typically run alongside tasks created using the MapReduce approach.
This is so because MapReduce has unique benefits. To speed up processing, MapReduce
executes logic (illustrated above) on the server where the data already sits, rather than
transferring the data to the location of the application or logic.
MapReduce first appeared as a tool for Google to analyze its search results. However, it quickly
grew in popularity thanks to its capacity to split and process terabytes of data in parallel,
producing quicker results.
MapReduce is essential to the operation of the Hadoop framework and a core component. While
“reduce tasks” shuffle and reduce the data, “map tasks” deal with separating and mapping the
data. MapReduce makes concurrent processing easier by dividing petabytes of data into smaller
chunks and processing them in parallel on Hadoop commodity servers. In the end, it collects all
the information from several servers and gives the application a consolidated output.
For example, let us consider a Hadoop cluster consisting of 20,000 affordable commodity servers
containing 256MB data blocks in each. It will be able to process around five terabytes worth of
data simultaneously. Compared to the sequential processing of such a big data set, the usage of
MapReduce cuts down the amount of time needed for processing.
To speed up the processing, MapReduce eliminates the need to transport data to the location
where the application or logic is housed. Instead, it executes the logic directly on the server home
to the data itself. Both the accessing of data and its storing are done using server disks. Further,
the input data is typically saved in files that may include organized, semi-structured, or
unstructured information. Finally, the output data is similarly saved in the form of files.
The main benefit of MapReduce is that users can scale data processing easily over several
computing nodes. The data processing primitives used in the MapReduce model are mappers and
reducers. Sometimes it is difficult to divide a data processing application into mappers and
reducers. However, scaling an application to run over hundreds, thousands, or tens of thousands
of servers in a cluster is just a configuration modification after it has been written in the

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

MapReduce manner.

2. Explain about hadoop?

Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and
storage.
This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop
Distributed File System.
Audience
This tutorial has been prepared for professionals aspiring to learn the basics of Big Data
Analytics using Hadoop Framework and become a Hadoop Developer. Software Professionals,
Analytics Professionals, and ETL developers are the key beneficiaries of this course.
Prerequisites
Before you start proceeding with this tutorial, we assume that you have prior exposure to Core
Java, database concepts, and any of the Linux operating system flavors.

3. What is Hive?Explain it.

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
This is a brief tutorial that provides an introduction on how to use Apache Hive HiveQL with
Hadoop Distributed File System. This tutorial can be your first step towards becoming a
successful Hadoop Developer with Hive.
Audience
This tutorial is prepared for professionals aspiring to make a career in Big Data Analytics using
Hadoop Framework. ETL developers and professionals who are into analytics in general may as
well use this tutorial to good effect.
Prerequisites
Before proceeding with this tutorial, you need a basic knowledge of Core Java, Database
concepts of SQL, Hadoop File system, and any of Linux operating system flavors.

4. Discuss briefly about MapR?

MapR is one of the Big Data Distribution. It is a complete enterprise distribution for Apache
Hadoop which is designed to improve Hadoop’s reliability, performance, and ease of use.

Why MapR?
1. High Availability:

MapR provides High Availability features such as Self – Healing it means that no Namenode
architecture.

It has job tracker High Availability and NFS. MapR achieves only distributing its file system
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

metadata.

2. Disaster Recovery:

MapR provides mirroring facility which allows users to enable policies and mirror data. It
automatically within the multinode cluster or single node cluster between on-premise and cloud
infrastructure

3. Record Performance:

MapR is a world record performance cost only $9 to the earlier cost of $5M at a speed of 54 sec.
And it handles the large size of clusters like 2,200 nodes.

4. Consistent Snapshots:

MapR is the only big data distribution which provides a consistent, point in time recovery
because of its unique read and writes storage architecture.

5. Complete Data Protection:

MapR has own security system for data protection in cluster level.

6. Compression:

MapR provides automatic behind the scenes compression to data. It applies compression
automatically to files in the cluster.

7. Unbiased Open Source:

MapR completely unbiased opensource distribution

8. Real Multitenancy Including YARN also

9. Enterprise-grade NoSQL

10. Read and Write file system:

MapR has Read and Write file system.

MapR Ecosystem Packs (MEP):


The “MapR Ecosystem” is the set of open-source that is included in the MapR Platform, and the
“pack” means a bundled set of MapR Ecosystem projects with specific versions.

Mostly MapR Ecosystem Packs are released in every quarter and yearly also

A single version of MapR may support multiple MEPs, but only one at a time.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

In a familiar case, Hadoop Ecosystem components and Opensource components are like Spark,
Hive etc components are included in MapR Ecosystem Packs are like below tools:

Collectd

Elasticsearch

Grafana

Fluentd

Kibana

Open TSDB

5. Describe about sharding?

Sharding is a very important concept that helps the system to keep data in different resources
according to the sharding process. The word “Shard” means “a small part of a whole“. Hence
Sharding means dividing a larger part into smaller parts. In DBMS, Sharding is a type of
DataBase partitioning in which a large database is divided or partitioned into smaller data and
different nodes. These shards are not only smaller, but also faster and hence easily
manageable.
Need for Sharding:
Consider a very large database whose sharding has not been done. For example, let’s take a
DataBase of a college in which all the student’s records (present and past) in the whole college
are maintained in a single database. So, it would contain a very very large number of data, say
100, 000 records. Now when we need to find a student from this Database, each time around
100, 000 transactions have to be done to find the student, which is very very costly. Now
consider the same college students records, divided into smaller data shards based on years.
Now each data shard will have around 1000-5000 students records only. So not only the
database became much more manageable, but also the transaction cost each time also reduces
by a huge factor, which is achieved by Sharding. Hence this is why Sharding is needed.

How does Sharding work?

In a sharded system, the data is partitioned into shards based on a predetermined criterion. For
example, a sharding scheme may divide the data based on geographic location, user ID, or time
period. Once the data is partitioned, it is distributed across multiple servers or nodes. Each
server or node is responsible for storing and processing a subset of the data.
To query data from a sharded database, the system needs to know which shard contains the
required data. This is achieved using a shard key, which is a unique identifier that is used to
map the data to its corresponding shard. When a query is received, the system uses the shard
key to determine which shard contains the required data and then sends the query to the
appropriate server or node.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

Features of Sharding:
• Sharding makes the Database smaller
• Sharding makes the Database faster
• Sharding makes the Database much more easily manageable
• Sharding can be a complex operation sometimes
• Sharding reduces the transaction cost of the Database
• Each shard reads and writes its own data.
• Many NoSQL databases offer auto-sharding.
• Failure of one shard doesn’t effect the data processing of other shards.

Benefits of Sharding:

1. Improved Scalability: Sharding allows the system to scale horizontally by adding


more servers or nodes as the data grows. This improves the system’s capacity to
handle large volumes of data and requests.
2. Increased Performance: Sharding distributes the data across multiple servers or
nodes, which improves the system’s performance by reducing the load on each
server or node. This results in faster response times and better throughput.
3. Fault Tolerance: Sharding provides a degree of fault tolerance as the system can
continue to function even if one or more servers or nodes fail. This is because the
data is replicated across multiple servers or nodes, and if one fails, the others can
continue to serve the requests.
4. Reduced Costs: Sharding allows the system to scale horizontally, which can be
more cost-effective than scaling vertically by upgrading hardware. This is because
horizontal scaling can be done using commodity hardware, which is typically less
expensive than high-end servers.
6. Briefly explain about NoSQL Databases?

A NoSQL (originally referring to "non-SQL" or "non-relational")[1] database provides a


mechanism for storage and retrieval of data that is modeled in means other than the tabular
relations used in relational databases. Such databases have existed since the late 1960s, but the
name "NoSQL" was only coined in the early 21st century,[2] triggered by the needs of Web
2.0 companies.[3][4] NoSQL databases are increasingly used in big data and real-time
web applications.[5]
NoSQL systems are also sometimes called Not only SQL to emphasize that they may support SQL-like
query languages or sit alongside SQL databases in polyglot- persistent architectures.[6][7]
Motivations for this approach include simplicity of design, simpler "horizontal"
scaling to clusters of machines (which is a problem for relational databases),[2] finer control
over availability, and limiting the object-relational impedance mismatch.[8] The data structures
used by NoSQL databases (e.g. key–value pair, wide column, graph, or document) are
different
from those used by default in relational databases, making some operations faster in NoSQL.
The particular suitability of a given NoSQL database depends on the problem it must solve.
Sometimes the data structures used by NoSQL databases are also viewed as "more flexible" than
relational database tables.[9]

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of
availability, partition tolerance, and speed. Barriers to the greater adoption of NoSQL stores
include the use of low-level query languages (instead of SQL, for instance), lack of ability to
perform ad hoc joins across tables, lack of standardized interfaces, and huge previous
investments in existing relational databases.[10] Most NoSQL stores lack true ACID
transactions, although a few databases have made them central to their designs.
Instead, most NoSQL databases offer a concept of "eventual consistency", in which database
changes are propagated to all nodes "eventually" (typically within milliseconds), so queries for
data might not return updated data immediately or might result in reading data that is not
accurate, a problem known as stale read.[11] Additionally, some NoSQL systems may exhibit
lost writes and other forms of data loss.[12] Some NoSQL systems provide concepts such as
write- ahead logging to avoid data loss.[13] For distributed transaction processing across multiple
databases, data consistency is an even bigger challenge that is difficult for both NoSQL and
relational databases. Relational databases "do not allow referential integrity constraints to span
databases".[14] Few systems maintain both ACID transactions and X/Open XA standards for
distributed transaction processing.[15] Interactive relational databases share conformational relay
analysis techniques as a common feature. [16] Limitations within the interface environment are
overcome using semantic virtualization protocols, such that NoSQL services are accessible to
most operating systems.[17]

7. What is S3?Explain it.

Amazon S3 (Simple Storage Service) is a scalable, high-speed, low-cost web-based service


designed for online backup and archiving of data and application programs. It allows to upload,
store, and download any type of files up to 5 TB in size. This service allows the subscribers to
access the same systems that Amazon uses to run its own web sites. The subscriber has control
over the accessibility of data, i.e. privately/publicly accessible.
How to Configure S3?
Following are the steps to configure a S3 account.
Step 1− Open the Amazon S3 console using this link
− https://console.aws.amazon.com/s3/home
Step 2 − Create a Bucket using the following steps.
• A prompt window will open. Click the Create Bucket button at the bottom of the page.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

 Create a Bucket dialog box will open. Fill the required details and click the Create button.

 The bucket is created successfully in Amazon S3. The console displays the list of buckets
and its properties.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

 Select the Static Website Hosting option. Click the radio button Enable website hosting
and fill the required details.

Step 3 − Add an Object to a bucket using the following steps.


• Open the Amazon S3 console using the following link
− https://console.aws.amazon.com/s3/home
• Click the Upload button.

 Click the Add files option. Select those files which are to be uploaded from the system
and then click the Open button.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

 Click the start upload button. The files will get uploaded into the bucket.
To open/download an object − In the Amazon S3 console, in the Objects & Folders list, right-
click on the object to be opened/downloaded. Then, select the required object.

How to Move S3 Objects?


Following are the steps to move S3 objects.
step 1 − Open Amazon S3 console.
step 2 − Select the files & folders option in the panel. Right-click on the object that is to be
moved and click the Cut option.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

step 3 − Open the location where we want this object. Right-click on the folder/bucket where the
object is to be moved and click the Paste into option.

How to Delete an Object?


Step 1 − Open Amazon S3.
Step 2 − Select the files & folders option in the panel. Right-click on the object that is to be
deleted. Select the delete option.
Step 3 − A pop-up window will open for confirmation. Click Ok.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

How to Empty a Bucket?


Step 1 − Open Amazon S3 console.
Step 2 − Right-click on the bucket that is to be emptied and click the empty bucket option.

Step 3 − A confirmation message will appear on the pop-up window. Read it carefully and click
the Empty bucket button to confirm.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Amazon S3 Features
• Low cost and Easy to Use − Using Amazon S3, the user can store a large amount
of data at very low charges.
• Secure − Amazon S3 supports data transfer over SSL and the data gets encrypted
automatically once it is uploaded. The user has complete control over their data by
configuring bucket policies using AWS IAM.
• Scalable − Using Amazon S3, there need not be any worry about storage
concerns. We can store as much data as we have and access it anytime.
• Higher performance − Amazon S3 is integrated with Amazon CloudFront, that
distributes content to the end users with low latency and provides high data
transfer speeds without any minimum usage commitments.
• Integrated with AWS services − Amazon S3 integrated with AWS services
include Amazon CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon
RDS, Amazon Route 53, Amazon VPC, AWS Lambda, Amazon EBS, Amazon
Dynamo DB, etc.

8. Explain about Hadoop Distributed File System?

Now we think you become familiar with the term file system so let’s begin with HDFS.
HDFS(Hadoop Distributed File System) is utilized for storage permission is a Hadoop cluster.
It mainly designed for working on commodity Hardware devices(devices that are inexpensive),
working on a distributed file system design. HDFS is designed in such a way that it believes
more in storing the data in a large chunk of blocks rather than storing small data blocks. HDFS
in Hadoop provides Fault-tolerance and High availability to the storage layer and the other
devices present in that Hadoop cluster.
HDFS is capable of handling larger size data with high volume velocity and variety makes
Hadoop work more efficient and reliable with easy access to all its components. HDFS stores
the data in the form of the block where the size of each data block is 128MB in size which is
configurable means you can change it according to your requirement in hdfs-site.xml file in
your Hadoop directory.

Some Important Features of HDFS(Hadoop Distributed File System)

• It’s easy to access the files stored in HDFS.


• HDFS also provides high availability and fault tolerance.
• Provides scalability to scaleup or scaledown nodes as per our requirement.
• Data is stored in distributed manner i.e. various Datanodes are responsible for
storing the data.
• HDFS provides Replication because of which no fear of Data Loss.
• HDFS Provides High Reliability as it can store data in a large range of Petabytes.
• HDFS has in-built servers in Name node and Data Node that helps them to easily
retrieve the cluster information.
• Provides high throughput.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

HDFS Storage Daemon’s

As we all know Hadoop works on the MapReduce algorithm which is a master-slave


architecture, HDFS has NameNode and DataNode that works in the similar pattern.

1. NameNode(Master)
2. DataNode(Slave)

1. NameNode: NameNode works as a Master in a Hadoop cluster that Guides the


Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. nothing but the data
about the data. Meta Data can be the transaction logs that keep track of the user’s activity in a
Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
As our NameNode is working as a Master it should have a high RAM or Processing power in
order to Maintain or Guide all the slaves in a Hadoop cluster. Namenode receives heartbeat
signals and block reports from all the slaves i.e. DataNodes.
2. DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data
in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that, the
more number of DataNode your Hadoop cluster has More Data can be stored. so it is advised
that the DataNode should have High storing capacity to store a large number of file blocks.
Datanode performs operations like creation, deletion, etc. according to the instruction provided
by the NameNode.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Objectives and Assumptions Of HDFS

1. System Failure: As a Hadoop cluster is consists of Lots of nodes with are commodity
hardware so node failure is possible, so the fundamental goal of HDFS figure out this failure
problem and recover it.
2. Maintaining Large Dataset: As HDFS Handle files of size ranging from GB to PB, so
HDFS has to be cool enough to deal with these very large data sets on a single cluster.
3. Moving Data is Costlier then Moving the Computation: If the computational operation is
performed near the location where the data is present then it is quite faster and the overall
throughput of the system can be increased along with minimizing the network congestion
which is a good assumption.
4. Portable Across Various Platform: HDFS Posses portability which allows it to switch
across diverse Hardware and software platforms.
5. Simple Coherency Model: A Hadoop Distributed File System needs a model to write once
read much access for Files. A file written then closed should not be changed, only data can be
appended. This assumption helps us to minimize the data coherency issue. MapReduce fits
perfectly with such kind of file model.
6. Scalability: HDFS is designed to be scalable as the data storage requirements increase over
time. It can easily scale up or down by adding or removing nodes to the cluster. This helps to
ensure that the system can handle large amounts of data without compromising performance.
7. Security: HDFS provides several security mechanisms to protect data stored on the cluster.
It supports authentication and authorization mechanisms to control access to data, encryption
of data in transit and at rest, and data integrity checks to detect any tampering or corruption.
8. Data Locality: HDFS aims to move the computation to where the data resides rather than
moving the data to the computation. This approach minimizes network traffic and enhances
performance by processing data on local nodes.
9. Cost-Effective: HDFS can run on low-cost commodity hardware, which makes it a cost-
effective solution for large-scale data processing. Additionally, the ability to scale up or down
as required means that organizations can start small and expand over time, reducing upfront
costs.
10. Support for Various File Formats: HDFS is designed to support a wide range of file
formats, including structured, semi-structured, and unstructured data. This makes it easier to
store and process different types of data using a single system, simplifying data management
and reducing costs.

9. Describe about Preventing Private Information Inference Attacks on

Social Networks?
Online social networks, such as Facebook, are increasingly utilized by many people. These
networks allow users to publish details about themselves and to connect to their friends. Some of
the information revealed inside these networks is meant to be private. Yet it is possible to use
learning algorithms on released data to predict private information. In this paper, we explore how
to launch inference attacks using released social networking data to predict private information.
We then devise three possible sanitization techniques that could be used in various situations.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Then, we explore the effectiveness of these techniques and attempt to use methods of collective
inference to discover sensitive attributes of the data set. We show that we can decrease the
effectiveness of both local and relational classification algorithms by using the sanitization
methods we described.
SOCIAL networks are online applications that allow their users to connect by means of various
link types. As part of their offerings, these networks allow people to list details about themselves
that are relevant to the nature of the network. For instance, Facebook is a general-use social
network, so individual users list their favorite activities, books, and movies. Conversely,
LinkedIn is a professional network; because of this, users specify details which are related to
their professional life (i.e., reference letters, previous employment, and so on.) Because these
sites gather extensive personal information, social network application providers have a rare
opportunity: direct use of this information could be useful to advertisers for direct marketing.
However, in practice, privacy concerns can prevent these efforts [1]. This conflict between the
desired use of data and individual privacy presents an opportunity for privacy-preserving social
network data mining—that is, the discovery of information and relationships from social network
data without violating privacy.
Privacy concerns of individuals in a social network can be classified into two categories: privacy
after data release, and private information leakage.
Instances of privacy after data release involve the identification of specific individuals in a data
set subsequent to its release to the general public or to paying customers for a specific usage.
Perhaps the most illustrative example of this type of privacy breach (and the repercussions
thereof) is the AOL search data scandal.
10. Write an overview of Big Data Framework?

Frameworks provide structure. The core objective of the Big Data Framework is to provide a
structure for enterprise organisations that aim to benefit from the potential of Big Data. In order
to achieve long-term success, Big Data is more than just the combination of skilled people and
technology – it requires structure and capabilities.

The Big Data Framework was developed because – although the benefits and business cases of
Big Data are apparent – many organizations struggle to embed a successful Big Data practice in
their organization. The structure provided by the Big Data Framework provides an approach for
organizations that takes into account all organizational capabilities of a successful Big Data
practice. All the way from the definition of a Big Data strategy, to the technical tools and
capabilities an organization should have.

The main benefits of applying a Big Data framework include:


1. The Big Data Framework provides a structure for organisations that want to start
with Big Data or aim to develop their Big Data capabilities further.
2. The Big Data Framework includes all organisational aspects that should be taken
into account in a Big Data organization.
3. The Big Data Framework is vendor independent. It can be applied to any
organization regardless of choice of technology, specialisation or tools.
4. The Big Data Framework provides a common reference model that can be used
across departmental functions or country boundaries.
5. The Big Data Framework identifies core and measurable capabilities in each of its
six domains so that the organization can develop over time.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

Big Data is a people business. Even with the most advanced computers and processors in the
world, organisations will not be successful without the appropriate knowledge and skills. The
Big Data Framework therefore aims to increase the knowledge of everyone who is interested in
Big Data. The modular approach and accompanying certification scheme aims to develop
knowledge about Big Data in a similar structured fashion.

The Big Data framework provides a holistic structure toward Big Data. It looks at the various
components that enterprises should consider while setting up their Big Data organization. Every
element of the framework is of equal importance and organisations can only develop further if
they provide equal attention and effort to all elements of the Big Data framework.

The Structure of the Big Data Framework

The Big Data framework is a structured approach that consists of six core capabilities that
organisations need to take into consideration when setting up their Big Data organization. The
Big Data Framework is depicted in the figure below:

The Big Data Framework consists of the following six main elements:

1. Big Data Strategy

Data has become a strategic asset for most organisations. The capability to analyse large data
sets and discern pattern in the data can provide organisations with a competitive advantage.
Netflix, for example, looks at user behaviour in deciding what movies or series to produce.
Alibaba, the Chinese sourcing platform, became one of the global giants by identifying which
suppliers to loan money and recommend on their platform. Big Data has become Big Business.

In order to achieve tangible results from investments in Big Data, enterprise organisations need a
sound Big Data strategy. How can return on investments be realised, and where to focus effort in
Big Data analysis and analytics? The possibilities to analyse are literally endless and
organisations can easily get lost in the zettabytes of data. A sound and structured Big Data
strategy is the first step to Big Data success.

2. Big Data Architecture

In order to work with massive data sets, organisations should have the capabilities to store and
process large quantities of data. In order to achieve this, the enterprise should have the
underlying IT infrastructure to facilitate Big Data. Enterprises should therefore have a
comprehensive Big Data architecture to facilitate Big Data analysis. How should enterprises
design and set up their architecture to facilitate Big Data? And what are the requirements from
a storage and processing perspective?

The Big Data Architecture element of the Big Data Framework considers the technical
capabilities of Big Data environments. It discusses the various roles that are present within a Big
Data Architecture and looks at the best practices for design. In line with the vendor-independent
structure of the Framework, this section will consider the Big Data reference architecture of
the National Institute of Standards and Technology (NIST).
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

3. Big Data Algorithms

A fundamental capability of working with data is to have a thorough understanding of statistics


and algorithms. Big Data professionals therefore need to have a solid background in statistics
and algorithms to deduct insights from data. Algorithms are unambiguous specifications of how
to solve a class of problems. Algorithms can perform calculations, data processing and
automated reasoning tasks. By applying algorithms to large volumes of data, valuable
knowledge and insights can be obtained.

The Big Data algorithms element of the framework focuses on the (technical) capabilities of
everyone who aspires to work with Big Data. It aims to build a solid foundation that includes
basic statistical operations and provides an introduction to different classes of algorithms.

4. Big Data Processes

In order to make Big Data successful in enterprise organization, it is necessary to consider more
than just the skills and technology. Processes can help enterprises to focus their direction.
Processes bring structure, measurable steps and can be effectively managed on a day-to-day
basis. Additionally, processes embed Big Data expertise within the organization by following
similar procedures and steps, embedding it as ‘a practice’ of the organization. Analysis becomes
less dependent on individuals and thereby, greatly enhancing the chances of capturing value in
the long term.

5. Big Data Functions

Big Data functions are concerned with the organisational aspects of managing Big Data in
enterprises. This element of the Big Data framework addresses how organisations can structure
themselves to set up Big Data roles and discusses roles and responsibilities in Big Data
organisations. Organisational culture, organisational structures and job roles have a large
impact on the success of Big Data initiatives. We will therefore review some ‘best practices’ in
setting up enterprise big data

In the Big Data Functions section of the Big Data Framework, the non-technical aspects of Big
Data are covered. You will learn how to set up a Big Data Center of Excellence (BDCoE).
Additionally, it also addresses critical success factors for starting Big Data project in the
organization.
6. Artificial Intelligence

The last element of the Big Data Framework addresses Artificial Intelligence (AI). One of the
major areas of interest in the world today, AI provides a whole world of potential. In this part
of the framework, we address the relation between Big Data and Artificial Intelligence and
outline key characteristics of AI.

Many organisations are keen to start Artificial Intelligence projects, but most are unsure where
to start their journey.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

The Big Data Framework takes a functional view of AI in the context of bringing business
benefits to enterprise organisations. The last section of the framework therefore showcases how
AI follows as a logical next step for organisations that have built up the other capabilities of the
Big Data Framework. The last element of the Big Data Framework has been depicted as a
lifecycle on purposes. Artificial Intelligence can start to continuously learn from the Big Data in
the organization in order to provide long lasting value.

11. Discuss about Preventing Private Information Inference Attacks on Social Networks?

Nowadays social media is becoming very popular and used for marketing as per users profile.
But for this, social networking sites share users data with other marketing companies and it is
possible the third party companies can use users private data. Significant factor in multimedia
mobile systems is social network, where users can send their photos, videos and other media
files. On the other hand, the information (e.g., user Bio, posts, etc. ) on social media platforms
shared usually reveals lots of users private information. That can be mined and mistreated for the
malicious reasons. To tackle privacy concerns, privacy preserving mechanisms adopted by many
social network service providers, e.g. hiding users profiles, anonymzing user identity, etc. As an
attributes result from user profiles are usually set such that it could be accessed to prevent
personal information outflow only by friends. To understand the hidden attributes to the
numerous efficiency of current privacy protecting mechanisms different attacks have been
proposed. Almost solutions are based on the social networking links along with users or their
behaviors. The proposed work is an inference attack prevention model on social networking
application to solve prevention problem. To prevent inference attack we proposed data
sanitization method on user’s profile.

Social networking websites are virtual communities that foster interaction and encourages
among associates of a group by permitting them to connect with other users, post personal data
and link their personal profiles to others profiles. In most cases, membership is attained by
registering as a user of that website in web community. Regularly interacting and visiting with
people who use that website makes ones network solider. Though many social networking
websites are release to anyone, who belong to a specific real world occupation or some are open
only to people in certain age group. Members of social networking websites are communicated
by posting weblogs, video and music stream, messages and chatting. Social networking sites
members frequently link smaller communities within the interwork. Members of the social
networking websites allow endorsing themselves and their comforts by posting individual
profiles that contain enough information for others to determine if they are involved in
associating with that person.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

An opponent of social networking claim that can be used to outbreak privacy and it contributes
to grasping behavior. Many people are free with the information they post concerning
themselves, those websites are frequently used to investigate s social habits and persons
character. Social networks permit users to make public the details about themselves and to get
connected with their friends. Some of the information is to be private when it revealed inside
these networks. On released data to predict private information it is possible to use machine
learning algorithms. In this paper, explore that how to launch inference attacks to predict private
information by using released social networking data.

12. Describe about Applying Regulatory Science and Big Data to Improve Medical

Device Innovation
Understanding how proposed medical devices will interface with humans is a major challenge
that impacts both the design of innovative new devices and approval and regulation of existing
devices. Today, designing and manufacturing medical devices requires extensive and expensive
product cycles. Bench tests and other preliminary analyses are used to understand the range of
anatomical conditions, and animal and clinical trials are used to understand the impact of design
decisions upon actual device success. Unfortunately, some scenarios are impossible to replicate
on the bench, and competitive pressures often accelerate initiation of animal trials without
sufficient understanding of parameter selections. We believe these limitations can be overcome
through advancements in data-driven and simulation based medical device design and
manufacture, a research topic that draws upon and combines emerging work in the areas of
Regulatory Science and Big Data.
We propose a cross disciplinary grand challenge to develop and holistically apply new thinking
and techniques in these areas to medical devices in order to improve and accelerate medical
device innovation.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Unit-V
Answer all the
questions Part-A
1. What is R?

R is a programming language that is mainly used for statistical computing and graphics. It was
created by statisticians Ross Ihaka and Robert Gentleman in the 1990s, and it is now supported
by the R Core Team and the R Foundation for Statistical Computing. R is similar to the S
language, which was developed at Bell Laboratories by John Chambers and colleagues.
2. What is Data frames?

Data Frames

Data Frames are data displayed in a format as a table.

Data Frames can have different types of data inside it. While the first column can be character,
the second and third can be numeric or logical. However, each column should have the same
type of data.

3. What is classes?

Classes and Objects are basic concepts of Object-Oriented Programming that revolve around
the real-life entities. Everything in R is an object. An object is simply a data structure that has
some methods and attributes. A class is just a blueprint or a sketch of these objects. It
represents the set of properties or methods that are common to all objects of one type.
4. Give short notes on input/output?

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

5. What is meant by string manipulations?

String Manipulation in R Programming

Here are a few of the string manipulation functions available in R’s base packages. We are going
to look at these functions in detail.

1. The nchar function


2. The toupper function
3. The tolower function
4. The substr function
5. The grep function
6. The paste function
7. The strsplit function
8. The sprintf function
9. The cat function
10. The sub function

6. What is meant by toupper()?

The R toupper () function is used to convert all characters of the string in uppercase. Any
symbol, space, or number in the string is ignored while applying this function. Only alphabet are
converted. The syntax for using this function is given below: Syntax toupper(x) Parameters x
Required. Specify text to be converted.
7. What is meant by tolower()?

tolower() method in R programming is used to convert the uppercase letters of string to


lowercase string.
Syntax: tolower(s)
Return: Returns the lowercase string.
8. Shortly discuss about strsplit()?

The strsplit() in R programming language function is used to split the elements of the specified
character vector into substrings according to the given substring taken as its parameter.
Syntax: strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes...
9. What is nchar()?

nchar () method in R Programming Language is used to get the length of a character in a string
object. Syntax: nchar (string) Where: String is object. Return: Returns the length of a string.
10. Write a short notes on sprintf()?

The sprintf() function in R is a built-in function that prints formatted strings. You can use it to
control the number of digits, alignment, padding, and other aspects of how strings are
displayed. For example, you can use sprintf(“%f”, x) to format a numeric value x with six digits
after the decimal point12. You can also use other format specifiers such as %d for

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

integers, %s for strings, %e for scientific notation, and more3. The sprintf() function is useful for
generating dynamic messages and formatting data for reporting4

11. What is Data analysis and statistics?

• R is renowned for its powerful capabilities in data analysis and statistical modeling.
It provides a vast array of built-in functions and packages for statistical analysis, hypothesis
testing, regression, and more.
• Users can perform data manipulation, cleansing, and transformation tasks with ease using
R's data manipulation libraries like dplyr and tidyr.
12. What is data visualization?

• R offers extensive data visualization tools, including the popular ggplot2 package,
which allows users to create highly customizable and publication-quality graphs and charts.
• It provides support for various plotting styles, such as scatter plots, bar
charts, histograms, box plots, and heatmaps.
13. What is extensive package ecosystem?

• R boasts a rich ecosystem of user-contributed packages available through the


Comprehensive R Archive Network (CRAN) and other repositories. These packages extend
R's capabilities for various specialized tasks and domains.
• Users can find packages for machine learning (e.g., caret, randomForest), data
manipulation (e.g., dplyr, tidyr), and domain-specific tasks (e.g., bioconductor for
bioinformatics).
14. What is reproducible research?

• R promotes reproducible research by allowing users to create scripts and documents


(e.g., R Markdown) that integrate code, data, and documentation in a single, shareable format.
• This makes it easier to collaborate, share findings, and ensure the transparency and
reproducibility of data analyses.
15. What is data import and export?

• R provides functions for importing data from a wide range of sources, including
CSV, Excel, SQL databases, and web APIs.
• Exporting results to various formats, such as CSV, Excel, PDF, and graphics files,
is straightforward.
16. What is interactive data analysis?

• R supports interactive data exploration and analysis through the use of graphical user
interfaces (GUIs) like RStudio and Jupyter notebooks.
• Users can explore data, execute code, and visualize results in real time.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

17. What is scripting and programming?

• R is a full-fledged programming language with control structures like loops,


conditionals, and functions, making it highly flexible and suitable for complex data analysis
tasks.
• Users can write custom functions and scripts to automate repetitive tasks.
18. What is cross platform compatibility?

R is cross-platform and runs on various operating systems, including Windows, macOS, and
Linux.
19. What is integration and other tools?

R can be integrated with other data science and analytics tools and languages, including Python,
SQL, and tools for big data processing like Apache Spark.
20. What is map phase?

• The first phase of MapReduce is the "Map" phase, where the input data is divided into
smaller chunks, called splits.
• A user-defined function called the "Mapper" is applied to each split independently.
The Mapper takes an input record and emits key-value pairs based on some logic.
• The output of the Mapper is an intermediate set of key-value pairs, which are grouped by
key.
Part-B
1. Write an overview of R language?

R is a programming language for statistical computing and graphics supported by the R Core
Team and the R Foundation for Statistical Computing. Created by statisticians Ross
Ihaka and Robert Gentleman, R is used among data
miners, bioinformaticians and statisticians for data analysis and developing statistical software.
[7]
The core R language is augmented by a large number of extension packages containing
reusable code and documentation.
According to user surveys and studies of scholarly literature databases, R is one of the most
commonly used programming languages in data mining.[8] As of April 2023, R ranks 16th in
the TIOBE index, a measure of programming language popularity, in which the language peaked
in 8th place in August 2020.[9][10]
The official R software environment is an open-source free software environment released as
part of the GNU Project and available under the GNU General Public License. It is written
primarily in C, Fortran, and R itself (partially self-hosting). Precompiled executables are
provided for various operating systems. R has a command line interface.[11] Multiple third-
party graphical user interfaces are also available, such as RStudio, an integrated development
environment, and Jupyter, a notebook interface.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

2. Describe briefly about control statements?

Control statements are expressions used to control the execution and flow of the program
based on the conditions provided in the statements. These structures are used to make a
decision after assessing the variable. In this article, we’ll discuss all the control statements with
the examples.
In R programming, there are 8 types of control statements as follows:
• if condition
• if-else condition
• for loop
• nested loops
• while loop
• repeat and break statement
• return statement
• next statement
if condition
This control structure checks the expression provided in parenthesis is true or not. If true, the
execution of the statements in braces {} continues.
Syntax:
if(expression){
statements
....
....
}
Example:

x <- 100

if(x > 10){


print(paste(x, "is greater than 10"))
}

Output:
[1] "100 is greater than 10"
if-else condition
It is similar to if condition but when the test expression in if condition fails, then statements
in else condition are executed.
Syntax:
if(expression){
statements
....

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

....
}
else{
statements
....
....
}
Example:

x <- 5

# Check value is less than or greater than 10


if(x > 10){
print(paste(x, "is greater than 10"))
}else{
print(paste(x, "is less than 10"))
}

Output:
[1] "5 is less than 10"
for loop
It is a type of loop or sequence of statements executed repeatedly until exit condition is
reached.
Syntax:
for(value in vector){
statements
....
....
}
Example:

x <- letters[4:10]

for(i in x){

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

print(i)
}

Output:
[1] "d"
[1] "e"
[1] "f"
[1] "g"
[1] "h"
[1] "i"
[1] "j"
Nested loops
Nested loops are similar to simple loops. Nested means loops inside loop. Moreover, nested
loops are used to manipulate the matrix.
Example:

# Defining matrix
m <- matrix(2:15, 2)

for (r in seq(nrow(m))) {
for (c in seq(ncol(m))) {
print(m[r, c])
}
}

Output:
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

[1] 3
[1] 5
[1] 7
[1] 9
[1] 11
[1] 13
[1] 15
while loop
while loop is another kind of loop iterated until a condition is satisfied. The testing expression
is checked first before executing the body of loop.
Syntax:
while(expression){
statement
....
....
}
Example:

x=1

# Print 1 to 5
while(x <= 5){
print(x)
x=x+1
}

Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

repeat loop and break statement


repeat is a loop which can be iterated many number of times but there is no exit condition to
come out from the loop. So, break statement is used to exit from the loop. break statement can
be used in any type of loop to exit from the loop.
Syntax:
repeat {
statements
....
....
if(expression) {
break
}
}
Example:

x=1

# Print 1 to 5
repeat{ prin
t(x)
x=x+1
if(x > 5)
{ break
}
}

Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

return statement
return statement is used to return the result of an executed function and returns control to the
calling function.
Syntax:
return(expression)
Example:

# Checks value is either positive, negative or zero


func <- function(x){
if(x > 0){
return("Positive")
}else if(x < 0){
return("Negative")
}else{
return("Zero")
}
}

func(1)
func(0)
func(-
1)
Output:
[1] "Positive"
[1] "Zero"
[1] "Negative"
next statement
next statement is used to skip the current iteration without executing the further statements and
continues the next iteration cycle without terminating the loop.
Example:

# Defining vector
x <- 1:10

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

# Print even numbers


for(i in x){
if(i%%2 != 0){
next #Jumps to next loop
}
print(i)
}

Output:
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10

3. Discuss briefly about operators?

Operators

Operators are used to perform operations on variables and values.

In the example below, we use the + operator to add together two

values: Example

10 + 5

R divides the operators in the following groups:

• Arithmetic operators

• Assignment operators

• Comparison operators

• Logical operators

• Miscellaneous operators

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

R Arithmetic Operators

Arithmetic operators are used with numeric values to perform common mathematical operations:

Operato Name Example


r

+ Addition x+y

- Subtraction x-y

* Multiplication x*y

/ Division x/y

^ Exponent x^y

%% Modulus (Remainder from division) x %% y

%/% Integer Division x%/%y

R Assignment Operators

Assignment operators are used to assign values to variables:

Example
my_var <- 3

my_var <<- 3

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

3 -> my_var

3 ->> my_var

my_var # print my_var


Note: <<- is a global assigner. You will learn more about this in the Global Variable chapter.

It is also possible to turn the direction of the assignment operator.

x <- 3 is equal to 3 -> x

R Comparison Operators

Comparison operators are used to compare two values:

Operatr Name Example

== Equal x == y

!= Not equal x != y

> Greater than x>y

< Less than x<y

>= Greater than or equal to x >= y

<= Less than or equal to x <= y

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

R Logical Operators

Logical operators are used to combine conditional statements:

Operato Description
r

& Element-wise Logical AND operator. It returns TRUE if both elements are TRUE

&& Logical AND operator - Returns TRUE if both statements are TRUE

| Elementwise- Logical OR operator. It returns TRUE if one of the statement is TRUE

|| Logical OR operator. It returns TRUE if one of the statement is TRUE.

! Logical NOT - returns FALSE if statement is TRUE

R Miscellaneous Operators

Miscellaneous operators are used to manipulate data:

:
Operato r Creates a series of numbers in a x Example
Description <- 1:10
sequence

%in% Find out if an element belongs to a x %in% y


vector

%*% Matrix Multiplication x <- Matrix1 %*% Matrix2

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

4. What is functions?Explain it.

Functions are useful when you want to perform a certain task multiple times. A function
accepts input arguments and produces the output by executing valid R commands that are
inside the function. In R Programming Language when you are creating a function the function
name and the file in which you are creating the function need not be the same and you can
have one or more functions in R.

Creating a Function in R

Functions are created in R by using the command function(). The general structure of the
function file is as follows:

Functions in R Programming

Note: In the above syntax f is the function name, this means that you are creating a function
with name f which takes certain arguments and executes the following statements.
Types of Function in R Language
1. Built-in Function: Built-in functions in R are pre-defined functions that are
available in R programming languages to perform common tasks or operations.
2. User-defined Function: R language allow us to write our own function.
Built-in Function in R Programming Language
Here we will use built-in functions like sum(), max() and min().
R

# Find sum of numbers 4 to 6.


print(sum(4:6))

# Find max of numbers 4 and 6.


print(max(4:6))

# Find min of numbers 4 and 6.


print(min(4:6))
Output
[1] 15
[1] 6
[1] 4

User-defined Functions in R Programming Language


Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

R provides built-in functions like print(), cat(), etc. but we can also create our own functions.
These functions are called user-defined functions.
Example
R

# A simple R function to
check # whether x is even or
odd

evenOdd = function(x){
if(x %% 2 == 0)
return("even")
else
return("odd")
}

print(evenOdd(4))
print(evenOdd(3))
Output
[1] "even"
[1] "odd"
R Function Example – Single Input Single Output
Now create a function in R that will take a single input and gives us a single output.
Following is an example to create a function that calculates the area of a circle which takes in
the arguments the radius. So, to create a function, name the function as “areaOfCircle” and the
arguments that are needed to be passed are the “radius” of the circle.
R

# A simple R function to
calculate # area of a circle

areaOfCircle = function(radius){
area = pi*radius^2
return(area)
}

print(areaOfCircle(2))
Output
12.56637
R Function Example – Multiple Input Multiple Output
Now create a function in R Language that will take multiple inputs and gives us multiple
outputs using a list.
The functions in R Language take multiple input objects but returned only one object as output,
this is, however, not a limitation because you can create lists of all the outputs which you want
to create and once the list is created you can access them into the elements of the list and get

the answers which you want.


Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

Let us consider this example to create a function “Rectangle” which takes “length” and
“width” of the rectangle and returns area and perimeter of that rectangle. Since R Language
can return only one object. Hence, create one object which is a list that contains “area” and
“perimeter” and return the list.

# A simple R function to calculate


# area and perimeter of a rectangle

Rectangle = function(length, width){


area = length * width
perimeter = 2 * (length + width)

# create an object called result which is


# a list of area and perimeter
result = list("Area" = area, "Perimeter" = perimeter)
return(result)
}

resultList = Rectangle(2, 3)
print(resultList["Area"])

print(resultList["Perimeter"])
Output
$Area
[1] 6

$Perimeter
[1] 10
Inline Functions in R Programming Language

Sometimes creating an R script file, loading it, executing it is a lot of work when you want to
just create a very small function. So, what we can do in this kind of situation is an inline
function.

To create an inline function you have to use the function command with the argument x and
then the expression of the function.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Example
R

# A simple R program to
# demonstrate the inline function

f = function(x) x^2*4+x/3

print(f(4))
print(f(-2))
print(0)
Output
65.33333
15.33333
0
Passing Arguments to Functions in R Programming Language
There are several ways you can pass the arguments to the function:
• Case 1: Generally in R, the arguments are passed to the function in the same order
as in the function definition.
• Case 2: If you do not want to follow any order what you can do is you can pass the
arguments using the names of the arguments in any order.
• Case 3: If the arguments are not passed the default values are used to execute the
function.
Now, let us see the examples for each of these cases in the following R code:
R

# A simple R program to
demonstrate # passing arguments to
a function

Rectangle = function(length=5, width=4){


area = length * width
return(area)
}
# Case 1:
print(Rectangle(2, 3))
# Case 2:
print(Rectangle(width = 8, length = 4))
# Case 3:
print(Rectangle())
Output
6
32
20

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Lazy Evaluations of Functions in R Programming Language


In R the functions are executed in a lazy fashion. When we say lazy what it means is if some
arguments are missing the function is still executed as long as the execution does not involve
those arguments.
Example
In the function “Cylinder” given below. There are defined three-argument “diameter”, “length”
and “radius” in the function and the volume calculation does not involve this argument
“radius” in this calculation. Now, when you pass this argument “diameter” and “length” even
though you are not passing this “radius” the function will still execute because this radius is
not used in the calculations inside the function.
Let’s illustrate this in an R code given below:
R

# A simple R program to
demonstrate # Lazy evaluations of
functions

Cylinder = function(diameter, length, radius )


{ volume = pi*diameter^2*length/4
return(volume)
}

# This'll execute because this


# radius is not used in the
# calculations inside the
function. print(Cylinder(5, 10))
Output
196.3495
If you do not pass the argument and then use it in the definition of the function it will throw an
error that this “radius” is not passed and it is being used in the function definition.
Example
R

# A simple R program to
demonstrate # Lazy evaluations of
functions

Cylinder = function(diameter, length, radius )


{ volume = pi*diameter^2*length/4
print(radius)
return(volume)
}

# This'll throw an
error print(Cylinder(5,

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Output
Error in print(radius) : argument "radius" is missing, with no default
Other Built-in Functions in R
Functions Syntax

Mathematical Functions

a. abs() calculates a number’s absolute value.

b. sqrt() calculates a number’s square root.

c. round() rounds a number to the nearest integer.

d. exp() calculates a number’s exponential value

Functions Syntax

e. log() which calculates a number’s natural logarithm.

f. cos(), sin(), and tan() calculates a number’s cosine, sine, and tang.

Statistical Functions

a. mean() A vector’s arithmetic mean is determined by the mean() function.

b. median() A vector’s median value is determined by the median() function.

c. cor() calculates the correlation between two vectors.

calculates the variance of a vector and calculates the standard


d. var()
deviation of a vector.

Data Manipulation
Functions

a. unique() returns the unique values in a vector.

b. subset() subsets a data frame based on conditions.

c. aggregate() groups data according to a grouping variable.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

d. order() uses ascending or descending order to sort a vector.

File Input/Output
Functions

a. read.csv() reads information from a CSV file.

b. file.csv() publishes information to a read CSV file.

c. file. table() reads information from a tabular.

d. file.table() creates a tabular file with data.

5. Explain about environment and scope issues?

R Environment and Scope

In this tutorial, you will learn everything about environment and scope in R programming with
the help of examples.

In order to write functions in a proper way and avoid unusual errors, we need to know the
concept of environment and scope in R.

R Programming Environment

Environment can be thought of as a collection of (functions, variables etc.). An environment is created when w
The top level environment available to us at the R command prompt is the global environment called R_Globa
We can use the ls() function to show what variables and functions are defined in the current

environment. Moreover, we can use theenvironment() function to get the current environment.

Example of environment() function

# assign value to the variable a and b

a <- 2

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

# define a function and assign the value 0 to the parameter x

f <- function(x) x<-0

# ls() function list the objects in the current working environment

ls()

# retrieve the environment where a function is defined

environment()

Output

[1] "a" "b" "f"

<environment: R_GlobalEnv>

In the above example, we can see that a, b and f are in the R_GlobalEnv environment.

Notice that x (in the argument of the function) is not in this global environment. When we define
a function, a new environment is created.

Here, the function f() creates a new environment inside the global environment.

Actually an environment has a frame, which has all the objects defined, and a pointer to the
enclosing (parent) environment.

Hence, x is in the frame of the new environment created by the function f. This environment will
also have a pointer to R_GlobalEnv.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Example: Cascading of environments

f <- function(f_x){
g <- function(g_x){
print("Inside g")
print(environment()
) print(ls())
}
g(5)
print("Inside f")
print(environment())
print(ls())
}
f(6)
environment()

Output

[1] "Inside g"


<environment: 0x5649bd74dec8>
[1] "g_x"
[1] "Inside f"
<environment: 0x5649bd7471d0>
[1] "f_x" "g"
<environment: R_GlobalEnv>

In the above example, we have defined two nested functions: f and g.


Th g() f() f()
e function is defined inside function. When function is called, it creates a
the the
local variable g and defines the function within its own environment.
g()
The g() function prints "Inside g", displays its own environment using environment(), and lists
the objects in its environment using ls().
After that, the f() function prints "Inside f", displays its own environment using environment(),
and lists the objects in its environment using ls().

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

R Programming Scope

In R programming, scope refers to the accessibility or visibility of objects (variables, functions,


etc.) within different parts of your code.

In R, there are two main types of variables: global variables and local variables.

Let's consider an example:

outer_func

inner_func

Global Variables

Global variables are those variables which exist throughout the execution of a program. It can be
changed and accessed from any part of the program.

However, global variables also depend upon the perspective of a function.

For example, in the above example, from the perspective of inner_func(),


both a and b are global variables.
However, from the perspective of outer_func(), b is a local variable and only a is a global
variable. The variable c is completely invisible to outer_func().

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Local Variables

On the other hand, local variables are those variables which exist only within a certain part of a program like a
In the above program the variable c is called a local variable.
If we assign a value to a variable with the function inner_func(), the change will only be local and cannot be ac
This is also the same even if names of both global variables and local variables match.

For example, if we have a function as below.

outer_func <- function(){ a <- 20


inner_func <- function(){ a <- 30
print(a)
}
inner_func() print(a)
}
outer_func() a <- 10
print(a)

Output

[1] 30
[1] 20
[1] 10

Here, the outer_func() function is defined, and within it, a local variable a is assigned the
value 20.

Inside outer_func(), there is an inner_func() function defined. The inner_func() function also
has its own local variable a, which is assigned the value 30.
When inner_func() is called within outer_func(), it prints the value of its local variable a (30).
Then, continues executing and prints the value of its local variable a (20).
outer_func()
Outside the functions, a global variable a is assigned the value 10. This code then prints the
value of the global variable a (10) .

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Accessing global variables

Global variables can be read but when we try to assign to it, a new local variable is created instead.
To make assignments to global variables, super assignment operator, <<-, is used.
When using this operator within a function, it searches for the variable in the parent environment frame, if not f

If the variable is still not found, it is created and assigned at the global level.

outer_func <- function()


{ inner_func <-
function(){ a <<- 30
print(a)
}
inner_func()
print(a)
}
outer_func()
print(a)

Output

[1] 30
[1] 30
[1] 30

When the statement a <<- 30is encountered within inner_func(), it looks for the
variable a in environment.
outer_func()
When the search fails, it searches in R_GlobalEnv.
Since, a is not defined in this global environment as well, it is created and assigned there which

is now referenced and printed from within inner_func() as well as outer_func() .

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

• R

6. Elaborately explain about recursion?

Recursion, in the simplest terms, is a type of looping technique. It exploits the basic working of
functions in R.
Recursive Function in R:
Recursion is when the function calls itself. This forms a loop, where every time the function is
called, it calls itself again and again and this technique is known as recursion. Since the loops
increase the memory we use the recursion. The recursive function uses the concept of recursion
to perform iterative tasks they call themselves, again and again, which acts as a loop. These
kinds of functions need a stopping condition so that they can stop looping continuously.
Recursive functions call themselves. They break down the problem into smaller components.
The function() calls itself within the original function() on each of the smaller components.
After this, the results will be put together to solve the original problem.

Example: Factorial using Recursion in R


R

rec_fac <- function(x){ if(x==0 || x==1)


{
return(1)
}
else
{
return(x*rec_fac(x-1))
}
}

rec_fac(5)

Output:
[1] 120
Here, rec_fac(5) calls rec_fac(4), which then calls rec_fac(3), and so on until the input
argument x, has reached 1. The function returns 1 and is destroyed. The return value is
multiplied by the argument value and returned. This process continues until the first function
call returns its output, giving us the final result.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

• R
Example: Sum of Series Using Recursion
Recursion in R is most useful for finding the sum of self-repeating series. In this example, we
will find the sum of squares of a given series of numbers. Sum = 12+22+…+N2
Example:

R-

sum_series <- function(vec){


if(length(vec)<=1)
{
return(vec^2)
}
else
{
return(vec[1]^2+sum_series(vec[-1]))
}
}
series <- c(1:10)
sum_series(series)

Output:
[1] 385

sum_n <- function(n) { if (n == 1) {


return(1)
} else {
return(n + sum_n(n-1))
}
}

# Test the sum_n function


sum_n(5)

Output:
[1] 15

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

In this example, the sum_n function recursively increases n until it reaches 1, which is the base
case•of R
the recursion, by adding the current value of n to the sum of the first n-1 values.

exp_n <- function(base, n) { if (n == 0) {


return(1)
} else {
return(base * exp_n(base, n-1))
}
}

# Test the exp_n function


exp_n(4, 5)

Output:
[1] 1024
In this example, the base case of the recursion is represented by the exp_n function, which
recursively multiplies the base by itself n times until n equals 0.

Key Features of R Recursion


• The use of recursion, often, makes the code shorter and it also looks clean.
• It is a simple solution for a few cases.
• It expresses in a function that calls itself.
Applications of Recursion in R
• Recursive functions are used in many efficient programming techniques like
dynamic programming language(DSL) or divide-and-conquer algorithms.
• In dynamic programming, for both top-down as well as bottom-up approaches,
recursion is vital for performance.
• In divide-and-conquer algorithms, we divide a problem into smaller sub-problems
that are easier to solve. The output is then built back up to the top. Recursion has a
similar process, which is why it is used to implement such algorithms.
• In its essence, recursion is the process of breaking down a problem into many
smaller problems, these smaller problems are further broken down until the problem
left is trivial. The solution is then built back up piece by piece.

Types of Recursion in R
1. Direct Recursion: The recursion that is direct involves a function calling itself
directly. This kind of recursion is the easiest to understand.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

2. Indirect Recursion: An indirect recursion is a series of function calls in which one


function calls another, which in turn calls the original function.

3. Mutual Recursion: Multiple functions that call each other repeatedly make up
mutual recursion. To complete a task, each function depends on the others.

4. Nested Recursion: Nested recursion happens when one recursive function calls
another recursively while passing the output of the first call as an argument. The
arguments of one recursion are nested inside of this one.

5. Structural Recursion: Recursion that is based on the structure of the data is known
as structural recursion. It entails segmenting a complicated data structure into
smaller pieces and processing each piece separately.

7. What is meant by replacement functions?

I searched for a reference to learn about replacement functions in R, but I haven't found any yet.
I'm trying to understand the concept of the replacement functions in R. I have the code below but
I don't understand it:

"cutoff<-" <- function(x, value){


x[x > value] <- Inf
x
}
and then we call cutoff with:

cutoff(x) <- 65
Could anyone explain what a replacement function is in R?

8. Describe about Vectors?

Vectors

A vector is simply a list of items that are of the same type.

To combine the list of items to a vector, use the c() function and separate the items by a comma.

In the example below, we create a vector variable called fruits, that combine strings:

Example
# Vector with numerical values in a
sequence numbers <- 1:10

numbers

You can also create numerical values with decimals in a sequence, but note that if the last
element does not belong to the sequence, it is not used:

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Example
# Vector of logical values
log_values <- c(TRUE, FALSE, TRUE,

FALSE) log_values

Vector Length

To find out how many items a vector has, use the length() function:

Example
fruits <- c("banana", "apple", "orange")

length(fruits)

Sort a Vector

To sort items in a vector alphabetically or numerically, use the sort() function:

Example

fruits <- c("banana", "apple", "orange", "mango", "lemon")


numbers <- c(13, 3, 5, 7, 20, 2)

sort(fruits) # Sort a string


sort(numbers) # Sort numbers

Access Vectors

You can access the vector items by referring to its index number inside brackets []. The first item
has index 1, the second item has index 2, and so on:

Example

fruits <- c("banana", "apple", "orange")

# Access the first item


(banana) fruits[1]

You can also access multiple elements by referring to different index positions with
the c() function:

Example

fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access the first and third item (banana and


orange) fruits[c(1, 3)]

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

You can also use negative index numbers to access all items except the ones specified:

Example:

fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access all items except for the first


item fruits[c(-1)]
Change an Item

To change the value of a specific item, refer to the index number:

Example

fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Change "banana" to "pear"


fruits[1] <- "pear"

# Print fruits
fruits

Repeat Vectors

To repeat vectors, use the rep()

function: Example

Repeat each value

repeat_each <- rep(c(1,2,3), each = 3)

repeat_each
Example
Repeat sequence of the vector

repeat_times <- rep(c(1,2,3), times = 3)

repeat_times
Example
Repeat each value independence
repeat_indepent <- rep(c(1,2,3), times = c(5,2,1))

repeat_indepent

Generating Sequenced Vectors

One of the examples on top, showed you how to create a vector with numerical values in a
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

sequence with the : operator:

Example

numbers <- 1:10

numbers

To make bigger or smaller steps in a sequence, use the seq() function:

Example

numbers <- seq(from = 0, to = 100, by = 20)

numbers

Note: The seq() function has three parameters: from is where the sequence starts, to is where the
sequence stops, and by is the interval of the sequence.

9. Briefly describe about matrices and arrays?

The data structure is a particular way of organizing data in a computer so that it can be used
effectively. The idea is to reduce the space and time complexities of different tasks. Data
structures in R programming are tools for holding multiple values. The two most important
data structures in R are Arrays and Matrices.

Arrays in R

Arrays are data storage objects in R containing more than or equal to 1 dimension. Arrays can
contain only a single data type. The array() function is an in-built function which takes input
as a vector and arranges them according to dim argument. Array is an iterable object, where
the array elements are indexed, accessed and modified individually. Operations on array can be
performed with similar structures and dimensions. Uni-dimensional arrays are called vectors in
R. Two-dimensional arrays are called matrices.

Syntax:
array(array1, dim = c (r, c, m), dimnames = list(c.names, r.names, m.names))
Parameters:
array1: a vector of values
dim: contains the number of matrices, m of the specified number of rows and columns
dimnames: contain the names for the dimensions

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Example:

Python3

# R program to illustrate an array

# creating a vector
vector1 <- c("A", "B", "C") # declaring a character array uni_array <- array(vector1)
print("Uni-Dimensional Array")
print(uni_array)

# creating another vector vector <- c(1:12)


# declaring 2 numeric multi-dimensional # array with size 2x3
multi_array <- array(vector, dim = c(2, 3, 2))
print("Multi-Dimensional Array") print(multi_array)

Output:

[1] "Uni-Dimensional Array"


[1] "A" "B" "C"
[1] "Multi-Dimensional Array"

,,1

[,1] [,2] [,3]


[1,] 1 3 5
[2,] 2 4 6

,,2

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

[,1] [,2] [,3]


[1,] 7 9 11
[2,] 8 10 12

Matrices in R

Matrix in R is a table-like structure consisting of elements arranged in a fixed number of rows


and columns. All the elements belong to a single data type. R contains an in-built
function matrix() to create a matrix. Elements of a matrix can be accessed by providing
indexes of rows and columns. The arithmetic operation, addition, subtraction, and
multiplication can be performed on matrices with the same dimensions. Matrices can be easily
converted to data frames CSVs.

Syntax:
matrix(data, nrow, ncol, byrow)
Parameters:
data: contain a vector of similar data type elements.
nrow: number of rows.
ncol: number of columns.
byrow: By default matrices are in column-wise order. So this parameter decides how to
arrange the matrix

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Example:

Python3

# R program to illustrate a matrix

A = matrix(
# Taking sequence of elements

c(1, 2, 3, 4, 5, 6, 7, 8, 9),

# No of rows and
columns nrow = 3, ncol =
3,

# By default matrices
are # in column-wise
order
# So this parameter decides
# how to arrange the
matrix byrow = TRUE
)

Output:

[,1] [,2] [,3]


[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Arrays vs Matrices

Arrays Matrices

Arrays can contain greater than or equal to 1 Matrices contains 2 dimensions in a table
dimensions. like structure.

Array is a homogeneous data structure. Matrix is also a homogeneous data structure.

Arrays Matrices

It is a singular vector arranged into the specified It comprises of multiple equal length vectors
dimensions. stacked together in a table.

array() function can be used to create matrix by matrix() function however can be used to
specifying the third dimension to be 1. create at most 2-dimensional array.

Matrices are a subset, special case of array


Arrays are superset of matrices.
where dimensions is two.

Limited set of collection-based operations. Wide range of collection operations possible.

Mostly, matrices are intended for


Mostly, intended for storage of data.
data transformation.

10. Discuss briefly about lists?

A list in R is a generic object consisting of an ordered collection of objects. Lists are one-
dimensional, heterogeneous data structures. The list can be a list of vectors, a list of matrices, a
list of characters and a list of functions, and so on.
A list is a vector but with heterogeneous data elements. A list in R is created with the use
of list() function. R allows accessing elements of an R list with the use of the index value. In
R, the indexing of a list starts with 1 instead of 0 like in other programming languages.
Creating a List
To create a List in R you need to use the function called “list()”. In other words, a list is a
generic vector containing other objects. To illustrate how a list looks, we take an example here.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

We want to build a list of employees with the details. So for this, we want attributes such as ID,
employee name, and the number of employees.

Example:
R

# R program to create a List

# The first attributes is a numeric vector


# containing the employee IDs which is
created # using the command here
empId = c(1, 2, 3, 4)

# The second attribute is the employee name


# which is created using this line of code here
# which is the character vector
empName = c("Debi", "Sandeep", "Subham", "Shiba")

# The third attribute is the number of


employees # which is a single numeric variable.
numberOfEmp = 4

# We can combine all these three


different # data types into a list
# containing the details of employees
# which can be done using a list command
empList = list(empId, empName, numberOfEmp)

print(empList)
Output:
[[1]]
[1] 1 2 3 4

[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba"

[[3]]
[1] 4
Accessing components of a list
We can access components of an R list in two ways.
• Access components by names: All the components of a list can be named and we
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

can use those names to access the components of the R list using the dollar
command.

Example:
R

# R program to access
# components of a list

# Creating a list by naming all its components


empId = c(1, 2, 3, 4)
empName = c("Debi", "Sandeep", "Subham", "Shiba")

Output:
$ID
[1] 1 2 3 4

$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"

$`Total Staff`
[1] 4

Accessing name components using $ command


[1] "Debi" "Sandeep" "Subham" "Shiba"
• Access components by indices: We can also access the components of the R list
using indices. To access the top-level components of a R list we have to use a
double slicing operator “[[ ]]” which is two square brackets and if we want to access
the lower or inner-level components of a R list we have to use another square
bracket “[ ]” along with the double slicing operator “[[ ]]“.
Example:
R

# R program to access
# components of a list

# Creating a list by naming all its components


empId = c(1, 2, 3, 4)
empName = c("Debi", "Sandeep", "Subham",
"Shiba") numberOfEmp = 4
empList = list(

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

"ID" = empId,
"Names" = empName,
"Total Staff" = numberOfEmp
)
print(empList)

# Accessing a top level components by indices


cat("Accessing name components using indices\n")
print(empList[[2]])

# Accessing a inner level components by indices


cat("Accessing Sandeep from name using indices\n")
print(empList[[2]][2])

# Accessing another inner level components by


indices cat("Accessing 4 from ID using indices\n")
print(empList[[1]][4])
Output:
$ID
[1] 1 2 3 4

$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"

$`Total Staff`
[1] 4

Accessing name components using indices


[1] "Debi" "Sandeep" "Subham" "Shiba"
Accessing Sandeep from name using indices
[1] "Sandeep"
Accessing 4 from ID using indices
[1] 4

Modifying components of a list


A R list can also be modified by accessing the components and replacing them with the ones
which you want.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Example:

# R program to edit
# components of a list

# Creating a list by naming all its components


empId = c(1, 2, 3, 4)
empName = c("Debi", "Sandeep", "Subham",
"Shiba") numberOfEmp = 4
empList = list(
"ID" =
empId,
"Names" = empName,
"Total Staff" = numberOfEmp
)
cat("Before modifying the list\n")
print(empList)

# Modifying the top-level


component empList$`Total Staff` =
5

# Modifying inner level component


empList[[1]][5] = 5
empList[[2]][5] = "Kamala"

cat("After modified the list\n")


Output:
Before modifying the list
$ID
[1] 1 2 3 4

$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"

$`Total Staff`
[1] 4

After modified the list


$ID
[1] 1 2 3 4 5

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

$Names
[1] "Debi" "Sandeep" "Subham" "Shiba" "Kamala"

$`Total Staff`
[1] 5
Concatenation of lists
Two R lists can be concatenated using the concatenation function. So, when we want to
concatenate two lists we have to use the concatenation operator.
Syntax:
list = c(list, list1)
list = the original list
list1 = the new list
Example:
R

# R program to edit
# components of a list

# Creating a list by naming all its components


empId = c(1, 2, 3, 4)
empName = c("Debi", "Sandeep", "Subham",
"Shiba") numberOfEmp = 4
empList = list(
"ID" =
empId,
"Names" = empName,
"Total Staff" = numberOfEmp
)
cat("Before concatenation of the new list\n")
print(empList)

# Creating another list


empAge = c(34, 23, 18,
45)

# Concatenation of list using concatenation operator


empList = c(empName, empAge)

cat("After concatenation of the new list\n")

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Output:
Before concatenation of the new list

$ID
[1] 1 2 3 4

$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"

$`Total Staff`
[1] 4

After concatenation of the new list


[1] "Debi" "Sandeep" "Subham" "Shiba" "34" "23" "18" "45"
Deleting components of a list
To delete components of a R list, first of all, we need to access those components and then
insert a negative sign before those components. It indicates that we had to delete that
component.
Example:
R

# R program to access
# components of a list

# Creating a list by naming all its components


empId = c(1, 2, 3, 4)
empName = c("Debi", "Sandeep", "Subham",
"Shiba") numberOfEmp = 4
empList = list(
"ID" =
empId,
"Names" = empName,
"Total Staff" = numberOfEmp
)
cat("Before deletion the list is\n")
print(empList)

# Deleting a top level components


cat("After Deleting Total staff components\
n") print(empList[-3])

# Deleting a inner level components


cat("After Deleting sandeep from name\
n") print(empList[[2]][-2])

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Output:
Before deletion the list is
$ID
[1] 1 2 3 4

$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"

$`Total Staff`
[1] 4

After Deleting Total staff components


$ID
[1] 1 2 3 4

$Names
[1] "Debi" "Sandeep" "Subham" "Shiba"

After Deleting sandeep from name


[1] "Debi" "Subham" "Shiba"
Merging list
We can merge the R list by placing all the lists into a single list.
R

# Create two lists.


lst1 <- list(1,2,3)
lst2 <- list("Sun","Mon","Tue")

# Merge the two lists.


new_list <-
c(lst1,lst2)

# Print the merged


list. print(new_list)

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Output:
[[1]]

[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] "Sun"

[[5]]
[1] "Mon"

[[6]]
[1] "Tue"
Converting List to Vector
Here we are going to convert the R list to vector, for this we will create a list first and then
unlist the list into the vector.
R

# Create lists.
lst <-
list(1:5)
print(lst)

# Convert the lists to vectors.


vec <- unlist(lst)

print(vec)
Output:
[[1]]
[1] 1 2 3 4 5

[1] 1 2 3 4 5

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

R List to matrix

We will create matrices using matrix() function in R programming. Another function that will
be used is unlist() function to convert the lists into a vector.
R

# Defining list
lst1 <- list(list(1, 2, 3),
list(4, 5, 6))

# Print list
cat("The list is:\
n") print(lst1)
cat("Class:", class(lst1), "\n")

# Convert list to matrix


mat <- matrix(unlist(lst1), nrow = 2, byrow = TRUE)

# Print matrix
cat("\nAfter conversion to matrix:\n")
print(mat)
cat("Class:", class(mat), "\n")
Output:
The list is:
[[1]]
[[1]][[1]]
[1] 1

[[1]][[2]]
[1] 2

[[1]][[3]]
[1] 3

[[2]]
[[2]][[1]]
[1] 4

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

[[2]][[2]]
[1] 5

[[2]][[3]]
[1] 6

Class: list

After conversion to matrix:


[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Class: matrix

11. Describe briefly about R programming language?

R is a programming language for statistical computing and graphics supported by the R Core
Team and the R Foundation for Statistical Computing. Created by statisticians Ross
Ihaka and Robert Gentleman, R is used among data
miners, bioinformaticians and statisticians for data analysis and developing statistical software.
[7]
The core R language is augmented by a large number of extension packages containing
reusable code and documentation.

According to user surveys and studies of scholarly literature databases, R is one of the most
commonly used programming languages in data mining.[8] As of April 2023, R ranks 16th in
the TIOBE index, a measure of programming language popularity, in which the language peaked
in 8th place in August 2020.[9][10]

The official R software environment is an open-source free software environment released as


part of the GNU Project and available under the GNU General Public License. It is written
primarily in C, Fortran, and R itself (partially self-hosting). Precompiled executables are
provided for various operating systems. R has a command line interface.[11] Multiple third-
party graphical user interfaces are also available, such as RStudio, an integrated development
environment, and Jupyter, a notebook interface.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

12. Discuss briefly about R data structures?

A data structure is a particular way of organizing data in a computer so that it can be used
effectively. The idea is to reduce the space and time complexities of different tasks. Data
structures in R programming are tools for holding multiple values.

R’s base data structures are often organized by their dimensionality (1D, 2D, or nD) and
whether they’re homogeneous (all elements must be of the identical type) or heterogeneous
(the elements are often of various types). This gives rise to the six data types which are most
frequently utilized in data analysis.
The most essential data structures used in R include:
• Vectors
• Lists
• Dataframes
• Matrices
• Arrays
• Factors

Vectors

A vector is an ordered collection of basic data types of a given length. The only key thing here
is all the elements of a vector must be of the identical data type e.g homogeneous data
structures. Vectors are one-dimensional data structures.

Example:
Python3

# R program to illustrate Vector

# Vectors(ordered collection of same data


type) X = c(1, 3, 5, 7, 8)

# Printing those elements in console


print(X)
Output:
[1] 1 3 5 7 8

Lists

A list is a generic object consisting of an ordered collection of objects. Lists are heterogeneous
data structures. These are also one-dimensional data structures. A list can be a list of vectors,
list of matrices, a list of characters and a list of functions and so on.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Example:
Python3

# R program to illustrate a List

# The first attributes is a numeric


vector # containing the employee IDs
which is
# created using the 'c' command here
empId = c(1, 2, 3, 4)

# The second attribute is the employee name


# which is created using this line of code here
# which is the character vector
empName = c("Debi", "Sandeep", "Subham", "Shiba")

# The third attribute is the number of


employees # which is a single numeric variable.
numberOfEmp = 4

# We can combine all these three


different # data types into a list
# containing the details of employees
# which can be done using a list command
empList = list(empId, empName, numberOfEmp)

print(empList)

Output:
[[1]]
[1] 1 2 3 4

[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba"

[[3]]
[1] 4

Dataframes
Dataframes are generic data objects of R which are used to store the tabular data. Dataframes
are the foremost popular data objects in R programming because we are comfortable in seeing
the data within the tabular form.
Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

They are two-dimensional, heterogeneous data structures.


These are lists of vectors of equal lengths.

Data frames have the following constraints placed upon them:

• A data-frame must have column names and every row should have a unique name.
• Each column must have the identical number of items.
• Each item in a single column must be of the same data type.
• Different columns may have different data types.
To create a data frame we use the data.frame() function.

Example:
Python3

# R program to illustrate dataframe

# A vector which is a character


vector Name = c("Amiya", "Raj",
"Asish")

# A vector which is a character


vector Language = c("R", "Python",
"Java")

# A vector which is a numeric


vector Age = c(22, 25, 45)

# To create dataframe use data.frame


command # and then pass each of the vectors
# we have created as
arguments # to the function
data.frame()
df = data.frame(Name, Language, Age)
Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45

Matrices
A matrix is a rectangular arrangement of numbers in rows and columns. In a matrix, as we
know rows are the ones that run horizontally and columns are the ones that run vertically.
Matrices are two-dimensional, homogeneous data structures.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Now, let’s see how to create a matrix in R. To create a matrix in R you need to use the function
called matrix. The arguments to this matrix() are the set of elements in the vector.
You have to pass how many numbers of rows and how many numbers of columns you want to
have in your matrix and this is the important point you have to remember that by default,
matrices are in column-wise order.
Example:
Python3

# R program to illustrate a matrix

A = matrix(
# Taking sequence of
elements c(1, 2, 3, 4, 5, 6, 7,
8, 9),

# No of rows and columns


nrow = 3, ncol = 3,

# By default matrices
are # in column-wise
order
# So this parameter decides
# how to arrange the
matrix byrow = TRUE
)

Output:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

Arrays

Arrays are the R data objects which store the data in more than two dimensions. Arrays are n-
dimensional data structures. For example, if we create an array of dimensions (2, 3, 3) then it
creates 3 rectangular matrices each with 2 rows and 3 columns. They are homogeneous data
structures.

Now, let’s see how to create arrays in R. To create an array in R you need to use the function
called array(). The arguments to this array() are the set of elements in vectors and you have to
pass a vector containing the dimensions of the array.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Example:
Python3

# R program to illustrate an array

A = array(
# Taking sequence of
elements c(1, 2, 3, 4, 5, 6, 7,
8),

# Creating two rectangular matrices


# each with two rows and two columns
dim = c(2, 2, 2)
)

print(A)
Output:
,,1
[,1] [,2]
[1,] 1 3
[2,] 2 4
,,2
[,1] [,2]
[1,] 5 7
[2,] 6 8

Factors
Factors are the data objects which are used to categorize the data and store it as levels. They
are useful for storing categorical data. They can store both strings and integers. They are useful
to categorize unique values in columns like “TRUE” or “FALSE”, or “MALE” or “FEMALE”,
etc.. They are useful in data analysis for statistical modeling.

Now, let’s see how to create factors in R. To create a factor in R you need to use the function
called factor(). The argument to this factor() is the vector.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

Example:
Python3

# R program to illustrate factors

# Creating factor using factor()


fac = factor(c("Male", "Female", "Male",
"Male", "Female", "Male", "Female"))

print(fac)

Output:
[1] Male Female Male Male Female Male Female
Levels: Female Male

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

ANNA UNIVERSITY IMPORTANT QUESTIONS[LAST YEARS]

1) Define Big Data Analytics?


 Big data means the “large volume of data”. Analytics means it is not a
Tool or technology. It is based on assumption and field in computer Science (statistics, machine
Learning).
 Generally, Big Data Analytics means apply analytics techniques on large volume of data and
breaking the problem into simpler one.

2) What is the purpose of IDA?


 Concerned with finding a model for data or structures. (i.e.) leads to data.
 Concerned with algorithm.
 Concerned with computer intensive methods (resampling, Bayesian methods).

3) Define Nature of Data?


 There are three kinds.
*Numerical Data
=>data is measurable(Height, Weight, Amount).
=> Numbers can be discrete(7,7.5,8) or continuous(160cm,34.6784)
*Text Data
=> It represents words in high quality. Challenges: Data missing, data mis-recorded data unclean.
* Image Data
=> Graphical representation with higher resolution.

4) What is Bootstrapping?
 It is method of sample reuse.
 The main idea is to use the observed sample to estimate the population distribution.
Three forms of bootstrapping:
 Non-Parametric(Re-sampling)
 Semi-Parametric(Adding noise)
 Parametric(Simulation)

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

5) Difference between Sampling and Re-Sampling.


Sampling Re - Sampling
*It is the probability of distribution of statistics *A method consist of repeated drawing of sample
i.e. Obtained from large number of sample. from original
Sample.
*The techniques are random sampling and non- *The techniques are Bootstrapping, jackknifing
random sampling. and permutation test.

6) Difference between Analysis and Reporting?


Analysis Reporting
*It can transform information into insights. *It can transform data to information.

*Purpose: the focus on insights. *purpose: company easy to understand.

*Context: must tell as story *No context


*Output: pull (extract information) *Output: Push (Information to user)

Unit – II

1) Define Simulated Annealing?


 It is motivated by Physical Annealing and met heuristic for approximating the global
optimum of a given function.
 It is a probabilistic technique and allows downward steps. It is a Global Optimization
Technique.

2) What is stochastic search algorithm ?


 It is a random probability distribution process.
 Stochastic process means for each observation at certain time there is certain
probability to get certain outcome.
 In general, probability depends on what we obtained in previous observation (The more
observation is been made the better we predict the outcome).

3) Define Evaluation Strategies?


 It is an optimization technique based on the idea of evaluation.
 It is belongs to artificial Intelligence Evaluation methodology.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

4) Define Visualization and the tools used in it?


 Technique used for creating images, diagrams or animations to communicate a message.
 Data Visualization – Graphical representation of data.

Tools used in Visualization :


*Datameer *Talleau *Hunk *Plotly *Jaspersoft.

5) List out the types of view used in user interface techniques ?


 Mainly three types of view used in user interface techniques are
* Computation View
* User View
* Design View
Unit – III

1) Define Stream and issues in Stream Processing?


 Stream – Sequence of Data flow.
 Stream Concept – 2 Assumptions
*Data arrives in stream – It must be stored or processed immediate otherwise data left (or) lost.
*Data arrives rapidly – Not flexible to store & interact.
Issues in Stream Processing:
 Stream often deliver the element very rapidly.
 Must process the element or it will be lost.

2) What is the purpose of Bloom Filtering?


 It is used to check element present in a set.
 It was main memory as Bit Array (space for 8 billion bits).
 Hash Function – It is used to hash each number of S to a bit (set bit to 1 and other bit in
array 0).
 1 – allow the mail , 0 – drop the mail.
3) Define Decaying Window?
 It allow you to identify the most popular requiring element in an incoming data stream.
 In decaying window, you assign a weight to every element of incoming data stream.
 Calculate the aggregate sum for each distinct element by adding all the weight assigned
to that element.
 The element with highest total score is listed as trending or popular.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

4) List out the Stream Sampling Techniques?


 The problem of obtaining sample from a window (sample size is smaller than number of
element in the window).
 Three Types of Window:
*Stationary Window – whose end points are at specified position.
*Silding Window – whose end points move forward.
*Generalized Window – stream consist of sequence of transaction(insert,delete).

5) Define RTAP and its application?


 It consists of dynamic analysis and reporting based on the data enters the system.
 It has the capacity to use all the available eneterprise data and resources when they are
needed.
 Also called as Real Time Data Analytics (or) Real Time Data Integration.
Applications:
 Fraud detection systems for online transactions.
 Log analysis for understanding usage pattern.
6) Estimate the number of buckets and 1’s for a given input stream bit
101011000101110110010110 and add new stream bits 10101011.
 The bit stream is always added on the right side of previous data.
 101 011 000 101 110 110 010 110 10101011
101 011 000 101 110 110 010 110 10101011
N=32 New bit 1 arrives,
101 011 000 101 110 110 010 110 10101011 1
8 4
4 2 1
Therefore, 5 buckets is been framed.

Unit- IV

1) How Map Reduce works?


 As the name suggests reducer takes place after map phase is completed.
 First is map job – Block of data is read and processed and produce the key value pair as
intermediate output.
 Second is reduce job – output of mapper or map is input to reducer. Reducer receives
key-value pair from multiple map job.

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

2) What is Hive? List out its advantages?


 Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
 It resides in top of hadoop to summarize Big Data and makes querying and analyzing
easy.
Advantages:
 Fast-Hive is designed to quickly handle petabytes of data using batch processing.
 Hive provides a familiar, SQL – like interface.
 Hive is easy to distribute and scale based on your needs.

3) What is MapR ? List out its features.


 It is a distributed data platform for artificial intelligence and analytics provider that
enables the enterprise for applying data modeling to the business process with goal of
increasing revenue, reducing cost and minimizing the risk.
Features:
 It is a distributed file system and used for data storage ad maintenance.
 Encryption of data transmitted to and from within a cluster.

4) Define Hadoop and its component?


 Haddop is an open source framework that is used to efficiently store and process large
datasets ranging in size from gigabytes to petabytes of data.
Components:
 HDFS =>Hive =>Map Reduce =>Yarn =>Flume

Unit- V

1) How to assign a string to a variable and check the length of a string in R-language ?
 Assigning a string to a variable is done with the variable followed by the <- operator and
the string.
Eg: str <- “Hello”
str #print the value of str
 To find the number of characters in a string, use the nchar() function.
Eg: str <- “hello world”
nchar(str)

2) List out the types of operators used R-language with an example?


 Arithmetic Operators ( + ,- , * , / , ^ , %)
 Assignment Operators ( <- )
 Comparison Operators ( == , != , >, <, >=, <= )
 Logical Operators ( & , && , | , || , ! )
 Miscellaneous Operators ( : , %in% , %*% )
Eg: X <- 20 Y <- 4
Z <- X + Y # Z will be 24

Downloaded by BARATH S (htarab86@gmail.com)

You might also like