You are on page 1of 27

A Technical Seminar Report

On
BIG DATA
Submitted by

A. KEERTHANA (18W91A05C7)

Under the Esteemed Guidance of


MD. REHEMAN PASHA

Professor, Department of CSE


To

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY


HYDERABAD

In partial fulfillment of the requirements for award of degree of

BACHELOR OF TECHNOLOGY
IN
COMMPUTER SCIENCE AND ENGINEERING
2021-22

(AUTONOMOUS)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


MALLA REDDY INSTITUTE OF ENGINEERING AND TECHNOLOGY
(AUTONOMOUS)
DECLARATION

I hereby declare that the technical seminar report entitled “BIG DATA” submitted to Malla Reddy
Institute of Engineering and Technology, affiliated to Jawaharlal Nehru Technological University
Hyderabad (JNTUH), for the award of the degree of Bachelor of Technology in Computer Science
& Engineering is a result of original industrial oriented seminar done by me.
It is further declared that the seminar report or any part thereof has not been previously submitted
to any University or Institute for the award of degree or diploma

A. KEERTHANA

18W91A05C7
MALLA REDDY INSTITUTE OF
ENGINEERING & TECHNOLOGY
(Sponsored by Malla Reddy Educational Society)
Accredited by NBA, Affiliated to JNTU, Hyd, Maisammaguda,
Dhulapally (post via Hakimpet), Sec’Bad-500 014.

Department of Computer Science and Engineering

BONAFIED CERTIFICATE

This is to certify that this is the bonafide record of the Seminar titled “BIG DATA” is submitted
by A. KEERTHANA (18W91A05C7) of B. Tech in the partial fulfillment of the requirements
for the degree of Bachelor of Technology in Computer Science & Engineering, Department of
Computer Science & Engineering and this has not been submitted for the award of any other degree
of this institution.

SEMINAR SUPERVISOR HOD

EXTERNAL EXAMINER PRINCIPAL


ACKNOWLEDGEMENT

First and foremost, I am grateful to the Principal Dr. M. ASHOK, for providing me with all the
resources in the college to make my Seminar a success. I thank him for his valuable suggestions
at the time of seminars which encouraged me to give my best in the Seminar.

I would like to express my gratitude to Dr. ANANTHA RAMAN G.R, Head of the Department
Computer Science Engineering for his support and valuable suggestions during the dissertation
work.

I would like to express my gratitude to my Seminar -coordinator Dr. B DHANALAXMI for her
support and valuable suggestions during the dissertation work

I offer my sincere gratitude to an internal guide Md. REHEMAN PASHA Professor of the
Computer Science Engineering department who has supported me throughout this Seminar with
their patience and valuable suggestions.

I would also like to thank all the supporting staff of the Department Computer Science Engineering
and all other departments who have been helpful directly or indirectly in making the Seminar a
success.

I am extremely grateful to my parents for their blessings and prayers for my completion of the
Seminar that gave me strength to do my Seminar.

A. KEERTHANA

18W91A05C7
INDEX

S.NO. CONTENT PAGE NO.

1. Introduction

2. Architecture and components


• Data sources
• Data storage
• Batch Processing
• Real-time message ingestion
• Stream processing
• Analytical data store
• Analytical and reporting
• Orchestration

3. Technology
• Hadoop
• Data Mining
• Text Mining
• Predictive Analytics

4. Applications

5. Characteristics

6. Benefits

7. Challenges
• Big
• Constantly Changing and Updating
• Overwhelming Variety
• Messy

8. Future of Big Data


9. Conclusion
10. References
ABSTRACT

Big data is a collection of massive and complex data sets and data volume that include the huge
quantities of data, data management capabilities, social media analytics and real-time data. Big
data analytics is the process of examining large amounts of data. There exist large amounts of
heterogeneous digital data. Big data is about data volume and large data set's measured in terms of
terabytes or petabytes. This phenomenon is called Bigdata. After examining of Bigdata, the data
has been launched as Big Data analytics. In this paper, presenting the 5Vs characteristics of big
data and the technique and technology used to handle big data.

The challenges include capturing, analysis, storage, searching, sharing, visualization, transferring
and privacy violations. It can neither be worked upon by using traditional SQL queries nor can the
relational database management system (RDBMS) be used for storage. Though, a wide variety of
scalable database tools and techniques has evolved. Hadoop is an open source distributed data
processing is one of the prominent and well known solutions. The NoSQL has a non-relational
database with the likes of MongoDB from Apache.
1. INTRODUCTION

Big Data may well be the Next Big Thing in the IT world. Big data burst upon the scene in the first
decade of the 21st century. The first organizations to embrace it were online and startup firms.
Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
Like many new information technologies, big data can bring about dramatic cost reductions,
substantial improvements in the time required to perform a computing task, or new product and
service offerings.

‘Big Data’ is similar to ‘small data’, but bigger in size but having data bigger it requires different
approaches: Techniques, tools and architecture. An aim to solve new problems or old problems in
a better way Big Data generates value from the storage and processing of very large quantities of
digital information that cannot be analyzed with traditional computing techniques.

Walmart handles more than 1 million customer transactions every hour.

• Facebook handles 40 billion photos from its user base.

• Decoding the human genome originally took 10years to process; now it can be achieved in one
week.

WHY BIG DATA

Growth of Big Data is needed

-Increase of storage capacities

-Increase of processing power

-Availability of data (different data types)

-Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been
created in the last two years alone.
•FB generates 10TB daily
•Twitter generates 7TB of data daily.
•IBM claims 90% of today’s stored data was generated in just the last two years.
2.ARCHITECTURE AND COMPONENTS

A big data architecture is designed to handle the ingestion, processing, and analysis of data that is
too large or complex for traditional database systems. The threshold at which organizations enter
into the big data realm differs, depending on the capabilities of the users and their tools. For some,
it can mean hundreds of gigabytes of data, while for others it means hundreds of terabytes. As
tools for working with big datasets advance, so does the meaning of big data. More and more, this
term relates to the value you can extract from your data sets through advanced analytics, rather
than strictly the size of the data, although in these cases they tend to be quite large.

Over the years, the data landscape has changed. What you can do, or are expected to do, with data
has changed. The cost of storage has fallen dramatically, while the means by which data is
collected keeps growing. Some data arrives at a rapid pace, constantly demanding to be collected
and observed. Other data arrives more slowly, but in very large chunks, often in the form of
decades of historical data. You might be facing an advanced analytics problem, or one that requires
machine learning. These are challenges that bigdata architectures seek to solve.

Big data solutions typically involve one or more of the following types of work load:

• Batch processing of big data sources at rest.


• Real-time processing of big data in motion.
• Interactive exploration of big data.
• Predictive analytics and machine learning.

Consider big data architectures when you need to:

• Store and process data in volumes too large for a traditional database.
• Transform unstructured data for analysis and reporting.
• Capture, process, and analyze unbounded streams of data in real time, or with low latency.

Components of a big data architecture


The following diagram shows the logical components that fit into a big data architecture.
Individual solutions may not contain every item in this diagram.

Most big data architectures include some or all of the following components:

Data sources:

All big data solutions start with one or more data sources. Examples include:

-Application data stores, such as relational databases.

-Static files produced by applications, such as web server log files.

-Real-time data sources, such as IoT devices.


Data storage:

Data for batch processing operations is typically stored in a distributed file store that can hold high
volumes of large files in various formats. This kind of store is often called a data lake. Options for
implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.

Batch processing:

Because the data sets are so large, often a big data solution must process data files using long-
running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these
jobs involve reading source files, processing them, and writing the output to new files. Options
include running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom
Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an
HDInsight Spark cluster.

Real-time message ingestion:

If the solution includes real-time sources, the architecture must include a way to capture and store
real-time messages for stream processing. This might be a simple data store, where incoming
messages are dropped into a folder for processing. However, many solutions need a message
ingestion store to act as a buffer for messages, and to support scale-out processing, reliable delivery,
and other message queuing semantics. This portion of a streaming architecture is often referred to
as stream buffering. Options include Azure Event Hubs, Azure IoT Hub, and Kafka.

Stream processing:

After capturing real-time messages, the solution must process them by filtering, aggregating, and
otherwise preparing the data for analysis. The processed stream data is then written to an output
sink. Azure Stream Analytics provides a managed stream processing service based on perpetually
running SQL queries that operate on unbounded streams. You can also use open source Apache
streaming technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store:

Many big data solutions prepare data for analysis and then serve the processed data in a structured
format that can be queried using analytical tools. The analytical data store used to serve these
queries can be a Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency
NoSQL technology such as HBase, or an interactive Hive database that provides a metadata
abstraction over data files in the distributed data store. Azure Synapse Analytics provides a
managed service for large-scale, cloud-based data warehousing. HDInsight supports Interactive
Hive, HBase, and Spark SQL, which can also be used to serve data for analysis.

Analysis and reporting:

The goal of most big data solutions is to provide insights into the data through analysis and
reporting. To empower users to analyze the data, the architecture may include a data modeling
layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services.
It might also support self-service BI, using the modeling and visualization technologies in
Microsoft Power BI or Microsoft Excel. Analysis and reporting can also take the form of
interactive data exploration by data scientists or data analysts. For these scenarios, many Azure
services support analytical notebooks, such as Jupyter, enabling these users to leverage their
existing skills with Python or R. For large-scale data exploration, you can use Microsoft R Server,
either standalone or with Spark.

Orchestration:

Most big data solutions consist of repeated data processing operations, encapsulated in workflows,
that transform source data, move data between multiple sources and sinks, load the processed data
into an analytical data store, or push the results straight to a report or dashboard. To automate these
workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie
and Sqoop.
3.TECHNOLOGY

Analytics comprises various technologies that help you get the most valued information from the
data.

Hadoop
The open-source framework is widely used to store a large amount of data and run various
applications on a cluster of commodity hardware. It has become a key technology to be used in
big data because of the constant increase in the variety and volume of data, and its distributed
computing model provides faster access to data.

Data Mining
Once the data is stored in the data management system, you can use data mining techniques to
discover the patterns which are used for further analysis and answer complex business questions.
With data mining, all the repetitive and noisy data can be removed and point out only the relevant
information that is used to accelerate the pace of making informed decisions.

Text Mining
With text mining, we can analyze the text data from the web like the comments, likes from social
media, and other text-based sources like the email; we can identify if the mail is spam. Text Mining
uses technologies like machine learning or natural language processing to analyze a large amount
of data and discover the various patterns.

Predictive Analytics
Predictive analytics uses data, statistical algorithms, and machine learning techniques to identify
future outcomes based on historical data. It’s all about providing the best future outcomes so that
organizations can feel confident in their current business decisions.
4. APPLICATIONS

The term Big Data is referred to as large amount of complex and unprocessed data. Now a day's
companies use Big Data to make business more informative and allows to take business decisions
by enabling data scientists, analytical modelers and other professionals to analyse large volume of
transactional data. Big data is the valuable and powerful fuel that drives large IT industries of the
21st century. Big data is a spreading technology used in each business sector.

Travel and tourism are the users of Big Data. It enables us to forecast travel facilities requirements
at multiple locations, improve business through dynamic pricing, and many more.

The financial and banking sectors use big data technology extensively. Big data analytics
help banks and customer behaviour on the basis of investment patterns, shopping trends,
motivation to invest, and inputs that are obtained from personal or financial backgrounds.

Big data has started making a massive difference in the healthcare sector, with the help
of predictive analytics, medical professionals, and health care personnel. It can
produce personalized healthcare and solo patients also.
Telecommunications and the multimedia sector are the main users of Big Data. There
are zettabytes to be generated every day and handling large-scale data that require big data
technologies.

The government and military also used technology at high rates. We see the figures that
the government makes on the record. In the military, a fighter plane requires to
process petabytes of data.
Government agencies use Big Data and run many agencies, managing utilities, dealing with traffic
jams, and the effect of crime like hacking and online fraud.

Aadhar Card:

The government has a record of 1.21 billion citizens. This vast data is analyzed and store to find
things like the number of youth in the country. Some schemes are built to target the maximum
population. Big data cannot store in a traditional database, so it stores and analyze data by using
the Big Data Analytics tools.

E-commerce is also an application of Big data. It maintains relationships with customers that is
essential for the e-commerce industry. E-commerce websites have many marketing ideas to retail
merchandise customers, manage transactions, and implement better strategies of innovative ideas
to improve businesses with Big data.

o Amazon:
o Amazon is a tremendous e-commerce website dealing with lots of traffic daily. But, when
there is a pre-announced sale on Amazon, traffic increase rapidly that may crash the
website. So, to handle this type of traffic and data, it uses Big Data. Big Data help in
organizing and analyzing the data for far use.

Social Media is the largest data generator. The statistics have shown that around 500+ terabytes of
fresh data generated from social media daily, particularly on Facebook. The data mainly
contains videos, photos, message exchanges, etc. A single activity on the social media site
generates many stored data and gets processed when required. The data stored is in terabytes (TB);
it takes a lot of time for processing. Big Data is a solution to the problem.
5.CHARACTERISTICS

Big Data contains a large amount of data that is not being processed by traditional data storage or
the processing unit. It is used by many multinational companies to process the data and business
of many organizations. The data flow would exceed 150 exabytes per day before replication.

There are five v's of Big Data that explains the characteristics:

o Volume
o Veracity
o Variety
o Value
o Velocity
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media platforms,
networks, human interactions, and many more.

Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button
is recorded, and more than 350 million new posts are uploaded each day. Big data technologies
can handle large amounts of data.

Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, but these days
the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
The data is categorized as below:

a. Structured data:

In Structured schema, along with all the required columns. It is in a tabular form.
Structured Data is stored in the relational database management system.

b. Semi-structured:

In Semi-structured, the schema is not appropriately defined, e.g., JSON, XML, CSV, TSV,
and email. OLTP (Online Transaction Processing) systems are built to work with semi-
structured data. It is stored in relations, i.e., tables.

c. Unstructured Data:

All the unstructured files, log files, audio files, and image files are included in the
unstructured data. Some organizations have much data available, but they did not know
how to derive the value of data since the data is raw.

d. Quasi-structured Data:
The data format contains textual data with inconsistent data formats that are formatted
with effort and time with some tools.

Example:

Web server logs, i.e., the log file is created and maintained by some server that contains a list
of activities.

Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.

For example, Facebook posts with hashtags.

Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

Velocity plays an important role compared to others. Velocity creates the speed by which the data
is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.

6.BENEFITS

•Fast forward to the present and technologies like Hadoop give you the scale and flexibility to

store data before you know how you are going to process it.

•Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse,
It’s about the ability to make better decisions and take meaningful actions at the right time.

•Technologies such as MapReduce, Hive and Impala enable you to run queries without changing
the data structures underneath.

• Our newest research finds that organizations are using big data to target customer-centric
outcomes, tap into internal data and build a better information ecosystem.
• Big Data is already an important part of the $64 billion database and data analytics market
• It offers commercial opportunities of a comparable scale to enterprise software in the late
1980s
• And the Internet boom of the 1990s, and the social media explosion of today.
7.CHALLENGES

1.BIG

• Lots of raw data to store and analyze

• expensive and require good computing investment

2.CONSTANTLY CHANGING AND UPDATING


• data is constantly changing and fluctuating, systems built to handle that has to be
adaptive

3.OVERWHELMING VARIETY

• difficult to determine which source of data useful

4.MESSY

• notables to quickly analyze

• need to clean data first


8. FUTURE OF BIG DATA

Big Data is commonly associated with other buzzwords like Machine Learning, Data
Science, AI, Deep Learning, etc. Since these fields require data, Big data will continue
to play a huge role in improving the current models we have now and allow for
advancements in research. Take Tesla, for example, each Tesla car that has self-
driving is also at the same time training Tesla’s AI model and continually improves it
with each mistake. This huge siphoning of data allows, along with a team of talented
engineers is what makes Tesla the best at the self-driving game.

As data continues to expand and grow, cloud storage providers like AWS, Microsoft
Azure and Google Cloud will rule in storing big data. This allows room for scalability
and efficiency for companies. This also means there will be more and more people
hired to handle these data, which translate to more job opportunities for “data officers”
to manage the database of a company

The future of Big data also has it’s dark sides, as you know, many tech companies are
facing heat from governments and the public due to issues of privacy and data. Laws
that govern the rights of the people to their data will make data collection more
restricted albeit honest. By the same vein, the proliferation of data online also exposes
us to cyberattacks, and data security will be incredibly important.
9.CONCLUSION

Big Data is a game-changer. Many organizations are using more analytics to drive strategic actions
and offer a better customer experience. A slight change in the efficiency or smallest savings can
lead to a huge profit, which is why most organizations are moving towards big data.

The availability of Big data, low-cost commodity hardware, and new information management
and analytic software have produced a unique moment in the history of data analysis. The
convergence of these trends means that we have the capabilities required to analyse astonishing
datasets quickly and cost-effectively for the first time in history. These capabilities are neither
theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize
enormous gains in terms of efficiency, productivity, revenue and profitability.
10.REFERENCES

www.google.com

www.wikipedia.com

www.studymafia.org

You might also like