Professional Documents
Culture Documents
EVOLUTION OF TECHNOLOGY
Telephone--------Mobile/Android
Bulky desktop-----FD/HDD/Cloud
Car -------Smart car
WHY????
Phone Data
Self driving Car data
Smart AC
Social Media
Amazon Flipcart
Big Data is the term for collection of data sets so large and complex that it becomes difficult to
process using on-hand database system tools or traditional data processing applications
Volume
Variety
Velocity
Value
Veracity
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical,
or mechanical recording media.
The New York Stock Exchange generates about one terabyte of new trade data per day.
Social Media The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated in
terms of photo and video uploads, message exchanges, putting comments etc.
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science has achieved greater success
in developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Looking at these figures one can easily understand why the name Big Data is given and imagine
the challenges involved in its storage and processing.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for
deriving value out of it. A typical example of unstructured data is a heterogeneous data source
containing a combination of simple text files, images, videos etc. Now day organizations have
wealth of data available with them but unfortunately, they don't know how to derive value out of
it since this data is in its raw form or unstructured format.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
Please note that web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is stored in
relations (tables).
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one
characteristic which needs to be considered while dealing with Big Data.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by
most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring
devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of
unstructured data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The
flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
1.) Ability to process Big Data brings in multiple benefits, such as-
3.) Traditional customer feedback systems are getting replaced by new systems designed
with Big Data technologies. In these new systems, Big Data and natural language
processing technologies are being used to read and evaluate consumer responses.
Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of
Big Data technologies and data warehouse helps an organization to offload infrequently accessed
data.
Spark: we can write spark program to process the data, using spark we can process live stream
of data as well.
To analyze such a large volume of data, Big Data analytics applications enables big data analyst,
data scientists, predictive modelers, statisticians, and other analytical performers to analyze the
growing volume of structured and unstructured data. It is performed using specialized software
tools and applications. Using these tools various data operations can be performed like data
mining, text mining, predictive analysis, forecasting etc., all these processes are performed
separately and are a part of high-performance analytics. Using Big Data analytic tools and
software enables an organization to process a large amount of data and provide meaningful
insights that provide better business decisions in the future.
Hadoop
The open-source framework that is widely used to store a large amount of data and run various
applications on a cluster of commodity hardware. It has become a key technology to be used in
big data because of the constant increase in the variety and volume of data and its distributed
computing model provides faster access to data.
Data Mining
Once the data is stored in the data management system. You can use data mining techniques to
discover the patterns which are used for further analysis and answer complex business questions.
With data mining, all the repetitive and noisy data can be removed and point out only the
relevant information that is used to accelerate the pace of making informed decisions.
Text Mining
With text mining, we can analyze the text data from the web like the comments, likes from social
media and other text-based sources like email we can identify if the mail is spam. Text Mining
uses technologies like machine learning or natural language processing to analyze a large amount
of data and discover the various patterns.
Predictive Analytics
Predictive analytics uses data, statistical algorithms and machine learning techniques to identify
future outcomes based on historical data. It’s all about providing the best future outcomes so that
organizations can feel confident in their current business decisions.
Using these kinds of data, organizations derive some patterns and provide the best customer
service like
Conclusion
Big Data is a game-changer. Many organizations are using more analytics to drive strategic
actions and offer a better customer experience. A slight change in the efficiency or smallest
savings can lead to a huge profit, which is why most organizations are moving towards big data.
Data architecture
Traditional data use centralized database architecture in which large and complex problems
are solved by a single computer system. Centralized architecture is costly and ineffective
to process large amount of data. Big data is based on the distributed database architecture
where a large block of data is solved by dividing it into several smaller sizes. Then the
solution to a problem is computed by several different computers present in a given computer
network. The computers communicate to each other in order to find the solution to a
problem. The distributed database provides better computing, lower price and also improve
the performance as compared to the centralized database system. This is because centralized
architecture is based on the mainframes which are not as economic as microprocessors in
distributed database system. Also the distributed database has more computational power as
compared to the centralized database system which is used to manage traditional data.
Types of data
Traditional database systems are based on the structured data i.e. traditional data is stored in
fixed format or fields in a file. Examples of the unstructured data include Relational
Database System (RDBMS) and the spreadsheets, which only answers to the questions about
what happened. Traditional database only provides an insight to a problem at the small level.
However in order to enhance the ability of an organization, to gain more insight into the data
and also to know about metadata unstructured data is used . Big data uses the semi-structured
and unstructured data and improves the variety of the data gathered from different sources
like customers, audience or subscribers. After the collection, Big data transforms it into
knowledge based information.
Volume of data
The traditional system database can store only small amount of data ranging from gigabytes
to terabytes. However, big data helps to store and process large amount of data which
consists of hundreds of terabytes of data or petabytes of data and beyond. The storage of
massive amount of data would reduce the overall cost for storing data and help in providing
business intelligence.
Data schema
Big data uses the dynamic schema for data storage. Both the un-structured and structured
information can be stored and any schema can be used since the schema is applied only after
a query is generated. Big data is stored in raw format and then the schema is applied only
when the data is to be read. This process is beneficial in preserving the information present in
the data. The traditional database is based on the fixed schema which is static in nature. In
traditional database data cannot be changed once it is saved and this is only done during
write operations.
Data relationship
In the traditional database system relationship between the data items can be explored easily
as the number of information’s stored is small. However, big data contains massive or
voluminous data which increase the level of difficulty in figuring out the relationship
between the data items.
Scaling
Scaling refers to demand of the resources and servers required to carry out the computation.
Big data is based on the scale out architecture under which the distributed approaches for
computing are employed with more than one server. So, the load of the computation is shared
with single application based system. However, achieving the scalability in the traditional
database is very difficult because the traditional database runs on the single server and
requires expensive servers to scale up.