Professional Documents
Culture Documents
org
Unit – 1: Introduction to Big data
……………………………………………………………………………………………………………………………..
1. Data:
In the pursuit of knowledge, data is a collection of discrete values that convey information, describing quantity,
quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further
interpreted. A datum is an individual state in a set of data.
Process of classifying data in relevant categories so that it can be used or applied more efficiently. The
classification of data makes it easy for the user to retrieve it. Data classification holds its importance when
comes to data security and compliance and also to meet different types of business or personal objective. It is
also of major requirement, as data must be easily retrievable within a specific period of time.
1. Structured Data:
Structured data is created using a fixed schema and is maintained in tabular format. The elements in structured
data are addressable for effective analysis. It contains all the data which can be stored in the SQL database in
a tabular format. Today, most of the data is developed and processed in the simplest way to manage
information.
Examples –
Consider an example for Relational Data like you have to maintain a record of students for a university like
the name of the student, ID of a student, address, and Email of the student. To store the record of students
used the following relational schema and table for the same.
2. Unstructured Data:
It is defined as the data in which is not follow a pre-defined standard or you can say that any does not follow
any organized format. This kind of data is also not fit for the relational database because in the relational
database you will see a pre-defined manner or you can say organized way of data. Unstructured data is also
very important for the big data domain and To manage and store Unstructured data there are many platforms
to handle it like No-SQL Database.
Examples –
3. Semi-Structured Data:
Semi-structured data is information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store them in a relational
database but is very hard for some kind of semi-structured data, but semi-structured exist to ease space.
Example –
XML data.
The main goal of the organization of data is to arrange the data in such a form that it becomes fairly available
to the users. So it’s basic features as following.
• Homogeneity – The data items in a particular group should be similar to each other.
• Clarity – There must be no confusion in the positioning of any data item in a particular group.
• Stability – The data item set must be stable i.e. any investigation should not affect the same set of
classification.
• Elastic – One should be able to change the basis of classification as the purpose of classification changes.
One of the most important things to always remember is that not all data could be considered of fine quality
hence making them limited in their usefulness. In order to fully realize the benefits of data, it has to be of high
quality. This means that one should look out for certain characteristics in the data. These are:
1. Data should be precise which means it should contain accurate information. Precision saves time of the
user as well as their money.
2. Data should be relevant and according to the requirements of the user. Hence the legitimacy of the
data should be checked before considering it for usage.
3. Data should be consistent and reliable. False data is worse than incomplete data or no data at all.
4. Relevance of data is necessary in order for it to be of good quality and useful. Although in today’s
world of dynamic data any relevant information is not complete at all times however at the time of
its usage, the data has to be comprehensive and complete in its current form.
5. A high quality data is unique to the requirement of the user. Moreover, it is easily accessible and could
be processed further with ease.
Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc ,Excel)
or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated that
almost 90% of today's data has been generated in the past 3 years.
• Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on
a day to day basis as they have billions of users worldwide.
• Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish
their plans and for this they store the data of its million users.
• Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.
1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will double
in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as well as unstructured.
Log file, CCTV footage is unstructured data. Data which can be saved in tables are structured data like
the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
o Perhaps the most frequent challenge in big data efforts is the inaccessibility of data sets from
external sources.
o It include the need for inter and intra- institutional legal documents.
o It is necessary for the data to be available in an accurate, complete and timely manner because
if data in the companies information system is to be used to make accurate decisions in time
then it becomes necessary for data to be available in this manner.
o It is another most important challenge with Big Data. This challenge includes sensitive,
conceptual, technical as well as legal significance.
o Most of the organizations are unable to maintain regular checks due to large amounts of data
generation. However, it should be necessary to perform security checks and observation in real
time because it is most beneficial.
o There is some information of a person which when combined with external large data may
lead to some facts of a person which may be secretive and he might not want the owner to
know this information about that person.
o Some of the organization collects information of the people in order to add value to their
business. This is done by making insights into their lives that they’re unaware of.
3 © www.anuupdates.org Prepared by D.Venkata Reddy M.Tech(Ph.D), UGC NET, AP SET Qualified
www.anuupdates.org
3. Analytical Challenges:
o There are some huge analytical challenges in big data which arise some main challenges
questions like how to deal with a problem if data volume gets too large?
o These large amount of data on which these type of analysis is to be done can be structured
(organized data), semi-structured (Semi-organized data) or unstructured (unorganized data).
There are two techniques through which decision making can be done:
4. Technical challenges:
o Quality of data:
▪ When there is a collection of a large amount of data and storage of this data, it comes
at a cost. Big companies, business leaders and IT leaders always want large data storage.
▪ For better results and conclusions, Big data rather than having irrelevant data, focuses
on quality data storage.
▪ This further arise a question that how it can be ensured that data is relevant, how much
data would be enough for decision making and whether the stored data is accurate or
not.
o Fault tolerance:
▪ Fault tolerance is another technical challenge and fault tolerance computing is extremely
hard, involving intricate algorithms.
▪ Nowadays some of the new technologies like cloud computing and big data always
intended that whenever the failure occurs the damage done should be within the
acceptable threshold that is the whole task should not begin from the scratch.
o Scalability:
▪ Big data projects can grow and evolve rapidly. The scalability issue of Big Data has lead
towards cloud computing.
▪ It leads to various challenges like how to run and execute various jobs so that goal of
each workload can be achieved cost-effectively.
▪ It also requires dealing with the system failures in an efficient manner. This leads to a
big question again that what kinds of storage devices are to be used.
The Big Data analytics is indeed a revolution in the field of Information Technology. The use of Data analytics
by the companies is enhancing every year.Big data has the characteristics of high variety, volume, and
velocity.Big Data involves the use of analytics techniques like machine learning, data mining, natural language
processing, and statistics. With the help of big data multiple operations can be performed at a single platform.
You can store Tbs of data, pre process it , analyze the data and visualize the data with the help of couple of
big data tools.
Data is extracted, prepared and blended to provide analysis for the businesses. Large enterprises and
multinational organizations use these techniques widely these days in different ways.
Big data analytics helps organizations to work with their data efficiently and use that data identify new
oportunities. Different technqiues and algorithms can be applied to predict from data. Mutliple business
strategies can be applied for future success of the company and that leads to smarter business moves, more
efficient operations and higher profits.
Following are the three main reasons that why Big data is so important and efficient.
Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data
Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined with the
ability to analyze new sources of data, businesses are able to analyze information immediately and make
decisions based on what they’ve learned.
New products and services. With the ability to gauge customer needs and satisfaction through analytics
comes the power to give customers what they want.
The use of Big Data analytics is very flexible to another fields as well. With the use of big data alot there has
been an enormous growth in multiple industries. Some of them are
• Banking
• Technology
• Manufacturing
Specially in Banking sector, big data tools have been associated with their system. Multiple operations can be
performed on transactional data moreover tools like Apache Hive facilitate users to query on their data to
get results in a very short period of time. A user can optimize the query engine to get better query
performance.
The usability of big data is also increased in educational sector. There are new options for research and
analysis using data analytics.The insights provided by the big data analytics tools help in knowing the needs
of customers better.
With huge interest and investment in the Big Data technologies, the professionals carrying the skills of big data
analytics are in huge demand. Fields like Data Analytics and Data Engineering have the most worth now a
days. IT Executives , Business Analysts and Software developers are learning big data tools & techniques to
grow with the market of jobs & opportunities since some of the big data tools are based on Python and Java
so it is easier for the programmers who already working on these languages moreover users who know how
to pre-process and has skills like data cleaning, can easily learn about Big Data analyzation tools and analytics.
With the help of visualization tools like Power Bi, Qlikview, Tableau etc , a user can easily analyze the data
and present a new marketing strategy.
In different domains of industry, the nature of the job differs and so does the requirement of the industry.
Since analytics is the emerging in every field, the workforce needs are equally enormous. The job titles may
include Big Data Analyst, Big Data Engineer, Business Intelligence Consultants, Solution
• Tableau • Hadoop
• Cost savings
The banking
Social media, Healthcare, Gaming sector, Entertainment, and Social
Applied Fields
Industry, Food Industry etc media, Healthcare, Retail and
wholesale etc