You are on page 1of 14

Introduction

Big data refers to data sets which are difficult to capture, manage and analyze effectively
using current database management software. It consists of structured, unstructured and semi-
structured data which cannot be stored in a table format. As per latest survey conducted by
International data corporation (IDC). Social sites are creating a huge amount of data. From
Twitter 500 million tweets are sent,6 billion searches are done on google, 3.6 billion of likes
on Instagram & 5.75 billion likes on Facebook. Big data analytics is a form of advanced
analytics, which involves complex applications with elements such as predictive models,
statistical algorithms and what-if analysis powered by high-performance analytics systems.
The process of capturing or collecting Big data is known as datafication. By large or huge
datasets or big data, we mean anything from a petabyte(1PB=1000TB) to an exabyte
(1EB=1000PB) of data.

History of data management- Evolution of Big Data


Big data was launched by O’Reilly Media in 2005.It is the new term of data evolution
directed by enormous velocity, variety and volume of data. Velocity implies the speed with
which the data flows in an organization, variety refers to the varied forms of data and volume
defines the amount or quantity of data an organization has to deal with. It is used to process,
store and analyze the data. Big data came into picture because of few of the benefits such as,
 Cost efficient.
 Faster and better decision making.
 Cater to customer needs by analysis.
Some of the sectors which use big data are as follows:
 Banking
 Government
 Health care
 Education
 Manufacturing
 Retail

Structuring Big Data:


It is arranging the available data in a manner such that it becomes easy to study, analyze and
derive conclusion.
In daily life we come across few questions they are as follows:
 Why is it required?
 How to use the vast amount of data and information we come across?
 Which news article to read of thousands we come across?
 How to choose a book of the millions available on favorite sites or stores?
 How to keep updated about new events, sports, invention and discoveries taking
across the globe?

1|Page
Currently, we can find by information processing systems. These systems can analyze and
structure a large amount of data specifically what we searched, what we looked at and how
long we remained at a particular page or website.
It helps in understanding user behaviors, requirements and preferences to make personalized
recommendations for every individual.
When a user regularly visits or purchases from online shopping sites like, eBay, time person
logs in, the system can present a recommend list of products that may interest the user on the
basis of the earlier purchases or searches, thus, presenting a specially customized
recommendation set for every user. This is the power of big data analytics.

Types of Data:
Data is obtained primarily from the following types of sources:
 Internal sources - organizational or enterprise data.
 External sources - social data.
 Comparison between internal and external sources of data:
Data source Definition Examples of sources Application
Internal Provides structured >Customer The current data in
or organized data Relationship the operational
that originates from Management system is used to
within the enterprise (CRM). support daily
and helps run >Enterprise business operations
business Resource of an organization
Planning (ERP).
>Customer details.
>Products and sales
data.
External Provides >Business partners. This data is often
unstructured or >Syndicate data analyzed to
unorganized data suppliers. understand the
that originates from >Internet. entities mostly
the external >Government. external to the
environment of an >Market research organization, like,
organization. Organizations customers,
competitors, market
and environment.

Big data comprises of:


 Structured data.
 Unstructured data.
 Semi-structured data.
In a real-world scenario, typically, the unstructured data is larger in volume than the
structured and semi-structured data, approximately 70% to 80% of data is in unstructured
form.

2|Page
 Structured data: The data that has a defined repeating pattern. This pattern makes it
easier for any program to sort, read and process the data. Processing structured data is
much easier and faster than processing data without any specific repeating patterns.
 Organized data in a predefined format.
 Stored in tabular form.
 The data that resides in fixed fields within a record or file.
 Formatted data that has entities and their attributes mapped.
 Used to query and report against predetermined data types.
There are some sources:
 Relational databases in the form of tables.
 Flat files in the form of records.
 Multidimensional databases mainly used in data warehouse technology.
 Legacy databases.

Customer ID Name Product ID City State


12365 Smith 241 Graz Styria
23658 Jack 365 Wolfsberg Carinthia
32456 Kady 421 Enns Upper Austria

 Unstructured data: A set of data that might not have any logical or repeating
patterns.
 It consists of metadata ie.,. the additional information related to data.
 It comprises of inconsistent data, like, data obtained from files, social media websites,
satellites, etc.
 It consists of data in different formats, like, e-mails, text, audio, video or images.
There are some sources:
 Text both internal and external to an organization – documents, logs, survey results,
feedbacks and e-mails from both within and across the organization.
 Social media - data obtained from social networking platforms including YouTube,
Facebook, twitter, LinkedIn, and Flickr.
 Mobile data – for text messages and location information.
About 80% of enterprise data consists of unstructured content.
Some of the challenges associated they are as follows:
 Identifying the unstructured data that can be processed.
 Sorting, organizing and arranging unstructured data in different sets and formats.
 Combining and linking unstructured data in a more structured format to derive any
logical conclusions out of the available information.
 Costing in terms of storage space and human resource (data analysts and scientists)
needed to deal with the exponential growth of unstructured data.

3|Page
It is also generated from files that often have the same name and extension. Examples are,
video files are generally stored with the extension .mp4 or .3gp, whereas, audio files have
extension .wav or .mp3. As different files of the same category can have the same file name
in different sources, merely, a name and an extension do not help in data identification,
classification or even basic searches.
 Semi-structured data: Also known as, schema-less or self-describing structure. It is
a data that stored inconsistently in rows and columns of a database.
There are some sources:
 File systems, like, web data in the form of cookies.
 Data exchange format, like, JavaScript Object Notation (JSON) data.
Sl. No. Name E-mail
1 Sam Jacobs smj@xyz.com
2 First name: David Davidb@xyz.com
Last name: Brown

Elements of big data:


According to Gartner, data is growing at the rate of 59% every year. This growth can be
depicted in terms of the following 4 Vs:
 Volume.
 Velocity.
 Variety.
 Veracity.

 Volume: The amount of data generated by organizations or individuals. Today, the


volume of data in most organizations is approaching exabytes. Some experts predict
the volume of data to reach zettabytes in the coming years. Organizations are doing
their best to handle this ever-increasing volume of data. Examples are, according to
IBM, over 2.7 zettabytes of data is present in the digital universe today. Every minute,
over 571 new websites are being created. IDC estimates that by 2020, online business
transactions will reach up to 450 billion per day. Even by underestimation, the total
data stored on the internet, including images, videos, audio, etc. has crossed 1
yottabyte. The exact size of the internet will never be known!

 Velocity: The rate at which data is generated, captured and shared. Enterprises can
capitalize on data only if it is captured and shared in real time. Information processing
systems, like, CRM and ERP face problems associated with data which keeps adding
up but, cannot be processed quickly. These systems are able to attend data in batches
every few hours, however, even this time lag causes the data to lose its importance as
new data is constantly being generated. Examples are, eBay analyzes around 5 million

4|Page
transactions per day in real time to detect and prevent frauds arising from the use of
PayPal.
There are some sources:
 IT devices including routers, switches, firewalls, etc. constantly generate valuable
data.
 Social media including Facebook posts, tweets and other social media activities create
huge amount of data which is to be analyzed instantly at a fast speed because, the
value degrades quickly with time.
 Portable device including mobile, PDA, etc. also generate data at a high speed.

 Variety: Data is generated from different types of sources, like, internal, external,
social and behavioral which comes in different formats, like, images, text, videos, etc.
Even a single source can generate data in varied formats, like, GPS and social
networking sites, like, Facebook produce data of all types including text, images,
videos, etc.

 Veracity: The uncertainty of data i.e.,. whether the obtained data is correct or
consistent. Out of the huge amount of data that is generated in almost every process
only the data that is correct and consistent can be used for further analysis. Data when
processed becomes information, however, a lot of effort goes in processing the data.
Big data especially in the unstructured and semi-structured forms is messy in nature
and it takes a good amount of time and expertise to clean that data and make it
suitable for analysis.

BIG DATA ANALYTICS


"Huge information" is a field that treats approaches to break down, methodically extricate
data from, or generally manage informational collections that are excessively enormous or
complex to be managed by conventional information handling application programming. Big
information difficulties incorporate catching information, information stockpiling,
information investigation, search, sharing, move, representation, questioning, refreshing, data
protection and information source. When we handle huge information, we may not test yet
essentially watch and track what occurs. In this way, huge information frequently
incorporates information with sizes that surpass the limit of customary normal programming
to process inside an adequate time and worth.
Current use of the term enormous information will in general allude to the utilization of
prescient examination, client conduct investigation, or certain other propelled information
examination strategies that concentrate an incentive from information, and sometimes to a
specific size of informational collection. Scientists, business officials, specialists of
prescription, promoting and governments the same routinely meet challenges with huge
informational collections in territories including Internet look, fintech, urban informatics, and
business informatics.
There are three types of analytics:

5|Page
o Descriptive Analytics: It is the most prevalent form of analytics, and it serves as a base
for advanced analytics. It is a database to provide information on the trends of past or
current business events that can help managers, planners, leaders, etc. to develop a road
map for future actions. It performs an in-depth analysis of data to reveal details such as
frequency of events, operation costs, and the underlying reason for failures. It helps in
identifying the root cause of the problem.

o Predictive Analytics: It is about understanding and predicting the future. It predicts the
near future probabilities and trends and helps in what-if-analysis. In this analysis we use
statistics, data mining techniques, and machine learning to analyse the future.

o Prescriptive Analytics: This analysis is based on complex data obtained from descriptive
and predictive analyses. Using this optimization technique, prescriptive analytics
determines the finest substitute to minimize or maximize some equitable finance,
marketing, and many other areas.
Example: We must find the best way of shipping goods from a factory to a destination, to
minimize costs, we will use the prescriptive analytics. The data is available in abundance,
can be streamlined for growth and expansion in technology as well as business.

Analytical approaches:
Investigation is examination. When logical aptitudes are found out, they can be connected to
numerous circumstances by essentially having a scrutinizing demeanour and following the
logical technique. Regularly, more questions are produced when answers are acquired: some
being significant disclosures and others being awful works of art. This part gives an
establishment that characterizes examination and enormous information, at that point
subtleties a couple of methodologies demonstrating how investigations can be performed.

Approach Possible evaluations


Predictive analysis  How can a business available data for predictive and real-time
analysis across its different domains?
 How can a business avail benefit from the unstructured
enterprise data?

Behavioural analysis How will a business leverage complex data in order to create new
models for?
 Decreasing business costs
 Converting an audience to a customer
 Improving overall customer satisfaction.

Data interpretation  Which data should be analysed for new product innovation?

6|Page
Advantages of Big data Analytics:
There are numerous points of interest of preparing Big Data Analytics continuously.
Knowing mistakes quickly inside the association. Executing new techniques to improve
administration drastically. Extortion can be distinguished the minute it occurs and cost
investment funds. Better deals bits of knowledge and keep up the client patterns.
Example:
In a manufacturing unit, data analytics can improve the functions of the following
processes:

o Procurement: To find which suppliers are more efficient and cost-effective in


delivering products on time.
o Product development: To draw on innovative product and service formats and
designs for enhancing the development process and coming up with demanded
products.
o Manufacturing: To identify machinery and process variations that may be indicators
of quality problems.
o Distribution: To enhance the supply chain activities and standardize optimal
inventory level.
o Marketing: To identify which marketing campaigns will be the most effective in
driving and engaging customers and understanding customer behaviours and channel
behavior.
o Price management: To optimize prices based on the analysis of external factors.
o Merchandising: To improve merchandise breakdown on the basis of current buying
patterns and increase inventory levels.
o Sales: to optimize assignment of sales resources and accounts, product mix, and other
operations.
o Store operations: To adjust inventory levels based on predicted buying patterns,
study of demographics, weather, key events and other factors.
o Human resources: to find out the characteristics and behaviours of successful and
effective employees.

Advantages:

o Data Procurement: There is gigantic measure of information for an engineer to store.


The different sites they have gathered information from, essential and optional gathered
information also. An all-around guided framework should be set up to use that data in a
snap when required.

o Data Quality and Integration: When such a gigantic measure of information is put
away there are high odds of information being repetitive and even unauthentic on
occasion. In enormous information frameworks there is parcel of redundancy of
information which just makes perplexity and a lot of costs. This eventually prompts
confused leads and untrustworthy fallout.

o Governance: This is device which each business faces. One should be authentic and
justified under the law. Every nation has various terms and conditions which one must
cling to.

7|Page
o Data Segmentation: There are times when a land office needs to disperse their
information dependent on various parameters, like Gender, Age, Income Group,
Location, Budget different ways are into client division, showcase division, item division
and so forth. The division is picked dependent on choice tree procedure, CART or
relapse-based strategy. This isolation is repetitive and takes always to sort.

o Data modelling: Even if you have information however not the ranges of abilities to
interlink everything and concoct an end, your information is useless.

o Business Intelligence: Mastering this area is a huge task. There is always a scope of
missing out on some factors which doesn't give the desired results of the measures taken.

CAREERS IN BIG DATA


The market today needs plenty of talented and qualified people who can use their expertise
to help organizations deal with Big data.
Most jobs in Big Data are from companies that can be categorized into the following four
broad buckets:
 Big Data technology drivers, e.g., Google, IBM, Salesforce.
 Big Data product companies, e.g., Oracle.
 Big Data services companies, e.g., EMC.
 Big Data analytics companies, e.g., Splunk.
These companies deal in various domains such as retail, manufacturing, information,
finance, and consumer electronics. The hiring of Big Data experts in these domains, as per
Big Data Analytics 2014 report:

8|Page
Sales
4686
1210
0 015
21
%
% 2%
%
%8.%%
0
0
%
2
2
%
Professional, Scientific, and Technical Services
Information
Manufacturing
Retail Trade
Sustainability, Waste Management and Rededication Services
Finanace and Insurance
Wholesale Trade
Educational Services
Other Services(except Public Administration)
Accomadation and Food Services
Health Care and Social Assistance
Real Estate, Rentals and Leasings
Construction
Transportation and Warehousing
Public Administration
Management of Companies and Enterprises
Arts, Entertainment and Recreation
Mining Quarriying and Oil Gas Extraction
Utilizes
Agriculture, Fishing and Hunting
The most common job titles in Big Data include:
 Big Data analyst
 Data scientist
 Big Data developer
 Big Data administrator
 Big Data engineer

Skills Required
Big Data professionals can have various educational backgrounds such as
econometrics, physics, biostatics, computer science, applied mathematics, or engineering.
Data scientists mostly possess a master’s degree or Ph.D. because it is a senior position and
often achieved after considerable experience in dealing with the data. Developers generally
prefer implementing Big Data by using Hadoop and its components.

Technical Skills

 A Big Data analyst should possess technical skills like Knowledge of natural
language processing, statistical analysis, analytical tools, machine learning, conceptual and
predictive modelling.
 A Big Data developer should possess programming skills like Java, Hadoop, Hive,
HBase, and HQL. Understanding of HDFS and MapReduce, Zookeeper, Flame, and Sqoop.

9|Page
Soft Skills

Organizations look for professionals who possess good logical and analytical skills, with
good communication skills and an affinity toward strategic business thinking.
The preferred soft skills requirements for a Big Data professional are:
 Strong written and verbal communication skills.
 Analytical ability.
 Basic understanding of how a business works.

Future of Big Data


In today competitive world, the need of Big Data is evident. If leaders and economies
want exemplary growth and wish to generate value for all their stakeholders, Big Data has to
be embraced and used extensively to
 Allow the storage and use of transactional data in digital form.
 Provide more specific information.
 Refine analytics that can improve decision making.
 Classify customers for providing customized products and services based on buying
patterns.

Most organizations today consider data and information to be their most valuable and
differentiated asset. By analysing this data effectively, organizations worldwide are now
finding new ways to compete and emerge as leaders in their fields to improve decision
making and enhance their productivity and performance. At the same time, the volume and
variety of data is also increasing at an immense rate every day. The global phenomena of
using Big Data to gain business value and competitive advantage will only continue to grow
as will the opportunities associated with it.
Research conducted by MGI and McKinsey’s Business Technology Office suggests that the
use of Big Data is most likely to become a key basis of competition for individual firms for
success and growth and strengthening consumer surplus, production growth, and innovation.

The Future of Big Data-Moving from Big Data 1.0 to 2.0

The future of Big Data is not about numeric data points but instead about asking the deeper
questions and findings out why consumers make the decisions they do.
Today, clients often ask about the future of big data and what the next step is; how can we
leverage data on an even deeper level in order to exact meaningful
consumer insights that go beyond where we are now? Most of the standard answers are
around the ability to get data and insights in real time and from more devices than ever. It’s
time we move beyond structured data and into the prime time of text analytics.
For us, the easiest way to get started with Big Data 2.0 is to focus on the unstructured data
we collect every day. This can be reviews, customer support emails, community forums, or
even your own CRM systems. The simplest way to look at this data is through a process
called text analytics,
Text analytics is a fairly straightforward process that breaks out like

10 | P a g e
 Transforming & pre-processing-Cleaning and formatting the data to make it easier to
read.
 Enrichment-Enhancing the data by adding additional data points
 Processing-Performing specific analyses and classifications on the data.
 Frequencies & Analysis-Evaluation of the results and transition into numerical
indicators.
 Mining-Actual extraction of information.

BIG DATA IN SOCIAL NETWORKING


Almost all the organisation collects and collate relevant data in various forms, such as consumer
feedback, inputs from retailers and suppliers, current market trends, etc. The information thus derived,
is used by the management to take major organisational decisions. An organisation generally must
spend huge amounts to collect data and information. Collecting and maintaining a pool of data and
information is just a waste of resources unless any logical conclusions and business insights can be
derived from it. This is where Big data analytics comes into the picture.
USE OF BIG DATA IN SOCIAL NETWORKING
Social network refers to the data generated from people socializing on social media. On a social
networking site, you will find different people constantly adding and updating comments, statuses,
preferences, etc. All these activities generate large amounts of data. Analyzing and mining such large
volumes of data show business trends with respect to wants and preferences and likes and dislikes of a
wide audiences. This data can be segregated based on different age groups, locations, and genders for
the purpose of analysis. Based on the information extracted, organizations design products and
services specific to people’s need.

Youtube users
upload 72
hours of new
video

Google EVERY Facebook


Receives 2 users shares
lakh search MINUTE 2.5 million
quieries A DAY content

Twitter users
send over
3lakh tweeets

Social Network Analysis is the analysis performed on the data obtained on social media. As the data
generated in huge volume, it results in the formation of Big Data pool.

11 | P a g e
It is not difficult to keep track of a thousand users, but it comes difficult when it comes to one million
direct connections between these thousand users, and another one billion connections when friends of
friends are taken into consideration. Extracting, obtaining, and analysing data from every single point
of connection faced by social network analysis.
Social media analytics is now days used for online reputation management, crisis management, lead
generation, and brand check to measure campaigning reports and much more.

The following areas in which decision-making processes are influenced by social network data:
 Business Intelligence
 Marketing
 Product design and development
Business Intelligence
Business intelligence is a data analytics process to convert a raw dataset to meaningful information by
using different techniques and tools for boosting business performance. This system allows a
company to collect, store, access, and analyses data for adding value to decision making.
The data generated from different from social media is analyzed to gain important business insights.
Social customer relationship management data is the latest catch phrase used these days to describe
this type of data. Such a data analysis helps in changing the perspective of an organization while
valuing its customers. Instead of valuing a single customer, organizations can now calculate the value
of the entire network that is influenced by that customer.
Some organizations reward their influential customers with discounts and offers, and these customers
in turn keep on spreading a positive brand image of the organization. Social networking sites such as
LinkedIn or Facebook can obtain insights on the advertisements that most users prefer. This is
achieved by designing advertisements based on interests, likes, and preferences that customers as well
as their circle of friends, contacts, and colleagues have personally opted for.
MARKETING
Today, the preferences of consumers have changed due to their busy schedules. They no longer have
the time to read newspapers thoroughly, watch all the TV commercials, or go through all the emails
they receive in their inbox. In today’s competitive scenario, marketers aim to deliver what consumers
want by using interactive communication across digital channels such as email, mobile, social, and the
WEB.
These channels, in turn, generate the social data required to provide insights based upon the brand
preferences of a target audience, the tone of its voice, the other brands it discusses, its interests, and
other information. Conducting social network analysis of this data can generate very useful and
meaningful business insights that may help organisations to take timely and informed decisions.
When a comparison is made between the efforts spent on the social media marketing platform and the
e-mail marketing platform, it is found that the marketing efforts spent on the former yield more
returns as compared to the latter.
Affiliate marketing is a reward-based marketing structure, where an affiliated company uses its own
market effort to trigger off customers for another company and in turn, is rewarded by the benefited
company. Websites, such as couponmountain.com, earn revenue of multimillion dollars a year by
doing affiliate marketing for the brands they promote.
PRODUCT DESIGN AND DEVELOPMENT

12 | P a g e
A system that is able to represent a sentiment as data with a high degree of accuracy provides the
client a means to access information on a social platform. To be able to measure sentiments more
meticulously is of great value while designing product or service. Brands must understand the
importance of the demographic information they receive to devise better target products and
programs.
Sentiment analysis refers to a computer programming technique to analyses human emotions,
attitudes, and views across popular social networks, including Facebook, Twitter, and blogs. The
technique requires analytic skills as well as advanced computing skills. However, this technique is
still evolving, and the full potential of sentiment analysis is yet to be explored by marketers and other
business professionals. Most organizations today simply rely on the number of likes, tweets, and
comments, instead of actually studying the quality of the sentiments expressed in the conversations.
USE OF BIG DATA IN PREVENTING FRADULENT ACTIVITIES
A fraud can be defined as the false representation of facts, leading to concealment or distortion of the
truth. Frauds that occur frequently in financial institutions, such as banks and insurance and healthcare
companies, or involve any type of monetary transactions, such as retail industry, are called financial
frauds. In such cases, online retailers, such as Amazon, eBay, tend to incur huge expenses and losses.
The following are some of the most common types of financial frauds:
 Credit Card Fraud – This type of fraud is common these days and is related to the use of
credit card facilities. In online shopping transaction, the online retailer cannot see the
authentic user of the card and therefore, the valid owner of the card cannot be verified. It is
quite likely that a fake or a stolen card is used in the transaction.

 Exchange or Return Policy Fraud – An online retailer always has a policy allowing the
exchange and return of good and sometimes, people take advantage of this policy. These
people buy a product online, use it, and then return it back as they are not satisfied with the
product.

 Personal Information Fraud – In this type of fraud, people obtain the login information of a
customer and then log-in to the customer’s account, purchase a product online, and then
change the delivery address to a different location. The actual customer keeps on calling the
retailer to refund the amount as he or she has not made the transaction. Once the transaction is
proved fraudulent, the retailer has to refund the amount to the customer.

PREVENTING FRAUD USING BIG DATA ANALYTICS


In order to deal with huge amounts of data and gain meaningful business insights, organizations need
to apply Big data analytics. Analyzing Big Data allows organizations to:
 Keep track of the process huge volumes of data.
 Differentiate between real and fraudulent entries.
 Identify new methods of fraud and add them to the list of fraud-prevention checks.
 Verify whether a product has been delivered to the valid recipient.
 Determine the location of the customer and the time when the product was delivered.
 Check the listings of popular retail sites, such as e-Bay, to find whether the product is up for
sale somewhere else.

13 | P a g e
Summary
This topic helped us to know about Big Data – the big buzzword of today’s IT industry. In
this we discussed some common features and sources of Big data. Continued by sources of
Big data along with various types of Big data. We also discussed about the four v’s of Big
data that is Volume, Velocity, variety and Veracity. We understood where this is used and
various domains and familiarized you with professional opportunities available in the career
path of Big data and importance of conducting Big data analytics in the business context.
Towards the end we came learned about preventing fraudulent activities in big data.

14 | P a g e

You might also like