Professional Documents
Culture Documents
Big data refers to data sets which are difficult to capture, manage and analyze effectively
using current database management software. It consists of structured, unstructured and semi-
structured data which cannot be stored in a table format. As per latest survey conducted by
International data corporation (IDC). Social sites are creating a huge amount of data. From
Twitter 500 million tweets are sent,6 billion searches are done on google, 3.6 billion of likes
on Instagram & 5.75 billion likes on Facebook. Big data analytics is a form of advanced
analytics, which involves complex applications with elements such as predictive models,
statistical algorithms and what-if analysis powered by high-performance analytics systems.
The process of capturing or collecting Big data is known as datafication. By large or huge
datasets or big data, we mean anything from a petabyte(1PB=1000TB) to an exabyte
(1EB=1000PB) of data.
1|Page
Currently, we can find by information processing systems. These systems can analyze and
structure a large amount of data specifically what we searched, what we looked at and how
long we remained at a particular page or website.
It helps in understanding user behaviors, requirements and preferences to make personalized
recommendations for every individual.
When a user regularly visits or purchases from online shopping sites like, eBay, time person
logs in, the system can present a recommend list of products that may interest the user on the
basis of the earlier purchases or searches, thus, presenting a specially customized
recommendation set for every user. This is the power of big data analytics.
Types of Data:
Data is obtained primarily from the following types of sources:
Internal sources - organizational or enterprise data.
External sources - social data.
Comparison between internal and external sources of data:
Data source Definition Examples of sources Application
Internal Provides structured >Customer The current data in
or organized data Relationship the operational
that originates from Management system is used to
within the enterprise (CRM). support daily
and helps run >Enterprise business operations
business Resource of an organization
Planning (ERP).
>Customer details.
>Products and sales
data.
External Provides >Business partners. This data is often
unstructured or >Syndicate data analyzed to
unorganized data suppliers. understand the
that originates from >Internet. entities mostly
the external >Government. external to the
environment of an >Market research organization, like,
organization. Organizations customers,
competitors, market
and environment.
2|Page
Structured data: The data that has a defined repeating pattern. This pattern makes it
easier for any program to sort, read and process the data. Processing structured data is
much easier and faster than processing data without any specific repeating patterns.
Organized data in a predefined format.
Stored in tabular form.
The data that resides in fixed fields within a record or file.
Formatted data that has entities and their attributes mapped.
Used to query and report against predetermined data types.
There are some sources:
Relational databases in the form of tables.
Flat files in the form of records.
Multidimensional databases mainly used in data warehouse technology.
Legacy databases.
Unstructured data: A set of data that might not have any logical or repeating
patterns.
It consists of metadata ie.,. the additional information related to data.
It comprises of inconsistent data, like, data obtained from files, social media websites,
satellites, etc.
It consists of data in different formats, like, e-mails, text, audio, video or images.
There are some sources:
Text both internal and external to an organization – documents, logs, survey results,
feedbacks and e-mails from both within and across the organization.
Social media - data obtained from social networking platforms including YouTube,
Facebook, twitter, LinkedIn, and Flickr.
Mobile data – for text messages and location information.
About 80% of enterprise data consists of unstructured content.
Some of the challenges associated they are as follows:
Identifying the unstructured data that can be processed.
Sorting, organizing and arranging unstructured data in different sets and formats.
Combining and linking unstructured data in a more structured format to derive any
logical conclusions out of the available information.
Costing in terms of storage space and human resource (data analysts and scientists)
needed to deal with the exponential growth of unstructured data.
3|Page
It is also generated from files that often have the same name and extension. Examples are,
video files are generally stored with the extension .mp4 or .3gp, whereas, audio files have
extension .wav or .mp3. As different files of the same category can have the same file name
in different sources, merely, a name and an extension do not help in data identification,
classification or even basic searches.
Semi-structured data: Also known as, schema-less or self-describing structure. It is
a data that stored inconsistently in rows and columns of a database.
There are some sources:
File systems, like, web data in the form of cookies.
Data exchange format, like, JavaScript Object Notation (JSON) data.
Sl. No. Name E-mail
1 Sam Jacobs smj@xyz.com
2 First name: David Davidb@xyz.com
Last name: Brown
Velocity: The rate at which data is generated, captured and shared. Enterprises can
capitalize on data only if it is captured and shared in real time. Information processing
systems, like, CRM and ERP face problems associated with data which keeps adding
up but, cannot be processed quickly. These systems are able to attend data in batches
every few hours, however, even this time lag causes the data to lose its importance as
new data is constantly being generated. Examples are, eBay analyzes around 5 million
4|Page
transactions per day in real time to detect and prevent frauds arising from the use of
PayPal.
There are some sources:
IT devices including routers, switches, firewalls, etc. constantly generate valuable
data.
Social media including Facebook posts, tweets and other social media activities create
huge amount of data which is to be analyzed instantly at a fast speed because, the
value degrades quickly with time.
Portable device including mobile, PDA, etc. also generate data at a high speed.
Variety: Data is generated from different types of sources, like, internal, external,
social and behavioral which comes in different formats, like, images, text, videos, etc.
Even a single source can generate data in varied formats, like, GPS and social
networking sites, like, Facebook produce data of all types including text, images,
videos, etc.
Veracity: The uncertainty of data i.e.,. whether the obtained data is correct or
consistent. Out of the huge amount of data that is generated in almost every process
only the data that is correct and consistent can be used for further analysis. Data when
processed becomes information, however, a lot of effort goes in processing the data.
Big data especially in the unstructured and semi-structured forms is messy in nature
and it takes a good amount of time and expertise to clean that data and make it
suitable for analysis.
5|Page
o Descriptive Analytics: It is the most prevalent form of analytics, and it serves as a base
for advanced analytics. It is a database to provide information on the trends of past or
current business events that can help managers, planners, leaders, etc. to develop a road
map for future actions. It performs an in-depth analysis of data to reveal details such as
frequency of events, operation costs, and the underlying reason for failures. It helps in
identifying the root cause of the problem.
o Predictive Analytics: It is about understanding and predicting the future. It predicts the
near future probabilities and trends and helps in what-if-analysis. In this analysis we use
statistics, data mining techniques, and machine learning to analyse the future.
o Prescriptive Analytics: This analysis is based on complex data obtained from descriptive
and predictive analyses. Using this optimization technique, prescriptive analytics
determines the finest substitute to minimize or maximize some equitable finance,
marketing, and many other areas.
Example: We must find the best way of shipping goods from a factory to a destination, to
minimize costs, we will use the prescriptive analytics. The data is available in abundance,
can be streamlined for growth and expansion in technology as well as business.
Analytical approaches:
Investigation is examination. When logical aptitudes are found out, they can be connected to
numerous circumstances by essentially having a scrutinizing demeanour and following the
logical technique. Regularly, more questions are produced when answers are acquired: some
being significant disclosures and others being awful works of art. This part gives an
establishment that characterizes examination and enormous information, at that point
subtleties a couple of methodologies demonstrating how investigations can be performed.
Behavioural analysis How will a business leverage complex data in order to create new
models for?
Decreasing business costs
Converting an audience to a customer
Improving overall customer satisfaction.
Data interpretation Which data should be analysed for new product innovation?
6|Page
Advantages of Big data Analytics:
There are numerous points of interest of preparing Big Data Analytics continuously.
Knowing mistakes quickly inside the association. Executing new techniques to improve
administration drastically. Extortion can be distinguished the minute it occurs and cost
investment funds. Better deals bits of knowledge and keep up the client patterns.
Example:
In a manufacturing unit, data analytics can improve the functions of the following
processes:
Advantages:
o Data Quality and Integration: When such a gigantic measure of information is put
away there are high odds of information being repetitive and even unauthentic on
occasion. In enormous information frameworks there is parcel of redundancy of
information which just makes perplexity and a lot of costs. This eventually prompts
confused leads and untrustworthy fallout.
o Governance: This is device which each business faces. One should be authentic and
justified under the law. Every nation has various terms and conditions which one must
cling to.
7|Page
o Data Segmentation: There are times when a land office needs to disperse their
information dependent on various parameters, like Gender, Age, Income Group,
Location, Budget different ways are into client division, showcase division, item division
and so forth. The division is picked dependent on choice tree procedure, CART or
relapse-based strategy. This isolation is repetitive and takes always to sort.
o Data modelling: Even if you have information however not the ranges of abilities to
interlink everything and concoct an end, your information is useless.
o Business Intelligence: Mastering this area is a huge task. There is always a scope of
missing out on some factors which doesn't give the desired results of the measures taken.
8|Page
Sales
4686
1210
0 015
21
%
% 2%
%
%8.%%
0
0
%
2
2
%
Professional, Scientific, and Technical Services
Information
Manufacturing
Retail Trade
Sustainability, Waste Management and Rededication Services
Finanace and Insurance
Wholesale Trade
Educational Services
Other Services(except Public Administration)
Accomadation and Food Services
Health Care and Social Assistance
Real Estate, Rentals and Leasings
Construction
Transportation and Warehousing
Public Administration
Management of Companies and Enterprises
Arts, Entertainment and Recreation
Mining Quarriying and Oil Gas Extraction
Utilizes
Agriculture, Fishing and Hunting
The most common job titles in Big Data include:
Big Data analyst
Data scientist
Big Data developer
Big Data administrator
Big Data engineer
Skills Required
Big Data professionals can have various educational backgrounds such as
econometrics, physics, biostatics, computer science, applied mathematics, or engineering.
Data scientists mostly possess a master’s degree or Ph.D. because it is a senior position and
often achieved after considerable experience in dealing with the data. Developers generally
prefer implementing Big Data by using Hadoop and its components.
Technical Skills
A Big Data analyst should possess technical skills like Knowledge of natural
language processing, statistical analysis, analytical tools, machine learning, conceptual and
predictive modelling.
A Big Data developer should possess programming skills like Java, Hadoop, Hive,
HBase, and HQL. Understanding of HDFS and MapReduce, Zookeeper, Flame, and Sqoop.
9|Page
Soft Skills
Organizations look for professionals who possess good logical and analytical skills, with
good communication skills and an affinity toward strategic business thinking.
The preferred soft skills requirements for a Big Data professional are:
Strong written and verbal communication skills.
Analytical ability.
Basic understanding of how a business works.
Most organizations today consider data and information to be their most valuable and
differentiated asset. By analysing this data effectively, organizations worldwide are now
finding new ways to compete and emerge as leaders in their fields to improve decision
making and enhance their productivity and performance. At the same time, the volume and
variety of data is also increasing at an immense rate every day. The global phenomena of
using Big Data to gain business value and competitive advantage will only continue to grow
as will the opportunities associated with it.
Research conducted by MGI and McKinsey’s Business Technology Office suggests that the
use of Big Data is most likely to become a key basis of competition for individual firms for
success and growth and strengthening consumer surplus, production growth, and innovation.
The future of Big Data is not about numeric data points but instead about asking the deeper
questions and findings out why consumers make the decisions they do.
Today, clients often ask about the future of big data and what the next step is; how can we
leverage data on an even deeper level in order to exact meaningful
consumer insights that go beyond where we are now? Most of the standard answers are
around the ability to get data and insights in real time and from more devices than ever. It’s
time we move beyond structured data and into the prime time of text analytics.
For us, the easiest way to get started with Big Data 2.0 is to focus on the unstructured data
we collect every day. This can be reviews, customer support emails, community forums, or
even your own CRM systems. The simplest way to look at this data is through a process
called text analytics,
Text analytics is a fairly straightforward process that breaks out like
10 | P a g e
Transforming & pre-processing-Cleaning and formatting the data to make it easier to
read.
Enrichment-Enhancing the data by adding additional data points
Processing-Performing specific analyses and classifications on the data.
Frequencies & Analysis-Evaluation of the results and transition into numerical
indicators.
Mining-Actual extraction of information.
Youtube users
upload 72
hours of new
video
Twitter users
send over
3lakh tweeets
Social Network Analysis is the analysis performed on the data obtained on social media. As the data
generated in huge volume, it results in the formation of Big Data pool.
11 | P a g e
It is not difficult to keep track of a thousand users, but it comes difficult when it comes to one million
direct connections between these thousand users, and another one billion connections when friends of
friends are taken into consideration. Extracting, obtaining, and analysing data from every single point
of connection faced by social network analysis.
Social media analytics is now days used for online reputation management, crisis management, lead
generation, and brand check to measure campaigning reports and much more.
The following areas in which decision-making processes are influenced by social network data:
Business Intelligence
Marketing
Product design and development
Business Intelligence
Business intelligence is a data analytics process to convert a raw dataset to meaningful information by
using different techniques and tools for boosting business performance. This system allows a
company to collect, store, access, and analyses data for adding value to decision making.
The data generated from different from social media is analyzed to gain important business insights.
Social customer relationship management data is the latest catch phrase used these days to describe
this type of data. Such a data analysis helps in changing the perspective of an organization while
valuing its customers. Instead of valuing a single customer, organizations can now calculate the value
of the entire network that is influenced by that customer.
Some organizations reward their influential customers with discounts and offers, and these customers
in turn keep on spreading a positive brand image of the organization. Social networking sites such as
LinkedIn or Facebook can obtain insights on the advertisements that most users prefer. This is
achieved by designing advertisements based on interests, likes, and preferences that customers as well
as their circle of friends, contacts, and colleagues have personally opted for.
MARKETING
Today, the preferences of consumers have changed due to their busy schedules. They no longer have
the time to read newspapers thoroughly, watch all the TV commercials, or go through all the emails
they receive in their inbox. In today’s competitive scenario, marketers aim to deliver what consumers
want by using interactive communication across digital channels such as email, mobile, social, and the
WEB.
These channels, in turn, generate the social data required to provide insights based upon the brand
preferences of a target audience, the tone of its voice, the other brands it discusses, its interests, and
other information. Conducting social network analysis of this data can generate very useful and
meaningful business insights that may help organisations to take timely and informed decisions.
When a comparison is made between the efforts spent on the social media marketing platform and the
e-mail marketing platform, it is found that the marketing efforts spent on the former yield more
returns as compared to the latter.
Affiliate marketing is a reward-based marketing structure, where an affiliated company uses its own
market effort to trigger off customers for another company and in turn, is rewarded by the benefited
company. Websites, such as couponmountain.com, earn revenue of multimillion dollars a year by
doing affiliate marketing for the brands they promote.
PRODUCT DESIGN AND DEVELOPMENT
12 | P a g e
A system that is able to represent a sentiment as data with a high degree of accuracy provides the
client a means to access information on a social platform. To be able to measure sentiments more
meticulously is of great value while designing product or service. Brands must understand the
importance of the demographic information they receive to devise better target products and
programs.
Sentiment analysis refers to a computer programming technique to analyses human emotions,
attitudes, and views across popular social networks, including Facebook, Twitter, and blogs. The
technique requires analytic skills as well as advanced computing skills. However, this technique is
still evolving, and the full potential of sentiment analysis is yet to be explored by marketers and other
business professionals. Most organizations today simply rely on the number of likes, tweets, and
comments, instead of actually studying the quality of the sentiments expressed in the conversations.
USE OF BIG DATA IN PREVENTING FRADULENT ACTIVITIES
A fraud can be defined as the false representation of facts, leading to concealment or distortion of the
truth. Frauds that occur frequently in financial institutions, such as banks and insurance and healthcare
companies, or involve any type of monetary transactions, such as retail industry, are called financial
frauds. In such cases, online retailers, such as Amazon, eBay, tend to incur huge expenses and losses.
The following are some of the most common types of financial frauds:
Credit Card Fraud – This type of fraud is common these days and is related to the use of
credit card facilities. In online shopping transaction, the online retailer cannot see the
authentic user of the card and therefore, the valid owner of the card cannot be verified. It is
quite likely that a fake or a stolen card is used in the transaction.
Exchange or Return Policy Fraud – An online retailer always has a policy allowing the
exchange and return of good and sometimes, people take advantage of this policy. These
people buy a product online, use it, and then return it back as they are not satisfied with the
product.
Personal Information Fraud – In this type of fraud, people obtain the login information of a
customer and then log-in to the customer’s account, purchase a product online, and then
change the delivery address to a different location. The actual customer keeps on calling the
retailer to refund the amount as he or she has not made the transaction. Once the transaction is
proved fraudulent, the retailer has to refund the amount to the customer.
13 | P a g e
Summary
This topic helped us to know about Big Data – the big buzzword of today’s IT industry. In
this we discussed some common features and sources of Big data. Continued by sources of
Big data along with various types of Big data. We also discussed about the four v’s of Big
data that is Volume, Velocity, variety and Veracity. We understood where this is used and
various domains and familiarized you with professional opportunities available in the career
path of Big data and importance of conducting Big data analytics in the business context.
Towards the end we came learned about preventing fraudulent activities in big data.
14 | P a g e