Professional Documents
Culture Documents
&
SOURCES OF DATA
Big Data and its characteristics coined by Doug Laney of Gartner around 2001.
Activity 1: Data Scenarios
1
An ecommerce site gets thousands of transactions and millions of clicks (“events”) in a
day.
2
The HR team is sitting on last ten years of employee attrition data, trying to figure out
how that can be used.
3
A retail store gets an average of 5000 footfalls and 50% transactions in a day.
4
A hospital and diagnostic healthcare provider serves 10,000 patients in a month across
hundreds of disease categories.
Big Data @ Amazon
Data Explosion !!
1. Data is created constantly, and at an ever-increasing rate.
4. Merely keeping up with this huge influx of data is difficult, but substantially more
challenging is analysing vast amounts of it, especially when it does not conform to
traditional notions of data structure, to identify meaningful patterns and extract
useful information.
5. These challenges of the data deluge present the opportunity to transform business,
government, science, and everyday life.
Changing Face of Data
• Big Data is the exponential growth & availability of data, both structured and unstructured, because of Internet and fast
growing technologies
Drivers of Big Data
3. Video surveillance, such as the thousands of video cameras spread across a city
4. Mobile devices, which provide geospatial location data of the users, as well as metadata about text
messages, phone calls, and application usage on smart phones
5. Smart devices, which provide sensor-based collection of information from smart electric grids, smart
buildings, and many other public and industry infrastructures
6. Non-traditional IT devices, including the use of radio-frequency identification (RFID) readers, GPS
navigation systems, and seismic processing
Examples of Data generated
• Social media and genetic sequencing are among the fastest-growing sources of
Big Data and examples of untraditional sources of data being used for analysis
• In 2012 Facebook users posted 700 status updates per second worldwide, which
can be leveraged to deduce latent interests or political views of users and show
relevant ads.
• Facebook can also construct social graphs to analyze which users are connected
to each other as an interconnected network.
• Big data can be applied to real-time fraud detection, complex competitive
analysis, call centre optimization, consumer sentiment analysis, intelligent traffic
management, and to manage smart power grids, to name only a few applications
• Genetic sequencing and human genome mapping provide a detailed
understanding of genetic makeup and lineage.
• The health care industry can predict which illnesses a person is likely to get in
his lifetime and take steps to avoid these maladies or reduce their impact through
the use of personalized medicine and treatment.
• Pharmaceutical companies can use this data for different medications and
pharmaceutical drugs, heightening risk awareness of specific drug treatments
Different Sources of Data
The “BIG” Dilemma!
Volume
Velocity
Variety
Variability
Veracity
Visualisation
Value
Characteristics of Big Data – The Seven Vs
Volume
• Volume is how much data we have – what used to be measured in Gigabytes is now measured in
Zettabytes (ZB) or even Yottabytes (YB). The IoT (Internet of Things) is creating exponential
growth in data. The volume of data is projected to change significantly in the coming years.
Velocity
• Velocity is the speed in which data is process and becomes accessible. I remember the days of
nightly batches, now if it’s not real-time it’s usually not fast enough.
Variety
• Variety describes one of the biggest challenges of big data. It can be unstructured and it can include
so many different types of data from XML to video to SMS. Organizing the data in a meaningful
way is no simple task, especially when the data itself changes rapidly.
Variability is different from variety. A coffee shop may offer 6 different blends of coffee, but if you get the same blend
every day and it tastes different every day, that is variability. The same is true of data, if the meaning is constantly
changing it can have a huge impact on your data homogenization.
Veracity
Veracity is all about making sure the data is accurate, which requires processes to keep the bad data from accumulating
in your systems. The simplest example is contacts that enter your marketing automation system with false names and
inaccurate contact information. How many times have you seen Mickey Mouse in your database? It’s the classic
“garbage in, garbage out” challenge.
Visualization
Visualization is critical in today’s world. Using charts and graphs to visualize large amounts of complex data is much
more effective in conveying meaning than spreadsheets and reports chock-full of numbers and formulas.
Value
Value is the end game. After addressing volume, velocity, variety, variability, veracity, and visualization – which takes a
lot of time, effort and resources – you want to be sure your organization is getting value from the data.
The 3 traits:
• Dynamic The 3 Types:
• Interpretable • Descriptive
• Ready for action • Predictive
• Prescriptive
INTELLIGENCE ANALYTICS
• The joys of having a structurally sound data are many like they can be seamlessly added in
a relational database and are easily searchable by simplest of search engine operations or
even algorithms; whereas, the unstructured data is a nightmare for the designers to connect
the random strands of data with the existing meaningful ones and present it as a structure.
Semi-structured Data
Product_Id Product _Name Product _Price
1 Pen INR 5
2 Paper INR 10
• Well-defined arrangement, easy to understand structure and comprehensible hierarchy is considered a structurally
sound entity.
• Seamlessly added in a relational database and are easily searchable by simplest of search engine operations or even
algorithms; whereas, the unstructured data is a nightmare for the designers to connect the random strands of data with
the existing meaningful ones and present it as a structure.
• Structural data is closer to machine language than the unstructured data.
Unstructured Data
Unstructured data generally has no organizing structure, and Big Data technologies use
different ways to add structure to this data. Typical example of unstructured data is, a
heterogeneous data source containing a combination of simple text files, images, videos etc
The Power of Big Data
• Big Data can bring “big values” to our life in almost every aspects.
• Technologically, Big Data is bringing about changes in our lives because it allows diverse and
heterogeneous data to be fully integrated and analyzed to help us make decisions.
• Today, with the Big Data technology, thousands of data from seemingly unrelated areas can help support
important decisions.
Links : https://www.youtube.com/watch?v=-Gj93L2Qa6c
Big Data Eco-System
Who Uses Big Data ?
1 2
• Banking • Government
3 4
• Education • Healthcare
5 6
• Manufacturing • Retail
SOURCES OF DATA
SECONDARY
DATA
PRIMARY
DATA
MERITS DEMERITS
• Easy and quick • May not be exactly as per
needs
• Economical
• Needs modification
• Reliable data available
• Testing required
• Absence of • Too much dependence
interviewee’s bias undesirable
• Convenience • Secondary method
• Suitable to small firms • Lacks practical-orientation
INTERNAL EXTERNAL
SOURCE SOURCE
Coverage Limited coverage as they relate to Wide coverage as they are varied in
company only. character
Reliability Internal sources are more reliable as External sources may not supply accurate
they supply accurate data. Verification data. Naturally, a verification of data
of data is not required before actual use is necessary
Availability Internal sources are easily available and External sources are not easily available
that too without any extra cost. Money is required to be spent on them.