and techniques, rom gaining insight heretooreunavailable, to signifcantly reducing the cost and/or time necessary to achieve business benefts.Also covered is the new-ound organizationalability to analyze data generated by devices andsocial media, as well as other unstructured andsemi-structured data.What stands out amid theseabundant capability andbeneft claims is the lack oa universally accepted def-nition o big data. Accordingto IDC, “Big data technolo-gies describe a new gen-eration o technologies andarchitectures, designed toeconomically extract valuerom very large volumes oa wide variety o data, byenabling high-velocity capture, discovery, and/oranalysis.”
In other words, this defnition incorpo-rates all data types managed by next-generationsystems that must scale to handle ever-increas-ing user workloads and data volumes.On the other hand, McKinsey & Co. defnes bigdata as, “Datasets whose size is beyond the abilityo typical database sotware tools to capture,store, manage and analyze.”
This suggests thatbig data’s size is relative to the eectiveness othe technology that handles it and that what con-stitutes big data today will not likely be big datatomorrow.All this said, there is little wonder why a widespectrum o big data approaches, and big dataresults, exists.
Most gurus consider the Apache Open SourceFoundation’s Hadoop technology stack as thequintessential big data platorm (see Figure 1).This stack actually comprises a small number ocomponents and does not completely addresskey issues pertaining to real-time analytics, datasecurity and operations. Customers requentlyselect one o the commercially available solutionsto address these issues.The problem is that all the leading big datasolution vendors are still scrambling to fll opera-tional, visualization and inormation discoverygaps while also planning major product changesneeded over the next six months to a year.A hidden message is written between the lines othis big data story. Consider this:
The capabilities available in the multitudeo commercial solutions vary signifcantlybetween channels and continue to diverge.
The current Hadoop stack is aimed at batchprocessing and is not tailored or real-timeprocessing.
Various tool sets are evolving rapidly and dra-matically, and this technological progression isexpected to continue.cognizant 20-20 insights
The leading big datasolution vendorsare still scramblingto fll operational,visualization andinormation discoverygaps while also planningmajor product changes.
Components o Apache’s Hadoop Platorm
S q o o p r e l a t i o n a l d a t a b a s e d a t a c o l l e c t o r F l u m e | C h u k w a l o g d a t a c o l l e c t o r
Hadoop MapReduceDistributed processing frameworkHDFSHadoop distributed ﬁle systemRstatisticsMahoutmachine learningPig data ﬂowHive datawarehouse
O o z i e w o r k ﬂ o w
Z o o k e e p e r c o o r d i n a t i o n
AmbariProvisioning, managing and monitoring Hadoop clusters