DA Unit - 1 Notes

phe © “Tndacton Dale ence hlhat Bal 7 D Pata % the big et ord shir lachay dhe omeunk 4 dak Hob tee aXe Gatcbry > Y fpu Beutd hind abut how much dad Ary Geveiakd flon prekiihrie thes ull 203 anc! Ke thin Bhat he GN 1 do nih UW, re ae abe % gents Gch amterh 4 Ohta tn fuk Loman Athat- ree ae data ores he geustahiy ) ? baa Rhen he Q An oa ge tae? be Kate guroehy bh | ichen ne aly vie : 7. AtLet— fe fordiy an Emel, Yebry the Hech , agpluitig Gia trebecbse Ketebag i 2 Wwhephon e Gps, Pustntn yp Apahscen, Com Acoee] ore, Ariens RYT MA, Crete Lehre) UE tnd — rehak bind 9 event We oe — GR Grit Ue delk chat Mle ne aig , Ture re Oe my the dot — PY the dad 4 bn Noversed the 4, AM wile ae RATa lesngy ben i Gurcer cyenk dota. - < ee bone tiapriaiedinles . (in, aaron | that we bo + _ aud datprobe _ C Gatted ae —) Dnbeunt & Peeple. Ther Muti 4 Lesko, tal went Hyd Ae Gprlele, Ls Mes dain Bo Se E F Moe N Bie age ty Dedbenet 4) gpa hon C Ae wher moth Cerny, , the Phone ba Un eee ile Pe urebry by Babe. Froude erumrkG exer overt data, thot Og ber Bs Khon ete Pople lb abo bs thay Fi! chant the Erpaunhel go A hap -brnlintigy ones aes ee i eR ee&sd j a) Ti dabe. the fram tow GZ oy elon, te Awdecb Ge Hitt, oro 4h nul Jase cyprwmde Vy bp or Ko & KA = aac BRECON NI, Oizo ann Kore t & ¥ - ore. By potent rll. — Ale COF0céty 4 Hasd Die hal» [eeu Ye ole ay Ad) air f the Gtk pe bike On Wn Cbnuase Lo Yan he heeded Howlea 4 ein to ‘i c ak rostornd the Real ee th On. 4 ut ty oe Mies dala, JS |, el a Ue eee” a oma $ods) of) Aun sti a
Data deernt here ts bo bry be Challny., Data me quash, te phe ard tok i By oben UW tey dignt obmind_ fh ala le Leenticts- ohm fee, dat Ceaubite cy 7 at hese )>I { ie A ie 3) Bee goal Wx dun dale © Fou qeaere iienle hn \) Natok ee te) D esgrckay bd whan ¢ (apd Pe» Catishnd pola mam 4 >) NOON cs oo +5 Wie thr. 2) Var vied fobe alti tw acd wth i D- sorel a ey rep Bhan, nee , ra tre pet EoekS hens a con Arlect velue fr : 7p 0) Ava\ Urpreeery ee ‘ sag ona A te rata : Xs eae pow? 9 fortimen te He Bg You dant, wand we Empere clots a want ie Empwre pocem =) Qn ‘te, Bad FD the jprowy Sl Metter) Qed font Hao [ tn. the ») Not (is peters 2 dorraun « tah rd ow) poo 4 Joy, loapomanls-5 Th ed Bly lance Dudew Sutrus & obi %i @ RN Ba tke fe aified ab 8 Greg tad comeaen ee alee Gk hinds By Gh) tual dctet gee e Lytem A ee is By dey HY ~~ = Prolite i be be peated om! dtp alte. BP a te ysl Ton den tm bela cleertdn ord elec She tepy iN busin NES ‘ rapide | j he do wth tke cate a hay muck dok th thoy ) eae hr) Gs Cage etek Det Con be Jeben fm Varios aes ties and foadng amet Ach oct =) Sonar ar rene: > gic ha oe eg © Lvscel eo 4 Sls ON OS eatia Pa ads pebtle wedi Chet 7 ae Leese aMol yr The dete Genenteol Uy a. ' Belen a hy els Eta ce ditehim Dats J) Ber pt, Sdn) 2) Une in *) rv ey Pi de , URGE My Der Tekin a Sen et fa tee peg a trate — vain ae i | jay iat Aas ay Peobbwos, Be Late aks epeuhr 4 ae eprint sedi pais $5 Quy ‘ can ea eneWhar 4 dole Cine > BES) hat Lene % a Conbiako Gi multe Hhavupiny ‘thot Uses Mohebe , bata daly ye machine fawning ee slate dnd kafract brilade and! . it ft a DP data stn. % about ldo eee is Ord dle excor— ae 5 Pons ey, ee DIP ote pre iy a t And mabe pe pee vic el ) by ar Data lune, fa Me “all mabe » SAM brass — lehouldlyn are’ US) —Preclrehie anolyns (hat will heppr red? ) Pater Aeereee (fied gatan, § may be heeden va inthe out ) Mh b Pata Lererce Norded ? Bo? 4 ted ia indlutbres in the Nd thd Ey Eonbing , Crbutionty Healtheors ard enabisors a en be ae da weal eit erg in = Ogeeee q a How Dato Leend tRe 2 — Tr kequine te (9 Sowa! oe a)NL f) Gaktha D brpommay (8 Aptos A) Netheahes — Y) Baibars - re A Dale Ceaphst Mund sind eer ek ee ee ah lout up a Ha aA Baka Basa 1a Catackin g--bofereNn vy ‘Gre pape Be le Bee ed ub inkeiprebee ooo Sul ty tun bata an be Gotepiyd & do. Nae, Airbed tok a Operized cara MT 8 Pec) dys Ge Opens. Ble et ep nh Liab = pepe 4-€ aA ea a pata. Serene ©© O Intend Seon © Dip Bd vethyement e Poe mmmarcer Aypstenns es Dae =) SB Ms sem Seren A the Hranunry 4D Bin ly Sth lan Calan YG Se on ot i Pes ea ourd ae Toenagh ete Pane TE AN Mila Mi eens ndustryey ee A) Deenpha Analyte, Dt punta opiates abet Mbt appt Nhe deehnnnes fumondatizes te large dasrse % Aube Ouhamis b5 (atehdur b) Dd a re Questo Alb me oy Sime thy Adenbjy — Arawelits Ur dhe Abe Deke 6 reletel 6 Meche Reyes %H und a% Dyedirhie Ng ) aie ‘ | Je ret pt omeneh pe ce a, thet WU happen a, \ A ilh\ice dot Tepe kthrgues Pe Boa. benbiy Qe date U)tpenersghre =« “i (a wah Apert alo whe theta io — badw ct) Qe ‘Gnae Dus a aes Dosa ee parker 7 a ripratin Rebels» ee pornos « — Wy es frida tn fl te ip banned od Biakayahon Bead — Ahi primey Ged ADA Kt onerase — Ufparmry , rare (dace = My dattoony peteiny Dr anne) a , bree Oe ae Cred JfTT - O}Q Se Be q data vs Data Scinuce a —_ = @ Bfq dato Data Susu 5 | a ea age O Analyuig dala © Prows buge voluires A undetoudl pao of data aud 1 ann gale uF i Qeuorate tn quis ‘ dockiloue- xe04 bron, @ E-commoxnee Seeumty G) Salles Emag @ Costes telecoramuricaction : adubakie ment” ¢ ae) Fe &) © Hadoop, spau, Flute © Gach hye” fs ov Ns OS oN Sou ae yo Os RET Or", wr Oo, ¢ oy CRNOSZ 2 ee. wo eat oe a) mass geo oy | Oh eer © ey Ons sae [see es “ed “ -— i Ospychor eiWTF U2 Avoialito . ox brio dala Aorrwalé WE auoy. ate — M hypctherte s puriia We cise dala a mi + cottecl” dala Dekermine wtutu Variable te explac, And urutc ones to Preeobtr * Selea= Potible moelel RAG AE be utecf meet alls Dmplemews Wt Be ELS adhe i, * eeuor ge) modele en te |e Bek wz leu one ‘ Dao ne Aetovdurg te fi nes equ ae aa So 4 volidat® ur : Ee eiselr poet oreyusre Big data ; Repo ig | 3 = Terrnrotogies an aq Una ovalue cpewaed | and Shou , ude eeeWaly US Dals Same 9 @) O stombination Of matiemalics ; clabivalr oud i aad > Conlei= OF frrim being gow ; > Way of caplivning Gatien tual, Ue Cats coptusteof = Ability to lok ar Ua Using & dcfpocentty a bora Wig ieee’ > large amount Of dala oe vraxious Sours oO + Traditions dala Procamring Syvtams ON a deol: > Inlexms Of whatr fs Dale Arabytices ? i . us ° o dala > D¥scovexing cup paemea” ES 7 Supports ~ decked aie, > involwes Inspecting , deonng Lieber - ore > uses quatitalie oud qyouttol ae “Paid Sauk Oppbteetins Pie safe eS lose t ey ees (5 dalie aloud” ea i wish ds ari “OF ronment Ea feb votime, voredly roacly pveloity £ ¥ pee Tee a Ks cpeeburn ; je a ee sig Aola « pile rst ae es eae eg am Wee wy . Gi a Of use(2) wollte PResommenolala | Wp today Tseucr for 7 whi hy —qour headphoun , yee hunks Of Neadplonrec lo 4 fore, Va pe bait mplesos mgt So de G4 Umer qerexor| ond etoted dala explosion °UNT=T on TET Triitducten Pat. Cerne lhet Bala 7 i soy anor — Pata 4 th ext ard Hiri thay Oe intudhe ameunt meee ok eer. lollechrs te ee a hoo weh dade poy aaa. about flor prehittre bres tsi sur and re hat pe GN MW dy rith lt he ate abo , nt cena agen a nat bund 9 ay bide Coffee he Mate ao eee ae as yf & forhy an dwell , ae re oa a Rethog leehae ek Ba, eee, Wee er oe Con] te07e] exe, Ae Esl lend | Comic ecard , fhe bh Hous Us lek a ie We Oe abe vp the ete dy 7 the dal mn ley 7 PA | ln 4p¢rae O ; i Wharis Big Dara? a t Be he most common myth So pata tit is not T ually, its not just about associated with it ed ok pio Data refers to the large mounts of data which is ssa Even previously there was huge data Waich were being Stored'mn databases, but because of the varied nature of this Data, the traditional relational database systems are incapable of handling this Data Big Data is much Frere tania calcein of datasets ith ferent foals it san important ase! wit con . jn what is big Data, let me also tell you wha is that itis just about the size or volume of da d to obtain enumerable benefits, The three different formats of big data are: n f 1,-Siructured: Organised data format witha fixed schema. EX: RDBMS Snot hitea fixed format. E 2. SeaniSthuctured: Partially organised data wi ! JSON ~ ue xd: Unorganised data with an unknown schema. EX Audio, video files ete a 3. cnstruct i Characteristics of ig Data Following are the characteris | ; © ‘The above image depicts the five V's of Big Data but as and, yben the data Keeps evolving so 4 aot abe V's Lam listing five more V's which have developed pridually overtime: ° § : + Validity: correctness of dat ‘ + Volatility: tendency to change in time + Vulnerability: vulnerable to breach or attacks ® Visualization: visualizing meaningful usage of data© Characteristics of Big Data Cy vie c oy pate). Big data can be described by the following characteristics: Volume Vari Velocity Variability () Volume ~ The name Big Data itself is related to a size which is enormous. Size of data play Very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence. “Volume” is one characteristic which needs to be considered while dealing with Big Data solutions. (i) Variety — The next aspect of Big Data is its variety. Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data consider by most of the applications. Nowadays, data in the form of mails, photos, videos, monitoring devices. PDFs, audio, etc. are also being considered in the analysis applications) This variety of unstructured data poses certain issues for storage, mining and analyzing data. (iii) Velocity — The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from sources like business ocesses, application logs, networks, and social media sites, sensors, Mobile d5¥ices, ete. The flow of data is massive and continuous. (iv) Variability — This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.ns Las the fe for @@ y Caper terete 17, by er ona worly Ocol Emy fom A — My Caqrecty 4 PYNEIN Dryas Ayeloay Poor; Laps hel alo ty, Gk per byte Ate abn Aecrsaqep lume 8) py Saucy W Vetae cf 0 we ee ae Wee die Auscinty 4 dolRet oe A apg timo tin’ Haroon 9 ade YY Met bra data Vehaochy oHad Meer - F +a cel a x CooapAotely Tota reve Bpee, Wor & Len aed Bat Gn We be fae Yor the [pawn why pore J dha Ghayay dene D the eked plan ware Ue. Usb i a bend 1 ae A ot Capea Tee p UP feb. oad gin need fe he 8 bo dp Wt He > Dota deer nt bore t be cae Duh ey ee war fy Gra te my digent obmend. fi ae Lente ° & rik do tree dalle Ccuaphate oly 7 Kheb a her “propetem , Lee 2 3 ‘Thar gu kb Colned, Arabi andKB ob et Aht u) ie poet +5 (s ae of Dogon tbged tages SE A Lhadion : YJ Wests: bp ener ; eve ‘ter bert Lo line > nett | Gury tee doles) Bo poe dhy qudelws (At Burne per) 9 bohaeel dure peavey mle aenuetet Ute abe dus Web tebe? aManisha 2 spol. wear ee ob wil to b ote te aul, lourp—amalins BD) Yor vied foe ale te Ack ath Chevelle 0) Apr alo sued bs Fore dounl Lutron ee eth abu eatys Cth Ghat ; Prvacy , thet Dae yetas Qos bay °) Qyadua*wed hea «Gs ks, Nant ny i een Ke nat tang Dytheas to Can babect velur ftom i ie: You dont ward th Unger oloks. tigi Noand & Um prore pecens ~—) On the ed ob te joe ht maltey Cond trate the ote de) mi” Not pr petted DQ dort « tak end toad Syweonotata > ee er | ayer 4 A (@ 0,0 > They & lab tyualin — vad we dusnim O gus wed @ mika coverage O© w&h—-A dela, Bde, ds, DA yi y uv gy x©) Uni Introduction to Data Science and Big Data What is Data? Data is everywhere and part of our daily liv in our daily lives. The amount of di growing exponentially. According t peesrerated data. That’s s expected to double by 2024. What is Data Science? in more ways than most of us realize that we create—is ing to estimates, in 2021, there will be 74 zetabytes Dealing with unstructured and structured data, data science is a field that comprises everything that is related to data cleansing, preparation, and analysis. Data science is the combination of statistics, mathematics, programming, problem- solving, capturing data in ingenious ways, the ability to look at things differently, and the activity of cleansing, preparing, and aligning data. This umbrella term includes various techniques that are used when extracting insights and information from data. What is Big Data? Big data refers to significant volumes of data that cannot be processed effectively with the traditional applications that are currently used. The processing of big data begins with raw data that isn’t aggregated and is most often impossible to store in the memory of a single computer. A buzzword that is used to describe immense volumes of data, both unstructured and structured, big data can inundate a business on a day-to-day basis. Big data ig used to analyze insights, which can lead to better decisions and strategic business anoves. Gartner provides the following definition of big data: “Big data is high-volume, and high-velocity or hi: igh-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”Skills Required to Become ee qi Education: 88 percent have master ss and 46 percent have PhDs nce, R is generally preferred. In-depth knowledge of SAS or R. For data scie! ge that is used in data Python is the most common coding langua Python codin; science, along with Java, Perl, and C/C++ Hadoop platform: Although not always a requirement, knowing the Hadoop preferred for the field. Having some experience in Hive or Pig is ~ platform is sti also beneficial. : Although NoSQL and Hadoop have become a significant plex , it is still preferred if you can write and execute com| SQL database/codin part of data sci queries in SQL. ial that a data scientist can work with _ Working with unstructured unstructured data, whether on social: media,-video feeds, or audio. Skills Required to Become a\Big Data Specialist) oo F er ° Analytical skills: These skills are essentfal for making sense of data, and Getermining which data is relevant when creating reports and looking for solutions. he ability to create new methods to gather, interpret, |d-fashioned lata: It is essenti ivity: You need to have tl data strategy. Mathematics and statistical skills: Good, 0} be it in data science, data analytics, or big analyze a “number crunching” is also necessary, data. Computer science: Computers are the backbone of every data strategy. d to come up with algorithms to process data Programmers will have a constant nee into insights. 4 E és Business skills: Big data professional: Fasiness objectives that are i place, as wel the growth of the business and its profits. s will need to have an understanding of the | as the underlying processes that driveDale” frralilie ewetuer appe colo ob4- on dole set Wp hone 0 mtoctng yl wowedalion - @ What is Data Analytics? Data analytics is the science of examining raw data to reach certain conclusions. Data analytics involves applying an algorithmic or mechanical process to derive insights and running through several data sets to look for meaningful correlations. Itis used in several industries, which enables organizations and data analytics companies to make more informed decisions, a5 well as verify and disprove existing theories or models. The focus of da lies in inference, which is the process of deriving conclusions that are solely based on what the researcher already knows. nalytic: Now, let us move to applications of data science, big data, and data analytics. Applications of Data Science Internet Search Search engines make use of data science algorithms to deliver the best results for searehrqueries in seconds. a Digital Advertisements The entire digital market*ng spectrum uses data scfence algorithms, from display banners to digital billboards. This is the main reason that digital ads have higher click-thropeh rates than traditional advertisements. Recommender Systems The recommender systems not only make it easy to find relevant products from billions of available products, but they also add a lot to the user experience. Many companies use this system to promote their products and suggestions in accordance with the user’s demands and relevance of information. The recommendations are based on the user’s previous search results.Applications of Big Data Big Data for Financial Services , retail banks, private wealth management advisories, ids, and institutional investment banks all use big data among them all is the massive for their financial services. The common problem amounts of multi-structured data living in multiple disparate systems, which big @. ' 2 data can solve. As such, big data 1s used th several ways, including; Customer analytics Compliance analyties Fraud analytics Operational analyties Big Data in Communications Gaining new subscribers, retaining customers, and expanding within current subscriber bases are top priorities for telecommunication service providers. The solutions to these challenges lie in the ability to combine and analyze the masses of customer-generated data and machine-generated data that is bein; created every ay Big Data for Retail Whether it’s a brick-and-mortar company an online retailer, the answer to staying in the game and being competitive is understanding the customer better. This requires the ability to analyze all disparate data sources that companies deal with every day, including the weblogs, customer transaction data, social media, store- branded credit card data, and loyalty program data.G9 Applications of Data Analytics Healthcare The main challenge for hospitals is to treat as many patients as they efficiently can, while also providing a high. Instrument and machine data are increasingly being used to track and optimize patient flow, treatment, and equipment used in _ hospitals. It is estimated that there will be a one percent efficiency gain that could Yield more than $63 billion in global healthcare savings by leveraging software from data analytics companies. Travel Data analytics can optimize the buying experience through mobile/weblog and social media data analysis. Travel websites can gain insights into the customer's preferences. Products can be upsold by correlating current sales to the subsequent “browsing increase in browse-to-buy conversions via customized packages and offers., Data analytics that is based on social media data can also deliver personalized travel recommendations. Gaming Data analytics helps in collecting data to o) stimize and spend within and across. games. Gaming companies are also able to learn more about what their usets like Energy Management Most firms are using data analytics for energy management, including smart-grid management, energy optimization, energy distribution, and building automation in utility companies. The application here is centered on the controlling and monitoring of network devices and dispatch crews, as well as managing service outages. Utilities have the ability to integrate millions of data points in the network performance and gives engineers the opportunity to use the analytics to monitor the network.Data Analytics (DA) is the PROCESS of examining data sets Ca cette elk en Ton & Cotas ol oreetn eens 1.Gaminig een pate Fane Pan nen cet ea kaoielats ‘OOLS & LANGUAGES a Tableau Pubic Contes —_ANNUAL-SALARY Cre $69,845 ea ne REE re te pean’ RIP RlyBig Data Analyti Now that I have told you what is Big Data and how it’s being generated exponentially, let me Present to you a Very interesting example of how Starbucks, one of the leading coffeehouse chain is making use of this Big Data. | came across this article by Forbes which reported how Starbucks made use of this technology to analyse the preferences of their customers to enhance and personalize their experience. They analysed their member's coffee buying habits along with their preferred drinks to what time of day they are usually ordering. So, even when people visit a “new” Starbucks location, that stor Point-of-sale system is able to identify the customer through their smartphone and give the barista their preferred order. In addition, based on ordering preferences. their app will suzges new products that the customers might be interested in trying. This my friends is what we call Big Data Analytics ==>. CE Big Data Applications Big Data Tools These are some of the following domains where Big Data Applications has been revolutionized Netflix and Amazon use it to make shows and movie recommendations to + Entertainment: their users. + Insurance: Uses this technology to predict illness, accidents and price their products accordingly. Driver-less Cars: Google’s driver-less cars collect about one gigabyte of data per second These experiments require more and more data for their successful execution. + _Education: Opting for big data powered technology as a learning tool instead of traditional Tecture methods, which enhanced the learning of students as well aided the teacher to track their performance better: a a + Automobile: Rolls Royce has embraced this technology by fitting hundreds of sensors into its engines and propulsion systems, which record every tiny detail about their operation The changes in data in real-time are reported to engineers who will decide the best course E. of action such as scheduling maintenanc€OF dispatching engineering teams should the problem requireit.@ * Government: A very interesting use ease is inthe field of polities to analyse pattems and Influenee election results. Cambridge Analytica Ltd. is one such organisation which completely drives on data to change audience behaviour and plays a major role in the electoral proc Scope of Big Data data + Numerous Job opportunitis include, Big Data Analyst, Big Data Engineer, Big Data solution architect ete. According to IBM, 59% of all Data Sciencé“and’ Analytics (DSA) job demand is in Finance and Insurance. Professional Services, and IT The career opportunities pertaining to the field of E + Rising demand for Analytics Professional: An article by Forbes reveals that “IBM predicts demand for Data Scientists will soar by 28%”. By 2020. the number of jobs for all S data professionals will increase by 364,000 openings to 2.720.000 according to IBM. + Salary Aspects: Forbes reported that employers are willing to pay a premium of $8.736 above median bachelor’s and graduate-level salaries. with successful applicants earning a starting salary of $80,265 + Adoption of Big Data analytics: Immense growth in the usage of big data analysis across the world lume in btlon US dlrs Market The above image depicts the growing market revenue of Big Data in billion U.S. dollars from the year 2011 to 2027. So that was all in the blog and I hope this was helpfulBig Data Infrastructure Making the most of big data requires not just having the right big data analytics tools and Processes in place, but also optimizing your big data infrastructure. How can you do that? Read on for tips about common problems that a frastructure, and how to solve them. ises in data What is big data infrastructure? Big data infrastructure is what it sounds like: The IT infrastructure that hosts your “big data.” (Keep in mind that what constitutes big data depends on a lot of factors; the data need not be enormous in size to qualify as “big.”) More specifically, big data infrastructure entails the tools and agents that collect data, the k that transfers it, the or archive software systems and physical storage media that store it application environments that host the analytics tools that analy infrastricture that backs it up after analysis is comple Lots of things can go wrong with these various components. Below are the most common problems you may experience that delay or prevent you from transforming big data into value. “ Slow storage media Disk 1/0 bottlenecks are one common souree of delays in data processing, Fortunately, there are some tricks that you can use to minimize their impact. One solution is to upgrade your data infrastructure solid-state disks (SSDs), which typically run faster. Alternatively, you could use in-memory data processing, which is much faster than relying on conventional storage. Ds and in-memory storage are more costly, of course, especially when you use them at scale. But that does not mean you can’t iake advantage of them strategically in a cost-effective way: Consider deploying SSDs or in-memory data processing for workloads that require the highest speed, but sticking with conventional storage where the benefits of faster I/O won't outweigh the costs. Lack of scalability If your data infrastructure can’t increase in size as your data needs grow, it will undercut your ability to turn data into value. At the same time, of course, you don’t want to maintain substantially more big data infrastructure than you need today just so that it's there for the future. Otherwise, you will be paying for infrastructure you're not currently using, which is not a good use of money.this Joud, where you 1g data workloads in the ¢ y when you need it, without ig data workloads to the but having a cloud hey arise—at least until One way to help addres challenge is to deploy bi can increase the size of your infrastructure virtually instantaneous! paying for it when you don’t. If you prefer not to shift all of your b cloud, you might also consider keeping most workloads on-premise, infrasfructure set up and ready to handle “spillover” wor kloads when tl you can create a new on-premise infrastructure to handle them permanently. Slow network connectivity If your data is large in size, transferring it across the network ean take time—especially if network transfers require using the public internet, where bandwidth tends to be much more limited than it is on internal company networks. Paying for more bandwidth is one way to mitigate this problem, but that will only get you so far (and it will cost you). A better approach is to architect your big data infrastructure in a way that minimizes the amount of data transfer that needs to occur over the network. You could do this by. for example, using cloud-based analytics tools to analyze data that is collected in the cloud, rather than downloading that data to an on-premise location first. (The same logic applies in reverse: If your data is born or collected on-premise, analyze it there.) Sub-optimal data transformation Getting data from the format in which it is born into the format that you need ty analyze it or share it with others can be very tricky. Most applications structure data in ways that work best for them, with little consideration of how well those structures work for other applications or contexts. , This is why data transformation is so important. Data transformation allows you to convert datafrom one format to another. When done incorrectly which means manually and in ways that do not control for data quality —data transformation can quickly cause more trouble than it is worth. But when you automate data transformation and ensure the quality of the resulting data, you maximize your data infrastructure’s ability to meet your big data needs, no matter how your infrastructure is constructed.Volume Th r ; ot mepeus ts thr arte b dala me be Sled - volume thu 10 oF data * dhe volume 4 ieee Usp ere {ange data Sepou ty ua ore Ob te Jak = a ‘ ae wae do wa proceed 5 whe au usw frucuottly Una tvaby ki oud Pas legtes 6 volume EOE ls webides , Portals Data velocity: Peters lg us speed ty corer olala & 5 Acti buted and eotacteo| . The velocity “s Leos aoa qputors fuer We amount oF poner Aewiees Pare ect of ore pPreemed 97 10T eile fete ed (oo ale oe vebortty exsen toll} am gakuHet WoLo Howe wa dala ts corrtqg fo) Biq dose welouly seals ellis Une Paco a ae ie Lifts burtinene valve wt Wakage lw, Ua amgutl Of dala qeueralacl application @uadl ov lie Be ese, alee Ruma gourd pedaeee \o. 2 ; ae qa of 0 Data explosion (uy | — What has led to this explosive growth of data? One answer is innovation. Innovation has ansformed the way we engage in business, provide services, and. the associated measurement of value and profitability. are business model Three fundamental trends that shaped up the data world in the last few yea transformation, globalization, and personalization of services ins formation: Fundamental business models have been transformed by globalization and connectivity Companies have moved from being produ where the value of the organization in its customers’ view is measured by service effectiveness and not by product Usefulness. What this transformation mandates to every business is the need to produce more data in terms of produets and services to cater to each segment and channel of customers, and to consume as much data from each customer touch point, including social media, surveys, forums, direct, feedback. call center, competitive market research, _and much more, This trend exists Greet: feedback. call center. competitive marke : across bysiness-to-business (B2B). busingss-to-business-to-consumer (B2B2C), and business-to- consymer-to-consumer (B2C2C) models. The amount of data produced and consumed by every organization today exceeds what the same organization produced prior to the business transformation. The fundamental data that is central to the business remains, and the supporting data that was needed but not available or accessible previously now exists and is accessible through multiple channels. This is where the volume equation of data exploding to Big Data ‘comes to play. Globalizati sagvennuut- § Soutien , people _Lilotlzaon You % aed Diet Globalization is a key trend that has drast manufacturing to customer service. Globalization has also changed the variety and formats of oriented to service oriente ly changed the commerce of the world, from data Personalization of services ee Business transformation’s maturity index is measured by the extent of personalization of services and the. value perceived by their customers from such transformation. This model is one of the primary causes for the velocity of data that is generated. New sources of data ee) cial With technology tipping points in the last decade, we now have data floating around in s me le devices, sensor networks, and new media more than ever before. Along with this “ia gore ea oo al was er epost aan ony process from a Business Intelligence and Analytics perspective. The emergence of newer &Oe business models and the aggressive growth of technology capabilities over the last decade or ross the enterprise into one holistic more has paved the way for integrating all of the data platform to create a meaningful and contextualized business decision support platform. and at the same time created the These trends have added complexities in terms of processes need for acquiting the data needed for the processes, which can provide critical insights into areas that were never possible, Technology also evolved in the last 20 years that provided the capabjlity to generate data at extreme scales: © Advances in mobile technology © Large-scale data processing networks . ‘ © Commoditization of hardware s Securi . © Virtualization . , © Cloud computing © Open-source software Data volume Data volume is characterized by the amount of data that is generated continuously. Different data = types come in different sizes. For example, a blog text is a few kilobytes; voice calls or video pes different siz p iz y waa eos and Icke 7 files are a few megabytes; sensor data, machine logs, and clickstream data can be in gigabytes. Traditionally, employees generated data. Today, for a given organization, customers, partners, @ competitors, and anyone else can generate data, . ‘ Machine data 2 : Application og — CF Seanness, ¥-NAY Mc» heady Seanners Rae ae, as » LIPS) Aang Cuapmente Commeriol Se lekstream IOS 0 Go, my we (alls Laden Lay eres testy 2) Externalorthirdparty data» Wetrathert dala cern Emails e E a cocntnonmaletemiienima coe ort ty te, s oO nielopage ‘ The Maye Oly datitieg os aplised tere, WO User 1 aad utr welspoge, and frowole datz fot adalalil an au, remanintton, Obie) Lit bogged en memes Pau, " Sarna Wed LetArs, outer, frro%y SUL . :Data velocity Velocity can be defined as the 5 peed and direction of motion of an object. Constant velocity of an object : Isthe motion ofan object at constant speed and dyetoeeee— With the advent of Big Data, understanding the velocity of data is extremely important. The basic reason for this arises from the fact that in the early days of data processing, we used to analyze deta in batches, acquired over time. Typically, data is broken into fxed-sire chunks and processed through different layers from source to targets, and the end result is stored in a data warehouse for further use in reporting and analysis. This data Processing technique in batches or microbatches works great when the flow of input data is at a fixed rate and results are used for analysis with all process delays. The Scalability and throughput ofthe data processing architecture is maintained due tothe fied sizeof the batches. In the case of Big Data, the data streams in a continuous fashion and the result sets are useful when the acquisition and processing delays are short. Here is where the need becomes critical for24-Chapter? Working with Big Data an ingestion and processing engine that can work at extremely scalable speeds on extremely volatile sizes of data in a relatively minimal amount of time. Let us look at some examples of data velocity ~~ ‘Amazon, Facebook, Yahoo, and Google Sensor data Mobile networks Social media Dis SomBig Data Processing Architectures INTRODUCTION Data processing has been a complex su bject to deal with since the primitive days of computing. The underlying reason for this stems from ti he fact that complexity is induced from the instrumentation of data rather than the movement of data, Instrumentation of data requires a complete understanding of the data and the need to maintain Consistency of processing (if the data set is broken into multiple Pleces), the need to integrate multiple data sets through the processing cycles to maintain the integrity of the data, and the need for complete associated computations within the same processing cycle. The instrumentation of transactional data has been a challenge considering the discrete nature ofthe data, and the magnitude of the problem amplifies with the increase in the sizeof the data, This problem has been handled in multiple ways within the RDBMS-based ecosystem for online transaction processing (OLTP) and data warehousing, but the solutions cannot be extended to the Big Data situation. How do we deal with processing Big Data? Taking distributed processing, storage, neural networks, multiprocessor architectures, and object-oriented concepts, combined with Internet data processing ‘echniques, there are several approaches that have been architected for processing Big Data Data processing revisited Data processing can be defined as the collection, processing, and management of data resulting in information generation to end consumers, Broadly, the different cyles of activities in data processing can be described as shown in Figure 3.1. Transactional data processing follows this life cycle, as the data is first analyzed and modeled. The data collected is structured in nature and discrete in volume, since the entire process is prédefined based on known requirements, Other areas of data management, like quality and Cleansing, are a nonissue, as they are handled in the source systems a5 a part of the process. Data warehouse data processing follows similar patterns as transaction data Processing, the key difference is the volume of data to be processed varies depending on the source that 's processed. Before we move onto Big Data processing, let us discuss the techniques and challenges in data processing. / Data processing techniques There are two fundamental styles of data processing that have been accepted as de-facto standards: {Centralized processing. ths architecture all the data is collected toa single centralize storage area see, by 2 single computer with often very large architectures in terms of memory, processor, and storage. (x Ugur ry | tens | Centralized processing architectures evolved with transaction processing and are well suited for small organizations with one location of service Centralized processing requires minimal resources both from people and system perspectives. &—N 5 @ © Centralized processing is very successful when the collection and consumption of data occurs at the same location. ¢ Aer) is oy, toWarn|os ® Distributed Processing. In this architecture data and its processing are distributed across geographies Or Gata centers, andl processing of data is localized with the federation of the results into a centralized Storage. Distributed architectures evolved to overcome the limitations of the centralized processing, where all the data needed to be collected to one central location and results were available in one Central location. P \ Sanne en There are several architectures of distributed processing Po” TS 1 Slems lo frepusc a Se capali + Gllent-server. In this architecture the client does all the data collection and presentation, while the sérver does the Processing sand -nanagement.of.data. This was the most popular form of data management in the 1980s, and this architecture is still in use across small and midsize businesses. © Three-tier architecture. With client-server architecture the client machines needed to be connected toa sever machine, thus mandating finite states and introducing latencies and overhead in terms of data to be carried between clients and servers. To increase processing efficiency and reduce redundancy while increasing reusability, client-server architecture evolved into three-tier systems, where the client’s processing logic was moved toa middie tier_of services, thereby freeing the client from having to be tethered to the server. This evolution Collect Process Capturing transaction data Collection of data from subsystems Data collection on forms and portals Classify Transform Sort / Merge Calculations Summarize Storage Retrieval Archival Governance Advanced Compute Format Present Manage Generate FIGURE 3.1 Data processing cycles. Data processing infrastructure challenges 31 allowed scalability of each layer, but the overall connectedness of the different layers limited the performance of the overall system. This is predominantly the architecture of analytical and business intelligence applications. n-tier architecture. n-tier or multitier architecture is where clients, middleware, applications, and ‘Séivers are isolated into tiers. By this architecture any tier can be scaled independent of the others. Web “applications use this type of architecture approach. <— —— ‘© Cluster architecture. Refers to machines that are connected in a network architecture (software or hardware) to closely work together to process data or compute requirements in parallel. Each machine in a cluster is associated with a task that is processed locally and the result sets are collected to a master server that returns it back to the user. © Peer-to-peer architecture. This is a type of architecture where there are no dedicated servers and lights; instead, all the processing resnansibilties are allocated among all machines, known as peers. Each machine can perform the role of a client or server or just process data. ale shosug fie Wea bap fr wlio. matic Iedividual ace* Hub and spoke © Federated ructure challenges ~ucture challengesly Basic data processing architecture (computational from the days of punch card to distinct areas that have evolved Units) as shown in Figure 3.2 has remained the same modern computing architectures. the following sections outline the four and yet prove challenging. times storage for intermediate result set processing and storage FIGURE 3.2 Computational units of data processing [> a Yew brita} hie Transportation one ones issues that always confronted the data world is moving data between different systems and then storing it or loaging it into memory for manipulation. This continuous movement of data has been one of the reasons that structured data Processing evolved to be restrictive in nature, where the data had to be transported between the compute and storage layers. The ‘continuous improvement in network technologies could not solve the problem, though it enabled the bandwidth of the transport layers to be much bigger and more scalable. Next, we discuss how different processing architectures evolved and how they were designed to take this data transport as one of the primary design requirements to develop the newer architecture. & 4aon | 1 @ Processing Pre : Oressing data required the ability to combine some form of logic and mathematical computes ‘gether in one cycle of operation. This area can be further divided into the following: * CPU oF processor. Computer Processing units have evolved a long way from the early 1970s to today. With each generation the com Processing capabilities, software layers, \ puting speed and processing power have increased, leading to more access to wider memory, and have accelerated architecture evolution within the * Memory. While the storage of data to disk for offine processing proved the need for storage cvolution and data management, equally important was the need to store data in perishable formats in memory for compute and processing. Memory has become cheaper and faster, and with the avolution Of Prdcessor capability, the amount of memory that can be allocated to 2 system, then to a process within a system, has changed significantly © Software-Another core data Processing component is the software used to develop the programs to transform and process the data. Software across different layers from operating systems to programming languages has evolved Benerationally and even leapfrogged hardware evolution in some Cases. In its lowest form the software translates ‘sequenced instruction sets into machine language that is used to process data with the infrastructure layers ot CPU + memory + storage. Programming languages that have evolved over time have harvested the infrastructure evolution to improve the Speed and scalability of the system. Operating systems like Linux have opened the doors of innovation to enterprises to develop additional Capabilities to the base software platform for leveraging the entire infrastructure and processing architecture improvements. | & Speed or throughput The biggest continuing challenge is the speed or throughput of data processing Speed is a combination of various architecture layers: hardware, software, networking, and storage Each layer has its own limitations and, in @ combination, these limitations have challenged the overall throughput of data processing. Data processing challenges continue to exist in the infrastructure architecture layers as an ecosystem, though the underlying software, processor, memory, storage, and network components have all evolved A independently. In the world of database processing and data management this is a significant problem D both from a value and a financial perspective. In the next section we discuss the architectures that were initially developed as shared-everything architecture and the problems that were solved as transaction processing on these platforms, and the newer evolution to shared-nothing architecture that has given us the appliance platform, which is providing the unlimited scalability that was always lacking in the world of data and its managementé v e e Big Data Processing Architectures $ Shared-everything and shared-nothing architectures coe tapete ree aes Sa Data processing is an intense workload that s crash..The key to both the scenarios stems which 2 particular data architecture perfo infrastructure architectures that are Fegarded as industry standard are shared-everything and sharednothing architectures. Application software suéfras CRM, ERP, SCM, and transaction processing require software that can drive Performance, Web applications require an architecture that is scalable and flexible. Data warehousing requires an infrastructure platform that is robust and scalable. Based on the nature of the data and the type of processing, shared-everything architectures are suited for applications, while shared-nothing architecture lends is Shared-everything architecture Shared-ever can either scale dramatically or severely underperform and from the underlying infrastructure architecture, based on rmance can be predicted. Two popular data processing > > elf to data warehouse and web applications, eds hing architecture refers to system architecture where ali resources are shared including. — ee Storage, memory, and the processer (Figure 3.3). The biggest disadvantage of this architecture Is the limited scalability. Two variations of shared-everything architecture are symmetric multiprocessing (SMP)@nd distributed shared memo: ° I % 5 alt aletley tp ROU dee ds Perfomance Ls in weap (6 In the SMP architecture, all the processors shar*S>sinfle pool of memiry for readwwrite access concurrently and uniformly without latency. Sometimes this is referred to as uniform memory access (UMA) architecture. The drawback of SMP architecture is when multiple processors are present and share 2 single system bus, which results in choking of the bandwidth for simultaneous memory access, therefore, the scalability of such system is very limited, ‘The DSM architecture addresses the scalability problem by providing multiple pools of memory for 4 Processors to use. In the DSM architecture, the latency to access memory depends on the relative distances of the processors and their dedicated memory pools. This architecture is also referred to, as Ea 2 nonuniform memory access (NUMA) architecture. Gaile, HOU, rodifer £%, Se ee Ney Ee rat doa rouwacdions a Both SMP and DSM architectures have been deployed for many transaction processing systems, where the transactional data is small in size and has a short burst cycle of resource requirements. Data warehouses have been deployed on the shared-everything architecture for many years, and due to the intrinsic architecture limitations, the direct impact has been on cost and performance. Analytical applications and Big Data cannot be processed on a shared-everything architecture. z e080 090939906 9080 ° ° » \ » BU VEPU RPO oe t 3 ae hared-nothing architecture vedi shar = Sauer Wms Shared-nothing architecture is a distributed computing architecture where multipfé"systems (talled nodes) are networked to form a scalable system (Figure 3.4). Each node has its own private memory, ie cae »®@ ° 8 8Grtleste \ smh sewol ‘ nl a Om disk: rage de\ Ti shasrebe,Sevlets ndebendentiot any other node In the configuration, thus isolating any per: sharing and the associated contention, The flexiblity ofthe architecture sits scalability, This is the yi underlying architecture for data warehouse appliances and large data processing. The extensibility and infinite scalability of this architecture makes it the platform architecture for Internet and web applications. The key feature of shared-nothing architecture is that the operating system not the application server Owns responsibility for controlling and sharing hardware resources. in a share-nothing architecture, a pasa Can assign dedicated applications or partition its data among the different nodes to handle a ~Pariicular task Shared-nothing architectures enable the creatbnr of @ self-contained architecture where the infrastructure and the data coexist in dedicated layers, Big Data processing Big Dataiis neither structured, nor does it have a finite state and volume wwe have seen examples of the different formats and sources of data that need to be processed as Big Data. The processing complexities in Big Data include the following: 1. Data volume—amount of data Generated every day both within and outside the organization. ® Internal data includes memos, contracts, analyst reports, competitive research, financial statements, emails, call center data, supplier data, vendor data, customer data, and confidential and sensitive data including HR and legal.e External data includes articles, videos, blogs, analyst reviews, forums, social Media, sensor networks, and mobile data, 2. Data variety—different formats of data that are generated by different sources, ¢ Excel spreadsheets APETHE associated formulas © Documents « Blogs and microblogs « Videos, images, and audio « Multilingual data ® Mobile, sensor, and radio-frequency identification (RFID) data 3. Data ambiguity—complexity of the data and the ambiguity associated with it in terms of metadata Seaumic Comma Separated Values (CSV) files may or may not contain header rows ¢ Word documents have multiple formats (i.e., legal documents for patients versus pharmaceuticals by 2 hospital) @ Sensor data from mobile versus RFID networks @ Microblog data from Twitter versus data from Facebook 4, Data velocity—speed of data generation. @ Sensor networks # Mobile devices Social media © YouTube broadcasts e Streaming services such as Nettlix and Hulu e Corporate documents and systems ‘ Patient networks Due to the very characteristics of Big Data, processing data of different types and volumes on traditional architectures like Symmetric Multi Processing (SMP) or Massive Parallel Processing (MPP) platforms, which are more transaction prone and disk oriented, cannot provide the required scalability, throughput, and flexibility. The biggest problem with Big Data is its uncertainty and the biggest advantage of Big Data is its nonrelational format, Foe2®De®e > ° 6 e ° e e ° e e 5 @2eeoe008 ® a AAD yy » fO E mdatel he data processing life cycle for Big Data differs from transactional data (Figure 3.5). 'na traditional environment you first analyze the data and data discovery and data model cre The resulting architecture is ver shape, structure, Create a set of requirements, which le: ‘ation, and then a database structure is created to process th 'Y efficient from the perspective of write and state are loaded in the end state. sads to e data performance, as data's finite Big Data widely differs in its then a metadata layer is ap, data structure is Processing cycle. The data is first collect plied to the data, and a 4 applied, the data is then transforme: ted and loaded to a target platform, ‘ata structure for the content is created. Once the 'd and analyzed. The end result from the process is Data processing arc -cture requirements, © Data model-less architecture © Near-real-time data collection © Microbatch processing ‘ Minimal data transformation © Efficient data reads © Multipartition capability * Store result in file system or DBMS (not relational) © Share data across multiple processing points Infrastructure requirements: © Linear scalability © High throughput Fault tolerance © Auto recovery ‘© High degree of parallelism © Distributed data processing\@ + ‘© Programming language interface ‘The key element that is not required for Big Data is the need for a relational database to provide the backend platform for data processing, Interestingly, the architecture and infrastructure requirements for Big Data processing are closely aligned to web application architecture. Furthermore, there are several data processing techniques on file-based architectures including the operating systems that have matured over the last 30 years Combining these ‘techniques, a highly scalable and performing platform can be designed and deployed. To design an efficient infrastructure and processing architecture, we need to understand the dataflow for processing Big Data, A high-level overview of Big Data processing is shown in Figure 3.6. There are four distinct stages of processing and each stage’s requirement for infrastructure remains the same. Let us look at the processing that occurs in each stage. © Gather data. In this stage, the data is received fro ces and loaded to a file system cHiled the landing zone or landing area, Typically, the data is sorted into subdirectories based on the data type. Any file modifications like naming or extension changes can be completed in this stage. Load data. In this stage, the data is loaded with the application of metadata (this is the stage where you will apply a structure for the first time to the data) and readied for transformation. The loading rocess breaks down the large input into small chunks of files. A catalog of the Landing Zone Ingestion Process Discover Analytics DB Integration Operational Reporting Raw Data Extracts Extact Data Transform Data Gather Load Data Data FIGURE 3.6 Big Data processing flow. Big Data processing 39 files Js created and the associated metadata for the catalog is processed for that file. In this stage one can also partition the data horizontally or vertically depending on the user and processing requirements. ‘* Transform data. In this stage the data is transformed by applying business rules and processing the _Sontents. This stage has multiple steps to execute and can quickly become complex to manage. The Processing steps at each stage produce intermediate results that can be stored for later examination. The results from this stage typically are a few keys of metadata and associated metrics (a key-value pair). * © Extract data. In this stage the result data set can be extracted for further processing including anialytics, operational reporting, data warehouse integration, and visualization purposes. Based on the dataflow as—described here, let us revisit the infrastructure and data processing architecture requirements as they relate to Big Data. Ants apiece Linear scalability. Earlier in this chapter, we discussed the challenges of data processing infrastructure memory, and processor. In a traditional system's architecture, when you add u achieve scalability, but it is not 100% linear and therefore becomes add 178 storage, you get about half of that usable space with a RAID Data, there is already inherent design to accommodate for the in terms of store additional infrastructuré> expensive. For example, when 4

DA Unit - 1 Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DA Unit - 1 Notes

Uploaded by

Copyright:

Available Formats

You might also like