You are on page 1of 12

7

Unit l - lNRODUCTION TO DATA SCIENCE


1.1 Inti oduction:
D.11.1 ,, ,._l'Cl' 1:-. ,1 collcl'IIOn t>f 1cch111qucs used to c:xlract v:::luc from <lala. It
hn, bc'-'1)me an 6:-l'nt1al wot for Jll) organization 1hat co ll ects, stores, and
rro,·.:-,,c, \.fat .h p.lrl t)f lb t)j)Cl.lll011S n.1ta ~c1encc techrnques rely on finding
u,etul rart1.'' ,, u)nncctwn:-. .•md rdat1on~h1ps ,, 1thm data. Data science 1s also
c-ommonl) r.:-fcrrcd to a~ !-..no\\ kJgc d1sco,e1'). machine learning, predictive
an.il)1Ic,, and datn mmmg Howcwr. each tcm1 has a slightly different connotation
dc-ptndmg on the conte:xr Data ,c1cnce is all about using data to solve problems.
The problem could be dec1s1on making such as identifying which email 1s spam and
"hrch b not.
Tradiuonall). "1e data that,, e had ,, as mostly structured and small m size.
\\ h1ch could be anai)~Led b) using simple Business Intelligence tools Data tn
th~ traditional ') ,tern" i:- mostly srrucmred. bur today most of the data 1s
un,tructured o ,er, -~tmcrured. The data trends m the image gi, en belo\\ shows
that b,- :!u~O. more 1han SO ~o of the data will be unstructured as shown in Figure
~

I I

0
I 19

_()(l< :.::010
8
2020

Fig l J ~tmcturcJ & U11::.1ructurcd Dara

1
8

1.2 l.ilt·cyrk ol Dat.t \nCllCt':


\ d 1.1 ~e1encc !He C)Ck 1s nn 1tcrnu,·e !>Ct of steps you take to c.lelivcr a data
C1en1..e proJCd or product. Bccm1::,e c, cry data :ic,encc proJect an<l team are
d k ent e\el') 5pe-c1fic dam ,c,cncc life cycle i:-. different. However, most data
tt~ c prClJCCts l~nd to flo,, through the :,,unc general life cycle. Lite cycle of data
s1..1ct ce 1, "-ho\\n m Fig 1 2.
Lile() de of Data Science mvoh e::. the follo,i. ing step:.,
Dbco,ery
... Data Preparation
...
;). Model Planning
t. \fodel Bm1ding
5 Opera1ionahze
6 Communicati,e re:,ults

Commun,c••• Oat.:,
Jl:ti.~Uh1- Pr •p•r•uon

I
Mod•I
Planning

IAod•!
o... 1lchng

I 1g I 2 I tfc ( )tic ol I )ata Sc1c11cc


9

Di,l·m l'r~: lkt'(11l' bl'ginning thl' prn.1cct. 11 ts important to undcn,tand thl'


\ anou, '-1'CC1fic,uion,. requirt'mcnls. prim 1t1cs and n.:quired budget. It 1s necessary
10 po,,e,, the tt:'quired 1esourccs p1l',t:'11t 111 tcnm, of people, technology, time anJ
Jam ' ' '~ p1>011 the p!OJCCt ln tlrn. phase. fi ,1111c the business problem and formulate
mm.ii h\1'0l!' l·,e, d H) to tc,t
D:ll.1 preparntion: In this pha,e, prepare analytical sandbox which
pt·rfomb .1rul) lies fm the entire duration or the proJcct. II is possible to c'<plorc,
preproce" and condition data pnor to modeling. Further. perfom1 ETLT (extract,
tran:--fonn. load and l1 ansfonn) to gel data into the sandbo\..
\lodel planning: In this phase. derenrnnc the methods and tech111ques to
drav. the relationship, bet,\ een , anables. These relationships will set the base for
ll.1c algonthm:. \\ htch 1s necessary to implement in the next phase. Apply
E..xploratory Data Anal) ucs (EDA) usmg vanous statistical fom1ulas and
, is11al11at1on toob.

SQL
Analysis SAS/
R ACCESS
Services

\lodd huildi11 g: In this phase. de, clop datasets lot tra1111ng and testing
purpo~es. I !ere rnrl'>1d1.:1 whethc, tht: c,1-;1111g tools \\ill iiUnicc lc.)1 1111111mg the
node! cir 11 \Viii 11et:d a more 1ubu ... t cnv1ro11mcn1 Analy:tc vMtOtb k.1rn1ng
techn1quc~ lil e class1fii.:i1\l(J1t, :1s~ociat1011 and dustenng to bt11ld the 11101.kl
I0

~
SAS WEKA SPCS Matlab Alpine Stat1stica
tnterpnse Modeler Miner
t,.tmer

Opl.'l"U1io11 ali1e: In tlm. ph,ise, deliver final n.:po1 ls, bricl'1ngs. code and
tochnH:al documentl) arc done. In addi tion. somct1mcs a pilol proJccl is abo
1mp1emt:nt<.>d m a real-time production cnv1ronmcnl. This will provide a clear
picture of the performance and othc1 related constraints on a small scale before full
deplo;ment.
<om municalc results : !\Jo\\ il is 11nporlant to evaluate the process of
aclue, mg the goat that had planned m the firsl phase. So. m the last phase. 1dcnt1 fy
all the ke~ findmgs. communicate to the stakeholders and determmc 1! the results of
the proJect 31e a success or a foilu1 e based on lhc cntcria dc,cloped in Phase 1.

1.3 AT, Machine Lea rning a nd Data Science:


Art1fic1dl 1ntelltgc:11Le, Yfacl1111e lc.111111,g, and data -;c1cncl! arc all related to
eac.h othe1. lloweve1, all of these three fide.ls an: d1st1ncl dcpcnd1nr on lhl! context.
J\rt1ficial intelligence is ttboul g1, ing 111ach1m:s the capab1ltty of n111rnck1ng human
bcha,101, p<1I11c11laily cog111ti\·e functions. I he relatiom,htp bctwl!cn Al & Machml!
Lcammg I!:> shown 111 hg 1.3.
I x.imples would bl!: facial recog11itio11, auto111ate<l d11vi11g, -.orl1ng rnatl
b I cJ 011 po.,tal code l11 l>1>1J1e ca:-.es, 111ach1ncs havl! far exceeded human
ldJ'Jh1hm
11

h quite-., 1111g1 ot tcchn,quL, 1h,1t r,dl u11du m1il1L't,tl 111tcll1gcncc


tut 11 Ian u,1gc pll'11'ss111 •,
1 de\ 1s1P11 sc1c111.;c, h1.i,, \ 1s1011, 10bot1cs,
mm, i-.. m 11111'1'11,1111 p,111 nl ln1111,111 1:.1pability. f11 fact. many

.im!.111~ cnn k.un


1\1 dune lr mmg 1tltl culler b-.· u,11,1tk1nl a ... uh-lidd or one oftht.: tools or
1 mtc:lh 1 ell1\ i-. pnn1dmg m.1rh11lL', ,,1111 the cJp,tb1lrty of ll'arning from
[ xpt 1C"nrc fo, 111n1..h111c-, ctimi:s 111 the fo1 m ol tlata. D.ilJ th,ll is ust.:d to
1.. 1cs 1s tailed l1,11mng da1,1 \lachinc lcarmng tum, the traditional
n odel upside dm\ n.
Art f ._,oi intcll1g~nco

RCtbObU

Pboou,g
IA1chm• learning

Data science

Data
i.NN Texl prep ,rooon
min:ng
Sl.'.llllitia
Proc=
mnrng
:/1sual1ZDL'OO

Processing
p: rad,gms
E' >.p nm r IJllOn

I 1g I 1 A I & f\ 1,tchme I ca111i11g

pr Cl of In IIUCIIOII 10 a cu111putc1, 11,111:.forni:, lllpll( sign.tis 11110


um' pr de1e,n11md ,ules nud ,el 1tHm,l11ps. t\ l,11.'lunc le,1111111g
lk:d 'learne1 " L,1kc hoth the k110\\ 11 111put ,1nJ output tl) figutl•
tl,t.: pro ,ram \\l11ch com e,t:, mpul lu uulput. Fo1 l'~umpk, 111.Ht)
t1I o tul media pla111111:-., 1l'\1e\\ ~Ill's, rn tu1ums 111.: 11.·quireJ Ill

o e bu t\ c wnte111

lltt:,
12

I th lU Ill 1p1 l1u1w11 of 1111d111ll.: I 1111mg. 1t11fkrnl


11 l1 ,i. , q11 11111 uv r, 1,1 r,k 1,,1 I lie.. • YI ,u,11 I/ IIJOJI, anJ
It l II\ lllll1d1 t1pl111,11y 11 Id 1h111 cxtt1c11t voluc horn dau,. )Jaw
111 he hc,1\11~ m11111(.l1111c lc,1111111jj rwd I c,<1tr1ct11nc <,allcd

I ,; m1,k ot d.111 1.1cntc ti er c, c, u1c rci.;un,111c11dat1!J11 engine that


nd tno\ 1 foi I pa, uc-111 11 llliCr, ,1 f, t111d :de, 1 1t111dcl th,1t detect,
d11 l 11d II in ilC..t1011 , 1md c..11~10111c11, who will 1110 ,1 likely churn next

1tr nue for the next qu,11tcr

1 4 \ \ It II I:, IJ ,il tl Sc !I' ll( f''/

U t,u1 v.rth <l 1tt1, v.h1d1 <.tan rnngc flom n i:;1rnplc arrny of ,1 fcv.
r, t1on to ,1 complex mt1111x ul 1111llio11s of ubscrvallons with
11alIJ 1,i.;1c11cc utilucs \.:crt.1111 spcci:di1cd t:1J111pulcltional
n 1d r Ill d1'ic0\c1 mc.111111rlul ,111d u cl11l M111c1111cs wi1l11 11 a dataset. ·1 he

nee cocx1 ,1 ,111d 1s do1iely nfisoci:itcd with u rn11nbcr of


H.h dau1h 1';C y1.tc1m,, cla1c1 cng111ccn11g, vit.1rnl11,ation, dnta
pa rncnwt1on. ,ind hu mci;s 1111cll1gc11cc (IHJ.

I 1put (X)

lnr 1 (1.1
M,sc.; h 1t10 Ruµ1c~c11ldl1vc
lm.unln~ model
0 lpu 1
~}

Prndt,)NI '
ov1 ut (:)
I !' l '1 MI l11n1.: I ni11111 •
13

DJW <..1ence 1mohc~ C"lr.1..lln • buildm '• <.ombmmg und learning data
;\ l h.h ,., d,1.,:,1f1ed here
t \lt ,, ~lin:.! i\fi•:,nin:!lul l'aitt•l'fl •~: Knowledge d1scovcrym database,, the
non I\ 1 1 process of 1dent1fymg , ahd no"cl. potcnually useful, and ultimately
underst nd blc pJUem~ or rclat1onsh1ps wulun a data~et m order to make 1mponant
dec1~10n Data science 1m oh es inference and sterauon of many differen~
hypolhc~s One of the key aspects of data science is the process of gcncrahzatton
of p ttemc. lrom a dataset. The gcncrali/.ation 5hould be valid, not just for the
da1.isct u ed to obser. e the pattern, but also for new unseen data. 'Jbe ult1male
obJectl\ e of data science 1s to find potenttally useful conclus1ons that can be acted
upon b) the users of the analysts.
Buildin~ Rcpn:'>entath c \fod els: In statistics. a model is the
repre~en·.auon of a relauonship bem ecn variables in a dataset. ft descnbes hov. one
r more , anables m the d.ita are related to other vanables. Modeling 1s a process m
v.l 1ch a repre entame abstraction is built from the observed dataset. For example,
based on credit score, rncome level, and requested loan amount, a model can be
de, eloped to determme the mterest rate of a loan. For this task, previously known
bservauonal data inc]uding credit score, income level, loan amount, and imeresl
rate urc needed. Once the reprcsentaLJ\ e model 1s created, 11 can be used to predict
me value: of the mtcrest rate, bac;ed on all the mput variables.
( c,r11him11i1,n ,,1 ~latis;lin, \lachine Learning. and ( r,111puting: In the
pur H f -.:trac.ttng uselul and rclc\oant mfonnat,on trom large oatasctS, data
e borro\\ ~ computatJonal techmques from the di!;Ctplmes of sta11st1c~,
c·q,enmenta11on. und d.Jtabasc theories The algonthms u ed m
d.Jt.l 1ence ongrn.itc from theM: d1cc1plmes but have smcc evolved to adopt more
di r ~ le hmques 6UCh & paralld computing, evolutwnary cornpt:tmg. hngu1s11cs,
1 r.,I tud,ec;
14

Learning Algorithms: The applicat,on of sophisLicatcd learning algorithms


tu, n uacung useful patterns from data d, ffcrcntiatcs data science from traditional

Jata :111,dy-..1, tcch111ques. ::-.1c.tny of these algorithms were developed in the past few
drc.1tk-.. ,md arc a pa11 of machine learning and artificial intelligence. Some
nlgornhm-.. arc based on the foundations of Bayesian probabilistic theories and
r\.!gre._,,on analys1::,, originating from hundreds of years ago.
The.-,e 11erat1ve algorithms automate the process of searching for an optimal
solution for a given data problem. Based on the problem, data science is classified
into ta-,ks such as classification, association analysis, clustering, and regression.
Each data science task uses specific learning algorithms like decision trees, neural
networks. k-nearest neighbors (k-NN), and k-means clustering, among others .
. \5sociated Fields: While data science covers a wide set of Lechniqucs.
,tpplJc,rnons. and disciplines. there a few associated fields that data science heavily
relic:-. on. The rechn1qucs used in the steps of a data science process and in
conJunctio11 w1Lh the term "data science" are:

• Dcscn 1)tive Statistics


• Fxploratory Visuali1ation

• D1mens1onal Scalmg
• If ypothc..,,s Testing

• Data Lng111cenng

• Bu'imcss Intclli~cnce
1.5 Ca'>es for Dala Science:
l 1:id1t1011al analysis tcch111ques like dimensional slicing, hypothesis tcstrng.
"1d dl: ~c, 1p11ve statistics ea11 only go so far 111 1nlormation discovery. A paradigm ,.,
needed 111 11Ja11:1gt tile massive volun,e ol data, explore the inter-rclatiorn,h1ps of
ands ot variablt.:~, and deploy machine learning algorithms to deduce optimal
from datascts.
15
. \ set of frameworks, tools, and techniques are needed to intelligently assi::.t
humans to proce.-.s all these data and extract valuable infom1ation. Data science is
one "uch paradigm that can handle large Yolumes with multiple attnbutes and
deplt,:,. rnmplex algomhms to search for patterns from data. Each key motivation
tor u..;mg data science techniques is explored.
Y olum e: The sheer \-Olume of data captured by organuat,ons 1s
e-...ronent1ally increasing. The rapid decline m storage costs and ad\ ancement::, in
capmring e\ ery transaction and event. comb med with the bustness need to extract
a-:-. much leverage as possible using data, creates a strong motivation to store more
data than e\er. As data become more granular, the need to use large volume data to
extract information increases. A rapid increase in the volume of data exposes the
limitattom, of current analysts methodologies. Jn a few implemcntat10m,, the time to
create generalization models is critical and data volume plays a maJor part in
detem1in111g the llrr.e frame of de\elopment and deployment.
Dim ension s: The three characteristics of the Big Data phenomenon arc
high vnlumc. high velocity. and high variety. The variety of data relates to the
multiple t 1 pes ohalucs (numetical, categorical), fo1111ats of data (audio files, video
files), and the application of the data (location coordinates, graph data). Every
single record or data point contains multiple attribute::, or variables to provide
context for the record. For example, every user record of an ecommerce site can
contain auributes such as products viewed; products purchased, user demographics,

frequenC) of purchase, click stream, etc.

1.6 Data Science Classification:


Data science prob lems can be broad ly categori7ed into supervised or
unsupcr\ l'>CJ karnrng models. Supcn isc<l or directed data science tries to 111fei· a

function ur rdatiom,h1p based on labeled tn11ning data and uses this funcuon to map
.new unlabeled data. Supervised techniques predict the value of the output \ 'anablcs
don a set or input variab les
16
·1 t' dt, tlm,. a mndcl ts dt·, doped Crom a training dataset when; the values ol'
111pu1 and tllllput a1c pre, t011-;ly l...1H)'.\'I\ fhc model gcnernli1es the relationship
bL't\\ L'Cn thL' mput lrnd output \'ari<1bks and uses it to predict for a dataset whne
tlllh 111pu1 , .mablcs ::ire kno\\ n. The output , ·anahlc that 1s being predicted I'> also
L'.llkd ,1 d,tss L1bL'l or urJ_cl , anablc Supcn 1scd data science needs a sunic1ent
numbc1 t'C labeled rcconh to learn the model from the data Um,upcn 1scd 01
u11d11cc1cd dat.1 science unco, crs hidden patte1 ns 111 unlabeled data. In unsuperv1<,cd
dat.1 -;c1cncc, there arc no output vanabks to predict. The objective or this class or
JaLJ -..c1cncc techniques i::i LO find patterns 111 data based on the relationship between
data pomts themselves. An application can employ both supervised and
un-.,upcn 1-.ed learners.
Data science problems can also be classified into tasks such as:
cla~::iiticat1011 regression, association analysis, clustering, anomaly detect1011,
recommendauon engines, feature selection, time series forecasting, deep lcairnng,
and tC\.l 111111mg. C la::isification and regression techniques predict a target , anable
based on input , ariablcs. The prediction is based on a generalized model built l rom
a previous!)- kno\l.. n datasi.::t. In regression ta::.ks, the output variable 1s nume11c.
Clas:-.1fic.it1on iasks prcd,et output , arinbles. which arc categorical or polynom,al
which 1s illustrated 111 fig 1.5. Deep lea, ning is a more sophisticated artificial neural
nen1.ork that 1s increasingly used for classification and regression problems.
Clu-.,te, 111g is the proces::i or idenll lying the natural groupings in a dataset.
[;or example, clustering is helpful in finding natural clusters in cu::itomer
datasets. ,, h1ch can be used for market segmentation. Since this is unsupervised
da1~1 -.cienct:, 1t 1s up to the end u::icr to imestigate why these clusters arc formed in
tht: data and gt:ncral17e the uniquent:ss or each clustc1. In reratl analytics, 1t 1s
common to 1dcnt1fy pairs of items that are purchased together, so that specific 1te1m,
bc bundled 01 plact:d nt:xt to each otht:1. This task is called 111a1 ket bitsket
or assoc1a11on ana1 1 s1s, which is co111monly used in cross selling.
17

!1cgro::,::,1on

C d:.:,1f1(,dllOI' ~ C lu~ ltlrin4

Assocotiori
F~ ;:ituro sctoc1,or1
an:ilys1r;
Data scie nce
Ancmal/
Tt\! I !flt I!:)
(ip[(>(l/01

-1me 5cncs llccornmcndat1on


'orccast1ng ong1ncs
I 1PPr
IPr1mno

Fig 1.5: Data Science Classification

Recommcndat1on engi nes arc the systems that recommend items to the
u... cr-, ba:-,cd on individual user preference. Anomaly or outlier detcet1on identifici>
the data ro1nts that arc s1gnilicantly different from other data points 111 a dataset.
Credit card t1amact1on fraud dctectio11 is one or the most prolific app lications of'
an omaly tktcct1on. Timl: scric:. lo rcca-,ti ng is the prOCl:SS of' predicting the 1·uturl:
value of a vanabk Text mining ii> a data i>Ctence application whcre the input data is
texL which can be 111 the fo1111 or documents, messages, emails, 01 web pages.
To aid tht..: data i>Ctence on text data, the !ext files are first converted into
docurn i:nt \ectori> where each unique word 1s an attribute. Once the text file is
comcrtcd to documcnt vectors, standard data sciL:nce tasks such as clasi>ifica!1011
'
clu ... tcnng, ctc., can be app lied. Fcature si: lcction is a pnJcess in which auributes i 11
a d..it:t'ict a1t..'. rcduced to a few atlribulc!:> that rea ll y niatte1 . Un'iupervisl:d teel1111ques
pro, idc an incri:used undcrsta11d111g or the datai>el a11d hence, are somct1111cs called
18

,\s an e'l.ampl e of ho\\ both unsu pervised and supervised data science can
be combmed in an application, consi der the fo llowing scenari o. In marketing
an..1lytics, clustering can be used to find the natural clusters in customer records.
Each customer is assigned a cluster label at the end of the clustering process. A
labdcd customer dataset can now be used to develop a model that assigns a cluster
l.ibd for an) new customer record wi th a supervised classification technique.
1.7 Data Science Algorithms:
\ n algorithm is a logical step-by-step procedure for solving a problem. In

da!J \l.'tcnce. it is the bluepri nt for how a parti cul ar data prob lem is solved. Many of
th(' kam1ng algorithms are recursive, where a set of steps are repeated many times
until a limiting cond ition is met. Some algorithms also contain a random variable as
an mput and are aptly call ed randomized algorithms. A classification task can be
sol\ ed using many differen t learning algorithms such as decision trees, artificial
neural netv,orks, k-NN. and even some regression algorithms. The choice of which
algomhm to use depends on the type of dataset, objective, strncture of the data,
presence 01 outl iers, ava ilab le computati onal power, number of records, number of
attnbutes. and so on. It is up to the data science practitioner to decide which
algomhm to use by evaluating the performance of multiple algorithms. There have
been hundred~ of algo ri thms developed in the last few decades to solve data science
problems.
Oala ~c1ence algorithms can be impl emented by custom-developed
in almost any computer language. This obviously is a time
order to foc us the appropriate amount of time on data and
rithms, data science too ls or statistica l programming tools, like R, Rapid Miner,
on, SAS En terprise Miner, etc., which can impl ement these algorithms wi th
, can be leveraged. These data science tools offer a libra ry of algo rithms as
t1 on~, wh ich can be interfaced through programming code or confi!:,'l.Irated
ugh graphica l U!>Cr in terfaces.

You might also like