You are on page 1of 27

UserManual

For

BigDataIntegrationand
Analysis

VIDEO:
https://www.youtube.com/watch?v=qlHG55S2K7g


BigDataIntegrationandAnalysis

UserManual

Page2

Tableofcontents
1. Introduction
1.1WhyIntegration&AnalyzingData
1.2ApacheHadoop
1.3ApacheSpark
1.4Authorizedusepermission
2.SystemSummary
2.1SystemConfiguration
2.2ArchitectureDiagram
3.IntegrationJobSpecificationLanguage
3.1IJSLGrammar
3.2Examples

3
3
4
4
5
5
5
5
6
6
9

4.Application

11

4.1.Runningit
4.2Setting

11
12

4.3.DataIntegration
4.3.1.Specifyingvaluesmanually
4.3.1.1.TransformationsOperations
4.3.1.2.Restrictionoperation
4.3.1.3.Outputfiles

13
13
15
18
18

4.3.2.LoadinganexistingIJSLscript
4.3.3.WritingaIJSLscript

20
22

4.4.DataAnalysis

4.4.1LoadfilesfromHadoopFileSystem

24

4.4.2SelectingfilesfromHadoopFileSystem
4.4.3Displayingfieldnamesfromselectedfiles
4.4.4Inputquery
4.4.5Sampleresults
5.PossibleflowsforDataIntegration
5.1Specifyingvaluesmanually
5.2LoadinganexistingIJSLscript
5.3WritingaIJSLscript
6.PossibleflowsforDataAnalysis
7.References

24
25
25
25
26
26
27
27
27
27
28


BigDataIntegrationandAnalysis

UserManual

Page3

1. Introduction
These days,datastreamsfromeach and every activity of dailylife:fromphones,credit
cards, televisions and computers from sensorequipped buildings,GPS, trains,buses,
planes,bridges, factories,
and so on. Thedata flows so fast thatthetotalaccumulation
of the past two yearsa zettabytedwarfs the prior record ofhumancivilization. This
huge amount of data is very important as it contains a lot of useful information and
consideringthe volume, velocity and variety ofdata, cleaningand analyzing big datais
abigchallenge.
Arealexampleofsuch

achallengecanbeseeninecommercecompanies,suchas
Amazon, wherethey havehuge amountsof customerrelated data.Thisdata is crucial
to any company for which reason they are ready and eager to spend a bigportionof
their budget in analyzing data so they can, among others, establish solid predictive
models. For the example of ecommerce companies, one concern is to list the most
sought products by their customers so they can predict which classofcustomers(per
ageforexample)tendstobywhatinwhichperiodoftime.[1][2]

1.1 WhyIntegration&AnalyzingData?
Extracting, cleaning, and loading data are three core steps in the Data Integration
process. Data reshapingprograms are difficult to write becauseof their complexity,but
they are required becauseeach analytic toolexpectsdatainaveryspecific form andto
get the data into that form typically requires awholeseries ofcleaning,normalization,
reformatting,integration,andrestructuringoperations.
Analyzing Big Data is used to find meaning and discover hidden relationships in Big
Data. The technological advances in storage, processing, and analysis of Big Data
include the rapidly decreasing cost of storage and CPU power in recent years the
flexibility and costeffectiveness of datacenters and cloud computing for elastic


BigDataIntegrationandAnalysis

UserManual

Page4

computation and storage and the development of new frameworks such as Hadoop
and Spark, allowed users to take advantage of these distributed computing systems
storinglargequantitiesofdatathroughflexibleparallelprocessing.[1],[3],[4].

1.2 ApacheHadoop
Apache Hadoop [5]
isanopensourcesoftwareframeworkwritteninJavafordistributed
storage and distributed processing of very large data sets on computer clusters built
fromcommodityhardware.
ThecoreofApacheHadoopconsistsof:
Storagepart:HadoopDistributedFileSystem(HDFS).
Processingpart:HadoopMapReduce.
Hadoop splits files into large blocks and distributes them amongst the nodes in the
cluster.To processthe data,HadoopMapReducetransferspackagedcodefornodesto
process in parallel, based on the data each node needs to process. Hadoop
(MapReduce) alongside thebuiltinsolutions like Hive andPig, have beenwidely used
for years now intensively in batch processing chains. Onesucha chainareETL(like)
dataintegrationprograms.

1.3 ApacheSpark
Apache Spark [6] is a cluster computing platform designed to be fast and general
purpose. Onthespeed side,SparkextendsthepopularMapReducemodeltoefficiently
support more types of computations, including interactive queries and stream
processing. Speed is for instance very important in interactive computations where
responses are ultimately needed in scale of few seconds. This includes querying
dataset and running iterative programs, such as the ones found inMachine Learning.
Therefore, Spark came with native inmemory data processing that speedsup hugely
thesetypesofcomputing.


BigDataIntegrationandAnalysis

UserManual

Page5

1.4 Authorizedusepermission
ThissystemisdesignedforlabprojectstudyinEnterpriseInformationSystems.

2. SystemSummary
2.1 SystemConfiguration
Ubuntu14.4
Hadoop2.6
Spark1.4
Scala2.1
Maven
EclipseIDE
JavaJDK1.8

2.2.ArchitectureDiagram

Figure1:ArchitectureofIntegrationandAnalysingofData[7].


BigDataIntegrationandAnalysis

UserManual

Page6

3.
INTEGRATIONJOBSPECIFICATIONLANGUAGE

Weproposealanguagetousersareabletodescribetheintegrationjob.
Wenameit
IntegrationJobSpecificationLanguage
,
IJSL
forshort.
3.1.IJSLGrammar
IJSLscriptconsistsofkeywordsandparameters:
KEYWORDS
Thereare10keywords(whichcanbewritteninuppercaseorlowercase).Forclarity,
willwritethemonuppercasethroughoutourreport:
INPUTFILE,OUTPUTFILE,SEPARATOR,PROJECTEDCOLUMNS,
PROJECTEDCOLUMNSNAMES,MERGE,SPLIT,CASE,FORMATDATES,
RESTRICTION,EMPTY
OBLIGATORY

OPTIONAL

INPUTFILE

MERGE

OUTPUTFILE

SPLIT

SEPARATOR

CASE

PROJECTEDCOLUMNS

FORMATDATES

PROJECTEDNAMES

RESTRICTION

Table2:Somekeywordsareobligatoryandothersoptional

PARAMETERS
Parametersaredefinedwiththefollowingstructure:
Par1|Par2|
Ifwedontwanttodefineparameters
onoptionalkeywords,
weuseth
ekeywor
d:
EMPTY

Example:
MERGE
EMPTY
orjustdontwritethatstatement.


BigDataIntegrationandAnalysis

UserManual

Page7

INPUTFILE
DefinestheinputCSVfilename.
Example:
INPUTFILE
Customers.csv
OUTPUTFILE
DefinestheoutputCSVfilename.
Example:
OUTPUTFILE
CustomersOutput.csv
SEPARATOR
Definesthedelimiteroftheinputfile.Itcanonlyonecharacter.
MostcommonusedonesareComma(,)Semicolon()Pipe(|)andCaret(^)
Example:
SEPARATOR
,

PROJECTEDCOLUMNS
Definestheindexofthecolumnstobeprojected(with0beingthefirstindex).
Part1:Firstprojectedcolumn.
Part2:Secondprojectedcolumn.

PartN:Nprojectedcolumn.
Example:
PROJECTEDCOLUMNS
1|3

PROJECTEDNAMES
Definesthenamesofthecolumnstobeprojected.
Part1:Firstprojectedcolumnname.
Part2:Secondprojectedcolumnname.

PartN:Nprojectedcolumnname.
Example:
PROJECTEDNAMES
Name|City


BigDataIntegrationandAnalysis

UserManual

Page8

[MERGE]
Definestheindexofthetwocolumnstobemergedandthemergecharacter.
Part1:Firstcolumnindex.
Part2:Secondcolumnindex.
Part3:Mergecharacter.
Example:
MERGE
0|1|

[
SPLIT]
Definestheindexofcolumntobesplittedandthecharacter.
Part1:Columnindex.
Part2:Splitcharacter.
Example:
SPLIT
2|
_

[CASE]
Definestheindexofcolumntoupperorlowerthecase
Part1:Columnindex.
Part2:0forUPERCASE,1forLOWERCASE
Example:
CASE
1|0

[FORMAT]
Definestheindexofdatecolumntobeformatted.
Part1:Columnindex.
Part2:DD/MM/YYYYMM/DD/YYYYYYYY/MM/DD
Example:
FORMAT
3|MM/DD/YYYY

[RESTRICTION]
Definestheindexofcolumntoberestricted(filtered),theoperatorusedandthevalue.
Part1:Columnindex.


BigDataIntegrationandAnalysis

UserManual

Page9

Part2:Fornumeralvalues:=,<>,>,<,>=,<=Fortextualvalues:EQUAL,NOTEQUAL,
CONTAINS
Part3:value
Example:
RESTRICTION
0|>=|20|0|<|50
3.2.EXAMPLES:
OurexampleCSVinputfileiscalledCustomers.csv,andhasthefollowingheader,6
columns:Customer_ID|Name|Address|City|ZipCode|Phone
A. Wewantasoutput3columns:Customer_ID,Address,CitywhereCustomer_ID
valuesaregreaterthanorequalto20andlessthan50.

INPUTFILECustomers.csv
OUTPUTFILEOutput.csv
SEPARATOR|
PROJECTEDCOLUMNS0|2|3
PROJECTEDNAMESCustomer_ID|Address|City
RESTRICTION0|>=|20|0|<|50

B. Wewantasoutputallthecolumns,butrenamesomeofthem:Customer_ID>
ID,ZipCode>Postal_Code.AndchangeNamecasetocapitalletters.
INPUTFILECustomers.csv
OUTPUTFILEOutput.csv
SEPARATOR|
PROJECTEDCOLUMNS0|1|2|3|4|5
PROJECTEDNAMESID|Name|Address|City|Postal_Code|Phone
MERGEEMPTY
SPLITEMPTY
CASE1|0
FORMATDATESEMPTY
RESTRICTIONEMPTY

C. WewantasoutputonlyColumnName,butwewanttosplititontwocolumns
First_NameandLast_Name.
INPUTFILECustomers1.csv
OUTPUTFILEOutput.csv
SEPARATOR|
PROJECTEDCOLUMNS1|1000
PROJECTEDNAMESFirst_Name|Last_Name
MERGEEMPTY
SPLIT1|
CASEempty
FORMATDATESEMPTY
RESTRICTIONEMPTY


BigDataIntegrationandAnalysis

UserManual

Page10

4.Application
Inordertoinstalltheapplicationweneedtounzipthefilellama.zipintothepath
/home/hduser/Desktop/
Itwillcopythefollowingfiles:
llama.jar
llama.jpg
Settings

4.1.Runningit
Onlinuxterminal:
Locatethepathwheresparkisinstalled,inthiscase
/home/miguel/Downloads/spark1.4.1binhadoop2.6/

andexecutethefollowingcommand
./bin/sparksubmitclasseis.lab.groupb.MainDrivermasterspark://master:7077
/home/hduser/Desktop/llama.jar

Options:
class:
Classthatcontainsthemainfunction.
master:
Locationofthesparkcluster.Itshouldhavetheoption
local
ifwewanttousespark
locally.
Finallywehavetoindicatethepathwherethejarislocated.


BigDataIntegrationandAnalysis

UserManual

Page11

4.2Settings
Theapplicationhasthefollowinglistofsettings,asshowninFigure2,toeditthem
wehavetogotothe
Settings
tabandthefollowingformwillappear.

Figure2

IntegrationModule
URI

URIwheretheHDFSis

Inputfilespath

Pathfortheinputfiles

Tempfilespath

Pathforthetempfiles

Outputfilespath

Pathfortheoutputfiles

AnalysisModule
URI

URIwheretheHDFSis

Inputfilespath

Pathfortheinputfiles

Outputfilespath

Pathfortheoutputfiles


BigDataIntegrationandAnalysis

UserManual

Page12

4.3.DataIntegration
InthissectionwewilldescribetheDataIntegrationphase.
Wehavethreepossiblewaystodoanintegrationjob:
Specifyingvaluesmanually.
LoadinganexistingIJSLscript.
WritingaIJSLscript.

4.3.1.Specifyingvaluesmanually
InFigure3:
1)SelectthefirsttabnamedIntegrationofData.
2)SelectafilefromthelistoffilesalreadyuploadedonHDFS.
3)ClickSelectbutton.
4) Define the separator character thatwe are using, bydefaultthe value is comma
(,)

Figure3

InFigure4:alistofDesiredcolumnswillappearfromwhich:
1)selecttheneededcolumnsfortheintegration,
2)ClicktheSelectbutton.


BigDataIntegrationandAnalysis

UserManual

Page13

Figure4

In Figure 5, as you can see three sections appears corresponding respectively to:
Transformation Operations, Restriction Operations and Output files. 3 and 4 are
optional5isobligatory.


BigDataIntegrationandAnalysis

UserManual

Page14

Figure5

4.3.1.1.TransformationsOperations:
Four transformations operations using Mapreduce are offered by the software: Merge
columns, Split columns, Caseofcolumns,andFormatting of datecolumns andone for
Renameheaders.
InFigure6,Formerging:


BigDataIntegrationandAnalysis

UserManual

Page15

1)SelecttheMergetabinTransformationOperations.
2)Selectthetwocolumnswhichyouwanttomerge,
3)Definethemergingcharactertheusedbydefaultisblankspace.
4)WritethenameofthenewcolumnintheTextboxlocatedaftertheequalsign=.
5)ClicktheProceedbutton.

Figure6
InFigure7,Forsplitting:
1)SelectthecolumnSplittab
2)Selectthecolumnnametobesplitted.
3)Writethenameofthenewtwocolumnsaftertheequalsign=.,
4)Definethesplittingcharactertheusedbydefaultisblankspace.
5)ClickProceed.

Figure7

InFigure8,forCasing:
1)Selectcasingtab
2)Selectthedesiredcolumn
3)SelectUpperCaseorLowerCase
4)ClickProceed


BigDataIntegrationandAnalysis

UserManual

Page16

Figure8
InFigure9,forFormatting:
1)SelectFormattingtab.
2)Selectthedesiredcolumntoformat,itmustbeadatecolumn
3)Selectthedesireddateformatforrelatedcolumn.
Threedateformatsaresupported:DD/MM/YYYYMM/DD/YYYYYYYY/MM/DD
4)ClickProceed

Figure9
InFigure10,forRenamingcolumns:
1)SelectRenametab
2)Selectthedesiredcolumntoberenamed.
3)Writethenewdesiredname.
4)ClickProceed
Itisalsopossibletorenamemorecolumnsclickingonthe+button.

Figure10


BigDataIntegrationandAnalysis

UserManual

Page17

4.3.1.2.Restrictionoperation
InRestrictionOperationssection,wecanrestricttheoutputaccordingtosomecriterias.
Fornumericvalues:=,<>,>,<,>=,<=
Fortextualvalues:EQUAL,NOTEQUAL,CONTAINS.
InFigure11:
1)Selectthedesiredcolumn
2)Selectthecriteriatocompare
3)Writethevaluewhichyouwanttocompareit.
4)ClickProceedbutton.
Itisalsopossibletoaddmorerestrictionsclickingonthe+button.

Figure11

4.3.1.3.Outputfiles
InFigure12,wedefinetheoutputfiles.
1)Selectthedesiredcolumns
2)Writethedesirednameforoutputfile.
3)PresstheRunbutton.
Itisalsopossibletogeneratemorefilesclickingonthe+button.

Figure12


BigDataIntegrationandAnalysis

UserManual

Page18

Aftertheintegrationjobisdoneoneformisopen,asshowninFigure13,ifyouwant
tosaveitasaIJSLscriptclickYesotherwiseclickNo.

Figure13

InFigure14,Itwillappearonedialog:
1)Writethenameofthefile
2)ClickSave.

Figure14


BigDataIntegrationandAnalysis

UserManual

Page19

4.3.2.Loadinganexistin
gIJSLscript
AlsoyoucanLoadanexistingIJSLscriptfileandRunit.
InFigure15,16:
1)ClickLoadbutton,
2)Selectfile.
3)ClickLoadbutton.

Figure15

Figure16


BigDataIntegrationandAnalysis

UserManual

Page20

InFigure17:
VerifytheIJSLscript.Modifyitifneeded.
1)Clickrun.

Figure17


BigDataIntegrationandAnalysis

UserManual

Page21

4.3.3.WritingaIJSLscript
ThestructureofthegrammarfortheIJSLscriptisexplainedat4.
InFigures18,19:
1)ClickWrite
2)WritetheIJSLscript
3)ClickRun

Figure18


BigDataIntegrationandAnalysis

UserManual

Page22

Figure19

InFigure20,alsowecansaveournewIJSLscript.
1)ClickSave.
2)Writethenameofthefile
3)ClickSave


BigDataIntegrationandAnalysis

UserManual

Page23

Figure20

4.4.DataAnalysis
InthissectionwewilldescribetheDataAnalysisphase.

4.4.1LoadfilesfromHadoopFileSystem
ThesystemwilluploadthelistoffilesthatarecurrentlyintheInputDirectorythatis
specifiedintheSettingsfile.AsshowninFigure21.

Figure21


BigDataIntegrationandAnalysis

UserManual

Page24

4.4.2SelectingfilesfromHadoopFileSystem
InFigure22:
1)Definetheseparator(defaultiscomma).
2)Selectatleastonefile
3)ClickbuttonAdd.

Figure22

4.4.3Displayingfieldnamesfromselectedfiles
Thetablewillberegisteredinsparkwiththenameofthefileanditwillbedisplay
amongwithitsfieldnamestohelptheuserformulatingthequeries.Asshownin
Figure23.

Figure23

4.4.4Inputquery
InFigure24,atextareawillbedisplayedfortheuser


BigDataIntegrationandAnalysis

UserManual

Page25

1)Inputthedesiredquery.
2)InputanametosavetheresultsintheHDFS.
2)ClickExecutebutton.

Figure24

4.4.5Sampleresults
OnlyasmallportionoftheresultsisshowninsidetheResultsbox,therestsistobe
savedtodisk.AsshowninFigure25.

Figure25

5.PossibleflowsforDataIntegration
Allpossibleflowsthatanexecutionofoneinstancearelistedhere.Ifonedesiredtaskis
notlistedhere,itmaybepossibletoexecuteitbytwoexecutions.Forexample:Merge
andSplit,FirstdoaMergeandlaterdoasplitjob.


BigDataIntegrationandAnalysis

UserManual

Page26

5.1Specifyingvaluesmanually
1.Inputfile>Desiredcolumns>Outputfiles.
2.Inputfile>Desiredcolumns>Restrictionoperations>Outputfiles.
3.Inputfile>Desiredcolumns>Transformationoperations[MERGE]>Outputfile*.
4.Inputfile>Desiredcolumns>Transformationoperations[SPLIT]>Outputfile*.
5.Inputfile>Desiredcolumns>Transformationoperations[CASING]>Outputfiles.
6.Inputfile>Desiredcolumns>Transformationoperations[CASING]>
Transformationoperations[FORMATTING]>Outputfiles.
7.Inputfile>Desiredcolumns>Transformationoperations[CASING]>Restriction
operations>Outputfile*.
8.Inputfile>Desiredcolumns>Transformationoperations[CASING]>
Transformationoperations[FORMATTING]>Restrictionoperations>Outputfile.
9.Inputfile>Desiredcolumns>Transformationoperations[FORMATTING]>Output
file.
10.Inputfile>Desiredcolumns>Transformationoperations[FORMATTING]>
Restrictionoperations>Outputfile.
11.Inputfile>Desiredcolumns>Transformationoperations[RENAME]>Outputfile.
12.Inputfile>Desiredcolumns>Transformationoperations[RENAME]>Restriction
operations>Outputfile.

5.2LoadinganexistingIJSLscript
1.Load>Run.

5.3WritingaIJSLscript
1.Write>Run.
2.Write>Save>Run.

6.PossibleflowsforDataAnalysis
1. Listfile>Selectfiles>Displaytablenamesandfields>InputQuery>Save
file>Displaypartialresults


BigDataIntegrationandAnalysis

UserManual

Page27

7.References
[1]
http://www.oracle.com/technetwork/database/options/advancedanalytics/bigdataanalyticswp
oaa1930891.pdf
[2]
http://harvardmagazine.com/2014/03/whybigdataisabigdeal

[3]
http://uscisii2.github.io/papers/knoblock13sbd.pdf
[4]
https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_Data_Analytics_for_Security_Int
elligence.pdf
[5]
https://en.wikipedia.org/wiki/Apache_Hadoop
[6]
http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358624_sampler.pdf
[7]
http://www.glennklockwood.com/data-intensive/hadoop/mapreduce-workflow.png