ITCS 6161/8162: KDD Assignment Instructions

ITCS 6161/ITCS 8162: Knowledge Discovery in Databases
Assignment Instructions
Instructions:
Software required:
1. Putty: http://www.putty.org/
2. WinSCP: https://winscp.net/eng/download.php
3. Oracle Virtual Box: https://www.virtualbox.org/
4. Cloudera: http://www.cloudera.com/downloads/quickstart_vms/5-8.html
For detailed description on how to install Cloudera, watch this video:
https://www.youtube.com/watch?v=L0lPPC5qeyU
By default, Cloudera contains Eclipse and Hadoop packages installed which can be used to
program MapReduce programs. Cloudera contains single node cluster. Use Cloudera to test your
code on small inputs. For large inputs, use DSBA-cluster. Once you are confident that your code
works correctly, run in the cluster.
******************************************************************************
To Log In to DSBA Hadoop Cluster follow the instructions below :
TASK – 1: Logging into Hadoop cluster and running simple commands

1. To Log-In to Hadoop via FTP client ( in order to copy and paste data and to view files )
Open your FTP Client(WinSCP)

Choose Session | New Session
File protocol SFTP :

Host Name : dsba-hadoop.uncc.edu
Type UserName and Password
and click Save | check the Save Password checkbox
2. Log-in To dsba-hadoop.uncc.edu via the Putty or ( in order to run commands )

3. Run sample text processing on the ListOfInputActionRules using GREP command.
ListOfInputActionRules is a text file containing one action rule per line.
For example:
(a, a1->a2) ^ (c = c2) -> (f, f1->f0) [2, 50%]

(a, a1->a3) ^ (b, ->b1) -> (f, f1->f0) [3, 75%]
(a, a1->a3) ^ (c = c2) -> (f, f1->f0) [1, 80%]
(a, ->a3) ^ (b, b2->b1) -> (f, f1->f0) [3, 50%]
4.To move files from client to the cluster, use following command:
hadoop fs -put path-of-the-file-in-client path-of-the-destination-folder-in-cluster
5.Run following GREP command on ListOfInputActionRules to return all lines of text

(ActionRules) which contain the word ‘ a1 ‘
hadoop org.apache.hadoop.examples.Grep input-path-of-

ListOfInputActionRules-file path-of-destination-folder ".*a1.*"
NOTE: The destination folder should not exist before running this command. To
remove a folder, use following command,
hadoop fs -rm -r path-of-the-folder
6.To get the output folder back to the client, use following command,
hadoop fs -get path-of-the-output-folder-in-cluster path-of-the-folder-in-client
7.Repeat steps 4-6 for the Mammals book text file and return all lines of text which contain the
word “mammal”. Download Mammals book text file here:
http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/03_MammalsBook_Text_34848.txt
.utf8.txt
For TASK-2 and TASK-3, use Mammals book as an input file.
TASK – 2: Running WordCount

Read the "MapReduce Tutorial" from
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Basic procedure to follow when executing a MapReduce program in a Hadoop cluster:
1. The inputs should be transferred to HDFS from the local system
2. The JAR file can reside in the FTP client side(i.e in WinSCP)
The output of MapReduce programs will be written on HDFS which can be transferred back to
the local system
To understand how MapReduce works, you can see following links along with example
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html - for Hadoop version 1.0
All basic HDFS commands can be found here:
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-
common/FileSystemShell.html
1. Create a new JAVA project in Cloudera Eclipse

2. Cloudera Eclipse contains a sample MapReduce project. That project consists of all
required MapReduce jar files. Import all those .jar files into your project.
3. Copy WordCount v2.0 from https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
into your project
4. Convert your project into a .jar file
5. Run the .jar file in a cluster and produce the output
6. Use the following command to run the .jar file
 hadoop jar path-of-the-jar-file path-of-input-folder path-of output-folder
TASK – 3: Running modified version of WordCount
1. Download the .jar file from https://github.com/Keval17/Hadoop-Map-Reduce-with-

Modified-Map-function-for-efficient-word-counts
2. Save it in the client
3. Run and produce the output
TASK – 4: Write-up comparing the results of TASK-2 and TASK-3
Submit all your source codes, all your outputs and output files and a comparison write-up for
TASK-2 and TASK-3. We need following outputs,
1. GREP command output of ListOfActionRules file
2. GREP command output of Mammals book
3. WordCount v2.0
4. Modified WordCount

ITCS 6161/8162: KDD Assignment Instructions

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ITCS 6161/8162: KDD Assignment Instructions

Uploaded by

Copyright:

Available Formats

ITCS 6161/ITCS 8162: Knowledge Discovery in Databases

TASK – 1: Logging into Hadoop cluster and running simple commands

Open your FTP Client(WinSCP)

File protocol SFTP :

Type UserName and Password

and click Save | check the Save Password checkbox

2. Log-in To dsba-hadoop.uncc.edu via the Putty or ( in order to run commands )

(a, a1->a2) ^ (c = c2) -> (f, f1->f0) [2, 50%]

5.Run following GREP command on ListOfInputActionRules to return all lines of text

hadoop org.apache.hadoop.examples.Grep input-path-of-

hadoop fs -get path-of-the-output-folder-in-cluster path-of-the-folder-in-client

For TASK-2 and TASK-3, use Mammals book as an input file.

TASK – 2: Running WordCount

1. Create a new JAVA project in Cloudera Eclipse

TASK – 3: Running modified version of WordCount

1. Download the .jar file from https://github.com/Keval17/Hadoop-Map-Reduce-with-

TASK – 4: Write-up comparing the results of TASK-2 and TASK-3

You might also like