Integration of R with Hadoop

When it comes to Statistical Analysis, R is one of the most preferred option
and by integrating it with Hadoop, we can successfully use it for Big Data
Analytics. In this post, we will be discussing the step-by-step explanation
for integrating R with Hadoop and will be performing various operations
on HDFS using R console.

RHadoop is a collection of three R packages for providing large data
operations with an R environment. RHadoop is available with three main R
packages, where each of them offer different Hadoop features:

1. Rhdfs

2. Rmr

3. Rhbase

Rhdfs:

Rhdfs is an R package that provides the basic connectivity to the Hadoop
Distributed File System. R programmers can browse, read, write, and
modify files stored in HDFS from within R. Rhdfs package calls the HDFS
API in the backend to operate on the data sources stored in the HDFS. This
package should be installed only on the node that will run the R client.

Rmr:

Rmr is an R package that allows R developers to perform Statistical
Analysis in R via Hadoop’s MapReduce functionality on a Hadoop cluster.
With the help of this package, the job of a R programmer has been
reduced, where they just need to divide their application logic into the
map and reduce phases and submit it with the Rmr methods. After that,
the Rmr calls the Hadoop streaming MapReduce API with several job
parameters such as input directory, output directory, mapper, reducer,
and so on, to perform the R MapReduce job over the Hadoop cluster. This
package should be installed on every node in the cluster.

Assuming they are already installed. 1.'RJSONIO'. let get started with the setup process. Installing Java and Hadoop 2. 'digest'. Installing R Required Packages for Installing We require several R packages to be installed for connecting R with Hadoop.Rhbase: Rhbase is an R interface for operating the Hadoop’s HBase data source. Steps for Setting up Rhadoop The per-requisites for installing Rhadoop is Hadoop and R. stored at the distributed network via a Thrift server.packages from R Console: install. The Rhbase package is designed with several methods for initialization and read/write and table manipulation operations. Using install. They are as follows: 1.'functional'.rstudio.com/') .'devtools'.packages( c('rJava'.'Rcpp '.dependencies=TRUE. 1 'plyr'. 'itertools'. In this post. let’s look at how to setup Rhadoop. Before delving deeper.'httr'.'reshape2'). The list of packages are as follows:  rJava  RJSONIO  itertools  digest  Rcpp  httr  functional  devtools  plyr  reshape2 We will discuss installing of all this packages in two different ways. we will look in to the Rhdfs package that provides the basic connectivity to the Hadoop Distributed File System.repos='http://cran.

extract them and use the below command: 1 unzip Rhadoop_packages.com/open? id=0B5dejdhAYHztRkgzbGZOeUdXdVE After downloading the packages.tar.google. 2. we will be using R cmd. . Link: https://drive. 1 R CMD INSTALL <package name> Now we will be Installing rJava .9-6. we should set the JAVA_HOME path and should login to R with sudo privileges.Downloading Packages and installing through R cmd: Download the required packages from the below link.zip To install these packages.gz We need to follow the same command to install all the other required packages .refer the below command for the same.Note: Before installing rJava. 1 sudo R CMD INSTALL rJava_0.

1 sudo R CMD INSTALL <package.rar> Note:Before installing rhdfs we should set HADOOP_CMD environmental variable. For accessing HDFS we should start hadoop demons. . Check the files in HDFS from the command line. make sure that all your HDFS daemons are up. You can refer to the below screen shot to follow the steps for Installing Rhdfs.

Now we will access HDFS from the R console Login to R console .

Set environment variables Load the required packages rhdfs After loading the rhdfs package we should initiate the connection using hdfs.ls('/') .init() Accessing HDFS through R console Listing the file in hdfs root directory 1 hdfs.

mkdir: used to create new directory in hdfs: 1 hdfs.put('localfile source'.put: This is used to copy files from the local filesystem to the HDFS filesystem.'hdfs destination') hdfs.mkdir('/new_dir') .defaults("conf") File manipulation ° hdfs. 1 hdfs.To get the HDFS default configurations used for this connection use 1 hdfs.

'/new_dir/test_file1') ° hdfs. permissions= '777') . 1 hdfs.rename: This is used to rename the file stored at HDFS from R. 1 hdfs.chmod('/Wc.txt'.move: This is used to move a file from one HDFS directory to another HDFS directory.° hdfs. 1 hdfs.'/new_dir/') ° hdfs.rename('/new_dir/test_file'.chmod: This is used to change permissions of some files.move('/test_file'.

Keep visiting our site for more updates on BigData and other technologies.° hdfs. Click Here to learn more.delete("/RHadoop") Hope this blog helped you in learning how to integrate R with Hadoop. . 1 hdfs.delete: This is used to delete the HDFS file or directory from R.