You are on page 1of 4

Introduction to R

Welcome to the statistical world of R language. This chapter discuss the basic introduction to R language
with the help of few case studies. The R software provides an environment for data management and
statistical analysis. Although this environment is perceived to be unpleasant as compared to much user
friendly software’s, but the way its demand is increasing in academics and corporates, it will be definitely the
future of statistical analysis in both academics and corporates.

1.1 Benefits of R

The benefits of R software for a research analyst include:


 It is free software which provides techniques of statistical analysis and graphics facilities to an
analyst.
 R has an excellent built-in help system.
 R is a “Open Source”, which means that people who developed R allows everyone to access their
codes. This facility allows everyone to make added contribution to the software.
 R software is a base package with many built-in statistical functions. It can also be expanded by
downloading additional packages which provides specific functionality to the base software.
 R is a computer programming language. Programmers feel more comfort in using it.
 Comprehensive R Archive Network (CRAN) is central to using R. It is a place from where you
download the software and the package you want to install.

Disadvantages to R compared to other software’s


 It has a limited graphical interface.
 There is no commercial support.
 R requires a programming language so students must learn to appreciate syntax issues etc.

The R project was started by Robert Gentleman and Ross Ihaka of the Statistics Department of the
University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by
the R core-development team, a hard-working, international team of volunteer developers. The R project web
page http://www.r-project.org is the main site for information on R. At this site are directions for obtaining
the software, accompanying packages and other sources of documentation.

2. Starting R

There are three windows in R. These are:


1. Console
2. Editor
3. Graphics

Console is the main window where you can run commands and see the results of executing these commands.
In graphics window graphs will appear as a result of the commands.

>, + and #
The symbol “>” is called the prompt and indicates that the lines after this symbol are typed by users. Lines
beginning by anything else are produced by R. You can type the commands after the symbol “>” to instruct
R to execute them. If a command is too long to fit on a line, a + is used for the continuation prompt. The
symbol # is used to make comments after the command. Basically anything after the comment character is
ignored by R.

Objects and Functions: Commands in R are generally made up of 2 parts:

 Objects

 Functions

An object is anything created in R. It may be a variable or a collection of variables. Functions are inbuilt in
the software. These are separated by <-. That is Objects <- Function
Typing less: You can save a lot of typing in R. Arrow keys can be used to retrieve your previous commands
in R. In particular, each command is stored in a history and the up arrow will traverse backwards along this
history and the down arrow forwards. Left and right arrow keys will work as expected.

Exercise 1: Using R as a calculator: Perform the following mathematical expression in R software

# Using R as a calculator

2+5
5+10+20-15+25
15+10/2-15/3*5
(15+20/2-50/10)*5

x= 18
y=15
z=5

a = pi*2^5-sqrt(x)+ log10(y) -log(z)


a
15-17*2/3-20
abs(15-17*2/3-20)
factorial(5)
log10(2)
log(2)
exp(0.6931472) # exp is the exponential

# Exercise of making a formula for NPV


co = 110000
c1 = 50000
c2 = 30000
c3 = 40000
k = .10
npv = c1/(1+k)^1 + c2/(1+k)^2 + c3/(1+k)^3 - co
npv

# make the formula for future price


spotprice = 2000
r = .12
t = .5
Futureprice= spotprice * exp(r*t)
Futureprice

Exercise 2: Following is the data set of few companies w.r.t the details of their stock price, EPS, the average
PE ratio of the industry these companies belongs to and the type of industry.
Company Sector Current EPS Industry Industry
Price PE Type
DLF Real estate 90 2.52 29 Manufacturing
SBI Banking 160 147 11 Service
HDFC Banking 1000 36 25 Service
Bharti Telecom 326 19 14 Service
Reliance Petroleum 1020 69 18 Manufacturing
Infosys IT 1120 185 23 Service
BHEL Infra 110 13 17 Manufacturing
Ranbaxy Pharma 450 9.78 22.57 Manufacturing
Tata Steel Metal 250 75.41 10.33 Manufacturing
L&T Infra 1200 62.86 17.39 Manufacturing

You are required to perform the following activities in R software:


1. Make a folder “r data” on the desktop and define this folder as your default working folder.
2. Identify and Install the package of your choice. Use require ( ) to use the package.
3. Enter the given data of all the variables using c ( ) function in R
4. Combine all the variables into a dataset using data.frame ( ) function in R
5. Calculate PE ratio of each company and include in the dataset.
6. Check the nature of variables (numeric or character).
7. Add the data following five companies:

Company Sector Current EPS Industry Industry Type


Price PE
Gati Logistic 110 1.96 97 Service
Yes Bank Banking 710 57 17 Service
Vedanta Metal 75 8 11 Manufacturing
Maruti Auto 29 0.1 7 Manufacturing
Apollo Tyres Auto 161 16 10 Manufacturing

8. Calculate the descriptive statistics (mean, median, mode, standard deviation, variance, minimum marks and
maximum) of the variables.
9. Export the new dataset in the default folder in .csv format.

Exercise 3 (Descriptive statistics in R) The HR manager of ABC Ltd is interested in analysing the performance score
as well as the retention level of the employees in the company. She collected the data of 40 employees working with the
company w.r.t. to six variables (gender, age, performance score, education background, monthly income and the time
spent by the employees in the company). The data set consisting of the details of selected variables of the 40 employees
of a company is given below:

Employee Gender Age (in years) Performance Education Monthly Time spent
Code Score Background Income in the
company
1 M 25 57 BSc 29000 2
2 M 27 78 Bcom 30000 2
3 M 36 57 Btech 43000 7
4 F 43 46 BE 56000 6
5 F 36 59 BA 67000 5
6 F 28 65 BSc 76000 3
7 F 23 67 BA 15000 4
8 M 35 73 Bcom 72000 5
9 F 34 49 BE 52000 4
10 M 45 63 Btech 65000 7
11 F 34 68 Btech 61000 6
12 F 43 62 BA 89000 5
13 M 42 75 BSc 87000 6
14 M 25 56 Bcom 39000 2
15 F 43 64 Bcom 73000 4
16 F 42 69 BSc 76000 3
17 M 33 75 BA 30000 3
18 M 26 65 BE 28000 2
19 M 27 52 Btech 39000 3
20 F 24 99 BA 19000 3
21 F 25 56 BA 20000 4
22 M 26 87 BE 32000 3
23 F 35 67 BA 48000 9
24 F 36 52 BSc 52000 5
25 M 25 91 BSc 26000 2
26 M 36 78 BE 54000 7
27 F 38 50 BA 71000 6
28 F 39 72 Btech 39000 6
29 M 35 69 Btech 41000 5
30 M 36 61 Btech 43000 5
31 F 31 66 Bcom 20000 6
32 F 34 89 Bcom 45000 4
33 F 35 59 BE 51000 3
34 M 32 60 BA 48000 2
35 M 71 65 Btech 120000 15
36 M 29 59 Btech 49000 4
37 F 32 72 BSc 50000 7
38 M 36 63 BE 48000 5
39 F 48 56 BE 51000 8
40 M 39 73 BSc 63000 3

Perform the following analysis in R for the given data set.

1. Import the data (.csv format) in R using read.csv function, examine the data and make any two subset of the
data.
2. Convert the dataframe into a matrix. Study the dataset using head, names, View. Analyse the summary of each
variable.
3. Calculate the descriptive statistics (mean, median, mode, standard deviation, variance, minimum and
maximum, skewness and kurtosis) of the variables Age, performance score, retention and monthly income.
4. It is later found that the performance score of 4nd employee are recorded by mistake. The actual performance
score is 78. Correct the data.
5. (a) What if the monthly income of 19th employee.
(b) Want to see the performance score of except 10th employee.
(c) Wants to see the age of only 7th, 9th and 15th employee.

6. (a) Which employee is having performance score equal to 80.


(b) Which employee is having performance score greater than 70.
(c) Which employee is having performance score less than 70.

7. Make the frequency distribution table of gender and education background of the employees.

8. Find out the univariate outliers in the variables performance scores and monthly income using box plot
diagram

9. Use barplot ( ), hist ( ) and pie ( ) functions to plot the graph of performance score of the employees.

10. Test the normal distribution of the variables ps, age, mi and rt

11. Plot the bivariate plot between the variables age and monthly income (using plot(c( )function)

12. (a) Test the null hypothesis that the average monthly income of the employees of the company is Rs 50000.
(b) Test the null hypothesis that average performance score of employees with all education background is
same.
13. Analyse the correlation between age, performance score, monthly income and retention of the employees.

14. Run the analyse the following regression model


Performance score=α + β 1∗Age+ β 2∗Retention

You might also like