Assignment No 8
Name: Renu Tamsekar
C number: C22020111255
Date: 12/04/24
Aim: Data Analysis and Visualization using R programming
Brief info About R programming
R is an open-source programming language that is widely used as a statistical software and data
analysis tool.This system is comprised of two parts: the R language itself (which is what most
people mean when they talk about R) and a run-time environment.It is an interpreted language,
which means that users access its functions through a command-line interpreter. Unlike languages
such as Python and Java, R is not a general-purpose programming language. Instead, it’s
considered a domain-specific language (DSL), meaning its functions and use are designed for a
specific area of use or domain. It is equipped with a large set of functions that enable data
visualizations, so users can analyze data, model it as required, and then create graphics. In addition
to the language’s in-built graphical functions, there are numerous add-ons or modules that facilitate
this.
Features:
1. Comprehensive Statistical Analysis Toolkit: It provides a vast standard library and a
comprehensive set of tools for statistical analysis to perform linear and nonlinear modeling,
classical statistical tests, time-series analysis, classification, clustering, etc.
2. Graphical Capabilities : R excels in its graphical capabilities, enabling users to create high-
quality plots, including scatter plots, line graphs, histograms, bar charts, and more. The base
graphics system in R allows for fine control over these graphics, while additional packages like
ggplot2 offer advanced plotting options.
3. Data Manipulation and Storage: It includes powerful facilities for data manipulation.
Packages like dplyr and data.table allow for efficient manipulation of data sets, including filtering,
selecting, rearranging, and aggregating data. R’s data handling capabilities handle large data sets
and various data types with ease.
4. Programming Features: It supports both procedural and object-oriented programming,
featuring functions, loops, conditional statements, and user-defined recursive functions. It also
supports S3 and S4 classes for more advanced object-oriented programming.
5. Package Ecosystem: The Comprehensive R Archive Network (CRAN), a repository of over
16,000 packages, extends R’s base functionality. These packages cover a wide range of statistical,
graphical, and data manipulation tasks.
6. Interoperability : It can interface with other programming languages. It can call upon and be
called from C/C++, Java, and Python, making it versatile in multi-language development projects.
7. Environment and IDE Support: It can be used in various environments, from the command
line to several feature-rich IDEs like RStudio, which provide tools for scripting, debugging, and
managing projects.
8. Platform Independence: It is cross-platform, running on Windows, macOS, and various
Unix/Linux flavors, making it versatile for all users regardless of their operating system.
Implementation details:
Basics Of R programming
1. Creating data variables and data vectors
2. Creating DataFrame
3. Work on above Dataframe
Database used: IRIS dataset
Analysis and visualization and the outputs:
str() function : It is used for compactly displaying the internal structure of a R object.
head() function: Function which returns the first n rows of the dataset.
head(x, n): where x: specified data types & n: number of row need to be printed.Here we print
first 3 rows of the dataset
tail(x,n) : To display the last n rows of the dataset
summary(): used to return the following from the given data.
Min: The minimum value in the given data ; 1st Qu: The value of the 1st quartile (25th
percentile) in the given data ; Median: The median value in the given data ; 3rd Qu: The value of
the 3rd quartile (75th percentile) in the given data ; Max: The maximum value in the given data
sd() : It is used to return the standard deviation.
Visualizations:
Pie chart :
Bar Plot
Histogram
Scatter plot
pairs() function : It is provided in R Language by default and it produces a matrix of
scatterplots. The pairs() function takes the data frame as an argument and returns a matrix of
scatter plots between each pair of variables in the data frame.
Psych package: The “psych” package is an R package that provides various functions for
psychological research and data analysis. It contains tools for data visualization, factor analysis,
reliability analysis, correlation analysis, and more.
pairs.panels() : pairs.panels shows a scatter plot of matrices (SPLOM), with bivariate scatter
plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the
diagonal. Useful for descriptive statistics of small data sets.
cex.cor : If this is specified, this will change the size of the text in the correlations. this allows
one to also change the size of the points in the plot by specifying the normal cex values. If just
specifying cex, it will change the character size, if cex.cor is specified, then cex will function to
change the point size.
DAAG :Data Analysis and Graphics Data and Functions
data() : Loads specified data sets, or list the available data sets.
$ : this symbol is used to access a specific column
complete.cases() : To eliminate missing values from a vector, matrix, or data frame, use the
complete.cases() function in R.
Conclusion : Comparison of R programming and Python for Data analysis
R Programming Python
R is a language and environment for statistical Python is a general-purpose programming
programming which includes statistical language for data analysis and scientific
computing and graphics. computing
Very efficient for in-memory data analysis, Performs well with large datasets, especially
but can struggle with very large data sets. with libraries like NumPy and Pandas.
Superior graphical capabilities, especially for Good visualization libraries (Matplotlib,
complex visualizations. Seaborn) but generally considered less
powerful than R's.
Built-in data handling capabilities are It relies heavily on libraries like Pandas for
excellent, particularly for complex statistical data manipulation, which provides powerful
operations. R’s native data.frame offers tools for data handling similar to R’s data
comprehensive statistical analysis tools. frames.
Superior graphical capabilities with base Strong visualization tools like Matplotlib and
graphics and ggplot2, allowing for advanced Seaborn; however, they are considered less
statistical plots. specialized for statistical tasks compared to R.
Extensive range of built-in statistical tests and Python has libraries such as SciPy and
models, excelling in statistical methodology. StatsModels for statistical tests, but these are
CRAN hosts packages tailored specifically for generally seen as less comprehensive than R's
statistical analysis. offerings.