Professional Documents
Culture Documents
What is Data ?
Facts and statistics collected together for reference or analysis. Things known or assumed as facts,
making the basis of reasoning or calculation.
What is Data Science?
Data science is an interdisciplinary field about processes and systems to extract knowledge or insights
from data in various forms, either structured or unstructured which is a continuation of some of the data
analysis fields such as statistics, machine learning, data mining, and predictive analytics, similar to
Knowledge Discovery in Databases (KDD).
The data scientist has the ability to handle the crude data using the latest technologies and techniques,
can perform the necessary analysis, and can present the acquired knowledge to his associates in an
informative way.
Programming Languages used for Data Science
S.No
Languages
R - Language
Industrial
Usage %
60.9 %
Where ? (Industries)
Productive Industry
Performance Monitoring
Enterprise Deployment
Tech Support - CRM
Maintenance
Training and Consulting
Solutions
Python
35.8 %
SQL
12.4 %
8.8 %
Yahoo Groups
Google
Zope Corp.,
Ultraseek
Gaming Industries etc.,
It is Embedding &
Scripting Language
For various testing /
building /
deployment /
monitoring
frameworks,
building scripts,
system monitoring
and logging tools
etc.,
Data Entry and Data
Mapping
(Manipulate data and
produce reports)
Java-based
frameworks,
statistical modeling
5
Unix - Shell
Script/AWK/SED
Matlab
8.5 %
8.5 %
6.3 %
Scala
5.9 %
GO (Golang) - Google
9
5.2 %
Modeling and
Scripting of data
To create data
models, Data
management
frameworks (HDFS)
for large volume of
data, Statistics
models preparations
using map reduce
etc.,
Languages
SAS
SPSS
MATLAB
GNU Octave
Ruby
Specialization
Basic procedures and data management
Statistical analysis
Graphics and presentation
Econometrics and Time Series Analysis
Quality control
Clinical trial analysis etc.,
o Business intelligence
o Data management
o Predictive analytics
o Multivariate analysis
Survey authoring and deployment (IBM SPSS
Data Collection)
data mining (IBM SPSS Modeler)
Text analytics
collaboration and deployment (batch and
automated scoring services).
Multi-type data support - sensor, image, video,
telemetry, binary, and other real-time formats.
Mainly Focuses on - Machine learning, neural
networks, statistics.
Primarily intended for numerical computations.
Octave helps in solving linear and nonlinear
problems numerically
Performing other numerical experiments using a
language that is mostly compatible with
MATLAB.
Supports multiple programming paradigms,
including functional, object-oriented, and
imperative. It also has a dynamic type system
Scala
Julia
R - Language
Python
10
SQL
11
Java Scripting
12
C/C++
13
14
Perl
In order to retrieve the desired data the user present a set of criteria by a query. Then the
Database Management System (DBMS), software for managing databases, selects the
demanded data from the database. The retrieved data may be stored in a file, printed, or viewed
on the screen.
A query language, such as Structured Query Language (SQL), is used to prepare the queries.
Tools
Platform
Scope : Object Oriented Python - Basic Structured
Query Language - Data Models and Relational SQL Many-to-Many Relationships in SQL - Databases and
Visualization
Python
Relational databases
3
4
Spark
IBM SQL/DS
Sybase SQL Server
Scope : Hive Hbase for Real time data storage and
retrieval SQL on Hadoop
Map reduce Programming using R / Python / Java
HDFS storage MapReduce Design Pattern
It is an advanced version of Map Reduce - Resilient
Distributed Dataset and DataFrames - Spark
application programming (Spark Shell / PySpark
Shell / Java) - Spark libraries Spark configuration,
monitoring and tuning