You are on page 1of 3

Prerequisite Data Science Skills for Intern

Statistics and Mathematics


1. Descriptive Statistics
1. Mean, Median, Mode
2. Variance and Standard Deviation
2. Probability
1. Bernoulli Trials & Probability Mass Function
2. Central Limit Theorem
3. Normal Distribution
4. Baye’s Theorem (Precision, Recall, Positive Predictive Value, Negative
Predictive Value, Confusion Matrix, ROC Curve)
3. Inferential Statistics
1. Confidence Interval
2. Hypothesis Testing
3. Correlation
4. Linear Algebra
1. Vectors
2. Matrices
3. Transpose of a matrix
4. The inverse of a matrix
5. The determinant of a matrix
6. Dot product
7. Eigenvalues
8. Eigenvectors
5. Multivariate Calculus
1. Functions of several variables
2. Derivatives and gradients
3. Step function, Sigmoid function, Logit function, ReLU (Rectified Linear
Unit) function
4. Cost function
5. Plotting of functions
6. Minimum and Maximum values of a function
6. Optimization Methods
1. Cost function/Objective function
2. Likelihood function
3. Error function
4. Gradient Descent Algorithm and its variants (e.g. Stochastic Gradient
Descent Algorithm)

Programming Skills
• Basic Python Programming with functional programming practise.
• Numpy for Numeric and Algebra Operation
• Pandas for Data Handling and Statistical operation
• Matplotlib, Seaborn for Data Visualization
• Scikit-Learn for Machine learning algorithm
• Excel knowledge is added advantage

Machine learning Algorithm

1. Supervised Learning (Continuous Variable Prediction)


• Basic regression
• Multiregression analysis
• Regularized regression
2. Supervised Learning (Discrete Variable Prediction)
• Logistic Regression Classifier
• Support Vector Machine Classifier
• K-nearest neighbor (KNN) Classifier
• Decision Tree Classifier
• Random Forest Classifier
3. Unsupervised Learning
• Kmeans clustering algorithm

General Project Timeline(90 Days)

Sr No Task Duration(Days)
1. Business Understanding 3
• Understands the business process
• Define and Frame the business problem
• Define the business objective
• Formulate the Milestone
2. Data Collection and Understanding 5
• Collect the data from different sources
• Understand the important features
• Indentify independent and dependent variables
3. Exploration Data Analysis 15
• Variable Identification
• Univariate Data Analysis
• Bivariate Data Analysis
• Multivariate Data Analysis
4. Data Preprocessing and Wrangling 10
• Missing values identification
• Scaling using Noramlization, Standarization
• Outliers Detection using Boxplot, IQR, Z-Score
• Special values, Obvious inconsistencies treatments
• Feature imputation using Hot-Deck, Cold-Deck,
Mean-substitution, Linear regression methods
5. Feature Engineering and Base line Model Training 8
• Discretization - Continuous Features and
Categorical Features
• Reframe Numerical Quantities – Scaling all
variable in one unit
• Crossing – Generate the new features from existing
data
• Train the baseline model and check the
performance with feature engineering and
without feature engineering
6. Feature Selection and Base line Model training 15
• Correlation
• Dimensionality Reduction - PCA
• Feature Importance Methods
o Filter based
▪ Correlation
▪ Chi-Square Test
▪ Anova
o Wrapper based
▪ Forward Selection
▪ Backward Selection
▪ Recursive Feature Elimination
o Embedded methods
▪ L1 Regularization
▪ L2 Regularization
• Model Training Comparision with Feature
selection variant
7. Data Sampling and Model Selection 15
• Data Sampling technique
o Random Sampling– train test split
o Stratified Sampling – Startified kfold
o Cross Validation
• Model Selection
o Linear Models
o Non-Linear Models
o Tree based Models
8. Hyper Parameter Tuning on Model 12
• Grid Search with cross validation
• Random Search with cross validation
• Baysian Search with cross validation
9. Model Integration 7
• Using Flask API
• Using Streamlit

You might also like