You are on page 1of 4

Data Science Interview Questions and Answers

Table of Contents
STATISTICS ........................................................................................................................................................... 6
Q1. WHAT IS THE CENTRAL LIMIT THEOREM AND WHY IS IT IMPORTANT? ........................................................................ 6
Q2. WHAT IS SAMPLING? HOW MANY SAMPLING METHODS DO YOU KNOW? ................................................................... 7
Q3. WHAT IS THE DIFFERENCE BETWEEN TYPE I VS TYPE II ERROR? .................................................................................. 9
Q4. WHAT IS LINEAR REGRESSION? WHAT DO THE TERMS P-VALUE, COEFFICIENT, AND R-SQUARED VALUE MEAN? WHAT IS THE
SIGNIFICANCE OF EACH OF THESE COMPONENTS? ................................................................................................................. 9
Q5. WHAT ARE THE ASSUMPTIONS REQUIRED FOR LINEAR REGRESSION? ........................................................................ 10
Q6. WHAT IS A STATISTICAL INTERACTION? .............................................................................................................. 10
Q7. WHAT IS SELECTION BIAS? .............................................................................................................................. 11
Q8. WHAT IS AN EXAMPLE OF A DATA SET WITH A NON-GAUSSIAN DISTRIBUTION? .......................................................... 11
DATA SCIENCE .................................................................................................................................................... 12
Q1. WHAT IS DATA SCIENCE? LIST THE DIFFERENCES BETWEEN SUPERVISED AND UNSUPERVISED LEARNING. ......................... 12
Q2. WHAT IS SELECTION BIAS? ............................................................................................................................. 12
Q3. WHAT IS BIAS-VARIANCE TRADE-OFF? ............................................................................................................... 12
Q4. WHAT IS A CONFUSION MATRIX? ..................................................................................................................... 13
Q5. WHAT IS THE DIFFERENCE BETWEEN “LONG” AND “WIDE” FORMAT DATA?............................................................... 14
Q6. WHAT DO YOU UNDERSTAND BY THE TERM NORMAL DISTRIBUTION? ...................................................................... 15
Q7. WHAT IS CORRELATION AND COVARIANCE IN STATISTICS?...................................................................................... 15
Q8. WHAT IS THE DIFFERENCE BETWEEN POINT ESTIMATES AND CONFIDENCE INTERVAL? ................................................. 16
Q9. WHAT IS THE GOAL OF A/B TESTING? ............................................................................................................... 16
Q10. WHAT IS P-VALUE? ....................................................................................................................................... 16
Q11. IN ANY 15-MINUTE INTERVAL, THERE IS A 20% PROBABILITY THAT YOU WILL SEE AT LEAST ONE SHOOTING STAR. WHAT IS THE
PROBABILITY THAT YOU SEE AT LEAST ONE SHOOTING STAR IN THE PERIOD OF AN HOUR? ........................................................... 16
Q12. HOW CAN YOU GENERATE A RANDOM NUMBER BETWEEN 1 – 7 WITH ONLY A DIE? .................................................... 17
Q13. A CERTAIN COUPLE TELLS YOU THAT THEY HAVE TWO CHILDREN, AT LEAST ONE OF WHICH IS A GIRL. WHAT IS THE
PROBABILITY THAT THEY HAVE TWO GIRLS? ....................................................................................................................... 17
Q14. A JAR HAS 1000 COINS, OF WHICH 999 ARE FAIR AND 1 IS DOUBLE HEADED. PICK A COIN AT RANDOM AND TOSS IT 10
TIMES. GIVEN THAT YOU SEE 10 HEADS, WHAT IS THE PROBABILITY THAT THE NEXT TOSS OF THAT COIN IS ALSO A HEAD? ................. 17
Q15. WHAT DO YOU UNDERSTAND BY STATISTICAL POWER OF SENSITIVITY AND HOW DO YOU CALCULATE IT? ......................... 18
Q16. WHY IS RE-SAMPLING DONE? ......................................................................................................................... 18
Q17. WHAT ARE THE DIFFERENCES BETWEEN OVER-FITTING AND UNDER-FITTING? ............................................................ 19
Q18. HOW TO COMBAT OVERFITTING AND UNDERFITTING? ......................................................................................... 19
Q19. WHAT IS REGULARIZATION? WHY IS IT USEFUL? .................................................................................................. 20
Q20. WHAT IS THE LAW OF LARGE NUMBERS? .......................................................................................................... 20
Q21. WHAT ARE CONFOUNDING VARIABLES? ........................................................................................................... 20
Q22. WHAT ARE THE TYPES OF BIASES THAT CAN OCCUR DURING SAMPLING? ............................................................... 20
Q23. WHAT IS SURVIVORSHIP BIAS? ........................................................................................................................ 20
Q24. WHAT IS SELECTION BIAS? WHAT IS UNDER COVERAGE BIAS? ............................................................................... 21
Q25. EXPLAIN HOW A ROC CURVE WORKS? .............................................................................................................. 21
Q26. WHAT IS TF/IDF VECTORIZATION? .................................................................................................................. 22
Q27. WHY WE GENERALLY USE SOFT-MAX (OR SIGMOID) NON-LINEARITY FUNCTION AS LAST OPERATION IN-NETWORK? WHY
RELU IN AN INNER LAYER?............................................................................................................................................ 22
DATA ANALYSIS.................................................................................................................................................. 23
Q1. PYTHON OR R – WHICH ONE WOULD YOU PREFER FOR TEXT ANALYTICS? ................................................................. 23
Q2. HOW DOES DATA CLEANING PLAY A VITAL ROLE IN THE ANALYSIS? ........................................................................... 23
Q3. DIFFERENTIATE BETWEEN UNIVARIATE, BIVARIATE AND MULTIVARIATE ANALYSIS........................................................ 23
Q4. EXPLAIN STAR SCHEMA. ................................................................................................................................. 23
Q5. WHAT IS CLUSTER SAMPLING? ........................................................................................................................ 23

Steve Nouri Ravit Jain


Q6. WHAT IS SYSTEMATIC SAMPLING? ................................................................................................................... 24
Q7. WHAT ARE EIGENVECTORS AND EIGENVALUES? .................................................................................................. 24
Q8. CAN YOU CITE SOME EXAMPLES WHERE A FALSE POSITIVE IS IMPORTANT THAN A FALSE NEGATIVE?................................ 24
Q9. CAN YOU CITE SOME EXAMPLES WHERE A FALSE NEGATIVE IMPORTANT THAN A FALSE POSITIVE? AND VICE VERSA? .......... 24
Q10. CAN YOU CITE SOME EXAMPLES WHERE BOTH FALSE POSITIVE AND FALSE NEGATIVES ARE EQUALLY IMPORTANT? ............. 25
Q11. CAN YOU EXPLAIN THE DIFFERENCE BETWEEN A VALIDATION SET AND A TEST SET? .................................................... 25
Q12. EXPLAIN CROSS-VALIDATION. .......................................................................................................................... 25
MACHINE LEARNING .......................................................................................................................................... 27
Q1. WHAT IS MACHINE LEARNING? ....................................................................................................................... 27
Q2. WHAT IS SUPERVISED LEARNING? .................................................................................................................... 27
Q3. WHAT IS UNSUPERVISED LEARNING? ................................................................................................................ 27
Q4. WHAT ARE THE VARIOUS ALGORITHMS? ............................................................................................................ 27
Q5. WHAT IS ‘NAIVE’ IN A NAIVE BAYES?................................................................................................................ 28
Q6. WHAT IS PCA? WHEN DO YOU USE IT?............................................................................................................. 29
Q7. EXPLAIN SVM ALGORITHM IN DETAIL. ............................................................................................................... 30
Q8. WHAT ARE THE SUPPORT VECTORS IN SVM? ...................................................................................................... 31
Q9. WHAT ARE THE DIFFERENT KERNELS IN SVM? .................................................................................................... 32
Q10. WHAT ARE THE MOST KNOWN ENSEMBLE ALGORITHMS? ...................................................................................... 32
Q11. EXPLAIN DECISION TREE ALGORITHM IN DETAIL. .................................................................................................. 32
Q12. WHAT ARE ENTROPY AND INFORMATION GAIN IN DECISION TREE ALGORITHM? ........................................................ 33
Gini Impurity and Information Gain - CART ....................................................................................................... 34
Entropy and Information Gain – ID3.................................................................................................................. 37
Q13. WHAT IS PRUNING IN DECISION TREE? .............................................................................................................. 41
Q14. WHAT IS LOGISTIC REGRESSION? STATE AN EXAMPLE WHEN YOU HAVE USED LOGISTIC REGRESSION RECENTLY. ................ 41
Q15. WHAT IS LINEAR REGRESSION?........................................................................................................................ 42
Q16. WHAT ARE THE DRAWBACKS OF THE LINEAR MODEL? ......................................................................................... 43
Q17. WHAT IS THE DIFFERENCE BETWEEN REGRESSION AND CLASSIFICATION ML TECHNIQUES? ........................................... 43
Q18. WHAT ARE RECOMMENDER SYSTEMS? ............................................................................................................. 43
Q19. WHAT IS COLLABORATIVE FILTERING? AND A CONTENT BASED? ............................................................................. 44
Q20. HOW CAN OUTLIER VALUES BE TREATED? ........................................................................................................... 44
Q21. WHAT ARE THE VARIOUS STEPS INVOLVED IN AN ANALYTICS PROJECT? ..................................................................... 45
Q22. DURING ANALYSIS, HOW DO YOU TREAT MISSING VALUES? .................................................................................... 45
Q23. HOW WILL YOU DEFINE THE NUMBER OF CLUSTERS IN A CLUSTERING ALGORITHM? ..................................................... 45
Q24. WHAT IS ENSEMBLE LEARNING? ...................................................................................................................... 48
Q25. DESCRIBE IN BRIEF ANY TYPE OF ENSEMBLE LEARNING. ......................................................................................... 49
Bagging ............................................................................................................................................................. 49
Boosting............................................................................................................................................................. 49
Q26. WHAT IS A RANDOM FOREST? HOW DOES IT WORK? ........................................................................................... 50
Q27. HOW DO YOU WORK TOWARDS A RANDOM FOREST? ......................................................................................... 51
Q28. WHAT CROSS-VALIDATION TECHNIQUE WOULD YOU USE ON A TIME SERIES DATA SET? ................................................ 52
Q29. WHAT IS A BOX-COX TRANSFORMATION? ......................................................................................................... 53
Q30. HOW REGULARLY MUST AN ALGORITHM BE UPDATED? ....................................................................................... 53
Q31. IF YOU ARE HAVING 4GB RAM IN YOUR MACHINE AND YOU WANT TO TRAIN YOUR MODEL ON 10GB DATA SET. HOW
WOULD YOU GO ABOUT THIS PROBLEM? HAVE YOU EVER FACED THIS KIND OF PROBLEM IN YOUR MACHINE LEARNING/DATA SCIENCE
EXPERIENCE SO FAR? .................................................................................................................................................... 53

DEEP LEARNING ................................................................................................................................................. 55


Q1. WHAT DO YOU MEAN BY DEEP LEARNING? ........................................................................................................ 55
Q2. WHAT IS THE DIFFERENCE BETWEEN MACHINE LEARNING AND DEEP LEARNING? ........................................................ 55
Q3. WHAT, IN YOUR OPINION, IS THE REASON FOR THE POPULARITY OF DEEP LEARNING IN RECENT TIMES? .......................... 56
Q4. WHAT IS REINFORCEMENT LEARNING? .............................................................................................................. 56
Q5. WHAT ARE ARTIFICIAL NEURAL NETWORKS? ...................................................................................................... 57

Steve Nouri
Q6. DESCRIBE THE STRUCTURE OF ARTIFICIAL NEURAL NETWORKS? ............................................................................. 57
Q7. HOW ARE WEIGHTS INITIALIZED IN A NETWORK? ............................................................................................... 57
Q8. WHAT IS THE COST FUNCTION? ....................................................................................................................... 58
Q9. WHAT ARE HYPERPARAMETERS? ..................................................................................................................... 58
Q10. WHAT WILL HAPPEN IF THE LEARNING RATE IS SET INACCURATELY (TOO LOW OR TOO HIGH)? ................................... 58
Q11. WHAT IS THE DIFFERENCE BETWEEN EPOCH, BATCH, AND ITERATION IN DEEP LEARNING? ......................................... 58
Q12. WHAT ARE THE DIFFERENT LAYERS ON CNN? .................................................................................................... 58
Convolution Operation ...................................................................................................................................... 60
Pooling Operation ............................................................................................................................................. 62
Classification ..................................................................................................................................................... 63
Training ............................................................................................................................................................. 64
Testing ............................................................................................................................................................... 65
Q13. WHAT IS POOLING ON CNN, AND HOW DOES IT WORK? .................................................................................... 65
Q14. WHAT ARE RECURRENT NEURAL NETWORKS (RNNS)? ........................................................................................ 65
Parameter Sharing ............................................................................................................................................ 67
Deep RNNs ......................................................................................................................................................... 68
Bidirectional RNNs ............................................................................................................................................. 68
Recursive Neural Network ................................................................................................................................. 69
Encoder Decoder Sequence to Sequence RNNs ................................................................................................. 70
LSTMs ................................................................................................................................................................ 70
Q15. HOW DOES AN LSTM NETWORK WORK? ......................................................................................................... 70
Recurrent Neural Networks ............................................................................................................................... 71
The Problem of Long-Term Dependencies ......................................................................................................... 72
LSTM Networks.................................................................................................................................................. 73
The Core Idea Behind LSTMs ............................................................................................................................. 74
Q16. WHAT IS A MULTI-LAYER PERCEPTRON (MLP)? ................................................................................................. 75
Q17. EXPLAIN GRADIENT DESCENT. ......................................................................................................................... 76
Q18. WHAT IS EXPLODING GRADIENTS? .................................................................................................................... 77
Solutions ............................................................................................................................................................ 78
Q19. WHAT IS VANISHING GRADIENTS? .................................................................................................................... 78
Solutions ............................................................................................................................................................ 79
Q20. WHAT IS BACK PROPAGATION AND EXPLAIN IT WORKS. ....................................................................................... 79
Q21. WHAT ARE THE VARIANTS OF BACK PROPAGATION? ............................................................................................ 79
Q22. WHAT ARE THE DIFFERENT DEEP LEARNING FRAMEWORKS? .................................................................................. 81
Q23. WHAT IS THE ROLE OF THE ACTIVATION FUNCTION? ............................................................................................ 81
Q24. NAME A FEW MACHINE LEARNING LIBRARIES FOR VARIOUS PURPOSES..................................................................... 81
Q25. WHAT IS AN AUTO-ENCODER? ........................................................................................................................ 81
Q26. WHAT IS A BOLTZMANN MACHINE? ................................................................................................................. 82
Q27. WHAT IS DROPOUT AND BATCH NORMALIZATION? ............................................................................................. 83
Q28. WHY IS TENSORFLOW THE MOST PREFERRED LIBRARY IN DEEP LEARNING? ............................................................. 83
Q29. WHAT DO YOU MEAN BY TENSOR IN TENSORFLOW? .......................................................................................... 83
Q30. WHAT IS THE COMPUTATIONAL GRAPH? ........................................................................................................... 83
Q31. HOW IS LOGISTIC REGRESSION DONE? ............................................................................................................... 83
MISCELLANEOUS ................................................................................................................................................ 84
Q1. EXPLAIN THE STEPS IN MAKING A DECISION TREE. ................................................................................................. 84
Q2. HOW DO YOU BUILD A RANDOM FOREST MODEL? ................................................................................................ 84
Q3. DIFFERENTIATE BETWEEN UNIVARIATE, BIVARIATE, AND MULTIVARIATE ANALYSIS....................................................... 85
Univariate .......................................................................................................................................................... 85
Bivariate ............................................................................................................................................................ 85
Multivariate ....................................................................................................................................................... 85
Q4. WHAT ARE THE FEATURE SELECTION METHODS USED TO SELECT THE RIGHT VARIABLES? .............................................. 86
Filter Methods ................................................................................................................................................... 86

Steve Nouri
Wrapper Methods ............................................................................................................................................. 86
Q5. IN YOUR CHOICE OF LANGUAGE, WRITE A PROGRAM THAT PRINTS THE NUMBERS RANGING FROM ONE TO 50. BUT FOR
MULTIPLES OF THREE, PRINT "FIZZ" INSTEAD OF THE NUMBER AND FOR THE MULTIPLES OF FIVE, PRINT "BUZZ." FOR NUMBERS WHICH
ARE MULTIPLES OF BOTH THREE AND FIVE, PRINT "FIZZBUZZ." .............................................................................................. 86
Q6. YOU ARE GIVEN A DATA SET CONSISTING OF VARIABLES WITH MORE THAN 30 PERCENT MISSING VALUES. HOW WILL YOU
DEAL WITH THEM?....................................................................................................................................................... 87
Q7. FOR THE GIVEN POINTS, HOW WILL YOU CALCULATE THE EUCLIDEAN DISTANCE IN PYTHON? ........................................ 87
Q8. WHAT ARE DIMENSIONALITY REDUCTION AND ITS BENEFITS? ................................................................................. 87
Q9. HOW WILL YOU CALCULATE EIGENVALUES AND EIGENVECTORS OF THE FOLLOWING 3X3 MATRIX? ................................. 88
Q10. HOW SHOULD YOU MAINTAIN A DEPLOYED MODEL? ............................................................................................ 88
Q11. HOW CAN A TIME-SERIES DATA BE DECLARED AS STATIONERY? ............................................................................... 88
Q12. 'PEOPLE WHO BOUGHT THIS ALSO BOUGHT...' RECOMMENDATIONS SEEN ON AMAZON ARE A RESULT OF WHICH ALGORITHM?
89
Q13. WHAT IS A GENERATIVE ADVERSARIAL NETWORK?.............................................................................................. 89
Q14. YOU ARE GIVEN A DATASET ON CANCER DETECTION. YOU HAVE BUILT A CLASSIFICATION MODEL AND ACHIEVED AN ACCURACY
OF 96 PERCENT. WHY SHOULDN'T YOU BE HAPPY WITH YOUR MODEL PERFORMANCE? WHAT CAN YOU DO ABOUT IT? ................... 90
Q15. BELOW ARE THE EIGHT ACTUAL VALUES OF THE TARGET VARIABLE IN THE TRAIN FILE. WHAT IS THE ENTROPY OF THE TARGET
VARIABLE? [0, 0, 0, 1, 1, 1, 1, 1] .................................................................................................................................. 90
Q16. WE WANT TO PREDICT THE PROBABILITY OF DEATH FROM HEART DISEASE BASED ON THREE RISK FACTORS: AGE, GENDER, AND
BLOOD CHOLESTEROL LEVEL. WHAT IS THE MOST APPROPRIATE ALGORITHM FOR THIS CASE? CHOOSE THE CORRECT OPTION: ........... 90
Q17. AFTER STUDYING THE BEHAVIOR OF A POPULATION, YOU HAVE IDENTIFIED FOUR SPECIFIC INDIVIDUAL TYPES THAT ARE
VALUABLE TO YOUR STUDY. YOU WOULD LIKE TO FIND ALL USERS WHO ARE MOST SIMILAR TO EACH INDIVIDUAL TYPE. WHICH
ALGORITHM IS MOST APPROPRIATE FOR THIS STUDY?.......................................................................................................... 90
Q18. YOU HAVE RUN THE ASSOCIATION RULES ALGORITHM ON YOUR DATASET, AND THE TWO RULES {BANANA, APPLE} => {GRAPE}
AND {APPLE, ORANGE} => {GRAPE} HAVE BEEN FOUND TO BE RELEVANT. WHAT ELSE MUST BE TRUE? CHOOSE THE RIGHT ANSWER: .. 90
Q19. YOUR ORGANIZATION HAS A WEBSITE WHERE VISITORS RANDOMLY RECEIVE ONE OF TWO COUPONS. IT IS ALSO POSSIBLE THAT
VISITORS TO THE WEBSITE WILL NOT RECEIVE A COUPON. YOU HAVE BEEN ASKED TO DETERMINE IF OFFERING A COUPON TO WEBSITE
VISITORS HAS ANY IMPACT ON THEIR PURCHASE DECISIONS. WHICH ANALYSIS METHOD SHOULD YOU USE?.................................... 91
Q20. WHAT ARE THE FEATURE VECTORS? .................................................................................................................. 91
Q21. WHAT IS ROOT CAUSE ANALYSIS? ..................................................................................................................... 91
Q22. DO GRADIENT DESCENT METHODS ALWAYS CONVERGE TO SIMILAR POINTS? ............................................................. 91
Q23. WHAT ARE THE MOST POPULAR CLOUD SERVICES USED IN DATA SCIENCE? .............................................................. 91
Q24. WHAT IS A CANARY DEPLOYMENT? .................................................................................................................. 92
Q25. WHAT IS A BLUE GREEN DEPLOYMENT? ............................................................................................................ 93

Steve Nouri

You might also like