You are on page 1of 16

Big Data Analytics and Optimization

Certificate Program in Engineering Excellence Certificate in Accelerated Engineering M.Tech. (GITAM University) Applied Computer Science and Technology

ESSENTIALS OF APPLIED PREDICTIVE ANALYTICS ......................................................................................... 2 STATISTICAL MODELING FOR PREDICTIVE ANALYTICS IN ENGINEERING AND BUSINESS .......... 4 EFFECTIVE DECISION MAKING: OPTIMIZATION, SIMULATION AND STATISTICAL METHODS ... 6 ENGINEERING BIG DATA WITH R AND HADOOP ECOSYSTEM .................................................................... 8 TEXT MINING AND SOCIAL MEDIA ANALYTICS ...............................................................................................10 METHODS AND ALGORITHMS IN MACHINE LEARNING ...............................................................................12 ADVANCED TOPICS IN MACHINE LEARNING ....................................................................................................14 ARCHITECTING DATA ANALYTICS SOLUTIONS IN THE REAL WORLD .................................................15


CSE 7301c Essentials of Applied Predictive Analytics

This five-day module teaches the complete data analytics lifecycle in an applied and hands-on manner. A data-rich business environment is detailed and a few semi-real world problems that can be solved in 5 days are worked on. It starts with playing with data, using data visualization as an analytics technique and data pre-processing. It then smoothly moves to designing and implementing predictive models for a variety of business applications. It also covers important aspects of analyzing the quality of the model. Finally, the latest trends in reporting the results are discussed. While one or two business cases are used as anchoring themes during the program, the general applicability is emphasized throughout. At the end of the program, the participants are able to answer business questions such as who is likely to buy a new product amongst the existing customers, which customers are most likely to default on a loan or an insurance payment and if a customer buys Product A, which other products can be recommended to him/her . This course thoroughly trains candidates on the following techniques: Day 1 Introduction: Big picture of Data Sciences Understanding the business case and defining a solution framework Getting the data into R environment: Reading data as a Data frame, Matrix, Vector and a List; Visualization: Various plots and their purpose (Scatter, Bar, Pie, Box, Histograms, and Surface and Contour graphs) Pre-processing the data: Binning; Normalizing; Imputation; Removing noise and outliers A framework for solving Analytics problems. Pre-processing Techniques: Graphical visualization; Handling missing values; Data standardization Introduction to two important data mining techniques: Decision Trees and Association Rules A thorough introduction to solving analytics problems using R Model selection using K-fold validation

Day 2 Data Pre-processing - continued Traps and Errors: Confusion Matrix, Analyzing False Positives and False Negatives from a problem perspective, Different error measures used in forecasting Model selection: K-fold validation Introduction to Decision Trees and their structure


Day 3 Day 4 Day 5 The last 4 days covered enough techniques and process for handling a complex analytics problem. On the last day, all of it is brought together for a coherent story. Data visualization and Story-telling: Anatomy of a graph Animated graphs, BI dashboards and the latest trends in data visualization Industry exposure: A webinar by an industry expert about how they are using analytics in the real world A mathematical model for association analysis; Large itemsets; Association Rules Apriori: Constructs large itemsets with minsup by iterations Interestingness of discovered Association Rules; Examples; Association Analysis vs. Classification Using Association Rules to compare stores; Dissociation Rules; Sequential Analysis Using Association Rules Construction of Decision Trees through simplified examples; Choosing the "best" attribute at each non-leaf node; Entropy; Information Gain Generalizing Decision Trees; Information Content and Gain Ratio; Dealing with numerical variables; Other measures of randomness Issues in Inductive learning: Curse of Dimensionality, Overfitting, Bias-Variance tradeoff Pruning a Decision Tree; Cost as a consideration; Unwrapping Trees as rules


CSE 7302c Statistical Modeling for Predictive Analytics in Engineering and Business
This six day module is aimed at teaching how to think like a statistician. Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write, wrote H. G. Wells in the year 1895. That day and age has arrived with Data Analytics going mainstream (For Todays Graduate, Just One Word: Statistics This course teaches this very important and essential skill. Broadly, the following aspects are covered: Studying the data systematically and gaining intuition about variables and their inter-relationships Applied statistical methods to extract hidden relations and patterns from the data

By the end of the course, the participants will be able to answer questions like what will be the price of a commodity at a future point of time, if a sample of 100 components have the dimension of 100 nanometers, what can I say about the dimension of the population of 100,000 components, etc. Data sets from Retail, Finance, Manufacturing and Healthcare industries are used to explain the concepts. This course thoroughly trains candidates on the following techniques: Probability distribution analysis, Correlations and ChiSquare testing Linear regression, Multilinear regression and Logistic regression Clustering Time series analysis Non-parametric statistics

From a tools perspective, you will gain confidence with tools like R and Excel for creating meaningful and information rich dashboards. Day 1 Computing the properties of an attribute: Central tendencies (Mean, Median, Mode, Range, Variance, Standard Deviation); Expectations of a Variable; Moment Generating Functions Describing an attribute: Probability distributions (Discrete and Continuous) Bernoulli, Binomial, Multinomial and Poisson distributions Describing the relationship between attributes: Covariance; Correlation; ChiSquare

Day 2 Describing a single variable continued: Weibull, Geometric, Negative Binomial, Gamma and Exponential distributions; Special emphasis on Normal distribution; Central Limit Theorem Inferential statistics: How to learn about the population from a sample and vice versa; Sampling distributions; Confidence Intervals, Hypothesis Testing


Day 3 Day 4 Day 5 Day 6 Non-parametric statistics ANOVA Survival analysis in equipment operations Industry exposure: A webinar by an industry expert about how they use statistical data analysis in the real world Trend analysis and Time Series Cyclical and Seasonal analysis; Box-Jenkins method Smoothing; Moving averages; Auto-correlation; ARIMA Holt-Winters method; GARCH VaR; Applications of Time Series in financial markets Regression (Linear, Multivariate Regression) in forecasting Analyzing and interpreting regression results Logistic Regression Multivariate normal distributions Types of clusters; Different clustering methods; K-Means; K-Medoids Iterative distance-based clustering; Dealing with discrete values in K-Means Constructing a hierarchical clustering using K-Means


CSE 7303c Effective Decision Making: Optimization, Simulation and Statistical Methods
This module is designed to enhance your decision capabilities when confronted with strategic choices. You learn techniques of turning real-world problems into mathematical models. It teaches three classes of models: Optimization, Simulation and Statistical. The application areas originate from problems in finance, marketing and operations. At the end of the program, you will be able to answer questions like should I outsource a service or do it in-house, how to optimize a supply chain, and how to price a product when faced with demand uncertainty. This course thoroughly trains students in the following techniques: Multi-criteria decision analysis Linear, Integer, Binary and Quadratic programming Data envelopment analysis; Goal and multi-objective modeling Genetic Algorithms Simulations in decision analysis: Monte Carlo and Markov Chain methods Game theory and strategy

From a tools perspective, this course trains you on building your own R code and you are provided R codes for a host of problems mentioned above. The course is anchored on a large financial and mutual fund company and techniques for solving a variety of problems they face are provided. Day 1 Day 2 A system for advising the clients on right investment - A COOs problem - Linear programming: Applications, Graphical analysis, Sensitivity and Duality analyses Worked-out examples in helping customers identify right portfolio, planning cash transport and employee assignment Comparing the performance of various offices: A CMOs problem and the data envelopment analysis Setting up a new office in a different city: Goal programming and Multi-objective programming Introduction to the business problem Multi-criteria decision making for the CEO: Scientific decision making, Value of information Analytic hierarchy process Strategy and game theory in analytics and decision analysis


Day 3 Day 4 Day 5 Representing data for a Genetic Algorithm Why and how do Genetic Algorithms work? Industry exposure: A webinar by an industry expert about how they are using analytics in the real world Minimizing travel costs: Solving non-convex problems Monte Carlo essentials and making quick estimates Markov Chains and generating samples from complex scenarios Metropolis-Hastings algorithms; Simulated Annealing; Minimizing travel distance of the mutual funds Genetic Algorithms: The algorithm and the process Goal programming and Multi-objective programming - Continued Minimizing the risk - A CROs problem - Graphical representation of Maxima, Minima, Point of inflection and Saddle points in single and multivariable functions Derivative, Gradient and Hessian; Optimization with constraints; Lagrange multipliers Quadratic programming formulation and applications in portfolio analytics


CSE 7304c Engineering Big Data with R and Hadoop Ecosystem

Companies collect and store large amounts of data during daily transactions. This data is both structured and unstructured. The volume of the data being collected has grown from MB to TB in the past few years and is continuing to grow at an exponential pace. The very large size, lack of structure and the pace at which it is growing characterize the Big Data. To analyze long-term trends and patterns in the data and provide actionable intelligence to managers, this data needs to be consolidated and processed in specialized processes; those techniques form the core of the module. The use cases for the program are "analyzing a customer in near real-time" as applied in Retail, Banking, Airlines, Telecom or Gaming industries. At the end of the program, the participants will be able to set up a Hadoop cluster and write a Map Reduce program that uses pre-built libraries to solve typical CRM data mining tasks like recommendation engines. This course thoroughly trains candidates on the following techniques: HQL querying & PIG Latin Scripting (with a focus on statistical analysis) Hadoop and Map Reduce methods of programming Columnar (No-SQL) databases

From a tools perspective, this course introduces you to Hadoop. You will learn one of the most powerful combinations of Big Data, viz., R and Hadoop. In addition, all the essential content required to build powerful Big Data processing applications and to acquire Hadoop certifications will be covered in the course. The emphasis is not on abstract theory or on mindless coding. The concepts and the realworld programming techniques are emphasized. Day 1 Big Data an Introduction Parallel and Distributed Computing Hadoop: An overview Installing and starting to play with Hadoop

Day 2

On this day, the course gives an exciting motivation for learning Big Data. Common and special algorithms are taught in a specific business problem context and understand about Hadoop Ecosystem Linux and Java refresher Algorithms for real-world problems well-suited to Hadoop - Standard algorithms: Sorting, Searching, Indexing, Concurrent Algorithms Hadoop usage in real-world HDFS Architecture Hadoop Ecosystem I : HBase, Hive, Pig, Chukwa, Avro, Flume and Zookeeper Demo: Data analysis using Hive and Pig


Day 3 During the main part of the course, you will learn the fundamental concepts of Map Reduce with detailed explanation. Introduction to Map Reduce Programming methodologies and paradigms in Map Reduce Understanding the concepts of Graph Algorithms and Page Rank Beyond basics: The flow; APIs; Driver; Mapper; Reducer Demo: Compiling and running basic Java Map Reduce code, Hadoop configuration parameters & logs.

Day 4 On this day, you will learn how to work with Map Reduce with practical aspects Day 5 On the last day: Covers Hadoop certification aspects and hands on assignments. Overview of Hadoop certifications Hands-on-in-class assignment where students can use their Mapreduce/Hive/Pig/RHadoop/Streaming to code a new problem. choice of Map-side and Reduce-side Joins; Secondary Sort Page Rank in Map Reduce Practical Aspects of Map Reduce Implementation, Streaming Demo: Hadoop streaming, More realistic Map-Reduce code walk-through and execution. Hadoop Ecosystem II: Sqoop, Mahout, Whirr, Hama and Oozie Demos on Hadoop Ecosystem: Sqoop, Mahout R-Hadoop: An overview Demo: R-Hadoop:RHDFS


CSE 7206c Text Mining and Social Media Analytics

This module teaches two of the most important applications of analytics in high tech industries. Text mining: Unstructured data comprises more than 80% of the stored business information (primarily as text). This helped text mining emerge as a leading-edge technology. This module describes practical techniques for text mining, including pre-processing (tokenization, part-of-speech tagging), document clustering and classification, information retrieval, search and sentiment extraction in a business context. Predictive modeling with social network data: Social network mining is extremely useful in targeted marketing, on-line advertising and fraud detection. The course teaches how incorporating social media analysis can help improve the performance of predictive models.

By the end of the course, you will be able to answer questions like how to classify or tag a document into a category, how to rank some people in a network as more likely customers than others, etc. In terms of techniques, the course teaches: Text pre-processing Bag-of-words and Text Similarity measures Page Rank; Neighbor analysis on predictive modeling

This course uses packages like R, WEKA and R-Hadoop for demonstrating real world examples. Day 1 Unstructured vs. semi-structured data; Fundamentals of information retrieval Properties of words; Vector space models; Creating Term-Document (TxD) matrices; Similarity measures Low-level processes (Sentence Splitting; Tokenization; Part-of-Speech Tagging; Stemming; Chunking)

Day 2 Day 3 Fundamentals of web search A detailed analysis of Page Rank Page Rank in social network analysis Analyzing social networks for targeted marketing and fraud detection Text classification and feature selection: How to use Nave Bayes classifier for text classification Evaluation systems on the accuracy of text mining Sentiment Analysis


Day 4 Natural Language Analysis Discussion of text mining tools and applications Industry exposure: A webinar by an industry expert about how they are using analytics in the real world


CSE 7305c Methods and Algorithms in Machine Learning

This module discusses the principles and ideas underlying the current practice of data mining and introduces a powerful set of useful data analytics tools (such as K-Nearest Neighbors, Neural Networks, etc.). Real-world business problems are used for practice. In addition, for each of the techniques, both the traditional approach and the Big Data approach are taught. At the end of the course, the student will be able to answer questions like which technique is likely to work under what situations, how to handle fraud detection and how to recognize handwriting. From techniques perspective, the student learns: Bayesian analysis and Nave Bayes classifier Neural Networks K-Nearest Neighbors Association Rules, Dimensionality reduction using Principal Component Analysis (PCA), Single Vector Decomposition (SVD) Ensemble and Hybrid methods

A fictitious courier company is taken as an example and issues faced in this industry are solved. Day 1 Day 2 Day 3 Day 4 Probability fundamentals Bayes Theorem and its applications Becoming instinctively Bayesian Representing data in a matrix form; Bases and thinking of attributes as bases; Orthogonality and Orthonormality; Linear independence of axes Transformation matrices and Eigen vectors as transformation matrices Principal Component Analysis (PCA) Single Vector Decomposition (SVD) and applications in Association Rules and Latent Semantic Indexing (LSI) Self Organizing Maps (SOM) Computational geometry; Voronoi diagrams K-Nearest Neighbor method Wilson editing and triangulations K-nearest neighbors in collaborative filtering, digit recognition Business problem and solution architecture Motivation for Neural Networks and its applications Perceptron and Single Layer Neural Network, and hand calculations Learning in a Neural Net: Back propagation and conjugant gradient techniques Application of Neural Net in Face and Digit Recognition


Day 5 Nave Bayes classifier Ensemble and Hybrid models o AdaBoost and Random Forests Industry exposure: A webinar by an industry expert about how they are using analytics in the real world


CSE 7108c Advanced Topics in Machine Learning

This module discusses the most advanced data mining techniques such as Support Vector Machines (SVM), Bayesian Belief Nets, Expectation Maximization and Reinforcement Learning. This is suited for those interested in getting into an R&D lab of a product company or a PhD program in machine learning. Day 1 Day 2 Day 3 Reinforcement Learning and Adaptive Control Applications of machine learning to robotic control, data mining, autonomous navigation, bioinformatics and speech recognition R&D exposure: A webinar by a senior scientist about the cutting-edge developments in analytics Bayesian Belief Nets Expectation Maximization Linear learning machines and Kernel methods in learning VC (Vapnik-Chervonenkis) dimension; Shattering power of models Algorithm of Support Vector Machines (SVM)


CSE 7107c Architecting Data Analytics Solutions in the Real World

OK! The rubber meets the road! It is competition and fun time. You will actually architect an entire solution (actually 2!). This module also helps bring all the concepts learnt in other modules into perspective, helping students provide end-to-end solutions to business problems. Students are divided into groups of approximately 4 each. They are given a real world problem with insufficient information. They are required to conduct interviews, obtain the information, design a solution, and come up with an implementation plan. Days 1, 2 and 3 The students get the problem a day prior to the start of this module. Each team works through the problem, and comes up with a solution architecture and effort estimates. In addition, there are at least two presentations by industry experts from consulting, insurance, retail, services and/or financial industries.