You are on page 1of 16
Data Science @ Globant Playbook Latest Revision - 2020-02 - 19 Current Maintainer: juanjose.lopez@globant.com Disclaimer This hereby document attempts to give guidance on whatis expected from a Data Scientist at Globant. By necessity, it will consist on broad definitions, references, and guidelines, as the kind of endeavor is not suitable for a completely defined, mechanically repetible, automatable task. Itis intended to be a live document, as the position evolves within the industry and at Globant itself, but it should provide a solid ground on which to build your career. Contents Role of a Data Scientist Scope of Work Attitude Data Engineering Seniority Overall description Junior ‘Semi Senior Senior Ambassador? - Name TBD Agility Career path Development Phase Technical Leadership Higher Echelons Selected references Code and Practice Theory General Sources CONN OHA 11 13 14 15 15 15 15 16 16 16 16 16 Role of a Data Scientist The followingis a description of the broad tasks and intents that a Data Scientist should perform to be considered effective at their work. While depth of results in any given dimension is commendable, failing to cover the rest means that the overall objective is not met, Too many data scientists are not worried about why am I looking at this data, what data should I be looking at and how, but rather only what algorithm to use. That is a sign of immaturity. Job Description © understand, guide and define the actionable output and required decision support to be obtained from a model © explicit and mathematize working hypothesis - axiomatize - to either validate or reject through modeling © analyze data sources for completeness of data semantics and gaps develop the process to assess information quality and ready the data to be fed into a mathematical model tune / calibrate / train the model for the required accuracy - generality trade off. extract actionable output from model analyze, interpret and present output build required analytical workflow for modelling - system wide exempted circle back to a refined problem definition to keep improving performance and actionability related to business strategy and capabilities Important note: Testing is part of the job. Not only in the “training” sense, but unit. integration, component and so on. © Code review is about style, “what if’, peer-discuss ideas and scenarios. Is a tool for feedback and learning. Encourage it, seek it. And by its definition as it applies to us, does not require a finished work to do so. Attitude One of the anchoring traits of a Data Scientist at Globant is the attitude and approach evidenced in a project development. Some of the attitudinal elements are present in the scope of work, but there are some considerations to it. ¢ Weare not code monkeys. Hence we don't behave likewise. © Expecting a sharply defined task and then to be left alone at developing it, doesn’t work. ¢ Isolating oneself from the overall objective of the project doesn’t work * Wecare. We partner. We challenge. Take part of the vision of the project and use it as your own. You are a member of the team, and that implies that the outcome is not alien to you. ¢ Wewant the best for the project. If needed, we can propose, challenge and advise proactively, © Respect is non-negotiable. We aid, we partner, but the overall ownership lies within the client, and we must get on board. Should there be any doubt and gray area, consider that we are adults and behave like one. Can do attitude beats technical prowess most of the times. © Theory of Constraints - all the way Togive the best of ourselves, always work on improving the limiting factor: Ifyou have time to spare, you need to start generating more ideas and options. © Ifyou don’t have enough time, you are either over committing, not scoping enough the user stories, or going more deep than what the current user story entails Ifno technique is good enough, then the problem could be ill-defined. It should be redefined © Rockstars will die of overdose, ninjas will get stabbed in the back. Teams will defend Sparta Data Engineering We define the practice of Data Engineering as the general design, development and evolution of data products. Data product is any piece of code that enables to handle, consume, understand and make decisions about data. In short, is the practice of enabling getting value out of data. Within Globant, the profile that implements Data Products is called the Data Engineer. Within that profile, there are areas of focus and specialization, like Data Architecture, Data Warehousing, Data Visualization and Data Science. They all forma continuum in which data is acquired, handled, processed, understood through consumption, and enhanced or improved through prediction. As with many definitions herein contained, strict boundaries are not only arbitrary and hard to define, but detrimental to the overall proficiency of any professional and of the Artificial Intelligence and Big Data Studio. Each profile has a focus and specialization, but they carry a common core of knowledge, whose contents grow ever larger with an increased seniority, and they should work together seamlessly on the perceived frontiers of each specialization. Seniority Given the fluid nature of the scope of work, tightly defined boundaries for seniority levels are out of the question. There are some basic distinctions nonetheless that define what is the necessary requirements, albeit not sufficient, to perform at that seniority level. Knowledge beyond the base-level requirements is valuable, but the defining criteria rests on a good, ample core knowledge, experience and soft skills. Overall description The following table encompasses the general fundamental skills expected at each level Seniority Technical Soft Practice / Studio Jr Math and Stats | Owning the task and driving knowledge, basic | requests for help machine learning ssr Basic set of Given the task, executing _| Participates in presales, aids algorithms in some | correctly with in content generation platform, focus on | independence, short data analysis experience as DS Sr Classic ML+DL, | Defining the right task, Interviews, leads presales, Agile, software development proactive in seeking feedback, effective client facing, firm experience as Ds content generation Ambassador? (name TBD) Architecture of modules and data flow Coordinate project vision and definition, sizable experience as DS Guardian of relevance, acts as ambassador and PR agent to the world, interviews. Junior Asa junior Data Scientist, you should know, be able to explain, or respond to the following: Statistics ~ Big samples vs small samples? Pros and cons - Explain metrics of central tendency and dispersion - Mean, mode, median - Variance / standard deviation, L1 distances, Kurtosis - When do you prefer a median to the mean? - When is standard deviation not a good indicator of a + range? - Explain the following: - Histogram - Box Plot - Dispersion Plot - P-value - How do you test an hypothesis? - What does the cumulative distribution function tell you - What does “multimodal distribution” mean and imply - Name some distribution functions and logic from which they arise - Hint: Gaussian, LogNormal, Weibull, Gumbel, Uniform, Triangular, Gamma, Beta, Poisson, Binomial - Name some different sampling procedures: - Hint: Random, stratified, poisson disc, others Math - Matrix multiplication, identity, inversion, determinant, chain of products and equations - Derivatives, integrals, limits, perimeter length of arbitrary figures in nd-space Code - Python - Data types and differences / use cases ~ _ List, dict, set, tuple, numpy.ndarray, pandas.dataframe - What is aliasing? Why is it dangerous when unchecked? - Whatis the use for_name_=="_main_"? - What is an environment? Virtual environment? Conda? Why use them? - pandas /numpy /scikit - Matplotlib /Seaborn Development - What does it mean that a task is done? - Hint: Unit + integration test, reproducible research, literate programming - What do you do if you don't understand / get stuck on a task? How do you proceed? - Whatis agile development? Sel mi Senior As a semi senior Data Scientist, you should know, be able to explain, or respond to the following: Algorithms Sta Ma Linear regression Logistic Regression Decision Tree Random Forest k-Means KNN Shallow NN SVM tistics /Math Central Limit Theorem - Explain - Whyisit useful? Cross validation - Explain CV technique, test and training. Why we do that. Confusion Matrix / Accuracy / Precision / Recall Gradient descent - Howdoes it work? - Soft requirement on convexity? When does it not matter? Whats a convolution? Hypothesis generation Error analysis Explain how the autocorrelation of a positive-trend line can be negative ANOVA, multivariate analysis, PCA / Factor Analysis, population tsts and comparison. chine Learning Algorithm customization ~ what do you do if you are trying to do regre: - _ Howdo you avoid overfitting? - howdo you add non-linearities to regression - howdo you add classes to regression Variable selection Handling “too many features” - Isthat a problem? When? What would you do? How do you choose k on k-means? n and you have heteroskedasticity? Development What is the difference between “checkout’, ‘add’, “commit” and “branch” in git? What are the 4 main parts of an SQL statement? Python Flask, Eventlet / Gunicorn REST, wsgi Bokeh / Plotly / other Proper Object Oriented Programming Building Libraries Literate programming Reproducible Research Error catching, Context manager, generators and comprehensions, reusable methods and classes, some code optimization and profiling Senior Asa senior Data Scientist, you should know, be able to explain, or respond to the following: General Al knowledge - Fields of Al = Machine Learning - Simulation - Discrete Event Simulation - Agent Based Modelling + System Dynamics / Numerical Partial Derivative Systems - Optimization ~ Mixed Integer Programming and variations - Computer Vision / Signal Processing - Natural Language Processing Deep Learning - Layers: - Fully Connected, Convolutional, Recurrent, Dropout, Pooling, Batch Normalization ~ Learning schemas - SGD, adaptative models, batch size / learning rate tradeoff - Objective - Loss functions (L-based distance, cross-entropy, others) - Evaluation - Bias / Variance tradeoff - Epoch graph vs Dataset Size graphs ~ Error analysis ~ Whats it? What options do you have? - _ Whatare the options when an NN does not give good results? Why? - How do you avoid the infamous racial bias on object detection (detecting a family of african-americans as apes) - What does Word2Vec do? How does it work? - Define “embeddings” - Why “negative sampling”? ~ Pros cons when modeling sequence - LSTM = Adconvolution - _ Explain differences, and how the approaches differ. e.g. with “objects” - Detection - Semantic segments - Instance segments Machine Learning - Class imbalance: What is the issue? - Howwould you solve it? - _ Explain pros and cons of the different possible models based on the final objective of the model and possible side-effects, without resorting to “try it out and see” on the data - _E.g. Without having the data (so “trying it out" is not an option), why would you choose, or what would you take into account, to choose among the following models - Logistic - Tree - Forest - NN = k&NN - Variable Normalization / feature scaling - _Islinear separability @ requirement before using logistic regression? Or is it desirable? - Pros and cons against SVM NLP. - Explain TDF-IDF - LDA - Bagof Words Optimization - Solving Speed ona MIP - Ifyou want to speed solving time, do you change the objective function. variables. constraints? - Explain the effect of each - Whatis the trade-off of heuristics on optimization models? Development - How would you handle an agile environment for doing data science? - Why don't ijust ask the people instead of modelling preferences of predict behavior? - How does the world work? -> make hypothesis explicit to validate through data - What i do not know about the world? -> discover through data - What are the uncharted territories? -> use data as an exploration ~The art of breaking up the project into User Stories - Howto generate a plethora of hypothesis and interpretations from a set of data/ visualizations / problem statement / business context Ambassador? - Name TBD Going above and beyond, managing larger projects and teams, dealing competently with the broader data engineer role. ‘Communication and presentation skills become critical. Be ready to engage: peers, clients, conferences, etc. Care about how the client will grow, not only the project or account. That makes us different. Not limit yourself to your skill, but see through the clients eyes problems and vision We are not sales people, but we have an insight no one else will have. We team up with them to reach people. We should become our client consiglieri (Guibert dixit). It’s not selling, it’s conveying what we can and what we like to do. It's saying "yes and” or “no but" Comunicate the vision Mentoring: © Passive guide © Not responsible for the mentee’s progress © Part of the usual tasks as presales or generating material, should be conducted likewise, Treat people as future peers. Respect, empathy and consideration are always relevant. Agility The term and methodologies typically accepted nowadays for software development are many times ahard match for a discipline like data science, hinging on research and explorations or unforeseeable results at the beginning of an iteration. That much is true. Itwill not, however, be used as an excuse for not being “more agile” in the way we work. # Plans are useless but planning is indispensable. The aim is to get the structure of what, we want to build and the road to get there, not the exact result and date. © What can you do “for tomorrow EOD" that still delivers something of value? How does that help the business? Iterative and incremental! * Quick wins sound like burning paper instead of coal. Try to turn them into “early returns’ on which to build upon the next phase. Think of agile as gradient descent. A little bit a lot of times let's you take more appropriate decisions rather than the perfect option on a sparse quantity of more “wicked” problems. Agility serves us as a discipline enforcer, not to go too deep when is not warranted, not letting us just browse around, and making sure we keep focus on the expected value to be delivered rather than abstract algorithmia The key discipline for us, then, is how to break down our work into manageable chunks that are still meaningful. Every story should have a deliverable of sorts. It could be a working version (preferably), a document with the analysis and actionable conclusions that derive from it, a report on the plan to follow. or anything else that creativity allows us. But it needs to deliver value, either as software or gained knowledge. In the same vein that a minimum viable product (MVP) represent the minimum set of features 50 that the software delivers value to it's users, we have to think about the minimum work required to gain a new insight or functionality. Any large research and development can be broken down into a myriad of these smaller units, many of them parallelizable to leverage a larger team and more diverse skill sets. But we have to be agile about agility. As 2 practice, this is an expertise we hold. Consult with the team, and we'll learn and develop it together. Agility is also about plasticity, not becoming ossified in our behaviors and ideas. Career path There are different growth dimensions to consider within Globant, suited to each profile’s preference and interest. Development Phase This implies growing in seniority in all three areas (technical, soft skills and practice involvement). Up to that point, the expectation is a priority of project performance. followed by all the other components. The way to grow is to build solid knowledge on the field, leveraging projects to deepen the experience and proposing new venues to improve further. It is expected that profiles will rely upon more senior peers to help them grow in a consultative / mentorship manner. Ahigher seniority means the ability to handle larger, more complex projects and clients. discover and apply newer techniques. and spread that knowledge. Technical Leadership Above the Senior level, there is the option (though highly regarded in terms of growth) of Technical Leadership of a project, taking onto the responsibility of delivering great value as a whole, leading a team and handling aclient. This requires a mixture of soft and hard skills. Higher Echelons Beyond the top development seniority, there are two paths to choose: © Subject Matter Expertise Technical Manager/Director While both entail responsibilities over projects and teams, the SME path implies a technical leadership to handle the most complex technical problems that can arise in a project, act as consultant to several projects, develop techniques and delivering exceedingly superior value on projects. The managing (TM/TD) path entails being an owner and representative of the value we can offer to clients, peers and Globant as a whole, which means taking ownership of teams, initiatives, career paths, what / how / why something is offered to clients, and why would they want to work with us at all. It implies connecting the technical capacities with the business need. Selected references This section contains a minimal set of very few selected references to give a jumpstart. Not considered “complete” in an on themselves, not necessarily canon, but nonetheless useful resources. ‘You can always check the Al path at Globant Campus for a comprehensive list of resources validated by the practice. Code and Practice ¢ Python machine learning Machine Learning Algorithms © Buildi 7 ings ithe Theory Khan Academy. ASA’s statement on p-values © CrashCourse: 2 p-values Machine Learning - Andrew Ng - Coursera GCP related 2 Machine Learning Crash Course © Other courses linked from there to complement General Globant’s Al Manifesto Introduction to Data Science ° Overall idea about the role and discipline © Harnessing the Power of Al © Business oriented introduction to Al Tech Lead Maturity Program © Understand the roles and responsibilities within projects ¢ Becominga Solution Owner > Understanding presales ¢ The importance of Visualizations ‘Study: Charts change hearts and ds better than words do Sources © Others to come

You might also like