You are on page 1of 6

We are now at 86 questions.

These are mostly open-ended questions, to assess the technical horizontal knowledge of a senior candidate for a rather high level position, e.g. director. . What is the !iggest data set that you processed, and how did you process it, what were the results" #. Tell me two success stories a!out your analytic or computer science pro$ects" %ow was lift &or success' measured" (. What is) lift, *+,, ro!ustness, model fitting, design of e-periments, 8./#. rule" 0. What is) colla!orative filtering, n-grams, map reduce, cosine distance" 1. %ow to optimize a we! crawler to run much faster, e-tract !etter information, and !etter summarize data to produce cleaner data!ases" 6. %ow would you come up with a solution to identify plagiarism" 2. %ow to detect individual paid accounts shared !y multiple users" 8. 3hould click data !e handled in real time" Why" ,n which conte-ts" 4. What is !etter) good data or good models" 5nd how do you define 6good6" ,s there a universal good model" 5re there any models that are definitely not so good" .. What is pro!a!ilistic merging &5*5 fuzzy merging'" ,s it easier to handle with 378 or other languages" Which languages would you choose for semistructured te-t data reconciliation" . %ow do you handle missing data" What imputation techniques do you recommend" #. What is your favorite programming language / vendor" why" (. Tell me ( things positive and ( things negative a!out your favorite statistical software. 0. 9ompare 353, :, +ython, +erl 1. What is the curse of !ig data" 6. %ave you !een involved in data!ase design and data modeling" 2. %ave you !een involved in dash!oard creation and metric selection" What do you think a!out ;irt" 8. What features of Teradata do you like" 4. <ou are a!out to send one million email &marketing campaign'. %ow do you optimze delivery" %ow do you optimize response" 9an you optimize !oth separately" &answer) not really'

((. ridge regression. and !y how much" ( . . small hash ta!les or one !ig hash ta!le. %ow.. in terms of access speed &assuming !oth fit within :5?'" What do you think a!out in-data!ase analytics" #6. Dive e-amples of data that does not have a Daussian distri!ution.n the conte-t of fraud or spam detection' #8. 9an you perform logistic regression with @-cel" &yes' %ow" &use linest on logtransformed data'" Would the result !e good" &@-cel has numerical issues. . +ython etc. What are the draw!acks of general linear model" 5re you familiar with alternatives &8asso.ayes so !ad" %ow would you improve a spam detection algorithm that uses naive .f not. %ow would you turn unstructured data into structured data" ..s it really necessary" .#.s it !etter to have . @-amples where mapreduce does not work" @-amples where it works very well" What are the security issues involved with the cloud" What do you think of @?9As solution offering an hy!rid approach .ayes" #2. days for .sigma.s it !etter to spend 1 days developing a 4.C accurate solution. !ut itAs very interactive' (. Why" %ow would you do to increase speed !y a factor .C accuracy" >epends on the conte-t" (#. !oosted trees'" (0. Why is naive . What is star schema" 8ookup ta!les" #4.rio or any other similar clients are quite inefficient to query =racle data!ases. . >o you think 1. si. Dive e-amples of good and !ad designs of e-periments.. %ow to make sure a mapreduce application has good load !alance" What is load !alance" #0. +erl. Dive e-amples of data that has a very chaotic distri!ution" . in memory.s actuarial science not a !ranch of statistics &survival analysis'" . how so" (6.!oth internal and e-ternal cloud .?3" ##. or . nor lognormal.. %ave you !een working with white lists" +ositive rules" &. What are hash ta!le collisions" %ow is it avoided" %ow frequently does it happen" #(.s it =* to store data as flat te-t files rather than in an 378powered :>.. >efine) quality assurance. Toad or .to mitigate the risks and offer other advantages &which ones'" #1. small decision trees are !etter than a large one" Why" (1. %ave you optimized code or algorithms for speed) in 378. 9BB. and !e a!le to handle far !igger outputs" # . design of e-periments.

. or the other way around" %ow to perform good cross-validation" What do you think a!out the idea of in$ecting noise in your data set to test the sensitivity of your models" 0.As" 5aa3 &5nalytics as a service'" 1.ridge theorem' 0(. 5re you familiar either with e-treme value theory. monte carlo simulations or mathematical statistics &or anything else' to correctly estimate the chance of a very rare event" 00. testing" (4.g. great ro!ustness' and low predictive power. >o you know / used data reduction techniques other than +95" What do you think of step-wise regression" What kind of step-wise techniques are you familiar with" When is full data !etter than reduced data or sample" 0#. %ow to detect the !est rule set for a fraud detection scoring technology" %ow do you deal with rule redundancy. a correlation" Dive e-amples.. What is sensitivity analysis" . 5ny e-perience with using 5+.As" Doogle or 5mazon 5+. What is a . for scores" &see the 5nalytic.the one with !est predictive power'" 9an an appro-imate solution to the rule set pro!lem !e =*" %ow would you find an =* appro-imate solution" %ow would you decide it is good enough and stop looking for a !etter one" 02.(2. 9ompare logistic regression w. What is +=9 &proof of concept'" . 01. decision trees. What is root cause analysis" %ow to identify a cause vs. %ow would you !uild non parametric confidence intervals.s it !etter to have low sensitivity &that is. neural networks. Why is mean square error a !ad measure of model performance" What would you suggest instead" (8. %ow have these technologies !een vastly improved over the last 1 years" 0 . %ow can you prove that one improvement youAve !rought to an algorithm is really an improvement over not doing anything" 5re you familiar with 5/. e.otnet" %ow can it !e detected" 04. %ow to efficiently represent 1 dimension in a chart &or in a video'" 1#. %ow to create a keyword ta-onomy" 08. %ow would you define and measure the predictive power of a metric" 06.As" +rogramming 5+. and the com!inatorial nature of the pro!lem &for finding optimum rule set . Which tools do you use for visualization" What do you think of Ta!leau" :" 353" &for graphs'. When is it !etter to write your own code than using a data science software package" 1 . rule discovery.

What is an efficiency curve" What are its draw!acks. sales / finance / marketing / .from gathering requests to maintenance" 11.. 5re you a lone coder" 5 production guy &developer'" =r a designer &architect'" 12. or too many false negatives" 18. million data points . 5re you familiar with pricing optimization. . %ave you used time series models" 9ross-correlations with time lags" 9orrelograms" 3pectral analysis" 3ignal processing and filtering techniques" . price elasticity. fast clustering algorithm" What is a good clustering algorithm" %ow do you determine the num!er of clusters" %ow would you perform clustering on one million unique keywords. %ave you ever thought a!out creating a startup" 5round which idea / concept" 6(. competitive intelligence" Dive e-amples. including vendor selection and testing" 10. 5re you familiar with software life cycle" With .each one consisting of two . What is a cron $o!" 16. What is a recommendation engine" %ow does it work" 64.s it !etter to have too many false positives. What do you think makes a good data scientist" 2 . %ow would you create a new anonymous digital currency" 6#. 14.1(. %ow to detect !ogus reviews. assuming you have .T pro$ect life cycle . >o you think that typed login / password will disappear" %ow could they !e replaced" 60. Which data scientists do you admire most" which startups" 66. %ow did you !ecome interested in data science" 62. What is the computational comple-ity of a good. What is an e-act test" %ow and when can simulations help us when we do not use an e-act test" 2. %ow does EillowAs algorithm work" &to estimate the value of any home in F3' 6. or !ogus Gace!ook accounts used for !ad purposes" 6 .T people" 9onsulting e-perience" >ealing with vendors. e-ternal. >o you think data science is an art or a science" 2#. inventory management. What types of clients have you !een working with) internal.n which conte-t" 61.. and how can they !e overcome" 68.

?ore difficult. that & ' is independent of sample size. comprehensive factual information on a specific su!$ect" Gor instance.com' will charge you to produce fake accounts and fake likes. %ow do you sample permutations &that is. Gan?eIow. what do you think a!out the official monthly unemployment statistics regularly discussed in the press" What could make them more accurate" 28.keywords. Two of them e-hi!it patterns. %ow many 6useful6 votes will a <elp review receive" My answer) @liminate !ogus accounts &read this article'.and B . Which ones" >o you know that these charts are called scatter-plots" 5re there other ways to visually represent this type of data" 24. >o you know a few 6rules of thum!6 used in statistical or computer science" =r in !usiness analytics" 26. %ow do you immediately know when statistics pu!lished in an article &e. >etect fake likes) some companies &e. and use it rather than wasting our time trying to estimate the asymptotic distri!ution using simulations" 8. >o you think that an e-act theoretical distri!ution might e-ist. used to decode a permutation and transform it !ack into a num!er" Hint) 5n intermediate step is to use the factorial num!er system representation of an integer. feel free to !rowse the we! to find the full answer to the question &this will test the candidateAs a!ility to quickly search online and find a solution to a pro!lem without spending hours reinventing the wheel'. Testing your analytic intuition) look at these three charts.talian restaurants in same Eip code could !admouth each other and write great comments for themselves'. difficult to read or interpret" What features should a useful chart have" 21. What are your top 1 predictions for the ne-t #. There is an o!vious one-to-one correspondence !etween permutations of n elements and integers !etween and nH >esign an algorithm that encodes an integer less than nH as a permutation of n elements. <ou design a ro!ust non-parametric statistic &metric' to replace correlation or : square.g. @ven !etter. years" 22. What would !e the reverse algorithm. million data points ta!le in the first place" 2(. or competitor reviews &how to detect them) use ta-onomy to classify users. %ow do you normalize for sample size" Write an algorithm that computes all permutations of n elements. rather than correct.two . What could make a chart misleading. those who hate everything. Dive a few e-amples of 6!est practices6 in data science. Geel free to check this reference online to answer the question. 20. @liminate prolific users who like everything. and &(' !ased on rank statistics. we should find it. %ave a !lacklist of keywords . newspaper' are either wrong or presented to support the authorAs point of view. and location . generate tons of random permutations' when n is large. and therefore.. 8 . and a metric measuring how similar these two keywords are" %ow would you create this . technical question related to previous one. &#' always !etween . to estimate the asymptotic distri!ution for your newly created metric" <ou may use this asymptotic distri!ution for normalizing your metric.g.

WhatAs wrong with this picture" .+ address or . 9an you estimate and forecast sales for any !ook.s it a !rand new company" 5dd more weight to trusted users &create a category of trusted users'. 9reate a metric to measure distance !etween two pieces of te-t &reviews'. !ased on 5mazon pu!lic data" %int) read this article. Who are the !est people you recruited and where are they today" 88. Glag all reviews that are identical &or nearly identical' and come from same . What/when/where is the last data science !log post you wrote" 86. 9reate a review or reviewer ta-onomy. Fse hidden decision trees to rate or score review and reviewers. what is data science" ?achine learning" >ata mining" 82. What did you do today" =r what did you do this week / last week" 8(. . and why" Which company do you admire most" 81.to filter fake reviews. What are your favorite data science we!sites" Who do you admire most in the data science community. Watch out for # or ( similar comments posted the same day !y ( users regarding a company that receives very few reviews. 5lso watch out for disgruntled employees !admouthing their former employer. . 3ee if . What/when is the latest data mining !ook / article you read" What/when is the latest data mining conference / we!inar / class / workshop / training you attended" What/when is the most recent programming skill that you acquired" 80.+ !lock of reviewer is in a !lacklist such as 63top Gorum 3pam6. 8#.n your opinion. 84. 9reate honeypot to catch fraudsters.+ address or same user.