You are on page 1of 12
Baran Lab Group Meeting Statistics in Synthetic Chemistry Marcus Farmer 09/02/2017 Central goal of this group meeting "To provide a brief summary of statistical methodology that has| found applications in organic chemistry, as well as to provide a {forward looking perspective of potential applications that have yet to be realized. Emphasis will be placed on how to ulize| statistical methodology so that the aforementioned potential applications may be readily sought after by organic chemists." Select applications of statistics in areas related to synthesis Clinical Trials Ww? * = What's the difference between probability and proportion? < = E S How sure can | be that 900 people is a sulficient sample size to study the impact of this drug on a disease? How do I deterimine what dose to conduct this new clinical trial at? Biostatistics/Genomics How do | automatically classify genetic variations? Is there a gene that influences ‘a person's response to a drug that we didn't realize before? To synthesis NETFLIX ris Avene GOpenat What is statistics? ‘Statistics is a science in my opinion, and itis no more a branch of mathematics than are physics, chemistry, and economics; for if ts methods fail the test of experience - not the test of logic- they are discarded” = John Tukey Founding Chairman of Princeton's Statistics Department What are statistical parameters? Statistical parameters are theoretical values that we can only ! approximate by using statistics. Probability is a statistical parameter that can be estimated by the ratio of occurences, i.e. the proportion. y “a! Confidence intervals are used to declare ranges where we can be certain, with some percentage of confidence, the true value ! of the statistical parameter exists. What is statistical power? Statistical power is the confidence we have that our test will be able to distinguish the difference between two statistics What is correlation and how should we use it properly? Correlation should not be mistaken for causation and is merely the description of an apparent relatonship between variables. { What is Bayes theorem and why should | care? Bayes theorem details how we should update our beliefs about ! the probability of an event given new information on the system. IVEGAS rere can I learn more about statistics? Penn State's online world campus is a fantastic place to start Link: https://onlinecourses.science.psu.edu/statprogram/ Design of Experiments (DoE): Principles and Theory 2 ‘Types of Experiments vractional Factorial Design ‘One Variable at a Time (OVAT) ————> HoH XY remove 2° Constant oermente > Z| z z J Varyx Z ¥ Y v x << x Factorial design (2°) Fractional Far Design (2) {eneral form ofa factorial experiment: LF? [Tiss to waaibonal optinizaion scheme Pick optimal X! wi f ors, and P = number uted to red f used by organic chemists. Ths example has| Vary ¥ | Pik optimal X; Where L = levels, F = factors, and P = number used to reduce complexity #9 unique rations, but are we ara that we fesponse Surface Methodology found the optimal condtions? Note that yi ‘actorial designs are great for quick experiments and will tell you if there the st censor mt pment y ona However k work be , re any interactions. However, it won't necessarly let you model the shape the “response surface." Response surface methodology tackles this Pick optimal X 3 teaching you How to design experiments that allow you to mode the Bren optimal x reaporae sartace"and approxmste ts curate entral Composite Design <— nost popular response surface methodology VaryZ Apply 2°F ‘sar pont Vary X,¥Z Neomter ultaneously points ——_> z use mapping to predict optinal | The contr points are commanly replated i conditions N times so that you can esitmate the x x variation of your response surface DoE vs OVAT displayed in 2-D (function space is 3D) What will you need? Any ofthe folowing software wil be hep ovat eos DoE DoE Software: Design Expert (Stat-Ease Inc.), MODDE (Umetrcs). 025M 05M | DoE Fusion PRO (S-Matrix Corp), STAVEX (Aicos), som @0.1M Minitab (Minitab ine), uM (SAS) Sons rans Froe DoE Software: (packages can be found inthe CRAN. Requires, programming skis.) Temperature Temperature Where to learn more? Penn State Online STA 503 (Free Notes!) emperagr omperaye Design of Experiments (DoE): A Case Study of a GSK Process Optimization 3 2uwinee Ti exprinartat Denon 1) Half Factorial o a (central composite, 30 experiments) aon " " ~ 4 Undesired pathway. Mechanism? Ox0- ‘carbenium, :OH formation oe taut M@. Plsromutin co Teme Design of Experiments: A Merck Case Study 4 Improved Alternatives 2200, 0 95% isolated 2.2 6g, Octanethiol| seg MeCN, 20°C. ws Tend TRY H 2H2SO, Key lessons from this work 41) HTS and DoE work together 2) HTS is good for discrete variables 3) DoE is great for continuous variables| 4) DoE, once again, provides valuable insight into variable interactions that guide process decisions Kuethe, Tellers, Org. Proc. Res. Dev. 2009, 13, 471 a 5 eq, TsOH 25 eq. TFA, —— MeCN, 70 °C. 1.29 of SM HTE a oN 236 rans * 3 days total 70g. H/SO, - Product-H,$0, HN N product 50-68% —_——_ >» MeCNIH0 (90:10), 70°C 75% Yield 100% Conversion a DOE Optimization 49 reactions, central composite 9°" Sosa" yield CN eq. HzSO, MeCN, 4 vol% H-0, 70°C traction : Seq ven] Both acids required for global deprotection : 15 equiv H:S0. sof -_ _ a — “ ~ equiv HSO« Statistical Modeling of Catalysts by Parameterization 5 What is the difference between theoretical and empierical models? CrCl, Ligand HO, — Theoretical Models Co ee proparay bromide, io 75% yield 1) Denved fom fst ania and dont rely on experiential data no), TEATHST 92% e8 Emporical Models | E=Hammett « Value, S = Charton v Value 1) Require a set of data to fit the model to Eq.1) AAG = -1.20 + 1.22E + 2.845 - 0.85S*- 3.79ES + 1.25 ES? 2) Mathematical constants do not need to have any meaning E = Hammett « Value, 8; = Minimum Width, Bs = Maximum Wiath What are some examples of emperical models? } Eq.2) AAG =-0.696 + 1.3808, - 0.96285 - 2.7058, + 1,736EBs Response surface models (RSM, previously discussed) Linear Free Energy Relationships (LFERs) Example of a LFER: The Hammett Equation -RTLn(kike) = AG = Ald2(B,/D +B) ——P logiKiK,) ‘Sonn chron ih tr 'W(2.303R) Canstan(eargee win reacton) p= (17Te"\ByD +B) rato oF equ constant Key Ide i = reference rat or equilbrum constant ‘There are parameters which are unique to the substituent and reaction | respectively, R= gas constant 1D = ditocric constant for solvent stance between the aubettuent ang reaction st on AG = Gers? oo EMedel Comparison Fc" F, a LOH AG = Gon CO2H Fac FC" NHs" AG = Gaps NHy Predicted 316! (kealmol) Measured 436 heavmo) 0 owas i Multivariate LFERs: Parameters 6 Why is parameterization of organic molecules important? By nature, molecules and their substituents are discrete variables. on Patonee Discrete variables aro useful for classifeaton, but nat for regression. | g) stones Peranaerte longue tant noe root 80 "et -tnsacadMbon ik” hra att #12 Sehioved a the cost of some rformation anes on te secrecy 35) yeossc+018 what drives At= 4-MeCgHy A'B = 1:12 Siem | Selectivity? Ar=4.CFyCcHe A = *' i Bal o we Hammet alternatives? 9 crcl L, i Al Brom N.s fred 5 v - 2 op? “Wino, TEA 3 NBG TSC TH 4 ‘Sigman Nat. Chem, 2012, S 3 < seit pears i ats ‘Sterimol paramaters ode! VooH are more information| 08 08 15 25 35 eo ' >4 fch than Charton teeta hast slat Mode 22. sat = 0.9808, -1.058 16 BonzaldehyeAlviaton : hi 14 ‘oupS t-ademanst é <. 12] ystsn 3 ; tof roo” yor? eg 2s 7 rn i ‘py! Ee . i 06. 7 ee z au A Kn, 8 os a ene, i ? 02 a a z 00 Me 5 i s Measured 4G (heal mot) 0003 08 Ge ta is te a1 24 orld ranisie' Multivariate LFE Rs: Mechanism q Gravaermeninwacmemnrimecres Rmx >a Se a an et hE oh. BA 9h on : \- tor2 Desired -_ Insertion Pao) Sk + iam Product | “© [Heck Product] {E1CO},0, oA 4, joints 2b, 2c, 2e (rac) DIPEA, 0°. i y=0.96x +001 AAGT= 0.0 -0.28 Ly 1.07 Be a Z D gaGt=-0:14-0.745Dn4 +0. 0.40 85, +0.26 Dry Ex mex! cation-n interaction tol oae (oz ea0e The requirement]: or both the Dx &| Ex terms indicate]: that stacking is likely involved in ...._ |deterimining the] jeo% Predicted AAG? (kcal/mol) 05 40 45 20 25 30 Measured 8163 healmol) Sigman, 9 — Pd(dba)s CHCIy_ Ph LL “Rr mae Re A Pence”, AL J wan A Beare “On” ; Te ecko» Can mutivarate conelatons be 2a: Ry = Cay, R= any ood study tistoacbon? Db Ry aH Rea alo Cyycte ‘Sigman, Toste, J. Am. Chem. Soc. 2017 Predicted A4G* (kcal/mol) 3 94 2a, 2d 2,6-unsub, asgt>0-= 6a ‘uet 4a 40 40 05 00 05 10 oto Measured AAG (kcalémol) Catalyst steries can control product ratio suggesting that 2is involved in the deprotonation event } What interactions dictate the enantioselectvity? 9 x Horry diarylation using 2a Ph o Vary Ar bar or AAG = 0.05 + 0.59 NBOcy - 0.42 SEn- 0.88 NBOc,SEx 17 datapoints a7 +018 Training set (14) Validation set (6) 00°05 10 415 20 25 ‘Measured AsG* (kcal/mol) Predicted AsGt (keal/mo!) 00 Multivariate LFERs: Prediction py hs 9 Tee Experimentaly evaluated wl Sigman, J. Am. Chem. Soc. 2018, 137,18668 Paill, L6 Busy QR: je Cutt y Bu, aa is nc Oo) 2) Preeted 136! WN, 6 -22.5NBOy op yooten+02s jad Poa i” Vaitsion rods i oH © is mee 2 wre Computationally S10: 23:77 Screen Ligands i, OH a E [i Waning set —) M fas Os Oo 15 20 25 76.723.3 O09 “EES eA ented 830s sna Cont He EReaeeay O New Optima! Conctton M cm ow Optima Condtons py 7 avclen, A OQ te CQ et A SS Macro sth Me modes | __[ Nite Reduction & Proposals 2.662 condition Pe (Logistic | APply predictions to —| regression| electronic notebook 50th class > 77% accuracy 283% eoreet meth. inthe top 4 model) Y data (98,226 oni) Table Reaction proseiges conputergottront Precion ar] [ow meperam a] Clessfeatons ay ‘Can you? QO ae Recall = 93% || Nmap cannot Stille }) |Sonogashire mn S roblem with patent dat 2.796. zn md OTMS Heat OTMS, Problem with patent data Oc edioted Fexn Ty Bromo Suzuki tom recited Rn Type rood Bromo Suzuk Type Full Mechanism a ie Br Chiro Suz Proved = ; (Each elementary 0 OH ir Fluoro N-Arylation Step wes intop 3) ws Chior Nanton KS Fuore Neva Baldi, J. Chem. Inf. Model. Br Probability = 70% 35,987 2012, 52, 2526 Ranking predictor workflow Jensen, ACS Cent, Sci, 2017, 3, 434 Reactions from patents, between 1976 & 2013 Machine Learning in Synthetic Chemistry Extracted Templates Heuristic-driven " Accurately Predicted Major Product - 68.5% Validation Data- template extraction 41,122,662 =aee 140,284 (average undergrad?) a reduce [Note 1: while the validation Pinatls a template’ templates: [Sata wos not used during Major product identified Observations fempates: Vine model taking, the data in top X predictions Ho. HaSo, 9 0 those: Jised for ths pretocol was 3, 84.8% 20. HeSO4 on with >50 : Vint to reactions where ai 5, 89.4% Me examples | east of the templates couk . Jens ofthe tomplatos could Nore 2 When he 1.689 lomplates used in thal | Fiterd Templates SMudy were appled to a fandom set of 15,000 KDMSO ye Examples of Predicted | reactions, tne actual product was found in 76% Mey ge OS Nera 1,689) Reactions Sexarges. This arses fom the tengo ie Ae (no reagents sre missing, [lemplats witless than 50 reported examples ° 9 patent data isnt clean) o Prob = 98.8) 9 _KoH 7 tempts RE GN Ranke 1 J on 15,00, Sz uF A Pr examples CV on > Nay Heuristictiven BONN Ny W template oxracon [Possible Reactions"! |. ) N (Algonthm) 5,335,669 i ——— 335, correct Prob. = 98.8) : Rank Template ir r nA yt it oat wo) ht _> Spit Data FFs B(0H), HO a Incorrect, oH 3 Train Neural Network Data Splits! Me. a Hel Prop = 10 70% Training P Nao; —> Rank = 190 410% Validation || Cr Me E anzoyPerxite auc [prove leave out "Validation" Machine Learning in Synthetic Chemistry 12 Reays, Filter reactions by: A+B+C m= D > [Undisciosed Sample Size (undisclosed data gathering algorithm) (millions) Learn rules based on bond ) 137 rules ‘A) 17,370 rules B)8,720 rules |< ‘save rules if they occur Filter by rules Reactions 4,900,000 ro fare) esas ‘A) >50 times 8) >100 times (C> 25,000 times Split and train datasets using 5-fold cross-validation 70:20:10 Train:Test: Validate changes and neighboring atoms Filter by 103 hand writen ros by an organic chemist Reactions: 3,000,000, | ft Logistic |) [ Logistic Regression Regression| Model Model ' Artificial Neural rena Accuracy Metrics. Retrosynthesis ‘Accuracy LR: 31% ‘Accuracy NN: 62% MRR LR: 0.41 MRR NN: 0.75 Forward Accuracy LR: 41% ‘Accuracy NN: 77% MRR LR: 0.49 MRR NN: 0.85 Forward Synthesis Cy ow, Ce g Pr OMe Pra ott AY eon ——» AY y y Pr OMe Ph A OMe na esH LY ct Interesting Observation CHO wwe Ty nm (probability = 99%) (probability = 1%) This selectivity was leamed, not programmed Retrosynthesis ne retro was in top 10 pred.) Me 2a, neo. S o xa AcHN 9 _cHo neem

You might also like