You are on page 1of 24
Forensie Science Intemational: Genetics 65 (2028) 102870 Contents lists available nt ScienceDirect Forensic Science International: Genetics ELSEVIER journal homepage: wr elsevier-comilocatelisigen Recent advances in Forensic DNA Phenotyping of appearance, ancestry and age* Manfred Kayser”, Wojciech Branicki"*°, Walther Parson", Christopher Phillips’ of Ges Heft, Erma MC, Uns Metical Cntr Roteran Rested the Redan a of Zoe and Biome Resch Senin testy, Rr, Pl, “sn of Ferns Rech, Kr lan Si ftp Mee, Mel Uist of ack, Irak, Ai “Forest Gnas Ur, uct of Frese Sines Uda of Satan de Compo. Spain Foren DA Pheowpin Freee DRA onal Forense DNA Phenoryping (FDP) comprises the prediction of @ person's extemally visible characteristics regatding appearance, blogeogaphie ancesuy and age fom DNA of ime see samples, 10 provide invest ftv lends to help Bind unknown pespetatos that cannot be identified with forensle STR-profling In vecent Cae ‘eas, FOP has advanced considerably in all fs fuee components, which we suanasize in this review ate swe . [Appeatance prediction fom DNA has broadeed vond ey, hai and skin color to atltonalyconpise ote tals such a eyebrow color, feekles, hat stuctute, hai los ln men, and tall stature. Blogeographe ancesuy Inference Hom DNA has ptogiessed Hom coatnental ancestry to sub-coutinental ancestry detection and the resolving of coancestry pates in genetically admixed individuals. Age estntion fom DNA has widened beyond blood to move somatic tsies sich as salva and ones 36 well a6 new markets and Cooks for seen ‘Technologial progess has allowed foenscally sable DNA techiology with age increased sulplex cx city forthe simultaneous analysis of bundveds of DNA predictors with targeted massively pasallel sequencing (GHP). Forenseally alldated MPS tase FDP coos for pedeting Kom ene Scene DNA i) several appeatance ‘uns, mul-egionl ances, i) sever appeatance unis togeter ith i regional sees, age fiom diferent sue ype, ate already aallable. Despite een advances dat wil key increase the inpace of DP in criminal easewotk inthe neat ute, moving sllable appearance, ancestry and age prediction {oa cine seene DNA to the level of detail and aceuacy police investigators may’ desire, tequles futher intensified Si te eseaittogeter with techni developments ad ovens validations a well the necessary finding 1. Introduction satel the EVC information predicted from the crime scene DNA. FDP therefore allows focused police investigation based on information ob Forensic DNA Phenotyping (FDP) refers co the prediction of a per: son's externally visible characteristics (EVCs) regarding appearance, biogeographie ancestry and age from DNA extracted from human bio logical samples collected at crite seenes (1). FDP is applied in eriminal ‘ses where no match in forensie STR-profiling [s found, because the souple donor is unknown to the investigative authorities ad thus ts STR-profile is unavailable for comparative matching [2). FDP aims to provide investigative leads to help find unknown perpetrators of crime by reducing the numberof potential stxpects to group of people that tained directly from the evidential material, Because current FDP tools cannot deliver appearance on the individual specifi level, which also is unlikely to become possible inthe foreseeable Future, FDP applieations are always followed by forensic STR profiling for final individual iden tifetion. Beeruse appearance, ancestry and age by thenselves describe a person's EVs, and since some appearance traits depend on certain biogeographic ancestries and/or a certain age range, the combination of all three FDP components isthe most informative way to find unknown Derpetrators with the help of DNA. Therefore, itis recommended to * Tis papers dear 0 om esenmed college and dea ie Pete M.Seinedes, who sadly pase aay i September 2022; Forensle DNA Phenoyping is ‘one ofthe many forensic genetic areas head impacted wit his 35 yeas" work inthe fel, Corsespoading autor mal addres ase @erasinasn al (ML Kaper. haups://doL.ong/10-1016/} igen. 2023.102870 Received 17 Febuary 2023; Acepted 4 Apil 2023 ‘Availabe online 6 Apel 2023 1872497376 2023 The Authors. Published by ever BV. This san open aces ail ude the €C BY license Cp /ereatveronnons.og/toeses/by/.07. ‘combine DNA-based appearance, ancestry and age prediction in forensic practice, provided the legal situation allows (2), FDP does not come without ethical, societal and legal implications as described elsewhere [5]. In recent years, several countries have revised their forensic DNA legislation ro allow and regulate FDP, while in some other counties FDP is allowed without law specifications (2) ‘The final practical success of FDP in criminal casework depends on, the lovel of detail, accuracy and reliability with which appearance, ‘ancestry nc age can be predicted from erime scene DNA. Another factor Is the frequeney of the predicted EVC feature inthe region where the ‘rime was committed, assuming the eriminal is local. Predicted EVE features that are less frequent inthe region where the erime happened will help more in the police investigation to find the unknown perpe trator (provided the perpetrator comes from that region), than consmon ‘ones. However, this does not menn that predicting eommion features has no value becatse i¢ allows the exclusion of individuals not matching these commen features, for instance members of minority groups (2), Practical success of FDP also depends on how the EDP outcome is finaly uulized during the police investigation. An effective way in cases with unknown male perpetrators sto combine FDP with patriinen! frill ‘seageh (), where only those men are approached for voluntary partic ipation in the Y-STR profiling that meet the FDP outcome. This cou bined approach allows to focus on a smaller mumber of volunteers matching the FDP outcome than would DNA mass screening or DNA dragnets without combining with FDP. The success of this combined ‘approach is exemplified in the murder and rape ese of Milica van Door, in the Netherlands (5) FDP tools generally consist of two components: 1) forensiealy validated multiplex genotyping tool for analyzing all predictive DNA markers in the crime scene sample based on a forensically suitable DNA technology allowing low quality and low quantity DNA analysis, and i) 2 prediction tool based on a validated prediction model for obtaining probabilities in appearance and ancestry prediction, and for estimating the age from the epigenetic dara, established with the multiplex geno- typing tool from the crime scene DNA. Average prediction accuracy «estimates availabe from the validation of the prediction models indicate if prediction model / tool is accurate enough for practical application, However, oven models with lower average accuracy ean yield high probabilities in certain individuals, abet in Tess india chan with ‘accurate models. tn practical casework appliation, current appearance DNA prediction tools deliver probabilities of trait categories forall ‘appearance tals for which DNA predictors are Included in the geno "yping tols, Thus fit, appearance DNA prediction in forensic applica Hons purely reflects categorical prediction as te scence of eontinons appearance prediction from genetic data is not yet advanced enough oF troubled by the very large number of DNA predictors needed. The est mation of & probability Is mirrored in the inference of biogeographic ‘ancestry font erime scene DNA, where typially likelihood ratio (LR) framework is applied. In DNA-based age prediction, the prediction model delivers an age estimate for which the error comes as average ‘error ofthe prediction model, Thus, FDP not only provides information ‘on the unknown sample donor's mest Iikely category ofall appearance traits, the most likely geographic region of bio-geographic ancestry, and the most like age, but also on the errors ofthese DNA-based predictions “This marks an advantage over eyewiness descriptions (when these are available) that are known to be highly subjective and prone to change ‘overtime [6], but the error in an eye wimess report of any specific case {s completely unknown, Based on the magnitude of te probabilities or Lis obtained in any one case, police investigators can decide what ‘weight to give the generated FDP information inthe investigation. FDP always relies on reference data used either directly or indirectly Directly, as with most biogeographic ancestry predietion tools per forming inference nnalsis ofthe case sample data rlongside reference Population sample date. Indirectly, as with appearance and age pre diction, where the reference data are used in the prediction models ro ‘obtain the probabilities, but not by the prediction tools diredy Ferenc Sene main Gene 65 (2023) 102870 ‘Therefore, reference data should alveays be described together with the prediction outcome, when FDP results are reported tothe investigative thorities, In recent years there have been significant advances In FDP in all, three components: appearance, ancestry and age prediction from erime scene DNA, which we summarize in this review. For earlier achieve ments in this fel, we refer co the previous review articles on the forensically motivated prediction of appearance and ancestry published in 2015 (7,8) and age published 2016-17 (9,10). A key element fr the improved FDP solutions was the inereased number of DNA predietors thar have beconte available over recent years. However, this number went beyoud the linits of dhe multiplex eapacitles of the forensle DNA technologies previously wsed in FDP tools. Advances in targeted MPS technologies demonstrated incensed mliplex capacity compared to al previously used forensie DNA technologies, which together with its Sensitivity and reliability nakes MPS the key technology for FDP put poses. The multiplex capacity of targeted MPS is highest for SNPs, used for appearance and ancestry prediction, compared to DNA methylation markers, nsed for age prediction, tn the lst Years, signifieant progress has been made in the implementation of MPS technologies in the forensic workflow for all types of forensic purposes inching for FDP 11], Various MPS-based multiplex genotyping tools have been devel ‘oped ecently fr predicting front ernve scene DNA D several appearance traits combined, it) multi-region biogeographic ancestry, Hi) several fppearance nits combined with mlti-egional ancestry, and $v) age from diferent tissues, which provide improved FDP solutions, as we discs in this review. However, there are eriseal points that apply to many of the recent and earlier studies on forensially motivated DNA-basel appearance, fcestry and age prediction, Firs, the sample sizeof the dataset used for discovering the predictive DNA market was often smal, which leas to ‘uncertainty inthe predictive value ofthe markers used, Second, datasets ‘wed fr butlding and valldating the preditton model were aso sina in scale and not independent from each other and/or the marker discovery dataset. Applying the same datasets forthe diferent steps in the pre diction modelling can result in overestimation ofthe prediction acu. racy, especially when the sae dataset is used forall ree steps, Using the same (large) dataset for model building and model validation by performing crossvalidaion i a valid option as long asi sch dataset was not additional used for marker discavery, but using independent datasets forall three steps is always better. Third, many ofthe reported prediction models were not made available as prediction tols, whieh prevents others, besides the reporting authors, 0 apply dhem in practice. Fourth, many articles that describe prediction markers and nindels da rot provide a dedicated multiplex genotyping tool, which hinders practical applications. Fifth, when multiplex genotyping tools were re Ported, they often lacked forensie validation stles, which prevents applications in forensie praetice.An optimum approach for developing 1nd validating FDP tools incudes:) DNA predictors ascertained from & large dataset not used in prediction modeling, i) a predietion model built large independent dataset and validated ins large independent Aataset, while the model validation delivers high enough prediction accuracies for appearance and ancestry and low-enough error forage, i) the developed prediction model fs made available for practical ap plications as prediction too, i) single-mltplex genotyping col is developed for all DNA predictors used in the prediction model(s) based om a technology suitable for analyzing low quantity and low quallty DNA, ¥) the multiplex genotyping tool has mndergone forensic valid tion and successfully passed the major validation steps, and vi) the forensically validated multiplex genotyping tool is mage available togeter with the prediction too, thereby allowing practical FDP ap: plieations in forensic easework. However, in reality, many studies did not meet several or all of these points, which Iimits FDP applications. ‘The Visible Attributes through GEnomic (VISAGE) Consortium and Project, which worked on improving, integrating, implementing, disseminating and assessing the societal and ethical implications of Forensic DNA Phenotyping on appearance, bio-geographic ancestry and age (hip /wnww.visage 2020.60), considered all the above raised points when designing, developing, and forensically validating the VISAGE Toolbox. 2. Recent progress in predicting appearance traits from crime ‘scene DNA Developments in forensically motivated DNA prediction of appear: ance traits published up (0 2014-15 are summarized elsewhere [7], At that time, categorical eye and hnir color prediction from: crime scene DNA had becomte established with several predictive DNA marker sets, nitiplex genoryping tools - some with forensic validation, and statis tical prediction models some with prediction role (7). Skin color DNA prediction was emerging, while ather appearance taits were not pre ictabe from DNA due to strong limitations in the genetic knowledge available for these traits at that time (7)- Since 2015, DNA prediction of ‘kin color became more established [12,15], and (more predietive DNA markers for more appearance traits were discovered, sich as for ‘eyebrow colo, freckles, hie steuenute, har Toss in men, tal statute, and sey hat In the following sections, we sunimatize the recent advances in appearance DNA prediction for these recently added traits after we provide a brief update on advances in eye, ir, and skin color DNA, predition. For other appearance tats, such as ear morphology, facil hie traits and facial shape, the number of reported associated DNA ‘variants as increased over the last yeas, ut the phenotypic variance they explain is not large enough for developing FDP tools as of yet, which we summarize in one section. tn the last section of this chapter, we discuss the newly emerging field of epigenetic prediction of exter: nally visible characteristics or habits that are detersined by the expo: ‘sue to external factors such as tobacco stoking. 2.1. Recent advances in eye color and haircolor DNA prediction ‘The minn steps in the establishment of eye and haircolor DNA pre diction from crime scene DNA were acconiplished prior to 2015 as described before [7], The recent yearshave seen many validation studies ‘of these previously established eye and lsir color DNA prediction tks in different populations ffom the same and different continents inchiding admixed groups, and with different statistical approaches Including machine learning methods. Discusing these here would go beyond the prnetcal length of this review 12018, the IrisPlex mode! for eye color prediction andthe HtrisPlex model for hair color prediction were both revised by increasing the model underlying reference daca, respectively (25). The updated Ins Plex model is now based on close co 9500 samples and yields prediction ‘accuracies expressed as cross validated AUC (area under the receiver ‘operating curve) of 0.95, 0.94 and 0.74 for brown, ble, and interme date eve color, respectively (Table 1). Notably, the relatively low AUG for intermediate eye color understates the ability 10 prediet non bhi and nhon-brown eye color with IisPlex, which is posible by deviating from ‘cneliding the mos likely eye color fram the entegory with the highest, probability, as typically done for blue and brovin eye color, in many ‘cases, 1sPlex allows concluding intermediate eye color from similarly high probabilities obtained forthe blue an the brown eye color eat ‘gories, The updated HlrisPlex model is based on elese to 1900 sanuples and gives cross validated AUCS of 0.92, 0.83, 0.80 and 0.72 for red, Dlack, blond and brovn haircolor, respectively (rable 1), The updated leisPlex and HirisPlex models are publily available for practical use as prediction tols vin the Erasims MC Hirisplex website at hirps//hir Isplexcerasnsieul, Both prediction tools available via the HirisPlex website are based on a dynamic IisPlex and HlrisPlex prediction models, thereby allowing missing data depending on which SNP is aissing in the incomplete profiles obtained from low quality and/or ‘quantity crime scene DNA. Non dynamic versions of the updated Iris Plex and HlrsPlex models re implemented inthe VISAGE Software for Ferenc Sene main Gene 65 (2023) 102870 Appearance, Ancestry and Age prediction from DNA (VISAGE Software, “Inble 1, whieh uses as input the MPS data genetated with the VISAGE Enhanced Tool for Appearace and Ancestry (1 andthe two VISAGE Enhanced Tools for Age [15,16], The VISAGE Software Is available for expert forensic genetic practitioners via the European Network of Forensic Science Institutes (ENFSD. ‘The lrisPlex and HlesPlex SNPs can be analyzed with the forensically| validated IrisPlex and HirisPlex multiplex genotyping tools based on @ single SNaPshot ary, respectively (17-19). They are included together With autosomal ancestry informative SNPs and ¥-chromosomie SNPS in & hybridization eapture based MPS assay (20) Cable 1). They are part of the forensically. validated commercial MPS-based ForenSeq DNA Signature’ Prep Kit (Verogen) together with genetic markers. for biogeographie nncestry and other forensic purposes (21) Cable 1). The commercial Universal Analysis Software (Verogen) for data analysis allows, amongst other things, to generate eye and hair color probabil ties from the data established with the kit, Notably, the prediction pe rameters used by this commercial predietion tool are the ones from the fist IisPlex and HrsPlex models, which were based on much smaller Adtasets and deliver lower accuracies then the updated current models available via the HiisPlex webtool (Fable 1). The tis and HirisPlex SNPs can also be analyzed together with additional skin color predicting SNPs ofthe HrisPlex S system with several MPS tools see below under skin colon). The last years have also seen signifiant progress in genetically derstanding eye and haie color variation more completely via outcomes of genome wide association studies (GWASs) with enlarged sample size ‘nd consequent increase staistiel power. Published in 2018 [22), the ternational Visible Trait Geneties (VisiGen) Consortium performed GWAS on hair color in almost 300,000 Europeans and identified 124 significantly associated genetic loci, of which 111 were not previously known for haircolor. Next to GWAS, the authors tested the predictive value of the newly identified SNPS in tral of < 15,000 Europeans from {avo cohons. Based on 252 of the 258 independently associated SNPs discovered in their GWAS together with 18 available HirsPlex SNPs, ‘nd based on data from 180 cohorts combined (split 80 %:20 % for model building and testing), AUCS of 0.86 for red, 0.86 for black, 0.74 for blond and 0.68 for brown were achieved, while the énconiplote 18-SNP HirisPlex model gave 0.85, 0.78, 0.67 and 0.62, respectively, in the same data (22). Thos, @ prediction accuracy inerease was achieved for all four hair color categories, which, however, was relatively stall for the price of sing 234 (14. Told) more SNPs othe mode, with AUC increases of 0.08 for black, 0.07 for lon, 0.06 for brown, nnd 0.01 For red hair (22). A dedicated multiplex genotyping tool for the SNP pre dicts and a prediction tol were not made available by the authors, ‘wlio used GWAS data from SNP microarrays, Provided enough mult. plexing capacity ofthe genotyping method, the lait color SNP predictor Identified by Hysi etal. should be considered in the development of furure FOP tools In 2021, che VisiGen Consortium published a GWAS on eye eolor in 195,000 Etropeats and identified 61 significantiy associated genetic loet of whieh 50 were previously unknown for eye color [2 Although the eye color predictive value of dhe identified SNPS was nor tested in this study, the authors quantified the amount of eye color variation explained by these SNPs, which was 53 9 (95 96 confidence interval 45-61 56 ofthe total eye color variation in ther study population. Also published in 2021, Kuk Bartoszek eal. [2 performed a whole exome sexquencing study ina (fr associalon studies very small) number of 150 Polish subjects and reported 27 new candidate SNPs for eye color. ‘Testing 137 newly and previously discovered eye color SNPS in 849 Polish samples enabled the authors to develop uew predictive models for eye color following @rwo step modelling approach. In he ist step, AIC, BIG and LASSO methods were applied for marker selection, an in the second step, regression niodels were built based onthe selected markers. ‘The regression model based on LASSO selected markers consisted of 10 SNPs and that based on BIC-selected markers only had 4 SNPs; both Ferenc Sene main Gene 65 (2023) 102870 ‘Table 1 IMPS.based tools for predicting appearance tats rom ene scene DNA, ‘Toolname ——_Appewrance Tait Appeeranee Composition, MPS Forensle—_Preiclon models with Predietion ool Reteeneee Murkers”—_teennlogy Validation’ reference ata and prediction Toreaica DNA Eye olor, acolo ae Pex Agere 1 BGA Yew Bjecolor ist wexmadeh Unwed ———208 Sienre Pep se ter rene pp, del bling N = 3004 del alee 25) Ee Neope FeseaSa, Salldalon N= 7364, AUG blue sofas, UAS * D1 town 08, 07% Weare ai color et Hex mae ull & Fogle Eye coordi cole, 1 HPS Appemance FBGA, Yen ye colr updated Pie Hen san inca ss ‘mois mode Hat ol updated webs Useler model Shines: /hiplencae HPlerS motels Glsee monn shove) uwhecs Eyer, bacon, 1 Pens Appearance: Amplien, Yes je color updad nsPiee——Hilex (03.28) shin cole sxe tin tol 8 ~ 9466 rose weber own ass, ineeiae 0.74 inn” Bau clr upd Pex model ode Bulag and ess 02, ise 08, boxe 00, brown 172 Shin col Male S ted 6 = 1423, co Upke 072, totem 7, a VISAGE Reve Eye coor, col, 4D hnPlexSAppemane 4 BGA, Yen Ejecolr opdatd sPie ——Hpler (0325-301 oe ae sxe ‘Ample, Foren, eel Hacer: pasted bibl pe fppesance Powers Hineler motes Sknestr 1 /Seuplesae chance? HieierSmotelsalsee | uenen soe) een Coptaze AAthiPies Appeuce $8GA Neo Ejecolr updated Pin HPlen vines, so ther frente purpose, ode Hur coor pasted wea e ‘woncey apr enschme Hiseler model Sines: /hrplencae HS, on Nex Piers mofellsee anne seq shove) Roufeca ——EYeeuly a eolor ZS Hex —_Appeutne + BGA, No jecoln updawd sein Hex 03.201 sxe cour emchmen Fede Heres pasted wean pe MPS anion Hiveler mole Sknecio: —"heoplenae elexS models (see ann” there) sick eco bat cole, 198 Appenance 804, Yes je color updated msPieg Ge ISAGE assss7, Enhance Tool skin col eyebrow appearance Ample, shove): ele: pds Soeware 4240) tae fol eels ba SNPS lex model portal mal opeaance shee male alos Sth 20 Skis 1876 coe fe ances Salon AUCs v4 3a DS Mand 2.0, bow 0.72 1428, croavalidatin, AUC ery liebe 74, 072, She 96 ebro ar Pog foedel Penge al) pl ote town 2 Fei: uke xetosek model (ao Banaset ea). 10SNPs Crosson, AUC eso a shape Pspeeh mode (Posie eta), puta mote i 28 SHS + se = 9674 tel bling, 2138 mote Saldeon, ALC sigh on lps 08 (a, 86 (Barapa 08! (eos (omit on 20) ‘Table 1 (comin) Ferenc Sene main Gene 65 (2023) 102870 Tool name Appentance Tails Appearance ‘Composition, MPS edition models with Forenaie Prediction ool References Esvpens) Male ir or Gh ‘eel (Chenetal) parlodel MPS tools are Usted in tne -wise order of publication, Tithe sense of published foenae validation stdles ofthe MPS ool lnluding esting of sel, peli, 8500 European samples and Identified six significantly associated genetic loc, one of whieh had not been previously linked with eyebrove color or other human pigmenta tion tats. In addition to GWAS, the authors used all dentiied signif cantly associated SNPs for prediction modeling of eyebrow color by building th prediction model in >8000 Europeans and validating tin 70 subjects. Theit Dest model based on 25 SNPs achieved AUCS Of 0.7, 0.67 and 0.62 for blond, black and brown eyebrow colo, respectively (Ged eyebrow color was excluded due tothe extremely small sample size inthe study). Dedicated multiplex genotyping and prediction tools for eyebrow color were not made available bythe authors, who used GWAS data from SNP microarrays. However, the eyebrow color predictive SNPs reported by Peng etal are included inthe forenscaly validated VISAGE-ET-AA MPS tool (Isble 1), nd the eyebrow color prediction ‘model is implemented in the VISAGE Software (Table 1). Given the medium level accuracy achieved by the current model, fur efforts should focus on identifying additional independently predictive SNPs for eyebrow color and consider then in developing future FDP tools. 24, Freckle DNA prediction model for freeles, for which the same dataset of 458 Spanish subjects ‘wore used for prediction model building and model validstion, Thee final prediction model was based on chee SNPS from three genes pls the compound Rr markers from MCIR. The model achioved @ 190 samples gave an AUC of 0.81. Dedicated mult plex genotyping and prediction tools for freckles were not made aval fable by the authors. fn 2019, Kula Bartoszek et al. (37), with support, by the VISAGE Project, published the second prediction model for freckles. Inthe study, the authors frst sereened 115 DNA variants from 46 genes previously associated with pigmentation traits in 960 Polish samples (0 Identity genetie freckle predictors. They used the same dataset for prediction model building and model validation. Thele 2eatogory model for presence vs. absence of freckles based on 12 var inbles achieved a cross validated AUC of 0.75. The 2-category model ‘based on 14 variables revened a cross-validated AUC of 0.79 for heavy freckling 0.66 for medtum feckling, and 0.79 fr absence of freckles. As variables in their prediction models, the authors considered SNPs, ‘compound MCIR R/r markers, sex, SNP.SNP interactions, and sex SNP internctions, This model achieved a prediction accuracy increase of 0.085 AUC compared (0 the previous model from Hemando et al Dedicated mmltiplex genotyping and prediction tools fr freckles were not msde available by the authors. However, the freckle predictive SNPs reported by Kukla-Bartoseck et al. for their 2-category model are Inchided in the forensicaly validated VISAGE-ET-AA MPS tool Cisble 1, and the freckles prediction model is implemented in the VISAGE Software (Table 1). Based on the mediunt-level AUC achieved by the current models, fare efforts shoul focus on identiyng addi ‘onal independently preditive SNPS for freckles and eonsider them in future FDP tools. 2.5. Grey hair DNA prediction 112016, aGWAS on hair phenotypes including head hat greying vas published (8) based on over 6000 Latin Americans, which discovered ‘one significantly associated genetic locus for gry hair harboring IRF4, a known pigmentation gene. In 2020, Pospich et al. 9) published an association study based on whole-exomie sequencing data in a (or as ‘sociation testing very stall) number of 180 Polish samples, followed by targeted MPS of 378 newly identified exonic and literature-based SNPs In over 849 Posh subjects. The authors used the same dataset for pre diction niodel building and model validation, Thee 2-eategory prec: ton model for presence vs, absence of hair greying based on ten SNPs, age and sex achieved a eross.validated AUC of 0,87, while their 3-cae ‘gory model based on twelve SNPs, age, and sex cross-validated AUCS of (0.86 for no, 0.79 fr mil, and 0.88 for severe air greying. However, the ‘authors reported that their SNP predictors explained only 10 9 of the hie greying variation in their smdy population, while age explained 48 6 ad sex > 5 9 Thus, age alone was responsible for mos of the pre diction accuracy these models achieved, Dedicated multiplex genotyp Ing and prediction tools for grey lr were not made avallable bythe ‘authors, Futre efforts will need to concentrate on nding more inde pendently predicting SNPs for hair greying to develop more accurate prediction niodels and tools. Moreover, because of its strong age de- pendency, hair greying should addidonally be investigated via epigenone-wide association studies (EWAS) to find DNA methylation sites associated with grey hair dat may serve as epigenetic predictors for _srey hale and consider them in future epigenetic FDP tools. 2.6, Halr shape DNA prediction Im 2015, Pospiech et al. [40], a8 part of the EUROFORGEN-NoE, Consortium, reported the first prediction model for hair shape based ‘on three SNPs in 528 Polish samples using the same dataset for predic tion model building and mode! validation. Based on diferent methods used, thelr models achieved cross-valideted AUCS between 0.589 and Ferenc Sene main Gene 65 (2023) 102870 0.688 for straight vs. non-strnight hai. Dedicated multiplex genotyping and prediction tools forse shape were not made availabe by the a thors. In 2018, Liu etal.) published a GWAS on ir shape in almost 29,000 individuals fom different continental populations, which iden tified 12 significantly associated genetic loc, of which 8 were not pre viously involved in hai shape. The authors reported a prediction model for har shape based on 14 SNPs and sex, which was bil from over 6000 Europeans that were part of the discover GWAS and achieved @ cross validated AUC of 0.66, while the extemal validation in almost 1000 independent Europeans gave 0.64, Dedicated multiplex genoryp- ing and prediction tools fr hair shape were not made available by the authors, who used GWAS data from SNP maleroserays. As part of the EUROFORGEN:NoE Consortium, Posplech eta. (2) publised in 2018 What is eurrendy the most comprehensive prediction model for hair shape. For model building, che thors sed data from over 9600 Es opean and non-European subjects previously used by Li eta. (#1 and tested 90 candidate SNPs, Model validation employed an independent dataset of neatly 2500 European and now Butopean samples. The best 2eategory prediction model for straight vs. non straight based on 32 SNPs, sex and age achieved an AUC of 0.7 for Entopeans and rnon-Europeans combined. A considerably higher AUC of 0.80 vas ob tained from non-Europeans (N = 277) compared to 0.68 Europeans (N = 2138). For combined Europeans snd non-Buropenus, the best 3-¢ate gory model based on 33-SNPs, mostly overlapping with those in the 2Dontegory model, yielded AUCs of 0.68, 046 nnd 0.62 for straight, avy ‘and cutly air’ shape, respectively, without sex and age. For ‘non Europeans and Europeans separately, the AUCs were 0.8 0.61, 0.74 rnd 0.67, 046, 06, respectively. The inereased predition aecurscy in ‘non-Enropenns is explained by a strong SNP predictor in the EDAR gene, for which the predictive allele does not exist in Europeans |. Dedi cated multiplex genotyping and prediction fols for hai shape were not rade available by the authors, However, the SNP predictors of the 2-category model of hai shape predietion by Posplee ea are included in the forensically validated VISAGE-ET-AA MPS tool (ble 1), ané the hr shape prediction model is implemented in the VISAGE Software (able 1. The medium level AUC cutrently achieved for bait shape prediction should prompt future efforts to identify additional indepen dently predictive SNPs tnd consider them inthe development of future EDP tools. 2.7, Male hair loss DNA prediction In 2015/16, the ist two genetic prediction models for hai loss in men, or male patter baldness (MPB), were independently published. Marciuska et al. [4], as part of the EUROFORGEN.NoE Cousortium, (ested 50 MPB associated SNPs for their predictive value in > 600 Ew ropeans from different European popilations. The authors reported a S-SNP model and an extended 20-SNP niodel, bch built i 30 samples tnd validated in independent 300 samples. Their 20.SNP 2-category model achieved an AUC of 0.66 for no badness vs. significant baldness ‘without considering age, while when selecting males over 50 years of ge, the AUC incrensed to 0.76, Dediested multiplex genoryping and prediction tools for MPB were not made available by the authors. In & study carried out in parallel, Liv etal. [44] tested 25 SNPs previously associated with MPB for thelr predictive value in 2455 sample from three Enropean polation. Predietion modelling was dane separately in three cohorts and yielded overlapping sets of 6-14 SNP predictors and age as predictor. The same data were used for model building and validation. tn an early-onset enriched MPB dataset (N = 727), the best 2eategory model based on 14 SNPS achieved eross-validated AUG of 0.741 for baldness vs. no baldness. tn a population-based daraset (N 1161), the best 2-category model based on 11 SNPs plus age achieved & crose-validsted AUC of 0.711 while in sn independent smaller population based dataset (N = 567) the AUC was lower with 0.685 based on a smaller number of 6 SNPs plus age. Dedicated multiplex genotyping and prediction tools for MPB were nor made available by the thors, who used! GWAS data from SNP mieroarrays. In 2017, Hage ars ef al, |] reported a MPB prediction model based on 331 SNPs they identified via GWAS in 40,000 samples from the UK Biobatik (URBB) study they also used for model building, which wes validated in > 12,000 independent UKBB males, The authors reported AUCS oF 0.78 foro MPB vs. severe MPB, 0.68 for no MPB vs. moderate MPB, and 0.61 for no MPB ss, slight MPB without considering age, and 0.79, 0.70, and (0.61, respectively, when considering age. Dedicated multiplex geno typing and prediction tools for MPB were not made available by the ‘authors, who used GWAS data from SNP nileroarrays. In 2022, Chen etal. [40] with support by the VISAGE Project, pub: lished a genetic prediction model for MPB that currently represents the ‘most data supported model available for the genetic predietion of MPB ‘or any other appearance trait, beemise they used large and independent ‘datasets forall different analytical steps. Rased on the associated SNPs reported by Hagenaars et al. (9), the authors identified 117 SNP pre dictors im over 95,900 UKBB males tha largely overlapped with those previously used by Hagenars etal. fr discovering the MPB association ‘ofthese SNPs, Based on these 117 SNPs from 85 genetic lei, they build MPa prediction models with diferent methods in over 100,000 inde pendent UKBB miles and validated them in over 26,000 independent ‘UKBB males. The reported AUC were sinilar aeross methods inthe range (of 0.725-0.728 for severe, 0.631-0.635 for moderate, 0.598-0.602 for slight, snd 0.708-0.711 for no hair loss with age, and slighrly loier ‘without. Two-category prediction of any versus no hai loss gave ALCS ‘0f0,690-0,711 with age and sight lower without, Additonal extemal validation in an early onset enriched MPB dataset (N = 991) showed Improved prediction accuracy without considering age such as AUG of 0.830 for no vs. any hair loss. Dedicated multiplex genotyping and prediction tools for MP were not made available by the authors, who twsed GWAS data from SNP microarrays. However, the MPB predictive SSNPS identified by Chen et al. are included in the forenscally vali dated VISAGE ET-AA MPS rool (Table 1), and the mle hairloss pre diction model {s implemented in the VISAGE Softvare (Table 1). A ‘recent GWAS on MPB [7] in over 200,000 UKBB males identified 624 significantly associated genetic lei, which in the future should be ested for their predictive value, provided the necessary independent dataset for prediction model building and validation become available. 2.8, Body height DNA prediction 12014, Liner al [48) used 180 previously height associated SNPs in ‘cota of > 10,300 Europeans enriched with 770 very el individuals for building end validating a prediction model for tall stature, which ach leved a cross validated AUC of 0.75 for tall vs. nontall stature, & dedicated multiplex genotyping tool for the SNP predictors and a pre diction tool were not miade available hy the authors who used GWAS. ‘data from SNP microarrays. In 2019, the same group [49], with support by the VISAGE Project, published an update on the genetic predictability ‘of al stature by using the same cohort for testing the predictive value of {697 height- associated SNPs that Were identified in a previons GWAS on ‘over 250,000 subjects published in 2014 [50]. Based on 689 available SSNPS, the model achieved a cross-validated AUC of 0.79 for tll vs. nonall prediction, representing an AUC increase of 0.04 forthe price of 509 (3.8 fold) more SNPS. The autor also demonstrated thatthe most informative subset of 412 SNPs achieved an AUC of 0.76 4, 0.01 AUC Increase for 232 (23fld) more SNPs. Of note these two models for ‘genetic prediction of tall stature have almost no value for predicting continuous height as indicated by the obtained correlations between -genetenlly predicted height and observed helght of R? = 0.12 for the 180-SNP model and 0.21 for the 689-SNP model (49). Provided enough nmitiplexing capacity of the applied genotyping method, tll starare prediction should be included in future FDP tools Such studies indicate the enoruous difficulty in predicting normal height from DNA, which is eaused by the large genetic complexity of height together with the minimal effect size in che millimeter and Ferenc Sene main Gene 65 (2023) 102870 submillimeter range that associnted SNPs have on body height. This problem can only be overcome by including a very large number of SNPs the prediction model i. via genomic prediction. A rare example for genomic appearance prediction is the study by Leo etal. published in 2018 (51), which presented prediction of continous body height using > 690,000 SNPs in > 460,000 UKBB samples, Based on 9000 samples not previously used for prediction model building and with their best 100,000 SNP predictors, the authors reported a correlation between predicted herght and observed height of R?< 0.7, where a subset of 20, 00 SNPS were described as optimal height predictors. Dedicated multiplex genotyping and prediction tools for eight were not made avallable by te authors, who sed GWAS data from SNP nleroarrays, kn 2022, Yengo etl. (52) published what currently is the largest GWAS on body’ hoight in 5.4 million individuals, which revealed over 12,000 signifiently associated SNP from over 7200 genetic let acconinting for 40 9 of height variations in their European study population Relinble genotyping of many thousauds of SNPs fron low quality and ‘quantity DNA eypleally obtained from erime scene samples. with curently available targeted MPS technologies and chemistries is ex pected 10 be challenging. Targeted MPS involving hybridization eaprure enrichment allows to drastically inerease the number of simultaneously snalyzable SNPS and is already used since many years in the field of fancent DNA for hundreds of thousands of SNPs. Since recently, capture based targeted MPS is started (0 be used for thousands of SNPS for forensic purposes [3]. In 2021, Tillmar et al. (1) published the Forensic Captute Enrichment (FORCE) panel involving several thou sands of SNPs for diferent forensic purposes including appearance (pigmentation) an aneesiry prediction. 2.9, Recent generic progress on appearance tats not yet applicable for DNA prediction Since 2015, several GWASS were published chat have improved and, broadened our knowledge on the genetic basis of human appearance also regarding additonal traits not discussed above, such as on ear ‘morphology, facial hair tits, and facial shape. However, the number tnd effect sizes of the SNPs associated with these traits are co stall so thar they explain too litle ofthe phenotypic variance; hence, these traits sarenot applicable for FP purposes yet. Therefore, a clear need exists for more GWASs based on larger sample sizes co identify more sall-effect SNPS that serve as DNA predictors, which may eventually provide practically isefl prediction accuracies and if so, elude then in farare FDP tools in case the multiplex capacity of the genotyping method allows 12015, Adhiksri etl 5 published a GWAS of ear morphology In, 5000 Latin Americans and reported seven signifieantly associated ge: neti loo for diferent enr phenotypes inchiding earlobe ancient and others, In 2017, a large GWAS on earlobe attachment in nearly 75,000 individuals identified 49 significantly associated genetic loci for whether the earlobe is fee hanging, partially, or fully attached [95]. 8 lage GWAS on mniltiple ear morphology tits iseurrenty underway by ‘group incliding one of the authors In 2016, Adar etal, published a GWAS on scalp and facial hair features in the same Latin American cohort and found significant associations with sealp hair shape and balding, hair greying (ee above), mono eyebrov, beard thickness, it color aud eyebrow thickness at 18 genetic loci, of which 10 were novel findings (58). In 2018, Wu et al. [6] reported two additional signif cantly associated genetic loc! involved in eyebrow thickness identified via GWAS in < 3000 Chinese. A GWAS on eyebrow thickness in Euro peas i currently underway by 8 group including one ofthe authors. En 2022, Pespiech et al. [571 published a candidate SNP association study tnd a prediction sty on several hairelated phenotypes considering 240 SNPs in 099 Polish samples, demonstrating elear evidence of pet otropy and epistass in the gates of hair traits. The reported prediction todels achieved low to miedium cross validated AUCs for haitines in fenrales(0.69-76) and males (0:51-0.59), miono-eyebrow (0.62-0.70), ‘eyebrow thickness in males (0.5-0.63), head hair thickness (0.6-0.63), and head hair density (0.56-0.69). Recent progress has also been made in increasing the genetic jowledge underlying phenotypic variation of facil shape, building on the fist two facial GWASs published before 2015 [58,59] summarized ‘elsewhere [7]. In 2016, three GWASS on fecal shape were published ‘Adhikari et al. (60) identified four significantly associated genetic loci tostly forthe nose in >6200 Latin American subjects. Cole eal. (61) Identified two signifeantly associated genetic loci with measures of facial size in >3500 Afrean childeen that were replicated in <2400 African children, Shafer et al. [62] reported seven significantly asso ‘lated genetic lol with different facil shape measurements in >3000 European subjects. In 2018, Claes et. (63) published a GWAS on fecal shape in. >2300 European subjects that identified 38 significantly ‘associated genetic loci of which 15 wete reported to replat in sn in dependent European sample of >1700; four ofthe reported lock were novel. In a follow-up study published in 2021 (6, this group applied their phenotyping approach ro an enlarged sample set of >8200 Euro: ‘pean subjects and reported a large umber of 203 significantly assoc ‘ated genetic lori of which 3 were located in regions not previously knowin to be involved in facial development or diseases with facia ‘manifestation. Ina parallel study published in 2019, Xiong etal (6,08 behalf ofthe VisiGen Consortium, performed a GWAS on 78 facta shape ‘phenotypes, obtained from 13 facial landmarks placed mostly auto matically on the 3-dimensional digital facial images using dedicated ‘computer vision methods, in > 10,100 European subjects with replica tion in an additional >7500 Europeans and non: Europeans. The authors Identified 24 significantly associated genetic loci, of which 17 were hovel. In 2021, Bonfante etl, (06) published a face GWAS in more than ‘6000 Latin Americans and reported significant assacintions at 32 genetic loci of which 9 were previously unidentified. In 2022, Zhang etal. (67) ‘ublished a face GWAS in Chinese based on nearly 7000 samples for discovery and over 2700 for replication, which revealed 166 signif ‘cantly associated genetic loi, of which 62 were not previously involved in facial variation. End of 2022, Xiong et al. [65 published @ new method for combining GWASS of multiple tats (C-GWAS) and pre sented its first application to facil shape, whic identified 56 signif ‘ently associated genetic loc, of which 17 were not involved in facil ‘variation before. In their 2019 paper, Xiong etal. (65) reported on the quantifeation ‘ofthe facial phenotypic variance genetically explained by the SNPs they Identified in their GWAS. A multiple regression analysis conditioning on the effects ofthe lead SNPS from 24 genetic loc identified 31 SNPS with significant independent effects on sex- and age-adjusted facial distance phenotypes, which pet each SNP explained less than 1% of the phenotypic facial variation and al together 4.62 9, tn thei 2022 paper, lou etal. (58) performed poly genic risk seore analysis based on thele ‘C.GWAS findings using 57 significantly faceassociated SNPS, whiel, ‘explained on average 2.28 9 and up to 4.51 9% sex- and age-adjusted facial variance. These very small proportions of facial variance explained by faceassocinted SNPs ilhistrate the ausjor problem of ‘moving from face GWASs identiying facial SNPs predicting faces with SSNPS. Therefore, seientifc publications char claim to predict human faces from genetic data (69-71), and companies that provide commer al sevice testing on predicting faces from rime scene DNA, aust be viewed very critically (72,73) 2.10. DNA preicton ofextemally visible characteristics induced by exposure to external factors A recent newcomer in the field of (extended) FDP is the epigenetic prediction of extemally visible characteristics (or habits) thar are determined by individual interaction with external factors, such as to bacco stoke, which is summarized elsewisere 7-1 combination with ‘genetic prediction of appearance and ancestry (see chapter 2) and the ‘epigenetie prediction of age (see chaprer 3) Is expected to characterize Ferenc Sene main Gene 65 (2023) 102870 the extemals ofan unknown person from DNA in a more complete way Several recent epigenome ave association studies (EWASS) revealed DNA methylation sites shoveing significant association with (non}eou- sumption of substances such as tobveco, alcohol, coffee, ten, caeaine, heroin ete. The fist epigenetic prediction models for smoking and drinking were alreedy reported of which we mention here the reo most recent ones, In 2019, Maas etal. (79 published an epigenetic prediction tuodel for stuoking habits based on 13 DNA methylation sites that was built and intemally validated on 3764 samples and delivered cross-validated AUCS of 0.925 for current smokers, 0.83 for never smokers, and 0.706 for former smokers, while external validation $a 1608 independent stmples gave 0.914, 0.781, and 0.699, respectively n 2021, Maas et al. (76) reported epigenetic prediction models for alcohol consumption based on different sets of DNA medylacion tarkers that were built i 2883 samples and for heavy and at risk Avinkers vs. light and non-érinkers delivered mean (across marker sets) cross-validated AUCS of 0.67-0.68, and 0.6-0,7 in the external validation in 1794 independent samples. Continues progress in under standing the impact of environmental exposire on the human epi enome will likely Ieed to increasingly accurate epigenetic prediction ‘models for the named and other environmentally determined externally visible habits, 3. Recent progress in inferring biogeographic ancestry from crime scene DNA In this chapter, we review recent advances in biogeographic ancestry (BGA) inference from forensic DNA publised since the 2015 review of this copie (61. We will deseribe che Key elements of recently reported forensic BGA tools, which comprise marker selection, genotyping rultiplex design, and. statistical analysis of the resulting data Ancestry inforuative DNA marker (AIM) selection must compile @ suitable panel for # defined set of population diferentiations. The consequent statistical regime applied to the genotype data must be able to predict BGA using reference population datasets, but also ideally hhaving the eapacity to decectco-ancestry in individvals with admixed backgrounds. Because the resolution of BGA inferences from DNA largely depends on the numberof AIMS used, and targeted MPS has the largest multiplex capacity of all DNA technologies currently available for forensic DNA analysis, we concentrate here on forensic BGA tools based exclusively on targeted MPS, Recent efforts in developing sel tools have concentrated on autosomal SNPS as the ancestry informative markers (AIMS) of choice, bt theres increasing interest in the ancestry informativeness of autosomal mierohaplotypes (MES) — combinations of closely sted SNPS in shor sequences which are readily detected with the single-strand sequencing of MPS. Insertion/deletion polymorphisms (lndels) for forensic BGA tools were reviewed proviowsy (8 and have not developed farther daring the last ten years. Recent progres hs vlso been made in targeted MPS rols for hundreds of Y-chromosome SNPs, any of which are ancestry-informative for a male's paternal lineage (77), and for complete mitochondrial genome analysis that provides the xin level af maternal BGA information (78). Since we focus hereon Diparental BGA inferred with autosomal AIMS, we will only mention such markers if hey are part of MPS tols that focus en autosomal AIM Autosomal STRS used in forensic DNA profiling for individual genetic identification can give viable population dilferentitions especially when the entre sequence information Is available, s with targeted MPS (ig. 5 of [79]. However, ths form of variation has less power than istosomal AIMSNPs, s0 STR tests have not been adapted specifically for BGA and therefore are not dseussed further (but see section 6.2 of (8). 2.1, AIM SNP panels in forensic MPS tools for BGA Several MPS tools have been developed since 2015 that are either dedicated to forensic BGA analysis or include DNA markers for BGA combined with those for other forensic appliestions. Each of the MPS tools discussed below is smimaized in Table 2 11, Alls in commercial MPS based forensic BGA tools ‘The compact set of 56 AIMs developed by the Kidd Ib at Yale Un ‘versity were described in [1] and have been the subject of several studies ‘since (80,1, but this panels only a portion of two larger scale SNP sts for MPS analysis) MPSplex (developed jointly by QIAGEN and IMP, bbut not yer commercially available) combining more than 1400 SNPS and Ms for individual identification, but adng the 56 AIMs to provide BGA inference for missing persons identification (82); and i) Verogen’s Forenseq DNA Signature Prep Kit (FDSPK) comprising SNPS and STRS for indivi Identifieation in ‘primer pool A’ with multiplex exten ‘sion possible using ‘prime pool B’ uihich sds eye and haircolor pre dlietive SNPS and AIME SNPS (83,84). As well as being smaller in seale than other MPS-based AIM panels, che 56 Kiddlab AIM SNPs in FDSPK ‘do not benefit fom a comprehensive stastial system, asthe univers ‘analysis software of FDSPK (UAS) just runs a simplified principal Ferenc Sene main Gene 65 (2023) 102870 component analysis (PCA) based on limited 1000 Genomes Phase refereace populations (nov superseded by more compretiensive 1000 Genomes Phase] datasets, but not used by UAS), and distance-to-lesest centroid ealeulations (ee Section 5nd Fg. 6 of [8 “Therefore, wsers make BGA assessments based on sertinizing the pos: tion ofthe forensic sample in relation to reference clusters in the PCA plot, bue without accompanying statistical output. To compensate for & lack of likelihood ratio (LR) analysis in UAS, more complete reference population data from 1000 Genomes Phase-I and CEPH iuman genome diversity panel (HGDP-CEPHD saniples are available in Snippet for the same AIM. SNPs (litp://maitigene.us.es/snipper/forensic mps aims. hun. Snipper analysis provides similar 2D PCA but with more ext sive reference data tan used by UAS, plus coupled LR caeulations, bot of whieh ean analyze customizable combinations of reference pop: tlstons, Adapting reference daea in this way enn fine-tune the pap: tlatons compared 10 evaluate the forensic sample with mote Lined ranges of ancestries (e., comparing Aftican, Europe, Middle Bast and ‘Table 2 [MPS based rol for nfening biogeographic ances ho ela scene DNA Name ry “Ancety CComposiion. MPS Forensic _‘redition Approuchand Orginal —Validation ee eee aes ‘alidtion Prediction Too! Publication Peblietion TBOROFORGEN—_Saips" 127 owosonal BGA AEG e “pga ed LR wasn FR 67) er rane sxe (ep /mthgenene ssipgerindex STRUCTURE analis (inp eb tod oer ‘up/ptehaday are ad FocenSeqONA Soups" SS eutosma SNPS BGA 4 Appearance | Yes FCA wth ene ard en Signature Pep Kt ce foreaseprss EencrtdinVeropen Univeral Wega” EoenSeg S60) ‘asia Sorase (US) Peso Sqowet 16S ausomal BGA, Ampeg ve Dislutn fined 1325) (Mere Fier aera ringed la ull lope Groupe. 117 atowmal “SGA. Appearance, Yer Supper and/or FROGS (hp (7) ea gow SN apis pi tropmedyiend sapien Temupes. et auras BGA Ampeg ve Srpper tue tangas (08) ist Name 2 elowp Nord 121 autosomal Complement BGA, Yes Suppec tnd i tewand (69) 9 ‘Meno Mitle SNe ise, Ampeg POA, STRUCTURE ana urspex Selowes 270 utallele BGA nd! Dee STRUCTURE analysis (02) (2 oupe's S——auosomal SNPs Olas. SEQ PoencTivham ——Troupe'S——«AeDatommal UGA Appeatnce, ee Ditton simulated (20) (9) ouge's SSNPS H120¥SNDS Apion toss wierd ‘a Mile ‘vinta ros geet in as SAS {TS Comergeaewae VISAGE ave Tool Group's 13Saowmal «BGA Appeatice, Yon Supperiued Uns and (03) (30) fer Appenanee pup SNe Ape, Frensy, POA, STRUCTURE analyte sedsncenr? a Powerseg orn Capeme —Sroupe? wih 241 auotomal «BOA Agpeatance + No POA ana Nave Rages Rte, (1) eu ‘ror sane Captie escent PS Aout et Some mae At Agere Be Snippertased UR teen FCA 20) VISAGE Ennced 7 oupe'S “LOY atommal_—— «BGA Appesrnce, Yer VISAGE Sota wa ua Tel er fouerS—SHBLGSTESNDS mpl Appesance ant e's Midale —H6SNPS +125 MPS tol te listed in ve wise order of publication, * Commercial too, which lack orignal scene article dseibing i, $ AnpliSey pier poo! comely avalable om Thermo Fisher Slencife, QlASe primer pool commetelaly available fiom QIAGEN, Mis: autosomal mlcrohaplotypes consisting of several SNPS in lose physi! promi. * groups: distinguishing Sub Saharan Asean, Europea, East Aslan, Oceanian and Naive American population groups. °° Markers not ‘selected for ancesty but fr inv identifcation papas; bees of thelr nner they also ptovide BGA information South Asian data alone). Snipper also handles missing genotypes effec. Lively by highlighting the relative informativeness of those AIMS drop: ping out, by listing the SNPs in order of informativeness and marking ‘absent genotypes in red. FROGKD, the Kidd Inb's own population ‘database for SNPs useful in the forensic sertng inching for BGA, also offers comprehensive sets of population data and the means to perform. likeliiood based ancestry calculations on multiple SNP profiles (8) Both MPSplex sud FDSPK MPS assays constitute forensic aucestry tools bu use a relatively small portion of the tral genotypes that the rest generates and for detailed statistical analysis the user must bring in ‘complementary analysis systems in Shipper of FROG-KD Co infer BGA, OF the 56 AIM SNPs described above, 55 were combined with 128 from Seldin’s on forensic panel (the Kiddiab. SNP. 151919550 Is redundant as it completely linked to 1512498138 selected indepen: dently by Seldin), creating the Thermo Fisher Scientific (TFS) Precision 1D Ancestry Panel (PIAP), whieh represents che frst commercial MPS test dedicated wholly to BGA inference (86). With 87 SNPs common to both panels, 165 autosomal AIM SNPS are genotyped in & single MPS assay. TFS provide an ancestry analysis plug-in for use wit PIAP and forms part ofthe Converge software suite which performs LR tests sing population data to generate a ranked Iist of likelihoods for individ populations and/or continental regions. Likelihood comparisons also ‘enableanestiniate af ¢ ancestry proportions which ae used ro evaluate Individual ratios of admixture components using. simulation-based rodelling (described below). Limitations ofthis commercial BGA tool inchide: 1) a requitement 10 link to the allele frequencies held in FROG kb via the ALFRED database to make the likelihood calculations, ‘although these are quite extensive in geographic scope, tis limits & user's lexbility to apply specific population comparisons; and, i) nn ‘emphasis on sub-population comparisons, which do not always accu rately reflect forensic DNA donor's aneesty (likelihood ratios less than 100 that suggest ‘more likely Japanese than Chinese Han’) which ‘give the impression that the ces can reliably make such distinecons. The reporting forense scientist can modetate this staistieal ouput to rediee the risk of overstated accuracy but may also be tenipted to provide the investigators with ancestry information that as to high a classification ‘error rite when the compared sub-populations are closely related ‘and/or lack geographic separation. 1.2. Alls 0 non-commercial MPS-based forensic BGA tots Dedicated MPS-based forensic AIM panels which are not currently ‘commercially developed are, in order of publication: the 126.SNP EUROFORGEN Global AIMS (gAIMs) (57), a 154-SNP combined BGA pigmentation phenotype preditive panel (27), the 164 marker MAPlex panel [88], the I11-SNP EUROFORGEN NAME panel (59), the PhenoTrivium combined ancestry appearance panel (901, the AS3SNP_VISAGEBTAR tool (28 50,91], and the 524-8NP VISAGEET-AA tool (counting all SNPs for appearance and ancestry) [14,02]. Available starstcal data analysis tools such as Shipper and GenoGeographer (see below) can be applied to data generated by all these MPS tools. shipper provides LR and PCA analysis frameworks, ‘extendable to STRUCTURE based generic cluster analysis using the same population data input. Applying LR, PCA and STRUCTURE analyses siniultaneously and tothe same chosen reference popsation datasets i advocated. #8 this enables complementary approaches to ancestry inference and analysis of adminture in individuals with co-aacestzy. The VISAGE-ET-AA panel has @ more closely integrated statistic analysis package designed to combine the age and appearance predictions with those for BGA in a single workflow, the VISAGE Software ‘The combined BGA-pigmentation phenoryping MPS test from Bulbul, ‘and Filogln (27) combined two ancestry panels~55 ofthe Kiddlab AIMS with 65 additonal ‘SWA" AIMs previously developed to differentiate Populations from Southwest Asia (here considered equivalent to the “Middle East region) |) with the HlrisPlex-S SNPs for eye, hair and skin color prediction (15). The authors ested the pane for MPS performance ‘gmuging sequence coverige, genotype concordance and ability to Ferenc Sene main Gene 65 (2023) 102870 snalyze mixtures. The panel provided improved ancestry predictive performance wien differentiating Southwest Asia (Middle Bast), Euro ean aud South Asian populations compared with using 55 AIMs alone. The AIM selection for MAPlex focused on variation in Asie Pacific populations, while preserving che differentiation power across all tain global population groups obtained with the mackers of gAIMS. It also extended gAIMs’ geographic scope by adding South Asia asa ar geted sub-continental population group. The original remit of MAPlex ‘was to enhance the sub-continental differentiation within East Asia in addition to neighboring Oceania and South Asia regions. Although MAPlex did this successfully (8), the ability to differendiate well defined populations in all these regfons las been the main goal of studies assembling smaller dedicated panels 95-95] exemplified by the Sopaneseplex panel (96. Although Japaneseplex is small in seale and lacks a specifi genotyping assay design, this study demonstrated the Dotential for compiling AIMS dedicated to diferentating one population from others in the same continental group. Such custom assays can Potentially be applied in a “nested” approach, Le., where an iit continental wide inference is refined by follow-ap tests targeting a specific region or poptlation, ‘The EUROFORGEN NAME panels stand-alone MPS (01 developed to enhance the gAIMs panel for improved differentiation of European Individuals from those of North Affiean, Middle Bast, snd South Asian populations (89). Therefore, it extends the existing gAIMs ancestry test which did not originally alm to diferentiate these neighboring pop- lations from Europe, Although the success NAME obtained in ffer- cutiating North Afiean/Middle East populations from other closely related populations was limiced, when gAIMs aid NAME markers are combined, Middle East populations, but not Nom Africans, were distinguishable from Europeans and South Asians (see Fig. 4 of (8). Studies to select the best AIMS for specifi population ifferentiations, were until recently, hindered by being restricted to a pool of 650,000 candidate SNPS from the Stanford noslyses using 650 K microarray SNP genotyping of the HGDP-CEPH diversity panel detailed in (8)). AS discussed below, 929 unrelated HGDP-CEPH panel samples have now undergone whole genome sequencing, whieh identifies several million SNPs and Indels per individual (97), making it more straightforward in the furure to identify and compile AIMS informative for HGDP-CEPH Populations additional to those of 1000 Genomes (namely, Native American, Oceanian, Middle Est population groups). One other benefit ofthe NAME panel initiative was to find Une most informative and best Performing loc, in MPS sequencing terms. Consequently, 29 SNPs from tte NAME panel were adopted forthe VISAGE ET-AA tool (92), repre senting almost 28 % ofthe tral mtosomal AIM assembled fn that assay The PhenoTrivium panel, developed by Diepenbroek et al. in 2020, 00], combines 163 of the 165 PIAP ancestry SNPs and the 41 HirisPlex$ eye, hai, skin color SNPS into a single AmpliSeg based pane, Dut importantly, also includes 120 lineage-specific ¥ SNPs, The Paper reporting the development of PhenoTrivinm also evaluated forensic performance by applying sequence thresholds of « mininuum 100 reads (50 for ¥-SNPs); and 95 9/5 % homozygote oF 65 %/35 heterozygote allele read frequency balance. The forensic sensivity of PhenaTrivin was assessed with dilution series DNA ail postmortem blood and bone samples. Full concordant SNP profiles were obtained down to 125 pg input DNA, with just SNP 181470608 failing to reach sequence coverage thresholds. The dowastream: ancestry analyses using the Converge software (likelihood ealeulations and bootstrapping runs with the admixture analysis algorithm of samples with known ads tre pattems (self:declased parental and grandparental combinations), representing a comprehensive evaluation of the analysis of co ancestry with ths software. Converge based bootstrapping co-ancestry analyses were iso compared 10 those of Shipper and FROG-Kb, indicating Converge handles admixture component predition wel, The AIM panel of the VISAGE-T-AA tool [91] vas designed to ‘atch the MPS multiplex sales ofthe gAIMs, MPSplex and PIAP panels to ensure good sequencing performance with forensic DNA, but more importantly, to begin the process of combining FDP and ancestry SNPS in ‘one MPS genotyping assay. The combination of 115 AIMS with the 41 pigmentation predictive SNPs of HisisPlex-S (with SNPS 1316891982; 151426654; 1912913882 shared for BGA and pigmentation prediction purposes) ~ means the VISAGE-BT-AA tool has the stallest umber of [AIMs of any forensic MPS ancestry test. Asi was developed to difer. ‘entiate Sout Asians in addition to the Five main continental population ‘groups, ouly the most differentiating markers were retained. A small Proportion of previously used AIMS were excinded to preserve space for ‘optim South Asinn-informative SNPS taken from the previons Eura saplex [98] and NAME [9] panel development work. Therefore, the final AIM panel of VISAGE BI-AA oo consisted of 57 gAIMS, 7 Kidalab 56 SNPs, 12 NAME SNPs, 5 SNPs from PIAP notin the $6 Kidlab pane, ‘and 19 South Asian infrniaive SNPS from Esrasiaplex (9 |. ASwell a8 3 triallelic SNPs from gAIMe, a further 12 tt-allelie SNPs were aded ro the VISAGE-BT-AA AIM panel o provide sope for mixed DNA analys ‘and to improve the differentiation of six ofthe seven target population, ‘groups (91). The earefil selection process for all the above AIMS Is underlined by the ancestry prediction success of VISAGE-BT-AA, where ‘cross-validation of the reference population samples gives 100 % cla sifleation sucess forall groups apart from Middle East, which has 80.6 6 suceess, with the biggest proportion of error (17-2 9) coming ftom samples misclassfed 45 European (91). Selecting AIMS for the VISAGE-BT-AA dhatare not only the most aneesty informative but have proven performance in MPS has benefited the sensitivity an reliability ‘of the VISAGE-BT-AA tool based on diferent MPS chemistries and in struments [28.30]. For instance, Xavier et al. (25) reported for VISAGE BT-AA tool 100 % genotype eal rates with 0.1 ng Input DNA, ‘and in DNA with up fo 240 min of sonication degradation, with good IMPS performance across the whole range of SNPs in the test. Although the primer pool forthe VISAGE BT. AAs available as a community panel from TFS, the entire VISAGE BT-AA MPS tool is not commercially available as of er. Given the MPS performance success of the VISAGE-BT-AA tool, the fact that multiplex space could be shared between AIMS and phenotype predictive SNPs without reducing ancestry informativeness, and because 2 total of 153 SNPs is likely to be well below the multiplex limits of targeted MPS while preserving the necessary sensitivity anéreliabi for the VISAGE- ET-AA tool it was decided co inerease the mkiplex size ‘almost three-fold relative to VISAGE-BT-AA. Although the VISAGE-ET ‘AA too! (14) has @ markedly increased number of appearance predic Live SNPs for? traits (see chapter 1), It also has @ mone complex om bination of AIM SNP from autosonies, X-eromoseme, Y- ehromosonte, ‘and sitosomial microhaplarypes inorder to. improve analysis of ‘coancestry patterns in individuals with admixed backgrounds. Despite a reduced total number of autosomal AIM SNPs (115 in VISAGE BT-AA to 304 In VISAGE-ETAA), the augmented autosomal SNP_panel of VISAGEET-AA tool miore than doubles the Middle East informative markers [92] originally chosen for the VISAGE-BT-AA tool (12—28) ‘Consequently, it more efficiently distinguishes Middle East populations from those of Europe, Aftica and South Asia, with the benefit that this differentiation power extends 10 the majority of samples feom North ‘Aftican and East Afriean populations when such comparisons are made. ‘The VISAGE ET-AA tool eddlitonally inchides 87 V.SNPs and 16 X:SNPS to provide a distinct method for obtaining extra detail about co-aneestry patterns identified in males with admixed backgrounds, To test how ‘effectively Y- and X-SNPs can do this, a supplementary analysis system was developed for the XSNP dath snd tested on genorypes compiled from six admixed populations of 1000 Genomes, ‘and rural Brazilian samples sequenced with the VISAGE-ET-AA tool [02], YSNP data from the male Brazilian test samples were analyzed using haplotype designations based on core 859 YSNP dataset defining 640 haplogroups (77). inal eight admixed populations, X-SNP data was used to evaluate the possible ancestry of each male sample’s X-chromosome, by co-analysing these samples in PCA with African, Europes and Native American male referetice population data for the Ferenc Sene main Gene 65 (2023) 102870 16 X-SNPs only. Cleasly separeted PCA clusters were formed by the tefereace samples from each ofthe admixture contributor populations, and when the test points were positioned in thse reference clusters, the X chromosome ancestry was Inferred 0 be the same. The four 1000 Genomes admixed American populations gave varied pattems ranging from 109% to 50% unassigned (Le, positioned between clusters), but the {avo 1000 Genomes African American populations gave more cleatly lineated patterns with 64 % of X chromosomes African in ASW, 87.5 % im ACB; 20 9 European in ASW, 8 % in ACB: and only 16.9 unassigned i ASW, 996 in ACB, The two Beazlian samples from tban. and rural regions had the benefit of Y-SNP pattern comparisons (1000 Genomes Y data was incomplete), which allowed analysis of sex biased sdmisture Detter that contrasted between the two samples. Urban Brains bid & high proportion of European X chiromosonies (62%) and few Afrian (19 9) whereas rural Brazilian had the inverse pattern: 17 9% European X-ehromosomes and 61 9 Aftican. Applying an independent X-SNP and Y-SNP ancesty inference regime in parallel to that of autosomal SNP genotype analysis allows a degree of extra detall co be obiined for in dividuals in which co ancestry has been devected, A detailed deseripcon Df the AIM set ofthe VISAGE ET-AA is currently underway (97), The gAIMS, MAPIex, VISAGE-ST-AA tool and VISAGE-ET-AA tool ADM panels were all built on the principle of balancing cumulative population specific Divergence values (termed In no i (87) across al ‘he population dferentintions the panels were designed ro make (Fig. 4 in [87]; Fig. 4 in (85)). This principle was extensively explored in the 2015 review (8) and such AIM selection and balancing steps applied to the forensic AIM panels were based on methods developed forthe LACE rneestey panel used in genomies studies [99]. The benefits from balancing an AIM set in this way inelude i) more efficient analysis of co-ancestry in admixed individuals; i) equilibrated likelihood statstes in most populations reducing the bias towards particular populations; ‘ud i) more robustness to the statistical effects of missing genotypes. As more worldwide population groups ate differentinted, the balancing process becomes more difficult to achieve anit will nor be possible to ensure balanced ly for comparisons of more closely related groups, eg, Europeans vs South Asians vs Middle East populations. Nevertheless, balancing f pop values fr the five majer continental population groups of Africans; Enropeans; East Asians; Americans and Oceanians repre sents a key step in the assembly of panels differentiating these groups. To evaluate the benefit of attempting (0 balance fy 28 mueh as possible, direct comparisons of co-aucestry pater in samples from the six admixed 1000 Genomtes populations, using STRUCTURE analysis, were made for both VISAGE Appearance and Ancestry tls. inferred co ancestry proportions in these samples were compared between those estimated sing the Afiymetix Human Origins asray’ (comprising 572,000 SNPs) and the VISAGE-BT-AA tool, with sampleto-sample correlations giving ©? values above 0.8 in all populations and co: fncestres, apart from African/American co-ancesty in Puerto Ricans (ig. 4 of [25)). The same analyses were made for the VISAGE-ET-AA too] (92), but the study compared the eo-ancestry patterns 10 those from the full genoue-wide SNP datasets obtained directly from 1000 Geniomes (comprising several milion SNPS, parse 0 remove <0.05 minor allele frequency variants) (100). The VISAGE-ET-AA tool corte lation analyses showed very similar values to those obtained with the VISAGE-BT-AA tool -an important finding, given the 26 % reduetion in the autosomal AIM SNPS in VISAGE ET (irom 103 in BT 1976 markers in ET, exclng those extra SNPs ded 10 ET co differentiate Middle East populations). Additional independent evaluations of the ability of forensic MPS ancestry panels to differentiate populations in comparison ‘omuch larger SNP sets were made by Resutk eal. [101] ina study that alyzed a larger dataset of CEPH, SGDP and EGDP genotypes combined with those of 1000 Genomes, for the SNPs of VISAGEBT-AA tool, MAPlex, and PIAP, compared t0 8 100,000-SNP dataset from the Affy retrix Human Origins array. Comping the G similarity score and aren lunder the precision recall curve (AUG PR) to assess similarities in STRUCTURE cluster memberships between the forensic ADMs panels and the 100,000 SNP set showed all three forensic panels had similar per formance, with VISAGE BT-AA tool giving a marginally beer match co the large seale SNP set STRUCTURE patterns at inferred cluster values above sx. 3.1.3. Autosomal microhaplotypes as emerging forensic AIM MAPlex was the first forensic assay (© combine binary SNPS, suitipleallele SNPs ie, tri-aleie SNPs catalogued in [82|; tetra allelic SNPs in [102] and mierohaplotypes (MH) based on closely spaced ‘autosomal SNPs [109]. A large proportion of autosomal MHS show ‘sufficient population differentiation to be viable AIMS in thei own right ‘The use of MHs in forensic BGA assays Was reviewed elsewhere (10) and te population differentiation capabllces of « 65-MH panel were ‘explored in [105] (a subse of 130 Kidd lab MH loci with high Jy vals, bur lacking MPS designs). The study by Chen et al. (105) indicted Alrcan-European vs East Asian. diferentation can be achieved with only ten MHs. Therefore, given their widely recorded capacity t eff cently detect mixed source DNA, MHS justify inclusion in MPS-based BGA panels and cools. MAPlex (8s) incorporated 22 MHS; 13 short ‘ened versions of loci from 10 Mis the Kiddlab identified [ 10), while the VISAGE-ET-AA tool includes 21 ancestry-informative MES (2), Although not all autosomal MHs make suitable AIMS, the ability 0 detect mixed sonree DNA more effiefenty than binary SNPS justifies ther inclusion in MPS panels fr forensic use. The same rationale applies to muipleallele SNPs, although a detailed study of relative fy values (amongst four divergence meties) comparing MH, binary and e-allelic AIM SNPs 107) indicated that binary SNPs were more informative than. trballelie SNPs marker-for-marker. When MPS panels of autosonusl MHS are very large, collective ancestry infornativeness can be very high, ‘even if loci are primarily selected for individual identification (i, with low population divergence of haplotype frequencies jn exch MH tarker) due tothe summarized effect of many markers, For istance, ni IMPS panel designed for individual ientifiearion based on 113. MHS (108) ha sufcient ancestry informativeness to allow differentiation of Middle East, North African, European, and South Asian populations with population subset STRUCTURE analyses in follow-up studies (109), Initial explorations of the haplotype frequencies of the 113 auto sonial MHs developed by de Ia Puente etal. (108) demonstrated the potenti of these markers to infer the BGA of contributors in simple 2.way mixedsource DNA [10°], This potential to move beyond ident fication of sinple-mixture DNA contributors fro their haplotype pat terns, to additionally inferring thelr BGA, was studied In limited ut promising pilor sty of the 21 MI loc ia the VISAGE-ET-AA tool [92 Simple mixtures of two control DNAs with African snd European sn ‘estes were made at 1:1, 3:1 and 9:1 ratios, and the sequence patterns from just the 21 MHs’ were collected and analyzed to attempt ‘de convolution from contrasting sequences and tead coverage levels (02). The contrasts in haplorype frequencies in Atieans and Europea In the MHs meant that 11/2 loc showed three-haplotype combinations ‘and 5/21 four-haplotypes, which in the 3:1 and 9:1 ratios were suf ciently distinct in sequence reads to be assignable to the major and minor contributors. When the identified haplotypes assigned ro each ‘contributor were mn in STRUCTURE (hich easily handles smlt-allele markers), applying 1000 Genomes African, European and East Asian reference population data forthe 21 Ms, BGA was succesfull inferred in the $:1 and 9:1 mixtures, where sequence ratios were suiciently ‘contrasted. Although # lined pilor study and based on deconvolution Intended robe restricted the three main population groups of Aiea, European, and East Asian individuals only, these findings indicate that IMHs can be successfully used fo add inferred BGA to the information ‘obtained front sequence analysis of simple 2 way mixed DNA [2], Ii ‘also reasonably secure inference to expeet the majority of the 21 MH loci used ro have multiple-haplorype patterns. Using known hplonype frequencies, simulated haplotype patters that could be expected from mixed DNA can establish the likelihood of being able to de-convolute mixtures of individuals from the three population groups, but this Ferenc Sene main Gene 65 (2023) 102870 pilot focused on the ability to interpret the sequence output ofthe 21 MH markers in VISAGE-ET-AA, which simulations are unable co model, 3.2. Expansion of population reference data for forensic BGA inference Population frequency data on the AIM SNPS used in forensic BCA, tools are required for two purposes 1) as x marker selection dataset for the inital identification nnd validation oF the AIM SNPs by selecting SNPs with large allele frequency diferences between different world wide populations, and ii as reference dataset for obtaining the fal BGA inference outcome for the DNA sample in question based on the results ofthe BGA genotypiig tool applied to the ease sample. In 2016, Soundararajan et al. [110] argued for establishing « common set of forensic AIMS that could be agreed upon and developed within @ collaborative framework. The misin motivation for suggesting this nitiative was the poor geographic coverage of population diversity somple sets used as reference data for BGA inference such as the HGDP.CEPH panel (the most widely used sample set, covered in [5]) and, using the study authors’ description, a largely ‘empty matrix’ of ‘hon overlapping AIMs selectious amongst 21 forensic panels, Agree ment on @ universal set of AIMS for forensic BGA DNA resting would bring the benefits of broadened regional coverage by sampling many eogeaphie gaps (eg, Remote Oceanian, Native American, ete.) and ‘more detailed population inferences. although this initiative did not progress, itis beneficial to exclude redundant, ie, non-overlapping AAIMs, but also SNPs thar duplicate divergence within single genomic regions fs noteworthy that Kidd labs 56-SNP panel has fone SNP pars closely sited in genes: EDAR, HCIS1-GOLGBI, ADHIB and ALDH2 (43827760-15260690; 1S1919550-t812498138; 181229984-18381 1801; +52238151-15671, respectively) despite LR calculations for this panel in FROG KD and Snipper assuming allelic independence. Some regional Variation remains uncharted anv not properly epresented. For instance, runic of Remiote Oceania is knonen to have quite different allelic di. ‘versity patterns to Near Oceania, det differences in human population history of both regions, but Remote Oceania lacks population data completely, while for Near Oceania i is very sparse. Native Americans from North, Middle and South America are similarly under-xepreseated inthe available popiation data, aswell as populations from Midale East regions, Another major factor is that AIMS highly informative for a particular population comparison may not amplify eiiently in an MPS ltiplex assay or show complex faking region sequences hampering Telinble alignment. Therefore, the fens community remains reliant fon an accumulating bank of genomewide variant data from a large umber of worldwide populations The advantage of public genomle data based on whole-genome sequencing data is at they establish catalogs of all SNPs detected, raking selection of fre AIMs open and flexible. For sons tine, SNPs considered as AIMS have centered on 1000 Genomes Phase-tIt data (summarized in Fig, 2of (8), which were generated with low coverage (@-3X) whole genome-sequeneing stidies [100,111]. In recent year, the toca numberof compiled SNP catalogs (VCF files fom public WGS data has increased markedly. The Simons Foundation Genonte Diversity Project (SGDP (112) and the Estonian Genome Diversity Project (EGDP (11S) both addres the under sampling of certain regions but rely on @ strategy of 2-3 samples per population. Although stich data can provide 1 regional/global overview, they are not useful as population data because of insufficient SNP allele frequency estimation (both projects" data scope i reviewed in (114), Hence, such sourees are not usefl as population reference data for (forensic) BGA DNA testing and areal of limited value for selecting AIM SNPs inthe firs ease. In contrast, the most extensive Inman variant catalog assembled 1 date, gnomaD (genome aggregation database) [115] has large sample sizes of several thousand individuals, but lacks detailed population definitions with loosely defined population identifiers such as Hispanic, nom Finnish European and ‘Other’, so is of limited use as a reference population Aatabase. Unlike 1000 Genomes, the parallel gnomAD project database holds summany allele frequencies each population, soit simpractical to use this data for reference purposes with current forensic ancestry inference statistical analyses that rely on individual SNP genotype data, unless SNP allele frequencies ean be placed directly into LR calculation (possible in Shipper at: hp://mathgene.use-es/snipper frequencies new. htm), ‘Two recent initiatives at 1000 Genomes Project have been partiew larly useful here, Fist, dhe completion of whole genome sequencing of the HGDP-CEPH panel samples (97) adds 929 complete genomes to the 2504 Phase-II genomes already sequenced. Snippet has compiled these HGDP-CEPH data forthe @AIMs, FDSPK, PIAP, MAPlex and VISAGE-BT BGA (ools, The HGDP-CEPH whole-genome vatiant catalogs expand reference population data for regions not previously covered for these tools, notably (ative) Ameria, Middle Bast, an Oceania (although just ‘neo small population samples from Papua New Guinea). Snipper pro: vides a flexible method to adjust which reference population data ate ‘compared by user-defined grouping selected to best match the observed patterns. For instance, an indietion of European. Middle East South Asian co-ancestry ean focus on just these reference populations in PCA, ‘which expands the 2D space reducing the overlap between reference loud. Inthe cond 1000 Genomes projec, the orignal 2504 Phase-I samples have been re-sequenced at higher coverage levels (an average 0X) by the New York Genome Center (116). The date generated from Iigher average sequence coverage markedly improves the quality of many of the original variant calls, which often changes the genotypes for ‘4 Significant proportion of 1000 Genomes samples in certain SNPs. The importance of reliable variant calling i lustrated here bythe example ‘of SNP 153857620, reported by Zhvo et al. (117) to have # South Asinn-indicatve Aallele found at much lower frequencies in other population groups and therefore suggested to be highly informative for South Asian European differentiations. The 1000 Genomes Phase tL data listed online (htp://www-ensembLorg/Homo sapiens/tafo/t nidex) hasan A-sllele frequency of 0.461 in South Asians snd 0.008 in Europeans at this SNP; but Allele frequencies from the 30X high sequence coverage analysis are < 0.001 in all populations, including ‘South Asians so when more accurate SNP genotype calls are made fom ten-fold increased levels of sequencing coverage, the 153857620 SNP Variation is uninformative for all population compaisons. In che revised population reference datasets in Snipper, all 1000 Genomes data now ‘compiles the 30x high coverage variant els released in 2020, with the added advantage that soae AIMS now have data not previously avail: ‘ble in Phase listings ~e., PIAP AIM 1s10954737. 1s importanc to re-emphasize the point mage in a previous dis ‘ession paper (2) and in the intraduction section, that the ancestry in ferences made of an unknown DNA donor from foreasie BGA testing is ‘entirely dependent on the reference population data available to the BGA analysis ofthe forensie sample. AS mentioned before, this otlon applies to all three components of FDP, but is especially important for BGA prediction where popiilation reference data are used directly in BGA prediction toos, while in appearance and age prediction reference ‘data are applied tothe prediction models implemented inthe prediction tools. Therefore, in BGA prediction the reference data and their init ons impact the predietion ontcome more diredy thas in appearance ‘and age prediction. It is thus important to communieate the reference population dataset used for such inferences and its limitations in the final report sent tothe investigating nuthorites, 3.3, Statistical approaches used by MPS-based forensle BGA tools “Most forensic BGA tools rely on likelihood based methods to pret 2 donor's most probable geographie region of biological ie, genetic ‘ancestry [15,119]. In individuals with significant levels of co- ancestry from families or popilations with admixed genetic backgrounds, the likelihood approach tends to break down, as alleles indicative of ml: tiple ancestries are present. Varied proportions of indicative alleles markedly reduce the LR comparing the two most likely origins, as both Ferenc Sene main Gene 65 (2023) 102870 divisor and numerator will have relatively high probabilities [1 19). For this reason, model based genetic clustering methods such as STRUC- ‘TURE (120,121) and ADMINTURE [(122) are often used to identify and examine the distribution of genese clusters in en individual's DNA. Matching genetic elusters of unknown and reference samples provides & direct pointer toa person's co-ancestry fromthe ratio of multiple clus ters when present at high proportions, Such pattems ean indicate his torical poptlation scale admixture, which i typically found in continental margin regions, eg. North Afri, instead of recent admixture in the individuals direct family. Apart from the PCA-only ancestry analysis module inthe FDSPK UAS software, Bayes likelihood ‘comparison models have predominated in the downstream analysis of AIM SNP_ genotypes from PIAP, swell as. gAIMs, MAPlex and VISAGE-BT-AA enstom panels, Useflly, FROG Kb alls Sapa of 55 of| the 56 FDSPK AIMs to comput likelihood based predictions of ancestry {60}, UAS and FROG-KD were formally compared by Sharma eta. (8) finding a 12 % error rate in UAS (32/260 donors with self declared fcestry) ¥s 9.896 in FROG Kb; with most UAS error In South Asians — ® population without reference data in the UAS PCA module. For the VISAGE-ET-AA, a probability-based approsch without LR iniplementa Udon was applied and implemented in the VISAGE Software, which provides probability estimates for all geographic regions considered “1.1. Snigper and PIAP HID SNP Genoryper kelihood analyses [As described above and in detail previously (1), likelihood analysis, forms the core statistical approach ysed in Saipper (118). The commonly-used ‘nutiple profiles’ Snipper option (ip//satligene ‘wsc.es/snipper/analysismultipleprofiles tml) provides the most flex bility ~ where reference/unknown SNP profiles are distinguished by {V0 end-columm labels and profiles can be re-arranged into multiple impue daca worksheets in Exeel. This portal allows adjustment for thon independence (option: Hardy Weinberg principle not applicable’) tnd generates 20 PCA plot for prinepal components (PCs): PCL vs PC2; PCI vs PCS; and PC2 vs PCS, with lack ‘unknown’ points linked ro exeh prediction likelihood for casework SNP profiles. For every new forensic IMPS ancestry panel published, Snipper compiles fll 30X high coverage 1000 Genomes/CEPH population data for use with the multiple profiles Doreal. This date current consists of a representative population or, to best mateh numbers, a set of populations for each continental group (Phase it: African Yoruba; CEPH European feom Uta; East Asian Han Chinese; South Asian Gujarat fom Houston, plus HGDP-CEPH Ocesh lan Papnia New Guinean; Middle East Emirat, Saudi and Yemeni pop: tlacons from the whole-genome sequencing studies of Almarti et al. {123}; HGDP-CEPH American, comprising five native populations). A separate dataset of SNP profiles lists the above Europeans, South and East Asians with HGDP-CEPH Middle East (four Isreei Arab pop- lulations) and HGDP-CEPH North africans (Mozabite Algerian), to ‘nable LR analysis and PCA of Eurasian population subsets. Combined 1000 Genomes and HGDP-CEPH samples otal 2535, from 69 non-overlapping populations (Japanese, Tuscan and Nigerian Yoruba popilations are common to both sets but have diferent samples). A ther 486 acmixed population samples ofthe 1000 Genomes ate listed i a separate worksheet, as well as 130 SGDP samples and 402 EGDP samples in the final worksheet, to enable test profiles’ robe introduced I customized analyses The HID SNP Genotyper ancestry plugin of the Torrent Suite™ Softwate from Thetmo Fisher Scietife (HSG-TSS) analyses PIAP ge- horypes and provides LR values based on 51 widely distsbuted pop: llations from seven ‘root popslaions” comprising: Africn; America; East Asia; Europe; South Asia; Southwest Asia (i. Middle East; and Ove sania; used for admixture analysis. The population likelihood calcula tions ereate m ranked list of values using LRs of all population comparisons, oa typical example could be that an East Asian individual hes highest likelihoods for Ami (Taiwanese Aboriginal), Japanese HapMap, Japanese, Korean, as the four highest LRS. A confidence value is given to predictions based on likelihoods obtained, allowing the user to exercise caution when relatively low likelihoods are returned. ‘The extent to which users adopt one or more ofthe population specifi RS to assign ancestry more precisely than root population inference is not known. Several smdies have specifiy discussed this interpretative hoice [86,95,118,119,128) and as a sensible rule-of chum the best, interpretation is to consider an individual originates from any’ listed population showing an LR of 10 or less. The HSG TSS admixture analysis algoritha uses a bootstrapping system to estimate co-ancestty pro: portions in individuals with admixture. Admixnite proportions are ‘estimated based on & masianin likelihood approach consparing the ‘seven root populations with bootstrapping replication rns analyzing a ferent subset of PLAP SNPs in each replication to capture uncertainty Inthe estimations. Co-anoesty estimates use the average of the boot- scrapping replications for eaeh population which are presented as a percentage of each conteibuting popilation with the corresponding likelihood. ‘The informative study by Jin etal. (124) focused on the admixture alyses made by the HSG-TSS algorithm, using $4 admixed donors with solfdeclared cosancestries, plus 648 single-ancestry donors. Rests Indicated generalized root population inferences for single ancestry donors had =99 9 liability (683 predictions matched selfdecarations), but admixed donors had more inaccurate predicto when the co-ancestry patterns in these Individuals were complex, oF their eo-ancestres were from closely related population, « ., Europe vs Sonthvest Asia, na siilar HSG-TSS study with fewer text samples, ALAsfi etal (125) found concordant co-ancestry predictions for 4/11 ‘admixed donors, whereas 7/11 were given more co-ancestry compo nents by HSG-TSS than was known or declared. This study looked in more detail at the individual population inferences based on ranked LRs ‘and found 22/36 single-ancestry donors had a prediction matching thet tive ancestry amongst the top five population specific inferences. Both studies advised caution with forensic easework analyses that to give data In he report whieh goes beyond a simple sigle-ancestry inference with ‘an accompanying high confidence score, although simple parental covancestry predictions (i. at or around 50:50 ratios) for divergent root populations were the most reliable 33.2. GenoGeogrpher GenoGeographer is & daca analysis software developed by Tvede. Drink er al. (126) co gage the siniarty oF Assinsasity of MPS AIM SNP profiles and the reference population data used to make BGA in ferences. GenoGeographer delivers a likelihood ratio test (LRT, distinet from LR) recording the absolute concordance between an AIM SNP profile and a population (computing a z-score) rather than a relative measure of the profile’sikeliiood into populations, (sin LR analyses discussed above). LRT analysis adjusts for the possibility in forensic ‘esework thar no reference population Is appropriate for the donor's ave origin, so all nll hypotheses are rejected when no relevant refer: fence date can make a reliable inference. Mogensen et al. [127] gauged the efficiency of GenoGeograplier using the PIAP and adding Green lander and Somali population reference data tothe widely used cont nental population groups of Africa, North Africa, Middle Bast, Europe, Sontl/Central Asin, ad East Asin obtained from FROG-HD, treating Greenlander/Somali profiles as unknowns. GenoGeographer marked 22.4 9 of test profiles as not assignable to any reference population used, while ~84 % of the remaining 77.6 % were correctly assigned to ‘either Greenlander or Som. In contrast, conventionsl LR analyses gave 78.1 % correct ad 21.9 % incorrect assignments. Overall, GenoGeog tepher was able to reduce the error rate threefold by sng the z score to ‘exercise cation when its value was above a certain level and prevented ‘erroneous assignments being made, In a similar, but simplified manner, Shipper allows the cross ‘validation of the population reference datasecs compiled for forensic MPS BGA tools, and when the likelihoods generated are ranked in @ ‘seties of LR plots the unknovn profile position ean be superimposed ro _gtige whether its LR is within the vale mages observed ori an outlie Ferenc Sene main Gene 65 (2023) 102870 (Ge, reliable vs unreliable LRs). The lowest values in each reference poplation, or the LR values seen in incorrect assignments allows an inference threshold to beset to reduce the risk of incorrect assigns from below-threshold values. Fig 2 of (128) shows an example of seh plots, applying a universal LR threshold of 1000 (je, no assignment reported for LR below "1000:imes more likely one popilation than another). Population specific thresholds can also be set when some populations are less divergent than others in the reference data used, 43.3. Combining different types of AIM tn BGA prediction frameworks ‘and tools ‘The intvoduetion of MHs as AIMS ts highlighted the problem of colleting suitable popuation reference data for ntiple-llele markers With many low frequency alleles/haplotypes, therefore requiring nme larger sample sizes to propery estimite frequencies, Sine sampling of| ‘Oceanian and American populations remains sant, is diffielt ro use “MH loci without significant inaccuracy in the ealeulation of likelihoods for under-represented population groups ising hese markers. Further: ‘more, haplotypes need 0 be counted rather thaa estimating allele fe ‘qencies from single SNP genotypes, so ie dependent on large sample sizes to be representative of the actual haplotype frequencies in the population. Consequently, Saipper has so far failed to adapt LR analysis of panels combining SNPs and Ms a the underlying variation needs to be estinated in different ways, Nevertheless, as widh STR data, STRUCTURE ean accept both types of markers as joint input epplving suitable numerical transforms co haplotypes (e.g, AAA=111; ACA121 ee.) It should be remembered that PCA analysis only works with binary SNP data, prechiding ti-sllelic and MH variation, and It Is reco mended to apply more appropriate methods of mult-dimensional scaling able to handle binary/multiple-llele SNP and MHC data, scripts are available to execute principal coordinates analysis (PcoA (129) and neighbor joining tree plots (se Fig. 8 Cin (82) and Figs. 3-4 1 (109)) to beter represent genete dstanees in 2D space using all the variation ina panel. Lastly, experiences with MAPlex and MPSplex in dicates LRs are often redaced scien MH Likelihood aze combined with SNP likelihoods, This suggests that MH data in sueh mixed panels are best reserved for monitoring and de-convoluting mixed DNA, with the otentil 0 infer each eoutributor’s BGA [109]. A, The continuing challenge of assessing BGA in cass of genetic fedmisture and co-ancesry GenoGeographer is designed primarily to address the problem of genetically admixed individuals analyzed with single-ancesty reference data (225). Genetically admixed Individuals witty co-ancestry from diferent geographic regions, present a problem for likelihood: based ancestry tests producing reduced values [13], which are then not rel able enough indicators of ancestry for reporting to investigators, Pereira tal. used STRUCTURE rather than GenoGeograplier to study Brazilian Population samples and fies (sibs have similar eo-ancestry ratios and predictably combine their parent's ancestral backgrounds), as & testbed for popilation and individual admixture measurement, espe Lively 10] Brazilian populations ate known to have eoancestry from European, Aftican, and Native American populations. As would be ex pected, the lager the AIMs panel (ranging from 46 Indels to 164 PIAP SNPS/210 AIMS combined), the less variable the eoancestry estimates Dbtained; with a more marked effece on consistency of individual co-aneestty estimates than population level estimates, Foran indivi suspeet in a crime case, accurately estimating co-ancestry for genetically fdmixed individuals Is more differ if 1) the sdmixture contributors Ihave Tower than average divergence; i) the ATMs panel i small; and i) there is insuficiently balanced J; pop levels, given the AIM set and refereace populations used (see Section 3.1.2.) ‘The above points suggest that reliable analysis of complex genetic admixture remains difficult to achieve with the current forensic BGA tools, even those with expanded sutosomal SNP numbers currently ‘available for tools based on targeted MPS. However, there are several factors that can help both the interpretation of allelic variation detected inthe tested individual andthe design of future AIM sets, which were ‘considered by the VISAGE Consortium when designing the AIM panel of the VISAGE-ET-AA. First, with existing MPS panels it shoulé be possible to report co-ancestry in the SNP profile analyzed and infer contributing population groups when simple, balanced pareatal admixture is detec ted. The study by de la Puente et al. (109) assessing the capabilities of @ large-scale MH panel for ancestry inference even suggests that the ‘cosancestry components of a mixed DNA contributor with an admixed background are detectable, without the expected confounding effects of ‘double mixtures’ (Fg. 5 in (2091), Second, when an individval ori nating from continental margin population has identieal patterns of ‘admixture 10 those of individuals with admixed funily history Ce. North Affioms have the same patterns as European-Afiean admixed Individuals), itis important ro consider both possibilities in the ancestry report Aset of reasonable and robust guidelines for reporting individual ‘coancesry has been proposed by Jin et al (124) and readers ane ‘encouraged 10 consider these as a framework for building their ow ‘eporting guidelines, based on the AIMS panel used and its limitations for measuring co-ancestry (admixed 1000 Genomes population data for all published MPS BGA panels and tools are available in Snipper for this purpose). The third factor is the most important ro consider as BGA MPS ‘panels continue 1 expand in de mmber of AIM SNPS and become more sophisticated. It should be increasingly easy to include non-autosomal marker data elther from parallel MPS assays or within the same ‘enlarged MPS muhplex. n this way, che paternal and maternal lineages ‘ean be compared with the autosomal data in admixed males. The FDSPK ‘A-Bntiplex combines 56 AIM SNPs and 24 Y-STRS, but ro our current knowledge this data has not been combined in any forensic ancestry studies. Likewise, the TFS Precision ID Identity Panel has $4 upper Y¥.Clade SNPS, which although a separate MPS multiplex, adds pat lineal analysis o PIAP data. A separate multiplex MPS tool for simul taneous analysis of 859 Y-SNPS allowing the inference of 640 Y-haplogroups bas been published [/7) and is available for high-resolution ¥-haplogrouping, allowing detailed paternal ancestry inference. As the increased copy-number of mtDNA makes the balancing ‘of mitochondrial sequence with autosomal and/or ¥-clromosomal se ‘quences with the same MPS tool challenging, future efforts need to find ‘out if sch combined MPS tool can be developed for forensic applica tious, In aay ease, separate whole mitogeriome MPS tools are available for separate analysis oF miatemal aneestty {7}, Inching» comercial solution [131]. As described above, the VISAGE-ET-AA tool already ‘combines mutosominl AIM SNPs with Y-SNPs for simultaneous Di-patental and paternal ancestry inference, although genotyping 87 Y-SNPS limits the level of paternal BCA inference and will need 10 be Increased in future FDP tools. Similarly, X'SNPs provide a reasonably informative substrate for mtDNA sequence analysis for the inference of X-chromosome ancestry in a large proportion of males with recent ‘admixture AS. Ancestry inference with genealogy scale SNP genoryping Although MPS tas been the key technological development in, forensic DNA analysis since 2015, arguably the recent rapid evolution of investigative genetic genealogy methods based on genomic data ob- tained with SNP microarrays, or whole-genome sequencing hs also ‘hanged our attitude co what could be possible in de future [152], SNP nicroarray-level genotyping is generally not suitable for forensic ma terial (153), although there can be erime scene samples that contain DNA of high enough quinntity and quality ro even allow sucessful SNP ‘analysis using whole-genome sequencing 4). Hence, a middle ground between hundreds of thousands of genome-wide SNPS present on ticroarrays and current MPS tools containing up to hundreds of AIM SSNPS is a potential future way to address many of the issues covered, ‘sbove ~ particularly Improving complex. co-aneestry analysis. The Ferenc Sene main Gene 65 (2023) 102870 ‘medium-scale MPS-based Kintelligence SNP test from Verogen inten ded for long range familial searching using dedicated portions of the GEDmatch database (termed GEDmatch PRO), could also provide ‘itech ancestry Information when combined with the 1000 CGenomes/HGDP-CEPH reference populations that can now be readily compiled in Snipper. In combination with evolving data analysis re times, such panels of several thousand SNPs could potentially provide greater detail, a wider range of sub continental population diferente tions and more refined co-ancestry analysis. In this way, an wnidemtified contact trace could be simultaneously analyzed for relatives in GED- match and have detailed ancestry inferences made. Dedicated tools with tuany thousands of AIM and appearance SNPs may be developed in the unure, for which targeted MPS methods involving eaprure enriehavent ‘appear more promising thas amplification based targeted MPS methods, especially for degraded DNA samples. In 2021, the FORCE pane! based on capture MPS was published [51], designed to be an all-in-one tool for 5,422SNPS for investigative genetic genealogy and other forensi pur Doses. The seale of the FORCE panel allows the combination of several thousands of SNPS with several distinerforenste purposes with 4069 kinship identity SNPs, 241 BGA SNPs, 41 HirsPlex $ SNPs for eye, hai, and skin color (4 overlapping BGA SNPS), 240 X-SNPs, and 823 Y-chromosome SNPs, The BGA SNPs in FORCE sere compiled from the terged PIAP and VISAGE BT-AA panels. While in deir paper, the a thors describe mucestry resolution with the FORCE panel based on 5 continental groups, the enlarged number of autosomal BGA SNPs in combination with 829 of the Y-SNPS from Ralf etal. [77] may allow FORCE to provide enhanced BGA inference compared to the smuller [MPS thsed BGA tools described inthis eview, which, however, remains to be formally assessed, 4. Recent progress in predicting age from erime seene DNA Obtaining investigative leads fom crime scene DNA within the concept of FDP gains extra power when the prediction of age is included togetier with appearance and ancestry. This not only is because age per se allows the characterization of a person, but also due to the fact that the expression ofeercain appearance traits depend on age, and for some appearance traits, age is used as predictor in the genetic prediction models (Fable 1). While various approaches for age estimation have been considered in freasic molecular biology, only DNA methylation (DNAu) analysis has proved 10 be sufficiently accurate wo provide @ practical solution for forensic applications. The fist papers mentioning the forensic usefulness of epigenetic age estimation appented in 2011 1nd presented the ist DNAm markers potentially useful fr this purpose [195,196], Since thet, mumerous studies have been published on epigenetic age prediction, many of them containing ess relevant co the forensic Feld. Progress on the iiplementation of DNA methylation markers for forensic age predietion has been summarized in several previous review aticles [910,197] ln this part ofthe review, we focus fon summarizing the most recent advances in forenscaly’ relevant epigenetic age prediction published since these last review artile on this ropie in 2016-17, Ineecent years, several DNA methylation-based age estimation tools, suitable for forensic applications have emerged, a discussed below and summarized in Table 3, In addition t0 those published earlier and summarized elsewhere (10). Notably, early research showed tha age estimated using DNA methylation markers has reduced accuracy in the elderly, which i related with a global dectease in the stability of DNA sethylation with advanced age (10). This problem is a consequence of ater individual variability in the rte of aging, which is more evident in the elderly and can be de tooth hereditary DNA variants, environ: rental factors infiiencing the rate of DNAmi progression ad stochastic fects, The inclusion of additional markers may reduce this problem. A study by Cho etal, suggested that the accuracy of age estimation based fon DNA methylation in ELOVL2, Clorf132, TRIMS9, KLFI4, and FHLZ feat! be inetersed in elderly people by adding information beyond

You might also like