Computers in Biology and Medicine: Barry Robson

Computers in Biology and Medicine 117 (2020) 103621
Contents lists available at ScienceDirect
Computers in Biology and Medicine

journal homepage: http://www.elsevier.com/locate/compbiomed
Extension of the Quantum Universal Exchange Language to precision

medicine and drug lead discovery. Preliminary example studies using the
mitochondrial genome
Barry Robson a, b, *
a
Ingine Inc., Delaware, USA
b
The Dirac Foundation, OxfordShire, UK
A R T I C L E I N F O A B S T R A C T
Keywords: The Quantum Universal Exchange Language (Q-UEL) based on Dirac notation and algebra from quantum me
Data analytics chanics, along with its associated data mining and Hyperbolic Dirac Net (HDN) for probabilistic inference, has
Data mining proven to be a useful architectural principle for knowledge management, analysis and prediction systems in
Genomics
medicine. It has been described in several papers; here is described its extension to clinical genomics and pre
Bioinformatics
Universal exchange language
cision medicine. Two use cases are studied: (a) bioinformatics in clinical decision support especially for risk for
Quantum mechanics type 2 diabetes using mitochondrial patient DNA sequences, and (b) bioinformatics and computational biology
Dirac notation (conformational) research examples related to drug discovery involving the recently discovered class of mito
Inference net chondrial derived peptides (MDPs). MDPs were surprising when first discovered as coded in small open reading
Hyperbolic Dirac net frames (sORFs), and are emerging as having a fundamental role in metabolic control, longevity and disease. This
Bayes net project originally represented a language specification study relating to what information related to genomics is
Clinical decision support essential or useful to carry, and what processing will be needed. However, novel aspects introduced or discovered
include the HDN-like neural nets and their use, along with more established methods, for prediction of type 2
diabetes, and in particular for proposals for over 80 natural MDPs most of which that have not previously been
described at the time of the study, as potential drug lead targets. Also, use of many medical records with
simulated joining of mtDNA as performance tests led to some insightful observations regarding the behavior of
HDN predictions where independent factors are involved.
1. Introduction and review related to those in more general use today were probably first developed
in bioinformatics and protein science, albeit framed in a Bayesian in
1.1. Background formation theoretic approach (see Chapter 9 of ref [2] for review). They
can be shown applicable to clinical decision support systems (CDSS); see
The DNA sequences of a patient and protein sequences deduced from Table 8.1, Prediction of Congestive Heart Failure, and associated dis
them increasingly provide a crucial source of information for the cussion in Ref. [1]). However, they remained largely confined to the
implementation of a personalized medicine tailored specifically to that bioinformatics domain, and the Bayes Net (BN) in the form due to Pearl
patient [1]. The required support discipline for analyzing DNA, RNA, [3] became popular for more general applications, followed by neural
and protein sequences and structures by computer is well-known as nets as discussed in this paper. Probabilistic inference nets, of which the
bioinformatics, first established as a set of protein sequencing support, BN is the well-known example, are said to work top down by
comparison, and prediction software tools as early as 1965–1975 when pre-assigning weights as probabilities based on prior knowledge,
protein sequencing, but not routine nucleic acid sequencing, was including that obtained by data mining. In contrast, neural nets are said
available. For example, see Ref. [2] that describes the above historical to work bottom up by comparing prediction with observation and cali
developments particularly in regard to peptide and protein structure and brating weights by minimizing the discrepancy; there is no use of pre
function, which are themes in the present paper. Prediction algorithms formed knowledge in the usual sense. More recently, to overcome
* Ingine Inc., Delaware, USA.

E-mail address: barryrobson@ingine.com.
https://doi.org/10.1016/j.compbiomed.2020.103621
Received 10 October 2019; Received in revised form 12 January 2020; Accepted 12 January 2020
Available online 20 January 2020
0010-4825/© 2020 Elsevier Ltd. All rights reserved.
B. Robson Computers in Biology and Medicine 117 (2020) 103621
certain limitations in the BN discussed below for use in CDSS, the pre 1.3. Context of the present study in relation to previous major efforts
sent author proposed the Hyperbolic Dirac Net (HDN) [4,5] based on
quantum mechanical principles [6]. The HDN is just one of several tools Q-UEL is not a major international standards effort, and currently,
used in the present paper, but of general importance throughout is the researchers tend to look to linked data formats based on SW technolo
associated Q-UEL language for probabilistic semantics and interopera gies. These include Resource Description Framework (RDF), Web
bility in medicine, as follows. Ontology Language (OWL), Simple Knowledge Organization System
(SKOS), and SPARQL [9]. Others look to the original XML, often also
1.2. Q-UEL and the context and purpose of the present paper considered as a tool of the Semantic Web (SW), but in any case attractive
for linking information on the World Wide Web (WWW). The PRISMA
The essential features of Q-UEL were captured in a paper in 2013 [7] approach to structured systematic reviews in medicine and for epide
and reflected the general specification in Q-UEL manuals around that miology [9] should be mentioned because Q-UEL has also been used for
time. Several special specifications of Q-UEL for different medical do that purpose. However, for certain interoperability aspects of PRISMA, it
mains and use cases have been developed over the past seven years in was found that an XML solution was more useful [9]. Bayes nets [3] and
many Q-UEL papers, examples of which are cited in context below. its causal model [10] have been also expressed in XML (e.g. see Ref. [7]
These papers did not attempt in any depth a special specification for for review). In contrast, PROforma is another and very different lan
management of knowledge in clinical genomics, but some such standard guage that has been recommended for clinical decision support systems,
is needed [8]. A special specification of any knowledge representation and is based on Backus Naur Form (BNF) [11]. It appears to be
language like XML [9] or Q-UEL is required when applied to a new moderately straightforward to convert PROforma to legal Q-UEL,
domain. Similarly, HL7 Version 3 and descendent versions are particular though the converse has not yet been completed, and results will be
embodiments or uses of XML for medical records and healthcare com reported elsewhere. Q-UEL owes most to XML, although unlike XML, its
munications, must obey XML rules, but took considerable effort to main function is not as a markup language (see Section 1.4). Q-UEL was
develop [1]. In the present paper, Q-UEL is extended from recent clinical inspired by the coincidental similarities between Dirac notation and
medicine, which made relatively little use of knowledge of a patient’s XML, and borrows some XML format and ideas, so it should be kept in
DNA, to include genomics and bioinformatics [1]. When a specification mind that Q-UEL is not XML. The use of hyphens in tag names such as
relates to something like Q-UEL which is interoperability, knowledge Q-UEL-PATIENT-DNA is even encouraged to make sure it is not misread
management, automation and reasoning language [7], a specification as XML (where they are illegal). The major differences are discussed in
study takes on the character of artificial intelligence research. Formally, Section 1.4.
any specification, general or special, is properly a description that is None of the main tools of the SW are extensively intrinsically prob
precise, defines the behavior of the system, itself involves formal se abilistic, although there have been diverse proposals [7], many of which
mantics and reasoning laws, and requires an understanding of the are based on the BN [3] expressed in XML [9]. That is unfortunate. Not
problem. A special specification study for extension of an interopera only does uncertainty abound in medicine but probabilities, or measures
bility language to a new domain must also determine what information based on them, are needed as the basis of important metrics in evidence
and transformations of it are absolutely required for that domain, and based medicine (and in epidemiology and public health generally). In
whether any additional information, perhaps redundant information in effect, a physician is concerned with the probabilities that particular
another form, will enhance efficiency and facilitate use. The account diseases are currently the best interpretation of the symptoms and
given in the present paper is not as rigorous as that suggests. Like the clinical laboratory results (differential diagnosis), with the risk of
above paper that first introduced Q-UEL in significant detail [7], it is not specified diseases occurring in the future (risk factors, prognoses), with
the specification manual itself, but rather describes the work through most probable causes (etiology), and with finding the therapeutic
examples, highlighting general principles and observations that are of intervention that provides the probable opportunities for success [1].
more general scientific interest. The pharmaceutical industry is concerned with selecting for develop
Most standards for giving structure to information, including XML ment the drug candidates that will most probably succeed. Genomics
and HL7 Version 3, have origins in specifications in earlier efforts with usually introduces a whole new probabilistic element [1]. Common
different names or guises. Q-UEL similarly has its roots in two earlier inherited diseases often represent a complex interaction of many genes,
descriptions that can be considered as specifications of a kind. Q-UEL is involving control systems affected by past environment and therapeutic
an acronym for Quantum Universal Exchange Language, because (a) it interventions. While for some diseases like sickle cell anemia and cystic
follows closely the form and algebra of Dirac’s braket notation (or bracket fibrosis detection of one inherited mutation in each copy of DNA from
notation) a standard for representing probabilistic knowledge and the parents is a very persuasive indicator, mutations such as BRCA1 and
inference from it in quantum mechanics (QM) since the 1930s, and BRCA2 represent a risk, not guarantee, of breast cancer, and absence of
because (b) it was a response to a call by the (US) President’s Council of these mutations does not guarantee avoidance. As discussed further in
Advisors on Science and Technology (PCAST) for a universal exchange Section 15 below, such probabilistic outcome is similarly most
language for medicine (UEL) (see Ref. [7] for discussion). The Dirac commonly the case for other cancers, and cardiovascular, diabetic,
notation, although mathematically rigorous, was slightly slack in its behavioral, and many other disorders.
demands for specific format on certain points, primarily in regard to The Bayes Net is probabilistic, but as a basis for probabilistic
features called attributes in the present paper, and discussed extensively knowledge in the SW and clinical decision support it has other limita
below. As a specification the PCAST requirements were vaguer still, tions. It is possible that the total adoption of BN-based methods as a
although the kind of capabilities required were outlined in significant standard for medicine have been impeded by controversy over the
detail. The motivation for the PCAST report was that a UEL was seen as interpretation of BNs as a Directed Acyclic Graph (DAG) of probabilistic
needed to overcome lack of interoperability in medical information knowledge, originally justified by a causal model [10] that is subject to
technology [7]. Arguably, the interoperability situation has not in some criticism (e.g. Refs. [4,5,12]]. The DAG used to represent a net of
practice greatly improved since, and certainly links between routine probabilistic knowledge seems inconsistent with querying by a set (not
clinical medicine and bioinformatics are poor (see Ref. [8], especially ordered list) of tuples (e.g. ref 13]), with classical statistics and its adage
Chapter 9, for independent appraisals of this). Q-UEL was also motivated that “Correlation does not imply causation” [14], the physics of bidi
by the desire to make the Semantic Web (SW) [9] probabilistic [7]. rectional transitions in nature (e.g. Ref. [15]), and not least with the
spirit of the WWW and SW that emphasize connectedness and a general
“Web of Data consisting of all sorts of data types and forms” [9]. The
HDN allows the Bidirectional General Graph (BGG) [5]. Q-UEL should
2
also not be confused with an ingenious but now obsolete query language courses are understandable because of the large amount of medical in
called QUEL [13]. It was not probabilistic, though Q-UEL by similar formation that a student must already absorb [1], but a grasp of bioin
name honors it because in hindsight there are common principles based formatics stands out as needed because of the rise of translational
on the tuple calculus developed by Codd, inventor of the relational data research, capable of delivering new genomics discoveries that impact
base [13]. Similarly in Q-UEL one can imagine a set of column names diagnosis, best therapy, and risk on a daily, even real time, basis [1].
(headers), associated with an item or column of data below it, the as Nonetheless, even bioinformatics would be not be essential, were it not
sociation being called an attribute by analogy with the similar appear for the fact that focus of personalized medicine today is increasingly on
ance XML’s attributes. DNA and protein sequences. This is arguably a characteristic feature of
precision medicine because it contrasts with the older or separate term
genomic medicine that has largely been associated with the idea of
1.4. Comparison of Q-UEL and XML
ordering tests for highly localized inherited mutations, typically single
base changes [1,8]. These differences were originally called single
Q-UEL is sometimes presented as an XML extension, with modifica
nucleotide polymorphisms (SNiPs) later seen as genomic biomarkers.
tions to XML to facilitate application to probabilistic semantics. The
Detecting the presence of such in the patient’s DNA is traditionally or
differences are primarily the introduction of the ‘|’ delimiter that relates
dered to be done by a lab, and is reported and used in much the same
to relationships, explicit or implicit operators such as logical operators
way as any test. However, while SNiPs are the most common kinds of
between attributes, and optionally more elaborate attributes
mutation, this does not mean that a single SNiP is always alone
with ontological structure using an attribute metadata language
responsible for a disease. Admittedly, there are perhaps some 9000
(AML) [7]. Q-UEL can be used like XML as a markup language, with start
diseases like sickle cell anemia and cystic fibrosis discussed above for
< … | and end | … > tags surrounding sections of text [7], but this is
which a single genomic localized biomarker (in those cases from each
rare, primarily to represent a first step in Q-UEL and XML interconver
parent) is typically sufficient evidence [1,8]. But most of these are
sion, and most tags are stand-alone tags. The most prominent kind of
relatively rare while, as introduced 1.3 above, many very common
XML-like tag in Q-UEL corresponds to Dirac’s bra-operator-ket.
diseases such as heart disease and stroke, cancers, type 2 diabetes, and
< subject expression | relationship operator expression | behavioral disorders have a complex polygenic basis in which many
object expression> genes interact with each other and with influences from the environ
ment. This requires bioinformatics, and the need is pressing. The cost of
However, Q-UEL differs most fundamentally from XML in having routine extensive genome sequencing is becoming cheaper than an MRI
algebraic and arithmetic force, as do the corresponding elements in scan, but more importantly still for immediate application as discussed in
Dirac notation, such that its XML-like tags can behave as variables in a the present paper, an increasing number of patients have already had
software application. The most important values are probabilities, their mitochondrial and Y-chromosomal DNA sequenced because of
actually pairs of probabilities that are expressed as a complex number their interests in genealogy, i.e. for discovering less immediate family
[7] as described later below, for the following reason. Attributes in relationships and prehistoric genetic origins. Notably, mitochondrial
Q-UEL appear as arguments in logical or other expressions, which are DNA (mtDNA) is easy to sequence and lots of sequences are available
related to each other, in a mode analogous to a natural language. because of its relatively very small size and multiple copies per cell [16].
Although it can be rendered in valid XML, it is then relatively inelegant, Not least, it plays a key general role in health and disease. See Sections
and a means of making algebraic and arithmetic interpretation would 1.7 and 1.8.
need to be available to the XML representing it. Building on Dirac ap
pears to provide a more natural and extensive approach than Codd’s.
1.6. Integrative bioinformatics
The above is a scalar variable that can be related to probability (Section
2.1) but also an expression, with a < … | bra row, | … > ket column, as
In the early decades of bioinformatics the tools were rather discon
the braket, and a relationship part which is an operator or matrix. These
nected, requiring some knowledge of computer science and communi
entities and the braket < … | … > where the ‘|’ can be read if logical IF,
cations such as file transfer protocols. It still requires some expertise
and the ketbra |.>< … | (an operator) also appear as valid Q-UEL tags.
today. Making the use of bioinformatics more digestible is the aim of
Unlike in XML, such tags can appear inside other tags, and this is easily
integrative bioinformatics. The present paper is also a study in that
demonstrated to be consistent with Dirac’s method, though this requires
discipline, since Q-UEL tags facilitate automation. Arguably the most
a deep discussion of the adjoint operation that is not required in this
popular freely available and user-friendly bioinformatics integration
paper. The present paper also relates this to the neural net. Although
effort has been the Biology Workbench (recently hosted at the San Diego
Q-UEL is currently used today as a local architectural principle for
Supercomputer Center) that comprised a comprehensive set of bioin
knowledge management in AI, the present paper is more characteristic
formatics tools and data bases in a unified intuitive workbench envi
of early Q-UEL studies envisaging a future “Thinking Web”. Here this
ronment. Unfortunately it has been out of action for some time,
means interoperability of a local clinical system with bioinformatics
purportedly for lack of funding [17], giving an added motivation for the
tools and data already on the web, and developed by others. Of course
present study. An industrial version, the Bioinformatics Workbench
for privacy and security these tools and data could be reproduced in a
developed for the holding company forming Craig Venter’s Celera Ge
cloud or behind a firewall generally; the principles would stay the same.
nomics that produced the first draft of the human genome, was partic
ularly well integrated with the storage and provenance of incoming
1.5. Precision medicine large amounts of DNA sequences [18], but it is not freely publicly
available in that form. Other workbench-like efforts include those by
Precision medicine (e.g. Ref. [8]) is usually portrayed as a kind of CLC, Taverna, Characterization visual Laboratory, Genomics Virtual and
personalized and preventative medicine in which probabilities and many others, but to the author’s experience and understanding they are
prediction methods and some understanding of the impact of a patient’s either more limited or not free. Still freely available public bioinfor
DNA play an important role. This presents challenges to many current matics sites include, for example, the European Bioinformatics Institute
physicians. None of the topics discussed so far above as yet appear as (EBI), European Molecular Biology (EMBL), National Center for
core features on most nationally approved medical syllabuses, except Biotechnology Information (NCBI), Broad Institute of Harvard and MIT,
perhaps as glimpses in a basic teaching of evidence based medicine, ExPASy Bioinformatics Resource Portal, the DNA Databank of Japan
epidemiology and biostatistics. This is all of some concern because a (DDBJ), and for mitochondrial research Mitochondrial Disease Sequence
physician takes ultimate responsibility in using a CDSS. The omissions in Data Resource (MseqDR) Consortium, and particularly for present
3
purposes MITOMAP and MITOMASTER and others discussed in context contemporary people will normally have identical mtDNA if they are
below. All these and others would benefit from unification in a Q-UEL related by a unbroken maternal lineage in which mutations have not
approach. Currently, there is no single integrative surface as in the occurred. The accepted (inherited) mutation rate is about 1–4% per
Biology Workbench, and even that lacked some of the more clinically generation. One has a 50% chance of sharing a common maternal
related tools. Professional users use them all largely by explicitly ancestor within the last 5 generations (circa 125 years), and a 95%
uploading and downloading files and typically cutting and pasting bio chance of a full sequence match for 22 generations (circa 550 years)
informatics data between webpages for diverse tools and systems. The [23]. An individual with no living recent relatives could have a unique
present paper explores use of Q-UEL based techniques to bypass this mtDNA sequence. Mutations associated with serious diseases may not be
difficulty, or at least to make it invisible. passed down through many generations [24]. For many years differ
ences were reported primarily in just two hypervariable regions as
1.7. Choice of focus on mitochondrial DNA in the present study “hotspots” for mutation: HVR1 and HVR2. These determine one’s hap
logroup which is still used as a classifier today. These regions are more
It is increasingly the case that DNA sequences are being added to the sensitive for genealogical and forensic purposes, although in forensics
patient record but mitochondrial DNA (mtDNA) sequences have had a percentage probabilities of variously 0.1%–1% are deliberately used as
head start because, as discussed in section 1.5, many patients have them conservative estimates of significant match (e.g. Refs. [25–27]). Prob
already because of their genealogical interests. Mitochondria are or abilistic considerations in linking mtDNA to disease concern (a) heter
ganelles in eukaryotic cells that reside in the cytoplasm outside the oplasmy that increases in diseased cells and especially in cancer, (b) the
nucleus, responsible for the bulk of ATP production [16]. Deduced as polygenic nature of the more common diseases, (c) involvement of
being of prokaryotic (bacterial) origin before the rise of multicellular mutations in nuclear DNA (nDNA), and the fact that mtDNA mutation
organisms, the mitochondrion represents an extreme accepted symbiont rates are some 10 time times higher than in nDNA, (d) estimated
with its own double stranded mitochondrial DNA of circa 16,570 base probabilities of maternal versus paternal transmission of diseases, and
pairs. It represents a part of the human genome that (except in rare (e) metabolic regulation effects (e.g. Refs. [28,29]). In the near future,
cases) is inherited from the mitochondria of the mother (as Y chromo we should expect that probabilities for diagnosis and perhaps prediction
somes are inherited from the mammalian father) [16]. Obtaining of disease might be linked even more precisely to Mitochondrial Derived
mtDNA is easier than for Y and other chromosomes because there may Peptides (MDPs) [19,20], inherited mutations in them, and to their
be 1000 to 2000 mitochondria in some cells, making up a fifth of the cell expression levels. MDPs appear to be responsible for much of the
volume, and each may contain several mtDNA copies. mtDNA has for crosstalk between mitochondria and the rest of the cell. At the time of
some years been of taxonomic and diagnostic [16], biological and writing some three general types of MDPs are said to be known,
biomedical research [19,20], forensic [21,22], and genealogical [23] humanin, MOTS-c, and a diverse class SHLP1-6. The first two classes are
interest, but precision medicine has so far tended to neglect mtDNA [8], down-regulated in type 2 diabetes [30] but all MDPs might have a
perhaps because mtDNA research is enjoying a revival that presents distinct action associated with distinct important diseases. It suggests
some mysteries and surprises. The number of publications on mito the design of peptomimetic drugs, i.e. designs that introduce, for
chondria has recently exploded and overtaken those for each of the other example, D-amino acids [2], possibly leading to more traditional small
cell components taken individually. Yet the picture is now clarifying, drug “molecules in a pill”. As an example of use in drug discovery
showing how mtDNA expression plays a central role in metabolic control research, Q-UEL uses bioinformatics to identify potential new MDPs in
and aging, responding to physiological stress, oxidative stress, and the present paper, so laying the basis for a study of inherited mutations
testosterone, while involved in many disease states, both rare and affecting MDP structure and expression, and linking mtDNA genomics
common, such as heart disease and stroke, cancers, obesity, and type2 more tightly to disease prediction.
diabetes. Recent mitochondrial research adds Alzheimer’s, Parkinson’s
Huntington’s diseases, and epilepsy [18,24]. mtDNA thus spans the 1.9. Other work related to use of Q-UEL in the present paper
main concerns in popular health culture. Importantly for the pharma
ceutical industry, the mechanisms are also increasingly better under In the author’s understanding and opinion at time of writing, there
stood as to how mitochondria communicate richly with each other and are no published efforts by other workers that closely relate to the HDN
rest of the cell to control housekeeping functions, modulate synaptic and Q-UEL. At least, they relate no more closely than the Bayes Net and
transmission within the brain, release molecules that contribute to XML as reviewed in Ref. [7]. It was found that neural nets using related
oncogenic transformation, trigger inflammatory responses systemically, algebra [31] have been used for some time as discussed below, but most
and influence the regulation of complex physiological systems and of the closely related efforts are by the present author and collaborators
aging. (e.g. Refs. [32–36]). That also includes aspects concerned with infor
mation theory for generalizing the approach to allow use of sparse data
1.8. Brief review of probabilistic aspects of mitochondrial genomics [32] that had early origins in bioinformatics [37] later extended to
biomedical data-mining [38–40]. See also refs [1,2]. The present
A probabilistic approach has long been required in mitochondrial approach in genomics makes much use of links to bioinformatics tools
genomics, not least because of its genealogical and forensic applications. developed by other workers given at appropriate points throughout his
In this paper, no sequencing errors are presumed: conventional DNA paper. Other efforts in genomics and bioinformatics that inspired some
sequencing has been fine-tuned to achieve read-lengths of up to ~1,000 ideas in the present work include the Mitomap and Mitomaster systems
bp per section with per-base error rates as low as probability 10 5. Aided provided by the Center for Mitochondrial and Epigenomic Medicine at
by reference to standard well-studied mtDNA sequences to focus atten the Children’s Hospital of Philadelphia [42]. Experience with the
tion on possible anomalies, reliability of sequences is a reasonable initial Genomic Messaging System [43], genomic knowledge gathered by the
assumption. There may be many independent mutations in one indi MARPLE system [44]. Mitomap and Mitomaster, and the XTRACTOR
vidual, resulting in heteroplasmy (mtDNA diversity) with each mutation component of MARPLE, were also used directly. In developing the sys
found in about 1–2% of all the mitochondrial genomes, but the average tem for integration of the above and the stored knowledge required by
sequence usually reflects the inherited sequence in healthy tissues. The them, brain models developed by neuroscience were helpful [45] (See
chance of picking two individuals at random in the world and finding an Section 4.1). As to areas of application, many other workers have of
identical mtDNA is estimated to be 1 in several trillions [21,22] course carried out investigative work on MDPs as indicated above (see
(although a conservative random match probability of one in a billion also refs [46,47]), as have studies on relationships between mtDNA
has become standard for UK court reporting purposes [22]). Any two mutations and type 2 diabetes (T2D) [48–52].
4
1.10. Other work using Clifford calculus and related algebras semantics” essentially means “reasoning using statements about the
world and the probabilities attached to them which are degrees of truth,
A broader body of work by other authors is relevant if one focuses on certainty, scope, prevalence, reliability, or lack of vagueness of the
a particular characteristic feature of the mathematics behind HDN and statements.” More precisely and pragmatically, a probability is ulti
Q-UEL, namely the use of a kind of algebra that belongs to the Clifford mately based on an estimate of the information in a statement, in a
calculus [53]. Primarily this affects the way that probabilities are rep quantitative information-theoretic sense, relevant to drawing a partic
resented and managed in Q-UEL (see Theory Section 2). Dirac redis ular conclusion including identifying common meanings in different
covered it in a particular algebraic form now called the Clifford-Dirac ontologies, or making a particular decision or prediction [32,39,70]. In
algebra. In general, algebras of this kind involve imaginary numbers principle at least, that is quantifiable by experiments with any method
which follow the anticommutative law ab ¼ -ba, create a new imaginary based on probabilistic semantics by studying its performance on test
number on multiplication, ab ¼ c, and for which a square such as aa, bb data with known outcomes. This latter information-theoretic definition
or cc is either þ1 or 1 (and in Grassmann algebras 0) as defined. It is impacts the former involving “degrees of truth” etc. when the available
one such imaginary number called h in Q-UEL, such that hh ¼ þ1, that is evidence is limited and probabilities cannot be assigned so objectively.
of interest for enabling Dirac notation to be used for clinical decision Then there are defaults and preferences regarding assignment of pre
support, an idea that appears to go back to a speculative paper by the liminary probability values that are characteristic of Q-UEL, one of
present author [54]. Normally one tends to think of quantum mechanics, which is based on the position that statements are initially assertions
including Dirac’s brakets and the associated algebra, as i-complex, based awaiting refutation by contrary evidence in accordance with the work of
on the familiar imaginary number i as the square root of minus one, i.e. the philosopher Popper (see Ref. [70] for discussion, and brief review in
ii ¼ 1. A number with the properties of h has been rediscovered by Section 4.4 below). In theory, these preliminary probability values could
many authors many times in various names and guises (perhaps 20 or still change in special cases or when Q-UEL enters a new knowledge
more, e.g. Ref. [6]), including by Dirac: it appears in the Clifford-Dirac domain. In the present study it was noted that the probabilities can
algebra. The original discovery was almost certainly by Cockle in sometimes be updated and upgraded in a reasonable way to more
1848–1849 [56,57]. Recently it has most often been called the split informative values based on prior semi-quantitative knowledge of the
complex number or hyperbolic number. Without any reference to theo population distribution of mitochondrial DNA and approximate sizes of
retical physics, several neural networks have used related algebra that populations, as discussed in Results Section 4. Much work with DNA, as
has been found efficient [58–62]. This is possibly in large part because well as in evidence based medicine and public health generally, involves
the so-called XOR problem, can be solved in a single h-complex neuron the notion of probability as a matter of scope and proportion in a similar
[62]. Neural Nets represent a different, “bottom up”, approach as way. Any less quantitative or fuzzy logic approach [79] to semantics
described earlier above, but arguably the commonality of the algebra involving uncertainty would clearly not be appropriate if applied in such
gives some credibility to the HDN and Q-UEL. The notion of a purely or cases [33].
extensively h-complex QM has only fairly recently been emphasized, by
Khrennikov (e.g. refs [63, 64]) and he also expected his interpretation of 2. Theory
an h-complex QM to be applicable to a theory of mind [65], but did not
develop a specific computational artificial-intelligence-style approach 2.1. Use of Dirac braket notation
based on h. Indeed, while such efforts are emerging, they are still rather
few (e.g. Refs. [66,67]). For the theoretical basis of Q-UEL and associated Hyperbolic Dirac
Nets, see Refs. [4–7,32–37]. This Theory Section is confined to those
1.11. Other work relating to probabilistic semantics aspects directly applicable to the paper, descriptions required to explain
new features (notably the relationship to the neural net) and to mtDNA
There is also a body of work related to probabilistic semantics, which research. The values of the basic Dirac braket forms the basis of the other
for Q-UEL plays an important part not only in automated inference but entities in the Dirac notation and in Q-UEL. In Q-UEL they are computed
also in extending it to new domains such as bioinformatics that use and/or interpreted as follows along with some important equivalents,
different ontologies. The term “probabilistic semantics” was assigned to for the h-complex case. Note that “complex conjugate” means the
Q-UEL by the present author but is certainly not original, since Google consequence of changing the sign of the imaginary part, as in QM.
obtains 36,900 hits on that term at the time of writing. However, de
< A | B> (braket)
scriptions of probabilistic semantics meaning essentially the same thing
as in Q-UEL are less common than might be expected. The Wikipedia ¼ ½ [ P(A|B) þ P(B|A)] þ h ½[ P(A|B) – P(B|A)] (h-complex Hermitian
entry on that term [68] at the time of writing is short and in a reasonable commutator)
accord with the Q-UEL notion. It simply notes that one of the most severe
limitations of the Semantic Web is its inability to deal with uncertain ¼ ½ [P(“All B are A00 ) þ P(“All A are B00 )] þ h½ [P(“All B are A00 ) - P(“All B
knowledge, and that probabilistic semantics extends the current se are A00 )]
mantic technology to overcome that limitation but can describe only
¼ P(“Some A are B00 ) þ h½ [P(“All B are A00 ) - P(“All B are A00 )]
those uncertainties that can be quantified, namely they cannot model
conceptual uncertainty [68]. It cites just one author [69]. Nonetheless, ¼ Partexistential þ h Partuniversal (interpretation as probabilistic categorical logic)
Q-UEL research has for some years been attempting to map Dirac no
tation to semantics and natural language by such means (e.g. Ref. [70]), ¼ (½ [ P(A|B) þ P(B|A)] - h½[ P(A|B) – P(B|A)])* (effect of the complex
and Dirac himself believed that his system was applicable to many as conjugate)
pects of human thought (e.g. ref [71]). Philosophers and theoreticians ¼ <B|A>* (effect of the the complex conjugate)
have explored certain aspects which relate to probabilistic semantics (e.
g. Refs. [72–76]), but there seems to be a significant gap between such ¼ < A | if |B> (braket in preferred sematic triple bra-operator-ket format)
efforts and what is needed for the SW and decision support that em
¼ < B | if | A>* (effect of the complex conjugate because if is taken as a
braces uncertainty. Prediou and Stuckenschmidt note that there is still as
Hermitian operator)
yet no settled opinion on how to describe probability and render the
Semantic Web probabilistic [77] (again, also see Ref. [7] for review). ¼ <A Pfwd:¼P(A|B) |B Pbwd:¼P(B|A)> (transmitted/stored Q-UEL tag
Many authors have been of the opinion that a fuzzy logic approach is an format)
appropriate compromise (e.g. Ref. [78]). In Q-UEL “probabilistic
5
¼ <B Pfwd:¼P(B|A) | A Pbwd:¼P(AIB))>* (effect of the complex conjugate) positive and negative imaginary parts alternate, zipper-like [33]. For
example for any real values a, b, c, d, etc.,
¼ ιP(A|B) þ ι*P(B|A) (spinor projector form, ι ¼ ½ (1þh) and ι* ¼ ½ (1þh) ) |A> ¼[ιa, ι*b, ιc, ι*d, …] (4)
¼ (ι*P(A|B) þ ιP(B|A))* (effect of the complex conjugate) <A| ¼ [ι*a, ιb, ι*c, ιd, …] T
(5)
¼ {P(A|B), P(B|A)} (probability dual, usually used to assign a constant e.g. The wave function or universal quantum state Ψ in Q-UEL with the
{0.91,0.23}) (1) distribution Ψ 0, Ψ 1, Ψ 2 in Q-UEL taken as something on which all the
¼ {P(B|A), P(A|B)}* (effect of the complex conjugate) other states and events etc. considered are strongly dependent, and in Q-
UEL in the medical setting that is usefully the age and sex of the patient.
In considering interconversion of Q-UEL with other languages it is Intriguingly, however, in HDN-like neural nets, the distribution Ψ 0, Ψ 1,
often useful to consider the format Pfwd(A|B) simply meaning P(A|B), Ψ 2, … in bra and ket vectors then represent the states with unknown
but extending that in a natural way, note that Pbwd(B|A) ¼ Pfwd(A|B) weights to find by optimization, i.e. by the training or learning process.
and Pbwd(A|B) ¼ Pfwd(B|A). The sense in which the above remains QM Neural nets have been fairly extensively used by researchers in bioin
is discussed in section 5.4, but some operational similarities and dif formatics, probably because they are well suited to handling sequences.
ference should be noted.1 Unfortunately, the introduction of hidden layers of arbitrary states with
weights optimized by feedback to produce best predictions leads to
weights that are poorly understood. To facilitate interoperability at a
2.2. Vector-matrix algebraic properties and HDN-like neural nets deeper level, one wishes to make hybrid, HDN and HDN-like neural nets,
or introduce initial knowledge as weights prior to training, not least for
Vectors and matrices in Dirac notation have been recognized for efficiency, better laying out of the final weights, and avoiding entrap
some time in Q-UEL as important for probability distributions of medical ment in local false solutions. HDN-like neural nets are HDNs with
measurements, but they also make an important connection with HDN- arbitrary nodes in layers, and remain h-complex.
like neural nets that have not previously been described. Neural nets The overall activation function of a hyperbolic neuron is considered
used by the present author and collaborators for comparison purposes as f (z) ¼ f (x) þ h f(y). As in any Neural net, a set of activation nodes α(0)
0 ,
with HDN predictions [36] were essentially standard. A Dirac bra < … | α(0) (0) (0) (0)
1 , α2 , αn , … α3 , … represents the input layer, say L0, acting on node
is a column vector as follows. say 0 with value α(1)0 in a first hidden layer 1 with a total weight w0,0 α0
(0)
(0) (0) (0) (0)
þ w0,1 α1 þ w0,2 α2 þ w0,3 α3 þ …. þ w0,n αn . There is also a bias
<A| ¼ [<A| Ψ 0>, <A|Ψ1>, <A|Ψ2>, ….<A|Ψn>] (2)
vector say β0 which (in the traditional neural nets) has been used with
Here [<Ψ 0>, <Ψ1>, <Ψ2>, ….<Ψn>] is expressed as a vector of the sigmoidal renormalization function f(z) ¼ σ(z) ¼ ez/(ezþ1). In Deep
wave functions, or distribution description of a wave function, or of Learning neural nets each of the layers (including input and output as
probability amplitudes of quantum states generally. The Q-UEL in special cases) correspond to vectors. Prior to the “firing” of the neurons,
terpretations are of practical value as discussed below. The corre one may write
sponding ket is the column vector.
|Liþ1> ¼ Ci | Li > þ |Bi> (6)
|B> ¼ [<Ψ0|B>, <Ψ1|B>, <Ψ2|B>, …..<Ψn|B>]T (3)
Here |Liþ1> and |Li > correspond to two consecutive layers, and Ci is
The T indicates the row-column transpose. Dirac’s braket can be seen the connection matrix, assigning values to the connection between every
as the scalar result of an algebraic expression containing vectors: <A| node in Li with every node in Liþ1. Bi is a bias or threshold vector cor
B> ¼ <A||B> ¼ <A|.|B>. The dot operator is not usually written in responding to βi above. The relation to the HDN is seen in that the usual
physics, but it is added here for clarity. The above and the following task of parameterizing the values of the vector and matrix elements,
apply both in Q-UEL and QM. Note that <A| ¼ |A>* (by definition, a bra characteristic of neural nets, now becomes one of assigning <Ψ0| Li >,
is the complex conjugate of a ket), |B>¼<B|* (similarly a ket is the <Ψ1| Li >, <Ψ2| Li >, ….<Ψn| Li > and <Ψ0| Li >, <Ψ1| Bi >, <Ψ2| Bi >,
complex conjugate of a bra), and operator ¼ |A><B| (outer product, a ….<Ψn| Bi > or functions of them as elements of the vectors and matrix
particular kind of matrix as an operator). While many authors write <A| in order to optimize prediction.
¼ |A>y, it seems clear from Dirac’s writings that the < and > delimiters Forming inner products < Liþ1 | Li > is legal, but so that vectors may
already provide the transpose, so only the complex conjugate * is correspond to those used in semantic HDN inference nets with a
required. As in QM, the most useful Q-UEL operators are Hermitian, Hamiltonian relator between bra and ket, the orthogonal case should
which means that it is unchanged under complex conjugations and apply such that inner products are zero. The form required to overcome
transposition, as implied in the following: < A | operator |B > ¼ <B | this is < Liþ1 | Ni | Li > using some neural net operator Ni in such a way
operator | A >* ¼ <B | operator* |A>. It is usually convenient to use a that it describes a probability of how the state described by a layer Li
class of h-complex forms called zipper vectors and matrices in which influences the immediate outputs state Liþ1, and vice versa. By a chain of
such expressions a bidirectional net overall may be expressed,
1 <Lm | Noverall |L0> ¼ … …< L3 |N2 |L1 >< L2 |N1 |L1 >< L1 |N0 |L0 >(7)
In QM, the spinor projector form also uses h or an equivalent entity that
squares to 1 (e.g. γ5). Many of the other expressions above would be valid for It is however vector elements as <Ψ0| Li >, <Ψ1| Li >, <Ψ2| Li >,
the i-complex case characteristic of much QM, although then the probability ….<Ψn| Li > and <Ψ0| Li >, <Ψ1| Bi >, <Ψ2| Bi >, ….<Ψn| Bi > etc. in
dual { …,…} is less directly deduced by a recipe for observable probabilities
their logarithmic forms (see Section 2.7) that are of direct operational
due to Dirac. It can be shown to remain valid in the h-complex case, but here
interest as knowledge elements prior to firing. However, the sigmoid
the above essentially definitional approach is sufficient. Also in traditional i-
complex QM, one would replace the empirical probabilities by normalized transformation ez/(ezþ1). that sets the response is by no means the only
exponentials of i-complex expressions of what is, in effect an information choice of conversion of results to the 0 … 1 range. Consistent with the
measure (it is this which resulting in QM’s characteristic periodic, i.e. wave, information theoretic basis discussed below,
function). The probability values required for Q-UEL can be obtained by data
mining methods that analogously result in information measures of which ex
|Liþ1> ¼ exp(-(<Li| Ci / /<Li||U> þ <Biþ1|)) (8)
ponentials are then taken (see Section 4.5). The use case of mtDNA also makes Due to the idempotent property, exp{x, y} ¼ {exp(x), exp(y)} for any
relatively easy some probability estimates made using prior semi-quantitative
real values x,y. Note the <Li||U> for normalization; here |U> is a unit
knowledge as discussed in Results Section 4.
6
vector composed of elements 1. (vii) Provide in single large tags reports such as statistical summaries and
Systematic Review. See Section 4.9, but in the present study use
3. Methods primarily consisted of querying and directly examining the
XTRACT tags generated in the automated surfing process [45],
3.1. System used and overall workflow rather than semiautomatic generation of systematic review tags
as reports [33]. This more limited use is still typically more
It is in large part the role of Q-UEL language in precision medicine efficient than interactive use of a standard search engine and,
and genomics that is of interest in this study. In general, Q-UEL has the importantly, the knowledge extracted during the surfing process
following roles, to which comment on the importance of the role in the is preserved in the KRS, again as mentioned in regard to role (i).
present project is added.
Fig. 1 describes the main flows of information from the broad
(i) Represent the “tags” conveying knowledge and logic, rendered as perspective of the system as an intelligent agent interacting with the
probabilistic statements whenever possible. They are stored in Internet. The distribution into three main conceptual components in this
the Knowledge Representation Store (KRS). Q-UEL is a system particular way along with converters is in some part a compromise. If Q-
that “learns through experience” as well as though data mining, UEL were widely accepted as a Universal Exchange Language (UEL) as
meaning that any Q-UEL generated by any process can be kept as its name suggests, and used to extend the Internet’s currently emerging
potentially of interest as knowledge for future use. In the present Semantic Web to a probabilistic Thinking Web [7], then (except for
project, this was particularly so for see roles (v) and (vii) below. security and privacy concerns which can be overcome by a PCAST
(ii) Provide the AI particularly by guiding the automatic construction encryption and disaggregation model [41]) one would need to consider
and evolution of the inference net called the Hyperbolic Dirac only the human user and the Internet. Of course, human user, local
Net. This is one of main uses of Q-UEL tags in the KRS, whether computation and local files, and access to internet, always make up a
for semi-manually [32,70] or automatically [35] constructed popular three-component option. As it is, Q-UEL is playing the role of an
HDNs, or for answering multiple choice questions [44]. All had architectural principle with A.I. flavor as the basis of the local system
some use in this study, but DiracSmash was of particular impor (albeit potentially enterprise-wide, and between collaborators). The
tance, for structured data mining and automatic odds inference local system is then usefully seen as an “artificial brain”, borrowing
net construction [35]. some of the concepts of processing, working, short term and long term
(iii) Respond to queries to compute probabilities and odds as risk fac memory from neuroscience [45]. From that perspective, the local system
tors, PICO etc. for measures in EBM, epidemiology, insurance, is the inner or mental world and the outer world is the Internet. At least,
finance. DiracSmash [32] in particular generates not only tags this is the model used in the present project, but with an important
with tag value attributes for two conditional probabilities that caveat. The “brain” is only currently considered as an artificially semi-
may for example represent risk factors, but also odds from which intelligent system in which the human user such as a physician makes up
odds ratios (OR) can immediately be constructed. This is impor the rest of the required intelligence by replacing, via an interface, what
tant here because odds ratios are commonly used in genetic neuroscience would consider as “the central executive”, i.e. the au
studies for clinical genomics as described in Results Section 4, thority responsible for the control and regulation of cognitive processes,
allowing easy comparison of predictions and experimental taking account of longer term planning, focusing system attention, and
studies. making working memory and long-term memory work together. Later,
(iv) Represent medical records and messages and other kinds of medical some of these tasks may also be automated. Working memory can be
or other data and information (i.e. written in Q-UEL itself). This considered as all those tags that have been selected for current HDN
was less important here for medical records: those used were construction, placed on a file called the HDNtags file (e.g. Ref. [35]). For
available in two versions Q-UEL patient record format (e.g. example, such a file is built by the Hitlist (interdependent factor list) and
Ref. [35]), but they were also available as commas separated Wishlist (independent factor list) by module DiracSmash [35]. Medium
value files (essentially spreadsheets) that also could be read term memory can be considered as all the tags gathered so for in rela
directly by most applications. However, this record and message tionship to a current project, and in simple cases that may be represented
representation role of Q-UEL remains important in the present by tags generated or selected on the basis of the Hitlist, which could
study not least because Q-UEL-PATIENT-DNA tags had to be result in a number of tags several orders of magnitude larger than those
designed and generated. generated by Hitlist and Wishlist acting together. Long Term memory is
(v) Act as a ‘hub language” for interoperability, and interconversion represented by a separate but accessible storage archive; it includes tags
and joining of diverse medical data. After being used to enhance that pass certain curation tests [44].
interoperability by carrying information between Q-UEL appli In the present study, the role of the central “model brain” figure in
cations and/or the use of the Internet, the Q-UEL tags continued Fig. 1 is currently played by the BioIngine, which is more general in
to represent knowledge of potential more general interest that is application and has been industrialized for management and mining of
useful to retain in the knowledge representation store (KRS), as in Big Data [32–36,44]. The arrows in this part of Fig. 1 all represent in
role (i). Currently, converters play a key role in exchanging in formation transmitted which is represented by Q-UEL, and the MEMORY
formation between Q-UEL and tools on the web, usually via the is also represented by Q-UEL tags. To be so in the left side of Fig. 1, i.e. on
HTML underlying the web pages for the tools primarily because the INTERNET would, as noted above, require wide acceptance of Q-UEL
Q-UEL is not widely accepted as an interoperability language at as a Universal Exchange Language. Q-UEL could be used in interactions
this time, so there is no internal capability on public servers to in the INTERFACE for the PHYSICIAN to the right hand side, but Q-UEL
receive process it. Application to a new domain such as genomics has as yet no particular advantage here over conventional methods. In
and bioinformatics also requires new converters, the BioIngine these are usually represented by programs or short scripts
(vi) Act as a programming language. Q-UEL is being developed as a in the SCALA language. The modules that Q-UEL connects in the Bio
programming language as was illustrated in a simple way in the Ingine are also coded in the SCALA language to facilitate distributed
Q-UEL application POPPER [70]. However, in the present study processing of Big Data. However, because the amount of data available
the Q-UEL application DiracSmash [35] intrinsically managed for the current study was small, much work was done in algorithmically
most of the programming language features that would be equivalent Perl research code on a laptop. All neural net codes are
required for data mining and inference net construction. research codes in Perl at this time. In each interaction with the Internet,
the request for a particular tool invokes a link to a web page on a server,
7
Fig. 1. The System used in the Present Project can be considered as Based on Brain Models from Neuroscience. See text and ref [45].
and the data to be submitted to that web page. This is easy to do annotated mitochondrial DNA variants, including circa 580 with dis
manually, although Q-UEL control tags for driving processes can also be ease, including T2D, information. Of course agreement with MITO
used [32]. It is the design and specification and description of tags MASTER by no means assured because many other factors might
carrying the data component of the above query that is important in the influence whether or not a patient is manifesting T2D. A further option is
present study. In the first pass of the outer cycle used for clinical pur the use of data and tools in the Mitochondrial Disease Sequence Data
poses, such tags will typically be Q-UEL-PATIENT-DNA tags (Results Resource (MSeqDR) Consortium data base in preliminary development,
Section 4.1). It is the HTML behind the web page that is directly matched but useful insight for interpretations can be obtained from such web
and edited, not the text in the surface display. sites, as follows.
3.2. Data sources 3.3. Extraction of knowledge from Internet text
Use was made of the present author’s collection originating in his In contrast to use of converters to access tools at specific sites (see
earlier genomic and proteomic studies of mtDNA-disease relationships below), XTRACTOR [7,44] can be directed at diverse initial sites
(e.g. Ref. [39]), but those related to largely monogenic, less common because it was designed for unstructured data mining of the Web (here
mitochondrial diseases and the status of the patients to common and meaning of natural language text), as automated surfing and knowledge
more complex diseases such as of type 2 diabetes (T2D) was not clear. harvesting from the Web. It keeps results initially in the form of XTRACT
The data that was mainly used comprised 74 mtDNA sequences relating tags, which it is said to “auto-spawn” explosively as it follows through
to presence, or absence so far of T2D including mtDNA consented for use surfing on links and reference citations with links. The restriction is that
and in part from the family TreeDNA data base [23] (which is genea will not alone enter, for example, DNA sequences. It can enter Google
logical and by itself provides no data as to health or disease). Whatever and Wikipedia queries Wikipedia to start off, and continue to do so itself,
the source, consent is by the persons from whom the original DNA in the surfing process. Most comments made about mitochondria and
sample was maintained. Only two full sequences have so far been con mtDNA in the present text, and many references were obtained by this
sented for public view of significant lengths of mtDNA sequence. For automated webpage searching approach. There is some associated nat
reasons discussed in Results Section 4, interest quickly focused on the ural language processing to represent extracted knowledge in a canon
relationship between inherited mutation T3394C and type 2 diabetes ical form which can easily be converted to semantic triples. Example
(T2D). Here 8 patents had the T3394C marker and T2D, 30 had the XTRACT tags and use of MARPLE [44] are described in Results Section
normal T at that location and T2D, 3 patients had the T3394C marker 4.6. In practice, in the present study, XTRACTOR use tends to go
and no T2D, and 33 had the normal T at that location and no T2D. The hand-in-hand with alignment work to explore patient features in a larger
above may be considered as a “training set”. As a check on both as context.
signments and the reasonableness of disease predictions, and to show
interoperability with Web resources with the flavor of similar pre 3.4. Converters for interaction of Q-UEL with bioinformatics websites
dictions, use was also made of MITOMAP [42] with its MITOMASTER
tool. It is a comprehensive on-line resource with curated datasets of At present, interaction with webpages for specific public domain
mitochondrial DNA (mtDNA) rearrangements, with circa 10,300 expert bioinformatics tools is less smart than in the case of XTRACTOR above
8
which is controlled by MARPLE, although sufficient for a specification interest in the unexpectedly small Mitochondrial Derived Peptides
study. The reason is as follows. “Traditionally” and prior to Q-UEL one (MDPs). Gene-finders have been fine tuned for features of globular
might write a simple converter using regular (match-and-edit) expres proteins and so may be unsuitable for MDPs. http://www.softberry.
sions to identify a known string in the webpage HTML < input type ¼ com/berry.phtml is a promising gene-finder website but for free aca
"text” name ¼ "authority” value ¼ "Genomic analyst” SIZE ¼ "3000 demic used it allows only a few searches per day, and so less suitable as
MAXSIZE ¼ "50">, and ensure that data is sent to the appropriate link an essentially automatically Q-UEL based interaction. Three methods
for the tool upload address by a <FORM METHOD ¼ POST ACTION ¼ were used. The first “Simple short ORF” approach is simply elucidation
“upload script link”> having tested it by e.g. <FORM METHOD ¼ "POST” of all 6 reading frames using the human mitochondrial genetic code, and
ACTION ¼ "mailto:robsonb@ …."> [43]. However, contemporary sites looking for DNA sections of a specified length range (typically 63–105
show considerable diversity that mostly need to be tackled one-by-one. base pairs) starting with a start codon (methionine) and ending with a
Not all converters have been completed at this time and so use was stop codon. ORFfinder at www.ncbi.nlm.gov was mainly used in this
sometimes semi-manual, but it was assured by inspection that the con case. The second was an in-house tool and was initially the same HDN
verters were feasible for inputting and/or outputting appropriate tags. prediction technique used for secondary structure prediction in proteins
Extraction from the author’s own internal mtDNA collections did not [4,35] but note that the start and end of a open reading frame is more
have this problem since it involved more standardized off-line web crisply defined than the start and end of, say, an α-helix. Subsequently, a
pages intended for clinical genomic analysts based on the author’s more elaborate form used MDP amino acid sequence features as dis
earlier genomic messaging system (GMS) [43] (Q-UEL tags now entirely cussed below. The third used the HDN-like neural net and the same more
replace the original GMS language). elaborate input data, but this had a greater challenges because prior
A Q-UEL patient mtDNA tag so generated is discussed as design result knowledge of the above kind cannot be introduced (at least not in the
in Section 4.1). In the workflow it is typically followed by alignment of current algorithm, although an experimental approach under develop
those sequences (e.g. Section 3.3). In particular, to create a useful per ment is expected to be able to use that strategy). Both the two last two
manent knowledge package about the mtDNA of a patient that is stored tools depended on data mining and training respectively on sets of
in the Knowledge Representation Store (KRS) for future use, the patient similar short peptides from different species or highly homologous
mtDNA tag is further reprocessed to a final form using alignment so that sections to the three currently known types of MDP Humanin, MOTS-c
inherited mutations as substitutions of base pairs with respect to a and SHLP1-6 (Section 4.9), so for example the set would include
standard reference sequence are included. For alignment of a DNA or MAPRGFSCLLLLTSEIDLPVKR Humanin 1 (human), MAAGGFGCLLLLI
protein sequence in general, clinical geneticists use well known bioin SEIDL SMRR Humanin-like peptide 7 (Piliocolobus tephrosceles),
formatics tools such as BLAST [8] ideally followed by ClustalW, for MATQGFSCLLLS ISETDLSMKR Humanin-like peptide 4 (Pongo abelii)
subsequent detailed alignment of two or more sequences. Consequently, and so on.
Q-UEL links to these tools and others such as SIXPACK for translation of
DNA to amino acid sequences were studied, particularly (but not solely) 4. Results
at https://blast.ncbi.nlm.nih.gov/Blast.cgi, https://www.ebi.ac.uk/, htt
ps://www.expasy.org/resources and https://www.uniprot.org/. Sites The study obtained several kinds of descriptive and quantitative re
used are quoted in Results Section 4, in context. Tools at Japanese sults, ranging from Q-UEL tags designed for genomics and bioinfor
Institute of Genetics, the Philippine Genome Center, and the Broad matics through some new techniques in the form of Q-UEL software
Institute were considered but not used in this study. A Q-UEL application applications, up to use of all that in example use cases that in some case
interface to https://en.wikipedia.org/wiki/List_of_open-source_bioi are believed to contain some new scientific information. The following is
nformatics_software was a useful hub access to some lesser known a summary guideline. Note that more attention is naturally given to
open source codes, but generally downloading and installing open research than clinical use because clinical use is more rigidly set by set
source, while a feasible approach in theory, was not done. The author’s by well-defined guidelines and clinical pathways. Original research is
own in-house methods for alignment and HDN-based secondary struc more open and less well defined. Note though that the practice of evi
ture prediction provided some advantages, but emphasis in this study dence based medicine by physicians also requires that the physician do
was access to tools by other workers and feely accessible public web some research-like study in regard to patients for which diagnosis and
sites. However, HDNs used in the preliminary prediction and risk best therapy is less clear, e.g. using the PICO method [33].
assessment studies were mostly odds-based HDNs for the dual {predic
tive odds, likelihood ratio} built using DiracSmash [35]. Some molec A. Features of Principal Types of Q-UEL tags designed in the study.
ular mechanics and dynamics modeling techniques briefly mentioned #4.1. Designed Patient DNA Tag.
briefly mentioned below were own codes because those of the nature #4.2. Designed Additional and Alternative Alignment Attributes.
required or preferred were not suitable freely available Web options. #4.3. Example of Designed Knowledge Tag.
#4.4. Assignment of Appropriate Default Probability Values.
3.5. Prediction of mitochondrial derived peptides #4.5. Assignment of Probability Values Based on Counting.
#4.6. Tags Expressing Translation into Peptide and Protein
Note that the mitochondrion has a slightly different genetic code that Sequences.
was commonly used in the present kind of study, but there are also B. Studies on the Early Processing of Patient Genomic Data Using Q-
products of mtDNA genes produced in the cytoplasm by standard genetic UEL.
code that may differ slightly and need studying [17]. In theory, despite #4.7. Initial Clinical and Research Use Case Examples.
the above complication, digital translation of DNA or RNA to a protein #4.8. Studies with Contrived Clinical-Genomic Data ad Conclu
amino acid sequence remains the simplest of the bioinformatics tasks for sions for Real Data.
the present study because the genetic code is conceptually simple, pre #4.9. Semi-automated Systematic Review with Associated
cise, well known and easy to code, e.g. ACG codes for threonine, GAA for Alignment Studies.
glutamate, and so on. However, the problem is that not every translation C. Clinical Use Cases
as a putative string of amino acid residues is necessarily realized as a #4.10. Prediction Use Case. Example involving Mutation
peptide or protein biologically. To use that as a tool for predicting T3394C.
protein-coding genes as open reading frames (ORFs) requires further #4.11. Further Analysis Regarding Type 2 Diabetes and Mito
considerations and is somewhat more sophisticated. This sophistication chondrial DNA.
presents particular difficulties for the present study because of the D. Research Use Cases
9
#4.12. Research Use Case. Initial Studies and Observations on automatically to access the web page, however, it requires an
Humanin. extension to include the Q-UEL application that accesses the
#4.13. Research Use Case. Predicting Other Mitochondrial HTML code underlying the web page.
Derived Peptides.
#4.14. Research Use Case. Validating Mitochondrial Polypeptide
Products.
#4.15. Research Use Case. Conformational Studies on MDPs.
(ii) Privacy and Security Features. Regarding the importance of pri
A further Section 4.16 addresses interoperability and interconversion vacy and security in genomics, note that in two cases in the above
with other approaches, with Q-UEL is a role as a potential hub language. example, the string “DNA: ¼ ‘human mitochondrial chromo
some’ ¼ ” and “’match code’:¼(match: ¼ 0, difference: ¼ 1) ¼ “,
4.1. Designed patient DNA tag the simple equals sign ‘ ¼ ’ is used. When Q-UEL applications
encrypt values after the rightmost ‘: ¼ ’ in a branch of an attri
Recall that the primary function of the present study is exploration of bute, this is encrypted and the attribute as a whole is subjected to
the “special specification” of Q-UEL for clinical genomics and bioinfor PCAST disaggregation [41].
matics. The Q-UEL-PATIENT-DNA tag shown immediately below is (iii) List of Variants with Respect to Reference DNA. Note strings
arguably the most important design and even the data used for original C16189T etc., in the above example tag in Section 3.1 above. It
biomedical research will comprise an archive of such in the near future. means that in the patient a T has replaced an original C at that
A large part of the sequence has been removed for brevity in this paper, locus 16189 in the reference sequence. The reference sequence
and also the SNiPs/genomic biomarkers in the protein coding region, as most currently used is the revised CRS is designated as rCRS,
indicated by the ellipsis ‘ … ’ used twice below. That is, the ellipses were deposited in the GenBank NCBI database under accession number
not in the tag as used. The DNA sequence was in actuality of length NC_012920. Such variants may be available in source data but are
16,571 base pairs covering several pages of representation (3 pages of readily generated by alignment given a reference sequence.
Arial font size 8). Large Q-UEL tags of this size are optional and not Though these can be computed from other information on the
unusual uotside of genomics, e.g. a more traditional patient medical tag, they contain the kind of biomarker information familiar to
record broken down into attributes, usually displayed oneper line and clinical and research genomicists, and save a great deal of time if
playing the role of multiple tags and interspersed text in XML. this information would otherwise need to be computed for many
The following features should be noted. tags.
(iv) Match code. This is usually used for neural net input but there are
(i) Source Code Attribute and Program Language. In the tag name other applications. A binary match/no match comparison is
considered as a special case of an attribute usually concerned typically made with the comparison with standard human
with provenance, it is indicated that the tag was generated from mtDNA Cambridge Reference Sequence (rCRS). The ellipsis “ …”
converters (interacting with web pages) using software written in missing out a large section of sequence in the above tag is only for
the Perl language. Q-UEL algorithms have been investigated in brevity here. The match code could be deduced from the list of
Mathematica, Python, and Q-UEL itself. In particular SCALA is the above biomarkers if complete (knowing the reference
used in the BioIngine for Big Data analytics. Perl is used here in its sequence), but again having it ready saves a great deal of time in
original role as “Pattern Extraction and Report Language”, and processing large numbers of tags. There is the implication, but no
remains a favored approach for converters and related software. guarantee, that the reference sequence is the “normal” or
(i) Used and Purpose Attributes. These attributes relate to provenance “healthy” case, and no guarantee that the substitution contributes
regarding the tag’s position as a chunk of information in the to a significant disease state, and an ambiguity, since a substi
workflow. The ‘used’ attribute value https://www.ebi.ac.uk/ tution of any one base pair could be by any three others. None
Tools/msa/muscle/is provenance and describes the web page theless purines (A,G) to purines and pyrimidines to pyrimidines
that was used generate the tag. The ‘purpose’ attribute in the tag (C,T), called “transitions” are approximately twice as common as
relates to its actual or intended subsequent use. If it is purine-to-pyrimidine or vice versa. In protein coding regions,
10
amino acid physicochemical properties tend to be conserved. The 4.4. Assignment of appropriate default probability values
loss of the more detailed molecular description is not expected to
have great impact on average, in the current state of the art. A default value of 1 is assumed for Pfwd if not specified, and similarly
When and where it does start to become important, the tags for Pbwd (and also for assoc). This was the case in the Q-UEL-PATIENT-
almost always contain the DNA sequence. If there are concerns DNA tag above. A finding of the present study was that there is no need
about neutrals substitutions in a protein coding sequence, then it to override the above initial default for uses in genomics. The choice of 1
is likely best to use a one letter code amino acid sequence in any reflects or implies a parameter setting prior ¼ 0 as discussed in Section
event, or a binary coding of that in which the 0s and 1s relate 4.5. This default is called the primary default. The particular choice of 1
meaningfully to amino acid properties [43]. essentially expresses an initial state actual or elected ignorance or
irrelevance, which may seem counterintuitive. However, it is in accor
By reference to DNA, such representations represent a focus on mo dance with Popper’s principle of assertion and refutation, information
lecular ethnicity which is arguably fairer that terms such as “Hispanic”, theory that states I(X) ¼ -logP(X) (so P(X) ¼ 1 for I(X) ¼ 0, consistency
“Oriental” and “African” and certainly far more relevant to assessment with Bayes’ rule, and the fact that if a piece of knowledge as a Q-UEL tag
of disease risks and application of pharmacogenomics for most effective is unknown or omitted from a purely multiplicative inference net, it is as
treatment [1]. The present study is intended to be extensible to all pa if present with probability 1 [7,32–39]. The default 1 also follows from
tient DNA, i.e. including that in the nucleus, so the following example Eqn. (9) below for the use of data mining in the limiting case of having
non-mitochondrial DNA attribute, also illustrating some other options no data to mine, if prior ¼ 0 is set. The most useful choice of subsequent
for tag content, may be noted. probability assignments reflects the fact that the mtDNA may not be
unique to the specified person. Secondary defaults or values are proba
bility values that do not yet reflect a specific statistical or data mining
4.2. Design of additional and alternative alignment attributes study, or other kind of approach involving counting, but do reflect
common sense and some prior knowledge. It may change the default of 1
Note the ‘modified for replacements’ attribute in the example above, depending on the case. Consider the simplified braket < patient:¼Alice |
which is clearly one way of representing condensed representation of an mtDNA > relating patient Alice to a specified mtDNA sequence. If Alice
alignment between two DNA sequences. Alignments are not only and mtDNA are both identified without error and misinterpretation, and
important for determining the patients genomic biomarkers as “inheri Alice has no close living relatives, then the mtDNA uniquely identifies
ted mutations”, actually the difference with respect to a standard her and the imaginary part, i.e. ½ [P(Alice | mtDNA) - P(mtDNA | Alice)]
reference DNA, but for comparison of DNA and proteins sequences of vanishes and P(Alice | mtDNA) ¼ P(mtDNA | Alice) ¼ 1. That is, the
patients, and even with other species, for research purposes. The probability dual is {1,1} ¼ 1. If Alice has a twin and no other siblings or
“problem” is that there are a variety of alternative and more accepted close relatives and we do not know which the mtDNA comes from, P
ways to represent alignments that bioinformatics researchers like, and (Alice | mtDNA) ¼ 0.5 and P(mtDNA | Alice) ¼ 1, i.e. the dual {0.5,1}; the
one of these may be the user’s favorite. Hence it was decided that the real part ½ [P(Alice | mtDNA) þ P(mtDNA | Alice)] is then 0.75 and the
special specification should allow various alternative alignment attri imaginary part ½[P(Alice | mtDNA) - P(mtDNA | Alice)] is 0.5. Sec
butes to be optionally used, for example: ondary defaults extend this to introduce the effect of a sample size N even
Note that part of the choice relates to various summarizing match if it is possible to establish a specific count, i.e. something is seen so
codes such as ‘CLUSTALW match code’ above. The following example many times out of N by data analytics. In a forensic case of a Agatha
uses the binary match code, prominent in the present study, as the Christie style murder mystery in a stately home, for example,
summary of a specified alignment suitable for, for example, predictions
by neural nets. Pfwd:¼P(person in Alice’s line of maternal descent in family’s stately home |
mtDNA) ¼ ~1
Pfwd:¼P(person in Alice’s family’s stately home | mtDNA) ¼ 0.1 (a very rough

4.3. Example of designed knowledge tag estimate)
The following a Q-UEL knowledge tag relating to genomics that was

3
Pfwd:¼P(anyone in England | mtDMA) ¼ 1.9 x10
stored in the Knowledge Representation store (KRS) at the start of this 3
Pfwd:¼P(anyone in United Kingdom | mtDMA) ¼ 1.5x10
study.
The inclusion of a relationship operator is preferred, here ‘is asso Pfwd:¼P(anyone in Europe | mtDMA) ¼ 1.35 x10 4
ciated with’, which is symmetrical and implies that the Pfwd and Pbwd 5
values are the same. Since the tag is an assertion with some definitional Pfwd:¼P(anyone in World | mtDMA) ¼ 1.3 x10
character, the probability values are reasonably both 1. Pfwd and Pbwd Some indication should of course be given as to what population the
are not required to be shown because this is the Q-UEL default (see number corresponds. Typically this is done in the tag name attribute, i.e.
below). Note here that one speaks of a single probability for a tag if Pfwd in the annotation attached to the tag name. The following attribute is
and Pbwd are the same value, following the rule for duals that {x, x} ¼ x.
11
said to represent an independent attribute or to reside in an independent content because, as in a court of law, inference collated from a large
context because its specification as a set of data and any selection from it number of items of weak evidence could overthrow a decision that is
is subject to user’s choice. made without it. The theory of expected information based on Bayesian
Compare the following referred to as a dependent attribute that resides principles was still found to be a satisfactory approach. It was originally
in a dependent context because, operationally at least, all automatically developed for bioinformatics [37]) although it has been further devel
emerges as a result of data analysis outside or partly outside user control, oped for use in Q-UEL and HDN calculations (e.g. Ref. [32]). An example
importantly including the metadata name (here “County”). would be mtDNA found at a crime scene and then thought of as intro
duced to the data base, e.g. as follows.
4.5. Assignment of probability values based on structured data mining
PðmtDNA j AliceÞ
¼ e ζðs ¼ 1; nðmtDNA; databaseÞ þ nðmtDNA; sourceÞÞ ζðs ¼ 1; nðdatabaseÞ nðmtDNA; sourceÞÞ
Tertiary assignments are not defaults, but represent objective proba
(9)
bility values estimated from observations and counts by data mining or
other data analytic techniques. There are nonetheless choices to be made Whatever the approach, in Q-UEL it usually involves the partially
for any new domain, and some require consideration of fairly subtle summated zeta function ζ(s, n) ¼ 1 þ 2 sþ 3 s þ … þ n-s with the
aspects. They include choices that are impacted by the above consid parameter s usually being set at s ¼ 1, or a closely related z function
eration of prior probabilities, as follows. Data may be sparse, and [32]. Probabilities converge to classical values for ample data, i.e. P(A|
inevitably it is also the case for probabilities containing many factors B) → n(A, B)/n(B) where n is the number of observations as n(A|B) in
(attributes) even in Big Data. What really matters is the information creases. As the amount of knowledge approaches zero, it is affected by
12
any prior knowledge. Prior knowledge can be included by considering or 4.6. Tags expressing translation into peptide and protein sequences
computing virtual frequencies of occurrence and adding that to any actual
counts n [32,37–40]. The required frequencies are readily calculated in When a gene codes for a protein, initial estimates of severity of effect
conjunction with the sample size N to give the prior probability that is on the protein helps establish probabilities that it is responsible for a
wanted. The virtual frequency added represent the value called “prior”, disease suspected to be associated with changes to the protein. This is
and it was found that a choice of prior ¼ 0 is also suitable for genomics, because not all bases lead to an amino acid change, and not all amino
consistent with Section 4.4. Details of the sample and the calculation and acid changes will have drastic effect. Tables and diagrams of conserva
comparisons, including the choice of prior values are usually given on a tive and non-conservative substitutions from which probabilities can be
tag as an aspect of provenance. The following kind of data mining output derived are given in Ref. [2] and most protein textbooks. In the present
tag was already available in Q-UEL and was readily adapted to genomics study, amino acid sequences derived from the open reading frames
[32]. (ORFS) are classified as coding for polypeptides as follows, “oligopep
This tag was produced in the present study but, because the amount tides” (2–20 amino acids), “peptide-miniproteins” (from 21 to an
of mtDNA-disease data was small, it used the kind of contrived input adjustable number, usually set at 35), and otherwise as “proteins”. The
data discussed in Section 4.8, but in this case from a socioeconomic standard IUPAC one-letter code for amino acids is used. With the
health study [36] appropriate to inclusion of disease prevalence. Here adjustable parameter set arbitrarily large, an example is as follows.
the sample size (500) was useful as not too small but also not too large, The patient’s open reading frame identified above has 100% identity
so illustrating the above points as to the relationships classical and zeta with NADH (ubiquinone) dehydrogenase subunit 5 [Homo sapiens]
probability estimates: note that the values are approaching each other Sequence ID: AAK17360.1 (mitochondrial) at blast.ncbi.nlm.nih.gov
within approximately 4–5%. It is important that a simple specification of data base when queried on the blast.ncbi.nlm.nih.gov (i.e. queried by
prior:¼0 on the tag means that all the virtual frequencies are 0, that this BLAST). There is a commonly cited sequence at https://www.uniprot.or
is uniformly applied to calculation of all probabilities, and that this g/uniprot/Q9B1R0 that at blast.ncbi.nlm.nih.gov matches 100% to
overrides any other choices in the flow of information that led to this tag. NADH dehydrogenase subunit 5 [Homo sapiens] Sequence ID:
It indicates the conclusion of a further choice that had to be made ASS29244.1 (mitochondrial). With respect to AAK17360.1 above,
because, theoretically, the secondary defaults of Section 4.5 are prior however, it shows an N-terminal (leading end) MLVQLQMKV extension
probabilities, and one might expect that they would be introduced and a C-terminal (finishing end) MTYSPEQSQLQYMHQQTMFNQ
(retained) as added virtual frequencies. While it remains an option, it is deletion.
not done in the genomics case. Largely this is because in evidence based
medicine, the probabilities estimated are usually considered as pro
portions form an actual sample, used to indicate prevalence, incidence 4.7. Initial clinical and research use case examples
and risk. They have a formal definition based on counting, and they are
not expected to contain some kind of Bayesian prior belief. Note also Having generated a Q-UEL-PATIENT tag (Section 4.1), the next step
that the uniform application of prior ¼ 0 for the virtual frequencies is generation of a tag describing the disease associations and risks that
behind all probabilities essentially relates to what is commonly called a those mutations may represent for the patient. The Q-UEL-DIRAC
Dirichlet prior density, although the standard choice in probability theory MINER-KMETHOD-3-FACTOR–CHF–SURVEY tag in Section 4.5 above is
actually corresponds to a negative virtual frequency, of 1 [37]. How such an example, but it relates to various populations, not a specific
ever, this choice justified by Dirichlet theory is somewhat undesirable patient. Various public web tools now exist to facilitate that step, so here
from an evidence based medicine perspective and counterintuitive from Q-UEL plays a more passive interoperability role by interacting with the
an information theory perspective [37–40]. In the present genomics HTML underlying the webpage for the tool via its converters. However,
study it would imply a blurring of the distinction between zero and one this also forms the basis of a more elaborate study in which one may be
occurrences, the latter being an important distinction because it means able to attach improved probabilities to the mtDNA-disease relation
that the possibility is existentially qualified, i.e. that the event is possible, ships, apply it to a larger body of patients, and possibly discover now
which implies a rather large leap of belief from not knowing if it is significant mtDNA-disease relationships in that process. For example,
possible. using the MITOMASTER tool from that resource at https://www.
mitomap.org/foswiki/bin/view/MITOMASTER/WebHome, and trans
ferring the mtDNA for patient 75229 with type 2 diabetes (T2D) in
FASTA format, returned a web page indicating “LHON/Diabetes/CPT
13
deficiency/high altitude adaptation” (recall LHON as Leber’s hereditary higher frequency in PD-ADS. Nonetheless this patient carried A10398G
optic neuropathy), as indicated by the presence of T3394C. The page. (rCRS position) associated with longevity shows that not all variants are
www.mitomap.org/foswiki/bin/view “bad news” and many of the above disorders may emerge simply dis
/MITOMAP/ClinicalPhenotypes1#Abbreviations. eases that tend to occur in old age. A fuller data mining of a larger data
displays clinical phenotypes (non-LHON) associated with mtDNA set is expected to be able to test these ideas.
polypeptide gene mutations reported in the literature that could be
automatically inserted into host sequences for test purposes. 4.8. Studies with contrived clinical-genomic data and conclusions for real
Appropriate tags similar to that shown in Section 4.5 can be applied data
to individual patients by use of DiracMiner, and they will then have
similar form to the tag example shown there. In the present study, While a great deal of tag design and specification can be done with a
however, DiracSmash was predominantly used for assessing probabili small preliminary amount of data, tests on larger amounts are important
ties and risks for individual patients. It generates tags for odds HDN for some use cases. Tests on large contrived data are sufficient for tag
construction (hence the additional odds attributes Ofwd and Obwd - see design, but they also revealed some important principles about trends
below). The process of generating associations is usually much faster and relationships between values that are not necessarily obvious
with DiracSmash (hence the name “Smash”), so these tags are used (Table 1). The medical data record is real, and the genomic feature can
immediately for inference net construction, and not stored long term or be real, but the joining is artificial; however it uses an adjustable rela
transmitted over the Internet. Typically, therefore, the option is taken tionship factor X as described below rather than random. Preliminary
for generating simplest possible appearance sufficient for the inference prediction and risk assessment studies used real medical record data
task and any required curation, uncluttered by web management and (first 100,000 records, then repeated for 500,000 and 667,000 for
provenance detail. appreciation of convergence) and adding a hypothetical mutation called
“heart attack variant” HAV, entered along with ‘Year of birth’:
< ‘type 2 diabetes’: ¼ ’Y’ Pfwd: ¼ 0.40 Ofwd: ¼ 0.67 | if | ‘genomic
¼<’19810 , ‘Gender’: ¼ as a requirement for the interdependent neces
biomarker’: ¼ mtDNA.T3394C and ‘age(years)’: ¼ ’gt650 and Gender: ¼ ’M’
sary factors specified in a “Hitlist” (a logical AND list).
and Obesity: ¼ ’Y’ Pbwd: ¼ 0.03 Obwd: ¼ 1.19 >
Such constructed models can be described in generalized statements
For a female over 65, the 0.4 becomes a 0.3 probability. As the which allow preliminary probability assignments. If the presence of the
amount of data is relatively small, such tags as directly generated above genomic biomarker is assigned at random to records except that and
may be also may be seen as initial guidelines and be curated and refined. fraction a with the variant have congestive heart failure (CHF) and 1-a
Probabilities in particular may be adjusted by interaction with a human do not, then value X ¼ a/(1-a) is a multiplicative adjustment factor to
expert aided by use of XTRACTOR to obtain supporting information on values indicated with X in Table 1. The actual numbers in Table 1
the Web. Such study shows that the Pbwd is reasonably be much lower represent a study in which X ¼ 1. Values in italic bold font are of
than Pfwd is because many potentially causative mutations are found in broadest clinical interest, conditional on the mutation, age, and male for
the type 2 diabetic groups. For example, C1310T, A1438G, A3243G, all the available data used. These are useful for disease-variant associ
A8348G, C8393T, C8478T, T8551C, T14577C, and T14709C are known ation studies. In the absence of ability to evaluate full probabilities,
or suspected to be implicated. Literature studies with XTRACTOR indi multiplying by X for a variant will give estimates of odds. In Table 1, to
cated that T3394C alone showed only mild defect in glucose-stimulated explore CHF in response to other serious conditions, the HDN was built
insulin secretion, but hyperglycemia appears when strengthened by as described in Methods Section 3.1. That includes use of a query list
factors as age and obesity. Even Pfwd, however, is low compared with called “Wishlist” (in this case of diseases better called “Shortlist”) of
that for e.g. sickle cell anemia and cystic fibrosis mutations from both strong individually sufficient independent factors. Recall that it is a
parents that guarantee severe disease, and LHON in which three mtDNA logical inclusive OR list for building block tag selection, but a logical
mutations have probabilities of 0.80–95 for having the disease. RAND or randomly associated AND list as far as HDN results are con
Although other mutations along with the T3394C variant can also lead cerned, so setting the prediction model defined in terms of the in
to LHON with a high probability, these mutations were not in the private dependencies (for estimating probabilities with many factors, all
collection. For T2D, there is some indication that there are fewer other inference nets such as Bayes Nets assume certain independencies). For
potential serious disease risks, if any, as for patient 75229. In contrast an studies on type 2 diabetes (T2D) below, that independency was not
example non-diabetic patient 4640 carried another LHON/Insulin assumed. It comprised ‘Pulmonary circulation disorder’: ¼ ’Y’, ‘Renal
Resistance/possible adaptive high altitude variant T4218C at rCRS failure’: ¼ ’Y’, ‘Diabetes complicated’: ¼ ’Y’, ‘Valvular disease’: ¼ ’Y’.
reference position 4216 as T4216C, a variant for vomiting syndrome These are generally held to be amongst the comoborbities or antecedents
with migraine, a variant for PD protective factor/longevity/altered cell of CHF.
pH/metabolic syndrome/breast cancer risk/LS risk/ADHD/cognitive The overall approach is a perturbation method since the odds condi
decline/SCA2 age of onset, and variant for LHON/Increased MS risk/ tional on the Hitlist of interdependent factors alone (see Methods Sec
tion 3.1) could be computed exactly, providing a correction factor for
14
Table 1
Steps in the DiracSmash HDN calculation of predictive odds and likelihood ratios.
Step in DiracSmash calculation (Prior Odds forward Odds backward h-complex real of h-complex imaginary LINEARITY CHECK - Slope of vector
odds calculation is called step [0]) (predictive odds), (likelihood ratio), HDN part to be part of hDN to be from (0,0) to this (real part,
Hitlist – e.g. demographic factors 100K records Perl 100K records Perl plotted as ½ [Odds plotted part as ½ [Odds imaginary part) in h-complex space.
Wishlist – e.g. clinical factors. PrO ¼ code, 500K code, 500K fwd þ odds bwd ], bwd - odds fwd ] E.g. 0.9260/0.7268 ¼ 1.2741 (first
0.1163 (100K records Perl code), ProO ¼ records Scala records Scala 100K records Perl 100K records Perl code, row)
0.0731 (500 K records, Scala code), PrO ¼ code, code, code, 500K records 500K records Scala
0.0424 (667 K records Perl code) 667K records Perl 667K records Perl Scala code, code,
code. code. 667K records Perl 667K records Perl code.
code.
[1] Exact result for inter-dependent factors 0.1992 X 1.6529 X 0.9260 X 0.7268 X 1.2741
(Hitlist) 0.1064 X 1.4552 X 0.7808 X 0.6744 X 1.1578
0.0622 X 1.4671 X 0.7646 X 0.7024 X 1.0886
[2] HDN result for inter-dependent factors 0.7409 X 6.3700 X 4.1742 X 3.5554 X 1.1740
(Hitlist) 0.5685 X 7.7798 X 4.1741 X 3.6056 X 1.1577
0.5668 X 13.3791 X 6.9729 X 6.4061 X 1.0885
[3] HDN result for inter-dependent þ 1.4143 X 12.1601 X 7.5150 X 5.3792 X 1.3970
independent factors (Hitlist þ Wishlist) 1.0188 X 13.9421 X 7.4804 X 6.4616 X 1.1630
1.3219 X 31.2030 X 16.2624 X 12.8123 X 1.2693
[4] HDN result for inter-dependent þ 0.3669 X 3.1553 X 1.7611 X 1.3942 X 1.2632
independent factors corrected by 0.1907 X 2.6077 X 1.2992 X 1.2080 X 1.0755
perturbation method 0.1145 X 3.4216 X 1.7681 X 1.6535 X 1.0693
the full calculation. Because the factors of interest were rather sparse in kind of calculation, which is useful because several assumptions are
the data and because of approximations made, and because the Wishlist made even in using completely real data, including the perturbation
factors represented a severe perturbation, it would be more correct to method along with estimated probabilities based on zeta function for
say that the likelihood ratio overall lay between 3.4216 X and 31.2030 X 667,000 records. Note that medical records and particular hospital re
in Table 1, a considerable spread in this case, but importantly both cords are subject to Berkson bias in that they highlight sick people; the
predicting CHF as ‘yes’ by being considerably greater than 1 in each prevalence of CHF in the US in recent decades would put the odds at
case. DiracSmash partly overcomes difficulties in this kind of result by around 0.01. Finally, note that the ratio between the values (e.g. 2.6077
generating simplified models from these results and optimizing their X) calculated on the above independency assumption and as obtained by
accuracy, sensitivity and specificity using ROC methods [35]. In this data mining (say 3.5492) value is of interest in its own right as a measure
case, 90% accuracy, 68% specificity, and 92% specificity were obtained of interdependency between X and the other factors.
for the 667,000 records. Table 1 shows that the slope of the “vector”, the
value in the final column in Table 1, does not vary greatly, and that is
typically be so even if X is not independent of the prediction target, here 4.9. Semi-automated systematic review with associated alignment studies
CHF. In this case the slope is determined largely by the prior odds for
congestive heart disorder, e.g. 0.0731 for 100,000 records and 0.0424 A first step in setting out on a more elaborate genomic research
for 667,000 records, its real part ½ [1þPrO)] divided by imaginary part study, or to help a physician with a more complex clinical case in accord
½ [1-PrO] giving 1.1579 for 100,000 records and 0.9858 for 667,000. with the dictates of evidence based medicine [1,33], is systematic review
This kind of near-independence provides a general check on any such of available knowledge such as published scientific literature [33].
MARPLE that can, in response to the multiple choice answers in turn,
15
autosurf and spawn XTRACT knowledge tags [45] produced the European and US patients the T3394C marker presumably reflect
following when executed in the multiple choice answer medical ancient origins of the maternal line of decent in North Eastern India and
licensing examination mode. While direct queries are also possible in a Tibet. T3394C lies in the protein product of the mtND1 gene coding for
single run, multiple choice questions are a convenient way to get the NADH dehydrogenase subunit 1 (ND1) subunit 1 in Complex I of the
probability distribution over number of some suspected possibilities (the mitochondrial electron transport chain, and translates to an H (histi
option list could be of much greater length), e.g. as follows. dine) at amino acid residue position 30. Replacing the H by Y (tyrosine)
Equal probabilities for several options (here, C, D, E) are indicative of in a query sequence, which corresponds to the amino acid residue at that
the fact that knowledge tags concerning the medical effects of inherited position in the Cambridge reference sequence such that one may refer to
mutations are currently sparse in the Knowledge Representation store a Y30H mutation, there was a 100% match with NADH dehydrogenase
KRS. It is extensions of the present pilot study that will correct, refine subunit 1 [Homo sapiens] Sequence ID: ACA80576.3. That and related
and greatly extend that knowledge. However, the state of the KRS as it is sequences with tyrosine at that locus tend to reflect studies on DNA
at the stage of the study already supported the initial interest in type 2 phylogeny ultimately in Eastern and Western Slavs. This tyrosine to
diabetes T2D, demonstrated that MARPLE input and output can be histidine substitution would not be considered a highly conservative one
inserted into the current Q-UEL tag-based workflow, and initiated pre due to the uncharged nature of tyrosine and the partial charge (due to
diction studies discussed below. The XTRACTOR component [46] that the pK ionization value) of the histidine. Tags including alignment of
underlies MARPLE can also be used directly as an efficient means of DNA combined with its translation as an amino sequence are useful in
detecting relevant web pages and extracting content for semi-automated the KRS, but can be complex to examine visually. In comparing two
systematic review [33], and the knowledge tags from this enrich the patients in the present study with and without T2D in the home data
KRS. In the studies on humanin, for example, the following tag was base used in the present study, as shown in standard interface display,
obtained. the following initial part of the gene for subunit 1 shows mutation
The initial zero in e.g. [0https://en.wikipedia.org/wiki/Micropepti T3394C (the underlined and bold H in the second row, as the 30th amino
de] indicates that the uniform resource locater is a direct link in the acid residue in the protein).
text. If it had been a number 1,2,3, etc., this would related to references Note the start of the gene with respect to above base pair numbering
(if present) of that number below text in the Web page, if they too have as corresponding to 3304 in the Cambridge reference sequence rCRS.
links. The natural language text or structured text in other format is The use of the term “norm” is to indicate that we might think of this as
parsed, and sometimes automatically restructured, to facilitate decom the “normal” mtDNA although of course this would be a misleading
position into several semantic triple subject-relationship-object or linear practice. As a cross-check, the above sequence for the patient with T2D
semantic multiple (LSM) forms [44]. Some studies were carried out on and the T3394C marker has a 99% 318/319 residue match NADH de
situations that could give rise to tissue stress to which mitochondria can hydrogenase subunit 1 [Homo sapiens] Sequence ID: ACA80576.3,
respond with signaling by MDPs, for example, which matches the rest of the sequence but retains the original Y
XTRACTOR was able to uncover a good deal of information about (tyrosine) at position 30. Some extended studies were carried out. There
T3394C because it has well studied for being associated with Leber’s are a large number of 99% matches e.g. 291/292 for partial sequences
hereditary optic neuropathy, carnitine palmitoyltransferase deficiency, that include the same Y to H mutation. There is bias inherent, however,
and Altzheimer’s and Parkinson disease, amongst others. However, it in the publicly available sequences reflecting researcher’s interests,
appears to require another mutation such as A3397G to also be found for notably in the high altitude mutation.
these more serious diseases to occur. To understand that kind of logic of
gene expression in disease, and explore how mutations expressed as
amino acid changes in the same protein can interact in particularly 4.10. Prediction use case. Example involving mutation T3394C
complex ways (sometimes reinforcing and sometimes compensating
each other’s effects on protein function) it is necessary to work a lot with No inherited mutations in the patient mtDNA data used have as yet
DNA and particularly protein sequences. It was found useful to develop been successful in predicting type 2 diabetes (T2D), with the important
means of extracting DNA and particularly amino acid residue sequences exception that use of mtDNA data for subunit 1 of the dehydrogenase did
from text. One particular popular format (FASTA) used a delimiter predict that disease. Nuclear DNA may of course play a more important
starting ‘>‘. It was re-rendered as &gt to prevent interference with Q- role in other cases. By “computer experiments” making various sub
UEL tag delimiters. stitutions including decoys, and by the studies described in Section 4.6,
XTRACTOR produced XTRACT tags showing that most of the rele T3394C was early identified as the principal natural “clue” to the pre
vant studies by other authors involved type 2 diabetic patients, or sub diction method, e.g. contained most of the information, which was the
jects living at high altitude, or both. Haplotype CEE15319 includes origin of the focus on T3394C in the present studies.
northeastern Indian populations with T2D, ANA92115 is derived from There is certainly still value in first using the simple approach of the
part of a study enriched pathogenic mutations in Tibetan highlanders. In more traditional genomic medicine which looks at the association be
the case of ADI79705, after the Last Glacial Maximum approximately tween single genomic biomarkers and diseases, as opposed to precision
26500 to 19000 years ago, improved climate allowed humans to re- medicine that considers other mutations and whole sequences (see
colonize the high latitude regions, and the ADI79705 sequence corre Section 1.5). A priori, as is the case for many inherited diseases, even a
sponds to the M9a0 b haplogroup that is relatively concentrated in Tibet. single marker can turn out to be a strong signal. For this purpose, clinical
genomics researchers have recently tended to report odds ratios. Family
16
studies by other workers have shown that patients with G3316A and in the “training set” (meaning here the data-mined set) gave predictive
T3394C for T2D have an odds ratio 5.2 and 3.2 compared to the sample odds of 2.67 and likelihood ratio of 1.53, and taking the patient out of
taken from the same social and environmental background [49]. The 74 the “training set” as a test case gave 2.33 and 1.34 respectively. The fact
patient data set used in the present study as gave an odds ratio of 2.93 for that the likelihood ratios (and indeed in this case all these numbers) are
the association between T3394C and T2D. greater than 1, combined with the fact that the T3394C marker is pre
In general there was qualitative agreement such a sample odds ratio sent, are all that are needed to predict T2D for that patient on this simple
(SOR) as a likelihood ratio computed from the data itself, as was done to likelihood basis. The above gives 0.70–0.75 for sensitivity depending on
obtain the above values, and diagnostic odds ratio (DOR) as a measure of whether the test patient is or is not removed from the training set, and
the effectiveness of a diagnostic test or prediction. The DOR is defined as similarly a specificity of 0.5–0.55. The former is promising but the latter
the ratio of the odds of the test being positive if the subject has a disease is essentially indicative of a random relationship, at least in the present
relative to the odds of the test being positive if the subject does not have case. It highlights the importance of obtaining other information that
the disease. For example, an HDN (a) computes a SOR from the proba would help exclude T2D.
bilities deduced from the sample HDN, and also (b) predicts say disease In the present study, testing the effect of inclusion of more factors
or not disease if the likelihood ratio is greater than 1 or less respectively could, of course, only be a test that there are no further significant
(subject to so-called ROC tuning or optimization by a kind of decision factors in the available mtDNA data that had T2D information. After
constant [32]), from which the DOR can be computed. The DOR is the high-dimensional (multifactor) unsupervised data mining by Dir
only option for a neural net, because it does not itself compute odds or acSmash, a similar odds ratio of 2.8 was obtained, an important finding
probabilities intrinsically but uses arbitrarily dispersed weights. Simply if only because one would expect it to increase if there were other
predicting T2D or not T2D from the presence or absence of the marker positively impacting factors in the available mtDNA with T2D infor
reduces prediction to a simple evidence based medicine calculation, and mation. Using MITOMASTER as the test set (see Methods Section 3.1),
SOR and DOR correspond. The more interesting study is the effect of an odds ratio of 2.5 for the HDN and 3.0 for the HDN-like neural net was
including further information is a more complex multifactor calculation, obtained for predicting T2D from T3394C. Although this supports the
which is discussed shortly below. Even for the above simpler case, above, it is a consistency check at best. It assumes that insulin resistance
however, one may obtain some better indication of predictive capability is interpreted only as a risk factor for T2D, i.e. not observed T2D per se,
by distinguishing a training set used to obtain the SOR from the test set and metabolic syndrome was not considered as necessarily T2D per se,
used to obtain the DOR, and in the case of small data, by the technique of and “diabetes” which could be T1D was considered as T2D, then the
“jackknifing”, i.e. simply omitting each patient used for prediction in predictive agreement is expressible as 2.5. For example, Patient 16325
each use of the training set. Leaving the example use case patient 75229 with T3394C had T2D and was predicted as LHON/Diabetes/CPT
17
deficiency/high altitude adaptation. expect from the knowledge gathered by XTRACTOR. The study did not
The HDN-like neural net using the final results with the training set strongly pick up the known G3316A (A4T) and T4216C (Y304H)
as test set gave a DOR of 3.4 and similar results have been obtained for a markers for T2D because the majority of subjects in the data available
variety of methods of condensing the input sequence information. for the study did not have T2D also had these markers. In the example
Because these studies are very preliminary as far as correctness of pre case of the two patients with mtDNA ND1 gene used in the alignment
dictions and the genomic basis of T2D are concerned, it is the ability of earlier above, the TD2 patient did not have the latter accepted mutation
Q-UEL to deliver and the neural net to receive appropriate input for but the non-diabetic patient did.
genomics and bioinformatics, and not least the efficiency of the method, In the case of the T4216C (Y304H) the MITOMAP tool MITOMASTER
which are perhaps of greatest interest. The project included studies to identified 144 references at the time of the present study but only 3
compare the efficiency of different approaches including the stacking of appeared to focus in depth on T2D, perhaps suggesting that the evidence
information believed more important toward the front of the input is considered rather weak by the scientific community as a general as
strings presented to the neural net and training on progressively longer sociation between this marker and diabetes. In addition to the above,
input strings [5]. The learning (training) time for neural nets took there was rather little variation in the above gene sequence within the
typically 3 h on an older Lenovo E430 with i3-2350 M CPU at 2.3 GHz data that were actually available. As noted above, the sequence for the
but that depended on a variety of number of nodes and layers explored example patient with T2D and the T3394C marker has an exact match
with slight modifications and on different machines, which will be using BLAST with mtDNA ND1 gene sequence ID: ACA80576.3 except
described elsewhere. More importantly, the time persistently took about for the T3394C (Y30H) mutation and the above-mentioned T4216C
two thirds to half that required by the corresponding more standard (Y304H) mutation. A large number of BLAST matches at the 99% level
purely real-valued neural net. While that result is certainly useful and represent partial sequences that include the same Y to H mutation.
promising, it is disappointing in that something like a four-fold speeding
was expected. It may be that the slight modifications made to try and 4.12. Research use case. Initial studies and observations on humanin
enhance performance in an otherwise more standard neural net [5] are
inappropriate when also used for the h-complex neural net (see Dis Certain types of basic research are not so sensitive to having only a
cussion Section 5.3). small data base of DNA sequences. Even a single sequence can be
informative. As noted in Section 1.10 and elsewhere, several small
4.11. Further analysis regarding Type 2 diabetes and mitochondrial DNA mitochondrial derived peptides (MDPs) are currently suspected to exist
as unusually small protein products of genes and several are being
Amongst those accepted mtDNA mutations known or believed in the discovered (e.g. Refs. [19,20]), the first discovered being humanin. The
literature to be associated with T2D, and found in Web literature with humanin gene is located at region starting 2633 in the Cambridge
the benefit of XTRACTOR, are C1310T, A1382C, G1438A, A1202G, reference (rCRS) and encodes a 24 amino acid peptide. The nucleotide
A3252G, A3256T, A3264C, A3271C, T3290C, C3303T, G3316A, bases of the Humanin gene also have the separate function of being a
T3394C, T4216C, A8296G, A8344G, G11778A, C12258A, T14577C, part of the ribosomal RNA 16S molecule, as noted in the XTRACT tag in
T14709C (though these may occur in non-diabetics). Two of these, Section 3.5. With the adjustable parameter set to 21–35 to focus on small
G3316A, T3394C, occur in the mtDNA ND1 gene. Some researchers open reading frames (sORFs) for potential MDPs, and in the first
identify T4216C (Y304H) as a T2D risk. A few more mutations were also example requesting a richer report on the clinical and genomic feature,
found by XTRACTOR in the literature as potentially related to T2D (e.g. is as follows.
Y30H, Y43C, I273V). The following fairly common mutations poten Humanin falls by definition into the peptide-miniprotein class of
tially concerned with T2D are worthy of note as being in the vicinity of, particular interest here, that are primarily responsible for signaling be
but not within, the mtDNA ND1 gene: A3252G, A3256T, A3264C, tween the mitochondria and the host cell cytosol. As the above tag in
A3271C, T3290C, C3303T. Common natural variants in the amino acid dicates, the reference sequence of active humanin in standard IUPAC
sequence of the mtDNA ND1 gene product exemplified below that are one-letter code is MAPRGFSCLLLLTSEIDLPVKRRA [20], though the
not necessarily associated with T2D or a serious disease state (though a last three residues are the C-terminal are expressed only in the cyto
few have been indicated as disease-related), are A4T Y30C M31T M31V plasmic genetic code translation of the exported mitochondrial mRNA,
A52T T87A T168A S205P E214K Y255C Y277C L285P L288P Y302H and omitted in the mitochondrial translation. These C-terminal residues
(sometimes reported as Y304H). None of the mutations mentioned in are accepted as non-essential because both 21 and 24-amino acid long
this paragraph showed as strong predictors in the current study, and nor peptides have indistinguishable intracellular and extracellular effects. In
did any other amino acid residues in the sequence, although one might the course of the studies for the present paper using methods [6,43,44],
for example expect that the neural net approach would be influenced in described in Section 3.2, the following were amongst those humanin
some way by the rest of the sequence. The full sequence for the above sequences found.
diabetic patient 75229 is as follows, in which these natural variants are MAPRGFSCLLLLTSEIDLPVKRRA* - most quoted humanin sequence
underlined and in bold. on the web.
The reason for little predictive power of the HDN and HDN-like MAPRGFSCLLLLTSEIDLPVKRRA* - next most quoted humanin
Neural net or T2D, apart from that inherent in the T3394C (Y30H) sequence on the web.
marker is ultimately due to the small sample of patient sources which MAPRGFSCLLLLTSEIDLPVK* - Revised Cambridge Reference
was also likely biased by the kinds of research and medical interests that Sequence (rCRS) of the Human Mitochondrial DNA (Genbank
gave rise to them. Such small numbers are not unusual in some kinds of NC_012920.1) mitochondrial genetic code translation.
medical research but once predictive results are in one can see that the MAPRGFSCLLLLTSEIDLPVKRRA* - Revised Cambridge Reference
sample is somewhat atypical (unrepresentative) of what one might Sequence (rCRS),standard cytoplasmic translation.
18
The following were found from the present study as mtDNA tags. 4.13. Research use case. Predicting Other mitochondrial derived peptides
MAPRGFSCLLLLTSEIDLPVK* - patient 66115 mitochondrial genetic
code translation. There are three types of MDP known at the time of writing, each
MAPRGFSCLLLLTSEIDLPVK* - patient 75229 mitochondrial genetic classified and named after the following by their homology with
code translation. representative members, humanin, MOTS-c (mitochondrial open
MAPRGFSCLLLLTSEIDLPVKRRA* - patient 66115 cytoplasmic reading frame 12S rRNA-c), and mebers of the ‘ORF8:2953:3054 SHLP1-
translation. related 1–6 group (Small Humanin-like Peptides 1–6). The best known
MAPRGFSCLLLLTSEIDLPVKRRA* - patient 75229 cytoplasmic members are as follows.
translation. MAPRGFSCLLLLTSEIDLPVK - humanin.
It is known that the humanin sequence is quite well conserved: only MRWQEMGYIFYPRKLR - MOTS-c.
about one in some 200 patients have a humanin mtDNA sequence MCHWAGGASNTGDARGDVFGKQAG - SHLP1.
affected by mutations. Whether those that are so affected have medical MGVKFFTLSTRFFPSVQRAVPLWTNS – SHLP2.
consequences is an area research interest in the present studies. It de MLGYNFSSFPCGTISIAPGFNFYRLYFIWVNGLAKVVW - SHLP3.
pends on what one considers as the ORF. For example, patient 75229 has MLEVMFLVNRRGKICRVPFTFFNLSL - SHLP4.
type 2 diabetes T2D , and hypertension, and patient 66115 has no T2D, MYCSEVGFCSEVAPTEIFNAGLVV - SHLP5.
and no hypertension. The DNA sequence for was identical in both these MLDQDIPMVQPLLKVRLFND - SHLP6.
cases, but with the interesting exception of the cytoplasmic stop codon, Humanin was found in all the patients in the present data albeit with
and corresponding to the above translation as follows. the exception of the variation stop code mutation above. Human vari
The asterisk * is the stop codon. The first line is the sequence as ants are known, e.g. P3A, S7A, C8A, L9A, L12A, T13A, and S14A and
translated in the mitochondrion. The second line is the sequence as P19A are known, but because humanin has important functions such as
translated in the cyotoplasm. The three DNA sequences are from the protecting neuronal cells variations may have fairly serious conse
mtDNA of patient 66115, the second from patient 75229, and the third quences. The recently discovered MOTS-c was also found and is used in
from the Cambridge reference sequence. Patient 66115 has a single the more detailed work-though later below, also noting that variants of
accepted mutation with respect to the other sequences, a G replacing an less serious consequence are known. In predicting the more varied
A (A2760G), but both TAA and TGA are stop codons in human cyto SHLP1-6 MDPs in particular, there appear to be more mutations with (a)
plasmic translation. This kind of mutation would typically be considered no guarantee that mutations will lead to the standard reference MDP
as degenerate by having no effect, but that is not entirely clear because being predicted, and (b) where found it is often not be predicted with
there may be as yet unknown control mechanisms associated with the exactly the same ORF as the standard reference MDP. However, they are
gene terminus. left in as predictions because, a priori, that ORF could be the reading
19
frame in that patient, possibly event with more than one translation. For direct gene product of circa 12 amino acids or more is actually made
example: (though it might be switched off in many cells and at many times) until
The current standard view is at least some 3000 proteins and pep proven otherwise The new terminology in discovering short polypeptide
tides are directly involved with the human mitochondrion, though the gene products is to locate additional sORFs (short open reading frames)
majority of those known are not coded there. It is conceivable that the in the mtDNA. MOTS-c peptide was amongst those predicted in this
order of MDPs that are at least occasionally exist might on balance be of study and turned out to be a recently discovered MDP [20] of 16 amino
the order of, say, 500–1000 as follows, even if many turn out to have acids in length, and biologically active. MOTS-c is of considerable cur
similar (redundant) functions (which are useful clues to drug designers). rent interest because it has dramatic effects on obesity and insulin
In theory, with three reading frames on each strand of mtDNA, there is resistance [47]. The presence of an ORF is recognized as exemplified in
the capacity for perhaps 3000 or more peptides of MDP size encoded in the following patient comparison of MOTS-c genes (underlined).
mtDNA. This is a statement based on the length of the DNA, not the The gene sequence underlined has the following reading frame.
likely open reading frames, although it becomes more reasonable if we The corresponding Q-UEL tag for this MDP as found in the KRS is.
include biologically active peptides produced by cleavage of larger Using the BLAST tool at https://blast.ncbi.nlm.nih.gov/Blast.cgi, it is
polypeptides, which is the better known usual case, or allow for occa identified as known and specifically as named MOTS-c by the following
sional use of starting codons that do not also code for methionine. BLAST report: RecName: Full ¼ Mitochondrial-derived peptide MOTS-c;
Focusing only on the ORFs (standard and small), we might expect an AltName: Full ¼ Mitochondrial open reading frame of the 12S rRNA-c
upper limit of some 660 based on the percentage of methionine found in [Homo sapiens]. There is 100% match. Variants also occur in MOTS-c.
human mitochondrial proteins at 6% (rather higher than the rest of the For example, a variant of AAA as CAA that replaces the K by Q, com
human organism at about 1.8%), but exceptions to the starting codon mon in the Asian population, and it has been postulated as a explanation
also coding for methionine are indeed known and might, a priori, be for the high longevity of Japanese people [20,47]:
more abundant for a new class of recently discovered entities such as
MDPs. It is known that the tRNA specific for methionine (tRNAmet) of
human mitochondria contains a formyl-cytosine at the wobble position
of the anticodon to facilitate its binding to RNA’s AUG, but also AUA,
and there are instances known of AUU. Ignoring the last presumed rarer In addition to the known MDPs above, 85 further candidate MDPs
case, a direct count of ATG and ATA and their anticodons indicates there were identified, almost all unknown or unproven as protein products.
are still enough starting codons for over 1000 reading frames. There are For example.
85 that satisfy small open reading frame requirements as found in this At the time of the study up to October 1 2019, no significantly similar
study, as given in Appendix 1. There the original dates of the finding are sequences for the above were found by the BLAST routine at https://bla
kept for provenance, but best matches change: see the comments in st.ncbi.nlm.nih.gov/Blast.cgi. Thus far, the prediction methods used to
Conclusions Section 6.3. Even they do not include several possible predict MDP genes (Methods Section 3.3) rather remarkably produced
overlaps and are favored genes by present methods which may none the same MDPs. The reason, it is suspected, is that human and MOTS-c in
theless be taken indicators of locations of one or more overlapping MDP the data had amino acid sequences conserved throughout while the
genes. The potential problem for predicting MDPs as new polypeptide SHLP1-6 group was diverse. Evidently, the start and stop codons stood
products from mtDNA is that oligopeptides less than about 35 amino out as the dominant clues in predictions. There is no evidence at the time
acid residues in length, and certainly of 20 or less, are not traditionally of the study to reject any of the 85 new proposals. Despite being entered
thought of capable of being compact globular structures and will lack the earlier into the KRS (march 2018), they are still expected to be valid as
amino acid residue sequence features characteristic of that. Short pep the essential features of the potential ORFs have not changed in the data.
tides are usually presumed to exist in a flexible random coil in aqueous However, as described also in Conclusions Section 6.3, there is frequent
solution, i.e. a statistical-mechanical distribution of many conforma appearance of new sequences to compare. That also depends on the data
tions. While biologically active peptides of short lengths are certainly base searched and scoring systems. A best BLAST match in October 1
well known, they are indirectly gene products, being derived by cleav 2019 for the above was a hypothetical protein from bacterium Lachno
age of larger proteins. Short polypeptides that cannot be globular and clostridium phytofermentans with a SHLP1 match of 67.74%, and on
are directly gene products would have been missed by smart gene finder December 22 2019 putative proline-rich receptor-like protein kinase of
algorithms in the past. Arabidopsis thaliana, a small flowering plant, was found at https://www.
In the first MDP gene prediction method in Methods Section 3.3 the ebi.ac.uk/Tools/services/web/using database UniProtKB/Swiss-Prot
simple reasonable presumption is that any predicted polypeptide as a (the manually annotated section of UniProtKB) with a similar match
but higher bit-based score. The most interesting hits are of course for
20
vertebrates, and once found as an actual known MDP, e.g. 4472:5515 NADH dehydrogenase (oxidoreductase) subunit 2. Many of
MLESMSTMGFTTSMLD and QDIMVQPLLKVRLFND, the best match the 85 additional predicted MDPs and some other less likely proposals
found naturally almost always stays as such. Even so, if similar se for MDPs suggestively match with other DNA sources of prokaryotic
quences found are not experimentally identified MDPs, they can repre origin, bacteria or even chloroplast proteins, but sometimes with similar
sent intriguing matches with sections in globular proteins, e.g. function to mitochondrial proteins, recalling the likely prokaryotic
Here the segment from above NFIFWR-YALLTVTPQLTHYF matches origin of mitochondria and chloroplasts. The predicted MDP of
fairly suggestively with NFIAWRLFPLLTLTPQLVLYF of RNA (N(6)-L- sequenceMSSSKPHLSPP WLSSPDEATS has a 76% match with section
threonylcarbamoyladenosine(37)-C(2))-methylthiotransferase MtaB MASSKP-LSPPSLSSP–ATS (ATP synthase subunit b’ in the chloroplast of
[Chloroflexi bacterium]. Although many such matches might be consid flowering plant Morus notabilis), and secondly a 72% match with section
ered coincidental, they are of increased significance when both from a EGTKQSAWTQAHTFYFTP (NADH dehydrogenase subunit 4 of Rhino
vertebrate, and particularly a human, mtDNA. The relatively small size gobius cliffordpopei, a ray-finned fish). In many such cases the sequence
of the mitochondrial genome increases somewhat the probability of resolution is obscure and strengthening the case for a meaningful match
common origin of similar short sequences within it, perhaps not least with prior belief seems unreasonable.
because MDPs are short. That means that there is in effect a significant
Bayesian prior probability that there may be some evolutionary common
origin. Humanin itself MAPFHFWVPEVTQGTPLTSGLLLLTWQ matches 4.14. Research use case. Validating mitochondrial polypeptide products
with a section MAPRGF——————————SCLLLLTSE of ORF37:
Validation of a protein means that the protein product of the
21
suspected gene has actually been observed in some way, and ideally by phospholipase A2 enzymatic activity is found in cytosolic, microsomal
its extraction and characterization. A failure so far to see the protein is and mitochondrial fractions after stresses such as renal ischemia, and
not definitive of nonexistence because expression may be in some way calcium-independent phospholipase A2 localizes in and protects rat
“switched off”, and peptides may be short lived when expressed. Vali mitochondria during exposure to staurosporine, a potent inhibitor of
dation of the predicted MDPs is nonetheless obviously important, and phospholipid/calcium-dependent protein kinase inhibitor involved in
while experimental studies on mDPs are still in early stages, the aptoptosis (“cell suicide”).
following approach has been shown to be interoperable with the Q-UEL In the following example, a progesterone receptor in the mitochon
system. By “polypeptide” here is meant large globular proteins as well as drion seems unlikely, but truncated progesterone receptor (PR-M) lo
small polypeptide (often called oligopeptide) products. The current Q- calizes to the mitochondrion and controls cellular respiration.
UEL system was used to search available text and more structured re Regarding the following, the human p53 protein is a nuclear gene
sources on the web to find evidence that any of the above putative MDPs mutated in approximately 50% of human cancers, but it is known to
are actually produced by the mitochondrion. Unfortunately, none of the involved with p53-mediated mitochondria-dependent apoptosis.
952 tags capturing this knowledge have related to the above predicted Other examples, some expected and others less so, are as follows.
MDPs, but that might simply suggest that the appropriate experimental Recall that in each case, it does not necessarily mean that the pro
evidence, for or against each peptide, is not yet available. Either way, teins are actually mtDNA products. A match with an ORF in mtDNA and
the approach remains of more general interest. There is a significant evidence of physical existence of the product is required.
challenge in experimental verification of which proteins are real and
truly of mitochondrial origin. The inner mitochondrial membrane alone
contains more than 150 different proteins and smaller polypeptides that 4.15. Research use case. Conformational studies on MDPs
are intrinsic to it. In practice, the assignment of many proteins as
physically associated with mitochondria is a matter of degree. That is As a basis for design of modified peptides and more traditional small
because many proteins coded in the nucleus bind to the mitochondrion molecule “tablet” drugs, one wishes to know more about the three
more weakly than others. Many are imported by the mitochondrion, and dimensional structure of the MDPS, and ideally that structure when the
many are associated only in cases of cell stress or disease. All this varies MDP binds to its protein receptor target. A final drug candidate is then a
in tissues and species: some 600–700 distinct types of protein are found small molecule with a similar van der Waal’s and electrostatic binding
in a typical example experimental study on cardiac mitochondria, while surface of the MDP, including corresponding polar/nonpolar features. In
in rat mitochondria, close to 1000 types of proteins have been reported. the absence of an experimental X-ray structure for the complex, re
The top few tags of the 952 collated proteins and peptides, believed searchers would wish to do conformational energy or molecular dy
by researchers to be present within the mitochondrion or at least tran namics computations [2] for MDPs and analogues bound to the target
siently with its membrane (whether coded there or not), are shown receptor. Features of the structure of general mitochondrial import re
below. This data extracted into and intermediate file Vali ceptor TOM20 that binds to MDP-like peptides is known, but the targets
datedMitoProteins.txt derived from a standard data base at http://lifese for biological response to most MDPs do not appear to be known at the
rv.bgu.ac.il/wb/jeichler/MPA/(as indicated on the tag in the “source” time of writing. Again in the spirit of using Q-UEL approach with con
attribute). The reason for the interim file ValidatedMitoProteins.txt is as verters to interact with public freely bioinformatics tools web pages in
follows. Normally knowledge is extracted directly from HTML on web the present study, secondary structure prediction HDN-based methods
pages in the form of Q-UEL XTRACT tags [43,44], which are subse already developed in Q-UEL tag context by the present author [35] were
quently processed into standard Q-UEL knowledge tags in the knowl not used, but GOR4, a version of the GOR method first developed as the
edge representation store. In the present case, the interim file preserves information theoretic approach [2,37] by the present author, is avail
original source content essentially verbatim after relative simple able on a server (https://npsa-prabi.ibcp.fr/cgi-bin/secpred_gor4.pl),
tidying. It retains the information should the source web site be down, and was used. There was also employment of the author’s molecular
and can be curated, annotated, and extended if required, and is finally mechanics and molecular dynamics tools (e.g. Ref. [2], and see Section
converted to standard Q-UEL tag knowledge format. 6.1 below). All these peptide and protein structure prediction methods
Note also. were in good accord with each other, so use of GOR4 below illustrates
The use of such tags in workflow includes the following. XTRACTOR the findings well. This is not least because, modeled in vacuo or in a
was triggered by the superoxide dismutase entry and generated a purely water environment, an MDP molecule has not one fixed structure
number of XTRACT tags, e.g. as follows. although each amino acid residue predominently “wobbles around” a
It is the tetrameric manganese superoxide dismutase or SOD2 protein conformational preference for its backbone and sidechain angles. There
product that is usually considered as mitochondrial, and consequently a is also relevant experimental data (e.g. Ref. [46]). In aqueous solution it
copper-zinc based superoxide dismutase associated with the mitochon is known from experiments that turn-like structures involving residues
drion seem surprising. However, XTRACTOR also found that the copper- G5-L10 and G15-L18 indicating nascent helix nascent helix, and in less
zinc SOD1 is partially localized in the mitochondrial matrix, consistent polar environments it forms a helix spanning G5-L18 (e.g. ref [46).
with the source list of validated proteins. Noting the characteristic tetra-leucine LLLL core of humanin, and the
A second example of a validated protein tag is as follows. strong helix forming tendency of that sequence in globular proteins [2],
It is only fairly recently that researchers provided the first demon an α-helix content or at least tendency to it (that might be realized on
stration that Rad51, which repairs double stranded breaks in DNA, is binding to the receptor) is not too surprising.
also operative in the mitochondrion. In the following further example, a Sequences homologous to humanin in known proteins that might
phospholipase A2 like protein is known to localize in the mitochondria give clues as to the corresponding MDP conformations include a section
in, for example, fungi, but not in humans. However, enhanced of NADH dehydrogenase subunit 2, MAPFHFWV
PEVTQGTPLTSGLLLLTWQ, of which TPLTSGLLLLTWQKL APISI
22
23
constitutes an experimentally verified membrane spanning region (res helix-breaking amino acids, notably the very strong helix-breaker pro
idues 122–142) which is helix rich. The GOR secondary structure pre line P. The other well-verified MDP is MOTS-c, MRWQEMGYIFYPRKLR,
diction tool at https://npsa-prabi.ibcp.fr/cgi-bin/secpred_gor4.pl which contains few α-helix forming residues, but gives a predicted sec
predicts a primarily coil structure for the humanin free in solution ondary structure picture ccccccceeeecceec similar to that of humanin.
(consistent with observation in aqueous solution) and for all the 85 Similarly, those predicted MDPs with more extensive proline and other
predicted MDPs (Appendix), and most often the overall CECEC motif of helix-breaker content give a reminiscent CECCEC picture near the
coil - extended chain – coil – extended chain – coil, including for C-terminal end, but are primarily coil. For example,
humanin. That is, there is notably no h (helix) predicted, but only the MLPSYKPQSTSSLPSPFPTASTAQHFL yields cccccccccccccccccccccee
coil (c) and extended structure (e). For humanin and ceec. It is unlikely that polypeptides so rich in α-helix breakers such as
TPLTSGLLLLTWQKLAPISIMY,shown along with the alignment, the proline (P) and serine (S) will adopt an α-helical conformation at the
prediction is. receptor. Several predicted as MDPs do contain a reasonably helix
The CECEC motif might suggest the e-sections pairing to form a forming run of amino acid residues such as ALLI. However, in such cases
hairpin pleated sheet, but this is not the experimental evidence [46]. there is often a detectable similarity to humanin and the prediction still
Leucine L is a strong a-helix former so the helical conformation on favors the CECEC motif.
binding to a target is expected, of which nonpolar solvents are models
(albeit lacking specificity). For a peptide in isolation of covalent or
binding attachment to a larger sequence, it takes the 12 residue
LLLLLLLLLLLL to predict helix content: ccchhhhhhhhc, using the GOR4
method used at the above site. A strong initiator and terminate
“capping” amino acid residue can strengthen a helix but predictions still
require a minimum of 11 leucine residues ELLLLLLLLLLLK to generate
ccchhhhhhceec. This would be of weak relevance alone, but the above is The LLLL and LLL pattern in proteins with homologous sections may
all reasonably consistent with experimental data for similar circa 12 not be purely to do with preferred conformation. By having what may
–residue peptides in aqueous or weakly nonpolar solution (e.g. Ref. [2]). seem a simple hydrophobic sidechain, leucine plays a number of sur
But none of the 85 predicted MDPs generated were found to contain prisingly important roles in recognition, especially in the immune sys
even LLL, and several like the example in the last tag above are rich in tem. According to knowledge harvested by xTRACTOR, the calmodulin-
24
binding tetra-leucine LLLL motif of KCNE4 is responsible for association readily accommodated DNA sequences etc. However, there are differ
with Kv1.3. The voltage-dependent potassium channel Kv1.3 regulates ences if the user is to handle probabilistic aspects of ontological matters.
leukocyte proliferation, activation, and apoptosis, and Kv1.3 function is In conversion of Q-UEL to GRAKN language, a frame or schema (e.g.
central for the immune system response. Tri-leucine LLL motifs are Ref. [82]) on such a data base approaches corresponds to a class of
involved in the mechanism of internalization of Inteleukin-13-bound simple Q-UEL tag Q-UEL tag of form < subject attribute Pfwd: ¼ x | if |
Interleukin-13Rα2. Di-leucine LL motifs are known to be involved in object attribute Pbwd: ¼ y > without values yet assigned. If the schema is
trans Golgi sorting, lysosomal targeting, and internalization of a number to be treated like a variable to which a probability is assigned, this can be
of proteins. The insulin receptor contains four di-leucine pairs in its done for two related schema, one with Pfwd(A|B) and one with Pbwd(A|
cytoplasmic domain. B), but if the system is to be used generally beyond Q-UEL’s interests it
seems desirable to have in, addition, corresponding schemas for Pbwd
4.16. Some studies on interoperability in precision medicine (B|A) and Pfwd(BIA). This is because the data base system does not
know that Pfwd(A|B) ¼ Pbwd(B|A) and Pbwd(A|B) ¼ Pfwd(B|A).
While the above studies are of interest as the basis of a local system
architecture (Notably the BioIngine), Q-UEL was also originally pro 4.16.3. Medical records and messaging
posed as an interoperability and hub language [7]. Consequently, it is of Some preliminary findings may be described in relation to genomic
interest how it might also share DNA and bioinformatics information by information in languages for representing clinical records and clinical
translation back and forth between other languages or systems in which messaging. DiracMiner [32] optionally generates medical records in
a language is the key theme. Simpler to achieve, but still useful goals, are simple Q-UEL whenever it mines a medical record or messaging file
simply as to whether Q-UEL can capture some of the knowledge held in which is expressed in CSV (comma separated value) format [32]. A main
other languages, or be expressed in ways that are readily stored in Big theme of systems for communicating DNA sequences such as the
Data Base systems. Genomic Messaging System Language GMSL [43] are not amenable to
CSV rendering in that way, but GMSL is instead concerned with anno
4.16.1. Clinical pathways tation, data and executables that can be embedded at the appropriate
PROforma [79,80] was developed at Cancer Research UK and can points in the sequence of characters for the DNA itself, and this poses no
receive new information and is also developed for genetic risk assess problem for DNA sequences in the attributes in Q-UEL. However, at
ment, so it is easy to represent essential genomic features, but it but has present embedded data will simply appear as annotation and Q-UEL will
to the author’s knowledge no special facilities (keywords etc.) for bio not automatically respond to embedded executable codes, though they
informatics results. PROforma is an action-oriented language for clinical could be downloaded and executed as programs. Interconversion be
pathways. Its task model has generic task and four task types: plan, tween Q-UEL and XML [9] is straightforward because Q-UEL can be
decision, action and enquiry. These have been readily rendered as considered as an XML extension. It is harder in the Q-UEL to XML di
Q-UEL attributes in a large Q-UEL-PROFORMA tag for thyroid disorders rection, particularly if the user exploits the Q-UEL option of more
kindly given by PROforma’s developer Professor John Fox at Oxford elaborate attribute structures expressing ontologies. XML attributes
University. The original PROforma file of 10377 lines (586.9 kilobytes) must be created to hold the Q-UEL relationship expressions (or it can be
became the Q-UEL-PROFORMA tag comprised a file of 8472 lines (550.5 embedded between XML tags), and in XML only logical AND is a plau
bytes). No criticism of PROforma is implied here; the differences are sible interpretation between attributes, but DNA in attributes poses no
primarily due to different ways of representing the ontological structure, additional problems. For several digital patient record and medical
the layout of that, and use of aesthetic line spacing. On the negative side messaging languages, some useful degree of interconversion with Q-UEL
for Q-UEL, Q-UEL applications will not of course (as yet) respond to the is possible with converters [7], but again carrying DNA information
items as a PROforma applications do. They will, however, acquire some causes no additional problems. To carry genomic content from
of the knowledge that PROforma captured about clinical pathways, and XML-based HL7 to Q-UEL Version 3 in a way that would be agreeable to
in return help analyze (and potentially extend) the relations between the most stakeholders is facilitated because HL7 has been extended to
various issue-to-response aspects represented in PROforma. accommodate the standard terminology Genetic Variation Model [7].
Also GMS notation [43] can provide an alternative approach to the
4.16.2. Data base languages above (see Chapter 5 of ref [1]).
Some studies have concerned re-expressing Q-UEL tags generated by
structured data mining [32,35], including actual or contrived DNA se 5. Discussion
quences, as data base languages. Recall (Section 1.3) that while Q-UEL is
quite distinct from the ingenious but now obsolete database language 5.1. Implications for precision medicine
QUEL [13], there are some common principles. Currently in Q-UEL,
Knowledge Representation Stores composed of millions of Q-UEL tags These preliminary studies demonstrate that Q-UEL can be extended
can be easily managed and searched by regular expression methods, from more traditional medicine to precision medicine with genomic
particularly those using match-and edit heterogeneous automata [41] content. With the structures of new tags required now specified, it is now
which also have an affinity to Q-UEL and can be rendered as such. possible to move on to diverse, more advanced, studies, especially pre
Originally developed to handle chemical formula which themselves dictions by HDNs and HDN-like neural nets based on larger data. The
imply complex graph structures [41], they are even more efficient for deeper examination of the proposed MDPs as a basis for novel phar
DNA, RNA, protein, and other structures that are purely linear. However maceuticals will be of particular interest to the author and collaborators.
the ability to have Q-UEL content expressed and stored using other more However, a question of interest to the scientific community is whether
standard data bases and public data base languages is of great interest. Q-UEL, or at least its core features such as the HDN, provides significant
There are circa 20 types of semantic graph data base appropriate for use advantages over other approaches. The answer in other medical do
with Q-UEL of which GRAKN [83] is a promising newcomer. Like the mains has been asserted to be yes, based on a variety of arguments and
obsolete QUEL, GRAKN also rests on semantic graphs and tuple calculus use case studies described in the published papers, probably the most
developed by Codd, inventor of the relational data base [13]. Semantic central of which relate to the demonstrable advantages of the HDN [4–6]
graphs focus on edges as relationships and are more efficient on over the Bayes Net and similar probabilistic inference net methods. As
traversing the edge (chaining through many relationships. There are no HDN provides the theory of Q-UEL and Q-UEL provides the building
special problems found for genomics and DNA representation because blocks and means of HDN construction, they are considered inseparable,
like Q-UEL, Graph data bases such as ref [83] are general can and have at least in the author’s strategy. It remains, however, that most of the
25
observations made in the present study are also potentially important, shops. No previous discussion of that matter was found that specified
not just for Q-UEL, but for any proposed interoperability, ontology or T3394C and statins at the time of writing, but popular dietary websites
probabilistic semantics language working with genomics. It is as yet do commonly list coenzyme Q as a supplement for a plethora of diseases
unclear whether one can reasonably say that there are fundamental including diabetes, so the extent of any novelty is at this time unclear.
advantages in using Q-UEL in the genomics and bioinformatics domain if Such cases argue in favor of the XTRACTOR approach for biomedical
one addresses that domain alone, but inclusion of knowledge from ge research: the extent to which knowledge is new or is well known is not
nomics should be seen as part of the Q-UEL vision for a thinking web for always well defined, and a matter of degree.
medicine as a whole [7].
5.3. Hyperbolic neural networks
5.2. Biomedical findings in the present study
The present work introduced HDN-like neural nets (of course any
Predictions by Bayes Nets, HDNs and neural nets arguably can advantage shown is unlikely to be limited to genomics). Again, the work
generate new knowledge at least in the sense that the diagnosis or is preliminary, but more conclusions can be drawn as to their likely
prediction of, say, a disease for a patient may be unexpected or unsure. prospects than one might expect because, as noted in the Introduction
Somewhat similarly new knowledge can also potentially generated from and Theory Section 2.2, the useful relation to Dirac’s quantum me
XTRACT tags by probabilistic deductive reasoning by some Q-UEL ap chanics (and hence consistency with Q-UEL) may be new information
plications [44,70]. As far as biomedical knowledge is concerned, how but the use of an h-complex neural net is not new (e.g. Refs. [31,57–62]).
ever, predictions of relationships between genomic biomarker (variant, In early HDN and Q-UEL papers this work was missed because HDN
inherited mutation) T3394C and type 2 diabetes were tests of method origins came from a different direction [54] and because of the huge
ology but did not produce knowledge new to science. It is amongst variety of names and guises for h that occur in the literature [6]. It was
several mutations in NADH Ubiquinone Oxidoreductase that have also quickly realized that it added credibility to the HDN as an predictive
been of interest to researchers for some time (e.g. Refs. [49–52]). In method even though the HDN is different and novel. The distinction
contrast, some of the MDPs proposed and listed in the Appendix might between neural nets and inference nets like the Bayes Net is that neural
turn yet out to be real and previously unknown to science. In many case, nets are said to work “bottom up” by calibrating weights to bring output
the question of whether a finding by Q-UEL is discovery or rediscovery is and observation into agreement, while Bayes Nets and HDNs work top
not so clear. A reviewer suggested that a key findings discussion might down by pre-assigning weights as probabilities based on prior knowl
be helpful to appreciate findings of the kind “XTRACTOR also found that edge. The “only” but important common feature of HDNs and h-complex
the copper-zinc SOD1 is partially localized in the mitochondrial matrix neural nets, is in the attraction of using h-complex values aþhb. They
… " (Section 4.14). Most often, XTRACTOR is used to aid the kind of contain information that models twice the dimensionality of corre
research that a researcher should do in an initial review of the literature sponding real values. As shown as early as 1993, that means that the
[33], or that a physician should do according to the practices of evidence decision boundary of a complex-valued neuron consists of two hyper
based medicine [33,44]. As far as can be ascertained for the present surfaces and divides a decision region into four equal sections [61]. An
study, none of the insight provided by XTRACT tags represented h-complex neural net can solve the XOR problem, i.e. can learn to
knowledge new to science, with a possible exception discussed below. respond to a pattern containing X or Y but not both, in a single neuron
One may still ask if at least the user’s examination of the collective as [62]. From this one may expect that any given number of nodes will do a
sembly of XTRACT tags suggested new ideas. This is arguably the case in more efficient job than the same number using weights that are real
preliminary study and explorative side studies. XTRACTOR was useful values. As discussed in Results, replacing the real values in the neural net
for explorative study in certain cases such as when one XTRACT tag by h-complex values, effectively pairs of numbers only reduced learning
mentions a genomic biomarker associated with a list of diseases, and time by between a third and a half when from the above one might
another XTRACT mentions another mutation with another list, when the expect more, and a possible reason that our neural nets, while essentially
disease lists overlap extensively. This provided useful insights and may standard, did incorporate some features that may have been counter
as yet lead to new knowledge. To approach this more automatically, productive in the h-complex case [5]. Notably modifications included
MARPLE/HDNstudent that uses XTRACTOR can be used in a slightly particular ways of using prior probabilities and of starting with expected
different way to link, say, biomarkers in a “question” with different important features ranked toward the beginning of the record as input,
biomarkers in candidate “answers” by automatically finding XTRACT or and then progressively lengthening the binary representation from the
other Q-UEL tags that illustrate such overlap [44]. However, experience beginning [5]. This is being explored. The previous studies [31,57–62]
with this suggests the value of modified forms of such software to make did not seek to make any significant link to QM, and this may be more
the most of the underlying algorithmic features in genomics and bioin important than first appears because the attempt to do so within the
formatics, as will be reported elsewhere. Q-UEL framework appears insightful and might promise significant ad
One line of investigative study using XTRACTOR that may have vances. Previously, Q-UEL sought to represent the universal quantum
helped discover something of a more general biomedical nature is as state or wave function Ψ by an empirically chosen, useful feature, with
follows. Note first that T3394C is already accepted by authorities as which diverse medical data had a strong association. This was typically
likely a possible example of a founder mutation, i.e. a mutation that has gender combined with age or age group. Now the principal difference
survived natural selection by conferring an advantage, probably at some between the HDN and the HDN-like neural net is that Ψ, which repre
stage in prehistory, for example the ability to hunt in a state of starva sents the weights, is found by optimization, i.e. by learning. The main
tion, and to function well at high altitude. These ideas have been well effort regarding neural nets in the present paper was to show their
explored by others (e.g. Ref. [51]). Also known is that risk for metabolic relationship with the HDN and Q-UEL, to describe that relationship in
syndrome including T2D has been shown to be reduced at high altitude terms of Dirac notation, and to lay a basis for Q-UEL delivering prior
[52]. But because the mutation occurs in a protein involved in NADH: knowledge to neural net algorithms. The importance of this is to make
ubiquinone oxidoreduction which reduces ubiquinone to ubiquinol to learning more efficient, avoid entrapment in local optima, and lead to a
carry electrons in the electron transport chain, and because statin drugs more meaningful dispersal of weight across the nodes that can be related
inhibit mevalonate synthesis and hence also ubiquinone (coenzyme Q) to probability. Proof that his is the case awaits results of studies
production, the present author is exploring the possibility that diabetes currently being initiated. For the present, the binary 0/1 “match code”
type 2 might be triggered by use of statins such as Lipitor. If so, a po proved to help efficiency by significantly reducing the size of input, in
tential simple therapy to the effects of T3394C is use of oral supplements contrast with training on whole DNA sequences. Although the ‘1’ could
of coenzyme Q, fairly readily available in pharmacies and health food be any one of three possible bases out of AGCT, certain transitions are
26
favored, and the indication that there is a difference and potentially a remains the guideline for Q-UEL.
defect seem sufficient information for an initial study.
6. Conclusions
5.4. Is Q-UEL quantum mechanics?
6.1. General observations
It is not usual for the author and collaborators to use the word
“quantum” in the title of a paper. It has seemed somewhat brash to do so, Q-UEL can be readily formulated for the genomics and bioinfor
and in the past the term has often been inappropriately applied by others matics domains without departing from its general specification, but
in the literature outside of physics. Even the use of the Q declared as some new attributes with rather different features were required in
“Quantum” in Q-UEL was simply intended to reflect the use of Dirac order to carry DNA and protein sequences. Since Q-UEL tags were
notation and mathematics of theoretical physics for broader use, as designed to be read by medical services if much of the IT infrastructure
Dirac suspected would be possible (e.g. Ref. [71]). It is simply the breaks down in a disaster, as well for ease of development and main
effectiveness in applications in medicine (and potentially other disci tenance [7], sequence and alignments also needed to be displayed
plines outside of physics) that ultimately matters. However, Dirac’s appropriately. It is likely that analogous findings will apply to any
system as developed for QM continues relentlessly to provide guidelines, interoperability or reasoning language, including for a future Thinking
insights, and useful tools, of which the HDN-like neural net introduced Web [7] beyond the Semantic Web [9] that is currently being developed.
in this study, at least with further development, may emerge as an On the one hand, the above conclusions, while important to establish,
example. Even if seen as an h-complex neural net without starting from are not surprising, and the changes required the current repertoire of
preassigned probabilities as planned, this now intellectually links Q-UEL tags are not dramatic. On the other hand, the features and their
Q-UEL, the HDN and probabilistic semantics to the neural net and Deep uses are numerous and fairly complex and require some bioinformatics
learning domains. It is this kind of development that facilitates the experience, so their documentation with associated observations should
interoperability both between ontologies and between probability the be useful to workers interested in knowledge management systems in
ories [7] to which Q-UEL aspires. medicine. There may of course be future refinement or corrections, and
Our approach may reasonably be considered as QM seen more the present study is the first pass at the special specification of Q-UEL in
broadly, but that still seems within QM as Dirac saw it. The present the genomics and bioinformatics domain. Of broader interest are new
author found h by considering what aspects of Dirac’s system would lead algorithms or software applications that use the tags and in the present
to classical probability results [54], first defining a special multiplica case that primarily means the HDN-like neural net, and also any findings
tion operator @ for i-complex numbers. Like many others since Cockle or predictions of the kind discussed in Section 5.2 and regarding MDPs in
[55,56], Dirac had also rediscovered h, but it is not at first obvious Section 6.3 below which may have novel, and ultimately testable, as
because he used it primarily not as a new imaginary number but in the pects. Once biomedical research is chosen as a use case, it raises the
form of matrices (γ matrices) and linear operators (e.g. σ). Dirac’s question of which discoveries may actually be novel. In Discussion
γ0 (γtime) and spinor projectors are notably already representations of Section 4 it was noted that it is not always an easy question, and a matter
h-complex values, as are ι and ι* in Q-UEL (Eqn. (1)), written as ½ (1þ of degree (at least not without very deep analysis of the kind that oc
γ5) and ½ (1- γ5) in physics. Here γ5 has the same hyperbolic property cupies the patent office [88]). One approach, which is readily auto
γ5γ5 ¼ þ1. There are several flavors of h as well as i, γ5 ¼ -iγ0 γ1γ2 γ3. mated, is to consider whether on using all the associated items from data
Here γ1γ2 γ3 relating to space are flavors of i. Dirac’s probability am mining as a query to Medline produces few or many hits: few hits suggest
plitudes of brakets etc. can this be both h and i complex overall: exp(hi novelty [89]. Also, the publications represented by those few hits can
x) ¼ exp(-ih x) ¼ cos(x) þ hi sin(x) is still periodic, a wave function, also suggest biomedical and population health mechanisms and pro
although a purely h-complex exponential is a hyperbolic one: exp(h x) cesses discovered by researchers that justify and explain the associations
¼ cosh(x) þ h sinh(x). The Q-UEL system could also be described as the [89]. In the present study, however, for those findings that appeared to
result of a Lorentz rotation i → h of Schro €dinger’s i-complex wave me be particularly interesting, evidence of discovery was either very clear,
chanics, arguably relating to “collapse of the wave function”, i.e. it be or notably absent at time of writing. See Section 6.3 for a special
comes classical [54]. For example, < p| x> for momentum p ¼ mx and comment on this in regard to proposed new MDPs.
position x leads to a term in exp(-mx2/t ћ) for mass m, distance x, time t
and Planck’s reduced constant ћ, which is a normal (Gaussian) distri 6.2. Interoperability, discovery, extensibility and future work
bution resembling diffusion. Q-UEL does not illustrate quantization, at
least not extensively, but that arises with the wave properties respon The developer of an interoperability language faces a dichotomy,
sible for “quantum weirdness”. Eigenvalues as a result of seeking to promote the scientific and technical advantages, and hoping
Halmitonian-like operators do have their analogues in Q-UEL, but this is for new discoveries, but at the same time arguing that it can accom
for future discussion by relating to expected experimental values other modate well, and ideally make equally good use of, many alternative
than probabilities, but note that eignenvalues of h as þ1 and 1 (and 1 approaches by other workers. Nevertheless, whatever is included, one
and 0 for the spinor projectors) determine the two values in the prob might seek to argue that the whole is greater than the sum of the parts
ability dual of Q-UEL as eigensolutions. The key questions are whether embraced. In that, two points arise. The first is that while the HDN and
we still inherit the full machinery of QM, and whether we find that Q-UEL may appear unusual and special, that approach is taken because
useful. The argument is that it allows the Bidirectional General Graph for the use of the h-complex number enhances interoperability: it leads to a
inference, and a probabilistic semantics closely related to natural lan broader approach that includes Bayes Nets and other technologies as
guage. Some features that appear different in such practical applications variously subgraphs and subsets. Consequently, Q-UEL embraces these
are also found in certain important situations QM. established approaches well and can serve them just as well. The second
Subject-verb-relationships follow bra-operator-ket rules but are always is somewhat converse to the above, being that maintaining consistency
related to probabilities; nonetheless this is permissible for any mea with the broader, fundamental principles for at least some of the tools
surements based on 0, 1 outcomes and projection and related operators added, rather than simply just plugging in established software, may
in QM. Probably the major unfamiliar step in putting the results of also be important. For example, the introduction of the HDN-like neural
Dirac’s system to use is in forming Hermitian matrices and related net extends the Q-UEL/HDN idea and provides a deeper useful rela
vectors with purely h-complex elements, but at least one approach, by tionship to Q-UEL probability theory. Future efforts will include using Q-
zipper vectors and corresponding matrices [33] (see also Eqns (9) and UEL to introduce prior knowledge into compatible h-complex neural
(10) above) is straightforward. In general, theoretical physics still nets [57–62]) prior to training, alongside linking precision medicine to
27
current Q-UEL efforts in regard to tissue biobanks [34], and linking 6.4. The general importance of the mitochondrial DNA choice
medical data to modeling peptides and proteins [2,81,87] as a path to
drug discovery [81] as discussed below in Section 6.3. Research into Mitochondrial DNA was chosen for initial genomics studies because
MDPs and identification of potential new MDPs may represent basic of its relative simplicity in terms of small size and because many patients
drug discovery for a new class of pharmaceutical candidates. Another are having their mtDNA sequenced because of genealogical interests, but
next step in the Q-UEL project will be to link related computational drug mtDNA is no “poor cousin” of the DNA in the cell nucleus. It is important
candidate selection methods and techniques for modeling the proteins in its own right because of its fundamental relation to metabolism and
with which they interact and their mode of interaction with them (e.g. major diseases. Efforts to collate access to worldwide mtDNA data
Ref. [81]). Another future step of great interest will be the examination shoulds benefit the community through forensic science as well as
of mtDNA sequences from many patients to see what inherited muta through medical genetics [90]. New techniques based on in vitro
tions are found in MDPs and how these might relate to health and dis fertilization (IVF), emerged in the past few years and have the potential
ease, but it is intriguing that this could in principle yield suggestive to prevent the transmission of serious mtDNA diseases via mitochondrial
results even prior to the above validation. donation [91], and this also hints at direct engineering of mtDNA to
As also regards future work, one may ask how extensible other fea avoid disease, alleviate subclinical issues, and enhance robust health
tures in the present paper may be outside genomics and bioinformatics, and promote longevity. These are all sweeping claims for such a small
and whether they are worth progressing for more general use. Many are part of human DNA, but they are by no means unique to the present
obviously specific to those domains, but some may also be specific while author. Currently, possibly no other sections of human DNA compressed
seeming more general. For example, while HDN-like nets are likely to be into so few base pairs have so much diverse impact on human health and
generally useful, the use as input of binary 0/1 assignment with (say) the affairs, providing interesting and timely tests cases for development of
1 represents any change from a standard reference, may be more systems for management and productive use of biomedical knowledge.
restricted. There may be concerns about that approach even within ge
nomics (these can easily be addressed in various ways, e.g. see Results Declaration of competing interest
Section 4.1), but the idea seems reasonable because one is usually
initially simply testing whether any mutation matters, i.e. if that This paper is provided to the community to promote the more gen
observed in a patient or donor is a suspect for causing or helping cause a eral applications of the thinking of Professor Paul A. M. Dirac to human
disease. Outside genomics, however, the specific nature of any kind of and animal medicine in accordance with the charter of The Dirac
difference may matter much more at the outset. Despite that, exploring Foundation, to emphasize the advantages and simplicity of the basic
various ways of presenting only the most relevant features to neural nets form of the Hyperbolic Dirac Net, to encourage its use, and to propose at
will be of obvious general interest. least some of the principles of the associated Q-UEL, a universal ex
change language for medicine, as a basis for a standard for interopera
bility. These mathematical and engineering principles are used, amongst
6.3. The mitochondrial derived peptides many others in an integrated way, in the algorithms and internal
architectural features of the BioIngine.com, a distributed system
Of dominant interest in the present research use cases were the developed by Ingine Inc. DE for the mining of, and inference from, Very
potentially new mitochondrial signaling factors (the MDPs). The fact Big Data for commercial purposes.
that they are peptides holds great promise for drug discovery and design.
That is because the ability to synthesize and even design active, quite Appendix A. Supplementary data
large peptides and proteins that are wholly or partly composed of D-
amino acids has been feasible for some years [82–86], as a first step to Supplementary data to this article can be found online at https://doi.
achieving oral tablets. Various prediction methods were used here, but org/10.1016/j.compbiomed.2020.103621.
many MDPS might be found by changing standard criteria, particularly
to allow shorter length, for the open reading frame (ORF). The emphasis References
has been on the use of Q-UEL, or a language like it, to facilitate such
work. Nonetheless, it remains that some may be discoveries because [1] B. Robson, O.K. Baek, The Engines of Hippocrates. From the Dawn of Medicine to
Medical and Pharmaceutical Informatics, Wiley, 2009.
these were not found in the literature by the author at the time of
[2] B. Robson, J. Garnier, Introduction to Proteins and Protein Engineering, Elsevier,
writing. Also most did not have amino acid sequence matches at that 1988.
time to peptides on public data bases that were immediately identifiable [3] J. Pearl, Bayesian networks: a model of self-activated memory for evidential
as actual or potential MDPs. Indeed, most matches (but not all) are reasoning (UCLA technical report CSD-850017), in: Proceedings of the 7th
Conference of the Cognitive Science Society, University of California, Irvine, CA,
unlikely evolutionary relatives as discussed in Results Section 4.13. Best 1985, pp. 329–334.
matches can of course change. Q-UEL-ORF-PROTEIN tags in the sup [4] B. Robson, Hyperbolic Dirac nets for medical decision support. Theory, methods,
plementary data Appendix 1 carry best sequence match information at and comparison with BNs, Comput. Biol. Med. 51 (2014) 183–197.
[5] B. Robson, Bidirectional general graphs for inference. Principles and implications
the time of the 2018 study, this being their prper provenance within for medicine, Comput. Biol. Med. 108 (2019) 382–399.
Q-UEL, and there have been no drastic revisions, but best matches [6] S. Deckelman, B. Robson, Split-complex numbers and Dirac bra-kets, Commun. Inf.
change frequently. For example, the putative MDP MAACLMLVPFDRG Syst. 14 (3) (2015) 135–149.
[7] B. Robson, P.T. Caruso, U.J. Balis, Suggestions for a web based universal exchange
DLEGELTGTGMLACVILL originally found a partial match with a K and inference language for medicine, Comput. Biol. Med. 43 (12) (2013), 229-
(þ)-transporting ATPase subunit F of Lechevalieria aerocolonigenes, a 2310.
bacterium, and the original dates are retained as provenance, but as of [8] G.S. Ginsbirg, H.F. Willard, Genomic and Precision Medicine. Foundations,
Translation, and Implementation, Elsevier Inc., 2017.
12/22/2019 a complete match was found, except for isoleucine I in [9] M. Alloghani, D. Al-Jumeily, A. Hussain, A.J. Aljaaf, J. Mustafina, M. Khalaf, S.
place of the initial methionine M, with Hypothetical protein CIN Y. Tan, The XML and semantic web: a systematic review on technologies,
CED_3A015187 in Cinara cedri, an insect (a conifer aphid), using international conference Big data analytics, data mining and computational
intelligence (last accessed 9/18/2019), http://www.iadisportal.org/digital-l
BLASTP at https://blast.ncbi.nlm.nih.gov/Blast.cg. See also Results
ibrary/the-xml-and-semantic-web-a-systematic-review-on-technologies, 2019.
Section 4.13. Ideally, of course, one should first be assured that the [10] J. Pearl, Causality: Models, Reasoning, and Inference, Cambridge University Press,
predicted MDPs actually do exist. It is to that end that the list of pre 2000.
dictions of potential MDPs is given in the Appendix, for interested re [11] D.R. Sutton, J. Fox, The syntax and semantics of the PROforma guideline modeling
language, J. Am. Med. Inform. Assoc. (5) (2003) 433–443. Cartwright.
searchers to consider and find published, or even provide, experimental [12] A.P. Dawid, Beware of the DAG, Workshop and Conference Proceedings, J. Mach.
validation. Learn. Res. 6 (2008) 59–86. NIPS Workshop on Causality.
28
[13] Wikipedia, QUEL query languages (last accessed 10/5/2019), https://en. [45] R.C. Malenka, E.J. Nestler, S.E. Hyman, Chapter 13: higher cognitive function and
wikipedia.org/wiki/QUEL query_languages. behavioral control, in: A. Sydor, R. Y (Eds.), Molecular Neuropharmacology: A
[14] Wikipedia, Correlation does not imply causation (last accessed 11/25/2018), Foundation for Clinical Neuroscience, second ed., McGraw-Hill Medical, NY, 2009,
https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation. pp. 313–321.
[15] K.J. Laidler, M.C. King, Development of transition-state theory, J. Phys. Chem. 87 [46] D. Benaki, C. Zikos, E. Livaniou, M. Vlassi, E. Mikros, M. Pelencanou, Solution
(15) (1987) 2657–2664. structure of humanin, a peptide against Alzheimer’s disease-related neurotoxicity,
[16] I.E. Scheffloer, Mitochondria, Wiley, 2007. Biochem. Biophys. Res. Commun. 329 (1) (2005) 152–160.
[17] San Diego supercomputer center biology workbench (last accessed 7/18/2029), [47] C. Lee, J. Zeng, B.G. Drew, T. Sallam, A. Martin-Montalvo, J. Wan, Su-Jeong Kim,
http://workbench.sdsc.edu/. H. Mehta, A.L. Hevener, de Cabo, P. Cohen, The Mitochondrial-derived peptide
[18] D. Smith, B. Robson, High Throughput Insight: Web-Based Collections of MOTS-c promotes metabolic homeostasis and reduces obesity and insulin
Bioinformatics Tools Catalyses Scientific Inquiry into Subtle Aspects of Gene resistance, Cell Metabol. 21 (3) (2015) 443–454, 2015 Mar 3.
Structure and Function, IBC Library Series B, 1998. [48] M. Tawata, M. Ohtaka, E. Iwase, Y. Ikegishi, K. Aida, T. Onaya, New mitochondrial
[19] C. Lee, K.en, and P. Cohen, Humanin: a harbinger of mitochondrial-derived DNA homoplasmic mutations associated with Japanese patients with type 2
peptides?, Trends Endocrinol. Metab., 24(5) 222-228. diabetes, Diabetes 47 (2) (1998) 276–277.
[20] C. Lee, K.H. Kim, P. Cohen, MOTS-c: a novel mitochondrial-derived peptide [49] A.D. Pranoto, The association of mitochondrial DBN mutation G3316a and T3394c
regulating muscle and fat metabolism, Free Radic. Biol. Med. 100 (2016) 182–187. with diabetes mellitus, Folia Medica Indonesiana 41 (2005) 1.
[21] B.S. Weir, The rarity of DNA profiles, Ann. Appl. Stat. 1 (2) (2007) 358–370. [50] D.-L. Tang, X. Zhou, X. li, L. Zhao, Variation of mitochondrial gene and the
[22] The Crown Prosecution Service, DNA-17 profiling (last access 3/7/2018), https association with type 2 diabetes mellitus in a Chinese population, Diabetes Res.
://www.cps.gov.uk/legal-guidance/dna-17-profiling. Clin. Pract. 73 (1) (2006) 77–82.
[23] Family Tree DNA (US), Learning center (last access 3/7/2018), https://www. [51] L. Kang, H-Xg Zheng, F. Chen, S. Yan, K. Liu, Z. Qin, L. Liu, Z. Zhao, L. Li, X. Wang,
familytreedna.com/. Y. He, L. Ji, nmtDNA lineage expansions in sherpa population suggest adaptive
[24] H. chial, J. Craig, mtDNA and mitochondrial diseases, Nat. Educ. 1 (1) (2008) 217. evolution in Tibetan highlands, Mol. Biol. Evol. 30 (12) (2013) D2579–D2587.
[25] J. Knight, Bearing False Witness - Chance Matches Are Much More Likely with [52] A. Lopez-Pascual1, M. Bes-Rastrollo, C. Say� on-Orea, A. Perez-Cornago, J. Díaz-
mtDNA Tests, New scientist, 2019 (last accessed 9/2/2019), https://www.newsci Guti�errez5, J.J. Pons, M.A. Martínez-Gonz� alez, P. Gonz�
alez-Muniesa, J.
entist.com/article/mg15721231-800-bearing-false-witness-chance-matches-are-m A. Martínez, Living at a geographically higher elevation is associated with lower
uch-more-likely-with-mtdna-tests/. risk of metabolic syndrome: prospective analysis of the SUN cohort, Front. Physiol.
[26] A. Frederika A. KaestleRicky, L. KittlesAndrea, A.L. Roth, E.J. Ungvarsky, Database (2017). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5209344/.
limitations on the evidentiary value of forensic mitochondrial DNA, evidence, [53] K. Gulebeck, W. Sprossig, Quaternionic and Clifford Calculus for Physicists and
berkeley law School repository, Winter 12-1-2006, https://scholarship.law.berkele Engineers, John Wiley & Sons, 1997.
y.edu/cgi/viewcontent.cgi?referer¼https://www.google.co.uk/&httpsredir¼1&a [54] B. Robson, The new physician as unwitting quantum mechanic: is adapting Dirac’s
rticle¼3544&context¼facpubs, 2006. inference system best practice for personalized medicine, genomics and
[27] D.H. Kaye, G. Sensabaugh, Reference manual on DNA identification evidence, in: proteomics? J. Proteome Res. 6 (8) (2007) 3114–3126.
Reference Manual on Scientific Evidence, third ed., National Acadamies Press, [55] J. Cockle, On certain functions resembling quaternions and on a new imaginary in
Washington, 2011, pp. 129–210. algebra, London-Dublin-Edinburgh Philosophical Magazine 3 (33) (1848)
[28] G.S. Gorman, A.M. Schaefer, Y. Ng, N. Gomez, E.L. Blakely, C.L. Alston, C. Feeney, 435–439.
R. Horvath, P. Yu-Wai-Man, P.F. Chinnery, R.W. Taylor, D.M. Turnbull, [56] J. Cockle, On a new imaginary in algebra, Philos. Mag. 34 (1849) 37–47.
R. McFarland, Prevalence of nuclear and mitochondrial DNA mutations related to [57] S. Buchholz, G. Sommer, in: S.-I. Amari, C.L.M. Giles, M. Gori, V. Piuri (Eds.),
adult mitochondrial disease, Ann. Neurol. 77 (5) (2015) 753–759. A Hyperbolic Multilayer Perceptron, IEEE Computer Society Press, 2000, p. 129.
[29] M. Ramanjaneya1, I. Bettahi1, J. Jerobin1, P. Chandra1, C.A. i Khalil2, M. Skarulis, [58] T. Nitta, S. Bucholtz, On the decision boundaries of hyperbolic neurons, in:
S.L. Atkin, A.-B. Abou-Samra1, Mitochondrial-derived peptides are down regulated Proceedings of the 2008 International Joint Conference on Neural Networks,
in diabetes subjects, 31 May, Front. Endocrinol. (2019) (last accessed 9/2/2019), IJCNN, 2008.
https://www.frontiersin.org/articles/10.3389/fendo.2019.00331/full. [59] R.S. Savitha, S. Suresh, S. Sundararajan, P. Saratchandran, A new learning
[30] P.A.M. Dirac, A new notation for quantum mechanics, Math. Proc. Camb. Philos. algorithm with logarithmic performance index for complex-valued neural
Soc. 35 (3) (1939) 416. networks, Neurocomputing 72 (2009) 16–18.
[31] M. Kobayashi, Hyperbolic Hopfield neural networks, IEEE Trans Neural Netw [60] Y. Kuroe, T. Shinpei, H. Iima, Models of Hopfield-type Clifford neural networks and
Learn Syst 24 (2) (2013) 335–341. their energy functions – hyperbolic and dual valued networks, Lect. Notes Comput.
[32] B. Robson, S. Boray, Implementation of a web based universal exchange and Sci. 7062 (2011) 560.
inference language for medicine. Sparse data, probabilities and inference in data [61] T. Nitta, An analysis of the fundamental structure of complex-valued neurons,
mining of clinical data repositories, Comput. Biol. Med. 66 (2015) 82–102. Neural Process. Lett. 12 (3) (1993) 239–246.
[33] B. Robson, Studies in using a universal exchange and inference language for [62] T. Nitta, Solving the XOR problem and the detection of symmetry using a single
evidence based medicine. Semi-automated learning and reasoning for PICO complex-valued neuron, Neural Netw. 16 (8) (2003) 1101.
methodology, systematic review, and environmental epidemiology, Comput. Biol. [63] A.Y. Khrennikov, Hyperbolic quantum mechanics, Adv. Appl. Clifford Algebras 13
Med. 79 (2016) 299–323. (2003).
[34] B. Robson, S. Boray, Studies of the role of a smart web for precision medicine [64] A.Y. Khrennikov, Contextual Approach to Quantum Formalism, Springer,
supported by biobanking, personalized medicine, FTG, Pers. Med. 13 (2016) 4. Netherlands, 2009.
[35] B. Robson, S. Boray, Studies in the extensively automatic construction of large [65] A.Y. Khrennikov, On quantum-like probabilistic structure of mental information,
odds-based inference networks from structured data. Examples from medical, Open Syst. Inf. Dyn. 11 (3) (2004) 267.
bioinformatics, and health insurance claims data, Comput. Biol. Med. 95 (2018) [66] J. Kunegis, G. Gr€ oner, T. Gottrron, On-line dating recommender systems, the split
147–166. complex number approach, (Like/Dislike, similar/disimilar) (last accessed
[36] B. Robson, S. Boray, Studies in the use of data mining, prediction algorithms, and a 06.01.14), http://userpages.uni-koblenz.de/~kunegis/paper/kunegis-online-datin
universal exchange and inference language in the analysis of socioeconomic health g-recommendersystems-the-split-complex-number-approach.pdf.
data, Comput. Biol. Med. 112 (2019), 103369, https://doi.org/10.1016/j. [67] S.E. Haupt, A. Pasini, C. Marzban (Eds.), Artificial Intelligence Methods in the
compbiomed.2019.103369. Environmental Sciences, Springer Science & Business Media, 2008.
[37] B. Robson, Analysis of the code relating sequence to conformation in globular [68] Wikipedia (last accessed 12/7/2019), https://en.wikipedia.org/wiki/Probabilistic_
proteins: theory and application of expected information, Biochem. J. 141 (1974) semantics.
853–867. [69] S.F. Pileggi, Probabilistic semantics, international conference on computational
[38] B. Robson, Clinical and pharmacogenomic data mining: 3. Zeta theory as a general science ICCS 2016), Procedia. Comput. Sci. 80 (2016) 1834–1845.
tactic for clinical bioinformatics, J. Proteome Res. 4 (2) (2005) 445–455. [70] B. Robson, POPPER a simple programming language for probabilistic semantic
[39] B. Robson, Clinical and Pharmacogenomic Data Mining. 1. The generalized theory inference in medicine, Comput. Biol. Med. 56 (2014) 107–123.
of expected information and application to the development of tools, J. Proteome [71] P.M. Dirac, Nobel prize banquet speech (last accessed 10/15/2019), http://www.
Res. 283–301 (2003) 2. nobelprize.org/nobel_prizes/physics/laureates/1933/dirac-speech.html, 1933.
[40] B. Robson, The dragon on the gold: myths and realities for data mining in [72] J.G. Petruni�c, Conceptions of Continuity: William Kingdon Clifford’s Empirical
Biotechnology using digital and molecular libraries, J. Proteome Res. 3 (6) (2004) Conception of Continuity in Mathematics, 1868-1879.
1113–1119. [73] C. Mus�es, Hypernumber, Ann. N. Y. Acad. Sci. 138 (1967) 10.
[41] B. Robson, T.P. Caruso, U.G.J. Balis, Suggestions for a web based universal [74] J. van Eijck, S. Lappin, Probabilistic semantics for natural language. http://www.
exchange and inference language for medicine. Continuity of patient care with dcs.kcl.ac.uk/staff/lappin/nasslli/nasslli2012/vanEijck-lappin_probabilistic_s
PCAST disaggregation, Comput. Biol. Med. 56 (2014) 51–66. emantics11.pdf, 2012.
[42] FOSWIKII (last accessed 7/9/2019), https://www.mitomap.org/foswiki/bin/view [75] N.D. Goodman, D. Lassiter, Probabilistic semantics and pragmatics: uncertainty in
/MITOMASTER/WebHome. language and thought. https://web.stanford.edu/~ngoodman/papers/Goodman-
[43] B. Robson, R. Mushlin, Genomic messaging system for information-based HCS-final.pdf, 2015.
personalized medicine with clinical and proteome research applications, [76] D. Clarke, B. Keller, Efficiency in ambiguity: two models of probabilistic semantics
J. Proteome Res. 3 (5) (2004) 930–948. for natural language, in: Proceedings of the 11th International Conference on
[44] B. Robson, S. Boray, Data-mining to build a knowledge representation store for Computational Semantics, Association for Computational Linguistics, London,
clinical decision support. Studies on curation and validation based on machine 2015, pp. 129–139.
performance in multiple choice medical licensing examinations, Comput. Biol.
Med. 73 (2015) 71–93.
29
[77] L. Prediou, H. Stuckenschmidt, Probabilistic models for the SW – a survey (last Barry Robson BSc(Hons) PhD DSc, Professor Emeritus Medicine
accessed 04.29.10), http://ki.informatik.uni-mannheim.de/fileadmin/publicati (Evidence Based Medicine, Epidemiology & Biostatistics) was
on/Predoiu08Survey, 2009. five years as Chief Scientific Officer IBM Global Healthcare,
[78] M. Zongmin, Z. Fu Zhang, Y. li, C. Jingwei, Fuzzy Knowledge Management for the Pharmaceutical, and Life Sciences and, prior to that, six years as
Semantic Web, Volume 306 of Studies in Fuzziness and Soft Computing, Springer, the Strategic Advisor at IBM Global Research Headquarters (T.
2013. J. Watson Research Centre). For those 11 years he held the
[79] Open Clinical (last accessed 1/7/2020), http://www.openclinical.org/gmm_profor prestigious title of IBM Distinguished Engineer. According to
ma.html, 2013. Barry’s two page biography written by journalist Brendan
[80] J. Fox, V. Patkar, R. Thomson, Decision support for healthcare: the PROforma Horton in Nature (389,418–420, 1997), Barry was a pioneer in
evidence base, Inf. Prim. Care 14 (2006) 49–54. bioinformatics, protein modelling, and computer-aided drug
[81] B. Robson, R. Dettinger, A. Peters, S.K.P. Boyer, Drug discovery using very large design. He is the recipient of several honours including the
numbers of patents: general strategy with extensive use of match and edit Asklepios Award for Outstanding Vision in Science and Tech
operations”, J. Comput. Aided Mol. Des. 25 (5) (2011) 427. nology at the Future of Health Technology Congress at M.I.T. in
[82] Graken Ltd, Building intelligent systems starts at the database (last accessed 1/7/ 2002. He has helped start up several other companies or di
2020), https://www.grakn.ai/, 2019. visions in the UK and USA. Barry continues as CEO of The Dirac
[83] B. Robson, Beyond proteins, Trends Biotechnol. 17 (8) (1999) 311–315. B. Robson, Foundation in the UK, and Distinguished Scientist (Admin.) at
"Doppelganger Proteins as Drug Leads", B. Robson (1996), Nature Biotechnology, the University of Wisconsin-Stout Department of Mathematics,
14, 892-893 (1966). Statistics, and Computer Science. He is also cofounder of Ingine
[84] G.M. Figliozzi, M.A. Siani, L.E. Canne, B. Robson, R.J. Simon, Chemical synthesis Inc., Delaware USA, a medical A.I. company. While continuing
and activity of D, superoxide dismutase, Protein Sci. 5 (suppl. 1) (1996) 72, 15. to work for, and then collaborate with IBM, he was also Uni
[85] M.A. Sinai, U. Hirotsugu, W. Gong, D.A. Thompson, G.G. Brown, J.M. Wang, versity Research Director and Professor of EBM, Biostatistics
Chemically synthesized SDF-1α analogue, N33A, is a potent chemotactic agent for and Epidemiology at St. Matthew’s University School of Med
CXCR4/Fusin/LESTR-expressing human leukocytes, J. Biol. Chem. 272 (40) (1997) icine which he helped established in its earlier days in the
24966–24970. Cayman Islands. Barry also holds a Harvard-Macy Course Cer
[86] B. Robson, Pseudoproteins: non-protein protein-like machines, the sixth foresight tification in the Business of Medical Education. Immediately
conference on molecular nanotechnology (last accessed 9/72019), https://foresi prior to joining IBM in 1998, he was hired as Principal Scientist
ght.org/Conferences/MNT6/Abstracts/Robson/index.html, 1998. at MDL Information Systems in California to help put together
[87] B. Robson, A. Vaithilingham, Protein folding revisited, 161-202, in: Progess in the technology for the multimillion sale of a bioinformatics
Molecular Biology and Translational Science, Vol 84: Molecular Biology of Protein system to the holding company forming Craig Venter’s Celera
Folding, Elsevier Press/Academic Press, 2008. Genomics that produced the first draft of the human genome.
[88] B. Robson, The concept of novel compositions of matter. A theoretical analysis.” Prior to that, he was CSO of Gryphon Sciences (later Gryphon
intellectual property rights, Intel Prop Rights 1 (2013) 108. Pharmaceuticals) in South San Francisco, California, a bio-
[89] I.M. Mullins, M.S. Siadaty, J. Lyman, K. Scully, G.T. Garrett, G. Miller, R. Muller, nanotechnology ultrastructural chemistry start-up largely held
B. Robson, C. Apte, S. Weiss, I. Rigoutsos, D. Platt, S. Cohen, Data mining and and then acquired by SmithKline Beecham. Before moving to
clinical data repositories: insights from a 667,000 patient data set, Comput. Biol. the US, Barry was the scientific founder of Proteus International
Med. 36 (12) (2006) 1351. plc in the UK, designing and leading the development of the
[90] L. Prieto, B. Zimmermann, A. Goios, A. Rodriguez, G.G. Paneto, d C. Alves, PROMETHEUS Expert System and its underlying GLOBAL
A. Alonso, C. Fridman, S. Cardoso, G. Lima, M.J. Anjos, M.R. Whittle, Expert System, bioinformatics and simulation language for
M. Montesino, a R.M.B. Cicarelli, A.M. Rocha, C. Albarr� an, M.M. de Pancorbo, The drug, vaccine, and diagnostic discovery. It sold for the equiv
GHEP–EMPOP collaboration on mtDNA population data—a new resource for alent of $9.4 million to the pharmaceutical industry in the mid-
forensic casework, Forensic Sci. Int. Genet. 5 (2) (2011) 146–151. 1990s. At Proteus, he also led the team that used the above
[91] G.S. Gorman, J.P. Grady, Y. Ng, A.M. Schaefer, R.J. McNally, P.F. Chinnery, P. Yu- Expert System to invent and patent several diagnostics and
Wai-Man, M. Herbert, R.W. Taylor, R. McFarland, D.M. Turnbull, Mitochondrial vaccines including the Mad Cow disease diagnostic subse
donation — how many women could benefit? N. Engl. J. Med. 372 (2015), quently marketed worldwide by Abbott. He has over 300 sci
150130091413004.(2015). entific publications in Nature, Science, J. Mol. Biol.
Biochemical J., including some 50 patents and two books: “The
Engines of Hippocrates. From the Dawn of Medicine to Medical
and Pharmaceutical Informatics” Robson and Baek, 2009,
Wiley, 600 pages)” and “Introduction to Proteins and Protein
Engineering” (B. Robson and J. Garnier, 1984, 1988, Elsevier,
700 pages). He has contributed to several reports to govern
ments (EU, US, Denmark) including Panels of the National
Innovation Initiative for “Innovate America” published by The
Council on Competitiveness, Washington D.C. (2004) as a
whitepaper to the President of the United States. He was also an
advisor in relation to a major scientific computer-aided drug
design collaboration and network for Peter Feinstein Consul
tants between work between US scientists and the Russian
Science City Arzamas. He has a Harvard-Macy certificate in the
business of medical education. For some five years, Barry was a
Nature “News and Views” Correspondent on biomolecules. He
was Visiting Scholar Stanford University School of Medicine
1997–1998, Professorial Lecturer Mount Sinai NYC during part
of his period at IBM Corporation, and held visiting positions and
professorships in INRA and U. Paris-Sud France, and a Tech
nical University of Copenhagen under Sir Rodney Cotterill, as
well a postdoctoral position at Oxford (Wolfson College) under
Sir David Phillips while Lecturer and Reader in Biochemistry at
the University of Manchester.
30

Computers in Biology and Medicine: Barry Robson

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computers in Biology and Medicine: Barry Robson

Uploaded by

Copyright:

Available Formats

Computers in Biology and Medicine 117 (2020) 103621

Contents lists available at ScienceDirect

Computers in Biology and Medicine

Extension of the Quantum Universal Exchange Language to precision

* Ingine Inc., Delaware, USA.

3.2. Data sources 3.3. Extraction of knowledge from Internet text

Pfwd:¼P(person in Alice’s family’s stately home | mtDNA) ¼ 0.1 (a very rough

The following a Q-UEL knowledge tag relating to genomics that was

You might also like