You are on page 1of 107

Lives of the Scientist

Genetic Basis of Differentiation

Events in time and space . . .
Genetic Basis of Differentiation

Events in time and space . . .
. . . driven by patterned gene expression
Genetic Basis of Differentiation

Events in time and space . . .
. . . driven by patterned gene expression
Genetic Basis of Differentiation

Nostoc

NH3 N2

NH3
Events in time and space . . .
. . . driven by patterned gene expression
Genetic Basis of Differentiation
How?

Environmental Signal Developmental Response
NH3

Histidine Kinase
Genetic Basis of Differentiation
How?

Environmental Signal Developmental Response
NH3
P
A
T
Histidine Kinase
Genetic Basis of Differentiation
How?

Environmental Signal Developmental Response
NH3
histidine
P
P

Histidine Kinase Response Regulator
Genetic Basis of Differentiation
How?

Environmental Signal Developmental Response
NH3
P
P

Histidine Kinase Response Regulator

NpR3010 ???
AATAAAGCTTTACAAACCAAA
Genetic Basis of Differentiation
How?
CTCTGGCTTCAATTGTGTAAC
Environmental Signal Developmental Response
CCAAGCTTTGATTCTTTCCTCT
NH3
GTTAAATCGGATTGATTATCTT
CATCAAGGGCAAGACCTACAA
P
P

ATTTACCATCACGAACAGCTT
Histidine Kinase Response Regulator
TAGACTCACTGAATTCATAAC
NpR3010 ???
CTTCTGTAGGCCAATAGCCAA
CTGTTTCACCACCATTTTCTGA
Genes Functionally Related to His Kinase
Histidine Kinase
Nostoc punctiforme
NpR3010

Anabaena PCC 7120

Trichodesmium

Synechocystis PCC 6803

. . . (13 total) Find similar genes
Conserved Blast
>npun_22dec03_Contig1_revised_geneNpR3010
MWHIQDSIITLSNHNQYLTFYKNQVKNPERFCRNVNQFDSQIDFVSCDIL
ELKDGRFFEQYSKPLRLAEEIIGTVWSFRDITESQQAKEENRRIIQQEKQ
LAEDRAYFTSMIFHEFRNPLNIISYSTSLLKRHSHHWSEEKKLQCLQNLQ
TAVEQINQFTDEVLIIESVEAGKLQYELKPIDLNLFCREVLAEMSLYTKG
ASQFLLFQNK*
MWHIQDSIITLSNHNQYLTFYKNQVKNPERFCRNVNQFDSQIDFVSCDIL
ELKDGRFFEQYSKPLRLAEEIIGTVWSFRDITESQQAKEENRRIIQQEKQ
LAEDRAYFTSMIFHEFRNPLNIISYSTSLLKRHSHHWSEEKKLQCLQNLQ
TAVEQINQFTDEVLIIESVEAGKLQYELKPIDLNLFCREVLAEMSLYTKG
ASQFLLFQNK
>npun_22dec03_Contig1_revised_geneNpR3008
LSPYLEACCLRISASVSYQRAAEDIEYLTGVEVSKSVQQRLVHRQNFELP
QVESTVEELSVDGGNIRIRTIKGQVCDWKGYKATCLHEKQAIAASFQENS
LVIDWVKSQSIAPILTCLGDGHDGIWNIVRDFAPEHQRREVLDWFHLMEN
LHKIGGSNQRLNQAKILLWQGKVDDAIAVFADCQLKQAFNFCTYLEKHRH
RIVNYQYYQAEQICSIGSGAIESTVKQIDRRTKISGAQWKSDNVPQVLAQ
RQSLSQWINLCSLNKNWDAPMKSSVERLSDYPVAR*
A new family of proteins?!
A type of transposase?
TRANSPOSON

transposase

...ATTTCTCTAGAAAGGCTGAAGGGGGGACAAGCACCCGAAAGCCTTTGTGCT...
...TAAAGAGATCTTTCCGACTTCCCCCCTGTTCGTGGGCTTTCGGAAACACGA...

...ATACAGTCAGCTTTATAGGCTTCATGTCGCCCCTTCAGCTAGAAAGGTACATA...
...TATGTCAGTCGAAATATCCGAAGTACAGCGGGGAAGTCGATCTTTCCATGTAT...
A new family of proteins?!
A type of transposase?
TRANSPOSON

transposase

...ATTTCTCTAGAAAGGCTGAAGGGGGGACAAGCACCCGAAAGCCTTTGTGCT...
...TAAAGAGATCTTTCCGACTTCCCCCCTGTTCGTGGGCTTTCGGAAACACGA...

...ATACAGTCAGCTTTATAGGCTTCATGTCGCCCCTTCAGCTAGAAAGGTACATA...
...TATGTCAGTCGAAATATCCGAAGTACAGCGGGGAAGTCGATCTTTCCATGTAT...
A new family of proteins?!
A type of transposase?
TRANSPOSON

transposase

...ATTTCTCTAGAAAGGCTGAAGGGGGGACAAGCACCCGAAAGCCTTTGTGCT...
...TAAAGAGATCTTTCCGACTTCCCCCCTGTTCGTGGGCTTTCGGAAACACGA...

...ATACAGTCAGCTTTATAGGCTTCATGTCGCCCCTTCAGCTAGAAAGGTACATA...
...TATGTCAGTCGAAATATCCGAAGTACAGCGGGGAAGTCGATCTTTCCATGTAT...
A new family of proteins?!
A type of transposase?
TRANSPOSON

transposase

Is Npr3008 a
transposase?
AATAA A C
A G C
AATAAAGCTTTACAAAC T A C A A
CAAAC GCTT TACA AT T
A A AT C T G TT C AA A C
TA T
CAAACTCTGGCTTCAAT
A GTGTA T G GG C
C T T T
C TC
A C G C
C T
A T
A T
A C
GCAATATC ACCCCAAAGC TA
TGTGTAACCCAAGCTTT A
T G TA
T T C C T G T
TT T
GATTCTTTCCTCTGTTA
AA GCG
TT TCC T
C T
C T G T
TGC T
A T
T TTATTAC
CG A T T
TAATCGGATTGATTATCTG A
TGT A T
ATCAA G G A T T A TC G A C
T C G
CTCATCAAGGGCAAGAC
A G G G C A A T
TAACAA AAGGAAGA TC C
ATACT T T A C C A
C
CCTACAAATTTACCATCA
GA ATCC AA TCTT
C A TC AC
A A
Observation

* Photos courtesy of www.webshots.com and Peter Smallwood
Observation

* Photos courtesy of www.webshots.com and Peter Smallwood
Observation

* Photos courtesy of www.webshots.com and Peter Smallwood
Observation

* Photos courtesy of www.webshots.com and Peter Smallwood
Filters: Information reducers
Squirrel filter
Filters: Information reducers
Molecular filter
Filters: Information reducers
Sequence filter

TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG
AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT
TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC
TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC
GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC
CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA
TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA
AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG

CTCCGTAAAC CTCTAAC...
AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA
TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG
CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC
GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG
GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT
CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA
ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA
TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT
CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA
CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG
AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC
CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA
TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT
How do Biologists use Bioinformation?
TCTACTTATA TTCAATCCAC AGGGCTACAC
AAGAGTCTGT TGAATGAACA CATACATGGT
TTCTGTCTGC TCTGACCTCT GGCAGCTTTC

What genes are in my organism?
TGGATTTCGG AACTCTAGCC TGCCCCACTC
GAACCTTAGT GACTTCTGCT ATACCAAAGT
CTCCGTAAAC CTCTAACATG ATGTCAGCAA
TGAATAAACT TTGTTAAAGG TACAAATGAA
AAGAGTTTAA AGTTAAAAAC GAATTGCAGT
AAACCTGTAT GGTTACATGA ACTGCCTAAA
TTATATATTT TAAGAAATTA ATTGCAATTA
CCCCAGCTGT CATTAAAAAG AGGCAAATAC
GACAGCACTG ACCCTCAAGA AGGCACCGGC

Gene finder
GCTGAAATTC CGCTGAGAGC AGAGTGGTAC
CCCTGCACCA GGTCTTTCCT GTGGGCACTG
ATGAATGACT GAACGAACGA TTGAATGAAA

Interpolated
Markov model

Candidate genes Predicted genes
How do Biologists use Bioinformation?
TCTACTTATA TTCAATCCAC AGGGCTACAC
AAGAGTCTGT TGAATGAACA CATACATGGT
TTCTGTCTGC TCTGACCTCT GGCAGCTTTC

What genes are in my organism?
TGGATTTCGG AACTCTAGCC TGCCCCACTC
GAACCTTAGT GACTTCTGCT ATACCAAAGT
CTCCGTAAAC CTCTAACATG ATGTCAGCAA
TGAATAAACT TTGTTAAAGG TACAAATGAA
AAGAGTTTAA AGTTAAAAAC GAATTGCAGT
AAACCTGTAT GGTTACATGA ACTGCCTAAA
TTATATATTT TAAGAAATTA ATTGCAATTA
CCCCAGCTGT CATTAAAAAG AGGCAAATAC
GACAGCACTG ACCCTCAAGA AGGCACCGGC

Gene finder
GCTGAAATTC CGCTGAGAGC AGAGTGGTAC
CCCTGCACCA GGTCTTTCCT GTGGGCACTG
ATGAATGACT GAACGAACGA TTGAATGAAA

Interpolated
Markov model

Conform to
Challenge standard model
accepted
beliefs

Candidate genes Predicted genes
How do Biologists use Bioinformation?
TCTACTTATA TTCAATCCAC AGGGCTACAC
AAGAGTCTGT TGAATGAACA CATACATGGT
TTCTGTCTGC TCTGACCTCT GGCAGCTTTC

What genes are in my organism?
TGGATTTCGG AACTCTAGCC TGCCCCACTC
GAACCTTAGT GACTTCTGCT ATACCAAAGT
CTCCGTAAAC CTCTAACATG ATGTCAGCAA
TGAATAAACT TTGTTAAAGG TACAAATGAA
AAGAGTTTAA AGTTAAAAAC GAATTGCAGT
AAACCTGTAT GGTTACATGA ACTGCCTAAA
TTATATATTT TAAGAAATTA ATTGCAATTA
CCCCAGCTGT CATTAAAAAG AGGCAAATAC
GACAGCACTG ACCCTCAAGA AGGCACCGGC

Gene finder
GCTGAAATTC CGCTGAGAGC AGAGTGGTAC
CCCTGCACCA GGTCTTTCCT GTGGGCACTG
ATGAATGACT GAACGAACGA TTGAATGAAA

Interpolated
Markov model

Conform to
standard model

Candidate genes Predicted genes
How do Biologists use Bioinformation?
TCTACTTATA TTCAATCCAC AGGGCTACAC
AAGAGTCTGT TGAATGAACA CATACATGGT
TTCTGTCTGC TCTGACCTCT GGCAGCTTTC

What genes are in my organism?
TGGATTTCGG AACTCTAGCC TGCCCCACTC
GAACCTTAGT GACTTCTGCT ATACCAAAGT
CTCCGTAAAC CTCTAACATG ATGTCAGCAA
TGAATAAACT TTGTTAAAGG TACAAATGAA
AAGAGTTTAA AGTTAAAAAC GAATTGCAGT
AAACCTGTAT GGTTACATGA ACTGCCTAAA
TTATATATTT TAAGAAATTA ATTGCAATTA
CCCCAGCTGT CATTAAAAAG AGGCAAATAC
GACAGCACTG ACCCTCAAGA AGGCACCGGC

Gene finder
GCTGAAATTC CGCTGAGAGC AGAGTGGTAC
CCCTGCACCA GGTCTTTCCT GTGGGCACTG
ATGAATGACT GAACGAACGA TTGAATGAAA

Interpolated
Markov model

Conform to
Challenge standard model
accepted
beliefs

Candidate genes Predicted genes
Filters are powerful

l o b in
g

Highly filtered output
• Easy to grasp
• High-level insights
Filters Constrain New Discovery

l o b in
g

Highly filtered output Unfiltered output
• Easy to grasp • Confusing
• High-level insights • Basic insights
Filters are tempting
Filters are tempting

Glob
in
The Death of Science
Current State of Affairs
1. Need high-level filters
Current State of Affairs
1. Need high-level filters
2. Need access to raw phenomena
AATAAAGCTTTACAAACCAAACTCTGGCTTCAAT
GTGTAACCCAAGCTTTGATTCTTTCCTCTGTTAAA
TCGGATTGATTATCTTCATCAAGGGCAAGACCTA
CAAATTTACCATCACGAACAGCTTTAGACTCACT
AATTCATAACCTTCTGTAGGCCAATAGCCAACTG
TTCACCACCATTTTCTGAAATTTTTTCCTCTAGAA
ACCGCAACACTATCACCACCAAACTCCTTCTGAA
TATTTCTGATTCAGTTTGGGTATTGCCTGTTTGAG
Current State of Affairs
1. Need high-level filters
2. Need access to raw phenomena
3. Need ability to build new tools

ASSIGN K12-set FROM Gene-finder (K12-DNA)
ASSIGN O157-set FROM Gene-finder (O157-DNA)
CONSIDER EACH protein IN O157-set
WHEN Constituent-of (K12-set, protein) = FALSE
COLLECT protein
We need…

Biologists . . .

. . . and Programmers
Current State of Affairs
1. Need high-level filters
2. Need access to raw phenomena
3. Need ability to build new tools

Need biologist programmers
AATAAAGCTTTACAAACCAAA
CTCTGGCTTCAATTGTGTAACC
CAAGCTTTGATTCTTTCCTCTG
TTAAATCGGATTGATTATCTTC
ATCAAGGGCAAGACCTACAAA
TTTACCATCACGAACAGCTTT
GARYGACTCACTGAATTCLAR
ATAACCTTCTGTAGGCCASON
ATAGCCAACTGTTTCACCACC
TATTCAAAATGAATTATATCGGTAACTTTAGTACAGAAAATGACGTTAAGA
ATATCTGCAACTTTAAACCTGAATGATATTATTATTGGCGGGCCTCCATGCCAG
GGATTTAGTATTGCTGGGCCAGCCCAAAEALAVGIASTCCTAAAGATCCTAGAAATG
GTTTAGAATTTTCATCAACTTTGCACAATGGATAAAATTTCTTGAACCTAAAGCGTTTGTC
ATGGAAAACGTGAATTCAAAAGGATTGCTATCAAGGAAAAATGCAGAAGGTTTTAAAGTTATAG
ATATTATTAAGAAAACATTTGGAATTCGAGAACTTGGTTATTTTGTCGAAGTATGGGTTTTAAATGCTG
CGGAATATGGCATTCCGCAAATTAGAGAACGGAATTCGATTTTTATTGTTGGCAATAAAAAAGGTAAAGTACT
AGGTATTCCTAAAAAAACACATTCTCTGCAATTTTTAAGAATTCGATTTAAATAGGTCTCAATTATCGATCTTCGATGAT
ATGAGTATTATACCTGCACTAACTTTGTGGGACGCAATATCAGACTTACGAATTCGACAGAACTTAATGCGCGTGAAGGAAGTGAA
GAGCAACCCTATCATTTAAAACCTCAAAATACTTATCAGACTTGGGCTAGAAATGGTAGTGGAATTCGATACGCTTTACAATCATGTTGCAAT
GGAACATTCTGACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGAATCCAGTTCGGATGTATCTAAAGAAGAATTCGACATGGAGCTAGACGACGT
AGTGGTAATGGTGAATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCTCATAAACCGGAATTCGAATTCTCACACTATTGCTGCGTCATTCTATGCTAATTTTG
TCCATCCTTTTCAACATCGAAATTTAACAGCCCGTGAAGGAGCTAGAATCCAATCTTTTCCAGATAACTATAGATTTTTTGGAAAAGAATTCGAATTCAAACTGTCGTATCTCATAAACTATTGCATCGA
GAAGAAAGATTTGATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTAAAGTAATTGCACATCATCTTCTAGAGAAATTAGGAATTCGAATTCAGTTATGCCAACAACTGATAGAAATCCTCTA
GTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAATCAGAACTGAATATGACAAATGGCATAAAGCAAATATGAACCTGGAATTCGAATTCGAGTTGGACCAAAATCAGAAATTACTGACCA
AGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACCTTCATTCTAGTGTTTTAGAGACCATTTATAAAGTAAATCTTTAGACGACTAGACGACGTAGCGAATTCGAATTCGAATTCATAATACGAGTCATAACGGCATATA
TG
GCAGCCTCACTCATTTCTGGGAGACGCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCATCAGCCAAACAGAGAGCGCAAATTTATCACCGTCATAGCCGGAATCAACCCAGATGACTTGAATTCGAATTCGAATTCGAACAACTTTTTCCAGTAATTCTGGAC
GCTCTTCTAACAGTTCCATCAAAGTATAGGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAATTCGAATTCGAATTCGAATTCGAATTCGAC
Why hasn’t this happened?
Part of bioinformatic program written in C

if (pcInFile == NULL) pfInFile = stdin;
else pfInFile = fopen(pcInFile, "r");
pfOutFile = fopen( pcOutFile, "w" );
if (pfInFile == NULL) { fprintf( stderr, "ERROR opening %s\n", pcInFile ); exit(1); }

if (pfOutFile == NULL) { fprintf( stderr, "ERROR opening %s\n", pcOutFile ); exit(1); }

fputc( fgetc(pfInFile), pfOutFile ); /* deal with first '>' in file */
for ( ; ; )
{
if (processIdentifier( pfInFile, pfOutFile )) { }
else { break; }
if (processSequence( pfInFile, pfOutFile )) { }
else { break; }
}
fclose( pfInFile );
fclose( pfOutFile );
Why hasn’t this happened?
Part of bioinformatic program written in Perl

sub match_positions {
my $pattern;
local $_;
($pattern, $_) = @_;
my @results;
local $matchStart;
my $instrumentedPattern = qr/(?{ $matchStart = pos() })$pattern/;
while (/$instrumentedPattern/g) {
my $nextStart = pos();
push @results, "[$matchStart..$nextStart)";
pos() = $matchStart+1;
}
return @results;
Why hasn’t this happened?

Biologists will not come to programming

Programming must come to biologists
BioLingua
Genetic Basis of Differentiation

Environmental Signal Developmental Response
NH3
P

Histidine Kinase Response Regulator

NpR3010 ???
Genetic Basis of Differentiation
NpR3010
RR HK-upstream HK HK-downstream
Genetic Basis of Differentiation
NpR3010
RR HK-upstream HK HK-downstream
BioLingua
<1>> (GENES-DESCRIBED-BY "response regulator" IN Npun)
:: (#$Npun.NpF0304 #$Npun.NpR0355 #$Npun.NpR0450 #$Npun.NpF0484
#$Npun.NpR0589 #$Npun.NpF0832 #$Npun.NpF0906 #$Npun.NpR0956
#$Npun.NpF1084 #$Npun.NpF1085 #$Npun.NpR1109 #$Npun.NpF1184
#$Npun.NpF1278 #$Npun.NpR1450 #$Npun.NpF1453 #$Npun.NpF1516
#$Npun.NpR1633 #$Npun.NpR1678 #$Npun.NpR1683 #$Npun.NpR1688
#$Npun.NpF1776 #$Npun.NpR1779 #$Npun.NpF1800 #$Npun.NpR1903
#$Npun.NpR2091 #$Npun.NpF2162 #$Npun.NpR2263 #$Npun.NpF2346
#$Npun.NpF2364 #$Npun.NpR2420 #$Npun.NpR2902 #$Npun.NpF2972
#$Npun.NpR3053 #$Npun.NpF3084 #$Npun.NpR3197 #$Npun.NpR3241
#$Npun.NpF3659 #$Npun.NpF3676 #$Npun.NpR3733 #$Npun.NpF3829
#$Npun.NpR3907 #$Npun.NpR3959 #$Npun.NpF3972 #$Npun.NpR4101
#$Npun.NpR4160 #$Npun.NpR4165 #$Npun.NpF4214 #$Npun.NpR4435
#$Npun.NpF4460 #$Npun.NpR4503 #$Npun.NpR4743 #$Npun.NpR4768
#$Npun.NpF4909 #$Npun.NpR5015 #$Npun.NpF5034 #$Npun.NpF5044
#$Npun.NpR5135 #$Npun.NpR5136 #$Npun.NpR5316 #$Npun.NpF5361
#$Npun.NpF5636 #$Npun.NpF5682 #$Npun.NpF5759 #$Npun.NpF5763
#$Npun.NpF5788 #$Npun.NpR6014 #$Npun.NpR6015 #$Npun.NpR6228
#$Npun.NpF6321 #$Npun.NpR6360 #$Npun.NpF6363 #$Npun.pNpAF075
#$Npun.pNpBR039 #$Npun.pNpBF139 #$Npun.pNpBF146 #$Npun.pNpBR169
#$Npun.pNpBR170 #$Npun.pNpBF205 #$Npun.pNpEF003)

<2>> (GENE-UPSTREAM-OF NpF0304)
BioLingua
<2>> (GENE-UPSTREAM-OF NpF0304)
:: #$Npun.NpF0303
<3>> (GENES-UPSTREAM-OF (RESULT 1))
:: (#$Npun.NpF0303 #$Npun.NpF0356 #$Npun.NpF0451 #$Npun.NpF0483
#$Npun.NpR0590 #$Npun.NpF0831 #$Npun.NpF0905 #$Npun.NpF0957
#$Npun.NpR1083 #$Npun.NpF1084 #$Npun.NpR1110 #$Npun.NpF1183
#$Npun.NpF1277 #$Npun.NpR1451 #$Npun.NpR1452 #$Npun.NpR1515
#$Npun.NpF1634 #$Npun.NpR1679 #$Npun.NpF1684 #$Npun.NpR1689
#$Npun.NpF1775 #$Npun.NpF1780 #$Npun.NpF1799 #$Npun.NpR1904
#$Npun.NpR2092 #$Npun.NpF2161 #$Npun.NpR2264 #$Npun.NpR2345
#$Npun.NpF2363 #$Npun.NpR2421 #$Npun.NpR2903 #$Npun.NpR2971
#$Npun.NpR3054 #$Npun.NpR3083 #$Npun.NpR3198 #$Npun.NpF3242
#$Npun.NpR3658 #$Npun.NpF3675 #$Npun.NpR3734 #$Npun.NpR3828
#$Npun.NpF3908 #$Npun.NpR3960 #$Npun.NpF3971 #$Npun.NpF4102
#$Npun.NpR4161 #$Npun.NpF4166 #$Npun.NpR4213 #$Npun.NpR4436
#$Npun.NpF4459 #$Npun.NpR4504 #$Npun.NpR4744 #$Npun.NpR4769
#$Npun.NpR4908 #$Npun.NpF5016 #$Npun.NpF5033 #$Npun.NpF5043
#$Npun.NpR5136 #$Npun.NpF5137 #$Npun.NpF5317 #$Npun.NpF5360
#$Npun.NpR5635 #$Npun.NpF5681 #$Npun.NpF5758 #$Npun.NpR5762
#$Npun.NpR5787 #$Npun.NpR6015 #$Npun.NpR6016 #$Npun.NpR6229
#$Npun.NpR6320 #$Npun.NpF6361 #$Npun.NpF6362 #$Npun.pNpAF074
#$Npun.pNpBR040 #$Npun.pNpBF138 #$Npun.pNpBF145 #$Npun.pNpBR170
#$Npun.pNpBR171 #$Npun.pNpBR204 #$Npun.pNpER002)
<4>> (DESCRIPTIONS-OF *)
BioLingua
<4>> DESCRIPTIONS-OF *)
:: ("two-component sensor histidine kinase [Nostoc sp. PCC 7120] gi|25531611|p
"unknown protein [Nostoc sp. PCC 7120] gi|25534386|pir||AH1981 hypothetical p
"tmRNA-binding protein [Nostoc sp. PCC 7120] gi|22096164|sp|Q8YM70|SSRP_ANASP
"GTP-binding protein era homolog"
"unknown protein [Nostoc sp. PCC 7120] gi|25533156|pir||AF2229 hypothetical p
"ORF_ID:tlr0160~similar to ferredoxin [Thermosynechococcus elongatus BP-1]
"hypothetical protein [Nostoc sp. PCC 7120] gi|25367067|pir||AH2295 hypotheti
"two-component hybrid sensor and regulator [Nostoc sp. PCC 7120] gi|25532444|
"hypothetical protein [Nostoc sp. PCC 7120] gi|25358966|pir||AG2158 hypotheti
"two-component response regulator [Nostoc sp. PCC 7120] gi|25533086|pir||AF21
"probable two-component sensor histidine kinase [Gloeobacter violaceus] gi|35
"phytochrome-like protein [Tolypothrix sp. PCC 7601]"
"two-component sensor histidine kinase [Nostoc sp. PCC 7120] gi|25530471|pir|
NIL NIL NIL
"hypothetical protein [Nostoc sp. PCC 7120] gi|25535333|pir||AI2179 hypotheti
NIL
"unknown protein [Nostoc sp. PCC 7120] gi|25535440|pir||AI2275 hypothetical p
"transcriptional regulator [Nostoc sp. PCC 7120] gi|25302898|pir||AB2544 tran
"similar to two-component sensor histidine kinase [Nostoc sp. PCC 7120] gi|25
"putative gluconolactonase precursor [Sinorhizobium meliloti] gi|25369832|pir
"similar to two-component sensor histidine kinase [Nostoc sp. PCC 7120] gi|25
"hypothetical protein [Nostoc sp. PCC 7120] gi|25530521|pir||AC1903 hypotheti
. . .
BioLingua
<5>> (DEFINE RR-class AS
(GENES-DESCRIBED-BY "response regulator" IN Npun)
DISPLAY off)
:: "List of length 79 suppressed"
<6>> (DEFINE HK-class AS
(GENES-DESCRIBED-BY “histidine kinase" IN Npun)
DISPLAY off)
:: "List of length 89 suppressed"
<7>> (DEFINE HK-upstream AS
(GENES-UPSTREAM-OF HK-class) DISPLAY off)
:: "List of length 89 suppressed"
<8>> (DEFINE HK-downstream AS
(GENES-DOWNSTREAM-OF HK-class) DISPLAY off)
:: "List of length 89 suppressed"
<9>> (DEFINE HK-adjacent AS
(UNION-OF (HK-upstream HK-downstream)) DISPLAY off)
:: "List of length 178 suppressed"
<10>>(INTERSECTION-OF (HK-adjacent RR-class))
BioLingua
<10>> (INTERSECTION-OF (HK-adjacent RR-class))
::
22 elements in INTERSECTION
> (#$Npun.pNpBF205 #$Npun.pNpBF139 #$Npun.NpR6228 #$Npun.NpR5316
#$Npun.NpF4214 #$Npun.NpF3676 #$Npun.NpF3084 #$Npun.NpR3053
#$Npun.NpR1779 #$Npun.NpR0589 #$Npun.NpF0304 #$Npun.NpR1109
#$Npun.NpF1278 #$Npun.NpF1776 #$Npun.NpF1800 #$Npun.NpR2420
#$Npun.NpR2902 #$Npun.NpR3197 #$Npun.NpR4503 #$Npun.NpF5763
#$Npun.NpF6363 #$Npun.pNpBF146)

<11>>(DEFINE RR-candidates AS (SET-DIFFERENCE RR-class (RESULT 10))
DISPLAY off)
:: "List of length 57 suppressed"
<12>>
Genes Functionally Related to His Kinase
Histidine Kinase
Nostoc punctiforme
NpR3010

Anabaena PCC 7120

Trichodesmium

Synechocystis PCC 6803

. . . (13 total) Find similar genes
Conserved
BioLingua
<10>> (INTERSECTION-OF (RR-adjacent HK-class))
::
24 elements in INTERSECTION
> (#$Npun.pNpBF205 #$Npun.pNpBF139 #$Npun.NpR6228 #$Npun.NpR5316
#$Npun.NpF4214 #$Npun.NpF3676 #$Npun.NpF3084 #$Npun.NpR3053
#$Npun.NpR1779 #$Npun.NpR0589 #$Npun.NpF0304 #$Npun.NpR1109
#$Npun.NpF1278 #$Npun.NpF1776 #$Npun.NpF1800 #$Npun.NpR2420
#$Npun.NpR2902 #$Npun.NpR3197 #$Npun.NpR4503 #$Npun.NpF5763
#$Npun.NpF6363 #$Npun.pNpBF146)
<11>>(DEFINE RR-candidates AS (SET-DIFFERENCE RR-class (RESULT 10))
DISPLAY off)
:: "List of length 57 suppressed"
<12>>(CONTEXT-OF NpF0304)
::
(<- #$Npun.NpR0302 potassium-dependent ATPase sub) 523
(-> #$Npun.NpF0303 two-component sensor histidine) 85
(-> #$Npun.NpF0304 two-component response regulat) 473
(-> #$Npun.NpF0305 hypothetical protein glr0895 [) 85
(<- #$Npun.NpR0306 primosomal protein N' [Nostoc )
> (#$Npun.NpR0302 #$Npun.NpF0303 #$Npun.NpF0304 #$Npun.NpF0305
#$Npun.NpR0306)

<13>>(ALL-ORTHOLOGS-OF *)
BioLingua
<12>>(CONTEXT-OF NpF0304)
::
(<- #$Npun.NpR0302 potassium-dependent ATPase sub) 523
(-> #$Npun.NpF0303 two-component sensor histidine) 85
(-> #$Npun.NpF0304 two-component response regulat) 473
(-> #$Npun.NpF0305 hypothetical protein glr0895 [) 85
(<- #$Npun.NpR0306 primosomal protein N' [Nostoc )
> (#$Npun.NpR0302 #$Npun.NpF0303 #$Npun.NpF0304 #$Npun.NpF0305
#$Npun.NpR0306)

<13>> (ALL-ORTHOLOGS-OF *)
:: ((#$S7942.sef0159 #$Npun.NpR0302 #$Gvi.glr0573 #$A29413.Av?3368
#$A7120.all3154)
(#$S6803.sll1590 #$Npun.NpF0303 #$Gvi.gll0572 #$A29413.Av?1247
#$A7120.alr3155)
(#$S6803.sll1592 #$P9313.PMT1405 #$Npun.NpF0304 #$Gvi.gll0571
#$A29413.Av?1248 #$A7120.alr3156)
(#$Tery.Te?7017 #$Npun.NpF0305 #$Cwat.Cw?3050)
(#$Tery.Te?2243 #$TeBP1.tll0415 #$S6803.sll0270 #$S8102.SynW1782
#$S7942.sef1895 #$PRO1375.Pro0497 #$P9313.PMT1271 #$PMED4.PMM0497
#$Npun.NpR0306 #$Gvi.gll0025 #$Cwat.Cw?3016 #$A29413.Av?5206
#$A7120.all4248))
<14>>
A new family of proteins?!
A type of transposase?
TRANSPOSON

transposase

Is Npr3008 a
transposase?
BioLingua
<14>>(DEFINE extended-NpR3008 AS
(SEQUENCE-OF NpR3008 FROM -700 TO-END +700)
DISPLAY off)
:: “Results suppressed"
<15>> (BLAST extended-NpR3008 Npun)
:: Query Q-Start Q-End Subject S-Start S-End E-value
1. "Seq 1" 1 2258 #$Npun.chromosome 3706846 3704589 0.0
2. "Seq 1" 293 1511 #$Npun.chromosome 4008429 4009647 0.0
3. "Seq 1" 293 1512 #$Npun.chromosome 7932036 7930817 0.0
4. "Seq 1" 293 1510 #$Npun.chromosome 4228111 4229328 0.0
5. "Seq 1" 293 1510 #$Npun.chromosome 3971285 3972502 0.0
6. "Seq 1" 293 1510 #$Npun.chromosome 4027833 4029050 0.0
7. "Seq 1" 293 1511 #$Npun.chromosome 2121987 2123204 0.0
8. "Seq 1" 293 1510 #$Npun.chromosome 2136737 2135521 0.0
9. "Seq 1" 397 1510 #$Npun.chromosome 2030748 2031861 0.0
10. "Seq 1" 1537 2258 #$Npun.pNpB 42015 42737 4.6d-83
11. "Seq 1" 1331 1420 #$Npun.chromosome 8036134 8036045 1.8d-8
12. "Seq 1" 1319 1385 #$Npun.chromosome 5915424 5915358 2.7d-4
13. "Seq 1" 1319 1385 #$Npun.chromosome 2577387 2577453 2.7d-4
> (#$Temp27 #$Temp28 #$Temp29 #$Temp30 #$Temp31 #$Temp32 #$Temp33
#$Temp34 #$Temp35 #$Temp36 #$Temp37 #$Temp38 #$Temp39)
<16>>
BioLingua
<14>>(DEFINE extended-NpR3008 AS
(SEQUENCE-OF NpR3008 FROM -700 TO-END +700)
DISPLAY off)
:: “Results suppressed"
<15>> (BLAST extended-NpR3008 Npun)
:: Query Q-Start Q-End Subject S-Start S-End E-value
1. "Seq 1" 1 2258 #$Npun.chromosome 3706846 3704589 0.0
2. "Seq 1" 293 1511 #$Npun.chromosome 4008429 4009647 0.0
. . .
<16>> (FOR-EACH hit IN *
AS (subj S-start)
= (GET-ELEMENTS (subject Subject-start) FROM hit)
AS start = (- S-start 15)
AS end = (+ S-start 40)
AS left-end = (SEQUENCE-OF subj FROM start TO end)
COLLECT left-end)
BioLingua
<14>>(DEFINE extended-NpR3008 AS
(SEQUENCE-OF NpR3008 FROM -700 TO-END +700)
DISPLAY off)
:: “Results suppressed"
<15>> (BLAST extended-NpR3008 Npun)
:: Query Q-Start Q-End Subject S-Start S-End E-value
1. "Seq 1" 1 2258 #$Npun.chromosome 3706846 3704589 0.0
2. "Seq 1" 293 1511 #$Npun.chromosome 4008429 4009647 0.0
. . .
<16>> (FOR-EACH hit IN *
AS (subj S-start)
= (GET-ELEMENTS (subject Subject-start) FROM hit)
AS start = (- S-start 15)
AS end = (+ S-start 40)
AS left-end = (SEQUENCE-OF subj FROM start TO end)
COLLECT left-end)
::
> ("TACGCTCTATCTTCAGCAAGTTGTTTTTCTTGCTGTATAATTCGGCGATTCTCTTC"
"AAAGAAACGCTAGAGGGGTGCATCCCAGTTTTTATTATTCCAAAACAAATAAATAA"
"AAACTGGGATGCACCCCTTATTAATGCTCTTTGGAGTCAATACTAATTTTGCCAAA"
"TACCTTTGTGATAGGGGGTGCATCCCAGTTTTTATTATTCCAAAACAAATAAATAA"
"AAATTAGTTTATTATGGGTGCATCCCAGTTTTTATTATTCCAAAACAAATAAATAA"
"CACCGATTCACTAATGGGTGCATCCCAGTTTTTATTATTCCAAAACAAATAAATAA"
"ACTATTGTAGAGACTGGGTGCATCCCAGTTTTTATTATTCCAAAACAAATAAATAA"
. . .
BioLingua
<17>>(ALIGNMENT-OF * LINE-LENGTH 60 SEGMENT-LENGTH 60)
::
Seq 4 1 TACCTTTGT-GATAGGGGGTGCATCCCAGTTTTTATTAT--TCCAAAACAAATAAATAA---
Seq 7 1 -ACTATTGTAGAGACTGGGTGCATCCCAGTTTTTATTAT--TCCAAAACAAATAAATAA---
Seq 2 1 -AAAGAAACGCTAGAGGGGTGCATCCCAGTTTTTATTAT--TCCAAAACAAATAAATAA---
Seq 5 1 AAATTAGTTTATTA-TGGGTGCATCCCAGTTTTTATTAT--TCCAAAACAAATAAATAA---
Seq 6 1 -CACCGATTCACTAATGGGTGCATCCCAGTTTTTATTAT--TCCAAAACAAATAAATAA---
Seq 8 1 ----------AAACTGGGATGCA-CCCAGTCTCTACAATAGTTCTAGA-GAACACATAACGT
Seq 3 1 ----------AAACTGGGATGCACCCC--TTATTAATGCTCTTTGGAGTCAATAC-TAATTT
Seq 9 1 -----------CATTGTCGCCCCTTGAAGTCATCAAGAC-----TAGGTGTATCAATGACTC
Seq 12 1 ------------------GTTCAGCTTGGTAATAGCTGTAGTTAATAATGCGAGAGCGATGT
Seq 1 1 ---------TACGCTCTATCTTCAGCAAGTTGTTTTTCT--TGCTGTATAATTCGGCGATTC
Seq 10 1 --------------GGTCGGGAAATTGCGAGATTATTCAGTGGCGAAGTAGTGGGAGAACTA
Seq 11 1 ------------TTGAACAAATTTGTTCGTGGAAATGGTAATTGGAAATTTGCTGCGGAATG
Seq 13 1 ------------ATTATTAACTACAGCTATTACCAAGCTGAACAACTGTGTTCTATTGGTTC
consensus 1
Genetic Basis of Differentiation
Nostoc +
Anabaena
NH3 N2

NH3
Not Synechocystis, Trichodesmium,…
BioLingua
<18>>(DEFINE diff-cb AS (Npun Avar A7120) DISPLAY off)
:: "List of length 3 suppressed"
<19>>(DEFINE non-diff-cb AS
(REMOVE-FROM-SET *loaded-organisms* diff-cb) DISPLAY off)
:: "List of length 10 suppressed"

<20>>(DEFINE diff-cb-specific AS
(COMMON-ORTHOLOGS-OF diff-cb NOT-IN non-diff-cb) DISPLAY off)
:: "List of length 661 suppressed"
BioLingua
• Provides knowledge in accessible form
• Provides tools accessed in common way
• Provides results that can be manipulated
• Provides a programming language that speaks
to biologists
The Death of Science
Credits
West Coast VCU
- Jeff Shrager - Austin Hess
- JP Massar - James Mastros
- Mike Travers - Sarah Cousins
- Yue Zhao

BioLingua: http://ramsites.net/~biolingua/help
Jeff Elhai: Center for the Study of Biological Complexity
Virginia Commonwealth University
Phone: 828-0794
E-mail: ElhaiJ@VCU.Edu