A fast Perl script for finding DNA Palindromes
Jules J. Berman Ph.D., M.D.
File created 8/8/02 for submission to:
APIII
Pittsburgh, PA
Oct 2-4, 2002
The contents of the abstract are Public Domain
No warranties apply
Background: Palindromes (strings that read the same forward and backward) have biologically signficant roles that may be exploitable in gene annotation. The literature describes the role of palindromic sequences as transcription-binding sites and their association with gene amplification and genetic instability. Surprisingly, the DNA palindrome is used to refer to several sequence types, none of which conform to the same meaning of palindromes used in written language. Palindromes can refer to back-to-back DNA segments, each the reverse-complement of the other (e.g. GAATTC, AAGCTT, and GCCCGGGC). These strings do not actually read the same forwards and backwards, but their relation to palindromes is obvious. Another palindrome definition allows for spacer regions with one or more non-palindromic bases separating the reverse-complement sequence. Spacer-separated reverse-complement sequences would permit looped strand self-annealing. By any definition, DNA palindromes are unusual seqences characteristic of a given gene and potentially useful for gene annotation. Although the literature refers to software for finding DNA palindromes, there seems to be no discussion of flexibility (i.e. choosing a preferred definition of palindrome), speed or code efficiency. Furthermore most or all existing programs seem to work on single gene sequence inputs and do not actually index muliple occurrences of palindromes in large (genomic) sequence datasets.
Design: An algorithm using pre-compiled regular expressions was designed to find several different types of palindromic sequences. The uses and modifications of the basic algorithm will be discussed.
Results: A Perl script that can find the most computationally complex type of palindrome, reverse-complement non-repeating palindromes with an intervening spacer region, follows. The example script operates on a given uninterrupted sequence of CAGT combinations, held in the file "sample".
This algorithm can be modified to find any type of sequence palindrome, indexing all found algorithms and parsing through gene sequence data of any length. A modified script found all simple sequence-line palindromes in a 34+ MByte sequence file, in just 411 seconds.
Conclusion: Palindromes can be quickly retrieved from large sequence datasets and used to annotate genes. Palindrome annotation is one of many ways of linking raw sequence data to gene signature data, biological feature data, and (ultimately) pathologic lesions.
#!/usr/bin/perl
$filename = "sample";
open (TEXT, "sample")||die"Cannot";
$line = " ";
$count = 0;
for $n (5..20)
{
$re = qr /[CAGT]{$n}/;
$regexes[$n-5] = $re;
}
NEXTLINE: while ($count < 1000)
{
$line = <TEXT> ;
$count++;
foreach my $value (@regexes)
{
$start = 0;
while ($line =~ /$value/g)
{
$endline = $';
$match = $&;
$revmatch = reverse($match);
$revmatch =~ tr/CAGT/GTCA/;
if ($endline =~ /^([CAGT]{0,15})($revmatch)/)
{
$start = 1;
$palindrome = $match . "*" . $1 . "*" . $2;
$palhash{$palindrome}++;
}
}
if ($start == 0)
{
goto NEXTLINE;
}
}
}
close TEXT;
while(($key, $value) = each (%palhash))
{
print "$key => $value\n";
}
exit;
Input of sample.pl (line-breads omitted from original file)
ATGAGCGAAGAAAGCTTATTCGAGTCTTCTCCACAGAAGATGGAGTACGAAATTACAAAC
TACTCAGAAAGACATACAGAACTTCCAGGTCATTTCATTGGCCTCAATACAGTAGATAAA
CTAGAGGAGTCCCCGTTAAGGGACTTTGTTAAGAGTCACGGTGGTCACACGGTCATATCC
AAGATCCTGATAGCAAATAATGGTATTGCCGCCGTGAAAGAAATTAGATCCGTCAGAAAA
TGGGCATACGAGACGTTCGGCGATGACAGAACCGTCCAATTCGTCGCCATGGCCACCCCA
GAAGATCTGGAGGCCAACGCAGAATATATCCGTATGGCCGATCAATACATTGAAGTGCCA
GGTGGTACTAATAATAACAACTACGCTAACGTAGACTTGATCGTAGACATCGCCGAAAGA
GCAGACGTAGACGCCGTATGGGCTGGCTGGGGTCACGCCTCCGAGAATCCACTATTGCCT
GAAAAATTGTCCCAGTCTAAGAGGAAAGTCATCTTTATTGGGCCTCCAGGTAACGCCATG
AGGTCTTTAGGTGATAAAATCTCCTCTACCATTGTCGCTCAAAGTGCTAAAGTCCCATGT
ATTCCATGGTCTGGTACCGGTGTTGACACCGTTCACGTGGACGAGAAAACCGGTCTGGTC
TCTGTCGACGATGACATCTATCAAAAGGGTTGTTGTACCTCTCCTGAAGATGGTTTACAA
AAGGCCAAGCGTATTGGTTTTCCTGTCATGATTAAGGCATCCGAAGGTGGTGGTGGTAAA
GGTATCAGACAAGTTGAACGTGAAGAAGATTTCATCGCTTTATACCACCAGGCAGCCAAC
GAAATTCCAGGCTCCCCCATTTTCATCATGAAGTTGGCCGGTAGAGCGCGTCACTTGGAA
GTTCAACTGCTAGCAGATCAGTACGGTACAAATATTTCCTTGTTCGGTAGAGACTGTTCC
GTTCAGAGACGTCATCAAAAAATTATCGAAGAAGCACCAGTTACAATTGCCAAGGCTGAA
ACATTTCACGAGATGGAAAAGGCTGCCGTCAGACTGGGGAAACTAGTCGGTTATGTCTCT
GCCGGTACCGTGGAGTATCTATATTCTCATGATGATGGAAAATTCTACTTTTTAGAATTG
AACCCAAGATTACAAGTCGAGCATCCAACAACGGAAATGGTCTCCGGTGTTAACTTACCT
GCAGCTCAATTACAAATCGCTATGGGTATCCCTATGCATAGAATAAGTGACATTAGAACT
TTATATGGTATGAATCCTCATTCTGCCTCAGAAATCGATTTCGAATTCAAAACTCAAGAT
GCCACCAAGAAACAAAGAAGACCTATTCCAAAGGGTCATTGTACCGCTTGTCGTATCACA
TCAGAAGATCCAAACGATGGATTCAAGCCATCGGGTGGTACTTTGCATGAACTAAACTTC
CGTTCTTCCTCTAATGTTTGGGGTTACTTCTCCGTGGGTAACAATGGTAATATTCACTCC
TTTTCGGACTCTCAGTTCGGCCATATTTTTGCTTTTGGTGAAAATAGACAAGCTTCCAGG
AAACACATGGTTGTTGCCCTGAAGGAATTGTCCATTAGGGGTGATTTCAGAACTACTGTG
GAATACTTGATCAAACTTTTGGAAACTGAAGATTTCGAGGATAACACTATTACCACCGGT
TGGTTGGACGATTTGATTACTCATAAAATGACCGCTGAAAAGCCTGATCCAACTCTTGCC
GTCATTTGCGGTGCCGCTACAAAGGCTTTCTTAGCATCTGAAGAAGCCCGCCACAAGTAT
ATCGAATCCTTACAAAAGGGACAAGTTCTATCTAAAGACCTACTGCAAACTATGTTCCCT
GTAGATTTTATCCATGAGGGTAAAAGATACAAGTTCACCGTAGCTAAATCCGGTAATGAC
CGTTACACATTATTTATCAATGGTTCTAAATGTGATATCATACTGCGTCAACTATCTGAT
GGTGGTCTTTTGATTGCCATAGGCGGTAAATCGCATACCATCTATTGGAAAGAAGAAGTT
GCTGCTACAAGATTATCCGTTGACTCTATGACTACTTTGTTGGAAGTTGAAAACGATCCA
ACCCAGTTGCGTACTCCATCCCCTGGTAAATTGGTTAAATTCTTGGTGGAAAATGGTGAA
CACATTATCAAGGGCCAACCATATGCAGAAATTGAAGTTATGAAAATGCAAATGCCTTTG
GTTTCTCAAGAAAATGGTATCGTCCAGTTATTAAAGCAACCTGGTTCTACCATTGTTGCA
GGTGATATCATGGCTATTATGACTCTTGACGATCCATCCAAGGTCAAGCACGCTCTACCA
TTTGAAGGTATGCTGCCAGATTTTGGTTCTCCAGTTATCGAAGGAACCAAACCTGCCTAT
AAATTCAAGTCATTAGTGTCTACTTTGGAAAACATTTTGAAGGGTTATGACAACCAAGTT
ATTATGAACGCTTCCTTGCAACAATTGATAGAGGTTTTGAGAAATCCAAAACTGCCTTAC
TCAGAATGGAAACTACACATCTCTGCTTTACATTCAAGATTGCCTGCTAAGCTAGATGAA
CAAATGGAAGAGTTAGTTGCACGTTCTTTGAGACGTGGTGCTGTTTTCCCAGCTAGACAA
TTAAGTAAATTGATTGATATGGCCGTGAAGAATCCTGAATACAACCCCGACAAATTGCTG
GGCGCCGTCGTGGAACCATTGGCGGATATTGCTCATAAGTACTCTAACGGGTTAGAAGCC
CATGAACATTCTATATTTGTCCATTTCTTGGAAGAATATTACGAAGTTGAAAAGTTATTC
AATGGTCCAAATGTTCGTGAGGAAAATATCATTCTGAAATTGCGTGATGAAAACCCTAAA
GATCTAGATAAAGTTGCGCTAACTGTTTTGTCTCATTCGAAAGTTTCAGCGAAGAATAAC
CTGATCCTAGCTATCTTGAAACATTATCAACCATTGTGCAAGTTATCTTCTAAAGTTTCT
GCCATTTTCTCTACTCCTCTACAACATATTGTTGAACTAGAATCTAAGGCTACCGCTAAG
GTCGCTCTACAAGCAAGAGAAATTTTGATTCAAGGCGCTTTACCTTCGGTCAAGGAAAGA
ACTGAACAAATTGAACATATCTTAAAATCCTCTGTTGTGAAGGTTGCCTATGGCTCATCC
AATCCAAAGCGCTCTGAACCAGATTTGAATATCTTGAAGGACTTGATCGATTCTAATTAC
GTTGTGTTCGATGTTTTACTTCAATTCCTAACCCATCAAGACCCAGTTGTGACTGCTGCA
GCTGCTCAAGTCTATATTCGTCGTGCTTATCGTGCTTACACCATAGGAGATATTAGAGTT
CACGAAGGTGTCACAGTTCCAATTGTTGAATGGAAATTCCAACTACCTTCAGCTGCGTTC
TCCACCTTTCCAACTGTTAAATCTAAAATGGGTATGAACAGGGCTGTTTCTGTTTCAGAT
TTGTCATATGTTGCAAACAGTCAGTCATCTCCGTTAAGAGAAGGTATTTTGATGGCTGTG
GATCATTTAGATGATGTTGATGAAATTTTGTCACAAAGTTTGGAAGTTATTCCTCGTCAC
CAATCTTCTTCTAACGGACCTGCTCCTGATCGTTCTGGTAGCTCCGCATCGTTGAGTAAT
GTTGCTAATGTTTGTGTTGCTTCTACAGAAGGTTTCGAATCTGAAGAGGAAATTTTGGTA
AGGTTGAGAGAAATTTTGGATTTGAATAAGCAGGAATTAATCAATGCTTCTATCCGTCGT
ATCACATTTATGTTCGGTTTTAAAGATGGGTCTTATCCAAAGTATTATACTTTTAACGGT
CCAAATTATAACGAAAATGAAACAATTCGTCACATTGAGCCGGCTTTGGCCTTCCAACTG
GAATTAGGAAGATTGTCCAACTTCAACATTAAACCAATTTTCACTGATAATAGAAACATC
CATGTCTACGAAGCTGTTAGTAAGACTTCTCCATTGGATAAGAGATTCTTTACAAGAGGT
ATTATTAGAACGGGTCATATCCGTGATGACATTTCTATTCAAGAATATCTGACTTCTGAA
GCTAACAGATTGATGAGTGATATATTGGATAATTTAGAAGTCACCGACACTTCAAATTCT
GATTTGAATCATATCTTCATCAACTTCATTGCGGTGTTTGATATCTCTCCAGAAGATGTC
GAAGCCGCCTTCGGTGGTTTCTTAGAAAGATTTGGTAAGAGATTGTTGAGATTGCGTGTT
TCTTCTGCCGAAATTAGAATCATCATCAAAGATCCTCAAACAGGTGCCCCAGTACCATTG
CGTGCCTTGATCAATAACGTTTCTGGTTATGTTATCAAAACAGAAATGTACACCGAAGTC
AAGAACGCAAAAGGTGAATGGGTATTTAAGTCTTTGGGTAAACCTGGATCCATGCATTTA
AGACCTATTGCTACTCCTTACCCTGTTAAGGAATGGTTGCAACCAAAACGTTATAAGGCA
CACTTGATGGGTACCACATATGTCTATGACTTCCCAGAATTATTCCGCCAAGCATCGTCA
TCCCAATGGAAAAATTTCTCTGCAGATGTTAAGTTAACAGATGATTTCTTTATTTCCAAC
GAGTTGATTGAAGATGAAAACGGCGAATTAACTGAGGTGGAAAGAGAACCTGGTGCCAAC
GCTATTGGTATGGTTGCCTTTAAGATTACTGTAAAGACTCCTGAATATCCAAGAGGCCGT
CAATTTGTTGTTGTTGCTAACGATATCACATTCAAGATCGGTTCCTTTGGTCCACAAGAA
GACGAATTCTTCAATAAGGTTACTGAATATGCTAGAAAGCGTGGTATCCCAAGAATTTAC
TTGGCTGCAAACTCAGGTGCCAGAATTGGTATGGCTGAAGAGATTGTTCCACTATTTCAA
GTTGCATGGAATGATGCTGCCAATCCGGACAAGGGCTTCCAATACTTATACTTAACAAGT
GAAGGTATGGAAACTTTAAAGAAATTTGACAAAGAAAATTCTGTTCTCACTGAACGTACT
GTTATAAACGGTGAAGAAAGATTTGTCATCAAGACAATTATTGGTTCTGAAGATGGGTTA
GGTGTCGAATGTCTACGTGGATCTGGTTTAATTGCTGGTGCAACGTCAAGGGCTTACCAC
GATATCTTCACTATCACCTTAGTCACTTGTAGATCCGTCGGTATCGGTGCTTATTTGGTT
CGTTTGGGTCAAAGAGCTATTCAGGTCGAAGGCCAGCCAATTATTTTAACTGGTGCTCCT
GCAATCAACAAAATGCTGGGTAGAGAAGTTTATACTTCTAACTTACAATTGGGTGGTACT
CAAATCATGTATAACAACGGTGTTTCACATTTGACTGCTGTTGACGATTTAGCTGGTGTA
GAGAAGATTGTTGAATGGATGTCTTATGTTCCAGCCAAGCGTAATATGCCAGTTCCTATC
TTGGAAACTAAAGACACATGGGATAGACCAGTTGATTTCACTCCAACTAATGATGAAACT
TACGATGTAAGATGGATGATTGAAGGTCGTGAGACTGAAAGTGGATTTGAATATGGTTTG
TTTGATAAAGGGTCTTTCTTTGAAACTTTGTCAGGATGGGCCAAAGGTGTTGTCGTTGGT
AGAGCCCGTCTTGGTGGTATTCCACTGGGTGTTATTGGTGTTGAAACAAGAACTGTCGAG
AACTTGATTCCTGCTGATCCAGCTAATCCAAATAGTGCTGAAACATTAATTCAAGAACCT
GGTCAAGTTTGGCATCCAAACTCCGCCTTCAAGACTGCTCAAGCTATCAATGACTTTAAC
AACGGTGAACAATTGCCAATGATGATTTTGGCCAACTGGAGAGGTTTCTCTGGTGGTCAA
CGTGATATGTTCAACGAAGTCTTGAAGTATGGTTCGTTTATTGTTGACGCATTGGTGGAT
TACAAACAACCAATTATTATCTATATCCCACCTACCGGTGAACTAAGAGGTGGTTCATGG
GTTGTTGTCGATCCAACTATCAACGCTGACCAAATGGAAATGTATGCCGACGTCAACGCT
AGAGCTGGTGTTTTGGAACCACAAGGTATGGTTGGTATCAAGTTCCGTAGAGAAAAATTG
CTGGACACCATGAACAGATTGGATGACAAGTACAGAGAATTGAGATCTCAATTATCCAAC
AAGAGTTTGGCTCCAGAAGTACATCAGCAAATATCCAAGCAATTAGCTGATCGTGAGAGA
GAACTATTGCCAATTTACGGACAAATCAGTCTTCAATTTGCTGATTTGCACGATAGGTCT
TCACGTATGGTGGCCAAGGGTGTTATTTCTAAGGAACTGGAATGGACCGAGGCACGTCGT
TTCTTCTTCTGGAGATTGAGAAGAAGATTGAACGAAGAATATTTGATTAAAAGGTTGAGC
CATCAGGTAGGCGAAGCATCAAGATTAGAAAAGATCGCAAGAATTAGATCGTGGTACCCT
GCTTCAGTGGACCATGAAGATGATAGGCAAGTCGCAACATGGATTGAAGAAAACTACAAA
ACTTTGGACGATAAACTAAAGGGTTTGAAATTAGAGTCATTCGCTCAAGACTTAGCTAAA
AAGATCAGAAGCGACCATGACAATGCTATTGATGGATTATCTGAAGTTATCAAGATGTTA
TCTACCGATGATAAAGAAAAATTGTTGAAGACTTTGAAATAA
Output of sample.pl
(* separates the spacer region from the flanking palindromic regions)
C:\FTP>perl sample.pl
CTTTG*TCAGGATGGGC*CAAAG => 1
AGTAT*T*ATACT => 1
GAAATC**GATTTC => 1
AGTTT*GGCATCC*AAACT => 1
CCTTA*CCCTGT*TAAGG => 1
CTTCT*GGAGATTGAGA*AGAAG => 1
GAAAT*CG*ATTTC => 1
TTCTG*GTTATGTTATCAAAA*CAGAA => 1
CTTCC*AACTGGAATTA*GGAAG => 1
TGGAA*A*TTCCA => 1
TTCAAAT*TCTG*ATTTGAA => 1
CTTGAC*GATCCATCCAAG*GTCAAG => 1
AAGTT*TATACTTCT*AACTT => 1
TTCTT*CTTCTGGAGATTGAG*AAGAA => 1
TTCTTCT*GGAGATTG*AGAAGAA => 1
AACGAA*GTCTTGAAGTATGG*TTCGTT => 1
ACGAA*AATGAAACAA*TTCGT => 1
TTTCA*CTCCAACTAATGA*TGAAA => 1
TCTTCT*CCAC*AGAAGA => 1
AAGAA*GACGAA*TTCTT => 1
TGAGA**TCTCA => 1
AATTT*TGGTAAGGTTGAGAG*AAATT => 1
GCAAC*CTGGTTCTACCATT*GTTGC => 1
CTACG*CTAA*CGTAG => 1
AATTC*TACTTTTTA*GAATT => 1
AATTCTA*CTTTT*TAGAATT => 1
TATTC*AA*GAATA => 1
TTGAC*GATCCATCCAAG*GTCAA => 1
TTCAA*ATTCTGAT*TTGAA => 1
TTCTTC*TTCTGGAGATTGA*GAAGAA => 1
TTTTA*TCCATGAGGG*TAAAA => 1
TTCTG*CCT*CAGAA => 1
ATTGG*TGGATTACAAACAA*CCAAT => 1
TCTAAC*GG*GTTAGA => 1
GATGG*ATTCAAG*CCATC => 1
GTTTGG*CAT*CCAAAC => 1
CTTCT*CCAC*AGAAG => 1
The following script operates on the GBEST1.SEQ file downloaded
9/4/02 from:
ftp.ncbi.nih.gov/genbank (login anonymous guest)
-r--r--r-- 1 ftp anonymous 20515576 Aug 30 20:47 gbest1.seq.gz
This file, when expanded, is 230,687,626 bytes.
Additional information on the file, and the first locus entry on the
file is:
GBEST1.SEQ Genetic Sequence Data Bank
August 15 2002
NCBI-GenBank Flat File Release 131.0
EST Sequences (Part 1)
68775 loci, 26546702 bases, from 68775 reported sequences
LOCUS AA000001 474 bp mRNA linear EST 17-OCT-1996
DEFINITION zd84h07.s1 Soares_fetal_heart_NbHH19W Homo sapiens cDNA clone
IMAGE:347389 3', mRNA sequence.
ACCESSION AA000001
VERSION AA000001.1 GI:1392161
KEYWORDS EST.
SOURCE human.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 474)
AUTHORS Hillier,L., Clark,N., Dubuque,T., Elliston,K., Hawkins,M., Holman
,M., Hultman,M., Kucaba,T., Le,M., Lennon,G., Marra,M., Parsons,J.,
Rifkin,L., Rohlfing,T., Soares,M., Tan,F., Trevaskis,E., Waterston
,R., Williamson,A., Wohldmann,P. and Wilson,R.
TITLE The WashU-Merck EST Project
JOURNAL Unpublished (1995)
COMMENT Contact: Wilson RK
Washington University School of Medicine
4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108
Tel: 314 286 1800
Fax: 314 286 1810
Email: est@watson.wustl.edu
This clone is available royalty-free through LLNL ; contact the
IMAGE Consortium (info@image.llnl.gov) for further information.
Insert Length: 2259 Std Error: 0.00
Seq primer: mob.REGA+ET
High quality sequence stop: 403.
FEATURES Location/Qualifiers
source 1..474
/organism="Homo sapiens"
/db_xref="GDB:1272764"
/db_xref="taxon:9606"
/clone="IMAGE:347389"
/clone_lib="Soares_fetal_heart_NbHH19W"
/sex="unknown"
/dev_stage="19 weeks"
/lab_host="DH10B (ampicillin resistant)"
/note="Organ: heart; Vector: pT7T3D (Pharmacia) with a
modified polylinker; Site_1: Not I; Site_2: Eco RI; 1st
strand cDNA was primed with a Not I - oligo(dT) primer [5'
TGTTACCAATCTGAAGTGGGAGCGGCCGCATCTTTTTTTTTTTTTTTTTT 3'],
double-stranded cDNA was size selected, ligated to Eco RI
adapters (Pharmacia), digested with Not I and cloned into
the Not I and Eco RI sites of a modified pT7T3 vector
(Pharmacia). Library went through one round of
normalization to a Cot = 5. Library constructed by
M.Fatima Bonaldo. This library was constructed from the
same fetus as the fetal lung library, Soares fetal lung
NbHL19W."
BASE COUNT 140 a 105 c 115 g 107 t 7 others
ORIGIN
1 taantgagat ctaggtatta acctgctgtc tagcgaaaac tagtcactaa gtcctggcct
61 gagagatacc cacatttcct ttagaacaaa cagaactaat acctgtgtac atttctgaga
121 gcctgatgtg tgagtcctta aaatgtagac cttgcaggag gcttagacct cagtttcacc
181 taatgcatgt ggaggaaatg gaggtgagaa tagtcacctg aagagtgcaa gcgctccagc
241 tccagcacac acactcttcc ctgggcagca ggaaaaggag gtaacaagga cttgggctga
301 catctgaagc actangctaa tgtgcctggt agaggggagc ctcaggaagn cacaagatgg
361 tcattccacc tngtagctgt ccacaaacct gaggtttcca catcgttttt aaagggcaca
421 gtgggcaaat gtgncaaggc agaaaaccaa taaccatttc aagggntcac ttgn
//
The Perl script that produces palindromes for the loci is show below.
Note that the name of the gbest1 file used in the script is "gbest1.seq".
The output file is called gbpal and is 3,839,652 bytes in length.
A sample from the output file (gbpal) is:
AA027885 =>
AGACA*TTCTTCCCAG*TGTCT
CAAAA*TG*TTTTG
AAAAA*AAAAAAGCAGCAAAA*TTTTT
AAAAA*AGCAGCAAAA*TTTTT
AA027886 =>
AGGTT*TAAA*AACCT
AGGTTT*AA*AAACCT
CTCTTG**CAAGAG
AA027887 =>
TTCAG*GCTA*CTGAA
AAAGA*AA*TCTTT
AA027888 =>
AAGTG*GA*CACTT
TTCTG*TGTT*CAGAA
TTTTT*TTA*AAAAA
TTTTTT*T*AAAAAA
AA057972 =>
TGCAG*GCCGCTCAGGATAAG*CTGCA
GCCGC*TCAGGATAAGCTGCA*GCGGC
AA057973 =>
TAGAT*CATT*ATCTA
AA057974 =>
AGACA*GGGCCCGCACCCC*TGTCT
GAGGC*CCAACCAGCCCGCT*GCCTC
GGGAG*CGTGCC*CTCCC
GACAGG*GCCCGCACC*CCTGTC
AGACAGG*GCCCGCACC*CCTGTCT
AA057975 =>
CTGGG*GA*CCCAG
CCACC*ATTGACCT*GGTGG
CCACCA*TTGACC*TGGTGG
AA057976 =>
CCGGG*GGTCCAGCCAAGCCA*CCCGG
GTGCG*ATG*CGCAC
AATCC*AAAGAAGC*GGATT
AA057977 =>
GGGCC*TGGCAGGTGCT*GGCCC
TGGGC*ATGAGGAGCGCGCG*GCCCA
AA057978 =>
CTGGC*TTTGTGAACCACGT*GCCAG
GGCTA*GTGAGG*TAGCC
TTTAA*CTTGACA*TTAAA
CCTGGC*TTTGTGAACCACGT*GCCAGG
GGGCTA*GTGAGG*TAGCCC
AA057979 =>
TATCG*CTTCACGGCCCC*CGATA
GCACC*TGCAGGTTT*GGTGC
AA027890 =>
CCAGA*CCCAGGG*TCTGG
CCTCC*AGCCTCTTTCTCCCT*GGAGG
CCAGAC*CCAGG*GTCTGG
GACCTCC*AGCCTCTTTCTCCCT*GGAGGTC
AA027893 =>
GCCGC*A*GCGGC
AA027894 =>
GCTTG*CACACTACAGT*CAAGC
AA027895 =>
AATCA*T*TGATT
AGTCAA*TCATTGA*TTGACT
AGTCAAT*CATTG*ATTGACT
AA027896 =>
TCTCT*TCC*AGAGA
CCACA*C*TGTGG
AA027897 =>
CCTGT*ATT*ACAGG
GGGTC*AG*GACCC
CCTGT*CTTGAG*ACAGG
AAAAT*CCAGTCACAAT*ATTTT
CCTGTC*TTGA*GACAGG
CCTGTCT*TG*AGACAGG
AA027898 =>
GGGGT*TGCCCATAAATCAA*ACCCC
ATTTT*TTAC*AAAAT
AA027899 =>
TGTGT*TGCCTGTA*ACACA
GTAAC*ACAAAAT*GTTAC
ACTGG*CA*CCAGT
AA057982 =>
TCTGT*T*ACAGA
AA057983 =>
CATGT*GAGGAGTCATGA*ACATG
TTGCC*TGTGTCA*GGCAA
CATTT*C*AAATG
AA057984 =>
GAGCT*TCTGCAGTACT*AGCTC
AA057985 =>
TGAGA*C*TCTCA
AGGTC*ACGAT*GACCT
CAGGTG*CAGTCACT*CACCTG
AA057986 =>
TTGGT**ACCAA
TTTGT*CAATTCT*ACAAA
CAATT*CTACAA*AATTG
TAAAT**ATTTA
CCCAAG*AACAGG*CTTGGG
AA057987 =>
CTCCT*GGAAGCAG*AGGAG
GGATG*AAC*CATCC
GCCTC*TAAAGAGCTT*GAGGC
CTGGA*GC*TCCAG
GTATCA*AGGAGACCAGTC*TGATAC
AA057988 =>
GAGCA*CCCGAC*TGCTC
ATGGT*GGCTCACA*ACCAT
AAGAGC*ACCCGACT*GCTCTT
AA057991 =>
AATAT*AATAT*ATATT
AATAT**ATATT
CAGAG**CTCTG
ACAGAG**CTCTGT
AA057992 =>
GGTGA*AGAAGCC*TCACC
AA057993 =>
ATTCA*GATGTCCAACTTGA*TGAAT
TTGAT*GA*ATCAA
CTCAT*CACTGATCTGCTAA*ATGAG
TGGTAA*AAGATACGACCCG*TTACCA
AA057994 =>
AGGGG*AAGATTAGAAACTG*CCCCT
GGTGA*GATCCTGCCTGGA*TCACC
GATCC*TGCCT*GGATC
CTCAG*GTCAG*CTGAG
TGATG*AGTGACTCT*CATCA
CTCAG*CCTG*CTGAG
GCTCTG*GC*CAGAGC
GATGAG*TGACT*CTCATC
TCTCAG*CCTG*CTGAGA
ATCTCAG*CCTG*CTGAGAT
AA057995 =>
TCCAT*TTACCGGATGCATTT*ATGGA
AA057996 =>
ACCTG*CTG*CAGGT
CTGCA*GGTATAGG*TGCAG
AA057997 =>
ATCAT*CCACTC*ATGAT
TCAGC*AGTTAC*GCTGA
AA057998 =>
TCAGA*AAAGGGATGGAG*TCTGA
AACTG*CTGTTTTA*CAGTT
ATGTT*CTG*AACAT
AA057999 =>
CTTCT*ACCC*AGAAG
GCTGG*A*CCAGC
TCATG*GCTATACT*CATGA
AA020600 =>
GTAGA*CGAAAACACGGTT*TCTAC
CAGGG*GAAG*CCCTG
AA020601 =>
GGGGG*GTCTGCTCGAGG*CCCCC
GGGGC*GGGTTGTCGCCGCA*GCCCC
GCAGG*AGGCGAGGCGGAC*CCTGC
AA020602 =>
CTGAT*CC*ATCAG
AGATG**CATCT
AA020603 =>
CCCGC*C*GCGGG
TGTTT*TAATT*AAACA
GCCGC*NGCGGAGAAACC*GCGGC
GCGGG*ACGCCGCCCGCC*CCCGC
AGGCCG*CNGCGGAGAAACCG*CGGCCT
AA020604 =>
ACCAG*CACCGAGTA*CTGGT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/usr/bin/perl
$/ = "\/\/";
$filename = "gbest1.seq";
open (TEXT, "gbest1.seq")||die"Cannot";
open (OUT, ">gbpal")||die"Cannot";
$line = " ";
$count = 0;
for $n (5..20)
{
$re = qr /[CAGT]{$n}/;
$regexes[$n-5] = $re;
}
NEXTLINE: while ($line ne "")
{
$line = ;
$count++;
$line =~ /LOCUS +([A-Z0-9]*) +/o;
$locusid = $1;
$line =~ /ORIGIN/o;
$code = $';
$code =~ s/[\d \n]*//g;
$code = uc($code);
#print "$locusid\n";
foreach my $value (@regexes)
{
$start = 0;
while ($code =~ /$value/g)
{
$endline = $';
$match = $&;
$revmatch = reverse($match);
$revmatch =~ tr/CAGT/GTCA/;
if ($endline =~ /^([CAGTN]{0,15})($revmatch)/)
{
$start = 1;
$palindrome = $match . "*" . $1 . "*" . $2;
$palhash{$locusid} = "$palhash{$locusid}\n$palindrome";
}
}
if ($start == 0)
{
goto NEXTLINE;
}
}
}
close TEXT;
while(($key, $value) = each (%palhash))
{
print OUT "\n$key =>$value\n";
}
exit;