A fast Perl script for finding DNA Palindromes


Jules J. Berman Ph.D., M.D.
File created 8/8/02 for submission to:
APIII
Pittsburgh, PA
Oct 2-4, 2002
The contents of the abstract are Public Domain
No warranties apply

Background: Palindromes (strings that read the same forward and backward) have biologically signficant roles that may be exploitable in gene annotation. The literature describes the role of palindromic sequences as transcription-binding sites and their association with gene amplification and genetic instability. Surprisingly, the DNA palindrome is used to refer to several sequence types, none of which conform to the same meaning of palindromes used in written language. Palindromes can refer to back-to-back DNA segments, each the reverse-complement of the other (e.g. GAATTC, AAGCTT, and GCCCGGGC). These strings do not actually read the same forwards and backwards, but their relation to palindromes is obvious. Another palindrome definition allows for spacer regions with one or more non-palindromic bases separating the reverse-complement sequence. Spacer-separated reverse-complement sequences would permit looped strand self-annealing. By any definition, DNA palindromes are unusual seqences characteristic of a given gene and potentially useful for gene annotation. Although the literature refers to software for finding DNA palindromes, there seems to be no discussion of flexibility (i.e. choosing a preferred definition of palindrome), speed or code efficiency. Furthermore most or all existing programs seem to work on single gene sequence inputs and do not actually index muliple occurrences of palindromes in large (genomic) sequence datasets.

Design: An algorithm using pre-compiled regular expressions was designed to find several different types of palindromic sequences. The uses and modifications of the basic algorithm will be discussed.

Results: A Perl script that can find the most computationally complex type of palindrome, reverse-complement non-repeating palindromes with an intervening spacer region, follows. The example script operates on a given uninterrupted sequence of CAGT combinations, held in the file "sample". This algorithm can be modified to find any type of sequence palindrome, indexing all found algorithms and parsing through gene sequence data of any length. A modified script found all simple sequence-line palindromes in a 34+ MByte sequence file, in just 411 seconds.

Conclusion: Palindromes can be quickly retrieved from large sequence datasets and used to annotate genes. Palindrome annotation is one of many ways of linking raw sequence data to gene signature data, biological feature data, and (ultimately) pathologic lesions.

#!/usr/bin/perl
$filename = "sample";
open (TEXT, "sample")||die"Cannot";
$line = " ";
$count = 0;
for $n (5..20)
   {
   $re = qr /[CAGT]{$n}/;
   $regexes[$n-5] = $re;
   }
NEXTLINE: while ($count < 1000)
   {
   $line = <TEXT> ;
   $count++;
   foreach my $value (@regexes)
      {
      $start = 0;
      while ($line =~ /$value/g)
         {
         $endline = $';
         $match = $&;
         $revmatch = reverse($match);
         $revmatch =~ tr/CAGT/GTCA/;
         if ($endline =~ /^([CAGT]{0,15})($revmatch)/)
            {
            $start = 1;
            $palindrome = $match . "*" . $1 . "*" . $2;
            $palhash{$palindrome}++;
            }
         }
      if ($start == 0)
         {
         goto NEXTLINE;
         }
      }
   }
close TEXT;
while(($key, $value) = each (%palhash))
   {
   print "$key => $value\n";
   }
exit;


Input of sample.pl (line-breads omitted from original file)


ATGAGCGAAGAAAGCTTATTCGAGTCTTCTCCACAGAAGATGGAGTACGAAATTACAAAC
TACTCAGAAAGACATACAGAACTTCCAGGTCATTTCATTGGCCTCAATACAGTAGATAAA
CTAGAGGAGTCCCCGTTAAGGGACTTTGTTAAGAGTCACGGTGGTCACACGGTCATATCC
AAGATCCTGATAGCAAATAATGGTATTGCCGCCGTGAAAGAAATTAGATCCGTCAGAAAA
TGGGCATACGAGACGTTCGGCGATGACAGAACCGTCCAATTCGTCGCCATGGCCACCCCA
GAAGATCTGGAGGCCAACGCAGAATATATCCGTATGGCCGATCAATACATTGAAGTGCCA
GGTGGTACTAATAATAACAACTACGCTAACGTAGACTTGATCGTAGACATCGCCGAAAGA
GCAGACGTAGACGCCGTATGGGCTGGCTGGGGTCACGCCTCCGAGAATCCACTATTGCCT
GAAAAATTGTCCCAGTCTAAGAGGAAAGTCATCTTTATTGGGCCTCCAGGTAACGCCATG
AGGTCTTTAGGTGATAAAATCTCCTCTACCATTGTCGCTCAAAGTGCTAAAGTCCCATGT
ATTCCATGGTCTGGTACCGGTGTTGACACCGTTCACGTGGACGAGAAAACCGGTCTGGTC
TCTGTCGACGATGACATCTATCAAAAGGGTTGTTGTACCTCTCCTGAAGATGGTTTACAA
AAGGCCAAGCGTATTGGTTTTCCTGTCATGATTAAGGCATCCGAAGGTGGTGGTGGTAAA
GGTATCAGACAAGTTGAACGTGAAGAAGATTTCATCGCTTTATACCACCAGGCAGCCAAC
GAAATTCCAGGCTCCCCCATTTTCATCATGAAGTTGGCCGGTAGAGCGCGTCACTTGGAA
GTTCAACTGCTAGCAGATCAGTACGGTACAAATATTTCCTTGTTCGGTAGAGACTGTTCC
GTTCAGAGACGTCATCAAAAAATTATCGAAGAAGCACCAGTTACAATTGCCAAGGCTGAA
ACATTTCACGAGATGGAAAAGGCTGCCGTCAGACTGGGGAAACTAGTCGGTTATGTCTCT
GCCGGTACCGTGGAGTATCTATATTCTCATGATGATGGAAAATTCTACTTTTTAGAATTG
AACCCAAGATTACAAGTCGAGCATCCAACAACGGAAATGGTCTCCGGTGTTAACTTACCT
GCAGCTCAATTACAAATCGCTATGGGTATCCCTATGCATAGAATAAGTGACATTAGAACT
TTATATGGTATGAATCCTCATTCTGCCTCAGAAATCGATTTCGAATTCAAAACTCAAGAT
GCCACCAAGAAACAAAGAAGACCTATTCCAAAGGGTCATTGTACCGCTTGTCGTATCACA
TCAGAAGATCCAAACGATGGATTCAAGCCATCGGGTGGTACTTTGCATGAACTAAACTTC
CGTTCTTCCTCTAATGTTTGGGGTTACTTCTCCGTGGGTAACAATGGTAATATTCACTCC
TTTTCGGACTCTCAGTTCGGCCATATTTTTGCTTTTGGTGAAAATAGACAAGCTTCCAGG
AAACACATGGTTGTTGCCCTGAAGGAATTGTCCATTAGGGGTGATTTCAGAACTACTGTG
GAATACTTGATCAAACTTTTGGAAACTGAAGATTTCGAGGATAACACTATTACCACCGGT
TGGTTGGACGATTTGATTACTCATAAAATGACCGCTGAAAAGCCTGATCCAACTCTTGCC
GTCATTTGCGGTGCCGCTACAAAGGCTTTCTTAGCATCTGAAGAAGCCCGCCACAAGTAT
ATCGAATCCTTACAAAAGGGACAAGTTCTATCTAAAGACCTACTGCAAACTATGTTCCCT
GTAGATTTTATCCATGAGGGTAAAAGATACAAGTTCACCGTAGCTAAATCCGGTAATGAC
CGTTACACATTATTTATCAATGGTTCTAAATGTGATATCATACTGCGTCAACTATCTGAT
GGTGGTCTTTTGATTGCCATAGGCGGTAAATCGCATACCATCTATTGGAAAGAAGAAGTT
GCTGCTACAAGATTATCCGTTGACTCTATGACTACTTTGTTGGAAGTTGAAAACGATCCA
ACCCAGTTGCGTACTCCATCCCCTGGTAAATTGGTTAAATTCTTGGTGGAAAATGGTGAA
CACATTATCAAGGGCCAACCATATGCAGAAATTGAAGTTATGAAAATGCAAATGCCTTTG
GTTTCTCAAGAAAATGGTATCGTCCAGTTATTAAAGCAACCTGGTTCTACCATTGTTGCA
GGTGATATCATGGCTATTATGACTCTTGACGATCCATCCAAGGTCAAGCACGCTCTACCA
TTTGAAGGTATGCTGCCAGATTTTGGTTCTCCAGTTATCGAAGGAACCAAACCTGCCTAT
AAATTCAAGTCATTAGTGTCTACTTTGGAAAACATTTTGAAGGGTTATGACAACCAAGTT
ATTATGAACGCTTCCTTGCAACAATTGATAGAGGTTTTGAGAAATCCAAAACTGCCTTAC
TCAGAATGGAAACTACACATCTCTGCTTTACATTCAAGATTGCCTGCTAAGCTAGATGAA
CAAATGGAAGAGTTAGTTGCACGTTCTTTGAGACGTGGTGCTGTTTTCCCAGCTAGACAA
TTAAGTAAATTGATTGATATGGCCGTGAAGAATCCTGAATACAACCCCGACAAATTGCTG
GGCGCCGTCGTGGAACCATTGGCGGATATTGCTCATAAGTACTCTAACGGGTTAGAAGCC
CATGAACATTCTATATTTGTCCATTTCTTGGAAGAATATTACGAAGTTGAAAAGTTATTC
AATGGTCCAAATGTTCGTGAGGAAAATATCATTCTGAAATTGCGTGATGAAAACCCTAAA
GATCTAGATAAAGTTGCGCTAACTGTTTTGTCTCATTCGAAAGTTTCAGCGAAGAATAAC
CTGATCCTAGCTATCTTGAAACATTATCAACCATTGTGCAAGTTATCTTCTAAAGTTTCT
GCCATTTTCTCTACTCCTCTACAACATATTGTTGAACTAGAATCTAAGGCTACCGCTAAG
GTCGCTCTACAAGCAAGAGAAATTTTGATTCAAGGCGCTTTACCTTCGGTCAAGGAAAGA
ACTGAACAAATTGAACATATCTTAAAATCCTCTGTTGTGAAGGTTGCCTATGGCTCATCC
AATCCAAAGCGCTCTGAACCAGATTTGAATATCTTGAAGGACTTGATCGATTCTAATTAC
GTTGTGTTCGATGTTTTACTTCAATTCCTAACCCATCAAGACCCAGTTGTGACTGCTGCA
GCTGCTCAAGTCTATATTCGTCGTGCTTATCGTGCTTACACCATAGGAGATATTAGAGTT
CACGAAGGTGTCACAGTTCCAATTGTTGAATGGAAATTCCAACTACCTTCAGCTGCGTTC
TCCACCTTTCCAACTGTTAAATCTAAAATGGGTATGAACAGGGCTGTTTCTGTTTCAGAT
TTGTCATATGTTGCAAACAGTCAGTCATCTCCGTTAAGAGAAGGTATTTTGATGGCTGTG
GATCATTTAGATGATGTTGATGAAATTTTGTCACAAAGTTTGGAAGTTATTCCTCGTCAC
CAATCTTCTTCTAACGGACCTGCTCCTGATCGTTCTGGTAGCTCCGCATCGTTGAGTAAT
GTTGCTAATGTTTGTGTTGCTTCTACAGAAGGTTTCGAATCTGAAGAGGAAATTTTGGTA
AGGTTGAGAGAAATTTTGGATTTGAATAAGCAGGAATTAATCAATGCTTCTATCCGTCGT
ATCACATTTATGTTCGGTTTTAAAGATGGGTCTTATCCAAAGTATTATACTTTTAACGGT
CCAAATTATAACGAAAATGAAACAATTCGTCACATTGAGCCGGCTTTGGCCTTCCAACTG
GAATTAGGAAGATTGTCCAACTTCAACATTAAACCAATTTTCACTGATAATAGAAACATC
CATGTCTACGAAGCTGTTAGTAAGACTTCTCCATTGGATAAGAGATTCTTTACAAGAGGT
ATTATTAGAACGGGTCATATCCGTGATGACATTTCTATTCAAGAATATCTGACTTCTGAA
GCTAACAGATTGATGAGTGATATATTGGATAATTTAGAAGTCACCGACACTTCAAATTCT
GATTTGAATCATATCTTCATCAACTTCATTGCGGTGTTTGATATCTCTCCAGAAGATGTC
GAAGCCGCCTTCGGTGGTTTCTTAGAAAGATTTGGTAAGAGATTGTTGAGATTGCGTGTT
TCTTCTGCCGAAATTAGAATCATCATCAAAGATCCTCAAACAGGTGCCCCAGTACCATTG
CGTGCCTTGATCAATAACGTTTCTGGTTATGTTATCAAAACAGAAATGTACACCGAAGTC
AAGAACGCAAAAGGTGAATGGGTATTTAAGTCTTTGGGTAAACCTGGATCCATGCATTTA
AGACCTATTGCTACTCCTTACCCTGTTAAGGAATGGTTGCAACCAAAACGTTATAAGGCA
CACTTGATGGGTACCACATATGTCTATGACTTCCCAGAATTATTCCGCCAAGCATCGTCA
TCCCAATGGAAAAATTTCTCTGCAGATGTTAAGTTAACAGATGATTTCTTTATTTCCAAC
GAGTTGATTGAAGATGAAAACGGCGAATTAACTGAGGTGGAAAGAGAACCTGGTGCCAAC
GCTATTGGTATGGTTGCCTTTAAGATTACTGTAAAGACTCCTGAATATCCAAGAGGCCGT
CAATTTGTTGTTGTTGCTAACGATATCACATTCAAGATCGGTTCCTTTGGTCCACAAGAA
GACGAATTCTTCAATAAGGTTACTGAATATGCTAGAAAGCGTGGTATCCCAAGAATTTAC
TTGGCTGCAAACTCAGGTGCCAGAATTGGTATGGCTGAAGAGATTGTTCCACTATTTCAA
GTTGCATGGAATGATGCTGCCAATCCGGACAAGGGCTTCCAATACTTATACTTAACAAGT
GAAGGTATGGAAACTTTAAAGAAATTTGACAAAGAAAATTCTGTTCTCACTGAACGTACT
GTTATAAACGGTGAAGAAAGATTTGTCATCAAGACAATTATTGGTTCTGAAGATGGGTTA
GGTGTCGAATGTCTACGTGGATCTGGTTTAATTGCTGGTGCAACGTCAAGGGCTTACCAC
GATATCTTCACTATCACCTTAGTCACTTGTAGATCCGTCGGTATCGGTGCTTATTTGGTT
CGTTTGGGTCAAAGAGCTATTCAGGTCGAAGGCCAGCCAATTATTTTAACTGGTGCTCCT
GCAATCAACAAAATGCTGGGTAGAGAAGTTTATACTTCTAACTTACAATTGGGTGGTACT
CAAATCATGTATAACAACGGTGTTTCACATTTGACTGCTGTTGACGATTTAGCTGGTGTA
GAGAAGATTGTTGAATGGATGTCTTATGTTCCAGCCAAGCGTAATATGCCAGTTCCTATC
TTGGAAACTAAAGACACATGGGATAGACCAGTTGATTTCACTCCAACTAATGATGAAACT
TACGATGTAAGATGGATGATTGAAGGTCGTGAGACTGAAAGTGGATTTGAATATGGTTTG
TTTGATAAAGGGTCTTTCTTTGAAACTTTGTCAGGATGGGCCAAAGGTGTTGTCGTTGGT
AGAGCCCGTCTTGGTGGTATTCCACTGGGTGTTATTGGTGTTGAAACAAGAACTGTCGAG
AACTTGATTCCTGCTGATCCAGCTAATCCAAATAGTGCTGAAACATTAATTCAAGAACCT
GGTCAAGTTTGGCATCCAAACTCCGCCTTCAAGACTGCTCAAGCTATCAATGACTTTAAC
AACGGTGAACAATTGCCAATGATGATTTTGGCCAACTGGAGAGGTTTCTCTGGTGGTCAA
CGTGATATGTTCAACGAAGTCTTGAAGTATGGTTCGTTTATTGTTGACGCATTGGTGGAT
TACAAACAACCAATTATTATCTATATCCCACCTACCGGTGAACTAAGAGGTGGTTCATGG
GTTGTTGTCGATCCAACTATCAACGCTGACCAAATGGAAATGTATGCCGACGTCAACGCT
AGAGCTGGTGTTTTGGAACCACAAGGTATGGTTGGTATCAAGTTCCGTAGAGAAAAATTG
CTGGACACCATGAACAGATTGGATGACAAGTACAGAGAATTGAGATCTCAATTATCCAAC
AAGAGTTTGGCTCCAGAAGTACATCAGCAAATATCCAAGCAATTAGCTGATCGTGAGAGA
GAACTATTGCCAATTTACGGACAAATCAGTCTTCAATTTGCTGATTTGCACGATAGGTCT
TCACGTATGGTGGCCAAGGGTGTTATTTCTAAGGAACTGGAATGGACCGAGGCACGTCGT
TTCTTCTTCTGGAGATTGAGAAGAAGATTGAACGAAGAATATTTGATTAAAAGGTTGAGC
CATCAGGTAGGCGAAGCATCAAGATTAGAAAAGATCGCAAGAATTAGATCGTGGTACCCT
GCTTCAGTGGACCATGAAGATGATAGGCAAGTCGCAACATGGATTGAAGAAAACTACAAA
ACTTTGGACGATAAACTAAAGGGTTTGAAATTAGAGTCATTCGCTCAAGACTTAGCTAAA
AAGATCAGAAGCGACCATGACAATGCTATTGATGGATTATCTGAAGTTATCAAGATGTTA
TCTACCGATGATAAAGAAAAATTGTTGAAGACTTTGAAATAA



Output of sample.pl
(* separates the spacer region from the flanking palindromic regions)
C:\FTP>perl sample.pl
CTTTG*TCAGGATGGGC*CAAAG => 1
AGTAT*T*ATACT => 1
GAAATC**GATTTC => 1
AGTTT*GGCATCC*AAACT => 1
CCTTA*CCCTGT*TAAGG => 1
CTTCT*GGAGATTGAGA*AGAAG => 1
GAAAT*CG*ATTTC => 1
TTCTG*GTTATGTTATCAAAA*CAGAA => 1
CTTCC*AACTGGAATTA*GGAAG => 1
TGGAA*A*TTCCA => 1
TTCAAAT*TCTG*ATTTGAA => 1
CTTGAC*GATCCATCCAAG*GTCAAG => 1
AAGTT*TATACTTCT*AACTT => 1
TTCTT*CTTCTGGAGATTGAG*AAGAA => 1
TTCTTCT*GGAGATTG*AGAAGAA => 1
AACGAA*GTCTTGAAGTATGG*TTCGTT => 1
ACGAA*AATGAAACAA*TTCGT => 1
TTTCA*CTCCAACTAATGA*TGAAA => 1
TCTTCT*CCAC*AGAAGA => 1
AAGAA*GACGAA*TTCTT => 1
TGAGA**TCTCA => 1
AATTT*TGGTAAGGTTGAGAG*AAATT => 1
GCAAC*CTGGTTCTACCATT*GTTGC => 1
CTACG*CTAA*CGTAG => 1
AATTC*TACTTTTTA*GAATT => 1
AATTCTA*CTTTT*TAGAATT => 1
TATTC*AA*GAATA => 1
TTGAC*GATCCATCCAAG*GTCAA => 1
TTCAA*ATTCTGAT*TTGAA => 1
TTCTTC*TTCTGGAGATTGA*GAAGAA => 1
TTTTA*TCCATGAGGG*TAAAA => 1
TTCTG*CCT*CAGAA => 1
ATTGG*TGGATTACAAACAA*CCAAT => 1
TCTAAC*GG*GTTAGA => 1
GATGG*ATTCAAG*CCATC => 1
GTTTGG*CAT*CCAAAC => 1
CTTCT*CCAC*AGAAG => 1


The following script operates on the GBEST1.SEQ file downloaded 9/4/02 from:
ftp.ncbi.nih.gov/genbank (login anonymous guest)
-r--r--r-- 1 ftp anonymous 20515576 Aug 30 20:47 gbest1.seq.gz
This file, when expanded, is 230,687,626 bytes.

Additional information on the file, and the first locus entry on the file is:

GBEST1.SEQ          Genetic Sequence Data Bank
                         August 15 2002

               NCBI-GenBank Flat File Release 131.0

                      EST Sequences (Part 1)

   68775 loci,    26546702 bases, from   68775 reported sequences

LOCUS       AA000001                 474 bp    mRNA    linear   EST 17-OCT-1996
DEFINITION  zd84h07.s1 Soares_fetal_heart_NbHH19W Homo sapiens cDNA clone
            IMAGE:347389 3', mRNA sequence.
ACCESSION   AA000001
VERSION     AA000001.1  GI:1392161
KEYWORDS    EST.
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 474)
  AUTHORS   Hillier,L., Clark,N., Dubuque,T., Elliston,K., Hawkins,M., Holman
           ,M., Hultman,M., Kucaba,T., Le,M., Lennon,G., Marra,M., Parsons,J.,
           Rifkin,L., Rohlfing,T., Soares,M., Tan,F., Trevaskis,E., Waterston
           ,R., Williamson,A., Wohldmann,P. and Wilson,R.
 TITLE     The WashU-Merck EST Project
 JOURNAL   Unpublished (1995)
COMMENT     Contact: Wilson RK
            Washington University School of Medicine
            4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108
            Tel: 314 286 1800
            Fax: 314 286 1810
            Email: est@watson.wustl.edu
            This clone is available royalty-free through LLNL ; contact the
            IMAGE Consortium (info@image.llnl.gov) for further information.
            Insert Length: 2259   Std Error: 0.00
            Seq primer: mob.REGA+ET
            High quality sequence stop: 403.
FEATURES    Location/Qualifiers
     source 1..474
            /organism="Homo sapiens"
            /db_xref="GDB:1272764"
            /db_xref="taxon:9606"
            /clone="IMAGE:347389"
            /clone_lib="Soares_fetal_heart_NbHH19W"
            /sex="unknown"
            /dev_stage="19 weeks"
            /lab_host="DH10B (ampicillin resistant)"
            /note="Organ: heart; Vector: pT7T3D (Pharmacia) with a
            modified polylinker; Site_1: Not I; Site_2: Eco RI; 1st
            strand cDNA was primed with a Not I - oligo(dT) primer [5'
            TGTTACCAATCTGAAGTGGGAGCGGCCGCATCTTTTTTTTTTTTTTTTTT 3'],
            double-stranded cDNA was size selected, ligated to Eco RI
            adapters (Pharmacia), digested with Not I and cloned into
            the Not I and Eco RI sites of a modified pT7T3 vector
            (Pharmacia). Library went through one round of
            normalization to a Cot = 5. Library constructed by
            M.Fatima Bonaldo. This library was constructed from the
            same fetus as the fetal lung library, Soares fetal lung
            NbHL19W."
BASE COUNT      140 a    105 c    115 g    107 t      7 others
ORIGIN
  1 taantgagat ctaggtatta acctgctgtc tagcgaaaac tagtcactaa gtcctggcct
 61 gagagatacc cacatttcct ttagaacaaa cagaactaat acctgtgtac atttctgaga
121 gcctgatgtg tgagtcctta aaatgtagac cttgcaggag gcttagacct cagtttcacc
181 taatgcatgt ggaggaaatg gaggtgagaa tagtcacctg aagagtgcaa gcgctccagc
241 tccagcacac acactcttcc ctgggcagca ggaaaaggag gtaacaagga cttgggctga
301 catctgaagc actangctaa tgtgcctggt agaggggagc ctcaggaagn cacaagatgg
361 tcattccacc tngtagctgt ccacaaacct gaggtttcca catcgttttt aaagggcaca
421 gtgggcaaat gtgncaaggc agaaaaccaa taaccatttc aagggntcac ttgn
//


The Perl script that produces palindromes for the loci is show below.
Note that the name of the gbest1 file used in the script is "gbest1.seq".
The output file is called gbpal and is 3,839,652 bytes in length.

A sample from the output file (gbpal) is:


AA027885 =>
AGACA*TTCTTCCCAG*TGTCT
CAAAA*TG*TTTTG
AAAAA*AAAAAAGCAGCAAAA*TTTTT
AAAAA*AGCAGCAAAA*TTTTT

AA027886 =>
AGGTT*TAAA*AACCT
AGGTTT*AA*AAACCT
CTCTTG**CAAGAG

AA027887 =>
TTCAG*GCTA*CTGAA
AAAGA*AA*TCTTT

AA027888 =>
AAGTG*GA*CACTT
TTCTG*TGTT*CAGAA
TTTTT*TTA*AAAAA
TTTTTT*T*AAAAAA

AA057972 =>
TGCAG*GCCGCTCAGGATAAG*CTGCA
GCCGC*TCAGGATAAGCTGCA*GCGGC

AA057973 =>
TAGAT*CATT*ATCTA

AA057974 =>
AGACA*GGGCCCGCACCCC*TGTCT
GAGGC*CCAACCAGCCCGCT*GCCTC
GGGAG*CGTGCC*CTCCC
GACAGG*GCCCGCACC*CCTGTC
AGACAGG*GCCCGCACC*CCTGTCT

AA057975 =>
CTGGG*GA*CCCAG
CCACC*ATTGACCT*GGTGG
CCACCA*TTGACC*TGGTGG

AA057976 =>
CCGGG*GGTCCAGCCAAGCCA*CCCGG
GTGCG*ATG*CGCAC
AATCC*AAAGAAGC*GGATT

AA057977 =>
GGGCC*TGGCAGGTGCT*GGCCC
TGGGC*ATGAGGAGCGCGCG*GCCCA

AA057978 =>
CTGGC*TTTGTGAACCACGT*GCCAG
GGCTA*GTGAGG*TAGCC
TTTAA*CTTGACA*TTAAA
CCTGGC*TTTGTGAACCACGT*GCCAGG
GGGCTA*GTGAGG*TAGCCC

AA057979 =>
TATCG*CTTCACGGCCCC*CGATA
GCACC*TGCAGGTTT*GGTGC

AA027890 =>
CCAGA*CCCAGGG*TCTGG
CCTCC*AGCCTCTTTCTCCCT*GGAGG
CCAGAC*CCAGG*GTCTGG
GACCTCC*AGCCTCTTTCTCCCT*GGAGGTC

AA027893 =>
GCCGC*A*GCGGC

AA027894 =>
GCTTG*CACACTACAGT*CAAGC

AA027895 =>
AATCA*T*TGATT
AGTCAA*TCATTGA*TTGACT
AGTCAAT*CATTG*ATTGACT

AA027896 =>
TCTCT*TCC*AGAGA
CCACA*C*TGTGG

AA027897 =>
CCTGT*ATT*ACAGG
GGGTC*AG*GACCC
CCTGT*CTTGAG*ACAGG
AAAAT*CCAGTCACAAT*ATTTT
CCTGTC*TTGA*GACAGG
CCTGTCT*TG*AGACAGG

AA027898 =>
GGGGT*TGCCCATAAATCAA*ACCCC
ATTTT*TTAC*AAAAT

AA027899 =>
TGTGT*TGCCTGTA*ACACA
GTAAC*ACAAAAT*GTTAC
ACTGG*CA*CCAGT

AA057982 =>
TCTGT*T*ACAGA

AA057983 =>
CATGT*GAGGAGTCATGA*ACATG
TTGCC*TGTGTCA*GGCAA
CATTT*C*AAATG

AA057984 =>
GAGCT*TCTGCAGTACT*AGCTC

AA057985 =>
TGAGA*C*TCTCA
AGGTC*ACGAT*GACCT
CAGGTG*CAGTCACT*CACCTG

AA057986 =>
TTGGT**ACCAA
TTTGT*CAATTCT*ACAAA
CAATT*CTACAA*AATTG
TAAAT**ATTTA
CCCAAG*AACAGG*CTTGGG

AA057987 =>
CTCCT*GGAAGCAG*AGGAG
GGATG*AAC*CATCC
GCCTC*TAAAGAGCTT*GAGGC
CTGGA*GC*TCCAG
GTATCA*AGGAGACCAGTC*TGATAC

AA057988 =>
GAGCA*CCCGAC*TGCTC
ATGGT*GGCTCACA*ACCAT
AAGAGC*ACCCGACT*GCTCTT

AA057991 =>
AATAT*AATAT*ATATT
AATAT**ATATT
CAGAG**CTCTG
ACAGAG**CTCTGT

AA057992 =>
GGTGA*AGAAGCC*TCACC

AA057993 =>
ATTCA*GATGTCCAACTTGA*TGAAT
TTGAT*GA*ATCAA
CTCAT*CACTGATCTGCTAA*ATGAG
TGGTAA*AAGATACGACCCG*TTACCA

AA057994 =>
AGGGG*AAGATTAGAAACTG*CCCCT
GGTGA*GATCCTGCCTGGA*TCACC
GATCC*TGCCT*GGATC
CTCAG*GTCAG*CTGAG
TGATG*AGTGACTCT*CATCA
CTCAG*CCTG*CTGAG
GCTCTG*GC*CAGAGC
GATGAG*TGACT*CTCATC
TCTCAG*CCTG*CTGAGA
ATCTCAG*CCTG*CTGAGAT

AA057995 =>
TCCAT*TTACCGGATGCATTT*ATGGA

AA057996 =>
ACCTG*CTG*CAGGT
CTGCA*GGTATAGG*TGCAG

AA057997 =>
ATCAT*CCACTC*ATGAT
TCAGC*AGTTAC*GCTGA

AA057998 =>
TCAGA*AAAGGGATGGAG*TCTGA
AACTG*CTGTTTTA*CAGTT
ATGTT*CTG*AACAT

AA057999 =>
CTTCT*ACCC*AGAAG
GCTGG*A*CCAGC
TCATG*GCTATACT*CATGA

AA020600 =>
GTAGA*CGAAAACACGGTT*TCTAC
CAGGG*GAAG*CCCTG

AA020601 =>
GGGGG*GTCTGCTCGAGG*CCCCC
GGGGC*GGGTTGTCGCCGCA*GCCCC
GCAGG*AGGCGAGGCGGAC*CCTGC

AA020602 =>
CTGAT*CC*ATCAG
AGATG**CATCT

AA020603 =>
CCCGC*C*GCGGG
TGTTT*TAATT*AAACA
GCCGC*NGCGGAGAAACC*GCGGC
GCGGG*ACGCCGCCCGCC*CCCGC
AGGCCG*CNGCGGAGAAACCG*CGGCCT

AA020604 =>
ACCAG*CACCGAGTA*CTGGT


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#!/usr/bin/perl
$/ = "\/\/";
$filename = "gbest1.seq";
open (TEXT, "gbest1.seq")||die"Cannot";
open (OUT, ">gbpal")||die"Cannot";
$line = " ";
$count = 0;
for $n (5..20)
   {
   $re = qr /[CAGT]{$n}/;
   $regexes[$n-5] = $re;
   }
NEXTLINE: while ($line ne "")
   {
   $line = ;
   $count++;
   $line =~ /LOCUS +([A-Z0-9]*) +/o;
   $locusid = $1;
   $line =~ /ORIGIN/o;
   $code = $';
   $code =~ s/[\d \n]*//g;
   $code = uc($code);
   #print "$locusid\n";
   foreach my $value (@regexes)
      {
      $start = 0;
      while ($code =~ /$value/g)
         {
         $endline = $';
         $match = $&;
         $revmatch = reverse($match);
         $revmatch =~ tr/CAGT/GTCA/;
         if ($endline =~ /^([CAGTN]{0,15})($revmatch)/)
            {
            $start = 1;
            $palindrome = $match . "*" . $1 . "*" . $2;
            $palhash{$locusid} = "$palhash{$locusid}\n$palindrome";
            }
         }
      if ($start == 0)
         {
         goto NEXTLINE;
         }
      }
   }
close TEXT;
while(($key, $value) = each (%palhash))
   {
   print OUT "\n$key =>$value\n";
   }
exit;