File Coverage

Bio/Search/BlastUtils.pm
Criterion Covered Total %
statement 3 171 1.7
branch 0 88 0.0
condition 0 21 0.0
subroutine 1 6 16.6
pod 4 4 100.0
total 8 290 2.7


line stmt bran cond sub pod time code
1             =head1 NAME
2              
3             Bio::Search::BlastUtils - Utility functions for Bio::Search:: BLAST objects
4              
5             =head1 SYNOPSIS
6              
7             # This module is just a collection of subroutines, not an object.
8              
9             See L.
10              
11             =head1 DESCRIPTION
12              
13             The BlastUtils.pm module is a collection of subroutines used primarily by
14             Bio::Search::Hit::BlastHit objects for some of the additional
15             functionality, such as HSP tiling. Right now, the BlastUtils is just a
16             collection of methods, not an object, and it's tightly coupled to
17             Bio::Search::Hit::BlastHit. A goal for the future is to generalize it
18             to work based on the Bio::Search interfaces, then it can work with any
19             objects that implements them.
20              
21             =head1 AUTHOR
22              
23             Steve Chervitz Esac@bioperl.orgE
24              
25             =cut
26              
27             #'
28              
29             package Bio::Search::BlastUtils;
30 3     3   16 use Bio::Root::Version;
  3         4  
  3         24  
31              
32              
33             =head2 tile_hsps
34              
35             Usage : tile_hsps( $sbjct );
36             : This is called automatically by Bio::Search::Hit::BlastHit
37             : during object construction or
38             : as needed by methods that rely on having tiled data.
39             Purpose : Collect statistics about the aligned sequences in a set of HSPs.
40             : Calculates the following data across all HSPs:
41             : -- total alignment length
42             : -- total identical residues
43             : -- total conserved residues
44             Returns : n/a
45             Argument : A Bio::Search::Hit::BlastHit object
46             Throws : n/a
47             Comments :
48             : This method is *strongly* coupled to Bio::Search::Hit::BlastHit
49             : (it accesses BlastHit data members directly).
50             : TODO: Re-write this to the Bio::Search::Hit::HitI interface.
51             :
52             : This method performs more careful summing of data across
53             : all HSPs in the Sbjct object. Only HSPs that are in the same strand
54             : and frame are tiled. Simply summing the data from all HSPs
55             : in the same strand and frame will overestimate the actual
56             : length of the alignment if there is overlap between different HSPs
57             : (often the case).
58             :
59             : The strategy is to tile the HSPs and sum over the
60             : contigs, collecting data separately from overlapping and
61             : non-overlapping regions of each HSP. To facilitate this, the
62             : HSP.pm object now permits extraction of data from sub-sections
63             : of an HSP.
64             :
65             : Additional useful information is collected from the results
66             : of the tiling. It is possible that sub-sequences in
67             : different HSPs will overlap significantly. In this case, it
68             : is impossible to create a single unambiguous alignment by
69             : concatenating the HSPs. The ambiguity may indicate the
70             : presence of multiple, similar domains in one or both of the
71             : aligned sequences. This ambiguity is recorded using the
72             : ambiguous_aln() method.
73             :
74             : This method does not attempt to discern biologically
75             : significant vs. insignificant overlaps. The allowable amount of
76             : overlap can be set with the overlap() method or with the -OVERLAP
77             : parameter used when constructing the Blast & Sbjct objects.
78             :
79             : For a given hit, both the query and the sbjct sequences are
80             : tiled independently.
81             :
82             : -- If only query sequence HSPs overlap,
83             : this may suggest multiple domains in the sbjct.
84             : -- If only sbjct sequence HSPs overlap,
85             : this may suggest multiple domains in the query.
86             : -- If both query & sbjct sequence HSPs overlap,
87             : this suggests multiple domains in both.
88             : -- If neither query & sbjct sequence HSPs overlap,
89             : this suggests either no multiple domains in either
90             : sequence OR that both sequences have the same
91             : distribution of multiple similar domains.
92             :
93             : This method can deal with the special case of when multiple
94             : HSPs exactly overlap.
95             :
96             : Efficiency concerns:
97             : Speed will be an issue for sequences with numerous HSPs.
98             :
99             Bugs : Currently, tile_hsps() does not properly account for
100             : the number of non-tiled but overlapping HSPs, which becomes a problem
101             : as overlap() grows. Large values overlap() may thus lead to
102             : incorrect statistics for some hits. For best results, keep overlap()
103             : below 5 (DEFAULT IS 2). For more about this, see the "HSP Tiling and
104             : Ambiguous Alignments" section in L.
105              
106             See Also : L<_adjust_contigs>(), L
107              
108             =cut
109              
110             #--------------
111             sub tile_hsps {
112             #--------------
113 0     0 1   my $sbjct = shift;
114              
115 0           $sbjct->{'_tile_hsps'} = 1;
116 0           $sbjct->{'_gaps_query'} = 0;
117 0           $sbjct->{'_gaps_sbjct'} = 0;
118              
119             ## Simple summation scheme. Valid if there is only one HSP.
120 0 0 0       if((defined($sbjct->{'_n'}) and $sbjct->{'_n'} == 1) or $sbjct->num_hsps == 1) {
      0        
121 0           my $hsp = $sbjct->hsp;
122 0           $sbjct->{'_length_aln_query'} = $hsp->length('query');
123 0           $sbjct->{'_length_aln_sbjct'} = $hsp->length('sbjct');
124 0           $sbjct->{'_length_aln_total'} = $hsp->length('total');
125 0           ($sbjct->{'_totalIdentical'},$sbjct->{'_totalConserved'}) = $hsp->matches();
126 0           $sbjct->{'_gaps_query'} = $hsp->gaps('query');
127 0           $sbjct->{'_gaps_sbjct'} = $hsp->gaps('sbjct');
128              
129             # print "_tile_hsps(): single HSP, easy stats.\n";
130 0           return;
131             } else {
132             # print STDERR "Sbjct: _tile_hsps: summing multiple HSPs\n";
133 0           $sbjct->{'_length_aln_query'} = 0;
134 0           $sbjct->{'_length_aln_sbjct'} = 0;
135 0           $sbjct->{'_length_aln_total'} = 0;
136 0           $sbjct->{'_totalIdentical'} = 0;
137 0           $sbjct->{'_totalConserved'} = 0;
138             }
139              
140             ## More than one HSP. Must tile HSPs.
141             # print "\nTiling HSPs for $sbjct\n";
142 0           my($hsp, $qstart, $qstop, $sstart, $sstop);
143 0           my($frame, $strand, $qstrand, $sstrand);
144 0           my(@qcontigs, @scontigs);
145 0           my $qoverlap = 0;
146 0           my $soverlap = 0;
147 0           my $max_overlap = $sbjct->{'_overlap'};
148              
149 0           foreach $hsp ($sbjct->hsps()) {
150             # printf " HSP: %s\n%s\n",$hsp->name, $hsp->str('query');
151             # printf " Length = %d; Identical = %d; Conserved = %d; Conserved(1-10): %d",$hsp->length, $hsp->length(-TYPE=>'iden'), $hsp->length(-TYPE=>'cons'), $hsp->length(-TYPE=>'cons',-START=>0,-STOP=>10);
152 0           ($qstart, $qstop) = $hsp->range('query');
153 0           ($sstart, $sstop) = $hsp->range('sbjct');
154 0           $frame = $hsp->frame('hit');
155 0 0         $frame = -1 unless defined $frame;
156 0           ($qstrand, $sstrand) = $hsp->strand;
157              
158 0           my ($qgaps, $sgaps) = $hsp->gaps();
159 0           $sbjct->{'_gaps_query'} += $qgaps;
160 0           $sbjct->{'_gaps_sbjct'} += $sgaps;
161              
162 0           $sbjct->{'_length_aln_total'} += $hsp->length;
163             ## Collect contigs in the query sequence.
164 0           $qoverlap = &_adjust_contigs('query', $hsp, $qstart, $qstop, \@qcontigs, $max_overlap, $frame, $qstrand);
165              
166             ## Collect contigs in the sbjct sequence (needed for domain data and gapped Blast).
167 0           $soverlap = &_adjust_contigs('sbjct', $hsp, $sstart, $sstop, \@scontigs, $max_overlap, $frame, $sstrand);
168              
169             ## Collect overall start and stop data for query and sbjct over all HSPs.
170 0 0         if(not defined $sbjct->{'_queryStart'}) {
171 0           $sbjct->{'_queryStart'} = $qstart;
172 0           $sbjct->{'_queryStop'} = $qstop;
173 0           $sbjct->{'_sbjctStart'} = $sstart;
174 0           $sbjct->{'_sbjctStop'} = $sstop;
175             } else {
176 0 0         $sbjct->{'_queryStart'} = ($qstart < $sbjct->{'_queryStart'} ? $qstart : $sbjct->{'_queryStart'});
177 0 0         $sbjct->{'_queryStop'} = ($qstop > $sbjct->{'_queryStop'} ? $qstop : $sbjct->{'_queryStop'});
178 0 0         $sbjct->{'_sbjctStart'} = ($sstart < $sbjct->{'_sbjctStart'} ? $sstart : $sbjct->{'_sbjctStart'});
179 0 0         $sbjct->{'_sbjctStop'} = ($sstop > $sbjct->{'_sbjctStop'} ? $sstop : $sbjct->{'_sbjctStop'});
180             }
181             }
182              
183             ## Collect data across the collected contigs.
184              
185             # print "\nQUERY CONTIGS:\n";
186             # print " gaps = $sbjct->{'_gaps_query'}\n";
187              
188             # TODO: Account for strand/frame issue!
189             # Strategy: collect data on a per strand+frame basis and save the most significant one.
190 0           my (%qctg_dat);
191 0           foreach(@qcontigs) {
192             # print " query contig: $_->{'start'} - $_->{'stop'}\n";
193             # print " iden = $_->{'iden'}; cons = $_->{'cons'}\n";
194 0           ($frame, $strand) = ($_->{'frame'}, $_->{'strand'});
195 0           $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1;
196 0           $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'};
197 0           $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'};
198 0           $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand;
199             }
200              
201             # Find longest contig.
202 0           my @sortedkeys = reverse sort { $qctg_dat{ $a }->{'length_aln_query'} <=> $qctg_dat{ $b }->{'length_aln_query'} } keys %qctg_dat;
  0            
203              
204             # Save the largest to the sbjct:
205 0           my $longest = $sortedkeys[0];
206 0           $sbjct->{'_length_aln_query'} = $qctg_dat{ $longest }->{'length_aln_query'};
207 0           $sbjct->{'_totalIdentical'} = $qctg_dat{ $longest }->{'totalIdentical'};
208 0           $sbjct->{'_totalConserved'} = $qctg_dat{ $longest }->{'totalConserved'};
209 0           $sbjct->{'_qstrand'} = $qctg_dat{ $longest }->{'qstrand'};
210              
211             ## Collect data for sbjct contigs. Important for gapped Blast.
212             ## The totalIdentical and totalConserved numbers will be the same
213             ## as determined for the query contigs.
214              
215             # print "\nSBJCT CONTIGS:\n";
216             # print " gaps = $sbjct->{'_gaps_sbjct'}\n";
217              
218 0           my (%sctg_dat);
219 0           foreach(@scontigs) {
220             # print " sbjct contig: $_->{'start'} - $_->{'stop'}\n";
221             # print " iden = $_->{'iden'}; cons = $_->{'cons'}\n";
222 0           ($frame, $strand) = ($_->{'frame'}, $_->{'strand'});
223 0           $sctg_dat{ "$frame$strand" }->{'length_aln_sbjct'} += $_->{'stop'} - $_->{'start'} + 1;
224 0           $sctg_dat{ "$frame$strand" }->{'frame'} = $frame;
225 0           $sctg_dat{ "$frame$strand" }->{'sstrand'} = $strand;
226             }
227              
228 0           @sortedkeys = reverse sort { $sctg_dat{ $a }->{'length_aln_sbjct'} <=> $sctg_dat{ $b }->{'length_aln_sbjct'} } keys %sctg_dat;
  0            
229              
230             # Save the largest to the sbjct:
231 0           $longest = $sortedkeys[0];
232              
233 0           $sbjct->{'_length_aln_sbjct'} = $sctg_dat{ $longest }->{'length_aln_sbjct'};
234 0           $sbjct->{'_frame'} = $sctg_dat{ $longest }->{'frame'};
235 0           $sbjct->{'_sstrand'} = $sctg_dat{ $longest }->{'sstrand'};
236              
237 0 0         if($qoverlap) {
    0          
238 0 0         if($soverlap) { $sbjct->ambiguous_aln('qs');
  0            
239             # print "\n*** AMBIGUOUS ALIGNMENT: Query and Sbjct\n\n";
240             }
241 0           else { $sbjct->ambiguous_aln('q');
242             # print "\n*** AMBIGUOUS ALIGNMENT: Query\n\n";
243             }
244             } elsif($soverlap) {
245 0           $sbjct->ambiguous_aln('s');
246             # print "\n*** AMBIGUOUS ALIGNMENT: Sbjct\n\n";
247             }
248              
249             # Adjust length based on BLAST flavor.
250 0           my $prog = $sbjct->algorithm;
251 0 0         if($prog eq 'TBLASTN') {
    0          
    0          
252 0           $sbjct->{'_length_aln_sbjct'} /= 3;
253             } elsif($prog eq 'BLASTX' ) {
254 0           $sbjct->{'_length_aln_query'} /= 3;
255             } elsif($prog eq 'TBLASTX') {
256 0           $sbjct->{'_length_aln_query'} /= 3;
257 0           $sbjct->{'_length_aln_sbjct'} /= 3;
258             }
259             }
260              
261              
262              
263             =head2 _adjust_contigs
264              
265             Usage : n/a; called automatically during object construction.
266             Purpose : Builds HSP contigs for a given BLAST hit.
267             : Utility method called by _tile_hsps()
268             Returns :
269             Argument :
270             Throws : Exceptions propagated from Bio::Search::Hit::BlastHSP::matches()
271             : for invalid sub-sequence ranges.
272             Status : Experimental
273             Comments : This method does not currently support gapped alignments.
274             : Also, it does not keep track of the number of HSPs that
275             : overlap within the amount specified by overlap().
276             : This will lead to significant tracking errors for large
277             : overlap values.
278              
279             See Also : L(), L
280              
281             =cut
282              
283             #-------------------
284             sub _adjust_contigs {
285             #-------------------
286 0     0     my ($seqType, $hsp, $start, $stop, $contigs_ref, $max_overlap, $frame, $strand) = @_;
287              
288 0           my $overlap = 0;
289 0           my ($numID, $numCons);
290              
291             # print STDERR "Testing $seqType data: HSP (${\$hsp->name}); $start, $stop, strand=$strand, frame=$frame\n";
292 0           foreach(@$contigs_ref) {
293             # print STDERR " Contig: $_->{'start'} - $_->{'stop'}, strand=$_->{'strand'}, frame=$_->{'frame'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n";
294              
295             # Don't merge things unless they have matching strand/frame.
296 0 0 0       next unless ($_->{'frame'} == $frame and $_->{'strand'} == $strand);
297              
298             ## Test special case of a nested HSP. Skip it.
299 0 0 0       if($start >= $_->{'start'} and $stop <= $_->{'stop'}) {
300             # print STDERR "----> Nested HSP. Skipping.\n";
301 0           $overlap = 1;
302 0           next;
303             }
304              
305             ## Test for overlap at beginning of contig.
306 0 0 0       if($start < $_->{'start'} and $stop > ($_->{'start'} + $max_overlap)) {
307             # print STDERR "----> Overlaps beg: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n";
308             # Collect stats over the non-overlapping region.
309 0           eval {
310             ($numID, $numCons) = $hsp->matches(-SEQ =>$seqType,
311             -START =>$start,
312 0           -STOP =>$_->{'start'}-1);
313             };
314 0 0         if($@) { warn "\a\n$@\n"; }
  0            
315             else {
316 0           $_->{'start'} = $start; # Assign a new start coordinate to the contig
317 0           $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
318 0           $_->{'cons'} += $numCons;
319 0           $overlap = 1;
320             }
321             }
322              
323             ## Test for overlap at end of contig.
324 0 0 0       if($stop > $_->{'stop'} and $start < ($_->{'stop'} - $max_overlap)) {
325             # print STDERR "----> Overlaps end: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n";
326             # Collect stats over the non-overlapping region.
327 0           eval {
328             ($numID,$numCons) = $hsp->matches(-SEQ =>$seqType,
329 0           -START =>$_->{'stop'},
330             -STOP =>$stop);
331             };
332 0 0         if($@) { warn "\a\n$@\n"; }
  0            
333             else {
334 0           $_->{'stop'} = $stop; # Assign a new stop coordinate to the contig
335 0           $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
336 0           $_->{'cons'} += $numCons;
337 0           $overlap = 1;
338             }
339             }
340 0 0         $overlap && do {
341             # print STDERR " New Contig data:\n";
342             # print STDERR " Contig: $_->{'start'} - $_->{'stop'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n";
343 0           last;
344             };
345             }
346             ## If there is no overlap, add the complete HSP data.
347 0 0         !$overlap && do {
348             # print STDERR "No overlap. Adding new contig.\n";
349 0           ($numID,$numCons) = $hsp->matches(-SEQ=>$seqType);
350 0           push @$contigs_ref, {'start'=>$start, 'stop'=>$stop,
351             'iden'=>$numID, 'cons'=>$numCons,
352             'strand'=>$strand, 'frame'=>$frame};
353             };
354 0           $overlap;
355             }
356              
357             =head2 get_exponent
358              
359             Usage : &get_exponent( number );
360             Purpose : Determines the power of 10 exponent of an integer, float,
361             : or scientific notation number.
362             Example : &get_exponent("4.0e-206");
363             : &get_exponent("0.00032");
364             : &get_exponent("10.");
365             : &get_exponent("1000.0");
366             : &get_exponent("e+83");
367             Argument : Float, Integer, or scientific notation number
368             Returns : Integer representing the exponent part of the number (+ or -).
369             : If argument == 0 (zero), return value is "-999".
370             Comments : Exponents are rounded up (less negative) if the mantissa is >= 5.
371             : Exponents are rounded down (more negative) if the mantissa is <= -5.
372              
373             =cut
374              
375             #------------------
376             sub get_exponent {
377             #------------------
378 0     0 1   my $data = shift;
379              
380 0           my($num, $exp) = split /[eE]/, $data;
381              
382 0 0         if( defined $exp) {
    0          
    0          
383 0 0         $num = 1 if not $num;
384 0 0         $num >= 5 and $exp++;
385 0 0         $num <= -5 and $exp--;
386             } elsif( $num == 0) {
387 0           $exp = -999;
388             } elsif( not $num =~ /\./) {
389 0           $exp = CORE::length($num) -1;
390             } else {
391 0           $exp = 0;
392 0 0         $num .= '0' if $num =~ /\.$/;
393 0           my ($c);
394 0           my $rev = 0;
395 0 0         if($num !~ /^0/) {
396 0           $num = reverse($num);
397 0           $rev = 1;
398             }
399 0           do { $c = chop($num);
  0            
400 0 0         $c == 0 && $exp++;
401             } while( $c ne '.');
402              
403 0 0 0       $exp = -$exp if $num == 0 and not $rev;
404 0 0         $exp -= 1 if $rev;
405             }
406 0           return $exp;
407             }
408              
409             =head2 collapse_nums
410              
411             Usage : @cnums = collapse_nums( @numbers );
412             Purpose : Collapses a list of numbers into a set of ranges of consecutive terms:
413             : Useful for condensing long lists of consecutive numbers.
414             : EXPANDED:
415             : 1 2 3 4 5 6 10 12 13 14 15 17 18 20 21 22 24 26 30 31 32
416             : COLLAPSED:
417             : 1-6 10 12-15 17 18 20-22 24 26 30-32
418             Argument : List of numbers sorted numerically.
419             Returns : List of numbers mixed with ranges of numbers (see above).
420             Throws : n/a
421              
422             See Also : L
423              
424             =cut
425              
426             #------------------
427             sub collapse_nums {
428             #------------------
429             # This is probably not the slickest connectivity algorithm, but will do for now.
430 0     0 1   my @a = @_;
431 0           my ($from, $to, $i, @ca, $consec);
432            
433 0           $consec = 0;
434 0           for($i=0; $i < @a; $i++) {
435 0 0         not $from and do{ $from = $a[$i]; next; };
  0            
  0            
436 0 0         if($a[$i] == $a[$i-1]+1) {
437 0           $to = $a[$i];
438 0           $consec++;
439             } else {
440 0 0         if($consec == 1) { $from .= ",$to"; }
  0            
441 0 0         else { $from .= $consec>1 ? "\-$to" : ""; }
442 0           push @ca, split(',', $from);
443 0           $from = $a[$i];
444 0           $consec = 0;
445 0           $to = undef;
446             }
447             }
448 0 0         if(defined $to) {
449 0 0         if($consec == 1) { $from .= ",$to"; }
  0            
450 0 0         else { $from .= $consec>1 ? "\-$to" : ""; }
451             }
452 0 0         push @ca, split(',', $from) if $from;
453              
454 0           @ca;
455             }
456              
457              
458             =head2 strip_blast_html
459              
460             Usage : $boolean = &strip_blast_html( string_ref );
461             : This method is exported.
462             Purpose : Removes HTML formatting from a supplied string.
463             : Attempts to restore the Blast report to enable
464             : parsing by Bio::SearchIO::blast.pm
465             Returns : Boolean: true if string was stripped, false if not.
466             Argument : string_ref = reference to a string containing the whole Blast
467             : report containing HTML formatting.
468             Throws : Croaks if the argument is not a scalar reference.
469             Comments : Based on code originally written by Alex Dong Li
470             : (ali@genet.sickkids.on.ca).
471             : This method does some Blast-specific stripping
472             : (adds back a '>' character in front of each HSP
473             : alignment listing).
474             :
475             : THIS METHOD IS VERY SENSITIVE TO BLAST FORMATTING CHANGES!
476             :
477             : Removal of the HTML tags and accurate reconstitution of the
478             : non-HTML-formatted report is highly dependent on structure of
479             : the HTML-formatted version. For example, it assumes that first
480             : line of each alignment section (HSP listing) starts with a
481             : anchor tag. This permits the reconstruction of the
482             : original report in which these lines begin with a ">".
483             : This is required for parsing.
484             :
485             : If the structure of the Blast report itself is not intended to
486             : be a standard, the structure of the HTML-formatted version
487             : is even less so. Therefore, the use of this method to
488             : reconstitute parsable Blast reports from HTML-format versions
489             : should be considered a temorary solution.
490              
491             =cut
492              
493             #--------------------
494             sub strip_blast_html {
495             #--------------------
496             # This may not best way to remove html tags. However, it is simple.
497             # it won't work under following conditions:
498             # 1) if quoted > appears in a tag (does this ever happen?)
499             # 2) if a tag is split over multiple lines and this method is
500             # used to process one line at a time.
501            
502 0     0 1   my ($string_ref) = shift;
503              
504 0 0         ref $string_ref eq 'SCALAR' or
505             croak ("Can't strip HTML: ".
506 0           "Argument is should be a SCALAR reference not a ${\ref $string_ref}\n");
507              
508 0           my $str = $$string_ref;
509 0           my $stripped = 0;
510              
511             # Removing "" and adding the '>' character for
512             # HSP alignment listings.
513 0 0         $str =~ s/(\A|\n)]+> ?/>/sgi and $stripped = 1;
514              
515             # Removing all "<>" tags.
516 0 0         $str =~ s/<[^>]+>| //sgi and $stripped = 1;
517              
518             # Re-uniting any lone '>' characters.
519 0 0         $str =~ s/(\A|\n)>\s+/\n\n>/sgi and $stripped = 1;
520              
521 0           $$string_ref = $str;
522 0           $stripped;
523             }
524              
525              
526             1;
527              
528