| line |
stmt |
bran |
cond |
sub |
pod |
time |
code |
|
1
|
|
|
|
|
|
|
=head1 NAME |
|
2
|
|
|
|
|
|
|
|
|
3
|
|
|
|
|
|
|
Bio::Search::BlastUtils - Utility functions for Bio::Search:: BLAST objects |
|
4
|
|
|
|
|
|
|
|
|
5
|
|
|
|
|
|
|
=head1 SYNOPSIS |
|
6
|
|
|
|
|
|
|
|
|
7
|
|
|
|
|
|
|
# This module is just a collection of subroutines, not an object. |
|
8
|
|
|
|
|
|
|
|
|
9
|
|
|
|
|
|
|
See L. |
|
10
|
|
|
|
|
|
|
|
|
11
|
|
|
|
|
|
|
=head1 DESCRIPTION |
|
12
|
|
|
|
|
|
|
|
|
13
|
|
|
|
|
|
|
The BlastUtils.pm module is a collection of subroutines used primarily by |
|
14
|
|
|
|
|
|
|
Bio::Search::Hit::BlastHit objects for some of the additional |
|
15
|
|
|
|
|
|
|
functionality, such as HSP tiling. Right now, the BlastUtils is just a |
|
16
|
|
|
|
|
|
|
collection of methods, not an object, and it's tightly coupled to |
|
17
|
|
|
|
|
|
|
Bio::Search::Hit::BlastHit. A goal for the future is to generalize it |
|
18
|
|
|
|
|
|
|
to work based on the Bio::Search interfaces, then it can work with any |
|
19
|
|
|
|
|
|
|
objects that implements them. |
|
20
|
|
|
|
|
|
|
|
|
21
|
|
|
|
|
|
|
=head1 AUTHOR |
|
22
|
|
|
|
|
|
|
|
|
23
|
|
|
|
|
|
|
Steve Chervitz Esac@bioperl.orgE |
|
24
|
|
|
|
|
|
|
|
|
25
|
|
|
|
|
|
|
=cut |
|
26
|
|
|
|
|
|
|
|
|
27
|
|
|
|
|
|
|
#' |
|
28
|
|
|
|
|
|
|
|
|
29
|
|
|
|
|
|
|
package Bio::Search::BlastUtils; |
|
30
|
3
|
|
|
3
|
|
16
|
use Bio::Root::Version; |
|
|
3
|
|
|
|
|
4
|
|
|
|
3
|
|
|
|
|
24
|
|
|
31
|
|
|
|
|
|
|
|
|
32
|
|
|
|
|
|
|
|
|
33
|
|
|
|
|
|
|
=head2 tile_hsps |
|
34
|
|
|
|
|
|
|
|
|
35
|
|
|
|
|
|
|
Usage : tile_hsps( $sbjct ); |
|
36
|
|
|
|
|
|
|
: This is called automatically by Bio::Search::Hit::BlastHit |
|
37
|
|
|
|
|
|
|
: during object construction or |
|
38
|
|
|
|
|
|
|
: as needed by methods that rely on having tiled data. |
|
39
|
|
|
|
|
|
|
Purpose : Collect statistics about the aligned sequences in a set of HSPs. |
|
40
|
|
|
|
|
|
|
: Calculates the following data across all HSPs: |
|
41
|
|
|
|
|
|
|
: -- total alignment length |
|
42
|
|
|
|
|
|
|
: -- total identical residues |
|
43
|
|
|
|
|
|
|
: -- total conserved residues |
|
44
|
|
|
|
|
|
|
Returns : n/a |
|
45
|
|
|
|
|
|
|
Argument : A Bio::Search::Hit::BlastHit object |
|
46
|
|
|
|
|
|
|
Throws : n/a |
|
47
|
|
|
|
|
|
|
Comments : |
|
48
|
|
|
|
|
|
|
: This method is *strongly* coupled to Bio::Search::Hit::BlastHit |
|
49
|
|
|
|
|
|
|
: (it accesses BlastHit data members directly). |
|
50
|
|
|
|
|
|
|
: TODO: Re-write this to the Bio::Search::Hit::HitI interface. |
|
51
|
|
|
|
|
|
|
: |
|
52
|
|
|
|
|
|
|
: This method performs more careful summing of data across |
|
53
|
|
|
|
|
|
|
: all HSPs in the Sbjct object. Only HSPs that are in the same strand |
|
54
|
|
|
|
|
|
|
: and frame are tiled. Simply summing the data from all HSPs |
|
55
|
|
|
|
|
|
|
: in the same strand and frame will overestimate the actual |
|
56
|
|
|
|
|
|
|
: length of the alignment if there is overlap between different HSPs |
|
57
|
|
|
|
|
|
|
: (often the case). |
|
58
|
|
|
|
|
|
|
: |
|
59
|
|
|
|
|
|
|
: The strategy is to tile the HSPs and sum over the |
|
60
|
|
|
|
|
|
|
: contigs, collecting data separately from overlapping and |
|
61
|
|
|
|
|
|
|
: non-overlapping regions of each HSP. To facilitate this, the |
|
62
|
|
|
|
|
|
|
: HSP.pm object now permits extraction of data from sub-sections |
|
63
|
|
|
|
|
|
|
: of an HSP. |
|
64
|
|
|
|
|
|
|
: |
|
65
|
|
|
|
|
|
|
: Additional useful information is collected from the results |
|
66
|
|
|
|
|
|
|
: of the tiling. It is possible that sub-sequences in |
|
67
|
|
|
|
|
|
|
: different HSPs will overlap significantly. In this case, it |
|
68
|
|
|
|
|
|
|
: is impossible to create a single unambiguous alignment by |
|
69
|
|
|
|
|
|
|
: concatenating the HSPs. The ambiguity may indicate the |
|
70
|
|
|
|
|
|
|
: presence of multiple, similar domains in one or both of the |
|
71
|
|
|
|
|
|
|
: aligned sequences. This ambiguity is recorded using the |
|
72
|
|
|
|
|
|
|
: ambiguous_aln() method. |
|
73
|
|
|
|
|
|
|
: |
|
74
|
|
|
|
|
|
|
: This method does not attempt to discern biologically |
|
75
|
|
|
|
|
|
|
: significant vs. insignificant overlaps. The allowable amount of |
|
76
|
|
|
|
|
|
|
: overlap can be set with the overlap() method or with the -OVERLAP |
|
77
|
|
|
|
|
|
|
: parameter used when constructing the Blast & Sbjct objects. |
|
78
|
|
|
|
|
|
|
: |
|
79
|
|
|
|
|
|
|
: For a given hit, both the query and the sbjct sequences are |
|
80
|
|
|
|
|
|
|
: tiled independently. |
|
81
|
|
|
|
|
|
|
: |
|
82
|
|
|
|
|
|
|
: -- If only query sequence HSPs overlap, |
|
83
|
|
|
|
|
|
|
: this may suggest multiple domains in the sbjct. |
|
84
|
|
|
|
|
|
|
: -- If only sbjct sequence HSPs overlap, |
|
85
|
|
|
|
|
|
|
: this may suggest multiple domains in the query. |
|
86
|
|
|
|
|
|
|
: -- If both query & sbjct sequence HSPs overlap, |
|
87
|
|
|
|
|
|
|
: this suggests multiple domains in both. |
|
88
|
|
|
|
|
|
|
: -- If neither query & sbjct sequence HSPs overlap, |
|
89
|
|
|
|
|
|
|
: this suggests either no multiple domains in either |
|
90
|
|
|
|
|
|
|
: sequence OR that both sequences have the same |
|
91
|
|
|
|
|
|
|
: distribution of multiple similar domains. |
|
92
|
|
|
|
|
|
|
: |
|
93
|
|
|
|
|
|
|
: This method can deal with the special case of when multiple |
|
94
|
|
|
|
|
|
|
: HSPs exactly overlap. |
|
95
|
|
|
|
|
|
|
: |
|
96
|
|
|
|
|
|
|
: Efficiency concerns: |
|
97
|
|
|
|
|
|
|
: Speed will be an issue for sequences with numerous HSPs. |
|
98
|
|
|
|
|
|
|
: |
|
99
|
|
|
|
|
|
|
Bugs : Currently, tile_hsps() does not properly account for |
|
100
|
|
|
|
|
|
|
: the number of non-tiled but overlapping HSPs, which becomes a problem |
|
101
|
|
|
|
|
|
|
: as overlap() grows. Large values overlap() may thus lead to |
|
102
|
|
|
|
|
|
|
: incorrect statistics for some hits. For best results, keep overlap() |
|
103
|
|
|
|
|
|
|
: below 5 (DEFAULT IS 2). For more about this, see the "HSP Tiling and |
|
104
|
|
|
|
|
|
|
: Ambiguous Alignments" section in L. |
|
105
|
|
|
|
|
|
|
|
|
106
|
|
|
|
|
|
|
See Also : L<_adjust_contigs>(), L |
|
107
|
|
|
|
|
|
|
|
|
108
|
|
|
|
|
|
|
=cut |
|
109
|
|
|
|
|
|
|
|
|
110
|
|
|
|
|
|
|
#-------------- |
|
111
|
|
|
|
|
|
|
sub tile_hsps { |
|
112
|
|
|
|
|
|
|
#-------------- |
|
113
|
0
|
|
|
0
|
1
|
|
my $sbjct = shift; |
|
114
|
|
|
|
|
|
|
|
|
115
|
0
|
|
|
|
|
|
$sbjct->{'_tile_hsps'} = 1; |
|
116
|
0
|
|
|
|
|
|
$sbjct->{'_gaps_query'} = 0; |
|
117
|
0
|
|
|
|
|
|
$sbjct->{'_gaps_sbjct'} = 0; |
|
118
|
|
|
|
|
|
|
|
|
119
|
|
|
|
|
|
|
## Simple summation scheme. Valid if there is only one HSP. |
|
120
|
0
|
0
|
0
|
|
|
|
if((defined($sbjct->{'_n'}) and $sbjct->{'_n'} == 1) or $sbjct->num_hsps == 1) { |
|
|
|
|
0
|
|
|
|
|
|
121
|
0
|
|
|
|
|
|
my $hsp = $sbjct->hsp; |
|
122
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_query'} = $hsp->length('query'); |
|
123
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_sbjct'} = $hsp->length('sbjct'); |
|
124
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_total'} = $hsp->length('total'); |
|
125
|
0
|
|
|
|
|
|
($sbjct->{'_totalIdentical'},$sbjct->{'_totalConserved'}) = $hsp->matches(); |
|
126
|
0
|
|
|
|
|
|
$sbjct->{'_gaps_query'} = $hsp->gaps('query'); |
|
127
|
0
|
|
|
|
|
|
$sbjct->{'_gaps_sbjct'} = $hsp->gaps('sbjct'); |
|
128
|
|
|
|
|
|
|
|
|
129
|
|
|
|
|
|
|
# print "_tile_hsps(): single HSP, easy stats.\n"; |
|
130
|
0
|
|
|
|
|
|
return; |
|
131
|
|
|
|
|
|
|
} else { |
|
132
|
|
|
|
|
|
|
# print STDERR "Sbjct: _tile_hsps: summing multiple HSPs\n"; |
|
133
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_query'} = 0; |
|
134
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_sbjct'} = 0; |
|
135
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_total'} = 0; |
|
136
|
0
|
|
|
|
|
|
$sbjct->{'_totalIdentical'} = 0; |
|
137
|
0
|
|
|
|
|
|
$sbjct->{'_totalConserved'} = 0; |
|
138
|
|
|
|
|
|
|
} |
|
139
|
|
|
|
|
|
|
|
|
140
|
|
|
|
|
|
|
## More than one HSP. Must tile HSPs. |
|
141
|
|
|
|
|
|
|
# print "\nTiling HSPs for $sbjct\n"; |
|
142
|
0
|
|
|
|
|
|
my($hsp, $qstart, $qstop, $sstart, $sstop); |
|
143
|
0
|
|
|
|
|
|
my($frame, $strand, $qstrand, $sstrand); |
|
144
|
0
|
|
|
|
|
|
my(@qcontigs, @scontigs); |
|
145
|
0
|
|
|
|
|
|
my $qoverlap = 0; |
|
146
|
0
|
|
|
|
|
|
my $soverlap = 0; |
|
147
|
0
|
|
|
|
|
|
my $max_overlap = $sbjct->{'_overlap'}; |
|
148
|
|
|
|
|
|
|
|
|
149
|
0
|
|
|
|
|
|
foreach $hsp ($sbjct->hsps()) { |
|
150
|
|
|
|
|
|
|
# printf " HSP: %s\n%s\n",$hsp->name, $hsp->str('query'); |
|
151
|
|
|
|
|
|
|
# printf " Length = %d; Identical = %d; Conserved = %d; Conserved(1-10): %d",$hsp->length, $hsp->length(-TYPE=>'iden'), $hsp->length(-TYPE=>'cons'), $hsp->length(-TYPE=>'cons',-START=>0,-STOP=>10); |
|
152
|
0
|
|
|
|
|
|
($qstart, $qstop) = $hsp->range('query'); |
|
153
|
0
|
|
|
|
|
|
($sstart, $sstop) = $hsp->range('sbjct'); |
|
154
|
0
|
|
|
|
|
|
$frame = $hsp->frame('hit'); |
|
155
|
0
|
0
|
|
|
|
|
$frame = -1 unless defined $frame; |
|
156
|
0
|
|
|
|
|
|
($qstrand, $sstrand) = $hsp->strand; |
|
157
|
|
|
|
|
|
|
|
|
158
|
0
|
|
|
|
|
|
my ($qgaps, $sgaps) = $hsp->gaps(); |
|
159
|
0
|
|
|
|
|
|
$sbjct->{'_gaps_query'} += $qgaps; |
|
160
|
0
|
|
|
|
|
|
$sbjct->{'_gaps_sbjct'} += $sgaps; |
|
161
|
|
|
|
|
|
|
|
|
162
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_total'} += $hsp->length; |
|
163
|
|
|
|
|
|
|
## Collect contigs in the query sequence. |
|
164
|
0
|
|
|
|
|
|
$qoverlap = &_adjust_contigs('query', $hsp, $qstart, $qstop, \@qcontigs, $max_overlap, $frame, $qstrand); |
|
165
|
|
|
|
|
|
|
|
|
166
|
|
|
|
|
|
|
## Collect contigs in the sbjct sequence (needed for domain data and gapped Blast). |
|
167
|
0
|
|
|
|
|
|
$soverlap = &_adjust_contigs('sbjct', $hsp, $sstart, $sstop, \@scontigs, $max_overlap, $frame, $sstrand); |
|
168
|
|
|
|
|
|
|
|
|
169
|
|
|
|
|
|
|
## Collect overall start and stop data for query and sbjct over all HSPs. |
|
170
|
0
|
0
|
|
|
|
|
if(not defined $sbjct->{'_queryStart'}) { |
|
171
|
0
|
|
|
|
|
|
$sbjct->{'_queryStart'} = $qstart; |
|
172
|
0
|
|
|
|
|
|
$sbjct->{'_queryStop'} = $qstop; |
|
173
|
0
|
|
|
|
|
|
$sbjct->{'_sbjctStart'} = $sstart; |
|
174
|
0
|
|
|
|
|
|
$sbjct->{'_sbjctStop'} = $sstop; |
|
175
|
|
|
|
|
|
|
} else { |
|
176
|
0
|
0
|
|
|
|
|
$sbjct->{'_queryStart'} = ($qstart < $sbjct->{'_queryStart'} ? $qstart : $sbjct->{'_queryStart'}); |
|
177
|
0
|
0
|
|
|
|
|
$sbjct->{'_queryStop'} = ($qstop > $sbjct->{'_queryStop'} ? $qstop : $sbjct->{'_queryStop'}); |
|
178
|
0
|
0
|
|
|
|
|
$sbjct->{'_sbjctStart'} = ($sstart < $sbjct->{'_sbjctStart'} ? $sstart : $sbjct->{'_sbjctStart'}); |
|
179
|
0
|
0
|
|
|
|
|
$sbjct->{'_sbjctStop'} = ($sstop > $sbjct->{'_sbjctStop'} ? $sstop : $sbjct->{'_sbjctStop'}); |
|
180
|
|
|
|
|
|
|
} |
|
181
|
|
|
|
|
|
|
} |
|
182
|
|
|
|
|
|
|
|
|
183
|
|
|
|
|
|
|
## Collect data across the collected contigs. |
|
184
|
|
|
|
|
|
|
|
|
185
|
|
|
|
|
|
|
# print "\nQUERY CONTIGS:\n"; |
|
186
|
|
|
|
|
|
|
# print " gaps = $sbjct->{'_gaps_query'}\n"; |
|
187
|
|
|
|
|
|
|
|
|
188
|
|
|
|
|
|
|
# TODO: Account for strand/frame issue! |
|
189
|
|
|
|
|
|
|
# Strategy: collect data on a per strand+frame basis and save the most significant one. |
|
190
|
0
|
|
|
|
|
|
my (%qctg_dat); |
|
191
|
0
|
|
|
|
|
|
foreach(@qcontigs) { |
|
192
|
|
|
|
|
|
|
# print " query contig: $_->{'start'} - $_->{'stop'}\n"; |
|
193
|
|
|
|
|
|
|
# print " iden = $_->{'iden'}; cons = $_->{'cons'}\n"; |
|
194
|
0
|
|
|
|
|
|
($frame, $strand) = ($_->{'frame'}, $_->{'strand'}); |
|
195
|
0
|
|
|
|
|
|
$qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1; |
|
196
|
0
|
|
|
|
|
|
$qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'}; |
|
197
|
0
|
|
|
|
|
|
$qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'}; |
|
198
|
0
|
|
|
|
|
|
$qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand; |
|
199
|
|
|
|
|
|
|
} |
|
200
|
|
|
|
|
|
|
|
|
201
|
|
|
|
|
|
|
# Find longest contig. |
|
202
|
0
|
|
|
|
|
|
my @sortedkeys = reverse sort { $qctg_dat{ $a }->{'length_aln_query'} <=> $qctg_dat{ $b }->{'length_aln_query'} } keys %qctg_dat; |
|
|
0
|
|
|
|
|
|
|
|
203
|
|
|
|
|
|
|
|
|
204
|
|
|
|
|
|
|
# Save the largest to the sbjct: |
|
205
|
0
|
|
|
|
|
|
my $longest = $sortedkeys[0]; |
|
206
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_query'} = $qctg_dat{ $longest }->{'length_aln_query'}; |
|
207
|
0
|
|
|
|
|
|
$sbjct->{'_totalIdentical'} = $qctg_dat{ $longest }->{'totalIdentical'}; |
|
208
|
0
|
|
|
|
|
|
$sbjct->{'_totalConserved'} = $qctg_dat{ $longest }->{'totalConserved'}; |
|
209
|
0
|
|
|
|
|
|
$sbjct->{'_qstrand'} = $qctg_dat{ $longest }->{'qstrand'}; |
|
210
|
|
|
|
|
|
|
|
|
211
|
|
|
|
|
|
|
## Collect data for sbjct contigs. Important for gapped Blast. |
|
212
|
|
|
|
|
|
|
## The totalIdentical and totalConserved numbers will be the same |
|
213
|
|
|
|
|
|
|
## as determined for the query contigs. |
|
214
|
|
|
|
|
|
|
|
|
215
|
|
|
|
|
|
|
# print "\nSBJCT CONTIGS:\n"; |
|
216
|
|
|
|
|
|
|
# print " gaps = $sbjct->{'_gaps_sbjct'}\n"; |
|
217
|
|
|
|
|
|
|
|
|
218
|
0
|
|
|
|
|
|
my (%sctg_dat); |
|
219
|
0
|
|
|
|
|
|
foreach(@scontigs) { |
|
220
|
|
|
|
|
|
|
# print " sbjct contig: $_->{'start'} - $_->{'stop'}\n"; |
|
221
|
|
|
|
|
|
|
# print " iden = $_->{'iden'}; cons = $_->{'cons'}\n"; |
|
222
|
0
|
|
|
|
|
|
($frame, $strand) = ($_->{'frame'}, $_->{'strand'}); |
|
223
|
0
|
|
|
|
|
|
$sctg_dat{ "$frame$strand" }->{'length_aln_sbjct'} += $_->{'stop'} - $_->{'start'} + 1; |
|
224
|
0
|
|
|
|
|
|
$sctg_dat{ "$frame$strand" }->{'frame'} = $frame; |
|
225
|
0
|
|
|
|
|
|
$sctg_dat{ "$frame$strand" }->{'sstrand'} = $strand; |
|
226
|
|
|
|
|
|
|
} |
|
227
|
|
|
|
|
|
|
|
|
228
|
0
|
|
|
|
|
|
@sortedkeys = reverse sort { $sctg_dat{ $a }->{'length_aln_sbjct'} <=> $sctg_dat{ $b }->{'length_aln_sbjct'} } keys %sctg_dat; |
|
|
0
|
|
|
|
|
|
|
|
229
|
|
|
|
|
|
|
|
|
230
|
|
|
|
|
|
|
# Save the largest to the sbjct: |
|
231
|
0
|
|
|
|
|
|
$longest = $sortedkeys[0]; |
|
232
|
|
|
|
|
|
|
|
|
233
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_sbjct'} = $sctg_dat{ $longest }->{'length_aln_sbjct'}; |
|
234
|
0
|
|
|
|
|
|
$sbjct->{'_frame'} = $sctg_dat{ $longest }->{'frame'}; |
|
235
|
0
|
|
|
|
|
|
$sbjct->{'_sstrand'} = $sctg_dat{ $longest }->{'sstrand'}; |
|
236
|
|
|
|
|
|
|
|
|
237
|
0
|
0
|
|
|
|
|
if($qoverlap) { |
|
|
|
0
|
|
|
|
|
|
|
238
|
0
|
0
|
|
|
|
|
if($soverlap) { $sbjct->ambiguous_aln('qs'); |
|
|
0
|
|
|
|
|
|
|
|
239
|
|
|
|
|
|
|
# print "\n*** AMBIGUOUS ALIGNMENT: Query and Sbjct\n\n"; |
|
240
|
|
|
|
|
|
|
} |
|
241
|
0
|
|
|
|
|
|
else { $sbjct->ambiguous_aln('q'); |
|
242
|
|
|
|
|
|
|
# print "\n*** AMBIGUOUS ALIGNMENT: Query\n\n"; |
|
243
|
|
|
|
|
|
|
} |
|
244
|
|
|
|
|
|
|
} elsif($soverlap) { |
|
245
|
0
|
|
|
|
|
|
$sbjct->ambiguous_aln('s'); |
|
246
|
|
|
|
|
|
|
# print "\n*** AMBIGUOUS ALIGNMENT: Sbjct\n\n"; |
|
247
|
|
|
|
|
|
|
} |
|
248
|
|
|
|
|
|
|
|
|
249
|
|
|
|
|
|
|
# Adjust length based on BLAST flavor. |
|
250
|
0
|
|
|
|
|
|
my $prog = $sbjct->algorithm; |
|
251
|
0
|
0
|
|
|
|
|
if($prog eq 'TBLASTN') { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
252
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_sbjct'} /= 3; |
|
253
|
|
|
|
|
|
|
} elsif($prog eq 'BLASTX' ) { |
|
254
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_query'} /= 3; |
|
255
|
|
|
|
|
|
|
} elsif($prog eq 'TBLASTX') { |
|
256
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_query'} /= 3; |
|
257
|
0
|
|
|
|
|
|
$sbjct->{'_length_aln_sbjct'} /= 3; |
|
258
|
|
|
|
|
|
|
} |
|
259
|
|
|
|
|
|
|
} |
|
260
|
|
|
|
|
|
|
|
|
261
|
|
|
|
|
|
|
|
|
262
|
|
|
|
|
|
|
|
|
263
|
|
|
|
|
|
|
=head2 _adjust_contigs |
|
264
|
|
|
|
|
|
|
|
|
265
|
|
|
|
|
|
|
Usage : n/a; called automatically during object construction. |
|
266
|
|
|
|
|
|
|
Purpose : Builds HSP contigs for a given BLAST hit. |
|
267
|
|
|
|
|
|
|
: Utility method called by _tile_hsps() |
|
268
|
|
|
|
|
|
|
Returns : |
|
269
|
|
|
|
|
|
|
Argument : |
|
270
|
|
|
|
|
|
|
Throws : Exceptions propagated from Bio::Search::Hit::BlastHSP::matches() |
|
271
|
|
|
|
|
|
|
: for invalid sub-sequence ranges. |
|
272
|
|
|
|
|
|
|
Status : Experimental |
|
273
|
|
|
|
|
|
|
Comments : This method does not currently support gapped alignments. |
|
274
|
|
|
|
|
|
|
: Also, it does not keep track of the number of HSPs that |
|
275
|
|
|
|
|
|
|
: overlap within the amount specified by overlap(). |
|
276
|
|
|
|
|
|
|
: This will lead to significant tracking errors for large |
|
277
|
|
|
|
|
|
|
: overlap values. |
|
278
|
|
|
|
|
|
|
|
|
279
|
|
|
|
|
|
|
See Also : L(), L |
|
280
|
|
|
|
|
|
|
|
|
281
|
|
|
|
|
|
|
=cut |
|
282
|
|
|
|
|
|
|
|
|
283
|
|
|
|
|
|
|
#------------------- |
|
284
|
|
|
|
|
|
|
sub _adjust_contigs { |
|
285
|
|
|
|
|
|
|
#------------------- |
|
286
|
0
|
|
|
0
|
|
|
my ($seqType, $hsp, $start, $stop, $contigs_ref, $max_overlap, $frame, $strand) = @_; |
|
287
|
|
|
|
|
|
|
|
|
288
|
0
|
|
|
|
|
|
my $overlap = 0; |
|
289
|
0
|
|
|
|
|
|
my ($numID, $numCons); |
|
290
|
|
|
|
|
|
|
|
|
291
|
|
|
|
|
|
|
# print STDERR "Testing $seqType data: HSP (${\$hsp->name}); $start, $stop, strand=$strand, frame=$frame\n"; |
|
292
|
0
|
|
|
|
|
|
foreach(@$contigs_ref) { |
|
293
|
|
|
|
|
|
|
# print STDERR " Contig: $_->{'start'} - $_->{'stop'}, strand=$_->{'strand'}, frame=$_->{'frame'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n"; |
|
294
|
|
|
|
|
|
|
|
|
295
|
|
|
|
|
|
|
# Don't merge things unless they have matching strand/frame. |
|
296
|
0
|
0
|
0
|
|
|
|
next unless ($_->{'frame'} == $frame and $_->{'strand'} == $strand); |
|
297
|
|
|
|
|
|
|
|
|
298
|
|
|
|
|
|
|
## Test special case of a nested HSP. Skip it. |
|
299
|
0
|
0
|
0
|
|
|
|
if($start >= $_->{'start'} and $stop <= $_->{'stop'}) { |
|
300
|
|
|
|
|
|
|
# print STDERR "----> Nested HSP. Skipping.\n"; |
|
301
|
0
|
|
|
|
|
|
$overlap = 1; |
|
302
|
0
|
|
|
|
|
|
next; |
|
303
|
|
|
|
|
|
|
} |
|
304
|
|
|
|
|
|
|
|
|
305
|
|
|
|
|
|
|
## Test for overlap at beginning of contig. |
|
306
|
0
|
0
|
0
|
|
|
|
if($start < $_->{'start'} and $stop > ($_->{'start'} + $max_overlap)) { |
|
307
|
|
|
|
|
|
|
# print STDERR "----> Overlaps beg: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n"; |
|
308
|
|
|
|
|
|
|
# Collect stats over the non-overlapping region. |
|
309
|
0
|
|
|
|
|
|
eval { |
|
310
|
|
|
|
|
|
|
($numID, $numCons) = $hsp->matches(-SEQ =>$seqType, |
|
311
|
|
|
|
|
|
|
-START =>$start, |
|
312
|
0
|
|
|
|
|
|
-STOP =>$_->{'start'}-1); |
|
313
|
|
|
|
|
|
|
}; |
|
314
|
0
|
0
|
|
|
|
|
if($@) { warn "\a\n$@\n"; } |
|
|
0
|
|
|
|
|
|
|
|
315
|
|
|
|
|
|
|
else { |
|
316
|
0
|
|
|
|
|
|
$_->{'start'} = $start; # Assign a new start coordinate to the contig |
|
317
|
0
|
|
|
|
|
|
$_->{'iden'} += $numID; # and add new data to #identical, #conserved. |
|
318
|
0
|
|
|
|
|
|
$_->{'cons'} += $numCons; |
|
319
|
0
|
|
|
|
|
|
$overlap = 1; |
|
320
|
|
|
|
|
|
|
} |
|
321
|
|
|
|
|
|
|
} |
|
322
|
|
|
|
|
|
|
|
|
323
|
|
|
|
|
|
|
## Test for overlap at end of contig. |
|
324
|
0
|
0
|
0
|
|
|
|
if($stop > $_->{'stop'} and $start < ($_->{'stop'} - $max_overlap)) { |
|
325
|
|
|
|
|
|
|
# print STDERR "----> Overlaps end: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n"; |
|
326
|
|
|
|
|
|
|
# Collect stats over the non-overlapping region. |
|
327
|
0
|
|
|
|
|
|
eval { |
|
328
|
|
|
|
|
|
|
($numID,$numCons) = $hsp->matches(-SEQ =>$seqType, |
|
329
|
0
|
|
|
|
|
|
-START =>$_->{'stop'}, |
|
330
|
|
|
|
|
|
|
-STOP =>$stop); |
|
331
|
|
|
|
|
|
|
}; |
|
332
|
0
|
0
|
|
|
|
|
if($@) { warn "\a\n$@\n"; } |
|
|
0
|
|
|
|
|
|
|
|
333
|
|
|
|
|
|
|
else { |
|
334
|
0
|
|
|
|
|
|
$_->{'stop'} = $stop; # Assign a new stop coordinate to the contig |
|
335
|
0
|
|
|
|
|
|
$_->{'iden'} += $numID; # and add new data to #identical, #conserved. |
|
336
|
0
|
|
|
|
|
|
$_->{'cons'} += $numCons; |
|
337
|
0
|
|
|
|
|
|
$overlap = 1; |
|
338
|
|
|
|
|
|
|
} |
|
339
|
|
|
|
|
|
|
} |
|
340
|
0
|
0
|
|
|
|
|
$overlap && do { |
|
341
|
|
|
|
|
|
|
# print STDERR " New Contig data:\n"; |
|
342
|
|
|
|
|
|
|
# print STDERR " Contig: $_->{'start'} - $_->{'stop'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n"; |
|
343
|
0
|
|
|
|
|
|
last; |
|
344
|
|
|
|
|
|
|
}; |
|
345
|
|
|
|
|
|
|
} |
|
346
|
|
|
|
|
|
|
## If there is no overlap, add the complete HSP data. |
|
347
|
0
|
0
|
|
|
|
|
!$overlap && do { |
|
348
|
|
|
|
|
|
|
# print STDERR "No overlap. Adding new contig.\n"; |
|
349
|
0
|
|
|
|
|
|
($numID,$numCons) = $hsp->matches(-SEQ=>$seqType); |
|
350
|
0
|
|
|
|
|
|
push @$contigs_ref, {'start'=>$start, 'stop'=>$stop, |
|
351
|
|
|
|
|
|
|
'iden'=>$numID, 'cons'=>$numCons, |
|
352
|
|
|
|
|
|
|
'strand'=>$strand, 'frame'=>$frame}; |
|
353
|
|
|
|
|
|
|
}; |
|
354
|
0
|
|
|
|
|
|
$overlap; |
|
355
|
|
|
|
|
|
|
} |
|
356
|
|
|
|
|
|
|
|
|
357
|
|
|
|
|
|
|
=head2 get_exponent |
|
358
|
|
|
|
|
|
|
|
|
359
|
|
|
|
|
|
|
Usage : &get_exponent( number ); |
|
360
|
|
|
|
|
|
|
Purpose : Determines the power of 10 exponent of an integer, float, |
|
361
|
|
|
|
|
|
|
: or scientific notation number. |
|
362
|
|
|
|
|
|
|
Example : &get_exponent("4.0e-206"); |
|
363
|
|
|
|
|
|
|
: &get_exponent("0.00032"); |
|
364
|
|
|
|
|
|
|
: &get_exponent("10."); |
|
365
|
|
|
|
|
|
|
: &get_exponent("1000.0"); |
|
366
|
|
|
|
|
|
|
: &get_exponent("e+83"); |
|
367
|
|
|
|
|
|
|
Argument : Float, Integer, or scientific notation number |
|
368
|
|
|
|
|
|
|
Returns : Integer representing the exponent part of the number (+ or -). |
|
369
|
|
|
|
|
|
|
: If argument == 0 (zero), return value is "-999". |
|
370
|
|
|
|
|
|
|
Comments : Exponents are rounded up (less negative) if the mantissa is >= 5. |
|
371
|
|
|
|
|
|
|
: Exponents are rounded down (more negative) if the mantissa is <= -5. |
|
372
|
|
|
|
|
|
|
|
|
373
|
|
|
|
|
|
|
=cut |
|
374
|
|
|
|
|
|
|
|
|
375
|
|
|
|
|
|
|
#------------------ |
|
376
|
|
|
|
|
|
|
sub get_exponent { |
|
377
|
|
|
|
|
|
|
#------------------ |
|
378
|
0
|
|
|
0
|
1
|
|
my $data = shift; |
|
379
|
|
|
|
|
|
|
|
|
380
|
0
|
|
|
|
|
|
my($num, $exp) = split /[eE]/, $data; |
|
381
|
|
|
|
|
|
|
|
|
382
|
0
|
0
|
|
|
|
|
if( defined $exp) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
383
|
0
|
0
|
|
|
|
|
$num = 1 if not $num; |
|
384
|
0
|
0
|
|
|
|
|
$num >= 5 and $exp++; |
|
385
|
0
|
0
|
|
|
|
|
$num <= -5 and $exp--; |
|
386
|
|
|
|
|
|
|
} elsif( $num == 0) { |
|
387
|
0
|
|
|
|
|
|
$exp = -999; |
|
388
|
|
|
|
|
|
|
} elsif( not $num =~ /\./) { |
|
389
|
0
|
|
|
|
|
|
$exp = CORE::length($num) -1; |
|
390
|
|
|
|
|
|
|
} else { |
|
391
|
0
|
|
|
|
|
|
$exp = 0; |
|
392
|
0
|
0
|
|
|
|
|
$num .= '0' if $num =~ /\.$/; |
|
393
|
0
|
|
|
|
|
|
my ($c); |
|
394
|
0
|
|
|
|
|
|
my $rev = 0; |
|
395
|
0
|
0
|
|
|
|
|
if($num !~ /^0/) { |
|
396
|
0
|
|
|
|
|
|
$num = reverse($num); |
|
397
|
0
|
|
|
|
|
|
$rev = 1; |
|
398
|
|
|
|
|
|
|
} |
|
399
|
0
|
|
|
|
|
|
do { $c = chop($num); |
|
|
0
|
|
|
|
|
|
|
|
400
|
0
|
0
|
|
|
|
|
$c == 0 && $exp++; |
|
401
|
|
|
|
|
|
|
} while( $c ne '.'); |
|
402
|
|
|
|
|
|
|
|
|
403
|
0
|
0
|
0
|
|
|
|
$exp = -$exp if $num == 0 and not $rev; |
|
404
|
0
|
0
|
|
|
|
|
$exp -= 1 if $rev; |
|
405
|
|
|
|
|
|
|
} |
|
406
|
0
|
|
|
|
|
|
return $exp; |
|
407
|
|
|
|
|
|
|
} |
|
408
|
|
|
|
|
|
|
|
|
409
|
|
|
|
|
|
|
=head2 collapse_nums |
|
410
|
|
|
|
|
|
|
|
|
411
|
|
|
|
|
|
|
Usage : @cnums = collapse_nums( @numbers ); |
|
412
|
|
|
|
|
|
|
Purpose : Collapses a list of numbers into a set of ranges of consecutive terms: |
|
413
|
|
|
|
|
|
|
: Useful for condensing long lists of consecutive numbers. |
|
414
|
|
|
|
|
|
|
: EXPANDED: |
|
415
|
|
|
|
|
|
|
: 1 2 3 4 5 6 10 12 13 14 15 17 18 20 21 22 24 26 30 31 32 |
|
416
|
|
|
|
|
|
|
: COLLAPSED: |
|
417
|
|
|
|
|
|
|
: 1-6 10 12-15 17 18 20-22 24 26 30-32 |
|
418
|
|
|
|
|
|
|
Argument : List of numbers sorted numerically. |
|
419
|
|
|
|
|
|
|
Returns : List of numbers mixed with ranges of numbers (see above). |
|
420
|
|
|
|
|
|
|
Throws : n/a |
|
421
|
|
|
|
|
|
|
|
|
422
|
|
|
|
|
|
|
See Also : L |
|
423
|
|
|
|
|
|
|
|
|
424
|
|
|
|
|
|
|
=cut |
|
425
|
|
|
|
|
|
|
|
|
426
|
|
|
|
|
|
|
#------------------ |
|
427
|
|
|
|
|
|
|
sub collapse_nums { |
|
428
|
|
|
|
|
|
|
#------------------ |
|
429
|
|
|
|
|
|
|
# This is probably not the slickest connectivity algorithm, but will do for now. |
|
430
|
0
|
|
|
0
|
1
|
|
my @a = @_; |
|
431
|
0
|
|
|
|
|
|
my ($from, $to, $i, @ca, $consec); |
|
432
|
|
|
|
|
|
|
|
|
433
|
0
|
|
|
|
|
|
$consec = 0; |
|
434
|
0
|
|
|
|
|
|
for($i=0; $i < @a; $i++) { |
|
435
|
0
|
0
|
|
|
|
|
not $from and do{ $from = $a[$i]; next; }; |
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
436
|
0
|
0
|
|
|
|
|
if($a[$i] == $a[$i-1]+1) { |
|
437
|
0
|
|
|
|
|
|
$to = $a[$i]; |
|
438
|
0
|
|
|
|
|
|
$consec++; |
|
439
|
|
|
|
|
|
|
} else { |
|
440
|
0
|
0
|
|
|
|
|
if($consec == 1) { $from .= ",$to"; } |
|
|
0
|
|
|
|
|
|
|
|
441
|
0
|
0
|
|
|
|
|
else { $from .= $consec>1 ? "\-$to" : ""; } |
|
442
|
0
|
|
|
|
|
|
push @ca, split(',', $from); |
|
443
|
0
|
|
|
|
|
|
$from = $a[$i]; |
|
444
|
0
|
|
|
|
|
|
$consec = 0; |
|
445
|
0
|
|
|
|
|
|
$to = undef; |
|
446
|
|
|
|
|
|
|
} |
|
447
|
|
|
|
|
|
|
} |
|
448
|
0
|
0
|
|
|
|
|
if(defined $to) { |
|
449
|
0
|
0
|
|
|
|
|
if($consec == 1) { $from .= ",$to"; } |
|
|
0
|
|
|
|
|
|
|
|
450
|
0
|
0
|
|
|
|
|
else { $from .= $consec>1 ? "\-$to" : ""; } |
|
451
|
|
|
|
|
|
|
} |
|
452
|
0
|
0
|
|
|
|
|
push @ca, split(',', $from) if $from; |
|
453
|
|
|
|
|
|
|
|
|
454
|
0
|
|
|
|
|
|
@ca; |
|
455
|
|
|
|
|
|
|
} |
|
456
|
|
|
|
|
|
|
|
|
457
|
|
|
|
|
|
|
|
|
458
|
|
|
|
|
|
|
=head2 strip_blast_html |
|
459
|
|
|
|
|
|
|
|
|
460
|
|
|
|
|
|
|
Usage : $boolean = &strip_blast_html( string_ref ); |
|
461
|
|
|
|
|
|
|
: This method is exported. |
|
462
|
|
|
|
|
|
|
Purpose : Removes HTML formatting from a supplied string. |
|
463
|
|
|
|
|
|
|
: Attempts to restore the Blast report to enable |
|
464
|
|
|
|
|
|
|
: parsing by Bio::SearchIO::blast.pm |
|
465
|
|
|
|
|
|
|
Returns : Boolean: true if string was stripped, false if not. |
|
466
|
|
|
|
|
|
|
Argument : string_ref = reference to a string containing the whole Blast |
|
467
|
|
|
|
|
|
|
: report containing HTML formatting. |
|
468
|
|
|
|
|
|
|
Throws : Croaks if the argument is not a scalar reference. |
|
469
|
|
|
|
|
|
|
Comments : Based on code originally written by Alex Dong Li |
|
470
|
|
|
|
|
|
|
: (ali@genet.sickkids.on.ca). |
|
471
|
|
|
|
|
|
|
: This method does some Blast-specific stripping |
|
472
|
|
|
|
|
|
|
: (adds back a '>' character in front of each HSP |
|
473
|
|
|
|
|
|
|
: alignment listing). |
|
474
|
|
|
|
|
|
|
: |
|
475
|
|
|
|
|
|
|
: THIS METHOD IS VERY SENSITIVE TO BLAST FORMATTING CHANGES! |
|
476
|
|
|
|
|
|
|
: |
|
477
|
|
|
|
|
|
|
: Removal of the HTML tags and accurate reconstitution of the |
|
478
|
|
|
|
|
|
|
: non-HTML-formatted report is highly dependent on structure of |
|
479
|
|
|
|
|
|
|
: the HTML-formatted version. For example, it assumes that first |
|
480
|
|
|
|
|
|
|
: line of each alignment section (HSP listing) starts with a |
|
481
|
|
|
|
|
|
|
: anchor tag. This permits the reconstruction of the |
|
482
|
|
|
|
|
|
|
: original report in which these lines begin with a ">". |
|
483
|
|
|
|
|
|
|
: This is required for parsing. |
|
484
|
|
|
|
|
|
|
: |
|
485
|
|
|
|
|
|
|
: If the structure of the Blast report itself is not intended to |
|
486
|
|
|
|
|
|
|
: be a standard, the structure of the HTML-formatted version |
|
487
|
|
|
|
|
|
|
: is even less so. Therefore, the use of this method to |
|
488
|
|
|
|
|
|
|
: reconstitute parsable Blast reports from HTML-format versions |
|
489
|
|
|
|
|
|
|
: should be considered a temorary solution. |
|
490
|
|
|
|
|
|
|
|
|
491
|
|
|
|
|
|
|
=cut |
|
492
|
|
|
|
|
|
|
|
|
493
|
|
|
|
|
|
|
#-------------------- |
|
494
|
|
|
|
|
|
|
sub strip_blast_html { |
|
495
|
|
|
|
|
|
|
#-------------------- |
|
496
|
|
|
|
|
|
|
# This may not best way to remove html tags. However, it is simple. |
|
497
|
|
|
|
|
|
|
# it won't work under following conditions: |
|
498
|
|
|
|
|
|
|
# 1) if quoted > appears in a tag (does this ever happen?) |
|
499
|
|
|
|
|
|
|
# 2) if a tag is split over multiple lines and this method is |
|
500
|
|
|
|
|
|
|
# used to process one line at a time. |
|
501
|
|
|
|
|
|
|
|
|
502
|
0
|
|
|
0
|
1
|
|
my ($string_ref) = shift; |
|
503
|
|
|
|
|
|
|
|
|
504
|
0
|
0
|
|
|
|
|
ref $string_ref eq 'SCALAR' or |
|
505
|
|
|
|
|
|
|
croak ("Can't strip HTML: ". |
|
506
|
0
|
|
|
|
|
|
"Argument is should be a SCALAR reference not a ${\ref $string_ref}\n"); |
|
507
|
|
|
|
|
|
|
|
|
508
|
0
|
|
|
|
|
|
my $str = $$string_ref; |
|
509
|
0
|
|
|
|
|
|
my $stripped = 0; |
|
510
|
|
|
|
|
|
|
|
|
511
|
|
|
|
|
|
|
# Removing "" and adding the '>' character for |
|
512
|
|
|
|
|
|
|
# HSP alignment listings. |
|
513
|
0
|
0
|
|
|
|
|
$str =~ s/(\A|\n)]+> ?/>/sgi and $stripped = 1; |
|
514
|
|
|
|
|
|
|
|
|
515
|
|
|
|
|
|
|
# Removing all "<>" tags. |
|
516
|
0
|
0
|
|
|
|
|
$str =~ s/<[^>]+>| //sgi and $stripped = 1; |
|
517
|
|
|
|
|
|
|
|
|
518
|
|
|
|
|
|
|
# Re-uniting any lone '>' characters. |
|
519
|
0
|
0
|
|
|
|
|
$str =~ s/(\A|\n)>\s+/\n\n>/sgi and $stripped = 1; |
|
520
|
|
|
|
|
|
|
|
|
521
|
0
|
|
|
|
|
|
$$string_ref = $str; |
|
522
|
0
|
|
|
|
|
|
$stripped; |
|
523
|
|
|
|
|
|
|
} |
|
524
|
|
|
|
|
|
|
|
|
525
|
|
|
|
|
|
|
|
|
526
|
|
|
|
|
|
|
1; |
|
527
|
|
|
|
|
|
|
|
|
528
|
|
|
|
|
|
|
|