line |
stmt |
bran |
cond |
sub |
pod |
time |
code |
1
|
|
|
|
|
|
|
package Bio::ToolBox::db_helper; |
2
|
|
|
|
|
|
|
our $VERSION = '1.69'; |
3
|
|
|
|
|
|
|
|
4
|
|
|
|
|
|
|
=head1 NAME |
5
|
|
|
|
|
|
|
|
6
|
|
|
|
|
|
|
Bio::ToolBox::db_helper - helper interface to various database formats |
7
|
|
|
|
|
|
|
|
8
|
|
|
|
|
|
|
=head1 DESCRIPTION |
9
|
|
|
|
|
|
|
|
10
|
|
|
|
|
|
|
In most cases, this module does not need to be used directly. The |
11
|
|
|
|
|
|
|
methods available to L and L |
12
|
|
|
|
|
|
|
provide convenient access to the methods described here. |
13
|
|
|
|
|
|
|
|
14
|
|
|
|
|
|
|
These are helper subroutines to work with relevant databases that can be |
15
|
|
|
|
|
|
|
accessed through BioPerl modules. These include the L |
16
|
|
|
|
|
|
|
relational database as well as L, L, |
17
|
|
|
|
|
|
|
L, L, L, and |
18
|
|
|
|
|
|
|
L databases. The primary functions |
19
|
|
|
|
|
|
|
included opening connections to these databases, verifying features found |
20
|
|
|
|
|
|
|
within the database, generating lists of features or genomic intervals for |
21
|
|
|
|
|
|
|
subsequent analysis, and collecting features and/or scores within regions |
22
|
|
|
|
|
|
|
of interest. Collected scores may be summarized using a variety of statistical |
23
|
|
|
|
|
|
|
methods. |
24
|
|
|
|
|
|
|
|
25
|
|
|
|
|
|
|
When collecting scores or features, the data may hosted in a variety of |
26
|
|
|
|
|
|
|
formats and locations. These include: |
27
|
|
|
|
|
|
|
|
28
|
|
|
|
|
|
|
=over 4 |
29
|
|
|
|
|
|
|
|
30
|
|
|
|
|
|
|
=item SeqFeature database |
31
|
|
|
|
|
|
|
|
32
|
|
|
|
|
|
|
Genomic feature annotation, including genes, transcripts, exons, etc, |
33
|
|
|
|
|
|
|
may be stored in a L database. These |
34
|
|
|
|
|
|
|
databases are backed by either a relational database, including |
35
|
|
|
|
|
|
|
MySQL, PostGreSQL, or SQLite. Small GFF3 files may also be loaded in |
36
|
|
|
|
|
|
|
an in-memory database. These are typically loaded from GFF3 |
37
|
|
|
|
|
|
|
files; see the adapter documentation for more information. Dataset |
38
|
|
|
|
|
|
|
scores, such as from microarray, may also be stored as the score |
39
|
|
|
|
|
|
|
value in the source GFF file. |
40
|
|
|
|
|
|
|
|
41
|
|
|
|
|
|
|
References to local, binary, indexed files may also be included as |
42
|
|
|
|
|
|
|
attributes to features stored in the database. This is legacy support |
43
|
|
|
|
|
|
|
from the L, and should |
44
|
|
|
|
|
|
|
not be used in preference to other methods. Supported files include the |
45
|
|
|
|
|
|
|
legacy binary wig files (F<.wib>, see L) using the |
46
|
|
|
|
|
|
|
C attribute, or bigWig files using the C or C |
47
|
|
|
|
|
|
|
attribute. The attribute values must be full paths. |
48
|
|
|
|
|
|
|
|
49
|
|
|
|
|
|
|
=item BigWig files |
50
|
|
|
|
|
|
|
|
51
|
|
|
|
|
|
|
BigWig files (F<.bw> and F<.bigwig>) are compressed, binary, indexed |
52
|
|
|
|
|
|
|
versions of text wig files and may be accessed either locally or remotely. |
53
|
|
|
|
|
|
|
They support extremely fast score retrieval from any genomic location of |
54
|
|
|
|
|
|
|
any size without sacrificing resolution (spatial and numeric). See |
55
|
|
|
|
|
|
|
L for more information. |
56
|
|
|
|
|
|
|
BigWig files are supported by either the L or the |
57
|
|
|
|
|
|
|
L adapter, based on either L |
58
|
|
|
|
|
|
|
or the L libraries, |
59
|
|
|
|
|
|
|
respectively; see their respective documentation for more information. |
60
|
|
|
|
|
|
|
|
61
|
|
|
|
|
|
|
=item Directory of BigWig files |
62
|
|
|
|
|
|
|
|
63
|
|
|
|
|
|
|
A directory containing two or more BigWig files is assembled into a |
64
|
|
|
|
|
|
|
BigWigSet, allowing for metadata, such as strand, to be associated with |
65
|
|
|
|
|
|
|
BigWig files. Additional metadata beyond the filename may be included in |
66
|
|
|
|
|
|
|
text file F within the directory. See the |
67
|
|
|
|
|
|
|
L adapter documentation for more information. When using |
68
|
|
|
|
|
|
|
L adapters, BigWigSet support is natively supported by |
69
|
|
|
|
|
|
|
the BioToolBox package. |
70
|
|
|
|
|
|
|
|
71
|
|
|
|
|
|
|
=item BigBed files |
72
|
|
|
|
|
|
|
|
73
|
|
|
|
|
|
|
BigBed files are compressed, binary, indexed versions of text BED files. See |
74
|
|
|
|
|
|
|
L for more information. |
75
|
|
|
|
|
|
|
Both local and remote files may be accessed. BigBed files are supported by |
76
|
|
|
|
|
|
|
either the L or the L adapter, based on either L |
77
|
|
|
|
|
|
|
or the L libraries, |
78
|
|
|
|
|
|
|
respectively; see their respective documentation for more information. |
79
|
|
|
|
|
|
|
|
80
|
|
|
|
|
|
|
=item Bam files |
81
|
|
|
|
|
|
|
|
82
|
|
|
|
|
|
|
Bam files (F<.bam>) are compressed, binary, indexed versions of the text SAM file, |
83
|
|
|
|
|
|
|
or sequence alignment map. They are used with next generation sequencing |
84
|
|
|
|
|
|
|
technologies. They support individual alignment retrieval as well as |
85
|
|
|
|
|
|
|
read depth coverage. Two different Bam file adapters are supported. |
86
|
|
|
|
|
|
|
|
87
|
|
|
|
|
|
|
The L adapter is an interface to the Bam (or Cram) file. |
88
|
|
|
|
|
|
|
This is based on the modern HTSlib C library (version E= 1.0), successor |
89
|
|
|
|
|
|
|
to the original samtools library. See L for |
90
|
|
|
|
|
|
|
more information. |
91
|
|
|
|
|
|
|
|
92
|
|
|
|
|
|
|
The L adapter is an older interface to the Bam alignment |
93
|
|
|
|
|
|
|
file. This is based on the samtools C library version E= 0.1.19. While |
94
|
|
|
|
|
|
|
this adapter is still supported, the above should be used for new installs. |
95
|
|
|
|
|
|
|
|
96
|
|
|
|
|
|
|
=item Cram files |
97
|
|
|
|
|
|
|
|
98
|
|
|
|
|
|
|
Cram files (F<.cram>) are similar to Bam files, but are much smaller due to |
99
|
|
|
|
|
|
|
only storing sequence differences for each alignment. As such, they require |
100
|
|
|
|
|
|
|
an indexed, reference fasta to regenerate the complete alignment. Cram files |
101
|
|
|
|
|
|
|
are only supported by the L adapter. |
102
|
|
|
|
|
|
|
|
103
|
|
|
|
|
|
|
=item USeq files |
104
|
|
|
|
|
|
|
|
105
|
|
|
|
|
|
|
USeq files (F<.useq>) are compressed, binary, indexed files that support |
106
|
|
|
|
|
|
|
multiple information types, including region intervals, BED type |
107
|
|
|
|
|
|
|
annotations, or wig type scores distributed across the genome. They |
108
|
|
|
|
|
|
|
support rapid, random access across the genome and are comparable to |
109
|
|
|
|
|
|
|
both BigWig and BigBed files. See L |
110
|
|
|
|
|
|
|
for more information. Files may only be local. These files are supported |
111
|
|
|
|
|
|
|
by the L adapter. |
112
|
|
|
|
|
|
|
|
113
|
|
|
|
|
|
|
=item Fasta files |
114
|
|
|
|
|
|
|
|
115
|
|
|
|
|
|
|
Fasta files (F<.fa> or F<.fasta>) may be opened. Fasta files are indexed to |
116
|
|
|
|
|
|
|
rapidly and randomly fetch genomic sequence. Three different adapters are |
117
|
|
|
|
|
|
|
available for indexing the fasta file. |
118
|
|
|
|
|
|
|
|
119
|
|
|
|
|
|
|
The L adapter is preferentially used, and requires a |
120
|
|
|
|
|
|
|
F<.fai> index file. |
121
|
|
|
|
|
|
|
|
122
|
|
|
|
|
|
|
The L adapter may alternatively used, and also requires a |
123
|
|
|
|
|
|
|
F<.fai> index file. |
124
|
|
|
|
|
|
|
|
125
|
|
|
|
|
|
|
The older style L adapter is much slower, but will index |
126
|
|
|
|
|
|
|
either a single genomic fasta file or a directory of individual chromosome |
127
|
|
|
|
|
|
|
or contig fasta files. |
128
|
|
|
|
|
|
|
|
129
|
|
|
|
|
|
|
Additionally, genomic sequence stored in a L |
130
|
|
|
|
|
|
|
annotation database may also be used. |
131
|
|
|
|
|
|
|
|
132
|
|
|
|
|
|
|
=back |
133
|
|
|
|
|
|
|
|
134
|
|
|
|
|
|
|
While the functions within this library may appear to be simply a rehashing of |
135
|
|
|
|
|
|
|
the methods and functions in L and other modules, they |
136
|
|
|
|
|
|
|
either provide a simpler function to often used database methodologies or are |
137
|
|
|
|
|
|
|
designed to work intimately with the BioToolBox data format file and data structures |
138
|
|
|
|
|
|
|
(see L). One key advantage to these functions is the ability |
139
|
|
|
|
|
|
|
to work with datasets that are stranded (transcriptome data, for example). |
140
|
|
|
|
|
|
|
|
141
|
|
|
|
|
|
|
Historically, this module was initially written to use L for |
142
|
|
|
|
|
|
|
database usage. It has since been re-written to use L |
143
|
|
|
|
|
|
|
database schema. Additional functionality has also been added for other databases |
144
|
|
|
|
|
|
|
and file formats. |
145
|
|
|
|
|
|
|
|
146
|
|
|
|
|
|
|
Complete usage and examples for the functions are provided below. |
147
|
|
|
|
|
|
|
|
148
|
|
|
|
|
|
|
=head1 USAGE |
149
|
|
|
|
|
|
|
|
150
|
|
|
|
|
|
|
Call the module at the beginning of your perl script and include the module |
151
|
|
|
|
|
|
|
names to export. None are exported by default. |
152
|
|
|
|
|
|
|
|
153
|
|
|
|
|
|
|
use Bio::ToolBox::db_helper qw( |
154
|
|
|
|
|
|
|
get_new_feature_list |
155
|
|
|
|
|
|
|
get_db_feature |
156
|
|
|
|
|
|
|
); |
157
|
|
|
|
|
|
|
|
158
|
|
|
|
|
|
|
This will export the indicated subroutine names into the current namespace. |
159
|
|
|
|
|
|
|
Their usage is detailed below. |
160
|
|
|
|
|
|
|
|
161
|
|
|
|
|
|
|
=head1 EXPORTED SUBROUTINES |
162
|
|
|
|
|
|
|
|
163
|
|
|
|
|
|
|
=over 4 |
164
|
|
|
|
|
|
|
|
165
|
|
|
|
|
|
|
=item C<$BAM_ADAPTER> variable |
166
|
|
|
|
|
|
|
|
167
|
|
|
|
|
|
|
=item C<$BIG_ADAPTER> variable |
168
|
|
|
|
|
|
|
|
169
|
|
|
|
|
|
|
These variables control which Bam and bigWig/bigBed adapters are used |
170
|
|
|
|
|
|
|
during execution in installations where both adapters are present. These |
171
|
|
|
|
|
|
|
variables are not set initially; adapters are chosen automatically when an |
172
|
|
|
|
|
|
|
adapter is requested during execution and modules are evaluated for loading. |
173
|
|
|
|
|
|
|
|
174
|
|
|
|
|
|
|
=item use_bam_adapter |
175
|
|
|
|
|
|
|
|
176
|
|
|
|
|
|
|
=item use_big_adapter |
177
|
|
|
|
|
|
|
|
178
|
|
|
|
|
|
|
These are convenience methods for explicitly setting and retrieving the |
179
|
|
|
|
|
|
|
C<$BAM_ADAPTER> and C<$BIG_ADAPTER> variables, respectively. |
180
|
|
|
|
|
|
|
|
181
|
|
|
|
|
|
|
Optionally pass the value to set. It will always return the value being used. |
182
|
|
|
|
|
|
|
Values include the following: |
183
|
|
|
|
|
|
|
|
184
|
|
|
|
|
|
|
=over 4 |
185
|
|
|
|
|
|
|
|
186
|
|
|
|
|
|
|
=item Bam adapter |
187
|
|
|
|
|
|
|
|
188
|
|
|
|
|
|
|
* hts Bio::DB::HTS (default if available) |
189
|
|
|
|
|
|
|
* sam Bio::DB::Sam |
190
|
|
|
|
|
|
|
* none no adapters allowed |
191
|
|
|
|
|
|
|
|
192
|
|
|
|
|
|
|
=item Big adapter |
193
|
|
|
|
|
|
|
|
194
|
|
|
|
|
|
|
* big Bio::DB::Big (default if available) |
195
|
|
|
|
|
|
|
* ucsc Bio::DB::BigWig and Bio::DB::BigBed |
196
|
|
|
|
|
|
|
* none no adapters allowed |
197
|
|
|
|
|
|
|
|
198
|
|
|
|
|
|
|
=back |
199
|
|
|
|
|
|
|
|
200
|
|
|
|
|
|
|
=item open_db_connection |
201
|
|
|
|
|
|
|
|
202
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
203
|
|
|
|
|
|
|
my $db = open_db_connection($db_name); |
204
|
|
|
|
|
|
|
|
205
|
|
|
|
|
|
|
my $file = '/path/to/file.bam'; |
206
|
|
|
|
|
|
|
my $db = open_db_connection($file); |
207
|
|
|
|
|
|
|
|
208
|
|
|
|
|
|
|
# within a forked child process |
209
|
|
|
|
|
|
|
# reusing the same variable to simplify code |
210
|
|
|
|
|
|
|
# pass second true value to avoid cache |
211
|
|
|
|
|
|
|
$db = open_db_connection($file, 1); |
212
|
|
|
|
|
|
|
|
213
|
|
|
|
|
|
|
This is a simplified, generalized method that will open a connection to a |
214
|
|
|
|
|
|
|
database or indexed file using any one of a number of different |
215
|
|
|
|
|
|
|
BioPerl-style adapters. For local and remote files, the appropriate |
216
|
|
|
|
|
|
|
adapter is determined by the file extension. Relational database |
217
|
|
|
|
|
|
|
connection and authentication information is checked in a configuration |
218
|
|
|
|
|
|
|
file. Once identified, the appropriate Perl module adapter is loaded |
219
|
|
|
|
|
|
|
during run time automatically, saving the user from knowing in advance |
220
|
|
|
|
|
|
|
to C |
221
|
|
|
|
|
|
|
|
222
|
|
|
|
|
|
|
Pass the name of a relational database or the path of the database file to |
223
|
|
|
|
|
|
|
the subroutine. The opened database object is returned. If it fails, then |
224
|
|
|
|
|
|
|
an error message should be generated and nothing is returned. |
225
|
|
|
|
|
|
|
|
226
|
|
|
|
|
|
|
B If you are forking your perl process, B re-open your |
227
|
|
|
|
|
|
|
database objects in the child process, and pass a second true value |
228
|
|
|
|
|
|
|
to avoid using the cached database object. By default, opened databases are |
229
|
|
|
|
|
|
|
cached to improve efficiency, but this will be disastrous when crossing forks. |
230
|
|
|
|
|
|
|
|
231
|
|
|
|
|
|
|
=over 4 |
232
|
|
|
|
|
|
|
|
233
|
|
|
|
|
|
|
=item SeqFeature Store database |
234
|
|
|
|
|
|
|
|
235
|
|
|
|
|
|
|
Provide the name of the relational database to open. Parameters for |
236
|
|
|
|
|
|
|
connecting to a relational database are stored in the BioToolBox |
237
|
|
|
|
|
|
|
configuration file, F<.biotoolbox.cfg>. These include database adaptors, |
238
|
|
|
|
|
|
|
user name, password, etc. Further information regarding the configuration |
239
|
|
|
|
|
|
|
file may be found in L. |
240
|
|
|
|
|
|
|
|
241
|
|
|
|
|
|
|
For SQLite databases, provide the path to the F<.sqlite> or F<.db> file. |
242
|
|
|
|
|
|
|
|
243
|
|
|
|
|
|
|
For in-memory databases, provide the path to the F<.gff> or F<.gff3> file. |
244
|
|
|
|
|
|
|
The larger the file, the more memory and processing time is consumed as |
245
|
|
|
|
|
|
|
the file is parsed into memory. This may be fine for small model organisms |
246
|
|
|
|
|
|
|
like yeast, but not for vertebrate annotation. |
247
|
|
|
|
|
|
|
|
248
|
|
|
|
|
|
|
=item Bam file database |
249
|
|
|
|
|
|
|
|
250
|
|
|
|
|
|
|
Provide the path to the F<.bam> file. This may be a local file, or a remote |
251
|
|
|
|
|
|
|
URL (generally supported by the adapter). If a local F<.bam.bai> index file |
252
|
|
|
|
|
|
|
is not present, it will automatically be built prior to opening; this may |
253
|
|
|
|
|
|
|
fail if the bam file is not sorted. Remote files are not indexed. |
254
|
|
|
|
|
|
|
|
255
|
|
|
|
|
|
|
=item Cram file database |
256
|
|
|
|
|
|
|
|
257
|
|
|
|
|
|
|
Provide the path to the local F<.cram> file. Currently, supplementary fasta |
258
|
|
|
|
|
|
|
files are not supported, so the Cram file header must point to a valid |
259
|
|
|
|
|
|
|
reference. Cram files will be indexed automatically if an index is not available. |
260
|
|
|
|
|
|
|
|
261
|
|
|
|
|
|
|
=item BigWig file database |
262
|
|
|
|
|
|
|
|
263
|
|
|
|
|
|
|
Provide the path to a local F<.bw> or F<.bigwig> file, or URL to a |
264
|
|
|
|
|
|
|
remote file. |
265
|
|
|
|
|
|
|
|
266
|
|
|
|
|
|
|
=item BigWigSet database |
267
|
|
|
|
|
|
|
|
268
|
|
|
|
|
|
|
A local or remote directory of one or more BigWig files that can treated |
269
|
|
|
|
|
|
|
collectively as a database. A special text file may be included in the |
270
|
|
|
|
|
|
|
directory to assign metadata to each BigWig file, including attributes such |
271
|
|
|
|
|
|
|
as type, source, display name, strand, etc. See L for |
272
|
|
|
|
|
|
|
more information on the formatting of the metadata file. If a metadata |
273
|
|
|
|
|
|
|
file is not included (it's not necessary), then the filenames are used |
274
|
|
|
|
|
|
|
as the name and type. Strand may be parsed from the filenames if the |
275
|
|
|
|
|
|
|
basename ends in F<.f> and F<.r>. |
276
|
|
|
|
|
|
|
|
277
|
|
|
|
|
|
|
=item BigBed database |
278
|
|
|
|
|
|
|
|
279
|
|
|
|
|
|
|
Provide the path to a local F<.bb> or F<.bigbed> file, or URL to a |
280
|
|
|
|
|
|
|
remote file. |
281
|
|
|
|
|
|
|
|
282
|
|
|
|
|
|
|
=item USeq database |
283
|
|
|
|
|
|
|
|
284
|
|
|
|
|
|
|
Provide the path to a local F<.bb> or F<.bigbed> file, or URL to a |
285
|
|
|
|
|
|
|
remote file. |
286
|
|
|
|
|
|
|
|
287
|
|
|
|
|
|
|
=item Fasta database |
288
|
|
|
|
|
|
|
|
289
|
|
|
|
|
|
|
Provide the path to a local fasta file for rapidly fetching genomic |
290
|
|
|
|
|
|
|
sequence. If the fasta file is not indexed, then it will be indexed |
291
|
|
|
|
|
|
|
automatically upon opening. |
292
|
|
|
|
|
|
|
|
293
|
|
|
|
|
|
|
=back |
294
|
|
|
|
|
|
|
|
295
|
|
|
|
|
|
|
=item get_db_name |
296
|
|
|
|
|
|
|
|
297
|
|
|
|
|
|
|
my $name = get_db_name($db); |
298
|
|
|
|
|
|
|
|
299
|
|
|
|
|
|
|
This method will attempt to get the name of an opened Database |
300
|
|
|
|
|
|
|
object if, for some reason, it's unknown, e.g. the user only provided an |
301
|
|
|
|
|
|
|
opened db object. This only works for some databases, at least those I've |
302
|
|
|
|
|
|
|
bothered to track down and find a usable API call to use. |
303
|
|
|
|
|
|
|
|
304
|
|
|
|
|
|
|
=item get_dataset_list |
305
|
|
|
|
|
|
|
|
306
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
307
|
|
|
|
|
|
|
my @types = get_dataset_list($db_name); |
308
|
|
|
|
|
|
|
|
309
|
|
|
|
|
|
|
This method will retrieve a list of the available features stored in the |
310
|
|
|
|
|
|
|
database and returns an array of the features' types. |
311
|
|
|
|
|
|
|
|
312
|
|
|
|
|
|
|
Pass either the name of the database or an established database object. |
313
|
|
|
|
|
|
|
Supported databases include both L and |
314
|
|
|
|
|
|
|
L databases. |
315
|
|
|
|
|
|
|
|
316
|
|
|
|
|
|
|
For L databases, the type is represented as |
317
|
|
|
|
|
|
|
"C", corresponding to the third and second GFF columns, |
318
|
|
|
|
|
|
|
respectively. The types are sorted alphabetically first by source, |
319
|
|
|
|
|
|
|
then by method. |
320
|
|
|
|
|
|
|
|
321
|
|
|
|
|
|
|
For L databases, the C, C, C, |
322
|
|
|
|
|
|
|
or C attribute may be used, in that respective order of |
323
|
|
|
|
|
|
|
availability. The list is sorted alphabetically. |
324
|
|
|
|
|
|
|
|
325
|
|
|
|
|
|
|
=item verify_or_request_feature_types |
326
|
|
|
|
|
|
|
|
327
|
|
|
|
|
|
|
# verify a dataset, either file or database type |
328
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
329
|
|
|
|
|
|
|
my $dataset = 'microaray_data'; |
330
|
|
|
|
|
|
|
my @features = verify_or_request_feature_types( |
331
|
|
|
|
|
|
|
db => $db_name, |
332
|
|
|
|
|
|
|
feature => $dataset, |
333
|
|
|
|
|
|
|
); |
334
|
|
|
|
|
|
|
unless (@features) { |
335
|
|
|
|
|
|
|
die " no valid feature provided!\n"; |
336
|
|
|
|
|
|
|
} |
337
|
|
|
|
|
|
|
|
338
|
|
|
|
|
|
|
# select a list of features |
339
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
340
|
|
|
|
|
|
|
my @types = (); # user not provided |
341
|
|
|
|
|
|
|
@types = verify_or_request_feature_types( |
342
|
|
|
|
|
|
|
db => $db_name, |
343
|
|
|
|
|
|
|
feature => $types, |
344
|
|
|
|
|
|
|
); |
345
|
|
|
|
|
|
|
# user will be promoted to select from database list |
346
|
|
|
|
|
|
|
if (@types) { |
347
|
|
|
|
|
|
|
print "using types " . join(", ", @types); |
348
|
|
|
|
|
|
|
} |
349
|
|
|
|
|
|
|
else { |
350
|
|
|
|
|
|
|
die " no valid features selected!\n"; |
351
|
|
|
|
|
|
|
} |
352
|
|
|
|
|
|
|
|
353
|
|
|
|
|
|
|
This subroutine will process a list of feature types or data sources to be |
354
|
|
|
|
|
|
|
used for data or feature collection. There are two modes of action. If a |
355
|
|
|
|
|
|
|
list was provided, it will process and verify the list. If no list was |
356
|
|
|
|
|
|
|
provided, then the user will be prompted to select from a list of available |
357
|
|
|
|
|
|
|
feature types in the provided database. |
358
|
|
|
|
|
|
|
|
359
|
|
|
|
|
|
|
The provided list may be a list of feature types or a list of single |
360
|
|
|
|
|
|
|
file data sources, including Bam, bigWig, or bigBed files. If the list |
361
|
|
|
|
|
|
|
includes feature types, they are verified as existing in the provided |
362
|
|
|
|
|
|
|
database. If the list includes file names, the local files are checked |
363
|
|
|
|
|
|
|
and the names prefixed with "C" for use downstream. |
364
|
|
|
|
|
|
|
|
365
|
|
|
|
|
|
|
If no list was provided, then a list of available feature types in the |
366
|
|
|
|
|
|
|
provided database will be presented to the user, and the user prompted |
367
|
|
|
|
|
|
|
to make a selection. One or more types may be selected, and a single |
368
|
|
|
|
|
|
|
item may be enforced if requested. The response is filtered through |
369
|
|
|
|
|
|
|
the parse_list method from L, so a mix of single |
370
|
|
|
|
|
|
|
numbers or a range of numbers may be accepted. The responses are then |
371
|
|
|
|
|
|
|
validated. |
372
|
|
|
|
|
|
|
|
373
|
|
|
|
|
|
|
Two types of databases are supported: L and |
374
|
|
|
|
|
|
|
L databases. For SeqFeature Store databases, the |
375
|
|
|
|
|
|
|
type is comprised of "C", corresponding to the third and |
376
|
|
|
|
|
|
|
second GFF columns. For BigWigSet databases, types correspond to either |
377
|
|
|
|
|
|
|
the C, C, C, or C attributes, |
378
|
|
|
|
|
|
|
in that order of availability. |
379
|
|
|
|
|
|
|
|
380
|
|
|
|
|
|
|
For feature types or files that pass validation, they are returned as a |
381
|
|
|
|
|
|
|
list. Types of files that do not pass validation are printed to C. |
382
|
|
|
|
|
|
|
|
383
|
|
|
|
|
|
|
To use this subroutine, pass an array with the following keys |
384
|
|
|
|
|
|
|
and values. Not every key is required. |
385
|
|
|
|
|
|
|
|
386
|
|
|
|
|
|
|
=over 4 |
387
|
|
|
|
|
|
|
|
388
|
|
|
|
|
|
|
=item db |
389
|
|
|
|
|
|
|
|
390
|
|
|
|
|
|
|
The name of the database or a reference to an established BioPerl |
391
|
|
|
|
|
|
|
database object. A L or |
392
|
|
|
|
|
|
|
L database are typically used. Only required when |
393
|
|
|
|
|
|
|
no features are provided. |
394
|
|
|
|
|
|
|
|
395
|
|
|
|
|
|
|
=item feature |
396
|
|
|
|
|
|
|
|
397
|
|
|
|
|
|
|
Pass either a single dataset name as a scalar or an anonymous array |
398
|
|
|
|
|
|
|
reference of a list of dataset names. These may have been provided as |
399
|
|
|
|
|
|
|
a command line option and need to be verified. If nothing is passed, |
400
|
|
|
|
|
|
|
then a list of possible datasets will be presented to the user to be |
401
|
|
|
|
|
|
|
chosen. |
402
|
|
|
|
|
|
|
|
403
|
|
|
|
|
|
|
=item prompt |
404
|
|
|
|
|
|
|
|
405
|
|
|
|
|
|
|
Provide a phrase to be prompted to the user to help in selecting |
406
|
|
|
|
|
|
|
datasets from the list. If none is provided, a generic prompt will be |
407
|
|
|
|
|
|
|
used. |
408
|
|
|
|
|
|
|
|
409
|
|
|
|
|
|
|
=item single |
410
|
|
|
|
|
|
|
|
411
|
|
|
|
|
|
|
A Boolean value (1 or 0) indicating whether only a single dataset is |
412
|
|
|
|
|
|
|
allowed when selecting datasets from a presented list. If true, only |
413
|
|
|
|
|
|
|
one dataset choice is accepted. If false, one or more dataset choices |
414
|
|
|
|
|
|
|
are allowed. Default is false. |
415
|
|
|
|
|
|
|
|
416
|
|
|
|
|
|
|
=item limit |
417
|
|
|
|
|
|
|
|
418
|
|
|
|
|
|
|
Optionally provide a word or regular expression that matches the |
419
|
|
|
|
|
|
|
feature type (C only; C, if present, is ignored). |
420
|
|
|
|
|
|
|
For example, provide "geneEmrna" to only present gene and mRNA |
421
|
|
|
|
|
|
|
features to the user. This is only applicable when a user must select |
422
|
|
|
|
|
|
|
from a database list. The default is to list all available feature |
423
|
|
|
|
|
|
|
types. |
424
|
|
|
|
|
|
|
|
425
|
|
|
|
|
|
|
=back |
426
|
|
|
|
|
|
|
|
427
|
|
|
|
|
|
|
=item check_dataset_for_rpm_support($dataset, [$cpu]) |
428
|
|
|
|
|
|
|
|
429
|
|
|
|
|
|
|
# count the total number of alignments |
430
|
|
|
|
|
|
|
my $dataset = '/path/to/file.bam'; |
431
|
|
|
|
|
|
|
my $total_count = check_dataset_for_rpm_support($dataset); |
432
|
|
|
|
|
|
|
|
433
|
|
|
|
|
|
|
# use multithreading |
434
|
|
|
|
|
|
|
my $total_count = check_dataset_for_rpm_support($dataset, $cpu); |
435
|
|
|
|
|
|
|
|
436
|
|
|
|
|
|
|
This subroutine will check a dataset for RPM, or Reads Per Million mapped, |
437
|
|
|
|
|
|
|
support. Only two types of database files support this, Bam files and |
438
|
|
|
|
|
|
|
BigBed files. If the dataset is either one of these, or the name of a |
439
|
|
|
|
|
|
|
database feature which points to one of these files, then it will |
440
|
|
|
|
|
|
|
calculate the total number of mapped alignments (Bam file) or features |
441
|
|
|
|
|
|
|
(BigBed file). It will return this total number. If the dataset does |
442
|
|
|
|
|
|
|
not support RPM (because it is not a Bam or BigBed file, for example), |
443
|
|
|
|
|
|
|
then it will return undefined. |
444
|
|
|
|
|
|
|
|
445
|
|
|
|
|
|
|
Pass this subroutine one or two values. The first is the name of the |
446
|
|
|
|
|
|
|
dataset. Ideally it should be validated using verify_or_request_feature_types() |
447
|
|
|
|
|
|
|
and have an appropriate prefix (file, http, or ftp). |
448
|
|
|
|
|
|
|
|
449
|
|
|
|
|
|
|
For multi-threaded execution pass a second value, the number of CPU cores |
450
|
|
|
|
|
|
|
available to count Bam files. This will speed up counting bam files |
451
|
|
|
|
|
|
|
considerably. The default is 2 for environments where L |
452
|
|
|
|
|
|
|
is installed, or 1 where it is not. |
453
|
|
|
|
|
|
|
|
454
|
|
|
|
|
|
|
=item get_new_feature_list |
455
|
|
|
|
|
|
|
|
456
|
|
|
|
|
|
|
my $Data = Bio::ToolBox::Data->new; |
457
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
458
|
|
|
|
|
|
|
my %data = get_new_feature_list( |
459
|
|
|
|
|
|
|
data => $Data, |
460
|
|
|
|
|
|
|
db => $db_name, |
461
|
|
|
|
|
|
|
features => 'genes', |
462
|
|
|
|
|
|
|
); |
463
|
|
|
|
|
|
|
|
464
|
|
|
|
|
|
|
This subroutine will generate a new feature list collected from an annotation |
465
|
|
|
|
|
|
|
database, such as a L database. It will populate |
466
|
|
|
|
|
|
|
an empty L object data table with a list of the collected |
467
|
|
|
|
|
|
|
features. Columns include the Primary ID, Name, and feature Type. |
468
|
|
|
|
|
|
|
|
469
|
|
|
|
|
|
|
The subroutine is passed an array containing three required arguments. The keys |
470
|
|
|
|
|
|
|
include |
471
|
|
|
|
|
|
|
|
472
|
|
|
|
|
|
|
=over 4 |
473
|
|
|
|
|
|
|
|
474
|
|
|
|
|
|
|
=item data |
475
|
|
|
|
|
|
|
|
476
|
|
|
|
|
|
|
An empty L object. This is required. |
477
|
|
|
|
|
|
|
|
478
|
|
|
|
|
|
|
=item db |
479
|
|
|
|
|
|
|
|
480
|
|
|
|
|
|
|
The name of the database or a reference to an established database object. |
481
|
|
|
|
|
|
|
Currently, this must be a L database. |
482
|
|
|
|
|
|
|
This is required. |
483
|
|
|
|
|
|
|
|
484
|
|
|
|
|
|
|
=item features |
485
|
|
|
|
|
|
|
|
486
|
|
|
|
|
|
|
A scalar value containing a name representing the feature type(s) of |
487
|
|
|
|
|
|
|
feature(s) to collect. The type may be a C or C. |
488
|
|
|
|
|
|
|
It may be a single element or a comma-delimited list. This is required. |
489
|
|
|
|
|
|
|
|
490
|
|
|
|
|
|
|
=item chrskip |
491
|
|
|
|
|
|
|
|
492
|
|
|
|
|
|
|
Provide a regular-expression compatible string for skipping features from |
493
|
|
|
|
|
|
|
specific chromosomes, for example mitochondrial or unmapped contigs. |
494
|
|
|
|
|
|
|
|
495
|
|
|
|
|
|
|
=back |
496
|
|
|
|
|
|
|
|
497
|
|
|
|
|
|
|
Status messages, including the number of features found and kept, are always |
498
|
|
|
|
|
|
|
printed to C. |
499
|
|
|
|
|
|
|
|
500
|
|
|
|
|
|
|
=item get_new_genome_list |
501
|
|
|
|
|
|
|
|
502
|
|
|
|
|
|
|
my $Data = Bio::ToolBox::Data->new(); |
503
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
504
|
|
|
|
|
|
|
my $window_size = 500; |
505
|
|
|
|
|
|
|
my $step_size = 250; |
506
|
|
|
|
|
|
|
my %data = get_new_genome_list( |
507
|
|
|
|
|
|
|
data => $Data, |
508
|
|
|
|
|
|
|
db => $db_name, |
509
|
|
|
|
|
|
|
win => $window_size, |
510
|
|
|
|
|
|
|
step => $step_size, |
511
|
|
|
|
|
|
|
); |
512
|
|
|
|
|
|
|
|
513
|
|
|
|
|
|
|
This subroutine will generate a new list of genomic windows or intervals. |
514
|
|
|
|
|
|
|
The genome is split into intervals of a specified size with the specified |
515
|
|
|
|
|
|
|
step size. |
516
|
|
|
|
|
|
|
|
517
|
|
|
|
|
|
|
The subroutine is passed an array containing the required arguments. |
518
|
|
|
|
|
|
|
|
519
|
|
|
|
|
|
|
=over 4 |
520
|
|
|
|
|
|
|
|
521
|
|
|
|
|
|
|
=item data |
522
|
|
|
|
|
|
|
|
523
|
|
|
|
|
|
|
An empty L object. This is required. |
524
|
|
|
|
|
|
|
|
525
|
|
|
|
|
|
|
=item db |
526
|
|
|
|
|
|
|
|
527
|
|
|
|
|
|
|
The name of the database or a reference to an established database |
528
|
|
|
|
|
|
|
object. Any BioPerl adapter, including those described in the L, |
529
|
|
|
|
|
|
|
are supported. Required. |
530
|
|
|
|
|
|
|
|
531
|
|
|
|
|
|
|
=item win |
532
|
|
|
|
|
|
|
|
533
|
|
|
|
|
|
|
A scalar value containing an integer representing the size of the |
534
|
|
|
|
|
|
|
window in basepairs. The default value is 500 bp. |
535
|
|
|
|
|
|
|
|
536
|
|
|
|
|
|
|
=item step |
537
|
|
|
|
|
|
|
|
538
|
|
|
|
|
|
|
A scalar value containing an integer representing the step size for |
539
|
|
|
|
|
|
|
advancing the window across the genome. The default is the window size. |
540
|
|
|
|
|
|
|
|
541
|
|
|
|
|
|
|
=item chrskip |
542
|
|
|
|
|
|
|
|
543
|
|
|
|
|
|
|
=item skip |
544
|
|
|
|
|
|
|
|
545
|
|
|
|
|
|
|
=item exclude |
546
|
|
|
|
|
|
|
|
547
|
|
|
|
|
|
|
Provide a regular expression compatible string for excluding specific |
548
|
|
|
|
|
|
|
chromosomes or classes of chromosomes, such as the mitochondrial or |
549
|
|
|
|
|
|
|
unmapped contigs. |
550
|
|
|
|
|
|
|
|
551
|
|
|
|
|
|
|
=back |
552
|
|
|
|
|
|
|
|
553
|
|
|
|
|
|
|
Status messages are printed to C. |
554
|
|
|
|
|
|
|
|
555
|
|
|
|
|
|
|
=item get_db_feature |
556
|
|
|
|
|
|
|
|
557
|
|
|
|
|
|
|
my $db = open_db_connection('annotation.db'); |
558
|
|
|
|
|
|
|
my $seqfeature = get_db_feature( |
559
|
|
|
|
|
|
|
db => $db, |
560
|
|
|
|
|
|
|
type => 'gene', |
561
|
|
|
|
|
|
|
name => 'ABC1', |
562
|
|
|
|
|
|
|
); |
563
|
|
|
|
|
|
|
|
564
|
|
|
|
|
|
|
This subroutine will retrieve a specific feature from a L |
565
|
|
|
|
|
|
|
database for subsequent analysis, manipulation, and/or score retrieval using the |
566
|
|
|
|
|
|
|
L methods. It relies upon unique information to pull out a |
567
|
|
|
|
|
|
|
single, unique feature. |
568
|
|
|
|
|
|
|
|
569
|
|
|
|
|
|
|
Several attributes may be used to pull out the feature, including the feature's |
570
|
|
|
|
|
|
|
unique database primary ID, name andEor aliases, and GFF type (C |
571
|
|
|
|
|
|
|
andEor C |
572
|
|
|
|
|
|
|
of features with their unique identifiers. |
573
|
|
|
|
|
|
|
|
574
|
|
|
|
|
|
|
The C attribute is preferentially used as it provides the best |
575
|
|
|
|
|
|
|
performance. However, it is not always portable between databases or even a |
576
|
|
|
|
|
|
|
database re-load. In that case, the C and C are used to identify |
577
|
|
|
|
|
|
|
potential features. Note that the C may not always be unique in the |
578
|
|
|
|
|
|
|
database. In this case, the addition of aliases may help. If all else fails, a new |
579
|
|
|
|
|
|
|
feature list should be generated. |
580
|
|
|
|
|
|
|
|
581
|
|
|
|
|
|
|
To get a feature, pass an array of arguments. |
582
|
|
|
|
|
|
|
|
583
|
|
|
|
|
|
|
=over 4 |
584
|
|
|
|
|
|
|
|
585
|
|
|
|
|
|
|
=item db |
586
|
|
|
|
|
|
|
|
587
|
|
|
|
|
|
|
The name of the L database or a reference |
588
|
|
|
|
|
|
|
to an established database object. |
589
|
|
|
|
|
|
|
|
590
|
|
|
|
|
|
|
=item id |
591
|
|
|
|
|
|
|
|
592
|
|
|
|
|
|
|
Provide the C tag. In the L |
593
|
|
|
|
|
|
|
database schema this is a (usually) non-portable, unique identifier |
594
|
|
|
|
|
|
|
specific to a database. It provides the fastest lookup. |
595
|
|
|
|
|
|
|
|
596
|
|
|
|
|
|
|
=item name |
597
|
|
|
|
|
|
|
|
598
|
|
|
|
|
|
|
A scalar value representing the feature C. Aliases may be |
599
|
|
|
|
|
|
|
appended with semicolon delimiters. |
600
|
|
|
|
|
|
|
|
601
|
|
|
|
|
|
|
=item type |
602
|
|
|
|
|
|
|
|
603
|
|
|
|
|
|
|
Provide the feature type, which is typically expressed as |
604
|
|
|
|
|
|
|
C. Alternatively, provide just the C only. |
605
|
|
|
|
|
|
|
|
606
|
|
|
|
|
|
|
=back |
607
|
|
|
|
|
|
|
|
608
|
|
|
|
|
|
|
While it is possible to identify features with any two attributes |
609
|
|
|
|
|
|
|
(or possibly just name or ID), the best performance is obtained with |
610
|
|
|
|
|
|
|
all three together. The first SeqFeature object is returned if multiple |
611
|
|
|
|
|
|
|
are found. |
612
|
|
|
|
|
|
|
|
613
|
|
|
|
|
|
|
=item get_segment_score |
614
|
|
|
|
|
|
|
|
615
|
|
|
|
|
|
|
my $score = get_segment_score( |
616
|
|
|
|
|
|
|
$seqfeature->seq_id, # chromosome |
617
|
|
|
|
|
|
|
$seqfeature->start, # start |
618
|
|
|
|
|
|
|
$seqfeature->end, # end |
619
|
|
|
|
|
|
|
$seqfeature->strand, # strand |
620
|
|
|
|
|
|
|
'sense', # strandedness |
621
|
|
|
|
|
|
|
'mean', # method |
622
|
|
|
|
|
|
|
0, # return type |
623
|
|
|
|
|
|
|
$db, # database |
624
|
|
|
|
|
|
|
'scores.bw' # datasets |
625
|
|
|
|
|
|
|
); |
626
|
|
|
|
|
|
|
|
627
|
|
|
|
|
|
|
Generic method for collecting score(s) from a database. This is the |
628
|
|
|
|
|
|
|
primary interface to all of the database handling packages, including |
629
|
|
|
|
|
|
|
Bam, bigWig, bigBed, and USeq files, as well as SeqFeature databases. |
630
|
|
|
|
|
|
|
By passing common parameters here, the appropriate database or file |
631
|
|
|
|
|
|
|
handling packages are loaded during run time and the appropriate |
632
|
|
|
|
|
|
|
methods are called. Essentially, this is where the magic happens. |
633
|
|
|
|
|
|
|
|
634
|
|
|
|
|
|
|
Pass the method an array of nine (or more) parameters. The parameters |
635
|
|
|
|
|
|
|
are not keyed as a hash, and the order is explicit. However, see the |
636
|
|
|
|
|
|
|
L module for named index positions. |
637
|
|
|
|
|
|
|
The order is as follows. |
638
|
|
|
|
|
|
|
|
639
|
|
|
|
|
|
|
=over 4 |
640
|
|
|
|
|
|
|
|
641
|
|
|
|
|
|
|
=item * 0 Chromosome |
642
|
|
|
|
|
|
|
|
643
|
|
|
|
|
|
|
String representing chromosome or sequence ID. |
644
|
|
|
|
|
|
|
|
645
|
|
|
|
|
|
|
=item * 1 Start |
646
|
|
|
|
|
|
|
|
647
|
|
|
|
|
|
|
Start coordinate (integer, 1-based) |
648
|
|
|
|
|
|
|
|
649
|
|
|
|
|
|
|
=item * 2 Stop |
650
|
|
|
|
|
|
|
|
651
|
|
|
|
|
|
|
Stop or end coordinate (integer, 1-based) |
652
|
|
|
|
|
|
|
|
653
|
|
|
|
|
|
|
=item * 3 Strand |
654
|
|
|
|
|
|
|
|
655
|
|
|
|
|
|
|
BioPerl style strand value, including -1, 0, and 1. |
656
|
|
|
|
|
|
|
|
657
|
|
|
|
|
|
|
=item * 4 Strandedness |
658
|
|
|
|
|
|
|
|
659
|
|
|
|
|
|
|
A string indicating strandedness: sense, antisense, all. |
660
|
|
|
|
|
|
|
|
661
|
|
|
|
|
|
|
=item * 5 Method |
662
|
|
|
|
|
|
|
|
663
|
|
|
|
|
|
|
A string representing the method of data collection. |
664
|
|
|
|
|
|
|
|
665
|
|
|
|
|
|
|
Current acceptable methods include mean, median, sum, min, max, |
666
|
|
|
|
|
|
|
stddev, count, ncount (named count), and pcount (precise count). |
667
|
|
|
|
|
|
|
See the individual db_helper sub classes for more details. Not all |
668
|
|
|
|
|
|
|
adapters support all methods. |
669
|
|
|
|
|
|
|
|
670
|
|
|
|
|
|
|
=item * 6 Return type |
671
|
|
|
|
|
|
|
|
672
|
|
|
|
|
|
|
An integer (0, 1, 2) representing the type of value to return. |
673
|
|
|
|
|
|
|
|
674
|
|
|
|
|
|
|
A return type of 0 indicates a single score should be returned, |
675
|
|
|
|
|
|
|
calculated using the specified method. A return type of 1 is an |
676
|
|
|
|
|
|
|
array or array reference of all scores found. A return type of 2 |
677
|
|
|
|
|
|
|
is a hash or hash reference of coordinate => score. |
678
|
|
|
|
|
|
|
|
679
|
|
|
|
|
|
|
=item * 7 Dataset database |
680
|
|
|
|
|
|
|
|
681
|
|
|
|
|
|
|
This is either a L or a L |
682
|
|
|
|
|
|
|
database containing multiple score dataset features. If your dataset |
683
|
|
|
|
|
|
|
does not contain multiple different feature types, pass C. |
684
|
|
|
|
|
|
|
|
685
|
|
|
|
|
|
|
Pass either a string containing the name of database (which will be opened) |
686
|
|
|
|
|
|
|
or an already opened database object. |
687
|
|
|
|
|
|
|
|
688
|
|
|
|
|
|
|
=item * 8 Dataset |
689
|
|
|
|
|
|
|
|
690
|
|
|
|
|
|
|
This is the verified dataset from which the scores will be passed. The |
691
|
|
|
|
|
|
|
dataset should be verified using the L |
692
|
|
|
|
|
|
|
subroutine. For indexed file datasets, e.g. F<.bam> or F<.bw> files, |
693
|
|
|
|
|
|
|
this will the verified local path or URL. For a database, this will be |
694
|
|
|
|
|
|
|
a database type, sucha as C or C. |
695
|
|
|
|
|
|
|
or the path to a data file, e.g. Bam, bigWig, bigBed, or USeq file. |
696
|
|
|
|
|
|
|
|
697
|
|
|
|
|
|
|
If multiple datasets are to be combined, simply append them to the |
698
|
|
|
|
|
|
|
array (index 9, 10, etc). |
699
|
|
|
|
|
|
|
|
700
|
|
|
|
|
|
|
=back |
701
|
|
|
|
|
|
|
|
702
|
|
|
|
|
|
|
The returned item is dependent on the value of the return type code. |
703
|
|
|
|
|
|
|
|
704
|
|
|
|
|
|
|
=item calculate_score |
705
|
|
|
|
|
|
|
|
706
|
|
|
|
|
|
|
my $scores = get_segment_score($chr,$start,$stop,$strand, |
707
|
|
|
|
|
|
|
'sense','ncount',1,undef,'file:/path/to/dataset.bam'); |
708
|
|
|
|
|
|
|
my $score = calculate_score('ncount', $scores); |
709
|
|
|
|
|
|
|
|
710
|
|
|
|
|
|
|
This subroutine will perform the math to calculate a single score |
711
|
|
|
|
|
|
|
from an array of values. Current acceptable methods include mean, |
712
|
|
|
|
|
|
|
median, sum, min, max, stddev, count, ncount, and pcount. |
713
|
|
|
|
|
|
|
|
714
|
|
|
|
|
|
|
Pass the subroutine two items: the name of the mathematical method |
715
|
|
|
|
|
|
|
enumerated above, and an array reference of the values to work on. |
716
|
|
|
|
|
|
|
A single score will be returned. |
717
|
|
|
|
|
|
|
|
718
|
|
|
|
|
|
|
=item get_chromosome_list |
719
|
|
|
|
|
|
|
|
720
|
|
|
|
|
|
|
my $db = open_db_connection('cerevisiae'); |
721
|
|
|
|
|
|
|
# get all chromosomes in the database |
722
|
|
|
|
|
|
|
my @chromosomes = get_chromosome_list($db); |
723
|
|
|
|
|
|
|
foreach (@chromosomes) { |
724
|
|
|
|
|
|
|
my $name = $_->[0]; |
725
|
|
|
|
|
|
|
my $length = $_->[1]; |
726
|
|
|
|
|
|
|
print "chromosome $name is $length bp\n"; |
727
|
|
|
|
|
|
|
} |
728
|
|
|
|
|
|
|
|
729
|
|
|
|
|
|
|
This subroutine will collect a list of chromosomes or reference sequences |
730
|
|
|
|
|
|
|
in a Bio::DB database and return the list along with their sizes in bp. |
731
|
|
|
|
|
|
|
Many BioPerl-based databases are supported, including |
732
|
|
|
|
|
|
|
L, L, L, |
733
|
|
|
|
|
|
|
L, L, L, |
734
|
|
|
|
|
|
|
L, or any others that support the C method. |
735
|
|
|
|
|
|
|
L adapters for big files are also supported, though see note |
736
|
|
|
|
|
|
|
below. See the L subroutine for more information. |
737
|
|
|
|
|
|
|
|
738
|
|
|
|
|
|
|
Pass the subroutine either the name of the database, the path to the |
739
|
|
|
|
|
|
|
database file, or an opened database object. |
740
|
|
|
|
|
|
|
|
741
|
|
|
|
|
|
|
Optionally pass a second value, a regular expression compatible string |
742
|
|
|
|
|
|
|
for skipping specific chromosomes or chromosome classes, such as mitochondrial |
743
|
|
|
|
|
|
|
or unmapped contigs. The default is to return all chromosomes. |
744
|
|
|
|
|
|
|
|
745
|
|
|
|
|
|
|
The subroutine will return an array, with each element representing each |
746
|
|
|
|
|
|
|
chromosome or reference sequence in the database. Each element is an anonymous |
747
|
|
|
|
|
|
|
array of two elements, the chromosome name and length in bp. |
748
|
|
|
|
|
|
|
|
749
|
|
|
|
|
|
|
B: L databases for bigWig and bigBed files do |
750
|
|
|
|
|
|
|
not return chromosomes in the original order as the file, and are returned |
751
|
|
|
|
|
|
|
in a sorted manner that may not be the original order. |
752
|
|
|
|
|
|
|
|
753
|
|
|
|
|
|
|
=item low_level_bam_coverage |
754
|
|
|
|
|
|
|
|
755
|
|
|
|
|
|
|
my $coverage = low_level_bam_coverage($sam, $tid, $start, $stop); |
756
|
|
|
|
|
|
|
|
757
|
|
|
|
|
|
|
This is a convenience method for running the low level bam coverage method. |
758
|
|
|
|
|
|
|
Since both L and L bam file adapters are |
759
|
|
|
|
|
|
|
supported, and each has slight variation in the API syntax, this method helps |
760
|
|
|
|
|
|
|
to abstract the actual method and use the appropriate syntax depending on |
761
|
|
|
|
|
|
|
which adapter is loaded. It is best if the C<$sam> object was opened using the |
762
|
|
|
|
|
|
|
L method, or that C<$BAM_ADAPTER> is set. |
763
|
|
|
|
|
|
|
|
764
|
|
|
|
|
|
|
B that this is the B coverage method based on the index object, |
765
|
|
|
|
|
|
|
and not the similarly named high level API method. Read the adapter |
766
|
|
|
|
|
|
|
documentation for proper usage. |
767
|
|
|
|
|
|
|
|
768
|
|
|
|
|
|
|
=item low_level_bam_fetch |
769
|
|
|
|
|
|
|
|
770
|
|
|
|
|
|
|
my $success = low_level_bam_fetch($sam, $tid, $start, $stop, $callback, $data); |
771
|
|
|
|
|
|
|
|
772
|
|
|
|
|
|
|
This is a convenience method for running the low level bam fetch method. |
773
|
|
|
|
|
|
|
Since both L and L bam file adapters are |
774
|
|
|
|
|
|
|
supported, and each has slight variation in the API syntax, this method helps |
775
|
|
|
|
|
|
|
to abstract the actual method and use the appropriate syntax depending on |
776
|
|
|
|
|
|
|
which adapter is loaded. It is best if the C<$sam> object was opened using the |
777
|
|
|
|
|
|
|
L method, or that C<$BAM_ADAPTER> is set. |
778
|
|
|
|
|
|
|
|
779
|
|
|
|
|
|
|
B that this is the B fetch method based on the index object, |
780
|
|
|
|
|
|
|
and not the similarly named high level API method. Read the adapter |
781
|
|
|
|
|
|
|
documentation for proper usage. |
782
|
|
|
|
|
|
|
|
783
|
|
|
|
|
|
|
=item get_genomic_sequence |
784
|
|
|
|
|
|
|
|
785
|
|
|
|
|
|
|
my $sequence = get_genomic_sequence($db, $chrom, $start, $stop); |
786
|
|
|
|
|
|
|
|
787
|
|
|
|
|
|
|
This is a convenience wrapper function for fetching genomic sequence from |
788
|
|
|
|
|
|
|
three different supported fasta database adapters, which, of course, all |
789
|
|
|
|
|
|
|
have different APIs, including the L, L, |
790
|
|
|
|
|
|
|
and L adapters, as well as L |
791
|
|
|
|
|
|
|
and possibly other BioPerl databases. |
792
|
|
|
|
|
|
|
|
793
|
|
|
|
|
|
|
Pass the opened database object, chromosome, start, and stop coordinates. This |
794
|
|
|
|
|
|
|
assumes BioPerl standard 1-base coordinates. Only the forward sequence is |
795
|
|
|
|
|
|
|
retrieved. The sequence is returned as a simple string. |
796
|
|
|
|
|
|
|
|
797
|
|
|
|
|
|
|
=back |
798
|
|
|
|
|
|
|
|
799
|
|
|
|
|
|
|
=head1 INTERNAL SUBROUTINES |
800
|
|
|
|
|
|
|
|
801
|
|
|
|
|
|
|
These are not intended for normal general consumption but are documented here |
802
|
|
|
|
|
|
|
so that at least some of us know what is going on. |
803
|
|
|
|
|
|
|
|
804
|
|
|
|
|
|
|
=over 4 |
805
|
|
|
|
|
|
|
|
806
|
|
|
|
|
|
|
=item _lookup_db_method |
807
|
|
|
|
|
|
|
|
808
|
|
|
|
|
|
|
Internal method for determining which database adapter and method call |
809
|
|
|
|
|
|
|
should be used given a set of parameters. This result is cached for |
810
|
|
|
|
|
|
|
future data queries (likely hundreds or thousands of repeated calls). |
811
|
|
|
|
|
|
|
The method is stored in the global hash C<%DB_METHODS>; |
812
|
|
|
|
|
|
|
|
813
|
|
|
|
|
|
|
Pass the parameter array reference from L. This |
814
|
|
|
|
|
|
|
should load the appropriate db_helper sub module during run time. |
815
|
|
|
|
|
|
|
|
816
|
|
|
|
|
|
|
=item _load_helper_module |
817
|
|
|
|
|
|
|
|
818
|
|
|
|
|
|
|
Subroutine to load and import a db_helper module during run time. |
819
|
|
|
|
|
|
|
Sets the appropriate global variable if successful. |
820
|
|
|
|
|
|
|
|
821
|
|
|
|
|
|
|
=item _load_bam_helper_module |
822
|
|
|
|
|
|
|
|
823
|
|
|
|
|
|
|
Subroutine to determine which Bam adapter db_helper module to load, |
824
|
|
|
|
|
|
|
either the older samtools-based adapter or the newer HTSlib-based adapter. |
825
|
|
|
|
|
|
|
Uses the global and exportable variable C<$BAM_ADAPTER> to set a |
826
|
|
|
|
|
|
|
preference for which adapter to use. Use 'sam' or 'hts' or some |
827
|
|
|
|
|
|
|
string containing these names. Do B change this variable after |
828
|
|
|
|
|
|
|
loading the helper module; it will not do what you think it will do. |
829
|
|
|
|
|
|
|
If both are available and a preference is not set then the hts helper |
830
|
|
|
|
|
|
|
is the default. |
831
|
|
|
|
|
|
|
|
832
|
|
|
|
|
|
|
=back |
833
|
|
|
|
|
|
|
|
834
|
|
|
|
|
|
|
=cut |
835
|
|
|
|
|
|
|
|
836
|
3
|
|
|
3
|
|
16
|
use strict; |
|
3
|
|
|
|
|
6
|
|
|
3
|
|
|
|
|
89
|
|
837
|
|
|
|
|
|
|
require Exporter; |
838
|
3
|
|
|
3
|
|
11
|
use Carp qw(carp cluck croak confess); |
|
3
|
|
|
|
|
4
|
|
|
3
|
|
|
|
|
148
|
|
839
|
3
|
|
|
3
|
|
1136
|
use Module::Load; # for dynamic loading during runtime |
|
3
|
|
|
|
|
2491
|
|
|
3
|
|
|
|
|
15
|
|
840
|
3
|
|
|
3
|
|
124
|
use List::Util qw(min max sum0 uniq); |
|
3
|
|
|
|
|
5
|
|
|
3
|
|
|
|
|
166
|
|
841
|
3
|
|
|
3
|
|
1040
|
use Statistics::Lite qw(median range stddevp); |
|
3
|
|
|
|
|
3542
|
|
|
3
|
|
|
|
|
157
|
|
842
|
3
|
|
|
3
|
|
1104
|
use Bio::ToolBox::db_helper::constants; |
|
3
|
|
|
|
|
6
|
|
|
3
|
|
|
|
|
145
|
|
843
|
3
|
|
|
3
|
|
1168
|
use Bio::ToolBox::utility; |
|
3
|
|
|
|
|
6
|
|
|
3
|
|
|
|
|
20771
|
|
844
|
|
|
|
|
|
|
|
845
|
|
|
|
|
|
|
# check values for dynamically loaded helper modules |
846
|
|
|
|
|
|
|
# these are loaded only when needed during runtime to avoid wasting resources |
847
|
|
|
|
|
|
|
our $BAM_OK = 0; |
848
|
|
|
|
|
|
|
our $BIGBED_OK = 0; |
849
|
|
|
|
|
|
|
our $BIGWIG_OK = 0; |
850
|
|
|
|
|
|
|
our $SEQFASTA_OK = 0; |
851
|
|
|
|
|
|
|
our $USEQ_OK = 0; |
852
|
|
|
|
|
|
|
our $BAM_ADAPTER = undef; # preference for which bam adapter to use |
853
|
|
|
|
|
|
|
our $BIG_ADAPTER = undef; |
854
|
|
|
|
|
|
|
|
855
|
|
|
|
|
|
|
# define reusable variables |
856
|
|
|
|
|
|
|
my %TOTAL_READ_NUMBER; # for rpm calculations |
857
|
|
|
|
|
|
|
my $PRIMARY_ID_WARNING; # for out of date primary IDs |
858
|
|
|
|
|
|
|
my %OPENED_DB; # cache for some opened Bio::DB databases |
859
|
|
|
|
|
|
|
my %DB_METHODS; # cache for database score methods |
860
|
|
|
|
|
|
|
|
861
|
|
|
|
|
|
|
# score calculators |
862
|
|
|
|
|
|
|
my %SCORE_CALCULATOR_SUB = ( |
863
|
|
|
|
|
|
|
'mean' => sub { |
864
|
|
|
|
|
|
|
my $s = shift; |
865
|
|
|
|
|
|
|
return sum0(@$s)/(scalar(@$s) || 1); |
866
|
|
|
|
|
|
|
}, |
867
|
|
|
|
|
|
|
'sum' => sub { |
868
|
|
|
|
|
|
|
my $s = shift; |
869
|
|
|
|
|
|
|
return sum0(@$s); |
870
|
|
|
|
|
|
|
}, |
871
|
|
|
|
|
|
|
'median' => sub { |
872
|
|
|
|
|
|
|
my $s = shift; |
873
|
|
|
|
|
|
|
return '.' unless scalar(@$s); |
874
|
|
|
|
|
|
|
return median(@$s); |
875
|
|
|
|
|
|
|
}, |
876
|
|
|
|
|
|
|
'min' => sub { |
877
|
|
|
|
|
|
|
my $s = shift; |
878
|
|
|
|
|
|
|
return '.' unless scalar(@$s); |
879
|
|
|
|
|
|
|
return min(@$s); |
880
|
|
|
|
|
|
|
}, |
881
|
|
|
|
|
|
|
'max' => sub { |
882
|
|
|
|
|
|
|
my $s = shift; |
883
|
|
|
|
|
|
|
return '.' unless scalar(@$s); |
884
|
|
|
|
|
|
|
return max(@$s); |
885
|
|
|
|
|
|
|
}, |
886
|
|
|
|
|
|
|
'count' => sub { |
887
|
|
|
|
|
|
|
my $s = shift; |
888
|
|
|
|
|
|
|
return scalar(@$s); |
889
|
|
|
|
|
|
|
}, |
890
|
|
|
|
|
|
|
'pcount' => sub { |
891
|
|
|
|
|
|
|
my $s = shift; |
892
|
|
|
|
|
|
|
return scalar(@$s); |
893
|
|
|
|
|
|
|
}, |
894
|
|
|
|
|
|
|
'ncount' => sub { |
895
|
|
|
|
|
|
|
# Convert names into unique counts |
896
|
|
|
|
|
|
|
my $s = shift; |
897
|
|
|
|
|
|
|
my %name2count; |
898
|
|
|
|
|
|
|
foreach my $n (@$s) { |
899
|
|
|
|
|
|
|
if (ref($n) eq 'ARRAY') { |
900
|
|
|
|
|
|
|
# this is likely from a ncount indexed hash |
901
|
|
|
|
|
|
|
foreach (@$n) { |
902
|
|
|
|
|
|
|
$name2count{$_} += 1; |
903
|
|
|
|
|
|
|
} |
904
|
|
|
|
|
|
|
} |
905
|
|
|
|
|
|
|
else { |
906
|
|
|
|
|
|
|
$name2count{$n} += 1; |
907
|
|
|
|
|
|
|
} |
908
|
|
|
|
|
|
|
} |
909
|
|
|
|
|
|
|
return scalar(keys %name2count); |
910
|
|
|
|
|
|
|
}, |
911
|
|
|
|
|
|
|
'range' => sub { |
912
|
|
|
|
|
|
|
# the range value is 'min-max' |
913
|
|
|
|
|
|
|
my $s = shift; |
914
|
|
|
|
|
|
|
return '.' unless scalar(@$s); |
915
|
|
|
|
|
|
|
return range(@$s); |
916
|
|
|
|
|
|
|
}, |
917
|
|
|
|
|
|
|
'stddev' => sub { |
918
|
|
|
|
|
|
|
# we are using the standard deviation of the population, |
919
|
|
|
|
|
|
|
# since these are the only scores we are considering |
920
|
|
|
|
|
|
|
my $s = shift; |
921
|
|
|
|
|
|
|
return '.' unless scalar(@$s); |
922
|
|
|
|
|
|
|
return stddevp(@$s); |
923
|
|
|
|
|
|
|
}, |
924
|
|
|
|
|
|
|
'rpm' => sub { |
925
|
|
|
|
|
|
|
confess " The rpm methods have been deprecated due to complexity and " . |
926
|
|
|
|
|
|
|
"the variable way of calculating the value. Collect counts and " . |
927
|
|
|
|
|
|
|
"calculate your preferred way.\n"; |
928
|
|
|
|
|
|
|
}, |
929
|
|
|
|
|
|
|
'rpkm' => sub { |
930
|
|
|
|
|
|
|
confess " The rpm methods have been deprecated due to complexity and " . |
931
|
|
|
|
|
|
|
"the variable way of calculating the value. Collect counts and " . |
932
|
|
|
|
|
|
|
"calculate your preferred way.\n"; |
933
|
|
|
|
|
|
|
} |
934
|
|
|
|
|
|
|
); |
935
|
|
|
|
|
|
|
|
936
|
|
|
|
|
|
|
|
937
|
|
|
|
|
|
|
# Exported names |
938
|
|
|
|
|
|
|
our @ISA = qw(Exporter); |
939
|
|
|
|
|
|
|
our @EXPORT = qw(); |
940
|
|
|
|
|
|
|
our @EXPORT_OK = qw( |
941
|
|
|
|
|
|
|
$BAM_ADAPTER |
942
|
|
|
|
|
|
|
$BIG_ADAPTER |
943
|
|
|
|
|
|
|
use_bam_adapter |
944
|
|
|
|
|
|
|
use_big_adapter |
945
|
|
|
|
|
|
|
open_db_connection |
946
|
|
|
|
|
|
|
get_dataset_list |
947
|
|
|
|
|
|
|
verify_or_request_feature_types |
948
|
|
|
|
|
|
|
check_dataset_for_rpm_support |
949
|
|
|
|
|
|
|
get_new_feature_list |
950
|
|
|
|
|
|
|
get_new_genome_list |
951
|
|
|
|
|
|
|
validate_included_feature |
952
|
|
|
|
|
|
|
get_db_feature |
953
|
|
|
|
|
|
|
get_segment_score |
954
|
|
|
|
|
|
|
calculate_score |
955
|
|
|
|
|
|
|
get_chromosome_list |
956
|
|
|
|
|
|
|
low_level_bam_coverage |
957
|
|
|
|
|
|
|
low_level_bam_fetch |
958
|
|
|
|
|
|
|
get_genomic_sequence |
959
|
|
|
|
|
|
|
); |
960
|
|
|
|
|
|
|
|
961
|
|
|
|
|
|
|
# The true statement |
962
|
|
|
|
|
|
|
1; |
963
|
|
|
|
|
|
|
|
964
|
|
|
|
|
|
|
|
965
|
|
|
|
|
|
|
### Open a connection to the SeqFeature Store MySQL database |
966
|
|
|
|
|
|
|
|
967
|
|
|
|
|
|
|
sub use_bam_adapter { |
968
|
0
|
|
0
|
0
|
1
|
0
|
my $a = shift || undef; |
969
|
0
|
0
|
|
|
|
0
|
$BAM_ADAPTER = $a if $a; |
970
|
0
|
|
|
|
|
0
|
return $BAM_ADAPTER; |
971
|
|
|
|
|
|
|
} |
972
|
|
|
|
|
|
|
|
973
|
|
|
|
|
|
|
|
974
|
|
|
|
|
|
|
sub use_big_adapter { |
975
|
0
|
|
0
|
0
|
1
|
0
|
my $a = shift || undef; |
976
|
0
|
0
|
|
|
|
0
|
$BIG_ADAPTER = $a if $a; |
977
|
0
|
|
|
|
|
0
|
return $BIG_ADAPTER; |
978
|
|
|
|
|
|
|
} |
979
|
|
|
|
|
|
|
|
980
|
|
|
|
|
|
|
|
981
|
|
|
|
|
|
|
sub open_db_connection { |
982
|
8
|
|
|
8
|
1
|
12
|
my $database = shift; |
983
|
8
|
|
50
|
|
|
20
|
my $no_cache = shift || 0; |
984
|
8
|
50
|
|
|
|
12
|
unless ($database) { |
985
|
|
|
|
|
|
|
# cluck 'no database name passed!'; |
986
|
0
|
|
|
|
|
0
|
return; |
987
|
|
|
|
|
|
|
} |
988
|
|
|
|
|
|
|
|
989
|
|
|
|
|
|
|
# first check if it is a database reference |
990
|
8
|
|
|
|
|
14
|
my $db_ref = ref $database; |
991
|
8
|
100
|
|
|
|
25
|
if ($db_ref =~ /DB|big::BigWigSet/) { |
992
|
|
|
|
|
|
|
# the provided database is already an open database object |
993
|
|
|
|
|
|
|
# nothing to open, return as is |
994
|
2
|
|
|
|
|
5
|
return $database; |
995
|
|
|
|
|
|
|
} |
996
|
|
|
|
|
|
|
|
997
|
|
|
|
|
|
|
# check to see if we have already opened it |
998
|
6
|
100
|
66
|
|
|
23
|
if (exists $OPENED_DB{$database} and not $no_cache) { |
999
|
|
|
|
|
|
|
# return the cached database connection |
1000
|
|
|
|
|
|
|
# but NOT if user explicitly requested no cached databases |
1001
|
|
|
|
|
|
|
# DO NOT reuse database objects if you have forked!!! Bad things happen |
1002
|
5
|
|
|
|
|
11
|
return $OPENED_DB{$database}; |
1003
|
|
|
|
|
|
|
} |
1004
|
|
|
|
|
|
|
|
1005
|
|
|
|
|
|
|
# skip parsed databases |
1006
|
1
|
50
|
|
|
|
3
|
return if $database =~ /^Parsed:/; # we don't open parsed annotation files |
1007
|
|
|
|
|
|
|
|
1008
|
|
|
|
|
|
|
# remove file prefix, just in case |
1009
|
1
|
|
|
|
|
2
|
$database =~ s/^file://; |
1010
|
|
|
|
|
|
|
|
1011
|
|
|
|
|
|
|
|
1012
|
|
|
|
|
|
|
### Attempt to open the database |
1013
|
|
|
|
|
|
|
# we go through a series of checks to determine if it is remote, local, |
1014
|
|
|
|
|
|
|
# an indexed big data file, SQLite file, etc |
1015
|
|
|
|
|
|
|
# when all else fails, try to open a SQL connection |
1016
|
1
|
|
|
|
|
2
|
my $db; |
1017
|
|
|
|
|
|
|
my $error; |
1018
|
|
|
|
|
|
|
|
1019
|
|
|
|
|
|
|
# check if it is a remote file |
1020
|
1
|
50
|
|
|
|
20
|
if ($database =~ /^(?:https?|ftp)/i) { |
|
|
50
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
1021
|
|
|
|
|
|
|
|
1022
|
|
|
|
|
|
|
# a remote Bam database |
1023
|
0
|
0
|
|
|
|
0
|
if ($database =~ /\.bam$/i) { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
1024
|
|
|
|
|
|
|
# open using Bam adaptor |
1025
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
1026
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
1027
|
0
|
|
|
|
|
0
|
$db = open_bam_db($database); |
1028
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1029
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open remote Bam file" . |
1030
|
|
|
|
|
|
|
" '$database'! $!\n"; |
1031
|
|
|
|
|
|
|
} |
1032
|
|
|
|
|
|
|
} |
1033
|
|
|
|
|
|
|
else { |
1034
|
0
|
|
|
|
|
0
|
$error = " Bam database cannot be loaded because\n" . |
1035
|
|
|
|
|
|
|
" Bio::DB::Sam or Bio::DB::HTS is not installed\n"; |
1036
|
|
|
|
|
|
|
} |
1037
|
|
|
|
|
|
|
} |
1038
|
|
|
|
|
|
|
|
1039
|
|
|
|
|
|
|
# a remote BigBed database |
1040
|
|
|
|
|
|
|
elsif ($database =~ /\.(?:bb|bigbed)$/i) { |
1041
|
|
|
|
|
|
|
# open using BigBed adaptor |
1042
|
0
|
0
|
|
|
|
0
|
$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK; |
1043
|
0
|
0
|
|
|
|
0
|
if ($BIGBED_OK) { |
1044
|
0
|
|
|
|
|
0
|
$db = open_bigbed_db($database); |
1045
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1046
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open remote BigBed file" . |
1047
|
|
|
|
|
|
|
" '$database'! $!\n"; |
1048
|
|
|
|
|
|
|
} |
1049
|
|
|
|
|
|
|
} |
1050
|
|
|
|
|
|
|
else { |
1051
|
0
|
|
|
|
|
0
|
$error = " BigBed database cannot be loaded because\n" . |
1052
|
|
|
|
|
|
|
" Bio::DB::Big or Bio::DB::BigBed is not installed\n"; |
1053
|
|
|
|
|
|
|
} |
1054
|
|
|
|
|
|
|
} |
1055
|
|
|
|
|
|
|
|
1056
|
|
|
|
|
|
|
# a remote BigWig database |
1057
|
|
|
|
|
|
|
elsif ($database =~ /\.(?:bw|bigwig)$/i) { |
1058
|
|
|
|
|
|
|
# open using BigWig adaptor |
1059
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK; |
1060
|
0
|
0
|
|
|
|
0
|
if ($BIGWIG_OK) { |
1061
|
0
|
|
|
|
|
0
|
$db = open_bigwig_db($database); |
1062
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1063
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open remote BigWig file" . |
1064
|
|
|
|
|
|
|
" '$database'! $!\n"; |
1065
|
|
|
|
|
|
|
} |
1066
|
|
|
|
|
|
|
} |
1067
|
|
|
|
|
|
|
else { |
1068
|
0
|
|
|
|
|
0
|
$error = " BigWig database cannot be loaded because\n" . |
1069
|
|
|
|
|
|
|
" Bio::DB::Big or Bio::DB::BigWig is not installed\n"; |
1070
|
|
|
|
|
|
|
} |
1071
|
|
|
|
|
|
|
} |
1072
|
|
|
|
|
|
|
|
1073
|
|
|
|
|
|
|
# a remote useq file |
1074
|
|
|
|
|
|
|
elsif ($database =~ /\.useq$/) { |
1075
|
|
|
|
|
|
|
# uh oh! remote useq files are not supported |
1076
|
0
|
|
|
|
|
0
|
$error = " ERROR: remote useq files are not supported!\n"; |
1077
|
|
|
|
|
|
|
} |
1078
|
|
|
|
|
|
|
|
1079
|
|
|
|
|
|
|
# a remote fasta file??? |
1080
|
|
|
|
|
|
|
elsif ($database =~ /\.fa(?:sta)?$/i) { |
1081
|
|
|
|
|
|
|
# uh oh! remote fasta files are not supported |
1082
|
0
|
|
|
|
|
0
|
$error = " ERROR: remote fasta files are not supported!\n"; |
1083
|
|
|
|
|
|
|
} |
1084
|
|
|
|
|
|
|
|
1085
|
|
|
|
|
|
|
# a presumed remote directory, presumably of bigwig files |
1086
|
|
|
|
|
|
|
else { |
1087
|
|
|
|
|
|
|
# open using BigWigSet adaptor |
1088
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig') |
1089
|
|
|
|
|
|
|
unless $BIGWIG_OK; |
1090
|
0
|
0
|
|
|
|
0
|
if ($BIGWIG_OK) { |
1091
|
0
|
|
|
|
|
0
|
$db = open_bigwigset_db($database); |
1092
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1093
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open presumed remote " . |
1094
|
|
|
|
|
|
|
"BigWigSet directory '$database'! $!\n"; |
1095
|
|
|
|
|
|
|
} |
1096
|
|
|
|
|
|
|
} |
1097
|
|
|
|
|
|
|
else { |
1098
|
0
|
|
|
|
|
0
|
$error = " Presumed BigWigSet database cannot be loaded because\n" . |
1099
|
|
|
|
|
|
|
" Bio::DB::BigWigSet is not installed\n"; |
1100
|
|
|
|
|
|
|
} |
1101
|
|
|
|
|
|
|
} |
1102
|
|
|
|
|
|
|
|
1103
|
|
|
|
|
|
|
} |
1104
|
|
|
|
|
|
|
|
1105
|
|
|
|
|
|
|
# a local existing file |
1106
|
|
|
|
|
|
|
elsif (-f $database) { |
1107
|
|
|
|
|
|
|
|
1108
|
|
|
|
|
|
|
# a Bam database |
1109
|
1
|
50
|
|
|
|
17
|
if ($database =~ /\.bam$/i) { |
|
|
50
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
1110
|
|
|
|
|
|
|
# open using appropriate bam adaptor |
1111
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
1112
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
1113
|
0
|
|
|
|
|
0
|
undef $@; |
1114
|
0
|
|
|
|
|
0
|
$db = open_bam_db($database); |
1115
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1116
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local Bam file" . |
1117
|
|
|
|
|
|
|
" '$database'! $@\n"; |
1118
|
|
|
|
|
|
|
} |
1119
|
|
|
|
|
|
|
} |
1120
|
|
|
|
|
|
|
else { |
1121
|
0
|
|
|
|
|
0
|
$error = " Bam database cannot be loaded because\n" . |
1122
|
|
|
|
|
|
|
" Bio::DB::Sam or Bio::DB::HTS is not installed\n"; |
1123
|
|
|
|
|
|
|
} |
1124
|
|
|
|
|
|
|
} |
1125
|
|
|
|
|
|
|
|
1126
|
|
|
|
|
|
|
# a BigBed database |
1127
|
|
|
|
|
|
|
elsif ($database =~ /\.(?:bb|bigbed)$/i) { |
1128
|
|
|
|
|
|
|
# open using BigBed adaptor |
1129
|
0
|
0
|
|
|
|
0
|
$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK; |
1130
|
0
|
0
|
|
|
|
0
|
if ($BIGBED_OK) { |
1131
|
0
|
|
|
|
|
0
|
undef $@; |
1132
|
0
|
|
|
|
|
0
|
$db = open_bigbed_db($database); |
1133
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1134
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local BigBed file" . |
1135
|
|
|
|
|
|
|
" '$database'! $@\n"; |
1136
|
|
|
|
|
|
|
} |
1137
|
|
|
|
|
|
|
} |
1138
|
|
|
|
|
|
|
else { |
1139
|
0
|
|
|
|
|
0
|
$error = " BigBed database cannot be loaded because\n" . |
1140
|
|
|
|
|
|
|
" Big::DB::Big or Bio::DB::BigBed is not installed\n"; |
1141
|
|
|
|
|
|
|
} |
1142
|
|
|
|
|
|
|
} |
1143
|
|
|
|
|
|
|
|
1144
|
|
|
|
|
|
|
# a BigWig database |
1145
|
|
|
|
|
|
|
elsif ($database =~ /\.(?:bw|bigwig)$/i) { |
1146
|
|
|
|
|
|
|
# open using BigWig adaptor |
1147
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK; |
1148
|
0
|
0
|
|
|
|
0
|
if ($BIGWIG_OK) { |
1149
|
0
|
|
|
|
|
0
|
undef $@; |
1150
|
0
|
|
|
|
|
0
|
$db = open_bigwig_db($database); |
1151
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1152
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local BigWig file" . |
1153
|
|
|
|
|
|
|
" '$database'! $@\n"; |
1154
|
|
|
|
|
|
|
} |
1155
|
|
|
|
|
|
|
} |
1156
|
|
|
|
|
|
|
else { |
1157
|
0
|
|
|
|
|
0
|
$error = " BigWig database cannot be loaded because\n" . |
1158
|
|
|
|
|
|
|
" Big::DB::Big or Bio::DB::BigWig is not installed\n"; |
1159
|
|
|
|
|
|
|
} |
1160
|
|
|
|
|
|
|
} |
1161
|
|
|
|
|
|
|
|
1162
|
|
|
|
|
|
|
# a useq database |
1163
|
|
|
|
|
|
|
elsif ($database =~ /\.useq$/i) { |
1164
|
|
|
|
|
|
|
# open using USeq adaptor |
1165
|
1
|
50
|
|
|
|
35
|
$USEQ_OK = _load_helper_module('Bio::ToolBox::db_helper::useq') |
1166
|
|
|
|
|
|
|
unless $USEQ_OK; |
1167
|
1
|
50
|
|
|
|
3
|
if ($USEQ_OK) { |
1168
|
1
|
|
|
|
|
2
|
$db = open_useq_db($database); |
1169
|
1
|
50
|
|
|
|
3
|
unless ($db) { |
1170
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local useq file" . |
1171
|
|
|
|
|
|
|
" '$database'! $!\n"; |
1172
|
|
|
|
|
|
|
} |
1173
|
|
|
|
|
|
|
} |
1174
|
|
|
|
|
|
|
else { |
1175
|
0
|
|
|
|
|
0
|
$error = " Useq database cannot be loaded because\n" . |
1176
|
|
|
|
|
|
|
" Bio::DB::USeq is not installed\n"; |
1177
|
|
|
|
|
|
|
} |
1178
|
|
|
|
|
|
|
} |
1179
|
|
|
|
|
|
|
|
1180
|
|
|
|
|
|
|
# a Fasta File |
1181
|
|
|
|
|
|
|
elsif ($database =~ /\.fa(?:sta)?(?:\.gz)?$/i) { |
1182
|
|
|
|
|
|
|
# first try a modern fai indexed adapter |
1183
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
1184
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
1185
|
0
|
|
|
|
|
0
|
$db = open_indexed_fasta($database); |
1186
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1187
|
0
|
|
|
|
|
0
|
$error .= " ERROR: could not open indexed fasta file '$database'!\n"; |
1188
|
|
|
|
|
|
|
} |
1189
|
|
|
|
|
|
|
} |
1190
|
|
|
|
|
|
|
else { |
1191
|
|
|
|
|
|
|
# open using the old slow BioPerl Fasta adaptor |
1192
|
0
|
0
|
|
|
|
0
|
$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta') |
1193
|
|
|
|
|
|
|
unless $SEQFASTA_OK; |
1194
|
0
|
0
|
|
|
|
0
|
if ($SEQFASTA_OK) { |
1195
|
0
|
|
|
|
|
0
|
$db = open_fasta_db($database); |
1196
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1197
|
0
|
|
|
|
|
0
|
$error .= " ERROR: could not open fasta file '$database'!\n"; |
1198
|
0
|
0
|
|
|
|
0
|
if (-e "$database\.index") { |
1199
|
0
|
|
|
|
|
0
|
$error .= " Try deleting $database\.index and try again\n"; |
1200
|
|
|
|
|
|
|
} |
1201
|
|
|
|
|
|
|
} |
1202
|
|
|
|
|
|
|
} |
1203
|
|
|
|
|
|
|
else { |
1204
|
0
|
|
|
|
|
0
|
$error .= " Fasta file could not be loaded because neither" . |
1205
|
|
|
|
|
|
|
" Bio::DB::HTS, Bio::DB::Sam, or Bio::DB::Fasta is installed\n"; |
1206
|
|
|
|
|
|
|
} |
1207
|
|
|
|
|
|
|
} |
1208
|
|
|
|
|
|
|
} |
1209
|
|
|
|
|
|
|
|
1210
|
|
|
|
|
|
|
# a gff3 or sqlite database |
1211
|
|
|
|
|
|
|
elsif ($database =~ /\.(?:gff3?|gff3?\.gz|db|sqlite)$/) { |
1212
|
0
|
0
|
|
|
|
0
|
$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta') |
1213
|
|
|
|
|
|
|
unless $SEQFASTA_OK; |
1214
|
0
|
0
|
|
|
|
0
|
if ($SEQFASTA_OK) { |
1215
|
0
|
|
|
|
|
0
|
$db = open_store_db($database); |
1216
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1217
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not load SeqFeature database file '$database'!\n"; |
1218
|
|
|
|
|
|
|
} |
1219
|
|
|
|
|
|
|
} |
1220
|
|
|
|
|
|
|
else { |
1221
|
0
|
|
|
|
|
0
|
$error .= " Module Bio::DB::SeqFeature::Store is required to load " . |
1222
|
|
|
|
|
|
|
"GFF and database files\n"; |
1223
|
|
|
|
|
|
|
} |
1224
|
|
|
|
|
|
|
} |
1225
|
|
|
|
|
|
|
|
1226
|
|
|
|
|
|
|
# a cram database |
1227
|
|
|
|
|
|
|
elsif ($database =~ /\.cram$/i) { |
1228
|
|
|
|
|
|
|
# open using HTS bam adaptor only |
1229
|
0
|
|
0
|
|
|
0
|
$BAM_ADAPTER ||= 'hts'; |
1230
|
0
|
0
|
|
|
|
0
|
if ($BAM_ADAPTER eq 'sam') { |
1231
|
0
|
|
|
|
|
0
|
$error .= " ERROR: Only HTS adapters support Cram files!\n"; |
1232
|
|
|
|
|
|
|
} |
1233
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
1234
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
1235
|
0
|
|
|
|
|
0
|
undef $@; |
1236
|
0
|
|
|
|
|
0
|
$db = open_bam_db($database); |
1237
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1238
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local Cram file" . |
1239
|
|
|
|
|
|
|
" '$database'! $@\n"; |
1240
|
|
|
|
|
|
|
} |
1241
|
|
|
|
|
|
|
} |
1242
|
|
|
|
|
|
|
else { |
1243
|
0
|
|
|
|
|
0
|
$error = " Cram database cannot be loaded because\n" . |
1244
|
|
|
|
|
|
|
" Bio::DB::HTS is not installed\n"; |
1245
|
|
|
|
|
|
|
} |
1246
|
|
|
|
|
|
|
} |
1247
|
|
|
|
|
|
|
|
1248
|
|
|
|
|
|
|
# something unrecognized? |
1249
|
|
|
|
|
|
|
else { |
1250
|
0
|
|
|
|
|
0
|
$error .= " ERROR! Cannot identify database type based on extension for $database!\n"; |
1251
|
|
|
|
|
|
|
} |
1252
|
|
|
|
|
|
|
} |
1253
|
|
|
|
|
|
|
|
1254
|
|
|
|
|
|
|
# a directory, presumably of bigwig files |
1255
|
|
|
|
|
|
|
elsif (-d $database) { |
1256
|
|
|
|
|
|
|
# try opening using the BigWigSet adaptor |
1257
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK; |
1258
|
0
|
0
|
|
|
|
0
|
if ($BIGWIG_OK) { |
1259
|
0
|
|
|
|
|
0
|
$db = open_bigwigset_db($database); |
1260
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1261
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local BigWigSet " . |
1262
|
|
|
|
|
|
|
"directory '$database'!\n"; |
1263
|
0
|
|
|
|
|
0
|
$error .= " Does directory contain bigWig .bw files?\n"; |
1264
|
|
|
|
|
|
|
} |
1265
|
|
|
|
|
|
|
} |
1266
|
|
|
|
|
|
|
else { |
1267
|
0
|
|
|
|
|
0
|
$error = " Presumed BigWigSet database cannot be loaded because\n" . |
1268
|
|
|
|
|
|
|
" Bio::DB::Big or Bio::DB::BigWigSet is not installed\n"; |
1269
|
|
|
|
|
|
|
} |
1270
|
|
|
|
|
|
|
|
1271
|
|
|
|
|
|
|
# try opening with the Fasta adaptor |
1272
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1273
|
0
|
0
|
|
|
|
0
|
$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta') |
1274
|
|
|
|
|
|
|
unless $SEQFASTA_OK; |
1275
|
0
|
0
|
|
|
|
0
|
if ($SEQFASTA_OK) { |
1276
|
0
|
|
|
|
|
0
|
$db = open_fasta_db($database); |
1277
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1278
|
0
|
|
|
|
|
0
|
$error .= " ERROR: could not open fasta directory '$database'!\n"; |
1279
|
0
|
|
|
|
|
0
|
$error .= " Does directory contain fasta files? If it contains a" . |
1280
|
|
|
|
|
|
|
" directory.index file,\n try deleting it and try again.\n"; |
1281
|
|
|
|
|
|
|
} |
1282
|
|
|
|
|
|
|
} |
1283
|
|
|
|
|
|
|
else { |
1284
|
0
|
|
|
|
|
0
|
$error .= " Module Bio::DB::Fasta is required to open presumed fasta " . |
1285
|
|
|
|
|
|
|
"directory\n"; |
1286
|
|
|
|
|
|
|
} |
1287
|
|
|
|
|
|
|
} |
1288
|
|
|
|
|
|
|
} |
1289
|
|
|
|
|
|
|
|
1290
|
|
|
|
|
|
|
# otherwise assume the name of a SeqFeature::Store database in the configuration |
1291
|
|
|
|
|
|
|
else { |
1292
|
|
|
|
|
|
|
# attempt to open using information from the configuration |
1293
|
|
|
|
|
|
|
# using default connection information as necessary |
1294
|
0
|
0
|
|
|
|
0
|
$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta') |
1295
|
|
|
|
|
|
|
unless $SEQFASTA_OK; |
1296
|
0
|
0
|
|
|
|
0
|
if ($SEQFASTA_OK) { |
1297
|
0
|
|
|
|
|
0
|
$db = open_store_db($database); |
1298
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1299
|
0
|
|
|
|
|
0
|
$error .= " unable to open relational database named '$database'\n"; |
1300
|
|
|
|
|
|
|
} |
1301
|
|
|
|
|
|
|
} |
1302
|
|
|
|
|
|
|
else { |
1303
|
0
|
|
|
|
|
0
|
$error .= " Module Bio::DB::SeqFeature::Store is required to connect " . |
1304
|
|
|
|
|
|
|
"to databases\n"; |
1305
|
|
|
|
|
|
|
} |
1306
|
|
|
|
|
|
|
} |
1307
|
|
|
|
|
|
|
|
1308
|
|
|
|
|
|
|
# conditional return |
1309
|
1
|
50
|
|
|
|
3
|
if ($db) { |
1310
|
|
|
|
|
|
|
# cache the opened connection for use later |
1311
|
1
|
50
|
|
|
|
3
|
$OPENED_DB{$database} = $db unless $no_cache; |
1312
|
|
|
|
|
|
|
|
1313
|
|
|
|
|
|
|
# return as appropriate either both object and name or just object |
1314
|
1
|
|
|
|
|
3
|
return $db; |
1315
|
|
|
|
|
|
|
} |
1316
|
|
|
|
|
|
|
else { |
1317
|
0
|
|
|
|
|
0
|
print STDERR $error; |
1318
|
0
|
|
|
|
|
0
|
return; |
1319
|
|
|
|
|
|
|
} |
1320
|
|
|
|
|
|
|
} |
1321
|
|
|
|
|
|
|
|
1322
|
|
|
|
|
|
|
|
1323
|
|
|
|
|
|
|
### Retrieve a database name from an db object |
1324
|
|
|
|
|
|
|
sub get_db_name { |
1325
|
0
|
|
|
0
|
1
|
0
|
my $db = shift; |
1326
|
0
|
0
|
|
|
|
0
|
return unless $db; |
1327
|
0
|
|
|
|
|
0
|
my $db_ref = ref($db); |
1328
|
0
|
0
|
|
|
|
0
|
return $db unless $db_ref; # presumption that non-object is just a database name |
1329
|
0
|
|
|
|
|
0
|
my $db_name; |
1330
|
0
|
0
|
|
|
|
0
|
if ($db_ref =~ /^Bio::DB::SeqFeature::Store/) { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
1331
|
|
|
|
|
|
|
# a SeqFeature database, using any DBI adapter |
1332
|
0
|
|
|
|
|
0
|
$db_name = $db->{'dbh'}->{'name'}; |
1333
|
|
|
|
|
|
|
# dig through the object internals to identify the original |
1334
|
|
|
|
|
|
|
# name of the database |
1335
|
|
|
|
|
|
|
# this should be relatively well documented through DBI |
1336
|
|
|
|
|
|
|
# but could break in the future since it's not official API |
1337
|
|
|
|
|
|
|
} |
1338
|
|
|
|
|
|
|
elsif ($db_ref eq 'Bio::DB::Sam') { |
1339
|
|
|
|
|
|
|
# a Samtools Bam database |
1340
|
|
|
|
|
|
|
# old versions <=1.39 don't have the bam_path method, so it's easier |
1341
|
|
|
|
|
|
|
# to explicitly grab it from the object internals |
1342
|
0
|
|
|
|
|
0
|
$db_name = $db->{bam_path}; |
1343
|
|
|
|
|
|
|
} |
1344
|
|
|
|
|
|
|
elsif ($db_ref eq 'Bio::DB::HTS') { |
1345
|
|
|
|
|
|
|
# a HTSlib Bam database |
1346
|
0
|
|
|
|
|
0
|
$db_name = $db->hts_path; |
1347
|
|
|
|
|
|
|
} |
1348
|
|
|
|
|
|
|
# determining the database name from other sources is |
1349
|
|
|
|
|
|
|
# either not possible or not easy, so won't bother unless |
1350
|
|
|
|
|
|
|
# there is a really really good need |
1351
|
0
|
|
|
|
|
0
|
return $db_name; |
1352
|
|
|
|
|
|
|
} |
1353
|
|
|
|
|
|
|
|
1354
|
|
|
|
|
|
|
|
1355
|
|
|
|
|
|
|
### Retrieve a list of the microrarray data sets from the db |
1356
|
|
|
|
|
|
|
sub get_dataset_list { |
1357
|
|
|
|
|
|
|
|
1358
|
0
|
|
|
0
|
1
|
0
|
my $database = shift; |
1359
|
0
|
|
|
|
|
0
|
my $use_all_features = shift; |
1360
|
|
|
|
|
|
|
|
1361
|
|
|
|
|
|
|
# Open a db connection |
1362
|
0
|
|
|
|
|
0
|
my $db = open_db_connection($database); |
1363
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1364
|
0
|
|
|
|
|
0
|
carp 'no database connected!'; |
1365
|
0
|
|
|
|
|
0
|
return; |
1366
|
|
|
|
|
|
|
} |
1367
|
0
|
|
|
|
|
0
|
my $db_ref = ref $db; |
1368
|
|
|
|
|
|
|
|
1369
|
|
|
|
|
|
|
# process the database types, according to the type of database |
1370
|
0
|
|
|
|
|
0
|
my @types; |
1371
|
|
|
|
|
|
|
|
1372
|
|
|
|
|
|
|
# a SeqFeature database |
1373
|
0
|
0
|
|
|
|
0
|
if ($db_ref =~ /^Bio::DB::SeqFeature::Store/) { |
|
|
0
|
|
|
|
|
|
1374
|
|
|
|
|
|
|
|
1375
|
|
|
|
|
|
|
# collect the database types, |
1376
|
|
|
|
|
|
|
# sort first by source, then by method |
1377
|
|
|
|
|
|
|
@types = ( |
1378
|
|
|
|
|
|
|
map $_->[2], |
1379
|
0
|
0
|
|
|
|
0
|
sort { ($a->[0] cmp $b->[0]) or ($a->[1] cmp $b->[1]) } |
|
0
|
|
|
|
|
0
|
|
1380
|
|
|
|
|
|
|
map [$_->source, $_->method, $_], |
1381
|
|
|
|
|
|
|
$db->types |
1382
|
|
|
|
|
|
|
); |
1383
|
|
|
|
|
|
|
} |
1384
|
|
|
|
|
|
|
|
1385
|
|
|
|
|
|
|
# a BigWigSet database |
1386
|
|
|
|
|
|
|
elsif ($db_ref =~ /BigWigSet/i) { |
1387
|
|
|
|
|
|
|
|
1388
|
|
|
|
|
|
|
# get the metadata |
1389
|
0
|
|
|
|
|
0
|
my $metadata = $db->metadata; |
1390
|
|
|
|
|
|
|
|
1391
|
|
|
|
|
|
|
# collect |
1392
|
0
|
|
|
|
|
0
|
foreach my $file (keys %{ $metadata }) { |
|
0
|
|
|
|
|
0
|
|
1393
|
|
|
|
|
|
|
# get the type for each file |
1394
|
0
|
|
|
|
|
0
|
my ($primary, $type, $name); |
1395
|
|
|
|
|
|
|
|
1396
|
|
|
|
|
|
|
# get the appropriate tags |
1397
|
0
|
|
|
|
|
0
|
foreach my $attribute (keys %{ $metadata->{$file} } ) { |
|
0
|
|
|
|
|
0
|
|
1398
|
0
|
0
|
|
|
|
0
|
if ($attribute =~ m/^type/i) { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
1399
|
0
|
|
|
|
|
0
|
$type = $metadata->{$file}{$attribute}; |
1400
|
|
|
|
|
|
|
} |
1401
|
|
|
|
|
|
|
elsif ($attribute =~ m/name/i) { |
1402
|
0
|
|
|
|
|
0
|
$name = $metadata->{$file}{$attribute}; |
1403
|
|
|
|
|
|
|
} |
1404
|
|
|
|
|
|
|
elsif ($attribute =~ m/^primary_tag|method$/i) { |
1405
|
0
|
|
|
|
|
0
|
$primary = $metadata->{$file}{$attribute}; |
1406
|
|
|
|
|
|
|
} |
1407
|
|
|
|
|
|
|
} |
1408
|
|
|
|
|
|
|
|
1409
|
|
|
|
|
|
|
# store the type |
1410
|
0
|
|
0
|
|
|
0
|
push @types, $type || $primary || $name; |
1411
|
|
|
|
|
|
|
} |
1412
|
|
|
|
|
|
|
|
1413
|
|
|
|
|
|
|
# sort the types in alphabetical order |
1414
|
|
|
|
|
|
|
# and discard duplicate types - which may occur with stranded entries |
1415
|
0
|
|
|
|
|
0
|
@types = uniq(sort {$a cmp $b} @types); |
|
0
|
|
|
|
|
0
|
|
1416
|
|
|
|
|
|
|
} |
1417
|
|
|
|
|
|
|
|
1418
|
|
|
|
|
|
|
# some other database |
1419
|
|
|
|
|
|
|
else { |
1420
|
0
|
|
|
|
|
0
|
carp " no dataset lists for database type $db_ref!\n"; |
1421
|
|
|
|
|
|
|
} |
1422
|
|
|
|
|
|
|
|
1423
|
|
|
|
|
|
|
# finish |
1424
|
0
|
|
|
|
|
0
|
return @types; |
1425
|
|
|
|
|
|
|
} |
1426
|
|
|
|
|
|
|
|
1427
|
|
|
|
|
|
|
|
1428
|
|
|
|
|
|
|
|
1429
|
|
|
|
|
|
|
### Process and verify a dataset |
1430
|
|
|
|
|
|
|
sub verify_or_request_feature_types { |
1431
|
|
|
|
|
|
|
|
1432
|
|
|
|
|
|
|
# Retrieve passed values |
1433
|
1
|
|
|
1
|
1
|
2
|
my %args = @_; # the passed argument values as a hash reference |
1434
|
|
|
|
|
|
|
|
1435
|
|
|
|
|
|
|
# Check for single option |
1436
|
1
|
|
50
|
|
|
5
|
$args{'single'} ||= 0; |
1437
|
|
|
|
|
|
|
|
1438
|
|
|
|
|
|
|
# Collect the datasets |
1439
|
1
|
|
|
|
|
1
|
my @datasets; |
1440
|
1
|
|
50
|
|
|
3
|
$args{'feature'} ||= undef; |
1441
|
1
|
50
|
|
|
|
4
|
if (ref $args{'feature'} eq 'ARRAY') { |
|
|
50
|
|
|
|
|
|
1442
|
|
|
|
|
|
|
# an anonymous array of datasets |
1443
|
0
|
|
|
|
|
0
|
@datasets = @{ $args{'feature'} }; |
|
0
|
|
|
|
|
0
|
|
1444
|
|
|
|
|
|
|
} |
1445
|
|
|
|
|
|
|
elsif (defined $args{'feature'}) { |
1446
|
1
|
|
|
|
|
2
|
push @datasets, $args{'feature'}; |
1447
|
|
|
|
|
|
|
} |
1448
|
|
|
|
|
|
|
# else it is null and @datasets remains empty, prompting user input |
1449
|
|
|
|
|
|
|
|
1450
|
|
|
|
|
|
|
# feature limits |
1451
|
|
|
|
|
|
|
# this is a regex to limit the feature types presented to the user |
1452
|
|
|
|
|
|
|
# otherwise everything in the database is presented to the user |
1453
|
1
|
|
50
|
|
|
12
|
my $limit = $args{'limit'} ||= undef; |
1454
|
|
|
|
|
|
|
|
1455
|
|
|
|
|
|
|
|
1456
|
|
|
|
|
|
|
# Open database object and collect features |
1457
|
1
|
|
50
|
|
|
4
|
$args{'db'} ||= undef; |
1458
|
1
|
50
|
|
|
|
3
|
my $db = $args{'db'} ? open_db_connection( $args{'db'} ) : undef; |
1459
|
|
|
|
|
|
|
|
1460
|
|
|
|
|
|
|
# Initialize main output arrays |
1461
|
1
|
|
|
|
|
3
|
my @good_datasets; |
1462
|
|
|
|
|
|
|
my @bad_datasets; |
1463
|
1
|
|
|
|
|
0
|
my %db_features; # hash for features found in the database, use when needed |
1464
|
|
|
|
|
|
|
|
1465
|
|
|
|
|
|
|
# Check provided datasets |
1466
|
1
|
50
|
|
|
|
2
|
if (@datasets) { |
1467
|
|
|
|
|
|
|
|
1468
|
|
|
|
|
|
|
# check for multiple comma-delimited datasets |
1469
|
1
|
|
|
|
|
1
|
my @list_to_check; |
1470
|
1
|
|
|
|
|
2
|
foreach my $item (@datasets) { |
1471
|
1
|
50
|
|
|
|
3
|
if ($item =~ /,/) { |
1472
|
|
|
|
|
|
|
# this one has a comma, therefore it has more than dataset |
1473
|
0
|
|
|
|
|
0
|
push @list_to_check, split(/,/, $item); |
1474
|
|
|
|
|
|
|
} |
1475
|
|
|
|
|
|
|
else { |
1476
|
|
|
|
|
|
|
# a singleton |
1477
|
1
|
|
|
|
|
4
|
push @list_to_check, $item; |
1478
|
|
|
|
|
|
|
} |
1479
|
|
|
|
|
|
|
} |
1480
|
|
|
|
|
|
|
|
1481
|
|
|
|
|
|
|
# now verify the datasets |
1482
|
1
|
|
|
|
|
2
|
foreach my $dataset (@list_to_check) { |
1483
|
|
|
|
|
|
|
|
1484
|
|
|
|
|
|
|
# check for a remote file |
1485
|
1
|
50
|
|
|
|
9
|
if ($dataset =~ /^(?: http | ftp) .+ \. (?: bam | bw | bb) $/xi) { |
|
|
50
|
|
|
|
|
|
1486
|
|
|
|
|
|
|
# a remote file |
1487
|
|
|
|
|
|
|
# assume it is good, no verification here though |
1488
|
|
|
|
|
|
|
# it will either work or won't work |
1489
|
0
|
|
|
|
|
0
|
push @good_datasets, $dataset; |
1490
|
|
|
|
|
|
|
} |
1491
|
|
|
|
|
|
|
|
1492
|
|
|
|
|
|
|
# a local file |
1493
|
|
|
|
|
|
|
elsif ($dataset =~ /\.(?:bam|bw|bigwig|bb|bigbed|useq)$/i) { |
1494
|
|
|
|
|
|
|
# presume we have a local indexed data file |
1495
|
|
|
|
|
|
|
|
1496
|
|
|
|
|
|
|
# user may have requested two or more files to be merged |
1497
|
|
|
|
|
|
|
# these should be combined with an ampersand |
1498
|
|
|
|
|
|
|
# check each one |
1499
|
1
|
|
|
|
|
2
|
my @files; |
1500
|
1
|
|
|
|
|
3
|
foreach my $file (split /\&/, $dataset) { |
1501
|
1
|
50
|
|
|
|
17
|
if (-e $file) { |
1502
|
|
|
|
|
|
|
# file exists |
1503
|
1
|
|
|
|
|
4
|
push @files, "file:$file"; |
1504
|
|
|
|
|
|
|
} |
1505
|
|
|
|
|
|
|
else { |
1506
|
|
|
|
|
|
|
# file doesn't exist! can't use this set of files |
1507
|
0
|
|
|
|
|
0
|
@files = (); |
1508
|
0
|
|
|
|
|
0
|
last; |
1509
|
|
|
|
|
|
|
} |
1510
|
|
|
|
|
|
|
} |
1511
|
1
|
50
|
|
|
|
3
|
if (@files) { |
1512
|
1
|
|
|
|
|
10
|
push @good_datasets, join("&", @files); |
1513
|
|
|
|
|
|
|
} |
1514
|
|
|
|
|
|
|
else { |
1515
|
0
|
|
|
|
|
0
|
push @bad_datasets, $dataset; |
1516
|
|
|
|
|
|
|
} |
1517
|
|
|
|
|
|
|
} |
1518
|
|
|
|
|
|
|
|
1519
|
|
|
|
|
|
|
# a feature type in a database |
1520
|
|
|
|
|
|
|
else { |
1521
|
|
|
|
|
|
|
# must be a database feature type |
1522
|
|
|
|
|
|
|
|
1523
|
|
|
|
|
|
|
# check for database features hash |
1524
|
0
|
0
|
|
|
|
0
|
unless (%db_features) { |
1525
|
0
|
0
|
|
|
|
0
|
if ($db) { |
1526
|
|
|
|
|
|
|
# collect the database features as needed |
1527
|
|
|
|
|
|
|
# I want both method:source and method in the hash |
1528
|
0
|
|
|
|
|
0
|
foreach my $type ( get_dataset_list($db) ) { |
1529
|
0
|
|
|
|
|
0
|
my ($method, $source) = split /:/, $type; |
1530
|
0
|
0
|
|
|
|
0
|
if ($source) { |
1531
|
0
|
|
|
|
|
0
|
$db_features{$type} = 1; |
1532
|
0
|
|
|
|
|
0
|
$db_features{$method} = 1; |
1533
|
|
|
|
|
|
|
} |
1534
|
|
|
|
|
|
|
else { |
1535
|
0
|
|
|
|
|
0
|
$db_features{$type} = 1; |
1536
|
|
|
|
|
|
|
} |
1537
|
|
|
|
|
|
|
} |
1538
|
|
|
|
|
|
|
|
1539
|
|
|
|
|
|
|
# verify |
1540
|
0
|
0
|
|
|
|
0
|
unless (%db_features) { |
1541
|
0
|
|
|
|
|
0
|
carp " provided database has no feature types " . |
1542
|
|
|
|
|
|
|
"to verify dataset(s) against!\n"; |
1543
|
|
|
|
|
|
|
} |
1544
|
|
|
|
|
|
|
} |
1545
|
|
|
|
|
|
|
else { |
1546
|
|
|
|
|
|
|
# we need a database |
1547
|
0
|
|
|
|
|
0
|
carp " unable to verify dataset without database"; |
1548
|
|
|
|
|
|
|
} |
1549
|
|
|
|
|
|
|
} |
1550
|
|
|
|
|
|
|
|
1551
|
|
|
|
|
|
|
# validate the given dataset |
1552
|
|
|
|
|
|
|
# user may have requested two or more features to be merged |
1553
|
|
|
|
|
|
|
# these should be combined with an ampersand |
1554
|
|
|
|
|
|
|
# check each one |
1555
|
0
|
|
|
|
|
0
|
my $check = 0; |
1556
|
0
|
|
|
|
|
0
|
foreach my $d (split /&/, $dataset) { |
1557
|
|
|
|
|
|
|
# validate this dataset |
1558
|
0
|
0
|
|
|
|
0
|
if (exists $db_features{$d}) { |
1559
|
0
|
|
|
|
|
0
|
$check++; |
1560
|
|
|
|
|
|
|
} |
1561
|
|
|
|
|
|
|
else { |
1562
|
|
|
|
|
|
|
# at least one of these is not good, fail check |
1563
|
0
|
|
|
|
|
0
|
$check = 0; |
1564
|
0
|
|
|
|
|
0
|
last; |
1565
|
|
|
|
|
|
|
} |
1566
|
|
|
|
|
|
|
} |
1567
|
|
|
|
|
|
|
|
1568
|
|
|
|
|
|
|
# report the verification |
1569
|
0
|
0
|
|
|
|
0
|
if ($check) { |
1570
|
0
|
|
|
|
|
0
|
push @good_datasets, $dataset; |
1571
|
|
|
|
|
|
|
} |
1572
|
|
|
|
|
|
|
else { |
1573
|
0
|
|
|
|
|
0
|
push @bad_datasets, $dataset; |
1574
|
|
|
|
|
|
|
} |
1575
|
|
|
|
|
|
|
} |
1576
|
|
|
|
|
|
|
} |
1577
|
|
|
|
|
|
|
} |
1578
|
|
|
|
|
|
|
|
1579
|
|
|
|
|
|
|
# User must select datasets |
1580
|
|
|
|
|
|
|
else { |
1581
|
|
|
|
|
|
|
# dataset not specified |
1582
|
|
|
|
|
|
|
# present the dataset list to the user and get an answer |
1583
|
|
|
|
|
|
|
|
1584
|
|
|
|
|
|
|
# get the dataset list |
1585
|
0
|
0
|
|
|
|
0
|
if ($db) { |
1586
|
|
|
|
|
|
|
# collect the database features as needed |
1587
|
0
|
|
|
|
|
0
|
my $i = 1; |
1588
|
0
|
|
|
|
|
0
|
foreach my $type ( get_dataset_list($db) ) { |
1589
|
0
|
0
|
|
|
|
0
|
if ($limit) { |
1590
|
0
|
|
|
|
|
0
|
my ($p, $s) = split /:/, $type; # split into primary_tag and source |
1591
|
|
|
|
|
|
|
# only keep those types that match our limiter |
1592
|
0
|
0
|
|
|
|
0
|
next unless $p =~ /$limit/i; |
1593
|
|
|
|
|
|
|
} |
1594
|
0
|
|
|
|
|
0
|
$db_features{$i} = $type; |
1595
|
0
|
|
|
|
|
0
|
$i++; |
1596
|
|
|
|
|
|
|
} |
1597
|
|
|
|
|
|
|
|
1598
|
|
|
|
|
|
|
# verify |
1599
|
0
|
0
|
|
|
|
0
|
unless (%db_features) { |
1600
|
0
|
|
|
|
|
0
|
carp " provided database has no feature types " . |
1601
|
|
|
|
|
|
|
"to collect!\n"; |
1602
|
0
|
|
|
|
|
0
|
return; |
1603
|
|
|
|
|
|
|
} |
1604
|
|
|
|
|
|
|
} |
1605
|
|
|
|
|
|
|
else { |
1606
|
|
|
|
|
|
|
# we need a database |
1607
|
0
|
|
|
|
|
0
|
carp " no database provided from which to collect features!\n"; |
1608
|
0
|
|
|
|
|
0
|
return; |
1609
|
|
|
|
|
|
|
} |
1610
|
|
|
|
|
|
|
|
1611
|
|
|
|
|
|
|
# present the list |
1612
|
0
|
|
|
|
|
0
|
print "\n These are the available data sets in the database:\n"; |
1613
|
0
|
|
|
|
|
0
|
foreach (sort {$a <=> $b} keys %db_features) { |
|
0
|
|
|
|
|
0
|
|
1614
|
|
|
|
|
|
|
# print out the list of microarray data sets |
1615
|
0
|
|
|
|
|
0
|
print " $_\t$db_features{$_}\n"; |
1616
|
|
|
|
|
|
|
} |
1617
|
|
|
|
|
|
|
|
1618
|
|
|
|
|
|
|
# prompt the user |
1619
|
0
|
|
0
|
|
|
0
|
$args{'prompt'} ||= undef; |
1620
|
0
|
0
|
|
|
|
0
|
if ($args{'prompt'}) { |
1621
|
|
|
|
|
|
|
# provided custom prompt |
1622
|
0
|
|
|
|
|
0
|
print $args{'prompt'}; |
1623
|
|
|
|
|
|
|
} |
1624
|
|
|
|
|
|
|
else { |
1625
|
|
|
|
|
|
|
# generic prompt |
1626
|
0
|
0
|
|
|
|
0
|
if ($args{'single'}) { |
1627
|
0
|
|
|
|
|
0
|
print " Enter one number for the data set or feature "; |
1628
|
|
|
|
|
|
|
} |
1629
|
|
|
|
|
|
|
else { |
1630
|
0
|
|
|
|
|
0
|
print " Enter the number(s) or range of the data set(s) or" . |
1631
|
|
|
|
|
|
|
" feature(s) "; |
1632
|
|
|
|
|
|
|
} |
1633
|
|
|
|
|
|
|
} |
1634
|
|
|
|
|
|
|
|
1635
|
|
|
|
|
|
|
# get answer from the user |
1636
|
0
|
|
|
|
|
0
|
my $answer = ; |
1637
|
0
|
|
|
|
|
0
|
chomp $answer; |
1638
|
0
|
|
|
|
|
0
|
my @answer_list = parse_list($answer); |
1639
|
|
|
|
|
|
|
|
1640
|
|
|
|
|
|
|
# take the first one if requested |
1641
|
0
|
0
|
|
|
|
0
|
if ($args{'single'}) { |
1642
|
0
|
0
|
|
|
|
0
|
unless (scalar @answer_list == 1) { |
1643
|
0
|
|
|
|
|
0
|
splice(@answer_list, 1); |
1644
|
|
|
|
|
|
|
} |
1645
|
|
|
|
|
|
|
} |
1646
|
|
|
|
|
|
|
|
1647
|
|
|
|
|
|
|
# verify the answer list |
1648
|
0
|
|
|
|
|
0
|
foreach my $answer (@answer_list) { |
1649
|
|
|
|
|
|
|
|
1650
|
|
|
|
|
|
|
# check for merged datasets |
1651
|
0
|
0
|
|
|
|
0
|
if ($answer =~ /&/) { |
1652
|
|
|
|
|
|
|
# a merged dataset |
1653
|
0
|
|
|
|
|
0
|
my @list = split /&/, $answer; |
1654
|
0
|
|
|
|
|
0
|
my $check = 1; |
1655
|
|
|
|
|
|
|
|
1656
|
|
|
|
|
|
|
# check all are good |
1657
|
0
|
|
|
|
|
0
|
foreach (@list) { |
1658
|
0
|
0
|
|
|
|
0
|
unless (exists $db_features{$_}) { |
1659
|
0
|
|
|
|
|
0
|
$check = 0; |
1660
|
|
|
|
|
|
|
} |
1661
|
|
|
|
|
|
|
} |
1662
|
|
|
|
|
|
|
|
1663
|
|
|
|
|
|
|
# if all are good |
1664
|
0
|
0
|
|
|
|
0
|
if ($check) { |
1665
|
|
|
|
|
|
|
push @good_datasets, |
1666
|
0
|
|
|
|
|
0
|
join( "&", map { $db_features{$_} } @list); |
|
0
|
|
|
|
|
0
|
|
1667
|
|
|
|
|
|
|
} |
1668
|
|
|
|
|
|
|
else { |
1669
|
0
|
|
|
|
|
0
|
push @bad_datasets, $answer; |
1670
|
|
|
|
|
|
|
} |
1671
|
|
|
|
|
|
|
} |
1672
|
|
|
|
|
|
|
|
1673
|
|
|
|
|
|
|
else { |
1674
|
|
|
|
|
|
|
# a single dataset |
1675
|
|
|
|
|
|
|
# check if it is good |
1676
|
|
|
|
|
|
|
|
1677
|
0
|
0
|
|
|
|
0
|
if (exists $db_features{$answer}) { |
1678
|
0
|
|
|
|
|
0
|
push @good_datasets, $db_features{$answer}; |
1679
|
|
|
|
|
|
|
} |
1680
|
|
|
|
|
|
|
else { |
1681
|
0
|
|
|
|
|
0
|
push @bad_datasets, $db_features{$answer}; |
1682
|
|
|
|
|
|
|
} |
1683
|
|
|
|
|
|
|
} |
1684
|
|
|
|
|
|
|
} |
1685
|
|
|
|
|
|
|
} |
1686
|
|
|
|
|
|
|
|
1687
|
|
|
|
|
|
|
# Print bad results |
1688
|
1
|
50
|
|
|
|
3
|
if (@bad_datasets) { |
1689
|
0
|
|
|
|
|
0
|
print " The following datasets could not be verified:\n"; |
1690
|
0
|
|
|
|
|
0
|
foreach (@bad_datasets) { |
1691
|
0
|
|
|
|
|
0
|
print " $_\n"; |
1692
|
|
|
|
|
|
|
} |
1693
|
|
|
|
|
|
|
} |
1694
|
|
|
|
|
|
|
|
1695
|
|
|
|
|
|
|
# Return good results |
1696
|
1
|
50
|
|
|
|
3
|
if ($args{'single'}) { |
1697
|
0
|
|
|
|
|
0
|
return $good_datasets[0]; |
1698
|
|
|
|
|
|
|
} |
1699
|
|
|
|
|
|
|
else { |
1700
|
1
|
|
|
|
|
4
|
return @good_datasets; |
1701
|
|
|
|
|
|
|
} |
1702
|
|
|
|
|
|
|
} |
1703
|
|
|
|
|
|
|
|
1704
|
|
|
|
|
|
|
|
1705
|
|
|
|
|
|
|
### Process and verify a dataset |
1706
|
|
|
|
|
|
|
sub check_dataset_for_rpm_support { |
1707
|
|
|
|
|
|
|
|
1708
|
|
|
|
|
|
|
# get passed dataset and databases |
1709
|
0
|
|
|
0
|
1
|
0
|
my $dataset = shift; |
1710
|
0
|
|
|
|
|
0
|
my $cpu = shift; |
1711
|
|
|
|
|
|
|
|
1712
|
|
|
|
|
|
|
# Calculate the total number of reads |
1713
|
|
|
|
|
|
|
# this uses the global variable $rpkm_read_sum |
1714
|
0
|
|
|
|
|
0
|
my $rpm_read_sum; |
1715
|
|
|
|
|
|
|
|
1716
|
0
|
0
|
|
|
|
0
|
if (exists $TOTAL_READ_NUMBER{$dataset}) { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
1717
|
|
|
|
|
|
|
# this dataset has already been summed |
1718
|
|
|
|
|
|
|
# no need to do it again |
1719
|
0
|
|
|
|
|
0
|
$rpm_read_sum = $TOTAL_READ_NUMBER{$dataset}; |
1720
|
|
|
|
|
|
|
} |
1721
|
|
|
|
|
|
|
|
1722
|
|
|
|
|
|
|
elsif ($dataset =~ /\.bam$/) { |
1723
|
|
|
|
|
|
|
# a bam file dataset |
1724
|
|
|
|
|
|
|
|
1725
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
1726
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
1727
|
|
|
|
|
|
|
# Bio::ToolBox::db_helper::bam was loaded ok |
1728
|
|
|
|
|
|
|
# sum the number of reads in the dataset |
1729
|
0
|
|
|
|
|
0
|
$rpm_read_sum = sum_total_bam_alignments($dataset, 0, 0, $cpu); |
1730
|
|
|
|
|
|
|
} |
1731
|
|
|
|
|
|
|
else { |
1732
|
0
|
|
|
|
|
0
|
carp " Bam support is not available! " . |
1733
|
|
|
|
|
|
|
"Is Bio::DB::Sam or Bio::DB::HTS installed?\n"; |
1734
|
0
|
|
|
|
|
0
|
return; |
1735
|
|
|
|
|
|
|
} |
1736
|
|
|
|
|
|
|
} |
1737
|
|
|
|
|
|
|
|
1738
|
|
|
|
|
|
|
elsif ($dataset =~ /\.bb$/) { |
1739
|
|
|
|
|
|
|
# a bigbed file dataset |
1740
|
|
|
|
|
|
|
|
1741
|
0
|
0
|
|
|
|
0
|
$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed') unless |
1742
|
|
|
|
|
|
|
$BIGBED_OK; |
1743
|
0
|
0
|
|
|
|
0
|
if ($BIGBED_OK) { |
1744
|
|
|
|
|
|
|
# Bio::ToolBox::db_helper::bigbed was loaded ok |
1745
|
|
|
|
|
|
|
# sum the number of features in the dataset |
1746
|
0
|
|
|
|
|
0
|
$rpm_read_sum = sum_total_bigbed_features($dataset); |
1747
|
|
|
|
|
|
|
} |
1748
|
|
|
|
|
|
|
else { |
1749
|
0
|
|
|
|
|
0
|
carp " BigBed support is not available! " . |
1750
|
|
|
|
|
|
|
"Is Bio::DB::BigBed installed?\n"; |
1751
|
0
|
|
|
|
|
0
|
return; |
1752
|
|
|
|
|
|
|
} |
1753
|
|
|
|
|
|
|
} |
1754
|
|
|
|
|
|
|
|
1755
|
|
|
|
|
|
|
else { |
1756
|
|
|
|
|
|
|
# some other non-supported dataset |
1757
|
0
|
|
|
|
|
0
|
return; |
1758
|
|
|
|
|
|
|
} |
1759
|
|
|
|
|
|
|
|
1760
|
|
|
|
|
|
|
# return the sum value if we've made it this far |
1761
|
0
|
|
|
|
|
0
|
$TOTAL_READ_NUMBER{$dataset} = $rpm_read_sum; |
1762
|
0
|
|
|
|
|
0
|
return $rpm_read_sum; |
1763
|
|
|
|
|
|
|
} |
1764
|
|
|
|
|
|
|
|
1765
|
|
|
|
|
|
|
|
1766
|
|
|
|
|
|
|
### Generate a new list of features |
1767
|
|
|
|
|
|
|
sub get_new_feature_list { |
1768
|
|
|
|
|
|
|
|
1769
|
|
|
|
|
|
|
# Retrieve passed values |
1770
|
0
|
|
|
0
|
1
|
0
|
my %args = @_; # the passed argument values |
1771
|
|
|
|
|
|
|
|
1772
|
|
|
|
|
|
|
# check data object |
1773
|
0
|
|
0
|
|
|
0
|
my $data = $args{data} || undef; |
1774
|
0
|
0
|
|
|
|
0
|
unless ($data) { |
1775
|
0
|
|
|
|
|
0
|
confess "must pass a 'data' key and Bio::ToolBox::Data object!"; |
1776
|
0
|
|
|
|
|
0
|
return; |
1777
|
|
|
|
|
|
|
} |
1778
|
0
|
0
|
|
|
|
0
|
unless (ref($data) eq 'Bio::ToolBox::Data') { |
1779
|
0
|
|
|
|
|
0
|
confess 'must pass a Bio::ToolBox::Data object!'; |
1780
|
0
|
|
|
|
|
0
|
return; |
1781
|
|
|
|
|
|
|
} |
1782
|
|
|
|
|
|
|
|
1783
|
|
|
|
|
|
|
# Open a db connection |
1784
|
0
|
|
0
|
|
|
0
|
$args{'db'} ||= undef; |
1785
|
0
|
|
|
|
|
0
|
my $db = open_db_connection($args{'db'}); |
1786
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
1787
|
0
|
|
|
|
|
0
|
carp 'no database connected!'; |
1788
|
0
|
|
|
|
|
0
|
return; |
1789
|
|
|
|
|
|
|
} |
1790
|
|
|
|
|
|
|
|
1791
|
|
|
|
|
|
|
# Verify a SeqFeature::Store database |
1792
|
0
|
|
|
|
|
0
|
my $db_ref = ref $db; |
1793
|
0
|
0
|
|
|
|
0
|
unless ($db_ref =~ /^Bio::DB::SeqFeature::Store/) { |
1794
|
0
|
|
|
|
|
0
|
carp "Database type $db_ref doesn't support generating feature lists!\n"; |
1795
|
0
|
|
|
|
|
0
|
return; |
1796
|
|
|
|
|
|
|
} |
1797
|
|
|
|
|
|
|
|
1798
|
|
|
|
|
|
|
# Features to search for |
1799
|
0
|
|
0
|
|
|
0
|
my $searchFeature = $args{'feature'} || $args{'features'} || undef; |
1800
|
0
|
0
|
|
|
|
0
|
unless ($searchFeature) { |
1801
|
0
|
|
|
|
|
0
|
carp "no search feature types passed!"; |
1802
|
0
|
|
|
|
|
0
|
return; |
1803
|
|
|
|
|
|
|
} |
1804
|
0
|
|
|
|
|
0
|
my @classes = split(',', $searchFeature); # it may or may not be a list |
1805
|
|
|
|
|
|
|
|
1806
|
|
|
|
|
|
|
# chromosomes to skip |
1807
|
0
|
|
0
|
|
|
0
|
my $chr_exclude = $args{'chrskip'} || undef; |
1808
|
|
|
|
|
|
|
|
1809
|
|
|
|
|
|
|
# Add table columns |
1810
|
0
|
|
|
|
|
0
|
my $pid_i = $data->add_column('Primary_ID'); |
1811
|
0
|
|
|
|
|
0
|
my $name_i = $data->add_column('Name'); |
1812
|
0
|
|
|
|
|
0
|
my $type_i = $data->add_column('Type'); |
1813
|
|
|
|
|
|
|
|
1814
|
|
|
|
|
|
|
# Set up the database iterator |
1815
|
0
|
|
|
|
|
0
|
print " Searching for $searchFeature features\n"; |
1816
|
0
|
0
|
|
|
|
0
|
my $iterator = $db->get_seq_stream( |
1817
|
|
|
|
|
|
|
-types => scalar @classes ? \@classes : $searchFeature, |
1818
|
|
|
|
|
|
|
); |
1819
|
0
|
0
|
|
|
|
0
|
unless ($iterator) { |
1820
|
|
|
|
|
|
|
# there should be some features found in the database |
1821
|
0
|
|
|
|
|
0
|
carp "could not get feature iterator for database"; |
1822
|
0
|
|
|
|
|
0
|
return; |
1823
|
|
|
|
|
|
|
} |
1824
|
|
|
|
|
|
|
|
1825
|
|
|
|
|
|
|
# Walk through the collected features |
1826
|
0
|
|
|
|
|
0
|
my $total_count = 0; # total found features |
1827
|
0
|
|
|
|
|
0
|
while (my $feature = $iterator->next_seq) { |
1828
|
0
|
|
|
|
|
0
|
$total_count++; |
1829
|
|
|
|
|
|
|
|
1830
|
|
|
|
|
|
|
# skip genes from excluded chromosomes |
1831
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $feature->seq_id =~ $chr_exclude); |
1832
|
|
|
|
|
|
|
|
1833
|
|
|
|
|
|
|
# Record the feature information |
1834
|
|
|
|
|
|
|
# in the B::DB::SF::S database, the primary_ID is a number unique to the |
1835
|
|
|
|
|
|
|
# the specific database, and is not portable between databases |
1836
|
0
|
|
|
|
|
0
|
$data->add_row( [ |
1837
|
|
|
|
|
|
|
$feature->primary_id, |
1838
|
|
|
|
|
|
|
$feature->display_name, |
1839
|
|
|
|
|
|
|
$feature->type, |
1840
|
|
|
|
|
|
|
] ); |
1841
|
|
|
|
|
|
|
} |
1842
|
|
|
|
|
|
|
|
1843
|
|
|
|
|
|
|
# print result of search |
1844
|
0
|
|
|
|
|
0
|
printf " Found %s features in the database\n", format_with_commas($total_count); |
1845
|
0
|
|
|
|
|
0
|
printf " Kept %s features.\n", format_with_commas($data->last_row); |
1846
|
|
|
|
|
|
|
|
1847
|
|
|
|
|
|
|
# return the new data structure |
1848
|
0
|
|
|
|
|
0
|
return 1; |
1849
|
|
|
|
|
|
|
} |
1850
|
|
|
|
|
|
|
|
1851
|
|
|
|
|
|
|
|
1852
|
|
|
|
|
|
|
### Generate a new list genomic windows |
1853
|
|
|
|
|
|
|
sub get_new_genome_list { |
1854
|
|
|
|
|
|
|
|
1855
|
|
|
|
|
|
|
# Collect the passed arguments |
1856
|
1
|
|
|
1
|
1
|
5
|
my %args = @_; |
1857
|
|
|
|
|
|
|
|
1858
|
|
|
|
|
|
|
# check data object |
1859
|
1
|
|
50
|
|
|
3
|
my $data = $args{data} || undef; |
1860
|
1
|
50
|
|
|
|
4
|
unless ($data) { |
1861
|
0
|
|
|
|
|
0
|
confess "must pass a 'data' key and Bio::ToolBox::Data object!"; |
1862
|
|
|
|
|
|
|
} |
1863
|
1
|
50
|
|
|
|
3
|
unless (ref($data) eq 'Bio::ToolBox::Data') { |
1864
|
0
|
|
|
|
|
0
|
confess 'must pass a Bio::ToolBox::Data object!'; |
1865
|
|
|
|
|
|
|
} |
1866
|
|
|
|
|
|
|
|
1867
|
|
|
|
|
|
|
# Open a db connection |
1868
|
1
|
|
50
|
|
|
3
|
$args{'db'} ||= undef; |
1869
|
1
|
|
|
|
|
3
|
my $db = open_db_connection($args{'db'}); |
1870
|
1
|
50
|
|
|
|
3
|
unless ($db) { |
1871
|
0
|
|
|
|
|
0
|
carp 'no database connected!'; |
1872
|
0
|
|
|
|
|
0
|
return; |
1873
|
|
|
|
|
|
|
} |
1874
|
|
|
|
|
|
|
|
1875
|
|
|
|
|
|
|
# Determine win and step sizes |
1876
|
1
|
|
50
|
|
|
3
|
$args{'win'} ||= 500; # hard coded default size |
1877
|
1
|
|
33
|
|
|
5
|
$args{'step'} ||= $args{'win'}; |
1878
|
|
|
|
|
|
|
|
1879
|
|
|
|
|
|
|
|
1880
|
|
|
|
|
|
|
# Prepare data structures |
1881
|
1
|
|
|
|
|
5
|
my $chr_i = $data->add_column('Chromosome'); |
1882
|
1
|
|
|
|
|
3
|
my $start_i = $data->add_column('Start'); |
1883
|
1
|
|
|
|
|
66
|
my $stop_i = $data->add_column('Stop'); |
1884
|
1
|
|
|
|
|
7
|
$data->metadata($start_i, 'win', $args{'win'}); |
1885
|
1
|
|
|
|
|
3
|
$data->metadata($start_i, 'step', $args{'step'}); |
1886
|
|
|
|
|
|
|
|
1887
|
|
|
|
|
|
|
# Collect the chromosomes |
1888
|
1
|
|
50
|
|
|
6
|
my $chr_exclude = $args{'skip'} || $args{'chrskip'} || undef; |
1889
|
1
|
|
|
|
|
5
|
my @chromosomes = get_chromosome_list($db, $chr_exclude); |
1890
|
1
|
50
|
|
|
|
36
|
unless (@chromosomes) { |
1891
|
0
|
|
|
|
|
0
|
carp " no sequence IDs were found in the database!\n"; |
1892
|
0
|
|
|
|
|
0
|
return; |
1893
|
|
|
|
|
|
|
} |
1894
|
|
|
|
|
|
|
|
1895
|
|
|
|
|
|
|
# Load exclusion intervals |
1896
|
1
|
|
|
|
|
2
|
my %exclusion_tree; |
1897
|
1
|
|
50
|
|
|
7
|
$args{exclude} ||= $args{blacklist} || undef; |
|
|
|
33
|
|
|
|
|
1898
|
1
|
50
|
33
|
|
|
6
|
if (exists $args{exclude} and defined $args{exclude}) { |
1899
|
0
|
0
|
|
|
|
0
|
if (ref($args{exclude}) eq 'Bio::ToolBox::Data') { |
1900
|
0
|
0
|
|
|
|
0
|
if (_load_helper_module('Set::IntervalTree')) { |
1901
|
|
|
|
|
|
|
# iterate through the data object of intervals |
1902
|
|
|
|
|
|
|
# prepare an Interval Tree for each chromosome |
1903
|
|
|
|
|
|
|
# and load the exclusion interval into it |
1904
|
|
|
|
|
|
|
$args{exclude}->iterate( sub { |
1905
|
0
|
|
|
0
|
|
0
|
my $row = shift; |
1906
|
0
|
0
|
|
|
|
0
|
unless (exists $exclusion_tree{$row->seq_id} ) { |
1907
|
0
|
|
|
|
|
0
|
$exclusion_tree{ $row->seq_id } = Set::IntervalTree->new(); |
1908
|
|
|
|
|
|
|
} |
1909
|
0
|
|
|
|
|
0
|
$exclusion_tree{ $row->seq_id }->insert(1, $row->start - 1, $row->end); |
1910
|
0
|
|
|
|
|
0
|
} ); |
1911
|
|
|
|
|
|
|
} |
1912
|
|
|
|
|
|
|
else { |
1913
|
0
|
|
|
|
|
0
|
carp " Set::IntervalTree must be installed to use exclusion intervals!"; |
1914
|
|
|
|
|
|
|
} |
1915
|
|
|
|
|
|
|
} |
1916
|
|
|
|
|
|
|
else { |
1917
|
0
|
|
|
|
|
0
|
confess " Exclusion data must be a Bio::ToolBox::Data object!"; |
1918
|
|
|
|
|
|
|
} |
1919
|
|
|
|
|
|
|
} |
1920
|
|
|
|
|
|
|
|
1921
|
|
|
|
|
|
|
# Collect the genomic windows |
1922
|
1
|
|
|
|
|
31
|
print " Generating $args{win} bp windows in $args{step} bp increments\n"; |
1923
|
1
|
|
|
|
|
6
|
foreach (@chromosomes) { |
1924
|
|
|
|
|
|
|
|
1925
|
|
|
|
|
|
|
# name and length as sub-array in each element |
1926
|
1
|
|
|
|
|
1
|
my ($chr, $length) = @{$_}; |
|
1
|
|
|
|
|
3
|
|
1927
|
1
|
|
50
|
|
|
5
|
my $Tree = $exclusion_tree{$chr} || undef; |
1928
|
|
|
|
|
|
|
|
1929
|
|
|
|
|
|
|
# iterate through chromosome |
1930
|
1
|
|
|
|
|
4
|
for (my $start = 1; $start <= $length; $start += $args{step}) { |
1931
|
122
|
|
|
|
|
127
|
my $end = $start + $args{win} - 1; |
1932
|
122
|
100
|
|
|
|
143
|
if ($end > $length) { |
1933
|
|
|
|
|
|
|
# fix end to the length of the chromosome |
1934
|
1
|
|
|
|
|
2
|
$end = $length; |
1935
|
|
|
|
|
|
|
} |
1936
|
|
|
|
|
|
|
# skip if excluded |
1937
|
122
|
50
|
33
|
|
|
152
|
next if ($Tree and scalar( @{ $Tree->fetch($start - 1, $end) } ) >= 1); |
|
0
|
|
|
|
|
0
|
|
1938
|
122
|
|
|
|
|
184
|
$data->add_row( [ $chr, $start, $end] ); |
1939
|
|
|
|
|
|
|
} |
1940
|
|
|
|
|
|
|
} |
1941
|
1
|
|
|
|
|
27
|
print " Kept " . $data->{'last_row'} . " windows.\n"; |
1942
|
|
|
|
|
|
|
|
1943
|
|
|
|
|
|
|
# Return the data structure |
1944
|
1
|
|
|
|
|
9
|
return 1; |
1945
|
|
|
|
|
|
|
} |
1946
|
|
|
|
|
|
|
|
1947
|
|
|
|
|
|
|
|
1948
|
|
|
|
|
|
|
sub get_db_feature { |
1949
|
0
|
|
|
0
|
1
|
0
|
my %args = @_; |
1950
|
|
|
|
|
|
|
|
1951
|
|
|
|
|
|
|
# Open a db connection |
1952
|
0
|
|
0
|
|
|
0
|
$args{db} ||= undef; |
1953
|
0
|
0
|
|
|
|
0
|
unless ($args{db}) { |
1954
|
0
|
|
|
|
|
0
|
croak 'no database provided for getting a feature!'; |
1955
|
|
|
|
|
|
|
} |
1956
|
0
|
|
|
|
|
0
|
my $db = open_db_connection($args{db}); |
1957
|
0
|
|
|
|
|
0
|
my $db_ref = ref $db; |
1958
|
|
|
|
|
|
|
|
1959
|
|
|
|
|
|
|
# get the name of the feature |
1960
|
0
|
|
0
|
|
|
0
|
my $name = $args{'name'} || undef; |
1961
|
|
|
|
|
|
|
|
1962
|
|
|
|
|
|
|
# check for values and internal nulls |
1963
|
0
|
0
|
|
|
|
0
|
$args{'id'} = exists $args{'id'} ? $args{'id'} : undef; |
1964
|
0
|
|
0
|
|
|
0
|
$args{'type'} ||= undef; |
1965
|
0
|
0
|
0
|
|
|
0
|
undef $name if $name and $name eq '.'; |
1966
|
0
|
0
|
0
|
|
|
0
|
undef $args{'id'} if $args{'id'} and $args{'id'} eq '.'; |
1967
|
0
|
0
|
0
|
|
|
0
|
undef $args{'type'} if $args{'type'} and $args{'type'} eq '.'; |
1968
|
|
|
|
|
|
|
|
1969
|
|
|
|
|
|
|
# quick method for feature retrieval |
1970
|
0
|
0
|
0
|
|
|
0
|
if (defined $args{'id'} and $db->can('fetch')) { |
1971
|
|
|
|
|
|
|
# we have a unique primary_id to identify the feature |
1972
|
|
|
|
|
|
|
# usually this pulls out the feature directly |
1973
|
0
|
|
0
|
|
|
0
|
my $feature = $db->fetch($args{'id'}) || undef; |
1974
|
|
|
|
|
|
|
|
1975
|
|
|
|
|
|
|
# check that the feature we pulled out is what we want |
1976
|
0
|
0
|
|
|
|
0
|
my $check = $feature ? 1 : 0; |
1977
|
0
|
0
|
|
|
|
0
|
if ($check) { |
1978
|
0
|
0
|
0
|
|
|
0
|
$check = 0 if (defined $name and $feature->display_name ne $name); |
1979
|
0
|
0
|
0
|
|
|
0
|
$check = 0 if (defined $args{'type'} and $feature->type ne $args{'type'}); |
1980
|
|
|
|
|
|
|
} |
1981
|
|
|
|
|
|
|
|
1982
|
|
|
|
|
|
|
# return if things match up as best we can tell |
1983
|
0
|
0
|
|
|
|
0
|
if ($check) { |
1984
|
0
|
|
|
|
|
0
|
return $feature; |
1985
|
|
|
|
|
|
|
} |
1986
|
|
|
|
|
|
|
else { |
1987
|
|
|
|
|
|
|
# the primary_ids are out of date |
1988
|
0
|
0
|
|
|
|
0
|
unless ($PRIMARY_ID_WARNING) { |
1989
|
0
|
|
|
|
|
0
|
warn "CAUTION: Some primary IDs in Input file appear to be out of date\n"; |
1990
|
0
|
|
|
|
|
0
|
$PRIMARY_ID_WARNING++; |
1991
|
|
|
|
|
|
|
} |
1992
|
|
|
|
|
|
|
} |
1993
|
|
|
|
|
|
|
} |
1994
|
|
|
|
|
|
|
|
1995
|
|
|
|
|
|
|
# otherwise use name and type |
1996
|
0
|
0
|
|
|
|
0
|
return unless $name; # otherwise db will return all features! Not what we want! |
1997
|
|
|
|
|
|
|
my @features = $db->features( |
1998
|
|
|
|
|
|
|
-name => $name, # we assume this name will be unique |
1999
|
|
|
|
|
|
|
-aliases => 0, |
2000
|
0
|
|
|
|
|
0
|
-type => $args{'type'}, |
2001
|
|
|
|
|
|
|
); |
2002
|
|
|
|
|
|
|
|
2003
|
|
|
|
|
|
|
# if none are found |
2004
|
0
|
0
|
|
|
|
0
|
unless (@features) { |
2005
|
|
|
|
|
|
|
# try again with aliases enabled |
2006
|
|
|
|
|
|
|
@features = $db->features( |
2007
|
|
|
|
|
|
|
-name => $name, |
2008
|
|
|
|
|
|
|
-aliases => 1, |
2009
|
0
|
|
|
|
|
0
|
-type => $args{'type'}, |
2010
|
|
|
|
|
|
|
); |
2011
|
|
|
|
|
|
|
} |
2012
|
0
|
0
|
0
|
|
|
0
|
unless (@features and $name =~ /[;,\|]/) { |
2013
|
|
|
|
|
|
|
# I used to append aliases to the end of the name in old versions of biotoolbox |
2014
|
|
|
|
|
|
|
# crazy, I know, but just in case this happened, let's try breaking them apart |
2015
|
0
|
|
|
|
|
0
|
my $name2 = (split(/\s*[;,\|]\s*/, $name))[0]; |
2016
|
|
|
|
|
|
|
# multiple names present using common delimiters ;,| |
2017
|
|
|
|
|
|
|
# take the first name only, assume others are aliases that we don't need |
2018
|
|
|
|
|
|
|
@features = $db->features( |
2019
|
|
|
|
|
|
|
-name => $name2, |
2020
|
|
|
|
|
|
|
-aliases => 1, |
2021
|
0
|
|
|
|
|
0
|
-type => $args{'type'}, |
2022
|
|
|
|
|
|
|
); |
2023
|
|
|
|
|
|
|
} |
2024
|
|
|
|
|
|
|
|
2025
|
|
|
|
|
|
|
# check the number of features returned |
2026
|
0
|
0
|
|
|
|
0
|
if (scalar @features > 1) { |
|
|
0
|
|
|
|
|
|
2027
|
|
|
|
|
|
|
# there should only be one feature found |
2028
|
|
|
|
|
|
|
# if more are returned, there's redundant or duplicated data in the db |
2029
|
|
|
|
|
|
|
|
2030
|
|
|
|
|
|
|
# first check whether addition of aliases may improve accuracy |
2031
|
0
|
0
|
|
|
|
0
|
if ($args{'name'} =~ /;/) { |
2032
|
|
|
|
|
|
|
# we have presumed aliases in the name value |
2033
|
0
|
|
|
|
|
0
|
my $check = $args{'name'}; |
2034
|
0
|
|
|
|
|
0
|
$check =~ s/\s*;\s*/;/g; # strip spaces just in case |
2035
|
|
|
|
|
|
|
|
2036
|
|
|
|
|
|
|
# check each feature and try to match name and aliases |
2037
|
0
|
|
|
|
|
0
|
my @candidates; |
2038
|
0
|
|
|
|
|
0
|
foreach my $f (@features) { |
2039
|
0
|
|
|
|
|
0
|
my $f_name = join(';', $f->display_name, ($f->get_tag_values('Alias'))); |
2040
|
0
|
0
|
|
|
|
0
|
if ($check eq $f_name) { |
2041
|
|
|
|
|
|
|
# found one |
2042
|
0
|
|
|
|
|
0
|
push @candidates, $f; |
2043
|
|
|
|
|
|
|
} |
2044
|
|
|
|
|
|
|
} |
2045
|
|
|
|
|
|
|
|
2046
|
0
|
0
|
|
|
|
0
|
if (scalar @candidates == 1) { |
|
|
0
|
|
|
|
|
|
2047
|
|
|
|
|
|
|
# we found it! |
2048
|
0
|
|
|
|
|
0
|
return shift @candidates; |
2049
|
|
|
|
|
|
|
} |
2050
|
|
|
|
|
|
|
elsif (scalar @candidates > 1) { |
2051
|
|
|
|
|
|
|
# hopefully we've improved a little bit? |
2052
|
0
|
|
|
|
|
0
|
@features = @candidates; |
2053
|
|
|
|
|
|
|
} |
2054
|
|
|
|
|
|
|
} |
2055
|
|
|
|
|
|
|
|
2056
|
|
|
|
|
|
|
# warn the user, this should be fixed |
2057
|
|
|
|
|
|
|
printf " Found %s %s features named '$name' in the database! Using first one.\n", |
2058
|
0
|
|
|
|
|
0
|
scalar(@features), $args{'type'}; |
2059
|
|
|
|
|
|
|
} |
2060
|
|
|
|
|
|
|
elsif (!@features) { |
2061
|
0
|
|
|
|
|
0
|
printf " Found no %s features named '$name' in the database!\n", $args{'type'}; |
2062
|
0
|
|
|
|
|
0
|
return; |
2063
|
|
|
|
|
|
|
} |
2064
|
|
|
|
|
|
|
|
2065
|
|
|
|
|
|
|
# done |
2066
|
0
|
|
|
|
|
0
|
return shift @features; |
2067
|
|
|
|
|
|
|
} |
2068
|
|
|
|
|
|
|
|
2069
|
|
|
|
|
|
|
|
2070
|
|
|
|
|
|
|
### Get a dataset score for a single region |
2071
|
|
|
|
|
|
|
sub get_segment_score { |
2072
|
|
|
|
|
|
|
# parameters passed as an array |
2073
|
|
|
|
|
|
|
# we will be passing this array on as a reference to the appropriate |
2074
|
|
|
|
|
|
|
# imported helper subroutine |
2075
|
|
|
|
|
|
|
# chromosome, start, stop, strand, strandedness, method, return type, db, dataset |
2076
|
12
|
50
|
|
12
|
1
|
23
|
confess "incorrect number of parameters passed!" unless scalar @_ == 9; |
2077
|
|
|
|
|
|
|
|
2078
|
|
|
|
|
|
|
# check the database |
2079
|
12
|
100
|
66
|
|
|
41
|
$_[DB] = open_db_connection($_[DB]) if ($_[DB] and not ref($_[DB])); |
2080
|
|
|
|
|
|
|
|
2081
|
|
|
|
|
|
|
# check for combined datasets |
2082
|
12
|
50
|
|
|
|
49
|
if ($_[DATA] =~ /&/) { |
2083
|
0
|
|
|
|
|
0
|
push @_, (split '&', pop @_); |
2084
|
|
|
|
|
|
|
} |
2085
|
|
|
|
|
|
|
|
2086
|
|
|
|
|
|
|
# determine method |
2087
|
|
|
|
|
|
|
# yes, we're only checking the first dataset, but they should all |
2088
|
|
|
|
|
|
|
# be the same type |
2089
|
12
|
|
66
|
|
|
71
|
my $db_method = $DB_METHODS{$_[METH]}{$_[RETT]}{$_[DB]}{$_[DATA]} || |
2090
|
|
|
|
|
|
|
_lookup_db_method(\@_); |
2091
|
|
|
|
|
|
|
|
2092
|
|
|
|
|
|
|
# return type values |
2093
|
|
|
|
|
|
|
# 0 = calculate score |
2094
|
|
|
|
|
|
|
# 1 = score array |
2095
|
|
|
|
|
|
|
# 2 = hash position scores |
2096
|
|
|
|
|
|
|
# immediately return calculated score if appropriate |
2097
|
12
|
100
|
|
|
|
24
|
if ($_[RETT] > 0) { |
2098
|
|
|
|
|
|
|
# immediately return either indexed hash or array of scores |
2099
|
5
|
|
|
|
|
7
|
return &{$db_method}(\@_); |
|
5
|
|
|
|
|
11
|
|
2100
|
|
|
|
|
|
|
} |
2101
|
|
|
|
|
|
|
else { |
2102
|
|
|
|
|
|
|
# calculate a score |
2103
|
7
|
|
|
|
|
9
|
my $scores = &{$db_method}(\@_); |
|
7
|
|
|
|
|
16
|
|
2104
|
|
|
|
|
|
|
# this might be an array reference of scores that need to be combined |
2105
|
|
|
|
|
|
|
# or it could be a single scalar score which just needs to be returned |
2106
|
7
|
50
|
|
|
|
18
|
if (ref($scores)) { |
2107
|
7
|
|
|
|
|
18
|
return calculate_score($_[METH], $scores); |
2108
|
|
|
|
|
|
|
} |
2109
|
|
|
|
|
|
|
else { |
2110
|
0
|
0
|
|
|
|
0
|
if (not defined $scores) { |
2111
|
0
|
0
|
|
|
|
0
|
return 0 if $_[METH] =~ /count|sum/; |
2112
|
0
|
|
|
|
|
0
|
return '.'; |
2113
|
|
|
|
|
|
|
} |
2114
|
0
|
|
|
|
|
0
|
return $scores; |
2115
|
|
|
|
|
|
|
} |
2116
|
|
|
|
|
|
|
} |
2117
|
|
|
|
|
|
|
} |
2118
|
|
|
|
|
|
|
|
2119
|
|
|
|
|
|
|
|
2120
|
|
|
|
|
|
|
sub calculate_score { |
2121
|
7
|
|
|
7
|
1
|
12
|
my ($method, $scores) = @_; |
2122
|
7
|
|
50
|
|
|
14
|
$scores ||= []; # just in case |
2123
|
7
|
50
|
|
|
|
14
|
if (exists $SCORE_CALCULATOR_SUB{$method}) { |
2124
|
7
|
|
|
|
|
11
|
return &{$SCORE_CALCULATOR_SUB{$method}}($scores); |
|
7
|
|
|
|
|
17
|
|
2125
|
|
|
|
|
|
|
} |
2126
|
|
|
|
|
|
|
else { |
2127
|
0
|
|
|
|
|
0
|
confess " unrecognized method '$method'!"; |
2128
|
|
|
|
|
|
|
} |
2129
|
|
|
|
|
|
|
} |
2130
|
|
|
|
|
|
|
|
2131
|
|
|
|
|
|
|
|
2132
|
|
|
|
|
|
|
sub get_chromosome_list { |
2133
|
|
|
|
|
|
|
|
2134
|
|
|
|
|
|
|
# options |
2135
|
2
|
|
|
2
|
1
|
297
|
my $database = shift; |
2136
|
2
|
|
50
|
|
|
8
|
my $chr_exclude = shift || undef; |
2137
|
|
|
|
|
|
|
|
2138
|
|
|
|
|
|
|
# Open a db connection |
2139
|
2
|
|
|
|
|
5
|
my $db = open_db_connection($database); |
2140
|
2
|
50
|
|
|
|
5
|
unless ($db) { |
2141
|
0
|
|
|
|
|
0
|
carp 'no database connected!'; |
2142
|
0
|
|
|
|
|
0
|
return; |
2143
|
|
|
|
|
|
|
} |
2144
|
|
|
|
|
|
|
|
2145
|
|
|
|
|
|
|
# Check for BigWigSet database |
2146
|
|
|
|
|
|
|
# these need to be handled a little differently |
2147
|
2
|
50
|
|
|
|
7
|
if (ref($db) =~ /BigWigSet/) { |
2148
|
|
|
|
|
|
|
# BigWigSet databases are the only databases that don't |
2149
|
|
|
|
|
|
|
# support the seq_ids method |
2150
|
|
|
|
|
|
|
# instead we have to look at one of the bigwigs in the set |
2151
|
0
|
|
|
|
|
0
|
my $bw_file = ($db->bigwigs)[0]; |
2152
|
0
|
|
|
|
|
0
|
$db = open_db_connection($bw_file); |
2153
|
|
|
|
|
|
|
} |
2154
|
|
|
|
|
|
|
|
2155
|
|
|
|
|
|
|
|
2156
|
|
|
|
|
|
|
# Collect chromosome lengths |
2157
|
|
|
|
|
|
|
# we want to maintain the original order of the chromosomes so we |
2158
|
|
|
|
|
|
|
# won't be putting it into a hash |
2159
|
|
|
|
|
|
|
# instead an array of arrays |
2160
|
2
|
|
|
|
|
4
|
my @chrom_lengths; |
2161
|
|
|
|
|
|
|
|
2162
|
|
|
|
|
|
|
# Database specific approaches to collecting chromosomes |
2163
|
|
|
|
|
|
|
# I used to have one generic approach, but idiosyncrasies and potential |
2164
|
|
|
|
|
|
|
# bugs make me use different approaches for better consistency |
2165
|
2
|
|
|
|
|
3
|
my $type = ref($db); |
2166
|
|
|
|
|
|
|
|
2167
|
|
|
|
|
|
|
# SeqFeature::Store |
2168
|
2
|
50
|
33
|
|
|
33
|
if ($type =~ /^Bio::DB::SeqFeature::Store/) { |
|
|
50
|
33
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
2169
|
0
|
|
|
|
|
0
|
for my $chr ($db->seq_ids) { |
2170
|
|
|
|
|
|
|
|
2171
|
|
|
|
|
|
|
# check for excluded chromosomes |
2172
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
2173
|
|
|
|
|
|
|
|
2174
|
|
|
|
|
|
|
# get chromosome size |
2175
|
0
|
|
|
|
|
0
|
my ($seqf) = $db->get_features_by_name($chr); |
2176
|
0
|
0
|
|
|
|
0
|
my $length = $seqf ? $seqf->length : 0; |
2177
|
|
|
|
|
|
|
|
2178
|
|
|
|
|
|
|
# store |
2179
|
0
|
|
|
|
|
0
|
push @chrom_lengths, [ $chr, $length ]; |
2180
|
|
|
|
|
|
|
} |
2181
|
|
|
|
|
|
|
} |
2182
|
|
|
|
|
|
|
|
2183
|
|
|
|
|
|
|
# libBigWig Bigfile, including both bigWig and bigBed |
2184
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::Big::File') { |
2185
|
|
|
|
|
|
|
# this doesn't follow the typical BioPerl convention |
2186
|
|
|
|
|
|
|
# it's a hash, so randomly sorted!!!!! Never will be in same order as file!!!! |
2187
|
0
|
|
|
|
|
0
|
my $chroms = $db->chroms(); |
2188
|
|
|
|
|
|
|
|
2189
|
0
|
|
|
|
|
0
|
my @list; |
2190
|
0
|
|
|
|
|
0
|
foreach (values %$chroms) { |
2191
|
|
|
|
|
|
|
# check for excluded chromosomes |
2192
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $_->{name} =~ /$chr_exclude/i); |
2193
|
0
|
|
|
|
|
0
|
push @list, [$_->{name}, $_->{length}]; |
2194
|
|
|
|
|
|
|
} |
2195
|
|
|
|
|
|
|
# sort consistently by a sane system |
2196
|
0
|
|
|
|
|
0
|
@chrom_lengths = sane_chromo_sort(@list); |
2197
|
|
|
|
|
|
|
} |
2198
|
|
|
|
|
|
|
|
2199
|
|
|
|
|
|
|
# UCSC kent Bigfile |
2200
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::BigWig' or $type eq 'Bio::DB::BigBed') { |
2201
|
0
|
|
|
|
|
0
|
foreach my $chr ($db->seq_ids) { |
2202
|
|
|
|
|
|
|
|
2203
|
|
|
|
|
|
|
# check for excluded chromosomes |
2204
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
2205
|
|
|
|
|
|
|
|
2206
|
|
|
|
|
|
|
# get chromosome size |
2207
|
0
|
|
|
|
|
0
|
my $length = $db->length($chr); |
2208
|
|
|
|
|
|
|
|
2209
|
|
|
|
|
|
|
# store |
2210
|
0
|
|
|
|
|
0
|
push @chrom_lengths, [ $chr, $length ]; |
2211
|
|
|
|
|
|
|
} |
2212
|
|
|
|
|
|
|
} |
2213
|
|
|
|
|
|
|
|
2214
|
|
|
|
|
|
|
# Samtools Bam |
2215
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::Sam' or $type eq 'Bio::DB::HTS') { |
2216
|
0
|
|
|
|
|
0
|
for my $tid (0 .. $db->n_targets - 1) { |
2217
|
|
|
|
|
|
|
# each chromosome is internally represented in the bam file as |
2218
|
|
|
|
|
|
|
# a numeric target identifier |
2219
|
|
|
|
|
|
|
# we can easily convert this to an actual sequence name |
2220
|
|
|
|
|
|
|
# we will force the conversion to go one chromosome at a time |
2221
|
|
|
|
|
|
|
|
2222
|
|
|
|
|
|
|
# sequence info |
2223
|
0
|
|
|
|
|
0
|
my $chr = $db->target_name($tid); |
2224
|
0
|
|
|
|
|
0
|
my $length = $db->target_len($tid); |
2225
|
|
|
|
|
|
|
|
2226
|
|
|
|
|
|
|
# check for excluded chromosomes |
2227
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
2228
|
|
|
|
|
|
|
|
2229
|
|
|
|
|
|
|
# store |
2230
|
0
|
|
|
|
|
0
|
push @chrom_lengths, [ $chr, $length ]; |
2231
|
|
|
|
|
|
|
} |
2232
|
|
|
|
|
|
|
} |
2233
|
|
|
|
|
|
|
|
2234
|
|
|
|
|
|
|
# Fast through modern HTS fai index |
2235
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::HTS::Faidx') { |
2236
|
0
|
|
|
|
|
0
|
for my $chr ($db->get_all_sequence_ids) { |
2237
|
|
|
|
|
|
|
# check for excluded |
2238
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
2239
|
|
|
|
|
|
|
|
2240
|
|
|
|
|
|
|
# get length and store it |
2241
|
0
|
|
|
|
|
0
|
my $length = $db->length($chr); |
2242
|
0
|
|
|
|
|
0
|
push @chrom_lengths, [$chr, $length]; |
2243
|
|
|
|
|
|
|
} |
2244
|
|
|
|
|
|
|
} |
2245
|
|
|
|
|
|
|
|
2246
|
|
|
|
|
|
|
# Fasta through old BioPerl adapter |
2247
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::Fasta') { |
2248
|
0
|
|
|
|
|
0
|
for my $chr ($db->get_all_ids) { |
2249
|
|
|
|
|
|
|
|
2250
|
|
|
|
|
|
|
# check for excluded chromosomes |
2251
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
2252
|
|
|
|
|
|
|
|
2253
|
|
|
|
|
|
|
# get chromosome size |
2254
|
0
|
|
|
|
|
0
|
my $seq = $db->get_Seq_by_id($chr); |
2255
|
0
|
0
|
|
|
|
0
|
my $length = $seq ? $seq->length : 0; |
2256
|
|
|
|
|
|
|
|
2257
|
|
|
|
|
|
|
# store |
2258
|
0
|
|
|
|
|
0
|
push @chrom_lengths, [ $chr, $length ]; |
2259
|
|
|
|
|
|
|
} |
2260
|
|
|
|
|
|
|
} |
2261
|
|
|
|
|
|
|
|
2262
|
|
|
|
|
|
|
# other Bioperl adapter????? |
2263
|
|
|
|
|
|
|
elsif ($db->can('seq_ids')) { |
2264
|
2
|
|
|
|
|
8
|
foreach my $chr ($db->seq_ids) { |
2265
|
|
|
|
|
|
|
|
2266
|
|
|
|
|
|
|
# check for excluded chromosomes |
2267
|
2
|
50
|
33
|
|
|
20
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
2268
|
|
|
|
|
|
|
|
2269
|
|
|
|
|
|
|
# generate a segment representing the chromosome |
2270
|
|
|
|
|
|
|
# due to fuzzy name matching, we may get more than one back |
2271
|
2
|
|
|
|
|
18
|
my @segments = $db->segment($chr); |
2272
|
|
|
|
|
|
|
# need to find the right one |
2273
|
2
|
|
|
|
|
247
|
my $segment; |
2274
|
2
|
|
|
|
|
12
|
while (@segments) { |
2275
|
2
|
|
|
|
|
4
|
$segment = shift @segments; |
2276
|
2
|
50
|
|
|
|
8
|
last if $segment->seq_id eq $chr; |
2277
|
|
|
|
|
|
|
} |
2278
|
|
|
|
|
|
|
|
2279
|
|
|
|
|
|
|
# check segment |
2280
|
2
|
50
|
|
|
|
25
|
unless ($segment) { |
2281
|
0
|
|
|
|
|
0
|
carp " No genome segment for seq_id $chr!!?? skipping\n"; |
2282
|
0
|
|
|
|
|
0
|
next; |
2283
|
|
|
|
|
|
|
} |
2284
|
|
|
|
|
|
|
|
2285
|
|
|
|
|
|
|
# get the chromosome length |
2286
|
2
|
|
|
|
|
10
|
my $length = $segment->length; |
2287
|
|
|
|
|
|
|
|
2288
|
|
|
|
|
|
|
# store |
2289
|
2
|
|
|
|
|
49
|
push @chrom_lengths, [ $chr, $length ]; |
2290
|
|
|
|
|
|
|
} |
2291
|
|
|
|
|
|
|
} |
2292
|
|
|
|
|
|
|
|
2293
|
|
|
|
|
|
|
else { |
2294
|
0
|
|
|
|
|
0
|
carp " Unable to identify chromosomes in database type '$type'. Try using a different database file or adapter\n"; |
2295
|
0
|
|
|
|
|
0
|
return; |
2296
|
|
|
|
|
|
|
} |
2297
|
|
|
|
|
|
|
|
2298
|
|
|
|
|
|
|
# Return |
2299
|
2
|
50
|
|
|
|
6
|
unless (@chrom_lengths) { |
2300
|
0
|
|
|
|
|
0
|
carp " no chromosome sequences identified in database!\n"; |
2301
|
0
|
|
|
|
|
0
|
return; |
2302
|
|
|
|
|
|
|
} |
2303
|
2
|
|
|
|
|
6
|
return @chrom_lengths; |
2304
|
|
|
|
|
|
|
} |
2305
|
|
|
|
|
|
|
|
2306
|
|
|
|
|
|
|
|
2307
|
|
|
|
|
|
|
sub low_level_bam_fetch { |
2308
|
0
|
0
|
|
0
|
1
|
0
|
confess "incorrect number of parameters passed!" unless scalar @_ == 6; |
2309
|
0
|
|
|
|
|
0
|
my ($sam, $tid, $start, $stop, $callback, $data) = @_; |
2310
|
|
|
|
|
|
|
# run the the low level bam fetch based on which adapter is being used |
2311
|
0
|
0
|
|
|
|
0
|
unless ($BAM_ADAPTER) { |
2312
|
0
|
0
|
|
|
|
0
|
$BAM_ADAPTER = ref($sam) =~ /hts/i ? 'hts' : 'sam'; |
2313
|
|
|
|
|
|
|
} |
2314
|
0
|
0
|
|
|
|
0
|
if ($BAM_ADAPTER eq 'hts') { |
|
|
0
|
|
|
|
|
|
2315
|
|
|
|
|
|
|
# using Bio::DB::HTS |
2316
|
0
|
|
|
|
|
0
|
my $index = $sam->hts_index; |
2317
|
0
|
0
|
|
|
|
0
|
return unless $index; |
2318
|
0
|
|
|
|
|
0
|
return $index->fetch($sam->hts_file, $tid, $start, $stop, $callback, $data); |
2319
|
|
|
|
|
|
|
} |
2320
|
|
|
|
|
|
|
elsif ($BAM_ADAPTER eq 'sam') { |
2321
|
|
|
|
|
|
|
# using Bio::DB::Sam |
2322
|
0
|
|
|
|
|
0
|
my $index = $sam->bam_index; |
2323
|
0
|
0
|
|
|
|
0
|
return unless $index; |
2324
|
0
|
|
|
|
|
0
|
return $index->fetch($sam->bam, $tid, $start, $stop, $callback, $data); |
2325
|
|
|
|
|
|
|
} |
2326
|
|
|
|
|
|
|
else { |
2327
|
0
|
|
|
|
|
0
|
confess "no bam adapter loaded!\n"; |
2328
|
|
|
|
|
|
|
} |
2329
|
|
|
|
|
|
|
} |
2330
|
|
|
|
|
|
|
|
2331
|
|
|
|
|
|
|
|
2332
|
|
|
|
|
|
|
sub low_level_bam_coverage { |
2333
|
0
|
0
|
|
0
|
1
|
0
|
confess "incorrect number of parameters passed!" unless scalar @_ == 4; |
2334
|
0
|
|
|
|
|
0
|
my ($sam, $tid, $start, $stop) = @_; |
2335
|
|
|
|
|
|
|
# run the the low level bam coverage based on which adapter is being used |
2336
|
0
|
0
|
|
|
|
0
|
unless ($BAM_ADAPTER) { |
2337
|
0
|
0
|
|
|
|
0
|
$BAM_ADAPTER = ref($sam) =~ /hts/i ? 'hts' : 'sam'; |
2338
|
|
|
|
|
|
|
} |
2339
|
0
|
0
|
|
|
|
0
|
if ($BAM_ADAPTER eq 'hts') { |
|
|
0
|
|
|
|
|
|
2340
|
|
|
|
|
|
|
# using Bio::DB::HTS |
2341
|
0
|
|
|
|
|
0
|
my $index = $sam->hts_index; |
2342
|
0
|
0
|
|
|
|
0
|
return unless $index; |
2343
|
0
|
|
|
|
|
0
|
return $index->coverage($sam->hts_file, $tid, $start, $stop); |
2344
|
|
|
|
|
|
|
} |
2345
|
|
|
|
|
|
|
elsif ($BAM_ADAPTER eq 'sam') { |
2346
|
|
|
|
|
|
|
# using Bio::DB::Sam |
2347
|
0
|
|
|
|
|
0
|
my $index = $sam->bam_index; |
2348
|
0
|
0
|
|
|
|
0
|
return unless $index; |
2349
|
0
|
|
|
|
|
0
|
return $index->coverage($sam->bam, $tid, $start, $stop); |
2350
|
|
|
|
|
|
|
} |
2351
|
|
|
|
|
|
|
else { |
2352
|
0
|
|
|
|
|
0
|
confess "no bam adapter loaded!\n"; |
2353
|
|
|
|
|
|
|
} |
2354
|
|
|
|
|
|
|
} |
2355
|
|
|
|
|
|
|
|
2356
|
|
|
|
|
|
|
|
2357
|
|
|
|
|
|
|
sub get_genomic_sequence { |
2358
|
0
|
0
|
|
0
|
1
|
0
|
confess "incorrect number of parameters passed!" unless scalar @_ == 4; |
2359
|
0
|
|
|
|
|
0
|
my ($db, $chrom, $start, $stop) = @_; |
2360
|
|
|
|
|
|
|
|
2361
|
|
|
|
|
|
|
# check database |
2362
|
0
|
|
|
|
|
0
|
my $type = ref($db); |
2363
|
0
|
0
|
|
|
|
0
|
$db = open_db_connection($db) if not $type; |
2364
|
|
|
|
|
|
|
|
2365
|
|
|
|
|
|
|
# return sequence based on the type of database adapter we're using |
2366
|
0
|
0
|
|
|
|
0
|
if ($type eq 'Bio::DB::HTS::Faidx') { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
2367
|
0
|
|
|
|
|
0
|
return $db->get_sequence_no_length("$chrom:$start-$stop"); |
2368
|
|
|
|
|
|
|
} |
2369
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::Sam::Fai') { |
2370
|
0
|
|
|
|
|
0
|
return $db->fetch("$chrom:$start-$stop"); |
2371
|
|
|
|
|
|
|
} |
2372
|
|
|
|
|
|
|
elsif ($db->can('seq')) { |
2373
|
|
|
|
|
|
|
# BioPerl database including Bio::DB::Fasta and Bio::DB::SeqFeature::Store |
2374
|
0
|
|
|
|
|
0
|
return $db->seq($chrom, $start, $stop); |
2375
|
|
|
|
|
|
|
} |
2376
|
|
|
|
|
|
|
else { |
2377
|
0
|
|
|
|
|
0
|
confess "unsupported database $type for collecting genomic sequence!\n"; |
2378
|
|
|
|
|
|
|
} |
2379
|
|
|
|
|
|
|
} |
2380
|
|
|
|
|
|
|
|
2381
|
|
|
|
|
|
|
|
2382
|
|
|
|
|
|
|
### deprecated methods |
2383
|
|
|
|
|
|
|
# just in case |
2384
|
|
|
|
|
|
|
sub validate_included_feature { |
2385
|
0
|
|
|
0
|
0
|
0
|
confess "validate_included_feature() is no longer a valid method. " . |
2386
|
|
|
|
|
|
|
"Please update your script to the current API.\n"; |
2387
|
|
|
|
|
|
|
} |
2388
|
|
|
|
|
|
|
|
2389
|
|
|
|
|
|
|
|
2390
|
|
|
|
|
|
|
### Internal subroutine to retrieve the scores from an established region object |
2391
|
|
|
|
|
|
|
sub _lookup_db_method { |
2392
|
|
|
|
|
|
|
# parameters passed as an array reference |
2393
|
9
|
|
|
9
|
|
14
|
my $param = shift; |
2394
|
|
|
|
|
|
|
# determine the appropriate score method |
2395
|
9
|
|
|
|
|
10
|
my $score_method; |
2396
|
9
|
50
|
|
|
|
31
|
if ($param->[DATA] =~ /^file|http|ftp/) { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
2397
|
|
|
|
|
|
|
# collect the data according to file type |
2398
|
|
|
|
|
|
|
|
2399
|
|
|
|
|
|
|
# BigWig Data file |
2400
|
9
|
50
|
|
|
|
69
|
if ($param->[DATA] =~ /\.(?:bw|bigwig)$/i) { |
|
|
50
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
2401
|
|
|
|
|
|
|
# file is in bigwig format |
2402
|
|
|
|
|
|
|
# get the dataset scores using Bio::ToolBox::db_helper::bigwig |
2403
|
|
|
|
|
|
|
# this uses either Bio::DB::BigWig or Bio::DB::Big |
2404
|
|
|
|
|
|
|
|
2405
|
|
|
|
|
|
|
# check that we have bigwig support |
2406
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK; |
2407
|
0
|
0
|
|
|
|
0
|
if ($BIGWIG_OK) { |
2408
|
0
|
0
|
|
|
|
0
|
if ($param->[RETT] == 2) { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
2409
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwig_position_scores; |
2410
|
|
|
|
|
|
|
} |
2411
|
|
|
|
|
|
|
elsif ($param->[RETT] == 1) { |
2412
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwig_scores; |
2413
|
|
|
|
|
|
|
} |
2414
|
|
|
|
|
|
|
elsif ($param->[METH] =~ /min|max|mean/) { |
2415
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwig_score; |
2416
|
|
|
|
|
|
|
} |
2417
|
|
|
|
|
|
|
elsif ($param->[METH] =~ /sum|count/) { |
2418
|
|
|
|
|
|
|
# difference between ucsc and libBigWig libraries |
2419
|
0
|
0
|
|
|
|
0
|
$score_method = $BIG_ADAPTER eq 'ucsc' ? |
2420
|
|
|
|
|
|
|
\&collect_bigwig_score : \&collect_bigwig_scores; |
2421
|
|
|
|
|
|
|
} |
2422
|
|
|
|
|
|
|
else { |
2423
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwig_scores; |
2424
|
|
|
|
|
|
|
} |
2425
|
|
|
|
|
|
|
} |
2426
|
|
|
|
|
|
|
else { |
2427
|
0
|
|
|
|
|
0
|
croak " BigWig support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?"; |
2428
|
|
|
|
|
|
|
} |
2429
|
|
|
|
|
|
|
} |
2430
|
|
|
|
|
|
|
|
2431
|
|
|
|
|
|
|
# BigBed Data file |
2432
|
|
|
|
|
|
|
elsif ($param->[DATA] =~ /\.(?:bb|bigbed)$/i) { |
2433
|
|
|
|
|
|
|
# data is in bigbed format |
2434
|
|
|
|
|
|
|
# get the dataset scores using Bio::ToolBox::db_helper::bigbed |
2435
|
|
|
|
|
|
|
# this uses either Bio::DB::BigBed or Bio::DB::Big |
2436
|
|
|
|
|
|
|
|
2437
|
|
|
|
|
|
|
# check that we have bigbed support |
2438
|
0
|
0
|
|
|
|
0
|
$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK; |
2439
|
0
|
0
|
|
|
|
0
|
if ($BIGBED_OK) { |
2440
|
0
|
0
|
|
|
|
0
|
if ($param->[RETT] == 2) { |
2441
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigbed_position_scores; |
2442
|
|
|
|
|
|
|
} |
2443
|
|
|
|
|
|
|
else { |
2444
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigbed_scores; |
2445
|
|
|
|
|
|
|
} |
2446
|
|
|
|
|
|
|
} |
2447
|
|
|
|
|
|
|
else { |
2448
|
0
|
|
|
|
|
0
|
croak " BigBed support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?"; |
2449
|
|
|
|
|
|
|
} |
2450
|
|
|
|
|
|
|
} |
2451
|
|
|
|
|
|
|
|
2452
|
|
|
|
|
|
|
# BAM data file |
2453
|
|
|
|
|
|
|
elsif ($param->[DATA] =~ /\.bam$/i) { |
2454
|
|
|
|
|
|
|
# data is in bam format |
2455
|
|
|
|
|
|
|
# get the dataset scores using Bio::ToolBox::db_helper::bam |
2456
|
|
|
|
|
|
|
# this uses the Bio::DB::Sam or Bio::DB::HTS adaptor |
2457
|
|
|
|
|
|
|
|
2458
|
|
|
|
|
|
|
# check that we have Bam support |
2459
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
2460
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
2461
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bam_scores; |
2462
|
|
|
|
|
|
|
} |
2463
|
|
|
|
|
|
|
else { |
2464
|
0
|
|
|
|
|
0
|
croak " Bam support is not enabled! " . |
2465
|
|
|
|
|
|
|
"Is Bio::DB::HTS or Bio::DB::Sam installed?\n"; |
2466
|
|
|
|
|
|
|
} |
2467
|
|
|
|
|
|
|
} |
2468
|
|
|
|
|
|
|
|
2469
|
|
|
|
|
|
|
# USeq Data file |
2470
|
|
|
|
|
|
|
elsif ($param->[DATA] =~ /\.useq$/i) { |
2471
|
|
|
|
|
|
|
# data is in useq format |
2472
|
|
|
|
|
|
|
# get the dataset scores using Bio::ToolBox::db_helper::useq |
2473
|
|
|
|
|
|
|
# this uses the Bio::DB::USeq adaptor |
2474
|
|
|
|
|
|
|
|
2475
|
|
|
|
|
|
|
# check that we have bigbed support |
2476
|
9
|
50
|
|
|
|
21
|
$USEQ_OK = _load_helper_module('Bio::ToolBox::db_helper::useq') |
2477
|
|
|
|
|
|
|
unless $USEQ_OK; |
2478
|
9
|
50
|
|
|
|
13
|
if ($USEQ_OK) { |
2479
|
9
|
100
|
|
|
|
13
|
if ($param->[RETT] == 2) { |
2480
|
4
|
|
|
|
|
8
|
$score_method = \&collect_useq_position_scores; |
2481
|
|
|
|
|
|
|
} |
2482
|
|
|
|
|
|
|
else { |
2483
|
5
|
|
|
|
|
11
|
$score_method = \&collect_useq_scores; |
2484
|
|
|
|
|
|
|
} |
2485
|
|
|
|
|
|
|
} |
2486
|
|
|
|
|
|
|
else { |
2487
|
0
|
|
|
|
|
0
|
croak " USeq support is not enabled! " . |
2488
|
|
|
|
|
|
|
"Is Bio::DB::USeq installed?\n"; |
2489
|
|
|
|
|
|
|
} |
2490
|
|
|
|
|
|
|
} |
2491
|
|
|
|
|
|
|
|
2492
|
|
|
|
|
|
|
# Unsupported Data file |
2493
|
|
|
|
|
|
|
else { |
2494
|
0
|
|
|
|
|
0
|
confess sprintf " Unsupported or unrecognized file type for file '%s'!\n", |
2495
|
|
|
|
|
|
|
$param->[DATA]; |
2496
|
|
|
|
|
|
|
} |
2497
|
|
|
|
|
|
|
} |
2498
|
|
|
|
|
|
|
|
2499
|
|
|
|
|
|
|
|
2500
|
|
|
|
|
|
|
### BigWigSet database |
2501
|
|
|
|
|
|
|
elsif (ref($param->[DB]) =~ m/BigWigSet/) { |
2502
|
|
|
|
|
|
|
# calling features from a BigWigSet database object |
2503
|
|
|
|
|
|
|
# this uses either Bio::DB::BigWig or Bio::DB::Big |
2504
|
|
|
|
|
|
|
|
2505
|
|
|
|
|
|
|
# check that we have bigwig support |
2506
|
|
|
|
|
|
|
# duh! we should, we probably opened the stupid database! |
2507
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK; |
2508
|
0
|
0
|
|
|
|
0
|
croak " BigWigSet support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?" |
2509
|
|
|
|
|
|
|
unless $BIGWIG_OK; |
2510
|
|
|
|
|
|
|
|
2511
|
|
|
|
|
|
|
# the data collection depends on the method |
2512
|
0
|
0
|
|
|
|
0
|
if ($param->[RETT] == 2) { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
2513
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwigset_position_scores; |
2514
|
|
|
|
|
|
|
} |
2515
|
|
|
|
|
|
|
elsif ($param->[RETT] == 1) { |
2516
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwigset_scores; |
2517
|
|
|
|
|
|
|
} |
2518
|
|
|
|
|
|
|
elsif ($param->[METH] =~ /min|max|mean|sum|count/) { |
2519
|
|
|
|
|
|
|
# difference between ucsc and libBigWig libraries |
2520
|
|
|
|
|
|
|
# the ucsc library works with summaries and we can handle multiple of these |
2521
|
|
|
|
|
|
|
# but the big adapter doesn't |
2522
|
0
|
0
|
|
|
|
0
|
$score_method = $BIG_ADAPTER eq 'ucsc' ? |
2523
|
|
|
|
|
|
|
\&collect_bigwigset_score : \&collect_bigwigset_scores; |
2524
|
|
|
|
|
|
|
} |
2525
|
|
|
|
|
|
|
else { |
2526
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwigset_scores; |
2527
|
|
|
|
|
|
|
} |
2528
|
|
|
|
|
|
|
} |
2529
|
|
|
|
|
|
|
|
2530
|
|
|
|
|
|
|
|
2531
|
|
|
|
|
|
|
### BioPerl style database |
2532
|
|
|
|
|
|
|
elsif (ref($param->[DB]) =~ m/^Bio::DB/) { |
2533
|
|
|
|
|
|
|
# a BioPerl style database, including Bio::DB::SeqFeature::Store |
2534
|
|
|
|
|
|
|
# most or all Bio::DB databases support generic get_seq_stream() methods |
2535
|
|
|
|
|
|
|
# that return seqfeature objects, which we can use in a generic sense |
2536
|
|
|
|
|
|
|
|
2537
|
|
|
|
|
|
|
# check that we have support |
2538
|
|
|
|
|
|
|
# duh! we should, we probably opened the stupid database! |
2539
|
0
|
0
|
|
|
|
0
|
$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta') |
2540
|
|
|
|
|
|
|
unless $SEQFASTA_OK; |
2541
|
0
|
0
|
|
|
|
0
|
if ($SEQFASTA_OK) { |
2542
|
|
|
|
|
|
|
# get the dataset scores using Bio::ToolBox::db_helper::seqfasta |
2543
|
|
|
|
|
|
|
# check that we support methods |
2544
|
0
|
0
|
|
|
|
0
|
unless ($param->[DB]->can('get_seq_stream')) { |
2545
|
0
|
|
|
|
|
0
|
confess sprintf "unsupported database! cannot use %s as it does not support get_seq_stream method or equivalent", |
2546
|
|
|
|
|
|
|
ref($param->[DB]); |
2547
|
|
|
|
|
|
|
} |
2548
|
0
|
|
|
|
|
0
|
$score_method = \&collect_store_scores; |
2549
|
|
|
|
|
|
|
} |
2550
|
|
|
|
|
|
|
else { |
2551
|
0
|
|
|
|
|
0
|
croak " SeqFeature Store support is not enabled! " . |
2552
|
|
|
|
|
|
|
"Is BioPerl and Bio::DB::SeqFeature::Store properly installed?\n"; |
2553
|
|
|
|
|
|
|
} |
2554
|
|
|
|
|
|
|
} |
2555
|
|
|
|
|
|
|
|
2556
|
|
|
|
|
|
|
|
2557
|
|
|
|
|
|
|
### Some other database? |
2558
|
|
|
|
|
|
|
else { |
2559
|
0
|
0
|
|
|
|
0
|
confess "no recognizeable dataset provided!" unless $param->[DATA]; |
2560
|
0
|
0
|
|
|
|
0
|
confess "no database passed!" unless $param->[DB]; |
2561
|
0
|
|
|
|
|
0
|
confess "something went wrong! parameters: @$param"; |
2562
|
|
|
|
|
|
|
} |
2563
|
|
|
|
|
|
|
|
2564
|
|
|
|
|
|
|
|
2565
|
|
|
|
|
|
|
### Cache and return the results |
2566
|
9
|
|
|
|
|
27
|
$DB_METHODS{$param->[METH]}{$param->[RETT]}{$param->[DB]}{$param->[DATA]} = |
2567
|
|
|
|
|
|
|
$score_method; |
2568
|
9
|
|
|
|
|
19
|
return $score_method; |
2569
|
|
|
|
|
|
|
} |
2570
|
|
|
|
|
|
|
|
2571
|
|
|
|
|
|
|
|
2572
|
|
|
|
|
|
|
sub _load_helper_module { |
2573
|
1
|
|
|
1
|
|
3
|
my $class = shift; |
2574
|
1
|
|
|
|
|
2
|
my $success = 0; |
2575
|
1
|
|
|
|
|
1
|
eval { |
2576
|
1
|
|
|
|
|
4
|
load $class; # using Module::Load to dynamically load modules |
2577
|
1
|
|
|
|
|
36
|
$class->import; # must be done particularly for my modules |
2578
|
1
|
|
|
|
|
3
|
$success = 1; |
2579
|
|
|
|
|
|
|
}; |
2580
|
1
|
|
|
|
|
2
|
return $success; |
2581
|
|
|
|
|
|
|
} |
2582
|
|
|
|
|
|
|
|
2583
|
|
|
|
|
|
|
|
2584
|
|
|
|
|
|
|
sub _load_bam_helper_module { |
2585
|
0
|
0
|
|
0
|
|
|
if ($BAM_ADAPTER) { |
2586
|
0
|
0
|
|
|
|
|
if ($BAM_ADAPTER =~ /sam/i) { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
2587
|
0
|
|
|
|
|
|
$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::bam'); |
2588
|
0
|
|
|
|
|
|
$BAM_ADAPTER = 'sam'; # for internal consistency |
2589
|
|
|
|
|
|
|
} |
2590
|
|
|
|
|
|
|
elsif ($BAM_ADAPTER =~ /hts/i) { |
2591
|
0
|
|
|
|
|
|
$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::hts'); |
2592
|
0
|
|
|
|
|
|
$BAM_ADAPTER = 'hts'; # for internal consistency |
2593
|
|
|
|
|
|
|
} |
2594
|
|
|
|
|
|
|
elsif ($BAM_ADAPTER =~ /none/i) { |
2595
|
|
|
|
|
|
|
# basically for testing purposes, don't use a module |
2596
|
0
|
|
|
|
|
|
return 0; |
2597
|
|
|
|
|
|
|
} |
2598
|
|
|
|
|
|
|
else { |
2599
|
|
|
|
|
|
|
# unrecognized |
2600
|
0
|
|
|
|
|
|
$@ = 'unrecognized'; |
2601
|
|
|
|
|
|
|
} |
2602
|
0
|
0
|
|
|
|
|
if (not $BAM_OK) { |
2603
|
0
|
|
|
|
|
|
print "Requested '$BAM_ADAPTER' adapter could not be loaded: $@\n"; |
2604
|
|
|
|
|
|
|
} |
2605
|
|
|
|
|
|
|
} |
2606
|
|
|
|
|
|
|
else { |
2607
|
|
|
|
|
|
|
# try hts first, then sam |
2608
|
|
|
|
|
|
|
# be sure to set BAM_ADAPTER upon success |
2609
|
0
|
|
|
|
|
|
$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::hts'); |
2610
|
0
|
0
|
|
|
|
|
if ($BAM_OK) { |
2611
|
0
|
|
|
|
|
|
$BAM_ADAPTER = 'hts'; |
2612
|
|
|
|
|
|
|
} |
2613
|
|
|
|
|
|
|
else { |
2614
|
0
|
|
|
|
|
|
$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::bam'); |
2615
|
0
|
0
|
|
|
|
|
$BAM_ADAPTER = 'sam' if $BAM_OK; |
2616
|
|
|
|
|
|
|
} |
2617
|
|
|
|
|
|
|
} |
2618
|
0
|
|
|
|
|
|
return $BAM_OK; |
2619
|
|
|
|
|
|
|
} |
2620
|
|
|
|
|
|
|
|
2621
|
|
|
|
|
|
|
|
2622
|
|
|
|
|
|
|
sub _load_bigwig_helper_module { |
2623
|
0
|
0
|
|
0
|
|
|
if ($BIG_ADAPTER) { |
2624
|
0
|
0
|
|
|
|
|
if ($BIG_ADAPTER =~ /ucsc|kent/i) { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
2625
|
0
|
|
|
|
|
|
$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig'); |
2626
|
0
|
|
|
|
|
|
$BIG_ADAPTER = 'ucsc'; # for internal consistency |
2627
|
|
|
|
|
|
|
} |
2628
|
|
|
|
|
|
|
elsif ($BIG_ADAPTER =~ /big/i) { |
2629
|
0
|
|
|
|
|
|
$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::big'); |
2630
|
0
|
|
|
|
|
|
$BIGBED_OK = $BIGWIG_OK; # bigbed too! |
2631
|
0
|
|
|
|
|
|
$BIG_ADAPTER = 'big'; # for internal consistency |
2632
|
|
|
|
|
|
|
} |
2633
|
|
|
|
|
|
|
elsif ($BIG_ADAPTER =~ /none/i) { |
2634
|
|
|
|
|
|
|
# basically for testing purposes, don't use a module |
2635
|
0
|
|
|
|
|
|
return 0; |
2636
|
|
|
|
|
|
|
} |
2637
|
|
|
|
|
|
|
else { |
2638
|
|
|
|
|
|
|
# unrecognized |
2639
|
0
|
|
|
|
|
|
$@ = 'unrecognized'; |
2640
|
|
|
|
|
|
|
} |
2641
|
0
|
0
|
|
|
|
|
if (not $BIGWIG_OK) { |
2642
|
0
|
|
|
|
|
|
print "Requested '$BIG_ADAPTER' adapter could not be loaded: $@\n"; |
2643
|
|
|
|
|
|
|
} |
2644
|
|
|
|
|
|
|
} |
2645
|
|
|
|
|
|
|
else { |
2646
|
|
|
|
|
|
|
# we have to try each one out |
2647
|
|
|
|
|
|
|
# try the modern big adapter first |
2648
|
0
|
|
|
|
|
|
$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::big'); |
2649
|
0
|
0
|
|
|
|
|
if ($BIGWIG_OK) { |
2650
|
|
|
|
|
|
|
# success! |
2651
|
0
|
|
|
|
|
|
$BIGBED_OK = $BIGWIG_OK; # bigbed too! |
2652
|
0
|
0
|
|
|
|
|
$BIG_ADAPTER = 'big' if $BIGWIG_OK; |
2653
|
|
|
|
|
|
|
} |
2654
|
|
|
|
|
|
|
else { |
2655
|
0
|
|
|
|
|
|
$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig'); |
2656
|
0
|
0
|
|
|
|
|
$BIG_ADAPTER = 'ucsc' if $BIGWIG_OK; |
2657
|
|
|
|
|
|
|
} |
2658
|
|
|
|
|
|
|
} |
2659
|
0
|
|
|
|
|
|
return $BIGWIG_OK; |
2660
|
|
|
|
|
|
|
} |
2661
|
|
|
|
|
|
|
|
2662
|
|
|
|
|
|
|
sub _load_bigbed_helper_module { |
2663
|
0
|
0
|
|
0
|
|
|
if ($BIG_ADAPTER) { |
2664
|
0
|
0
|
|
|
|
|
if ($BIG_ADAPTER =~ /ucsc|kent/i) { |
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
2665
|
0
|
|
|
|
|
|
$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed'); |
2666
|
0
|
|
|
|
|
|
$BIG_ADAPTER = 'ucsc'; # for internal consistency |
2667
|
|
|
|
|
|
|
} |
2668
|
|
|
|
|
|
|
elsif ($BIG_ADAPTER =~ /big/i) { |
2669
|
0
|
|
|
|
|
|
$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::big'); |
2670
|
0
|
|
|
|
|
|
$BIGWIG_OK = $BIGBED_OK; # bigwig too! |
2671
|
0
|
|
|
|
|
|
$BIG_ADAPTER = 'big'; # for internal consistency |
2672
|
|
|
|
|
|
|
} |
2673
|
|
|
|
|
|
|
elsif ($BIG_ADAPTER =~ /none/i) { |
2674
|
|
|
|
|
|
|
# basically for testing purposes, don't use a module |
2675
|
0
|
|
|
|
|
|
return 0; |
2676
|
|
|
|
|
|
|
} |
2677
|
|
|
|
|
|
|
elsif ($BAM_ADAPTER =~ /\w+/) { |
2678
|
|
|
|
|
|
|
# unrecognized |
2679
|
0
|
|
|
|
|
|
$@ = 'unrecognized'; |
2680
|
|
|
|
|
|
|
} |
2681
|
0
|
0
|
|
|
|
|
if (not $BIGWIG_OK) { |
2682
|
0
|
|
|
|
|
|
print "Requested '$BIG_ADAPTER' adapter could not be loaded: $@\n"; |
2683
|
|
|
|
|
|
|
} |
2684
|
|
|
|
|
|
|
} |
2685
|
|
|
|
|
|
|
else { |
2686
|
|
|
|
|
|
|
# we have to try each one out |
2687
|
|
|
|
|
|
|
# try the modern big adapter first |
2688
|
0
|
|
|
|
|
|
$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::big'); |
2689
|
0
|
0
|
|
|
|
|
if ($BIGBED_OK) { |
2690
|
|
|
|
|
|
|
# success! |
2691
|
0
|
|
|
|
|
|
$BIGWIG_OK = $BIGBED_OK; # bigwig too! |
2692
|
0
|
0
|
|
|
|
|
$BIG_ADAPTER = 'big' if $BIGBED_OK; |
2693
|
|
|
|
|
|
|
} |
2694
|
|
|
|
|
|
|
else { |
2695
|
0
|
|
|
|
|
|
$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed'); |
2696
|
0
|
0
|
|
|
|
|
$BIG_ADAPTER = 'ucsc' if $BIGBED_OK; |
2697
|
|
|
|
|
|
|
} |
2698
|
|
|
|
|
|
|
} |
2699
|
0
|
|
|
|
|
|
return $BIGBED_OK; |
2700
|
|
|
|
|
|
|
} |
2701
|
|
|
|
|
|
|
|
2702
|
|
|
|
|
|
|
=head1 AUTHOR |
2703
|
|
|
|
|
|
|
|
2704
|
|
|
|
|
|
|
Timothy J. Parnell, PhD |
2705
|
|
|
|
|
|
|
Howard Hughes Medical Institute |
2706
|
|
|
|
|
|
|
Dept of Oncological Sciences |
2707
|
|
|
|
|
|
|
Huntsman Cancer Institute |
2708
|
|
|
|
|
|
|
University of Utah |
2709
|
|
|
|
|
|
|
Salt Lake City, UT, 84112 |
2710
|
|
|
|
|
|
|
|
2711
|
|
|
|
|
|
|
This package is free software; you can redistribute it and/or modify |
2712
|
|
|
|
|
|
|
it under the terms of the Artistic License 2.0. |
2713
|
|
|
|
|
|
|
|