| line |
stmt |
bran |
cond |
sub |
pod |
time |
code |
|
1
|
|
|
|
|
|
|
package Bio::ToolBox::db_helper; |
|
2
|
|
|
|
|
|
|
our $VERSION = '1.69'; |
|
3
|
|
|
|
|
|
|
|
|
4
|
|
|
|
|
|
|
=head1 NAME |
|
5
|
|
|
|
|
|
|
|
|
6
|
|
|
|
|
|
|
Bio::ToolBox::db_helper - helper interface to various database formats |
|
7
|
|
|
|
|
|
|
|
|
8
|
|
|
|
|
|
|
=head1 DESCRIPTION |
|
9
|
|
|
|
|
|
|
|
|
10
|
|
|
|
|
|
|
In most cases, this module does not need to be used directly. The |
|
11
|
|
|
|
|
|
|
methods available to L and L |
|
12
|
|
|
|
|
|
|
provide convenient access to the methods described here. |
|
13
|
|
|
|
|
|
|
|
|
14
|
|
|
|
|
|
|
These are helper subroutines to work with relevant databases that can be |
|
15
|
|
|
|
|
|
|
accessed through BioPerl modules. These include the L |
|
16
|
|
|
|
|
|
|
relational database as well as L, L, |
|
17
|
|
|
|
|
|
|
L, L, L, and |
|
18
|
|
|
|
|
|
|
L databases. The primary functions |
|
19
|
|
|
|
|
|
|
included opening connections to these databases, verifying features found |
|
20
|
|
|
|
|
|
|
within the database, generating lists of features or genomic intervals for |
|
21
|
|
|
|
|
|
|
subsequent analysis, and collecting features and/or scores within regions |
|
22
|
|
|
|
|
|
|
of interest. Collected scores may be summarized using a variety of statistical |
|
23
|
|
|
|
|
|
|
methods. |
|
24
|
|
|
|
|
|
|
|
|
25
|
|
|
|
|
|
|
When collecting scores or features, the data may hosted in a variety of |
|
26
|
|
|
|
|
|
|
formats and locations. These include: |
|
27
|
|
|
|
|
|
|
|
|
28
|
|
|
|
|
|
|
=over 4 |
|
29
|
|
|
|
|
|
|
|
|
30
|
|
|
|
|
|
|
=item SeqFeature database |
|
31
|
|
|
|
|
|
|
|
|
32
|
|
|
|
|
|
|
Genomic feature annotation, including genes, transcripts, exons, etc, |
|
33
|
|
|
|
|
|
|
may be stored in a L database. These |
|
34
|
|
|
|
|
|
|
databases are backed by either a relational database, including |
|
35
|
|
|
|
|
|
|
MySQL, PostGreSQL, or SQLite. Small GFF3 files may also be loaded in |
|
36
|
|
|
|
|
|
|
an in-memory database. These are typically loaded from GFF3 |
|
37
|
|
|
|
|
|
|
files; see the adapter documentation for more information. Dataset |
|
38
|
|
|
|
|
|
|
scores, such as from microarray, may also be stored as the score |
|
39
|
|
|
|
|
|
|
value in the source GFF file. |
|
40
|
|
|
|
|
|
|
|
|
41
|
|
|
|
|
|
|
References to local, binary, indexed files may also be included as |
|
42
|
|
|
|
|
|
|
attributes to features stored in the database. This is legacy support |
|
43
|
|
|
|
|
|
|
from the L, and should |
|
44
|
|
|
|
|
|
|
not be used in preference to other methods. Supported files include the |
|
45
|
|
|
|
|
|
|
legacy binary wig files (F<.wib>, see L) using the |
|
46
|
|
|
|
|
|
|
C attribute, or bigWig files using the C or C |
|
47
|
|
|
|
|
|
|
attribute. The attribute values must be full paths. |
|
48
|
|
|
|
|
|
|
|
|
49
|
|
|
|
|
|
|
=item BigWig files |
|
50
|
|
|
|
|
|
|
|
|
51
|
|
|
|
|
|
|
BigWig files (F<.bw> and F<.bigwig>) are compressed, binary, indexed |
|
52
|
|
|
|
|
|
|
versions of text wig files and may be accessed either locally or remotely. |
|
53
|
|
|
|
|
|
|
They support extremely fast score retrieval from any genomic location of |
|
54
|
|
|
|
|
|
|
any size without sacrificing resolution (spatial and numeric). See |
|
55
|
|
|
|
|
|
|
L for more information. |
|
56
|
|
|
|
|
|
|
BigWig files are supported by either the L or the |
|
57
|
|
|
|
|
|
|
L adapter, based on either L |
|
58
|
|
|
|
|
|
|
or the L libraries, |
|
59
|
|
|
|
|
|
|
respectively; see their respective documentation for more information. |
|
60
|
|
|
|
|
|
|
|
|
61
|
|
|
|
|
|
|
=item Directory of BigWig files |
|
62
|
|
|
|
|
|
|
|
|
63
|
|
|
|
|
|
|
A directory containing two or more BigWig files is assembled into a |
|
64
|
|
|
|
|
|
|
BigWigSet, allowing for metadata, such as strand, to be associated with |
|
65
|
|
|
|
|
|
|
BigWig files. Additional metadata beyond the filename may be included in |
|
66
|
|
|
|
|
|
|
text file F within the directory. See the |
|
67
|
|
|
|
|
|
|
L adapter documentation for more information. When using |
|
68
|
|
|
|
|
|
|
L adapters, BigWigSet support is natively supported by |
|
69
|
|
|
|
|
|
|
the BioToolBox package. |
|
70
|
|
|
|
|
|
|
|
|
71
|
|
|
|
|
|
|
=item BigBed files |
|
72
|
|
|
|
|
|
|
|
|
73
|
|
|
|
|
|
|
BigBed files are compressed, binary, indexed versions of text BED files. See |
|
74
|
|
|
|
|
|
|
L for more information. |
|
75
|
|
|
|
|
|
|
Both local and remote files may be accessed. BigBed files are supported by |
|
76
|
|
|
|
|
|
|
either the L or the L adapter, based on either L |
|
77
|
|
|
|
|
|
|
or the L libraries, |
|
78
|
|
|
|
|
|
|
respectively; see their respective documentation for more information. |
|
79
|
|
|
|
|
|
|
|
|
80
|
|
|
|
|
|
|
=item Bam files |
|
81
|
|
|
|
|
|
|
|
|
82
|
|
|
|
|
|
|
Bam files (F<.bam>) are compressed, binary, indexed versions of the text SAM file, |
|
83
|
|
|
|
|
|
|
or sequence alignment map. They are used with next generation sequencing |
|
84
|
|
|
|
|
|
|
technologies. They support individual alignment retrieval as well as |
|
85
|
|
|
|
|
|
|
read depth coverage. Two different Bam file adapters are supported. |
|
86
|
|
|
|
|
|
|
|
|
87
|
|
|
|
|
|
|
The L adapter is an interface to the Bam (or Cram) file. |
|
88
|
|
|
|
|
|
|
This is based on the modern HTSlib C library (version E= 1.0), successor |
|
89
|
|
|
|
|
|
|
to the original samtools library. See L for |
|
90
|
|
|
|
|
|
|
more information. |
|
91
|
|
|
|
|
|
|
|
|
92
|
|
|
|
|
|
|
The L adapter is an older interface to the Bam alignment |
|
93
|
|
|
|
|
|
|
file. This is based on the samtools C library version E= 0.1.19. While |
|
94
|
|
|
|
|
|
|
this adapter is still supported, the above should be used for new installs. |
|
95
|
|
|
|
|
|
|
|
|
96
|
|
|
|
|
|
|
=item Cram files |
|
97
|
|
|
|
|
|
|
|
|
98
|
|
|
|
|
|
|
Cram files (F<.cram>) are similar to Bam files, but are much smaller due to |
|
99
|
|
|
|
|
|
|
only storing sequence differences for each alignment. As such, they require |
|
100
|
|
|
|
|
|
|
an indexed, reference fasta to regenerate the complete alignment. Cram files |
|
101
|
|
|
|
|
|
|
are only supported by the L adapter. |
|
102
|
|
|
|
|
|
|
|
|
103
|
|
|
|
|
|
|
=item USeq files |
|
104
|
|
|
|
|
|
|
|
|
105
|
|
|
|
|
|
|
USeq files (F<.useq>) are compressed, binary, indexed files that support |
|
106
|
|
|
|
|
|
|
multiple information types, including region intervals, BED type |
|
107
|
|
|
|
|
|
|
annotations, or wig type scores distributed across the genome. They |
|
108
|
|
|
|
|
|
|
support rapid, random access across the genome and are comparable to |
|
109
|
|
|
|
|
|
|
both BigWig and BigBed files. See L |
|
110
|
|
|
|
|
|
|
for more information. Files may only be local. These files are supported |
|
111
|
|
|
|
|
|
|
by the L adapter. |
|
112
|
|
|
|
|
|
|
|
|
113
|
|
|
|
|
|
|
=item Fasta files |
|
114
|
|
|
|
|
|
|
|
|
115
|
|
|
|
|
|
|
Fasta files (F<.fa> or F<.fasta>) may be opened. Fasta files are indexed to |
|
116
|
|
|
|
|
|
|
rapidly and randomly fetch genomic sequence. Three different adapters are |
|
117
|
|
|
|
|
|
|
available for indexing the fasta file. |
|
118
|
|
|
|
|
|
|
|
|
119
|
|
|
|
|
|
|
The L adapter is preferentially used, and requires a |
|
120
|
|
|
|
|
|
|
F<.fai> index file. |
|
121
|
|
|
|
|
|
|
|
|
122
|
|
|
|
|
|
|
The L adapter may alternatively used, and also requires a |
|
123
|
|
|
|
|
|
|
F<.fai> index file. |
|
124
|
|
|
|
|
|
|
|
|
125
|
|
|
|
|
|
|
The older style L adapter is much slower, but will index |
|
126
|
|
|
|
|
|
|
either a single genomic fasta file or a directory of individual chromosome |
|
127
|
|
|
|
|
|
|
or contig fasta files. |
|
128
|
|
|
|
|
|
|
|
|
129
|
|
|
|
|
|
|
Additionally, genomic sequence stored in a L |
|
130
|
|
|
|
|
|
|
annotation database may also be used. |
|
131
|
|
|
|
|
|
|
|
|
132
|
|
|
|
|
|
|
=back |
|
133
|
|
|
|
|
|
|
|
|
134
|
|
|
|
|
|
|
While the functions within this library may appear to be simply a rehashing of |
|
135
|
|
|
|
|
|
|
the methods and functions in L and other modules, they |
|
136
|
|
|
|
|
|
|
either provide a simpler function to often used database methodologies or are |
|
137
|
|
|
|
|
|
|
designed to work intimately with the BioToolBox data format file and data structures |
|
138
|
|
|
|
|
|
|
(see L). One key advantage to these functions is the ability |
|
139
|
|
|
|
|
|
|
to work with datasets that are stranded (transcriptome data, for example). |
|
140
|
|
|
|
|
|
|
|
|
141
|
|
|
|
|
|
|
Historically, this module was initially written to use L for |
|
142
|
|
|
|
|
|
|
database usage. It has since been re-written to use L |
|
143
|
|
|
|
|
|
|
database schema. Additional functionality has also been added for other databases |
|
144
|
|
|
|
|
|
|
and file formats. |
|
145
|
|
|
|
|
|
|
|
|
146
|
|
|
|
|
|
|
Complete usage and examples for the functions are provided below. |
|
147
|
|
|
|
|
|
|
|
|
148
|
|
|
|
|
|
|
=head1 USAGE |
|
149
|
|
|
|
|
|
|
|
|
150
|
|
|
|
|
|
|
Call the module at the beginning of your perl script and include the module |
|
151
|
|
|
|
|
|
|
names to export. None are exported by default. |
|
152
|
|
|
|
|
|
|
|
|
153
|
|
|
|
|
|
|
use Bio::ToolBox::db_helper qw( |
|
154
|
|
|
|
|
|
|
get_new_feature_list |
|
155
|
|
|
|
|
|
|
get_db_feature |
|
156
|
|
|
|
|
|
|
); |
|
157
|
|
|
|
|
|
|
|
|
158
|
|
|
|
|
|
|
This will export the indicated subroutine names into the current namespace. |
|
159
|
|
|
|
|
|
|
Their usage is detailed below. |
|
160
|
|
|
|
|
|
|
|
|
161
|
|
|
|
|
|
|
=head1 EXPORTED SUBROUTINES |
|
162
|
|
|
|
|
|
|
|
|
163
|
|
|
|
|
|
|
=over 4 |
|
164
|
|
|
|
|
|
|
|
|
165
|
|
|
|
|
|
|
=item C<$BAM_ADAPTER> variable |
|
166
|
|
|
|
|
|
|
|
|
167
|
|
|
|
|
|
|
=item C<$BIG_ADAPTER> variable |
|
168
|
|
|
|
|
|
|
|
|
169
|
|
|
|
|
|
|
These variables control which Bam and bigWig/bigBed adapters are used |
|
170
|
|
|
|
|
|
|
during execution in installations where both adapters are present. These |
|
171
|
|
|
|
|
|
|
variables are not set initially; adapters are chosen automatically when an |
|
172
|
|
|
|
|
|
|
adapter is requested during execution and modules are evaluated for loading. |
|
173
|
|
|
|
|
|
|
|
|
174
|
|
|
|
|
|
|
=item use_bam_adapter |
|
175
|
|
|
|
|
|
|
|
|
176
|
|
|
|
|
|
|
=item use_big_adapter |
|
177
|
|
|
|
|
|
|
|
|
178
|
|
|
|
|
|
|
These are convenience methods for explicitly setting and retrieving the |
|
179
|
|
|
|
|
|
|
C<$BAM_ADAPTER> and C<$BIG_ADAPTER> variables, respectively. |
|
180
|
|
|
|
|
|
|
|
|
181
|
|
|
|
|
|
|
Optionally pass the value to set. It will always return the value being used. |
|
182
|
|
|
|
|
|
|
Values include the following: |
|
183
|
|
|
|
|
|
|
|
|
184
|
|
|
|
|
|
|
=over 4 |
|
185
|
|
|
|
|
|
|
|
|
186
|
|
|
|
|
|
|
=item Bam adapter |
|
187
|
|
|
|
|
|
|
|
|
188
|
|
|
|
|
|
|
* hts Bio::DB::HTS (default if available) |
|
189
|
|
|
|
|
|
|
* sam Bio::DB::Sam |
|
190
|
|
|
|
|
|
|
* none no adapters allowed |
|
191
|
|
|
|
|
|
|
|
|
192
|
|
|
|
|
|
|
=item Big adapter |
|
193
|
|
|
|
|
|
|
|
|
194
|
|
|
|
|
|
|
* big Bio::DB::Big (default if available) |
|
195
|
|
|
|
|
|
|
* ucsc Bio::DB::BigWig and Bio::DB::BigBed |
|
196
|
|
|
|
|
|
|
* none no adapters allowed |
|
197
|
|
|
|
|
|
|
|
|
198
|
|
|
|
|
|
|
=back |
|
199
|
|
|
|
|
|
|
|
|
200
|
|
|
|
|
|
|
=item open_db_connection |
|
201
|
|
|
|
|
|
|
|
|
202
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
|
203
|
|
|
|
|
|
|
my $db = open_db_connection($db_name); |
|
204
|
|
|
|
|
|
|
|
|
205
|
|
|
|
|
|
|
my $file = '/path/to/file.bam'; |
|
206
|
|
|
|
|
|
|
my $db = open_db_connection($file); |
|
207
|
|
|
|
|
|
|
|
|
208
|
|
|
|
|
|
|
# within a forked child process |
|
209
|
|
|
|
|
|
|
# reusing the same variable to simplify code |
|
210
|
|
|
|
|
|
|
# pass second true value to avoid cache |
|
211
|
|
|
|
|
|
|
$db = open_db_connection($file, 1); |
|
212
|
|
|
|
|
|
|
|
|
213
|
|
|
|
|
|
|
This is a simplified, generalized method that will open a connection to a |
|
214
|
|
|
|
|
|
|
database or indexed file using any one of a number of different |
|
215
|
|
|
|
|
|
|
BioPerl-style adapters. For local and remote files, the appropriate |
|
216
|
|
|
|
|
|
|
adapter is determined by the file extension. Relational database |
|
217
|
|
|
|
|
|
|
connection and authentication information is checked in a configuration |
|
218
|
|
|
|
|
|
|
file. Once identified, the appropriate Perl module adapter is loaded |
|
219
|
|
|
|
|
|
|
during run time automatically, saving the user from knowing in advance |
|
220
|
|
|
|
|
|
|
to C |
|
221
|
|
|
|
|
|
|
|
|
222
|
|
|
|
|
|
|
Pass the name of a relational database or the path of the database file to |
|
223
|
|
|
|
|
|
|
the subroutine. The opened database object is returned. If it fails, then |
|
224
|
|
|
|
|
|
|
an error message should be generated and nothing is returned. |
|
225
|
|
|
|
|
|
|
|
|
226
|
|
|
|
|
|
|
B If you are forking your perl process, B re-open your |
|
227
|
|
|
|
|
|
|
database objects in the child process, and pass a second true value |
|
228
|
|
|
|
|
|
|
to avoid using the cached database object. By default, opened databases are |
|
229
|
|
|
|
|
|
|
cached to improve efficiency, but this will be disastrous when crossing forks. |
|
230
|
|
|
|
|
|
|
|
|
231
|
|
|
|
|
|
|
=over 4 |
|
232
|
|
|
|
|
|
|
|
|
233
|
|
|
|
|
|
|
=item SeqFeature Store database |
|
234
|
|
|
|
|
|
|
|
|
235
|
|
|
|
|
|
|
Provide the name of the relational database to open. Parameters for |
|
236
|
|
|
|
|
|
|
connecting to a relational database are stored in the BioToolBox |
|
237
|
|
|
|
|
|
|
configuration file, F<.biotoolbox.cfg>. These include database adaptors, |
|
238
|
|
|
|
|
|
|
user name, password, etc. Further information regarding the configuration |
|
239
|
|
|
|
|
|
|
file may be found in L. |
|
240
|
|
|
|
|
|
|
|
|
241
|
|
|
|
|
|
|
For SQLite databases, provide the path to the F<.sqlite> or F<.db> file. |
|
242
|
|
|
|
|
|
|
|
|
243
|
|
|
|
|
|
|
For in-memory databases, provide the path to the F<.gff> or F<.gff3> file. |
|
244
|
|
|
|
|
|
|
The larger the file, the more memory and processing time is consumed as |
|
245
|
|
|
|
|
|
|
the file is parsed into memory. This may be fine for small model organisms |
|
246
|
|
|
|
|
|
|
like yeast, but not for vertebrate annotation. |
|
247
|
|
|
|
|
|
|
|
|
248
|
|
|
|
|
|
|
=item Bam file database |
|
249
|
|
|
|
|
|
|
|
|
250
|
|
|
|
|
|
|
Provide the path to the F<.bam> file. This may be a local file, or a remote |
|
251
|
|
|
|
|
|
|
URL (generally supported by the adapter). If a local F<.bam.bai> index file |
|
252
|
|
|
|
|
|
|
is not present, it will automatically be built prior to opening; this may |
|
253
|
|
|
|
|
|
|
fail if the bam file is not sorted. Remote files are not indexed. |
|
254
|
|
|
|
|
|
|
|
|
255
|
|
|
|
|
|
|
=item Cram file database |
|
256
|
|
|
|
|
|
|
|
|
257
|
|
|
|
|
|
|
Provide the path to the local F<.cram> file. Currently, supplementary fasta |
|
258
|
|
|
|
|
|
|
files are not supported, so the Cram file header must point to a valid |
|
259
|
|
|
|
|
|
|
reference. Cram files will be indexed automatically if an index is not available. |
|
260
|
|
|
|
|
|
|
|
|
261
|
|
|
|
|
|
|
=item BigWig file database |
|
262
|
|
|
|
|
|
|
|
|
263
|
|
|
|
|
|
|
Provide the path to a local F<.bw> or F<.bigwig> file, or URL to a |
|
264
|
|
|
|
|
|
|
remote file. |
|
265
|
|
|
|
|
|
|
|
|
266
|
|
|
|
|
|
|
=item BigWigSet database |
|
267
|
|
|
|
|
|
|
|
|
268
|
|
|
|
|
|
|
A local or remote directory of one or more BigWig files that can treated |
|
269
|
|
|
|
|
|
|
collectively as a database. A special text file may be included in the |
|
270
|
|
|
|
|
|
|
directory to assign metadata to each BigWig file, including attributes such |
|
271
|
|
|
|
|
|
|
as type, source, display name, strand, etc. See L for |
|
272
|
|
|
|
|
|
|
more information on the formatting of the metadata file. If a metadata |
|
273
|
|
|
|
|
|
|
file is not included (it's not necessary), then the filenames are used |
|
274
|
|
|
|
|
|
|
as the name and type. Strand may be parsed from the filenames if the |
|
275
|
|
|
|
|
|
|
basename ends in F<.f> and F<.r>. |
|
276
|
|
|
|
|
|
|
|
|
277
|
|
|
|
|
|
|
=item BigBed database |
|
278
|
|
|
|
|
|
|
|
|
279
|
|
|
|
|
|
|
Provide the path to a local F<.bb> or F<.bigbed> file, or URL to a |
|
280
|
|
|
|
|
|
|
remote file. |
|
281
|
|
|
|
|
|
|
|
|
282
|
|
|
|
|
|
|
=item USeq database |
|
283
|
|
|
|
|
|
|
|
|
284
|
|
|
|
|
|
|
Provide the path to a local F<.bb> or F<.bigbed> file, or URL to a |
|
285
|
|
|
|
|
|
|
remote file. |
|
286
|
|
|
|
|
|
|
|
|
287
|
|
|
|
|
|
|
=item Fasta database |
|
288
|
|
|
|
|
|
|
|
|
289
|
|
|
|
|
|
|
Provide the path to a local fasta file for rapidly fetching genomic |
|
290
|
|
|
|
|
|
|
sequence. If the fasta file is not indexed, then it will be indexed |
|
291
|
|
|
|
|
|
|
automatically upon opening. |
|
292
|
|
|
|
|
|
|
|
|
293
|
|
|
|
|
|
|
=back |
|
294
|
|
|
|
|
|
|
|
|
295
|
|
|
|
|
|
|
=item get_db_name |
|
296
|
|
|
|
|
|
|
|
|
297
|
|
|
|
|
|
|
my $name = get_db_name($db); |
|
298
|
|
|
|
|
|
|
|
|
299
|
|
|
|
|
|
|
This method will attempt to get the name of an opened Database |
|
300
|
|
|
|
|
|
|
object if, for some reason, it's unknown, e.g. the user only provided an |
|
301
|
|
|
|
|
|
|
opened db object. This only works for some databases, at least those I've |
|
302
|
|
|
|
|
|
|
bothered to track down and find a usable API call to use. |
|
303
|
|
|
|
|
|
|
|
|
304
|
|
|
|
|
|
|
=item get_dataset_list |
|
305
|
|
|
|
|
|
|
|
|
306
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
|
307
|
|
|
|
|
|
|
my @types = get_dataset_list($db_name); |
|
308
|
|
|
|
|
|
|
|
|
309
|
|
|
|
|
|
|
This method will retrieve a list of the available features stored in the |
|
310
|
|
|
|
|
|
|
database and returns an array of the features' types. |
|
311
|
|
|
|
|
|
|
|
|
312
|
|
|
|
|
|
|
Pass either the name of the database or an established database object. |
|
313
|
|
|
|
|
|
|
Supported databases include both L and |
|
314
|
|
|
|
|
|
|
L databases. |
|
315
|
|
|
|
|
|
|
|
|
316
|
|
|
|
|
|
|
For L databases, the type is represented as |
|
317
|
|
|
|
|
|
|
"C", corresponding to the third and second GFF columns, |
|
318
|
|
|
|
|
|
|
respectively. The types are sorted alphabetically first by source, |
|
319
|
|
|
|
|
|
|
then by method. |
|
320
|
|
|
|
|
|
|
|
|
321
|
|
|
|
|
|
|
For L databases, the C, C, C, |
|
322
|
|
|
|
|
|
|
or C attribute may be used, in that respective order of |
|
323
|
|
|
|
|
|
|
availability. The list is sorted alphabetically. |
|
324
|
|
|
|
|
|
|
|
|
325
|
|
|
|
|
|
|
=item verify_or_request_feature_types |
|
326
|
|
|
|
|
|
|
|
|
327
|
|
|
|
|
|
|
# verify a dataset, either file or database type |
|
328
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
|
329
|
|
|
|
|
|
|
my $dataset = 'microaray_data'; |
|
330
|
|
|
|
|
|
|
my @features = verify_or_request_feature_types( |
|
331
|
|
|
|
|
|
|
db => $db_name, |
|
332
|
|
|
|
|
|
|
feature => $dataset, |
|
333
|
|
|
|
|
|
|
); |
|
334
|
|
|
|
|
|
|
unless (@features) { |
|
335
|
|
|
|
|
|
|
die " no valid feature provided!\n"; |
|
336
|
|
|
|
|
|
|
} |
|
337
|
|
|
|
|
|
|
|
|
338
|
|
|
|
|
|
|
# select a list of features |
|
339
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
|
340
|
|
|
|
|
|
|
my @types = (); # user not provided |
|
341
|
|
|
|
|
|
|
@types = verify_or_request_feature_types( |
|
342
|
|
|
|
|
|
|
db => $db_name, |
|
343
|
|
|
|
|
|
|
feature => $types, |
|
344
|
|
|
|
|
|
|
); |
|
345
|
|
|
|
|
|
|
# user will be promoted to select from database list |
|
346
|
|
|
|
|
|
|
if (@types) { |
|
347
|
|
|
|
|
|
|
print "using types " . join(", ", @types); |
|
348
|
|
|
|
|
|
|
} |
|
349
|
|
|
|
|
|
|
else { |
|
350
|
|
|
|
|
|
|
die " no valid features selected!\n"; |
|
351
|
|
|
|
|
|
|
} |
|
352
|
|
|
|
|
|
|
|
|
353
|
|
|
|
|
|
|
This subroutine will process a list of feature types or data sources to be |
|
354
|
|
|
|
|
|
|
used for data or feature collection. There are two modes of action. If a |
|
355
|
|
|
|
|
|
|
list was provided, it will process and verify the list. If no list was |
|
356
|
|
|
|
|
|
|
provided, then the user will be prompted to select from a list of available |
|
357
|
|
|
|
|
|
|
feature types in the provided database. |
|
358
|
|
|
|
|
|
|
|
|
359
|
|
|
|
|
|
|
The provided list may be a list of feature types or a list of single |
|
360
|
|
|
|
|
|
|
file data sources, including Bam, bigWig, or bigBed files. If the list |
|
361
|
|
|
|
|
|
|
includes feature types, they are verified as existing in the provided |
|
362
|
|
|
|
|
|
|
database. If the list includes file names, the local files are checked |
|
363
|
|
|
|
|
|
|
and the names prefixed with "C" for use downstream. |
|
364
|
|
|
|
|
|
|
|
|
365
|
|
|
|
|
|
|
If no list was provided, then a list of available feature types in the |
|
366
|
|
|
|
|
|
|
provided database will be presented to the user, and the user prompted |
|
367
|
|
|
|
|
|
|
to make a selection. One or more types may be selected, and a single |
|
368
|
|
|
|
|
|
|
item may be enforced if requested. The response is filtered through |
|
369
|
|
|
|
|
|
|
the parse_list method from L, so a mix of single |
|
370
|
|
|
|
|
|
|
numbers or a range of numbers may be accepted. The responses are then |
|
371
|
|
|
|
|
|
|
validated. |
|
372
|
|
|
|
|
|
|
|
|
373
|
|
|
|
|
|
|
Two types of databases are supported: L and |
|
374
|
|
|
|
|
|
|
L databases. For SeqFeature Store databases, the |
|
375
|
|
|
|
|
|
|
type is comprised of "C", corresponding to the third and |
|
376
|
|
|
|
|
|
|
second GFF columns. For BigWigSet databases, types correspond to either |
|
377
|
|
|
|
|
|
|
the C, C, C, or C attributes, |
|
378
|
|
|
|
|
|
|
in that order of availability. |
|
379
|
|
|
|
|
|
|
|
|
380
|
|
|
|
|
|
|
For feature types or files that pass validation, they are returned as a |
|
381
|
|
|
|
|
|
|
list. Types of files that do not pass validation are printed to C. |
|
382
|
|
|
|
|
|
|
|
|
383
|
|
|
|
|
|
|
To use this subroutine, pass an array with the following keys |
|
384
|
|
|
|
|
|
|
and values. Not every key is required. |
|
385
|
|
|
|
|
|
|
|
|
386
|
|
|
|
|
|
|
=over 4 |
|
387
|
|
|
|
|
|
|
|
|
388
|
|
|
|
|
|
|
=item db |
|
389
|
|
|
|
|
|
|
|
|
390
|
|
|
|
|
|
|
The name of the database or a reference to an established BioPerl |
|
391
|
|
|
|
|
|
|
database object. A L or |
|
392
|
|
|
|
|
|
|
L database are typically used. Only required when |
|
393
|
|
|
|
|
|
|
no features are provided. |
|
394
|
|
|
|
|
|
|
|
|
395
|
|
|
|
|
|
|
=item feature |
|
396
|
|
|
|
|
|
|
|
|
397
|
|
|
|
|
|
|
Pass either a single dataset name as a scalar or an anonymous array |
|
398
|
|
|
|
|
|
|
reference of a list of dataset names. These may have been provided as |
|
399
|
|
|
|
|
|
|
a command line option and need to be verified. If nothing is passed, |
|
400
|
|
|
|
|
|
|
then a list of possible datasets will be presented to the user to be |
|
401
|
|
|
|
|
|
|
chosen. |
|
402
|
|
|
|
|
|
|
|
|
403
|
|
|
|
|
|
|
=item prompt |
|
404
|
|
|
|
|
|
|
|
|
405
|
|
|
|
|
|
|
Provide a phrase to be prompted to the user to help in selecting |
|
406
|
|
|
|
|
|
|
datasets from the list. If none is provided, a generic prompt will be |
|
407
|
|
|
|
|
|
|
used. |
|
408
|
|
|
|
|
|
|
|
|
409
|
|
|
|
|
|
|
=item single |
|
410
|
|
|
|
|
|
|
|
|
411
|
|
|
|
|
|
|
A Boolean value (1 or 0) indicating whether only a single dataset is |
|
412
|
|
|
|
|
|
|
allowed when selecting datasets from a presented list. If true, only |
|
413
|
|
|
|
|
|
|
one dataset choice is accepted. If false, one or more dataset choices |
|
414
|
|
|
|
|
|
|
are allowed. Default is false. |
|
415
|
|
|
|
|
|
|
|
|
416
|
|
|
|
|
|
|
=item limit |
|
417
|
|
|
|
|
|
|
|
|
418
|
|
|
|
|
|
|
Optionally provide a word or regular expression that matches the |
|
419
|
|
|
|
|
|
|
feature type (C only; C, if present, is ignored). |
|
420
|
|
|
|
|
|
|
For example, provide "geneEmrna" to only present gene and mRNA |
|
421
|
|
|
|
|
|
|
features to the user. This is only applicable when a user must select |
|
422
|
|
|
|
|
|
|
from a database list. The default is to list all available feature |
|
423
|
|
|
|
|
|
|
types. |
|
424
|
|
|
|
|
|
|
|
|
425
|
|
|
|
|
|
|
=back |
|
426
|
|
|
|
|
|
|
|
|
427
|
|
|
|
|
|
|
=item check_dataset_for_rpm_support($dataset, [$cpu]) |
|
428
|
|
|
|
|
|
|
|
|
429
|
|
|
|
|
|
|
# count the total number of alignments |
|
430
|
|
|
|
|
|
|
my $dataset = '/path/to/file.bam'; |
|
431
|
|
|
|
|
|
|
my $total_count = check_dataset_for_rpm_support($dataset); |
|
432
|
|
|
|
|
|
|
|
|
433
|
|
|
|
|
|
|
# use multithreading |
|
434
|
|
|
|
|
|
|
my $total_count = check_dataset_for_rpm_support($dataset, $cpu); |
|
435
|
|
|
|
|
|
|
|
|
436
|
|
|
|
|
|
|
This subroutine will check a dataset for RPM, or Reads Per Million mapped, |
|
437
|
|
|
|
|
|
|
support. Only two types of database files support this, Bam files and |
|
438
|
|
|
|
|
|
|
BigBed files. If the dataset is either one of these, or the name of a |
|
439
|
|
|
|
|
|
|
database feature which points to one of these files, then it will |
|
440
|
|
|
|
|
|
|
calculate the total number of mapped alignments (Bam file) or features |
|
441
|
|
|
|
|
|
|
(BigBed file). It will return this total number. If the dataset does |
|
442
|
|
|
|
|
|
|
not support RPM (because it is not a Bam or BigBed file, for example), |
|
443
|
|
|
|
|
|
|
then it will return undefined. |
|
444
|
|
|
|
|
|
|
|
|
445
|
|
|
|
|
|
|
Pass this subroutine one or two values. The first is the name of the |
|
446
|
|
|
|
|
|
|
dataset. Ideally it should be validated using verify_or_request_feature_types() |
|
447
|
|
|
|
|
|
|
and have an appropriate prefix (file, http, or ftp). |
|
448
|
|
|
|
|
|
|
|
|
449
|
|
|
|
|
|
|
For multi-threaded execution pass a second value, the number of CPU cores |
|
450
|
|
|
|
|
|
|
available to count Bam files. This will speed up counting bam files |
|
451
|
|
|
|
|
|
|
considerably. The default is 2 for environments where L |
|
452
|
|
|
|
|
|
|
is installed, or 1 where it is not. |
|
453
|
|
|
|
|
|
|
|
|
454
|
|
|
|
|
|
|
=item get_new_feature_list |
|
455
|
|
|
|
|
|
|
|
|
456
|
|
|
|
|
|
|
my $Data = Bio::ToolBox::Data->new; |
|
457
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
|
458
|
|
|
|
|
|
|
my %data = get_new_feature_list( |
|
459
|
|
|
|
|
|
|
data => $Data, |
|
460
|
|
|
|
|
|
|
db => $db_name, |
|
461
|
|
|
|
|
|
|
features => 'genes', |
|
462
|
|
|
|
|
|
|
); |
|
463
|
|
|
|
|
|
|
|
|
464
|
|
|
|
|
|
|
This subroutine will generate a new feature list collected from an annotation |
|
465
|
|
|
|
|
|
|
database, such as a L database. It will populate |
|
466
|
|
|
|
|
|
|
an empty L object data table with a list of the collected |
|
467
|
|
|
|
|
|
|
features. Columns include the Primary ID, Name, and feature Type. |
|
468
|
|
|
|
|
|
|
|
|
469
|
|
|
|
|
|
|
The subroutine is passed an array containing three required arguments. The keys |
|
470
|
|
|
|
|
|
|
include |
|
471
|
|
|
|
|
|
|
|
|
472
|
|
|
|
|
|
|
=over 4 |
|
473
|
|
|
|
|
|
|
|
|
474
|
|
|
|
|
|
|
=item data |
|
475
|
|
|
|
|
|
|
|
|
476
|
|
|
|
|
|
|
An empty L object. This is required. |
|
477
|
|
|
|
|
|
|
|
|
478
|
|
|
|
|
|
|
=item db |
|
479
|
|
|
|
|
|
|
|
|
480
|
|
|
|
|
|
|
The name of the database or a reference to an established database object. |
|
481
|
|
|
|
|
|
|
Currently, this must be a L database. |
|
482
|
|
|
|
|
|
|
This is required. |
|
483
|
|
|
|
|
|
|
|
|
484
|
|
|
|
|
|
|
=item features |
|
485
|
|
|
|
|
|
|
|
|
486
|
|
|
|
|
|
|
A scalar value containing a name representing the feature type(s) of |
|
487
|
|
|
|
|
|
|
feature(s) to collect. The type may be a C or C. |
|
488
|
|
|
|
|
|
|
It may be a single element or a comma-delimited list. This is required. |
|
489
|
|
|
|
|
|
|
|
|
490
|
|
|
|
|
|
|
=item chrskip |
|
491
|
|
|
|
|
|
|
|
|
492
|
|
|
|
|
|
|
Provide a regular-expression compatible string for skipping features from |
|
493
|
|
|
|
|
|
|
specific chromosomes, for example mitochondrial or unmapped contigs. |
|
494
|
|
|
|
|
|
|
|
|
495
|
|
|
|
|
|
|
=back |
|
496
|
|
|
|
|
|
|
|
|
497
|
|
|
|
|
|
|
Status messages, including the number of features found and kept, are always |
|
498
|
|
|
|
|
|
|
printed to C. |
|
499
|
|
|
|
|
|
|
|
|
500
|
|
|
|
|
|
|
=item get_new_genome_list |
|
501
|
|
|
|
|
|
|
|
|
502
|
|
|
|
|
|
|
my $Data = Bio::ToolBox::Data->new(); |
|
503
|
|
|
|
|
|
|
my $db_name = 'cerevisiae'; |
|
504
|
|
|
|
|
|
|
my $window_size = 500; |
|
505
|
|
|
|
|
|
|
my $step_size = 250; |
|
506
|
|
|
|
|
|
|
my %data = get_new_genome_list( |
|
507
|
|
|
|
|
|
|
data => $Data, |
|
508
|
|
|
|
|
|
|
db => $db_name, |
|
509
|
|
|
|
|
|
|
win => $window_size, |
|
510
|
|
|
|
|
|
|
step => $step_size, |
|
511
|
|
|
|
|
|
|
); |
|
512
|
|
|
|
|
|
|
|
|
513
|
|
|
|
|
|
|
This subroutine will generate a new list of genomic windows or intervals. |
|
514
|
|
|
|
|
|
|
The genome is split into intervals of a specified size with the specified |
|
515
|
|
|
|
|
|
|
step size. |
|
516
|
|
|
|
|
|
|
|
|
517
|
|
|
|
|
|
|
The subroutine is passed an array containing the required arguments. |
|
518
|
|
|
|
|
|
|
|
|
519
|
|
|
|
|
|
|
=over 4 |
|
520
|
|
|
|
|
|
|
|
|
521
|
|
|
|
|
|
|
=item data |
|
522
|
|
|
|
|
|
|
|
|
523
|
|
|
|
|
|
|
An empty L object. This is required. |
|
524
|
|
|
|
|
|
|
|
|
525
|
|
|
|
|
|
|
=item db |
|
526
|
|
|
|
|
|
|
|
|
527
|
|
|
|
|
|
|
The name of the database or a reference to an established database |
|
528
|
|
|
|
|
|
|
object. Any BioPerl adapter, including those described in the L, |
|
529
|
|
|
|
|
|
|
are supported. Required. |
|
530
|
|
|
|
|
|
|
|
|
531
|
|
|
|
|
|
|
=item win |
|
532
|
|
|
|
|
|
|
|
|
533
|
|
|
|
|
|
|
A scalar value containing an integer representing the size of the |
|
534
|
|
|
|
|
|
|
window in basepairs. The default value is 500 bp. |
|
535
|
|
|
|
|
|
|
|
|
536
|
|
|
|
|
|
|
=item step |
|
537
|
|
|
|
|
|
|
|
|
538
|
|
|
|
|
|
|
A scalar value containing an integer representing the step size for |
|
539
|
|
|
|
|
|
|
advancing the window across the genome. The default is the window size. |
|
540
|
|
|
|
|
|
|
|
|
541
|
|
|
|
|
|
|
=item chrskip |
|
542
|
|
|
|
|
|
|
|
|
543
|
|
|
|
|
|
|
=item skip |
|
544
|
|
|
|
|
|
|
|
|
545
|
|
|
|
|
|
|
=item exclude |
|
546
|
|
|
|
|
|
|
|
|
547
|
|
|
|
|
|
|
Provide a regular expression compatible string for excluding specific |
|
548
|
|
|
|
|
|
|
chromosomes or classes of chromosomes, such as the mitochondrial or |
|
549
|
|
|
|
|
|
|
unmapped contigs. |
|
550
|
|
|
|
|
|
|
|
|
551
|
|
|
|
|
|
|
=back |
|
552
|
|
|
|
|
|
|
|
|
553
|
|
|
|
|
|
|
Status messages are printed to C. |
|
554
|
|
|
|
|
|
|
|
|
555
|
|
|
|
|
|
|
=item get_db_feature |
|
556
|
|
|
|
|
|
|
|
|
557
|
|
|
|
|
|
|
my $db = open_db_connection('annotation.db'); |
|
558
|
|
|
|
|
|
|
my $seqfeature = get_db_feature( |
|
559
|
|
|
|
|
|
|
db => $db, |
|
560
|
|
|
|
|
|
|
type => 'gene', |
|
561
|
|
|
|
|
|
|
name => 'ABC1', |
|
562
|
|
|
|
|
|
|
); |
|
563
|
|
|
|
|
|
|
|
|
564
|
|
|
|
|
|
|
This subroutine will retrieve a specific feature from a L |
|
565
|
|
|
|
|
|
|
database for subsequent analysis, manipulation, and/or score retrieval using the |
|
566
|
|
|
|
|
|
|
L methods. It relies upon unique information to pull out a |
|
567
|
|
|
|
|
|
|
single, unique feature. |
|
568
|
|
|
|
|
|
|
|
|
569
|
|
|
|
|
|
|
Several attributes may be used to pull out the feature, including the feature's |
|
570
|
|
|
|
|
|
|
unique database primary ID, name andEor aliases, and GFF type (C |
|
571
|
|
|
|
|
|
|
andEor C). The L subroutine will generate a list |
|
572
|
|
|
|
|
|
|
of features with their unique identifiers. |
|
573
|
|
|
|
|
|
|
|
|
574
|
|
|
|
|
|
|
The C attribute is preferentially used as it provides the best |
|
575
|
|
|
|
|
|
|
performance. However, it is not always portable between databases or even a |
|
576
|
|
|
|
|
|
|
database re-load. In that case, the C and C are used to identify |
|
577
|
|
|
|
|
|
|
potential features. Note that the C may not always be unique in the |
|
578
|
|
|
|
|
|
|
database. In this case, the addition of aliases may help. If all else fails, a new |
|
579
|
|
|
|
|
|
|
feature list should be generated. |
|
580
|
|
|
|
|
|
|
|
|
581
|
|
|
|
|
|
|
To get a feature, pass an array of arguments. |
|
582
|
|
|
|
|
|
|
|
|
583
|
|
|
|
|
|
|
=over 4 |
|
584
|
|
|
|
|
|
|
|
|
585
|
|
|
|
|
|
|
=item db |
|
586
|
|
|
|
|
|
|
|
|
587
|
|
|
|
|
|
|
The name of the L database or a reference |
|
588
|
|
|
|
|
|
|
to an established database object. |
|
589
|
|
|
|
|
|
|
|
|
590
|
|
|
|
|
|
|
=item id |
|
591
|
|
|
|
|
|
|
|
|
592
|
|
|
|
|
|
|
Provide the C tag. In the L |
|
593
|
|
|
|
|
|
|
database schema this is a (usually) non-portable, unique identifier |
|
594
|
|
|
|
|
|
|
specific to a database. It provides the fastest lookup. |
|
595
|
|
|
|
|
|
|
|
|
596
|
|
|
|
|
|
|
=item name |
|
597
|
|
|
|
|
|
|
|
|
598
|
|
|
|
|
|
|
A scalar value representing the feature C. Aliases may be |
|
599
|
|
|
|
|
|
|
appended with semicolon delimiters. |
|
600
|
|
|
|
|
|
|
|
|
601
|
|
|
|
|
|
|
=item type |
|
602
|
|
|
|
|
|
|
|
|
603
|
|
|
|
|
|
|
Provide the feature type, which is typically expressed as |
|
604
|
|
|
|
|
|
|
C. Alternatively, provide just the C only. |
|
605
|
|
|
|
|
|
|
|
|
606
|
|
|
|
|
|
|
=back |
|
607
|
|
|
|
|
|
|
|
|
608
|
|
|
|
|
|
|
While it is possible to identify features with any two attributes |
|
609
|
|
|
|
|
|
|
(or possibly just name or ID), the best performance is obtained with |
|
610
|
|
|
|
|
|
|
all three together. The first SeqFeature object is returned if multiple |
|
611
|
|
|
|
|
|
|
are found. |
|
612
|
|
|
|
|
|
|
|
|
613
|
|
|
|
|
|
|
=item get_segment_score |
|
614
|
|
|
|
|
|
|
|
|
615
|
|
|
|
|
|
|
my $score = get_segment_score( |
|
616
|
|
|
|
|
|
|
$seqfeature->seq_id, # chromosome |
|
617
|
|
|
|
|
|
|
$seqfeature->start, # start |
|
618
|
|
|
|
|
|
|
$seqfeature->end, # end |
|
619
|
|
|
|
|
|
|
$seqfeature->strand, # strand |
|
620
|
|
|
|
|
|
|
'sense', # strandedness |
|
621
|
|
|
|
|
|
|
'mean', # method |
|
622
|
|
|
|
|
|
|
0, # return type |
|
623
|
|
|
|
|
|
|
$db, # database |
|
624
|
|
|
|
|
|
|
'scores.bw' # datasets |
|
625
|
|
|
|
|
|
|
); |
|
626
|
|
|
|
|
|
|
|
|
627
|
|
|
|
|
|
|
Generic method for collecting score(s) from a database. This is the |
|
628
|
|
|
|
|
|
|
primary interface to all of the database handling packages, including |
|
629
|
|
|
|
|
|
|
Bam, bigWig, bigBed, and USeq files, as well as SeqFeature databases. |
|
630
|
|
|
|
|
|
|
By passing common parameters here, the appropriate database or file |
|
631
|
|
|
|
|
|
|
handling packages are loaded during run time and the appropriate |
|
632
|
|
|
|
|
|
|
methods are called. Essentially, this is where the magic happens. |
|
633
|
|
|
|
|
|
|
|
|
634
|
|
|
|
|
|
|
Pass the method an array of nine (or more) parameters. The parameters |
|
635
|
|
|
|
|
|
|
are not keyed as a hash, and the order is explicit. However, see the |
|
636
|
|
|
|
|
|
|
L module for named index positions. |
|
637
|
|
|
|
|
|
|
The order is as follows. |
|
638
|
|
|
|
|
|
|
|
|
639
|
|
|
|
|
|
|
=over 4 |
|
640
|
|
|
|
|
|
|
|
|
641
|
|
|
|
|
|
|
=item * 0 Chromosome |
|
642
|
|
|
|
|
|
|
|
|
643
|
|
|
|
|
|
|
String representing chromosome or sequence ID. |
|
644
|
|
|
|
|
|
|
|
|
645
|
|
|
|
|
|
|
=item * 1 Start |
|
646
|
|
|
|
|
|
|
|
|
647
|
|
|
|
|
|
|
Start coordinate (integer, 1-based) |
|
648
|
|
|
|
|
|
|
|
|
649
|
|
|
|
|
|
|
=item * 2 Stop |
|
650
|
|
|
|
|
|
|
|
|
651
|
|
|
|
|
|
|
Stop or end coordinate (integer, 1-based) |
|
652
|
|
|
|
|
|
|
|
|
653
|
|
|
|
|
|
|
=item * 3 Strand |
|
654
|
|
|
|
|
|
|
|
|
655
|
|
|
|
|
|
|
BioPerl style strand value, including -1, 0, and 1. |
|
656
|
|
|
|
|
|
|
|
|
657
|
|
|
|
|
|
|
=item * 4 Strandedness |
|
658
|
|
|
|
|
|
|
|
|
659
|
|
|
|
|
|
|
A string indicating strandedness: sense, antisense, all. |
|
660
|
|
|
|
|
|
|
|
|
661
|
|
|
|
|
|
|
=item * 5 Method |
|
662
|
|
|
|
|
|
|
|
|
663
|
|
|
|
|
|
|
A string representing the method of data collection. |
|
664
|
|
|
|
|
|
|
|
|
665
|
|
|
|
|
|
|
Current acceptable methods include mean, median, sum, min, max, |
|
666
|
|
|
|
|
|
|
stddev, count, ncount (named count), and pcount (precise count). |
|
667
|
|
|
|
|
|
|
See the individual db_helper sub classes for more details. Not all |
|
668
|
|
|
|
|
|
|
adapters support all methods. |
|
669
|
|
|
|
|
|
|
|
|
670
|
|
|
|
|
|
|
=item * 6 Return type |
|
671
|
|
|
|
|
|
|
|
|
672
|
|
|
|
|
|
|
An integer (0, 1, 2) representing the type of value to return. |
|
673
|
|
|
|
|
|
|
|
|
674
|
|
|
|
|
|
|
A return type of 0 indicates a single score should be returned, |
|
675
|
|
|
|
|
|
|
calculated using the specified method. A return type of 1 is an |
|
676
|
|
|
|
|
|
|
array or array reference of all scores found. A return type of 2 |
|
677
|
|
|
|
|
|
|
is a hash or hash reference of coordinate => score. |
|
678
|
|
|
|
|
|
|
|
|
679
|
|
|
|
|
|
|
=item * 7 Dataset database |
|
680
|
|
|
|
|
|
|
|
|
681
|
|
|
|
|
|
|
This is either a L or a L |
|
682
|
|
|
|
|
|
|
database containing multiple score dataset features. If your dataset |
|
683
|
|
|
|
|
|
|
does not contain multiple different feature types, pass C. |
|
684
|
|
|
|
|
|
|
|
|
685
|
|
|
|
|
|
|
Pass either a string containing the name of database (which will be opened) |
|
686
|
|
|
|
|
|
|
or an already opened database object. |
|
687
|
|
|
|
|
|
|
|
|
688
|
|
|
|
|
|
|
=item * 8 Dataset |
|
689
|
|
|
|
|
|
|
|
|
690
|
|
|
|
|
|
|
This is the verified dataset from which the scores will be passed. The |
|
691
|
|
|
|
|
|
|
dataset should be verified using the L |
|
692
|
|
|
|
|
|
|
subroutine. For indexed file datasets, e.g. F<.bam> or F<.bw> files, |
|
693
|
|
|
|
|
|
|
this will the verified local path or URL. For a database, this will be |
|
694
|
|
|
|
|
|
|
a database type, sucha as C or C. |
|
695
|
|
|
|
|
|
|
or the path to a data file, e.g. Bam, bigWig, bigBed, or USeq file. |
|
696
|
|
|
|
|
|
|
|
|
697
|
|
|
|
|
|
|
If multiple datasets are to be combined, simply append them to the |
|
698
|
|
|
|
|
|
|
array (index 9, 10, etc). |
|
699
|
|
|
|
|
|
|
|
|
700
|
|
|
|
|
|
|
=back |
|
701
|
|
|
|
|
|
|
|
|
702
|
|
|
|
|
|
|
The returned item is dependent on the value of the return type code. |
|
703
|
|
|
|
|
|
|
|
|
704
|
|
|
|
|
|
|
=item calculate_score |
|
705
|
|
|
|
|
|
|
|
|
706
|
|
|
|
|
|
|
my $scores = get_segment_score($chr,$start,$stop,$strand, |
|
707
|
|
|
|
|
|
|
'sense','ncount',1,undef,'file:/path/to/dataset.bam'); |
|
708
|
|
|
|
|
|
|
my $score = calculate_score('ncount', $scores); |
|
709
|
|
|
|
|
|
|
|
|
710
|
|
|
|
|
|
|
This subroutine will perform the math to calculate a single score |
|
711
|
|
|
|
|
|
|
from an array of values. Current acceptable methods include mean, |
|
712
|
|
|
|
|
|
|
median, sum, min, max, stddev, count, ncount, and pcount. |
|
713
|
|
|
|
|
|
|
|
|
714
|
|
|
|
|
|
|
Pass the subroutine two items: the name of the mathematical method |
|
715
|
|
|
|
|
|
|
enumerated above, and an array reference of the values to work on. |
|
716
|
|
|
|
|
|
|
A single score will be returned. |
|
717
|
|
|
|
|
|
|
|
|
718
|
|
|
|
|
|
|
=item get_chromosome_list |
|
719
|
|
|
|
|
|
|
|
|
720
|
|
|
|
|
|
|
my $db = open_db_connection('cerevisiae'); |
|
721
|
|
|
|
|
|
|
# get all chromosomes in the database |
|
722
|
|
|
|
|
|
|
my @chromosomes = get_chromosome_list($db); |
|
723
|
|
|
|
|
|
|
foreach (@chromosomes) { |
|
724
|
|
|
|
|
|
|
my $name = $_->[0]; |
|
725
|
|
|
|
|
|
|
my $length = $_->[1]; |
|
726
|
|
|
|
|
|
|
print "chromosome $name is $length bp\n"; |
|
727
|
|
|
|
|
|
|
} |
|
728
|
|
|
|
|
|
|
|
|
729
|
|
|
|
|
|
|
This subroutine will collect a list of chromosomes or reference sequences |
|
730
|
|
|
|
|
|
|
in a Bio::DB database and return the list along with their sizes in bp. |
|
731
|
|
|
|
|
|
|
Many BioPerl-based databases are supported, including |
|
732
|
|
|
|
|
|
|
L, L, L, |
|
733
|
|
|
|
|
|
|
L, L, L, |
|
734
|
|
|
|
|
|
|
L, or any others that support the C method. |
|
735
|
|
|
|
|
|
|
L adapters for big files are also supported, though see note |
|
736
|
|
|
|
|
|
|
below. See the L subroutine for more information. |
|
737
|
|
|
|
|
|
|
|
|
738
|
|
|
|
|
|
|
Pass the subroutine either the name of the database, the path to the |
|
739
|
|
|
|
|
|
|
database file, or an opened database object. |
|
740
|
|
|
|
|
|
|
|
|
741
|
|
|
|
|
|
|
Optionally pass a second value, a regular expression compatible string |
|
742
|
|
|
|
|
|
|
for skipping specific chromosomes or chromosome classes, such as mitochondrial |
|
743
|
|
|
|
|
|
|
or unmapped contigs. The default is to return all chromosomes. |
|
744
|
|
|
|
|
|
|
|
|
745
|
|
|
|
|
|
|
The subroutine will return an array, with each element representing each |
|
746
|
|
|
|
|
|
|
chromosome or reference sequence in the database. Each element is an anonymous |
|
747
|
|
|
|
|
|
|
array of two elements, the chromosome name and length in bp. |
|
748
|
|
|
|
|
|
|
|
|
749
|
|
|
|
|
|
|
B: L databases for bigWig and bigBed files do |
|
750
|
|
|
|
|
|
|
not return chromosomes in the original order as the file, and are returned |
|
751
|
|
|
|
|
|
|
in a sorted manner that may not be the original order. |
|
752
|
|
|
|
|
|
|
|
|
753
|
|
|
|
|
|
|
=item low_level_bam_coverage |
|
754
|
|
|
|
|
|
|
|
|
755
|
|
|
|
|
|
|
my $coverage = low_level_bam_coverage($sam, $tid, $start, $stop); |
|
756
|
|
|
|
|
|
|
|
|
757
|
|
|
|
|
|
|
This is a convenience method for running the low level bam coverage method. |
|
758
|
|
|
|
|
|
|
Since both L and L bam file adapters are |
|
759
|
|
|
|
|
|
|
supported, and each has slight variation in the API syntax, this method helps |
|
760
|
|
|
|
|
|
|
to abstract the actual method and use the appropriate syntax depending on |
|
761
|
|
|
|
|
|
|
which adapter is loaded. It is best if the C<$sam> object was opened using the |
|
762
|
|
|
|
|
|
|
L method, or that C<$BAM_ADAPTER> is set. |
|
763
|
|
|
|
|
|
|
|
|
764
|
|
|
|
|
|
|
B that this is the B coverage method based on the index object, |
|
765
|
|
|
|
|
|
|
and not the similarly named high level API method. Read the adapter |
|
766
|
|
|
|
|
|
|
documentation for proper usage. |
|
767
|
|
|
|
|
|
|
|
|
768
|
|
|
|
|
|
|
=item low_level_bam_fetch |
|
769
|
|
|
|
|
|
|
|
|
770
|
|
|
|
|
|
|
my $success = low_level_bam_fetch($sam, $tid, $start, $stop, $callback, $data); |
|
771
|
|
|
|
|
|
|
|
|
772
|
|
|
|
|
|
|
This is a convenience method for running the low level bam fetch method. |
|
773
|
|
|
|
|
|
|
Since both L and L bam file adapters are |
|
774
|
|
|
|
|
|
|
supported, and each has slight variation in the API syntax, this method helps |
|
775
|
|
|
|
|
|
|
to abstract the actual method and use the appropriate syntax depending on |
|
776
|
|
|
|
|
|
|
which adapter is loaded. It is best if the C<$sam> object was opened using the |
|
777
|
|
|
|
|
|
|
L method, or that C<$BAM_ADAPTER> is set. |
|
778
|
|
|
|
|
|
|
|
|
779
|
|
|
|
|
|
|
B that this is the B fetch method based on the index object, |
|
780
|
|
|
|
|
|
|
and not the similarly named high level API method. Read the adapter |
|
781
|
|
|
|
|
|
|
documentation for proper usage. |
|
782
|
|
|
|
|
|
|
|
|
783
|
|
|
|
|
|
|
=item get_genomic_sequence |
|
784
|
|
|
|
|
|
|
|
|
785
|
|
|
|
|
|
|
my $sequence = get_genomic_sequence($db, $chrom, $start, $stop); |
|
786
|
|
|
|
|
|
|
|
|
787
|
|
|
|
|
|
|
This is a convenience wrapper function for fetching genomic sequence from |
|
788
|
|
|
|
|
|
|
three different supported fasta database adapters, which, of course, all |
|
789
|
|
|
|
|
|
|
have different APIs, including the L, L, |
|
790
|
|
|
|
|
|
|
and L adapters, as well as L |
|
791
|
|
|
|
|
|
|
and possibly other BioPerl databases. |
|
792
|
|
|
|
|
|
|
|
|
793
|
|
|
|
|
|
|
Pass the opened database object, chromosome, start, and stop coordinates. This |
|
794
|
|
|
|
|
|
|
assumes BioPerl standard 1-base coordinates. Only the forward sequence is |
|
795
|
|
|
|
|
|
|
retrieved. The sequence is returned as a simple string. |
|
796
|
|
|
|
|
|
|
|
|
797
|
|
|
|
|
|
|
=back |
|
798
|
|
|
|
|
|
|
|
|
799
|
|
|
|
|
|
|
=head1 INTERNAL SUBROUTINES |
|
800
|
|
|
|
|
|
|
|
|
801
|
|
|
|
|
|
|
These are not intended for normal general consumption but are documented here |
|
802
|
|
|
|
|
|
|
so that at least some of us know what is going on. |
|
803
|
|
|
|
|
|
|
|
|
804
|
|
|
|
|
|
|
=over 4 |
|
805
|
|
|
|
|
|
|
|
|
806
|
|
|
|
|
|
|
=item _lookup_db_method |
|
807
|
|
|
|
|
|
|
|
|
808
|
|
|
|
|
|
|
Internal method for determining which database adapter and method call |
|
809
|
|
|
|
|
|
|
should be used given a set of parameters. This result is cached for |
|
810
|
|
|
|
|
|
|
future data queries (likely hundreds or thousands of repeated calls). |
|
811
|
|
|
|
|
|
|
The method is stored in the global hash C<%DB_METHODS>; |
|
812
|
|
|
|
|
|
|
|
|
813
|
|
|
|
|
|
|
Pass the parameter array reference from L. This |
|
814
|
|
|
|
|
|
|
should load the appropriate db_helper sub module during run time. |
|
815
|
|
|
|
|
|
|
|
|
816
|
|
|
|
|
|
|
=item _load_helper_module |
|
817
|
|
|
|
|
|
|
|
|
818
|
|
|
|
|
|
|
Subroutine to load and import a db_helper module during run time. |
|
819
|
|
|
|
|
|
|
Sets the appropriate global variable if successful. |
|
820
|
|
|
|
|
|
|
|
|
821
|
|
|
|
|
|
|
=item _load_bam_helper_module |
|
822
|
|
|
|
|
|
|
|
|
823
|
|
|
|
|
|
|
Subroutine to determine which Bam adapter db_helper module to load, |
|
824
|
|
|
|
|
|
|
either the older samtools-based adapter or the newer HTSlib-based adapter. |
|
825
|
|
|
|
|
|
|
Uses the global and exportable variable C<$BAM_ADAPTER> to set a |
|
826
|
|
|
|
|
|
|
preference for which adapter to use. Use 'sam' or 'hts' or some |
|
827
|
|
|
|
|
|
|
string containing these names. Do B change this variable after |
|
828
|
|
|
|
|
|
|
loading the helper module; it will not do what you think it will do. |
|
829
|
|
|
|
|
|
|
If both are available and a preference is not set then the hts helper |
|
830
|
|
|
|
|
|
|
is the default. |
|
831
|
|
|
|
|
|
|
|
|
832
|
|
|
|
|
|
|
=back |
|
833
|
|
|
|
|
|
|
|
|
834
|
|
|
|
|
|
|
=cut |
|
835
|
|
|
|
|
|
|
|
|
836
|
3
|
|
|
3
|
|
16
|
use strict; |
|
|
3
|
|
|
|
|
3
|
|
|
|
3
|
|
|
|
|
93
|
|
|
837
|
|
|
|
|
|
|
require Exporter; |
|
838
|
3
|
|
|
3
|
|
12
|
use Carp qw(carp cluck croak confess); |
|
|
3
|
|
|
|
|
4
|
|
|
|
3
|
|
|
|
|
168
|
|
|
839
|
3
|
|
|
3
|
|
1130
|
use Module::Load; # for dynamic loading during runtime |
|
|
3
|
|
|
|
|
2799
|
|
|
|
3
|
|
|
|
|
14
|
|
|
840
|
3
|
|
|
3
|
|
128
|
use List::Util qw(min max sum0 uniq); |
|
|
3
|
|
|
|
|
5
|
|
|
|
3
|
|
|
|
|
169
|
|
|
841
|
3
|
|
|
3
|
|
1143
|
use Statistics::Lite qw(median range stddevp); |
|
|
3
|
|
|
|
|
3698
|
|
|
|
3
|
|
|
|
|
173
|
|
|
842
|
3
|
|
|
3
|
|
1156
|
use Bio::ToolBox::db_helper::constants; |
|
|
3
|
|
|
|
|
6
|
|
|
|
3
|
|
|
|
|
141
|
|
|
843
|
3
|
|
|
3
|
|
1226
|
use Bio::ToolBox::utility; |
|
|
3
|
|
|
|
|
6
|
|
|
|
3
|
|
|
|
|
20595
|
|
|
844
|
|
|
|
|
|
|
|
|
845
|
|
|
|
|
|
|
# check values for dynamically loaded helper modules |
|
846
|
|
|
|
|
|
|
# these are loaded only when needed during runtime to avoid wasting resources |
|
847
|
|
|
|
|
|
|
our $BAM_OK = 0; |
|
848
|
|
|
|
|
|
|
our $BIGBED_OK = 0; |
|
849
|
|
|
|
|
|
|
our $BIGWIG_OK = 0; |
|
850
|
|
|
|
|
|
|
our $SEQFASTA_OK = 0; |
|
851
|
|
|
|
|
|
|
our $USEQ_OK = 0; |
|
852
|
|
|
|
|
|
|
our $BAM_ADAPTER = undef; # preference for which bam adapter to use |
|
853
|
|
|
|
|
|
|
our $BIG_ADAPTER = undef; |
|
854
|
|
|
|
|
|
|
|
|
855
|
|
|
|
|
|
|
# define reusable variables |
|
856
|
|
|
|
|
|
|
my %TOTAL_READ_NUMBER; # for rpm calculations |
|
857
|
|
|
|
|
|
|
my $PRIMARY_ID_WARNING; # for out of date primary IDs |
|
858
|
|
|
|
|
|
|
my %OPENED_DB; # cache for some opened Bio::DB databases |
|
859
|
|
|
|
|
|
|
my %DB_METHODS; # cache for database score methods |
|
860
|
|
|
|
|
|
|
|
|
861
|
|
|
|
|
|
|
# score calculators |
|
862
|
|
|
|
|
|
|
my %SCORE_CALCULATOR_SUB = ( |
|
863
|
|
|
|
|
|
|
'mean' => sub { |
|
864
|
|
|
|
|
|
|
my $s = shift; |
|
865
|
|
|
|
|
|
|
return sum0(@$s)/(scalar(@$s) || 1); |
|
866
|
|
|
|
|
|
|
}, |
|
867
|
|
|
|
|
|
|
'sum' => sub { |
|
868
|
|
|
|
|
|
|
my $s = shift; |
|
869
|
|
|
|
|
|
|
return sum0(@$s); |
|
870
|
|
|
|
|
|
|
}, |
|
871
|
|
|
|
|
|
|
'median' => sub { |
|
872
|
|
|
|
|
|
|
my $s = shift; |
|
873
|
|
|
|
|
|
|
return '.' unless scalar(@$s); |
|
874
|
|
|
|
|
|
|
return median(@$s); |
|
875
|
|
|
|
|
|
|
}, |
|
876
|
|
|
|
|
|
|
'min' => sub { |
|
877
|
|
|
|
|
|
|
my $s = shift; |
|
878
|
|
|
|
|
|
|
return '.' unless scalar(@$s); |
|
879
|
|
|
|
|
|
|
return min(@$s); |
|
880
|
|
|
|
|
|
|
}, |
|
881
|
|
|
|
|
|
|
'max' => sub { |
|
882
|
|
|
|
|
|
|
my $s = shift; |
|
883
|
|
|
|
|
|
|
return '.' unless scalar(@$s); |
|
884
|
|
|
|
|
|
|
return max(@$s); |
|
885
|
|
|
|
|
|
|
}, |
|
886
|
|
|
|
|
|
|
'count' => sub { |
|
887
|
|
|
|
|
|
|
my $s = shift; |
|
888
|
|
|
|
|
|
|
return scalar(@$s); |
|
889
|
|
|
|
|
|
|
}, |
|
890
|
|
|
|
|
|
|
'pcount' => sub { |
|
891
|
|
|
|
|
|
|
my $s = shift; |
|
892
|
|
|
|
|
|
|
return scalar(@$s); |
|
893
|
|
|
|
|
|
|
}, |
|
894
|
|
|
|
|
|
|
'ncount' => sub { |
|
895
|
|
|
|
|
|
|
# Convert names into unique counts |
|
896
|
|
|
|
|
|
|
my $s = shift; |
|
897
|
|
|
|
|
|
|
my %name2count; |
|
898
|
|
|
|
|
|
|
foreach my $n (@$s) { |
|
899
|
|
|
|
|
|
|
if (ref($n) eq 'ARRAY') { |
|
900
|
|
|
|
|
|
|
# this is likely from a ncount indexed hash |
|
901
|
|
|
|
|
|
|
foreach (@$n) { |
|
902
|
|
|
|
|
|
|
$name2count{$_} += 1; |
|
903
|
|
|
|
|
|
|
} |
|
904
|
|
|
|
|
|
|
} |
|
905
|
|
|
|
|
|
|
else { |
|
906
|
|
|
|
|
|
|
$name2count{$n} += 1; |
|
907
|
|
|
|
|
|
|
} |
|
908
|
|
|
|
|
|
|
} |
|
909
|
|
|
|
|
|
|
return scalar(keys %name2count); |
|
910
|
|
|
|
|
|
|
}, |
|
911
|
|
|
|
|
|
|
'range' => sub { |
|
912
|
|
|
|
|
|
|
# the range value is 'min-max' |
|
913
|
|
|
|
|
|
|
my $s = shift; |
|
914
|
|
|
|
|
|
|
return '.' unless scalar(@$s); |
|
915
|
|
|
|
|
|
|
return range(@$s); |
|
916
|
|
|
|
|
|
|
}, |
|
917
|
|
|
|
|
|
|
'stddev' => sub { |
|
918
|
|
|
|
|
|
|
# we are using the standard deviation of the population, |
|
919
|
|
|
|
|
|
|
# since these are the only scores we are considering |
|
920
|
|
|
|
|
|
|
my $s = shift; |
|
921
|
|
|
|
|
|
|
return '.' unless scalar(@$s); |
|
922
|
|
|
|
|
|
|
return stddevp(@$s); |
|
923
|
|
|
|
|
|
|
}, |
|
924
|
|
|
|
|
|
|
'rpm' => sub { |
|
925
|
|
|
|
|
|
|
confess " The rpm methods have been deprecated due to complexity and " . |
|
926
|
|
|
|
|
|
|
"the variable way of calculating the value. Collect counts and " . |
|
927
|
|
|
|
|
|
|
"calculate your preferred way.\n"; |
|
928
|
|
|
|
|
|
|
}, |
|
929
|
|
|
|
|
|
|
'rpkm' => sub { |
|
930
|
|
|
|
|
|
|
confess " The rpm methods have been deprecated due to complexity and " . |
|
931
|
|
|
|
|
|
|
"the variable way of calculating the value. Collect counts and " . |
|
932
|
|
|
|
|
|
|
"calculate your preferred way.\n"; |
|
933
|
|
|
|
|
|
|
} |
|
934
|
|
|
|
|
|
|
); |
|
935
|
|
|
|
|
|
|
|
|
936
|
|
|
|
|
|
|
|
|
937
|
|
|
|
|
|
|
# Exported names |
|
938
|
|
|
|
|
|
|
our @ISA = qw(Exporter); |
|
939
|
|
|
|
|
|
|
our @EXPORT = qw(); |
|
940
|
|
|
|
|
|
|
our @EXPORT_OK = qw( |
|
941
|
|
|
|
|
|
|
$BAM_ADAPTER |
|
942
|
|
|
|
|
|
|
$BIG_ADAPTER |
|
943
|
|
|
|
|
|
|
use_bam_adapter |
|
944
|
|
|
|
|
|
|
use_big_adapter |
|
945
|
|
|
|
|
|
|
open_db_connection |
|
946
|
|
|
|
|
|
|
get_dataset_list |
|
947
|
|
|
|
|
|
|
verify_or_request_feature_types |
|
948
|
|
|
|
|
|
|
check_dataset_for_rpm_support |
|
949
|
|
|
|
|
|
|
get_new_feature_list |
|
950
|
|
|
|
|
|
|
get_new_genome_list |
|
951
|
|
|
|
|
|
|
validate_included_feature |
|
952
|
|
|
|
|
|
|
get_db_feature |
|
953
|
|
|
|
|
|
|
get_segment_score |
|
954
|
|
|
|
|
|
|
calculate_score |
|
955
|
|
|
|
|
|
|
get_chromosome_list |
|
956
|
|
|
|
|
|
|
low_level_bam_coverage |
|
957
|
|
|
|
|
|
|
low_level_bam_fetch |
|
958
|
|
|
|
|
|
|
get_genomic_sequence |
|
959
|
|
|
|
|
|
|
); |
|
960
|
|
|
|
|
|
|
|
|
961
|
|
|
|
|
|
|
# The true statement |
|
962
|
|
|
|
|
|
|
1; |
|
963
|
|
|
|
|
|
|
|
|
964
|
|
|
|
|
|
|
|
|
965
|
|
|
|
|
|
|
### Open a connection to the SeqFeature Store MySQL database |
|
966
|
|
|
|
|
|
|
|
|
967
|
|
|
|
|
|
|
sub use_bam_adapter { |
|
968
|
0
|
|
0
|
0
|
1
|
0
|
my $a = shift || undef; |
|
969
|
0
|
0
|
|
|
|
0
|
$BAM_ADAPTER = $a if $a; |
|
970
|
0
|
|
|
|
|
0
|
return $BAM_ADAPTER; |
|
971
|
|
|
|
|
|
|
} |
|
972
|
|
|
|
|
|
|
|
|
973
|
|
|
|
|
|
|
|
|
974
|
|
|
|
|
|
|
sub use_big_adapter { |
|
975
|
0
|
|
0
|
0
|
1
|
0
|
my $a = shift || undef; |
|
976
|
0
|
0
|
|
|
|
0
|
$BIG_ADAPTER = $a if $a; |
|
977
|
0
|
|
|
|
|
0
|
return $BIG_ADAPTER; |
|
978
|
|
|
|
|
|
|
} |
|
979
|
|
|
|
|
|
|
|
|
980
|
|
|
|
|
|
|
|
|
981
|
|
|
|
|
|
|
sub open_db_connection { |
|
982
|
8
|
|
|
8
|
1
|
12
|
my $database = shift; |
|
983
|
8
|
|
50
|
|
|
21
|
my $no_cache = shift || 0; |
|
984
|
8
|
50
|
|
|
|
19
|
unless ($database) { |
|
985
|
|
|
|
|
|
|
# cluck 'no database name passed!'; |
|
986
|
0
|
|
|
|
|
0
|
return; |
|
987
|
|
|
|
|
|
|
} |
|
988
|
|
|
|
|
|
|
|
|
989
|
|
|
|
|
|
|
# first check if it is a database reference |
|
990
|
8
|
|
|
|
|
15
|
my $db_ref = ref $database; |
|
991
|
8
|
100
|
|
|
|
28
|
if ($db_ref =~ /DB|big::BigWigSet/) { |
|
992
|
|
|
|
|
|
|
# the provided database is already an open database object |
|
993
|
|
|
|
|
|
|
# nothing to open, return as is |
|
994
|
2
|
|
|
|
|
5
|
return $database; |
|
995
|
|
|
|
|
|
|
} |
|
996
|
|
|
|
|
|
|
|
|
997
|
|
|
|
|
|
|
# check to see if we have already opened it |
|
998
|
6
|
100
|
66
|
|
|
23
|
if (exists $OPENED_DB{$database} and not $no_cache) { |
|
999
|
|
|
|
|
|
|
# return the cached database connection |
|
1000
|
|
|
|
|
|
|
# but NOT if user explicitly requested no cached databases |
|
1001
|
|
|
|
|
|
|
# DO NOT reuse database objects if you have forked!!! Bad things happen |
|
1002
|
5
|
|
|
|
|
12
|
return $OPENED_DB{$database}; |
|
1003
|
|
|
|
|
|
|
} |
|
1004
|
|
|
|
|
|
|
|
|
1005
|
|
|
|
|
|
|
# skip parsed databases |
|
1006
|
1
|
50
|
|
|
|
3
|
return if $database =~ /^Parsed:/; # we don't open parsed annotation files |
|
1007
|
|
|
|
|
|
|
|
|
1008
|
|
|
|
|
|
|
# remove file prefix, just in case |
|
1009
|
1
|
|
|
|
|
2
|
$database =~ s/^file://; |
|
1010
|
|
|
|
|
|
|
|
|
1011
|
|
|
|
|
|
|
|
|
1012
|
|
|
|
|
|
|
### Attempt to open the database |
|
1013
|
|
|
|
|
|
|
# we go through a series of checks to determine if it is remote, local, |
|
1014
|
|
|
|
|
|
|
# an indexed big data file, SQLite file, etc |
|
1015
|
|
|
|
|
|
|
# when all else fails, try to open a SQL connection |
|
1016
|
1
|
|
|
|
|
2
|
my $db; |
|
1017
|
|
|
|
|
|
|
my $error; |
|
1018
|
|
|
|
|
|
|
|
|
1019
|
|
|
|
|
|
|
# check if it is a remote file |
|
1020
|
1
|
50
|
|
|
|
22
|
if ($database =~ /^(?:https?|ftp)/i) { |
|
|
|
50
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
1021
|
|
|
|
|
|
|
|
|
1022
|
|
|
|
|
|
|
# a remote Bam database |
|
1023
|
0
|
0
|
|
|
|
0
|
if ($database =~ /\.bam$/i) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
1024
|
|
|
|
|
|
|
# open using Bam adaptor |
|
1025
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
|
1026
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
|
1027
|
0
|
|
|
|
|
0
|
$db = open_bam_db($database); |
|
1028
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1029
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open remote Bam file" . |
|
1030
|
|
|
|
|
|
|
" '$database'! $!\n"; |
|
1031
|
|
|
|
|
|
|
} |
|
1032
|
|
|
|
|
|
|
} |
|
1033
|
|
|
|
|
|
|
else { |
|
1034
|
0
|
|
|
|
|
0
|
$error = " Bam database cannot be loaded because\n" . |
|
1035
|
|
|
|
|
|
|
" Bio::DB::Sam or Bio::DB::HTS is not installed\n"; |
|
1036
|
|
|
|
|
|
|
} |
|
1037
|
|
|
|
|
|
|
} |
|
1038
|
|
|
|
|
|
|
|
|
1039
|
|
|
|
|
|
|
# a remote BigBed database |
|
1040
|
|
|
|
|
|
|
elsif ($database =~ /\.(?:bb|bigbed)$/i) { |
|
1041
|
|
|
|
|
|
|
# open using BigBed adaptor |
|
1042
|
0
|
0
|
|
|
|
0
|
$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK; |
|
1043
|
0
|
0
|
|
|
|
0
|
if ($BIGBED_OK) { |
|
1044
|
0
|
|
|
|
|
0
|
$db = open_bigbed_db($database); |
|
1045
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1046
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open remote BigBed file" . |
|
1047
|
|
|
|
|
|
|
" '$database'! $!\n"; |
|
1048
|
|
|
|
|
|
|
} |
|
1049
|
|
|
|
|
|
|
} |
|
1050
|
|
|
|
|
|
|
else { |
|
1051
|
0
|
|
|
|
|
0
|
$error = " BigBed database cannot be loaded because\n" . |
|
1052
|
|
|
|
|
|
|
" Bio::DB::Big or Bio::DB::BigBed is not installed\n"; |
|
1053
|
|
|
|
|
|
|
} |
|
1054
|
|
|
|
|
|
|
} |
|
1055
|
|
|
|
|
|
|
|
|
1056
|
|
|
|
|
|
|
# a remote BigWig database |
|
1057
|
|
|
|
|
|
|
elsif ($database =~ /\.(?:bw|bigwig)$/i) { |
|
1058
|
|
|
|
|
|
|
# open using BigWig adaptor |
|
1059
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK; |
|
1060
|
0
|
0
|
|
|
|
0
|
if ($BIGWIG_OK) { |
|
1061
|
0
|
|
|
|
|
0
|
$db = open_bigwig_db($database); |
|
1062
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1063
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open remote BigWig file" . |
|
1064
|
|
|
|
|
|
|
" '$database'! $!\n"; |
|
1065
|
|
|
|
|
|
|
} |
|
1066
|
|
|
|
|
|
|
} |
|
1067
|
|
|
|
|
|
|
else { |
|
1068
|
0
|
|
|
|
|
0
|
$error = " BigWig database cannot be loaded because\n" . |
|
1069
|
|
|
|
|
|
|
" Bio::DB::Big or Bio::DB::BigWig is not installed\n"; |
|
1070
|
|
|
|
|
|
|
} |
|
1071
|
|
|
|
|
|
|
} |
|
1072
|
|
|
|
|
|
|
|
|
1073
|
|
|
|
|
|
|
# a remote useq file |
|
1074
|
|
|
|
|
|
|
elsif ($database =~ /\.useq$/) { |
|
1075
|
|
|
|
|
|
|
# uh oh! remote useq files are not supported |
|
1076
|
0
|
|
|
|
|
0
|
$error = " ERROR: remote useq files are not supported!\n"; |
|
1077
|
|
|
|
|
|
|
} |
|
1078
|
|
|
|
|
|
|
|
|
1079
|
|
|
|
|
|
|
# a remote fasta file??? |
|
1080
|
|
|
|
|
|
|
elsif ($database =~ /\.fa(?:sta)?$/i) { |
|
1081
|
|
|
|
|
|
|
# uh oh! remote fasta files are not supported |
|
1082
|
0
|
|
|
|
|
0
|
$error = " ERROR: remote fasta files are not supported!\n"; |
|
1083
|
|
|
|
|
|
|
} |
|
1084
|
|
|
|
|
|
|
|
|
1085
|
|
|
|
|
|
|
# a presumed remote directory, presumably of bigwig files |
|
1086
|
|
|
|
|
|
|
else { |
|
1087
|
|
|
|
|
|
|
# open using BigWigSet adaptor |
|
1088
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig') |
|
1089
|
|
|
|
|
|
|
unless $BIGWIG_OK; |
|
1090
|
0
|
0
|
|
|
|
0
|
if ($BIGWIG_OK) { |
|
1091
|
0
|
|
|
|
|
0
|
$db = open_bigwigset_db($database); |
|
1092
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1093
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open presumed remote " . |
|
1094
|
|
|
|
|
|
|
"BigWigSet directory '$database'! $!\n"; |
|
1095
|
|
|
|
|
|
|
} |
|
1096
|
|
|
|
|
|
|
} |
|
1097
|
|
|
|
|
|
|
else { |
|
1098
|
0
|
|
|
|
|
0
|
$error = " Presumed BigWigSet database cannot be loaded because\n" . |
|
1099
|
|
|
|
|
|
|
" Bio::DB::BigWigSet is not installed\n"; |
|
1100
|
|
|
|
|
|
|
} |
|
1101
|
|
|
|
|
|
|
} |
|
1102
|
|
|
|
|
|
|
|
|
1103
|
|
|
|
|
|
|
} |
|
1104
|
|
|
|
|
|
|
|
|
1105
|
|
|
|
|
|
|
# a local existing file |
|
1106
|
|
|
|
|
|
|
elsif (-f $database) { |
|
1107
|
|
|
|
|
|
|
|
|
1108
|
|
|
|
|
|
|
# a Bam database |
|
1109
|
1
|
50
|
|
|
|
17
|
if ($database =~ /\.bam$/i) { |
|
|
|
50
|
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
1110
|
|
|
|
|
|
|
# open using appropriate bam adaptor |
|
1111
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
|
1112
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
|
1113
|
0
|
|
|
|
|
0
|
undef $@; |
|
1114
|
0
|
|
|
|
|
0
|
$db = open_bam_db($database); |
|
1115
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1116
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local Bam file" . |
|
1117
|
|
|
|
|
|
|
" '$database'! $@\n"; |
|
1118
|
|
|
|
|
|
|
} |
|
1119
|
|
|
|
|
|
|
} |
|
1120
|
|
|
|
|
|
|
else { |
|
1121
|
0
|
|
|
|
|
0
|
$error = " Bam database cannot be loaded because\n" . |
|
1122
|
|
|
|
|
|
|
" Bio::DB::Sam or Bio::DB::HTS is not installed\n"; |
|
1123
|
|
|
|
|
|
|
} |
|
1124
|
|
|
|
|
|
|
} |
|
1125
|
|
|
|
|
|
|
|
|
1126
|
|
|
|
|
|
|
# a BigBed database |
|
1127
|
|
|
|
|
|
|
elsif ($database =~ /\.(?:bb|bigbed)$/i) { |
|
1128
|
|
|
|
|
|
|
# open using BigBed adaptor |
|
1129
|
0
|
0
|
|
|
|
0
|
$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK; |
|
1130
|
0
|
0
|
|
|
|
0
|
if ($BIGBED_OK) { |
|
1131
|
0
|
|
|
|
|
0
|
undef $@; |
|
1132
|
0
|
|
|
|
|
0
|
$db = open_bigbed_db($database); |
|
1133
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1134
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local BigBed file" . |
|
1135
|
|
|
|
|
|
|
" '$database'! $@\n"; |
|
1136
|
|
|
|
|
|
|
} |
|
1137
|
|
|
|
|
|
|
} |
|
1138
|
|
|
|
|
|
|
else { |
|
1139
|
0
|
|
|
|
|
0
|
$error = " BigBed database cannot be loaded because\n" . |
|
1140
|
|
|
|
|
|
|
" Big::DB::Big or Bio::DB::BigBed is not installed\n"; |
|
1141
|
|
|
|
|
|
|
} |
|
1142
|
|
|
|
|
|
|
} |
|
1143
|
|
|
|
|
|
|
|
|
1144
|
|
|
|
|
|
|
# a BigWig database |
|
1145
|
|
|
|
|
|
|
elsif ($database =~ /\.(?:bw|bigwig)$/i) { |
|
1146
|
|
|
|
|
|
|
# open using BigWig adaptor |
|
1147
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK; |
|
1148
|
0
|
0
|
|
|
|
0
|
if ($BIGWIG_OK) { |
|
1149
|
0
|
|
|
|
|
0
|
undef $@; |
|
1150
|
0
|
|
|
|
|
0
|
$db = open_bigwig_db($database); |
|
1151
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1152
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local BigWig file" . |
|
1153
|
|
|
|
|
|
|
" '$database'! $@\n"; |
|
1154
|
|
|
|
|
|
|
} |
|
1155
|
|
|
|
|
|
|
} |
|
1156
|
|
|
|
|
|
|
else { |
|
1157
|
0
|
|
|
|
|
0
|
$error = " BigWig database cannot be loaded because\n" . |
|
1158
|
|
|
|
|
|
|
" Big::DB::Big or Bio::DB::BigWig is not installed\n"; |
|
1159
|
|
|
|
|
|
|
} |
|
1160
|
|
|
|
|
|
|
} |
|
1161
|
|
|
|
|
|
|
|
|
1162
|
|
|
|
|
|
|
# a useq database |
|
1163
|
|
|
|
|
|
|
elsif ($database =~ /\.useq$/i) { |
|
1164
|
|
|
|
|
|
|
# open using USeq adaptor |
|
1165
|
1
|
50
|
|
|
|
40
|
$USEQ_OK = _load_helper_module('Bio::ToolBox::db_helper::useq') |
|
1166
|
|
|
|
|
|
|
unless $USEQ_OK; |
|
1167
|
1
|
50
|
|
|
|
3
|
if ($USEQ_OK) { |
|
1168
|
1
|
|
|
|
|
3
|
$db = open_useq_db($database); |
|
1169
|
1
|
50
|
|
|
|
3
|
unless ($db) { |
|
1170
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local useq file" . |
|
1171
|
|
|
|
|
|
|
" '$database'! $!\n"; |
|
1172
|
|
|
|
|
|
|
} |
|
1173
|
|
|
|
|
|
|
} |
|
1174
|
|
|
|
|
|
|
else { |
|
1175
|
0
|
|
|
|
|
0
|
$error = " Useq database cannot be loaded because\n" . |
|
1176
|
|
|
|
|
|
|
" Bio::DB::USeq is not installed\n"; |
|
1177
|
|
|
|
|
|
|
} |
|
1178
|
|
|
|
|
|
|
} |
|
1179
|
|
|
|
|
|
|
|
|
1180
|
|
|
|
|
|
|
# a Fasta File |
|
1181
|
|
|
|
|
|
|
elsif ($database =~ /\.fa(?:sta)?(?:\.gz)?$/i) { |
|
1182
|
|
|
|
|
|
|
# first try a modern fai indexed adapter |
|
1183
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
|
1184
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
|
1185
|
0
|
|
|
|
|
0
|
$db = open_indexed_fasta($database); |
|
1186
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1187
|
0
|
|
|
|
|
0
|
$error .= " ERROR: could not open indexed fasta file '$database'!\n"; |
|
1188
|
|
|
|
|
|
|
} |
|
1189
|
|
|
|
|
|
|
} |
|
1190
|
|
|
|
|
|
|
else { |
|
1191
|
|
|
|
|
|
|
# open using the old slow BioPerl Fasta adaptor |
|
1192
|
0
|
0
|
|
|
|
0
|
$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta') |
|
1193
|
|
|
|
|
|
|
unless $SEQFASTA_OK; |
|
1194
|
0
|
0
|
|
|
|
0
|
if ($SEQFASTA_OK) { |
|
1195
|
0
|
|
|
|
|
0
|
$db = open_fasta_db($database); |
|
1196
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1197
|
0
|
|
|
|
|
0
|
$error .= " ERROR: could not open fasta file '$database'!\n"; |
|
1198
|
0
|
0
|
|
|
|
0
|
if (-e "$database\.index") { |
|
1199
|
0
|
|
|
|
|
0
|
$error .= " Try deleting $database\.index and try again\n"; |
|
1200
|
|
|
|
|
|
|
} |
|
1201
|
|
|
|
|
|
|
} |
|
1202
|
|
|
|
|
|
|
} |
|
1203
|
|
|
|
|
|
|
else { |
|
1204
|
0
|
|
|
|
|
0
|
$error .= " Fasta file could not be loaded because neither" . |
|
1205
|
|
|
|
|
|
|
" Bio::DB::HTS, Bio::DB::Sam, or Bio::DB::Fasta is installed\n"; |
|
1206
|
|
|
|
|
|
|
} |
|
1207
|
|
|
|
|
|
|
} |
|
1208
|
|
|
|
|
|
|
} |
|
1209
|
|
|
|
|
|
|
|
|
1210
|
|
|
|
|
|
|
# a gff3 or sqlite database |
|
1211
|
|
|
|
|
|
|
elsif ($database =~ /\.(?:gff3?|gff3?\.gz|db|sqlite)$/) { |
|
1212
|
0
|
0
|
|
|
|
0
|
$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta') |
|
1213
|
|
|
|
|
|
|
unless $SEQFASTA_OK; |
|
1214
|
0
|
0
|
|
|
|
0
|
if ($SEQFASTA_OK) { |
|
1215
|
0
|
|
|
|
|
0
|
$db = open_store_db($database); |
|
1216
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1217
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not load SeqFeature database file '$database'!\n"; |
|
1218
|
|
|
|
|
|
|
} |
|
1219
|
|
|
|
|
|
|
} |
|
1220
|
|
|
|
|
|
|
else { |
|
1221
|
0
|
|
|
|
|
0
|
$error .= " Module Bio::DB::SeqFeature::Store is required to load " . |
|
1222
|
|
|
|
|
|
|
"GFF and database files\n"; |
|
1223
|
|
|
|
|
|
|
} |
|
1224
|
|
|
|
|
|
|
} |
|
1225
|
|
|
|
|
|
|
|
|
1226
|
|
|
|
|
|
|
# a cram database |
|
1227
|
|
|
|
|
|
|
elsif ($database =~ /\.cram$/i) { |
|
1228
|
|
|
|
|
|
|
# open using HTS bam adaptor only |
|
1229
|
0
|
|
0
|
|
|
0
|
$BAM_ADAPTER ||= 'hts'; |
|
1230
|
0
|
0
|
|
|
|
0
|
if ($BAM_ADAPTER eq 'sam') { |
|
1231
|
0
|
|
|
|
|
0
|
$error .= " ERROR: Only HTS adapters support Cram files!\n"; |
|
1232
|
|
|
|
|
|
|
} |
|
1233
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
|
1234
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
|
1235
|
0
|
|
|
|
|
0
|
undef $@; |
|
1236
|
0
|
|
|
|
|
0
|
$db = open_bam_db($database); |
|
1237
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1238
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local Cram file" . |
|
1239
|
|
|
|
|
|
|
" '$database'! $@\n"; |
|
1240
|
|
|
|
|
|
|
} |
|
1241
|
|
|
|
|
|
|
} |
|
1242
|
|
|
|
|
|
|
else { |
|
1243
|
0
|
|
|
|
|
0
|
$error = " Cram database cannot be loaded because\n" . |
|
1244
|
|
|
|
|
|
|
" Bio::DB::HTS is not installed\n"; |
|
1245
|
|
|
|
|
|
|
} |
|
1246
|
|
|
|
|
|
|
} |
|
1247
|
|
|
|
|
|
|
|
|
1248
|
|
|
|
|
|
|
# something unrecognized? |
|
1249
|
|
|
|
|
|
|
else { |
|
1250
|
0
|
|
|
|
|
0
|
$error .= " ERROR! Cannot identify database type based on extension for $database!\n"; |
|
1251
|
|
|
|
|
|
|
} |
|
1252
|
|
|
|
|
|
|
} |
|
1253
|
|
|
|
|
|
|
|
|
1254
|
|
|
|
|
|
|
# a directory, presumably of bigwig files |
|
1255
|
|
|
|
|
|
|
elsif (-d $database) { |
|
1256
|
|
|
|
|
|
|
# try opening using the BigWigSet adaptor |
|
1257
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK; |
|
1258
|
0
|
0
|
|
|
|
0
|
if ($BIGWIG_OK) { |
|
1259
|
0
|
|
|
|
|
0
|
$db = open_bigwigset_db($database); |
|
1260
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1261
|
0
|
|
|
|
|
0
|
$error = " ERROR: could not open local BigWigSet " . |
|
1262
|
|
|
|
|
|
|
"directory '$database'!\n"; |
|
1263
|
0
|
|
|
|
|
0
|
$error .= " Does directory contain bigWig .bw files?\n"; |
|
1264
|
|
|
|
|
|
|
} |
|
1265
|
|
|
|
|
|
|
} |
|
1266
|
|
|
|
|
|
|
else { |
|
1267
|
0
|
|
|
|
|
0
|
$error = " Presumed BigWigSet database cannot be loaded because\n" . |
|
1268
|
|
|
|
|
|
|
" Bio::DB::Big or Bio::DB::BigWigSet is not installed\n"; |
|
1269
|
|
|
|
|
|
|
} |
|
1270
|
|
|
|
|
|
|
|
|
1271
|
|
|
|
|
|
|
# try opening with the Fasta adaptor |
|
1272
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1273
|
0
|
0
|
|
|
|
0
|
$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta') |
|
1274
|
|
|
|
|
|
|
unless $SEQFASTA_OK; |
|
1275
|
0
|
0
|
|
|
|
0
|
if ($SEQFASTA_OK) { |
|
1276
|
0
|
|
|
|
|
0
|
$db = open_fasta_db($database); |
|
1277
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1278
|
0
|
|
|
|
|
0
|
$error .= " ERROR: could not open fasta directory '$database'!\n"; |
|
1279
|
0
|
|
|
|
|
0
|
$error .= " Does directory contain fasta files? If it contains a" . |
|
1280
|
|
|
|
|
|
|
" directory.index file,\n try deleting it and try again.\n"; |
|
1281
|
|
|
|
|
|
|
} |
|
1282
|
|
|
|
|
|
|
} |
|
1283
|
|
|
|
|
|
|
else { |
|
1284
|
0
|
|
|
|
|
0
|
$error .= " Module Bio::DB::Fasta is required to open presumed fasta " . |
|
1285
|
|
|
|
|
|
|
"directory\n"; |
|
1286
|
|
|
|
|
|
|
} |
|
1287
|
|
|
|
|
|
|
} |
|
1288
|
|
|
|
|
|
|
} |
|
1289
|
|
|
|
|
|
|
|
|
1290
|
|
|
|
|
|
|
# otherwise assume the name of a SeqFeature::Store database in the configuration |
|
1291
|
|
|
|
|
|
|
else { |
|
1292
|
|
|
|
|
|
|
# attempt to open using information from the configuration |
|
1293
|
|
|
|
|
|
|
# using default connection information as necessary |
|
1294
|
0
|
0
|
|
|
|
0
|
$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta') |
|
1295
|
|
|
|
|
|
|
unless $SEQFASTA_OK; |
|
1296
|
0
|
0
|
|
|
|
0
|
if ($SEQFASTA_OK) { |
|
1297
|
0
|
|
|
|
|
0
|
$db = open_store_db($database); |
|
1298
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1299
|
0
|
|
|
|
|
0
|
$error .= " unable to open relational database named '$database'\n"; |
|
1300
|
|
|
|
|
|
|
} |
|
1301
|
|
|
|
|
|
|
} |
|
1302
|
|
|
|
|
|
|
else { |
|
1303
|
0
|
|
|
|
|
0
|
$error .= " Module Bio::DB::SeqFeature::Store is required to connect " . |
|
1304
|
|
|
|
|
|
|
"to databases\n"; |
|
1305
|
|
|
|
|
|
|
} |
|
1306
|
|
|
|
|
|
|
} |
|
1307
|
|
|
|
|
|
|
|
|
1308
|
|
|
|
|
|
|
# conditional return |
|
1309
|
1
|
50
|
|
|
|
3
|
if ($db) { |
|
1310
|
|
|
|
|
|
|
# cache the opened connection for use later |
|
1311
|
1
|
50
|
|
|
|
3
|
$OPENED_DB{$database} = $db unless $no_cache; |
|
1312
|
|
|
|
|
|
|
|
|
1313
|
|
|
|
|
|
|
# return as appropriate either both object and name or just object |
|
1314
|
1
|
|
|
|
|
3
|
return $db; |
|
1315
|
|
|
|
|
|
|
} |
|
1316
|
|
|
|
|
|
|
else { |
|
1317
|
0
|
|
|
|
|
0
|
print STDERR $error; |
|
1318
|
0
|
|
|
|
|
0
|
return; |
|
1319
|
|
|
|
|
|
|
} |
|
1320
|
|
|
|
|
|
|
} |
|
1321
|
|
|
|
|
|
|
|
|
1322
|
|
|
|
|
|
|
|
|
1323
|
|
|
|
|
|
|
### Retrieve a database name from an db object |
|
1324
|
|
|
|
|
|
|
sub get_db_name { |
|
1325
|
0
|
|
|
0
|
1
|
0
|
my $db = shift; |
|
1326
|
0
|
0
|
|
|
|
0
|
return unless $db; |
|
1327
|
0
|
|
|
|
|
0
|
my $db_ref = ref($db); |
|
1328
|
0
|
0
|
|
|
|
0
|
return $db unless $db_ref; # presumption that non-object is just a database name |
|
1329
|
0
|
|
|
|
|
0
|
my $db_name; |
|
1330
|
0
|
0
|
|
|
|
0
|
if ($db_ref =~ /^Bio::DB::SeqFeature::Store/) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
1331
|
|
|
|
|
|
|
# a SeqFeature database, using any DBI adapter |
|
1332
|
0
|
|
|
|
|
0
|
$db_name = $db->{'dbh'}->{'name'}; |
|
1333
|
|
|
|
|
|
|
# dig through the object internals to identify the original |
|
1334
|
|
|
|
|
|
|
# name of the database |
|
1335
|
|
|
|
|
|
|
# this should be relatively well documented through DBI |
|
1336
|
|
|
|
|
|
|
# but could break in the future since it's not official API |
|
1337
|
|
|
|
|
|
|
} |
|
1338
|
|
|
|
|
|
|
elsif ($db_ref eq 'Bio::DB::Sam') { |
|
1339
|
|
|
|
|
|
|
# a Samtools Bam database |
|
1340
|
|
|
|
|
|
|
# old versions <=1.39 don't have the bam_path method, so it's easier |
|
1341
|
|
|
|
|
|
|
# to explicitly grab it from the object internals |
|
1342
|
0
|
|
|
|
|
0
|
$db_name = $db->{bam_path}; |
|
1343
|
|
|
|
|
|
|
} |
|
1344
|
|
|
|
|
|
|
elsif ($db_ref eq 'Bio::DB::HTS') { |
|
1345
|
|
|
|
|
|
|
# a HTSlib Bam database |
|
1346
|
0
|
|
|
|
|
0
|
$db_name = $db->hts_path; |
|
1347
|
|
|
|
|
|
|
} |
|
1348
|
|
|
|
|
|
|
# determining the database name from other sources is |
|
1349
|
|
|
|
|
|
|
# either not possible or not easy, so won't bother unless |
|
1350
|
|
|
|
|
|
|
# there is a really really good need |
|
1351
|
0
|
|
|
|
|
0
|
return $db_name; |
|
1352
|
|
|
|
|
|
|
} |
|
1353
|
|
|
|
|
|
|
|
|
1354
|
|
|
|
|
|
|
|
|
1355
|
|
|
|
|
|
|
### Retrieve a list of the microrarray data sets from the db |
|
1356
|
|
|
|
|
|
|
sub get_dataset_list { |
|
1357
|
|
|
|
|
|
|
|
|
1358
|
0
|
|
|
0
|
1
|
0
|
my $database = shift; |
|
1359
|
0
|
|
|
|
|
0
|
my $use_all_features = shift; |
|
1360
|
|
|
|
|
|
|
|
|
1361
|
|
|
|
|
|
|
# Open a db connection |
|
1362
|
0
|
|
|
|
|
0
|
my $db = open_db_connection($database); |
|
1363
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1364
|
0
|
|
|
|
|
0
|
carp 'no database connected!'; |
|
1365
|
0
|
|
|
|
|
0
|
return; |
|
1366
|
|
|
|
|
|
|
} |
|
1367
|
0
|
|
|
|
|
0
|
my $db_ref = ref $db; |
|
1368
|
|
|
|
|
|
|
|
|
1369
|
|
|
|
|
|
|
# process the database types, according to the type of database |
|
1370
|
0
|
|
|
|
|
0
|
my @types; |
|
1371
|
|
|
|
|
|
|
|
|
1372
|
|
|
|
|
|
|
# a SeqFeature database |
|
1373
|
0
|
0
|
|
|
|
0
|
if ($db_ref =~ /^Bio::DB::SeqFeature::Store/) { |
|
|
|
0
|
|
|
|
|
|
|
1374
|
|
|
|
|
|
|
|
|
1375
|
|
|
|
|
|
|
# collect the database types, |
|
1376
|
|
|
|
|
|
|
# sort first by source, then by method |
|
1377
|
|
|
|
|
|
|
@types = ( |
|
1378
|
|
|
|
|
|
|
map $_->[2], |
|
1379
|
0
|
0
|
|
|
|
0
|
sort { ($a->[0] cmp $b->[0]) or ($a->[1] cmp $b->[1]) } |
|
|
0
|
|
|
|
|
0
|
|
|
1380
|
|
|
|
|
|
|
map [$_->source, $_->method, $_], |
|
1381
|
|
|
|
|
|
|
$db->types |
|
1382
|
|
|
|
|
|
|
); |
|
1383
|
|
|
|
|
|
|
} |
|
1384
|
|
|
|
|
|
|
|
|
1385
|
|
|
|
|
|
|
# a BigWigSet database |
|
1386
|
|
|
|
|
|
|
elsif ($db_ref =~ /BigWigSet/i) { |
|
1387
|
|
|
|
|
|
|
|
|
1388
|
|
|
|
|
|
|
# get the metadata |
|
1389
|
0
|
|
|
|
|
0
|
my $metadata = $db->metadata; |
|
1390
|
|
|
|
|
|
|
|
|
1391
|
|
|
|
|
|
|
# collect |
|
1392
|
0
|
|
|
|
|
0
|
foreach my $file (keys %{ $metadata }) { |
|
|
0
|
|
|
|
|
0
|
|
|
1393
|
|
|
|
|
|
|
# get the type for each file |
|
1394
|
0
|
|
|
|
|
0
|
my ($primary, $type, $name); |
|
1395
|
|
|
|
|
|
|
|
|
1396
|
|
|
|
|
|
|
# get the appropriate tags |
|
1397
|
0
|
|
|
|
|
0
|
foreach my $attribute (keys %{ $metadata->{$file} } ) { |
|
|
0
|
|
|
|
|
0
|
|
|
1398
|
0
|
0
|
|
|
|
0
|
if ($attribute =~ m/^type/i) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
1399
|
0
|
|
|
|
|
0
|
$type = $metadata->{$file}{$attribute}; |
|
1400
|
|
|
|
|
|
|
} |
|
1401
|
|
|
|
|
|
|
elsif ($attribute =~ m/name/i) { |
|
1402
|
0
|
|
|
|
|
0
|
$name = $metadata->{$file}{$attribute}; |
|
1403
|
|
|
|
|
|
|
} |
|
1404
|
|
|
|
|
|
|
elsif ($attribute =~ m/^primary_tag|method$/i) { |
|
1405
|
0
|
|
|
|
|
0
|
$primary = $metadata->{$file}{$attribute}; |
|
1406
|
|
|
|
|
|
|
} |
|
1407
|
|
|
|
|
|
|
} |
|
1408
|
|
|
|
|
|
|
|
|
1409
|
|
|
|
|
|
|
# store the type |
|
1410
|
0
|
|
0
|
|
|
0
|
push @types, $type || $primary || $name; |
|
1411
|
|
|
|
|
|
|
} |
|
1412
|
|
|
|
|
|
|
|
|
1413
|
|
|
|
|
|
|
# sort the types in alphabetical order |
|
1414
|
|
|
|
|
|
|
# and discard duplicate types - which may occur with stranded entries |
|
1415
|
0
|
|
|
|
|
0
|
@types = uniq(sort {$a cmp $b} @types); |
|
|
0
|
|
|
|
|
0
|
|
|
1416
|
|
|
|
|
|
|
} |
|
1417
|
|
|
|
|
|
|
|
|
1418
|
|
|
|
|
|
|
# some other database |
|
1419
|
|
|
|
|
|
|
else { |
|
1420
|
0
|
|
|
|
|
0
|
carp " no dataset lists for database type $db_ref!\n"; |
|
1421
|
|
|
|
|
|
|
} |
|
1422
|
|
|
|
|
|
|
|
|
1423
|
|
|
|
|
|
|
# finish |
|
1424
|
0
|
|
|
|
|
0
|
return @types; |
|
1425
|
|
|
|
|
|
|
} |
|
1426
|
|
|
|
|
|
|
|
|
1427
|
|
|
|
|
|
|
|
|
1428
|
|
|
|
|
|
|
|
|
1429
|
|
|
|
|
|
|
### Process and verify a dataset |
|
1430
|
|
|
|
|
|
|
sub verify_or_request_feature_types { |
|
1431
|
|
|
|
|
|
|
|
|
1432
|
|
|
|
|
|
|
# Retrieve passed values |
|
1433
|
1
|
|
|
1
|
1
|
4
|
my %args = @_; # the passed argument values as a hash reference |
|
1434
|
|
|
|
|
|
|
|
|
1435
|
|
|
|
|
|
|
# Check for single option |
|
1436
|
1
|
|
50
|
|
|
7
|
$args{'single'} ||= 0; |
|
1437
|
|
|
|
|
|
|
|
|
1438
|
|
|
|
|
|
|
# Collect the datasets |
|
1439
|
1
|
|
|
|
|
1
|
my @datasets; |
|
1440
|
1
|
|
50
|
|
|
3
|
$args{'feature'} ||= undef; |
|
1441
|
1
|
50
|
|
|
|
15
|
if (ref $args{'feature'} eq 'ARRAY') { |
|
|
|
50
|
|
|
|
|
|
|
1442
|
|
|
|
|
|
|
# an anonymous array of datasets |
|
1443
|
0
|
|
|
|
|
0
|
@datasets = @{ $args{'feature'} }; |
|
|
0
|
|
|
|
|
0
|
|
|
1444
|
|
|
|
|
|
|
} |
|
1445
|
|
|
|
|
|
|
elsif (defined $args{'feature'}) { |
|
1446
|
1
|
|
|
|
|
3
|
push @datasets, $args{'feature'}; |
|
1447
|
|
|
|
|
|
|
} |
|
1448
|
|
|
|
|
|
|
# else it is null and @datasets remains empty, prompting user input |
|
1449
|
|
|
|
|
|
|
|
|
1450
|
|
|
|
|
|
|
# feature limits |
|
1451
|
|
|
|
|
|
|
# this is a regex to limit the feature types presented to the user |
|
1452
|
|
|
|
|
|
|
# otherwise everything in the database is presented to the user |
|
1453
|
1
|
|
50
|
|
|
4
|
my $limit = $args{'limit'} ||= undef; |
|
1454
|
|
|
|
|
|
|
|
|
1455
|
|
|
|
|
|
|
|
|
1456
|
|
|
|
|
|
|
# Open database object and collect features |
|
1457
|
1
|
|
50
|
|
|
3
|
$args{'db'} ||= undef; |
|
1458
|
1
|
50
|
|
|
|
4
|
my $db = $args{'db'} ? open_db_connection( $args{'db'} ) : undef; |
|
1459
|
|
|
|
|
|
|
|
|
1460
|
|
|
|
|
|
|
# Initialize main output arrays |
|
1461
|
1
|
|
|
|
|
3
|
my @good_datasets; |
|
1462
|
|
|
|
|
|
|
my @bad_datasets; |
|
1463
|
1
|
|
|
|
|
0
|
my %db_features; # hash for features found in the database, use when needed |
|
1464
|
|
|
|
|
|
|
|
|
1465
|
|
|
|
|
|
|
# Check provided datasets |
|
1466
|
1
|
50
|
|
|
|
2
|
if (@datasets) { |
|
1467
|
|
|
|
|
|
|
|
|
1468
|
|
|
|
|
|
|
# check for multiple comma-delimited datasets |
|
1469
|
1
|
|
|
|
|
1
|
my @list_to_check; |
|
1470
|
1
|
|
|
|
|
2
|
foreach my $item (@datasets) { |
|
1471
|
1
|
50
|
|
|
|
3
|
if ($item =~ /,/) { |
|
1472
|
|
|
|
|
|
|
# this one has a comma, therefore it has more than dataset |
|
1473
|
0
|
|
|
|
|
0
|
push @list_to_check, split(/,/, $item); |
|
1474
|
|
|
|
|
|
|
} |
|
1475
|
|
|
|
|
|
|
else { |
|
1476
|
|
|
|
|
|
|
# a singleton |
|
1477
|
1
|
|
|
|
|
3
|
push @list_to_check, $item; |
|
1478
|
|
|
|
|
|
|
} |
|
1479
|
|
|
|
|
|
|
} |
|
1480
|
|
|
|
|
|
|
|
|
1481
|
|
|
|
|
|
|
# now verify the datasets |
|
1482
|
1
|
|
|
|
|
2
|
foreach my $dataset (@list_to_check) { |
|
1483
|
|
|
|
|
|
|
|
|
1484
|
|
|
|
|
|
|
# check for a remote file |
|
1485
|
1
|
50
|
|
|
|
9
|
if ($dataset =~ /^(?: http | ftp) .+ \. (?: bam | bw | bb) $/xi) { |
|
|
|
50
|
|
|
|
|
|
|
1486
|
|
|
|
|
|
|
# a remote file |
|
1487
|
|
|
|
|
|
|
# assume it is good, no verification here though |
|
1488
|
|
|
|
|
|
|
# it will either work or won't work |
|
1489
|
0
|
|
|
|
|
0
|
push @good_datasets, $dataset; |
|
1490
|
|
|
|
|
|
|
} |
|
1491
|
|
|
|
|
|
|
|
|
1492
|
|
|
|
|
|
|
# a local file |
|
1493
|
|
|
|
|
|
|
elsif ($dataset =~ /\.(?:bam|bw|bigwig|bb|bigbed|useq)$/i) { |
|
1494
|
|
|
|
|
|
|
# presume we have a local indexed data file |
|
1495
|
|
|
|
|
|
|
|
|
1496
|
|
|
|
|
|
|
# user may have requested two or more files to be merged |
|
1497
|
|
|
|
|
|
|
# these should be combined with an ampersand |
|
1498
|
|
|
|
|
|
|
# check each one |
|
1499
|
1
|
|
|
|
|
2
|
my @files; |
|
1500
|
1
|
|
|
|
|
3
|
foreach my $file (split /\&/, $dataset) { |
|
1501
|
1
|
50
|
|
|
|
41
|
if (-e $file) { |
|
1502
|
|
|
|
|
|
|
# file exists |
|
1503
|
1
|
|
|
|
|
6
|
push @files, "file:$file"; |
|
1504
|
|
|
|
|
|
|
} |
|
1505
|
|
|
|
|
|
|
else { |
|
1506
|
|
|
|
|
|
|
# file doesn't exist! can't use this set of files |
|
1507
|
0
|
|
|
|
|
0
|
@files = (); |
|
1508
|
0
|
|
|
|
|
0
|
last; |
|
1509
|
|
|
|
|
|
|
} |
|
1510
|
|
|
|
|
|
|
} |
|
1511
|
1
|
50
|
|
|
|
3
|
if (@files) { |
|
1512
|
1
|
|
|
|
|
4
|
push @good_datasets, join("&", @files); |
|
1513
|
|
|
|
|
|
|
} |
|
1514
|
|
|
|
|
|
|
else { |
|
1515
|
0
|
|
|
|
|
0
|
push @bad_datasets, $dataset; |
|
1516
|
|
|
|
|
|
|
} |
|
1517
|
|
|
|
|
|
|
} |
|
1518
|
|
|
|
|
|
|
|
|
1519
|
|
|
|
|
|
|
# a feature type in a database |
|
1520
|
|
|
|
|
|
|
else { |
|
1521
|
|
|
|
|
|
|
# must be a database feature type |
|
1522
|
|
|
|
|
|
|
|
|
1523
|
|
|
|
|
|
|
# check for database features hash |
|
1524
|
0
|
0
|
|
|
|
0
|
unless (%db_features) { |
|
1525
|
0
|
0
|
|
|
|
0
|
if ($db) { |
|
1526
|
|
|
|
|
|
|
# collect the database features as needed |
|
1527
|
|
|
|
|
|
|
# I want both method:source and method in the hash |
|
1528
|
0
|
|
|
|
|
0
|
foreach my $type ( get_dataset_list($db) ) { |
|
1529
|
0
|
|
|
|
|
0
|
my ($method, $source) = split /:/, $type; |
|
1530
|
0
|
0
|
|
|
|
0
|
if ($source) { |
|
1531
|
0
|
|
|
|
|
0
|
$db_features{$type} = 1; |
|
1532
|
0
|
|
|
|
|
0
|
$db_features{$method} = 1; |
|
1533
|
|
|
|
|
|
|
} |
|
1534
|
|
|
|
|
|
|
else { |
|
1535
|
0
|
|
|
|
|
0
|
$db_features{$type} = 1; |
|
1536
|
|
|
|
|
|
|
} |
|
1537
|
|
|
|
|
|
|
} |
|
1538
|
|
|
|
|
|
|
|
|
1539
|
|
|
|
|
|
|
# verify |
|
1540
|
0
|
0
|
|
|
|
0
|
unless (%db_features) { |
|
1541
|
0
|
|
|
|
|
0
|
carp " provided database has no feature types " . |
|
1542
|
|
|
|
|
|
|
"to verify dataset(s) against!\n"; |
|
1543
|
|
|
|
|
|
|
} |
|
1544
|
|
|
|
|
|
|
} |
|
1545
|
|
|
|
|
|
|
else { |
|
1546
|
|
|
|
|
|
|
# we need a database |
|
1547
|
0
|
|
|
|
|
0
|
carp " unable to verify dataset without database"; |
|
1548
|
|
|
|
|
|
|
} |
|
1549
|
|
|
|
|
|
|
} |
|
1550
|
|
|
|
|
|
|
|
|
1551
|
|
|
|
|
|
|
# validate the given dataset |
|
1552
|
|
|
|
|
|
|
# user may have requested two or more features to be merged |
|
1553
|
|
|
|
|
|
|
# these should be combined with an ampersand |
|
1554
|
|
|
|
|
|
|
# check each one |
|
1555
|
0
|
|
|
|
|
0
|
my $check = 0; |
|
1556
|
0
|
|
|
|
|
0
|
foreach my $d (split /&/, $dataset) { |
|
1557
|
|
|
|
|
|
|
# validate this dataset |
|
1558
|
0
|
0
|
|
|
|
0
|
if (exists $db_features{$d}) { |
|
1559
|
0
|
|
|
|
|
0
|
$check++; |
|
1560
|
|
|
|
|
|
|
} |
|
1561
|
|
|
|
|
|
|
else { |
|
1562
|
|
|
|
|
|
|
# at least one of these is not good, fail check |
|
1563
|
0
|
|
|
|
|
0
|
$check = 0; |
|
1564
|
0
|
|
|
|
|
0
|
last; |
|
1565
|
|
|
|
|
|
|
} |
|
1566
|
|
|
|
|
|
|
} |
|
1567
|
|
|
|
|
|
|
|
|
1568
|
|
|
|
|
|
|
# report the verification |
|
1569
|
0
|
0
|
|
|
|
0
|
if ($check) { |
|
1570
|
0
|
|
|
|
|
0
|
push @good_datasets, $dataset; |
|
1571
|
|
|
|
|
|
|
} |
|
1572
|
|
|
|
|
|
|
else { |
|
1573
|
0
|
|
|
|
|
0
|
push @bad_datasets, $dataset; |
|
1574
|
|
|
|
|
|
|
} |
|
1575
|
|
|
|
|
|
|
} |
|
1576
|
|
|
|
|
|
|
} |
|
1577
|
|
|
|
|
|
|
} |
|
1578
|
|
|
|
|
|
|
|
|
1579
|
|
|
|
|
|
|
# User must select datasets |
|
1580
|
|
|
|
|
|
|
else { |
|
1581
|
|
|
|
|
|
|
# dataset not specified |
|
1582
|
|
|
|
|
|
|
# present the dataset list to the user and get an answer |
|
1583
|
|
|
|
|
|
|
|
|
1584
|
|
|
|
|
|
|
# get the dataset list |
|
1585
|
0
|
0
|
|
|
|
0
|
if ($db) { |
|
1586
|
|
|
|
|
|
|
# collect the database features as needed |
|
1587
|
0
|
|
|
|
|
0
|
my $i = 1; |
|
1588
|
0
|
|
|
|
|
0
|
foreach my $type ( get_dataset_list($db) ) { |
|
1589
|
0
|
0
|
|
|
|
0
|
if ($limit) { |
|
1590
|
0
|
|
|
|
|
0
|
my ($p, $s) = split /:/, $type; # split into primary_tag and source |
|
1591
|
|
|
|
|
|
|
# only keep those types that match our limiter |
|
1592
|
0
|
0
|
|
|
|
0
|
next unless $p =~ /$limit/i; |
|
1593
|
|
|
|
|
|
|
} |
|
1594
|
0
|
|
|
|
|
0
|
$db_features{$i} = $type; |
|
1595
|
0
|
|
|
|
|
0
|
$i++; |
|
1596
|
|
|
|
|
|
|
} |
|
1597
|
|
|
|
|
|
|
|
|
1598
|
|
|
|
|
|
|
# verify |
|
1599
|
0
|
0
|
|
|
|
0
|
unless (%db_features) { |
|
1600
|
0
|
|
|
|
|
0
|
carp " provided database has no feature types " . |
|
1601
|
|
|
|
|
|
|
"to collect!\n"; |
|
1602
|
0
|
|
|
|
|
0
|
return; |
|
1603
|
|
|
|
|
|
|
} |
|
1604
|
|
|
|
|
|
|
} |
|
1605
|
|
|
|
|
|
|
else { |
|
1606
|
|
|
|
|
|
|
# we need a database |
|
1607
|
0
|
|
|
|
|
0
|
carp " no database provided from which to collect features!\n"; |
|
1608
|
0
|
|
|
|
|
0
|
return; |
|
1609
|
|
|
|
|
|
|
} |
|
1610
|
|
|
|
|
|
|
|
|
1611
|
|
|
|
|
|
|
# present the list |
|
1612
|
0
|
|
|
|
|
0
|
print "\n These are the available data sets in the database:\n"; |
|
1613
|
0
|
|
|
|
|
0
|
foreach (sort {$a <=> $b} keys %db_features) { |
|
|
0
|
|
|
|
|
0
|
|
|
1614
|
|
|
|
|
|
|
# print out the list of microarray data sets |
|
1615
|
0
|
|
|
|
|
0
|
print " $_\t$db_features{$_}\n"; |
|
1616
|
|
|
|
|
|
|
} |
|
1617
|
|
|
|
|
|
|
|
|
1618
|
|
|
|
|
|
|
# prompt the user |
|
1619
|
0
|
|
0
|
|
|
0
|
$args{'prompt'} ||= undef; |
|
1620
|
0
|
0
|
|
|
|
0
|
if ($args{'prompt'}) { |
|
1621
|
|
|
|
|
|
|
# provided custom prompt |
|
1622
|
0
|
|
|
|
|
0
|
print $args{'prompt'}; |
|
1623
|
|
|
|
|
|
|
} |
|
1624
|
|
|
|
|
|
|
else { |
|
1625
|
|
|
|
|
|
|
# generic prompt |
|
1626
|
0
|
0
|
|
|
|
0
|
if ($args{'single'}) { |
|
1627
|
0
|
|
|
|
|
0
|
print " Enter one number for the data set or feature "; |
|
1628
|
|
|
|
|
|
|
} |
|
1629
|
|
|
|
|
|
|
else { |
|
1630
|
0
|
|
|
|
|
0
|
print " Enter the number(s) or range of the data set(s) or" . |
|
1631
|
|
|
|
|
|
|
" feature(s) "; |
|
1632
|
|
|
|
|
|
|
} |
|
1633
|
|
|
|
|
|
|
} |
|
1634
|
|
|
|
|
|
|
|
|
1635
|
|
|
|
|
|
|
# get answer from the user |
|
1636
|
0
|
|
|
|
|
0
|
my $answer = ; |
|
1637
|
0
|
|
|
|
|
0
|
chomp $answer; |
|
1638
|
0
|
|
|
|
|
0
|
my @answer_list = parse_list($answer); |
|
1639
|
|
|
|
|
|
|
|
|
1640
|
|
|
|
|
|
|
# take the first one if requested |
|
1641
|
0
|
0
|
|
|
|
0
|
if ($args{'single'}) { |
|
1642
|
0
|
0
|
|
|
|
0
|
unless (scalar @answer_list == 1) { |
|
1643
|
0
|
|
|
|
|
0
|
splice(@answer_list, 1); |
|
1644
|
|
|
|
|
|
|
} |
|
1645
|
|
|
|
|
|
|
} |
|
1646
|
|
|
|
|
|
|
|
|
1647
|
|
|
|
|
|
|
# verify the answer list |
|
1648
|
0
|
|
|
|
|
0
|
foreach my $answer (@answer_list) { |
|
1649
|
|
|
|
|
|
|
|
|
1650
|
|
|
|
|
|
|
# check for merged datasets |
|
1651
|
0
|
0
|
|
|
|
0
|
if ($answer =~ /&/) { |
|
1652
|
|
|
|
|
|
|
# a merged dataset |
|
1653
|
0
|
|
|
|
|
0
|
my @list = split /&/, $answer; |
|
1654
|
0
|
|
|
|
|
0
|
my $check = 1; |
|
1655
|
|
|
|
|
|
|
|
|
1656
|
|
|
|
|
|
|
# check all are good |
|
1657
|
0
|
|
|
|
|
0
|
foreach (@list) { |
|
1658
|
0
|
0
|
|
|
|
0
|
unless (exists $db_features{$_}) { |
|
1659
|
0
|
|
|
|
|
0
|
$check = 0; |
|
1660
|
|
|
|
|
|
|
} |
|
1661
|
|
|
|
|
|
|
} |
|
1662
|
|
|
|
|
|
|
|
|
1663
|
|
|
|
|
|
|
# if all are good |
|
1664
|
0
|
0
|
|
|
|
0
|
if ($check) { |
|
1665
|
|
|
|
|
|
|
push @good_datasets, |
|
1666
|
0
|
|
|
|
|
0
|
join( "&", map { $db_features{$_} } @list); |
|
|
0
|
|
|
|
|
0
|
|
|
1667
|
|
|
|
|
|
|
} |
|
1668
|
|
|
|
|
|
|
else { |
|
1669
|
0
|
|
|
|
|
0
|
push @bad_datasets, $answer; |
|
1670
|
|
|
|
|
|
|
} |
|
1671
|
|
|
|
|
|
|
} |
|
1672
|
|
|
|
|
|
|
|
|
1673
|
|
|
|
|
|
|
else { |
|
1674
|
|
|
|
|
|
|
# a single dataset |
|
1675
|
|
|
|
|
|
|
# check if it is good |
|
1676
|
|
|
|
|
|
|
|
|
1677
|
0
|
0
|
|
|
|
0
|
if (exists $db_features{$answer}) { |
|
1678
|
0
|
|
|
|
|
0
|
push @good_datasets, $db_features{$answer}; |
|
1679
|
|
|
|
|
|
|
} |
|
1680
|
|
|
|
|
|
|
else { |
|
1681
|
0
|
|
|
|
|
0
|
push @bad_datasets, $db_features{$answer}; |
|
1682
|
|
|
|
|
|
|
} |
|
1683
|
|
|
|
|
|
|
} |
|
1684
|
|
|
|
|
|
|
} |
|
1685
|
|
|
|
|
|
|
} |
|
1686
|
|
|
|
|
|
|
|
|
1687
|
|
|
|
|
|
|
# Print bad results |
|
1688
|
1
|
50
|
|
|
|
3
|
if (@bad_datasets) { |
|
1689
|
0
|
|
|
|
|
0
|
print " The following datasets could not be verified:\n"; |
|
1690
|
0
|
|
|
|
|
0
|
foreach (@bad_datasets) { |
|
1691
|
0
|
|
|
|
|
0
|
print " $_\n"; |
|
1692
|
|
|
|
|
|
|
} |
|
1693
|
|
|
|
|
|
|
} |
|
1694
|
|
|
|
|
|
|
|
|
1695
|
|
|
|
|
|
|
# Return good results |
|
1696
|
1
|
50
|
|
|
|
3
|
if ($args{'single'}) { |
|
1697
|
0
|
|
|
|
|
0
|
return $good_datasets[0]; |
|
1698
|
|
|
|
|
|
|
} |
|
1699
|
|
|
|
|
|
|
else { |
|
1700
|
1
|
|
|
|
|
4
|
return @good_datasets; |
|
1701
|
|
|
|
|
|
|
} |
|
1702
|
|
|
|
|
|
|
} |
|
1703
|
|
|
|
|
|
|
|
|
1704
|
|
|
|
|
|
|
|
|
1705
|
|
|
|
|
|
|
### Process and verify a dataset |
|
1706
|
|
|
|
|
|
|
sub check_dataset_for_rpm_support { |
|
1707
|
|
|
|
|
|
|
|
|
1708
|
|
|
|
|
|
|
# get passed dataset and databases |
|
1709
|
0
|
|
|
0
|
1
|
0
|
my $dataset = shift; |
|
1710
|
0
|
|
|
|
|
0
|
my $cpu = shift; |
|
1711
|
|
|
|
|
|
|
|
|
1712
|
|
|
|
|
|
|
# Calculate the total number of reads |
|
1713
|
|
|
|
|
|
|
# this uses the global variable $rpkm_read_sum |
|
1714
|
0
|
|
|
|
|
0
|
my $rpm_read_sum; |
|
1715
|
|
|
|
|
|
|
|
|
1716
|
0
|
0
|
|
|
|
0
|
if (exists $TOTAL_READ_NUMBER{$dataset}) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
1717
|
|
|
|
|
|
|
# this dataset has already been summed |
|
1718
|
|
|
|
|
|
|
# no need to do it again |
|
1719
|
0
|
|
|
|
|
0
|
$rpm_read_sum = $TOTAL_READ_NUMBER{$dataset}; |
|
1720
|
|
|
|
|
|
|
} |
|
1721
|
|
|
|
|
|
|
|
|
1722
|
|
|
|
|
|
|
elsif ($dataset =~ /\.bam$/) { |
|
1723
|
|
|
|
|
|
|
# a bam file dataset |
|
1724
|
|
|
|
|
|
|
|
|
1725
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
|
1726
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
|
1727
|
|
|
|
|
|
|
# Bio::ToolBox::db_helper::bam was loaded ok |
|
1728
|
|
|
|
|
|
|
# sum the number of reads in the dataset |
|
1729
|
0
|
|
|
|
|
0
|
$rpm_read_sum = sum_total_bam_alignments($dataset, 0, 0, $cpu); |
|
1730
|
|
|
|
|
|
|
} |
|
1731
|
|
|
|
|
|
|
else { |
|
1732
|
0
|
|
|
|
|
0
|
carp " Bam support is not available! " . |
|
1733
|
|
|
|
|
|
|
"Is Bio::DB::Sam or Bio::DB::HTS installed?\n"; |
|
1734
|
0
|
|
|
|
|
0
|
return; |
|
1735
|
|
|
|
|
|
|
} |
|
1736
|
|
|
|
|
|
|
} |
|
1737
|
|
|
|
|
|
|
|
|
1738
|
|
|
|
|
|
|
elsif ($dataset =~ /\.bb$/) { |
|
1739
|
|
|
|
|
|
|
# a bigbed file dataset |
|
1740
|
|
|
|
|
|
|
|
|
1741
|
0
|
0
|
|
|
|
0
|
$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed') unless |
|
1742
|
|
|
|
|
|
|
$BIGBED_OK; |
|
1743
|
0
|
0
|
|
|
|
0
|
if ($BIGBED_OK) { |
|
1744
|
|
|
|
|
|
|
# Bio::ToolBox::db_helper::bigbed was loaded ok |
|
1745
|
|
|
|
|
|
|
# sum the number of features in the dataset |
|
1746
|
0
|
|
|
|
|
0
|
$rpm_read_sum = sum_total_bigbed_features($dataset); |
|
1747
|
|
|
|
|
|
|
} |
|
1748
|
|
|
|
|
|
|
else { |
|
1749
|
0
|
|
|
|
|
0
|
carp " BigBed support is not available! " . |
|
1750
|
|
|
|
|
|
|
"Is Bio::DB::BigBed installed?\n"; |
|
1751
|
0
|
|
|
|
|
0
|
return; |
|
1752
|
|
|
|
|
|
|
} |
|
1753
|
|
|
|
|
|
|
} |
|
1754
|
|
|
|
|
|
|
|
|
1755
|
|
|
|
|
|
|
else { |
|
1756
|
|
|
|
|
|
|
# some other non-supported dataset |
|
1757
|
0
|
|
|
|
|
0
|
return; |
|
1758
|
|
|
|
|
|
|
} |
|
1759
|
|
|
|
|
|
|
|
|
1760
|
|
|
|
|
|
|
# return the sum value if we've made it this far |
|
1761
|
0
|
|
|
|
|
0
|
$TOTAL_READ_NUMBER{$dataset} = $rpm_read_sum; |
|
1762
|
0
|
|
|
|
|
0
|
return $rpm_read_sum; |
|
1763
|
|
|
|
|
|
|
} |
|
1764
|
|
|
|
|
|
|
|
|
1765
|
|
|
|
|
|
|
|
|
1766
|
|
|
|
|
|
|
### Generate a new list of features |
|
1767
|
|
|
|
|
|
|
sub get_new_feature_list { |
|
1768
|
|
|
|
|
|
|
|
|
1769
|
|
|
|
|
|
|
# Retrieve passed values |
|
1770
|
0
|
|
|
0
|
1
|
0
|
my %args = @_; # the passed argument values |
|
1771
|
|
|
|
|
|
|
|
|
1772
|
|
|
|
|
|
|
# check data object |
|
1773
|
0
|
|
0
|
|
|
0
|
my $data = $args{data} || undef; |
|
1774
|
0
|
0
|
|
|
|
0
|
unless ($data) { |
|
1775
|
0
|
|
|
|
|
0
|
confess "must pass a 'data' key and Bio::ToolBox::Data object!"; |
|
1776
|
0
|
|
|
|
|
0
|
return; |
|
1777
|
|
|
|
|
|
|
} |
|
1778
|
0
|
0
|
|
|
|
0
|
unless (ref($data) eq 'Bio::ToolBox::Data') { |
|
1779
|
0
|
|
|
|
|
0
|
confess 'must pass a Bio::ToolBox::Data object!'; |
|
1780
|
0
|
|
|
|
|
0
|
return; |
|
1781
|
|
|
|
|
|
|
} |
|
1782
|
|
|
|
|
|
|
|
|
1783
|
|
|
|
|
|
|
# Open a db connection |
|
1784
|
0
|
|
0
|
|
|
0
|
$args{'db'} ||= undef; |
|
1785
|
0
|
|
|
|
|
0
|
my $db = open_db_connection($args{'db'}); |
|
1786
|
0
|
0
|
|
|
|
0
|
unless ($db) { |
|
1787
|
0
|
|
|
|
|
0
|
carp 'no database connected!'; |
|
1788
|
0
|
|
|
|
|
0
|
return; |
|
1789
|
|
|
|
|
|
|
} |
|
1790
|
|
|
|
|
|
|
|
|
1791
|
|
|
|
|
|
|
# Verify a SeqFeature::Store database |
|
1792
|
0
|
|
|
|
|
0
|
my $db_ref = ref $db; |
|
1793
|
0
|
0
|
|
|
|
0
|
unless ($db_ref =~ /^Bio::DB::SeqFeature::Store/) { |
|
1794
|
0
|
|
|
|
|
0
|
carp "Database type $db_ref doesn't support generating feature lists!\n"; |
|
1795
|
0
|
|
|
|
|
0
|
return; |
|
1796
|
|
|
|
|
|
|
} |
|
1797
|
|
|
|
|
|
|
|
|
1798
|
|
|
|
|
|
|
# Features to search for |
|
1799
|
0
|
|
0
|
|
|
0
|
my $searchFeature = $args{'feature'} || $args{'features'} || undef; |
|
1800
|
0
|
0
|
|
|
|
0
|
unless ($searchFeature) { |
|
1801
|
0
|
|
|
|
|
0
|
carp "no search feature types passed!"; |
|
1802
|
0
|
|
|
|
|
0
|
return; |
|
1803
|
|
|
|
|
|
|
} |
|
1804
|
0
|
|
|
|
|
0
|
my @classes = split(',', $searchFeature); # it may or may not be a list |
|
1805
|
|
|
|
|
|
|
|
|
1806
|
|
|
|
|
|
|
# chromosomes to skip |
|
1807
|
0
|
|
0
|
|
|
0
|
my $chr_exclude = $args{'chrskip'} || undef; |
|
1808
|
|
|
|
|
|
|
|
|
1809
|
|
|
|
|
|
|
# Add table columns |
|
1810
|
0
|
|
|
|
|
0
|
my $pid_i = $data->add_column('Primary_ID'); |
|
1811
|
0
|
|
|
|
|
0
|
my $name_i = $data->add_column('Name'); |
|
1812
|
0
|
|
|
|
|
0
|
my $type_i = $data->add_column('Type'); |
|
1813
|
|
|
|
|
|
|
|
|
1814
|
|
|
|
|
|
|
# Set up the database iterator |
|
1815
|
0
|
|
|
|
|
0
|
print " Searching for $searchFeature features\n"; |
|
1816
|
0
|
0
|
|
|
|
0
|
my $iterator = $db->get_seq_stream( |
|
1817
|
|
|
|
|
|
|
-types => scalar @classes ? \@classes : $searchFeature, |
|
1818
|
|
|
|
|
|
|
); |
|
1819
|
0
|
0
|
|
|
|
0
|
unless ($iterator) { |
|
1820
|
|
|
|
|
|
|
# there should be some features found in the database |
|
1821
|
0
|
|
|
|
|
0
|
carp "could not get feature iterator for database"; |
|
1822
|
0
|
|
|
|
|
0
|
return; |
|
1823
|
|
|
|
|
|
|
} |
|
1824
|
|
|
|
|
|
|
|
|
1825
|
|
|
|
|
|
|
# Walk through the collected features |
|
1826
|
0
|
|
|
|
|
0
|
my $total_count = 0; # total found features |
|
1827
|
0
|
|
|
|
|
0
|
while (my $feature = $iterator->next_seq) { |
|
1828
|
0
|
|
|
|
|
0
|
$total_count++; |
|
1829
|
|
|
|
|
|
|
|
|
1830
|
|
|
|
|
|
|
# skip genes from excluded chromosomes |
|
1831
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $feature->seq_id =~ $chr_exclude); |
|
1832
|
|
|
|
|
|
|
|
|
1833
|
|
|
|
|
|
|
# Record the feature information |
|
1834
|
|
|
|
|
|
|
# in the B::DB::SF::S database, the primary_ID is a number unique to the |
|
1835
|
|
|
|
|
|
|
# the specific database, and is not portable between databases |
|
1836
|
0
|
|
|
|
|
0
|
$data->add_row( [ |
|
1837
|
|
|
|
|
|
|
$feature->primary_id, |
|
1838
|
|
|
|
|
|
|
$feature->display_name, |
|
1839
|
|
|
|
|
|
|
$feature->type, |
|
1840
|
|
|
|
|
|
|
] ); |
|
1841
|
|
|
|
|
|
|
} |
|
1842
|
|
|
|
|
|
|
|
|
1843
|
|
|
|
|
|
|
# print result of search |
|
1844
|
0
|
|
|
|
|
0
|
printf " Found %s features in the database\n", format_with_commas($total_count); |
|
1845
|
0
|
|
|
|
|
0
|
printf " Kept %s features.\n", format_with_commas($data->last_row); |
|
1846
|
|
|
|
|
|
|
|
|
1847
|
|
|
|
|
|
|
# return the new data structure |
|
1848
|
0
|
|
|
|
|
0
|
return 1; |
|
1849
|
|
|
|
|
|
|
} |
|
1850
|
|
|
|
|
|
|
|
|
1851
|
|
|
|
|
|
|
|
|
1852
|
|
|
|
|
|
|
### Generate a new list genomic windows |
|
1853
|
|
|
|
|
|
|
sub get_new_genome_list { |
|
1854
|
|
|
|
|
|
|
|
|
1855
|
|
|
|
|
|
|
# Collect the passed arguments |
|
1856
|
1
|
|
|
1
|
1
|
4
|
my %args = @_; |
|
1857
|
|
|
|
|
|
|
|
|
1858
|
|
|
|
|
|
|
# check data object |
|
1859
|
1
|
|
50
|
|
|
5
|
my $data = $args{data} || undef; |
|
1860
|
1
|
50
|
|
|
|
2
|
unless ($data) { |
|
1861
|
0
|
|
|
|
|
0
|
confess "must pass a 'data' key and Bio::ToolBox::Data object!"; |
|
1862
|
|
|
|
|
|
|
} |
|
1863
|
1
|
50
|
|
|
|
4
|
unless (ref($data) eq 'Bio::ToolBox::Data') { |
|
1864
|
0
|
|
|
|
|
0
|
confess 'must pass a Bio::ToolBox::Data object!'; |
|
1865
|
|
|
|
|
|
|
} |
|
1866
|
|
|
|
|
|
|
|
|
1867
|
|
|
|
|
|
|
# Open a db connection |
|
1868
|
1
|
|
50
|
|
|
3
|
$args{'db'} ||= undef; |
|
1869
|
1
|
|
|
|
|
4
|
my $db = open_db_connection($args{'db'}); |
|
1870
|
1
|
50
|
|
|
|
3
|
unless ($db) { |
|
1871
|
0
|
|
|
|
|
0
|
carp 'no database connected!'; |
|
1872
|
0
|
|
|
|
|
0
|
return; |
|
1873
|
|
|
|
|
|
|
} |
|
1874
|
|
|
|
|
|
|
|
|
1875
|
|
|
|
|
|
|
# Determine win and step sizes |
|
1876
|
1
|
|
50
|
|
|
3
|
$args{'win'} ||= 500; # hard coded default size |
|
1877
|
1
|
|
33
|
|
|
5
|
$args{'step'} ||= $args{'win'}; |
|
1878
|
|
|
|
|
|
|
|
|
1879
|
|
|
|
|
|
|
|
|
1880
|
|
|
|
|
|
|
# Prepare data structures |
|
1881
|
1
|
|
|
|
|
5
|
my $chr_i = $data->add_column('Chromosome'); |
|
1882
|
1
|
|
|
|
|
2
|
my $start_i = $data->add_column('Start'); |
|
1883
|
1
|
|
|
|
|
2
|
my $stop_i = $data->add_column('Stop'); |
|
1884
|
1
|
|
|
|
|
6
|
$data->metadata($start_i, 'win', $args{'win'}); |
|
1885
|
1
|
|
|
|
|
2
|
$data->metadata($start_i, 'step', $args{'step'}); |
|
1886
|
|
|
|
|
|
|
|
|
1887
|
|
|
|
|
|
|
# Collect the chromosomes |
|
1888
|
1
|
|
50
|
|
|
6
|
my $chr_exclude = $args{'skip'} || $args{'chrskip'} || undef; |
|
1889
|
1
|
|
|
|
|
16
|
my @chromosomes = get_chromosome_list($db, $chr_exclude); |
|
1890
|
1
|
50
|
|
|
|
3
|
unless (@chromosomes) { |
|
1891
|
0
|
|
|
|
|
0
|
carp " no sequence IDs were found in the database!\n"; |
|
1892
|
0
|
|
|
|
|
0
|
return; |
|
1893
|
|
|
|
|
|
|
} |
|
1894
|
|
|
|
|
|
|
|
|
1895
|
|
|
|
|
|
|
# Load exclusion intervals |
|
1896
|
1
|
|
|
|
|
1
|
my %exclusion_tree; |
|
1897
|
1
|
|
50
|
|
|
7
|
$args{exclude} ||= $args{blacklist} || undef; |
|
|
|
|
33
|
|
|
|
|
|
1898
|
1
|
50
|
33
|
|
|
5
|
if (exists $args{exclude} and defined $args{exclude}) { |
|
1899
|
0
|
0
|
|
|
|
0
|
if (ref($args{exclude}) eq 'Bio::ToolBox::Data') { |
|
1900
|
0
|
0
|
|
|
|
0
|
if (_load_helper_module('Set::IntervalTree')) { |
|
1901
|
|
|
|
|
|
|
# iterate through the data object of intervals |
|
1902
|
|
|
|
|
|
|
# prepare an Interval Tree for each chromosome |
|
1903
|
|
|
|
|
|
|
# and load the exclusion interval into it |
|
1904
|
|
|
|
|
|
|
$args{exclude}->iterate( sub { |
|
1905
|
0
|
|
|
0
|
|
0
|
my $row = shift; |
|
1906
|
0
|
0
|
|
|
|
0
|
unless (exists $exclusion_tree{$row->seq_id} ) { |
|
1907
|
0
|
|
|
|
|
0
|
$exclusion_tree{ $row->seq_id } = Set::IntervalTree->new(); |
|
1908
|
|
|
|
|
|
|
} |
|
1909
|
0
|
|
|
|
|
0
|
$exclusion_tree{ $row->seq_id }->insert(1, $row->start - 1, $row->end); |
|
1910
|
0
|
|
|
|
|
0
|
} ); |
|
1911
|
|
|
|
|
|
|
} |
|
1912
|
|
|
|
|
|
|
else { |
|
1913
|
0
|
|
|
|
|
0
|
carp " Set::IntervalTree must be installed to use exclusion intervals!"; |
|
1914
|
|
|
|
|
|
|
} |
|
1915
|
|
|
|
|
|
|
} |
|
1916
|
|
|
|
|
|
|
else { |
|
1917
|
0
|
|
|
|
|
0
|
confess " Exclusion data must be a Bio::ToolBox::Data object!"; |
|
1918
|
|
|
|
|
|
|
} |
|
1919
|
|
|
|
|
|
|
} |
|
1920
|
|
|
|
|
|
|
|
|
1921
|
|
|
|
|
|
|
# Collect the genomic windows |
|
1922
|
1
|
|
|
|
|
136
|
print " Generating $args{win} bp windows in $args{step} bp increments\n"; |
|
1923
|
1
|
|
|
|
|
5
|
foreach (@chromosomes) { |
|
1924
|
|
|
|
|
|
|
|
|
1925
|
|
|
|
|
|
|
# name and length as sub-array in each element |
|
1926
|
1
|
|
|
|
|
2
|
my ($chr, $length) = @{$_}; |
|
|
1
|
|
|
|
|
3
|
|
|
1927
|
1
|
|
50
|
|
|
5
|
my $Tree = $exclusion_tree{$chr} || undef; |
|
1928
|
|
|
|
|
|
|
|
|
1929
|
|
|
|
|
|
|
# iterate through chromosome |
|
1930
|
1
|
|
|
|
|
4
|
for (my $start = 1; $start <= $length; $start += $args{step}) { |
|
1931
|
122
|
|
|
|
|
143
|
my $end = $start + $args{win} - 1; |
|
1932
|
122
|
100
|
|
|
|
176
|
if ($end > $length) { |
|
1933
|
|
|
|
|
|
|
# fix end to the length of the chromosome |
|
1934
|
1
|
|
|
|
|
2
|
$end = $length; |
|
1935
|
|
|
|
|
|
|
} |
|
1936
|
|
|
|
|
|
|
# skip if excluded |
|
1937
|
122
|
50
|
33
|
|
|
168
|
next if ($Tree and scalar( @{ $Tree->fetch($start - 1, $end) } ) >= 1); |
|
|
0
|
|
|
|
|
0
|
|
|
1938
|
122
|
|
|
|
|
207
|
$data->add_row( [ $chr, $start, $end] ); |
|
1939
|
|
|
|
|
|
|
} |
|
1940
|
|
|
|
|
|
|
} |
|
1941
|
1
|
|
|
|
|
98
|
print " Kept " . $data->{'last_row'} . " windows.\n"; |
|
1942
|
|
|
|
|
|
|
|
|
1943
|
|
|
|
|
|
|
# Return the data structure |
|
1944
|
1
|
|
|
|
|
10
|
return 1; |
|
1945
|
|
|
|
|
|
|
} |
|
1946
|
|
|
|
|
|
|
|
|
1947
|
|
|
|
|
|
|
|
|
1948
|
|
|
|
|
|
|
sub get_db_feature { |
|
1949
|
0
|
|
|
0
|
1
|
0
|
my %args = @_; |
|
1950
|
|
|
|
|
|
|
|
|
1951
|
|
|
|
|
|
|
# Open a db connection |
|
1952
|
0
|
|
0
|
|
|
0
|
$args{db} ||= undef; |
|
1953
|
0
|
0
|
|
|
|
0
|
unless ($args{db}) { |
|
1954
|
0
|
|
|
|
|
0
|
croak 'no database provided for getting a feature!'; |
|
1955
|
|
|
|
|
|
|
} |
|
1956
|
0
|
|
|
|
|
0
|
my $db = open_db_connection($args{db}); |
|
1957
|
0
|
|
|
|
|
0
|
my $db_ref = ref $db; |
|
1958
|
|
|
|
|
|
|
|
|
1959
|
|
|
|
|
|
|
# get the name of the feature |
|
1960
|
0
|
|
0
|
|
|
0
|
my $name = $args{'name'} || undef; |
|
1961
|
|
|
|
|
|
|
|
|
1962
|
|
|
|
|
|
|
# check for values and internal nulls |
|
1963
|
0
|
0
|
|
|
|
0
|
$args{'id'} = exists $args{'id'} ? $args{'id'} : undef; |
|
1964
|
0
|
|
0
|
|
|
0
|
$args{'type'} ||= undef; |
|
1965
|
0
|
0
|
0
|
|
|
0
|
undef $name if $name and $name eq '.'; |
|
1966
|
0
|
0
|
0
|
|
|
0
|
undef $args{'id'} if $args{'id'} and $args{'id'} eq '.'; |
|
1967
|
0
|
0
|
0
|
|
|
0
|
undef $args{'type'} if $args{'type'} and $args{'type'} eq '.'; |
|
1968
|
|
|
|
|
|
|
|
|
1969
|
|
|
|
|
|
|
# quick method for feature retrieval |
|
1970
|
0
|
0
|
0
|
|
|
0
|
if (defined $args{'id'} and $db->can('fetch')) { |
|
1971
|
|
|
|
|
|
|
# we have a unique primary_id to identify the feature |
|
1972
|
|
|
|
|
|
|
# usually this pulls out the feature directly |
|
1973
|
0
|
|
0
|
|
|
0
|
my $feature = $db->fetch($args{'id'}) || undef; |
|
1974
|
|
|
|
|
|
|
|
|
1975
|
|
|
|
|
|
|
# check that the feature we pulled out is what we want |
|
1976
|
0
|
0
|
|
|
|
0
|
my $check = $feature ? 1 : 0; |
|
1977
|
0
|
0
|
|
|
|
0
|
if ($check) { |
|
1978
|
0
|
0
|
0
|
|
|
0
|
$check = 0 if (defined $name and $feature->display_name ne $name); |
|
1979
|
0
|
0
|
0
|
|
|
0
|
$check = 0 if (defined $args{'type'} and $feature->type ne $args{'type'}); |
|
1980
|
|
|
|
|
|
|
} |
|
1981
|
|
|
|
|
|
|
|
|
1982
|
|
|
|
|
|
|
# return if things match up as best we can tell |
|
1983
|
0
|
0
|
|
|
|
0
|
if ($check) { |
|
1984
|
0
|
|
|
|
|
0
|
return $feature; |
|
1985
|
|
|
|
|
|
|
} |
|
1986
|
|
|
|
|
|
|
else { |
|
1987
|
|
|
|
|
|
|
# the primary_ids are out of date |
|
1988
|
0
|
0
|
|
|
|
0
|
unless ($PRIMARY_ID_WARNING) { |
|
1989
|
0
|
|
|
|
|
0
|
warn "CAUTION: Some primary IDs in Input file appear to be out of date\n"; |
|
1990
|
0
|
|
|
|
|
0
|
$PRIMARY_ID_WARNING++; |
|
1991
|
|
|
|
|
|
|
} |
|
1992
|
|
|
|
|
|
|
} |
|
1993
|
|
|
|
|
|
|
} |
|
1994
|
|
|
|
|
|
|
|
|
1995
|
|
|
|
|
|
|
# otherwise use name and type |
|
1996
|
0
|
0
|
|
|
|
0
|
return unless $name; # otherwise db will return all features! Not what we want! |
|
1997
|
|
|
|
|
|
|
my @features = $db->features( |
|
1998
|
|
|
|
|
|
|
-name => $name, # we assume this name will be unique |
|
1999
|
|
|
|
|
|
|
-aliases => 0, |
|
2000
|
0
|
|
|
|
|
0
|
-type => $args{'type'}, |
|
2001
|
|
|
|
|
|
|
); |
|
2002
|
|
|
|
|
|
|
|
|
2003
|
|
|
|
|
|
|
# if none are found |
|
2004
|
0
|
0
|
|
|
|
0
|
unless (@features) { |
|
2005
|
|
|
|
|
|
|
# try again with aliases enabled |
|
2006
|
|
|
|
|
|
|
@features = $db->features( |
|
2007
|
|
|
|
|
|
|
-name => $name, |
|
2008
|
|
|
|
|
|
|
-aliases => 1, |
|
2009
|
0
|
|
|
|
|
0
|
-type => $args{'type'}, |
|
2010
|
|
|
|
|
|
|
); |
|
2011
|
|
|
|
|
|
|
} |
|
2012
|
0
|
0
|
0
|
|
|
0
|
unless (@features and $name =~ /[;,\|]/) { |
|
2013
|
|
|
|
|
|
|
# I used to append aliases to the end of the name in old versions of biotoolbox |
|
2014
|
|
|
|
|
|
|
# crazy, I know, but just in case this happened, let's try breaking them apart |
|
2015
|
0
|
|
|
|
|
0
|
my $name2 = (split(/\s*[;,\|]\s*/, $name))[0]; |
|
2016
|
|
|
|
|
|
|
# multiple names present using common delimiters ;,| |
|
2017
|
|
|
|
|
|
|
# take the first name only, assume others are aliases that we don't need |
|
2018
|
|
|
|
|
|
|
@features = $db->features( |
|
2019
|
|
|
|
|
|
|
-name => $name2, |
|
2020
|
|
|
|
|
|
|
-aliases => 1, |
|
2021
|
0
|
|
|
|
|
0
|
-type => $args{'type'}, |
|
2022
|
|
|
|
|
|
|
); |
|
2023
|
|
|
|
|
|
|
} |
|
2024
|
|
|
|
|
|
|
|
|
2025
|
|
|
|
|
|
|
# check the number of features returned |
|
2026
|
0
|
0
|
|
|
|
0
|
if (scalar @features > 1) { |
|
|
|
0
|
|
|
|
|
|
|
2027
|
|
|
|
|
|
|
# there should only be one feature found |
|
2028
|
|
|
|
|
|
|
# if more are returned, there's redundant or duplicated data in the db |
|
2029
|
|
|
|
|
|
|
|
|
2030
|
|
|
|
|
|
|
# first check whether addition of aliases may improve accuracy |
|
2031
|
0
|
0
|
|
|
|
0
|
if ($args{'name'} =~ /;/) { |
|
2032
|
|
|
|
|
|
|
# we have presumed aliases in the name value |
|
2033
|
0
|
|
|
|
|
0
|
my $check = $args{'name'}; |
|
2034
|
0
|
|
|
|
|
0
|
$check =~ s/\s*;\s*/;/g; # strip spaces just in case |
|
2035
|
|
|
|
|
|
|
|
|
2036
|
|
|
|
|
|
|
# check each feature and try to match name and aliases |
|
2037
|
0
|
|
|
|
|
0
|
my @candidates; |
|
2038
|
0
|
|
|
|
|
0
|
foreach my $f (@features) { |
|
2039
|
0
|
|
|
|
|
0
|
my $f_name = join(';', $f->display_name, ($f->get_tag_values('Alias'))); |
|
2040
|
0
|
0
|
|
|
|
0
|
if ($check eq $f_name) { |
|
2041
|
|
|
|
|
|
|
# found one |
|
2042
|
0
|
|
|
|
|
0
|
push @candidates, $f; |
|
2043
|
|
|
|
|
|
|
} |
|
2044
|
|
|
|
|
|
|
} |
|
2045
|
|
|
|
|
|
|
|
|
2046
|
0
|
0
|
|
|
|
0
|
if (scalar @candidates == 1) { |
|
|
|
0
|
|
|
|
|
|
|
2047
|
|
|
|
|
|
|
# we found it! |
|
2048
|
0
|
|
|
|
|
0
|
return shift @candidates; |
|
2049
|
|
|
|
|
|
|
} |
|
2050
|
|
|
|
|
|
|
elsif (scalar @candidates > 1) { |
|
2051
|
|
|
|
|
|
|
# hopefully we've improved a little bit? |
|
2052
|
0
|
|
|
|
|
0
|
@features = @candidates; |
|
2053
|
|
|
|
|
|
|
} |
|
2054
|
|
|
|
|
|
|
} |
|
2055
|
|
|
|
|
|
|
|
|
2056
|
|
|
|
|
|
|
# warn the user, this should be fixed |
|
2057
|
|
|
|
|
|
|
printf " Found %s %s features named '$name' in the database! Using first one.\n", |
|
2058
|
0
|
|
|
|
|
0
|
scalar(@features), $args{'type'}; |
|
2059
|
|
|
|
|
|
|
} |
|
2060
|
|
|
|
|
|
|
elsif (!@features) { |
|
2061
|
0
|
|
|
|
|
0
|
printf " Found no %s features named '$name' in the database!\n", $args{'type'}; |
|
2062
|
0
|
|
|
|
|
0
|
return; |
|
2063
|
|
|
|
|
|
|
} |
|
2064
|
|
|
|
|
|
|
|
|
2065
|
|
|
|
|
|
|
# done |
|
2066
|
0
|
|
|
|
|
0
|
return shift @features; |
|
2067
|
|
|
|
|
|
|
} |
|
2068
|
|
|
|
|
|
|
|
|
2069
|
|
|
|
|
|
|
|
|
2070
|
|
|
|
|
|
|
### Get a dataset score for a single region |
|
2071
|
|
|
|
|
|
|
sub get_segment_score { |
|
2072
|
|
|
|
|
|
|
# parameters passed as an array |
|
2073
|
|
|
|
|
|
|
# we will be passing this array on as a reference to the appropriate |
|
2074
|
|
|
|
|
|
|
# imported helper subroutine |
|
2075
|
|
|
|
|
|
|
# chromosome, start, stop, strand, strandedness, method, return type, db, dataset |
|
2076
|
12
|
50
|
|
12
|
1
|
30
|
confess "incorrect number of parameters passed!" unless scalar @_ == 9; |
|
2077
|
|
|
|
|
|
|
|
|
2078
|
|
|
|
|
|
|
# check the database |
|
2079
|
12
|
100
|
66
|
|
|
73
|
$_[DB] = open_db_connection($_[DB]) if ($_[DB] and not ref($_[DB])); |
|
2080
|
|
|
|
|
|
|
|
|
2081
|
|
|
|
|
|
|
# check for combined datasets |
|
2082
|
12
|
50
|
|
|
|
53
|
if ($_[DATA] =~ /&/) { |
|
2083
|
0
|
|
|
|
|
0
|
push @_, (split '&', pop @_); |
|
2084
|
|
|
|
|
|
|
} |
|
2085
|
|
|
|
|
|
|
|
|
2086
|
|
|
|
|
|
|
# determine method |
|
2087
|
|
|
|
|
|
|
# yes, we're only checking the first dataset, but they should all |
|
2088
|
|
|
|
|
|
|
# be the same type |
|
2089
|
12
|
|
66
|
|
|
90
|
my $db_method = $DB_METHODS{$_[METH]}{$_[RETT]}{$_[DB]}{$_[DATA]} || |
|
2090
|
|
|
|
|
|
|
_lookup_db_method(\@_); |
|
2091
|
|
|
|
|
|
|
|
|
2092
|
|
|
|
|
|
|
# return type values |
|
2093
|
|
|
|
|
|
|
# 0 = calculate score |
|
2094
|
|
|
|
|
|
|
# 1 = score array |
|
2095
|
|
|
|
|
|
|
# 2 = hash position scores |
|
2096
|
|
|
|
|
|
|
# immediately return calculated score if appropriate |
|
2097
|
12
|
100
|
|
|
|
29
|
if ($_[RETT] > 0) { |
|
2098
|
|
|
|
|
|
|
# immediately return either indexed hash or array of scores |
|
2099
|
5
|
|
|
|
|
9
|
return &{$db_method}(\@_); |
|
|
5
|
|
|
|
|
15
|
|
|
2100
|
|
|
|
|
|
|
} |
|
2101
|
|
|
|
|
|
|
else { |
|
2102
|
|
|
|
|
|
|
# calculate a score |
|
2103
|
7
|
|
|
|
|
11
|
my $scores = &{$db_method}(\@_); |
|
|
7
|
|
|
|
|
21
|
|
|
2104
|
|
|
|
|
|
|
# this might be an array reference of scores that need to be combined |
|
2105
|
|
|
|
|
|
|
# or it could be a single scalar score which just needs to be returned |
|
2106
|
7
|
50
|
|
|
|
22
|
if (ref($scores)) { |
|
2107
|
7
|
|
|
|
|
22
|
return calculate_score($_[METH], $scores); |
|
2108
|
|
|
|
|
|
|
} |
|
2109
|
|
|
|
|
|
|
else { |
|
2110
|
0
|
0
|
|
|
|
0
|
if (not defined $scores) { |
|
2111
|
0
|
0
|
|
|
|
0
|
return 0 if $_[METH] =~ /count|sum/; |
|
2112
|
0
|
|
|
|
|
0
|
return '.'; |
|
2113
|
|
|
|
|
|
|
} |
|
2114
|
0
|
|
|
|
|
0
|
return $scores; |
|
2115
|
|
|
|
|
|
|
} |
|
2116
|
|
|
|
|
|
|
} |
|
2117
|
|
|
|
|
|
|
} |
|
2118
|
|
|
|
|
|
|
|
|
2119
|
|
|
|
|
|
|
|
|
2120
|
|
|
|
|
|
|
sub calculate_score { |
|
2121
|
7
|
|
|
7
|
1
|
28
|
my ($method, $scores) = @_; |
|
2122
|
7
|
|
50
|
|
|
17
|
$scores ||= []; # just in case |
|
2123
|
7
|
50
|
|
|
|
23
|
if (exists $SCORE_CALCULATOR_SUB{$method}) { |
|
2124
|
7
|
|
|
|
|
18
|
return &{$SCORE_CALCULATOR_SUB{$method}}($scores); |
|
|
7
|
|
|
|
|
23
|
|
|
2125
|
|
|
|
|
|
|
} |
|
2126
|
|
|
|
|
|
|
else { |
|
2127
|
0
|
|
|
|
|
0
|
confess " unrecognized method '$method'!"; |
|
2128
|
|
|
|
|
|
|
} |
|
2129
|
|
|
|
|
|
|
} |
|
2130
|
|
|
|
|
|
|
|
|
2131
|
|
|
|
|
|
|
|
|
2132
|
|
|
|
|
|
|
sub get_chromosome_list { |
|
2133
|
|
|
|
|
|
|
|
|
2134
|
|
|
|
|
|
|
# options |
|
2135
|
2
|
|
|
2
|
1
|
533
|
my $database = shift; |
|
2136
|
2
|
|
50
|
|
|
7
|
my $chr_exclude = shift || undef; |
|
2137
|
|
|
|
|
|
|
|
|
2138
|
|
|
|
|
|
|
# Open a db connection |
|
2139
|
2
|
|
|
|
|
14
|
my $db = open_db_connection($database); |
|
2140
|
2
|
50
|
|
|
|
5
|
unless ($db) { |
|
2141
|
0
|
|
|
|
|
0
|
carp 'no database connected!'; |
|
2142
|
0
|
|
|
|
|
0
|
return; |
|
2143
|
|
|
|
|
|
|
} |
|
2144
|
|
|
|
|
|
|
|
|
2145
|
|
|
|
|
|
|
# Check for BigWigSet database |
|
2146
|
|
|
|
|
|
|
# these need to be handled a little differently |
|
2147
|
2
|
50
|
|
|
|
6
|
if (ref($db) =~ /BigWigSet/) { |
|
2148
|
|
|
|
|
|
|
# BigWigSet databases are the only databases that don't |
|
2149
|
|
|
|
|
|
|
# support the seq_ids method |
|
2150
|
|
|
|
|
|
|
# instead we have to look at one of the bigwigs in the set |
|
2151
|
0
|
|
|
|
|
0
|
my $bw_file = ($db->bigwigs)[0]; |
|
2152
|
0
|
|
|
|
|
0
|
$db = open_db_connection($bw_file); |
|
2153
|
|
|
|
|
|
|
} |
|
2154
|
|
|
|
|
|
|
|
|
2155
|
|
|
|
|
|
|
|
|
2156
|
|
|
|
|
|
|
# Collect chromosome lengths |
|
2157
|
|
|
|
|
|
|
# we want to maintain the original order of the chromosomes so we |
|
2158
|
|
|
|
|
|
|
# won't be putting it into a hash |
|
2159
|
|
|
|
|
|
|
# instead an array of arrays |
|
2160
|
2
|
|
|
|
|
2
|
my @chrom_lengths; |
|
2161
|
|
|
|
|
|
|
|
|
2162
|
|
|
|
|
|
|
# Database specific approaches to collecting chromosomes |
|
2163
|
|
|
|
|
|
|
# I used to have one generic approach, but idiosyncrasies and potential |
|
2164
|
|
|
|
|
|
|
# bugs make me use different approaches for better consistency |
|
2165
|
2
|
|
|
|
|
4
|
my $type = ref($db); |
|
2166
|
|
|
|
|
|
|
|
|
2167
|
|
|
|
|
|
|
# SeqFeature::Store |
|
2168
|
2
|
50
|
33
|
|
|
32
|
if ($type =~ /^Bio::DB::SeqFeature::Store/) { |
|
|
|
50
|
33
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
2169
|
0
|
|
|
|
|
0
|
for my $chr ($db->seq_ids) { |
|
2170
|
|
|
|
|
|
|
|
|
2171
|
|
|
|
|
|
|
# check for excluded chromosomes |
|
2172
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
|
2173
|
|
|
|
|
|
|
|
|
2174
|
|
|
|
|
|
|
# get chromosome size |
|
2175
|
0
|
|
|
|
|
0
|
my ($seqf) = $db->get_features_by_name($chr); |
|
2176
|
0
|
0
|
|
|
|
0
|
my $length = $seqf ? $seqf->length : 0; |
|
2177
|
|
|
|
|
|
|
|
|
2178
|
|
|
|
|
|
|
# store |
|
2179
|
0
|
|
|
|
|
0
|
push @chrom_lengths, [ $chr, $length ]; |
|
2180
|
|
|
|
|
|
|
} |
|
2181
|
|
|
|
|
|
|
} |
|
2182
|
|
|
|
|
|
|
|
|
2183
|
|
|
|
|
|
|
# libBigWig Bigfile, including both bigWig and bigBed |
|
2184
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::Big::File') { |
|
2185
|
|
|
|
|
|
|
# this doesn't follow the typical BioPerl convention |
|
2186
|
|
|
|
|
|
|
# it's a hash, so randomly sorted!!!!! Never will be in same order as file!!!! |
|
2187
|
0
|
|
|
|
|
0
|
my $chroms = $db->chroms(); |
|
2188
|
|
|
|
|
|
|
|
|
2189
|
0
|
|
|
|
|
0
|
my @list; |
|
2190
|
0
|
|
|
|
|
0
|
foreach (values %$chroms) { |
|
2191
|
|
|
|
|
|
|
# check for excluded chromosomes |
|
2192
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $_->{name} =~ /$chr_exclude/i); |
|
2193
|
0
|
|
|
|
|
0
|
push @list, [$_->{name}, $_->{length}]; |
|
2194
|
|
|
|
|
|
|
} |
|
2195
|
|
|
|
|
|
|
# sort consistently by a sane system |
|
2196
|
0
|
|
|
|
|
0
|
@chrom_lengths = sane_chromo_sort(@list); |
|
2197
|
|
|
|
|
|
|
} |
|
2198
|
|
|
|
|
|
|
|
|
2199
|
|
|
|
|
|
|
# UCSC kent Bigfile |
|
2200
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::BigWig' or $type eq 'Bio::DB::BigBed') { |
|
2201
|
0
|
|
|
|
|
0
|
foreach my $chr ($db->seq_ids) { |
|
2202
|
|
|
|
|
|
|
|
|
2203
|
|
|
|
|
|
|
# check for excluded chromosomes |
|
2204
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
|
2205
|
|
|
|
|
|
|
|
|
2206
|
|
|
|
|
|
|
# get chromosome size |
|
2207
|
0
|
|
|
|
|
0
|
my $length = $db->length($chr); |
|
2208
|
|
|
|
|
|
|
|
|
2209
|
|
|
|
|
|
|
# store |
|
2210
|
0
|
|
|
|
|
0
|
push @chrom_lengths, [ $chr, $length ]; |
|
2211
|
|
|
|
|
|
|
} |
|
2212
|
|
|
|
|
|
|
} |
|
2213
|
|
|
|
|
|
|
|
|
2214
|
|
|
|
|
|
|
# Samtools Bam |
|
2215
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::Sam' or $type eq 'Bio::DB::HTS') { |
|
2216
|
0
|
|
|
|
|
0
|
for my $tid (0 .. $db->n_targets - 1) { |
|
2217
|
|
|
|
|
|
|
# each chromosome is internally represented in the bam file as |
|
2218
|
|
|
|
|
|
|
# a numeric target identifier |
|
2219
|
|
|
|
|
|
|
# we can easily convert this to an actual sequence name |
|
2220
|
|
|
|
|
|
|
# we will force the conversion to go one chromosome at a time |
|
2221
|
|
|
|
|
|
|
|
|
2222
|
|
|
|
|
|
|
# sequence info |
|
2223
|
0
|
|
|
|
|
0
|
my $chr = $db->target_name($tid); |
|
2224
|
0
|
|
|
|
|
0
|
my $length = $db->target_len($tid); |
|
2225
|
|
|
|
|
|
|
|
|
2226
|
|
|
|
|
|
|
# check for excluded chromosomes |
|
2227
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
|
2228
|
|
|
|
|
|
|
|
|
2229
|
|
|
|
|
|
|
# store |
|
2230
|
0
|
|
|
|
|
0
|
push @chrom_lengths, [ $chr, $length ]; |
|
2231
|
|
|
|
|
|
|
} |
|
2232
|
|
|
|
|
|
|
} |
|
2233
|
|
|
|
|
|
|
|
|
2234
|
|
|
|
|
|
|
# Fast through modern HTS fai index |
|
2235
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::HTS::Faidx') { |
|
2236
|
0
|
|
|
|
|
0
|
for my $chr ($db->get_all_sequence_ids) { |
|
2237
|
|
|
|
|
|
|
# check for excluded |
|
2238
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
|
2239
|
|
|
|
|
|
|
|
|
2240
|
|
|
|
|
|
|
# get length and store it |
|
2241
|
0
|
|
|
|
|
0
|
my $length = $db->length($chr); |
|
2242
|
0
|
|
|
|
|
0
|
push @chrom_lengths, [$chr, $length]; |
|
2243
|
|
|
|
|
|
|
} |
|
2244
|
|
|
|
|
|
|
} |
|
2245
|
|
|
|
|
|
|
|
|
2246
|
|
|
|
|
|
|
# Fasta through old BioPerl adapter |
|
2247
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::Fasta') { |
|
2248
|
0
|
|
|
|
|
0
|
for my $chr ($db->get_all_ids) { |
|
2249
|
|
|
|
|
|
|
|
|
2250
|
|
|
|
|
|
|
# check for excluded chromosomes |
|
2251
|
0
|
0
|
0
|
|
|
0
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
|
2252
|
|
|
|
|
|
|
|
|
2253
|
|
|
|
|
|
|
# get chromosome size |
|
2254
|
0
|
|
|
|
|
0
|
my $seq = $db->get_Seq_by_id($chr); |
|
2255
|
0
|
0
|
|
|
|
0
|
my $length = $seq ? $seq->length : 0; |
|
2256
|
|
|
|
|
|
|
|
|
2257
|
|
|
|
|
|
|
# store |
|
2258
|
0
|
|
|
|
|
0
|
push @chrom_lengths, [ $chr, $length ]; |
|
2259
|
|
|
|
|
|
|
} |
|
2260
|
|
|
|
|
|
|
} |
|
2261
|
|
|
|
|
|
|
|
|
2262
|
|
|
|
|
|
|
# other Bioperl adapter????? |
|
2263
|
|
|
|
|
|
|
elsif ($db->can('seq_ids')) { |
|
2264
|
2
|
|
|
|
|
7
|
foreach my $chr ($db->seq_ids) { |
|
2265
|
|
|
|
|
|
|
|
|
2266
|
|
|
|
|
|
|
# check for excluded chromosomes |
|
2267
|
2
|
50
|
33
|
|
|
41
|
next if (defined $chr_exclude and $chr =~ /$chr_exclude/i); |
|
2268
|
|
|
|
|
|
|
|
|
2269
|
|
|
|
|
|
|
# generate a segment representing the chromosome |
|
2270
|
|
|
|
|
|
|
# due to fuzzy name matching, we may get more than one back |
|
2271
|
2
|
|
|
|
|
22
|
my @segments = $db->segment($chr); |
|
2272
|
|
|
|
|
|
|
# need to find the right one |
|
2273
|
2
|
|
|
|
|
248
|
my $segment; |
|
2274
|
2
|
|
|
|
|
5
|
while (@segments) { |
|
2275
|
2
|
|
|
|
|
3
|
$segment = shift @segments; |
|
2276
|
2
|
50
|
|
|
|
9
|
last if $segment->seq_id eq $chr; |
|
2277
|
|
|
|
|
|
|
} |
|
2278
|
|
|
|
|
|
|
|
|
2279
|
|
|
|
|
|
|
# check segment |
|
2280
|
2
|
50
|
|
|
|
21
|
unless ($segment) { |
|
2281
|
0
|
|
|
|
|
0
|
carp " No genome segment for seq_id $chr!!?? skipping\n"; |
|
2282
|
0
|
|
|
|
|
0
|
next; |
|
2283
|
|
|
|
|
|
|
} |
|
2284
|
|
|
|
|
|
|
|
|
2285
|
|
|
|
|
|
|
# get the chromosome length |
|
2286
|
2
|
|
|
|
|
8
|
my $length = $segment->length; |
|
2287
|
|
|
|
|
|
|
|
|
2288
|
|
|
|
|
|
|
# store |
|
2289
|
2
|
|
|
|
|
47
|
push @chrom_lengths, [ $chr, $length ]; |
|
2290
|
|
|
|
|
|
|
} |
|
2291
|
|
|
|
|
|
|
} |
|
2292
|
|
|
|
|
|
|
|
|
2293
|
|
|
|
|
|
|
else { |
|
2294
|
0
|
|
|
|
|
0
|
carp " Unable to identify chromosomes in database type '$type'. Try using a different database file or adapter\n"; |
|
2295
|
0
|
|
|
|
|
0
|
return; |
|
2296
|
|
|
|
|
|
|
} |
|
2297
|
|
|
|
|
|
|
|
|
2298
|
|
|
|
|
|
|
# Return |
|
2299
|
2
|
50
|
|
|
|
5
|
unless (@chrom_lengths) { |
|
2300
|
0
|
|
|
|
|
0
|
carp " no chromosome sequences identified in database!\n"; |
|
2301
|
0
|
|
|
|
|
0
|
return; |
|
2302
|
|
|
|
|
|
|
} |
|
2303
|
2
|
|
|
|
|
5
|
return @chrom_lengths; |
|
2304
|
|
|
|
|
|
|
} |
|
2305
|
|
|
|
|
|
|
|
|
2306
|
|
|
|
|
|
|
|
|
2307
|
|
|
|
|
|
|
sub low_level_bam_fetch { |
|
2308
|
0
|
0
|
|
0
|
1
|
0
|
confess "incorrect number of parameters passed!" unless scalar @_ == 6; |
|
2309
|
0
|
|
|
|
|
0
|
my ($sam, $tid, $start, $stop, $callback, $data) = @_; |
|
2310
|
|
|
|
|
|
|
# run the the low level bam fetch based on which adapter is being used |
|
2311
|
0
|
0
|
|
|
|
0
|
unless ($BAM_ADAPTER) { |
|
2312
|
0
|
0
|
|
|
|
0
|
$BAM_ADAPTER = ref($sam) =~ /hts/i ? 'hts' : 'sam'; |
|
2313
|
|
|
|
|
|
|
} |
|
2314
|
0
|
0
|
|
|
|
0
|
if ($BAM_ADAPTER eq 'hts') { |
|
|
|
0
|
|
|
|
|
|
|
2315
|
|
|
|
|
|
|
# using Bio::DB::HTS |
|
2316
|
0
|
|
|
|
|
0
|
my $index = $sam->hts_index; |
|
2317
|
0
|
0
|
|
|
|
0
|
return unless $index; |
|
2318
|
0
|
|
|
|
|
0
|
return $index->fetch($sam->hts_file, $tid, $start, $stop, $callback, $data); |
|
2319
|
|
|
|
|
|
|
} |
|
2320
|
|
|
|
|
|
|
elsif ($BAM_ADAPTER eq 'sam') { |
|
2321
|
|
|
|
|
|
|
# using Bio::DB::Sam |
|
2322
|
0
|
|
|
|
|
0
|
my $index = $sam->bam_index; |
|
2323
|
0
|
0
|
|
|
|
0
|
return unless $index; |
|
2324
|
0
|
|
|
|
|
0
|
return $index->fetch($sam->bam, $tid, $start, $stop, $callback, $data); |
|
2325
|
|
|
|
|
|
|
} |
|
2326
|
|
|
|
|
|
|
else { |
|
2327
|
0
|
|
|
|
|
0
|
confess "no bam adapter loaded!\n"; |
|
2328
|
|
|
|
|
|
|
} |
|
2329
|
|
|
|
|
|
|
} |
|
2330
|
|
|
|
|
|
|
|
|
2331
|
|
|
|
|
|
|
|
|
2332
|
|
|
|
|
|
|
sub low_level_bam_coverage { |
|
2333
|
0
|
0
|
|
0
|
1
|
0
|
confess "incorrect number of parameters passed!" unless scalar @_ == 4; |
|
2334
|
0
|
|
|
|
|
0
|
my ($sam, $tid, $start, $stop) = @_; |
|
2335
|
|
|
|
|
|
|
# run the the low level bam coverage based on which adapter is being used |
|
2336
|
0
|
0
|
|
|
|
0
|
unless ($BAM_ADAPTER) { |
|
2337
|
0
|
0
|
|
|
|
0
|
$BAM_ADAPTER = ref($sam) =~ /hts/i ? 'hts' : 'sam'; |
|
2338
|
|
|
|
|
|
|
} |
|
2339
|
0
|
0
|
|
|
|
0
|
if ($BAM_ADAPTER eq 'hts') { |
|
|
|
0
|
|
|
|
|
|
|
2340
|
|
|
|
|
|
|
# using Bio::DB::HTS |
|
2341
|
0
|
|
|
|
|
0
|
my $index = $sam->hts_index; |
|
2342
|
0
|
0
|
|
|
|
0
|
return unless $index; |
|
2343
|
0
|
|
|
|
|
0
|
return $index->coverage($sam->hts_file, $tid, $start, $stop); |
|
2344
|
|
|
|
|
|
|
} |
|
2345
|
|
|
|
|
|
|
elsif ($BAM_ADAPTER eq 'sam') { |
|
2346
|
|
|
|
|
|
|
# using Bio::DB::Sam |
|
2347
|
0
|
|
|
|
|
0
|
my $index = $sam->bam_index; |
|
2348
|
0
|
0
|
|
|
|
0
|
return unless $index; |
|
2349
|
0
|
|
|
|
|
0
|
return $index->coverage($sam->bam, $tid, $start, $stop); |
|
2350
|
|
|
|
|
|
|
} |
|
2351
|
|
|
|
|
|
|
else { |
|
2352
|
0
|
|
|
|
|
0
|
confess "no bam adapter loaded!\n"; |
|
2353
|
|
|
|
|
|
|
} |
|
2354
|
|
|
|
|
|
|
} |
|
2355
|
|
|
|
|
|
|
|
|
2356
|
|
|
|
|
|
|
|
|
2357
|
|
|
|
|
|
|
sub get_genomic_sequence { |
|
2358
|
0
|
0
|
|
0
|
1
|
0
|
confess "incorrect number of parameters passed!" unless scalar @_ == 4; |
|
2359
|
0
|
|
|
|
|
0
|
my ($db, $chrom, $start, $stop) = @_; |
|
2360
|
|
|
|
|
|
|
|
|
2361
|
|
|
|
|
|
|
# check database |
|
2362
|
0
|
|
|
|
|
0
|
my $type = ref($db); |
|
2363
|
0
|
0
|
|
|
|
0
|
$db = open_db_connection($db) if not $type; |
|
2364
|
|
|
|
|
|
|
|
|
2365
|
|
|
|
|
|
|
# return sequence based on the type of database adapter we're using |
|
2366
|
0
|
0
|
|
|
|
0
|
if ($type eq 'Bio::DB::HTS::Faidx') { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
2367
|
0
|
|
|
|
|
0
|
return $db->get_sequence_no_length("$chrom:$start-$stop"); |
|
2368
|
|
|
|
|
|
|
} |
|
2369
|
|
|
|
|
|
|
elsif ($type eq 'Bio::DB::Sam::Fai') { |
|
2370
|
0
|
|
|
|
|
0
|
return $db->fetch("$chrom:$start-$stop"); |
|
2371
|
|
|
|
|
|
|
} |
|
2372
|
|
|
|
|
|
|
elsif ($db->can('seq')) { |
|
2373
|
|
|
|
|
|
|
# BioPerl database including Bio::DB::Fasta and Bio::DB::SeqFeature::Store |
|
2374
|
0
|
|
|
|
|
0
|
return $db->seq($chrom, $start, $stop); |
|
2375
|
|
|
|
|
|
|
} |
|
2376
|
|
|
|
|
|
|
else { |
|
2377
|
0
|
|
|
|
|
0
|
confess "unsupported database $type for collecting genomic sequence!\n"; |
|
2378
|
|
|
|
|
|
|
} |
|
2379
|
|
|
|
|
|
|
} |
|
2380
|
|
|
|
|
|
|
|
|
2381
|
|
|
|
|
|
|
|
|
2382
|
|
|
|
|
|
|
### deprecated methods |
|
2383
|
|
|
|
|
|
|
# just in case |
|
2384
|
|
|
|
|
|
|
sub validate_included_feature { |
|
2385
|
0
|
|
|
0
|
0
|
0
|
confess "validate_included_feature() is no longer a valid method. " . |
|
2386
|
|
|
|
|
|
|
"Please update your script to the current API.\n"; |
|
2387
|
|
|
|
|
|
|
} |
|
2388
|
|
|
|
|
|
|
|
|
2389
|
|
|
|
|
|
|
|
|
2390
|
|
|
|
|
|
|
### Internal subroutine to retrieve the scores from an established region object |
|
2391
|
|
|
|
|
|
|
sub _lookup_db_method { |
|
2392
|
|
|
|
|
|
|
# parameters passed as an array reference |
|
2393
|
9
|
|
|
9
|
|
16
|
my $param = shift; |
|
2394
|
|
|
|
|
|
|
# determine the appropriate score method |
|
2395
|
9
|
|
|
|
|
12
|
my $score_method; |
|
2396
|
9
|
50
|
|
|
|
40
|
if ($param->[DATA] =~ /^file|http|ftp/) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
2397
|
|
|
|
|
|
|
# collect the data according to file type |
|
2398
|
|
|
|
|
|
|
|
|
2399
|
|
|
|
|
|
|
# BigWig Data file |
|
2400
|
9
|
50
|
|
|
|
105
|
if ($param->[DATA] =~ /\.(?:bw|bigwig)$/i) { |
|
|
|
50
|
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
|
|
50
|
|
|
|
|
|
|
2401
|
|
|
|
|
|
|
# file is in bigwig format |
|
2402
|
|
|
|
|
|
|
# get the dataset scores using Bio::ToolBox::db_helper::bigwig |
|
2403
|
|
|
|
|
|
|
# this uses either Bio::DB::BigWig or Bio::DB::Big |
|
2404
|
|
|
|
|
|
|
|
|
2405
|
|
|
|
|
|
|
# check that we have bigwig support |
|
2406
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK; |
|
2407
|
0
|
0
|
|
|
|
0
|
if ($BIGWIG_OK) { |
|
2408
|
0
|
0
|
|
|
|
0
|
if ($param->[RETT] == 2) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
2409
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwig_position_scores; |
|
2410
|
|
|
|
|
|
|
} |
|
2411
|
|
|
|
|
|
|
elsif ($param->[RETT] == 1) { |
|
2412
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwig_scores; |
|
2413
|
|
|
|
|
|
|
} |
|
2414
|
|
|
|
|
|
|
elsif ($param->[METH] =~ /min|max|mean/) { |
|
2415
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwig_score; |
|
2416
|
|
|
|
|
|
|
} |
|
2417
|
|
|
|
|
|
|
elsif ($param->[METH] =~ /sum|count/) { |
|
2418
|
|
|
|
|
|
|
# difference between ucsc and libBigWig libraries |
|
2419
|
0
|
0
|
|
|
|
0
|
$score_method = $BIG_ADAPTER eq 'ucsc' ? |
|
2420
|
|
|
|
|
|
|
\&collect_bigwig_score : \&collect_bigwig_scores; |
|
2421
|
|
|
|
|
|
|
} |
|
2422
|
|
|
|
|
|
|
else { |
|
2423
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwig_scores; |
|
2424
|
|
|
|
|
|
|
} |
|
2425
|
|
|
|
|
|
|
} |
|
2426
|
|
|
|
|
|
|
else { |
|
2427
|
0
|
|
|
|
|
0
|
croak " BigWig support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?"; |
|
2428
|
|
|
|
|
|
|
} |
|
2429
|
|
|
|
|
|
|
} |
|
2430
|
|
|
|
|
|
|
|
|
2431
|
|
|
|
|
|
|
# BigBed Data file |
|
2432
|
|
|
|
|
|
|
elsif ($param->[DATA] =~ /\.(?:bb|bigbed)$/i) { |
|
2433
|
|
|
|
|
|
|
# data is in bigbed format |
|
2434
|
|
|
|
|
|
|
# get the dataset scores using Bio::ToolBox::db_helper::bigbed |
|
2435
|
|
|
|
|
|
|
# this uses either Bio::DB::BigBed or Bio::DB::Big |
|
2436
|
|
|
|
|
|
|
|
|
2437
|
|
|
|
|
|
|
# check that we have bigbed support |
|
2438
|
0
|
0
|
|
|
|
0
|
$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK; |
|
2439
|
0
|
0
|
|
|
|
0
|
if ($BIGBED_OK) { |
|
2440
|
0
|
0
|
|
|
|
0
|
if ($param->[RETT] == 2) { |
|
2441
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigbed_position_scores; |
|
2442
|
|
|
|
|
|
|
} |
|
2443
|
|
|
|
|
|
|
else { |
|
2444
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigbed_scores; |
|
2445
|
|
|
|
|
|
|
} |
|
2446
|
|
|
|
|
|
|
} |
|
2447
|
|
|
|
|
|
|
else { |
|
2448
|
0
|
|
|
|
|
0
|
croak " BigBed support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?"; |
|
2449
|
|
|
|
|
|
|
} |
|
2450
|
|
|
|
|
|
|
} |
|
2451
|
|
|
|
|
|
|
|
|
2452
|
|
|
|
|
|
|
# BAM data file |
|
2453
|
|
|
|
|
|
|
elsif ($param->[DATA] =~ /\.bam$/i) { |
|
2454
|
|
|
|
|
|
|
# data is in bam format |
|
2455
|
|
|
|
|
|
|
# get the dataset scores using Bio::ToolBox::db_helper::bam |
|
2456
|
|
|
|
|
|
|
# this uses the Bio::DB::Sam or Bio::DB::HTS adaptor |
|
2457
|
|
|
|
|
|
|
|
|
2458
|
|
|
|
|
|
|
# check that we have Bam support |
|
2459
|
0
|
0
|
|
|
|
0
|
_load_bam_helper_module() unless $BAM_OK; |
|
2460
|
0
|
0
|
|
|
|
0
|
if ($BAM_OK) { |
|
2461
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bam_scores; |
|
2462
|
|
|
|
|
|
|
} |
|
2463
|
|
|
|
|
|
|
else { |
|
2464
|
0
|
|
|
|
|
0
|
croak " Bam support is not enabled! " . |
|
2465
|
|
|
|
|
|
|
"Is Bio::DB::HTS or Bio::DB::Sam installed?\n"; |
|
2466
|
|
|
|
|
|
|
} |
|
2467
|
|
|
|
|
|
|
} |
|
2468
|
|
|
|
|
|
|
|
|
2469
|
|
|
|
|
|
|
# USeq Data file |
|
2470
|
|
|
|
|
|
|
elsif ($param->[DATA] =~ /\.useq$/i) { |
|
2471
|
|
|
|
|
|
|
# data is in useq format |
|
2472
|
|
|
|
|
|
|
# get the dataset scores using Bio::ToolBox::db_helper::useq |
|
2473
|
|
|
|
|
|
|
# this uses the Bio::DB::USeq adaptor |
|
2474
|
|
|
|
|
|
|
|
|
2475
|
|
|
|
|
|
|
# check that we have bigbed support |
|
2476
|
9
|
50
|
|
|
|
20
|
$USEQ_OK = _load_helper_module('Bio::ToolBox::db_helper::useq') |
|
2477
|
|
|
|
|
|
|
unless $USEQ_OK; |
|
2478
|
9
|
50
|
|
|
|
30
|
if ($USEQ_OK) { |
|
2479
|
9
|
100
|
|
|
|
18
|
if ($param->[RETT] == 2) { |
|
2480
|
4
|
|
|
|
|
7
|
$score_method = \&collect_useq_position_scores; |
|
2481
|
|
|
|
|
|
|
} |
|
2482
|
|
|
|
|
|
|
else { |
|
2483
|
5
|
|
|
|
|
12
|
$score_method = \&collect_useq_scores; |
|
2484
|
|
|
|
|
|
|
} |
|
2485
|
|
|
|
|
|
|
} |
|
2486
|
|
|
|
|
|
|
else { |
|
2487
|
0
|
|
|
|
|
0
|
croak " USeq support is not enabled! " . |
|
2488
|
|
|
|
|
|
|
"Is Bio::DB::USeq installed?\n"; |
|
2489
|
|
|
|
|
|
|
} |
|
2490
|
|
|
|
|
|
|
} |
|
2491
|
|
|
|
|
|
|
|
|
2492
|
|
|
|
|
|
|
# Unsupported Data file |
|
2493
|
|
|
|
|
|
|
else { |
|
2494
|
0
|
|
|
|
|
0
|
confess sprintf " Unsupported or unrecognized file type for file '%s'!\n", |
|
2495
|
|
|
|
|
|
|
$param->[DATA]; |
|
2496
|
|
|
|
|
|
|
} |
|
2497
|
|
|
|
|
|
|
} |
|
2498
|
|
|
|
|
|
|
|
|
2499
|
|
|
|
|
|
|
|
|
2500
|
|
|
|
|
|
|
### BigWigSet database |
|
2501
|
|
|
|
|
|
|
elsif (ref($param->[DB]) =~ m/BigWigSet/) { |
|
2502
|
|
|
|
|
|
|
# calling features from a BigWigSet database object |
|
2503
|
|
|
|
|
|
|
# this uses either Bio::DB::BigWig or Bio::DB::Big |
|
2504
|
|
|
|
|
|
|
|
|
2505
|
|
|
|
|
|
|
# check that we have bigwig support |
|
2506
|
|
|
|
|
|
|
# duh! we should, we probably opened the stupid database! |
|
2507
|
0
|
0
|
|
|
|
0
|
$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK; |
|
2508
|
0
|
0
|
|
|
|
0
|
croak " BigWigSet support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?" |
|
2509
|
|
|
|
|
|
|
unless $BIGWIG_OK; |
|
2510
|
|
|
|
|
|
|
|
|
2511
|
|
|
|
|
|
|
# the data collection depends on the method |
|
2512
|
0
|
0
|
|
|
|
0
|
if ($param->[RETT] == 2) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
2513
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwigset_position_scores; |
|
2514
|
|
|
|
|
|
|
} |
|
2515
|
|
|
|
|
|
|
elsif ($param->[RETT] == 1) { |
|
2516
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwigset_scores; |
|
2517
|
|
|
|
|
|
|
} |
|
2518
|
|
|
|
|
|
|
elsif ($param->[METH] =~ /min|max|mean|sum|count/) { |
|
2519
|
|
|
|
|
|
|
# difference between ucsc and libBigWig libraries |
|
2520
|
|
|
|
|
|
|
# the ucsc library works with summaries and we can handle multiple of these |
|
2521
|
|
|
|
|
|
|
# but the big adapter doesn't |
|
2522
|
0
|
0
|
|
|
|
0
|
$score_method = $BIG_ADAPTER eq 'ucsc' ? |
|
2523
|
|
|
|
|
|
|
\&collect_bigwigset_score : \&collect_bigwigset_scores; |
|
2524
|
|
|
|
|
|
|
} |
|
2525
|
|
|
|
|
|
|
else { |
|
2526
|
0
|
|
|
|
|
0
|
$score_method = \&collect_bigwigset_scores; |
|
2527
|
|
|
|
|
|
|
} |
|
2528
|
|
|
|
|
|
|
} |
|
2529
|
|
|
|
|
|
|
|
|
2530
|
|
|
|
|
|
|
|
|
2531
|
|
|
|
|
|
|
### BioPerl style database |
|
2532
|
|
|
|
|
|
|
elsif (ref($param->[DB]) =~ m/^Bio::DB/) { |
|
2533
|
|
|
|
|
|
|
# a BioPerl style database, including Bio::DB::SeqFeature::Store |
|
2534
|
|
|
|
|
|
|
# most or all Bio::DB databases support generic get_seq_stream() methods |
|
2535
|
|
|
|
|
|
|
# that return seqfeature objects, which we can use in a generic sense |
|
2536
|
|
|
|
|
|
|
|
|
2537
|
|
|
|
|
|
|
# check that we have support |
|
2538
|
|
|
|
|
|
|
# duh! we should, we probably opened the stupid database! |
|
2539
|
0
|
0
|
|
|
|
0
|
$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta') |
|
2540
|
|
|
|
|
|
|
unless $SEQFASTA_OK; |
|
2541
|
0
|
0
|
|
|
|
0
|
if ($SEQFASTA_OK) { |
|
2542
|
|
|
|
|
|
|
# get the dataset scores using Bio::ToolBox::db_helper::seqfasta |
|
2543
|
|
|
|
|
|
|
# check that we support methods |
|
2544
|
0
|
0
|
|
|
|
0
|
unless ($param->[DB]->can('get_seq_stream')) { |
|
2545
|
0
|
|
|
|
|
0
|
confess sprintf "unsupported database! cannot use %s as it does not support get_seq_stream method or equivalent", |
|
2546
|
|
|
|
|
|
|
ref($param->[DB]); |
|
2547
|
|
|
|
|
|
|
} |
|
2548
|
0
|
|
|
|
|
0
|
$score_method = \&collect_store_scores; |
|
2549
|
|
|
|
|
|
|
} |
|
2550
|
|
|
|
|
|
|
else { |
|
2551
|
0
|
|
|
|
|
0
|
croak " SeqFeature Store support is not enabled! " . |
|
2552
|
|
|
|
|
|
|
"Is BioPerl and Bio::DB::SeqFeature::Store properly installed?\n"; |
|
2553
|
|
|
|
|
|
|
} |
|
2554
|
|
|
|
|
|
|
} |
|
2555
|
|
|
|
|
|
|
|
|
2556
|
|
|
|
|
|
|
|
|
2557
|
|
|
|
|
|
|
### Some other database? |
|
2558
|
|
|
|
|
|
|
else { |
|
2559
|
0
|
0
|
|
|
|
0
|
confess "no recognizeable dataset provided!" unless $param->[DATA]; |
|
2560
|
0
|
0
|
|
|
|
0
|
confess "no database passed!" unless $param->[DB]; |
|
2561
|
0
|
|
|
|
|
0
|
confess "something went wrong! parameters: @$param"; |
|
2562
|
|
|
|
|
|
|
} |
|
2563
|
|
|
|
|
|
|
|
|
2564
|
|
|
|
|
|
|
|
|
2565
|
|
|
|
|
|
|
### Cache and return the results |
|
2566
|
9
|
|
|
|
|
33
|
$DB_METHODS{$param->[METH]}{$param->[RETT]}{$param->[DB]}{$param->[DATA]} = |
|
2567
|
|
|
|
|
|
|
$score_method; |
|
2568
|
9
|
|
|
|
|
25
|
return $score_method; |
|
2569
|
|
|
|
|
|
|
} |
|
2570
|
|
|
|
|
|
|
|
|
2571
|
|
|
|
|
|
|
|
|
2572
|
|
|
|
|
|
|
sub _load_helper_module { |
|
2573
|
1
|
|
|
1
|
|
2
|
my $class = shift; |
|
2574
|
1
|
|
|
|
|
2
|
my $success = 0; |
|
2575
|
1
|
|
|
|
|
1
|
eval { |
|
2576
|
1
|
|
|
|
|
5
|
load $class; # using Module::Load to dynamically load modules |
|
2577
|
1
|
|
|
|
|
36
|
$class->import; # must be done particularly for my modules |
|
2578
|
1
|
|
|
|
|
3
|
$success = 1; |
|
2579
|
|
|
|
|
|
|
}; |
|
2580
|
1
|
|
|
|
|
2
|
return $success; |
|
2581
|
|
|
|
|
|
|
} |
|
2582
|
|
|
|
|
|
|
|
|
2583
|
|
|
|
|
|
|
|
|
2584
|
|
|
|
|
|
|
sub _load_bam_helper_module { |
|
2585
|
0
|
0
|
|
0
|
|
|
if ($BAM_ADAPTER) { |
|
2586
|
0
|
0
|
|
|
|
|
if ($BAM_ADAPTER =~ /sam/i) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
2587
|
0
|
|
|
|
|
|
$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::bam'); |
|
2588
|
0
|
|
|
|
|
|
$BAM_ADAPTER = 'sam'; # for internal consistency |
|
2589
|
|
|
|
|
|
|
} |
|
2590
|
|
|
|
|
|
|
elsif ($BAM_ADAPTER =~ /hts/i) { |
|
2591
|
0
|
|
|
|
|
|
$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::hts'); |
|
2592
|
0
|
|
|
|
|
|
$BAM_ADAPTER = 'hts'; # for internal consistency |
|
2593
|
|
|
|
|
|
|
} |
|
2594
|
|
|
|
|
|
|
elsif ($BAM_ADAPTER =~ /none/i) { |
|
2595
|
|
|
|
|
|
|
# basically for testing purposes, don't use a module |
|
2596
|
0
|
|
|
|
|
|
return 0; |
|
2597
|
|
|
|
|
|
|
} |
|
2598
|
|
|
|
|
|
|
else { |
|
2599
|
|
|
|
|
|
|
# unrecognized |
|
2600
|
0
|
|
|
|
|
|
$@ = 'unrecognized'; |
|
2601
|
|
|
|
|
|
|
} |
|
2602
|
0
|
0
|
|
|
|
|
if (not $BAM_OK) { |
|
2603
|
0
|
|
|
|
|
|
print "Requested '$BAM_ADAPTER' adapter could not be loaded: $@\n"; |
|
2604
|
|
|
|
|
|
|
} |
|
2605
|
|
|
|
|
|
|
} |
|
2606
|
|
|
|
|
|
|
else { |
|
2607
|
|
|
|
|
|
|
# try hts first, then sam |
|
2608
|
|
|
|
|
|
|
# be sure to set BAM_ADAPTER upon success |
|
2609
|
0
|
|
|
|
|
|
$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::hts'); |
|
2610
|
0
|
0
|
|
|
|
|
if ($BAM_OK) { |
|
2611
|
0
|
|
|
|
|
|
$BAM_ADAPTER = 'hts'; |
|
2612
|
|
|
|
|
|
|
} |
|
2613
|
|
|
|
|
|
|
else { |
|
2614
|
0
|
|
|
|
|
|
$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::bam'); |
|
2615
|
0
|
0
|
|
|
|
|
$BAM_ADAPTER = 'sam' if $BAM_OK; |
|
2616
|
|
|
|
|
|
|
} |
|
2617
|
|
|
|
|
|
|
} |
|
2618
|
0
|
|
|
|
|
|
return $BAM_OK; |
|
2619
|
|
|
|
|
|
|
} |
|
2620
|
|
|
|
|
|
|
|
|
2621
|
|
|
|
|
|
|
|
|
2622
|
|
|
|
|
|
|
sub _load_bigwig_helper_module { |
|
2623
|
0
|
0
|
|
0
|
|
|
if ($BIG_ADAPTER) { |
|
2624
|
0
|
0
|
|
|
|
|
if ($BIG_ADAPTER =~ /ucsc|kent/i) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
2625
|
0
|
|
|
|
|
|
$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig'); |
|
2626
|
0
|
|
|
|
|
|
$BIG_ADAPTER = 'ucsc'; # for internal consistency |
|
2627
|
|
|
|
|
|
|
} |
|
2628
|
|
|
|
|
|
|
elsif ($BIG_ADAPTER =~ /big/i) { |
|
2629
|
0
|
|
|
|
|
|
$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::big'); |
|
2630
|
0
|
|
|
|
|
|
$BIGBED_OK = $BIGWIG_OK; # bigbed too! |
|
2631
|
0
|
|
|
|
|
|
$BIG_ADAPTER = 'big'; # for internal consistency |
|
2632
|
|
|
|
|
|
|
} |
|
2633
|
|
|
|
|
|
|
elsif ($BIG_ADAPTER =~ /none/i) { |
|
2634
|
|
|
|
|
|
|
# basically for testing purposes, don't use a module |
|
2635
|
0
|
|
|
|
|
|
return 0; |
|
2636
|
|
|
|
|
|
|
} |
|
2637
|
|
|
|
|
|
|
else { |
|
2638
|
|
|
|
|
|
|
# unrecognized |
|
2639
|
0
|
|
|
|
|
|
$@ = 'unrecognized'; |
|
2640
|
|
|
|
|
|
|
} |
|
2641
|
0
|
0
|
|
|
|
|
if (not $BIGWIG_OK) { |
|
2642
|
0
|
|
|
|
|
|
print "Requested '$BIG_ADAPTER' adapter could not be loaded: $@\n"; |
|
2643
|
|
|
|
|
|
|
} |
|
2644
|
|
|
|
|
|
|
} |
|
2645
|
|
|
|
|
|
|
else { |
|
2646
|
|
|
|
|
|
|
# we have to try each one out |
|
2647
|
|
|
|
|
|
|
# try the modern big adapter first |
|
2648
|
0
|
|
|
|
|
|
$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::big'); |
|
2649
|
0
|
0
|
|
|
|
|
if ($BIGWIG_OK) { |
|
2650
|
|
|
|
|
|
|
# success! |
|
2651
|
0
|
|
|
|
|
|
$BIGBED_OK = $BIGWIG_OK; # bigbed too! |
|
2652
|
0
|
0
|
|
|
|
|
$BIG_ADAPTER = 'big' if $BIGWIG_OK; |
|
2653
|
|
|
|
|
|
|
} |
|
2654
|
|
|
|
|
|
|
else { |
|
2655
|
0
|
|
|
|
|
|
$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig'); |
|
2656
|
0
|
0
|
|
|
|
|
$BIG_ADAPTER = 'ucsc' if $BIGWIG_OK; |
|
2657
|
|
|
|
|
|
|
} |
|
2658
|
|
|
|
|
|
|
} |
|
2659
|
0
|
|
|
|
|
|
return $BIGWIG_OK; |
|
2660
|
|
|
|
|
|
|
} |
|
2661
|
|
|
|
|
|
|
|
|
2662
|
|
|
|
|
|
|
sub _load_bigbed_helper_module { |
|
2663
|
0
|
0
|
|
0
|
|
|
if ($BIG_ADAPTER) { |
|
2664
|
0
|
0
|
|
|
|
|
if ($BIG_ADAPTER =~ /ucsc|kent/i) { |
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
2665
|
0
|
|
|
|
|
|
$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed'); |
|
2666
|
0
|
|
|
|
|
|
$BIG_ADAPTER = 'ucsc'; # for internal consistency |
|
2667
|
|
|
|
|
|
|
} |
|
2668
|
|
|
|
|
|
|
elsif ($BIG_ADAPTER =~ /big/i) { |
|
2669
|
0
|
|
|
|
|
|
$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::big'); |
|
2670
|
0
|
|
|
|
|
|
$BIGWIG_OK = $BIGBED_OK; # bigwig too! |
|
2671
|
0
|
|
|
|
|
|
$BIG_ADAPTER = 'big'; # for internal consistency |
|
2672
|
|
|
|
|
|
|
} |
|
2673
|
|
|
|
|
|
|
elsif ($BIG_ADAPTER =~ /none/i) { |
|
2674
|
|
|
|
|
|
|
# basically for testing purposes, don't use a module |
|
2675
|
0
|
|
|
|
|
|
return 0; |
|
2676
|
|
|
|
|
|
|
} |
|
2677
|
|
|
|
|
|
|
elsif ($BAM_ADAPTER =~ /\w+/) { |
|
2678
|
|
|
|
|
|
|
# unrecognized |
|
2679
|
0
|
|
|
|
|
|
$@ = 'unrecognized'; |
|
2680
|
|
|
|
|
|
|
} |
|
2681
|
0
|
0
|
|
|
|
|
if (not $BIGWIG_OK) { |
|
2682
|
0
|
|
|
|
|
|
print "Requested '$BIG_ADAPTER' adapter could not be loaded: $@\n"; |
|
2683
|
|
|
|
|
|
|
} |
|
2684
|
|
|
|
|
|
|
} |
|
2685
|
|
|
|
|
|
|
else { |
|
2686
|
|
|
|
|
|
|
# we have to try each one out |
|
2687
|
|
|
|
|
|
|
# try the modern big adapter first |
|
2688
|
0
|
|
|
|
|
|
$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::big'); |
|
2689
|
0
|
0
|
|
|
|
|
if ($BIGBED_OK) { |
|
2690
|
|
|
|
|
|
|
# success! |
|
2691
|
0
|
|
|
|
|
|
$BIGWIG_OK = $BIGBED_OK; # bigwig too! |
|
2692
|
0
|
0
|
|
|
|
|
$BIG_ADAPTER = 'big' if $BIGBED_OK; |
|
2693
|
|
|
|
|
|
|
} |
|
2694
|
|
|
|
|
|
|
else { |
|
2695
|
0
|
|
|
|
|
|
$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed'); |
|
2696
|
0
|
0
|
|
|
|
|
$BIG_ADAPTER = 'ucsc' if $BIGBED_OK; |
|
2697
|
|
|
|
|
|
|
} |
|
2698
|
|
|
|
|
|
|
} |
|
2699
|
0
|
|
|
|
|
|
return $BIGBED_OK; |
|
2700
|
|
|
|
|
|
|
} |
|
2701
|
|
|
|
|
|
|
|
|
2702
|
|
|
|
|
|
|
=head1 AUTHOR |
|
2703
|
|
|
|
|
|
|
|
|
2704
|
|
|
|
|
|
|
Timothy J. Parnell, PhD |
|
2705
|
|
|
|
|
|
|
Howard Hughes Medical Institute |
|
2706
|
|
|
|
|
|
|
Dept of Oncological Sciences |
|
2707
|
|
|
|
|
|
|
Huntsman Cancer Institute |
|
2708
|
|
|
|
|
|
|
University of Utah |
|
2709
|
|
|
|
|
|
|
Salt Lake City, UT, 84112 |
|
2710
|
|
|
|
|
|
|
|
|
2711
|
|
|
|
|
|
|
This package is free software; you can redistribute it and/or modify |
|
2712
|
|
|
|
|
|
|
it under the terms of the Artistic License 2.0. |
|
2713
|
|
|
|
|
|
|
|