File Coverage

blib/lib/Bio/ToolBox/db_helper.pm

Criterion	Covered	Total	%
statement	130	671	19.3
branch	70	550	12.7
condition	20	117	17.0
subroutine	14	29	48.2
pod	16	17	94.1
total	250	1384	18.0

line	stmt	bran	cond	sub	pod	time	code
1							package Bio::ToolBox::db_helper;
2							our $VERSION = '1.68';
3
4							=head1 NAME
5
6							Bio::ToolBox::db_helper - helper interface to various database formats
7
8							=head1 DESCRIPTION
9
10							In most cases, this module does not need to be used directly. The
11							methods available to L and L
12							provide convenient access to the methods described here.
13
14							These are helper subroutines to work with relevant databases that can be
15							accessed through BioPerl modules. These include the L
16							relational database as well as L, L,
17							L, L, L, and
18							L databases. The primary functions
19							included opening connections to these databases, verifying features found
20							within the database, generating lists of features or genomic intervals for
21							subsequent analysis, and collecting features and/or scores within regions
22							of interest. Collected scores may be summarized using a variety of statistical
23							methods.
24
25							When collecting scores or features, the data may hosted in a variety of
26							formats and locations. These include:
27
28							=over 4
29
30							=item SeqFeature database
31
32							Genomic feature annotation, including genes, transcripts, exons, etc,
33							may be stored in a L database. These
34							databases are backed by either a relational database, including
35							MySQL, PostGreSQL, or SQLite. Small GFF3 files may also be loaded in
36							an in-memory database. These are typically loaded from GFF3
37							files; see the adapter documentation for more information. Dataset
38							scores, such as from microarray, may also be stored as the score
39							value in the source GFF file.
40
41							References to local, binary, indexed files may also be included as
42							attributes to features stored in the database. This is legacy support
43							from the L, and should
44							not be used in preference to other methods. Supported files include the
45							legacy binary wig files (F<.wib>, see L) using the
46							C attribute, or bigWig files using the C or C
47							attribute. The attribute values must be full paths.
48
49							=item BigWig files
50
51							BigWig files (F<.bw> and F<.bigwig>) are compressed, binary, indexed
52							versions of text wig files and may be accessed either locally or remotely.
53							They support extremely fast score retrieval from any genomic location of
54							any size without sacrificing resolution (spatial and numeric). See
55							L for more information.
56							BigWig files are supported by either the L or the
57							L adapter, based on either L
58							or the L libraries,
59							respectively; see their respective documentation for more information.
60
61							=item Directory of BigWig files
62
63							A directory containing two or more BigWig files is assembled into a
64							BigWigSet, allowing for metadata, such as strand, to be associated with
65							BigWig files. Additional metadata beyond the filename may be included in
66							text file F within the directory. See the
67							L adapter documentation for more information. When using
68							L adapters, BigWigSet support is natively supported by
69							the BioToolBox package.
70
71							=item BigBed files
72
73							BigBed files are compressed, binary, indexed versions of text BED files. See
74							L for more information.
75							Both local and remote files may be accessed. BigBed files are supported by
76							either the L or the L adapter, based on either L
77							or the L libraries,
78							respectively; see their respective documentation for more information.
79
80							=item Bam files
81
82							Bam files (F<.bam>) are compressed, binary, indexed versions of the text SAM file,
83							or sequence alignment map. They are used with next generation sequencing
84							technologies. They support individual alignment retrieval as well as
85							read depth coverage. Two different Bam file adapters are supported.
86
87							The L adapter is an interface to the Bam (or Cram) file.
88							This is based on the modern HTSlib C library (version E= 1.0), successor
89							to the original samtools library. See L for
90							more information.
91
92							The L adapter is an older interface to the Bam alignment
93							file. This is based on the samtools C library version E= 0.1.19. While
94							this adapter is still supported, the above should be used for new installs.
95
96							=item Cram files
97
98							Cram files (F<.cram>) are similar to Bam files, but are much smaller due to
99							only storing sequence differences for each alignment. As such, they require
100							an indexed, reference fasta to regenerate the complete alignment. Cram files
101							are only supported by the L adapter.
102
103							=item USeq files
104
105							USeq files (F<.useq>) are compressed, binary, indexed files that support
106							multiple information types, including region intervals, BED type
107							annotations, or wig type scores distributed across the genome. They
108							support rapid, random access across the genome and are comparable to
109							both BigWig and BigBed files. See L
110							for more information. Files may only be local. These files are supported
111							by the L adapter.
112
113							=item Fasta files
114
115							Fasta files (F<.fa> or F<.fasta>) may be opened. Fasta files are indexed to
116							rapidly and randomly fetch genomic sequence. Three different adapters are
117							available for indexing the fasta file.
118
119							The L adapter is preferentially used, and requires a
120							F<.fai> index file.
121
122							The L adapter may alternatively used, and also requires a
123							F<.fai> index file.
124
125							The older style L adapter is much slower, but will index
126							either a single genomic fasta file or a directory of individual chromosome
127							or contig fasta files.
128
129							Additionally, genomic sequence stored in a L
130							annotation database may also be used.
131
132							=back
133
134							While the functions within this library may appear to be simply a rehashing of
135							the methods and functions in L and other modules, they
136							either provide a simpler function to often used database methodologies or are
137							designed to work intimately with the BioToolBox data format file and data structures
138							(see L). One key advantage to these functions is the ability
139							to work with datasets that are stranded (transcriptome data, for example).
140
141							Historically, this module was initially written to use L for
142							database usage. It has since been re-written to use L
143							database schema. Additional functionality has also been added for other databases
144							and file formats.
145
146							Complete usage and examples for the functions are provided below.
147
148							=head1 USAGE
149
150							Call the module at the beginning of your perl script and include the module
151							names to export. None are exported by default.
152
153							use Bio::ToolBox::db_helper qw(
154							get_new_feature_list
155							get_db_feature
156							);
157
158							This will export the indicated subroutine names into the current namespace.
159							Their usage is detailed below.
160
161							=head1 EXPORTED SUBROUTINES
162
163							=over 4
164
165							=item C<$BAM_ADAPTER> variable
166
167							=item C<$BIG_ADAPTER> variable
168
169							These variables control which Bam and bigWig/bigBed adapters are used
170							during execution in installations where both adapters are present. These
171							variables are not set initially; adapters are chosen automatically when an
172							adapter is requested during execution and modules are evaluated for loading.
173
174							=item use_bam_adapter
175
176							=item use_big_adapter
177
178							These are convenience methods for explicitly setting and retrieving the
179							C<$BAM_ADAPTER> and C<$BIG_ADAPTER> variables, respectively.
180
181							Optionally pass the value to set. It will always return the value being used.
182							Values include the following:
183
184							=over 4
185
186							=item Bam adapter
187
188							* hts Bio::DB::HTS (default if available)
189							* sam Bio::DB::Sam
190							* none no adapters allowed
191
192							=item Big adapter
193
194							* big Bio::DB::Big (default if available)
195							* ucsc Bio::DB::BigWig and Bio::DB::BigBed
196							* none no adapters allowed
197
198							=back
199
200							=item open_db_connection
201
202							my $db_name = 'cerevisiae';
203							my $db = open_db_connection($db_name);
204
205							my $file = '/path/to/file.bam';
206							my $db = open_db_connection($file);
207
208							# within a forked child process
209							# reusing the same variable to simplify code
210							# pass second true value to avoid cache
211							$db = open_db_connection($file, 1);
212
213							This is a simplified, generalized method that will open a connection to a
214							database or indexed file using any one of a number of different
215							BioPerl-style adapters. For local and remote files, the appropriate
216							adapter is determined by the file extension. Relational database
217							connection and authentication information is checked in a configuration
218							file. Once identified, the appropriate Perl module adapter is loaded
219							during run time automatically, saving the user from knowing in advance
220							to C the appropriate module in the script.
221
222							Pass the name of a relational database or the path of the database file to
223							the subroutine. The opened database object is returned. If it fails, then
224							an error message should be generated and nothing is returned.
225
226							B If you are forking your perl process, B re-open your
227							database objects in the child process, and pass a second true value
228							to avoid using the cached database object. By default, opened databases are
229							cached to improve efficiency, but this will be disastrous when crossing forks.
230
231							=over 4
232
233							=item SeqFeature Store database
234
235							Provide the name of the relational database to open. Parameters for
236							connecting to a relational database are stored in the BioToolBox
237							configuration file, F<.biotoolbox.cfg>. These include database adaptors,
238							user name, password, etc. Further information regarding the configuration
239							file may be found in L.
240
241							For SQLite databases, provide the path to the F<.sqlite> or F<.db> file.
242
243							For in-memory databases, provide the path to the F<.gff> or F<.gff3> file.
244							The larger the file, the more memory and processing time is consumed as
245							the file is parsed into memory. This may be fine for small model organisms
246							like yeast, but not for vertebrate annotation.
247
248							=item Bam file database
249
250							Provide the path to the F<.bam> file. This may be a local file, or a remote
251							URL (generally supported by the adapter). If a local F<.bam.bai> index file
252							is not present, it will automatically be built prior to opening; this may
253							fail if the bam file is not sorted. Remote files are not indexed.
254
255							=item Cram file database
256
257							Provide the path to the local F<.cram> file. Currently, supplementary fasta
258							files are not supported, so the Cram file header must point to a valid
259							reference. Cram files will be indexed automatically if an index is not available.
260
261							=item BigWig file database
262
263							Provide the path to a local F<.bw> or F<.bigwig> file, or URL to a
264							remote file.
265
266							=item BigWigSet database
267
268							A local or remote directory of one or more BigWig files that can treated
269							collectively as a database. A special text file may be included in the
270							directory to assign metadata to each BigWig file, including attributes such
271							as type, source, display name, strand, etc. See L for
272							more information on the formatting of the metadata file. If a metadata
273							file is not included (it's not necessary), then the filenames are used
274							as the name and type. Strand may be parsed from the filenames if the
275							basename ends in F<.f> and F<.r>.
276
277							=item BigBed database
278
279							Provide the path to a local F<.bb> or F<.bigbed> file, or URL to a
280							remote file.
281
282							=item USeq database
283
284							Provide the path to a local F<.bb> or F<.bigbed> file, or URL to a
285							remote file.
286
287							=item Fasta database
288
289							Provide the path to a local fasta file for rapidly fetching genomic
290							sequence. If the fasta file is not indexed, then it will be indexed
291							automatically upon opening.
292
293							=back
294
295							=item get_db_name
296
297							my $name = get_db_name($db);
298
299							This method will attempt to get the name of an opened Database
300							object if, for some reason, it's unknown, e.g. the user only provided an
301							opened db object. This only works for some databases, at least those I've
302							bothered to track down and find a usable API call to use.
303
304							=item get_dataset_list
305
306							my $db_name = 'cerevisiae';
307							my @types = get_dataset_list($db_name);
308
309							This method will retrieve a list of the available features stored in the
310							database and returns an array of the features' types.
311
312							Pass either the name of the database or an established database object.
313							Supported databases include both L and
314							L databases.
315
316							For L databases, the type is represented as
317							"C", corresponding to the third and second GFF columns,
318							respectively. The types are sorted alphabetically first by source,
319							then by method.
320
321							For L databases, the C, C, C,
322							or C attribute may be used, in that respective order of
323							availability. The list is sorted alphabetically.
324
325							=item verify_or_request_feature_types
326
327							# verify a dataset, either file or database type
328							my $db_name = 'cerevisiae';
329							my $dataset = 'microaray_data';
330							my @features = verify_or_request_feature_types(
331							db => $db_name,
332							feature => $dataset,
333							);
334							unless (@features) {
335							die " no valid feature provided!\n";
336							}
337
338							# select a list of features
339							my $db_name = 'cerevisiae';
340							my @types = (); # user not provided
341							@types = verify_or_request_feature_types(
342							db => $db_name,
343							feature => $types,
344							);
345							# user will be promoted to select from database list
346							if (@types) {
347							print "using types " . join(", ", @types);
348							}
349							else {
350							die " no valid features selected!\n";
351							}
352
353							This subroutine will process a list of feature types or data sources to be
354							used for data or feature collection. There are two modes of action. If a
355							list was provided, it will process and verify the list. If no list was
356							provided, then the user will be prompted to select from a list of available
357							feature types in the provided database.
358
359							The provided list may be a list of feature types or a list of single
360							file data sources, including Bam, bigWig, or bigBed files. If the list
361							includes feature types, they are verified as existing in the provided
362							database. If the list includes file names, the local files are checked
363							and the names prefixed with "C" for use downstream.
364
365							If no list was provided, then a list of available feature types in the
366							provided database will be presented to the user, and the user prompted
367							to make a selection. One or more types may be selected, and a single
368							item may be enforced if requested. The response is filtered through
369							the parse_list method from L, so a mix of single
370							numbers or a range of numbers may be accepted. The responses are then
371							validated.
372
373							Two types of databases are supported: L and
374							L databases. For SeqFeature Store databases, the
375							type is comprised of "C", corresponding to the third and
376							second GFF columns. For BigWigSet databases, types correspond to either
377							the C, C, C, or C attributes,
378							in that order of availability.
379
380							For feature types or files that pass validation, they are returned as a
381							list. Types of files that do not pass validation are printed to C.
382
383							To use this subroutine, pass an array with the following keys
384							and values. Not every key is required.
385
386							=over 4
387
388							=item db
389
390							The name of the database or a reference to an established BioPerl
391							database object. A L or
392							L database are typically used. Only required when
393							no features are provided.
394
395							=item feature
396
397							Pass either a single dataset name as a scalar or an anonymous array
398							reference of a list of dataset names. These may have been provided as
399							a command line option and need to be verified. If nothing is passed,
400							then a list of possible datasets will be presented to the user to be
401							chosen.
402
403							=item prompt
404
405							Provide a phrase to be prompted to the user to help in selecting
406							datasets from the list. If none is provided, a generic prompt will be
407							used.
408
409							=item single
410
411							A Boolean value (1 or 0) indicating whether only a single dataset is
412							allowed when selecting datasets from a presented list. If true, only
413							one dataset choice is accepted. If false, one or more dataset choices
414							are allowed. Default is false.
415
416							=item limit
417
418							Optionally provide a word or regular expression that matches the
419							feature type (C only; C, if present, is ignored).
420							For example, provide "geneEmrna" to only present gene and mRNA
421							features to the user. This is only applicable when a user must select
422							from a database list. The default is to list all available feature
423							types.
424
425							=back
426
427							=item check_dataset_for_rpm_support($dataset, [$cpu])
428
429							# count the total number of alignments
430							my $dataset = '/path/to/file.bam';
431							my $total_count = check_dataset_for_rpm_support($dataset);
432
433							# use multithreading
434							my $total_count = check_dataset_for_rpm_support($dataset, $cpu);
435
436							This subroutine will check a dataset for RPM, or Reads Per Million mapped,
437							support. Only two types of database files support this, Bam files and
438							BigBed files. If the dataset is either one of these, or the name of a
439							database feature which points to one of these files, then it will
440							calculate the total number of mapped alignments (Bam file) or features
441							(BigBed file). It will return this total number. If the dataset does
442							not support RPM (because it is not a Bam or BigBed file, for example),
443							then it will return undefined.
444
445							Pass this subroutine one or two values. The first is the name of the
446							dataset. Ideally it should be validated using verify_or_request_feature_types()
447							and have an appropriate prefix (file, http, or ftp).
448
449							For multi-threaded execution pass a second value, the number of CPU cores
450							available to count Bam files. This will speed up counting bam files
451							considerably. The default is 2 for environments where L
452							is installed, or 1 where it is not.
453
454							=item get_new_feature_list
455
456							my $Data = Bio::ToolBox::Data->new;
457							my $db_name = 'cerevisiae';
458							my %data = get_new_feature_list(
459							data => $Data,
460							db => $db_name,
461							features => 'genes',
462							);
463
464							This subroutine will generate a new feature list collected from an annotation
465							database, such as a L database. It will populate
466							an empty L object data table with a list of the collected
467							features. Columns include the Primary ID, Name, and feature Type.
468
469							The subroutine is passed an array containing three required arguments. The keys
470							include
471
472							=over 4
473
474							=item data
475
476							An empty L object. This is required.
477
478							=item db
479
480							The name of the database or a reference to an established database object.
481							Currently, this must be a L database.
482							This is required.
483
484							=item features
485
486							A scalar value containing a name representing the feature type(s) of
487							feature(s) to collect. The type may be a C or C.
488							It may be a single element or a comma-delimited list. This is required.
489
490							=item chrskip
491
492							Provide a regular-expression compatible string for skipping features from
493							specific chromosomes, for example mitochondrial or unmapped contigs.
494
495							=back
496
497							Status messages, including the number of features found and kept, are always
498							printed to C.
499
500							=item get_new_genome_list
501
502							my $Data = Bio::ToolBox::Data->new();
503							my $db_name = 'cerevisiae';
504							my $window_size = 500;
505							my $step_size = 250;
506							my %data = get_new_genome_list(
507							data => $Data,
508							db => $db_name,
509							win => $window_size,
510							step => $step_size,
511							);
512
513							This subroutine will generate a new list of genomic windows or intervals.
514							The genome is split into intervals of a specified size with the specified
515							step size.
516
517							The subroutine is passed an array containing the required arguments.
518
519							=over 4
520
521							=item data
522
523							An empty L object. This is required.
524
525							=item db
526
527							The name of the database or a reference to an established database
528							object. Any BioPerl adapter, including those described in the L,
529							are supported. Required.
530
531							=item win
532
533							A scalar value containing an integer representing the size of the
534							window in basepairs. The default value is 500 bp.
535
536							=item step
537
538							A scalar value containing an integer representing the step size for
539							advancing the window across the genome. The default is the window size.
540
541							=item chrskip
542
543							=item skip
544
545							=item exclude
546
547							Provide a regular expression compatible string for excluding specific
548							chromosomes or classes of chromosomes, such as the mitochondrial or
549							unmapped contigs.
550
551							=back
552
553							Status messages are printed to C.
554
555							=item get_db_feature
556
557							my $db = open_db_connection('annotation.db');
558							my $seqfeature = get_db_feature(
559							db => $db,
560							type => 'gene',
561							name => 'ABC1',
562							);
563
564							This subroutine will retrieve a specific feature from a L
565							database for subsequent analysis, manipulation, and/or score retrieval using the
566							L methods. It relies upon unique information to pull out a
567							single, unique feature.
568
569							Several attributes may be used to pull out the feature, including the feature's
570							unique database primary ID, name andEor aliases, and GFF type (C
571							andEor C). The L subroutine will generate a list
572							of features with their unique identifiers.
573
574							The C attribute is preferentially used as it provides the best
575							performance. However, it is not always portable between databases or even a
576							database re-load. In that case, the C and C are used to identify
577							potential features. Note that the C may not always be unique in the
578							database. In this case, the addition of aliases may help. If all else fails, a new
579							feature list should be generated.
580
581							To get a feature, pass an array of arguments.
582
583							=over 4
584
585							=item db
586
587							The name of the L database or a reference
588							to an established database object.
589
590							=item id
591
592							Provide the C tag. In the L
593							database schema this is a (usually) non-portable, unique identifier
594							specific to a database. It provides the fastest lookup.
595
596							=item name
597
598							A scalar value representing the feature C. Aliases may be
599							appended with semicolon delimiters.
600
601							=item type
602
603							Provide the feature type, which is typically expressed as
604							C. Alternatively, provide just the C only.
605
606							=back
607
608							While it is possible to identify features with any two attributes
609							(or possibly just name or ID), the best performance is obtained with
610							all three together. The first SeqFeature object is returned if multiple
611							are found.
612
613							=item get_segment_score
614
615							my $score = get_segment_score(
616							$seqfeature->seq_id, # chromosome
617							$seqfeature->start, # start
618							$seqfeature->end, # end
619							$seqfeature->strand, # strand
620							'sense', # strandedness
621							'mean', # method
622							0, # return type
623							$db, # database
624							'scores.bw' # datasets
625							);
626
627							Generic method for collecting score(s) from a database. This is the
628							primary interface to all of the database handling packages, including
629							Bam, bigWig, bigBed, and USeq files, as well as SeqFeature databases.
630							By passing common parameters here, the appropriate database or file
631							handling packages are loaded during run time and the appropriate
632							methods are called. Essentially, this is where the magic happens.
633
634							Pass the method an array of nine (or more) parameters. The parameters
635							are not keyed as a hash, and the order is explicit. However, see the
636							L module for named index positions.
637							The order is as follows.
638
639							=over 4
640
641							=item * 0 Chromosome
642
643							String representing chromosome or sequence ID.
644
645							=item * 1 Start
646
647							Start coordinate (integer, 1-based)
648
649							=item * 2 Stop
650
651							Stop or end coordinate (integer, 1-based)
652
653							=item * 3 Strand
654
655							BioPerl style strand value, including -1, 0, and 1.
656
657							=item * 4 Strandedness
658
659							A string indicating strandedness: sense, antisense, all.
660
661							=item * 5 Method
662
663							A string representing the method of data collection.
664
665							Current acceptable methods include mean, median, sum, min, max,
666							stddev, count, ncount (named count), and pcount (precise count).
667							See the individual db_helper sub classes for more details. Not all
668							adapters support all methods.
669
670							=item * 6 Return type
671
672							An integer (0, 1, 2) representing the type of value to return.
673
674							A return type of 0 indicates a single score should be returned,
675							calculated using the specified method. A return type of 1 is an
676							array or array reference of all scores found. A return type of 2
677							is a hash or hash reference of coordinate => score.
678
679							=item * 7 Dataset database
680
681							This is either a L or a L
682							database containing multiple score dataset features. If your dataset
683							does not contain multiple different feature types, pass C.
684
685							Pass either a string containing the name of database (which will be opened)
686							or an already opened database object.
687
688							=item * 8 Dataset
689
690							This is the verified dataset from which the scores will be passed. The
691							dataset should be verified using the L
692							subroutine. For indexed file datasets, e.g. F<.bam> or F<.bw> files,
693							this will the verified local path or URL. For a database, this will be
694							a database type, sucha as C or C.
695							or the path to a data file, e.g. Bam, bigWig, bigBed, or USeq file.
696
697							If multiple datasets are to be combined, simply append them to the
698							array (index 9, 10, etc).
699
700							=back
701
702							The returned item is dependent on the value of the return type code.
703
704							=item calculate_score
705
706							my $scores = get_segment_score($chr,$start,$stop,$strand,
707							'sense','ncount',1,undef,'file:/path/to/dataset.bam');
708							my $score = calculate_score('ncount', $scores);
709
710							This subroutine will perform the math to calculate a single score
711							from an array of values. Current acceptable methods include mean,
712							median, sum, min, max, stddev, count, ncount, and pcount.
713
714							Pass the subroutine two items: the name of the mathematical method
715							enumerated above, and an array reference of the values to work on.
716							A single score will be returned.
717
718							=item get_chromosome_list
719
720							my $db = open_db_connection('cerevisiae');
721							# get all chromosomes in the database
722							my @chromosomes = get_chromosome_list($db);
723							foreach (@chromosomes) {
724							my $name = $_->[0];
725							my $length = $_->[1];
726							print "chromosome $name is $length bp\n";
727							}
728
729							This subroutine will collect a list of chromosomes or reference sequences
730							in a Bio::DB database and return the list along with their sizes in bp.
731							Many BioPerl-based databases are supported, including
732							L, L, L,
733							L, L, L,
734							L, or any others that support the C method.
735							L adapters for big files are also supported, though see note
736							below. See the L subroutine for more information.
737
738							Pass the subroutine either the name of the database, the path to the
739							database file, or an opened database object.
740
741							Optionally pass a second value, a regular expression compatible string
742							for skipping specific chromosomes or chromosome classes, such as mitochondrial
743							or unmapped contigs. The default is to return all chromosomes.
744
745							The subroutine will return an array, with each element representing each
746							chromosome or reference sequence in the database. Each element is an anonymous
747							array of two elements, the chromosome name and length in bp.
748
749							B: L databases for bigWig and bigBed files do
750							not return chromosomes in the original order as the file, and are returned
751							in a sorted manner that may not be the original order.
752
753							=item low_level_bam_coverage
754
755							my $coverage = low_level_bam_coverage($sam, $tid, $start, $stop);
756
757							This is a convenience method for running the low level bam coverage method.
758							Since both L and L bam file adapters are
759							supported, and each has slight variation in the API syntax, this method helps
760							to abstract the actual method and use the appropriate syntax depending on
761							which adapter is loaded. It is best if the C<$sam> object was opened using the
762							L method, or that C<$BAM_ADAPTER> is set.
763
764							B that this is the B coverage method based on the index object,
765							and not the similarly named high level API method. Read the adapter
766							documentation for proper usage.
767
768							=item low_level_bam_fetch
769
770							my $success = low_level_bam_fetch($sam, $tid, $start, $stop, $callback, $data);
771
772							This is a convenience method for running the low level bam fetch method.
773							Since both L and L bam file adapters are
774							supported, and each has slight variation in the API syntax, this method helps
775							to abstract the actual method and use the appropriate syntax depending on
776							which adapter is loaded. It is best if the C<$sam> object was opened using the
777							L method, or that C<$BAM_ADAPTER> is set.
778
779							B that this is the B fetch method based on the index object,
780							and not the similarly named high level API method. Read the adapter
781							documentation for proper usage.
782
783							=item get_genomic_sequence
784
785							my $sequence = get_genomic_sequence($db, $chrom, $start, $stop);
786
787							This is a convenience wrapper function for fetching genomic sequence from
788							three different supported fasta database adapters, which, of course, all
789							have different APIs, including the L, L,
790							and L adapters, as well as L
791							and possibly other BioPerl databases.
792
793							Pass the opened database object, chromosome, start, and stop coordinates. This
794							assumes BioPerl standard 1-base coordinates. Only the forward sequence is
795							retrieved. The sequence is returned as a simple string.
796
797							=back
798
799							=head1 INTERNAL SUBROUTINES
800
801							These are not intended for normal general consumption but are documented here
802							so that at least some of us know what is going on.
803
804							=over 4
805
806							=item _lookup_db_method
807
808							Internal method for determining which database adapter and method call
809							should be used given a set of parameters. This result is cached for
810							future data queries (likely hundreds or thousands of repeated calls).
811							The method is stored in the global hash C<%DB_METHODS>;
812
813							Pass the parameter array reference from L. This
814							should load the appropriate db_helper sub module during run time.
815
816							=item _load_helper_module
817
818							Subroutine to load and import a db_helper module during run time.
819							Sets the appropriate global variable if successful.
820
821							=item _load_bam_helper_module
822
823							Subroutine to determine which Bam adapter db_helper module to load,
824							either the older samtools-based adapter or the newer HTSlib-based adapter.
825							Uses the global and exportable variable C<$BAM_ADAPTER> to set a
826							preference for which adapter to use. Use 'sam' or 'hts' or some
827							string containing these names. Do B change this variable after
828							loading the helper module; it will not do what you think it will do.
829							If both are available and a preference is not set then the hts helper
830							is the default.
831
832							=back
833
834							=cut
835
836	3			3		25	use strict;
	3					6
	3					128
837							require Exporter;
838	3			3		14	use Carp qw(carp cluck croak confess);
	3					32
	3					198
839	3			3		1729	use Module::Load; # for dynamic loading during runtime
	3					3551
	3					18
840	3			3		175	use List::Util qw(min max sum0);
	3					6
	3					213
841	3			3		1509	use Statistics::Lite qw(median range stddevp);
	3					4819
	3					225
842	3			3		1635	use Bio::ToolBox::db_helper::constants;
	3					7
	3					207
843	3			3		1652	use Bio::ToolBox::utility;
	3					8
	3					27939
844
845							# check values for dynamically loaded helper modules
846							# these are loaded only when needed during runtime to avoid wasting resources
847							our $BAM_OK = 0;
848							our $BIGBED_OK = 0;
849							our $BIGWIG_OK = 0;
850							our $SEQFASTA_OK = 0;
851							our $USEQ_OK = 0;
852							our $BAM_ADAPTER = undef; # preference for which bam adapter to use
853							our $BIG_ADAPTER = undef;
854
855							# define reusable variables
856							our %total_read_number; # for rpm calculations
857							our $primary_id_warning; # for out of date primary IDs
858							our %OPENED_DB; # cache for some opened Bio::DB databases
859							our %DB_METHODS; # cache for database score methods
860
861							# Exported names
862							our @ISA = qw(Exporter);
863							our @EXPORT = qw();
864							our @EXPORT_OK = qw(
865							$BAM_ADAPTER
866							$BIG_ADAPTER
867							use_bam_adapter
868							use_big_adapter
869							open_db_connection
870							get_dataset_list
871							verify_or_request_feature_types
872							check_dataset_for_rpm_support
873							get_new_feature_list
874							get_new_genome_list
875							validate_included_feature
876							get_db_feature
877							get_segment_score
878							calculate_score
879							get_chromosome_list
880							low_level_bam_coverage
881							low_level_bam_fetch
882							get_genomic_sequence
883							);
884
885							# The true statement
886							1;
887
888
889							### Open a connection to the SeqFeature Store MySQL database
890
891							sub use_bam_adapter {
892	0		0	0	1	0	my $a = shift \|\| undef;
893	0	0				0	$BAM_ADAPTER = $a if $a;
894	0					0	return $BAM_ADAPTER;
895							}
896
897
898							sub use_big_adapter {
899	0		0	0	1	0	my $a = shift \|\| undef;
900	0	0				0	$BIG_ADAPTER = $a if $a;
901	0					0	return $BIG_ADAPTER;
902							}
903
904
905							sub open_db_connection {
906	6			6	1	12	my $database = shift;
907	6		50			23	my $no_cache = shift \|\| 0;
908	6	50				13	unless ($database) {
909							# cluck 'no database name passed!';
910	0					0	return;
911							}
912
913							# first check if it is a database reference
914	6					12	my $db_ref = ref $database;
915	6	100				20	if ($db_ref =~ /DB\|big::BigWigSet/) {
916							# the provided database is already an open database object
917							# nothing to open, return as is
918	1					3	return $database;
919							}
920
921							# check to see if we have already opened it
922	5	100	66			20	if (exists $OPENED_DB{$database} and not $no_cache) {
923							# return the cached database connection
924							# but NOT if user explicitly requested no cached databases
925							# DO NOT reuse database objects if you have forked!!! Bad things happen
926	4					10	return $OPENED_DB{$database};
927							}
928
929							# skip parsed databases
930	1	50				3	return if $database =~ /^Parsed:/; # we don't open parsed annotation files
931
932
933							### Attempt to open the database
934							# we go through a series of checks to determine if it is remote, local,
935							# an indexed big data file, SQLite file, etc
936							# when all else fails, try to open a SQL connection
937	1					2	my $db;
938							my $error;
939
940							# check if it is a remote file
941	1	50				11	if ($database =~ /^(?:https?\|ftp)/i) {
		50
		0
		0
942
943							# a remote Bam database
944	0	0				0	if ($database =~ /\.bam$/i) {
		0
		0
		0
		0
945							# open using Bam adaptor
946	0	0				0	_load_bam_helper_module() unless $BAM_OK;
947	0	0				0	if ($BAM_OK) {
948	0					0	$db = open_bam_db($database);
949	0	0				0	unless ($db) {
950	0					0	$error = " ERROR: could not open remote Bam file" .
951							" '$database'! $!\n";
952							}
953							}
954							else {
955	0					0	$error = " Bam database cannot be loaded because\n" .
956							" Bio::DB::Sam or Bio::DB::HTS is not installed\n";
957							}
958							}
959
960							# a remote BigBed database
961							elsif ($database =~ /\.(?:bb\|bigbed)$/i) {
962							# open using BigBed adaptor
963	0	0				0	$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK;
964	0	0				0	if ($BIGBED_OK) {
965	0					0	$db = open_bigbed_db($database);
966	0	0				0	unless ($db) {
967	0					0	$error = " ERROR: could not open remote BigBed file" .
968							" '$database'! $!\n";
969							}
970							}
971							else {
972	0					0	$error = " BigBed database cannot be loaded because\n" .
973							" Bio::DB::Big or Bio::DB::BigBed is not installed\n";
974							}
975							}
976
977							# a remote BigWig database
978							elsif ($database =~ /\.(?:bw\|bigwig)$/i) {
979							# open using BigWig adaptor
980	0	0				0	$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
981	0	0				0	if ($BIGWIG_OK) {
982	0					0	$db = open_bigwig_db($database);
983	0	0				0	unless ($db) {
984	0					0	$error = " ERROR: could not open remote BigWig file" .
985							" '$database'! $!\n";
986							}
987							}
988							else {
989	0					0	$error = " BigWig database cannot be loaded because\n" .
990							" Bio::DB::Big or Bio::DB::BigWig is not installed\n";
991							}
992							}
993
994							# a remote useq file
995							elsif ($database =~ /\.useq$/) {
996							# uh oh! remote useq files are not supported
997	0					0	$error = " ERROR: remote useq files are not supported!\n";
998							}
999
1000							# a remote fasta file???
1001							elsif ($database =~ /\.fa(?:sta)?$/i) {
1002							# uh oh! remote fasta files are not supported
1003	0					0	$error = " ERROR: remote fasta files are not supported!\n";
1004							}
1005
1006							# a presumed remote directory, presumably of bigwig files
1007							else {
1008							# open using BigWigSet adaptor
1009	0	0				0	$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig')
1010							unless $BIGWIG_OK;
1011	0	0				0	if ($BIGWIG_OK) {
1012	0					0	$db = open_bigwigset_db($database);
1013	0	0				0	unless ($db) {
1014	0					0	$error = " ERROR: could not open presumed remote " .
1015							"BigWigSet directory '$database'! $!\n";
1016							}
1017							}
1018							else {
1019	0					0	$error = " Presumed BigWigSet database cannot be loaded because\n" .
1020							" Bio::DB::BigWigSet is not installed\n";
1021							}
1022							}
1023
1024							}
1025
1026							# check for a known file type
1027							elsif ($database =~ /gff\|bw\|bb\|bam\|useq\|db\|sqlite\|fa\|fasta\|bigbed\|bigwig\|cram/i) {
1028
1029							# remove prefix, just in case
1030	1					2	$database =~ s/^file://;
1031
1032							# first check that it exists
1033	1	50				27	if (-e $database) {
1034
1035							# a Bam database
1036	1	50				14	if ($database =~ /\.bam$/i) {
		50
		50
		50
		0
		0
		0
1037							# open using appropriate bam adaptor
1038	0	0				0	_load_bam_helper_module() unless $BAM_OK;
1039	0	0				0	if ($BAM_OK) {
1040	0					0	undef $@;
1041	0					0	$db = open_bam_db($database);
1042	0	0				0	unless ($db) {
1043	0					0	$error = " ERROR: could not open local Bam file" .
1044							" '$database'! $@\n";
1045							}
1046							}
1047							else {
1048	0					0	$error = " Bam database cannot be loaded because\n" .
1049							" Bio::DB::Sam or Bio::DB::HTS is not installed\n";
1050							}
1051							}
1052
1053							# a BigBed database
1054							elsif ($database =~ /\.(?:bb\|bigbed)$/i) {
1055							# open using BigBed adaptor
1056	0	0				0	$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK;
1057	0	0				0	if ($BIGBED_OK) {
1058	0					0	undef $@;
1059	0					0	$db = open_bigbed_db($database);
1060	0	0				0	unless ($db) {
1061	0					0	$error = " ERROR: could not open local BigBed file" .
1062							" '$database'! $@\n";
1063							}
1064							}
1065							else {
1066	0					0	$error = " BigBed database cannot be loaded because\n" .
1067							" Big::DB::Big or Bio::DB::BigBed is not installed\n";
1068							}
1069							}
1070
1071							# a BigWig database
1072							elsif ($database =~ /\.(?:bw\|bigwig)$/i) {
1073							# open using BigWig adaptor
1074	0	0				0	$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
1075	0	0				0	if ($BIGWIG_OK) {
1076	0					0	undef $@;
1077	0					0	$db = open_bigwig_db($database);
1078	0	0				0	unless ($db) {
1079	0					0	$error = " ERROR: could not open local BigWig file" .
1080							" '$database'! $@\n";
1081							}
1082							}
1083							else {
1084	0					0	$error = " BigWig database cannot be loaded because\n" .
1085							" Big::DB::Big or Bio::DB::BigWig is not installed\n";
1086							}
1087							}
1088
1089							# a useq database
1090							elsif ($database =~ /\.useq$/i) {
1091							# open using USeq adaptor
1092	1	50				44	$USEQ_OK = _load_helper_module('Bio::ToolBox::db_helper::useq')
1093							unless $USEQ_OK;
1094	1	50				3	if ($USEQ_OK) {
1095	1					4	$db = open_useq_db($database);
1096	1	50				7	unless ($db) {
1097	0					0	$error = " ERROR: could not open local useq file" .
1098							" '$database'! $!\n";
1099							}
1100							}
1101							else {
1102	0					0	$error = " Useq database cannot be loaded because\n" .
1103							" Bio::DB::USeq is not installed\n";
1104							}
1105							}
1106
1107							# a Fasta File
1108							elsif ($database =~ /\.fa(?:sta)?$/i) {
1109							# first try a modern fai indexed adapter
1110	0	0				0	_load_bam_helper_module() unless $BAM_OK;
1111	0	0				0	if ($BAM_OK) {
1112	0					0	$db = open_indexed_fasta($database);
1113	0	0				0	unless ($db) {
1114	0					0	$error .= " ERROR: could not open indexed fasta file '$database'!\n";
1115							}
1116							}
1117							else {
1118							# open using the old slow BioPerl Fasta adaptor
1119	0	0				0	$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1120							unless $SEQFASTA_OK;
1121	0	0				0	if ($SEQFASTA_OK) {
1122	0					0	$db = open_fasta_db($database);
1123	0	0				0	unless ($db) {
1124	0					0	$error .= " ERROR: could not open fasta file '$database'!\n";
1125	0	0				0	if (-e "$database\.index") {
1126	0					0	$error .= " Try deleting $database\.index and try again\n";
1127							}
1128							}
1129							}
1130							else {
1131	0					0	$error .= " Fasta file could not be loaded because neither" .
1132							" Bio::DB::HTS, Bio::DB::Sam, or Bio::DB::Fasta is installed\n";
1133							}
1134							}
1135							}
1136
1137							# a gff3 or sqlite database
1138							elsif ($database =~ /\.(?:gff3?\|gff3?\.gz\|db\|sqlite)$/) {
1139	0	0				0	$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1140							unless $SEQFASTA_OK;
1141	0	0				0	if ($SEQFASTA_OK) {
1142	0					0	$db = open_store_db($database);
1143	0	0				0	unless ($db) {
1144	0					0	$error = " ERROR: could not load SeqFeature database file '$database'!\n";
1145							}
1146							}
1147							else {
1148	0					0	$error .= " Module Bio::DB::SeqFeature::Store is required to load " .
1149							"GFF and database files\n";
1150							}
1151							}
1152
1153							# a cram database
1154							elsif ($database =~ /\.cram$/i) {
1155							# open using HTS bam adaptor only
1156	0		0			0	$BAM_ADAPTER \|\|= 'hts';
1157	0	0				0	if ($BAM_ADAPTER eq 'sam') {
1158	0					0	$error .= " ERROR: Only HTS adapters support Cram files!\n";
1159							}
1160	0	0				0	_load_bam_helper_module() unless $BAM_OK;
1161	0	0				0	if ($BAM_OK) {
1162	0					0	undef $@;
1163	0					0	$db = open_bam_db($database);
1164	0	0				0	unless ($db) {
1165	0					0	$error = " ERROR: could not open local Cram file" .
1166							" '$database'! $@\n";
1167							}
1168							}
1169							else {
1170	0					0	$error = " Cram database cannot be loaded because\n" .
1171							" Bio::DB::HTS is not installed\n";
1172							}
1173							}
1174
1175							# something unrecognized?
1176							else {
1177	0					0	$error .= " ERROR! Cannot identify database type for $database!\n";
1178							}
1179							}
1180
1181							# file does not exist or can be read
1182							else {
1183							# file does not exist
1184	0					0	$error = " ERROR: file '$database' does not exist!\n";
1185							}
1186							}
1187
1188							# a directory, presumably of bigwig files
1189							elsif (-d $database) {
1190							# try opening using the BigWigSet adaptor
1191	0	0				0	$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
1192	0	0				0	if ($BIGWIG_OK) {
1193	0					0	$db = open_bigwigset_db($database);
1194	0	0				0	unless ($db) {
1195	0					0	$error = " ERROR: could not open local BigWigSet " .
1196							"directory '$database'!\n";
1197	0					0	$error .= " Does directory contain bigWig .bw files?\n";
1198							}
1199							}
1200							else {
1201	0					0	$error = " Presumed BigWigSet database cannot be loaded because\n" .
1202							" Bio::DB::Big or Bio::DB::BigWigSet is not installed\n";
1203							}
1204
1205							# try opening with the Fasta adaptor
1206	0	0				0	unless ($db) {
1207	0	0				0	$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1208							unless $SEQFASTA_OK;
1209	0	0				0	if ($SEQFASTA_OK) {
1210	0					0	$db = open_fasta_db($database);
1211	0	0				0	unless ($db) {
1212	0					0	$error .= " ERROR: could not open fasta directory '$database'!\n";
1213	0					0	$error .= " Does directory contain fasta files? If it contains a" .
1214							" directory.index file,\n try deleting it and try again.\n";
1215							}
1216							}
1217							else {
1218	0					0	$error .= " Module Bio::DB::Fasta is required to open presumed fasta " .
1219							"directory\n";
1220							}
1221							}
1222							}
1223
1224							# unrecognized real file
1225							elsif (-e $database) {
1226							# file exists, I just don't recognize the extension
1227	0					0	$error = " File '$database' type and/or extension is not recognized\n";
1228							}
1229
1230
1231							# otherwise assume the name of a SeqFeature::Store database in the configuration
1232							else {
1233							# attempt to open using information from the configuration
1234							# using default connection information as necessary
1235	0	0				0	$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
1236							unless $SEQFASTA_OK;
1237	0	0				0	if ($SEQFASTA_OK) {
1238	0					0	$db = open_store_db($database);
1239	0	0				0	unless ($db) {
1240	0					0	$error .= " unable to open relational database named '$database'\n";
1241							}
1242							}
1243							else {
1244	0					0	$error .= " Module Bio::DB::SeqFeature::Store is required to connect " .
1245							"to databases\n";
1246							}
1247							}
1248
1249							# conditional return
1250	1	50				3	if ($db) {
1251							# cache the opened connection for use later
1252	1	50				5	$OPENED_DB{$database} = $db unless $no_cache;
1253
1254							# return as appropriate either both object and name or just object
1255	1					3	return $db;
1256							}
1257							else {
1258	0					0	print STDERR $error;
1259	0					0	return;
1260							}
1261							}
1262
1263
1264							### Retrieve a database name from an db object
1265							sub get_db_name {
1266	0			0	1	0	my $db = shift;
1267	0	0				0	return unless $db;
1268	0					0	my $db_ref = ref($db);
1269	0	0				0	return $db unless $db_ref; # presumption that non-object is just a database name
1270	0					0	my $db_name;
1271	0	0				0	if ($db_ref =~ /^Bio::DB::SeqFeature::Store/) {
		0
		0
1272							# a SeqFeature database, using any DBI adapter
1273	0					0	$db_name = $db->{'dbh'}->{'name'};
1274							# dig through the object internals to identify the original
1275							# name of the database
1276							# this should be relatively well documented through DBI
1277							# but could break in the future since it's not official API
1278							}
1279							elsif ($db_ref eq 'Bio::DB::Sam') {
1280							# a Samtools Bam database
1281							# old versions <=1.39 don't have the bam_path method, so it's easier
1282							# to explicitly grab it from the object internals
1283	0					0	$db_name = $db->{bam_path};
1284							}
1285							elsif ($db_ref eq 'Bio::DB::HTS') {
1286							# a HTSlib Bam database
1287	0					0	$db_name = $db->hts_path;
1288							}
1289							# determining the database name from other sources is
1290							# either not possible or not easy, so won't bother unless
1291							# there is a really really good need
1292	0					0	return $db_name;
1293							}
1294
1295
1296							### Retrieve a list of the microrarray data sets from the db
1297							sub get_dataset_list {
1298
1299	0			0	1	0	my $database = shift;
1300	0					0	my $use_all_features = shift;
1301
1302							# Open a db connection
1303	0					0	my $db = open_db_connection($database);
1304	0	0				0	unless ($db) {
1305	0					0	carp 'no database connected!';
1306	0					0	return;
1307							}
1308	0					0	my $db_ref = ref $db;
1309
1310							# process the database types, according to the type of database
1311	0					0	my @types;
1312
1313							# a SeqFeature database
1314	0	0				0	if ($db_ref =~ /^Bio::DB::SeqFeature::Store/) {
		0
1315
1316							# collect the database types,
1317							# sort first by source, then by method
1318							@types = (
1319							map $_->[2],
1320	0	0				0	sort { ($a->[0] cmp $b->[0]) or ($a->[1] cmp $b->[1]) }
	0					0
1321							map [$_->source, $_->method, $_],
1322							$db->types
1323							);
1324							}
1325
1326							# a BigWigSet database
1327							elsif ($db_ref =~ /BigWigSet/i) {
1328
1329							# get the metadata
1330	0					0	my $metadata = $db->metadata;
1331
1332							# collect
1333	0					0	foreach my $file (keys %{ $metadata }) {
	0					0
1334							# get the type for each file
1335	0					0	my ($primary, $type, $name);
1336
1337							# get the appropriate tags
1338	0					0	foreach my $attribute (keys %{ $metadata->{$file} } ) {
	0					0
1339	0	0				0	if ($attribute =~ m/^primary_tag\|method$/i) {
		0
		0
1340	0					0	$primary = $metadata->{$file}{$attribute};
1341							}
1342							elsif ($attribute =~ m/^type/i) {
1343	0					0	$type = $metadata->{$file}{$attribute};
1344							}
1345							elsif ($attribute =~ m/^display_name/i) {
1346	0					0	$name = $metadata->{$file}{$attribute};
1347							}
1348							}
1349
1350							# store the type
1351	0		0			0	push @types, $type \|\| $primary \|\| $name;
1352							}
1353
1354							# sort the types in alphabetical order
1355	0					0	@types = sort {$a cmp $b} @types;
	0					0
1356							}
1357
1358							# some other database
1359							else {
1360	0					0	carp " no dataset lists for database type $db_ref!\n";
1361							}
1362
1363							# finish
1364	0					0	return @types;
1365							}
1366
1367
1368
1369							### Process and verify a dataset
1370							sub verify_or_request_feature_types {
1371
1372							# Retrieve passed values
1373	1			1	1	4	my %args = @_; # the passed argument values as a hash reference
1374
1375							# Check for single option
1376	1		50			6	$args{'single'} \|\|= 0;
1377
1378							# Collect the datasets
1379	1					2	my @datasets;
1380	1		50			7	$args{'feature'} \|\|= undef;
1381	1	50				7	if (ref $args{'feature'} eq 'ARRAY') {
		50
1382							# an anonymous array of datasets
1383	0					0	@datasets = @{ $args{'feature'} };
	0					0
1384							}
1385							elsif (defined $args{'feature'}) {
1386	1					3	push @datasets, $args{'feature'};
1387							}
1388							# else it is null and @datasets remains empty, prompting user input
1389
1390							# feature limits
1391							# this is a regex to limit the feature types presented to the user
1392							# otherwise everything in the database is presented to the user
1393	1		50			6	my $limit = $args{'limit'} \|\|= undef;
1394
1395
1396							# Open database object and collect features
1397	1		50			3	$args{'db'} \|\|= undef;
1398	1	50				4	my $db = $args{'db'} ? open_db_connection( $args{'db'} ) : undef;
1399
1400							# Initialize main output arrays
1401	1					3	my @good_datasets;
1402							my @bad_datasets;
1403	1					0	my %db_features; # hash for features found in the database, use when needed
1404
1405							# Check provided datasets
1406	1	50				13	if (@datasets) {
1407
1408							# check for multiple comma-delimited datasets
1409	1					3	my @list_to_check;
1410	1					3	foreach my $item (@datasets) {
1411	1	50				4	if ($item =~ /,/) {
1412							# this one has a comma, therefore it has more than dataset
1413	0					0	push @list_to_check, split(/,/, $item);
1414							}
1415							else {
1416							# a singleton
1417	1					4	push @list_to_check, $item;
1418							}
1419							}
1420
1421							# now verify the datasets
1422	1					3	foreach my $dataset (@list_to_check) {
1423
1424							# check for a remote file
1425	1	50				9	if ($dataset =~ /^(?: http \| ftp) .+ \. (?: bam \| bw \| bb) $/xi) {
		50
1426							# a remote file
1427							# assume it is good, no verification here though
1428							# it will either work or won't work
1429	0					0	push @good_datasets, $dataset;
1430							}
1431
1432							# a local file
1433							elsif ($dataset =~ /\.(?:bam\|bw\|bigwig\|bb\|bigbed\|useq)$/i) {
1434							# presume we have a local indexed data file
1435
1436							# user may have requested two or more files to be merged
1437							# these should be combined with an ampersand
1438							# check each one
1439	1					2	my @files;
1440	1					7	foreach my $file (split /\&/, $dataset) {
1441	1	50				34	if (-e $file) {
1442							# file exists
1443	1					7	push @files, "file:$file";
1444							}
1445							else {
1446							# file doesn't exist! can't use this set of files
1447	0					0	@files = ();
1448	0					0	last;
1449							}
1450							}
1451	1	50				8	if (@files) {
1452	1					5	push @good_datasets, join("&", @files);
1453							}
1454							else {
1455	0					0	push @bad_datasets, $dataset;
1456							}
1457							}
1458
1459							# a feature type in a database
1460							else {
1461							# must be a database feature type
1462
1463							# check for database features hash
1464	0	0				0	unless (%db_features) {
1465	0	0				0	if ($db) {
1466							# collect the database features as needed
1467							# I want both method:source and method in the hash
1468	0					0	foreach my $type ( get_dataset_list($db) ) {
1469	0					0	my ($method, $source) = split /:/, $type;
1470	0	0				0	if ($source) {
1471	0					0	$db_features{$type} = 1;
1472	0					0	$db_features{$method} = 1;
1473							}
1474							else {
1475	0					0	$db_features{$type} = 1;
1476							}
1477							}
1478
1479							# verify
1480	0	0				0	unless (%db_features) {
1481	0					0	carp " provided database has no feature types " .
1482							"to verify dataset(s) against!\n";
1483							}
1484							}
1485							else {
1486							# we need a database
1487	0					0	carp " unable to verify dataset without database";
1488							}
1489							}
1490
1491							# validate the given dataset
1492							# user may have requested two or more features to be merged
1493							# these should be combined with an ampersand
1494							# check each one
1495	0					0	my $check = 0;
1496	0					0	foreach my $d (split /&/, $dataset) {
1497							# validate this dataset
1498	0	0				0	if (exists $db_features{$d}) {
1499	0					0	$check++;
1500							}
1501							else {
1502							# at least one of these is not good, fail check
1503	0					0	$check = 0;
1504	0					0	last;
1505							}
1506							}
1507
1508							# report the verification
1509	0	0				0	if ($check) {
1510	0					0	push @good_datasets, $dataset;
1511							}
1512							else {
1513	0					0	push @bad_datasets, $dataset;
1514							}
1515							}
1516							}
1517							}
1518
1519							# User must select datasets
1520							else {
1521							# dataset not specified
1522							# present the dataset list to the user and get an answer
1523
1524							# get the dataset list
1525	0	0				0	if ($db) {
1526							# collect the database features as needed
1527	0					0	my $i = 1;
1528	0					0	foreach my $type ( get_dataset_list($db) ) {
1529	0	0				0	if ($limit) {
1530	0					0	my ($p, $s) = split /:/, $type; # split into primary_tag and source
1531							# only keep those types that match our limiter
1532	0	0				0	next unless $p =~ /$limit/i;
1533							}
1534	0					0	$db_features{$i} = $type;
1535	0					0	$i++;
1536							}
1537
1538							# verify
1539	0	0				0	unless (%db_features) {
1540	0					0	carp " provided database has no feature types " .
1541							"to collect!\n";
1542	0					0	return;
1543							}
1544							}
1545							else {
1546							# we need a database
1547	0					0	carp " no database provided from which to collect features!\n";
1548	0					0	return;
1549							}
1550
1551							# present the list
1552	0					0	print "\n These are the available data sets in the database:\n";
1553	0					0	foreach (sort {$a <=> $b} keys %db_features) {
	0					0
1554							# print out the list of microarray data sets
1555	0					0	print " $_\t$db_features{$_}\n";
1556							}
1557
1558							# prompt the user
1559	0		0			0	$args{'prompt'} \|\|= undef;
1560	0	0				0	if ($args{'prompt'}) {
1561							# provided custom prompt
1562	0					0	print $args{'prompt'};
1563							}
1564							else {
1565							# generic prompt
1566	0	0				0	if ($args{'single'}) {
1567	0					0	print " Enter one number for the data set or feature ";
1568							}
1569							else {
1570	0					0	print " Enter the number(s) or range of the data set(s) or" .
1571							" feature(s) ";
1572							}
1573							}
1574
1575							# get answer from the user
1576	0					0	my $answer = ;
1577	0					0	chomp $answer;
1578	0					0	my @answer_list = parse_list($answer);
1579
1580							# take the first one if requested
1581	0	0				0	if ($args{'single'}) {
1582	0	0				0	unless (scalar @answer_list == 1) {
1583	0					0	splice(@answer_list, 1);
1584							}
1585							}
1586
1587							# verify the answer list
1588	0					0	foreach my $answer (@answer_list) {
1589
1590							# check for merged datasets
1591	0	0				0	if ($answer =~ /&/) {
1592							# a merged dataset
1593	0					0	my @list = split /&/, $answer;
1594	0					0	my $check = 1;
1595
1596							# check all are good
1597	0					0	foreach (@list) {
1598	0	0				0	unless (exists $db_features{$_}) {
1599	0					0	$check = 0;
1600							}
1601							}
1602
1603							# if all are good
1604	0	0				0	if ($check) {
1605							push @good_datasets,
1606	0					0	join( "&", map { $db_features{$_} } @list);
	0					0
1607							}
1608							else {
1609	0					0	push @bad_datasets, $answer;
1610							}
1611							}
1612
1613							else {
1614							# a single dataset
1615							# check if it is good
1616
1617	0	0				0	if (exists $db_features{$answer}) {
1618	0					0	push @good_datasets, $db_features{$answer};
1619							}
1620							else {
1621	0					0	push @bad_datasets, $db_features{$answer};
1622							}
1623							}
1624							}
1625							}
1626
1627							# Print bad results
1628	1	50				3	if (@bad_datasets) {
1629	0					0	print " The following datasets could not be verified:\n";
1630	0					0	foreach (@bad_datasets) {
1631	0					0	print " $_\n";
1632							}
1633							}
1634
1635							# Return good results
1636	1	50				4	if ($args{'single'}) {
1637	0					0	return $good_datasets[0];
1638							}
1639							else {
1640	1					5	return @good_datasets;
1641							}
1642							}
1643
1644
1645							### Process and verify a dataset
1646							sub check_dataset_for_rpm_support {
1647
1648							# get passed dataset and databases
1649	0			0	1	0	my $dataset = shift;
1650	0					0	my $cpu = shift;
1651
1652							# Calculate the total number of reads
1653							# this uses the global variable $rpkm_read_sum
1654	0					0	my $rpm_read_sum;
1655
1656	0	0				0	if (exists $total_read_number{$dataset}) {
		0
		0
1657							# this dataset has already been summed
1658							# no need to do it again
1659	0					0	$rpm_read_sum = $total_read_number{$dataset};
1660							}
1661
1662							elsif ($dataset =~ /\.bam$/) {
1663							# a bam file dataset
1664
1665	0	0				0	_load_bam_helper_module() unless $BAM_OK;
1666	0	0				0	if ($BAM_OK) {
1667							# Bio::ToolBox::db_helper::bam was loaded ok
1668							# sum the number of reads in the dataset
1669	0					0	$rpm_read_sum = sum_total_bam_alignments($dataset, 0, 0, $cpu);
1670							}
1671							else {
1672	0					0	carp " Bam support is not available! " .
1673							"Is Bio::DB::Sam or Bio::DB::HTS installed?\n";
1674	0					0	return;
1675							}
1676							}
1677
1678							elsif ($dataset =~ /\.bb$/) {
1679							# a bigbed file dataset
1680
1681	0	0				0	$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed') unless
1682							$BIGBED_OK;
1683	0	0				0	if ($BIGBED_OK) {
1684							# Bio::ToolBox::db_helper::bigbed was loaded ok
1685							# sum the number of features in the dataset
1686	0					0	$rpm_read_sum = sum_total_bigbed_features($dataset);
1687							}
1688							else {
1689	0					0	carp " BigBed support is not available! " .
1690							"Is Bio::DB::BigBed installed?\n";
1691	0					0	return;
1692							}
1693							}
1694
1695							else {
1696							# some other non-supported dataset
1697	0					0	return;
1698							}
1699
1700							# return the sum value if we've made it this far
1701	0					0	$total_read_number{$dataset} = $rpm_read_sum;
1702	0					0	return $rpm_read_sum;
1703							}
1704
1705
1706							### Generate a new list of features
1707							sub get_new_feature_list {
1708
1709							# Retrieve passed values
1710	0			0	1	0	my %args = @_; # the passed argument values
1711
1712							# check data object
1713	0		0			0	my $data = $args{data} \|\| undef;
1714	0	0				0	unless ($data) {
1715	0					0	confess "must pass a 'data' key and Bio::ToolBox::Data object!";
1716	0					0	return;
1717							}
1718	0	0				0	unless (ref($data) eq 'Bio::ToolBox::Data') {
1719	0					0	confess 'must pass a Bio::ToolBox::Data object!';
1720	0					0	return;
1721							}
1722
1723							# Open a db connection
1724	0		0			0	$args{'db'} \|\|= undef;
1725	0					0	my $db = open_db_connection($args{'db'});
1726	0	0				0	unless ($db) {
1727	0					0	carp 'no database connected!';
1728	0					0	return;
1729							}
1730
1731							# Verify a SeqFeature::Store database
1732	0					0	my $db_ref = ref $db;
1733	0	0				0	unless ($db_ref =~ /^Bio::DB::SeqFeature::Store/) {
1734	0					0	carp "Database type $db_ref doesn't support generating feature lists!\n";
1735	0					0	return;
1736							}
1737
1738							# Features to search for
1739	0		0			0	my $searchFeature = $args{'feature'} \|\| $args{'features'} \|\| undef;
1740	0	0				0	unless ($searchFeature) {
1741	0					0	carp "no search feature types passed!";
1742	0					0	return;
1743							}
1744	0					0	my @classes = split(',', $searchFeature); # it may or may not be a list
1745
1746							# chromosomes to skip
1747	0		0			0	my $chr_exclude = $args{'chrskip'} \|\| undef;
1748
1749							# Add table columns
1750	0					0	my $pid_i = $data->add_column('Primary_ID');
1751	0					0	my $name_i = $data->add_column('Name');
1752	0					0	my $type_i = $data->add_column('Type');
1753
1754							# Set up the database iterator
1755	0					0	print " Searching for $searchFeature features\n";
1756	0	0				0	my $iterator = $db->get_seq_stream(
1757							-types => scalar @classes ? \@classes : $searchFeature,
1758							);
1759	0	0				0	unless ($iterator) {
1760							# there should be some features found in the database
1761	0					0	carp "could not get feature iterator for database";
1762	0					0	return;
1763							}
1764
1765							# Walk through the collected features
1766	0					0	my $total_count = 0; # total found features
1767	0					0	while (my $feature = $iterator->next_seq) {
1768	0					0	$total_count++;
1769
1770							# skip genes from excluded chromosomes
1771	0	0	0			0	next if (defined $chr_exclude and $feature->seq_id =~ $chr_exclude);
1772
1773							# Record the feature information
1774							# in the B::DB::SF::S database, the primary_ID is a number unique to the
1775							# the specific database, and is not portable between databases
1776	0					0	$data->add_row( [
1777							$feature->primary_id,
1778							$feature->display_name,
1779							$feature->type,
1780							] );
1781							}
1782
1783							# print result of search
1784	0					0	printf " Found %s features in the database\n", format_with_commas($total_count);
1785	0					0	printf " Kept %s features.\n", format_with_commas($data->last_row);
1786
1787							# return the new data structure
1788	0					0	return 1;
1789							}
1790
1791
1792							### Generate a new list genomic windows
1793							sub get_new_genome_list {
1794
1795							# Collect the passed arguments
1796	0			0	1	0	my %args = @_;
1797
1798							# check data object
1799	0		0			0	my $data = $args{data} \|\| undef;
1800	0	0				0	unless ($data) {
1801	0					0	confess "must pass a 'data' key and Bio::ToolBox::Data object!";
1802	0					0	return;
1803							}
1804	0	0				0	unless (ref($data) eq 'Bio::ToolBox::Data') {
1805	0					0	confess 'must pass a Bio::ToolBox::Data object!';
1806	0					0	return;
1807							}
1808
1809							# Open a db connection
1810	0		0			0	$args{'db'} \|\|= undef;
1811	0					0	my $db = open_db_connection($args{'db'});
1812	0	0				0	unless ($db) {
1813	0					0	carp 'no database connected!';
1814	0					0	return;
1815							}
1816
1817							# Determine win and step sizes
1818	0		0			0	$args{'win'} \|\|= 500; # hard coded default size
1819	0		0			0	$args{'step'} \|\|= $args{'win'};
1820
1821
1822							# Prepare data structures
1823	0					0	my $chr_i = $data->add_column('Chromosome');
1824	0					0	my $start_i = $data->add_column('Start');
1825	0					0	my $stop_i = $data->add_column('Stop');
1826	0					0	$data->metadata($start_i, 'win', $args{'win'});
1827	0					0	$data->metadata($start_i, 'step', $args{'step'});
1828
1829							# Collect the chromosomes
1830	0		0			0	my $chr_exclude = $args{'skip'} \|\| $args{'chrskip'} \|\| undef;
1831	0					0	my @chromosomes = get_chromosome_list($db, $chr_exclude);
1832	0	0				0	unless (@chromosomes) {
1833	0					0	carp " no sequence IDs were found in the database!\n";
1834	0					0	return;
1835							}
1836
1837							# Collect the genomic windows
1838	0					0	print " Generating $args{win} bp windows in $args{step} bp increments\n";
1839	0					0	foreach (@chromosomes) {
1840
1841							# name and length as sub-array in each element
1842	0					0	my ($chr, $length) = @{$_};
	0					0
1843
1844	0					0	for (my $start = 1; $start <= $length; $start += $args{step}) {
1845	0					0	my $end = $start + $args{win} - 1;
1846	0	0				0	if ($end > $length) {
1847							# fix end to the length of the chromosome
1848	0					0	$end = $length;
1849							}
1850	0					0	$data->add_row( [ $chr, $start, $end] );
1851							}
1852							}
1853	0					0	print " Kept " . $data->{'last_row'} . " windows.\n";
1854
1855							# Return the data structure
1856	0					0	return 1;
1857							}
1858
1859
1860							sub get_db_feature {
1861	0			0	1	0	my %args = @_;
1862
1863							# Open a db connection
1864	0		0			0	$args{db} \|\|= undef;
1865	0	0				0	unless ($args{db}) {
1866	0					0	croak 'no database provided for getting a feature!';
1867							}
1868	0					0	my $db = open_db_connection($args{db});
1869	0					0	my $db_ref = ref $db;
1870
1871							# get the name of the feature
1872	0		0			0	my $name = $args{'name'} \|\| undef;
1873
1874							# check for values and internal nulls
1875	0	0				0	$args{'id'} = exists $args{'id'} ? $args{'id'} : undef;
1876	0		0			0	$args{'type'} \|\|= undef;
1877	0	0	0			0	undef $name if $name and $name eq '.';
1878	0	0	0			0	undef $args{'id'} if $args{'id'} and $args{'id'} eq '.';
1879	0	0	0			0	undef $args{'type'} if $args{'type'} and $args{'type'} eq '.';
1880
1881							# quick method for feature retrieval
1882	0	0	0			0	if (defined $args{'id'} and $db->can('fetch')) {
1883							# we have a unique primary_id to identify the feature
1884							# usually this pulls out the feature directly
1885	0		0			0	my $feature = $db->fetch($args{'id'}) \|\| undef;
1886
1887							# check that the feature we pulled out is what we want
1888	0	0				0	my $check = $feature ? 1 : 0;
1889	0	0				0	if ($check) {
1890	0	0	0			0	$check = 0 if (defined $name and $feature->display_name ne $name);
1891	0	0	0			0	$check = 0 if (defined $args{'type'} and $feature->type ne $args{'type'});
1892							}
1893
1894							# return if things match up as best we can tell
1895	0	0				0	if ($check) {
1896	0					0	return $feature;
1897							}
1898							else {
1899							# the primary_ids are out of date
1900	0	0				0	unless ($primary_id_warning) {
1901	0					0	warn "CAUTION: Some primary IDs in Input file appear to be out of date\n";
1902	0					0	$primary_id_warning++;
1903							}
1904							}
1905							}
1906
1907							# otherwise use name and type
1908	0	0				0	return unless $name; # otherwise db will return all features! Not what we want!
1909							my @features = $db->features(
1910							-name => $name, # we assume this name will be unique
1911							-aliases => 0,
1912	0					0	-type => $args{'type'},
1913							);
1914
1915							# if none are found
1916	0	0				0	unless (@features) {
1917							# try again with aliases enabled
1918							@features = $db->features(
1919							-name => $name,
1920							-aliases => 1,
1921	0					0	-type => $args{'type'},
1922							);
1923							}
1924	0	0	0			0	unless (@features and $name =~ /[;,\\|]/) {
1925							# I used to append aliases to the end of the name in old versions of biotoolbox
1926							# crazy, I know, but just in case this happened, let's try breaking them apart
1927	0					0	my $name2 = (split(/\s[;,\\|]\s/, $name))[0];
1928							# multiple names present using common delimiters ;,\|
1929							# take the first name only, assume others are aliases that we don't need
1930							@features = $db->features(
1931							-name => $name2,
1932							-aliases => 1,
1933	0					0	-type => $args{'type'},
1934							);
1935							}
1936
1937							# check the number of features returned
1938	0	0				0	if (scalar @features > 1) {
		0
1939							# there should only be one feature found
1940							# if more are returned, there's redundant or duplicated data in the db
1941
1942							# first check whether addition of aliases may improve accuracy
1943	0	0				0	if ($args{'name'} =~ /;/) {
1944							# we have presumed aliases in the name value
1945	0					0	my $check = $args{'name'};
1946	0					0	$check =~ s/\s;\s/;/g; # strip spaces just in case
1947
1948							# check each feature and try to match name and aliases
1949	0					0	my @candidates;
1950	0					0	foreach my $f (@features) {
1951	0					0	my $f_name = join(';', $f->display_name, ($f->get_tag_values('Alias')));
1952	0	0				0	if ($check eq $f_name) {
1953							# found one
1954	0					0	push @candidates, $f;
1955							}
1956							}
1957
1958	0	0				0	if (scalar @candidates == 1) {
		0
1959							# we found it!
1960	0					0	return shift @candidates;
1961							}
1962							elsif (scalar @candidates > 1) {
1963							# hopefully we've improved a little bit?
1964	0					0	@features = @candidates;
1965							}
1966							}
1967
1968							# warn the user, this should be fixed
1969							printf " Found %s %s features named '$name' in the database! Using first one.\n",
1970	0					0	scalar(@features), $args{'type'};
1971							}
1972							elsif (!@features) {
1973	0					0	printf " Found no %s features named '$name' in the database!\n", $args{'type'};
1974	0					0	return;
1975							}
1976
1977							# done
1978	0					0	return shift @features;
1979							}
1980
1981
1982							### Get a dataset score for a single region
1983							sub get_segment_score {
1984							# parameters passed as an array
1985							# we will be passing this array on as a reference to the appropriate
1986							# imported helper subroutine
1987							# chromosome, start, stop, strand, strandedness, method, return type, db, dataset
1988	12	50		12	1	29	confess "incorrect number of parameters passed!" unless scalar @_ == 9;
1989
1990							# check the database
1991	12	100	66			57	$_[DB] = open_db_connection($_[DB]) if ($_[DB] and not ref($_[DB]));
1992
1993							# check for combined datasets
1994	12	50				36	if ($_[DATA] =~ /&/) {
1995	0					0	push @_, (split '&', pop @_);
1996							}
1997
1998							# determine method
1999							# yes, we're only checking the first dataset, but they should all
2000							# be the same type
2001	12		66			83	my $db_method = $DB_METHODS{$_[METH]}{$_[RETT]}{$_[DB]}{$_[DATA]} \|\|
2002							_lookup_db_method(\@_);
2003
2004							# return type values
2005							# 0 = calculate score
2006							# 1 = score array
2007							# 2 = hash position scores
2008							# immediately return calculated score if appropriate
2009	12	100				29	if ($_[RETT] > 0) {
2010							# immediately return either indexed hash or array of scores
2011	5					8	return &{$db_method}(\@_);
	5					19
2012							}
2013							else {
2014							# calculate a score
2015	7					13	my $scores = &{$db_method}(\@_);
	7					19
2016							# this might be an array reference of scores that need to be combined
2017							# or it could be a single scalar score which just needs to be returned
2018	7	50				32	if (ref($scores)) {
2019	7					19	return calculate_score($_[METH], $scores);
2020							}
2021							else {
2022	0	0				0	if (not defined $scores) {
2023	0	0				0	return 0 if $_[METH] =~ /count\|sum/;
2024	0					0	return '.';
2025							}
2026	0					0	return $scores;
2027							}
2028							}
2029							}
2030
2031
2032							sub calculate_score {
2033	7			7	1	15	my ($method, $scores) = @_;
2034	7		50			47	$scores \|\|= []; # just in case
2035
2036							# calculate a score based on the method
2037	7	100	100			49	if ($method eq 'mean') {
		50
		100
		50
		50
		100
		50
		0
		0
		0
2038	1		50			29	return sum0(@$scores)/(scalar(@$scores) \|\| 1);
2039							}
2040							elsif ($method eq 'sum') {
2041	0					0	return sum0(@$scores);
2042							}
2043							elsif ($method eq 'median') {
2044	3	50				13	return '.' unless scalar(@$scores);
2045	3					23	return median(@$scores);
2046							}
2047							elsif ($method eq 'min') {
2048	0	0				0	return '.' unless scalar(@$scores);
2049	0					0	return min(@$scores);
2050							}
2051							elsif ($method eq 'max') {
2052	0	0				0	return '.' unless scalar(@$scores);
2053	0					0	return max(@$scores);
2054							}
2055							elsif ($method eq 'count' or $method eq 'pcount') {
2056	2					29	return scalar(@$scores);
2057							}
2058							elsif ($method eq 'ncount') {
2059							# Convert names into unique counts
2060	1					2	my %name2count;
2061	1					5	foreach my $s (@$scores) {
2062	437	50				655	if (ref($s) eq 'ARRAY') {
2063							# this is likely from a ncount indexed hash
2064	0					0	foreach (@$s) {
2065	0					0	$name2count{$_} += 1;
2066							}
2067							}
2068							else {
2069	437					1104	$name2count{$s} += 1;
2070							}
2071							}
2072	1					73	return scalar(keys %name2count);
2073							}
2074							elsif ($method eq 'range') {
2075							# the range value is 'min-max'
2076	0	0				0	return '.' unless scalar(@$scores);
2077	0					0	return range(@$scores);
2078							}
2079							elsif ($method eq 'stddev') {
2080							# we are using the standard deviation of the population,
2081							# since these are the only scores we are considering
2082	0	0				0	return '.' unless scalar(@$scores);
2083	0					0	return stddevp(@$scores);
2084							}
2085							elsif ($method =~ /rpk?m/) {
2086	0					0	confess " The rpm methods have been deprecated due to complexity and " .
2087							"the variable way of calculating the value. Collect counts and " .
2088							"calculate your preferred way.\n";
2089							}
2090							else {
2091							# somehow bad method snuck past our checks
2092	0					0	confess " unrecognized method '$method'!";
2093							}
2094							}
2095
2096
2097							sub get_chromosome_list {
2098
2099							# options
2100	1			1	1	585	my $database = shift;
2101	1		50			7	my $chr_exclude = shift \|\| undef;
2102
2103							# Open a db connection
2104	1					3	my $db = open_db_connection($database);
2105	1	50				3	unless ($db) {
2106	0					0	carp 'no database connected!';
2107	0					0	return;
2108							}
2109
2110							# Check for BigWigSet database
2111							# these need to be handled a little differently
2112	1	50				4	if (ref($db) =~ /BigWigSet/) {
2113							# BigWigSet databases are the only databases that don't
2114							# support the seq_ids method
2115							# instead we have to look at one of the bigwigs in the set
2116	0					0	my $bw_file = ($db->bigwigs)[0];
2117	0					0	$db = open_db_connection($bw_file);
2118							}
2119
2120
2121							# Collect chromosome lengths
2122							# we want to maintain the original order of the chromosomes so we
2123							# won't be putting it into a hash
2124							# instead an array of arrays
2125	1					2	my @chrom_lengths;
2126
2127							# Database specific approaches to collecting chromosomes
2128							# I used to have one generic approach, but idiosyncrasies and potential
2129							# bugs make me use different approaches for better consistency
2130	1					2	my $type = ref($db);
2131
2132							# SeqFeature::Store
2133	1	50	33			24	if ($type =~ /^Bio::DB::SeqFeature::Store/) {
		50	33
		50
		50
		50
		50
		50
2134	0					0	for my $chr ($db->seq_ids) {
2135
2136							# check for excluded chromosomes
2137	0	0	0			0	next if (defined $chr_exclude and $chr =~ /$chr_exclude/i);
2138
2139							# get chromosome size
2140	0					0	my ($seqf) = $db->get_features_by_name($chr);
2141	0	0				0	my $length = $seqf ? $seqf->length : 0;
2142
2143							# store
2144	0					0	push @chrom_lengths, [ $chr, $length ];
2145							}
2146							}
2147
2148							# libBigWig Bigfile, including both bigWig and bigBed
2149							elsif ($type eq 'Bio::DB::Big::File') {
2150							# this doesn't follow the typical BioPerl convention
2151							# it's a hash, so randomly sorted!!!!! Never will be in same order as file!!!!
2152	0					0	my $chroms = $db->chroms();
2153							# so we'll sort the chromosomes by decreasing length
2154							# this is common for a lot of genomes anyway, except for yeast
2155	0					0	foreach (
2156	0					0	sort { $b->[1] <=> $a->[1] }
2157	0					0	map { [$_->{name}, $_->{length}] }
2158							values %$chroms
2159							) {
2160							# check for excluded chromosomes
2161	0	0	0			0	next if (defined $chr_exclude and $_->[0] =~ /$chr_exclude/i);
2162	0					0	push @chrom_lengths, $_;
2163							}
2164							}
2165
2166							# UCSC kent Bigfile
2167							elsif ($type eq 'Bio::DB::BigWig' or $type eq 'Bio::DB::BigBed') {
2168	0					0	foreach my $chr ($db->seq_ids) {
2169
2170							# check for excluded chromosomes
2171	0	0	0			0	next if (defined $chr_exclude and $chr =~ $chr_exclude);
2172
2173							# get chromosome size
2174	0					0	my $length = $db->length($chr);
2175
2176							# store
2177	0					0	push @chrom_lengths, [ $chr, $length ];
2178							}
2179							}
2180
2181							# Samtools Bam
2182							elsif ($type eq 'Bio::DB::Sam' or $type eq 'Bio::DB::HTS') {
2183	0					0	for my $tid (0 .. $db->n_targets - 1) {
2184							# each chromosome is internally represented in the bam file as
2185							# a numeric target identifier
2186							# we can easily convert this to an actual sequence name
2187							# we will force the conversion to go one chromosome at a time
2188
2189							# sequence info
2190	0					0	my $chr = $db->target_name($tid);
2191	0					0	my $length = $db->target_len($tid);
2192
2193							# check for excluded chromosomes
2194	0	0	0			0	next if (defined $chr_exclude and $chr =~ $chr_exclude);
2195
2196							# store
2197	0					0	push @chrom_lengths, [ $chr, $length ];
2198							}
2199							}
2200
2201							# Fast through modern HTS fai index
2202							elsif ($type eq 'Bio::DB::HTS::Faidx') {
2203	0					0	for my $chr ($db->get_all_sequence_ids) {
2204							# check for excluded
2205	0	0	0			0	next if (defined $chr_exclude and $chr =~ $chr_exclude);
2206
2207							# get length and store it
2208	0					0	my $length = $db->length($chr);
2209	0					0	push @chrom_lengths, [$chr, $length];
2210							}
2211							}
2212
2213							# Fasta through old BioPerl adapter
2214							elsif ($type eq 'Bio::DB::Fasta') {
2215	0					0	for my $chr ($db->get_all_ids) {
2216
2217							# check for excluded chromosomes
2218	0	0	0			0	next if (defined $chr_exclude and $chr =~ $chr_exclude);
2219
2220							# get chromosome size
2221	0					0	my $seq = $db->get_Seq_by_id($chr);
2222	0	0				0	my $length = $seq ? $seq->length : 0;
2223
2224							# store
2225	0					0	push @chrom_lengths, [ $chr, $length ];
2226							}
2227							}
2228
2229							# other Bioperl adapter?????
2230							elsif ($db->can('seq_ids')) {
2231	1					7	foreach my $chr ($db->seq_ids) {
2232
2233							# check for excluded chromosomes
2234	1	50	33			27	next if (defined $chr_exclude and $chr =~ $chr_exclude);
2235
2236							# generate a segment representing the chromosome
2237							# due to fuzzy name matching, we may get more than one back
2238	1					9	my @segments = $db->segment($chr);
2239							# need to find the right one
2240	1					230	my $segment;
2241	1					4	while (@segments) {
2242	1					2	$segment = shift @segments;
2243	1	50				14	last if $segment->seq_id eq $chr;
2244							}
2245
2246							# check segment
2247	1	50				78	unless ($segment) {
2248	0					0	carp " No genome segment for seq_id $chr!!?? skipping\n";
2249	0					0	next;
2250							}
2251
2252							# get the chromosome length
2253	1					14	my $length = $segment->length;
2254
2255							# store
2256	1					43	push @chrom_lengths, [ $chr, $length ];
2257							}
2258							}
2259
2260							else {
2261	0					0	carp " Unable to identify chromosomes in database type '$type'. Try using a different database file or adapter\n";
2262	0					0	return;
2263							}
2264
2265							# Return
2266	1	50				4	unless (@chrom_lengths) {
2267	0					0	carp " no chromosome sequences identified in database!\n";
2268	0					0	return;
2269							}
2270	1					4	return @chrom_lengths;
2271							}
2272
2273
2274							sub low_level_bam_fetch {
2275	0	0		0	1	0	confess "incorrect number of parameters passed!" unless scalar @_ == 6;
2276	0					0	my ($sam, $tid, $start, $stop, $callback, $data) = @_;
2277							# run the the low level bam fetch based on which adapter is being used
2278	0	0				0	unless ($BAM_ADAPTER) {
2279	0	0				0	$BAM_ADAPTER = ref($sam) =~ /hts/i ? 'hts' : 'sam';
2280							}
2281	0	0				0	if ($BAM_ADAPTER eq 'hts') {
		0
2282							# using Bio::DB::HTS
2283	0					0	my $index = $sam->hts_index;
2284	0	0				0	return unless $index;
2285	0					0	return $index->fetch($sam->hts_file, $tid, $start, $stop, $callback, $data);
2286							}
2287							elsif ($BAM_ADAPTER eq 'sam') {
2288							# using Bio::DB::Sam
2289	0					0	my $index = $sam->bam_index;
2290	0	0				0	return unless $index;
2291	0					0	return $index->fetch($sam->bam, $tid, $start, $stop, $callback, $data);
2292							}
2293							else {
2294	0					0	confess "no bam adapter loaded!\n";
2295							}
2296							}
2297
2298
2299							sub low_level_bam_coverage {
2300	0	0		0	1	0	confess "incorrect number of parameters passed!" unless scalar @_ == 4;
2301	0					0	my ($sam, $tid, $start, $stop) = @_;
2302							# run the the low level bam coverage based on which adapter is being used
2303	0	0				0	unless ($BAM_ADAPTER) {
2304	0	0				0	$BAM_ADAPTER = ref($sam) =~ /hts/i ? 'hts' : 'sam';
2305							}
2306	0	0				0	if ($BAM_ADAPTER eq 'hts') {
		0
2307							# using Bio::DB::HTS
2308	0					0	my $index = $sam->hts_index;
2309	0	0				0	return unless $index;
2310	0					0	return $index->coverage($sam->hts_file, $tid, $start, $stop);
2311							}
2312							elsif ($BAM_ADAPTER eq 'sam') {
2313							# using Bio::DB::Sam
2314	0					0	my $index = $sam->bam_index;
2315	0	0				0	return unless $index;
2316	0					0	return $index->coverage($sam->bam, $tid, $start, $stop);
2317							}
2318							else {
2319	0					0	confess "no bam adapter loaded!\n";
2320							}
2321							}
2322
2323
2324							sub get_genomic_sequence {
2325	0	0		0	1	0	confess "incorrect number of parameters passed!" unless scalar @_ == 4;
2326	0					0	my ($db, $chrom, $start, $stop) = @_;
2327
2328							# check database
2329	0					0	my $type = ref($db);
2330	0	0				0	$db = open_db_connection($db) if not $type;
2331
2332							# return sequence based on the type of database adapter we're using
2333	0	0				0	if ($type eq 'Bio::DB::HTS::Faidx') {
		0
		0
2334	0					0	return $db->get_sequence_no_length("$chrom:$start-$stop");
2335							}
2336							elsif ($type eq 'Bio::DB::Sam::Fai') {
2337	0					0	return $db->fetch("$chrom:$start-$stop");
2338							}
2339							elsif ($db->can('seq')) {
2340							# BioPerl database including Bio::DB::Fasta and Bio::DB::SeqFeature::Store
2341	0					0	return $db->seq($chrom, $start, $stop);
2342							}
2343							else {
2344	0					0	confess "unsupported database $type for collecting genomic sequence!\n";
2345							}
2346							}
2347
2348
2349							### deprecated methods
2350							# just in case
2351							sub validate_included_feature {
2352	0			0	0	0	confess "validate_included_feature() is no longer a valid method. " .
2353							"Please update your script to the current API.\n";
2354							}
2355
2356
2357							### Internal subroutine to retrieve the scores from an established region object
2358							sub _lookup_db_method {
2359							# parameters passed as an array reference
2360	9			9		18	my $param = shift;
2361							# determine the appropriate score method
2362	9					12	my $score_method;
2363	9	50				38	if ($param->[DATA] =~ /^file\|http\|ftp/) {
		0
		0
2364							# collect the data according to file type
2365
2366							# BigWig Data file
2367	9	50				81	if ($param->[DATA] =~ /\.(?:bw\|bigwig)$/i) {
		50
		50
		50
2368							# file is in bigwig format
2369							# get the dataset scores using Bio::ToolBox::db_helper::bigwig
2370							# this uses either Bio::DB::BigWig or Bio::DB::Big
2371
2372							# check that we have bigwig support
2373	0	0				0	$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
2374	0	0				0	if ($BIGWIG_OK) {
2375	0	0				0	if ($param->[RETT] == 2) {
		0
		0
		0
2376	0					0	$score_method = \&collect_bigwig_position_scores;
2377							}
2378							elsif ($param->[RETT] == 1) {
2379	0					0	$score_method = \&collect_bigwig_scores;
2380							}
2381							elsif ($param->[METH] =~ /min\|max\|mean/) {
2382	0					0	$score_method = \&collect_bigwig_score;
2383							}
2384							elsif ($param->[METH] =~ /sum\|count/) {
2385							# difference between ucsc and libBigWig libraries
2386	0	0				0	$score_method = $BIG_ADAPTER eq 'ucsc' ?
2387							\&collect_bigwig_score : \&collect_bigwig_scores;
2388							}
2389							else {
2390	0					0	$score_method = \&collect_bigwig_scores;
2391							}
2392							}
2393							else {
2394	0					0	croak " BigWig support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?";
2395							}
2396							}
2397
2398							# BigBed Data file
2399							elsif ($param->[DATA] =~ /\.(?:bb\|bigbed)$/i) {
2400							# data is in bigbed format
2401							# get the dataset scores using Bio::ToolBox::db_helper::bigbed
2402							# this uses either Bio::DB::BigBed or Bio::DB::Big
2403
2404							# check that we have bigbed support
2405	0	0				0	$BIGBED_OK = _load_bigbed_helper_module() unless $BIGBED_OK;
2406	0	0				0	if ($BIGBED_OK) {
2407	0	0				0	if ($param->[RETT] == 2) {
2408	0					0	$score_method = \&collect_bigbed_position_scores;
2409							}
2410							else {
2411	0					0	$score_method = \&collect_bigbed_scores;
2412							}
2413							}
2414							else {
2415	0					0	croak " BigBed support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?";
2416							}
2417							}
2418
2419							# BAM data file
2420							elsif ($param->[DATA] =~ /\.bam$/i) {
2421							# data is in bam format
2422							# get the dataset scores using Bio::ToolBox::db_helper::bam
2423							# this uses the Bio::DB::Sam or Bio::DB::HTS adaptor
2424
2425							# check that we have Bam support
2426	0	0				0	_load_bam_helper_module() unless $BAM_OK;
2427	0	0				0	if ($BAM_OK) {
2428	0					0	$score_method = \&collect_bam_scores;
2429							}
2430							else {
2431	0					0	croak " Bam support is not enabled! " .
2432							"Is Bio::DB::Sam installed?\n";
2433							}
2434							}
2435
2436							# USeq Data file
2437							elsif ($param->[DATA] =~ /\.useq$/i) {
2438							# data is in useq format
2439							# get the dataset scores using Bio::ToolBox::db_helper::useq
2440							# this uses the Bio::DB::USeq adaptor
2441
2442							# check that we have bigbed support
2443	9	50				22	$USEQ_OK = _load_helper_module('Bio::ToolBox::db_helper::useq')
2444							unless $USEQ_OK;
2445	9	50				17	if ($USEQ_OK) {
2446	9	100				19	if ($param->[RETT] == 2) {
2447	4					10	$score_method = \&collect_useq_position_scores;
2448							}
2449							else {
2450	5					10	$score_method = \&collect_useq_scores;
2451							}
2452							}
2453							else {
2454	0					0	croak " USeq support is not enabled! " .
2455							"Is Bio::DB::USeq installed?\n";
2456							}
2457							}
2458
2459							# Unsupported Data file
2460							else {
2461	0					0	confess sprintf " Unsupported or unrecognized file type for file '%s'!\n",
2462							$param->[DATA];
2463							}
2464							}
2465
2466
2467							### BigWigSet database
2468							elsif (ref($param->[DB]) =~ m/BigWigSet/) {
2469							# calling features from a BigWigSet database object
2470							# this uses either Bio::DB::BigWig or Bio::DB::Big
2471
2472							# check that we have bigwig support
2473							# duh! we should, we probably opened the stupid database!
2474	0	0				0	$BIGWIG_OK = _load_bigwig_helper_module() unless $BIGWIG_OK;
2475	0	0				0	croak " BigWigSet support is not enabled! Is Bio::DB::Big or Bio::DB::BigFile installed?"
2476							unless $BIGWIG_OK;
2477
2478							# the data collection depends on the method
2479	0	0				0	if ($param->[RETT] == 2) {
		0
		0
2480	0					0	$score_method = \&collect_bigwigset_position_scores;
2481							}
2482							elsif ($param->[RETT] == 1) {
2483	0					0	$score_method = \&collect_bigwigset_scores;
2484							}
2485							elsif ($param->[METH] =~ /min\|max\|mean\|sum\|count/) {
2486							# difference between ucsc and libBigWig libraries
2487							# the ucsc library works with summaries and we can handle multiple of these
2488							# but the big adapter doesn't
2489	0	0				0	$score_method = $BIG_ADAPTER eq 'ucsc' ?
2490							\&collect_bigwigset_score : \&collect_bigwigset_scores;
2491							}
2492							else {
2493	0					0	$score_method = \&collect_bigwigset_scores;
2494							}
2495							}
2496
2497
2498							### BioPerl style database
2499							elsif (ref($param->[DB]) =~ m/^Bio::DB/) {
2500							# a BioPerl style database, including Bio::DB::SeqFeature::Store
2501							# most or all Bio::DB databases support generic get_seq_stream() methods
2502							# that return seqfeature objects, which we can use in a generic sense
2503
2504							# check that we have support
2505							# duh! we should, we probably opened the stupid database!
2506	0	0				0	$SEQFASTA_OK = _load_helper_module('Bio::ToolBox::db_helper::seqfasta')
2507							unless $SEQFASTA_OK;
2508	0	0				0	if ($SEQFASTA_OK) {
2509							# get the dataset scores using Bio::ToolBox::db_helper::seqfasta
2510							# check that we support methods
2511	0	0				0	unless ($param->[DB]->can('get_seq_stream')) {
2512	0					0	confess sprintf "unsupported database! cannot use %s as it does not support get_seq_stream method or equivalent",
2513							ref($param->[DB]);
2514							}
2515	0					0	$score_method = \&collect_store_scores;
2516							}
2517							else {
2518	0					0	croak " SeqFeature Store support is not enabled! " .
2519							"Is BioPerl and Bio::DB::SeqFeature::Store properly installed?\n";
2520							}
2521							}
2522
2523
2524							### Some other database?
2525							else {
2526	0	0				0	confess "no recognizeable dataset provided!" unless $param->[DATA];
2527	0	0				0	confess "no database passed!" unless $param->[DB];
2528	0					0	confess "something went wrong! parameters: @$param";
2529							}
2530
2531
2532							### Cache and return the results
2533	9					36	$DB_METHODS{$param->[METH]}{$param->[RETT]}{$param->[DB]}{$param->[DATA]} =
2534							$score_method;
2535	9					26	return $score_method;
2536							}
2537
2538
2539							sub _load_helper_module {
2540	1			1		3	my $class = shift;
2541	1					2	my $success = 0;
2542	1					1	eval {
2543	1					48	load $class; # using Module::Load to dynamically load modules
2544	1					50	$class->import; # must be done particularly for my modules
2545	1					3	$success = 1;
2546							};
2547	1					3	return $success;
2548							}
2549
2550
2551							sub _load_bam_helper_module {
2552	0	0		0			if ($BAM_ADAPTER) {
2553	0	0					if ($BAM_ADAPTER =~ /sam/i) {
		0
		0
2554	0						$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::bam');
2555	0						$BAM_ADAPTER = 'sam'; # for internal consistency
2556							}
2557							elsif ($BAM_ADAPTER =~ /hts/i) {
2558	0						$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::hts');
2559	0						$BAM_ADAPTER = 'hts'; # for internal consistency
2560							}
2561							elsif ($BAM_ADAPTER =~ /none/i) {
2562							# basically for testing purposes, don't use a module
2563	0						return 0;
2564							}
2565							else {
2566							# unrecognized
2567	0						$@ = 'unrecognized';
2568							}
2569	0	0					if (not $BAM_OK) {
2570	0						print "Requested '$BAM_ADAPTER' adapter could not be loaded: $@\n";
2571							}
2572							}
2573							else {
2574							# try hts first, then sam
2575							# be sure to set BAM_ADAPTER upon success
2576	0						$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::hts');
2577	0	0					if ($BAM_OK) {
2578	0						$BAM_ADAPTER = 'hts';
2579							}
2580							else {
2581	0						$BAM_OK = _load_helper_module('Bio::ToolBox::db_helper::bam');
2582	0	0					$BAM_ADAPTER = 'sam' if $BAM_OK;
2583							}
2584							}
2585	0						return $BAM_OK;
2586							}
2587
2588
2589							sub _load_bigwig_helper_module {
2590	0	0		0			if ($BIG_ADAPTER) {
2591	0	0					if ($BIG_ADAPTER =~ /ucsc\|kent/i) {
		0
		0
2592	0						$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig');
2593	0						$BIG_ADAPTER = 'ucsc'; # for internal consistency
2594							}
2595							elsif ($BIG_ADAPTER =~ /big/i) {
2596	0						$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2597	0						$BIGBED_OK = $BIGWIG_OK; # bigbed too!
2598	0						$BIG_ADAPTER = 'big'; # for internal consistency
2599							}
2600							elsif ($BIG_ADAPTER =~ /none/i) {
2601							# basically for testing purposes, don't use a module
2602	0						return 0;
2603							}
2604							else {
2605							# unrecognized
2606	0						$@ = 'unrecognized';
2607							}
2608	0	0					if (not $BIGWIG_OK) {
2609	0						print "Requested '$BIG_ADAPTER' adapter could not be loaded: $@\n";
2610							}
2611							}
2612							else {
2613							# we have to try each one out
2614							# try the modern big adapter first
2615	0						$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2616	0	0					if ($BIGWIG_OK) {
2617							# success!
2618	0						$BIGBED_OK = $BIGWIG_OK; # bigbed too!
2619	0	0					$BIG_ADAPTER = 'big' if $BIGWIG_OK;
2620							}
2621							else {
2622	0						$BIGWIG_OK = _load_helper_module('Bio::ToolBox::db_helper::bigwig');
2623	0	0					$BIG_ADAPTER = 'ucsc' if $BIGWIG_OK;
2624							}
2625							}
2626	0						return $BIGWIG_OK;
2627							}
2628
2629							sub _load_bigbed_helper_module {
2630	0	0		0			if ($BIG_ADAPTER) {
2631	0	0					if ($BIG_ADAPTER =~ /ucsc\|kent/i) {
		0
		0
		0
2632	0						$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed');
2633	0						$BIG_ADAPTER = 'ucsc'; # for internal consistency
2634							}
2635							elsif ($BIG_ADAPTER =~ /big/i) {
2636	0						$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2637	0						$BIGWIG_OK = $BIGBED_OK; # bigwig too!
2638	0						$BIG_ADAPTER = 'big'; # for internal consistency
2639							}
2640							elsif ($BIG_ADAPTER =~ /none/i) {
2641							# basically for testing purposes, don't use a module
2642	0						return 0;
2643							}
2644							elsif ($BAM_ADAPTER =~ /\w+/) {
2645							# unrecognized
2646	0						$@ = 'unrecognized';
2647							}
2648	0	0					if (not $BIGWIG_OK) {
2649	0						print "Requested '$BIG_ADAPTER' adapter could not be loaded: $@\n";
2650							}
2651							}
2652							else {
2653							# we have to try each one out
2654							# try the modern big adapter first
2655	0						$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::big');
2656	0	0					if ($BIGBED_OK) {
2657							# success!
2658	0						$BIGWIG_OK = $BIGBED_OK; # bigwig too!
2659	0	0					$BIG_ADAPTER = 'big' if $BIGBED_OK;
2660							}
2661							else {
2662	0						$BIGBED_OK = _load_helper_module('Bio::ToolBox::db_helper::bigbed');
2663	0	0					$BIG_ADAPTER = 'ucsc' if $BIGBED_OK;
2664							}
2665							}
2666	0						return $BIGBED_OK;
2667							}
2668
2669							=head1 AUTHOR
2670
2671							Timothy J. Parnell, PhD
2672							Howard Hughes Medical Institute
2673							Dept of Oncological Sciences
2674							Huntsman Cancer Institute
2675							University of Utah
2676							Salt Lake City, UT, 84112
2677
2678							This package is free software; you can redistribute it and/or modify
2679							it under the terms of the Artistic License 2.0.
2680