File Coverage

Bio/DB/GenBank.pm

Criterion	Covered	Total	%
statement	13	17	76.4
branch	0	2	0.0
condition			n/a
subroutine	5	6	83.3
pod	2	2	100.0
total	20	27	74.0

line	stmt	bran	sub	pod	time	code
1						#
2						# BioPerl module for Bio::DB::GenBank
3						#
4						# Please direct questions and support issues to
5						#
6						# Cared for by Aaron Mackey
7						#
8						# Copyright Aaron Mackey
9						#
10						# You may distribute this module under the same terms as perl itself
11						#
12						# POD documentation - main docs before the code
13						#
14						# Added LWP support - Jason Stajich 2000-11-6
15						# completely reworked by Jason Stajich 2000-12-8
16						# to use WebDBSeqI
17
18						# Added batch entrez back when determined that new entrez cgi will
19						# essentially work (there is a limit to the number of characters in a
20						# GET request so I am not sure how we can get around this). The NCBI
21						# Batch Entrez form has changed some and it does not support retrieval
22						# of text only data. Still should investigate POST-ing (tried and
23						# failed) a message to the entrez cgi to get around the GET
24						# limitations.
25
26						=head1 NAME
27
28						Bio::DB::GenBank - Database object interface to GenBank
29
30						=head1 SYNOPSIS
31
32						use Bio::DB::GenBank;
33						$gb = Bio::DB::GenBank->new();
34
35						$seq = $gb->get_Seq_by_id('J00522'); # Unique ID, not always the LOCUS ID
36
37						# or ...
38
39						$seq = $gb->get_Seq_by_acc('J00522'); # Accession Number
40						$seq = $gb->get_Seq_by_version('J00522.1'); # Accession.version
41						$seq = $gb->get_Seq_by_gi('405830'); # GI Number
42
43						# get a stream via a query string
44						my $query = Bio::DB::Query::GenBank->new
45						(-query =>'Oryza sativa[Organism] AND EST',
46						-reldate => '30',
47						-db => 'nucleotide');
48						my $seqio = $gb->get_Stream_by_query($query);
49
50						while( my $seq = $seqio->next_seq ) {
51						print "seq length is ", $seq->length,"\n";
52						}
53
54						# or ... best when downloading very large files, prevents
55						# keeping all of the file in memory
56
57						# also don't want features, just sequence so let's save bandwith
58						# and request Fasta sequence
59						$gb = Bio::DB::GenBank->new(-retrievaltype => 'tempfile' ,
60						-format => 'Fasta');
61						my $seqio = $gb->get_Stream_by_acc(['AC013798', 'AC021953'] );
62						while( my $clone = $seqio->next_seq ) {
63						print "cloneid is ", $clone->display_id, " ",
64						$clone->accession_number, "\n";
65						}
66						# note that get_Stream_by_version is not implemented
67
68						# don't want the entire sequence or more options
69						my $gb = Bio::DB::GenBank->new(-format => 'Fasta',
70						-seq_start => 100,
71						-seq_stop => 200,
72						-strand => 1,
73						-complexity => 4);
74						my $seqi = $gb->get_Stream_by_query($query);
75
76
77						=head1 DESCRIPTION
78
79						Allows the dynamic retrieval of L sequence objects from the
80						GenBank database at NCBI, via an Entrez query.
81
82						WARNING: Please do B spam the Entrez web server with multiple
83						requests. NCBI offers Batch Entrez for this purpose.
84
85						Note that when querying for GenBank accessions starting with 'NT_' you
86						will need to call $gb-Erequest_format('fasta') beforehand, because
87						in GenBank format (the default) the sequence part will be left out
88						(the reason is that NT contigs are rather annotation with references
89						to clones).
90
91						Some work has been done to automatically detect and retrieve whole NT_
92						clones when the data is in that format (NCBI RefSeq clones). The
93						former behavior prior to bioperl 1.6 was to retrieve these from EBI,
94						but now these are retrieved directly from NCBI. The older behavior can
95						be regained by setting the 'redirect_refseq' flag to a value
96						evaluating to TRUE.
97
98						=head2 Running
99
100						Alternate methods are described at
101						L
102
103						NOTE: strand should be 1 for plus or 2 for minus.
104
105						Complexity: gi is often a part of a biological blob, containing other
106						gis
107
108						complexity regulates the display:
109						0 - get the whole blob
110						1 - get the bioseq for gi of interest (default in Entrez)
111						2 - get the minimal bioseq-set containing the gi of interest
112						3 - get the minimal nuc-prot containing the gi of interest
113						4 - get the minimal pub-set containing the gi of interest
114
115						'seq_start' and 'seq_stop' will not work when setting complexity to
116						any value other than 1. 'strand' works for any setting other than a
117						complexity of 0 (whole glob); when you try this with a GenBank return
118						format nothing happens, whereas using FASTA works but causes display
119						problems with the other sequences in the glob. As Tao Tao says from
120						NCBI, "Better left it out or set it to 1."
121
122						=head1 FEEDBACK
123
124						=head2 Mailing Lists
125
126						User feedback is an integral part of the evolution of this and other
127						Bioperl modules. Send your comments and suggestions preferably to one
128						of the Bioperl mailing lists. Your participation is much appreciated.
129
130						bioperl-l@bioperl.org - General discussion
131						http://bioperl.org/wiki/Mailing_lists - About the mailing lists
132
133						=head2 Support
134
135						Please direct usage questions or support issues to the mailing list:
136
137						I
138
139						rather than to the module maintainer directly. Many experienced and
140						reponsive experts will be able look at the problem and quickly
141						address it. Please include a thorough description of the problem
142						with code and data examples if at all possible.
143
144						=head2 Reporting Bugs
145
146						Report bugs to the Bioperl bug tracking system to help us keep track
147						the bugs and their resolution. Bug reports can be submitted via the
148						web:
149
150						https://github.com/bioperl/bioperl-live/issues
151
152						=head1 AUTHOR - Aaron Mackey, Jason Stajich
153
154						Email amackey@virginia.edu
155						Email jason@bioperl.org
156
157						=head1 APPENDIX
158
159						The rest of the documentation details each of the
160						object methods. Internal methods are usually
161						preceded with a _
162
163						=cut
164
165						# Let the code begin...
166
167						package Bio::DB::GenBank;
168	3		3		766	use strict;
	3				3
	3				84
169	3		3		13	use vars qw(%PARAMSTRING $DEFAULTFORMAT $DEFAULTMODE);
	3				4
	3				150
170
171	3		3		9	use base qw(Bio::DB::NCBIHelper);
	3				4
	3				1003
172						BEGIN {
173	3		3		5	$DEFAULTMODE = 'single';
174	3				4	$DEFAULTFORMAT = 'gbwithparts';
175	3				273	%PARAMSTRING = (
176						'batch' => { 'db' => 'nucleotide',
177						'usehistory' => 'n',
178						'tool' => 'bioperl'},
179						'query' => { 'usehistory' => 'y',
180						'tool' => 'bioperl',
181						'retmode' => 'text'},
182						'gi' => { 'db' => 'nucleotide',
183						'usehistory' => 'n',
184						'tool' => 'bioperl',
185						'retmode' => 'text'},
186						'version' => { 'db' => 'nucleotide',
187						'usehistory' => 'n',
188						'tool' => 'bioperl',
189						'retmode' => 'text'},
190						'single' => { 'db' => 'nucleotide',
191						'usehistory' => 'n',
192						'tool' => 'bioperl',
193						'retmode' => 'text'},
194						'webenv' => {
195						'query_key' => 'querykey',
196						'WebEnv' => 'cookie',
197						'db' => 'nucleotide',
198						'usehistory' => 'n',
199						'tool' => 'bioperl',
200						'retmode' => 'text'},
201						);
202						}
203
204						# new is in NCBIHelper
205
206						# helper method to get db specific options
207
208						=head2 new
209
210						Title : new
211						Usage : $gb = Bio::DB::GenBank->new(@options)
212						Function: Creates a new genbank handle
213						Returns : a new Bio::DB::Genbank object
214						Args : -delay number of seconds to delay between fetches (3s)
215
216						NOTE: There are other options that are used internally. By NCBI policy, this
217						module introduces a 3s delay between fetches. If you are fetching multiple genbank
218						ids, it is a good idea to use get
219
220						=cut
221
222						=head2 get_params
223
224						Title : get_params
225						Usage : my %params = $self->get_params($mode)
226						Function: Returns key,value pairs to be passed to NCBI database
227						for either 'batch' or 'single' sequence retrieval method
228						Returns : a key,value pair hash
229						Args : 'single' or 'batch' mode for retrieval
230
231						=cut
232
233						sub get_params {
234	0		0	1	0	my ($self, $mode) = @_;
235						return defined $PARAMSTRING{$mode} ?
236	0	0			0	%{$PARAMSTRING{$mode}} : %{$PARAMSTRING{$DEFAULTMODE}};
	0				0
	0				0
237						}
238
239						# from Bio::DB::WebDBSeqI from Bio::DB::RandomAccessI
240
241						=head1 Routines Bio::DB::WebDBSeqI from Bio::DB::RandomAccessI
242
243						=head2 get_Seq_by_id
244
245						Title : get_Seq_by_id
246						Usage : $seq = $db->get_Seq_by_id('ROA1_HUMAN')
247						Function: Gets a Bio::Seq object by its name
248						Returns : a Bio::Seq object
249						Args : the id (as a string) of a sequence
250						Throws : "id does not exist" exception
251
252						=head2 get_Seq_by_acc
253
254						Title : get_Seq_by_acc
255						Usage : $seq = $db->get_Seq_by_acc($acc);
256						Function: Gets a Seq object by accession numbers
257						Returns : a Bio::Seq object
258						Args : the accession number as a string
259						Note : For GenBank, this just calls the same code for get_Seq_by_id().
260						Caveat: this normally works, but in rare cases simply passing the
261						accession can lead to odd results, possibly due to unsynchronized
262						NCBI ID servers. Using get_Seq_by_version() is slightly better, but
263						using the unique identifier (GI) and get_Seq_by_id is the most
264						consistent
265						Throws : "id does not exist" exception
266
267						=head2 get_Seq_by_gi
268
269						Title : get_Seq_by_gi
270						Usage : $seq = $db->get_Seq_by_gi('405830');
271						Function: Gets a Bio::Seq object by gi number
272						Returns : A Bio::Seq object
273						Args : gi number (as a string)
274						Throws : "gi does not exist" exception
275
276						=head2 get_Seq_by_version
277
278						Title : get_Seq_by_version
279						Usage : $seq = $db->get_Seq_by_version('X77802.1');
280						Function: Gets a Bio::Seq object by sequence version
281						Returns : A Bio::Seq object
282						Args : accession.version (as a string)
283						Note : Caveat: this normally works, but using the unique identifier (GI) and
284						get_Seq_by_id is the most consistent
285						Throws : "acc.version does not exist" exception
286
287						=head1 Routines implemented by Bio::DB::NCBIHelper
288
289						=head2 get_Stream_by_query
290
291						Title : get_Stream_by_query
292						Usage : $seq = $db->get_Stream_by_query($query);
293						Function: Retrieves Seq objects from Entrez 'en masse', rather than one
294						at a time. For large numbers of sequences, this is far superior
295						than get_Stream_by_[id/acc]().
296						Example :
297						Returns : a Bio::SeqIO stream object
298						Args : $query : An Entrez query string or a
299						Bio::DB::Query::GenBank object. It is suggested that you
300						create a Bio::DB::Query::GenBank object and get the entry
301						count before you fetch a potentially large stream.
302
303						=cut
304
305						=head2 get_Stream_by_id
306
307						Title : get_Stream_by_id
308						Usage : $stream = $db->get_Stream_by_id( [$uid1, $uid2] );
309						Function: Gets a series of Seq objects by unique identifiers
310						Returns : a Bio::SeqIO stream object
311						Args : $ref : a reference to an array of unique identifiers for
312						the desired sequence entries
313
314						=head2 get_Stream_by_acc
315
316						Title : get_Stream_by_acc
317						Usage : $seq = $db->get_Stream_by_acc([$acc1, $acc2]);
318						Function: Gets a series of Seq objects by accession numbers
319						Returns : a Bio::SeqIO stream object
320						Args : $ref : a reference to an array of accession numbers for
321						the desired sequence entries
322						Note : For GenBank, this just calls the same code for get_Stream_by_id()
323
324						=cut
325
326						=head2 get_Stream_by_gi
327
328						Title : get_Stream_by_gi
329						Usage : $seq = $db->get_Seq_by_gi([$gi1, $gi2]);
330						Function: Gets a series of Seq objects by gi numbers
331						Returns : a Bio::SeqIO stream object
332						Args : $ref : a reference to an array of gi numbers for
333						the desired sequence entries
334						Note : For GenBank, this just calls the same code for get_Stream_by_id()
335
336						=head2 get_Stream_by_batch
337
338						Title : get_Stream_by_batch
339						Usage : $seq = $db->get_Stream_by_batch($ref);
340						Function: Retrieves Seq objects from Entrez 'en masse', rather than one
341						at a time.
342						Example :
343						Returns : a Bio::SeqIO stream object
344						Args : $ref : either an array reference, a filename, or a filehandle
345						from which to get the list of unique ids/accession numbers.
346
347						NOTE: This method is redundant and deprecated. Use get_Stream_by_id()
348						instead.
349
350						=head2 get_request
351
352						Title : get_request
353						Usage : my $url = $self->get_request
354						Function: HTTP::Request
355						Returns :
356						Args : %qualifiers = a hash of qualifiers (ids, format, etc)
357
358						=cut
359
360						=head2 default_format
361
362						Title : default_format
363						Usage : my $format = $self->default_format
364						Function: Returns default sequence format for this module
365						Returns : string
366						Args : none
367
368						=cut
369
370						sub default_format {
371	1		1	1	2	return $DEFAULTFORMAT;
372						}
373
374						1;
375						__END__