File Coverage

Bio/SeqIO/msout.pm
Criterion Covered Total %
statement 212 223 95.0
branch 54 62 87.1
condition 14 21 66.6
subroutine 29 30 96.6
pod 19 22 86.3
total 328 358 91.6


line stmt bran cond sub pod time code
1             # POD documentation - main docs before the code
2              
3             =head1 NAME
4              
5             Bio::SeqIO::msout - input stream for output by Hudson's ms
6              
7             =head1 SYNOPSIS
8              
9             Do not use this module directly. Use it via the Bio::SeqIO class.
10              
11             =head1 DESCRIPTION
12              
13             ms ( Hudson, R. R. (2002) Generating samples under a Wright-Fisher neutral
14             model. Bioinformatics 18:337-8 ) can be found at
15             http://home.uchicago.edu/~rhudson1/source/mksamples.html.
16              
17             Currently, this object can be used to read output from ms into seq objects.
18             However, because bioperl has no support for haplotypes created using an infinite
19             sites model (where '1' identifies a derived allele and '0' identifies an
20             ancestral allele), the sequences returned by msout are coded using A, T, C and
21             G. To decode the bases, use the sequence conversion table (a hash) returned by
22             get_base_conversion_table(). In the table, 4 and 5 are used when the ancestry is
23             unclear. This should not ever happen when creating files with ms, but it will be
24             used when creating msOUT files from a collection of seq objects ( To be added
25             later ). Alternatively, use get_next_hap() to get a string with 1's and 0's
26             instead of a seq object.
27              
28             =head2 Mapping to Finite Sites
29              
30             This object can now also be used to map haplotypes created using an infinite sites
31             model to sequences of arbitrary finite length. See set_n_sites() for more detail.
32             Thanks to Filipe G. Vieira for the idea and code.
33              
34             =head1 FEEDBACK
35              
36             =head2 Mailing Lists
37              
38             User feedback is an integral part of the evolution of this and other
39             Bioperl modules. Send your comments and suggestions preferably to the
40             Bioperl mailing list. Your participation is much appreciated.
41              
42             bioperl-l@bioperl.org - General discussion
43             http://bioperl.org/wiki/Mailing_lists - About the mailing lists
44              
45             =head2 Reporting Bugs
46              
47             Report bugs to the Bioperl bug tracking system to help us keep track
48             of the bugs and their resolution. Bug reports can be submitted via the
49             web:
50              
51             https://github.com/bioperl/bioperl-live/issues
52              
53             =head1 AUTHOR - Warren Kretzschmar
54              
55             This module was written by Warren Kretzschmar
56              
57             email: wkretzsch@gmail.com
58              
59             This module grew out of a parser written by Aida Andres.
60              
61             =head1 COPYRIGHT
62              
63             =head2 Public Domain Notice
64              
65             This software/database is ``United States Government Work'' under the
66             terms of the United States Copyright Act. It was written as part of
67             the authors' official duties for the United States Government and thus
68             cannot be copyrighted. This software/database is freely available to
69             the public for use without a copyright notice. Restrictions cannot
70             be placed on its present or future use.
71              
72             Although all reasonable efforts have been taken to ensure the accuracy
73             and reliability of the software and data, the National Human Genome
74             Research Institute (NHGRI) and the U.S. Government does not and cannot
75             warrant the performance or results that may be obtained by using this
76             software or data. NHGRI and the U.S. Government disclaims all
77             warranties as to performance, merchantability or fitness for any
78             particular purpose.
79              
80             =head1 METHODS
81              
82             =cut
83              
84             package Bio::SeqIO::msout;
85 1     1   8 use version;
  1         4  
  1         9  
86             our $API_VERSION = qv('1.1.8');
87              
88 1     1   84 use strict;
  1         1  
  1         18  
89 1     1   3 use base qw(Bio::SeqIO); # This ISA Bio::SeqIO object
  1         2  
  1         302  
90 1     1   216 use Bio::Seq::SeqFactory;
  1         2  
  1         1715  
91              
92             =head2 Methods for Internal Use
93              
94             =head3 _initialize
95              
96             Title : _initialize
97             Usage : $stream = Bio::SeqIO::msOUT->new($infile)
98             Function: extracts basic information about the file.
99             Returns : Bio::SeqIO object
100             Args : no_og, gunzip, gzip, n_sites
101             Details :
102             - include 'no_og' flag if the last population of an msout file contains
103             only one haplotype and you don't want the last haplotype to be
104             treated as the outgroup ( suggested when reading data created by ms ).
105             - including 'n_sites' (positive integer) causes all output haplotypes to be
106             mapped to a sequence of length 'n_sites'. See set_n_sites() for more details.
107              
108             =cut
109              
110             sub _initialize {
111 12     12   30 my ( $self, @args ) = @_;
112 12         37 $self->SUPER::_initialize(@args);
113              
114 12 50       30 unless ( defined $self->sequence_factory ) {
115 12         53 $self->sequence_factory( Bio::Seq::SeqFactory->new() );
116             }
117 12         37 my ($no_og) = $self->_rearrange( [qw(NO_OG)], @args );
118 12         36 my ($n_sites) = $self->_rearrange( [qw(N_SITES)], @args );
119              
120 12         134 my %initial_values = (
121             RUNS => undef,
122             N_SITES => undef,
123             SEGSITES => undef,
124             SEEDS => [],
125             MS_INFO_LINE => undef,
126             TOT_RUN_HAPS => undef,
127             POPS => [],
128             NEXT_RUN_NUM => undef, # What run is the next hap from? undef = EOF
129             LAST_READ_HAP_NUM => undef, # What did we just read from
130             LAST_HAPS_RUN_NUM => undef,
131             LAST_READ_POSITIONS => [],
132             LAST_READ_SEGSITES => undef,
133             BUFFER_HAP => undef,
134             NO_OUTGROUP => $no_og,
135             BASE_CONVERSION_TABLE_HASH_REF => {
136             'A' => 0,
137             'T' => 1,
138             'C' => 4,
139             'G' => 5,
140             },
141             );
142 12         47 foreach my $key ( keys %initial_values ) {
143 180         212 $self->{$key} = $initial_values{$key};
144             }
145 12         44 $self->set_n_sites($n_sites);
146              
147             # If the filehandle is defined open it and read a few lines
148 12 50       33 if ( ref( $self->{_filehandle} ) eq 'GLOB' ) {
149 12         25 $self->_read_start();
150 12         44 return $self;
151             }
152              
153             # Otherwise throw a warning
154             else {
155 0         0 $self->throw(
156             "No filehandle defined. Please define a file handle through -file when calling msout with Bio::SeqIO"
157             );
158             }
159             }
160              
161             =head3 _read_start
162              
163             Title : _read_start
164             Usage : $stream->_read_start()
165             Function: reads from the filehandle $stream->{_filehandle} all information up to the first haplotype (sequence). Closes the filehandle if all lines have been read.
166             Returns : void
167             Args : none
168              
169             =cut
170              
171             sub _read_start {
172 12     12   14 my $self = shift;
173              
174 12         22 my $fh_IN = $self->{_filehandle};
175              
176             # get the first five lines and parse for important info
177 12         28 my ( $ms_info_line, $seeds ) = $self->_get_next_clean_hap( $fh_IN, 2, 1 );
178              
179 12         59 my @ms_info_line = split( /\s+/, $ms_info_line );
180              
181 12         16 my ( $tot_pops, @pop_haplos );
182              
183             # Parsing the ms header line
184 12         13 shift @ms_info_line;
185 12         16 my $tot_run_haps = shift @ms_info_line;
186 12         17 my $runs = shift @ms_info_line;
187 12         13 my $segsites;
188              
189 12         28 foreach my $word ( 0 .. $#ms_info_line ) {
190 63 100       96 if ( $ms_info_line[$word] eq '-I' ) {
    100          
191 9         16 $tot_pops = $ms_info_line[ $word + 1 ];
192 9         29 for my $pop_num ( 1 .. $tot_pops ) {
193 27         40 push @pop_haplos, $ms_info_line[ $word + 1 + $pop_num ];
194             }
195              
196             # if @pop_haplos contains a non-digit, then there is an error in the msinfo line.
197 9 50 33     40 if ( !defined $pop_haplos[-1] || $pop_haplos[-1] =~ /\D/ ) {
198 0         0 $self->throw(
199             "Incorrect number of populations in the ms info line (after the -I specifier)"
200             );
201             }
202             }
203             elsif ( $ms_info_line[$word] eq '-s' ) {
204 9         16 $segsites = $ms_info_line[ $word + 1 ];
205             }
206 45         49 else { next; }
207             }
208              
209 12 100       23 unless (@pop_haplos) { @pop_haplos = ($tot_run_haps) }
  3         6  
210              
211 12         36 my @seeds = split( /\s+/, $seeds );
212              
213             # Save ms info data
214 12         16 $self->{RUNS} = $runs;
215 12         20 $self->{SEGSITES} = $segsites;
216 12         23 $self->{SEEDS} = \@seeds;
217 12         16 $self->{MS_INFO_LINE} = $ms_info_line;
218 12         21 $self->{TOT_RUN_HAPS} = $tot_run_haps;
219 12         23 $self->{POPS} = [@pop_haplos];
220              
221 12         26 return;
222             }
223              
224             =head2 Methods to Access Data
225              
226             =head3 get_segsites
227              
228             Title : get_segsites
229             Usage : $segsites = $stream->get_segsites()
230             Function: returns the number of segsites in the msOUT file (according to the msOUT header line's -s option), or the current run's segsites if -s was not specified in the command line (in this case the number of segsites varies from run to run).
231             Returns : scalar
232             Args : NONE
233              
234             =cut
235              
236             sub get_segsites {
237 7     7 1 3296 my $self = shift;
238 7 100       18 if ( defined $self->{SEGSITES} ) {
239 4         11 return $self->{SEGSITES};
240             }
241             else {
242 3         7 return $self->get_current_run_segsites;
243             }
244             }
245              
246             =head3 get_current_run_segsites
247              
248             Title : get_current_run_segsites
249             Usage : $segsites = $stream->get_current_run_segsites()
250             Function: returns the number of segsites in the run of the last read
251             haplotype (sequence).
252             Returns : scalar
253             Args : NONE
254              
255             =cut
256              
257             sub get_current_run_segsites {
258 138     138 1 2946 my $self = shift;
259 138         232 return $self->{LAST_READ_SEGSITES};
260             }
261              
262             =head3 get_n_sites
263              
264             Title : get_n_sites
265             Usage : $n_sites = $stream->get_n_sites()
266             Function: Gets the number of total sites (variable or not) to be output.
267             Returns : scalar if n_sites option is defined at call time of new()
268             Args : NONE
269             Note :
270             WARNING: Final sequence length might not be equal to n_sites if n_sites is
271             too close to number of segregating sites in the msout file.
272              
273             =cut
274              
275             sub get_n_sites {
276 134     134 1 2843 my ($self) = @_;
277              
278 134         184 return $self->{N_SITES};
279             }
280              
281             =head3 set_n_sites
282              
283             Title : set_n_sites
284             Usage : $n_sites = $stream->set_n_sites($value)
285             Function: Sets the number of total sites (variable or not) to be output.
286             Returns : 1 on success; throws an error if $value is not a positive integer or undef
287             Args : positive integer
288             Note :
289             WARNING: Final sequence length might not be equal to n_sites if it is
290             too close to number of segregating sites.
291             - n_sites needs to be at least as large as the number of segsites of
292             the next haplotype returned
293             - n_sites may also be set to undef, in which case haplotypes are returned
294             under the infinite sites model assumptions.
295              
296             =cut
297              
298             sub set_n_sites {
299 14     14 1 512 my ( $self, $value ) = @_;
300              
301             # make sure $value is a positive integer if it is defined
302 14 100       28 if ( defined $value ) {
303 7 100 66     57 $self->throw(
304             "first argument needs to be a positive integer. argument supplied: $value"
305             ) unless ( $value =~ m/^\d+$/ && $value > 0 );
306             }
307 13         21 $self->{N_SITES} = $value;
308              
309 13         17 return 1;
310             }
311              
312             =head3 get_runs
313              
314             Title : get_runs
315             Usage : $runs = $stream->get_runs()
316             Function: returns the number of runs in the msOUT file (according to the
317             msinfo line)
318             Returns : scalar
319             Args : NONE
320              
321             =cut
322              
323             sub get_runs {
324 7     7 1 3041 my $self = shift;
325 7         20 return $self->{RUNS};
326             }
327              
328             =head3 get_Seeds
329              
330             Title : get_Seeds
331             Usage : @seeds = $stream->get_Seeds()
332             Function: returns an array of the seeds used in the creation of the msOUT file.
333             Returns : array
334             Args : NONE
335             Details : In older versions, ms used three seeds. Newer versions of ms seem to
336             use only one (longer) seed. This function will return all the seeds
337             found.
338              
339             =cut
340              
341             sub get_Seeds {
342 7     7 1 3143 my $self = shift;
343 7         7 return @{ $self->{SEEDS} };
  7         25  
344             }
345              
346             =head3 get_Positions
347              
348             Title : get_Positions
349             Usage : @positions = $stream->get_Positions()
350             Function: returns an array of the names of each segsite of the run of the last
351             read hap.
352             Returns : array
353             Args : NONE
354             Details : The Positions may or may not vary from run to run depending on the
355             options used with ms.
356              
357             =cut
358              
359             sub get_Positions {
360 69     69 1 2308 my $self = shift;
361 69         72 return @{ $self->{LAST_READ_POSITIONS} };
  69         189  
362             }
363              
364             =head3 get_tot_run_haps
365              
366             Title : get_tot_run_haps
367             Usage : $number_of_haps_per_run = $stream->get_tot_run_haps()
368             Function: returns the number of haplotypes (sequences) in each run of the msOUT
369             file ( according to the msinfo line ).
370             Returns : scalar >= 0
371             Args : NONE
372             Details : This number should not vary from run to run.
373              
374             =cut
375              
376             sub get_tot_run_haps {
377 7     7 1 2803 my $self = shift;
378 7         18 return $self->{TOT_RUN_HAPS};
379             }
380              
381             =head3 get_ms_info_line
382              
383             Title : get_ms_info_line
384             Usage : $ms_info_line = $stream->get_ms_info_line()
385             Function: returns the header line of the msOUT file.
386             Returns : scalar
387             Args : NONE
388              
389             =cut
390              
391             sub get_ms_info_line {
392 7     7 1 2836 my $self = shift;
393 7         17 return $self->{MS_INFO_LINE};
394             }
395              
396             =head3 tot_haps
397              
398             Title : tot_haps
399             Usage : $number_of_haplotypes_in_file = $stream->tot_haps()
400             Function: returns the number of haplotypes (sequences) in the msOUT file.
401             Information gathered from msOUT header line.
402             Returns : scalar
403             Args : NONE
404              
405             =cut
406              
407             sub get_tot_haps {
408 12     12 0 12 my $self = shift;
409 12         43 return ( $self->{TOT_RUN_HAPS} * $self->{RUNS} );
410             }
411              
412             =head3 get_Pops
413              
414             Title : get_Pops
415             Usage : @pops = $stream->pops()
416             Function: returns an array of population sizes (order taken from the -I flag in
417             the msOUT header line). This array will include the last hap even if
418             it looks like an outgroup.
419             Returns : array of scalars > 0
420             Args : NONE
421              
422             =cut
423              
424             sub get_Pops {
425 91     91 1 3425 my $self = shift;
426 91         86 return @{ $self->{POPS} };
  91         251  
427             }
428              
429             =head3 get_next_run_num
430              
431             Title : get_next_run_num
432             Usage : $next_run_number = $stream->next_run_num()
433             Function: returns the number of the ms run that the next haplotype (sequence)
434             will be taken from (starting at 1). Returns undef if the complete
435             file has been read.
436             Returns : scalar > 0 or undef
437             Args : NONE
438              
439             =cut
440              
441             sub get_next_run_num {
442 162     162 1 6574 my $self = shift;
443 162         220 return $self->{NEXT_RUN_NUM};
444             }
445              
446             =head3 get_last_haps_run_num
447              
448             Title : get_last_haps_run_num
449             Usage : $last_haps_run_number = $stream->get_last_haps_run_num()
450             Function: returns the number of the ms run that the last haplotype (sequence)
451             was taken from (starting at 1). Returns undef if no hap has been
452             read yet.
453             Returns : scalar > 0 or undef
454             Args : NONE
455              
456             =cut
457              
458             sub get_last_haps_run_num {
459 128     128 1 139 my $self = shift;
460 128         158 return $self->{LAST_HAPS_RUN_NUM};
461             }
462              
463             =head3 get_last_read_hap_num
464              
465             Title : get_last_read_hap_num
466             Usage : $last_read_hap_num = $stream->get_last_read_hap_num()
467             Function: returns the number (starting with 1) of the last haplotype read from
468             the ms file
469             Returns : scalar >= 0
470             Args : NONE
471             Details : 0 means that no haplotype has been read yet. Is reset to 0 every run.
472              
473             =cut
474              
475             sub get_last_read_hap_num {
476 321     321 1 3497 my $self = shift;
477 321         388 return $self->{LAST_READ_HAP_NUM};
478             }
479              
480             =head3 outgroup
481              
482             Title : outgroup
483             Usage : $outgroup = $stream->outgroup()
484             Function: returns '1' if the msOUT stream has an outgroup. Returns '0'
485             otherwise.
486             Returns : '1' or '0'
487             Args : NONE
488             Details : This method will return '1' only if the last population in the msOUT
489             file contains only one haplotype. If the last population is not an
490             outgroup then create the msOUT object using 'no_og' as input flag.
491             Also, return 0, if the run has only one population.
492              
493             =cut
494              
495             sub outgroup {
496 6     6 1 2636 my $self = shift;
497 6         15 my @pops = $self->get_Pops;
498 6 100 66     41 if ( $pops[$#pops] == 1 && !defined $self->{NO_OUTGROUP} && @pops > 1 ) {
      66        
499 4         11 return 1;
500             }
501 2         4 else { return 0; }
502             }
503              
504             =head3 get_next_haps_pop_num
505              
506             Title : get_next_haps_pop_num
507             Usage : ($next_haps_pop_num, $num_haps_left_in_pop) = $stream->get_next_haps_pop_num()
508             Function: First return value is the population number (starting with 1) the
509             next hap will come from. The second return value is the number of haps
510             left to read in the population from which the next hap will come.
511             Returns : (scalar > 0, scalar > 0)
512             Args : NONE
513              
514             =cut
515              
516             sub get_next_haps_pop_num {
517 39     39 1 42 my $self = shift;
518 39         53 my $last_read_hap = $self->get_last_read_hap_num;
519 39         61 my @pops = $self->get_Pops;
520              
521 39         86 foreach my $pop_num ( 0 .. $#pops ) {
522 55 100       106 if ( $last_read_hap < $pops[$pop_num] ) {
523 37         112 return ( $pop_num + 1, $pops[$pop_num] - $last_read_hap );
524             }
525 18         25 else { $last_read_hap -= $pops[$pop_num] }
526             }
527              
528             # In this case we're at the beginning of the next run
529 2         5 return ( 1, $pops[0] );
530             }
531              
532             =head3 get_next_seq
533              
534             Title : get_next_seq
535             Usage : $seq = $stream->get_next_seq()
536             Function: reads and returns the next sequence (haplotype) in the stream
537             Returns : Bio::Seq object or void if end of file
538             Args : NONE
539             Note : This function is included only to conform to convention. The
540             returned Bio::Seq object holds a halpotype in coded form. Use the hash
541             returned by get_base_conversion_table() to convert 'A', 'T', 'C', 'G'
542             back into 1,2,4 and 5. Use get_next_hap() to retrieve the halptoype as
543             a string of 1,2,4 and 5s instead.
544              
545             =cut
546              
547             sub get_next_seq {
548 139     139 1 5944 my $self = shift;
549 139         191 my $seqstring = $self->get_next_hap;
550              
551 135 100       194 return unless ($seqstring);
552              
553             # Used to create unique ID;
554 128         184 my $run = $self->get_last_haps_run_num;
555              
556             # Converting numbers to letters so that the haplotypes can be stored as a
557             # seq object
558 128         167 my $rh_base_conversion_table = $self->get_base_conversion_table;
559 128         124 foreach my $base ( keys %{$rh_base_conversion_table} ) {
  128         308  
560 512         4244 $seqstring =~ s/($rh_base_conversion_table->{$base})/$base/g;
561             }
562              
563             # Fill in non-variable positions
564 128         300 my $segsites = $self->get_current_run_segsites;
565 128         197 my $n_sites = $self->get_n_sites;
566 128 100       233 if ( defined($n_sites) ) {
567              
568             # make sure that n_sites is at least as large
569             # as segsites for each run. Throw an exception otherwise.
570 63 100       138 $self->throw( "n_sites:\t$n_sites"
571             . "\nsegsites:\t$segsites"
572             . "\nrun:\t$run"
573             . "\nn_sites needs to be at least the number of segsites of every run"
574             ) unless $segsites <= $n_sites;
575              
576 62         67 my $seq_len = 0;
577 62         60 my @seq;
578 62         94 my @pos = $self->get_Positions;
579 62         130 for ( my $i = 0 ; $i <= $#pos ; $i++ ) {
580 434         603 $pos[$i] *= $n_sites;
581 434         624 push( @seq, "A" x ( $pos[$i] - 1 - $seq_len ) );
582 434         427 $seq_len += length( $seq[-1] );
583 434         489 push( @seq, substr( $seqstring, $i, 1 ) );
584 434         574 $seq_len += length( $seq[-1] );
585             }
586 62         90 push( @seq, "A" x ( $n_sites - $seq_len ) );
587 62         290 $seqstring = join( "", @seq );
588             }
589              
590 127         171 my $last_read_hap = $self->get_last_read_hap_num;
591 127         229 my $id = 'Hap_' . $last_read_hap . '_Run_' . $run;
592 127 100       355 my $description =
593             "Segsites $segsites;"
594             . " Positions "
595             . ( defined $n_sites ? $n_sites : $segsites ) . ";"
596             . " Haplotype $last_read_hap;"
597             . " Run $run;";
598 127         260 my $seq = $self->sequence_factory->create(
599             -seq => $seqstring,
600             -id => $id,
601             -desc => $description,
602             -alphabet => q(dna),
603             -direct => 1,
604             );
605              
606 127         299 return $seq
607              
608             }
609              
610             =head3 next_seq
611              
612             Title : next_seq
613             Usage : $seq = $stream->next_seq()
614             Function: Alias to get_next_seq()
615             Returns : Bio::Seq object or void if end of file
616             Args : NONE
617             Note : This function is only included for convention. It calls get_next_seq().
618             See get_next_seq() for details.
619              
620             =cut
621              
622             sub next_seq {
623 0     0 1 0 my $self = shift;
624 0         0 return $self->get_next_seq();
625             }
626              
627             =head3 get_next_hap
628              
629             Title : get_next_hap
630             Usage : $hap = $stream->next_hap()
631             Function: reads and returns the next sequence (haplotype) in the stream.
632             Returns undef if all sequences in stream have been read.
633             Returns : Haplotype string (e.g. '110110000101101045454000101'
634             Args : NONE
635             Note : Use get_next_seq() if you want the halpotype returned as a
636             Bio::Seq object.
637              
638             =cut
639              
640             sub get_next_hap {
641 148     148 1 2464 my $self = shift;
642              
643             # Let's figure out how many haps to read from the input file so that
644             # we get back to the beginning of the next run.
645              
646 148         159 my $end_run = 0;
647 148 100       279 if ( $self->{TOT_RUN_HAPS} == $self->{LAST_READ_HAP_NUM} + 1 ) {
648 23         26 $end_run = 1;
649             }
650              
651             # Setting last_haps_run_num
652 148         184 $self->{LAST_HAPS_RUN_NUM} = $self->get_next_run_num;
653              
654 148         182 my $last_read_hap = $self->get_last_read_hap_num;
655             my ($seqstring) =
656 148         242 $self->_get_next_clean_hap( $self->{_filehandle}, 1, $end_run );
657 146 100 100     263 if ( !defined $seqstring && $last_read_hap < $self->get_tot_haps ) {
658 2         7 $self->throw(
659             "msout file has only $last_read_hap hap(s), which is less than indicated in msinfo line ( "
660             . $self->get_tot_haps
661             . " )" );
662             }
663              
664 144         204 return $seqstring;
665             }
666              
667             =head3 get_next_pop
668              
669             Title : get_next_pop
670             Usage : @seqs = $stream->next_pop()
671             Function: reads and returns all the remaining sequences (haplotypes) in the
672             population of the next sequence. Returns an empty list if no more
673             haps remain to be read in the stream
674             Returns : array of Bio::Seq objects
675             Args : NONE
676              
677             =cut
678              
679             sub get_next_pop {
680 18     18 1 7626 my $self = shift;
681              
682             # Let's figure out how many haps to read from the input file so that
683             # we get back to the beginning of the next run.
684              
685 18         36 my @pops = $self->get_Pops;
686 18         21 my @seqs; # holds Bio::Seq objects to return
687              
688             # Determine number of the pop that the next hap will be taken from
689 18         27 my ( $next_haps_pop_num, $haps_to_pull ) = $self->get_next_haps_pop_num;
690              
691             # If $haps_to_pull == 0, then we need to pull the whole population
692 18 50       40 if ( $haps_to_pull == 0 ) {
693 0         0 $haps_to_pull = $pops[ $next_haps_pop_num - 1 ];
694             }
695              
696 18         28 for ( 1 .. $haps_to_pull ) {
697 50         68 my $seq = $self->get_next_seq;
698 50 100       77 next unless defined $seq;
699              
700             # Add Population number information to description
701 47         128 $seq->display_id(" Population number $next_haps_pop_num;");
702 47         83 push @seqs, $seq;
703             }
704              
705 18         59 return @seqs;
706             }
707              
708             =head3 next_run
709              
710             Title : next_run
711             Usage : @seqs = $stream->next_run()
712             Function: reads and returns all the remaining sequences (haplotypes) in the ms
713             run of the next sequence. Returns an empty list if all haps have been
714             read from the stream.
715             Returns : array of Bio::Seq objects
716             Args : NONE
717              
718             =cut
719              
720             sub get_next_run {
721 21     21 0 9223 my $self = shift;
722              
723             # Let's figure out how many haps to read from the input file so that
724             # we get back to the beginning of the next run.
725              
726 21         49 my ( $next_haps_pop_num, $haps_to_pull ) = $self->get_next_haps_pop_num;
727 21         25 my @seqs;
728              
729 21         28 my @pops = $self->get_Pops;
730              
731 21         39 foreach ( $next_haps_pop_num .. $#pops ) {
732 20         34 $haps_to_pull += $pops[$_];
733             }
734              
735             # Read those haps from the input file
736             # Next hap read will be the first hap of the first pop of the next run.
737              
738 21         31 for ( 1 .. $haps_to_pull ) {
739 75         111 my $seq = $self->get_next_seq;
740 71 100       117 next unless defined $seq;
741              
742 68         106 push @seqs, $seq;
743             }
744              
745 17         76 return @seqs;
746             }
747              
748             =head2 Methods to Retrieve Constants
749              
750             =head3 base_conversion_table
751              
752             Title : get_base_conversion_table
753             Usage : $table_hash_ref = $stream->get_base_conversion_table()
754             Function: returns a reference to a hash. The keys of the hash are the letters '
755             A','T','G','C'. The values associated with each key are the value that
756             each letter in the sequence of a seq object returned by a
757             Bio::SeqIO::msout stream should be translated to.
758             Returns : reference to a hash
759             Args : NONE
760             Synopsis:
761            
762             # retrieve the Bio::Seq object's sequence
763             my $haplotype = $seq->seq;
764            
765             # need to convert all letters to their corresponding numbers.
766             foreach my $base (keys %{$rh_base_conversion_table}){
767             $haplotype =~ s/($base)/$rh_base_conversion_table->{$base}/g;
768             }
769            
770             # $haplotype is now an ms style haplotype. (e.g. '100101101455')
771            
772             =cut
773              
774             sub get_base_conversion_table {
775 135     135 0 2732 my $self = shift;
776 135         166 return $self->{BASE_CONVERSION_TABLE_HASH_REF};
777             }
778              
779             ##############################################################################
780             ## subs for internal use only
781             ##############################################################################
782              
783             sub _get_next_clean_hap {
784              
785             #By Warren Kretzschmar
786              
787             # return the next non-empty line from file handle (chomped line)
788             # skipps to the next run if '//' is encountered
789 160     160   214 my ( $self, $fh, $times, $end_run ) = @_;
790 160         179 my @data;
791              
792 160 50       272 unless ( ref($fh) eq q(GLOB) ) { return; }
  0         0  
793              
794 160 50 33     414 unless ( defined $times && $times > 0 ) {
795 0         0 $times = 1;
796             }
797              
798 160 100       222 if ( defined $self->{BUFFER_HAP} ) {
799 28         46 push @data, $self->{BUFFER_HAP};
800 28         34 $self->{BUFFER_HAP} = undef;
801 28         30 $self->{LAST_READ_HAP_NUM}++;
802 28         30 $times--;
803             }
804              
805 160         252 while ( 1 <= $times-- ) {
806              
807             # Find next clean line
808 144         448 my $data = <$fh>;
809 144 100       220 last if !defined($data);
810 134         162 chomp $data;
811 134         361 while ( $data !~ /./ ) {
812 28         26 $data = <$fh>;
813 28         58 chomp $data;
814             }
815              
816             # If the next run is encountered here, then we have a programming
817             # or format error
818 134 50       218 if ( $data eq '//' ) { $self->throw("'//' found when not expected\n") }
  0         0  
819              
820 134         150 $self->{LAST_READ_HAP_NUM}++;
821 134         264 push @data, $data;
822             }
823              
824 160 100       221 if ($end_run) {
825 35         70 $self->_load_run_info($fh);
826             }
827              
828 158         269 return (@data);
829             }
830              
831             sub _load_run_info {
832              
833 35     35   55 my ( $self, $fh ) = @_;
834              
835 35         78 my $data = <$fh>;
836              
837             # getting rid of excess newlines
838 35   100     135 while ( defined($data) && $data !~ /./ ) {
839 28         99 $data = <$fh>;
840             }
841              
842             # In this case we are at EOF
843 35 100       54 if ( !defined($data) ) { $self->{NEXT_RUN_NUM} = undef; return; }
  5         8  
  5         8  
844              
845 30         68 while ( $data !~ /./ ) {
846 0         0 $data = <$fh>;
847 0         0 chomp $data;
848             }
849              
850 30         47 chomp $data;
851              
852             # If the next run is encountered, then skip to the next hap and save it in
853             # the buffer.
854 30 100       47 if ( $data eq '//' ) {
855 28         33 $self->{NEXT_RUN_NUM}++;
856 28         33 $self->{LAST_READ_HAP_NUM} = 0;
857 28         56 for ( 1 .. 3 ) {
858 84         137 $data = <$fh>;
859 84         164 while ( $data !~ /./ ) {
860 20         27 $data = <$fh>;
861 20         32 chomp $data;
862             }
863 84         99 chomp $data;
864              
865 84 100       152 if ( $_ eq '1' ) {
    100          
866 28         117 my @sites = split( /\s+/, $data );
867 28         67 $self->{LAST_READ_SEGSITES} = $sites[1];
868             }
869             elsif ( $_ eq '2' ) {
870 28         175 my @positions = split( /\s+/, $data );
871 28         40 shift @positions;
872 28         70 $self->{LAST_READ_POSITIONS} = \@positions;
873             }
874             else {
875 28 50       41 if ( !defined($data) ) {
876 0         0 $self->throw("run $self->{NEXT_RUN_NUM} has no haps./n");
877             }
878 28         58 $self->{BUFFER_HAP} = $data;
879             }
880             }
881             }
882             else {
883 2         8 $self->throw(
884             "'//' not encountered when expected. There are more haplos in one of the msOUT runs than advertised in the msinfo line."
885             );
886             }
887              
888             }
889             1;