File Coverage

blib/lib/AI/Calibrate.pm

Criterion	Covered	Total	%
statement	77	80	96.2
branch	11	12	91.6
condition	3	3	100.0
subroutine	12	13	92.3
pod	3	6	50.0
total	106	114	92.9

line	stmt	bran	cond	sub	pod	time	code
1							package AI::Calibrate;
2
3	3			3		87024	use 5.008008;
	3					13
	3					130
4	3			3		18	use strict;
	3					7
	3					268
5	3			3		18	use warnings;
	3					11
	3					151
6
7	3			3		17	use vars qw($VERSION);
	3					7
	3					751
8							$VERSION = "1.5";
9
10							require Exporter;
11
12							our @ISA = qw(Exporter);
13
14							# This allows declaration:
15							# use AI::Calibrate ':all';
16							# If you do not need this, moving things directly into @EXPORT or @EXPORT_OK
17							# will save memory.
18							our %EXPORT_TAGS = (
19							'all' => [
20							qw(
21							calibrate
22							score_prob
23							print_mapping
24							)
25							]
26							);
27
28							our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
29							our @EXPORT = qw( );
30
31	3			3		19	use constant DEBUG => 0;
	3					5
	3					261
32
33							# Structure slot names
34	3			3		16	use constant SCORE => 0;
	3					5
	3					130
35	3			3		15	use constant PROB => 1;
	3					5
	3					4175
36
37							=head1 NAME
38
39							AI::Calibrate - Perl module for producing probabilities from classifier scores
40
41							=head1 SYNOPSIS
42
43							use AI::Calibrate ':all';
44							... train a classifier ...
45							... test classifier on $points ...
46							$calibrated = calibrate($points);
47
48							=head1 DESCRIPTION
49
50							Classifiers usually return some sort of an instance score with their
51							classifications. These scores can be used as probabilities in various
52							calculations, but first they need to be I. Naive Bayes, for
53							example, is a very useful classifier, but the scores it produces are usually
54							"bunched" around 0 and 1, making these scores poor probability estimates.
55							Support vector machines have a similar problem. Both classifier types should
56							be calibrated before their scores are used as probability estimates.
57
58							This module calibrates classifier scores using a method called the Pool
59							Adjacent Violators (PAV) algorithm. After you train a classifier, you take a
60							(usually separate) set of test instances and run them through the classifier,
61							collecting the scores assigned to each. You then supply this set of instances
62							to the calibrate function defined here, and it will return a set of ranges
63							mapping from a score range to a probability estimate.
64
65							For example, assume you have the following set of instance results from your
66							classifier. Each result is of the form C<[ASSIGNED_SCORE, TRUE_CLASS]>:
67
68							my $points = [
69							[.9, 1],
70							[.8, 1],
71							[.7, 0],
72							[.6, 1],
73							[.55, 1],
74							[.5, 1],
75							[.45, 0],
76							[.4, 1],
77							[.35, 1],
78							[.3, 0 ],
79							[.27, 1],
80							[.2, 0 ],
81							[.18, 0],
82							[.1, 1 ],
83							[.02, 0]
84							];
85
86							If you then call calibrate($points), it will return this structure:
87
88							[
89							[.9, 1 ],
90							[.7, 3/4 ],
91							[.45, 2/3 ],
92							[.3, 1/2 ],
93							[.2, 1/3 ],
94							[.02, 0 ]
95							]
96
97							This means that, given a SCORE produced by the classifier, you can map the
98							SCORE onto a probability like this:
99
100							SCORE >= .9 prob = 1
101							.9 > SCORE >= .7 prob = 3/4
102							.7 > SCORE >= .45 prob = 2/3
103							.45 > SCORE >= .3 prob = 3/4
104							.2 > SCORE >= .7 prob = 3/4
105							.02 > SCORE prob = 0
106
107							For a realistic example of classifier calibration, see the test file
108							t/AI-Calibrate-NB.t, which uses the AI::NaiveBayes1 module to train a Naive
109							Bayes classifier then calibrates it using this module.
110
111							=cut
112
113							=head1 FUNCTIONS
114
115							=over 4
116
117							=item B
118
119							This is the main calibration function. The calling form is:
120
121							my $calibrated = calibrate( $data, $sorted);
122
123							$data looks like: C<[ [score, class], [score, class], [score, class]...]>
124							Each score is a number. Each class is either 0 (negative class) or 1
125							(positive class).
126
127							$sorted is boolean (0 by default) indicating whether the data are already
128							sorted by score. Unless this is set to 1, calibrate() will sort the data
129							itself.
130
131							Calibrate returns a reference to an ordered list of references:
132
133							[ [score, prob], [score, prob], [score, prob] ... ]
134
135							Scores will be in descending numerical order. See the DESCRIPTION section for
136							how this structure is interpreted. You can pass this structure to the
137							B function, along with a new score, to get a probability.
138
139							=cut
140
141							sub calibrate {
142	27			27	1	11188	my($data, $sorted) = @_;
143
144	27					35	if (DEBUG) {
145							print "Original data:\n";
146							for my $pair (@$data) {
147							my($score, $prob) = @$pair;
148							print "($score, $prob)\n";
149							}
150							}
151
152							# Copy the data over so PAV can clobber the PROB field
153	27					331	my $new_data = [ map([@$_], @$data) ];
154
155							# If not already sorted, sort data decreasing by score
156	27	100				95	if (!$sorted) {
157	25					4045	$new_data = [ sort { $b->[SCORE] <=> $a->[SCORE] } @$new_data ];
	843					1257
158							}
159
160	27					80	PAV($new_data);
161
162	27					30	if (DEBUG) {
163							print("After PAV, vector is:\n");
164							print_vector($new_data);
165							}
166
167	27					33	my(@result);
168	27					29	my( $last_prob, $last_score);
169
170	27					309	push(@$new_data, [-1e10, 0]);
171
172	27					55	for my $pair (@$new_data) {
173	382					321	print "Seeing @$pair\n" if DEBUG;
174	382					465	my($score, $prob) = @$pair;
175	382	100	100			1429	if (defined($last_prob) and $prob < $last_prob) {
176	114					192	print("Pushing [$last_score, $last_prob]\n") if DEBUG;
177	114					271	push(@result, [$last_score, $last_prob] );
178							}
179	382					396	$last_prob = $prob;
180	382					534	$last_score = $score;
181							}
182
183	27					197	return \@result;
184							}
185
186
187							sub PAV {
188	27			27	0	38	my ( $result ) = @_;
189
190	27					112	for ( my $i = 0; $i < @$result - 1; $i++ ) {
191	329	100				1064	if ( $result->[$i][PROB] < $result->[ $i + 1 ][PROB] ) {
192	167					456	$result->[$i][PROB] =
193							( $result->[$i][PROB] + $result->[ $i + 1 ][PROB] ) / 2;
194	167					400	$result->[ $i + 1 ][PROB] = $result->[$i][PROB];
195	167					136	print "Averaging elements $i and ", $i + 1, "\n" if DEBUG;
196
197	167					355	for ( my $j = $i - 1; $j >= 0; $j-- ) {
198	344	100				740	if ( $result->[$j][PROB] < $result->[ $i + 1 ][PROB] ) {
199	179					264	my $d = ( $i + 1 ) - $j + 1;
200	179					291	flatten( $result, $j, $d );
201							}
202							else {
203	165					428	last;
204							}
205							}
206							}
207							}
208							}
209
210							sub print_vector {
211	0			0	0	0	my($vec) = @_;
212	0					0	for my $pair (@$vec) {
213	0					0	print join(", ", @$pair), "\n";
214							}
215							}
216
217
218							sub flatten {
219	179			179	0	218	my ( $vec, $start, $len ) = @_;
220	179					188	if (DEBUG) {
221							print "Flatten called on vec, $start, $len\n";
222							print "Vector before: \n";
223							print_vector($vec);
224							}
225
226	179					198	my $sum = 0;
227	179					344	for my $i ( $start .. $start + $len-1 ) {
228	781					1138	$sum += $vec->[$i][PROB];
229							}
230	179					278	my $avg = $sum / $len;
231	179					166	print "Sum = $sum, avg = $avg\n" if DEBUG;
232	179					280	for my $i ( $start .. $start + $len -1) {
233	781					1157	$vec->[$i][PROB] = $avg;
234							}
235	179					539	if (DEBUG) {
236							print "Vector after: \n";
237							print_vector($vec);
238							}
239							}
240
241							=item B
242
243							This is a simple utility function that takes the structure returned by
244							B, along with a new score, and returns the probability estimate.
245							Example calling form:
246
247							$p = score_prob($calibrated, $score);
248
249							Once you have a trained, calibrated classifier, you could imagine using it
250							like this:
251
252							$calibrated = calibrate( $calibration_set );
253							print "Input instances, one per line:\n";
254							while (<>) {
255							chomp;
256							my(@fields) = split;
257							my $score = classifier(@fields);
258							my $prob = score_prob($score);
259							print "Estimated probability: $prob\n";
260							}
261
262							=cut
263
264							sub score_prob {
265	11			11	1	4319	my($calibrated, $score) = @_;
266
267	11					16	my $last_prob = 1.0;
268
269	11					20	for my $tuple (@$calibrated) {
270	32					41	my($bound, $prob) = @$tuple;
271	32	100				161	return $prob if $score >= $bound;
272	23					38	$last_prob = $prob;
273							}
274							# If we drop off the end, probability estimate is zero
275	2					5	return 0;
276							}
277
278
279							=item B
280
281							This is a simple utility function that takes the structure returned by
282							B and prints out a simple list of lines describing the mapping
283							created.
284
285							Example calling form:
286
287							print_mapping($calibrated);
288
289							Sample output:
290
291							1.00 > SCORE >= 1.00 prob = 1.000
292							1.00 > SCORE >= 0.71 prob = 0.667
293							0.71 > SCORE >= 0.39 prob = 0.000
294							0.39 > SCORE >= 0.00 prob = 0.000
295
296							These ranges are not necessarily compressed/optimized, as this sample output
297							shows.
298
299							=back
300
301							=cut
302							sub print_mapping {
303	2			2	1	4713	my($calibrated) = @_;
304	2					6	my $last_bound = 1.0;
305	2					5	for my $tuple (@$calibrated) {
306	10					22	my($bound, $prob) = @$tuple;
307	10					147	printf("%0.3f > SCORE >= %0.3f prob = %0.3f\n",
308							$last_bound, $bound, $prob);
309	10					31	$last_bound = $bound;
310							}
311	2	50				13	if ($last_bound != 0) {
312	2					25	printf("%0.3f > SCORE >= %0.3f prob = %0.3f\n",
313							$last_bound, 0, 0);
314							}
315							}
316
317							=head1 DETAILS
318
319							The PAV algorithm is conceptually straightforward. Given a set of training
320							cases ordered by the scores assigned by the classifier, it first assigns a
321							probability of one to each positive instance and a probability of zero to each
322							negative instance, and puts each instance in its own group. It then looks, at
323							each iteration, for adjacent violators: adjacent groups whose probabilities
324							locally increase rather than decrease. When it finds such groups, it pools
325							them and replaces their probability estimates with the average of the group's
326							values. It continues this process of averaging and replacement until the
327							entire sequence is monotonically decreasing. The result is a sequence of
328							instances, each of which has a score and an associated probability estimate,
329							which can then be used to map scores into probability estimates.
330
331							For further information on the PAV algorithm, you can read the section in my
332							paper referenced below.
333
334							=head1 EXPORT
335
336							This module exports three functions: calibrate, score_prob and print_mapping.
337
338							=head1 BUGS
339
340							None known. This implementation is straightforward but inefficient (its time
341							is O(n^2) in the length of the data series). A linear time algorithm is
342							known, and in a later version of this module I'll probably implement it.
343
344							=head1 SEE ALSO
345
346							The AI::NaiveBayes1 perl module.
347
348							My paper "PAV and the ROC Convex Hull" has a good discussion of the PAV
349							algorithm, including examples:
350							L
351
352							If you want to read more about the general issue of classifier calibration,
353							here are some good papers, which are freely available on the web:
354
355							I<"Transforming classifier scores into accurate multiclass probability estimates">
356							by Bianca Zadrozny and Charles Elkan
357
358							I<"Predicting Good Probabilities With Supervised Learning">
359							by A. Niculescu-Mizil and R. Caruana
360
361
362							=head1 AUTHOR
363
364							Tom Fawcett, Etom.fawcett@gmail.comE
365
366							=head1 COPYRIGHT AND LICENSE
367
368							Copyright (C) 2008-2012 by Tom Fawcett
369
370							This library is free software; you can redistribute it and/or modify
371							it under the same terms as Perl itself, either Perl version 5.8.8 or,
372							at your option, any later version of Perl 5 you may have available.
373
374							=cut
375							1;