File Coverage

blib/lib/Sys/Binmode.pm
Criterion Covered Total %
statement 9 9 100.0
branch n/a
condition n/a
subroutine 4 4 100.0
pod n/a
total 13 13 100.0


line stmt bran cond sub pod time code
1             package Sys::Binmode;
2              
3 13     13   1134091 use strict;
  13         148  
  13         377  
4 13     13   69 use warnings;
  13         26  
  13         2265  
5              
6             our $VERSION = '0.05';
7              
8             =encoding utf-8
9              
10             =head1 NAME
11              
12             Sys::Binmode - A fix for Perl’s system call character encoding
13              
14             =begin html
15              
16             Coverage Status
17              
18             =end html
19              
20             =head1 SYNOPSIS
21              
22             use Sys::Binmode;
23              
24             my $foo = "\xff";
25             $foo .= "\x{100}";
26             chop $foo;
27              
28             # Prints a single octet (0xFF) and a newline:
29             print $foo, $/;
30              
31             # In Perl 5.32 this may print the same single octet, or it may
32             # print UTF-8-encoded U+00FF. With Sys::Binmode, though, it always
33             # gives the single octet, just like print:
34             exec 'echo', $foo;
35              
36             =head1 DESCRIPTION
37              
38             tl;dr: Use this module in B new code.
39              
40             =head1 BACKGROUND
41              
42             Ideally, a Perl application doesn’t need to know how the interpreter stores
43             a given string internally. Perl can thus store any Unicode code point while
44             still optimizing for size and speed when storing “bytes-compatible”
45             strings—i.e., strings whose code points all lie below 256. Perl’s
46             “optimized” string storage format is faster and less memory-hungry, but it
47             can only store code points 0-255. The “unoptimized” format, on the other
48             hand, can store any Unicode code point.
49              
50             Of course, Perl doesn’t I optimize “bytes-compatible” strings;
51             Perl can also, if
52             it wants, store such strings “unoptimized” (i.e., in Perl’s internal
53             “loose UTF-8” format), too. For code points 0-127 (ASCII printables,
54             controls, and DEL) there’s actually no
55             difference between the two forms, but for 128-255 the formats differ. (cf.
56             L) This means that anything that reads
57             Perl’s internals B differentiate between the two forms in order to
58             use the string correctly.
59              
60             Alas, that differentiation doesn’t always happen. When it doesn’t, Perl
61             outputs code points 128-255 differently depending on whether the
62             containing string is “optimized” or not.
63              
64             Remember, though: Perl applications I I I about
65             Perl’s string storage internals like optimized/unoptimized. (This is why,
66             for example, the L
67             pragma is discouraged.) The catch, though, is that without that knowledge,
68             B B B B B B B B
69             B B B B
70              
71             Thus, applications must either monitor Perl’s string-storage internals
72             or accept unpredictable behavior, both of which are categorically bad.
73              
74             (Perl’s documentation calls the “unoptimized” format “upgraded”, while
75             it calls the “optimized” format “downgraded”. The rest of this document
76             will favor Perl’s terms.)
77              
78             =head1 HOW THIS MODULE (PARTLY) FIXES THE PROBLEM
79              
80             This module provides predictable behavior for Perl’s built-in functions by
81             downgrading all strings before giving them to the operating system. It’s
82             equivalent to—but faster than!—prefixing your system calls with
83             C (cf. L) on all arguments.
84              
85             Predictable behavior is B a good thing; ergo, you should
86             use this module in B new code.
87              
88             =head1 CAVEAT: CHARACTER ENCODING
89              
90             If you apply this module injudiciously to existing code you may see
91             exceptions or character corruption where previously things worked fine.
92              
93             This can
94             happen if you’ve neglected to encode one or more strings before
95             sending them to the OS. Without Sys::Binmode, Perl sends upgraded
96             strings to the OS in UTF-8 encoding. In essence, it’s an implicit
97             UTF-8 auto-encode, which is kind of nice, except that it depends on
98             Perl’s internals, which are unpredictable. Sys::Binmode removes
99             that implicit UTF-8 auto-encode, which of course will break things
100             that need it.
101              
102             The fix is to apply an explicit UTF-8 encode prior to the system call
103             that throws the error. This is what we should do I;
104             Sys::Binmode just enforces that better.
105              
106             =head2 Example: The L Pragma
107              
108             The widely-used L pragma particularly exemplifies this problem.
109              
110             If you have code like this:
111              
112             use utf8;
113              
114             mkdir "épée";
115              
116             … then adding this module will change your program’s behavior in ways you’ll
117             probably dislike.
118              
119             Consider the string C<épée>. Without the C pragma (but assuming that
120             the code I actually written in UTF-8) this is 6
121             characters because the two C<é>s are 2 bytes each (so 2 + 1 + 2 + 1),
122             and without the C pragma each byte in a string constant becomes its own
123             character, even if multiple bytes make up a single UTF-8 character. Since
124             nothing I upgrades that string on its way to
125             C, the OS will receive the intended 6 bytes and create a directory
126             with a UTF-8-encoded name.
127              
128             I C, though, C<épée> is B<4> characters, not 6, because
129             this string is now UTF-8-decoded. Those 4 characters all lie beneath 256,
130             so the string is still bytes-compatible. Thus, if you C that string
131             you’ll get 4 bytes of Latin-1, which probably B what you want.
132              
133             C, though, I still creates a directory with a 6-byte (UTF-8)
134             name. This happens when Perl itself stores C<épée> in upgraded (i.e.,
135             “unoptimized”) form. If that’s the case, that means Perl’s I buffer
136             of C<épée> is still the 6 bytes of UTF-8, even though to the Perl
137             I it’s a 4-character string. Perl’s C doesn’t care
138             about characters, though; it just gives Perl’s internal buffer to the
139             OS’s create-directory function. So by violating its own abstraction, Perl
140             happens to achieve something that is I useful.
141              
142             There are still two problems, though:
143              
144             =over
145              
146             =item * 1. Inconsistency: C sends 4 bytes to the OS while
147             C (again, I) outputs 6.
148              
149             =item * 2. Uncertainty: C<épée> I be stored downgraded rather than
150             upgraded, which would cause C to send 4 bytes instead.
151              
152             =back
153              
154             C’s outputting of 4 bytes here is actually the B behavior
155             because it doesn’t depend on whether Perl stores the string upgraded or
156             downgraded. Sys::Binmode extends that correct behavior to C and
157             other such Perl commands.
158              
159             Of course, in the end, we want C to receive 6 bytes of UTF-8, not
160             4 bytes of Latin-1. To achieve that, just do as you normally do with
161             C: encode your string before you give it to the OS.
162              
163             use utf8;
164             use Encode;
165              
166             mkdir encode("UTF-8", "épée");
167              
168             This is what your code should look like, regardless of Sys::Binmode;
169             the omitted encoding step was a bug that Perl’s own abstraction-violation
170             bug I have obscured for you. Sys::Binmode fixes Perl’s bug,
171             which makes you fix your own bug, too.
172              
173             =head2 Non-POSIX Operating Systems (e.g., Windows)
174              
175             In a POSIX operating system, an application’s communication with the
176             OS happens entirely through byte strings. Thus, treating all
177             OS-destined strings as byte strings is good and natural.
178              
179             In Windows, though, things are weirder. For example, Windows
180             exposes multiple APIs for creating a directory, and the one Perl uses (as of
181             5.32, anyway) only accepts code points 0-255. In this context Sys::Binmode
182             doesn’t I anything, but it does reinforce one of Perl’s unfortunate
183             limitations on Windows.
184              
185             Sys::Binmode is a good idea anywhere that Perl sends byte strings to the OS.
186             For now, as far as I know, that’s everywhere that Perl runs. If that’s not
187             true, please file a bug.
188              
189             =head1 WHERE ELSE THIS PROBLEM CAN APPEAR
190              
191             The unpredictable-behavior problem that this module fixes in core Perl is
192             also common in L’s XS modules due to rampant
193             use of L and
194             variants. SvPV is basically Perl’s L pragma in C: it gives
195             you the string’s
196             internal bytes with no regard for what those bytes represent. This, of course,
197             is problematic for the same reason why the L pragma is. XS authors
198             I should prefer
199             L
200             or L in lieu of
201             SvPV unless the C code in question handles Perl’s encoding abstraction.
202              
203             Note in particular that, as of Perl 5.32, the default XS typemap converts
204             scalars to C C and C via an SvPV variant. This means
205             that any module that uses that conversion logic also has this problem.
206             So XS authors should also avoid the default typemap for such conversions.
207             (Again, though, use of the default typemap in this context is regrettably
208             commonplace.)
209              
210             Before Perl 5.18 this problem also affected %ENV. 5.18 introduced
211             an auto-downgrade when setting %ENV similar to what this module does.
212              
213             =head1 LEXICAL SCOPING
214              
215             If, for some reason, you I Perl’s unpredictable default behavior,
216             you can disable this module for a given block via
217             C, thus:
218              
219             use Sys::Binmode;
220              
221             system 'echo', $foo; # predictable/sane/happy
222              
223             {
224              
225             # You should probably explain here why you’re doing this.
226             no Sys::Binmode;
227              
228             system 'echo', $foo; # nasal demons
229             }
230              
231             =head1 AFFECTED BUILT-INS
232              
233             =over
234              
235             =item * C, C, and C
236              
237             =item * C and C
238              
239             =item * File tests (e.g., C<-e>) and the following:
240             C, C, C, C, C,
241             C, C, C, C, C, C, C,
242             C, C, C, C, C,
243             C, C
244              
245             =item * C, C, C, and C (last argument)
246              
247             =item * C
248              
249             =back
250              
251             =head2 Omissions
252              
253             =over
254              
255             =item * C already does as Sys::Binmode would make it do.
256              
257             =item * C
258             but since it’s a performance-sensitive call where upgraded strings are
259             unlikely, this library doesn’t wrap it.
260              
261             =back
262              
263             =head1 KNOWN ISSUES
264              
265             L creates functions named, e.g., C in the
266             namespace of the module that Cs it. Those functions lack
267             the compiler “hint” that tells Sys::Binmode to do its work; thus,
268             L.
269             C functions will still have Sys::Binmode, but of course they won’t
270             throw exceptions.
271              
272             =head1 TODO
273              
274             =over
275              
276             =item * C and the System V IPC functions aren’t covered here.
277             If you’d like them, ask.
278              
279             =item * There’s room for optimization, if that’s gainful.
280              
281             =item * Ideally this behavior should be in Perl’s core distribution.
282              
283             =item * Even more ideally, Perl should adopt this behavior as I.
284             Maybe someday!
285              
286             =back
287              
288             =cut
289              
290             #----------------------------------------------------------------------
291              
292             require XSLoader;
293             XSLoader::load(__PACKAGE__, $VERSION);
294              
295             sub import {
296 18     18   390 $^H{ _HINT_KEY() } = 1;
297              
298 18         16122 return;
299             }
300              
301             sub unimport {
302 1     1   1560 delete $^H{ _HINT_KEY() };
303             }
304              
305             #----------------------------------------------------------------------
306              
307             =head1 ACKNOWLEDGEMENTS
308              
309             Thanks to Leon Timmermans (LEONT) and Paul Evans (PEVANS) for some
310             debugging and design help.
311              
312             =head1 LICENSE & COPYRIGHT
313              
314             Copyright 2021 Gasper Software Consulting. All rights reserved.
315              
316             This library is licensed under the same license as Perl.
317              
318             =cut
319              
320             1;