I’m writing a lambda calculus interpreter for fun and practice. I got iostreams to properly tokenize identifiers by adding a ctype facet which defines punctuation as whitespace:
struct token_ctype : ctype<char> {
mask t[ table_size ];
token_ctype()
: ctype<char>( t ) {
for ( size_t tx = 0; tx < table_size; ++ tx ) {
t[tx] = isalnum( tx )? alnum : space;
}
}
};
(classic_table() would probably be cleaner but that doesn’t work on OS X!)
And then swap the facet in when I hit an identifier:
locale token_loc( in.getloc(), new token_ctype );
…
locale const &oldloc = in.imbue( token_loc );
in.unget() >> token;
in.imbue( oldloc );
There seems to be surprisingly little lambda calculus code on the Web. Most of what I’ve found so far is full of unicode λ characters. So I thought to try adding Unicode support.
But ctype<wchar_t> works completely differently from ctype<char>. There is no master table; there are four methods do_is x2, do_scan_is, and do_scan_not. So I did this:
struct token_ctype : ctype< wchar_t > {
typedef ctype<wchar_t> base;
bool do_is( mask m, char_type c ) const {
return base::do_is(m,c)
|| (m&space) && ( base::do_is(punct,c) || c == L'λ' );
}
const char_type* do_is
(const char_type* lo, const char_type* hi, mask* vec) const {
base::do_is(lo,hi,vec);
for ( mask *vp = vec; lo != hi; ++ vp, ++ lo ) {
if ( *vp & punct || *lo == L'λ' ) *vp |= space;
}
return hi;
}
const char_type *do_scan_is
(mask m, const char_type* lo, const char_type* hi) const {
if ( m & space ) m |= punct;
hi = do_scan_is(m,lo,hi);
if ( m & space ) hi = find( lo, hi, L'λ' );
return hi;
}
const char_type *do_scan_not
(mask m, const char_type* lo, const char_type* hi) const {
if ( m & space ) {
m |= punct;
while ( * ( lo = base::do_scan_not(m,lo,hi) ) == L'λ' && lo != hi )
++ lo;
return lo;
}
return base::do_scan_not(m,lo,hi);
}
};
(Apologies for the flat formatting; the preview converted the tabs differently.)
The code is WAY less elegant. I does better express the notion that only punctuation is additional whitespace, but that would’ve been fine in the original had I had classic_table.
Is there a simpler way to do this? Do I really need all those overloads? (Testing showed do_scan_not is extraneous here, but I’m thinking more broadly.) Am I abusing facets in the first place? Is the above even correct? Would it be better style to implement less logic?
(It’s been a year with no substantive answer, and I’ve learned a lot about iostreams in the meantime…)
The custom facet exists exclusively to serve the string extraction operator
in >> token. That operator is defined in terms ofuse_facet< ctype< wchar_t > >( in.getloc() ).is( ios::space, c )“for the next available input character c.” (§21.3.7.9)ctype::isis simply a stub forctype::do_is, so it would seem thatdo_isis sufficient.Nevertheless, recent versions of the GCC standard library do implement
operator>>in terms ofscan_is. The catch is thatdo_scan_isis then implemented as a series of calls todo_is, virtual dispatch and all. The header file describesdo_scan_isas a hook for user optimization.So, it would seem that the as-if rule shelters an implementation that only provides the first override.
Note that the second override, which retrieves mask values, is an odd one out. It could be implemented in terms of the first, by inefficiently building the mask bit by bit. In GCC it is implemented in terms of system calls, inefficently building the mask bit by bit with 15 calls per character. This seems to sacrifice both performance and compatibility. Fortunately it seems nobody uses it.
Anyway, this is all well and good, but simply writing a tokenizer using
streambuf_iterator<wchar_t>is easier, far more extensible, and simplifies exception handling.