I want to implement my own tweet compressor. Basically this does the following. However I’m stuck with some of the unicode issues.
Here’s my script:
#!/usr/bin/env perl
use warnings;
use strict;
print tweet_compress('cc ms ns ps in ls fi fl ffl ffi iv ix vi oy ii xi nj/, "\. " ,", "'),"\n";
sub tweet_compress {
my $tweet = shift;
$tweet =~ s/\. ?$//;
my @orig = ( qw/cc ms ns ps in ls fi fl ffl ffi iv ix vi oy ii xi nj/, ". " ,", ");
my @new = qw/㏄ ㎳ ㎱ ㎰ ㏌ ʪ fi fl ffl ffi ⅳ ⅸ ⅵ ѹ ⅱ ⅺ nj . ,/;
$tweet =~ s/$orig[$_]/$new[$_]/g for 0 .. $#orig;
return $tweet;
}
But this prints junk out at the terminal:
?.?.?.?.?.?.?.f.?.f?.?.?.?.?.?.?.nj/."\..,"."
What am I doing wrong?
Two issues.
Firstly you have unicode characters in your source code. Make sure you save your file as utf8 and use the use utf8 pragma.
Also if you intend to run this program from a console make sure it can handle unicode. Windows command prompt cannot and will always show ? regardless of whether your data is correct or not. I ran this on Mac OS with Terminal set to handle utf8.
Secondly, if you have “.” in your orig list, it’ll get interpreted as “any single character” and give you wrong results – so you need to escape it before using it in your regular expression. I’ve modified the program a little to make it work.