I’m trying to create a method that provides “best effort” parsing of decimal inputs in cases where I do not know which of these two mutually exclusive ways of writing numbers the end-user is using:
- “.” as thousands separator and “,” as decimal separator
- “,” as thousands separator and “.” as decimal separator
The method is implemented as parse_decimal(..) in the code below. Furthermore, I’ve defined 20 test cases that show how the heuristics of the method should work.
While the code below passes the tests it is quite horrible and unreadable. I’m sure there is a more compact and readable way to implement the method. Possibly including smarter use of regexpes.
My question is simply: Given the code below and the test-cases, how would you improve parse_decimal(…) to make it more compact and readable while still passing the tests?
Clarifications:
- Clarification #1: As pointed out in the comments the case
^\d{1,3}[\.,]\d{3}$is ambiguous in that one cannot determine logically which character is used as thousands separator and which is used as a decimal separator. In ambiguous cases we’ll simply assume that US-style decimals are used: “,” as thousands separator and “.” as decimal separator. - Clarification #2: If you believe that any of test cases is wrong, then please state which of the tests that should be changed and how.
The code in question including the test cases:
#!/usr/bin/perl -wT
use strict;
use warnings;
use Test::More tests => 20;
ok(&parse_decimal("1,234,567") == 1234567);
ok(&parse_decimal("1,234567") == 1.234567);
ok(&parse_decimal("1.234.567") == 1234567);
ok(&parse_decimal("1.234567") == 1.234567);
ok(&parse_decimal("12,345") == 12345);
ok(&parse_decimal("12,345,678") == 12345678);
ok(&parse_decimal("12,345.67") == 12345.67);
ok(&parse_decimal("12,34567") == 12.34567);
ok(&parse_decimal("12.34") == 12.34);
ok(&parse_decimal("12.345") == 12345);
ok(&parse_decimal("12.345,67") == 12345.67);
ok(&parse_decimal("12.345.678") == 12345678);
ok(&parse_decimal("12.34567") == 12.34567);
ok(&parse_decimal("123,4567") == 123.4567);
ok(&parse_decimal("123.4567") == 123.4567);
ok(&parse_decimal("1234,567") == 1234.567);
ok(&parse_decimal("1234.567") == 1234.567);
ok(&parse_decimal("12345") == 12345);
ok(&parse_decimal("12345,67") == 12345.67);
ok(&parse_decimal("1234567") == 1234567);
sub parse_decimal($) {
my $input = shift;
$input =~ s/[^\d,\.]//g;
if ($input !~ /[,\.]/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d,\d+\.\d/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d\.\d+,\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d\.\d+\.\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d,\d+,\d/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d{4},\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d{4}\.\d/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d,\d{3}$/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d\.\d{3}$/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d,\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d\.\d/) {
return &parse_with_separators($input, '.', ',');
} else {
return &parse_with_separators($input, '.', ',');
}
}
sub parse_with_separators($$$) {
my $input = shift;
my $decimal_separator = shift;
my $thousand_separator = shift;
my $output = $input;
$output =~ s/\Q${thousand_separator}\E//g;
$output =~ s/\Q${decimal_separator}\E/./g;
return $output;
}
The idea in these problems is to look at the code and figure out where you typed anything twice. When you see that, work to remove it. My program handles everything in your test data, and I don’t have to repeat program logic structures to do it. That lets me focus on the data rather than program flow.
First, let’s clean up your tests. You really have a set of pairs that you want to test, so let’s put them into a data structure. You can add or remove items from the data structure as you like, and the tests will automatically adjust:
Now that you have it in a data structure, your long line of tests reduces to a short
foreachloop:The
parse_numberroutine likewise condenses into this simple code. Your trick is to find out what you are doing over and over again in the source and not do that. Instead of trying to figure out weird calling conventions and long chains of conditionals, I normalize the data. I figure out which cases are odd, then turn them into not-odd cases. In this code, I condense all of the knowledge about the separators into a handful of regexes and return one of two possible lists to show me what the thousands separator and decimal separator are. Once I have that, I remove the thousands separator completely and make the decimal separator the full stop. As I find more cases, I merely add a regex that returns true for that case:This is the sort of thing I talk about in the dynamic subroutines chapter of Mastering Perl. Although I won’t go into it here, I would probably turn that series of regexes into a pipeline of some sort and use a grep.
This is just the part of the program that passes your tests. I’d add another step to verify that the number is an expected format to deal with dirty data, but that’s not so hard and is just a simple matter of programming.