I’m trying to create a method that provides best effort parsing of decimal inputs

Question

0

Asked: May 12, 20262026-05-12T14:32:02+00:00 2026-05-12T14:32:02+00:00

I’m trying to create a method that provides best effort parsing of decimal inputs

0

I’m trying to create a method that provides “best effort” parsing of decimal inputs in cases where I do not know which of these two mutually exclusive ways of writing numbers the end-user is using:

“.” as thousands separator and “,” as decimal separator
“,” as thousands separator and “.” as decimal separator

The method is implemented as parse_decimal(..) in the code below. Furthermore, I’ve defined 20 test cases that show how the heuristics of the method should work.

While the code below passes the tests it is quite horrible and unreadable. I’m sure there is a more compact and readable way to implement the method. Possibly including smarter use of regexpes.

My question is simply: Given the code below and the test-cases, how would you improve parse_decimal(…) to make it more compact and readable while still passing the tests?

Clarifications:

Clarification #1: As pointed out in the comments the case ^\d{1,3}[\.,]\d{3}$ is ambiguous in that one cannot determine logically which character is used as thousands separator and which is used as a decimal separator. In ambiguous cases we’ll simply assume that US-style decimals are used: “,” as thousands separator and “.” as decimal separator.
Clarification #2: If you believe that any of test cases is wrong, then please state which of the tests that should be changed and how.

The code in question including the test cases:

#!/usr/bin/perl -wT

use strict;
use warnings;
use Test::More tests => 20;

ok(&parse_decimal("1,234,567") == 1234567);
ok(&parse_decimal("1,234567") == 1.234567);
ok(&parse_decimal("1.234.567") == 1234567);
ok(&parse_decimal("1.234567") == 1.234567);
ok(&parse_decimal("12,345") == 12345);
ok(&parse_decimal("12,345,678") == 12345678);
ok(&parse_decimal("12,345.67") == 12345.67);
ok(&parse_decimal("12,34567") == 12.34567);
ok(&parse_decimal("12.34") == 12.34);
ok(&parse_decimal("12.345") == 12345);
ok(&parse_decimal("12.345,67") == 12345.67);
ok(&parse_decimal("12.345.678") == 12345678);
ok(&parse_decimal("12.34567") == 12.34567);
ok(&parse_decimal("123,4567") == 123.4567);
ok(&parse_decimal("123.4567") == 123.4567);
ok(&parse_decimal("1234,567") == 1234.567);
ok(&parse_decimal("1234.567") == 1234.567);
ok(&parse_decimal("12345") == 12345);
ok(&parse_decimal("12345,67") == 12345.67);
ok(&parse_decimal("1234567") == 1234567);

sub parse_decimal($) {
    my $input = shift;
    $input =~ s/[^\d,\.]//g;
    if ($input !~ /[,\.]/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d,\d+\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d\.\d+,\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d\.\d+\.\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d,\d+,\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d{4},\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d{4}\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d,\d{3}$/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d\.\d{3}$/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d,\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } else {
        return &parse_with_separators($input, '.', ',');
    }
}

sub parse_with_separators($$$) {
    my $input = shift;
    my $decimal_separator = shift;
    my $thousand_separator = shift;
    my $output = $input;
    $output =~ s/\Q${thousand_separator}\E//g;
    $output =~ s/\Q${decimal_separator}\E/./g;
    return $output;
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-12T14:32:02+00:00

The idea in these problems is to look at the code and figure out where you typed anything twice. When you see that, work to remove it. My program handles everything in your test data, and I don’t have to repeat program logic structures to do it. That lets me focus on the data rather than program flow.

First, let’s clean up your tests. You really have a set of pairs that you want to test, so let’s put them into a data structure. You can add or remove items from the data structure as you like, and the tests will automatically adjust:

use Test::More 'no_plan';

my @pairs = (
     #  got          expect
    [ "1,234,567",  1234567  ],
    [ "1,234567",   1.234567 ],
    [ "1.234.567",  1234567  ],
    [ "1.234567",   1.234567 ],
    [ "12,345",     12345    ],
    [ "12,345,678", 12345678 ],
    [ "12,345.67",  12345.67 ],
    [ "12,34567",   12.34567 ],
    [ "12.34",      12.34    ],
    [ "12.345",     12345    ],  # odd case!
    [ "12.345,67",  12345.67 ],
    [ "12.345.678", 12345678 ],
    [ "12.34567",   12.34567 ],
    [ "123,4567",   123.4567 ],
    [ "123.4567",   123.4567 ],
    [ "1234,567",   1234.567 ],
    [ "1234.567",   1234.567 ],
    [ "12345",      12345    ],
    [ "12345,67",   12345.67 ],
    [ "1234567",    1234567  ],
);

Now that you have it in a data structure, your long line of tests reduces to a short foreach loop:

foreach my $pair ( @pairs ) {
     my( $original, $expected ) = @$pair;
     my $got = parse_number( $original );
     is( $got, $expected, "$original translates to $expected" );
     }

The parse_number routine likewise condenses into this simple code. Your trick is to find out what you are doing over and over again in the source and not do that. Instead of trying to figure out weird calling conventions and long chains of conditionals, I normalize the data. I figure out which cases are odd, then turn them into not-odd cases. In this code, I condense all of the knowledge about the separators into a handful of regexes and return one of two possible lists to show me what the thousands separator and decimal separator are. Once I have that, I remove the thousands separator completely and make the decimal separator the full stop. As I find more cases, I merely add a regex that returns true for that case:

sub parse_number
    {
    my $string = shift;

    my( $separator, $decimal ) = do {
        local $_ = $string;
        if( 
            /\.\d\d\d\./           || # two dots
            /\.\d\d\d,/            || # dot before comma
            /,\d{4,}/              || # comma with many following digits
            /\d{4,},/              || # comma with many leading digits
            /^\d{1,3}\.\d\d\d\z/   || # odd case of 123.456
            0
            )
            { qw( . , ) }
        else { qw( , . ) }      
        };

    $string =~ s/\Q$separator//g;
    $string =~ s/\Q$decimal/./;

    $string;
    }

This is the sort of thing I talk about in the dynamic subroutines chapter of Mastering Perl. Although I won’t go into it here, I would probably turn that series of regexes into a pipeline of some sort and use a grep.

This is just the part of the program that passes your tests. I’d add another step to verify that the number is an expected format to deal with dirty data, but that’s not so hard and is just a simple matter of programming.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to create a method that provides best effort parsing of decimal inputs

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply