Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 891589
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T13:54:52+00:00 2026-05-15T13:54:52+00:00

I have a Python script that we’re using to parse CSV files with user-entered

  • 0

I have a Python script that we’re using to parse CSV files with user-entered phone numbers in it – ergo, there are quite a few weird format/errors. We need to parse these numbers into their separate components, as well as fix some common entry errors.

Our phone numbers are for Sydney or Melbourne (Australia), or Auckland (New Zealand), given in international format.

Our standard Sydney number looks like:

+61(2)8328-1972

We have the international prefix +61, followed by a single digit area code in brackets, 2, followed by the two halves of the local component, separated by a hyphen, 8328-1972.

Melbourne numbers simply have 3 instead of 2 in the area code, e.g.

+61(3)8328-1972

The Auckland numbers are similar, but they have a 7-digit local component (3 then 4 numbers), instead of the normal 8 digits.

+64(9)842-1000

We also have matches for a number of common errors. I’ve separated the regex expressions into their own class.

class PhoneNumberFormats():
    """Provides compiled regex objects for different phone number formats. We put these in their own class for performance reasons - there's no point recompiling the same pattern for each Employee"""
    standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{3,4})-(?P<local_second_half>\d{4})')
    extra_zero = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{3,4})-(?P<local_second_half>\d{4})')
    missing_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{3,4})(?P<local_second_half>\d{4})')
    space_instead_of_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{3,4}) (?P<local_second_half>\d{4})')

We have one for standard_format numbers, then others for various common error cases e.g. putting an extra zero before the area code (02 instead of 2), or missing hyphens in the local component (e.g.83281972instead of8328-1972`) etc.

We then call these from cascaded if/elifs:

def clean_phone_number(self):
    """Perform some rudimentary checks and corrections, to make sure numbers are in the right format.
    Numbers should be in the form 0XYYYYYYYY, where X is the area code, and Y is the local number."""
    if not self.telephoneNumber:
        self.PHFull = ''
        self.PHFull_message = 'Missing phone number.'
    else:
        if PhoneNumberFormats.standard_format.search(self.telephoneNumber):
            result = PhoneNumberFormats.standard_format.search(self.telephoneNumber)
            self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
            self.PHFull_message = ''
        elif PhoneNumberFormats.extra_zero.search(self.telephoneNumber):
            result = PhoneNumberFormats.extra_zero.search(self.telephoneNumber)
            self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
            self.PHFull_message = 'Extra zero in area code - ask user to remediate.'
        elif PhoneNumberFormats.missing_hyphen.search(self.telephoneNumber):
            result = PhoneNumberFormats.missing_hyphen.search(self.telephoneNumber)
            self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
            self.PHFull_message = 'Missing hyphen in local component - ask user to remediate.'
        elif PhoneNumberFormats.space_instead_of_hyphen.search(self.telephoneNumber):
            result = PhoneNumberFormats.missing_hyphen.search(self.telephoneNumber)
            self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
            self.PHFull_message = 'Space instead of hyphen in local component - ask user to remediate.'
        else:
            self.PHFull = ''
            self.PHFull_message = 'Number didn\'t match recognised format. Original text is: ' + self.telephoneNumber

My aim is to make the matching as tight as possible, yet still at least catch the common errors.

There are number of problems with what I’ve done above though:

  1. I’m using \d{3,4} to match the first half of the local component. Ideally, however, we only really want to catch a 3-digit first half if if it’s a New Zealand number (i.e. starts with +64(9)). That way, we can flag Sydney/Melbourne numbers that are missing a digit. I could separate out auckland_number into it’s own regex pattern in PhoneNumberFormats, however, that means it wouldn’t catch a New Zealand number combined with the error cases (extra_zero, missing_hyphen, space_instead_of_hyphen). So unless I recreate version of them just for Auckland, like auckland_extra_zero, which seems pointlessly repetitive, I can’t see how to address this easily.
  2. We don’t pickup combinations of errors – e.g. if they have a extra zero, and a missing hyphen, we won’t pick this up. Is there an easy way to do this using regex, without explicitly creating permutations of the different errors?

I’d like to address the above two issues, and hopefully tighten it up a bit to catch anything that I’ve missed. Is there a smarter way to do what I’ve attempted to do above?

Cheers,
Victor

Additional Comments:

The following is just to provide some context:

This script is for a global company, with one office in Sydney, one in Melbourne and one in Auckland.

The numbers come from an internal Active Directory listing of employees (i.e. it’s not a customer listing, but our own office phones).

Hence, we’re not looking for a general Australian phone number matching script, rather, we’re looking at a general sript to parse numbers from three specific offices. General, it’s only the last 4 numbers that should differ.

Mobile phones aren’t required.

The script is designed to parse a CSV dump of the Active Directory, and reformat the numbers into an acceptable format for another program (QuickComm)

This program is from a external vendor, and requires numbers in the exact format that I’ve produced in the code above – that’s why the numbers are spat out like 0283433422.

The script I’ve written can’t change the records, it only works on a CSV dump of them – the records are stored in Active Directory, and the only way to access them to get them fixed is to email the employee and ask them to login and change their own records.

So this script is run by a PA, to produce the output required by this program. She/he will also get a list of people who have incorrectly formatted numbers – hence the messages about asking the user to remediate. In theory, there should only a be small number of these. We then email/ring these employees, asking them to fix their records – the script is run once a month (numbers may change), we also need to flag new employees that manage to enter their records in wrong as well.

@John Macklin: Are you recommending I scrap regexes, and just try to pull specific-position digits out of the string?

I was looking for a way to catch the common error cases, in combinations (e.g. space instead of hyphen, combined with an extra zero), but is this not easily feasible?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T13:54:53+00:00Added an answer on May 15, 2026 at 1:54 pm

    Don’t use complicated regexes. Delete EVERYTHING except digits — non-digits are error-prone cruft. If the third digit is 0, delete it.
    Expect 61 followed by valid AUS area code ([23478] for generality NB 4 is for mobiles) then 8 digits
    or 64 followed by valid NZL area code (whatever that is) followed by 7 digits. Anything else is bad. In the good stuff, insert the +()- at the appropriate places.

    By the way (1) area code 2 is for the whole of NSW+ACT, not just Sydney, 3 is for VIC+TAS (2) lots of people these days don’t have landlines, just mobiles, and people tend to retain the same mobile phone number longer than they maintain the same landline phone number or the same postal address, so mobile phone number is great for fuzzy matching customer records — so I’m more than a little curious why you don’t include them.

    The following tell you all you ever wanted to know, plus a whole lot more, about the Australian and New Zealand phone numbering schemes.

    Comment on the regexes:

    (1) You are using the search method with a “^” prefix. Using the match method with no prefix is somewhat less inelegant.

    (2) You don’t seem to be checking for trailing rubbish in your phone number field:

    >>> import re
    >>> standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\
    )(?P<local_first_half>\d{3,4})-(?P<local_second_half>\d{4})')
    >>> m =standard_format.search("+61(3)1234-567890whoopsie")
    >>> m.groups()
    ('61', '3', '1234', '5678')
    >>>
    

    You may like to (a) end some of your regexes with \Z (NOT $) so that they don’t match OK when there is trailing rubbish or (b) introduce another group to catch trailing rubbish.

    and a social engineering comment: Have you yet tested the user reaction to a staff member carrying out this directive: “Space instead of hyphen in local component – ask user to remediate”? Can’t the script just fix it and carry on?

    and some comments on the code:

    the self.PHFull code

    (a) is terribly repetitive (if you must have regexes put them in a list with corresponding action codes and error messages and iterate over the list)

    (b) is the same for “error” cases as for standard cases (so why are you asking the users to “remediate”???)

    (c) throws away the country code and substitutes a 0 i.e. your standard +61(2)1234-5678 is being kept as 0212345678 aarrgghhh … even if you have the country stored with the address that’s no good if an NZer migrates to Aus and the address gets updated but not the phone number and please don’t say that you are relying on the current (no NZ customers outside the Auckland area???) non-overlap of area codes …

    Update after full story revealed

    Keep it SIMPLE for both you and the staff. Instructions to staff using Active Directory should be (depending on which office) “Fill in +61(2)9876-7 followed by your 3-digit extension number”. If they can’t get that right after a couple of attempts, it’s time they got the DCM.

    So you use one regex per office, filling in the constant part, so that say the SYD offices have numbers of the form +61(2)9876-7ddd you use the regex r"\+61\(2\)9876-7\d{3,3}\Z". If a regex matches, then you remove all non-digits and use "0" + the_digits[2:] for the next app. If no regexes match, send a rocket.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a python script that runs a program, which generates few .exe files
I have a Python script that is using some closed-box Python functions (i.e. I
I have a python script that takes input using a pattern like this: 1**
I have a Python script that pulls in data from many sources (databases, files,
I have a python script that makes a series of url calls using urllib2.
I have a Python Script that generate a CSV (data parsed from a website).
I have a Python script that outputs two numbers like so: 1.0 2.0 (that's
I have python script that converts data.xml to html using stylesheet.xsl. And i have
I have a Python script that takes in '.html' files removes stop words and
I have a python script that is using the SIGSTOP and .SIGCONT commands with

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.