I have an address class that uses a regular expression to parse the house number, street name, and street type from the first line of an address. This code is generally working well, but I’m posting here to share with the community and to see if anyone has suggestions for improvement.
Note: The STREETTYPES and QUADRANT constants contain all of the relevant street types and quadrants respectively.
I’ve included a subset here:
private const string STREETTYPES = @'ALLEY|ALY|ANNEX|AX|ARCADE|ARC|AVENUE|AV|AVE|BAYOU|BYU|BEACH|...'; private const string QUADRANTS = 'N|NORTH|S|SOUTH|E|EAST|W|WEST|NE|NORTHEAST|NW|NORTHWEST|SE|SOUTHEAST|SW|SOUTHWEST';
HouseNumber, Quadrant, StreetName, and StreetType are all properties on the class.
private void Parse(string line1) { HouseNumber = string.Empty; Quadrant = string.Empty; StreetName = string.Empty; StreetType = string.Empty; if (!String.IsNullOrEmpty(line1)) { string noPeriodsLine1 = String.Copy(line1); noPeriodsLine1 = noPeriodsLine1.Replace('.', ''); string addressParseRegEx = @'(?ix) ^ \s* (?: (?<housenumber>\d+) (?:(?:\s+|-)(?<quadrant>' + QUADRANTS + @'))? (?:(?:\s+|-)(?<streetname>\S+(?:\s+\S+)*?))?? (?:(?:\s+|-)(?<quadrant>' + QUADRANTS + @'))? (?:(?:\s+|-)(?<streettype>' + STREETTYPES + @'))? (?:(?:\s+|-)(?<streettypequalifier>(?!(?:' + QUADRANTS + @'))(?:\d+|\S+)))? (?:(?:\s+|-)(?<streettypequadrant>(' + QUADRANTS + @')))?? (?:(?:\s+|-)(?<suffix>(?:ste|suite|po\sbox|apt)\s*\S*))? | (?:(?:po|postoffice|post\s+office)\s+box\s+(?<postofficebox>\S+)) ) \s* $ '; Match match = Regex.Match(noPeriodsLine1, addressParseRegEx); if (match.Success) { HouseNumber = match.Groups['housenumber'].Value; Quadrant = (string.IsNullOrEmpty(match.Groups['quadrant'].Value)) ? match.Groups['streettypequadrant'].Value : match.Groups['quadrant'].Value; if (match.Groups['streetname'].Captures.Count > 1) { foreach (Capture capture in match.Groups['streetname'].Captures) { StreetName += capture.Value + ' '; } StreetName = StreetName.Trim(); } else { StreetName = (string.IsNullOrEmpty(match.Groups['streetname'].Value)) ? match.Groups['streettypequalifier'].Value : match.Groups['streetname'].Value; } StreetType = match.Groups['streettype'].Value; //if the matched street type is found //use the abbreviated version...especially for credit bureau calls string streetTypeAbbreviation; if (StreetTypes.TryGetValue(StreetType.ToUpper(), out streetTypeAbbreviation)) { StreetType = streetTypeAbbreviation; } } } }
I don’t know what country you’re in, but if you’re in the USA and want to spend some money on address validation, you can buy related USPS products here. And here is a good place to find free word lists from the USPS for expected words and abbreviations. I’m sure similar pages are available for other countries.