Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3854982
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 19, 20262026-05-19T17:42:22+00:00 2026-05-19T17:42:22+00:00

I am fairly experienced with regular expressions, but I am having some difficulty with

  • 0

I am fairly experienced with regular expressions, but I am having some difficulty with a current application involving disjunction.

My situation is this: I need to separate an address into its component parts based on a regular expression match on the “Identifier elements” of the address — A comparable English example would be words like “state”, “road”, or “boulevard”–IF, for example, we wrote these out in our addresses. Imagine we have an address like the following, where (and this would never happen in English), we specified the identifier type after each name

United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER

(Where the words in CAPS are what I have called “identifiers”).

We want to parse it into:

United States COUNTRY
California STATE
San Francisco CITY
Mission STREET
245 NUMBER

OK, this is certainly contrived for English, but here’s the catch: I am working with Chinese data, where in fact this style of identifier specification happens all the time. An example below:


云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春-巷 ;
Yunnan-Province ; LiJiang-City ; GuCheng-District ; Xi'An-Street ; Yangchun-Alley

This is easy enough–a lazy match on a potential candidate identifier names, separated into a disjunctive list.

For China, the following are the “province-level” entities:


省 (Province) ,
自治区 (Autonomous Region) ,
市 (Municipality)

So my regex so far looks like this:


(.+?(?:(?:省)|(?:自治区)|(?:市)))

I have a series of these, in order to account for different portions of the address. The next level, corresponding to cities, for instance, is:


(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

So to match a province entity followed by a city entity:


(.+?(?:(?:省)|(?:自治区)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

With named capture groups:

(?<Province>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<City>.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

For the above, this yields:

$+{Province} = 云南省
$+{City} = 丽江市

This is all good and well, and gets me pretty far. The problem, however, is when I try to account for identifiers that can be a substring of other identifiers. A common street-level entity, for instance, is “村委会”, which means village organizing committee. In the set of addresses I wish to separate, not every address has this written out in full. In fact, I find “村委” and just plain “村” as well.

The problem? If I have a pure disjunction of these elements, we have the following:


(?<Street>.+?(?:(?:村委会)|(?:村委)|(?:村)))

What happens, though, is that if you have an entity 保定-村委会 (Baoding Village organizing committee), this lazy regex stops at 村 and calls it a day, orphaning our poor 委会 because 村 is one of the potential disjunctive elements.

Imagine an English equivalent like the following:

(?<Animal>.+?(?:(?:Cat)|(?:Elephant)|(?:CatElephant)|(?:City)))

We have two input strings:
1. “crap catelephant crap city”, where we wanted “Crap catelephant” and “crap city”
2. “crap catelephant city” , where we wanted “crap cat” “elephant city”

Ah, the solution, you say, is to make the pre-identifier capture greedy. But! There are entities have the same identifier that are not at the same level.

Take 市 for example. It means simply “city”. But in China, there are county-level, province-level, and municipality-level cities. If this character occurred twice in the string, especially in two adjacent entities, the greedy search would incorrectly tag the greedy match as the first entity. As in the following:


广东-省 ; 江门-市 ; 开平-市 ; 三埠-区 石海管-区
Guangdong-province ; Jiangmen-City ; Kaiping-City ; Sanbu-District ; Shihaiguan-District

(Note, as above, this has been hand-segmented. The raw data would simply have a string of concatenated characters)

The match for a greedy search would be

江门市开平市

This is wrong, as the two adjacent entities should be separated into their constituent parts. Once is at the level of provincial city, one is a county-level city.

Back to the original point, and I thank you for reading this far, is there a way to put a weighting on disjunctive entities? I would want the regex to find the highest “weighted” identifier first. 村委会 instead of simple 村 for example, “catelephant” instead of just “cat”. In preliminary experiments, the regex parser apparently proceeds left to right in finding disjunctive matches. Is this a valid assumption to make? Should I put the most frequently-occurring identifiers first in the disjunctive list?

If I have lost anyone with Chinese-related details, I apologize, and can further clarify if needed. The example really doesn’t have to be Chinese–I think more generally it is a question about the mechanics of the regex disjunctive match — in what order does it preference the disjunctive entities, and how does it decide when to “call it a day” in the context of a lazy search?

In a way, is there some sort of middle ground between lazy and greedy searches? Find the smallest bit you can find before the longest / highest weighted disjunctive entity? Be lazy, but put in that little bit of extra effort if you can for the sake of thoroughness?
(Incidentally, my work philosophy in college?)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-19T17:42:23+00:00Added an answer on May 19, 2026 at 5:42 pm

    How alternations are handled depends on the particular regular expression engine. For almost all engines (including Perl’s regular expression engine) the alternation matches eagerly – that is, it matches the left-most choice first and only tries another alternative if this fails. For example, if you have /(cat|catelephant)/ it will never match catelephant. The solution is to reorder the choices so that the most specific comes first.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This is a fairly trivial matter, but I'm curious to hear people's opinions on
I'm still fairly new to T-SQL and SQL 2005. I need to import a
I am fairly comfortable with standalone Java app development, but will soon be working
I'm fairly new to the world of versioning but would like to introduce Subversion
I am fairly new to unit testing. I am building an ASP.NET MVC3 application
I am working on a fairly large MVC 3 application, and I'm running into
this is my first posted question. I have a fairly complicated OQL query which
Being fairly new to JavaScript, I'm unable to discern when to use each of
I have a fairly small MySQL database (a Textpattern install) on a server that
I have a fairly simple ASP.NET 2.0 menu control using a sitemap file and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.