Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9150517
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T11:37:18+00:00 2026-06-17T11:37:18+00:00

We are currently using [^a-zA-Z0-9] in Java’s replaceAll function to strip special characters from

  • 0

We are currently using [^a-zA-Z0-9] in Java’s replaceAll function to strip special characters from a string. It has come to our attention that we need to allow hyphen(s) when they are mixed with number(s).

Examples for which hyphens will not be matched:

  • 1-2-3
  • -1-23-4562
  • –1—2–3—4-
  • –9–a–7
  • 425-12-3456

Examples for which hyphens will be matched:

  • –a–b–c
  • wal-mart

We think we formulated a regex to meet the latter criteria using this SO question as a reference but we have no idea how to combine it with the original regex [^a-zA-Z0-9].

We are wanting to do this to a Lucene search string because of the way Lucene’s standard tokenizer works when indexing:

Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T11:37:19+00:00Added an answer on June 17, 2026 at 11:37 am

    You can’t do this with a single regex. (Well… maybe in Perl.)

    (edit: Okay, you can do it with variable-length negative lookbehind, which it appears Java can (almost uniquely!) do; see Cyborgx37’s answer. Regardless, imo, you shouldn’t do this with a single regex. :))

    What you can do is split the string into words and deal with each word individually. My Java is pretty terrible so here is some hopefully-sensible Python:

    # Precompile some regex
    looks_like_product_number = re.compile(r'\A[-0-9]+\Z')
    not_wordlike = re.compile(r'[^a-zA-Z0-9]')
    not_wordlike_or_hyphen = re.compile(r'[^-a-zA-Z0-9]')
    
    # Split on anything that's not a letter, number, or hyphen -- BUT dots
    # must be followed by whitespace
    words = re.split(r'(?:[^-.a-zA-Z0-9]|[.]\s)+', string)
    
    stripped_words = []
    for word in words:
        if '-' in word and not looks_like_product_number.match(word):
            stripped_word = not_wordlike.sub('', word)
        else:
            # Product number; allow dashes
            stripped_word = not_wordlike_or_hyphen.sub('', word)
    
        stripped_words.append(stripped_word)
    
    pass_to_lucene(' '.join(stripped_words))
    

    When I run this with 'wal-mart 1-2-3', I get back 'walmart 1-2-3'.

    But honestly, the above code reproduces most of what the Lucene tokenizer is already doing. I think you’d be better off just copying StandardTokenizer into your own project and modifying it to do what you want.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am using the URLUTF8Encoder.java class from W3C ( www.w3.org/International/URLUTF8Encoder.java ). Currently, it will
I am using java and currently, I can download a text file from the
I'm currently using XPath to get some information from a podcast feed using Java
I am currently using Logger from Java.Util, The default behavior for logger.info is like
I am currently using Java ME to extract Addressbook (ContactList) fields in a device
I am currently using Ubuntu 11.10 and java SE 1.6.0_26. I am trying to
I'm currently using the ImageJ jar (ij.jar) in a Java application that I am
I am currently looking in to some file uploading using Java Server Faces. I've
we are currently working on a new web application using Java and MySql. We
I am working on an IMAP client using java mail. We currently have a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.