Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 723353
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 14, 20262026-05-14T06:05:55+00:00 2026-05-14T06:05:55+00:00

I have a large text file I am reading from and I need to

  • 0

I have a large text file I am reading from and I need to find out how many times some words come up. For example, the word the. I’m doing this line by line each line is a string.

I need to make sure that I only count legit the‘s–the the in other would not count. This means I know I need to use regular expressions in some way. What I was trying so far is this:

numSpace += line.split("[^a-z]the[^a-z]").length;  

I realize the regular expression may not be correct at the moment but I tried without that and just tried to find occurrences of the word the and I get wrong numbers too. I was under the impression this would split the string up into an array and how many times that array was split up was how many times the word is in the string. Any ideas I would be grateful.

Update:
Given some ideas, I’ve come up with this:

numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;

Though still getting some strange numbers. I was able to get an accurate general count (without the regular expression), now my issue is with the regexp.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-14T06:05:55+00:00Added an answer on May 14, 2026 at 6:05 am

    Using split to count isn’t the most efficient, but if you insist on doing that, the proper way is this:

    haystack.split(needle, -1).length -1                            
    

    If you don’t set limit to -1, split defaults to 0, which removes trailing empty strings, which messes up your count.

    From the API:

    The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. […] If n is zero then […] trailing empty strings will be discarded.

    You also need to subtract 1 from the length of the array, because N occurrences of the delimiter splits the string into N+1 parts.


    As for the regex itself (i.e. the needle), you can use \b the word boundary anchors around the word. If you allow word to contain metacharacters (e.g. count occurrences of "$US"), you may want to Pattern.quote it.


    I’ve come up with this:

    numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;
    

    Though still getting some strange numbers. I was able to get an accurate general count (without the regular expression), now my issue is with the regexp.

    Now the issue is that you’re not counting [Tt]he that appears as the first or last word, because the regex says that it has to be preceded/followed by some character, something that matches [^a-zA-Z] (that is, your match must be of length 5!). You’re not allowing the case where there isn’t a character at all!

    You can try something like this instead:

    "(^|[^a-zA-Z])[Tt]he([^a-zA-Z]|$)"
    

    This isn’t the most concise solution, but it works.

    Something like this (using negative lookarounds) also works:

    "(?<![a-zA-Z])[Tt]he(?![^a-zA-Z])"
    

    This has the benefit of matching just [Tt]he, without any extra characters around it like your previous solution did. This is relevant in case you actually want to process the tokens returned by split, because the delimiter in this case isn’t "stealing" anything from the tokens.


    Non-split

    Though using split to count is rather convenient, it isn’t the most efficient (e.g. it’s doing all kinds of work to return those strings that you discard). The fact that as you said you’re counting line-by-line means that the pattern would also have to be recompiled and thrown away every line.

    A more efficient way would be to use the same regex you did before and do the usual Pattern.compile and while (matcher.find()) count++;

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a large text file I need to sort in Java. The format
I have a large text file with a lot of \n that I need
I have a rather large text file that has a bunch of missing newlines,
I have a large file in my repository that is not text-mergeable and that
I'm reading in a large text file with 1.4 million lines that is 24
i have locked my file from 0 to 5 bytes, and i write some
I have a rather large DBF file, about 40 megs, that I need to
I have a large text file that reads like Kyle 40 Greg 91 Reggie
I have a large text file that has the following headings that I'm trying
Been trying to figure this one out all day. I have a large text

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.