Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 834085
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T04:37:19+00:00 2026-05-15T04:37:19+00:00

I’m building a process which extracts data from 6 csv-style files and two poorly

  • 0

I’m building a process which extracts data from 6 csv-style files and two poorly laid out .txt reports and builds output CSVs, and I’m fully aware that there’s going to be some overhead searching through all that whitespace thousands of times, but I never anticipated converting about 50,000 records would take 12 hours.

Excerpt of my manual matching code (I know it’s horrible that I use lists of tokens like that, but it was the best thing I could think of):

public static String lookup(Pattern tokenBefore,
                             List<String> tokensAfter)
{
    String result = null;

    while(_match(tokenBefore)) { // block until all input is read
        if(id.hasNext())
        {
            result = id.next(); // capture the  next token that matches

            if(_matchImmediate(tokensAfter)) // try to match tokensAfter to this result
                return result;
        } else
            return null; // end of file; no match
    }

    return null; // no matches
}

private static boolean _match(List<String> tokens)
{
    return _match(tokens, true);
}

private static boolean _match(Pattern token)
{
    if(token != null)
    {
        return (id.findWithinHorizon(token, 0) != null);
    } else {
        return false;
    }
}

private static boolean _match(List<String> tokens, boolean block)
{
    if(tokens != null && !tokens.isEmpty()) {
        if(id.findWithinHorizon(tokens.get(0), 0) == null)
            return false;

        for(int i = 1; i <= tokens.size(); i++)
        {
            if (i == tokens.size()) { // matches all tokens
                return true;
            } else if(id.hasNext() && !id.next().matches(tokens.get(i))) {
                break; // break to blocking behaviour
            }
        }
    } else {
        return true; // empty list always matches
    }

    if(block)
        return _match(tokens); // loop until we find something or nothing
    else
        return false; // return after just one attempted match
}

private static boolean _matchImmediate(List<String> tokens)
{
    if(tokens != null) {

        for(int i = 0; i <= tokens.size(); i++)
        {
            if (i == tokens.size()) { // matches all tokens
                return true;
            } else if(!id.hasNext() || !id.next().matches(tokens.get(i))) {
                return false; // doesn't match, or end of file
            }
        }

        return false; // we have some serious problems if this ever gets called
    } else {
        return true; // empty list always matches
    }
}

Basically wondering how I would work in an efficient string search (Boyer-Moore or similar). My Scanner id is scanning a java.util.String, figured buffering it to memory would reduce I/O since the search here is being performed thousands of times on a relatively small file. The performance increase compared to scanning a BufferedReader(FileReader(File)) was probably less than 1%, the process still looks to be taking a LONG time.

I’ve also traced execution and the slowness of my overall conversion process is definitely between the first and last like of the lookup method. In fact, so much so that I ran a shortcut process to count the number of occurrences of various identifiers in the .csv-style files (I use 2 lookup methods, this is just one of them) and the process completed indexing approx 4 different identifiers for 50,000 records in less than a minute. Compared to 12 hours, that’s instant.

Some notes (updated 6/6/2010):

  1. I still need the pattern-matching behaviour for tokensBefore.
  2. All ID numbers I need don’t necessarily start at a fixed position in a line, but it’s guaranteed that after the ID token is the name of the corresponding object.
  3. I would ideally want to return a String, not the start position of the result as an int or something.

Anything to help me out, even if it saves 1ms per search, will help, so all input is appreciated. Thankyou!


Usage scenario 1: I have a list of objects in file A, who in the old-style system have an id number which is not in file A. It is, however, POSSIBLY in another csv-style file (file B) or possibly still in a .txt report (file C) which each also contain a bunch of other information which is not useful here, and so file B needs to be searched through for the object’s full name (1 token since it would reside within the second column of any given line), and then the first column should be the ID number. If that doesn’t work, we then have to split the search token by whitespace into separate tokens before doing a search of file C for those tokens as well.

Generalised code:

String field;
for (/* each record in file A */)
{
    /* construct the rest of this object from file A info */
    // now to find the ID, if we can
    List<String> objectName = new ArrayList<String>(1);
    objectName.add(Pattern.quote(thisObject.fullName));
    field = lookup(objectSearchToken, objectName); // search file B
    if(field == null) // not found in file B
    {
        lookupReset(false); // initialise scanner to check file C
        objectName.clear(); // not using the full name

        String[] tokens = thisObject.fullName.split(id.delimiter().pattern());
        for(String s : tokens)
            objectName.add(Pattern.quote(s));

        field = lookup(objectSearchToken, objectName); // search file C
        lookupReset(true); // back to file B
    } else {
        /* found it, file B specific processing here */
    }

    if(field != null) // found it in B or C
        thisObject.ID = field;
}

The objectName tokens are all uppercase words with possible hyphens or apostrophes in them, separated by spaces (a person’s name).

As per aioobe’s answer, I have pre-compiled the regex for my constant search tokens, which in this case is just \r\n. The speedup noticed was about 20x in another one of the processes, where I compiled [0-9]{1,3}\\.[0-9]%|\r\n|0|[A-Z'-]+, although it was not noticed in the above code with \r\n. Working along these lines, it has me wondering:

Would it be better for me to match \r\n[^ ] if the only usable matches will be on lines beginning with a non-space character anyway? It may reduce the number of _match executions.

Another possible optimisation is this: concatenate all tokensAfter, and put a (.*) beforehand. It would reduce the number of regexes (all of which are literal anyway) that would be compiled by about 2/3, and also hopefully allow me to pull out the text from that grouping instead of keeping a “potential token” from every line with an ID on it. Is that also worth doing?

The above situation could be resolved if I could get java.util.Scanner to return the token previous to the current one after a call to findWithinHorizon.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T04:37:20+00:00Added an answer on May 15, 2026 at 4:37 am

    Something to start with: Every single time you run id.next().matches(tokens.get(i)) the following code is executed:

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(input);
    return m.matches();
    

    Compiling a regular expression is non-trivial and you should consider compiling the patterns once and for all in your program:

    pattern[i] = Pattern.compile(tokens.get(i));
    

    And then simply invoke something like

    pattern[i].matcher(str).matches()
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 501k
  • Answers 501k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer Hmm. Several issues with your code. Don't use document.all. It's… May 16, 2026 at 2:17 pm
  • Editorial Team
    Editorial Team added an answer Are your resources are deployed in WEB-INF. because you can… May 16, 2026 at 2:17 pm
  • Editorial Team
    Editorial Team added an answer You're declaring your list to be the wrong type (non-generic):… May 16, 2026 at 2:17 pm

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Related Questions

I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
link Im having trouble converting the html entites into html characters, (&# 8217;) i
Does anyone know how can I replace this 2 symbol below from the string
Seemingly simple, but I cannot find anything relevant on the web. What is the
this is what i have right now Drawing an RSS feed into the php,
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have just tried to save a simple *.rtf file with some websites and
I want to count how many characters a certain string has in PHP, but
I ran into a problem. Wrote the following code snippet: teksti = teksti.Trim() teksti
I have a French site that I want to parse, but am running into

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.