Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8211429
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T10:25:29+00:00 2026-06-07T10:25:29+00:00

I’ve got a problem at work that requires me to insheet some MASSIVE tab-separated

  • 0

I’ve got a problem at work that requires me to insheet some MASSIVE tab-separated values files (think 8-15 GB .txt files) into my PostgreSQL DB, but I’ve run into a problem with the way the data was formatted in the first place. Basically, the way we are given the data (and unfortunately we cannot get the data in a better format), there are some backslashes that appear and cause a return/new line.

So, there are lines (rows of data, tab-delim) that get chopped up into multiple lines, where the last character of line n is a \ , and the first character of line n+1 is a tab. Usually line n will be broken up into 1-3 additional lines (e.g. line n ends in a “\”, lines n+1 and n+2 start with a tab and end with a “\”, and line n+3 starts with a tab).

I need to write a script that can work with these huge files (this will run on a linux server with 192 GB of RAM) to look for the lines that begin with a tab, and then remove the return (and “\” wherever it exists) and save the text file.

To recap, the customer’s logging program splits the original line N into lines n, n+1, and sometimes n+2 and n+3 (depending on how many \ characters appear in line N), and I need to write a python script to recreate the original line N.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T10:25:31+00:00Added an answer on June 7, 2026 at 10:25 am

    This is based on @user665637’s good answer.

    #!/usr/bin/python
    
    import re, sys
    
    pat_incomplete = re.compile(r'\\\s*$')
    pat_indented = re.compile(r'^\t')
    
    try:
        _, fname_in, fname_out = sys.argv
    except ValueError:
        print("Usage: python line_joiner.py <input_filename> <output_filename>")
        sys.exit(1)
    
    with open(fname_in) as in_f, open(fname_out, "w") as out_f:
        lines = iter(in_f)
        try:
            line = next(lines)
            s = pat_incomplete.sub('', line)
        except StopIteration:
            print("Input file did not contain any data")
            sys.exit(2)
    
        for line in lines:
            line = pat_incomplete.sub('', line)
            if pat_indented.match(line):
                s += pat_indented.sub('',line)
            else:
                out_f.write(s)
                s = line
        out_f.write(s)
    

    Changes:

    • Uses “raw strings” for the regular expressions, which are easier to read.

    • Takes an output filename from the command-line arguments and writes to that file. Prints a message and exits if the user provides the wrong number of arguments. When we unpack sys.argv to get the arguments, we use the convention of using the variable name _ for parts we don’t care about.

    • Does not strip line endings, so the output file will have the same sort of line-endings as the input file. (When joining lines, of course it strips the line endings to do the join.)

    • Does not filter out blank lines from the input. It’s a little bit tricky, but by making an iterator and calling next() on it, it gets the first input line before starting the loop; thus s starts out with a valid value instead of None, and we don’t have to test it each time to see whether to print it or not. The original if lastLine: test, on an input line that was stripped, would not only protect against the initial None value of lastLine but would also filter out any blank lines from the input.

    • If you have to use this with Python 3.0 or Python 2.6, you can’t have a with statement that does two open() calls; but you can just turn it into two nested with statements, each of which does a single open().

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I've got a string that has curly quotes in it. I'd like to replace
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I have just tried to save a simple *.rtf file with some websites and
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
For some reason, after submitting a string like this Jack’s Spindle from a text
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I have a French site that I want to parse, but am running into
I am doing a simple coin flipping experiment for class that involves flipping a
i got an object with contents of html markup in it, for example: string

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.