Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7823821
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 2, 20262026-06-02T08:23:13+00:00 2026-06-02T08:23:13+00:00

I am using Python to generate an ASCII file composed of very long lines.

  • 0

I am using Python to generate an ASCII file composed of very long lines. This is one example line (let’s say line 100 in the file, ‘[…]’ are added by me to shorten the line):

{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,479 8,485 1,[...]}

If I open the ASCII file that I generated with ipython:

f = open('myfile','r')
print repr(f.readlines()[99])

I do obtain the expected line printed correctly (‘[…]’ are added by me to shorten the line):

'{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,479 8,485 1,[...]}\n'

On the contrary, if I open this file with the program that is suppose to read it, it will generate an exception, complaining about an unexpected pair after 478 1.
So I tried to open the file with vim. Still vim shows no problem, but if I copy the line as printed by vim and paste it in another text editor (in my case TextMate), this is the line that I obtain (‘[…]’ are added by me to shorten the line):

{6 1,14 1,[...],264 1,270      2,274 2,[...],478 1,4     79 8,485 1,[...]}

This line indeed has a problem after the pair 478 1.
I tried to generate my lines in different ways (concatenating, with cStringIO, …), but I always obtain this result. When using the cStringIO, for example, the lines are generated as in the following (even though I tried to change this, as well, with no luck):

def _construct_arff(self,attributes,header,data_rows):
  """Create the string representation of a Weka ARFF file.
     *attributes* is a dictionary with attribute_name:attribute_type
       (e.g., 'num_of_days':'NUMERIC')
     *header* is a list of the attributes sorted
       (e.g., ['age','name','num_of_days'])
     *data_rows* is a list of lists with the values, sorted as in the header
       (e.g., [ [88,'John',465],[77,'Bob',223]]"""

  arff_str = cStringIO.StringIO()
  arff_str.write('@relation %s\n' % self.relation_name)

  for idx,att_name in enumerate(header):
    try:
      name = att_name.replace("\\","\\\\").replace("'","\\'")
      arff_str.write("@attribute '%s' %s\n" % (name,attributes[att_name]))
    except UnicodeEncodeError:
      arff_str.write('@attribute unicode_err_%s %s\n' 
                     % (idx,attributes[att_name]))

  arff_str.write('@data\n')
  for data_row in data_rows:
    row = []
    for att_idx,att_name in enumerate(header):
      att_type = attributes[att_name]
      value = data_row[att_idx]
      # numeric attributes can be sparse: None and zeros are not written
      if ((not att_type == constants.ARRF_NUMERIC)
          or not ((value == None) or value == 0)):
        row.append('%s %s' % (att_idx,value))
    arff_str.write('{' + (','.join(row)) + '}\n')
  return arff_str.getvalue()

UPDATE: As you can see from the code above, the function transforms a given set of data to a special arff file format. I noticed that one of the attributes I was creating contained numbers as strings (e.g., ‘1’, instead of 1). By forcing these numbers into integers:

features[name] = int(value)

I recreated the arff file successfully. However I don’t see how this, which is a value, can have an impact on the formatting of *att_idx*, which is always an integer, as also pointed out by @JohnMachin and @gnibbler (thanks for your answers, btw). So, even if my code runs now, I still don’t see why this happens. How can the value, if not properly transformed into int, influence the formatting of something else?

This file contains the wrongly formatted version.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-02T08:23:16+00:00Added an answer on June 2, 2026 at 8:23 am

    The built-in function repr is your friend. It will show you unambiguously what you have in your file.

    Do this:

    f = open('myfile','r')
    print repr(f.readlines()[99])
    

    and edit your question to show the result.

    Update: As to how it got there, it is impossible to tell, because it cannot have been generated by the code that you showed. The value 37 should be a value of att_idx which comes from enumerate() and so must be an int. You are formatting this int with %s … 37 can’t become 3rubbish7. Also that should generate att_idx in order 0, 1, etc etc but you are missing many values and there is nothing conditional inside your loop.

    Please show us the code that you actually ran.

    Update:

    And again, this code won’t run:

    for idx,att_name in enumerate(header):
        arff_str.write("@attribute '%s' %s\n" % (name,attributes[att_name]))
    

    because name is not defined; you probably mean att_name.

    Perhaps we can short-circuit all this stuffing about: post a copy of your output file (zipped if it’s huge) on the web somewhere so that we can see for ourselves what might be disturbing its consumers. Please do edit your question to say which line(s) exhibits(s) the problem.

    By the way, you say some of the data is string rather than integer, and the problem goes away if you coerce the data to int by doing features[name] = int(value) … what is ‘features’?? What is ‘name’??

    Are any of those strings unicode instead of str?

    Update 2 (after bad file posted on net)

    No info supplied on which line(s) exhibits(s) the problem. As it turned out, no lines exhibited the described problem with attribute 479. I wrote this checking script:

    import re, sys
    # sample data line:
    # {40 1,101 3,319 2,375 2,525 2,530 bug}
    # Looks like all data lines end in ",530 bug}" or ",530 other}"
    pattern1 = r"\{(?:\d+ \d+,)*\d+ \w+\}$"
    matcher1 = re.compile(pattern1).match
    pattern2 = r"\{(?:\d+ \d+,)*"
    matcher2 = re.compile(pattern2).match
    bad_atts = re.compile(r"\D\d+\s+\W").findall
    got_data = False
    for lino, line in enumerate(open(sys.argv[1], "r"), 1):
        if not got_data:
            got_data = line.startswith('@data')
            continue
        if not matcher1(line):
            print
            print lino, repr(line)
            m = matcher2(line)
            if m:
                print "OK up to offset", m.end()
                print bad_atts(line)
    

    Sample output (wrapped at column 80):

    581 '{2 1,7 1,9 1,12 1,13 1,14 1,15 1,16 1,17 1,18 1,21 1,22 1,24 1,25 1,26 1,27
     1,29 1,32 1,33 1,36 1,39 1,40 1,44 1,48 1,49 1,50 1,54 1,57 1,58 1,60 1,67 1,68
     1,69 1,71 1,74 1,75 1,76 1,77 1,80 1,88 1,93 1,101 ,103 6,104 2,109 20,110 3,11
    2 2,114 1,119 17,120 4,124 39,128 5,137 1,138 1,139 1,162 1,168 1,172 18,175 1,1
    76 6,179 1,180 1,181 2,185 2,187 9,188 8,190 1,193 1,195 2,196 4,197 1,199 3,201
     3,202 4,203 5,206 1,207 2,208 1,210 2,211 1,212 5,213 1,215 2,216 3,218 2,220 2
    ,221 3,225 8,226 1,233 1,241 4,242 1,248 5,254 2,255 1,257 4,258 4,260 1,266 1,2
    68 1,269 3,270 2,271 5,273 1,276 1,277 1,280 1,282 1,283 11,285 1,288 1,289 1,29
    6 8,298 1,299 1,303 1,304 11,306 5,308 1,309 8,310 1,315 3,316 1,319 11,320 5,32
    1 11,322 2,329 1,342 2,345 1,349 1,353 2,355 2,358 3,359 1,362 1,367 2,368 1,369
     1,373 2,375 9,377 1,381 4,382 1,383 3,387 1,388 5,395 2,397 2,400 1,401 7,407 2
    ,412 1,416 1,419 2,421 2,422 1,425 2,427 1,431 1,433 7,434 1,435 1,436 2,440 1,4
    49 1,454 2,455 1,460 3,461 1,463 1,467 1,470 1,471 2,472 7,477 2,478 11,479 31,4
    82 6,485 7,487 1,490 2,492 16,494 2,495 1,497 1,499 1,501 1,502 1,503 1,504 11,5
    06 3,510 2,515 1,516 2,517 3,518 1,522 4,523 2,524 1,525 4,527 2,528 7,529 3,530
     bug}\n'
    OK up to offset 203
    [',101 ,']
    
    709 '{101 ,124 2,184 1,188 1,333 1,492 3,500 4,530 bug}\n'
    OK up to offset 1
    ['{101 ,']
    

    So it looks like the attribute with att_idx == 101 can sometimes contain the empty string ''. You need to sort out how this attribute is to be treated. It would help your thinking if you unwound this Byzantine code:

      if ((not att_type == constants.ARRF_NUMERIC)
          or not ((value == None) or value == 0)):
    

    Aside: that “expletive deleted” code won’t run; it should be ARFF, not ARRF

    into:

    if value or att_type != constants.ARFF_NUMERIC:
    

    or maybe just if value: which will filter out all of None, 0, and "". Note that att_idx == 101 corresponds to the attribute “priority” which is given a STRING type in the ARFF file header:

    [line 103] @attribute 'priority' STRING
    

    By the way, your statement about features[name] = int(value) “fixing” the problem is very suspicious; int("") raises an exception.

    It may help you to read the warning at the end of this wiki section about sparse ARFF files.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Is there a disadvantage to using a dynamic Python file to generate the CSS
I am using Python to generate some data and have some code like this
I am using Python multiprocessing to generate a temporary output file per process. They
How can I generate recurring dates using Python? For example I want to generate
I want to generate comparison tables like this by using Python. alt text http://img714.imageshack.us/img714/5677/22862352.png
Say I have generated the following binary file: # generate file: python -c 'import
Simple enough question: I'm using python random module to generate random integers. I want
What's the easiest way to generate a bitmap using Python? Text support would be
So, using Python's difflib , I can generate a diff of two strings: foo
Using Python, how does one parse/access files with Linux-specific features, like ~/.mozilla/firefox/*.default ? I've

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.