Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6385617
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T02:55:21+00:00 2026-05-25T02:55:21+00:00

I would like to determine the tab width used in source files indented with

  • 0

I would like to determine the tab width used in source files indented with spaces.
This is not hard for files with particularly regular indentation, where the leading spaces are only used for indentation, always in multiples of the tab width, and with indentation increasing one level at at time.
But many files will have some departure from this sort of regular indentation, generally for some form of vertical alignment. I’m thus looking for a good heuristic to estimate what tab width was used, allowing some possibility for irregular indentation.

The motivation for this is writing an extension for the SubEthaEdit editor. SubEthaEdit unfortunately doesn’t make the tab width available for scripting, so I’m going to guess at it based on the text.

A suitable heuristic should:

  • Perform well enough for interactive use. I don’t imagine this will be a problem, and just a portion of the text can be used if need be.
  • Be language independent.
  • Return the longest suitable tab width. For example, any file with a tab width of four spaces could also be a file with two-space tabs, if every indentation was actually by twice as many levels. Clearly, four spaces would be the right choice.
  • Always get it right if the indentation is completely regular.

Some simplifying factors:

  • At least one line can be assumed to be indented.
  • The tab width can be assumed to be at least two spaces.
  • It’s safe to assume that indentation is done with spaces only. It’s not that I have anything against tabs—quite the contrary, I’ll check first if there are any tabs used for indentation and handle it separately. This does mean that indentation mixing tabs and spaces might not be handled properly, but I don’t consider it important.
  • It may be assumed that there are no lines containing only whitespace.
  • Not all languages need to be handled correctly. For example, success or failure with languages like lisp and go would be completely irrelevant, since they’re not normally indented by hand.
  • Perfection is not required. The world isn’t going to end if a few lines occasionally need to be manually adjusted.

What approach would you take, and what do you see as its advantages and disadvantages?

If you want to provide working code in your answer, the best approach is probably to use a shell script that reads the source file from stdin and writes the tab width to stdout. Pseudocode or a clear description in words would be just fine, too.

Some Results

To test different strategies, we can apply different strategies to files in the standard libraries for language distributions, as they presumably follow the standard indentation for the language. I’ll consider the Python 2.7 and Ruby 1.8 libraries (system framework installs on Mac OS X 10.7), which have expected tab widths of 4 and 2, respectively. Excluded are those files which have lines beginning with tab characters or which have no lines beginning with at least two spaces.

Python:

                     Right  None  Wrong
Mode:                 2523     1    102
First:                2169     1    456
No-long (12):         2529     9     88
No-long (8):          2535    16     75
LR (changes):         2509     1    116
LR (indent):          1533     1   1092
Doublecheck (10):     2480    15    130
Doublecheck (20):     2509    15    101

Ruby:

                     Right  None  Wrong
Mode:                  594    29     51
First:                 578     0     54
No-long (12):          595    29     50
No-long (8):           597    29     48
LR (changes):          585     0     47
LR (indent):           496     0    136
Doublecheck (10):      610     0     22
Doublecheck (20):      609     0     23

In these tables, “Right” should be taken as determination of the language-standard tab width, “Wrong” as a non-zero tab width not equal to the language-standard width, and “None” as zero tab-width or no answer. “Mode” is the strategy of selecting the most frequently occurring change in indentation; “First” is taking the indentation of the first indented line; “No-long” is FastAl’s strategy of excluding lines with large indentation and taking the mode, with the number indicating the maximum allowed indent change; “LR” is Patrick87’s strategy based on linear regression, with variants based on the change in indentation between lines and on the absolute indentation of lines; “Doublecheck” (couldn’t resist the pun!) is Mark’s modification of FastAl’s strategy, restricting the possible tab width and checking whether half the modal value also occurs frequently, with two different thresholds for selecting the smaller width.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T02:55:21+00:00Added an answer on May 25, 2026 at 2:55 am

    Your choices are (realistically) 2,3,4,5,6,7,8.

    I’d scan the the first 50-100 lines or so using something like what @FastAl suggested. I’d probably lean toward just blindly pulling the spaces count from the front of any row with text and counting the length of the white space string. Left trimming lines and running length twice seems like a waste if you have regex available. Also, I’d do System.Math.abs(indent - previndent) so you get de-indent data. The regex would be this:

    row.matches('^( +)[^ ]') # grab all the spaces from line start to non-space.
    

    Once you’ve got a statistic for which of the 7 options has the highest count, run with it as the first guess. For 8, 6, and 4 you should check to see if there is also a significant count (2nd place or over 10% or some other cheapo heuristic) for 4 and 2, 3, or 2. If there are a lot of 12s (or 9s) that might hint that 4 (or 3) is a better choice than 8 (or 6) as well. Dropping or adding more than 2 levels at a time (usually collapsed ending brackets) is super rare.

    Irrelevant mumbling

    The one problem I see is that old .c code in particular has this nasty pattern going on in it:

    code level 0
    /* Fancy comments get weird spacing because there 
     * is an extra space beyond the *
     * looks like one space!
     */
      code indent (2 spaces)
      /* Fancy comments get weird spacing because there 
       * is an extra space beyond the *
       * looks like three spaces!
       */
    
    code level 0
      code indent (2 spaces)
      /* comment at indent level 1
         With no stars you wind up with 2 spaces + 3 spaces.
      */
    

    Yuck. I don’t know how you deal with comment standards like that. For code that is “c” like you might have to deal with comments special in version 2.0… but I would just ignore it for now.

    Your final issue is dealing with lines that don’t match your assumptions. My suggestion would be to “tab” them to depth and then leave the extra spaces in place. If you have to correct I’d do this: rowtabdepth = ceiling((rowspacecount - (tabwidth/2)) / tabwidth)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am battling regular expressions now as I type. I would like to determine
I've got a feeling this might not be possible, but I would like to
I would like to determine whether or not the user is logged in or
I would like to determine whether or not a variable has any text at
I would like to determine the operating system of the host that my Java
I would like to determine what the long url of a short url is.
I would like to determine a remote page's encoding through detection of the Content-Type
I am would like to determine a direction moving x degrees clockwise starting on
I have two times in PHP and I would like to determine the elapsed
I would like to know how do you determine the hardware needed for a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.