Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8854587
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T13:55:08+00:00 2026-06-14T13:55:08+00:00

I’m using an open source method that parses the html text into an NSString.

  • 0

I’m using an open source method that parses the html text into an NSString.

The resulting strings have large amounts of white space between the first couple of paragraphs, but only one line of space for subsequent paragraphs. Here is an example of an output.

enter image description here
Below is the method I’m calling. I’ve only changed two lines of the code. For stopCharacters and newLineAndWhitespaceCharacters, I removed /n from the characterset because when it was included, the entire text was one long paragraph.

- (NSString *)stringByConvertingHTMLToPlainText {

    // Pool
    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

    // Character sets
    NSCharacterSet *stopCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@"< \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];
    NSCharacterSet *newLineAndWhitespaceCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@" \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];
    NSCharacterSet *tagNameCharacters = [NSCharacterSet characterSetWithCharactersInString:@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"];

    // Scan and find all tags
    NSMutableString *result = [[NSMutableString alloc] initWithCapacity:self.length];
    NSScanner *scanner = [[NSScanner alloc] initWithString:self];
    [scanner setCharactersToBeSkipped:nil];
    [scanner setCaseSensitive:YES];
    NSString *str = nil, *tagName = nil;
    BOOL dontReplaceTagWithSpace = NO;
    do {

        // Scan up to the start of a tag or whitespace
        if ([scanner scanUpToCharactersFromSet:stopCharacters intoString:&str]) {
            [result appendString:str];
            str = nil; // reset
        }

        // Check if we've stopped at a tag/comment or whitespace
        if ([scanner scanString:@"<" intoString:NULL]) {

            // Stopped at a comment or tag
            if ([scanner scanString:@"!--" intoString:NULL]) {

                // Comment
                [scanner scanUpToString:@"-->" intoString:NULL];
                [scanner scanString:@"-->" intoString:NULL];

            } else {

                // Tag - remove and replace with space unless it's
                // a closing inline tag then dont replace with a space
                if ([scanner scanString:@"/" intoString:NULL]) {

                    // Closing tag - replace with space unless it's inline
                    tagName = nil; dontReplaceTagWithSpace = NO;
                    if ([scanner scanCharactersFromSet:tagNameCharacters intoString:&tagName]) {
                        tagName = [tagName lowercaseString];
                        dontReplaceTagWithSpace = ([tagName isEqualToString:@"a"] ||
                                                   [tagName isEqualToString:@"b"] ||
                                                   [tagName isEqualToString:@"i"] ||
                                                   [tagName isEqualToString:@"q"] ||
                                                   [tagName isEqualToString:@"span"] ||
                                                   [tagName isEqualToString:@"em"] ||
                                                   [tagName isEqualToString:@"strong"] ||
                                                   [tagName isEqualToString:@"cite"] ||
                                                   [tagName isEqualToString:@"abbr"] ||
                                                   [tagName isEqualToString:@"acronym"] ||
                                                   [tagName isEqualToString:@"label"]);
                    }

                    // Replace tag with string unless it was an inline
                    if (!dontReplaceTagWithSpace && result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "];

                }

                // Scan past tag
                [scanner scanUpToString:@">" intoString:NULL];
                [scanner scanString:@">" intoString:NULL];

            }

        } else {

            // Stopped at whitespace - replace all whitespace and newlines with a space
            if ([scanner scanCharactersFromSet:newLineAndWhitespaceCharacters intoString:NULL]) {
                if (result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "]; // Dont append space to beginning or end of result
            }

        }

    } while (![scanner isAtEnd]);

    // Cleanup
    [scanner release];

    // Decode HTML entities and return
    NSString *retString = [[result stringByDecodingHTMLEntities] retain];
    [result release];

    // Drain
    [pool drain];

    // Return
    return [retString autorelease];

}

EDIT:

Here is the NSLog of the string. I only pasted the first few paragraphs

Mitt Romney spent the past six years running for president. After his loss to President Barack Obama, he'll have to chart a different course.  


 His initial plan: spend time with his family. He has five sons and 18 grandchildren, with a 19th on the way.  






 "I don't look at postelection to be a time of regrouping. Instead it's a time of forward focus," Romney told reporters aboard his plane Tuesday evening as he returned to Boston after the final campaign stop of his political career. "I have, of course, a family and life important to me, win or lose."  

 The most visible member of that family — wife Ann Romney — says neither she nor her husband will seek political office again.  

etc….

for (int j = 25; j< 50; j++) {
    char test =  [completeTrimmed characterAtIndex:([completeTrimmed rangeOfString:@"chart a different course."].location + j)];

        NSLog(@"%hhd", test);
    }

012-11-11 17:15:57.668 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 72
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 115
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 110
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 116
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 97
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 108
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 112
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 108
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 97
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T13:55:09+00:00Added an answer on June 14, 2026 at 1:55 pm

    I have tried with the question above and this is how I fixed it,

    NSString *retString = [[result stringByDecodingHTMLEntities] retain];
    [result release];
    
    retString = [retString stripDuplicateCharactersInSet:[NSCharacterSet whitespaceCharacterSet] withString:@" "];
    retString = [retString stripDuplicateCharactersInSet:[NSCharacterSet newlineCharacterSet] withString:@"\n"];
    

    I have defined a category method on NSString as,

    - (NSString *)stripDuplicateCharactersInSet:(NSCharacterSet *)characterSet withString:(NSString *)joiningString;
    

    The implementation is as follows,

    - (NSString *)stripDuplicateCharactersInSet:(NSCharacterSet *)characterSet withString:(NSString *)joiningString {
    
        NSMutableString *originalStr = [NSMutableString string];
    
        if (!self) {
            return nil;
        }
    
        NSArray *componentsArray = [self componentsSeparatedByCharactersInSet:characterSet];
    
        int counter = 0;
        for (NSString *stringComponent in componentsArray) {
    
            counter ++;
    
            if ((stringComponent) && ([stringComponent length] > 0) && (![stringComponent isEqualToString:@" "]) && ((![stringComponent isEqualToString:@"\n"]) || (![joiningString isEqualToString:@"\n"]))) {
    
                if ([componentsArray count] == counter) {
                    [originalStr appendFormat:@"%@", stringComponent];                
                } else {
                    [originalStr appendFormat:@"%@%@", stringComponent, joiningString];
                }
            }
        }
    
        return originalStr;
    }
    

    Add the above method in NSString+HTML.m file as a category on NSString. Basically in the html given by you, white spaces and newline were getting mixed multiple times, and trying to strip newline alone was not working. So I am removing duplicate newlines and white spaces as shown above by comparing if the string has newline or whitespace after stripping and then appending on to main string.

    Alternatively, you can also try as,

    NSString *retString = [[result stringByDecodingHTMLEntities] retain];
    [result release];
    
    retString = [retString stripDuplicateNewlineCharacters];
    

    The method is defined as,

    - (NSString *)stripDuplicateNewlineCharacters {
    
        NSMutableString *originalStr = [NSMutableString string];
    
        if (!self) {
            return nil;
        }
    
        NSArray *componentsArray = [self componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]];
    
        int counter = 0;
        for (NSString *stringComponent in componentsArray) {
    
            counter ++;
    
            stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@" " withString:@"<#$%$#>"];
            stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@"<#$%$#><#$%$#>" withString:@"<#$%$#>"];
            stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@"<#$%$#>" withString:@" "];
    
            if ((stringComponent) && ([stringComponent length] > 0) && (![stringComponent isEqualToString:@" "]) && (![stringComponent isEqualToString:@"\n"])) {
    
                if ([componentsArray count] == counter) {
                    [originalStr appendFormat:@"%@", stringComponent];
                } else {
                    [originalStr appendFormat:@"%@\n", stringComponent];
                }
            }
        }
    
        return originalStr;
    }
    

    In this case, the duplicate white spaces are removed in the method itself while removing new line characters.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have a French site that I want to parse, but am running into
I have thousands of HTML files to process using Groovy/Java and I need to
I'm working with an upstream system that sometimes sends me text destined for HTML/XML
I'm trying to convert HTML to plain text. I get many &\#8217; &\#8220; etc.
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
this is what i have right now Drawing an RSS feed into the php,
I have a small JavaScript validation script that validates inputs based on Regex. I

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.