Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6006163
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T01:30:01+00:00 2026-05-23T01:30:01+00:00

I’m having a really hard time with this one, EDIT: I’m putting this edit

  • 0

I’m having a really hard time with this one,

EDIT: I’m putting this edit at the top: if any one want to read the problem and more, you are very welcome, I kind of starting to solve is really hard issue, but getting into a new problem, the way I thought of is to just return all the long HTML page divided by the paragraphs (“p” tags). Up to here every thing is working and when i do assert False, i am getting every thing as i want it. then in the template i go over the list I’ve sent in the response and for each value (a paragraph) for now i am creating a div (a page in the book), here is the problem. I am getting every paragraph three times! code below…

assert (part of it):
<p style="text-align: center;">
<span style="font-size:24px;"><strong><u>The Ten Foot Stop</u></strong></span></p>,
<p  style="margin-bottom: 0.2in; text-align: center;">
<span style="font-size:18px;"><font style="font-size: 7pt;">NEWS AND OCCASIONAL ITEMS 
ABOUT THE MEDICAL ASPECTS OF SCUBA DIVING.<br />
POSTED BY ERN CAMPBELL, MD</font></span></p>

template:
{% for article_page in article_pages %}
    {% if article_page %} <!-- don't show an empty paragraph -->
       {{ article_page|safe }}
    {% endif %}
{% endfor %}

show this in page:
[The Ten Foot Stop, The Ten Foot Stop, The Ten Foot Stop]
<!-- first paragraph has: The Ten Foot Stop -->

from here is my original posts with all the issue description:
I have a very long HTML like string (no head or body and stuff, but has tags and style, img tags and every thing else in it) and i need to split the string to smaller strings by number of words (need the string to fit into divs of certain sizes – lets say every 165 words more or less or even better to fit to certain height do it will fit the dive size- but i think that the second is much more complicated).

The problem i am having and tried every thing, including BeautifulSoup and other methods, is that i can’t find a way to split the string while keeping the tags safe…. if i have a style tag for example, and the stag starts at the 160 char and go to the 170 char, the second page (div) will treat the styles as a regular string and BeautifulSoup only close “bad” tags as i saw, doesn’t open the tags for the “bad” text in the second/third and so on divs….

And thought about using the truncate_html_words from text.py, but as the name implied, this only truncate words, doesn’t save the rest of the text for the next page (or am i wrong)?

Any one has an idea about how to do this?

OK, Starting to figure this out slowly, i will publish it when it is done, i think people need this kind of thing. Next step is, I broke the html string by tags (in my case every HTML “p” tag. now how do i count the text and only the text in the tag? (ps. the tag might have child tags that wrap the text and might have multiple child tags also eg:

  • a
  • bcd

need to return only count of 2 – two words in tap)?

10x,
Erez

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T01:30:02+00:00Added an answer on May 23, 2026 at 1:30 am

    Try starting small, define for yourself some sane, limited number of cases that you want to handle (like break on <p> tags, just show alt strings in place of images, and no divs), and see how that works. Then see if you want to tackle image sizing, or just show a hotspot for the use to select to see the image. Then the biggie is detecting divs. Start with just unnested divs, and get things working so that as you break up <p>s, you carry forward the current div’s formatting. Then add nesting with a stack of formatting directives, pushing and popping off the stack as you encounter <div> and </div> tags.

    But while your beginnings are simple, I would not be surprised if before long you find you are on the way to developing a complete browser.

    • repagination of text within screen size constraints
    • must handle modal style and formatting tags
    • must handle embedded images of varying size, presumably wrapping text around them

    You didn’t mention needing support for tables. If anchor tags with hrefs are defined, are these supposed to act as clickable hotspots? And God help you if you have to do something meaningful with JavaScript.

    While you are carving off your simple starting point, see just how broad the end product requirements/expectations will have to be. If you start adding tables, frames, fonts, complex style directives, then you are essentially reinventing the web browser. At that point, try to inject some sanity back into the discussion – you are just one person and writing a browser is not a weekend task. Try to get the requirements down to a constrained set of supported tags. Alternatively, look into publicly available/open source browser engines (such as Chromium), which you might be able to adapt, especially in light of your simplified subset of features.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

link Im having trouble converting the html entites into html characters, (&# 8217;) i
Does anyone know how can I replace this 2 symbol below from the string
this is what i have right now Drawing an RSS feed into the php,
I want to count how many characters a certain string has in PHP, but
I ran into a problem. Wrote the following code snippet: teksti = teksti.Trim() teksti
I have a French site that I want to parse, but am running into
Seemingly simple, but I cannot find anything relevant on the web. What is the
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have just tried to save a simple *.rtf file with some websites and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.