For a blog like project, I want to get the first few paragraphs, headers,

Question

0

Asked: May 17, 20262026-05-17T18:44:06+00:00 2026-05-17T18:44:06+00:00

For a blog like project, I want to get the first few paragraphs, headers,

0

For a blog like project, I want to get the first few paragraphs, headers, lists or whatever within a range of characters from a markdown generated html fragment to display as a summary.

So if I have

<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>

And assume, I want to summarize with text within the first 150 chars (does not have to be overly exact, I could just get the first 150 chars, including tags and go on with that, but probably would create some artifacts at the tail which could be more difficult to handle…), it should give me the h1, the p and the ul, but not the final p (which would be truncated). If the first element should have more than 150 chars, I would take the full first element.

How could I get this? Using XPath or a regex? I am a bit without ideas on that…

Edit

First I want to give a big THANK YOU to all of you who replied!

While I got really great answers in this thread, I actually found it much easier to plug in before the markdown interpreter hits in, take the first n textblocks separated by \r\n\r\n and just pass this on for md generation.

  class String
    def summarize_md length
        arr = self.split(/\r\n\r\n/)
        sum =""
        arr.each do |ea|
          break if sum.length + ea.length > length
          sum = sum+"#{ea}\r\n\r\n"
        end
        sum
      end
  end

while one probably could reduce this code to a one liner, it is still much simpler and cpu friendlier than any of the proposed solutions.
Anyway, since my question could be interpreted such as if the html was the starting point (and not the md text), I’ll just give the answer to the first guy… I hope that’s just…

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T18:44:07+00:00

Using XPath is the most robust and flexible. Here’s a sample app:

require 'rubygems'
require 'nokogiri'

html = <<End
<h1>hello world</h1>
<p>Lets say these are 100 chars.......................................................................</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
End

LIMIT = 150
summary = ""

doc = Nokogiri::HTML.parse(html)
doc.xpath('//text()').each do |node|
  text = node.text
  break if summary.length + text.length >= LIMIT
  summary << text
end

puts summary
puts summary.length

The XPath //text() simply selects all the text nodes in the document. If you wanted to be more specific about which elements you were interested in, you can.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

For a blog like project, I want to get the first few paragraphs, headers,

Edit

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply