Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6017079
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T03:04:39+00:00 2026-05-23T03:04:39+00:00

I imagine this is common enough that it’s a solved problem, but being a

  • 0

I imagine this is common enough that it’s a solved problem, but being a bit of a newbie with Loofah and Nokogiri I haven’t found the solution yet.

I’m using Loofah, a HTML scrubber library that wraps Nokogiri, to scrub some HTML text for display. However, that text sometimes happen to things like e-mail addresses and such between < and > characters, for example, < foo@domain.com >. Loofah is considering that as an HTML or XML tag, and is stripping it away from the text.

Is there a way to prevent this from happening while still doing a good job of scrubbing away the actual tags?

Edit: Here’s a failing test case:

require 'test/unit'
require 'test/unit/ui/console/testrunner'
require 'nokogiri'

MAGICAL_REGEXP = /<([^(?:\/|!\-\-)].*)>/

def filter_html(content)
  # Current approach in a gist: We capture content enclosed in angle brackets.
  # Then, we check if the excerpt right after the opening bracket is a valid HTML
  # tag. If it's not, we substitute the matched content (which is the captured
  # content enclosed in angle brackets) for the captured content enclosed in
  # the HTML entities for the angle brackets. This does not work with nested
  # HTML tags, since regular expressions are not meant for this.

  content.to_s.gsub(MAGICAL_REGEXP) do |excerpt|
    capture = $1
    Nokogiri::HTML::ElementDescription[capture.split(/[<> ]/).first] ? excerpt : "&lt;#{capture}&gt;"
  end
end

class HTMLTest < Test::Unit::TestCase
  def setup
    @raw_html = <<-EOS
<html>
<foo@bar.baz>
<p><foo@<b class="highlight">bar</b>.baz></p>
<p>
<foo@<b class="highlight">bar</b>.baz>
</p>
< don't erase this >
</html>
EOS

    @filtered_html = <<-EOS
<html>
&lt;foo@bar.baz&gt;
<p>&lt;foo@<b class="highlight">bar</b>.baz&gt;</p>
<p>
&lt;foo@<b class="highlight">bar</b>.baz&gt;
</p>
&lt; don't erase this &gt;
</html>
EOS
  end

  def test_filter_html
    assert_equal(@filtered_html, filter_html(@raw_html))
  end
end

# Can you make this test pass?
Test::Unit::UI::Console::TestRunner.run(HTMLTest)

We’re currently using some pretty evil regex hackery to try and accomplish this, but as the comment above states, it doesn’t work for tags “nested” inside non-tags. And we actually want to preserve the <b class="highlight"> elements as well.

The sample below isn’t using Loofah, but the application itself does in other places so it wouldn’t be hard to add it here. We’re just not sure of what configuration options we should use, if any.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T03:04:40+00:00Added an answer on May 23, 2026 at 3:04 am

    As the main issue was HTML tags enclosed in HTML entities angle brackets — which is totally mangled by Nokogiri — we solved it by just removing the aforementioned HTML tags, escaping the non-HTML-tag angle brackets and then putting the HTML tags back. It sounds a little hackish but it’s working perfectly. Our first goal was escaping email addresses enclosed in angle brackets, but this approach (supposedly) works for any kind text.

    # Does not run on ruby 1.9
    
    require 'test/unit'
    require 'test/unit/ui/console/testrunner'
    require 'nokogiri'
    require 'active_support/secure_random'
    
    def filter_html(content)
      # Used to mark highlighted words.
      random_hex = SecureRandom.hex(6)
    
      # Remove highlighting.
      highlighted_terms = []
      without_highlighting = content.to_s.gsub(/<b class="highlight">(.*?)<\/b>/) do |match|
        highlighted_terms << $1
        "highlight-#{random_hex}:#{$1}"
      end
    
      # Escape non-HTML angle brackets.
      escaped_content = without_highlighting.to_s.gsub(/<(?:\s*\/)?([^!\-\-].*?)>/) do |excerpt|
        capture = $1
        tag = capture.split(/[^a-zA-Z1-6]/).reject(&:empty?).first
        !!Nokogiri::HTML::ElementDescription[tag] ? excerpt : "&lt;#{capture}&gt;"
      end
    
      # Add highlighting back.
      highlighted_terms.uniq.each do |term|
        escaped_content.gsub!(/highlight-#{random_hex}:(#{term})/) do |match|
          "<b class=\"highlight\">#{$1}</b>"
        end
      end
    
      escaped_content
    end
    
    class HTMLTest < Test::Unit::TestCase
      def setup
        @raw_html = <<-EOS
          <html>
            <foo@bar.baz>
            <p><foo@<b class="highlight">bar</b>.baz></p>
            <p>
              <foo@<b class="highlight">bar</b>.baz>
            </p>
            <    don't erase this   >
          </html>
        EOS
    
        @filtered_html = <<-EOS
          <html>
            &lt;foo@bar.baz&gt;
            <p>&lt;foo@<b class="highlight">bar</b>.baz&gt;</p>
            <p>
              &lt;foo@<b class="highlight">bar</b>.baz&gt;
            </p>
            &lt;    don't erase this   &gt;
          </html>
        EOS
      end
    
      def test_filter_html
        assert_equal(@filtered_html, filter_html(@raw_html))
      end
    end
    
    # It passes!
    Test::Unit::UI::Console::TestRunner.run(HTMLTest)
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I've never learnt JavaScript, but I imagine this is quite a simple problem. Just
Imagine this case where I have an object that I need to check a
I have a problem. Imagine this data model: [Person] table has: PersonId, Name1 [Tag]
Ok I was a little unsure on how best name this problem :) But
Imagine this directory structure: app/ __init__.py sub1/ __init__.py mod1.py sub2/ __init__.py mod2.py I'm coding
Imagine this sample java class: class A { void addListener(Listener obj); void removeListener(Listener obj);
Imagine this function: void SoundManager::playSource(ALuint sourceID, float offset) { alSourceStop(sourceID); ALint iTotal = 0;
I imagine this is a pretty hard question to answer without sitting down and
I have a question about using new[] . Imagine this: Object.SomeProperty = new[] {string1,
Let's imagine I got this: index.php generates form with unpredictable number of inputs with

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.