Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 529643
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T09:06:48+00:00 2026-05-13T09:06:48+00:00

Is there an accepted way to deal with regular expressions in Ruby 1.9 for

  • 0

Is there an accepted way to deal with regular expressions in Ruby 1.9 for which the encoding of the input is unknown? Let’s say my input happens to be UTF-16 encoded:

x  = "foo<p>bar</p>baz"
y  = x.encode('UTF-16LE')
re = /<p>(.*)<\/p>/

x.match(re) 
=> #<MatchData "<p>bar</p>" 1:"bar">

y.match(re)
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)

My current approach is to use UTF-8 internally and re-encode (a copy of) the input if necessary:

if y.methods.include?(:encode)  # Ruby 1.8 compatibility
  if y.encoding.name != 'UTF-8'
    y = y.encode('UTF-8')
  end
end

y.match(/<p>(.*)<\/p>/u)
=> #<MatchData "<p>bar</p>" 1:"bar">

However, this feels a little awkward to me, and I wanted to ask if there’s a better way to do it.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T09:06:48+00:00Added an answer on May 13, 2026 at 9:06 am

    As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?

    Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.

    # Utility function to make transcoding the regex simpler.
    def get_regex(pattern, encoding='ASCII', options=0)
      Regexp.new(pattern.encode(encoding),options)
    end
    
    
    
      # Inside code looping through lines of input.
      # The variables 'regex' and 'line_encoding' should be initialized previously, to
      # persist across loops.
      if line.methods.include?(:encoding)  # Ruby 1.8 compatibility
        if line.encoding != last_encoding
          regex = get_regex('<p>(.*)<\/p>',line.encoding,16) # //u = 00010000 option bit set = 16
          last_encoding = line.encoding
        end
      end
      line.match(regex)
    

    In the pathological case (where the input encoding changes every line) this would be just as slow, since you’re re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

There doesn't seem to be an accepted way of sending down a header parameter
Is there a way to convert from System.Windows.Forms.HtmlElement to mshtml.IHTMLElemenet3? --edit after answer accepted
I know how I use these terms, but I'm wondering if there are accepted
Is there a widely accepted class for dealing with URLs in PHP? Things like:
Let's say I have a report page that can be customized via Javascript. Say
Is there a way to test if a collection is already initialized? try-catch only?
There is a conversion process that is needed when migrating Visual Studio 2005 web
There are two weird operators in C#: the true operator the false operator If
There are two popular closure styles in javascript. The first I call anonymous constructor
There is previous little on the google on this subject other than people asking

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.