I wrote a ruby script to process a large amount of documents and use

Question

0

Asked: May 24, 20262026-05-24T06:27:31+00:00 2026-05-24T06:27:31+00:00

I wrote a ruby script to process a large amount of documents and use

0

I wrote a ruby script to process a large amount of documents and use the following URI to extract URIs from a document’s string representation:

#Taken from: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
URI_REGEX = /
(                           # Capture 1: entire matched URL
  (?:
    [a-z][\w-]+:                # URL protocol and colon
    (?:
      \/{1,3}                        # 1-3 slashes
      |                             #   or
      [a-z0-9%]                     # Single letter or digit or '%'
    )
    |                           #   or
    www\d{0,3}[.]               # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}\/  # looks like domain name followed by a slash
  )
  (?:                           # One or more:
    [^\s()<>]+                      # Run of non-space, non-()&lt;&gt;
    |                               #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                           # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)/xi

It works pretty well for 99.9 percent of all documents but always hangs up my script when it encounters the following token in of the documents: token = "synsem:local:cat:(subcat:SubMot,adjuncts:Adjs,subj:Subj),"

I am using the standard ruby regexp oeprator: token =~ URI_REGEX and I don’t get any exception or error message.

First I tried to solve the problem encapsulating the regex evaluation into a Timeout::timeoutblock, but this degrades performance to much.

Any other ideas on how to solve this problem?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T06:27:31+00:00

Editorial Team

2026-05-24T06:27:31+00:00Added an answer on May 24, 2026 at 6:27 am

Why reinvent the wheel?

require 'uri'
uri_list = URI.extract("Text containing URIs.")

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I wrote a ruby script to process a large amount of documents and use

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply