After considerable search on SO and Google, I resort to posting a new question.

Question

0

Asked: June 14, 20262026-06-14T14:01:49+00:00 2026-06-14T14:01:49+00:00

After considerable search on SO and Google, I resort to posting a new question.

0

After considerable search on SO and Google, I resort to posting a new question. I am working with TextWrangler trying to compose a regular expression which will give me shortest matches of a multiple-line pattern.

Basically,

ہے\tVM

is the string I am looking for (an Arabic word separated by a tab character from its part of speech tag). What makes it difficult is that I would like to search for all single sentences containing that string. Here is what I have so far:

/(<Sentence id='\d+'>(?:[^<]|<(?!\/Sentence>))*ہے\tVM(?:[^<]|<(?!\/Sentence>))*<\/Sentence>)/

The files I am looking at are encoded in CML, so part of my question is whether any of you is aware of a CML parser for MAC?

Another obvious alternative is to write a Perl script — here again, I am thankful for any advice pointing to a simple solution.

My current script is:

use open ':encoding(utf8)';
use Encode;
binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");

my $word = Encode::decode_utf8("ہے");

my @files = glob("*.posn");

foreach my $file (@files) {
    open FILE, "<$file" or die "Error opening file $file ($!)";
    my $file = do {local $/; <FILE>};
    close FILE or die $!;
    if ($file =~ /(<Sentence id='\d+'>(?:[^<]|<(?!\/Sentence>))*$word\tVM(?:[^<]|<(?!\/Sentence>))*<\/Sentence>)/g) {
            print STDOUT "$1\n\n\n\n";
            push(@matches, "$1\n\n");
            }
}

open(OUTPUT, ">matches.txt");
print OUTPUT "@matches";
close(OUTPUT);

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T14:01:50+00:00

You possibly may have more occurrences of the string in the input, so search for all of them…

I believe your code should look like this >>

use open ':encoding(utf8)';
use Encode;

binmode(STDOUT, ":utf8");
binmode(STDIN,  ":utf8");

my $word = Encode::decode_utf8("ہے");
my @files = glob("*.posn");
my @matches = ();

foreach my $file (@files) {
  open FILE, "<$file" or die "Error opening file $file ($!)";
  my $file = do {local $/; <FILE>};
  close FILE or die $!;
  my @occurrences = $file =~ /<Sentence id='\d+'>(?:[^<]|<(?!\/Sentence>))*$word\tVM(?:[^<]|<(?!\/Sentence>))*<\/Sentence>/g;
  print STDOUT "$_\n\n\n\n" for (@occurrences);
  push (@matches, "$_\n\n") for (@occurrences);
}

open (OUTPUT, ">matches.txt");
print OUTPUT  "@matches";
close(OUTPUT);

Learn more about regular expressions here.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

After considerable search on SO and Google, I resort to posting a new question.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply