I’m trying to use regular expressions to remove certain blocks of coding from a

Question

0

Asked: May 19, 20262026-05-19T13:49:35+00:00 2026-05-19T13:49:35+00:00

I’m trying to use regular expressions to remove certain blocks of coding from a

0

I’m trying to use regular expressions to remove certain blocks of coding from a text file. So far, most of my regular expression lines have worked to remove the codes. However, I have two questions:

1) Whenever I remove a chunk of text, where the text should have been is substituted with blank space, rather than simply being removed.
An example of my regex code is:

$file =~ s/<ul(.*)>//gi;

Which removes all lines with the basic format <ul...>, which is what I want it to do. However, as mentioned prior, it replaces the tag and all contained data with blank spaces, and I was wondering how to stop this particular substitution.

2) Certain regular expression codes that should work, don’t seem to. For instance, I want to remove

<script type="text/javascript"> 

function getCookies() { return ""; }

</script>

I have tried using various regex codes, but nothing seems to remove these lines. For instance:

$file =~ s/<script type(.*)<\/script>//gi;

Which removes the <script type...> and </script> tags respectively, but leaves the

function getCookies() { return ""; }

…intact. I’m unsure as to why this happens, and I would very much like to correct this. How would this be possible? Any help on either of these two questions would be immensely helpful!

Edit: Sorry all, I’m using Perl!
Also: I just tried using

$file =~ /<script type(.*)<\/script>/sgi

…as well as /msgi, but neither worked unfortunately. Both the <script type> and </script> tags were removed, but for some reason the

function getCookies() { return ""; }

…section stayed. Here is my entire code, including all regex:

use strict;
use warnings;

my $firstarg;
if ($ARGV[0]){
  $firstarg = $ARGV[0];
}

open (DATA, $ARGV[1]);
my $file = do {local $/; <DATA>};

$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
$file =~ s/<head>//gi;
$file =~ s/<\/head>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<\link>//gi;
$file =~ s/CDM(.*)\;//gi;
$file =~ s/<\!(.*)->//gi;
$file =~ s/<body(.*)>//gi;
$file =~ s/<\/body>//gi;
$file =~ s/<div(.*)>//gi;
$file =~ s/<\/div>//gi;
$file =~ s/function(.*)>//gi;
$file =~ s/<noscript>//gi;
$file =~ s/<\/noscript>//gi;
$file =~ s/<a(.*)>//gi;
$file =~ s/<\/a>//gi;
$file =~ s/<ul(.*)>//gi;
$file =~ s/<\/ul>//gi;
$file =~ s/<li(.*)>//gi;
$file =~ s/<\/li>//gi;
$file =~ s/<form(.*)>//gi;
$file =~ s/<\/form>//gi;
$file =~ s/<iframe(.*)>//gi;
$file =~ s/<\/iframe>//gi;
$file =~ s/<select(.*)>//gi;
$file =~ s/<\/select>//gi;
$file =~ s/<textarea(.*)>//gi;
$file =~ s/<\/textarea>//gi;
$file =~ s/<b>//gi;
$file =~ s/<\/b>//gi;
$file =~ s/<H1>//gi;
$file =~ s/<H2>//gi;
$file =~ s/<H3>//gi;
$file =~ s/<H4>//gi;
$file =~ s/<H5>//gi;
$file =~ s/<H6>//gi;
$file =~ s/<\/H1>//gi;
$file =~ s/<\/H2>//gi;
$file =~ s/<\/H3>//gi;
$file =~ s/<\/H4>//gi;
$file =~ s/<\/H5>//gi;
$file =~ s/<\/H6>//gi;
$file =~ s/<option(.*)>//gi;
$file =~ s/<\/option>//gi;
$file =~ s/<p>//gi;
$file =~ s/<\/p>//gi;
$file =~ s/<span(.*)>//gi;
$file =~ s/<\/span>//gi;
$file =~ s/<!doctype(.*)>//gi;
$file =~ s/<base(.*)>//gi;
$file =~ s/<br>//gi;
$file =~ s/<hr>//gi;
$file =~ s/<img(.*)>//gi;
$file =~ s/<input(.*)>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<meta(.*)>//gi;
$file =~ s/<script type(.*)<\/script>//gi;
print $file;

Ok, now that I deleted the <script> regex that was causing one problem, another has been created – using:

$file =~ s/<script type(.*)<\/script>//gi;

removes everything in between the first instance of <script ...>, but not the tag itself, not the repetitions of the tag throughout. Using:

$file =~ s/<script type(.*)<\/script>//mgi;

results in the exact same thing. Using:

$file =~ s/<script type(.*)<\/script>//sgi;

results in the printing of several new line characters, but no other text, same for /msgi.
Urgh, the problems never end… 🙁

NEW EDIT: I would like to apologize for posting a question about parsing HTML using regex. I realize that there is a rather large backlash within the programming community regarding this practice (or attempt at practice, since this seems to fail more often than not). However, I am unfortunately forced to use regex to parse selected HTML, ones that it will be possible to remove the majority, if not all, of the HTML tags. I am not allowed to use a module, despite this being the most obvious and simplest of answers.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T13:49:36+00:00

If you are not allowed to use anything but Perl regular expressions then you could adapt the code to strip HTML tags from a text:

#!/usr/bin/perl -w
use strict;
use warnings;

$_ = do { local $/; <DATA> };

# see http://www.perlmonks.org/?node_id=161281
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
s{
  <               # open tag
  (?:             # open group (A)
    (!--) |       #   comment (1) or
    (\?) |        #   another comment (2) or
    (?i:          #   open group (B) for /i
      (           #     one of start tags
        SCRIPT |  #     for which
        APPLET |  #     must be skipped
        OBJECT |  #     all content
        STYLE     #     to correspond
      )           #     end tag (3)
    ) |           #   close group (B), or
    ([!/A-Za-z])  #   one of these chars, remember in (4)
  )               # close group (A)
  (?(4)           # if previous case is (4)
    (?:           #   open group (C)
      (?!         #     and next is not : (D)
        [\s=]     #       \s or "="
        ["`']     #       with open quotes
      )           #     close (D)
      [^>] |      #     and not close tag or
      [\s=]       #     \s or "=" with
      `[^`]*` |   #     something in quotes ` or
      [\s=]       #     \s or "=" with
      '[^']*' |   #     something in quotes ' or
      [\s=]       #     \s or "=" with
      "[^"]*"     #     something in quotes "
    )*            #   repeat (C) 0 or more times
  |               # else (if previous case is not (4))
    .*?           #   minimum of any chars
  )               # end if previous char is (4)
  (?(1)           # if comment (1)
    (?<=--)       #   wait for "--"
  )               # end if comment (1)
  (?(2)           # if another comment (2)
    (?<=\?)       #   wait for "?"
  )               # end if another comment (2)
  (?(3)           # if one of tags-containers (3)
    </            #   wait for end
    (?i:\3)       #   of this tag
    (?:\s[^>]*)?  #   skip junk to ">"
  )               # end if (3)
  >               # tag closed
 }{}gsx;         # STRIP THIS TAG

print;

__END__
<html><title>remove script, ul</title>
<script type="text/javascript"> 

function getCookies() { return ""; }

</script>
<body>
<ul><li>1
<li>2
<p>paragraph

Output

remove script, ul


1
2
paragraph

NOTE: This regex doesn’t work for nested tag-containers e.g.:

<!DOCTYPE html>
<meta charset="UTF-8">
<title>Nested &lt;object> example</title>
<body>
<object data="uri:here">fallback content for uri:here
  <object data="uri:another">uri:another fallback
  </object>!!!this text should be striped too!!!
</object>

Output

Nested &lt;object> example

!!!this text should be striped too!!!

Don’t parse html with regexs. Use a html parser or a tool built on top of it e.g., HTML::Parser:

#!/usr/bin/perl -w
use strict;
use warnings;

use HTML::Parser ();

HTML::Parser->new(
    ignore_elements => ["script"],
    ignore_tags => ["ul"],
    default_h => [ sub { print shift }, 'text'],
    )->parse_file(\*DATA) or die "error: $!\n";

__END__
<html><title>remove script, ul</title>
<script type="text/javascript"> 

function getCookies() { return ""; }

</script>
<body>
<ul><li>1
<li>2
<p>paragraph

Output

<html><title>remove script, ul</title>

<body>
<li>1
<li>2
<p>paragraph

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to use regular expressions to remove certain blocks of coding from a

Leave an answerCancel reply

1 Answer

Output

Output

Output

Leave an answer
Cancel reply