I have an interesting problem. I wrote the following perl script to recursively loop

Question

0

Asked: May 14, 20262026-05-14T03:40:47+00:00 2026-05-14T03:40:47+00:00

I have an interesting problem. I wrote the following perl script to recursively loop

0

I have an interesting problem. I wrote the following perl script to recursively loop through a directory and in all html files for img/script/a tags do the following:

Convert the entire url to lowercase
Replace spaces and %20 with underscores

The script works great except when an image tag in wrapped with an anchor tag. Is there a way to modify the current script to also be able to manipulate the links for nested tags that are not on separate lines? Basically if I have <a href="..."><img src="..."></a> the script will only change the link in the anchor tag but skip the img tag.

#!/usr/bin/perl

use File::Find;

$input="/var/www/tecnew/";

sub process {
        if (-T and m/.+\.(htm|html)/i) {
                #print "htm/html: $_\n";

                open(FILE,"+<$_") or die "couldn't open file $!\n";
                $out = '';
                while(<FILE>) {
                        $cur_line = $_;
                        if($cur_line =~ m/<a.*>/i) {
                                print "cur_line (unaltered) $cur_line\n";
                                $cur_line =~ /(^.* href=\")(.+?)(\".*$)/i;
                                $beg = $1;
                                $link = html_clean($2);
                                $end = $3;
                                $cur_line = $beg.$link.$end;
                                print "cur_line (altered) $cur_line\n";

                        }
                        if($cur_line =~ m/(<img.*>|<script.*>)/i) {
                                print "cur_line (unaltered) $cur_line\n";
                                $cur_line =~ /(^.* src=\")(.+?)(\".*$)/i;
                                $beg = $1;
                                $link = html_clean($2);
                                $end = $3;
                                $cur_line = $beg.$link.$end;
                                print "cur_line (altered) $cur_line\n";
                        }
                        $out .= $cur_line;

                }
                seek(FILE, 0, 0) or die "can't seek to start of file: $!";
                print FILE $out or die "can't print to file: $1";
                truncate(FILE, tell(FILE)) or die "can't truncate file: $!";
                close(FILE) or die "can't close file: $!";
        } } find(\&process, $input);

sub html_clean {
        my($input_string) = @_;
        $input_string = lc($input_string);
        $input_string =~ s/%20|\s/_/g;
        return $input_string; 
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T03:40:47+00:00

Editorial Team

2026-05-14T03:40:47+00:00Added an answer on May 14, 2026 at 3:40 am

Have you considered using a real parser instead of regexps? Regexps are not suitable for parsing HTML! Consider using a parser like HTML::Parser.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have an interesting problem. I wrote the following perl script to recursively loop

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply