I’m parsing the sourcecode of many websites, an entire huge web with thousands of

Question

0

Asked: May 27, 20262026-05-27T11:08:28+00:00 2026-05-27T11:08:28+00:00

I’m parsing the sourcecode of many websites, an entire huge web with thousands of

0

I’m parsing the sourcecode of many websites, an entire huge web with thousands of pages. Now I want to search for stuff in perĺ, I want to find the number of occurrences of a keyword.

For parsing the webpages I use curl and pipe the output to “grep -c” which doesn’t work, so I want to use perl. Can be perl utilised completely to crawl a page?

E.g.

cat RawJSpiderOutput.txt | grep parsed | awk -F " " '{print $2}' | xargs -I replaceStr curl replaceStr?myPara=en | perl -lne '$c++while/myKeywordToSearchFor/g;END{print$c}'

Explanation: In the textfile above I have usable and unusable URLs. With “Grep parsed” I fetch the usable URLs. With awk I select the 2nd column with contains the pure usable URL. So far so good. Now to this question: With Curl I fetch the source (appending some parameter, too) and pipe the whole source code of each page to perl in order to count “myKeywordToSearchFor” occurrences. I would love to do this in perl only if it is possible.

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T11:08:29+00:00

This uses Perl only (untested):

use strict;
use warnings;

use File::Fetch;

my $count;
open my $SPIDER, '<', 'RawJSpiderOutput.txt' or die $!;
while (<$SPIDER>) {
    chomp;
    if (/parsed/) {
        my $url = (split)[1];
        $url .= '?myPara=en';
        my $ff = File::Fetch->new(uri => $url);
        $ff->fetch or die $ff->error;
        my $fetched = $ff->output_file;
        open my $FETCHED, '<', $fetched or die $!;
        while (<$FETCHED>) {
            $count++ if /myKeyword/;
        }
        unlink $fetched;
    }
}
print "$count\n";

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m parsing the sourcecode of many websites, an entire huge web with thousands of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply