While parsing page no. 22 of http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vxfs_admin.pdf , I am able to parse all

Question

0

Asked: May 23, 20262026-05-23T18:01:16+00:00 2026-05-23T18:01:16+00:00

While parsing page no. 22 of http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vxfs_admin.pdf , I am able to parse all

0

While parsing page no. 22 of http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vxfs_admin.pdf, I am able to parse all the words except mount_vxfs as its encoding style and/or font is different than normal plain text.
Please find attached PDF Page for details.

Please find my code :-

`#!/usr/bin/perl
use CAM::PDF;
my $file_name="vxfs_admin_51sp1_lin.pdf";
my $pdf = CAM::PDF ->new($file_name);
my $no_pages=$pdf->numPages();
print "$no_pages\n";
for(my $i=1;$i<$no_pages;$i++){
my $page = $pdf->getPageText($i);
//for page no. 22
//if($i==22){ 
print $page;
//}
}`

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T18:01:16+00:00

PDF doesn’t store the semantic text that you read but rather uses character codes which map to glyphs (the painted characters) in a particular font. Often, however, the code-glyph mapping matches common character sets (such as ISO-8859-1 or UTF-8) so that the codes are human-readable. That’s the case for all of the text you have been able to parse, although sometimes the odd character, mostly punctuation, is also “wrong”.

The text for “mount_vxfs” in your document is encoded completely differently, unfortunately, resulting in apparent garbage. If you’re curious, you can see what’s really there by substituting getPageText() with getPageContent() in your code.

In order to convert the PDF text back to meaningful characters, PDF readers have to jump through hoops with a number of conversion tables (including the so-called CMaps). Because this is a lot of programming work, many simpler libraries opt not to implement them. That’s the case with CAM::PDF.

If you’re just interested in parsing the text (not editing it), the following technique is something I use with success:

Obtain xpdf (http://foolabs.com/xpdf) or Poppler (http://poppler.freedesktop.org/). Poppler is a newer fork of xpdf. If you’re using *nix, there will be a package available.
Use the command-line tool ‘pdftotext’ to extract the text from a file, either page-wise or all at once.

Example:

#!/usr/bin/perl
use English;
my $file_name="vxfs_admin.pdf";

open my $text_fh, "/usr/bin/pdftotext -layout -q '$file_name' - 2>/dev/null |";
local $INPUT_RECORD_SEPARATOR = "\f";    # slurp a whole page at a time
while (my $page_text = <$text_fh>) {
    # this is here only for demo purposes
    print $page_text if $INPUT_LINE_NUMBER == 19;
}
close $text_fh;

(Note: The document I retrieved using your link is slightly different; the offending bit is on page 19 instead.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

While parsing page no. 22 of http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vxfs_admin.pdf , I am able to parse all

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply