Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7904887
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 3, 20262026-06-03T10:16:15+00:00 2026-06-03T10:16:15+00:00

I’m new to the Zend Framework so my apologies if I’m missing something simple.

  • 0

I’m new to the Zend Framework so my apologies if I’m missing something simple. However, I would have thought that code taken directly from the documentation would work. Instead I’m getting an uncaught exception.

Fatal error:  Uncaught exception 'Zend_Pdf_Exception' with message 'Cross-reference streams are not supported yet.' in C:\xampp\php\zend\library\Zend\Pdf\Parser.php:318
Stack trace:
#0 C:\xampp\php\zend\library\Zend\Pdf\Parser.php(460): Zend_Pdf_Parser->_loadXRefTable('116')
#1 C:\xampp\php\zend\library\Zend\Pdf.php(318): Zend_Pdf_Parser->__construct('PDF/Current...', Object(Zend_Pdf_ElementFactory_Proxy), true)
#2 C:\xampp\php\zend\library\Zend\Pdf.php(267): Zend_Pdf->__construct('PDF/Current...', NULL, true)
#3 C:\xampp\htdocs\test\test.php(7): Zend_Pdf::load('PDF/Current...')
#4 {main}
  thrown in C:\xampp\php\zend\library\Zend\Pdf\Parser.php on line 318

I’ve been reading around looking for a possible solution to this, but have had little luck. This is the most similar and it does not solve my problem. From what I’ve read there, and from other sources, PDF versions 1.4 and older should work fine, but this is not the case here, and its years old. My PDF versions are all 1.4, so I’m not even sure how accurate that post is anyways. The code works for the PDF included in the demo, but not on any of the existing ones I’m trying to use. I would upload the PDF, but they are all confidential.

I’m only trying to get the metadata, but I am not even able to load the document. I started using a framework so I wouldn’t have to create my own parser. If there is a simpler way to do this, or if someone can shed some light on this, I would be much obliged.

Edit: for clarification, I’ve tried both methods from linked documentation page. Neither works.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-03T10:16:15+00:00Added an answer on June 3, 2026 at 10:16 am

    I ended up having to create my own parser for this. If anyone finds this and has any further suggestions or questions about how I did it just add a comment.

    Solution

    I’m not going to upload the whole code as its really long, very messy, and inefficient. I’ve grown a bit as a developer since the initial post and have been meaning to go back and take another swing at it. So I’ll use this post to explain what I have, point out some of the problems and solutions I have found, as well as make some comments on how to make it more efficient. Hopefully this will make it easier for you, and hopefully this will inspire me to make some changes. Disclaimer: It has been months since I have last looked at this code, so don’t expect me to remember everything. However, I was pretty good about documenting my code and findings (for once) so what I’m not remembering is mostly minor.

    The most important thing I can tell you is to look at the raw XML, take notes, and compare a few of your files. Adobe apparently couldn’t make up their mind when creating the metadata syntax, so you will end up having to add multiple checks for all the different revisions (I’ll give an example later). Actually finding the metadata in the document is pretty easy. Adobe gives you a nice set of begin/end tags, so you just iterate over the document until you find them. Here’s a cleaned up and generalized sample from one of the PDF’s I’m parsing.

    <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
    <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04        ">
        <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
            <rdf:Description rdf:about=""
                xmlns:dc="http://purl.org/dc/elements/1.1/">
                <dc:format>application/pdf</dc:format>
                <dc:title>
                    <rdf:Alt>
                        <rdf:li xml:lang="x-default">Title of Document</rdf:li>
                    </rdf:Alt>
                </dc:title>
                <dc:creator>
                    <rdf:Seq>
                        <rdf:li>Creator of Document (Not author)</rdf:li>
                    </rdf:Seq>
                </dc:creator>
                <dc:description>
                    <rdf:Alt>
                        <rdf:li xml:lang="x-default">Short description</rdf:li>
                    </rdf:Alt>
                </dc:description>
            </rdf:Description>
            <rdf:Description rdf:about=""
                xmlns:xmp="http://ns.adobe.com/xap/1.0/">
                <xmp:CreateDate>2004-01-27T16:36:09Z</xmp:CreateDate>
                <xmp:CreatorTool>FrameMaker 7.0</xmp:CreatorTool>
                <xmp:ModifyDate>2012-02-20T15:55:19Z</xmp:ModifyDate>
            </rdf:Description>
            <rdf:Description rdf:about=""
                xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
                <pdf:Producer>Acrobat Distiller 9.4.5 (Windows)</pdf:Producer>
            </rdf:Description>
            <rdf:Description rdf:about=""
                xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
                <xmpMM:DocumentID>uuid:4eae0fcf-f493-4773-9473-f81c7491e8aa</xmpMM:DocumentID>
                <xmpMM:InstanceID>uuid:98209926-ba98-4ac7-a5f7-050050048f5d</xmpMM:InstanceID>
            </rdf:Description>
        </rdf:RDF>
    </x:xmpmeta>
    <?xpacket end="w"?>
    

    The best way to view the raw XML data is to download notepad++ (though you could use any notepad like program) and open up the PDF’s in that. The first thing you will see is the PDF version, “%PDF-1.4” in this case, and then a lot of confusing looking characters. Ignore that, but note the PDF version. Notice the “xpacket” tags in the sample above, that’s what you are going to need to look for every time you want to find the metadata. Just Ctrl+F to find “xmpmeta”, the first occurrence should be your metadata. Word of caution: Don’t attempt to use password protected documents. Everything is obfuscated, including the meta, this also means that PHP can’t read it either. I believe there is an option to allow the reading of the meta in password protected PDF’s, but I can’t remember for sure, nor do I know if it actually works for PHP.

    Just as you can Ctrl+F to find the meta in notepad++, you can do the same thing in PHP with fgets() and a while loop. Something I didn’t do but would probably be a good idea to implement, is to determine which end of the document to start from. This isn’t universal between all PDF versions, but same versions seem to be similarly placed. For instance, in PDF 1.4 they appear to all be closer to the bottom of the document, while in PDF 1.6 they are closer to the top. Again, you can check the PDF version from the first line. Reading the document with PHP should be pretty simple to set up, so I’m going to skip this bit of code. Though, I will point out that it is a good idea to quit the loop once you have found the entire metadata as this is a very processing intense operation so you’ll want to save time where you can. I would also suggest only running this on groups of 10-20 files at a time, less if larger documents. Setting up a caching system helped me quite a bit with timeout errors.

    After you’ve got the metadata in a string, then you’ll want to clean it up a bit. The first thing you are going to want to do is make sure your metadata is wrapped up nicely in a single root node so that the XML parser can read it. There were a couple of instances where they weren’t. The best/easiest way to fix this is to add a common wrapper. I would suggest using the most common one available to you. For me, that was the “xmpmeta” tag with an inner “rdf” wrapper. Ensuring that each metdata starts the same is important for navigating the document. There might be a better way of doing this, but this works and isn’t too inefficient (at least now, after I removed the two loops).

    if(strpos($xmlstr, 'xmpmeta') === FALSE) {
        if(strpos($xmlstr, 'rdf:rdf') === FALSE) { $xmlstr = "<rdf>$xmlstr</rdf>"; }
        $xmlstr = "<xmpmeta>$xmlstr</xmpmeta>";
    }
    

    Afterwards you are going to want to remove the namespaces. I tried using them, but its kind of hard to do so when the URLs keep changing in each implementation and you don’t know for sure which ones you have. Besides, it was already starting to run slow and adding all that extra XML parsing would have only made it worse. It was just much simpler to remove them.

    $nodesToRemove = array('rdf', 'pdf', 'xap', 'xapMM', 'xmp', 'xmpMM', 'dc', 'x');
    foreach($nodesToRemove as $remove) { $xmlstr = str_replace("$remove:", '', $xmlstr); }
    $xmlstr = preg_replace('/xmlns[^=]*="[^"]*"/i', '', $xmlstr);
    $xmlstr = preg_replace("/xmlns[^=]*='[^']*'/i", '', $xmlstr);
    
    $dom = new DOMDocument();
    $dom->loadXML($xmlstr);
    $sxe = simplexml_import_dom($dom);
    $root = $dom->documentElement;
    $namespaces = $sxe->getDocNamespaces(TRUE);
    
    foreach($namespaces as $prefix => $uri) {
        $root->removeAttributeNS($uri, $prefix);
        $root->removeAttribute("xmlns:$prefix");
    }
    
    if($root->hasChildNodes()) {
        foreach($root->childNodes as $element) {
            if ($element->nodeType != XML_TEXT_NODE) {
                $this->_removeNS($element, $namespaces);
            }
        }
    }
    

    The $nodesToRemove might be a little different for you. Those are just all the namespaces I ran across. Note: I was having issues where the order in which you remove the nodes was important. I’m not sure why, but it would remove the “xmp” from “xmpMM” and I would be stuck with an “MM” namespace. The code above doesn’t appear to have that issue, so I’m not sure if it still is an issue, but just in case, be wary. Either way, it isn’t too hard to fix, just have PHP sort it then reverse it. The REGEX removes default namespace declarations. I tried a number of different ways to go about this, but this was the only one that I could find that consistently worked. There’s probably a way to combine those two REGEX functions, but I’m completely lost when it comes to REGEX, and my attempts just left it broken. I’m not sure why I’m then removing the namespaces again with XML. This appears to be one of my more recent attempts at cleaning this up a bit, however this is from a working solution, so it doesn’t hurt (at least not functionality). The first bit, besides the REGEX, can probably be removed and replaced with the XML solution, though I’ve not verified this. It’s still necessary to remove the default namespaces before loading the string into XML because the XML parsers do not consider the “xmlns” attribute to be an actual attribute. The only reason the namespaced version “xmlns:$prefix” works is because they are not considered “xmlns” attributes but “xmlns:$prefix” attributes. Subtleties.

    Don’t be like me. Don’t try to implement every version of PDF ever created. It CAN’T be done. Well… it probably can, but its more hassle than its worth. Luckily for me, these were all in-house documents, so when I reached my limit and was tired of tweaking it just to break something else, or lose compatibility that I previously had, I just had those last few documents converted. Find the most common versions and handle those, then the next most common and set up conditions for those, and so on. Once you get to a point where you’ve only a few left, have them updated, or just announce that you don’t support this version. Especially if they are older. No sense in adding functionality for something that’s only ever going to be used for just a few documents. One of the big ones I can remember is a situation where the “xpacket” was not always on its own line. Sometimes it shared space with a few metadata tags. This caused “missing” data, because I did not start recording the meta until after the “xpacket” was found. It seemed like a simple fix, but it uncovered a whole lot of issues, so I ended up just scrapping that revision altogether and having them updated. Luckily those were the last 3-4 files.

    Once you have cleaned the metadata, then you are ready to parse it as XML. For example, here’s how I get the description.

    function getDescription($xml) {
        $return = 'Error: Metadata could not be retrieved';//Return value if metadata can not be parsed
    
        $sxe = new SimpleXMLElement($xml);
    
        $xpath = array(
            '//description/Alt/li',
            '//Description/Alt/li',
            '//xmpmeta/RDF/*[last()]',
            //'//Description/description',
        );
        foreach($xpath as $pattern) {
            $temp = $sxe->xpath($pattern);
    
            if( ! empty($temp)) {
                $return = isset($temp[0]->description) ? $temp[0]->description : $temp[0];
                break;
            }
        }
    
        //Return value if description was not found in metadata
        return empty($return) ? 'Error: Metadata "description" could not be retrieved' : strval($return);
    }
    

    There’s a few things to note about this. The first is the array of XPATH’s. These are those multiple conditions I was talking about earlier. You may also notice that commented out XPATH. That’s one I am either still working on compatibility for, or have given up on. I don’t remember, its been a while since I’ve had to look at this, and no one has complained about errors. So I’m assuming its not an issue. Another thing to notice is the amount of deviations for just this ONE field. The metadata changed quite a bit, and sometimes reverted. So you have to check for each case, make sure there were no other deviations, and then add any other conditions that may have occurred. Something to look into would be saving separate parsers based on version then loading the proper parser, may cut down on inefficiency. Looking back on this now, perhaps the easier way would have been to look up the standardization docs for each revision, but instead I ended up doing this mostly through trial and error. So, while this works for me, there may be some things I missed because it wasn’t an issue in any of my documents. The other thing to note is how similar the tags are between the revisions. I wasn’t, and still am not all that great with advanced XPATH, so maybe there is some better way to do this, I don’t know.

    I hope this helps somewhat. I know its given me a few ideas. If you have any other specific questions let me know.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have just tried to save a simple *.rtf file with some websites and
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I am doing a simple coin flipping experiment for class that involves flipping a
I have this code to decode numeric html entities to the UTF8 equivalent character.
I have a French site that I want to parse, but am running into
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I have this code: - (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock { NSString *someString = [[NSString
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
I have a jquery bug and I've been looking for hours now, I can't
link Im having trouble converting the html entites into html characters, (&# 8217;) i

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.