I’m creating a website that will have the admin upload documents available only to the paid members of the website. But what I do want is that the search engines crawls or indexes the document, so that it appears in the search engine search results. Documents include DOC, DOCX and PDF.
For example i have a document that has this text: “the quick brown fox jumped over the lazy dog”. Now someone Google’s “brown fox”. Assuming that I have the standings, I would want the result to appear in the Google results. When the user clicks on it, I want that the user lands on a page, instead of the document, where there is a preview of the text with a link to be a member to view full document.
I planned that the preview of the document on the page will be saved into the database when the document is uploaded. So it is easily visible and crawl-able. For the full document, I could only figure to allow the full document to be crawled. But I think if I allow the search engine to crawl, then I’ll be giving access to the users aswell. And if I use htaccess to keep the documents from being accessed directly then I’m shutting the crawlers out too.
I also considered extracting all document text and putting it in the database, but I read somewhere that it is very difficult to distinguish between a user and a spider, and using user agents is a bad idea as it is very easy to spoof.
So I’m confused as to how I should go about this. Any help will be appreciated.
Thank you in advance!
No, not possible. Any user can pretend to be a search engine by changing their User-Agent.
You could do IP-address-based restrictions, or heuristic-based detection, but you’re likely to accidentally block crawlers.
Perhaps you should give users a number of free page views per day, or consider a different method of monetization.