I tried to use this script pdfmeat to get data about papers from google

Question

0

Asked: May 24, 20262026-05-24T09:39:14+00:00 2026-05-24T09:39:14+00:00

I tried to use this script pdfmeat to get data about papers from google

0

I tried to use this script pdfmeat to get data about papers from google scholar.

This script works very well in my pc, but when I try to put this script in my server I don’t have results. I saw that is very probably that my server is in the black list of google scholar, give that I have an error (redirects to solve a chapta):

$ wget scholar.google.com
--2011-08-08 04:52:19--  http://scholar.google.com/
Resolving scholar.google.com... 72.14.204.147, 72.14.204.99, 72.14.204.103, ...
Connecting to scholar.google.com|72.14.204.147|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.google.com/sorry/?continue=http://scholar.google.com/ [following]
--2011-08-08 04:52:24--  http://www.google.com/sorry/?continue=http://scholar.google.com/
Resolving www.google.com... 74.125.93.147, 74.125.93.99, 74.125.93.103, ...
Connecting to www.google.com|74.125.93.147|:80... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
2011-08-08 04:52:24 ERROR 503: Service Unavailable.

Then I have found that there is an option in wget –execute “http_proxy=urltoproxy”. I did that

wget -e "http_proxy=oneHttpProxy" scholar.google.com

and I could save the index.html from google scholar.

Then I tried to the same with the pdfmeat.py I don’t have results neither.

this is the code:

def getWebdata(self, link, referer='http://scholar.google.com'):
    useragent = 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100214 Ubuntu/9.10 (karmic) Firefox/3.5.8'
    c_web = 'wget --execute "http_proxy=oneHttpProxy" -qO- --user-agent="%s" --load-cookies="%s" "%s" --referer="%s"' % (useragent, WGET_COOKIEFILE, link, referer) 
    c_out = os.popen(c_web)
    c_txt = c_out.read()
    c_out.close()
    if re.search("We're sorry", c_txt) or re.search("please type the characters", c_txt):
        self.logger.critical("scholar captcha")
        if not self.options.quiet:
            print "PDFMEAT: scholar captcha!"
        sys.exit()
    self.logger.debug("getwebdata excerpt: %s" % (re.sub("\n", " ", c_txt[0:255])))
    self.queryLog.append("getwebdata excerpt: %s" % (re.sub("\n", " ", c_txt[0:255])))
    return c_txt

The script use the module os. The original function is without the –execute option for wget.

Thanks in advance

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T09:39:15+00:00

Editorial Team

2026-05-24T09:39:15+00:00Added an answer on May 24, 2026 at 9:39 am

Have you tried just setting the http_proxy env. var.?

So:

$ export http_proxy=”oneHttpProxy”

$ python pdfmeat.py ….

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I tried to use this script pdfmeat to get data about papers from google

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply