I’ve deployed a Django application on Heroku. The application by itself works fine. I can run commands such as heroku run python project/manage.py syncdband heroku run python project/manage.py shell and this works well.
My Django project makes use of the Python web scraping library called Scrapy. Scrapy comes with a command called scrapy crawl abc which helps me scrape websites I have defined in the scrapy application. When I run a scrapy command such as scrapy crawl spidername on my local machine, the application is able to scrape date and copy it to my database. However when I run the same command on Heroku under a sub-directory of my project directory heroku run scrapy crawl spidername, nothing happens.
I don’t see anything in the Heroku logs which can point to where I’m going wrong:
2012-01-26T15:45:38+00:00 heroku[run.1]: State changed from created to starting
2012-01-26T15:45:43+00:00 app[run.1]: Awaiting client
2012-01-26T15:45:43+00:00 app[run.1]: Starting process with command `project/spiderMainDir scrapy crawl spidername`
2012-01-26T15:45:44+00:00 heroku[run.1]: State changed from starting to up
2012-01-26T15:45:46+00:00 heroku[run.1]: State changed from up to complete
2012-01-26T15:45:46+00:00 heroku[run.1]: Process exited
Some additional information:
My scrapy app calls pipelines.py to save the scraped items to the database. In the pipelines.py file, this is what I’ve written to invoke the Django settings so that I can import my models and save data to the database from the scrapy application.
import os,sys
PROJECT_PATH = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(PROJECT_PATH)
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
Any pointers on where exactly am I going wrong? How do I execute the scrapy command on Heroku such that my application can scrape an external website and save that data to the database. Isn’t the way external commands are run in Heroku like – heroku run command?
I’m answering my own question because I discovered what the problem was. Heroku for some reason was not able to find
scrapywhen I executed the command from a sub-directory and not the top-level directory.The command
heroku run ...is generally run from the top-level directory. For my project which uses scrapy, I was required to go to a sub-directory and run thescrapycommand from the sub-directory (this is how scrapy is designed). This wasn’t working in Heroku. So I went to the Heroku bash by typingheroku run bashto see what was going on. When I ran thescrapycommand from the top-level directory, Heroku recognized the command but when I went to a sub-directory, it failed to recognize thescrapycommand. I suppose there is some problem related to the path. From the sub-directory, I had to specify the complete path toscrapy(~/bin/scrapy crawl spidername) to be able to execute it.To run the
scrapycommand without going to the Heroku bash manually each time, my work around this problem was that I created a shell script containing the following code and put it under the bin directory of my top-level directory and pushed the changes to Heroku.bin/scrapy.sh :
After this was done, I could execute
$ heroku run scrapy.sh crawl spidernamefrom my local bash. I suppose its not the best solution but this works.