i really need help to finish this task since it’s related to my research and I’m new to python and scrapy .
*the task is to select all input field (type=text or password or file ) and store it’s (id) in back-end DB , besides the page link where this input belongs *
my code to select input fields
def parse_item(self, response):
self.log('%s' % response.url)
hxs = HtmlXPathSelector(response)
item=IsaItem()
item['response_fld']=response.url
item['text_input']=hxs.select("//input[(@id or @name) and (@type = 'text' )]/@id ").extract()
item['pass_input']=hxs.select("//input[(@id or @name) and (@type = 'password')]/@id").extract()
item['file_input']=hxs.select("//input[(@id or @name) and (@type = 'file')]/@id").extract()
return item
Database pipeline code :
class SQLiteStorePipeline(object):
def __init__(self):
self.conn = sqlite3.connect('./project.db')
self.cur = self.conn.cursor()
def process_item(self, item, spider):
self.cur.execute("insert into inputs ( input_name) values(?)" , (item['text_input'][0] ), )
self.cur.execute("insert into inputs ( input_name) values(?)" , (item['pass_input'][0] ,))
self.cur.execute("insert into inputs ( input_name) values(?)" ,(item['file_input'][0] , ))
self.cur.execute("insert into links (link) values(?)", (item['response_fld'][0], ))
self.conn.commit()
return item
but i still get error like this
self.cur.execute("insert into inputs ( input_name) values(?)" , (item['text_input'][0] ), )
exceptions.IndexError: list index out of range
or database store only first letter !!
Database links table
╔════════════════╗
║ links ║
╠════════════════╣
║ id │input ║
╟──────┼─────────╢
║ 1 │ t ║
╟──────┼─────────╢
║ 2 │ t ║
╚══════╧═════════╝
Note it should "tbPassword" or "tbUsername"
ouput fron json file
{"pass_input": ["tbPassword"], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/Signup.aspx", "text_input": ["tbUsername"]}
{"pass_input": [], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/default.aspx", "text_input": []}
{"pass_input": ["tbPassword"], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/login.aspx", "text_input": ["tbUsername"]}
{"pass_input": [], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/Comments.aspx?id=0", "text_input": []}
You are getting
IndexErrorbecause you try to get the first item in the list, which sometimes is empty.I would do it like this.
The spider:
The pipeline: