I am trying to scrape form field IDs using Beautiful Soup like this
for link in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
if link.has_key('id'):
print link['id']
Lets us assume that it returns something like
username
email
password
passwordagain
terms
button_register
I would like to write this into Sqlite3 DB.
What I will be doing down the line in my application is… Use these form fields’ IDs and try to do a POST may be. The problem is.. there are plenty of sites like this whose form field IDs I have scraped. So the relation is like this…
Domain1 - First list of Form Fields for this Domain1
Domain2 - Second list of Form Fields for this Domain2
.. and so on
What I am unsure here is… How should I design my column for this kind of purpose? Will it be OK if I just create a table with two columns – say
COL 1 - Domain URL (as TEXT)
COL 2 - List of Form Field IDs (as TEXT)
One thing to be remembered is… Down the line in my application I will need to do something like this…
Pseudocode
If Domain is "http://somedomain.com":
For ever item in the COL2 (which is a list of form field ids):
Assign some set of values to each of the form fields & then make a POST request
Can any one guide, please?
EDITed on 22/07/2011 – Is My Below Database Design Correct?
I have decided to have a solution like this. What do you guys think?
I will be having three tables like below
Table 1
Key Column (Auto Generated Integer) - Primary Key
Domain as TEXT
Sample Data would be something like:
1 http://url1.com
2 http://url2.com
3 http://url3.com
Table 2
Domain (Here I will be using the Key Number from Table 1)
RegLink - This will have the registeration link (as TEXT)
Form Fields (as Text)
Sample Data would be something like:
1 http://url1.com/register field1
1 http://url1.com/register field2
1 http://url1.com/register field3
2 http://url2.com/register field1
2 http://url2.com/register field2
2 http://url2.com/register field3
3 http://url3.com/register field1
3 http://url3.com/register field2
3 http://url3.com/register field3
Table 3
Domain (Here I will be using the Key Number from Table 1)
Status (as TEXT)
User (as TEXT)
Pass (as TEXT)
Sample Data would be something like:
1 Pass user1 pass1
2 Fail user2 pass2
3 Pass user3 pass3
Do you think this table design is good? Or are there any improvements that can be made?
Try this to get the ids:
And this should show you how to save them, load them, and do something to each. This uses a single table and just inserts one row for each field for each domain. It’s the simplest solution, and perfectly adequate for a relatively small number of rows of data.