I’m trying to Parse the following HTML pages using BeautifulSoup (I’m going to parse

Question

0

Editorial Team

Asked: June 6, 20262026-06-06T20:33:47+00:00 2026-06-06T20:33:47+00:00

I’m trying to Parse the following HTML pages using BeautifulSoup (I’m going to parse

0

I’m trying to Parse the following HTML pages using BeautifulSoup (I’m going to parse a bulk of pages).

I need to save all of the fields in every page, but they can change dynamically (on different pages).

here is an example of a page – Page 1
and a page with different fields order – Page 2

I’ve written the following code to parse the page.

import requests
from bs4 import BeautifulSoup

PTiD = 7680560

url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=" + str(PTiD) + ".PN.&OS=PN/" + str(PTiD) + "&RS=PN/" + str(PTiD)

res = requests.get(url, prefetch = True)

raw_html = res.content

print "Parser Started.. "

bs_html = BeautifulSoup(raw_html, "lxml")

#Initialize all the Search Lists
fonts = bs_html.find_all('font')
para = bs_html.find_all('p')
bs_text = bs_html.find_all(text=True)
onlytext = [x for x in bs_text if x != '\n' and x != ' ']

#Initialize the Indexes
AppNumIndex = onlytext.index('Appl. No.:\n')
FiledIndex = onlytext.index('Filed:\n  ')
InventorsIndex = onlytext.index('Inventors: ')
AssigneeIndex = onlytext.index('Assignee:')
ClaimsIndex = onlytext.index('Claims')
DescriptionIndex = onlytext.index(' Description')
CurrentUSClassIndex = onlytext.index('Current U.S. Class:')
CurrentIntClassIndex = onlytext.index('Current International Class: ')
PrimaryExaminerIndex = onlytext.index('Primary Examiner:')
AttorneyOrAgentIndex = onlytext.index('Attorney, Agent or Firm:')
RefByIndex = onlytext.index('[Referenced By]')

#~~Title~~
for a in fonts:
        if a.has_key('size') and a['size'] == '+1':
                d_title = a.string
print "title: " + d_title

#~~Abstract~~~
d_abstract = para[0].string
print "abstract: " + d_abstract

#~~Assignee Name~~
d_assigneeName = onlytext[AssigneeIndex +1]
print "as name: " + d_assigneeName

#~~Application number~~
d_appNum = onlytext[AppNumIndex + 1]
print "ap num: " + d_appNum

#~~Application date~~
d_appDate = onlytext[FiledIndex + 1]
print "ap date: " + d_appDate

#~~ Patent Number~~
d_PatNum = onlytext[0].split(':')[1].strip()
print "patnum: " + d_PatNum

#~~Issue Date~~
d_IssueDate = onlytext[10].strip('\n')
print "issue date: " + d_IssueDate

#~~Inventors Name~~
d_InventorsName = ''
for x in range(InventorsIndex+1, AssigneeIndex, 2):
    d_InventorsName += onlytext[x]
print "inv name: " + d_InventorsName

#~~Inventors City~~
d_InventorsCity = ''

for x in range(InventorsIndex+2, AssigneeIndex, 2):
    d_InventorsCity += onlytext[x].split(',')[0].strip().strip('(')

d_InventorsCity = d_InventorsCity.strip(',').strip().strip(')')
print "inv city: " + d_InventorsCity

#~~Inventors State~~
d_InventorsState = ''
for x in range(InventorsIndex+2, AssigneeIndex, 2):
    d_InventorsState += onlytext[x].split(',')[1].strip(')').strip() + ','

d_InventorsState = d_InventorsState.strip(',').strip()
print "inv state: " + d_InventorsState

#~~ Asignee City ~~
d_AssigneeCity = onlytext[AssigneeIndex + 2].split(',')[1].strip().strip('\n').strip(')')
print "asign city: " + d_AssigneeCity

#~~ Assignee State~~
d_AssigneeState = onlytext[AssigneeIndex + 2].split(',')[0].strip('\n').strip().strip('(')
print "asign state: " + d_AssigneeState

#~~Current US Class~~
d_CurrentUSClass = ''

for x in range (CuurentUSClassIndex + 1, CurrentIntClassIndex):
    d_CurrentUSClass += onlytext[x]
print "cur us class: " + d_CurrentUSClass

#~~ Current Int Class~~
d_CurrentIntlClass = onlytext[CurrentIntClassIndex +1]
print "cur intl class: " + d_CurrentIntlClass

#~~~Primary Examiner~~~
d_PrimaryExaminer = onlytext[PrimaryExaminerIndex +1]
print "prim ex: " + d_PrimaryExaminer

#~~d_AttorneyOrAgent~~
d_AttorneyOrAgent = onlytext[AttorneyOrAgentIndex +1]
print "agent: " + d_AttorneyOrAgent

#~~ Referenced by ~~
for x in range(RefByIndex + 2, RefByIndex + 400):
    if (('Foreign' in onlytext[x]) or ('Primary' in onlytext[x])):
        break
    else:
        d_ReferencedBy += onlytext[x]
print "ref by: " + d_ReferencedBy

#~~Claims~~
d_Claims = ''

for x in range(ClaimsIndex , DescriptionIndex):
    d_Claims += onlytext[x]
print "claims: " + d_Claims

I insert all the text from the page to a list (using BeautifulSoup’s find_all(text=True)). then I try to Find The indexes of the fields Names, and go over the list from that location and save the members to a string until I reach the next field index.

When I tried the code on several different pages I’ve noticed that the structure of the members is changing, and I can’t find their indexes in the list.
for example, I search for the index of ‘123’ and on some pages it shows in the list as ’12’,’3′.

Can You think of any other way to parse the page that would be generic?

thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T20:33:48+00:00

if you using beautifulsoup, and have dom 123 and find_all(text=True) you will have ['123']

however, if you have dom 123, which have the same semantics as previous, but beautifulsoup will give you ['12','3']

maybe you could just find exactly which tag stucks you from getting complete ['123'] , and ignore / eliminate that tag first.

some fake code on how to eliminate  tag

import re
html='<p>12<b>3</b></p>'
reExp='<[\/\!]?b[^<>]*?>'
print re.sub(reExp,'',html)

for patterns, you could use this:

import re
patterns = '<TD align=center>(?P<VALUES_TO_FIND>.*?)<\/TD>'
print re.findall(patterns, your_html)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to Parse the following HTML pages using BeautifulSoup (I’m going to parse

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply