I need to read some large files (from 50k to 100k lines), structured in groups separated by empty lines. Each group start at the same pattern ‘No.999999999 dd/mm/yyyy ZZZ’. Here´s some sample data.
No.813829461 16/09/1987 270
Tit.SUZANO PAPEL E CELULOSE S.A. (BR/BA)
C.N.P.J./C.I.C./N INPI : 16404287000155
Procurador: MARCELLO DO NASCIMENTONo.815326777 28/12/1989 351
Tit.SIGLA SISTEMA GLOBO DE GRAVACOES AUDIO VISUAIS LTDA (BR/RJ)
C.N.P.J./C.I.C./NºINPI : 34162651000108
Apres.: Nominativa ; Nat.: De Produto
Marca: TRIO TROPICAL
Clas.Prod/Serv: 09.40
*DEFERIDO CONFORME RESOLUÇÃO 123 DE 06/01/2006, PUBLICADA NA RPI 1829, DE 24/01/2006.
Procurador: WALDEMAR RODRIGUES PEDRANo.900148764 11/01/2007 LD3
Tit.TIARA BOLSAS E CALÇADOS LTDA
Procurador: Marcia Ferreira Gomes
*Escritório: Marcas Marcantes e Patentes Ltda
*Exigência Formal não respondida Satisfatoriamente, Pedido de Registro de Marca considerado inexistente, de acordo com Art. 157 da LPI
*Protocolo da Petição de cumprimento de Exigência Formal: 810080140197
I wrote some code that´s parsing it accordingly. There´s anything that I can improve, to improve readability or performance? Here´s what I come so far:
import re, pprint class Despacho(object): ''' Class to parse each line, applying the regexp and storing the results for future use ''' regexp = { re.compile(r'No.([\d]{9}) ([\d]{2}/[\d]{2}/[\d]{4}) (.*)'): lambda self: self._processo, re.compile(r'Tit.(.*)'): lambda self: self._titular, re.compile(r'Procurador: (.*)'): lambda self: self._procurador, re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'): lambda self: self._documento, re.compile(r'Apres.: (.*) ; Nat.: (.*)'): lambda self: self._apresentacao, re.compile(r'Marca: (.*)'): lambda self: self._marca, re.compile(r'Clas.Prod/Serv: (.*)'): lambda self: self._classe, re.compile(r'\*(.*)'): lambda self: self._complemento, } def __init__(self): ''' 'complemento' is the only field that can be multiple in a single registry ''' self.complemento = [] def _processo(self, matches): self.processo, self.data, self.despacho = matches.groups() def _titular(self, matches): self.titular = matches.group(1) def _procurador(self, matches): self.procurador = matches.group(1) def _documento(self, matches): self.documento = matches.group(1) def _apresentacao(self, matches): self.apresentacao, self.natureza = matches.groups() def _marca(self, matches): self.marca = matches.group(1) def _classe(self, matches): self.classe = matches.group(1) def _complemento(self, matches): self.complemento.append(matches.group(1)) def read(self, line): for pattern in Despacho.regexp: m = pattern.match(line) if m: Despacho.regexp[pattern](self)(m) def process(rpi): ''' read data and process each group ''' rpi = (line for line in rpi) group = False for line in rpi: if line.startswith('No.'): group = True d = Despacho() if not line.strip() and group: # empty line - end of block yield d group = False d.read(line) arquivo = open('rm1972.txt') # file to process for desp in process(arquivo): pprint.pprint(desp.__dict__) print('--------------')
That is pretty good. Below some suggestions, let me know if you like’em: