Trying to write a python script to extract lines from a file. The file is a text file which is a dump of python suds output.
I want to:
- strip all characters except words and numbers. I don’t want any “\n”, “[“, “]”, “{“, “=”, etc characters.
- find a section where it starts with “ArrayOf_xsd_string”
- remove the next line “item[] =” from the result
- grab the remaining 6 lines and create a dictionary based on the unique number on the fifth line (123456, 234567, 345678) using this number as the key and the remaining lines as the values (pardon my ignorance if I’m not explaining this in pythonic terminology)
- output the results to a file
Data in file is a list:
[(ArrayOf_xsd_string){
item[] =
"001",
"ABCD",
"1234",
"wordy type stuff",
"123456",
"more stuff, etc",
}, (ArrayOf_xsd_string){
item[] =
"002",
"ABCD",
"1234",
"wordy type stuff",
"234567",
"more stuff, etc",
}, (ArrayOf_xsd_string){
item[] =
"003",
"ABCD",
"1234",
"wordy type stuff",
"345678",
"more stuff, etc",
}]
I tried doing a re.compile and here is my poor attempt at the code:
import re, string
f = open('data.txt', 'rb')
linelist = []
for line in f:
line = re.compile('[\W_]+')
line.sub('', string.printable)
linelist.append(line)
print linelist
newlines = []
for line in linelist:
mylines = line.split()
if re.search(r'\w+', 'ArrayOf_xsd_string'):
newlines.append([next(linelist) for _ in range(6)])
print newlines
I’m a Python newbie and haven’t found any results in google or on stackoverflow for how to extract specific number of lines after finding specific text. Any help is most appreciated.
Please ignore my code as I am taking “shots in the dark” 🙂
Here is what I’d like to see as the results:
123456: 001,ABCD,1234,wordy type stuff,more stuff etc
234567: 002,ABCD,1234,wordy type stuff,more stuff etc
345678: 003,ABCD,1234,wordy type stuff,more stuff etc
I hope that helps with trying to interpret my flawed code.
Several suggestions on your code:
Stripping all non-alphanumeric characters is totally unnecessary and timewasting; there is no need whatsoever to build
linelist. Are you aware you can simply use plain oldstring.find("ArrayOf_xsd_string")orre.search(...)?Then as to your regex,
_is already covered under\Wanyway. But the following reassignment to line overwrites the line you just read??Here’s my version, which reads the file directly, and also handles multiple matches:
And the output is exactly what you wanted (that’s what ‘,’.join(entries) was for):