I have PDB(text) files which are in a directory. I would like to print the number of subunits from each PDB file.
- Read all lines in a pdb file that start with
ATOM - The fifth column of the
ATOMline containsA,B,C,Detc. - If it contains only
Athe number of subunit is 1. If it containsAandB, the number of subunits are 2. If it containsA,B, andCthe number of subunits are 3.
1kg2.pdb file
ATOM 1363 N ASN A 258 82.149 -23.468 9.733 1.00 57.80 N
ATOM 1364 CA ASN A 258 82.494 -22.084 9.356 1.00 62.98 C
ATOM 1395 C MET B 196 34.816 -51.911 11.750 1.00 49.79 C
ATOM 1396 O MET B 196 35.611 -52.439 10.963 1.00 47.65 O
1uz3.pdb file
ATOM 1384 O ARG A 260 80.505 -20.450 15.420 1.00 22.10 O
ATOM 1385 CB ARG A 260 78.980 -18.077 15.207 1.00 36.88 C
ATOM 1399 SD MET B 196 34.003 -52.544 16.664 1.00 57.16 S
ATOM 1401 N ASP C 197 34.781 -50.611 12.007 1.00 44.30 N
2b69.pdb file
ATOM 1393 N MET B 196 33.300 -54.017 12.033 1.00 46.46 N
ATOM 1394 CA MET B 196 33.782 -52.714 12.566 1.00 49.99 C
desired output
pdb_id subunits
1kg2 2
1uz3 3
2b69 1
How can I do this with awk, python or Biopython?
You can use an array to record all seen values for the fifth column.
Edit: Using gawk 4.x you can use
ENDFILEto generate the required output:The result: