I have a question about sorting data by multiple columns. I’m definitely a beginner at this and am wondering how I can sort by one column and then by another without losing the ordering of the first column. I have a file of tab separated data consisting of three columns. The majority of the data isn’t paired (one id, first column, and position start and end, second and third columns). Occasionally, however, there are multiple entries for the same ID (first column). These need to remain grouped together (without a space separating them from the next entry, unless it has a different ID). The data is really already sorted with respect to the first column, but I need to sort it numerically based on the starting position (second column) while preserving the original sorting. Like this:
Current format:
PITG_00129 606 1436
PITG_00130 1 987
PITG_00132 2 1321
PITG_00133 4464 11708
PITG_00133 1 2946
PITG_00133 4081 4515
Desired format:
PITG_00129 606 1436
PITG_00130 1 987
PITG_00132 2 1321
PITG_00133 1 2946
PITG_00133 4081 4515
PITG_00133 4464 11708
You can do this pretty easily in python. First, you need to read your data in a proper format:
This will turn each line into a tuple which will sort lexicographically. Since your strings (the first column) are set up in an easily sorted manner, we don’t need to worry about them. The second and third columns just need to be converted to integers to make them sort properly.
Here’s another implementation to preserve blank lines between fields:
This uses itertools to chunk up the file based on the empty lines and sorts those groups individually before writing them back out.
Here’s the output: