I have a large file (100 million lines of tab separated values – about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?
I have tried hive. I would like to see if this can be done faster using python.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Have you considered using the *nix
sortprogram? in raw terms, it’ll probably be faster than most Python scripts.Use
-t $'\t'to specify that it’s tab-separated,-k nto specify the field, wherenis the field number, and-o outputfileif you want to output the result to a new file.Example:
Will sort
input.txton its 4th field, and output the result tosorted.txt