Hi suppose if i have a tab seperated file like this (each field separated by tab spaces):
Name ID Country GPA
Tom id1 USA 3.4
Jon id2 Canada
Amy UK 3.0
Kevin id4 Scotland
Kris 3.1
Here the density of name = 1.0 that is 100%
density of ID is 0.6 that is 60% (2 fields missing)
density of Country is 0.8
density of GPA is also 0.6
How to find this out for for a file using python? Also I need an algo that’s efficient and fast since I need to do this for thousands of files worth more than 40 GB. Map reduce code also works.
Thanks in advance 🙂
1 Answer