I would like to read lines from a text file and build a distance matrix based on Wu-Palmer distance between the words. Eg:
House Grass Boat Cat
House x y .. ..
Grass x1 y1 .. ..
Boat x2 y2 .. ..
Cat x3 y3 .. ..
I would like to know if there is any existing functions I can use in python to read lines from a text file and output the lines as rows and columns of the distance Matrix?
If your input is simply whitespace-delimited words then you can easily iterate through them like this:
The use of a
setensures that each word is only ever recorded once – it sounded like this is what you were after.If your input is running english text then things become a little harder because you want to catch things like “I’d” – you should also decide whether to class hyphenated words (e.g. “part-time”) as a single word – my example here does, but it’s easy to change. Much as I’m not a fan of them, this is somewhere where regular expressions are actually quite useful:
This will create a
setof words where a group of characters is anything consisting of one or more from the set[a-zA-Z0-9_-']and where the first character is a letter.After this, you can calculate the distance between each pair of words easily:
There’s probably a cleaner data structure than the nested dictionaries here, but it’s simple enough that I think that would suffice.
Finally, you can output a tab-delimited matrix something like this:
If yuo wanted something closer to a pretty-formatted output matrix (i.e. where each column is automatically sized according to its contents) then that’s still not hard per se, but it’s a little fiddly and requires rather more code.
As an aside, if you want to read or write files in CSV format then take a look at the Python csv module, it handles tedious things like quoting for you.
Was that the sort of thing you were after?