I have a method that takes addresses from the web, and therefore, there are many known errors like:
123 Awesome St, Pleasantville, NY, Get Directions
Which I want to be:
123 Awesome St, Pleasantville, NY
Is there a web service or Python library that can help with this? It’s fine for us to start creating a list of items like “, Get Directions” or a more generalized version of that, but I thought there might be a helper library for this kind of textual analysis.
If the address contains one of those bad strings, walk backwards till you find another non-whitespace character. If the character is one of your separators, say,or:, drop everything from that character onwards. If it’s a different character, drop everything after that character.Make a list of known bad strings. Then, you could take that list and use it to build a gigantic regex and use
re.sub().This is a naive solution, and isn’t going to be particularly performant, but it does give you a clean way of adding known bad strings, by adding them to a file called
.badstringsor similar and building the list from them.Note that if you make bad choices about what these bad strings are, you will break the algorithm. But it should work for the simple cases you describe in the comments.
EDIT: Something like this is what I mean:
which outputs: