How to replace unicode values using re in Python ?
I’m looking for something like this:
line.replace('Ã','')
line.replace('¢','')
line.replace('â','')
Or is there any way which will replace all the non-ASCII characters from a file. Actually I converted PDF file to ASCII, where I’m getting some non-ASCII characters [e.g. bullets in PDF]
Please help me.
Edit after feedback in comments.
Another solution would be to check the numeric value of each character and see if they are under 128, since ascii goes from 0 – 127. Like so:
Here’s an altered version of
jd‘s answer with benchmarks:Output first solution using a
strstring as input:Output first solution using a
unicodestring as input:Output second solution using a
strstring as input:Output second solution using a
unicodestring as input:Conclusion
Encoding is the faster solution and encoding the string is less code; Thus the better solution.