Is there a standard/common way to give compiler-style error messages that point to a line and column when the input is in a Unicode format?
For example, a very common compiler error messages format is:
“filename:line_number:column_number: error message”, e.g.:
- (From GCC):
bad.c:1:10: syntax error, unexpected STRING - (From a custom tool)
input.dat:45:3: expected String_Literal, found ';',
This is unambiguous when the input is a fixed 8-bit encoding, such as ISO-8859-1. But when the input is Unicode (UTF-8, UTF-16, etc), what does (or should) “column” mean in this case? Which byte? Which code-point? Which grapheme? Is there any tool that sets a precedent choosing one or the other?
A column should refer to non-combining Unicode code points. Both parts of a surrogate pair (in UTF-16) should share a column. A combining diacritical mark should share a column with the base character it modifies. This may apply to other non-spacing code points as well.