In a recent question it was noted that on OSX running sed on a

Question

0

Asked: May 16, 20262026-05-16T03:59:24+00:00 2026-05-16T03:59:24+00:00

In a recent question it was noted that on OSX running sed on a

0

In a recent question it was noted that on OSX running sed on a non ascii file gave strange results. For instance if you do (/usr/bin/cal is a random binary file)

sed 's/[^A-Z]//' /usr/bin/cal

sed will remove all of the printable characters other than A-Z, but many nonprintable characters remain. If however, you do

LANG='' sed 's/[^A-Z]//' /usr/bin/cal

only A-Z (and newlines) are output. Why?

Normally LANG=en-US.UTF-8 What is going on? I cannot see anyway that the output of sed could be considered correct in UTF-8. Is it broken, or is there some notion of working that I do not understand?

I know that the OSX sed is conforming to POSIX, and is therefore different from the beloved GNU sed.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T03:59:25+00:00

Binary data, such as the contents of /usr/bin/cal, are not UTF-8, and so will confuse any code that reads it as if it was. In particular, any byte with the high bit set (e.g., >= 128) will be interpreted as part of a multi-byte sequence representing a single character, and will thus be elided from the output. Not all sequences of bytes with the high-bit set are valid UTF-8, so things will get quite confused, but this probably explains why some non-printable characters remain but (possibly) not others.

In short: if you want to use text-oriented tools on binary data, don’t.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In a recent question it was noted that on OSX running sed on a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply