I was wondering how stackoverflow parses all sorts of different code and identifies keywords,

Question 1

I was wondering how stackoverflow parses all sorts of different code and identifies keywords, special characters, whitespace formatting, etc. It does this for most code I believe, and I’ve noticed it’s even sophisticated enough to understand the relationships between everything it parses, like so:

String mystring1 = "inquotes"; //incomment
String mystring2 = "inquotes//incomment";
String mystring3 = //incomment"inquotes";

Many IDEs do this also. How is this done?

Edit: Further explaination – I am not asking about the parsing of the text, my question is, once I am past that part.. is there something like a universal XML Scheme, or cross-code format hierarchy that describes which strings are keywords, which characters denote comments, text strings, logic operators, etc. Or must I become a syntax guru for any language I wish to parse accurately?

Question 2

To really have your IDE/compiler/interpreter “understand” and colorize code you’ll need to parse it and pull out the different syntactical parts. The classic reference for this is the Dragon Book, “Compilers: Principles, Techniques, and Tools.” You can see some of the difficulty in constructs like this

i+++++i;

or

list<list<hash<list<int>,hash<int,<list>>>>>;
//or just matching parens

Properly doing this is a hard problem. Some languages, like java, make this easier than others, such as C and C++ (which both have standards) or ruby (which doesn’t even have a spec and relies on the implementation as a spec). However, if you only want to do a few bits of highlighting you can skip large parts of the grammar and get an 80% solution more easily. I suspect that the SO engine knows about strings and a few different types of comments and this does well enough for their purpose.

The difficulty between 80% and 100% is one reason that most IDEs have syntax highlighting for C++ but Visual C++ still doesn’t have C++ refactoring support. For highlighting a few mistakes are probably OK. When you’re refactoring you need to really understand variable scope in different namespaces and all sorts of pointer stuff too.

Editorial Team · Answer 1 · 2026-05-16T08:38:29+00:00

To really have your IDE/compiler/interpreter “understand” and colorize code you’ll need to parse it and pull out the different syntactical parts. The classic reference for this is the Dragon Book, “Compilers: Principles, Techniques, and Tools.” You can see some of the difficulty in constructs like this

i+++++i;

or

list<list<hash<list<int>,hash<int,<list>>>>>;
//or just matching parens

Properly doing this is a hard problem. Some languages, like java, make this easier than others, such as C and C++ (which both have standards) or ruby (which doesn’t even have a spec and relies on the implementation as a spec). However, if you only want to do a few bits of highlighting you can skip large parts of the grammar and get an 80% solution more easily. I suspect that the SO engine knows about strings and a few different types of comments and this does well enough for their purpose.

The difficulty between 80% and 100% is one reason that most IDEs have syntax highlighting for C++ but Visual C++ still doesn’t have C++ refactoring support. For highlighting a few mistakes are probably OK. When you’re refactoring you need to really understand variable scope in different namespaces and all sorts of pointer stuff too.

Editorial Team
2026-05-16T08:38:29+00:00Added an answer on May 16, 2026 at 8:38 am

To really have your IDE/compiler/interpreter “understand” and colorize code you’ll need to parse it and pull out the different syntactical parts. The classic reference for this is the Dragon Book, “Compilers: Principles, Techniques, and Tools.” You can see some of the difficulty in constructs like this

i+++++i;

or

list<list<hash<list<int>,hash<int,<list>>>>>; //or just matching parens

Properly doing this is a hard problem. Some languages, like java, make this easier than others, such as C and C++ (which both have standards) or ruby (which doesn’t even have a spec and relies on the implementation as a spec). However, if you only want to do a few bits of highlighting you can skip large parts of the grammar and get an 80% solution more easily. I suspect that the SO engine knows about strings and a few different types of comments and this does well enough for their purpose.

The difficulty between 80% and 100% is one reason that most IDEs have syntax highlighting for C++ but Visual C++ still doesn’t have C++ refactoring support. For highlighting a few mistakes are probably OK. When you’re refactoring you need to really understand variable scope in different namespaces and all sorts of pointer stuff too.

0

Reply

Share
Share

Share on Facebook

Share on Twitter

Share on LinkedIn

Share on WhatsApp

Report — Editorial Team, 2026-05-16T08:38:29+00:00Added an answer on May 16, 2026 at 8:38 am

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I was wondering how stackoverflow parses all sorts of different code and identifies keywords,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply