What would be the best way to detect what programming language is used in a snippet of code?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you’re interested in.
http://en.wikipedia.org/wiki/Bayesian_spam_filtering
If you have the basic mechanism then it’s very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.
I’ve actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:
Let me find the code.
I couldn’t find the code so I made a new one. It’s a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it’s likely to say that this code:
is Python code (although it really is Ruby). This is because Python has a
defkeyword too. So if it has seen 1000xdefin Python and 100xdefin Ruby then it may still say Python even thoughputsandendis Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).