Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 1064047
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T18:52:20+00:00 2026-05-16T18:52:20+00:00

this is a follow up to my recent question ( Code for identifying programming

  • 0

this is a follow up to my recent question ( Code for identifying programming language in a text file ). I’m really thankful for all the answers I got, it helped me very much. My code for this task is complete and it works fairly well – quick and reasonably accurate.

The method i used is the following: i have a “learning” perl script that identifies most frequently used words in a language by doing a word histogram over a set of sample files. These data are then loaded by the c++ program which then checks the given text and accumulates score for each language based on found words and then simply checks which language accumulated the highest score.

Now i would like to make it even better and work a bit on the quality of identification. The problem is I often get “unknown” as result (many languages accumulate a small score, but none anything bigger than my threshold). After some debugging, research etc i found out that this is probably due to the fact, that all words are considered equal. This means that seeing a “#include” for example has the same effect as seeing a “while” – both of which indicate that it might be c/c++ (i’m now ignoring the fact that “while” is used in many other languages), but of course in larger .cpp files there might be a ton of “while” but most of the time only a few “#include”.

So the fact that a “#include” is more important is ignored, because i could not come up with a good way how to identify if a word is more important than another. Now bear in mind that the script which creates the data is fairly stupid, its only a word histogram and for every chosen word it assigns a score of 1. It does not even look at the words (so if there is a “#&|?/” in a file very often it might get chosen as a good word).

Also i would like to have the data creation part fully automated, so nobody should have to look at the data and alter them, change scores, change words etc. All the “brainz” should be in the script and the cpp program.

Does somebody have a suggestion how to identify keywords, or more generally, important words? Some things that might help: i have the number of occurences of each word and the number of total words (so a ratio may be calculated). I have also thought about wiping out characters like ;, etc. since the histogram script often puts for example “continue;” in the result, but the important word is “continue”. Last note: all checks for equality are done for exact match – no substrings, case sensitive. This is mainly because of speed, but substrings might help (or hurt, i dont know)…

NOTE: thanks all who bothered to answer, you helped me a lot.

My work with this is almost finished so i will describe what i did to get good results.

1) Get a decent training set, about 30-50 files per language from various sources to avoid coding style bias
2) Write a perl script that does a word histogram. Implement blacklist and whitelist (more about it below)
3) add bogus words to blacklist, like “license”, “the” etc. These are often found at the start of file in license information.
4) add about five most important words per language to the whitelist. These are words that are found in most source code of a given language, but are not frequent enough to get into the histogram. For example for C/C++ i had: #include, #define, #ifdef, #ifndef and #endif in the whitelist.
5) Emphasize the start of a file, so give more points to words found in the first 50-100 lines
6) when doing the word histogram, tokenize the file using @words = split(/[\s\(\){}\[\];.,=]+/, $_); This should be ok for most languages i think (gives me the best results). For each language, have about 10-20 most frequent words in the final results.
7) When the histogram is complete, remove all words that are found in the blacklist and add all those that are found in the whitelist
8) Write a program which processes a text file in the same way as the script – tokenize using the same rules. If a word is found in the histogram data, add points to the right language. Words in the histogram which correspond to only one language should add more points, those which belong to multiple languages should add less.

Comments are welcome. Currently on about 1000 text files i get 80 unknowns (mostly on extremely short files – mainly javascript with just one or two lines). About 20 files are recognized wrong. Size of the files is about 11kB ranging from 100 bytes to 100 kBytes (almost 11MB total). It takes one second to process them all, which is good enough for me.

  • 1 1 Answer
  • 1 View
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T18:52:20+00:00Added an answer on May 16, 2026 at 6:52 pm

    I think you’re approaching this from the wrong viewpoint. From your description, it sounds like you are building a classifier. A good classifier needs to discriminate between different classes; it doesn’t need to precisely estimate the correspondence between the input and the most likely class.

    Practically: your classifier doesn’t need to assess precisely how close to C++ a certain input is; it merely needs to determine if the input is more like C than C++. This makes your work a lot easier – most of your current “unknown” cases will be close to one or two languages, even though they don’t exceed your basic threshold.

    Now, once you realize this, you will also see what training your classifier needs: not some random aspect of the sample files, but what sets two languages apart. Hence, when you have parsed your C samples, and your C++ samples, you will see that #include does not set them apart. However, class and template will be far more common in C++. On the other hand, #include does distinguish between C++ and Java.

    There are of course other aspects besides keywords that you can use. For instance, the most obvious would be the frequency of {, and ; is similarly distinguishing. Another very useful feature for your classifier would be the comment tokens for the different languages. The basic problem of course would be automatically identifying them. Again, hardcoding //, /*, ', --, # and ! as pseudo-keywords would help.

    This also identifies another classification rule: SQL will often have -- at the beginning of a line, whereas in C it will often appear somewhere else. Thus it may be useful for your classifier to take the context into account as well.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This is a follow up question to a previous post on saving recent searches
I have this follow code in my javascript. I call this function when the
I really don't understand this. I downloaded the recent version of xCode 3.2.5 and
This question is a follow-up to my previous question , where I tried to
As a follow-up to my recent question about .NET Compact Framework debugging, I am
This is sort of a follow-up to this question . I want to know
This question is a continuation of the same effort to code my first cte
Hello guys in my code i'm getting this stack: Traceback (most recent call last):
I follow this tutorial online exactly but somehow it's giving me errors. Saying there
I follow this rule but some of my colleagues disagree with it and argue

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.