Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6253417
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T13:59:26+00:00 2026-05-24T13:59:26+00:00

I want to split a C file into tokens, not for compiling but for

  • 0

I want to split a C file into tokens, not for compiling but for analyzing. I feel like this should be pretty straight-forward, and tried looking online for a defined tokens.l (or something similar) file for flex with all the C grammar already defined, but couldn’t find anything. I was wondering if there are any sort of defined grammars floating around, or if perhaps I’m going about this all wrong?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T13:59:26+00:00Added an answer on May 24, 2026 at 1:59 pm

    Yes, there’s at least one around.

    Edit:

    Since there are a few issues that doesn’t handle, perhaps it’s worth looking at some (hand written) lexing code I wrote several years ago. This basically only handles phases 1, 2 and 3 of translation. If you define DIGRAPH, it also turns on some code to translate C++ digraphs. If memory serves, however, it’s doing that earlier in translation than it should really happen, but you probably don’t want it in any case. OTOH, this does not even attempt to recognize anywhere close to all tokens — mostly it separates the source into comments, character literals, string literals, and pretty much everything else. OTOH, it does handle trigraphs, line splicing, etc.

    I suppose I should also add that this leaves conversion of the platform’s line-ending character into a new-line to the underlying implementation by opening the file in translated (text) mode. Under most circumstances, that’s probably the right thing to do, but if you want to produce something like a cross-compiler where your source files have a different line-ending sequence than is normal for this host, you might have to change that.

    First the header that defines the external interface to all this stuff:

    /* get_src.h */   
    #ifndef GET_SRC_INCLUDED
    #define GET_SRC_INCLUDED
    
    #include <stdio.h>
    
    #ifdef __cplusplus
    extern "C" {
    #endif
    
    /* This is the size of the largest token we'll attempt to deal with.  If
     * you want to deal with bigger tokens, change this, and recompile
     * get_src.c.  Note that an entire comment is treated as a single token,
     * so long comments could overflow this.  In case of an overflow, the
     * entire comment will be read as a single token, but the part larger
     * than this will not be stored.
     */
    #define MAX_TOKEN_SIZE 8192
    
    /* `last_token' will contain the text of the most recently read token (comment,
     * string literal, or character literal).
     */
    extern char last_token[];
    
    /* This is the maximum number of characters that can be put back into a
     * file opened with parse_fopen or parse_fdopen.
     */
    #define MAX_UNGETS 5
    
    #include <limits.h>
    #include <stdio.h>
    
    typedef struct {
        FILE *file;
        char peeks[MAX_UNGETS];
        int last_peek;
    } PFILE;
    
    /* Some codes we return to indicate having found various items in the
     * source code.  ERROR is returned to indicate a newline found in the
     * middle of a character or string literal or if a file ends inside a
     * comment, or if a character literal contains more than two characters.
     *
     * Note that this starts at INT_MIN, the most negative number available
     * in an int.  This keeps these symbols from conflicting with any
     * characters read from the file.  However, one of these could
     * theoretically conflict with EOF.  EOF usually -1, and these are far
     * more negative than that.  However, officially EOF can be any value
     * less than 0...
     */
    enum {
        ERROR = INT_MIN,
        COMMENT,
        CHAR_LIT,
        STR_LIT
    };
    
    /* Opens a file for parsing and returns a pointer to a structure which
     * can be passed to the other functions in the parser/lexer to identify
     * the file being worked with.
     */
    PFILE *parse_fopen(char const *name);
    
    /* This corresponds closely to fdopen - it takes a FILE * as its
     * only parameter, creates a PFILE structure identifying that file, and
     * returns a pointer to that structure.
     */
    PFILE *parse_ffopen(FILE *stream);
    
    /* Corresponds to fclose.
     */
    int parse_fclose(PFILE *stream);
    
    /* returns characters from `stream' read as C source code.  String
     * literals, characters literals and comments are each returned as a
     * single code from those above.  All strings of any kind of whitespace
     * are returned as a single space character.
     */
    int get_source(PFILE *stream);
    
    /* Basically, these two work just like the normal versions of the same,
     * with the minor exception that unget_character can unget more than one
     * character.
     */
    int get_character(PFILE *stream);
    void unget_character(int ch, PFILE *stream);
    
    #ifdef __cplusplus
    }
    #endif
    
    #endif
    

    And then the implementation of all that:

    /* get_src.c */
    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    #include <stdlib.h>
    
    #define GET_SOURCE
    #include "get_src.h"
    
    static size_t current = 0;
    
    char last_token[MAX_TOKEN_SIZE];
    
    PFILE *parse_fopen(char const *name) {
    
        PFILE *temp = malloc(sizeof(PFILE));
    
        if ( NULL != temp ) {
            temp->file = fopen(name, "r");
            memset(temp->peeks, 0, sizeof(temp->peeks));
            temp->last_peek = 0;
        }
        return temp;
    }
    
    PFILE *parse_ffopen(FILE *file) {
    
        PFILE *temp = malloc(sizeof(PFILE));
    
        if ( NULL != temp) {
            temp->file = file;
            memset(temp->peeks, 0, sizeof(temp->peeks));
            temp->last_peek = 0;
        }
        return temp;
    }
    
    int parse_fclose(PFILE *stream) {
    
        int retval = fclose(stream->file);
    
        free(stream);
        return retval;
    }
    
    static void addchar(int ch) {
    /* adds the passed character to the end of `last_token' */
    
        if ( current < sizeof(last_token) -1 )
            last_token[current++] = (char)ch;
    
        if ( current == sizeof(last_token)-1 )
            last_token[current] = '\0';
    }
    
    static void clear(void) {
    /* clears the previous token and starts building a new one. */
        current = 0;
    }
    
    static int read_char(PFILE *stream) {
        if ( stream->last_peek > 0 )
            return stream->peeks[--stream->last_peek];
        return fgetc(stream->file);
    }
    
    void unget_character(int ch, PFILE * stream) {
        if ( stream->last_peek < sizeof(stream->peeks) )
            stream->peeks[stream->last_peek++] = ch;
    }
    
    static int check_trigraph(PFILE *stream) {
    /* Checks for trigraphs and returns the equivalant character if there
     * is one.  Expects that the leading '?' of the trigraph has already
     * been read before this is called.
     */
    
        int ch;
    
        if ( '?' != (ch=read_char(stream))) {
            unget_character(ch, stream);
            return '?';
        }
    
        ch = read_char(stream);
    
        switch( ch ) {
            case '(':   return '[';
            case ')':   return ']';
            case '/':   return '\\';
            case '\'':  return '^';
            case '<':   return '{';
            case '>':   return '}';
            case '!':   return '|';
            case '-':   return '~';
            case '=':   return '#';
            default:
                unget_character('?', stream);
                unget_character(ch, stream);
                return '?';
        }
    }
    
    #ifdef DIGRAPH
    static int check_digraph(PFILE *stream, int first) {
    /* Checks for a digraph.  The first character of the digraph is
     * transmitted as the second parameter, as there are several possible
     * first characters of a digraph.
     */
    
        int ch = read_char(stream);
    
        switch(first) {
            case '<':
                if ( '%' == ch )
                    return '{';
                if ( ':' == ch )
                    return '[';
                break;
            case ':':
                if ( '>' == ch )
                    return ']';
                break;
            case '%':
                if ( '>' == ch )
                    return '}';
                if ( ':' == ch )
                    return '#';
                break;
        }
    
    /* If it's not one of the specific combos above, return the characters
     * separately and unchanged by putting the second one back into the
     * stream, and returning the first one as-is.
     */
        unget_character(ch, stream);
        return first;
    }
    #endif
    
    
    static int get_char(PFILE *stream) {
    /* Gets a single character from the stream with any trigraphs or digraphs converted 
     * to the single character represented. Note that handling digraphs this early in
     * translation isn't really correct (and shouldn't happen in C at all).
     */
        int ch = read_char(stream);
    
        if ( ch == '?' )
            return check_trigraph(stream);
    
    #ifdef DIGRAPH
        if (( ch == '<' || ch == ':' || ch == '%' ))
            return check_digraph(stream, ch);
    #endif
    
        return ch;
    }
    
    int get_character(PFILE *stream) {
    /* gets a character from `stream'.  Any amount of any kind of whitespace
     * is returned as a single space. Escaped new-lines are "eaten" here as well.
     */
        int ch;
    
        if ( !isspace(ch=get_char(stream)) && ch != '\\')
            return ch;
    
        // handle line-slicing
        if (ch == '\\') {
            ch = get_char(stream);
            if (ch == '\n') 
                ch = get_char(stream);
            else {
                unget_character(ch, stream);
                return ch;
            }
        }
    
        /* If it's a space, skip over consecutive white-space */
        while (isspace(ch) && ('\n' != ch))
            ch = get_char(stream);
    
        if ('\n' == ch)
            return ch;
    
        /* Then put the non-ws character back */
        unget_character(ch, stream);
    
        /* and return a single space character... */
        return ' ';
    }
    
    static int read_char_lit(PFILE *stream) {
    /* This is used internally by `get_source' (below) - it expects the
     * opening quote of a character literal to have already been read and
     * returns CHAR_LIT or ERROR if there's a newline before a close
     * quote is found, or if the character literal contains more than two
     * characters after escapes are taken into account.
     */
    
        int ch;
        int i;
    
    
        clear();
        addchar('\'');
    
        for (i=0; i<2 && ('\'' != ( ch = read_char(stream))); i++) {
    
            addchar(ch);
    
            if ( ch == '\n' )
                return ERROR;
    
            if (ch == '\\' ) {
                ch = get_char(stream);
                addchar(ch);
            }
        }
        addchar('\'');
        addchar('\0');
    
        if ( i > 2 )
            return ERROR;
    
        return CHAR_LIT;
    }
    
    static int read_str_lit(PFILE *stream) {
    /* Used internally by get_source.  Expects the opening quote of a string
     * literal to have already been read.  Returns STR_LIT, or ERROR if a
     * un-escaped newline is found before the close quote.
     */
    
        int ch;
    
        clear();
        addchar('"');
    
        while ( '"' != ( ch = get_char(stream))) {
    
            if ( '\n' == ch || EOF == ch )
                return ERROR;
    
            addchar(ch);
    
            if( ch == '\\' ) {
                ch = read_char(stream);
                addchar(ch);
            }
    
        }
    
        addchar('"');
        addchar('\0');
    
        return STR_LIT;
    }
    
    static int read_comment(PFILE *stream) {
    /* Skips over a comment in stream.  Assumes the leading '/' has already
     * been read and skips over the body.  If we're reading C++ source, skips
     * C++ single line comments as well as normal C comments.
     */
        int ch;
    
        clear();
    
        ch = get_char(stream);
    
        /* Handle a single line comment.
         */
        if ('/' == ch) {
            addchar('/');
            addchar('/');
    
            while ( '\n' != ( ch = get_char(stream))) 
                addchar(ch);       
    
            addchar('\0');
            return COMMENT;
        }
    
        if ('*' != ch ) {
            unget_character(ch, stream);
            return '/';
        }
    
        addchar('/');
    
        do {
            addchar(ch);
            while ('*' !=(ch = get_char(stream)))
                if (EOF == ch)
                    return ERROR;
                else
                    addchar(ch);
            addchar(ch);
        } while ( '/' != (ch=get_char(stream)));
    
        addchar('/');
        addchar('\0');
    
        return COMMENT;
    }
    
    int get_source(PFILE *stream) {
    /* reads and returns a single "item" from the stream.  An "item" is a
     * comment, a literal or a single character after trigraph and possible
     * digraph substitution has taken place.
     */
    
        int ch = get_character(stream);
    
        switch(ch) {
            case '\'':
                return read_char_lit(stream);
            case '"':
                return read_str_lit(stream);
            case '/':
                return read_comment(stream);
            default:
                return ch;
        }
    }
    
    #ifdef TEST
    
    int main(int argc, char **argv)  {
        PFILE *f;
        int ch;
    
        if (argc != 2) {
            fprintf(stderr, "Usage: get_src <filename>\n");
            return EXIT_FAILURE;
        }
    
        if (NULL==(f= parse_fopen(argv[1]))) {
            fprintf(stderr, "Unable to open: %s\n", argv[1]);
            return EXIT_FAILURE;
        }
    
        while (EOF!=(ch=get_source(f))) 
            if (ch < 0) 
                printf("\n%s\n", last_token);
            else
                printf("%c", ch);
        parse_fclose(f);
        return 0;       
    }
    
    #endif
    

    I’m not sure about how easy/difficult it would/will be to integrate that into a Flex-based lexer though — I seem to recall Flex has some sort of hook to define what it uses to read a character, but I’ve never tried to use it, so I can’t say much more about it (and ultimately, can’t even say with anything approaching certainty that it even exists).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to split up the jQuery .js file into two, but I have
I have this file file.txt which I want to split into many smaller ones.
I want to split a string like this: abc//def//ghi into a part before and
I want to split a file containg HTTP response into two files: one containing
I want to split an arithmetic expression into tokens, to convert it into RPN.
This is similar question with few differences : Split XML file into multiple files
I'd like to split my seeds.rb file into multiple sections for ease of maintenance;
I want to split a text file into strings, can you please tell me
I want to split a command line like string in single string parameters. How
We want to split our large asp.net mvc web application into multiple Visual Studio

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.