I was tasked with creating a word frequency analysis program that reads the content from a text file, and produces the following example output:
SUMMARY:
27340 words
2572 unique words
WORD FREQUENCIES (TOP 10):
the 1644
and 872
to 729
a 632
it 595
she 553
i 545
of 514
said 462
you 411
I attempted to create a program to achieve such an output. I’m very new to C programming, so although it works to a certain extent, there are probably a lot of efficiency issues / flaws. Here is what I wrote so far:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#define MAX_WORD 32
#define MAX_TEXT_LENGTH 10000
// ===========================================
// STRUCTURE
//============================================
typedef struct word {
char *str; /* Stores the word */
int freq; /* Stores the frequency */
struct word *pNext; /* Pointer to the next word counter in the list */
} Word;
// ===========================================
// FUNCTION PROTOTYPES
//============================================
int getNextWord(FILE *fp, char *buf, int bufsize); /* Given function to get words */
void addWord(char *pWord); /* Adds a word to the list or updates exisiting word */
void show(Word *pWordcounter); /* Outputs a word and its count of occurrences */
Word* createWordCounter(char *word); /* Creates a new WordCounter structure */
// ===========================================
// GLOBAL VARIABLES
//============================================
Word *pStart = NULL; /* Pointer to first word counter in the list */
int totalcount = 0; /* Total amount of words */
int uniquecount = 0; /* Amount of unique words */
// ===========================================
// MAIN
//============================================
int main () {
/* File pointer */
FILE * fp;
/* Read text from here */
fp = fopen("./test.txt","r");
/* buf to hold the words */
char buf[MAX_WORD];
/* Size */
int size = MAX_TEXT_LENGTH;
/* Pointer to Word counter */
Word *pCounter = NULL;
/* Read all words from text file */
while (getNextWord(fp, buf, size)) {
/* Add the word to the list */
addWord(buf);
/* Increment the total words counter */
totalcount++;
}
/* Loop through list and figure out the number of unique words */
pCounter = pStart;
while(pCounter != NULL)
{
uniquecount++;
pCounter = pCounter->pNext;
}
/* Print Summary */
printf("\nSUMMARY:\n\n");
printf(" %d words\n", totalcount); /* Print total words */
printf(" %d unique words\n", uniquecount); /* Print unique words */
/* List the words and their counts */
pCounter = pStart;
while(pCounter != NULL)
{
show(pCounter);
pCounter = pCounter->pNext;
}
printf("\n");
/* Free the allocated memory*/
pCounter = pStart;
while(pCounter != NULL)
{
free(pCounter->str);
pStart = pCounter;
pCounter = pCounter->pNext;
free(pStart);
}
/* Close file */
fclose(fp);
return 0;
}
// ===========================================
// FUNCTIONS
//============================================
void show(Word *pWordcounter)
{
/* output the word and it's count */
printf("\n%-30s %5d", pWordcounter->str,pWordcounter->freq);
}
void addWord(char *word)
{
Word *pCounter = NULL;
Word *pLast = NULL;
if(pStart == NULL)
{
pStart = createWordCounter(word);
return;
}
/* If the word is in the list, increment its count */
pCounter = pStart;
while(pCounter != NULL)
{
if(strcmp(word, pCounter->str) == 0)
{
++pCounter->freq;
return;
}
pLast = pCounter;
pCounter = pCounter->pNext;
}
/* Word is not in the list, add it */
pLast->pNext = createWordCounter(word);
}
Word* createWordCounter(char *word)
{
Word *pCounter = NULL;
pCounter = (Word*)malloc(sizeof(Word));
pCounter->str = (char*)malloc(strlen(word)+1);
strcpy(pCounter->str, word);
pCounter->freq = 1;
pCounter->pNext = NULL;
return pCounter;
}
int getNextWord(FILE *fp, char *buf, int bufsize) {
char *p = buf;
char c;
//skip all non-word characters
do {
c = fgetc(fp);
if (c == EOF)
return 0;
} while (!isalpha(c));
//read word chars
do {
if (p - buf < bufsize - 1)
*p++ = tolower(c);
c = fgetc(fp);
} while (isalpha(c));
//finalize word
*p = '\0';
return 1;
}
It displays the summary correctly. The amount of words and unique words is completely correct. It then lists every single unique word found in the file and displays the correct number of occurrences.
What I need to do now (and what I’m having a lot of trouble with) is sorting my linked list by the number of occurrences in a descending order. On top of that, it should only display the top 10 words and not all of them (this should be doable once I have the linked list sorted).
I know the code itself is very inefficient right now, but my primary concern right now is to just get the correct output.
If anybody can help me out with a sorting algorithm, or at least point me in the right direction it would be greatly appreciated.
Thank you.
This idea might be a little ambitious for a beginning C programmer, but it is always a good idea to be aware of the functions in the standard library. If you know how big your linked list is, you can use
mallocto allocate space for an array holding the same data. Then you can useqsortto sort the data for you.Functions
mallocandqsortare frequently used members of the standard C library.