Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 80745
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 10, 20262026-05-10T21:21:03+00:00 2026-05-10T21:21:03+00:00

I’m a non-computer science student doing a history thesis that involves determining the frequency

  • 0

I’m a non-computer science student doing a history thesis that involves determining the frequency of specific terms in a number of texts and then plotting these frequencies over time to determine changes and trends. While I have figured out how to determine word frequencies for a given text file, I am dealing with a (relatively, for me) large number of files (>100) and for consistencies sake would like to limit the words included in the frequency count to a specific set of terms (sort of like the opposite of a ‘stop list’)

This should be kept very simple. At the end all I need to have is the frequencies for the specific words for each text file I process, preferably in spreadsheet format (tab delineated file) so that I can then create graphs and visualizations using that data.

I use Linux day-to-day, am comfortable using the command line, and would love an open-source solution (or something I could run with WINE). That is not a requirement however:

I see two ways to solve this problem:

  1. Find a way strip-out all the words in a text file EXCEPT for the pre-defined list and then do the frequency count from there, or:
  2. Find a way to do a frequency count using just the terms from the pre-defined list.

Any ideas?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-10T21:21:03+00:00Added an answer on May 10, 2026 at 9:21 pm

    I would go with the second idea. Here is a simple Perl program that will read a list of words from the first file provided and print a count of each word in the list from the second file provided in tab-separated format. The list of words in the first file should be provided one per line.

    #!/usr/bin/perl  use strict; use warnings;  my $word_list_file = shift; my $process_file = shift;  my %word_counts;  # Open the word list file, read a line at a time, remove the newline, # add it to the hash of words to track, initialize the count to zero open(WORDS, $word_list_file) or die 'Failed to open list file: $!\n'; while (<WORDS>) {   chomp;   # Store words in lowercase for case-insensitive match   $word_counts{lc($_)} = 0; } close(WORDS);  # Read the text file one line at a time, break the text up into words # based on word boundaries (\b), iterate through each word incrementing # the word count in the word hash if the word is in the hash open(FILE, $process_file) or die 'Failed to open process file: $!\n';  while (<FILE>) {   chomp;   while ( /-$/ ) {     # If the line ends in a hyphen, remove the hyphen and     # continue reading lines until we find one that doesn't     chop;     my $next_line = <FILE>;     defined($next_line) ? $_ .= $next_line : last;   }    my @words = split /\b/, lc; # Split the lower-cased version of the string   foreach my $word (@words) {     $word_counts{$word}++ if exists $word_counts{$word};   } } close(FILE);  # Print each word in the hash in alphabetical order along with the # number of time encountered, delimited by tabs (\t) foreach my $word (sort keys %word_counts) {   print '$word\t$word_counts{$word}\n' } 

    If the file words.txt contains:

    linux frequencies science words 

    And the file text.txt contains the text of your post, the following command:

    perl analyze.pl words.txt text.txt 

    will print:

    frequencies     3 linux   1 science 1 words   3 

    Note that breaking on word boundaries using \b may not work the way you want in all cases, for example, if your text files contain words that are hyphenated across lines you will need to do something a little more intelligent to match these. In this case you could check to see if the last character in a line is a hyphen and, if it is, just remove the hyphen and read another line before splitting the line into words.

    Edit: Updated version that handles words case-insensitively and handles hyphenated words across lines.

    Note that if there are hyphenated words, some of which are broken across lines and some that are not, this won’t find them all because it only removed hyphens at the end of a line. In this case you may want to just remove all hyphens and match words after the hyphens are removed. You can do this by simply adding the following line right before the split function:

    s/-//g; 
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 128k
  • Answers 128k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer https://bugzilla.mozilla.org/show_bug.cgi?id=98168#c99 and subsequent comments describe the current status of the… May 12, 2026 at 5:43 am
  • Editorial Team
    Editorial Team added an answer Make rowNumber a floating point number. When user inserts between… May 12, 2026 at 5:43 am
  • Editorial Team
    Editorial Team added an answer Think you want FileSystemWatcher msdn http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx example https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-6165137.html EDIT :… May 12, 2026 at 5:43 am

Related Questions

I ran into a problem. Wrote the following code snippet: teksti = teksti.Trim() teksti
I am currently running into a problem where an element is coming back from
Seemingly simple, but I cannot find anything relevant on the web. What is the
Does anyone know how can I replace this 2 symbol below from the string
Configuring TinyMCE to allow for tags, based on a customer requirement. My config is

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.