I’m attempting to build a n-gram language model based on the top 100K words

Question

0

Asked: May 26, 20262026-05-26T21:11:02+00:00 2026-05-26T21:11:02+00:00

I’m attempting to build a n-gram language model based on the top 100K words

0

I’m attempting to build a n-gram language model based on the top 100K words found in the english language wikipedia dump. I’ve already extracted out the plain text with a modified XML parser written in Java, but need to convert it to a vocab file.

In order to do this, I found a perl script that is said to do the job, but lacks instructions on how to execute. Needless to say, I’m a complete newbie to Perl and this is the first time I’ve encountered a need for its usage.

When I run this script, I’m getting an Out of Memory Error when using this on a 7.2GB text file on two separate dual core machines with 4GB RAM and runnung Ubuntu 10.04 and 10.10.

When I contacted the author, he said this script ran fine on a MacBook Pro with 4GB RAM, and the total in-memory usage was about 78 MB when executed on a 6.6GB text file with perl 5.12. The author also said that the script reads the input file line by line and creates a hashmap in memory.

The script is:

#! /usr/bin/perl

use FindBin;
use lib "$FindBin::Bin";

use strict;
require 'english-utils.pl';

## Create a list of words and their frequencies from an input corpus document
## (format: plain text, words separated by spaces, no sentence separators)

## TODO should words with hyphens be expanded? (e.g. three-dimensional)

my %dict;
my $min_len = 3;
my $min_freq = 1;

while (<>) {

    chomp($_);
    my @words = split(" ", $_);

    foreach my $word (@words) {

        # Check validity against regexp and acceptable use of apostrophe

        if ((length($word) >= $min_len) && ($word =~ /^[A-Z][A-Z\'-]+$/) 
        && (index($word,"'") < 0 || allow_apostrophe($word))) {
            $dict{$word}++;
        }
    }

}

# Output words which occur with the $min_freq or more often

foreach my $dictword (keys %dict) {
    if ( $dict{$dictword} >= $min_freq ) {
        print $dictword . "\t" . $dict{$dictword} . "\n";
    }
}

I’m executing this script from the command line via mkvocab.pl corpus.txt

The included extra script is simply a regex script to test the placement of apostrophe’s and whether they match English grammar rules.

I thought the memory leak was due to the different versions, as 5.10 was installed on my machine. So I upgraded to 5.14, but the error still persists. According to free -m, I have approximately 1.5GB free memory on my system.

As I am completely unfamiliar with the syntax and structure of language, can you point out the problem areas along with why the issue exists and how to fix it.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T21:11:03+00:00

Loading a 7,2Gb file into a hash could be possible if there is some repetition in the words, e.g. the occurs 17,000 times, etc. It seems to be rather a lot, though.

Your script assumes that the lines in the file are appropriately long. If your file does not contain line breaks, you will load the whole file into memory in $_, then double that memory load with split, and then add quite a whole lot more into your hash. Which would strain any system.

One idea may be to use space " " as your input record separator. It will do approximately what you are already doing with split, except that it will leave other whitespace characters alone, and will not trim excess whitespace as prettily. For example:

$/ = " ";
while (<>) {
    for my $word ( split ) {  # avoid e.g. "foo\nbar" being considered one word
        if (
              (length($word) >= $min_len) &&
              ($word =~ /^[A-Z][A-Z\'-]+$/) &&
              (index($word,"'") < 0 || allow_apostrophe($word))
        ) {
            $dict{$word}++;
        }
    }
}

This will allow even very long lines to be read in bite size chunks, assuming you do have spaces between the words (and not tabs or newlines).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m attempting to build a n-gram language model based on the top 100K words

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply