I need to extract the 10 most frequent words from a text using a

Question

0

Asked: June 4, 20262026-06-04T11:05:35+00:00 2026-06-04T11:05:35+00:00

I need to extract the 10 most frequent words from a text using a

0

I need to extract the 10 most frequent words from a text using a pipe (and any additional python scripts as needed); output being a block of all-caps words separated by a space.
This pipe needs to extract text from any external file: I’ve managed to get it to work on .txt files, but I also need to be able to input a URL and have it do the same thing with that.

I have the following code:

alias words="tr a-zA-Z | tr -cs A-Z | tr ' ' '\012' | sort -n | uniq -c | 
sort -r | head -n 10 | awk '{printf \"%s \", \$2}END{print \"\"}'" (on one line)

which, with cat hamlet.txt | words gives me:

TO THE AND A  'TIS THAT OR OF IS

To make it more complicated, I need to exclude any ‘function’ words: these are ‘non-lexical’ words like ‘a’, ‘the’, ‘of’, ‘is’, any pronouns (I, you, him), and any prepositions (there, at, from).

I need to be able to type htmlstrip http://www.google.com.au | words and have it print out like the above.

For the URL-opening:
The python script I’m trying to figure out (let’s call it htmlstrip) strips any tags from the text, leaving only ‘human readable’ text. This should be able to open any given URL, but I can’t figure out how to get this to work.
What I have so far:

import re
import urllib2
filename = raw_input('File name: ')
filehandle = open(filename)
html = filehandle.read()

f = urllib2.urlopen('http://') #???
print f.read()

text = [ ]
inTag = False


for ch in html:
    if ch == '<':
        inTag = True
    if not inTag:
        text.append(ch)
    if ch == '>':
        inTag = False

print ''.join(text)

I know this is both incomplete and probably incorrect – any guidance would really be appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T11:05:37+00:00

Editorial Team

2026-06-04T11:05:37+00:00Added an answer on June 4, 2026 at 11:05 am

You can use scrape.py and regular expressions like this:

#!/usr/bin/env python

from scrape import s
import sys, re

if len(sys.argv) < 2:
    print "Usage: words.py url"
    sys.exit(0)

s.go(sys.argv[1]) # fetch content
text = s.doc.text # extract readable text
text = re.sub("\W+", " ", text) # remove all non-word characters and repeating whitespace
print text

And then just:
./words.py http://whatever.com

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to extract the 10 most frequent words from a text using a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply