Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8401789
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T21:53:42+00:00 2026-06-09T21:53:42+00:00

I have a task of creating a script which takes a huge text file

  • 0

I have a task of creating a script which takes a huge text file as an input. It then needs to find all words and the number of occurrences and create a new file with each line displaying a unique word and its occurrence.

As an example take a file with this content:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor 
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud 
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure
dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.   
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt 
mollit anim id est laborum.

I need to create a file which looks like this:

1 AD
1 ADIPISICING
1 ALIQUA
...
1 ALIQUIP
1 DO
2 DOLOR
2 DOLORE
...

For this I wrote a script using tr, sort and uniq:

#!/bin/sh
INPUT=$1
OUTPUT=$2
if [ -a $INPUT ]
then
    tr '[:space:][\-_?!.;\:]' '\n' < $INPUT | 
        tr -d '[:punct:][:special:][:digit:]' |
        tr '[:lower:]' '[:upper:]' |
        sort |
        uniq -c > $OUTPUT
fi   

What this does is split the words by space as the delimiter. If the word contains -_?!.;: I break them into words again. I remove the punctuations, special characters and digits and convert the entire string to uppercase. Once this is done I sort it and pass it through uniq to get it to the format I want.

Now I downloaded the bible in txt format and used it as the input. Timing this I got:

scripts|$ time ./text-to-word.sh text.txt b     
./text-to-word.sh text.txt b  16.17s user 0.09s system 102% cpu 15.934 total

I did the same with a Python script:

import re
from collections import Counter
from itertools import chain
import sys

file = open(sys.argv[1])

c = Counter()

for line in file.readlines():
    c.update([re.sub('[^a-zA-Z]', '', l).upper()
            for l in chain(*[re.split('[-_?!.;:]', word)
                    for word in line.split()])])

file2 = open('output.txt', 'w')
for key in sorted(c):
    file2.write(key + ' ' + str(c[key]) + '\n')

When I executed the script I got:

scripts|$ time python text-to-word.py text.txt
python text-to-word.py text.txt  7.23s user 0.04s system 97% cpu 7.456 total

As you can see it ran in 7.23s compared to the shell script which ran in 16.17s. I have tried with bigger files and always Python seems to triumph. I have a few questions to the senario above:

  1. Why is the Python script faster given that the shell commands are written in C? I do realize the shell script may not be the optimum one.
  2. How can I improve the shell script?
  3. Can I improve the Python script?

To be clear I am not comparing Python to shell scripts. I am not trying to start a flame war or do not need answers in any other language comparing itself to be faster. Using the UNIX philosophy of piping small commands to do a task, how do I make the shell script faster?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T21:53:43+00:00Added an answer on June 9, 2026 at 9:53 pm

    An important point here is probably inter-process I/O. The Python script has all data in memory, so no I/O happens while it processes the data.

    Also note that Python isn’t slow as such. Most functionality in Python is implemented in C.

    The shell script has to start 5 processes and each of them has to read the whole text from stdin and write the whole text to stdout four times.

    There might be a way to make the Python script a bit faster: You can read the whole text into a single string, then remove all punctuation, split words and then count them:

    text = file.read()
    text = re.sub(r'[.,:;-_]', '', text)
    text = text.upper()
    words = re.split(r'\\s+', text)
    c = Counter()
    c.update(words)
    

    That would avoid the overhead of several nested loops.

    As for the shell script: You should try to reduce the number of processes. The three tr processes could probably be replaced with one call to sed.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have found that creating a zip file using the Zip task provided by
i have task which takes a parameter and has three modes of results Example
I have a script file for MySQL that needs to be run on about
I have been given a task of creating a common Gridview component which has
I have the task of creating implementations for a large number of metric data
I'm using dhtmlx Gantt Chart UI component which have task list and graphical chart.
I have a task to develop an electronic phone book in which i have
I have been task with (ha) creating an application that will allow the users
I have written a pretty convoluted script for creating graphics - via the Tkinter
I have been task with the mission of creating a phone solution where clients

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.