Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8077937
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T15:37:56+00:00 2026-06-05T15:37:56+00:00

A directory D contains a few thousand e-mails in the .eml format. Some e-mails

  • 0

A directory D contains a few thousand e-mails in the .eml format. Some e-mails are plain text, others come from Outlook, others have an ASCII header and HTML/MIME content and so on. There exists a dictionary file F containing a list of interesting words (i.e. red\nblue\ngreen\n…) to look for in the files underneath the D directory. The D directory has a large number of subfolders but no files other than the above-mentioned .eml files. A list of top recurring words should be made with these specifications:

  • For every interesting word, information should be provided concerning how many times it occurs and where it does. If it occurs multiple times within a file, it should be reported multiple times for that file. Reporting occurrence means reporting a tuple (L,P) of integers, where L is the line number from the top of the e-mail source and P is the position, within that line ,of the start of the occurrence.

This would build both an index to refer to the different occurrences and a summary of the most frequently occurring interesting words.

The output should be on a single output file and the format is not strictly defined, provided the information above is included: interesting words, number of times each interesting word occurs and where it does -> file/line/start-position.

This is not a homework exercise but actual text analysis I would like to make of a fairly large dataset. The challenge I am having is that of choosing the right tool for filtering efficiently. An iterative approach, Cartesian product of words/emails/etc, is too slow and it would be desirable to combine multiple word filtering for each line of each file.

I have experimented building a regex of alternatives from the list of interesting words, w1|w2|w3|…, compiling that and running it through each line of each e-mail but it’s still slow, especially when I need to check multiple occurrences within a single line.

Example:

E-mail E has a line containing the text:

^ … blah … red apples … blue blueberries … red, white and blue flag.$\n

the regex correctly reports red(2) and blue(2) but it’s slow when using the real, very large dictionary of interesting words.

Another approach I have tried is:

use a Sqlite database to dump tokens to as they are parsed, including (column,position) information for each entry, and just querying the output at the end. Batch inserts help a lot, with the appropriate in-memory buffer, but increase complexity.

I have not experimented with data parallelisation yet as I am not sure tokens/parsing are the right thing to do in the first place. Maybe a tree of letters would be more suitable?

I am interested in solutions in, in order of preference:

  • Bash/GNU CLI tools (esp. something parallelisable through GNU ‘parallel’for CLI-only execution)
  • Python (NLP?)
  • C/C++

No Perl as I don’t understand it, unfortunately.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T15:37:58+00:00Added an answer on June 5, 2026 at 3:37 pm

    I assume you can create/find an eml-to-text converter. Then this is fairly close to what you want:

    find -type f | parallel --tag 'eml-to-text {} | grep -o -n -b -f /tmp/list_of_interesting_words'
    

    The output is not formatted 100% how you want it:

    filename \t line no : byte no (from start of file) : word

    If you have many interesting words the ‘-f’ in grep is slow to start up, so if you can create an unpacked version of your maildir you can make parallel start grep fewer times:

    find . -type f | parallel 'eml-to-text {} >/tmp/unpacked/{#}'
    find /tmp/unpacked -type f | parallel -X grep -H -o -n -b -f /tmp/list_of_interesting_words
    

    Since the time complexity of grep -f is worse than linear, you may want to chop up /tmp/list_of_interesting_words into smaller blocks:

    cat /tmp/list_of_interesting_words | parallel --pipe --block 10k --files > /tmp/blocks_of_words
    

    And then process the blocks and the files in parallel:

    find /tmp/unpacked -type f | parallel -j1 -I ,, parallel --arg-file-sep // -X grep -H -o -n -b -f ,, {} // - :::: /tmp/blocks_of_words
    

    This output is formatted like:

    filename : line no : byte no (from start of file) : word

    To have it grouped by word instead of filename pipe the result through sort:

    ... | sort -k4 -t: > index.by.word
    

    To count the frequency:

    ... | sort -k4 -t: | tee index.by.word | awk 'FS=":" {print $4}' | uniq -c
    

    The good news is that this should be rather fast, and I doubt you will be able to achieve the same speed using Python.

    Edit:

    grep -F is way faster at starting, and you will want -w for grep (so the word ‘gram’ does not match ‘diagrams’); this will also avoid the temporary files and is probably reasonably fast:

    find . -type f | parallel --tag 'eml-to-text {} | grep -F -w -o -n -b -f /tmp/list_of_interesting_words' | sort -k3 -t: | tee index.by.word | awk 'FS=":" {print $3}' | uniq -c
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

i have a directory called testDir and it contains 1000 file, some of them
I have a directory named 'backup' its contains following files (with some specific details)
I have a network storage device that contains a few hundred thousand mp3 files,
I have a folder which contains few files and some directories which I need
I have a directory that contains jpg,tif,pdf,doc and xls. The client DB conly contains
Suppose I have a tar that contains: / # Root directory /level1/ # A
I have a image folder which contains sub directory for each album of images
I have a Windows service that is retrieving a few selected files from a
I have an asp.net website which contains a few pages that I'd like to
Tarantino requires that you point it to a scripts directory that contains two subdirectories,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.