A directory D contains a few thousand e-mails in the .eml format. Some e-mails

Question

0

Asked: June 5, 20262026-06-05T15:37:56+00:00 2026-06-05T15:37:56+00:00

A directory D contains a few thousand e-mails in the .eml format. Some e-mails

0

A directory D contains a few thousand e-mails in the .eml format. Some e-mails are plain text, others come from Outlook, others have an ASCII header and HTML/MIME content and so on. There exists a dictionary file F containing a list of interesting words (i.e. red\nblue\ngreen\n…) to look for in the files underneath the D directory. The D directory has a large number of subfolders but no files other than the above-mentioned .eml files. A list of top recurring words should be made with these specifications:

For every interesting word, information should be provided concerning how many times it occurs and where it does. If it occurs multiple times within a file, it should be reported multiple times for that file. Reporting occurrence means reporting a tuple (L,P) of integers, where L is the line number from the top of the e-mail source and P is the position, within that line ,of the start of the occurrence.

This would build both an index to refer to the different occurrences and a summary of the most frequently occurring interesting words.

The output should be on a single output file and the format is not strictly defined, provided the information above is included: interesting words, number of times each interesting word occurs and where it does -> file/line/start-position.

This is not a homework exercise but actual text analysis I would like to make of a fairly large dataset. The challenge I am having is that of choosing the right tool for filtering efficiently. An iterative approach, Cartesian product of words/emails/etc, is too slow and it would be desirable to combine multiple word filtering for each line of each file.

I have experimented building a regex of alternatives from the list of interesting words, w1|w2|w3|…, compiling that and running it through each line of each e-mail but it’s still slow, especially when I need to check multiple occurrences within a single line.

Example:

E-mail E has a line containing the text:

^ … blah … red apples … blue blueberries … red, white and blue flag.$\n

the regex correctly reports red(2) and blue(2) but it’s slow when using the real, very large dictionary of interesting words.

Another approach I have tried is:

use a Sqlite database to dump tokens to as they are parsed, including (column,position) information for each entry, and just querying the output at the end. Batch inserts help a lot, with the appropriate in-memory buffer, but increase complexity.

I have not experimented with data parallelisation yet as I am not sure tokens/parsing are the right thing to do in the first place. Maybe a tree of letters would be more suitable?

I am interested in solutions in, in order of preference:

Bash/GNU CLI tools (esp. something parallelisable through GNU ‘parallel’for CLI-only execution)
Python (NLP?)
C/C++

No Perl as I don’t understand it, unfortunately.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T15:37:58+00:00

I assume you can create/find an eml-to-text converter. Then this is fairly close to what you want:

find -type f | parallel --tag 'eml-to-text {} | grep -o -n -b -f /tmp/list_of_interesting_words'

The output is not formatted 100% how you want it:

filename \t line no : byte no (from start of file) : word

If you have many interesting words the ‘-f’ in grep is slow to start up, so if you can create an unpacked version of your maildir you can make parallel start grep fewer times:

find . -type f | parallel 'eml-to-text {} >/tmp/unpacked/{#}'
find /tmp/unpacked -type f | parallel -X grep -H -o -n -b -f /tmp/list_of_interesting_words

Since the time complexity of grep -f is worse than linear, you may want to chop up /tmp/list_of_interesting_words into smaller blocks:

cat /tmp/list_of_interesting_words | parallel --pipe --block 10k --files > /tmp/blocks_of_words

And then process the blocks and the files in parallel:

find /tmp/unpacked -type f | parallel -j1 -I ,, parallel --arg-file-sep // -X grep -H -o -n -b -f ,, {} // - :::: /tmp/blocks_of_words

This output is formatted like:

filename : line no : byte no (from start of file) : word

To have it grouped by word instead of filename pipe the result through sort:

... | sort -k4 -t: > index.by.word

To count the frequency:

... | sort -k4 -t: | tee index.by.word | awk 'FS=":" {print $4}' | uniq -c

The good news is that this should be rather fast, and I doubt you will be able to achieve the same speed using Python.

Edit:

grep -F is way faster at starting, and you will want -w for grep (so the word ‘gram’ does not match ‘diagrams’); this will also avoid the temporary files and is probably reasonably fast:

find . -type f | parallel --tag 'eml-to-text {} | grep -F -w -o -n -b -f /tmp/list_of_interesting_words' | sort -k3 -t: | tee index.by.word | awk 'FS=":" {print $3}' | uniq -c

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

A directory D contains a few thousand e-mails in the .eml format. Some e-mails

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply