A directory D contains a few thousand e-mails in the .eml format. Some e-mails are plain text, others come from Outlook, others have an ASCII header and HTML/MIME content and so on. There exists a dictionary file F containing a list of interesting words (i.e. red\nblue\ngreen\n…) to look for in the files underneath the D directory. The D directory has a large number of subfolders but no files other than the above-mentioned .eml files. A list of top recurring words should be made with these specifications:
- For every interesting word, information should be provided concerning how many times it occurs and where it does. If it occurs multiple times within a file, it should be reported multiple times for that file. Reporting occurrence means reporting a tuple (L,P) of integers, where L is the line number from the top of the e-mail source and P is the position, within that line ,of the start of the occurrence.
This would build both an index to refer to the different occurrences and a summary of the most frequently occurring interesting words.
The output should be on a single output file and the format is not strictly defined, provided the information above is included: interesting words, number of times each interesting word occurs and where it does -> file/line/start-position.
This is not a homework exercise but actual text analysis I would like to make of a fairly large dataset. The challenge I am having is that of choosing the right tool for filtering efficiently. An iterative approach, Cartesian product of words/emails/etc, is too slow and it would be desirable to combine multiple word filtering for each line of each file.
I have experimented building a regex of alternatives from the list of interesting words, w1|w2|w3|…, compiling that and running it through each line of each e-mail but it’s still slow, especially when I need to check multiple occurrences within a single line.
Example:
E-mail E has a line containing the text:
^ … blah … red apples … blue blueberries … red, white and blue flag.$\n
the regex correctly reports red(2) and blue(2) but it’s slow when using the real, very large dictionary of interesting words.
Another approach I have tried is:
use a Sqlite database to dump tokens to as they are parsed, including (column,position) information for each entry, and just querying the output at the end. Batch inserts help a lot, with the appropriate in-memory buffer, but increase complexity.
I have not experimented with data parallelisation yet as I am not sure tokens/parsing are the right thing to do in the first place. Maybe a tree of letters would be more suitable?
I am interested in solutions in, in order of preference:
- Bash/GNU CLI tools (esp. something parallelisable through GNU ‘parallel’for CLI-only execution)
- Python (NLP?)
- C/C++
No Perl as I don’t understand it, unfortunately.
I assume you can create/find an eml-to-text converter. Then this is fairly close to what you want:
The output is not formatted 100% how you want it:
filename \t line no : byte no (from start of file) : word
If you have many interesting words the ‘-f’ in
grepis slow to start up, so if you can create an unpacked version of your maildir you can make parallel startgrepfewer times:Since the time complexity of
grep -fis worse than linear, you may want to chop up /tmp/list_of_interesting_words into smaller blocks:And then process the blocks and the files in parallel:
This output is formatted like:
filename : line no : byte no (from start of file) : word
To have it grouped by
wordinstead of filename pipe the result through sort:To count the frequency:
The good news is that this should be rather fast, and I doubt you will be able to achieve the same speed using Python.
Edit:
grep -F is way faster at starting, and you will want -w for grep (so the word ‘gram’ does not match ‘diagrams’); this will also avoid the temporary files and is probably reasonably fast: