I’m trying to search across a large array of textual files in Mathematica 8 (12k+). So far, I’ve been able to plot the sheer numbers of times that a word appears (i.e. the word “love” appears 5,000 times across those 12k files). However, I’m running into difficulty determining the number of files in which “love” appears once – which might only be in 1,000 files, with it repeating several times in others.
I’m finding the documentation WRT FindList, streams, RecordSeparators, etc. a bit murky. Is there a way to set it up so it finds an incidence of a term once in a file and then moves onto the next?
Example of filelist:
{“89001.txt”, “89002.txt”, “89003.txt”, “89004.txt”, “89005.txt”, “89006.txt”, “89007.txt”, “89008.txt”, “89009.txt”, “89010.txt”, “89011.txt”, “89012.txt”, “89013.txt”, “89014.txt”, “89015.txt”, “89016.txt”, “89017.txt”, “89018.txt”, “89019.txt”, “89020.txt”, “89021.txt”, “89022.txt”, “89023.txt”, “89024.txt”}
The following returns all of the lines with love across every file. Is there a way to return only the first incidence of love in each file before moving onto the next one?
FindList[filelist, "love"]
Thanks so much. This is my first post and I’m largely learning Mathematica through peer/supervisory help, online tutorials, and the documentation.
In addition to Daniel’s answer, you also seem to be asking for a list of files where the word only occurs once. To do that, I’d continue to run
FindListacross all the filesThen, reduce the results to single lines only, via
But, this doesn’t eliminate the cases where there is more than one occurrence in a single line. To do that, you could use
StringCountand only accept instances where it is 1, as followsThe
RegularExpressionspecifies that “love” must be a distinct word using the word boundary marker (\\b), so that words like “lovely” won’t be included.Edit: It appears that
FindListwhen passed a list of files returns a flattened list, so you can’t determine which item goes with which file. For instance, if you have 3 files, and they contain the word “love”, 0, 1, and 2 times, respectively, you’d get a list that looked likewhich is clearly not useful. To overcome this, you’ll have to process each file individually, and that is best done via
Map(/@), as followsand the rest of the above code works as expected.
But, if you want to associate the results with a file name, you have to change it a little.
which returns a list of the form
To extract the file names, you simply type
lines[[All, 1]].Note, in order to
Selecton the properties you wanted, I usedPart([[ ]]) to specify the second element in each datum, and the same goes for extracting the file names.