I’m trying to make an app, in C# .NET 4.0 and WPF, that indexes:
- File Names (“taskmgr.exe”)
- File Descriptions (“Windows Task Manager”)
- Absolute Parent Directory (“C:\Windows\System32”)
on all the harddrives of a user’s computer.
I’m not indexing the contents of the files – just the file names/paths/descriptions. Also note that I’m only indexing practical files of extensions .DOC, .MP3, .EXE, .CS, .CPP. I won’t be indexing extension-less files, custom extensions, DLLs, or others of that sort.
I’m completely new to Lucene, and I’ve read a couple of beginner articles on how to design the document/index structure.
I was thinking that my Lucene document structure could be such that each file is a new document with the three fields listed above as the three key-pair values. Is this recommended? Is going with Lucene the right choice? Would searching for a file name be realtime (the list can, without much lag, dynamically expand and contract with all the possible filenames and paths)?
If any statistics are needed, my harddrive is 450 GB, and I have 681,014 total (all extensions) files and 165,732 folders.
It’s all the same to Lucene; the question is: what would your users want? If they search for “task”, should it match any file which has it in either the name, description or parent directory? If so, then this should probably be all one field.
Will your users want to be more specific (e.g.
filename:task)? If so, then you will need separate fields.As an aside: you probably want to use Solr. It’s easier to set up, and prevents some common pitfalls.