A customer need a document managment system and I’m building information about this.
I know about sharepoint & alfresco, but in this case I’m evaluating the necesary info for build it from scratch, so please refrain to suggest the use of any of these (we are doing the evaluation of them separately, this is all about develop, not implement a existent solution).
This are the requeriments:
- Have a very specific requeriment from legal managment of the documents that is specific to our local goverment, but apart from this:
- A operation similar to google docs from the point of view of the end-user
- Need store info from 200 + end-users (UPDATE: Are really +700 end-users)
- Mainly office documents, pdf, text. I already have the extraction of plain text from this binary files.
- No wiki, no portal creation, barely workflow but very simple, is only managment of files
- Central repository, share across the company, integrated with the Active directory
- Fast searching
- Transparent desktop integration
- Web interface
- Multiplataform, if possible
So, this is the things I have on top of my head:
- Storage: I know that sharepoint save all in the db (Alfresco too?). That is a nightmare, IMHO. I prefer put the metadata in a DB, and the files on disk.
I thinking about force the use of ZFS in this case & leverage their capabilities for versioning, snapshots & scaling. Or maybe use git as storage backend (git will work fine?)
So, where I can know more about how handle a large pool of documents, in ZFS or any regular file system? For example, how layout the folder structure to easy managemnt & fast responses, easy backup, etc.
- Metadata: I think in a regular DB here, but wonder if have more merit save everything in Lucene (I have some experience on Lucene, but worry because Lucene can’t be federated, rigth?).
If I use a search engine as metadata database I can save some work (not need a second pass for indexing), but a regular database engine is more standard.
- Tech: I probably will build this in Django, PyLucene, Postgress, and do the shell integration for windows (I have not problems for do that).
I will apreciate any hints or info in how properly implement this solution.
Personally I find the “similar to Google Docs” and “Transparent desktop integration” requirements a bit vague, IMHO. But judging from the question you are more concerned about the backend and document storage, and looking more on using a more open source stack (with integration with AD)?
Anyway, personally I’m using KnowledgeTree as our Document Management System and their implementation is that all files resides on a file directory and the database will keep track on the path, corresponding metadata, access logs and versioning information. They basically kept several versions of the same file if a document has been updated – which I think was a fair enough idea implementation wise considering Microsoft Office documents are mostly binary (up until 2003).
You may want to understand how much documents they currently have and how many documents that they are sort of expecting to flow into this system on a daily basis. (Or from a different point of view, what kind of documents they are planning to store would generally give you hints on what kind of load your server is supposed to handle)
My guess is that most likely you could get away with the setup of having local filesystems and database storing metadata stuff unless you are sure that the system is expected to be handling a massive load of documents on a daily basis (imagine being Flickr for documents 😉 ).