I’m currently working on a ‘simple’ photo sytem with mongoDB, using a Replica Set and GridFS.
The principle is simple, I put a lot of photos using GridFS, the client knows the filename, and from the filename I can retrieve the file.
Is GridFS using filename as indexes ? Hopefully yes, I couldn’t find it written down in any official doc.
My stats are :
{
"ns" : "photos.socialphotos.files",
"count" : 758086,
"size" : 168295128,
"avgObjSize" : 222.00004748801587,
"storageSize" : 220647424,
"numExtents" : 15,
"nindexes" : 2,
"lastExtentSize" : 43311104,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 125084624,
"indexSizes" : {
"_id_" : 22925504,
"filename_1_uploadDate_1" : 102159120
},
"ok" : 1
}
EDIT : by reIndex() the collections, I won 30 Go, but it’s still way too high..
My indexes are :
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "photos.socialphotos.files",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"filename" : 1,
"uploadDate" : 1
},
"ns" : "photos.socialphotos.files",
"name" : "filename_1_uploadDate_1"
}
Indexes size :
"keysPerIndex" : {
"photos.socialphotos.files.$_id_" : 758086,
"photos.socialphotos.files.$filename_1_uploadDate_1" : 758086
}
I never use _id_ as I don’t store it, is it OK to remove it ?
Index size is 125084624 which means I should have almost all my photos in RAM, which is a bit strange ?
Additional questions :
-
Statistics : mongostats is the basics, is there another good tool for monitoring, or do I have to create my own tool ?
-
Faults : I could see a LOT (around 100 a sec) when I’m doing lots of inserts, I have nothing on the console… where should I investigate ?
-
Connecion Pool with JAVA/Tomcat : I’m using a simple Tomcat webapp connection to MongoDB, would you recommand to open a new connection to mongoDB for each request (I guess not) or to keep a reference as a singleton on the Mongo object (with Holder for example) or using a good pool, but I didn’t find a standard one ?
Thank you very much !
To address your questions:
1) When you initialize a GridFS collection using the Java driver, that driver will automatically create indexes on the .files and the .chunks collections.
2) MongoDB requires that you have an ‘_id’ field and a unique ‘_id’ index. The default ‘_id’ is only 12 bytes long — there’s really no significant overhead from having it present.
Reference: http://www.mongodb.org/display/DOCS/Object+IDs
3) The stats on the “filename_1_uploadDate_1” index only indicate the size of the index. This index contains only the contents of the filename and the upload data fields – it does not contain any of the photo data itself. You want to have the active portion of the index fit in RAM for performance reasons.
References:
4) If you want to have advanced statistics and monitoring, enroll your system in the free MMS monitoring system provided by 10gen. For more information, start here: https://mms.10gen.com/help/
5) Page faults are normal when loading in new data. MongoDB uses memory-mapped files, so every time you write to a new location within the data file, the OS will need to fault in that page.
For more information about memory mapped files, look here: http://docs.mongodb.org/manual/faq/storage/
6) The MongoDB Java driver provides its own connection pool. Unless you’re doing a really high-performance application, you’re probably best off using the Mongo object as a singleton.