I created a table to insert all the documents of my application. It is a simple table (let’s call it DOC_DATA) that has 3 fields: DOC_ID, FileSize, Data. Data is varbinary(max).
I then have many tables (CUSTOMERS_DOCUMENTS, EMPLOYEES_DOCUMENTS, …) that contain other data (like “document description”, “Created by”, “Customer ID” …). My case is not exactly like this, anyway by writing this example I can express myself better. All these tables have a FK to DOC_DATA.DOC_ID).
When the user searches for a customer document he will run a query similar to this:
select CD.*, DD.FileSize
from DOC_DATA DD
join CUSTOMERS_DOCUMENTS CD ON CD.DOC_ID = DD.DOC_ID
My question is: will the performance of this query be bad because we are reading also a field from a table that is potentially huge (the DOC_DATA table can contain many GB of data) or this is not a problem?
The alternative solution is to put the FIleSize field in all the main tables (CUSTOMER_DOCUMENTS, EMPLOYEES_DOCUMENTS, …). Of course a join has a little impact on the performance, now I am not asking about to join or not to join in general, but to join or not to join a HUGE table while I am not interested in the HUGE fields.
Please note: I am not designing a new system, I am maintaining a legacy system, so here I am not discussing which is the best design in general, but just which is the best option in this case.
I see no reason why the performance of your query would suffer due to the presence of those large columns. Performance issues would come up when you read that data –specifically, when you require the database engine to return the document, but you are (of course) not doing so in the query.
Internally, for the various yada(max) data types, SQL stores a 16 or so byte pointer (or reference marker, forwarding record, or whatever they call it) in the row, and the actual data is stored in a separate set of pages. Thus, if you’re not reading that column, those pages do not need to be accessed, and you don’t incur the disk I/O hit.