You can use several routes to branch out the workflow,…

Question

0

Editorial Team

Asked: May 13, 20262026-05-13T21:41:47+00:00 2026-05-13T21:41:47+00:00

Background: Professional tools developer. SQL/DB amateur. Setup: .Net 3.5 winforms app talking to MS

0

Background: Professional tools developer. SQL/DB amateur.

Setup: .Net 3.5 winforms app talking to MS SQL Server 2008.

Scenario: I am populating a database with information extracted from a large quantity of files. This amounts to about 60M records, each of which has an arbitrarily sized message associated with it. My original plan was for an nvarchar(max) field in the record to hold the messages, however after performing a test run on a subset of the data this was going to make the database too large (extrapolates to an unacceptable 113GB). Running a few queries on this initial test data set (1.3GB database) I discovered that there was a significant amount of message duplication and that we could use this to shrink the message data to about one sixth. I’ve tried and thought of a few of approaches to achieve this but none are satisfactory. I’ve searched around for a few days now but either a) there doesn’t seem to be a good answer (unlikely), or b) I don’t know how to express what I need well enough (more likely).

Approaches considered/tried:

Bulk insertion of messages into records with a nvarchar(max) field. – found to have too much redundancy.
Stick with this message column but find a way to get the database to ‘compress’ the messages. – no idea how to do this.
Add a message table for unique messages, keyed on an ID that the main record(s) ‘point’ at. – whilst working in principal, implementing the uniqueness turns out to be painful and suffers from slowdown as more messages are added.
Perform duplicate removal on the client. – requires that all the messages are fetched to the client for each population session. This doesn’t scale as they would need to fit in memory.
Add an extra (indexed) hash column to the message table and submit the messages with a corresponding (locally generated) hash value. Search on this to narrow down the messages that actually need testing against. – complicated, there must be a better way.

This third approach amounts to the creation of a string dictionary table. After a couple of iterations on this idea I ended up with the following:

The database has a message table which maps an (auto-assigned) int ID primary key to an nvarchar(max) message.
The client batches up messages and submits multiple records for insertion to a stored procedure.
The stored procedure iterates through the batch of incoming records, and for each message:

i. The message dictionary table is checked (SELECT) for an existing instance of the message.

ii. If found, remember the ID of the existing message.

iii. If not found, insert a new message record, remembering the ID of the new record (OUTPUT).
The ID’s for all the messages (old and new) are returned as an output result set from the procedure.
The client generates the main table records with entries (int foreign keys) for the messages filled in with the IDs returned from the procedure.

Issues:

The search for existing messages gets slower and slower as the number of messages grows, becoming the limiting factor.
I’ve tried indexing (UNIQUE) the message column, but you can’t index a nvarchar(max) column.
I’ve looked at the Full Text Search capabilities of MS SQL Server 2008, but this seems like overkill to me.
I’ve thought about trying to MERGE in batches of messages, but I can’t see a way to easily get the corresponding list of IDs (old and new, in the right order) to give back to the client.

It seems to me that I’m trying to achieve some sort of normalisation of my data, but from what I understand of database design, this is more like ‘row normalisation’ than proper normalisation which is about ‘column normalisation’. I’m surprised that this isn’t something needed all over the place with corresponding support for already.

And so, my question is: What is the right approach here?

Any help greatly appreciated.

Sam

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T21:41:47+00:00

There are two practical aspects to (and reasons for) normalization: sensibility of the arrangement of the data (and the corresponding maintenance boon) and performance.

Regarding sensibility, one issue that you need to consider, at least from an abstract DB design perspective, is whether or not the data is truly duplicated. While you may have two messages that have identical data, they may not represent the “same thing” in reality. The real question is: Does the fact that two messages share the same text make them the same message? In other words, assuming that message A and message B share the same text, would you want a change in message A to be reflected in message B?

If your answer is “yes”, then your string dictionary is the right approach. If no, then you don’t really have duplicate data, just data that looks the same but isn’t.

From a performance perspective, I’d likely think that the string dictionary with the additional message hash would be the best approach; I don’t think this is really as complicated as you consider it to be. Standard hashing algorithms are available in virtually every language (including T-SQL), and I wouldn’t consider the possiblity of collisions or even distribution of hash values to be terribly important in this scenario, since you’re only using it as a “hint” to speed up the execution of a query.

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions