Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 647199
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T21:41:47+00:00 2026-05-13T21:41:47+00:00

Background: Professional tools developer. SQL/DB amateur. Setup: .Net 3.5 winforms app talking to MS

  • 0

Background: Professional tools developer. SQL/DB amateur.

Setup: .Net 3.5 winforms app talking to MS SQL Server 2008.

Scenario: I am populating a database with information extracted from a large quantity of files. This amounts to about 60M records, each of which has an arbitrarily sized message associated with it. My original plan was for an nvarchar(max) field in the record to hold the messages, however after performing a test run on a subset of the data this was going to make the database too large (extrapolates to an unacceptable 113GB). Running a few queries on this initial test data set (1.3GB database) I discovered that there was a significant amount of message duplication and that we could use this to shrink the message data to about one sixth. I’ve tried and thought of a few of approaches to achieve this but none are satisfactory. I’ve searched around for a few days now but either a) there doesn’t seem to be a good answer (unlikely), or b) I don’t know how to express what I need well enough (more likely).

Approaches considered/tried:

  1. Bulk insertion of messages into records with a nvarchar(max) field. – found to have too much redundancy.
  2. Stick with this message column but find a way to get the database to ‘compress’ the messages. – no idea how to do this.
  3. Add a message table for unique messages, keyed on an ID that the main record(s) ‘point’ at. – whilst working in principal, implementing the uniqueness turns out to be painful and suffers from slowdown as more messages are added.
  4. Perform duplicate removal on the client. – requires that all the messages are fetched to the client for each population session. This doesn’t scale as they would need to fit in memory.
  5. Add an extra (indexed) hash column to the message table and submit the messages with a corresponding (locally generated) hash value. Search on this to narrow down the messages that actually need testing against. – complicated, there must be a better way.

This third approach amounts to the creation of a string dictionary table. After a couple of iterations on this idea I ended up with the following:

  1. The database has a message table which maps an (auto-assigned) int ID primary key to an nvarchar(max) message.
  2. The client batches up messages and submits multiple records for insertion to a stored procedure.
  3. The stored procedure iterates through the batch of incoming records, and for each message:

    i. The message dictionary table is checked (SELECT) for an existing instance of the message.

    ii. If found, remember the ID of the existing message.

    iii. If not found, insert a new message record, remembering the ID of the new record (OUTPUT).

  4. The ID’s for all the messages (old and new) are returned as an output result set from the procedure.

  5. The client generates the main table records with entries (int foreign keys) for the messages filled in with the IDs returned from the procedure.

Issues:

  1. The search for existing messages gets slower and slower as the number of messages grows, becoming the limiting factor.
  2. I’ve tried indexing (UNIQUE) the message column, but you can’t index a nvarchar(max) column.
  3. I’ve looked at the Full Text Search capabilities of MS SQL Server 2008, but this seems like overkill to me.
  4. I’ve thought about trying to MERGE in batches of messages, but I can’t see a way to easily get the corresponding list of IDs (old and new, in the right order) to give back to the client.

It seems to me that I’m trying to achieve some sort of normalisation of my data, but from what I understand of database design, this is more like ‘row normalisation’ than proper normalisation which is about ‘column normalisation’. I’m surprised that this isn’t something needed all over the place with corresponding support for already.

And so, my question is: What is the right approach here?

Any help greatly appreciated.

Sam

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T21:41:47+00:00Added an answer on May 13, 2026 at 9:41 pm

    There are two practical aspects to (and reasons for) normalization: sensibility of the arrangement of the data (and the corresponding maintenance boon) and performance.

    Regarding sensibility, one issue that you need to consider, at least from an abstract DB design perspective, is whether or not the data is truly duplicated. While you may have two messages that have identical data, they may not represent the “same thing” in reality. The real question is: Does the fact that two messages share the same text make them the same message? In other words, assuming that message A and message B share the same text, would you want a change in message A to be reflected in message B?

    If your answer is “yes”, then your string dictionary is the right approach. If no, then you don’t really have duplicate data, just data that looks the same but isn’t.

    From a performance perspective, I’d likely think that the string dictionary with the additional message hash would be the best approach; I don’t think this is really as complicated as you consider it to be. Standard hashing algorithms are available in virtually every language (including T-SQL), and I wouldn’t consider the possiblity of collisions or even distribution of hash values to be terribly important in this scenario, since you’re only using it as a “hint” to speed up the execution of a query.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 365k
  • Answers 365k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer You can use several routes to branch out the workflow,… May 14, 2026 at 4:24 pm
  • Editorial Team
    Editorial Team added an answer This code assumes you have already added a form-level SerialPort… May 14, 2026 at 4:24 pm
  • Editorial Team
    Editorial Team added an answer If you can include a js in a way that… May 14, 2026 at 4:24 pm

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.