My project when it is running, will collect a large number of string text block (about 20K and largest I have seen is about 200K of them) in short span of time and store them in a relational database. Each of the string text is relatively small and the average would be about 15 short lines (about 300 characters). The current implementation is in C# (VS2008), .NET 3.5 and backend DBMS is Ms. SQL Server 2005
Performance and storage are both important concern of the project, but the priority will be performance first, then storage. I am looking for answers to these:
- Should I compress the text before storing them in DB? or let SQL Server worry about compacting the storage?
- Do you know what will be the best compression algorithm/library to use for this context that gives me the best performance? Currently I just use the standard GZip in .NET framework
- Do you know any best practices to deal with this? I welcome outside the box suggestions as long as it is implementable in .NET framework? (it is a big project and this requirements is only a small part of it)
EDITED: I will keep adding to this to clarify points raised
- I don’t need text indexing or searching on these text. I just need to be able to retrieve them in later stage for display as a text block using its primary key.
- I have a working solution implemented as above and SQL Server has no issue at all handling it. This program will run quite often and need to work with large data context so you can imagine the size will grow very rapidly hence every optimization I can do will help.
The strings are, on average, 300 characters each. That’s either 300 or 600 bytes, depending on Unicode settings. Let’s say you use a
varchar(4000)column and use (on average) 300 bytes each.Then you have up to 200,000 of these to store in a database.
That’s less than 60 MB of storage. In the land of databases, that is, quite frankly, peanuts. 60 GB of storage is what I’d call a “medium” database.
At this point in time, even thinking about compression is premature optimization. SQL Server can handle this amount of text without breaking a sweat. Barring any system constraints that you haven’t mentioned, I would not concern myself with any of this until and unless you actually start to see performance problems – and even then it will likely be the result of something else, like a poor indexing strategy.
And compressing certain kinds of data, especially very small amounts of data (and 300 bytes is definitely small), can actually sometimes yield worse results. You could end up with “compressed” data that is actually larger than the original data. I’m guessing that most of the time, the compressed size will probably be very close to the original size.
SQL Server 2008 can perform page-level compression, which would be a somewhat more useful optimization, but you’re on SQL Server 2005. So no, definitely don’t bother trying to compress individual values or rows, it’s not going to be worth the effort and may actually make things worse.