It is frequently advised to choose database field sizes to be as narrow as possible. I am wondering to what degree this applies to SQL Server 2005 VARCHAR columns: Storing 10-letter English words in a VARCHAR(255) field will not take up more storage than in a VARCHAR(10) field.
Are there other reasons to restrict the size of VARCHAR fields to stick as closely as possible to the size of the data? I’m thinking of
- Performance: Is there an advantage to using a smaller n when selecting, filtering and sorting on the data?
- Memory, including on the application side (C++)?
- Style/validation: How important do you consider restricting colunm size to force non-sensical data imports to fail (such as 200-character surnames)?
- Anything else?
Background: I help data integrators with the design of data flows into a database-backed system. They have to use an API that restricts their choice of data types. For character data, only VARCHAR(n) with n <= 255 is available; CHAR, NCHAR, NVARCHAR and TEXT are not. We’re trying to lay down some “good practices” rules, and the question has come up if there is a real detriment to using VARCHAR(255) even for data where real maximum sizes will never exceed 30 bytes or so.
Typical data volumes for one table are 1-10 Mio records with up to 150 attributes. Query performance (SELECT, with frequently extensive WHERE clauses) and application-side retrieval performance are paramount.
Data Integrity – By far the most important reason. If you create a column called
Surnamethat is 255 characters, you will likely get more than surnames. You’ll get first name, last name, middle name. You’ll get their favorite pet. You’ll get “Alice in the Accounting Department with the Triangle hair”. In short, you will make it easy for users to use the column as a notes/surname column. You want the cap to imped the users that try to put something other than a surname into that column. If you have a column that calls for a specific length (e.g. a US tax identifier is nine characters) but the column isvarchar(255), other developers will wonder what is going on and you likely get crap data as well.Indexing and row limits. In SQL Server you have a limit of 8060 bytes IIRC. Lots of fat non-varchar(max) columns with lots of data can quickly exceed that limit. In addition, indexes have a 900 bytes cap in width IIRC. So, if you wanted to index on your surname column and some others that contain lots of data, you could exceed this limit.
Reporting and external systems. As a report designer you must assume that if a column is declared with a max length of 255, it could have 255 characters. If the user can do it, they will do it. Thus, to say, “It probably won’t have more than 30 characters.” is not even remotely the same as “It cannot have more than 30 characters.” Never rely on the former. As a report designer, you have to work around the possibilities that users will enter a bunch of data into a column. That either means truncating the values (and if that is the case why have the additional space available?) or using CanGrow to make a lovely mess of a report. Either way, you make it harder on other developers to understand the intent of the column if the column size is so far out of whack with the actual data being stored.