I am creating a page where people can post articles. When the user posts an article, it shows up on a list, like the related questions on Stack Overflow (when you add a new question). It’s fairly simple.
My problem is that I have 2 types of users. 1) Unregistered private users. 2) A company.
The unregistered users needs to type in their name, email and phone. Whereas the company users just needs to type in their company name/password. Fairly simple.
I need to reduce the excess database usage and try to optimize the database and build the tables effectively.
Now to my problem in hand:
So I have one table with the information about the companies, ID (guid), Name, email, phone etc.
I was thinking about making one table called articles that contained ArticleID, Headline, Content and Publishing date.
One table with the information about the unregistered users, ID, their name, email and phone.
How do i tie the articles table to the company/unregistered users table. Is it good to make an integer that contains 2 values, 1=Unregistered user and 2=Company and then one field with an ID-number to the specified user/company. It looks like you need a lot of extra code to query the database. Performance? How could i then return the article along with the contact information? You should also be able to return all the articles from a specific company.
So Table company would be:
ID (guid), company name, phone, email, password, street, zip, country, state, www, description, contact person and a few more that i don't have here right now.
Table Unregistered user:
ID (guid), name, phone, email
Table article:
ID (int/guid/short guid), headline, content, published date, is_company, id_to_user
Is there a better approach?
Qualities that I am looking for is: Performance, Easy to query and Easy to maintain (adding new fields, indexes etc)
Theory
The problem you described is called Table Inheritance in data modeling theory. In Martin Fowler’s book the solutions are:
So from a theory and industry practice point of view all three solutions are acceptable: one table Posters with columns NULLable columns (ie. single table), three tables Posters, Companies and Persons (ie. class inheritance) and two tables Companies and Persons (ie. concrete inheritance).
Now, to pros and cons.
Cost of NULL columns
The record structure is discussed in Inside the Storage Engine: Anatomy of a record:
So if you have at least one NULLable column, you pay the cost of the NULL bitmap in each record, at least 3 bytes. But the cost is identical if you have 1 or 8 columns! The 9th NULLable column will add a byte to the NULL bitmap in each record. the formula is described in Estimating the Size of a Clustered Index: 2 + ((Num_Cols + 7) / 8)
Peformance Driving Factor
In database system there is really only one factor that drives performance: amount of data scanned. How large are the record scanned by a query plan, and how many records does it have to scan. So to improve the performance you need to:
Now in order to analyze these criteria, there is something missing in your post: the prevalent data access pattern, ie. the most common query that the database will be hit with. This is driven by how you display your posts on the site. Consider these possible approaches:
posts front page: like SO, a page of recent posts with header, excerpt, time posted and author basic information (name, gravatar). To get this page displayed you need to join Posts with authors, but you only need the author name and gravatar. Both single table inheritance and class table inheritance would work, but concrete table inheritance would fail. This is because you cannot afford for such a query to do conditional joins (ie. join the articles posted to either Companies or Persons), such a query will be less than optimal.
posts per author: users have to login first and then they’ll see their own posts (this is common for non-public post oriented sites, think incident tracking for instance). For such a design, all three table inheritance schemes would work.
Conclusion
There are some general performance considerations (ie. narrow the data) to consider, but the critical information is missing: how are you going to query the data, your access pattern. The data model has to be optimized for that access pattern:
PS
Needless to say, don’t use guids for ids. Unless you’re building a distributed system, they are a horrible choice for reasons of excessive width. Fragmentation is also a potential problem, but that can be alleviated by use of sequential guids.