This is a long question that I do not know how to summarize… I

Question

0

Asked: May 18, 20262026-05-18T08:06:05+00:00 2026-05-18T08:06:05+00:00

This is a long question that I do not know how to summarize… I

0

This is a long question that I do not know how to summarize…

I have a table that I need to read from that has financial data with almost a billion records of detailed data in it. I cannot change the structure of this table, I am merely a consumer of it. This table has columns such as a transaction data, a bunch of attribute columns with Int data in them describing the transaction (not named Attribute1-20, just named that way for simplicity below) and then an Amount column.

TABLE: FinancialData
COLUMNS:
  Id (BigInt IDENTITY)
  TransactionId (Int FK)
  TransactionDate (DateTime)
  Attribute1 (Int)
  Attribute2 (Int)
  .
  .
  Attribute20 (Int)
  Amount (Decimal)

I have a process that needs to summarize this FinancialData table into 2 database tables (one a header table and another a detail table with aggregated amounts) for a user-defined time-line so that a snapshot of the data can be used by other processes. The header table contains one record per user-defined time-line (snapshot) and the detail table contains aggregated amount records across all attributes of the FinancialData table.

TABLE: FinancialHeader 
COLUMNS:
  Id (Int IDENTITY)
  BeginTransactionDate (DateTime)
  EndTransactionData (DateTime)


TABLE: FinancialDetail
COLUMNS:
 Id (Int IDENTITY)
 FinancialHeaderId (Int FK)
 Attribute1 (Int)
 Attribute2 (Int)
 .
 .
 Attribute20 (Int)
 Amount (Decimal)

To give an example of the process, say there are 20 million records in the FinancialData table with a TransactionDate between 1/1/2010 and 6/30/2010 with many redundant attributes (however they would have different TransactionId values). If I were to summarize this data in the FinancialHeader and FinancialDetail tables above, I would create one FinancialHeader record with a BeginTransactionDate of 1/1/2010 and an EndTransactionDate of 6/30/2010 and then multiple FinancialDetail records that are child records to the header. The FinancialDetail table aggregates the 20 million records from FinancialData basically contains a grouping of the unique values of Attribute1 – Attribute20 along with a SUM(Amount) to track the total amount for those attributes. Typically 20 million records in the FinancialData table would contain about 10,000 unique combinations of attributes, which would then yield 10,000 records in the FinancialDetail table with an aggregated amount. So in my example there would be 1 FinancialHeader record and roughly 10,000 FinancialDetail records created during the process.

The question I have relates to storing 20 columns worth of unique combinations of attribute data… this “snapshot” process I am talking about can be ran over and over again by the user any number of times for various date ranges to essentially store amounts for that period in time. So what happens is the FinancialDetail table tends to have a lot of data in it even though it is aggregate data. What I don’t like is the fact that there are 20 columns in the FinancialDetail table I created that I feel may be wasting space. What I was thinking may be a better approach is to store the each unique combination of attributes into a row in yet another table, say called FinancialAttribute, that contains an Id column that can then be used as a look up mechanism for the FinancialDetail table. So the FinancialAttribute table would look like this:

TABLE: FinancialAttribute
COLUMNS: 
  Id (Int IDENTITY)
  Attribute1 (Int)
  Attribute2 (Int)
  .
  .
  Attribute2 (Int)

And the FinancialDetail table would be modified into this:

TABLE: FinancialDetail
COLUMNS:
 Id (Int IDENTITY)
 FinancialHeaderId (Int FK)
 FinancialAttributeId (Int FK)
 Amount (Decimal)

Is this a pretty common pattern to deal with aggregation across multiple columns/attributes? Or am I thinking about this in completely the wrong way? I need to store the data from the FinancialData table into my own local copy somehow though because there are several downstream processes that need to process or report on these user-defined time-line snapshots of this financial information.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T08:06:05+00:00

Wordy question, wordy answer! 😉

I’m no data warehousing expert, so I’m not familiar with the patterns (and anti-patterns) in that area. I’m speaking as just a DB dev who has maybe done something similar.

In my case, we take snapshots from large source tables of prescription drug info. The snapshots are used for downstream analysis and reporting. The users specify snapshot criteria, like date and drug type, which usually affects 2m (vs your 20m) records. This typically compiles down to 120k (vs. your 10k) records. Snapshots are kept indefinitely, as source tables change over time and are NOT historical. I share your concerns about snapshots pulling and storing redundant information.

Your question – are you doing something dumb? Is there a better way?

Conceptually speaking, it’s pretty apparent that your factoring is “safe”. By this, I mean it’s a straightforward transformation that’s obviously correct, and it’s pretty obvious how to map the factored version back into the original with little pain. From that perspective (conceptual ease), I think it has merits.

As for impact, I’d consider the expected table sizes. My assumptions are:

Avg snapshot records = Total unique Attribute combos per snapshot = 10k
Total num snapshots over time = 10k
Total unique Attribute combos over time = 100k
Amount column precision is < 20 digits

So:

FinancialDetail (orig)
Column   | Type     | Avg Size
-------------------------------
ID       | int      | 4
HeaderID | int      | 4
Amount   | decimal  | 9
A1 - A20 | int x 20 | 80
-------------------------------
Total:                97
Expected num rows:    100m
Total expected size:  9GB


FinancialDetail (new)
Column   | Type     | Avg Size
-------------------------------
ID       | int      | 4
HeaderID | int      | 4
AttribID | int      | 4
Amount   | decimal  | 9
-------------------------------
Total:                21
Expected num rows:    100m
Total expected size:  2GB

FinancialAttribute (new)
Column   | Type     | Avg Size
-------------------------------
ID       | int      | 4
A1 - A20 | int x 20 | 80
-------------------------------
Total:                84
Expected num rows:    100k
Total expected size:  8MB

If my assumptions are in the ball park (and my math right), you could be saving 78% on space. This doesn’t include space for indexes or table fill slack, so the actual table sizes will be higher.

Does saving 7GB matter?

9GB is easily managed by a modern disk, and is nothing compared to your master table.
The data is (probably) put back together in memory on a query anyway, so maybe no savings over the wire or in caching.
Query IO should be a little better, if seaching by Attributes
Backup / restore time should be better
Time spent reindexing, updating statistics, etc. should be better
(Covering) indexes based on Attributes are much smaller
The query time to insert into your factored tables will go up

You can make your own call on this but it seems to me that your factoring could be worthwhile if space is the #1 concern, even if it’s not technically the mostest optimalest solution.

Speaking of efficiency…

If you somehow managed to optimize your Attribute space down to 0 bytes, you’d only save another 0.09% off the original. So I wouldn’t futher optimize for space there.

On the other hand, simply dropping FinancialDetail.ID and using PK on (HeaderID, AttributeID) would save you 4.1% off the original. (Assumption: You don’t have FKs pointing to this table.)

As far as if there’s a better way – I don’t know. It would depend on how many snapshots you get, how your snapshots are used, and how fast it needs to be.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is a long question that I do not know how to summarize… I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply