I am not sure how to solve this problem:
We import order information from a variety of online vendors ( Amazon, Newegg etc ). Each vendor has their own specific terminology and structure for their orders that we have mirrored into a database. Our data imports into the database with no issues, however the problem I am faced with is to write a method that will extract required fields from the database, regardless of the schema.
For instance assume we have the following structures:
Newegg structure:
"OrderNumber" integer NOT NULL, -- The Order Number
"InvoiceNumber" integer, -- The invoice number
"OrderDate" timestamp without time zone, -- Create date.
Amazon structure:
"amazonOrderId" character varying(25) NOT NULL, -- Amazon's unique, displayable identifier for an order.
"merchant-order-id" integer DEFAULT 0, -- A unique identifier optionally supplied for the order by the Merchant.
"purchase-date" timestamp with time zone, -- The date the order was placed.
How can I select these items and place them into a temporary table for me to query against?
The temporary table could look like:
"OrderNumber" character varying(25) NOT NULL,
"TransactionId" integer,
"PurchaseDate" timestamp with time zone
I understand that some of the databases represent an order number with an integer and others a character varying; to handle that I plan on casting the datatypes to String values.
Does anyone have a suggestion for me to read about that will help me figure this out?
I don’t need an exact answer, just a nudge in the right direction.
The data will be consumed by Java, so if any particular Java classes will help, feel free to suggest them.
First, you can create a
VIEWto provide this functionality:You can query this view like any other table:
The
sourceis necessary if theorder_nris not unique. How else would you guarantee unique order-numbers over different sources?A
timestamp without time zoneis an ambiguous in a global context. It’s only good in connection with its time zone. If you mixtimestampandtimestamptz, you need to place thetimestampat a certain time zone with theAT TIME ZONEconstruct to make this work. For more explanation read this related answer.I use UTC as time zone, you might want to provide a different one. A simple cast
"OrderDate"::timestamptzwould assume your current time zone.AT TIME ZONEapplied to atimestampresults intimestamptz. That’s why I did not add another cast.While you can, I advise not to use camel-case identifiers in PostgreSQL ever. Avoids many kinds of possible confusion. Note the lower case identifiers (without the now unnecessary double-quotes) I supplied.
Don’t use
varchar(25)as type for theorder_nr. Just usetextwithout arbitrary length modifier if it has to be a string. If all order numbers consist of digits exclusively,integerorbigintwould be faster.Performance
One way to make this fast would be to materialize the view. I.e., write the result into a (temporary) table:
You need an index. In my example, the primary key constraint provides the index automatically.
If your tables are big, make sure you have enough temporary buffers to handle this in RAM before you create the temp table. Else it will actually slow you down.
Has to be the first call to temp objects in your session. Don’t set it high globally, just for your session. A temp table is dropped automatically at the end of your session anyway.
To get an estimate how much RAM you need, create the table once and measure:
More on object sizes under this related question on dba.SE.
All the overhead only pays if you have to process a number of queries within one session. For other use cases there are other solutions. If you know the source table at the time of the query, it would be much faster to direct your query to the source table instead. If you don’t, I would question the uniqueness of your
order_nronce more. If it is, in fact, guaranteed to be unique you can drop the columnsourceI introduced.For only one or a few queries, it might be faster to use the view instead of the materialized view.
I would also consider a plpgsql function that queries one table after the other until the record is found. Might be cheaper for a couple of queries, considering the overhead. Indexes for every table needed of course.
Also, if you stick to
textorvarcharfor yourorder_nr, considerCOLLATE "C"for it.