Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 729341
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 14, 20262026-05-14T06:47:10+00:00 2026-05-14T06:47:10+00:00

Background: I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP. I

  • 0

Background:

I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.

I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.

How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24×5.5.

Best Solution?

From what I can piece together the most practical solution is as follows:

  • Create a TRIGGER to write all DML activity to a rotating CSV log file
  • Perform whatever transformations are required
  • Use the native DW data pump tool to efficiently pump the transformed CSV into the DW

Why this approach?

  • TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (i.e. into a CSV) and are relatively easy to write and deploy. SLONY uses similar approach and overhead is acceptable
  • CSV easy and fast to transform
  • Easy to pump CSV into the DW

Alternatives considered ….

  • Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don’t need some of the tables (some are used for persistent storage of JMS messages which I do not want to log)
  • Querying the data directly via an ETL tool such as Talend and pumping it into the DW … problem is the OLTP schema would need tweaked to support this and that has many negative side-effects
  • Using a tweaked/hacked SLONY – SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner
  • Using the WAL

Has anyone done this before? Want to share your thoughts?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-14T06:47:10+00:00Added an answer on May 14, 2026 at 6:47 am

    Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATEs, and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETEs without a trigger.) Is this what you had in mind when you mentioned Talend?

    I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.

    Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either — that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly — or even statement logging — since whatever happens there does not affect INSERT/UPDATE/DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you’d get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Background In a PostgreSQL 9.0 database, there are various tables that have many-to-many relationships.
Background: we have an application that generates reports from HTML (that may or may
BACKGROUND: I have a list of posts that need to be sorted their 'votes'
I have a backend process that maintains state in a PostgreSQL database, which needs
We have a dedicated database server that runs PostgreSQL 8.3 on linux debian. The
Background I have a web app that will create an image from user input.
Background I have a function that takes a config object as an argument. Within
Background : I have a few different threads which each need to write to
Background: I have a WebPart that makes use of the SPCalendarView control and in
Background I have a dimension table that has a single record for each day.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.