Several pieces of software I’m maintaining make direct connections to remote databases to get data they need to operate. In the past, this was not a problem. However, clients are now wanting functionality that calls for executing queries that return massive amounts of historical data. Network latency is really starting to be a problem.
My first approach keep the software that queries the rdbms the exact same, but to just point it to localhost. Then simply build a slave directly on the client computer (laptop/netbook/etc) and presto it’s super fast again because there are no network calls.
The very obvious problem is that this isn’t what replication is for. It’s really easy to get corrupt or break a slave, especially on machines that are frequently rebooted (sometimes unexpectedly), like the laptops and netbooks my software runs on. And since we have 0 privileges on the client machine, a broken slave is out of the question. I personally love replication, but there’s always a lot of human intervention when things break — it doesn’t fit here.
Is there some pre-existing alternative here that’s robust? I was thinking about a system where a large dump is rebuilt at install time. Then my C#.NET service fills in the gap from the last update until current time whenever it has a network connection.
It won’t retroactively do updates like a slave, it won’t do anything a slave does. It will only add new rows in from an ever growing remote host. These limitations are well withing the bounds of acceptable. The appeal is that this .NET “rdbms manager” could be really really small, thus minimizing places where errors can occur, which seems like a good swap for all the unneeded replication functionality I am giving up.
I am missing something here or is there a better alternative? Thanks.
Regarding writing your own, you could definitely do that since your requirements are so narrow. If they’re very unlikely to change, it may even be better. You’d have to take care with the design, of course, so that any interruptions simply result in re-attempting later.
As for stuff already out there for highly configurable data synchronization, I’ve been using SymmetricDS. It’s very resilient to interruptions and works well with slow connections. Since you specify MySQL, it would only work with 5 and up, since it is based on triggers. But, it’s an option to consider.
A bit on SymmetricDS configuration: Because I really can’t answer to your comment briefly.
Aside from a properties file that gives the service information like port, database driver and connection info, registration node url, self url, etc., the configuration for what to replicate and where to send it is all in the database (default table prefix
sym_). Even most of the stuff you can put in the properties file can be put in thesym_parametertable.All replication config is done at the registration node (usually also the central/top-tier node). Changes are transmitted just like changes to data, with child nodes re-syncing their triggers automatically. I’m going to tersely go through a very basic config for a 2-tier setup (central and stores), 1 table bi-directional. I won’t get into the nodes, registration, initial loads, or other management, though.
The following statements are pretty simple. If you just read through them, it’s apparent that they are part of defining the relationship between nodes of the ‘central’ group (or tier) and nodes of the ‘stores’ group. The routers are a key part of the replication config and define how captured data events are routed, and to where. Each as
identityin the name here because the default is to use the table’s primary keys and to send to all nodes of the target group.The following is where we get into the specifics about the table we want to replicate. A channel is used to isolate groups of tables. If there’s a problem batching data events for something in one channel, it doesn’t affect other channels. You can also suspend or ignore batching for entire channels. The trigger entry simply says, “I want to capture data events from this table”, and the
sync_on_incoming_batchvalue of1is special because that is what will allow a change at a store to be replicated to central and then down to all the other stores. Then you create trigger/router associations to complete the relationship between capturing data events and sending those events to other nodes. One for the sending changes from store to central, and one for the other way.There are a number of columns on these tables that I don’t show that allow you very fine control over the replication. All tables and columns are described in Appendix A of the user manual, though.
It’s not too hard to install, either, just a bit manual when you’re learning it. I create the configurations we use for clients, but I made silent-install scripts for the techs to use to get a client going in a couple steps. Another script starts an initial load of the client’s database uploading to the central database (if vice-versa, I do that at the central database).
You could silently install Java and SymmetricDS (it does come with a way to install it as a windows service). Each node must have a unique id, so you’d have to partially generate the properties file, along with the information for connecting to the local database (manual talks about what privileges are needed, I think).
You could have open registration at the central database so that any machine can register, otherwise central must have entries in
sym_nodeandsym_node_securityfor enabling registration for a knownnode_idbefore the node attempts to register.You can go ahead with the idea of having an initial script of database data run by the installer into the client’s local database. When you do an initial load from central down to the node, it will update existing rows, or insert if not found. However, the trigger/router associations have an
initial_load_selectcolumn: you can define a select statement to limit the data sent to only what you know is not in the installation script.Getting central to start an initial load from a remote client installation might need the assistance of another service running at central that the installation can send requests to, then the service makes the change to the central database to start that initial load. I don’t know yet of a way for a node to request an initial load from the parent node. Such a service could also easily facilitate registration if you don’t want to use open registration (the installer sends the
node_id, and the service inserts 2 rows to enable registration).