The main strategies to read a table from a database in Ascend are:
- Full Resync (single process or in parallel)
- Incremental Column
- Table Snapshot (single process or in parallel)
In this guide, we'll discuss the differences among these strategies, using a Postgres Database as an example.
With Full Resync, all the data in a table is loaded and stored in a single Ascend partition. When the Read Connector is refreshed, the full table is read in again in a single process, and any downstream Transform components will process the dataset.
Note: In order to read two tables, two connectors are needed, but they can reuse the same connection.
At refresh time a Full Resync is performed in parallel. The developer provides a partitioning column that is of type time, date, or int and this column will be used by Spark to parallelize the loading of the dataset. Spark will create an RDD (to run parallel operations) for each of: (upper bound - lower bound)/(max number of connections). The Spark executors each operate on an RDD, creating parallelism. This replication strategy keeps the simplicity of a full resync but the parallelism enables performance gains and avoids out of memory issues.
On every refresh, Ascend will create a new partition that represents all the records that have changed since the previous refresh. You provide the column to use to determine which records changed and it must be of type integer, timestamp, or datetime.
For example, with a date column, if the existing partitions only have data up until 2021-08-26, the read component will look for data where the incremental column has a value greater than 2021-08-26, and load that data into a new partition.
Note 2: When connectors are refreshed, only new or updated partitions are marked for execution. Hence, transform blocks process only these partitions. In your downstream processing of these partitions, you can combine the records to take the latest, create a dataset that tracks change over time, etc.
Data from Write-Ahead Logs (WALs) can also be read in through a Read Connector. Support for this method will vary by database.
At refresh time, one new partition is created for all the new WAL entries since the last refresh. For Postgres and other database systems, the WAL will provide logs for all tables, so the logs will need to be filtered down in a Transform component for the specific table you need logs for. A tutorial that explains how to use this functionality can be found here:
Note: To use CDC, additional steps to setup the database are needed. For instance, in https://developer.ascend.io/docs/postgresql, we see how to create a replica slot for Postgres.
Table snapshots are used when you want to keep the historical values. This strategy is useful for keeping a full copy of the table on every refresh and maintaining historical integrity. Table snapshots are also used to slowly build changing dimension tables. This is helpful for tables where CDC or incremental loads are not possible, but you need to have the historical values. In this example, you would take a table snapshot for every load, creating a copy of the data prior to refresh.
In Ascend, a Full Resync is performed when a component refreshes. The result is stored in a new partition and the snapshot time is stored in the field
At refresh time a Full Resync Parallel is performed. This strategy creates the same output as Table Snapshot but performs the snapshotting in parallel. The parallel configuration and behavior is the same as in the Full Resync in Parallel replication strategy.
Updated about 1 month ago