The main strategies to read a table from a database in Ascend are:
- Full Resync (single process or in parallel)
- Incremental Column
- Table Snapshot (single process or in parallel)
In this guide, we'll discuss the differences among these strategies, using a Postgres Database as an example.
With Full Resync, all the data in a table is loaded and stored in a single Ascend partition.
When the Read Connector is refreshed, the full table is read in again in a single process, and any downstream Transform components will process the dataset.
Note: In order to read 2 tables, 2 connectors are needed, but they can reuse the same connection.
At refresh time a Full Resync is performed in parallel. The developer provides a partitioning column that is of type time, date, or int and this column will be used by Spark to parallelize the loading of the dataset.
Spark will create an RDD (to run parallel operations) for each of: (upper bound - lower bound)/(max number of connections). The Spark executors each operate on an RDD, creating parallelism.
This replication strategy keeps the simplicity of a full resync but the parallelism enables performance gains and avoids out of memory issues.
On every refresh, Ascend will create a new partition that represents all the records that have changed since the previous refresh. You provide the column to use to determine which records changed and it must be of type integer, timestamp, or datetime.
For example, with a date column, if the existing partitions only have data up until 2021-08-26, the read component will look for data where the incremental column has a value greater than 2021-08-26, and load that data into a new partition.
Note 2: When connectors are refreshed, only new or updated partitions are marked for execution. Hence, transform blocks process only these partitions. In your downstream processing of these partitions, you can combine the records to take the latest, create a dataset that tracks change over time, etc.
Data from Write-Ahead Logs (WALs) can also be read in through a Read Connector. Support for this method will vary by database.
At refresh time, 1 new partition is created for all the new WAL entries since the last refresh. For Postgres and other database systems, the WAL will provide logs for all tables, so the logs will need to be filtered down in a Transform component for the specific table you need logs for. A tutorial that explains how to use this functionality can be found here:
Note: To use CDC, additional steps to setup the database are needed. For instance, in https://developer.ascend.io/docs/postgres, we see how to create a replica slot for Postgres.
At refresh time a Full Resync is performed.
The result is stored in a new partition and the snapshot time is stored in the field
This strategy is useful to keep a full copy of the table on every refresh and maintain historical integrity or build slowly changing dimension tables.
At refresh time a Full Resync Parallel is performed.
This strategy creates the same output as Table Snapshot but performs the snapshotting in parallel. The parallel configuration and behavior is the same as in the Full Resync in Parallel replication strategy.
Updated 12 months ago