Google Cloud Pub/Sub Read Connector

Create a Google Cloud Pub/Sub Read Connector

Now that we have created a Pub/Sub Connection, we can create a Read Connector to ingest data from our Pub/Sub stream. Select the connection you have created, or any other Pub/Sub connection that is listed. You can click "View" to view, edit, or delete the connection, or click 'Use' to use the connection to create a connector.

You will then see a list of topics within your Google Project. Select the topic you want your connector to subscribe to and press confirm.

Connector Info

  • Name (required): The name to identify this connector with.

  • Description (optional): Description of what data this connector will read.

  • BROWSE CONNECTION: Click this button to view the topics within your Google Cloud Project. Select the topic you want to ingest and press confirm.

Configure Configuration

  • Project (Override): Use this field to connect to topics in a different project that your credentials have access to. If left blank, the connector will only connect to topics within the GCP project entered when creating the Pub/Sub connection.
  • Topic: Required. The topic you want to subscribe to. This value will be automatically filled out with the topic you selected in the previous step. You may also manually fill out this field.
  • Number of Sub-Partitions: Required. Not to be confused with Ascend partitions, this value represent the number of Spark partitions used to read in messages in parallel.
  • Approx. Message Count Per Sub-Partition: Required. Specify the number of messages to read in with each sub-partition.

📘

Managing throughput with Pub/Sub Read Connectors

The two fields "Number of Sub-Partitions" and "Approx. Message Count Per Sub-Partition" can be used to manage the total number of messages read in on each refresh.

For example, let's say you expect to read in 1,000,000 messages with each refresh. You can set your connector to have 10 sub-partitions, and read in 100,000 messages per partition. Ascend will then create 10 Spark partitions and each partition will continue to read messages until it has read 100,000 messages, or until there are no more messages in the topic to pull. By specifying multiple sub-partitions, Spark can distribute this work across multiple Spark executors for faster reading.

Generate Schema

Once you click on the GENERATE SCHEMA button, the parser will create a schema.

NOTE: Every Pub/Sub Connector will have the same schema. To learn more about the schema of Pub/Sub messages, click here.

Refresh Schedule

The refresh schedule specifies how often Ascend checks the data location to see if there's new data. Ascend will automatically kick off the corresponding big data jobs once new or updated data is discovered.

Component Pausing

Update the status of the read connector by marking it either Running to mark it active or Paused to pause the connector from running.

Processing Priority (optional)

When resources are constrained, Processing Priority will be used to determine which components to schedule first.

Higher priority numbers are scheduled before lower ones. Increasing the priority on a component also causes all its upstream components to be prioritized higher. Negative priorities can be used to postpone work until excess capacity becomes available.

How Reading Works

Subscriptions

For each read connector you create, you will notice a subscription created in your GCP project with the name ascend_subscription_<UUID>. This is the subscription your connector will use to read in messages.

Snapshots

At each refresh, the connector creates a snapshot of the subscription before reading in any messages. Once the messages have been read in and successfully persisted in Ascend, the snapshot is deleted. This ensures that, if there are any read failures, the subsequent run or retry can seek to that snapshot and read in the messages that were missed. To learn more about snapshots with Pub/Sub, click here.