Component Logs

Ascend components retain the Spark logs per partition of data processed. The UI surfaces these logs under the "Partition" tab, where developers can view them in the UI or download the files. Developers can also programmatically fetch and/or download logs through the Ascend SDK.

Logs Availability

Ascend manages Spark logs for Read Connections and Spark transformations (SQL, PySpark). Write Connectors and Read Connectors (Legacy)] do not currently support logging.

Ascend surfaces logs of "in-progress" Spark jobs, which allows for viewing logs and refreshing to see the most up to date output of the logs.

Accessing Logs from the UI

  1. Locate a component on the Dataflow that you wish to view logs for.
  2. Open the component detail view.
  3. Navigate to the "Partitions" tab.
  4. Under the "Logs" column, click on either View to open the logs in a new browser tab or Download to download a folder of the log files.

Instrumenting Logs in PySpark

Developers can instrument their own logging in a PySpark transform and have the log statements appear in the partition logs. This logging must go through the Ascend logging interface in order to add the correct logging labels and route to the correct partition.

A small example:

from pyspark.sql import DataFrame, SparkSession
from typing import List
import ascend.log as log

def transform(spark_session: SparkSession, inputs: List[DataFrame], credentials=None) -> DataFrame:
    df = inputs[0]
    log.info("I am logging!")
    return df

The log module provides functions compatible with the Python glog package. For full reference of the ascend.log module, please see the reference.

The logger stores references to the labels required as part of thread local variables. By default, logs from the Spark driver will be collected but from the executors will not. It is possible to propagate the labels to executors, by using get_log_label, set_log_label, and threading the label through. However, logging from Spark executors may be better suited to appending a result column to the record outputs to keep log volume constrained.

Ascend Log Module Reference

ReferenceDescriptionExample
debugFunction to log with level DEBUGdebug("debug statement")
infoFunction to log with level INFOinfo("info statement")
warningFunction to log with level WARNINGwarning("warning statement")
warnFunction to log with level WARNINGwarn("warning statement")
errorFunction to log with level ERRORerror("error statement")
exceptionFunction to log with level EXCEPTIONexception("exception statement")
fatalFunction to log with level FATALfatal("fatal statement")
logFunction to log that takes the first argument of logging level and the second of the log message. Log levels are found in the native Python logging module.log(logging.INFO, "info statement")
setLevelFunction to set the severity threshold of messages to emit. Logs with a severity higher than this threshold will be emitted. Log levels are found in the native Python logging module.setLevel(logging.INFO)

Logs in Blob Storage

Ascend also writes logs to the default bucket for your environment. Currently, logs are uploaded every minute. However, some delay can happen between when the log is produced and when it is uploaded. Logs are available in any blob storage service.

Global Logs

To find the global log information about a given job, check the logs for the same cluster_id where job=none in the time range of interest. The bucket path looks like this:
logs/spark/year=<year>/month=<month>/day=<day>/cluster_id=<spark_cluster_id>/.

Ex.: gs://ascend-io-demo-gcp-default/logs/spark/year=2022/month=08/day=18/cluster_id=sparkworker-1660782753-4d171028/job_id=none/exec-1_2022-08-18-00-30_0.json.gz

📘

Note:

JSON objects are in a single line. We've expanded the two examples below so we can describe the log message.

{
  "log": "[72.034s][info   ][gc            ] GC(37) Concurrent Cycle 174.831ms",
  "pod_id": "sparkworker-1660782753-4d171028",
  "job_id": "none",
  "role": "driver",
  "level": "UNKNOWN",
  "timestamp": "2022-08-18T00:33:47.994+0000",
  "cluster_id": "sparkworker-1660782753-4d171028"
}
KeyDescription
LogThe original log message.
pod_idKubernetes pod from the logs
job_idObject path. For global logs, this always has a value of "none".
role
levelIf the log is parseable, this will be the level it's parsed to.
timestampThe timestamp assigned to the log when it arrived at the collector.
cluster_idThis is the ID of the Spark cluster. It's also the prefix of the pod_id.

Specific Job Logs

Specific job logs have an additional keys that are specific to a component. The bucket path looks like this:
logs/spark/year=<year>/month=<month>/day=<day>/cluster_id=<spark_cluster_id>/job_id=<job_id>/exec-<timestamp>.

Ex. gs://ascend-io-demo-gcp-default/logs/spark/year=2022/month=08/day=18/cluster_id=sparkworker-1660782753-4d171028/job_id=5d45774a-2fd6-4dae-8e29-81af7d44483c/driver_2022-08-18-00-30_0.json.gz

{
  "log": "22/08/18 00:33:33 WARN udf_1f8bc55eba5f49f5bb02d9f381838668.py:4 example udf log",
  "pod_id": "sparkworker-1660782753-4d171028-exec-1",
  "job_id": "5d45774a-2fd6-4dae-8e29-81af7d44483c",
  "role": "exec-1",
  "level": "WARN",
  "timestamp": "2022-08-18T00:33:34.975+0000",
  "cluster_id": "sparkworker-1660782753-4d171028",
  "data_service_id": "_ascend",
  "dataflow_id": "test_graph_deploy_1660157043467288064",
  "component_id": "PySpark_Reduction_with_Inner_Join"
}

Job logs have the same JSON keys as global logs with the following differences and additions:

KeyDescription
job_idUnlike global logs, this key will have a specific value.
data_service_idData Service ID name.
dataflow_idDataflow ID name.
component_idComponent ID name.

Ascend-Hosted Environments (AWS Only)

For Ascend-hosted environments, we currently write Spark logs to an AWS S3 bucket. To access the logs, contact support to configure permissions. This requires you have an AWS account.

  1. First, we need the ID of the main AWS account you want to access logs from. We will take the Account ID and provision access for it.
  2. When access has been provisioned, we'll send you a role ARN and logs bucket name. the ARN will look like this (or similar):
    arn:aws:iam::<ASCEND_ENVIRONMENT_ACCOUNT_ID>:role/ascend-io-<subdomain>-logs-read

Once permissions are configured, you can use any method of AWS role assumption to access the logs. An easy way is to add an entry to your AWS Config file like below, where role_arn is the ARN we provided you. Once that profile entry is added you should be able to use aws s3 commands to list, and copy files out of the bucket.

[profile aws-demo-logs]
region = <ASCEND_ENVIRONMENT_REGION>
role_arn = arn:aws:iam::<ASCEND_ENVIRONMENT_ACCOUNT_ID>:role/<ASCEND_LOGS_ACCESS_ROLE_NAME>
source_profile = <YOUR_SOURCE_PROFILE>

🚧

For enterprise security customers in AWS and Azure, there are origin-based access controls on all buckets (including the logs bucket). In order to access logs from the bucket, please send a request to Ascend with the IPs (or in AWS VPC Endpoints) that you would like to permit access from.