Blob Storage Read Connector Parsers

Of the available parsers for some Read Connectors, the following have additional required or optional properties.

CSV Parser

The CSV parser provides a range of customizable parameters. Each parser for a Read Connector has specific parameters.

CSV Parser Fields

All fields available for the CSV parser are optional. We'll use the following text example to discuss each field:

@Album released in 1967 
artist,album,track,verse
The Beatles, Magical Mystery Tour, "Hello, Goodbye", "You say, /"Yes/", I say, /"No/" !/I say, /"Yes/", 
but I may mean, /"No/"!/"
FieldDescription
FIELD DELIMITERField delimiter to be used.

Ex. Delimiter = ,
FILES HAVE A HEADER ROWTells the parser that the first line should be used as column names in the schema. Selected by default. Deselect if the first row does not contain column names.

Ex. Header Row = artist,album,song,verse
COMMENT PREFIXA single character/symbol used for skipping lines beginning with this character. By default, this is disabled.

Ex. @Album release in 1967
( in this example, the Comment prefix is @.)
QUOTE CHARACTERA single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set the value to an empty string.

Ex. track = "Hello, Goodbye"
(In this example, the quote character is ".)
ESCAPE CHARACTERA single character used for escaping quotes inside an already quoted value.

Ex. verse1= "You say, /"Yes/", I say, /"No/" !/I say, /"Yes/",
but I may mean, /"No/"!/"

(In this example, the escape character is /.)
ENABLE MULTILINE VARIABLESThis checkbox informs the parser that there can be multi-line fields in the ingested data. In most CSV files, each record stands on its own line, with each field separated by a comma. But sometimes we have a multi-line field.

Ex. verse1= "You say, /"Yes/", I say, /"No/" !/I say, /"Yes/",
but I may mean, /"No/"!/"
(In this example, the verse is continued on a new line.)
DATE FORMATFormat of the date type field in the CSV file. See Datetime Patterns for Formatting and Parsing.
TIMESTAMP FORMATFormat of the timestamp type field in the CSV file. Specify the timestamp format if known. See Datetime Patterns for Formatting and Parsing.
ENCODINGDecode CSV files by the given encoding type. Defaults to UTF-8 (this is the most common encoding type). One known exception is files created on Windows, which requires UTF-16, instead.
LINE SEPARATORDefines the line separator that should be used for parsing. Defaults to \n (the newline character).
CHAR TO ESCAPE QUOTE ESCAPINGSets a single character used for escaping the escape for the quote character.

Ex. verse1= "You say, /"Yes/", I say, /"No/" !/I say, /"Yes/",
but I may mean, /"No/"!/"*
(In this example, the lyrics utilize a forward slash to denote the addition of background lyrics. Adding an exclamation mark (!) before the forward slash will preserve the forward slash in the string. The character to escape quote escaping is !.)
MODEAllows a mode for dealing with corrupt records during parsing. Mode options include PERMISSIVE, DROPMALFORMED, and FAILFAST.

PERMISSIVE: when processing a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord (default column name is 'corrupt_record'), and sets malformed fields to null. To keep corrupt records, a user should manually add an additional string column named columnNameOfCorruptRecord (default column name is '_corrupt_record') in a user-defined schema. The corrupt records are dropped if a schema does not have the _corrupt_record column defined.

DROPMALFORMED: ignores the whole corrupted records. This mode is unsupported in the CSV built-in functions.

FAILFAST: throws an exception when processing corrupted records.

Excel Parser

All fields for the Excel parser are optional.

Excel Parser Fields

FieldDescription
NUMBER OF HEADER ROWSIndicates the number of header rows used in the sheet used for parsing. If no number of rows is indicated, Ascend assumes the first row is the header row.
SHEET NAMEIndicates the name of the sheet to use. If no sheet is selected, the first sheet of the Excel file will be used.

JSON Parser

All fields for the JSON parser are optional.

JSON Parser Fields

FieldRequired
DATA FORMATFormat of the date type field in the CSV file. See Datetime Patterns for Formatting and Parsing.
TIMESTAMP FORMATFormat of the timestamp type field in the CSV file. Specify the timestamp format if known. See Datetime Patterns for Formatting and Parsing.
ENCODINGSpecify the encoding for json file. The default encoding is utf-8
LINE SEPARATORDefine the line separator that should be used for parsing. When multiline is disabled, this field is required for non utf-8 encoded input files
ENABLE MULTILINE VALUESParse one record, which may span multiple lines, per file. If it is enabled, the file is processed as a whole instead of being parallelized by line. It may affect the processing time.

Python Parser

Python Parser Fields

FieldRequiredDescription
PARSER CODE INTERFACERequiredCurrent availability is Byte Stream with Callback Function
CODE FINGERPRINT STRATEGYRequiredCurrent availability is Automatic content-based fingerprint
CODERequiredSee below for a code sample highlighting the required parameters.
ADDITIONAL PIP PACKAGES TO INSTALLOptionalIndicate if there are any additional PIP packages.

Below is a code sample included within Ascend's Python Parser Function.

# required interface parser_function
def parser_function(reader, on_next):
  byte = reader.read(1024)
  while byte != b"":
    # callback with parsed record
    # in this case, 1024 bytes per 'chunk' encoded in hexidecimal
    on_next({'chunk': byte.hex()})
    # read next chunk:
    byte = reader.read(1024)