Request a Trial!
The sample dataflow mentioned in this document is only available to trial users. You can request a trial of the Ascend platform here!
SQL + PySpark: Modular transforms in both languages with complete lineage and dependency management
PySpark ML: Performing sophisticated data science and analytics techniques with minimal code and easy repeatability.
Time-Based Partitioning & Scale: Ingesting time-series data allows for efficient updates of only new data even on the order of petabytes of data or billions of rows.
Data Comparison: Joining transforms makes analyzing data differences and debugging easy and simple to visualize.
Data-driven business decisions require efficient pattern detection on real-time, large-scaled data sources. In our sample dataflow, imagine that we are a hypothetical IoT company that ingests about half a million rows of IoT device usage and weather data every ten minutes. The order of scale of this data is approximately a couple petabytes every week. The goal of the data platform and data science teams of this IoT company is to detect patterns within this live dataflow in order to provide a foundation for business and engineering decisions around reducing IoT device energy consumption.
- Full Automation
Ascend provides full automation that requires no code for complicated scheduling, alerting, and maintenance. Furthermore, Ascend's automation framework can accomplish retries when there are transient errors returned from these public API endpoints.
- Zero Infrastructure
Ascend fully manages all infrastructure so our users don't have to manage any infrastructures such as servers and storages.
- Cross Platform
Ascend brings together data from different cloud, different APIs and different data systems together for user to explore with either SQL or PySpark. Furthermore, the Ascend backend can also be hosted on any major public cloud such as AWS, GCP, and Azure.
This dataflow has a single read connector that connects to an S3 bucket which is being updated with new data on a real-time basis. The read connector itself refreshes every hour to pull in new data from the S3 bucket. With Ascend's intelligent partitioning and updates, partitions are created based on the organization of files in the S3 bucket. These are time-based partitions such that each partition contains time-series data for a ten minute interval. Whenever the read connector refreshes to update itself with new data, only the new partitions are re-pulled. Existing partitions that have already been loaded into Ascend do not need to be touched. This makes the dataflow much more intelligent for not having to recompute or reload any data that has not been changed since its last update.
|Live IoT and Weather||Pulling from S3, this read connector contains data relating to IoT device usage in a single house and also weather around that house for a specific moment in time. The schema of the records contains energy usage for these IoT devices in units of kilowatts and also contains relevant weather information.|
The business goal from our data source is to perform data processing through a popular clustering algorithm implemented in
k-means||. We choose to test two different model training flows:
- Using all relevant columns of weather and IoT device energy usage data.
- Using all relevant columns of weather, IoT device energy usage, and solar energy generation data.
|Transforms and Data Feeds||Description|
|Usage, Weather||We filter for relevant features from our read connector. These features columns are associated with energy consumption and weather. For energy consumption data, we convert the kilowatts to a percentage of the total energy used by a single household.|
|Usage, Weather, Solar||We filter for relevant features from our read connector. These feature columns are associated with energy consumption, weather, and energy production. For energy consumption data, we convert the kilowatts to a percentage of the total energy used by a single household.|
|[DF] Usage, Weather||We have published the data from the |
|[DF] Usage, Weather, Solar||We have published the data from the |
|K-Means Cluster||Using Pyspark, we can train a k-means cluster model on the features from our upstream transform. The number of clusters and other hyperparameters are all adjustable as per the Pyspark documentation.|
|K-Means Cluster w/ Solar||Using Pyspark, we can train a k-means cluster model on the features from our upstream transform. The number of clusters and other hyperparameters are all adjustable as per the Pyspark documentation.|
|[DF] Clusters||We have published the data from the |
|[DF] Clusters w/ Solar||We have published the data from the |
|Cluster Info||We can perform some analysis over our cluster information with some SQL. For example, here we are calculating the count of records that were assigned to each cluster during training, and we are calculating the averages of each feature per cluster for relative cluster centers.|
|Cluster Info w/ Solar||We can perform some analysis over our cluster information with some SQL. For example, here we are calculating the count of records that were assigned to each cluster during training, and we are calculating the averages of each feature per cluster for relative cluster centers.|
Now that we have two separate cluster training models, it would be helpful to compare the two and understand how adding solar energy related data to one of them has changed the cluster centers and cluster sizes. With Ascend, we can just join two of our transforms together and perform some quick analysis!
|Clusters Diff||We perform a full outer join on both cluster information transforms. At the same time, we retrieve the differences between all shared columns of the two transforms.|
|Order By Diff Count||This transform simply reorders the cluster differences by the absolute value of the number of records that are different per cluster.|
|Energy Diff Totals||We can filter our cluster center differences for just the energy features. We then sum up all the differences per column to see the overall direction and magnitude of change per energy-related feature.|
|Weather Diff Totals||We can filter our cluster center differences for just the weather features. We then sum up all the differences per column to see the overall direction and magnitude of change per weather-related feature.|
Now that you have been introduced to this dataflow, we encourage you to continue to build on it and play around with its content in your own environment. You can easily bring this entire sample dataflow into your own data service with our Importing & Exporting (requires account access) feature!
Here are some ideas of things that you can do further:
- Connect to one of the data feeds from the sample dataflow and try doing some cluster training or analysis of your own.
- See if you can see the variance in differences between the two sets of clusters
- Answer the question: Which IoT device or room consumes the largest percentage of energy across all households?
- Answer the question: What is the correlation between rain and solar power generated?
Feel free to send an email to [email protected] if you have any additional questions, comments, or suggestions for this dataflow!
Updated 8 months ago