Ascend Developer Hub

IoT Device and Weather Analysis

Walkthrough of the IoT Device and Weather Analysis sample dataflow

๐Ÿ“˜

Request a Trial!

The sample dataflow mentioned in this document is only available to trial users. You can request a trial of the Ascend platform here!

Highlighted Features

SQL + PySpark: Modular transforms in both languages with complete lineage and dependency management
PySpark ML: Performing sophisticated data science and analytics techniques with minimal code and easy repeatability.
Time-Based Partitioning & Scale: Ingesting time-series data allows for efficient updates of only new data even on the order of petabytes of data or billions of rows.
Data Comparison: Joining transforms makes analyzing data differences and debugging easy and simple to visualize.

Introduction

Data-driven business decisions require efficient pattern detection on real-time, large-scaled data sources. In our sample dataflow, imagine that we are a hypothetical IoT company that ingests about half a million rows of IoT device usage and weather data every ten minutes. The order of scale of this data is approximately a couple petabytes every week. The goal of the data platform and data science teams of this IoT company is to detect patterns within this live dataflow in order to provide a foundation for business and engineering decisions around reducing IoT device energy consumption.

Why Ascend?

  • Full Automation
    Ascend provides full automation that requires no code for complicated scheduling, alerting, and maintenance. Furthermore, Ascend's automation framework can accomplish retries when there are transient errors returned from these public API endpoints.
  • Zero Infrastructure
    Ascend fully manages all infrastructure so our users don't have to manage any infrastructures such as servers and storages.
  • Cross Platform
    Ascend brings together data from different cloud, different APIs and different data systems together for user to explore with either SQL or PySpark. Furthermore, the Ascend backend can also be hosted on any major public cloud such as AWS, GCP, and Azure.

Dataflow Walkthrough

Read Connector

This dataflow has a single read connector that connects to an S3 bucket which is being updated with new data on a real-time basis. The read connector itself refreshes every hour to pull in new data from the S3 bucket. With Ascend's intelligent partitioning and updates, partitions are created based on the organization of files in the S3 bucket. These are time-based partitions such that each partition contains time-series data for a ten minute interval. Whenever the read connector refreshes to update itself with new data, only the new partitions are re-pulled. Existing partitions that have already been loaded into Ascend do not need to be touched. This makes the dataflow much more intelligent for not having to recompute or reload any data that has not been changed since its last update.

Read ConnectorDescription
Live IoT and WeatherPulling from S3, this read connector contains data relating to IoT device usage in a single house and also weather around that house for a specific moment in time. The schema of the records contains energy usage for these IoT devices in units of kilowatts and also contains relevant weather information.

Clustering Transforms and Data Feeds

The business goal from our data source is to perform data processing through a popular clustering algorithm implemented in pyspark: k-means||. We choose to test two different model training flows:

  1. Using all relevant columns of weather and IoT device energy usage data.
  2. Using all relevant columns of weather, IoT device energy usage, and solar energy generation data.
Transforms and Data FeedsDescription
Usage, WeatherWe filter for relevant features from our read connector. These features columns are associated with energy consumption and weather. For energy consumption data, we convert the kilowatts to a percentage of the total energy used by a single household.
Usage, Weather, SolarWe filter for relevant features from our read connector. These feature columns are associated with energy consumption, weather, and energy production. For energy consumption data, we convert the kilowatts to a percentage of the total energy used by a single household.
[DF] Usage, WeatherWe have published the data from the Usage, Weather transform as a data feed so that you can explore the data in your own data service!
[DF] Usage, Weather, SolarWe have published the data from the Usage, Weather, Solar transform as a data feed so that you can explore the data in your own data service!
K-Means ClusterUsing Pyspark, we can train a k-means cluster model on the features from our upstream transform. The number of clusters and other hyperparameters are all adjustable as per the Pyspark documentation.
K-Means Cluster w/ SolarUsing Pyspark, we can train a k-means cluster model on the features from our upstream transform. The number of clusters and other hyperparameters are all adjustable as per the Pyspark documentation.
[DF] ClustersWe have published the data from the K-Means Cluster transform as a data feed so that you can explore the data (with cluster assignments) in your own data service!
[DF] Clusters w/ SolarWe have published the data from the K-Means Cluster w/ Solar transform as a data feed so that you can explore the data (with cluster assignments) in your own data service!
Cluster InfoWe can perform some analysis over our cluster information with some SQL. For example, here we are calculating the count of records that were assigned to each cluster during training, and we are calculating the averages of each feature per cluster for relative cluster centers.
Cluster Info w/ SolarWe can perform some analysis over our cluster information with some SQL. For example, here we are calculating the count of records that were assigned to each cluster during training, and we are calculating the averages of each feature per cluster for relative cluster centers.

Cluster Diff Transforms

Now that we have two separate cluster training models, it would be helpful to compare the two and understand how adding solar energy related data to one of them has changed the cluster centers and cluster sizes. With Ascend, we can just join two of our transforms together and perform some quick analysis!

TransformsDescription
Clusters DiffWe perform a full outer join on both cluster information transforms. At the same time, we retrieve the differences between all shared columns of the two transforms.
Order By Diff CountThis transform simply reorders the cluster differences by the absolute value of the number of records that are different per cluster.
Energy Diff TotalsWe can filter our cluster center differences for just the energy features. We then sum up all the differences per column to see the overall direction and magnitude of change per energy-related feature.
Weather Diff TotalsWe can filter our cluster center differences for just the weather features. We then sum up all the differences per column to see the overall direction and magnitude of change per weather-related feature.

Next Steps

Now that you have been introduced to this dataflow, we encourage you to continue to build on it and play around with its content in your own environment. You can easily bring this entire sample dataflow into your own data service with our Importing & Exporting (requires account access) feature!

Here are some ideas of things that you can do further:

  • Connect to one of the data feeds from the sample dataflow and try doing some cluster training or analysis of your own.
  • See if you can see the variance in differences between the two sets of clusters
  • Answer the question: Which IoT device or room consumes the largest percentage of energy across all households?
  • Answer the question: What is the correlation between rain and solar power generated?

Questions or Suggestions

Feel free to send an email to [email protected] if you have any additional questions, comments, or suggestions for this dataflow!

Updated 8 months ago

IoT Device and Weather Analysis


Walkthrough of the IoT Device and Weather Analysis sample dataflow

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.