How We Built Real-time Dashboards Using AWS Kinesis Analytics.png

How We Built Real-time Dashboards Using AWS Kinesis Analytics

Knowing what users are doing on your websites in real-time provides insights you can act on without waiting for delayed batch processing of clickstream data. In the AWS world, the entire process is simplified through AWS Kinesis. This article is a walk-thru of how we utilized AWS Kinesis to build real-time dashboards and demonstrate a full overview of what is happening on our client’s sites at times of peak traffic.

The Challenge

The domains of our client, which are spread across more than 20 countries, serve millions of users. The essence of the domains is e-commerce, so naturally, there are isolated times of the year, usually during holidays or big annual discounts, when the domains’ traffic reaches its peak because people tend to go on a shopping spree.

For one of these shopping rampages, the client wanted to track and have a full overview of what is happening on its sites in real-time. For reference, here are some of the events that were requested for tracking:

  • the number of active users
  • the number of clicks

Architectural tools we already use

Needless to say, we are tracking the users, with all the events and clicks, at all times. The difference is that we don’t have the concluding data in real-time, but we are processing it in batches and the earliest it could be delivered is with a latency of 1h. Such latency wasn’t acceptable for us, because we wanted no latency at all – we needed real-time analytics for the incoming events.

In regards to the tracking and processing part, we’re already using a palette of AWS services, including Lambdas, Kinesis Streams, EMR for data processing, S3 for data storage, Redshift as an analytical database, etc.

Reasonably, the solution has to reside in the vast AWS service collection – one that would be built on top of all tools that are already in place, without interfering and compromising the existing services in use and the quality of the outputs.

The Solution and key steps

AWS offers real-time analytics with Kinesis Data Analytics. Since we are already using Kinesis Streams and Kinesis Firehose to deliver the ingested data to S3, we’ve gone enthusiastically with the idea to use the off-the-rack service, compliant with the services we already use.

How does Kinesis Analytics work?

How We Built Real-time Dashboards Using AWS Kinesis Analytics 2.png

A Kinesis Analytics application can be attached to either a Data Stream or a Delivery Stream (AKA Firehose), both of which are part of the AWS Kinesis toolbox. This means that all the data that travels through these stream instances, can be peeked into by a Kinesis Application, and with some modifications or specifics, you can extract valuable information from the real-time events and output them in a safe place for possible reading.

There are a few types of data windowing and they come in quite handy. The data can be windowed by:

  • a fixed time or row count interval (Sliding Window)
  • time-based windows that open and close at regular intervals (Tumbling Window)
  • a keyed time-based window that allow multiple overlapping windows to regulate late or out-of-order data (Stagger Windows)

Note: The KA doesn’t affect or distort the initial output from the Data Streams or the Firehoses. It just shows stalker affinities toward streaming data.

How We Built Real-time Dashboards Using AWS Kinesis Analytics 3.png

Step 1: Defining the data sources

The necessary data is produced from multiple sources. Two of those sources are sending the data to two different Kinesis Data Streams and a single source is sending the data to a Kinesis Firehose.

In other words, Kinesis Analytics says:

_– “No problem, I can work with both Kinesis Data Stream and Kinesis Firehose._“

We attached a specific KA application to each stream that we had, and only one stream had two KA applications on it because we needed the same stream to produce two separate logic fractions.

Step 2: Choosing the runtime

There are two different ways to process the data when creating the Kinesis Analytics application:

  • SQL
  • Apache Flink

We opted for simplicity and chose SQL rather than a complete Flink application. Instead of setting up a Flink project, managing proper connectors and deploying it, we simply wrote queries right on top of the incoming data.

Using SQL queries, you can modify and run them directly into the application. You don’t need to redeploy the whole application every time you make a change as opposed to a Flink application, where a jar should be created and deployed into the Kinesis Analytics application. As the data enters, quite beneficial for the KA user is the live output displayed below the query (however we still have some doubts about the accuracy of this output).

Step 3: Defining the schema

After the source of the Kinesis Analytics application is specified, the schema of the incoming records has to be defined. Bonus points go to Kinesis Analytics for its ability to read nested _JSON_s and extract the fields out of those _JSON_s.

While taking advantage of the latter, we faced a setback in this phase, due to some specific fields from a certain value from the whole JSON record, which is always sent as a stringified JSON inside the record. In this case, the Kinesis Analytics application is not able to extract the fields out of that string.

A potential solution to this challenge is to use a Lambda function to pre-process the incoming records before they are used in the KA (Lambda pre-processing is a feature offered by KA itself). This approach works well when we simulate the whole process with mock data. But once we start using real production data, everything works fine for a couple of minutes, and then it just stops, proving that this approach can be unreliable.

Eventually, we had to remove the pre-process Lambda, so we decided to just parse the stringified JSON inside the SQL script with regex and extract the needed information. It worked.

Step 4: Output

The results from the queries end up in an S3 bucket via Kinesis Firehose streams. Each KA output has its folder, and all Firehoses from the KA from all AWS regions simply put the results into a single place.

The Firehoses funnel the data as Parquet files for a low storage price, suitable for optimal querying with Athena. For that purpose we had Glue tables, identifying the structure of the output files.

Each of the outputs has a separate table, partitioned by year, month, day and hour.

Step 5: Templating the infrastructure

All the resources used in the Analytics part are created with CloudFormation templates. We have multiple AWS regions to cover (the same resources in all regions), so the templates alleviated the deployment of the resources in each region.

This provides an opportunity to easily kill all the resources by removing the CloudFormation stack, and reuse them at a later stage when we need to run these analytics again.

Step 6: Usage

This data is intended to feed (with a bit of enrichment from other Redshift tables) the real-time dashboards for analytics in Tableau.

How We Built Real-time Dashboards Using AWS Kinesis Analytics 4.png

Ultimately, using Redshift Spectrum, the data from KA is queried from S3, joined with other tables from Redshift for enrichment and used to make descriptive dashboards.

Result and future directions

This experience shows that despite having a questionable relationship with KA during development, Kinesis Analytics can be in fact quite handy for getting the results we wanted to achieve.

The ability to write a query, hit run and see the results below the query after running it, adds a big plus to KA’s value, even though it has many disadvantages such as you have to wait for the SQL validation, then for the application to start running, feed the source data at the right time, and get the results after a few minutes. The wait is long even for minor changes and at the time, we didn’t find the “patience test” quite charming.

Another downside we experienced was that sometimes if you change the source schema after you create the whole KA application, it would just quit and stop working, without giving you a reason why.

However, what we found useful is that you can analyze real-time streaming data with just a few clicks using KA and SQL – universal and ubiquitous languages that most people in our team (and in most data teams) understand. This in fact is quite the advantage that should be utilized and it should definitely weigh-in in the decision-making process on whether or not one should choose KA for real-time analytics.

Final words

If we were to answer the question of whether Kinesis Analytics was sufficient for what we aimed to achieve, we would simply reply:

“Yes, it was quite the ride, but in the end, we successfully completed our goal.”

We’d also add that if we had to do something similar in the future we would definitely take this solution into consideration, but our decision would also very much depend on the use case.

Tanja Zlatanovska

Jun 07, 2021

Category

Article

Technologies

AWSKinesis AnalyticsDevOps

Let's talk.

Name
Email
Phone
By submitting this, you agree to our Terms and Conditions & Privacy Policy