Building Data Lake from Affiliate Network Transactional Data.jpg

Building Data Lake from Affiliate Network Transactional Data

Data lakes have become the cornerstone of many big data initiatives, just as they offer easier and more flexible options to scale when working with big amounts of data that are being generated at a high velocity – such as web, sensor, or app activity data. Since these types of data sources have become increasingly prevalent, interest in data lakes has also been growing at a rapid pace.

The client is a company that constantly seeks ways to improve and enhance its products and improve user experience. When they approached us, we were focused on finding “a new solution to an old problem”.


On an hourly basis, the client collects transactional data from more than 50 Affiliate Networks (AN), using 70 different API’s. The number of new data sources and the data volume on already existing data sources is significantly increasing. The constant requests from the business side to introduce new data fields into the reporting are also piling up.

In a transactional DB, historical data is stored in one database. Even a simple analysis of how the transaction is changed is slow or inefficient. And when the client requests to do some ad-hoc data analysis that is not present in the DWH, considering that in the reports we only have unified transactional data from all ANs, this way turns out to be costly, time-consuming, and often incorrect.

On top of that, there are infrastructure limitations that we constantly need to track in order to be able to fulfill the client’s urging needs. Rather than spending the time productively, more time is spent on migrations to bigger servers and upgrading the existing software to a newer version.

Why we opted for Data Lake?

Building a data lake is the next logical step. Using this approach, we can store a large amount of structured, semi-structured, and unstructured data. This means we will have the history for all of the transactions and every field related to them, even if we don’t use them in the reports.

Also, the cost of keeping all of the unstructured and structured data is low and does not require any maintenance on our end. The introduction of serverless architecture for fetching the data and putting it in the Data Lake is also a possibility.

The serverless approach means that we can focus on our core product development, instead of worrying about managing and operating servers or runtimes, either in the cloud or on-premises.

We don’t have to know the structure of the data we want to store. The Data Lake gives you the ability to understand what data is in the lake through crawling, cataloging, and indexing of data. In the case of new business requirements or changes in the existing ones, we can easily go back in history and reload all of the data in the reports according to the new requirements.

The Data Lake allows us to run ad-hoc reports or analyses without the need to move our data to a separate analytics system. Having all of the data in its original form and in a structured format gives us the opportunity to run machine learning algorithms and predictive analytics and enables the client to better understand its data.

Approach and solution

In order to build the Data Lake, we introduce an entirely new serverless architecture for the entire process using AWS:

Building Data Lake from Affiliate Network Transactional Data   2.png

In the diagram above, on the left, we see the process for a specific Affiliate Network, and on the right, we see how all of the data from all of the Affiliate Networks are combined together. Let’s explain how it’s done starting from left to right.

The CloudWatch Events Rule triggers the Affiliate Network API Requests Lambda, which is used for creating the URL’s and making the Requests from the Affiliate Network API. The metadata for creating the URL’s are stored on S3 as a single JSON object per Affiliate Network. We can get up to a few thousand API Requests for a single Affiliate Network, depending on the number of accounts we have for that particular Affiliate Network, the requested period of time, and the specific API limitations.

We use a single Lambda function for each request, where the requests are called in parallel and asynchronously and we don’t have to worry about managing the server. The response with the transactional data from each Lambda is stored on S3.

Each file from the request when saved on S3 invokes another Lambda where the data is transformed and cleaned and after that, it is stored on S3 in Parquet format. After the transactions are stored on S3 in Parquet format, we catalog the structure for each Affiliate Network and use Athena for these data queries. Additionally, the transaction is transformed into a separate structure and sent to the Kinesis Firehose service which accepts transactional records from all the Affiliate Networks.

Firehose stores the customer transaction structured data on S3 in Parquet format for all Affiliate Networks. Here we have all versions of each transaction that are imported during their lifecycle. If we want to further analyze the lifecycle of these transactions, we can also use Athena. The last step is transforming the data with the Glue Job and storing them in the Redshift table.


Setting up Data Lake in place gives the client the advantage to:

  • Run the requests/transformation in parallel required at any scale, in contrast to being limited to the server capacity.
  • Now, each request is independent.
  • We also have a history of each transaction stored in S3 for minimum cost and if the business logic changes in the future we can always use these files to recreate the process.
  • In contrast to having a limited number of columns, unified for all the Affiliate Networks, where only the standard Reporting can apply, now the data on S3 is cataloged and can be used for additional analysis, using all the offered columns per Affiliate Network, with AWS Athena using standard SQL or additionally with Redshift Spectrum along with the tables in the DataWarehouse.
  • In contrast to using Aurora DB, now we use AWS S3 for storage.
  • In contrast to using Pentaho Server on EC2 for transformation, now we use
    AWS Lambda and Glue.
  • Pricing is based on the number of executions and not on pre-purchased compute capacity.
  • Instead of being focused on implementing, maintaining, debugging, and 24×7 monitoring of the infrastructure, now the client can fully focus on its application.

Tanja Zlatanovska

Jun 11, 2020




AWSData LakeBusiness Intelligence

Let's talk.

By submitting this, you agree to our Terms and Conditions & Privacy Policy