Data lakes have become the cornerstone of many big data initiatives, just as they offer easier and more flexible options to scale when working with big amounts of data that are being generated at a high velocity – such as web, sensor, or app activity data. Since these types of data sources have become increasingly prevalent, interest in data lakes has also been growing at a rapid pace.
The client is a company that constantly seeks ways to improve and enhance its products and improve user experience. When they approached us, we were focused on finding “a new solution to an old problem”.
On an hourly basis, the client collects transactional data from more than 50 Affiliate Networks (AN), using 70 different API’s. The number of new data sources and the data volume on already existing data sources is significantly increasing. The constant requests from the business side to introduce new data fields into the reporting are also piling up.
In a transactional DB, historical data is stored in one database. Even a simple analysis of how the transaction is changed is slow or inefficient. And when the client requests to do some ad-hoc data analysis that is not present in the DWH, considering that in the reports we only have unified transactional data from all ANs, this way turns out to be costly, time-consuming, and often incorrect.
On top of that, there are infrastructure limitations that we constantly need to track in order to be able to fulfill the client’s urging needs. Rather than spending the time productively, more time is spent on migrations to bigger servers and upgrading the existing software to a newer version.
Building a data lake is the next logical step. Using this approach, we can store a large amount of structured, semi-structured, and unstructured data. This means we will have the history for all of the transactions and every field related to them, even if we don’t use them in the reports.
Also, the cost of keeping all of the unstructured and structured data is low and does not require any maintenance on our end. The introduction of serverless architecture for fetching the data and putting it in the Data Lake is also a possibility.
The serverless approach means that we can focus on our core product development, instead of worrying about managing and operating servers or runtimes, either in the cloud or on-premises.
We don’t have to know the structure of the data we want to store. The Data Lake gives you the ability to understand what data is in the lake through crawling, cataloging, and indexing of data. In the case of new business requirements or changes in the existing ones, we can easily go back in history and reload all of the data in the reports according to the new requirements.
The Data Lake allows us to run ad-hoc reports or analyses without the need to move our data to a separate analytics system. Having all of the data in its original form and in a structured format gives us the opportunity to run machine learning algorithms and predictive analytics and enables the client to better understand its data.
In order to build the Data Lake, we introduce an entirely new serverless architecture for the entire process using AWS:
In the diagram above, on the left, we see the process for a specific Affiliate Network, and on the right, we see how all of the data from all of the Affiliate Networks are combined together. Let’s explain how it’s done starting from left to right.
The CloudWatch Events Rule triggers the Affiliate Network API Requests Lambda, which is used for creating the URL’s and making the Requests from the Affiliate Network API. The metadata for creating the URL’s are stored on S3 as a single JSON object per Affiliate Network. We can get up to a few thousand API Requests for a single Affiliate Network, depending on the number of accounts we have for that particular Affiliate Network, the requested period of time, and the specific API limitations.
We use a single Lambda function for each request, where the requests are called in parallel and asynchronously and we don’t have to worry about managing the server. The response with the transactional data from each Lambda is stored on S3.
Each file from the request when saved on S3 invokes another Lambda where the data is transformed and cleaned and after that, it is stored on S3 in Parquet format. After the transactions are stored on S3 in Parquet format, we catalog the structure for each Affiliate Network and use Athena for these data queries. Additionally, the transaction is transformed into a separate structure and sent to the Kinesis Firehose service which accepts transactional records from all the Affiliate Networks.
Firehose stores the customer transaction structured data on S3 in Parquet format for all Affiliate Networks. Here we have all versions of each transaction that are imported during their lifecycle. If we want to further analyze the lifecycle of these transactions, we can also use Athena. The last step is transforming the data with the Glue Job and storing them in the Redshift table.
Setting up Data Lake in place gives the client the advantage to:
Tanja Zlatanovska
Jun 11, 2020
Industry
Technologies
Solza Tanovska
Transportation and Logistics
Dimitar Siljanovski
Real Estate
Solza Tanovska
Transportation and Logistics