Serverless Streaming Data Loader, Solution for Errors When Redshift Is In Maintenance Mode.jpg

Solution for Errors When Redshift Is In Maintenance Mode

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud, which enables you to use your data to acquire new insights for your business and customers. At Intertec, we use Amazon Redshift to prepare reporting on all available data sources and host our own Data Warehouse.

In the following article, we will take a closer look at one of the many data sources we manage.

This particular data source is a collection of application logs, which we process and then display the usage statistics in a dashboard. The data shown in the dashboard resides in Redshift and is updated as often as new data arrives.

However, after the implementation process was done, we realized that we miss data for a specific time frame every week. To determine what is going on, we conducted an investigation and learned that the missing data is a result of the unavailability of the Redshift cluster at that particular time frame.

Current architecture

Serverless Streaming Data Loader, Solution for Errors When Redshift Is In Maintenance Mode 2.png

As shown on the image above, a microservice feeds application logs to a Kinesis firehose, which collects and dumps them on S3 every 300 seconds or 15MB.

From there a lambda function is configured to trigger whenever there is a new file created in the specified S3 bucket. The lambda function then runs a COPY command on the file and stores it in the Redshift table.

Having this in mind what we did next?

The business goal was to present the application usage statistics in a dashboard in (almost) real-time.

We know that Redshift has a scheduled maintenance window for 30 minutes each week, meaning if there are any upgrades in those 30 minutes the cluster is unreachable from the lambda function. This causes data to go missing in the dashboard, leading to incomplete results.

Our solution?

Store a list of the files that failed somewhere, to process them later. Since the maintenance window of our Redshift cluster is on Sundays from 05:30 – 06:00, we will try and load all files that failed after 06:00 every Sunday.

How was the implementation executed?

The existing lambda performing the load is extended to write all failures in a DynamoDB table:

try:
    cur = con.cursor()
    cur.execute("""
        COPY command...
    """)
    cur.close()

except Exception as e:
    dynamodb = boto3.client('dynamodb')
    dynamodb.put_item(TableName='copy-errors', 
            Item={
                    'file_name':{'S': file_name},
                    'error':{'S': e },
                    'timestamp':{'S': datetime.strftime(datetime.now(),"%Y-%m-%d %H:%M:%S")}
                }
    )

The lambda function running after the scheduled maintenance period loops through the items in the DynamoDB table and creates a manifest file that holds the list of files to be loaded:

for i in response['Items']:
    file_list.append({'url': ''+i['file_name']})
    data = {"entries": file_list}
    s3.put_object(Bucket='bucket',Key=’manifest_file.manifest',Body=json.dumps(data))

Then a copy command runs for this manifest file. If there are no errors, the items are removed from DynamoDB:

for file in raw_file_list:
    table.delete_item(Key={
        'file_name': file
    })

The lambda is scheduled using cron to execute just after the maintenance window. This information is found in the AWS console. The maintenance window is also changed per cluster:

Serverless Streaming Data Loader, Solution for Errors When Redshift Is In Maintenance Mode 3.png

Architecture diagram after including DynamoDB

Serverless Streaming Data Loader, Solution for Errors When Redshift Is In Maintenance Mode 4.png

To conclude

Our engineers continuously strive to introduce new ways to help companies and enterprise clients untangle the complex issues that always emerge during their digital transformation journey.

In this particular case, the lambda function could not have been set up to wait for the cluster to become available, and we had to find a way to somehow postpone the execution. This simple implementation of error handling in serverless applications is just one approach when you have limited resources. That being said, what would be your approach?

If you are looking to be part of a team that tackles new challenges every day and is constantly working on finding new ways of untangling complex issues, then make sure you check out our current job openings.

Tanja Zlatanovska

Sep 16, 2020

Category

Article

Technologies

Amazon Redshift

Let's talk.

Name
Email
Phone
By submitting this, you agree to our Terms and Conditions & Privacy Policy