There are a lot of benefits of data scalability. The size and the variety of data that enterprises have to deal with have become more complex and larger.
Traditional relational databases provide certain benefits, but they are not suitable to handle big and various data. That is when data lake products started gaining popularity, and since then, more companies introduced lake solutions as part of their data infrastructure. As the demand for the data solutions increased, cloud companies like AWS also jumped in and began providing managed data lake solutions with AWS Athena and S3. These services have powerful and convenient features. However, they are not perfect for all users and use cases. In this article, we will discuss shortcomings of indexing in Athena and S3 and how we can deal with them.
AWS Athena and S3
AWS Athena and S3 are separate services. AWS Athena is a query service that allows users to analyze data in S3 using standard SQL syntax. Athena is serverless and managed by AWS. Athena and other AWS serverless services have a similar pricing structure – it lets you pay only for what you use. S3 is one of the first-generation services of AWS. You can store different types of files and use them like cloud storage. Both combined, you use SQL to query what’s stored in S3.
Limits of Athena
Although Athena has great features and provides cost benefits, as you use it, you will find some limitations of Athena.
When you use Athena, the computation resources to run your queries are not something you can control. When you execute an Athena query, a request goes to the shared queue that comes from all Athena users in your region and AWS processes the requested query sequentially. This means when you execute a query in a busy time, you will have to wait longer to get your query processed and result back. Under this environment, you can not guarantee consistent performance, which can have a negative impact on service agreement with your customers.
In traditional relational database engines, users can plan indexing to improve performance. However, Athena does not use indexing by default. When you run a query, Athena goes to the targeted S3 bucket and starts opening each file until it meets the requests of your query. For example, when the data is located at the last file, your query will take longer than when you can find your data from the first scanned file. It might not make much difference when your data size is small. However, when your data is big, this makes a big difference. To mitigate this performance issue, AWS recommends partitioning.
You can improve query performance by partitioning your data. However, partitioning also has limits, and it is not easy to use. You have to carefully decide based on which column you want to partition. When you choose a wrong column, re-partitioning can make you move the entire data into a new bucket location, alter the table to refer to the new bucket location, and then delete the old data.
Because Athena uses the data storage that works like a file system, it does not allow you to update or delete at a row or a column level. Alternatively, you can run CTAS (Create Table AS) or INSERT INTO query. However, when you use them, you can only create up to 100 partitions in a destination table. That may sound large enough. Depending on what base column you use for partitioning, that limit can be reached unexpectedly fast.
How to improve indexing
When there is a problem, it becomes an opportunity. Since Athena is one of the most popular data lake query services, many users experience these problems and companies develop solutions to eliminate the inconvenience and performance issues. When it is hard to overcome shortcomings within AWS, people sometimes look outside to find a solution.
For the indexing and partitioning limitations of AWS, users could consider Varada’s big data indexing technology; it automatically indexes columns according to workload demands. Their indexing data breaks data, across any column, into nano blocks and then automatically selects the most efficient index for each nano-block considering data content and structure. In the back-end, their machine-learning optimization tools monitor cluster performance and data usage to detect bottlenecks and query performances. When it finds an optimization opportunity, it automatically applies improvements.
The result is a faster query result and optimized cost. This source shares performance comparisons across different metrics. One noticeable difference is the first experiment. The query was to find a specific ID and between specific time ranges as below.
... FROM demo_trips.trips_data WHERE rider_id = 3380311 AND t_hour between 7 AND 10
The result showed that Athena took 40.96 seconds and 132.0GB scanned while Varada took 0.57 and 245KB scanned.
The result tells you that depending on your partition, there can be a massive difference. In data engineering, besides partitioning, there are many areas to be taken care of. If engineers have to manage partitioning, it can slow down other important tasks. When you have data lake infrastructure in AWS, relying on a third party solution like Varada is something you can consider.