Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Data-Centric Firms Address Athena Shortcomings with Smart Indexing
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data-Centric Firms Address Athena Shortcomings with Smart Indexing
Big DataExclusive

Data-Centric Firms Address Athena Shortcomings with Smart Indexing

You have to take some Athena shortcomings into consideration as a data-driven business, so these guidelines will help you out.

Ryan Kh
Ryan Kh
7 Min Read
dealing with data limitations with athena
Shutterstock Photo License - Billion Photos
SHARE

There are a lot of benefits of data scalability. The size and the variety of data that enterprises have to deal with have become more complex and larger.

Contents
AWS Athena and S3Limits of AthenaShared resourcesIndexing capabilitiesPartition limitsHow to improve indexingWrapping up

Traditional relational databases provide certain benefits, but they are not suitable to handle big and various data. That is when data lake products started gaining popularity, and since then, more companies introduced lake solutions as part of their data infrastructure. As the demand for the data solutions increased, cloud companies like AWS also jumped in and began providing managed data lake solutions with AWS Athena and S3. These services have powerful and convenient features. However, they are not perfect for all users and use cases. In this article, we will discuss shortcomings of indexing in Athena and S3 and how we can deal with them.

AWS Athena and S3

AWS Athena and S3 are separate services. AWS Athena is a query service that allows users to analyze data in S3 using standard SQL syntax. Athena is serverless and managed by AWS. Athena and other AWS serverless services have a similar pricing structure – it lets you pay only for what you use. S3 is one of the first-generation services of AWS. You can store different types of files and use them like cloud storage. Both combined, you use SQL to query what’s stored in S3.

Limits of Athena

Although Athena has great features and provides cost benefits, as you use it, you will find some limitations of Athena.

More Read

Investigating the Relationship Between Big Data and Road Safety
Investigating the Relationship Between Big Data and Road Safety
Dr Gates was right, or how I learned to stop worrying and love the spam
SAS Global Conference 2009
SOA is necessary for agility but not sufficient
Hybrid Systems Integrations for Right Sized ERP

Shared resources

When you use Athena, the computation resources to run your queries are not something you can control. When you execute an Athena query, a request goes to the shared queue that comes from all Athena users in your region and AWS processes the requested query sequentially. This means when you execute a query in a busy time, you will have to wait longer to get your query processed and result back. Under this environment, you can not guarantee consistent performance, which can have a negative impact on service agreement with your customers.

Indexing capabilities

In traditional relational database engines, users can plan indexing to improve performance. However, Athena does not use indexing by default. When you run a query, Athena goes to the targeted S3 bucket and starts opening each file until it meets the requests of your query. For example, when the data is located at the last file, your query will take longer than when you can find your data from the first scanned file. It might not make much difference when your data size is small. However, when your data is big, this makes a big difference. To mitigate this performance issue, AWS recommends partitioning.

Partition limits

You can improve query performance by partitioning your data. However, partitioning also has limits, and it is not easy to use. You have to carefully decide based on which column you want to partition. When you choose a wrong column, re-partitioning can make you move the entire data into a new bucket location, alter the table to refer to the new bucket location, and then delete the old data.

Because Athena uses the data storage that works like a file system, it does not allow you to update or delete at a row or a column level. Alternatively, you can run CTAS (Create Table AS) or INSERT INTO query. However, when you use them, you can only create up to 100 partitions in a destination table. That may sound large enough. Depending on what base column you use for partitioning, that limit can be reached unexpectedly fast.

How to improve indexing

When there is a problem, it becomes an opportunity. Since Athena is one of the most popular data lake query services, many users experience these problems and companies develop solutions to eliminate the inconvenience and performance issues. When it is hard to overcome shortcomings within AWS, people sometimes look outside to find a solution.

For the indexing and partitioning limitations of AWS, users could consider Varada’s big data indexing technology; it automatically indexes columns according to workload demands. Their indexing data breaks data, across any column, into nano blocks and then automatically selects the most efficient index for each nano-block considering data content and structure. In the back-end, their machine-learning optimization tools monitor cluster performance and data usage to detect bottlenecks and query performances. When it finds an optimization opportunity, it automatically applies improvements.

The result is a faster query result and optimized cost. This source shares performance comparisons across different metrics. One noticeable difference is the first experiment. The query was to find a specific ID and between specific time ranges as below.

...
FROM
	demo_trips.trips_data
WHERE
	rider_id = 3380311
AND    t_hour between 7 AND 10

The result showed that Athena took 40.96 seconds and 132.0GB scanned while Varada took 0.57 and 245KB scanned.

Wrapping up

The result tells you that depending on your partition, there can be a massive difference. In data engineering, besides partitioning, there are many areas to be taken care of. If engineers have to manage partitioning, it can slow down other important tasks. When you have data lake infrastructure in AWS, relying on a third party solution like Varada is something you can consider.

TAGGED:athenadata-driven businessdata-driven organizations
Share This Article
Facebook Pinterest LinkedIn
Share
ByRyan Kh
Follow:
Ryan Kh is an experienced blogger, digital content & social marketer. Founder of Catalyst For Business and contributor to search giants like Yahoo Finance, MSN. He is passionate about covering topics like big data, business intelligence, startups & entrepreneurship. Email: ryankh14@icloud.com

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

data analytics can help with payment collections
Analytics

Using Data Analytics to Optimize Your Cash Collection Approach

10 Min Read
big data and social media analytics
Analytics

Data Analytics and Social Media: Twin Pillars in the Evolution of Business

8 Min Read
data-driven lead generation
Analytics

Five Proven Lead Generation Strategies to Merge with Data Analytics

8 Min Read
Business,Analytics,(ba),Technology,Using,Big,Data,,Cloud,Computing,And
Analytics

What to Consider When Choosing a Masters in Business Analytics

15 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?