Home
>
Data Science
>
Update 6 Questions for Choosing an Analytics-Ready Cloud Data Lake

March 22, 2022 by Phu Nguyen

Update 6 Questions for Choosing an Analytics-Ready Cloud Data Lake

Main Contents:

6 Questions for Choosing an Analytics-Ready Cloud Data Lake is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn 6 Questions for Choosing an Analytics-Ready Cloud Data Lake in today’s post !

1. Which Public Cloud Platform?

Choosing the right cloud platform provider can be a daunting task. But you can’t go wrong with the big three, AWS, Azure, and Google Cloud Platform; each with their own massively scalable object storage solution, data lake orchestration solution, and managed Spark, Presto and Hadoop services. AWS Lake Formation, in particular, provides a wizard-type interface over various pieces of the Amazon Web Services ecosystem that allow organizations to easily build a data lake. Azure Data Lake is the competitor to AWS Lake Formation and relies heavily on the Hadoop architecture. As with AWS, Azure Data Lake is also centered around its storage capacity. Similarly, Google Cloud Storage is the backend storage mechanism driving data lakes built on Google Cloud Platform.

2. Which Cloud Data Lake Solution Has the Most Optimal Storage?

In the case of object storage, all AWS, Azure and Google compete on usability and price with Azure being the cheapest and Google being the most expensive on average.

AWS S3 storage offers rich functionality, has been around the longest and many applications have been developed to run on it. It is highly scalable and available and can be made redundant across a number of availability zones. S3 has three tiers (Standard, IA and Glacier), with lower storage costs and higher read/write costs depending on availability. S3 also has automatic object versioning, where each version is addressable so it can be retrieved at any time.

Azure Blob Storage offers three classes of storage (Hot, Cool and Archive) that differ mainly in price, with lower storage cost but additional read and write costs for data that is infrequently or rarely accessed. Additionally, Azure Blob Storage can be integrated with Azure Search, allowing to search the contents of stored documents including PDF, Word, PowerPoint and Excel. Although Azure provides some level of versioning by allowing users to snapshot blobs, unlike AWS it is not automatic.

As with the other cloud vendors, Google Cloud Storage is divided into three tiers (Standard, Durable Reduced Availability, and Nearline). These tiers are grouped by availability and access time (with less accessible storage being much cheaper). And like AWS, Google supports automatic object versioning.

3. Which Open Source Query Engine and Data Virtualization Technology?

There are many open source and (commercial) tools available to choose from. The most popular ones are:

Trino logo

Originally built by Facebook, Presto & Trino (FKA PrestoSQL) are distributed query engines built over ANSI SQL that work with many Business Intelligence tools and are capable of querying petabytes of data. While Presto was built to solve for speed and cost-efficiency of data access at a massive scale, Trino has been expanded by Presto’s founders to accommodate a much broader variety of customers and analytics use cases. Why are Presto/Trino the leading candidates? Both are user-friendly options with good performance, high interoperability, and a strong community. They enable data to be accessed from different data sources within a single query, can combine data from multiple sources, support many data stores and data formats, and have many connectors including Hive, Phoenix (MR), Postgres, MySQL, Kafka, etc.

Apache Drill logo

Apache Drill is an open-source distributed query engine that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google’s Dremel system which is also available as an infrastructure service called Google BigQuery. Drill uses Apache Arrow for in-memory computations and Calcite for query parsing and optimization — but it has never enjoyed wide adoption, mainly because of its inherent performance and concurrency limitations. Some products based on Drill attempt to overcome these limitations. Drill shares many of the same features with Presto/Trino, including support for many data stores, nested data and rapidly evolving structures. Drill doesn’t require schema definition which could lead to malformed data, and it’s throttling functionality may limit concurrent queries.

Apache Spark logo

Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. SQL is supported by using Spark SQL. Queries are executed by a distributed in-memory computation engine on top of structured and semi-structured data sets. Spark works with multiple data formats but is more general in its application and supports a wide range of workloads such as data transformation, ML, batch queries, iterative algorithms, streaming, etc. Spark has seen less adoption for interactive queries than Presto/Trino or Drill.

4. Should You Use Your Cloud Data Lake or a Managed Solution?

With third-party managed analytics services, enterprises can start using data analytics quickly — and let their provider deal with all the hassle of storing and managing the data. A third-party managed solution also allows users throughout the company to quickly run unlimited queries without having to wait on the DevOps team to allocate resources. However, as adoption and query volume grow, spending balloons dramatically.

And at the end of the day, every managed analytics solution becomes another data silo to manage its data flows. Unified access control, audit trails, data lineage, discovery and governance become complex, requiring custom integrations and vendor lock-in. That’s the double-edged sword of “quick-start” managed solutions that CIOs need to be aware of, so they can prepare to shift to more economical in-house managed DataOps programs for the cost and control advantages they offer in the long term.

5. The Cloud Data Lake Analytics Stack: Which Use Cases Does It Best Suit? How To Maximize Results?

The cloud data lake analytics stack dramatically improves speed for ad hoc queries, dashboards and reports. It enables you to operationalize all your data and run existing BI tools on lower-cost data lakes without compromising performance or data quality while avoiding costly delays when adding new data sources and reports.

Keep in mind that if you want to serve as many use cases as possible and shift your workloads to the cloud data lake, you’ll want to avoid data silos. Instead, ensure your stack is analytics-ready with workload observability and acceleration capabilities, so it can be easily integrated with niche analytics technologies such as text analytics for folder and log analysis. A solution with integrated text analytics can be used by data teams to run text search at petabyte scale directly on the data lake for marketing, IT, and cybersecurity use cases (and more).

The cloud data lake analytics stack can be used for a wide range of analytics use cases, but a few, in particular, prove to deliver the highest level of ROI. These include:

Advanced analytics support for data scientists and business intelligence data analysts to provision and experiment with data
Near real-time analysis of streaming and IoT data
Rapid anomaly and threat detection over massive amounts of data via the security data lake, a modern and agile alternative for traditional SIEM/SOC platforms

6. How Can You Achieve the Optimal Query Performance and Price Balance?

According to a recent report by Mordor Intelligence, the Data Lakes Market was valued at USD 3.74 billion in 2020 and is expected to reach USD 17.60 billion by 2026, at a CAGR of 29.9% over the forecast period 2021-2026. Improving performance and cost efficiencies are critical driving forces behind this massive adoption.

As analytics use cases grow in demand across almost every business unit, data teams are still constantly struggling with balancing performance and costs. Unfortunately, manual query prioritization and performance optimization are time-consuming, lack scalability and often result in heavy DataOps.

Data teams and end-users no longer need to compromise on performance in order to achieve agility, speed and cost-effectiveness. To do so effectively, data teams should leverage strategies and technologies that can autonomously accelerate queries using advanced techniques such as big data indexing. The key element for success is to eliminate DataOps. Autonomous query acceleration gives users control over the performance and cost of their cloud data lake analytics. Workload-level observability gives DataOps teams an open view to see how data is being used across the entire organization and better focus resources on business priorities.

CIOs who lead their data teams in this direction can have the best of both worlds. They will find that implementing these techniques can make all the difference while widely expanding data lake architecture across their entire organization.

Feature image via Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

February 11, 2025 by Tam Ho

Update 6 Questions for Choosing an Analytics-Ready Cloud Data Lake

Read more about 6 Questions for Choosing an Analytics-Ready Cloud Data Lake at Wikipedia

1. Which Public Cloud Platform?

2. Which Cloud Data Lake Solution Has the Most Optimal Storage?

3. Which Open Source Query Engine and Data Virtualization Technology?

4. Should You Use Your Cloud Data Lake or a Managed Solution?

5. The Cloud Data Lake Analytics Stack: Which Use Cases Does It Best Suit? How To Maximize Results?

6. How Can You Achieve the Optimal Query Performance and Price Balance?

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Applying blockchain in the telecom industry ecosystem

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2023

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

FITNESS APP DEVELOPMENT

Locations

Read more about 6 Questions for Choosing an Analytics-Ready Cloud Data Lake at Wikipedia

1. Which Public Cloud Platform?

2. Which Cloud Data Lake Solution Has the Most Optimal Storage?

3. Which Open Source Query Engine and Data Virtualization Technology?

4. Should You Use Your Cloud Data Lake or a Managed Solution?

5. The Cloud Data Lake Analytics Stack: Which Use Cases Does It Best Suit? How To Maximize Results?

6. How Can You Achieve the Optimal Query Performance and Price Balance?

Get a custom Proposal

You need to enter your email to download

Blog post

Locations