According to a 2021 Accenture report on cloud trends, “the cloud is more than an efficient storage solution — it’s a unique platform for generating data and innovative solutions to leverage that data.” The data lake is no exception to this rule. Given that data lakes offer the most affordable and efficient way to handle vast, complex datasets, moving your data lake to the cloud is the next logical step.
Modern data lakes can support high-performance query engines, allowing users direct access to both raw and transformed data from any source or format. If you don’t have a data lake yet, implementing a cloud data lake should be your focus. Cloud-based solutions offer elastic scalability, agility, up to 40% lower total cost of ownership, increase in operational efficiency and the ability to innovate rapidly.
However, to get the most of your cloud data lake solution, you’ll want to make sure your cloud data lake stack is analytics-ready, enabling you to turn your data into a strategic competitive advantage.
Here are six critical questions to consider when choosing your cloud data lake stack for your business and making sure it is analytics-ready.
1. Which Public Cloud Platform?
Choosing the right cloud platform provider can be a daunting task. But you can’t go wrong with the big three, AWS, Azure, and Google Cloud Platform; each with their own massively scalable object storage solution, data lake orchestration solution, and managed Spark, Presto and Hadoop services. AWS Lake Formation, in particular, provides a wizard-type interface over various pieces of the Amazon Web Services ecosystem that allow organizations to easily build a data lake. Azure Data Lake is the competitor to AWS Lake Formation and relies heavily on the Hadoop architecture. As with AWS, Azure Data Lake is also centered around its storage capacity. Similarly, Google Cloud Storage is the backend storage mechanism driving data lakes built on Google Cloud Platform.
2. Which Cloud Data Lake Solution Has the Most Optimal Storage?
In the case of object storage, all AWS, Azure and Google compete on usability and price with Azure being the cheapest and Google being the most expensive on average.
AWS S3 storage offers rich functionality, has been around the longest and many applications have been developed to run on it. It is highly scalable and available and can be made redundant across a number of availability zones. S3 has three tiers (Standard, IA and Glacier), with lower storage costs and higher read/write costs depending on availability. S3 also has automatic object versioning, where each version is addressable so it can be retrieved at any time.
Azure Blob Storage offers three classes of storage (Hot, Cool and Archive) that differ mainly in price, with lower storage cost but additional read and write costs for data that is infrequently or rarely accessed. Additionally, Azure Blob Storage can be integrated with Azure Search, allowing to search the contents of stored documents including PDF, Word, PowerPoint and Excel. Although Azure provides some level of versioning by allowing users to snapshot blobs, unlike AWS it is not automatic.
As with the other cloud vendors, Google Cloud Storage is divided into three tiers (Standard, Durable Reduced Availability, and Nearline). These tiers are grouped by availability and access time (with less accessible storage being much cheaper). And like AWS, Google supports automatic object versioning.
3. Which Open Source Query Engine and Data Virtualization Technology?
There are many open source and (commercial) tools available to choose from. The most popular ones are:
Originally built by Facebook, Presto & Trino (FKA PrestoSQL) are distributed query engines built over ANSI SQL that work with many Business Intelligence tools and are capable of querying petabytes of data. While Presto was built to solve for speed and cost-efficiency of data access at a massive scale, Trino has been expanded by Presto’s founders to accommodate a much broader variety of customers and analytics use cases. Why are Presto/Trino the leading candidates? Both are user-friendly options with good performance, high interoperability, and a strong community. They enable data to be accessed from different data sources within a single query, can combine data from multiple sources, support many data stores and data formats, and have many connectors including Hive, Phoenix (MR), Postgres, MySQL, Kafka, etc.
Apache Drill is an open-source distributed query engine that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google’s Dremel system which is also available as an infrastructure service called Google BigQuery. Drill uses Apache Arrow for in-memory computations and Calcite for query parsing and optimization — but it has never enjoyed wide adoption, mainly because of its inherent performance and concurrency limitations. Some products based on Drill attempt to overcome these limitations. Drill shares many of the same features with Presto/Trino, including support for many data stores, nested data and rapidly evolving structures. Drill doesn’t require schema definition which could lead to malformed data, and it’s throttling functionality may limit concurrent queries.
Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. SQL is supported by using Spark SQL. Queries are executed by a distributed in-memory computation engine on top of structured and semi-structured data sets. Spark works with multiple data formats but is more general in its application and supports a wide range of workloads such as data transformation, ML, batch queries, iterative algorithms, streaming, etc. Spark has seen less adoption for interactive queries than Presto/Trino or Drill.
4. Should You Use Your Cloud Data Lake or a Managed Solution?
With third-party managed analytics services, enterprises can start using data analytics quickly — and let their provider deal with all the hassle of storing and managing the data. A third-party managed solution also allows users throughout the company to quickly run unlimited queries without having to wait on the DevOps team to allocate resources. However, as adoption and query volume grow, spending balloons dramatically.
And at the end of the day, every managed analytics solution becomes another data silo to manage its data flows. Unified access control, audit trails, data lineage, discovery and governance become complex, requiring custom integrations and vendor lock-in. That’s the double-edged sword of “quick-start” managed solutions that CIOs need to be aware of, so they can prepare to shift to more economical in-house managed DataOps programs for the cost and control advantages they offer in the long term.
5. The Cloud Data Lake Analytics Stack: Which Use Cases Does It Best Suit? How To Maximize Results?
The cloud data lake analytics stack dramatically improves speed for ad hoc queries, dashboards and reports. It enables you to operationalize all your data and run existing BI tools on lower-cost data lakes without compromising performance or data quality while avoiding costly delays when adding new data sources and reports.
Keep in mind that if you want to serve as many use cases as possible and shift your workloads to the cloud data lake, you’ll want to avoid data silos. Instead, ensure your stack is analytics-ready with workload observability and acceleration capabilities, so it can be easily integrated with niche analytics technologies such as text analytics for folder and log analysis. A solution with integrated text analytics can be used by data teams to run text search at petabyte scale directly on the data lake for marketing, IT, and cybersecurity use cases (and more).
The cloud data lake analytics stack can be used for a wide range of analytics use cases, but a few, in particular, prove to deliver the highest level of ROI. These include:
- Advanced analytics support for data scientists and business intelligence data analysts to provision and experiment with data
- Near real-time analysis of streaming and IoT data
- Rapid anomaly and threat detection over massive amounts of data via the security data lake, a modern and agile alternative for traditional SIEM/SOC platforms
6. How Can You Achieve the Optimal Query Performance and Price Balance?
According to a recent report by Mordor Intelligence, the Data Lakes Market was valued at USD 3.74 billion in 2020 and is expected to reach USD 17.60 billion by 2026, at a CAGR of 29.9% over the forecast period 2021-2026. Improving performance and cost efficiencies are critical driving forces behind this massive adoption.
As analytics use cases grow in demand across almost every business unit, data teams are still constantly struggling with balancing performance and costs. Unfortunately, manual query prioritization and performance optimization are time-consuming, lack scalability and often result in heavy DataOps.
Data teams and end-users no longer need to compromise on performance in order to achieve agility, speed and cost-effectiveness. To do so effectively, data teams should leverage strategies and technologies that can autonomously accelerate queries using advanced techniques such as big data indexing. The key element for success is to eliminate DataOps. Autonomous query acceleration gives users control over the performance and cost of their cloud data lake analytics. Workload-level observability gives DataOps teams an open view to see how data is being used across the entire organization and better focus resources on business priorities.
CIOs who lead their data teams in this direction can have the best of both worlds. They will find that implementing these techniques can make all the difference while widely expanding data lake architecture across their entire organization.
Feature image via Pixabay.