Enterprise data infrastructures have come a long way since Relational Software released Oracle, the first commercially available relational database management system (RDBMS), and companies began migrating away from paper-based data records. Soon after that, the first enterprise data warehouse was deployed, which consolidated databases in one location, and the journey towards a data-driven world was underway. However, no one anticipated the additional complexity that this journey would place upon the IT stack as enterprises reached for the promise of data-driven decision making without understanding the implications of rapidly advancing database technology.
On the heels of the data warehouse, we saw the advent of open-source databases in support of web applications. Soon after, the volume of data grew to the point where a single machine could no longer handle the analytical workload. This, in turn, led to the creation of massively parallel processing (MPP) and the business world began using the term “big data.”
Eventually, costs skyrocketed, compounded by rigid IT infrastructures that lacked the flexibility needed to support data science and analytics. Enter Hadoop, which began to add sense to it all and created what would turn out to be its legacy — the data lake.
Today, data lakes are flowing to the cloud and enterprises are looking to separate storage and compute in order to gain better flexibility and control over costs and performance. Unfortunately, IT is still supporting legacy database systems from the earliest parts of history, in a Frankensteinian collection of on-prem hardware, data warehouses, and swamps. Migrating enterprise data to a cloud-based data lake is a cumbersome and complicated task, often resulting in disruption to business operations and angry analysts. Further, how can IT successfully migrate without finding themselves locked into a whole host of new technology vendors?
There is a way to ease the pains of data migrations, ETL (extract, transfer, load) functions and vendor lock-in: Optionality.
Creating Optionality in Three Steps
Data engineers and administrators must keep three key things in mind when architecting a data infrastructure for the long run:
- Embrace the separation of storage and compute
- Commit to open-data formats
- Utilize abstraction to future-proof the architecture
Let’s examine each of these steps.
1. Separating Storage and Compute
Historically the state of the art for configuring servers within a data center was to give each one their own memory, CPU, and hard disk. Data was stored locally to the CPU and memory. The pros: some performance advantages because of the minimal latency of data traversing the network. The cons: A large one — IT has to buy enough hardware with enough power to handle what it anticipates will be maximum usage. As a consequence, more hardware and computing are purchased than is required most of the time, leaving expensive resource investments often under-utilized.
By choosing to leverage the separation available in a cloud deployment model, you gain the flexibility of scaling storage and compute independently of one another. This allows for a greater cost savings and performance efficiency. The benefits include only having to pay for what is actually used, gaining greater control over performance and cost, greatly reducing data duplication and data loading, and making the same data accessible to multiple platforms.
2. Open-Data Formats
One of Hadoop’s contributions to big data analytics is open-data formats optimized to be high performant, specifically, Avro, ORCFile, and Parquet. These three file formats revolutionized the data landscape. Columnar and read-optimized, these formats allowed data to be stored in file systems, like HDFS, or object storage, like S3, and still have lightning fast performance when using SQL to perform analytics.
Even better, you can access these formats from a variety of tools. One common pattern is to use Spark for data science activities like training machine learning models while using Presto for BI and SQL use cases. Both access the same open file formats so no transformation is necessary.
Historically, relational database management systems have leveraged proprietary storage formats that require significant engineering challenges and costs to extract data from them, complicating cloud migration strategies and locking you in to obsolete architectures. Open source file formats are the key to avoiding data lock-in, and data lock-in is the worst kind of vendor lock-in.
3. Abstraction: The Secret to a Future-Proofed Architecture
Just the idea of having to perform a massive data migration and determine the “how and when” of getting disparate data stores into the cloud without disrupting analysts’ productivity is enough to give IT teams a migraine. Analysts want just one thing: uninterrupted access to their data. They couldn’t care less where the data lives. The third step to creating optionality is deploying an abstraction layer between users and their data. The abstraction layer goes by many names: consumption layer, query federation, data virtualization, semantic layer, and query fabric.
The abstraction layer takes an SQL query as input and manages the execution as quickly as possible. This layer should be MPP in design, highly scalable, and able to push down predicates and column projections, only bringing into memory what is necessary. This creates a situation where analysts can access their data anywhere, without worrying about an ETL or other movement disrupting their work. An abstraction layer gives IT the time and flexibility to move data to the cloud at their own pace, without hindering analytics operations.
By following these three steps, any enterprise can create options for itself, which is a critical benefit when it comes to architecting an infrastructure that can withstand the test of time and whatever may get thrown at it in the future. It’s always good to have options. IT sleeps better at night when it has them.
Feature image via Pixabay.