The African antelope Kudu has vertical stripes, symbolic of the columnar data store in the Apache Kudu project.

Cloudera began working on Kudu in late 2012 to bridge the gap between the Hadoop File System HDFS and HBase Hadoop database and to take advantage of newer hardware. It donated Kudu and its accompanying query engine Impala to the Apache Software Foundation late last year; Kudu recently was named a top level project.

“The idea was to build a storage system for the Hadoop ecosystem that was for mixed workloads,” explained Todd Lipcon, vice president of Apache Kudu and a software engineer at Cloudera.

HDFS is good for analytics where you’re batch loading large amounts of data, but you typically don’t update the data. It’s just a batch of transactions that took place yesterday or weblogs or something like that, he said.

Meanwhile, HBase basically took the other side of this trade-off. It’s much more real-time. It’s really good for streaming. You can write data quickly with low latency. You can randomly read that data. You can look up an individual record. You have the ability to update as well.

“When we found HBase was becoming really popular, we found that some users had a mix of the two. They weren’t just doing this online random access, they also had some analytics. Conversely, other customers started with analytics, then said, ‘Hey wait. With this analytics workload, it’d be really nice to start ingesting it in a streaming fashion instead of just batch. Occasionally I have data corrections – updates or deletes.’ So we had a lot of people in the community and customers at Cloudera who were kind of between a rock and a hard place,” he said.

Read More:   Update Apache Samza, LinkedIn’s Framework for Stream Processing

They needed the capabilities of HBase, but like the performance of HDFS for analytics, and they were building really complicated systems to achieve that.

kudu_vs_parquet

“They would keep some amount of the data in HBase, like the most recent data, then have these background processes to export that data into HDFS, then have to synchronize the two systems to keep them up to date with the data. All that complexity was hampering people’s ability to adopt the ecosystem because you no longer had one storage system. You had to manage two, you had to learn about two, they had different APIs and you’re building a lot of this extra tooling to keep the data synchronized between the two.

“We decided to try to build a happy medium for the two use cases. And even if the happy medium probably won’t be optimal for either of the use cases, at least it will be one system that can do both use cases reasonably well. Of course, when you build a specialized system, it will be better than a general one. So there are specific use cases where HDFS will be more relevant and many use cases where HBase will be relevant.

“So basically, some people are willing to take some tradeoff on performance for simplicity of deployment and simplicity of application development,” he said.

Sort of Familiar

Cloudera stated its initial design goals for Kudu as:

  • Strong performance for both scan and random access
  • High CPU efficiency and IO efficiency.
  • The ability to update data in place.
  • The ability to support active-active replicated clusters that span multiple data centers in different parts of the world.

Kudu is a storage system for tables of structured data. Its tables look like those in SQL relational databases, each with a primary key made up of one or more columns that enforce uniqueness and acts as an index for efficient updates and deletes.

Read More:   2024's Most Popular AI Programming Languages for Your Projects

Logical subsets of data called tablets, make up the tables, similar to partitions in relational database systems. It replicates these tablets to multiple commodity hardware nodes using the Raft consensus algorithm, which ensures that every write is persisted by at least two nodes before responding to the client request to protect against data loss due to a machine failure.

There are Java, C++, or Python APIs for “NoSQL”-style access to individual rows. And these APIs can be used with batch access for machine learning or analytics.

You can stream real-time data in using the Java client, and then process it immediately using Spark, Impala, or MapReduce, and transparently join Kudu tables with data stored in HDFS or HBase.

Rajan Chandras, director of data architecture and strategy at NYU Langone Medical Center, has called Kudu/Impala potential game changers as a full-fledged alternative to the Hive/MapReduce/HDFS stack.

1.0 Coming Soon

While Kudu has good integration with Impala, it’s not tight coupling, Lipcon says.

“You can choose to use SQL or Impala, which is the one Cloudera has been focusing on, but you can also choose to use SparkSQL. The Kudu project has people working on Apache Drill. We’re happy to integrate with any SQL engine – the more the better. For our success, we want to work with as many SQL engines and analytic engines as possible.”

The software is still in the pre-1.0 release phase, though a number of organizations are already using Kudu in production. Those users are pretty tightly integrated with the development community, though. These are the types of cutting-edge users who compile their own Kudu binaries from source.

“Our goal, of course, is to make it a generally usable thing for any enterprise,” Lipcon said. “That’s the 1.0 milestone we’re marching toward, and we’re hoping that will happen sometime in September. It will include more quality assurance and a couple more features that are important for stability and reliability of the system. Fixing some bugs, improving high availability, knowing: Can you really run this thing 24/7 365 and have really great uptime?”

Read More:   Update Containers Pose Different Operational, Security Challenges for PCI Compliance

Feature Image: “n130_w1150” by Biodiversity Heritage Library, licensed under CC BY-SA 2.0.