Home
>
Data Science
>
Update Pachyderm Challenges Hadoop with Containerized Data Lakes

March 30, 2022 by Phu Nguyen

Update Pachyderm Challenges Hadoop with Containerized Data Lakes

Main Contents:

Pachyderm Challenges Hadoop with Containerized Data Lakes is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Pachyderm Challenges Hadoop with Containerized Data Lakes in today’s post !

Key Summary

This InApps.net article, published in 2022, explores Pachyderm’s container-based alternative to Hadoop for big data analytics. Written with an informative, technical tone, it aligns with InApps Technology’s mission to cover data science and software development trends, offering an accessible overview of modern data lake solutions.

Key Points:

Context: Hadoop’s reliance on Java and HDFS limits flexibility, prompting Pachyderm to leverage Docker, Kubernetes, and CoreOS for a more versatile, containerized data processing platform.
Core Insight: Pachyderm replaces Hadoop’s HDFS with its Pachyderm File System and MapReduce with Pachyderm Pipelines, emphasizing reproducibility, data provenance, and collaboration via a Git-inspired model.
Key Features:
- Containerized Stack: Uses Docker for flexible tool integration, allowing any programming language or tool, unlike Hadoop’s Java-centric approach.
- Git-Like Repositories: Enables reusable, well-documented analysis pipelines, fostering team collaboration and reducing redundant development.
- Customer Impact: Fogger, a sensor data platform, adopted Pachyderm for its low learning curve and seamless integration into containerized infrastructure, simplifying data processing for industrial applications.
Outcome: Pachyderm’s containerized approach offers a scalable, user-friendly alternative to Hadoop, enabling faster, more flexible big data analytics with broad applicability, as demonstrated by Fogger’s use case.

This article reflects InApps.net’s focus on innovative data science and software development, providing an inclusive, practical overview of Pachyderm’s disruption of traditional big data frameworks.

What’s New

The two components Pachyderm developed for the stack are file system and pipeline system.

Pachyderm Pipelines is a system of stringing containers together and doing data analysis with them. You create a containerized program with the tools of your choice that reads and writes to the local filesystem. It uses a FUSE volume to inject data into the container, then automatically replicates the container, showing each one a different chunk of data. This technique enables Pachyderm to scale any code you write to process massive data sets in parallel, according to Zwicker. It doesn’t require using Java at all: If it fits in a container, you can use it for data analysis.

Pachyderm File System is a distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. The data is stored in generic object storage such as Amazon’s S3, Google Cloud Storage or the open source Ceph file system. And like Apple’s Time Machine, it provides historical snapshots of how you data looked at different points in time.

“It lets you see how things have changed; it lets people work together,” Zwicker said. “It allows people to not only collaborate on code but on data. One data scientist can build a data set, and another can fork it and build off of it, then merge the results back with the original one. This is something that has been completely missing from the data science tools out there.”

There’s no shortage of technologies — Spark, Pig, Hive and others — considered alternatives MapReduce, the processing layer in Hadoop.

“We think the existence of all those tools is an indication that MapReduce was the wrong idea to begin with. It was an overly constraining way of analyzing,” Zwicker said.

“What Hadoop found was that MapReduce could do a bunch of stuff, but they needed to invent other things on top of it, like Pig and Hive and those things,” Zwicker said. “Hadoop has something kind of like what we do, which is called Hadoop Streaming, but it’s a very second-class citizen that’s added afterward rather than us having our containerized workload be the core layer that everybody uses.”

Doliner adds that Spark and Hive and other tools are all still built on top of the core pieces of the Hadoop infrastructure, like Zookeeper, YARN, HDFS, pieces of the infrastructure that are among the weaknesses to Hadoop.

Docker Was ‘Aha’ Moment

Doliner and Zwicker founded the San Francisco-based company in 2014 and participated in Y Combinator in early 2015. It has raised $2 million from Data Collective, Blumberg Capital, Foundation Capital, and others.

It might appear nakedly ambitious to boldly state one’s plans to replace Hadoop — the founders contend they have the only company building something totally new.

“If you look at what [the others] are building, all of it is still the same Hadoop primitives repackaged in some way. We’ve believed from very early on that the problem isn’t that Hadoop isn’t packaged in the right way, but that Hadoop has inherent flaws,” Doliner said.

The company started out before Docker was released. The founders initially knew they wanted to build a replacement for Hadoop, but saw an early demo of Docker at their former employer, RethinkDB.

“That was the ‘aha’ moment,” Zwicker said. “We knew Hadoop was going to be replaced and saw containers are the perfect tool to do it. We knew they were going to create this whole ecosystem we could use to replace it. When we put all of that together, that is when things really started working for us.”

Adds Doliner: “We’re not just saying, ‘Hey containers are a hot new technology. Let’s take everything and shove it in a container’ and all of a sudden that’s a new product.”

By being early to the container movement, it’s all been evolving together, he said.

One of the key benefits of Pachyderm, they says, is that it doesn’t take a large team with specific expertise that Hadoop requires to be productive. That was an attractive feature for its customer Fogger, according to CEO Kamil Kozak.

Fogger makes a software platform for processing sensor data on industrial machinery such as solar farms and wind turbines. Its Fog Computing platform allows data processing on small Linux boxes close to the machines and pushes it over a peer-to-peer network to a central cloud hub. It uses Pachyderm for local data processing on it way to the cloud.

“At Fogger, we believe that containers are redefining infrastructure and that they will be used in all types of deployments,” Kozak said.

“Pachyderm has a very well-designed technological stack. We love the idea of map/reduce pipelines built with containers and a simple Git-like triggering system.

“We were evaluating having to build our own solution in-house or using something like Hadoop/Spark when I stumbled across Pachyderm. We chose Pachyderm because the learning curve and infrastructure overhead for Hadoop/Spark was significantly harder than Pachyderm; it just fit seamlessly into our containerized stack,” he said.

Containers allow Fogger to build data-processing algorithms in any programming language, “which simplifies our lives drastically as we don’t have to learn any new technology other than Pachyderm CLI itself,” Koziak said.

Feature image via Pixabay, licensed under CC0.

InApps is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

July 18, 2025 by Anh Hoang

Update Pachyderm Challenges Hadoop with Containerized Data Lakes

Key Summary

Key Points:

Read more about Pachyderm Challenges Hadoop with Containerized Data Lakes at Wikipedia

What’s New

Docker Was ‘Aha’ Moment

Offshore AI Chatbot Development: Driving Business Innovation

AI‑Driven Automation: 7 Real‑Life Business Success Stories (2026 Update)

AI Automation for Business in 2026: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2025

Why Your Business Needs a Mobile App Rather Than a Website

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2025

Offshore AI Chatbot Development: Driving Business Innovation

Offshore AI Development Center Services: Unlocking Global AI Expertise

AI‑Driven Automation: 7 Real‑Life Business Success Stories (2026 Update)

Locations

Key Summary

Key Points:

Read more about Pachyderm Challenges Hadoop with Containerized Data Lakes at Wikipedia

What’s New

Docker Was ‘Aha’ Moment

Get a custom Proposal

You need to enter your email to download

Blog post

Locations