Update Dremio Wants to Be the Splunk of Big Data

Main Contents:

Dremio Wants to Be the Splunk of Big Data is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Dremio Wants to Be the Splunk of Big Data in today’s post !

Connecting Directly

Dremio connects to all of an organization’s data sources, data lakes, databases, taking care of everything in the middle. The Arrow-based execution engine leverages columnar-based memory to execute a query that runs on a single source or on data between different sources.

It also optimizes the data itself, similar to the way Google optimizes data in various data structures so that search queries can be very fast, Shiran said. It calls these data structures “Reflections.”

And it has a user interface much like Google Docs, except for data sets rather than documents. The users themselves can see the data and explore it. They can create new data sets by doing live data curation. They can interact with the data visually or through SQL.

“Everything under the hood is standard SQL, and more technical users can do anything in the power of SQL. You can create new data sets, share them with colleagues. There’s an entire data catalog in there.

“Then with a click of a button, you can launch any of these BI tools, connect it to the Dremio cluster already and start playing with the data inside Tableau without extracting any data. There are no copies of data. All the data sets and curation inside Dremio are all virtual. It’s all done at the logical layer. All the current solutions are based on data copies, and Dremio is the opposite of that,” he said.

Because the major BI tools are based on SQL, Dremio forms a bridge between NoSQL databases such as MongoDB, automatically learning the implicit schema from various systems even when they don’t have an original schema.

“It’s kind of what Splunk did with logs,” Shiran explained. “It wasn’t that people weren’t analyzing logs before, but they were using a lot of command-line tools and loading logs into relational databases — it was just a lot of manual work. Splunk designed a solution specifically for log analytics and made it so you don’t have to glue together all these tools in order to analyze your logs.”

Standard SQL

Dremio is designed to scale from one server to thousands of servers in a single cluster. It can be deployed on Hadoop or on dedicated hardware. With Hadoop, it recommends deploying Dremio on the Hadoop cluster so raw data is local in the cache.

There are two roles in the Dremio cluster:

Coordinators that coordinate query execution, managing metadata and managing the UI.
Executors, which process queries.

By deploying coordinators on edge nodes, external applications such as BI tools can connect to them. Coordinators use YARN to provision the compute capacity to the cluster, eliminating that need for manual deployment. The company recommends one executor on each Hadoop node in the cluster.

Dremio, in effect, is an extension of their open source work. Drill is a single SQL engine that can query and join data from myriad systems. Dremio uses Apache Arrow (columnar in memory) and Apache Parquet (columnar on disk) for high-performance columnar storage and execution.

Dremio looks like a single, high-performance relational database to any tool. You just send standard SQL queries. Meanwhile, Dremio automatically optimizes the physical organization of your data for different workloads in a cache, or it queries your data sources directly when you need access to live datasets.

It uses a persistent cache that can live on HDFS, MapR-FS, cloud storage such as S3, or direct-attached storage (DAS). The cache size can exceed that of physical memory, an architecture that enables Dremio to cache more data at a lower cost, producing a higher cache hit ratio compared to traditional memory-only architectures, according to the company.

It also offers native query push downs. Instead of performing full table scans for all queries, Dremio optimizes processing into underlying data sources. Dremio rewrites SQL in the native query language of each data source, such as Elasticsearch, MongoDB, and HBase, and optimizes processing for file systems such as Amazon S3 and HDFS.

Its Data Graph preserves a complete view of the flow of data. Companies have full visibility into how data is accessed, transformed, joined, and shared across all sources and all analytical environments.

Open Source Model

Dremio comes in an open source Community edition and an Enterprise edition. The Enterprise edition includes connectivity to enterprise data sources such as IBM DB2, as well as security and governance capabilities.

It can run on-premises or in the cloud. There are advantages to running Dremio in the cloud, such as you can store those reflections, the optimized data stores, on S3 directly, Shiran said.

“It’s a fully managed cache and you can scale your compute capacity independent of that. Say after a Black Friday, you need more analytics capacity, you spin up a few more Dremio instances, and you spin in down when you don’t need it,” he said.

Feature Image: Dremio co-founders Jacques Nadeau (right) and Tomer Shiran (Dremio).

InApps is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Dremio.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.