Home
>
Data Science
>
Update Delta Lake: A Layer to Ensure Data Quality

March 28, 2022 by Phu Nguyen

Update Delta Lake: A Layer to Ensure Data Quality

Main Contents:

Delta Lake: A Layer to Ensure Data Quality is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Delta Lake: A Layer to Ensure Data Quality in today’s post !

Read more about Delta Lake: A Layer to Ensure Data Quality at Wikipedia

You can find content about Delta Lake: A Layer to Ensure Data Quality from the Wikipedia website

One of the Linux Foundation’s newest projects, called Delta Lake, aims to ensure the reliability of data across data lakes at a massive scale. These Big Data systems most commonly are used for machine learning and data science, but also for business intelligence, visualization and reporting.

With multiple people working with data in a data lake at the same time, it’s easy for problems like incomplete transactions or multiple simultaneous updates to bring the quality of the data into question.

Apache Spark’s creators at Databricks also built Delta Lake. Though initially built atop Apache Spark, it now also supports other open source Big Data systems.

“Delta Lake enables you to add a transactional layer on top of your existing data lake. Now that you have transactional transactions on top of it, you can make sure you have reliable, high-quality data, and you can do all kinds of computations on it. You can, in fact, mix batch and streaming. … Because the data is reliable, It’s OK to have someone streaming in data while someone else is in batch reading it,” Ali Ghodsi, co-founder and CEO of Databricks explained at Spark+AI Summit Europe.

Delta Lake provides ACID transactions, snapshot isolation, data versioning and rollback, as well as schema enforcement to better handle schema changes and data type changes.

Transactional Support

Databricks open sourced the technology in April under the Apache 2.0 license.

Companies using it in production such as Viacom, Edmunds, Riot Games and McGraw Hill. Alibaba; Booz Allen Hamilton, Intel and Starburst Data, which are collaborating with Databricks on support also for Apache Hive, Apache NiFi, and Presto.

There are other ways to add transactional support to data lakes. Cloudera’s Project Ozone takes a similar tack, and there’s Hive for HDFS-based storage.

It’s not a storage system, per se, but sits atop your existing storage, like HDFS and cloud storage like S3 or Azure blob storage. It provides a bridge between on-prem and cloud storage systems.

It can read from any storage system that supports Apache Spark’s data sources and can write to Delta Lake, which stores data in Apache Parquet format. All transactions made on Delta Lake tables are stored directly to disk.

Central to Delta Lake is the transaction log, a central repository that tracks all changes that users make. It records as a JSON file every change in the order they are made. If someone makes a change, but then deletes it, there still will be a record of that to simplify auditing.

It provides atomicity, recording only transactions that execute fully and completely, to ensure the trustworthiness of the data.

Optimistic Protocol

Just as multiple people can work on a jigsaw puzzle by tackling different areas of it, Delta Lake is designed to enable multiple people to work on the data at once without stepping on each others’ toes.

When dealing with petabytes of data, most likely those users will be working on different parts of the data. If, for instance, two changes do happen simultaneously, it relies on optimistic concurrency control, a protocol in which the data remains unlocked, to settle the matter.

It also offers a “time travel” or data-versioning feature, enabling users to focus on a specific point in time. After 10 commits to the transaction log, Delta Lake saves a checkpoint file in Parquet format. Those files enable Spark to skip ahead to the most recent checkpoint file, which reflects the state of the table at that point.

Delta Lake supports two isolation levels: Serializable and WriteSerializable. Stronger than Snapshot isolation, WriteSerializable offers the best combination of availability and performance and is the default. The strongest level, Serializable ensures the serial sequence matches exactly that shown in the table’s history.

The Linux Foundation is a sponsor of InApps Technology.

Image by DreamyArt from Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

February 11, 2025 by Tam Ho

Update Delta Lake: A Layer to Ensure Data Quality

Read more about Delta Lake: A Layer to Ensure Data Quality at Wikipedia

Transactional Support

Optimistic Protocol

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Applying blockchain in the telecom industry ecosystem

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2023

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

FITNESS APP DEVELOPMENT

Locations

Read more about Delta Lake: A Layer to Ensure Data Quality at Wikipedia

Transactional Support

Optimistic Protocol

Get a custom Proposal

You need to enter your email to download

Blog post

Locations