What is Delta Lake in Databricks?
With Delta Lake version 3.0, Databricks has created a new universal format to resolve compatibility issues
Delta Lake is an open source framework designed by Databricks that allows businesses to manage and analyze the data they generate in real-time.
Taking the form of a transactional storage layer, it provides a foundation for storing data and tables on Data Lakehouses – which Databricks refers to as an open data management architecture. This allows for data warehousing and machine learning operations to be applied directly onto a data lake.
It’s the default storage format for all Databricks operations and is an open protocol, with the organization completely open sourcing Delta Lake with the launch of Delta Lake 2.0 in June 2022.
With Delta Lake 3.0, launched at Databricks Data + AI Summit 2023, Databricks has launched a new unified file format – UniForm – that solves a longstanding compatibility issue in the way businesses store metadata in their data lakes. This launches alongside a bid to remove data silos by addressing connector fragmentation, and performance improvements.
What is Delta Lake?
Introduced as an open source project in 2019, Delta Lake was created to ensure the data that powers real-time insights that organizations crave is genuinely reliable. It was billed at the time as Databricks’ most important innovation, even including the Apache Spark engine for data processing.
Modern storage: The answer to multi-cloud complexity
Innovative organizations need innovative storage to manage and leverage their data no matter where it lives
The need for Delta Lake has emerged from the way businesses have traditionally managed data in massive and unrefined data lakes. Because these repositories contain both structured and unstructured data, at different scales, any operations running on top of this data – such as data analytics or machine learning – might not be optimized.
Databricks designed Delta Lake to transform these messy data lakes into cleaner ‘Delta Lakes’ with higher-quality data. This ultimately ensures any additional processes or operations performed on top generate much better insights, even if performed in real-time.
Get the ITPro. daily newsletter
Receive our latest news, industry updates, featured resources and more. Sign up today to receive our FREE report on AI cyber crime & security - newly updated for 2024.
“Nearly every company has a data lake they are trying to gain insights from, but data lakes have proven to lack data reliability,” said Databricks cofounder and CEO, Ali Ghodsi. “Delta Lake has eliminated these challenges for hundreds of enterprises. By making Delta Lake open source, developers will be able to easily build reliable data lakes and turn them into ‘Delta Lakes’.”
Delta Lake competes with a number of other services on the market that perform similar functions, including Apache HUDI, Azure Data Lake, and Snowflake.
How does Delta Lake work?
Based in the cloud, Delta Lake is a Spark proprietary extension that allows organizations to create Delta tables by default, whether they’re using Apache Spark or SQL. The open source platform extends the classic parquet data file format with a file-based transaction log, which allows for additional functions.
Delta Lake features include providing atomicity, consistency, isolation and durability (ACID) transactions on Spark, as well as scalable metadata handling. The former ensures there’s never inconsistent data, while the latter takes advantage of Spark’s distributed processing power to handle masses of metadata at once.
Organizations can also take advantage of streaming and batch unification – by allowing both types of data to land in the same sink. This comes alongside some quality-of-life features such as rollbacks and historical audit trails, as well as the capability to reproduce machine learning experiments.
It comes with a variety of open source connectors, including Apache Flink, Presto and Trino, alongside standalone readers/writers that lets clients for Python, Ruby and Rust write directly into Delta Lake without needing a massive data engine like Apache Spark.
In 2021, Databricks launched Delta Sharing, a protocol that lets different companies share massive data sets securely and in real time. This aims to break the notion of vendor lock-in, as well as break down data silos due to proprietary data formats and computing resources that are required to read data. Users can access shared data through Pandas, Tableau, Presto, and other platforms without the need to use proprietary systems.
What’s new in Delta Lake 3.0?
With the latest iteration of Delta Lake, Databricks is hoping to end the ‘format wars’ that’s long posed challenges for storing and managing data in data lakes.
Most data (99%) is stored in data lakes using the parquet format, but organizations must choose from one of several competing standards to store all metadata: Delta Lake, Apache Iceberg, and Apache Hudi. Each format handles metadata differently are fundamentally incompatible with one another. Once enterprises choose a format, they’re stuck using it for everything.
With Delta Lake 3.0, Databricks has created the universal format (UniForm), which serves as a unification of the three standards. Customers who use a Databricks data lake will find their data stored in the parquet format, as usual, alongside three different versions of the metadata. It means the Delta Lake data can be read as if it’s stored in either Iceberg or Hudi.
“The metadata is very important, and if you get the metadata wrong, you can’t actually access this stuff,” says Databricks CEO Ali Ghodsi, speaking at Databricks AI + Data Summit 2023. “Since all three projects are open source, we just went and understood exactly how to do it in each of them, and now inside Databricks, when we create data, we create data for all three.”
Also new in the latest iteration of Delta Lake is the Delta Kernel, which addresses connector fragmentation by ensuring connectors are built against a core library that implements Delta specifications. This alleviates the burden of having to update Delta connectors with each new version or protocol change. With one API, developers can keep connectors up-to-date and ensure the latest innovations are pushed out as soon as they’re ready.
The final new addition is the Delta Liquid Clustering, which boosts the performance of reads and writes. With a flexible data layout technique, the Delta Liquid Clustering marks a departure from the decades old hive-style table partitioning system, which uses a fixed layout. It means organizations can take advantage of cost-efficient data clustering as their data lake grows with time.
Keumars Afifi-Sabet is a writer and editor that specialises in public sector, cyber security, and cloud computing. He first joined ITPro as a staff writer in April 2018 and eventually became its Features Editor. Although a regular contributor to other tech sites in the past, these days you will find Keumars on LiveScience, where he runs its Technology section.