Version 2.0 of the Delta Lake project is completely open source. This was announced by the development team at the Data+AI Summit 2022.
The company Databricks has launched its Delta Lake project handed over to the Linux Foundation three years ago. At this year’s Data+AI Summit conference, which took place at the end of June 2022 in San Francisco and virtually online, the developers behind Delta Lake presented version 2.0 of the software. They also announced that with the current release they are offering the entire project as open source.
Delta Lake – project with many people and companies
In a comprehensive blog post, the development team is highlighting the possibilities of Delta Lake. Companies should be able to build data lakehouses that enable data warehousing and machine learning directly on the Delta Lake. The developers describe the development of the project over the past three years as positive: According to their statements, the project currently has more than 190 participants in over 70 organizations, around two thirds of whom do not work for Databricks. According to a definition by the Linux Foundation, a contributor is anyone who is connected to the project through code activities (transfers/PRs/changes) or by helping to find and fix bugs.
Data Sharing
A major benefit of Delta Lake, according to the development team, is that a feature (launched back in 2021) called Delta Sharing makes it easier to share data, as well as to share data from other Delta tables read. This introduced an open protocol for the real-time exchange of large amounts of data, which should enable secure sharing of data across product boundaries. Data consumers should now be able to connect directly to the shared data via Pandas, Tableau, Presto, Trino or dozens of other systems. The prerequisite for the tools is that they implement the protocol without having to use proprietary systems – including Databricks.
The software also has a rich ecosystem of direct connectors such as Flink, Presto and Trino, which it allow Delta Lake to be read and written directly by most popular engines without Apache Spark. Thanks to the contribution of employees from the companies Scridb and Back Market to the project, programmers can also use Delta Rust – a basic Delta Lake library in the Rust programming language that allows Python, Rust and Ruby developers to use Delta without a Read and write Big Data framework.
With the 2.0 release comes a promise that all Delta Lake APIs will be made available as open source. According to the development team, this applies in particular to the performance optimizations and functions of the Delta Engine such as ZOrder, Change Data Feed, Dynamic Partition Overwrites and Dropped Columns.