Delta Lake is an open-source framework under the Linux Foundation used to build Lakehouse architectures. A new project is Delta Sharing, which is an open protocol for secure real-time exchange of large datasets. Databricks provides production-grade implementations of the projects under delta-io, including Databricks Delta Sharing. Understanding the open-source foundation of the enterprise offering can provide a deeper understand of
The Current State of Secure Data Sharing
Data sharing is the ability to make the same data available to one or many internal and/or external stakeholders. There are three basic approaches to data-sharing: legacy and custom solutions, closed source commercial applications and cloud object storage.
Homegrown data-sharing solutions based on emails, sFTP and APIs have the advantage of being vendor-agnostic and flexible. Many companies opted for a commercial solution rather than dedicate the resources to build an in-house solution. This simplifies the installation and maintenance and allows for data sharing with others on the same platform. Cloud object storage has strong scalability, availability and durability characteristics.
Data movement is an issue with both commercial and homegrown solutions since multiple copies of data result in additional complexity around consistency and integrity. Scalability is also an issue with on-premise homegrown and commercial solutions. Cloud solutions can be complex around managing security and governance around assigning permissions and managing access around IAM policies. Both homegrown and cloud solutions require ETL overheard for the consumer.
Delta Sharing
The Delta Sharing project seeks to address issues around data movement, scalability, security and governance complexity by defining an open protocol for secure real-time exchange of large datasets among internal and external stakeholders. Delta Sharing provides a protocol, a reference server and connectors.
The core of Delta Sharing is a REST protocol that provides secure access to datasets stored on S3, ADLS or GCS. The protocol specification has concepts for share, schema, table, recipient and a sharing server. A share is a logical grouping of one or more schemas that can be shared with one or more recipients. A schema is a logical grouping of tables. A table is a Delta Lake table or view. A recipient is a principal that has access to shared tables.
A Delta Sharing Reference Server is a server that implements the Delta Sharing Protocol. Databricks offers a managed service for the Delta Sharing Server, but it can also be built from pre-built packages or pre-built Docker images.This reference implementation can be deployed on any major cloud provider to share existing tables in Delta Lake or Apache Parquet format.
There are connectors for both Python and Apache Spark. The Delta Sharing Python Connector enables loading shared tables as pandas DataFrames while the Apache Spark Connector enables loading shared tables through SQL, Python (as PySpark), Java Scala and R.
Next Steps
The Delta Sharing Reference Server is a reference implementation server for the Delta Sharing Protocol. Note this server is not a complete implementation of a secure web server and should be behind a secure proxy if it would be exposed to the public. Use the pre-built packages to install the server on the cloud provider of your choice. There are specific configuration steps to configure the server to access tables on different cloud storage. For simplicity sake, you can use parquet tables for initial use cases.
This implementation will likely be useful as a POC but will allow your organisation to explore possible use cases for Delta Sharing with a minimum investment such as LOB and B2B sharing and data monetization with a minimum of investment. There are optimizations in commercial offerings like Databricks and knowing how to implement the solution from scratch can help making architectural decisions.
Leave A Comment