Integrating SAP and Databricks has typically required a lot of glue. Set up the SAP Data Hub environment, connect to the SAP data, set up a pipeline with Pipeline Modeler, configure the Streaming Analytics Service, setup Kafka or MQTT and receive the streaming data in Databricks with Spark Streaming. Most of these intermediate steps required custom code. And this is just the technical aspects of replicating data from point A to point B, skipping over governance, MDM, CDC and all the other details around duplicate data. Meanwhile, your business sponsors are waiting not-so-patiently for any results. With the release of  SAP Datasphere, there is an opportunity to enable a business data fabric by unifying the two platforms. SAP Datasphere delivers bi-directional integration with Databricks Lakehouse.

Why Does SAP Need a Lakehouse

A data lakehouse architecture combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses. Data lakehouses enable business intelligence (BI) and machine learning (ML) on all data. Lakehouses typically follow a medallion architecture with a bronze layer for raw data, a silver layer for deduplicated and validated data and a gold layer with refined and aggregated data. Data scientists are typically interested in the solver layer while BI tools typically access the gold layer. Databricks is the current leader in the lakehouse space.

SAP does offer a cloud-based data lake that can be compared to the Databricks offering. SAP HANA Cloud is a column-based, multi-modal and scalable in-memory database. The JSON document store offeres the possibility of processing semi-structured data, but there are few mechanisms for processing unstructured data. There is a Data Lake component that augments SAP HANA. This includes a relational database based in SAP ID and a hyperscaler-based file container called Data Lake Files. There a still gaps between what an SAP Data Lake architecture can provide and what could be possible if SAP integrated with other tools.

SAP users would certainly benefit from direct data access across SAP and non-SAP sources, direct connectivity to BI tools, while SAP architects would be interested in support for multi-cloud environments, access to open-source and commercial tools and all enabled with a simplified security model. A simplified data governance system coupled with no data redundancy makes for a powerful offering. SAP seems to have learned from its Hadoop past and is choosing to partner with industry leaders to focus on areas outside of its business expertise. SAP Datasphere is optimized for SAP and focuses on SAP’s  strong business perspective. A Data Fabric approach removes the requirement of having all data integrated for processing within SAP itself but relies upon orchestration of enterprise data independently from which system data is stored.

Connect; Don’t Migrate

Accessing SAP Datasphere data from Databricks involves a JDBC connection. In Databricks, we’re going to build a Delta Live tables pipeline. Go to Data Science and Engineering, select Compute to get an all-purpose compute resource and select Advanced options for JDBC connection details. You will also need to create a Personal Access Token from User Settings. Within SAP Datasphere, you need to install the Data Provisioning (DP) Agent on a virtual machine. In the DP Agent’s folder, copy the Databricks jar file to /camel/lib directory. Now you can navigate to your space in SAP Datasphere and create a connection to Databricks by creating a Generic JDBC connector and using the Databricks  jdbc url we set up earlier and providing the personal access token information.

Conclusion

SAP Datasphere platform is an upgrade to SAP’s Data Warehouse Cloud most notably characterised by its openness in partnering with different industry leaders. Collibra’s data governance tools, Confluent’s streaming data platform, DataRobot’s machine learning capabilities and Databricks’ Lake House represent the first strategic partnerships. In this post, we briefly explored the massive shift in data intelligence capabilities that are now available simply by removing the technical burden of migrating data into SAP for analytics as opposed to performing the analytics in-place.