- Azure Databricks is an Apache Spark-based analytics platform with streamlined workflows and interactive workspace.
- It enables collaboration between data scientists, data engineers, and business analysts.

- For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches or streamed near real-time using Kafka, Event Hub, or IoT Hub.
- This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage.
Apache Spark-based Analytics Platform
- Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities.
- Spark in Azure Databricks includes the following components:

- Spark SQL and DataFrames: Spark SQL is the Spark module for working with structured data. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python.
- Streaming: Real-time data processing and analysis for analytical and interactive applications. Integrates with HDFS, Flume, and Kafka.
- MLib: Machine Learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
- GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data exploration.
- Spark Core API: Includes support for R, SQL, Python, Scala, and Java.
Apache Spark in Azure Databricks
Azure Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that includes:
- Fully managed Spark clusters
- An interactive workspace for exploration and visualization
- A platform for powering your favorite Spark-based applications
Databricks Runtime