- It is a managed cloud service that's built for the complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
- You can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores.
- It can process and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure ML.
- You can publish output data to data stores such as Azure SQL Data Warehouse.
Connect and Collect
- The first step in building an information production system is to connect to all the required sources of data and processing, such as SaaS services, databases, file shares, and FTP web services.
- The next step is to move the data as needed to a centralized location for subsequent processing.
- With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis.
- For example, you can collect data in Azure Data Lake Store and transform the data later by using an Azure Data Lake Analytics compute service.
- You can also collect data in Azure Blob storage and transform it later by using an Azure HDInsight Hadoop cluster.
Transform and Enrich
- After data is present in a centralized data store in the cloud, process or transform the collected data by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.
Publish
- After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can point to from their business intelligence tools.
Monitor
- After you have successfully built and deployed your data integration pipeline, providing business value from refined data, monitor the scheduled activities and pipelines for success and failure rates.
- Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, and Log Analytics.
Top-level Concepts
Pipeline
- A data factory might have one or more pipelines.
- A pipeline is a logical grouping of activities that performs a unit of work.