This repository contains hands-on projects and resources for working with key Azure cloud data tools, including Databricks, Data Factory, and Synapse. It is organized into folders that follow the typical workflow for setting up, developing, and orchestrating data pipelines and dataflows in Azure.
- Azure setup/
- Contains instructions and resources for configuring your Azure environment. This includes setting up resource groups, storage accounts, and security prerequisites needed before building data solutions.
- Databricks/
- Resources, notebooks, and guides for working with Azure Databricks. Use this folder to find Databricks-specific setup steps, demonstration notebooks, and integration tips with other Azure tools.
- Azure dataflow/
- Video tutorials, sample dataflows, and documentation for building and managing dataflows within Azure Data Factory. This folder includes:
Azure dataflow 1 - Transformations Join Filter Sink_a.mp4: Video demonstration showing how to create, join, filter, and sink (write) data in a dataflow.README.md: Additional documentation for the dataflow tutorials.
- Video tutorials, sample dataflows, and documentation for building and managing dataflows within Azure Data Factory. This folder includes:
- Databricks/
- Exploration EDA and querying embeded data in JSON files using SQL Spark
1. Azure Setup
- Prepare your Azure environment by creating a resource group and storage account.
- Set up permissions and authentication (e.g., via Azure Active Directory).
- Deploy Azure Data Factory and/or Databricks workspace as needed.
[Azure setup - creating Blob containers in Azure]
[Azure setup - creating the SQL database cost]
2. Creating a Dataflow in Azure Data Factory
- Start a new Dataflow: In Azure Data Factory, navigate to the Author tab and create a new Dataflow.
- Add Source(s): Define the datasets you want to ingest (e.g., CSVs from Blob Storage, SQL tables).
- Apply Transformations:
- Join: Combine multiple sources using a join transformation.
- Filter: Use filter transformations to remove unwanted data based on conditions.
- Other Transformations: Aggregate, derive columns, or perform lookups as necessary for your use case.
- Configure Sink: Define where the transformed data should be written (e.g., another Blob Storage container, SQL database).
- Debug and Preview: Use the debug mode to preview data at each step and validate your logic.
[Azure dataflow - Complete Dataflow]
[Azure dataflow - Creating Pipeline of the Dataflow]
3. Creating a Data Pipeline
- Pipeline Creation: In Data Factory, create a new pipeline.
- Add Dataflow Activity: Drag your dataflow into the pipeline as an activity.
- Set up Triggers and Parameters: Schedule the pipeline or parameterize it for dynamic execution.
- Monitor & Manage: Use Data Factory monitoring tools to track pipeline runs, diagnose errors, and optimize performance.
[Azure pipeline - pipeline Container to SQL database]
[Azure pipeline - creating the SQL database connection in ADStudio]
4. Databricks
- Load Data & Register SQL View
- Inspect Schema & Row Counts
- Null Value Analysis
- Descriptive Stats & Percentiles
- Duplicates Check
- Visualizations (Histograms, Boxplots)
[Azure Databricks - EDA in JSON file on Apache Spark SQL]
[Azure Databricks - EDA in embeded JSON with Spark SQL]
[Azure Databricks - EDA subquery count Spark SQL]
[Azure Databricks - create new Compute Cluster]
- Review the
Azure setupfolder to prepare your Azure environment. - Watch the video(s) in
Azure dataflowfor practical demonstrations of building dataflows. - Explore the
Databricksfolder for additional advanced analytics and processing scenarios. - Combine these components to understand the extend of the data skills form this repository
- This repository is intended for educational and demonstration purposes.
- Each folder includes readme files and/or videos to guide you through the specific steps and best practices for that tool or process.