Azure Cloud Computing Tools Projects: Databricks, Data Factory, Synapse

This repository contains hands-on projects and resources for working with key Azure cloud data tools, including Databricks, Data Factory, and Synapse. It is organized into folders that follow the typical workflow for setting up, developing, and orchestrating data pipelines and dataflows in Azure.

Repository Structure

Azure setup/
- Contains instructions and resources for configuring your Azure environment. This includes setting up resource groups, storage accounts, and security prerequisites needed before building data solutions.
Databricks/
- Resources, notebooks, and guides for working with Azure Databricks. Use this folder to find Databricks-specific setup steps, demonstration notebooks, and integration tips with other Azure tools.
Azure dataflow/
- Video tutorials, sample dataflows, and documentation for building and managing dataflows within Azure Data Factory. This folder includes:
  - Azure dataflow 1 - Transformations Join Filter Sink_a.mp4: Video demonstration showing how to create, join, filter, and sink (write) data in a dataflow.
  - README.md: Additional documentation for the dataflow tutorials.
Databricks/
- Exploration EDA and querying embeded data in JSON files using SQL Spark

Building a Dataflow and Data Pipeline in Azure: Step-by-Step Guide

1. Azure Setup

Prepare your Azure environment by creating a resource group and storage account.
Set up permissions and authentication (e.g., via Azure Active Directory).
Deploy Azure Data Factory and/or Databricks workspace as needed. [Azure setup - creating Blob containers in Azure]

[Azure setup - creating the SQL database cost]

2. Creating a Dataflow in Azure Data Factory

Start a new Dataflow: In Azure Data Factory, navigate to the Author tab and create a new Dataflow.
Add Source(s): Define the datasets you want to ingest (e.g., CSVs from Blob Storage, SQL tables).
Apply Transformations:
- Join: Combine multiple sources using a join transformation.
- Filter: Use filter transformations to remove unwanted data based on conditions.
- Other Transformations: Aggregate, derive columns, or perform lookups as necessary for your use case.
Configure Sink: Define where the transformed data should be written (e.g., another Blob Storage container, SQL database).
Debug and Preview: Use the debug mode to preview data at each step and validate your logic. [Azure dataflow - Complete Dataflow]
[Azure dataflow - Creating Pipeline of the Dataflow]

3. Creating a Data Pipeline

Pipeline Creation: In Data Factory, create a new pipeline.
Add Dataflow Activity: Drag your dataflow into the pipeline as an activity.
Set up Triggers and Parameters: Schedule the pipeline or parameterize it for dynamic execution.
Monitor & Manage: Use Data Factory monitoring tools to track pipeline runs, diagnose errors, and optimize performance. [Azure pipeline - pipeline Container to SQL database] [Azure pipeline - creating the SQL database connection in ADStudio]

4. Databricks

Load Data & Register SQL View
Inspect Schema & Row Counts
Null Value Analysis
Descriptive Stats & Percentiles
Duplicates Check
Visualizations (Histograms, Boxplots)

[Azure Databricks - EDA in JSON file on Apache Spark SQL]
[Azure Databricks - EDA in embeded JSON with Spark SQL]
[Azure Databricks - EDA subquery count Spark SQL]
[Azure Databricks - create new Compute Cluster]

Getting Started

Review the Azure setup folder to prepare your Azure environment.
Watch the video(s) in Azure dataflow for practical demonstrations of building dataflows.
Explore the Databricks folder for additional advanced analytics and processing scenarios.
Combine these components to understand the extend of the data skills form this repository

Notes

This repository is intended for educational and demonstration purposes.
Each folder includes readme files and/or videos to guide you through the specific steps and best practices for that tool or process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Azure Cloud Computing Tools Projects: Databricks, Data Factory, Synapse

Repository Structure

Building a Dataflow and Data Pipeline in Azure: Step-by-Step Guide

Getting Started

Notes

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Azure dataflow		Azure dataflow
Azure pipeline		Azure pipeline
Azure setup		Azure setup
Databricks		Databricks
LICENSE		LICENSE
README.md		README.md

License

antoniguedes/Azure_Cloud_Computing_tools_projects_Databricks_DataFactory_Synapse

Folders and files

Latest commit

History

Repository files navigation

Azure Cloud Computing Tools Projects: Databricks, Data Factory, Synapse

Repository Structure

Building a Dataflow and Data Pipeline in Azure: Step-by-Step Guide

Getting Started

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages