0% found this document useful (0 votes)

84 views9 pages

Data Cleaning in Databricks

The document outlines various data cleaning techniques in Databricks, including removing duplicates, filtering rows, filling null values, trimming strings, type casting, renaming columns, dropping columns, splitting columns, and merging columns. Each technique is illustrated with code examples and the resulting DataFrame changes. The document is authored by Shwetank Singh from GritSetGrow - GSGLearn.com.

Uploaded by

ram prashanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views9 pages

Data Cleaning in Databricks

Uploaded by

ram prashanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Data Cleaning in Databricks

Removing Duplicate Rows

id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

df = df.dropDuplicates()

id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Filtering Rows
id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

df = df.filter(df.id> 2)

id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Filling or Replacing Null Values
id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel NULL Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

df = df.na.fill(value="unknown", subset=
["email"])
id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel unknown Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Trimming Strings
id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes ___ Wardel [email protected] Male 224.93.24.171

3 ___Nesta Beamond [email protected] Female 124.97.188.174

from pyspark.sql.functions import trim

df = df.withColumn("first_name", trim(df.first_name))

id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Type Casting
id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

df = df.withColumn("age", df["age"].cast("integer"))

id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Renaming Columns
id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

df = df.withColumnRenamed("id", "cust_id")

cust_id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Dropping Columns
id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

df = df.drop("age")

cust_id first_name last_name email gender

1 Karita Sendley [email protected] Female

2 Forbes Wardel [email protected] Male

3 Nesta Beamond [email protected] Female

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Splitting Columns
id full_name email gender

1 Karita Sendley [email protected] Female

2 Forbes Wardel [email protected] Male

3 Nesta Beamond [email protected] Female

from pyspark.sql.functions import split

df = df.withColumn("full_name", split(df["full_name"],
" ")).select("full_name.*")

cust_id full_name[0] full_name[1] email gender

1 Karita Sendley [email protected] Female

2 Forbes Wardel [email protected] Male

3 Nesta Beamond [email protected] Female

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Merging Columns
id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

from pyspark.sql.functions import concat_ws

df = df.withColumn("full_name",
concat_ws(" ", df["first_name"], df["last_name"]))

id full_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

Shwetank Singh
GritSetGrow - GSGLearn.com

Mock Data
No ratings yet
Mock Data
17 pages
Diverse Global Professions Data
No ratings yet
Diverse Global Professions Data
24 pages
Vision Care
No ratings yet
Vision Care
24 pages
Mock Data01
No ratings yet
Mock Data01
17 pages
User Data with Emails & IPs
No ratings yet
User Data with Emails & IPs
22 pages
Global Workforce Directory by Profession
No ratings yet
Global Workforce Directory by Profession
19 pages
Mock Data
No ratings yet
Mock Data
1 page
Mock Data
No ratings yet
Mock Data
16 pages
Gregslist SQL
No ratings yet
Gregslist SQL
3 pages
Master
No ratings yet
Master
69 pages
Employee Details Overview
No ratings yet
Employee Details Overview
7 pages
Dirty Dataset
No ratings yet
Dirty Dataset
35 pages
Personal CSV
No ratings yet
Personal CSV
1 page
Customers
No ratings yet
Customers
2,867 pages
Mock - Data 5
No ratings yet
Mock - Data 5
84 pages
List of Professionals by Role
No ratings yet
List of Professionals by Role
2 pages
Mock Data
No ratings yet
Mock Data
3 pages
My Prabhu Family Member List
No ratings yet
My Prabhu Family Member List
5 pages
Dbms Table PRP
No ratings yet
Dbms Table PRP
2 pages
Mock Data
No ratings yet
Mock Data
1 page
4444
No ratings yet
4444
4 pages
Chepel Kiril
No ratings yet
Chepel Kiril
3 pages
Family Data Records Analysis
No ratings yet
Family Data Records Analysis
3 pages
Untitled Spreadsheet - Sheet1
No ratings yet
Untitled Spreadsheet - Sheet1
1 page
Project Data v2
No ratings yet
Project Data v2
13 pages
Mock Data - 1
No ratings yet
Mock Data - 1
16 pages
Employee Dataset Overview
No ratings yet
Employee Dataset Overview
44 pages
Contact List with Demographics
No ratings yet
Contact List with Demographics
8 pages
Mock Interview - Excel
No ratings yet
Mock Interview - Excel
7 pages
Adv Works RetailData
No ratings yet
Adv Works RetailData
6,770 pages
Dim Employees
No ratings yet
Dim Employees
5 pages
Book 1
No ratings yet
Book 1
2 pages
Gender Distribution in Engineering Fields
No ratings yet
Gender Distribution in Engineering Fields
23 pages
Vlookup Use @excel
No ratings yet
Vlookup Use @excel
3 pages
aSc Timetables 2012 XML Data Overview
No ratings yet
aSc Timetables 2012 XML Data Overview
48 pages
Mulugu
No ratings yet
Mulugu
130 pages
Andhra Assessment Shortlistings - FY25
No ratings yet
Andhra Assessment Shortlistings - FY25
79 pages
Inserts 1
No ratings yet
Inserts 1
9 pages
Adv Works RetailData
No ratings yet
Adv Works RetailData
6,721 pages
Customer Sales Data Input
No ratings yet
Customer Sales Data Input
5 pages
Mock Data 5000
No ratings yet
Mock Data 5000
247 pages
User Demographics by Country and Age
No ratings yet
User Demographics by Country and Age
1 page
Sample CSV Files
No ratings yet
Sample CSV Files
1 page
SQL Coding STR
No ratings yet
SQL Coding STR
1 page
Participant
No ratings yet
Participant
1,231 pages
Messy Dataset Excel Cleaning
No ratings yet
Messy Dataset Excel Cleaning
6 pages
Mock Data
No ratings yet
Mock Data
2 pages
Gowthami
No ratings yet
Gowthami
4 pages
Sample Text Data
No ratings yet
Sample Text Data
1 page
06 BDBodega Script
No ratings yet
06 BDBodega Script
121 pages
Midterm Activity 4 On Group by and Aggregate Fucntions
No ratings yet
Midterm Activity 4 On Group by and Aggregate Fucntions
2 pages
Dbms 4
No ratings yet
Dbms 4
8 pages
Mobile Email Database of Students Sample
No ratings yet
Mobile Email Database of Students Sample
23 pages
Agriculture Final Merit List 27012021
No ratings yet
Agriculture Final Merit List 27012021
229 pages
Architecture Department Trip Student Count - 23 Batch
No ratings yet
Architecture Department Trip Student Count - 23 Batch
2 pages
Name - Soudagar Owais Javed ROLL NO - 24CO121 Batch - 03 DML Commands 1 - Query - (Department Table)
No ratings yet
Name - Soudagar Owais Javed ROLL NO - 24CO121 Batch - 03 DML Commands 1 - Query - (Department Table)
7 pages
Excel Upload Template
No ratings yet
Excel Upload Template
8 pages
Design of Circular Footing-: Engr. Abdul Aziz
No ratings yet
Design of Circular Footing-: Engr. Abdul Aziz
38 pages
Structural Engineering Related Problem & Solution
100% (1)
Structural Engineering Related Problem & Solution
75 pages
CP-1000 - Chapter 02 - Controls and Connections - Edition 1.0 PDF
No ratings yet
CP-1000 - Chapter 02 - Controls and Connections - Edition 1.0 PDF
3 pages
Iso 4305-2014-04
No ratings yet
Iso 4305-2014-04
22 pages
Dell Poweredge 600sc
No ratings yet
Dell Poweredge 600sc
2 pages
SG55U 5.5 Ton Clear Floor Electric Release Lift
No ratings yet
SG55U 5.5 Ton Clear Floor Electric Release Lift
5 pages
Amphibious Excavator Features & Specs
No ratings yet
Amphibious Excavator Features & Specs
8 pages
F-EQC-02 Anchor Bolts Check Report - Before Casting
No ratings yet
F-EQC-02 Anchor Bolts Check Report - Before Casting
1 page
Naka Welding - Google Search
No ratings yet
Naka Welding - Google Search
1 page
KSJ 08 P
No ratings yet
KSJ 08 P
1 page
Pavement Design Report - 65 MSA
No ratings yet
Pavement Design Report - 65 MSA
3 pages
Mig Functions
No ratings yet
Mig Functions
11 pages
E909 Dust Collector Controller Manual
No ratings yet
E909 Dust Collector Controller Manual
4 pages
Rooftop AC Installation Guide
100% (1)
Rooftop AC Installation Guide
49 pages
Horizon Compact Plus Quick Reference Guide 83-000089-01-02-01 PDF
No ratings yet
Horizon Compact Plus Quick Reference Guide 83-000089-01-02-01 PDF
2 pages
Rafale Fighter Jet Overview
No ratings yet
Rafale Fighter Jet Overview
23 pages
Os Lab Manual - 0 PDF
No ratings yet
Os Lab Manual - 0 PDF
56 pages
Sika Bro - E - Sikadur-Combiflex SG System - High Performance Joint - Crack Waterproofing System (09.2010)
No ratings yet
Sika Bro - E - Sikadur-Combiflex SG System - High Performance Joint - Crack Waterproofing System (09.2010)
8 pages
Helical Springs
100% (1)
Helical Springs
10 pages
Tivo Style Guide
No ratings yet
Tivo Style Guide
16 pages
Platforms Design Guideline
No ratings yet
Platforms Design Guideline
50 pages
Non-Shrink Grout Specification Guide
No ratings yet
Non-Shrink Grout Specification Guide
1 page
Grouting and Injection
No ratings yet
Grouting and Injection
5 pages
Stress-Strain Diagram: Engr. Abdul Rahim Khan
No ratings yet
Stress-Strain Diagram: Engr. Abdul Rahim Khan
26 pages
Ashirvad Column Pipes Leaflet PDF
100% (1)
Ashirvad Column Pipes Leaflet PDF
2 pages
MSc in Electrical Power Engineering
No ratings yet
MSc in Electrical Power Engineering
4 pages
Huang 1995
No ratings yet
Huang 1995
17 pages
Proposal3kw On Grid Hari Meka
No ratings yet
Proposal3kw On Grid Hari Meka
17 pages
Staff Quarter MIS KEVDI
No ratings yet
Staff Quarter MIS KEVDI
1 page
Smart Asthma Inhaler
No ratings yet
Smart Asthma Inhaler
4 pages

Uploaded by

Uploaded by

Data Cleaning in Databricks

Removing Duplicate Rows

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

2 Forbes Wardel [email protected] Male 224.93.24.171

id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel NULL Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel unknown Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes ___ Wardel [email protected] Male 224.93.24.171

3 ___Nesta Beamond [email protected] Female 124.97.188.174

from pyspark.sql.functions import trim

id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

cust_id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

cust_id first_name last_name email gender

1 Karita Sendley [email protected] Female

2 Forbes Wardel [email protected] Male

3 Nesta Beamond [email protected] Female

1 Karita Sendley [email protected] Female

2 Forbes Wardel [email protected] Male

3 Nesta Beamond [email protected] Female

from pyspark.sql.functions import split

cust_id full_name[0] full_name[1] email gender

1 Karita Sendley [email protected] Female

2 Forbes Wardel [email protected] Male

3 Nesta Beamond [email protected] Female

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

from pyspark.sql.functions import concat_ws

id full_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

You might also like