0% found this document useful (0 votes)
84 views9 pages

Data Cleaning in Databricks

The document outlines various data cleaning techniques in Databricks, including removing duplicates, filtering rows, filling null values, trimming strings, type casting, renaming columns, dropping columns, splitting columns, and merging columns. Each technique is illustrated with code examples and the resulting DataFrame changes. The document is authored by Shwetank Singh from GritSetGrow - GSGLearn.com.

Uploaded by

ram prashanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views9 pages

Data Cleaning in Databricks

The document outlines various data cleaning techniques in Databricks, including removing duplicates, filtering rows, filling null values, trimming strings, type casting, renaming columns, dropping columns, splitting columns, and merging columns. Each technique is illustrated with code examples and the resulting DataFrame changes. The document is authored by Shwetank Singh from GritSetGrow - GSGLearn.com.

Uploaded by

ram prashanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Cleaning in Databricks

Removing Duplicate Rows


id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

2 Forbes Wardel [email protected] Male 224.93.24.171

df = df.dropDuplicates()

id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Filtering Rows
id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

df = df.filter(df.id> 2)

id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Filling or Replacing Null Values
id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel NULL Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

df = df.na.fill(value="unknown", subset=
["email"])
id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel unknown Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Trimming Strings
id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes ___ Wardel [email protected] Male 224.93.24.171

3 ___Nesta Beamond [email protected] Female 124.97.188.174

from pyspark.sql.functions import trim


df = df.withColumn("first_name", trim(df.first_name))

id first_name last_name email gender ip_address

1 Karita Sendley [email protected] Female 114.8.222.223

2 Forbes Wardel [email protected] Male 224.93.24.171

3 Nesta Beamond [email protected] Female 124.97.188.174

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Type Casting
id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

df = df.withColumn("age", df["age"].cast("integer"))

id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Renaming Columns
id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

df = df.withColumnRenamed("id", "cust_id")

cust_id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Dropping Columns
id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

df = df.drop("age")

cust_id first_name last_name email gender

1 Karita Sendley [email protected] Female

2 Forbes Wardel [email protected] Male

3 Nesta Beamond [email protected] Female

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Splitting Columns
id full_name email gender

1 Karita Sendley [email protected] Female

2 Forbes Wardel [email protected] Male

3 Nesta Beamond [email protected] Female

from pyspark.sql.functions import split


df = df.withColumn("full_name", split(df["full_name"],
" ")).select("full_name.*")

cust_id full_name[0] full_name[1] email gender

1 Karita Sendley [email protected] Female

2 Forbes Wardel [email protected] Male

3 Nesta Beamond [email protected] Female

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Cleaning in Databricks
Merging Columns
id first_name last_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

from pyspark.sql.functions import concat_ws


df = df.withColumn("full_name",
concat_ws(" ", df["first_name"], df["last_name"]))

id full_name email gender age

1 Karita Sendley [email protected] Female 35

2 Forbes Wardel [email protected] Male 45

3 Nesta Beamond [email protected] Female 23

Shwetank Singh
GritSetGrow - GSGLearn.com

You might also like