pyspark | Arturo Gutiérrez Loza Blog

Below is a concise reference guide for working with PySpark DataFrames in Databricks:

1. Importing Required Libraries

You typically need to import the necessary modules to work with PySpark:

from pyspark.sql import SparkSession

2. Creating a SparkSession

A SparkSession is the entry point to programming Spark with the Dataset and DataFrame API. You create it as follows:

spark = SparkSession.builder \
    .appName("MyApp") \
    .getOrCreate()

3. Reading Data

You can read data from various sources into a DataFrame using read method:

df = spark.read.format("csv") \
    .option("header", "true") \
    .load("dbfs:/path/to/csv/file.csv")

4. Displaying Data

Databricks provides a convenient way to display DataFrames using the display() function:

display(df)

5. Operations and Transformations

Perform various operations and transformations on DataFrames such as selecting, filtering, aggregating, joining, etc.:

# Selecting columns
df.select("column1", "column2")

# Filtering
df.filter(df["column1"] > 10)

# Aggregating
df.groupBy("column1").agg({"column2": "sum"})

# Joining
df1.join(df2, "key_column")

6. Writing Data

Write DataFrame to various destinations such as CSV, JSON, Parquet, JDBC, etc.:

df.write.format("parquet") \
    .mode("overwrite") \
    .save("dbfs:/path/to/parquet/file")

7. SQL Queries

You can run SQL queries on DataFrames using SQL-like syntax:

df.createOrReplaceTempView("temp_table")
result = spark.sql("SELECT * FROM temp_table WHERE column1 > 10")

This reference provides a quick overview of commonly used operations and functionalities for working with PySpark DataFrames in Databricks. For more detailed information and advanced functionalities, you can refer to the official documentation or explore Databricks-specific features and optimizations.

Arturo Gutiérrez Loza Blog

LIVE FREE OR DIE

Tag Archives: pyspark

Databricks: PySpark DataFrames in Databricks: