Databricks: Create a table in Databricks using an external PostgreSQL data source,

o create a table in Databricks using an external PostgreSQL data source, you can use the CREATE TABLE SQL statement with the USING clause to specify the data source. Here’s a basic example:

CREATE TABLE your_table_name
USING jdbc
OPTIONS (
url 'jdbc:postgresql://your_postgresql_host:port/your_database',
dbtable 'your_table_in_postgresql',
user 'your_username',
password 'your_password'
);

In this SQL statement:

  • your_table_name is the name you want to assign to your table in Databricks.
  • jdbc specifies that you’re using the JDBC data source.
  • url is the JDBC connection URL for your PostgreSQL database.
  • dbtable is the name of the table in your PostgreSQL database that you want to create a Databricks table from.
  • user is the username to connect to your PostgreSQL database.
  • password is the password associated with the username.

Replace the placeholders (your_...) with your actual values.

Make sure you have the appropriate JDBC driver installed in your Databricks cluster. You can upload the JDBC driver JAR file to your cluster’s storage or use Maven coordinates if the driver is available on Maven repositories.

Here’s an example using Maven coordinates for the PostgreSQL JDBC driver:

CREATE TABLE your_table_name
USING jdbc
OPTIONS (
url 'jdbc:postgresql://your_postgresql_host:port/your_database',
dbtable 'your_table_in_postgresql',
user 'your_username',
password 'your_password',
driver 'org.postgresql.Driver'
);

Replace org.postgresql.Driver with the appropriate driver class name for your PostgreSQL JDBC driver.

After running this SQL statement in a Databricks notebook or SQL cell, the table your_table_name will be created in Databricks, and its schema and data will be synchronized with the specified table in your PostgreSQL database.

Databricks: Using COALESCE and SPLIT

COALESCE – This command is popular among many different SQL dialects. We can use it to replace NULL values. For all NULL values in the Description column, COALESCE() will replace the null with a value you include in the function. In this case, the value is "Misc". For more information about COALESCE, check the documentation.

SPLIT – This command splits a string value around a specified character and returns an array. An array is a list of values that you can access by position. In this case, the forward slash (“/”) is the character we use to split the data. The first value in the array is the month. This list is zero-indexed for the index of the first position is 0. Since we want to pull out the first value as the month, we indicate the value like this: SPLIT(InvoiceDate, "/")[0] and rename the column month. The day is the second value and its index is 1.

The third SPLIT is different. Remember that our InvoiceDate column is a string that includes a date and time. Each part of the date is seperated by a forward slash, but between the date and the time, there is only a space. Line 10 contains a nested SPLIT function that splits the string on a space delimiter.

SPLIT(InvoiceDate, " ")[0] –> Drops the time from the string and leaves the date intact. Then, we split that value on the forward slash delimiter.

Databricks: Sample the table

The function, TABLESAMPLE, allows you to return a number of rows or a certain percentage of the data. In the cell directly below this one, we show that TABLESAMPLE can be used to access a specific number of rows. In the following cell, we show that it can be used to access a given percentage of the data. Please note, however, any table display is limited to 1,000 rows. If the percentage of data you request returns more thna 1,000 rows, only the first 1000 will show.

 The sample that displays 2 percent of the table is also ordered by the InvoiceDate. This shows off a formatting issue in the date column that we will have to fix later on. Take a moment and see if you can predict how we might need to change in the way the InvoiceDate is written.

Databricks: Common Table Expressions

Common Table Expressions (CTE) are supported in Spark SQL. A CTE provides a temporary result set which you can then use in a SELECT statement. These are different from temporary views in that they cannot be used beyond the scope of a single query. In this case, we will use the CTE to get a closer look at the nested data without writing a new table or view. CTEs use the WITH clause to start defining the expression.

Notice that after we explode the source column, we can access individual properties in the value field by using dot notation with the property name.

Databricks: Create temporary view using ROUND and AVG and GROUP BY month

We will create a temporary view that we can easily refer to the data we want to include in our visualization. For this data, we can investigate whether there are any patterns in the ratings when grouped by month. To do that, we use the ROUND() and AVG() functions to calculate the average rating and limit it to 3 decimal places. Then, extract the month from the timeRecorded column after casting it as a timestamp. The AVG() is calculated over the course of a month, as specified in the GROUP BY clause.

CREATE

OR REPLACE TEMPORARY VIEW ratingsByMonth AS

SELECT

ROUND(AVG(rating), 3) AS avgRating,

month(CAST(timeRecorded as timestamp)) AS month

FROM

movieRatings

GROUP BY

month;