The os module in Python

The os module in Python provides a portable way to interact with the operating system, including Linux. While it doesn’t cover every aspect of Linux system administration, it offers functionalities for basic operations like file and directory manipulation, process management, and environment variables. Below are some of the key functions and classes in the os module:

  1. File and Directory Operations:
    • os.getcwd(): Get the current working directory.
    • os.chdir(path): Change the current working directory to the specified path.
    • os.listdir(path='.'): Return a list of the entries in the directory given by path.
    • os.mkdir(path): Create a directory named path.
    • os.makedirs(path): Recursive directory creation function.
    • os.remove(path): Remove (delete) the file path.
    • os.rmdir(path): Remove (delete) the directory path.
  2. Process Management:
    • os.system(command): Execute the command in a subshell.
    • os.spawn*(): Functions for spawning a new process.
    • os.fork(): Fork a child process.
    • os.kill(pid, sig): Send a signal to the process pid.
  3. Environment Variables:
    • os.environ: Dictionary containing the environment variables.
    • os.getenv(var, default=None): Get an environment variable, optionally returning a default value if the variable is not set.
  4. Miscellaneous:
    • os.path: Submodule for common pathname manipulations.
    • os.name: String representing the current operating system.
    • os.utime(path, times=None): Set the access and modified times of the file specified by path.
  5. Permissions:
    • os.chmod(path, mode): Change the mode (permissions) of path to the numeric mode.
    • os.access(path, mode): Check if a user has access to a file.

Remember, the os module provides basic functionalities. For more advanced operations, you might need to use other modules like subprocess, shutil, or os.path. Additionally, for system administration tasks on Linux, modules like subprocess, sys, shutil, socket, multiprocessing, and os.path are often used in conjunction with os.

Find and replace text within a file on Linux

To find and replace text within a file on Linux, you can use various command-line tools such as sed, awk, or even grep combined with sed. Here’s how you can do it using sed, which is one of the most commonly used tools for this purpose:

sed -i 's/old_text/new_text/g' filename

Explanation:

  • -i: This option edits files in place. If you want to create a backup of the original file, you can use -i.bak instead, which will create a backup with the extension .bak.
  • 's/old_text/new_text/g': This is the substitution command in sed. It tells sed to substitute old_text with new_text globally (i.e., all occurrences).
  • filename: Replace this with the name of the file you want to modify.

For example, if you want to replace all occurrences of “hello” with “world” in a file named example.txt, you would use:

sed -i 's/hello/world/g' example.txt

Make sure to use caution when using the -i option, as it directly modifies the file. Always double-check the results to ensure they are as expected. If you want to preview the changes before actually modifying the file, you can omit the -i option and redirect the output to a different file:

sed 's/old_text/new_text/g' filename > new_filename

This command will perform the replacement and save the modified content to a new file called new_filename, leaving the original file unchanged.

Databricks: Create a table in Databricks using an external PostgreSQL data source,

o create a table in Databricks using an external PostgreSQL data source, you can use the CREATE TABLE SQL statement with the USING clause to specify the data source. Here’s a basic example:

CREATE TABLE your_table_name
USING jdbc
OPTIONS (
url 'jdbc:postgresql://your_postgresql_host:port/your_database',
dbtable 'your_table_in_postgresql',
user 'your_username',
password 'your_password'
);

In this SQL statement:

  • your_table_name is the name you want to assign to your table in Databricks.
  • jdbc specifies that you’re using the JDBC data source.
  • url is the JDBC connection URL for your PostgreSQL database.
  • dbtable is the name of the table in your PostgreSQL database that you want to create a Databricks table from.
  • user is the username to connect to your PostgreSQL database.
  • password is the password associated with the username.

Replace the placeholders (your_...) with your actual values.

Make sure you have the appropriate JDBC driver installed in your Databricks cluster. You can upload the JDBC driver JAR file to your cluster’s storage or use Maven coordinates if the driver is available on Maven repositories.

Here’s an example using Maven coordinates for the PostgreSQL JDBC driver:

CREATE TABLE your_table_name
USING jdbc
OPTIONS (
url 'jdbc:postgresql://your_postgresql_host:port/your_database',
dbtable 'your_table_in_postgresql',
user 'your_username',
password 'your_password',
driver 'org.postgresql.Driver'
);

Replace org.postgresql.Driver with the appropriate driver class name for your PostgreSQL JDBC driver.

After running this SQL statement in a Databricks notebook or SQL cell, the table your_table_name will be created in Databricks, and its schema and data will be synchronized with the specified table in your PostgreSQL database.

Databricks: Using COALESCE and SPLIT

COALESCE – This command is popular among many different SQL dialects. We can use it to replace NULL values. For all NULL values in the Description column, COALESCE() will replace the null with a value you include in the function. In this case, the value is "Misc". For more information about COALESCE, check the documentation.

SPLIT – This command splits a string value around a specified character and returns an array. An array is a list of values that you can access by position. In this case, the forward slash (“/”) is the character we use to split the data. The first value in the array is the month. This list is zero-indexed for the index of the first position is 0. Since we want to pull out the first value as the month, we indicate the value like this: SPLIT(InvoiceDate, "/")[0] and rename the column month. The day is the second value and its index is 1.

The third SPLIT is different. Remember that our InvoiceDate column is a string that includes a date and time. Each part of the date is seperated by a forward slash, but between the date and the time, there is only a space. Line 10 contains a nested SPLIT function that splits the string on a space delimiter.

SPLIT(InvoiceDate, " ")[0] –> Drops the time from the string and leaves the date intact. Then, we split that value on the forward slash delimiter.

Databricks: Sample the table

The function, TABLESAMPLE, allows you to return a number of rows or a certain percentage of the data. In the cell directly below this one, we show that TABLESAMPLE can be used to access a specific number of rows. In the following cell, we show that it can be used to access a given percentage of the data. Please note, however, any table display is limited to 1,000 rows. If the percentage of data you request returns more thna 1,000 rows, only the first 1000 will show.

 The sample that displays 2 percent of the table is also ordered by the InvoiceDate. This shows off a formatting issue in the date column that we will have to fix later on. Take a moment and see if you can predict how we might need to change in the way the InvoiceDate is written.

Databricks: Common Table Expressions

Common Table Expressions (CTE) are supported in Spark SQL. A CTE provides a temporary result set which you can then use in a SELECT statement. These are different from temporary views in that they cannot be used beyond the scope of a single query. In this case, we will use the CTE to get a closer look at the nested data without writing a new table or view. CTEs use the WITH clause to start defining the expression.

Notice that after we explode the source column, we can access individual properties in the value field by using dot notation with the property name.

Databricks: Create temporary view using ROUND and AVG and GROUP BY month

We will create a temporary view that we can easily refer to the data we want to include in our visualization. For this data, we can investigate whether there are any patterns in the ratings when grouped by month. To do that, we use the ROUND() and AVG() functions to calculate the average rating and limit it to 3 decimal places. Then, extract the month from the timeRecorded column after casting it as a timestamp. The AVG() is calculated over the course of a month, as specified in the GROUP BY clause.

CREATE

OR REPLACE TEMPORARY VIEW ratingsByMonth AS

SELECT

ROUND(AVG(rating), 3) AS avgRating,

month(CAST(timeRecorded as timestamp)) AS month

FROM

movieRatings

GROUP BY

month;