Databricks: PySpark DataFrames in Databricks:

Posted on March 31, 2024 by Arturo Gutierrez Loza

Below is a concise reference guide for working with PySpark DataFrames in Databricks:

1. Importing Required Libraries

You typically need to import the necessary modules to work with PySpark:

from pyspark.sql import SparkSession

2. Creating a SparkSession

A SparkSession is the entry point to programming Spark with the Dataset and DataFrame API. You create it as follows:

spark = SparkSession.builder \
    .appName("MyApp") \
    .getOrCreate()

3. Reading Data

You can read data from various sources into a DataFrame using read method:

df = spark.read.format("csv") \
    .option("header", "true") \
    .load("dbfs:/path/to/csv/file.csv")

4. Displaying Data

Databricks provides a convenient way to display DataFrames using the display() function:

display(df)

5. Operations and Transformations

Perform various operations and transformations on DataFrames such as selecting, filtering, aggregating, joining, etc.:

# Selecting columns
df.select("column1", "column2")

# Filtering
df.filter(df["column1"] > 10)

# Aggregating
df.groupBy("column1").agg({"column2": "sum"})

# Joining
df1.join(df2, "key_column")

6. Writing Data

Write DataFrame to various destinations such as CSV, JSON, Parquet, JDBC, etc.:

df.write.format("parquet") \
    .mode("overwrite") \
    .save("dbfs:/path/to/parquet/file")

7. SQL Queries

You can run SQL queries on DataFrames using SQL-like syntax:

df.createOrReplaceTempView("temp_table")
result = spark.sql("SELECT * FROM temp_table WHERE column1 > 10")

This reference provides a quick overview of commonly used operations and functionalities for working with PySpark DataFrames in Databricks. For more detailed information and advanced functionalities, you can refer to the official documentation or explore Databricks-specific features and optimizations.

Basic Rules of Discrete Probability

Posted on February 25, 2024 by Arturo Gutierrez Loza

In this reading, we’ll introduce discrete and continuous probability, walk through basic probability notation, and describe a few common rules of discrete probability.

Types of Probability

When working with probability in the real world, it’s common to see probability broken down into two categories: discrete probability and continuous probability.

Discrete Probability

Discrete probability deals with discrete variables – that is, variables that have countable values, like integers. Examples of discrete variables include the number of fish in a lake and the number of hobbies that an adult in the U.S. enjoys.

Therefore, the probability of discrete variables describes the probability of occurrence of each specific value of a discrete variable. As an example, the probability of there being exactly 142 fish in the lake. In another example, the probability of a particular person enjoying exactly 4 hobbies.

Each possible value of these discrete variables has its own respective non-zero probability of occurring in a dataset.

Continuous Probability

Continuous probability deals with continuous variables – that is, variables that have infinite and uncountable values. Examples of continuous variables include an individual’s weight and how long it takes to run a kilometer.

Similarly to discrete probability, the probability of continuous variables describes the probability of occurrence of each specific value of a continuous variable. However, these probability values are always close to zero for continuous variables – this is because of the infinite set of outcomes! Any individual outcome, like 2.0000001, is highly unlikely.

We’ll cover more on continuous probability later in this course, but we’ll focus on discrete probability for now.

Probability Notation

In order to learn the basics of discrete probability, it’s important to understand the basics of probability notation.

To symbolize the probability of a discrete event occurring, we use the following notation: P(A) = 0.5

This reads as “the probability of Event A occurring is equal to 0.5.” This can be interpreted as a 50 percent chance of Event A occurring.

Let’s consider a common real-world example: flipping a coin. Below, we write the probability of a flipped standard coin landing heads-up.

P(Heads) = 0.5

To shorten this, we will commonly represent the outcome (Heads) with a single letter. In this case, we are shorting Heads as a capital H.

P(H) = 0.5

Let’s take a look at an example with more possible outcomes: rolling a standard six-sided die. Each of the six outcomes are equally likely to occur when the die is rolled.

P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 or about 0.167.

It’s important to note that these examples are theoretical. They are based on what we know about coins and dice, rather than recording actual observations from the real world. In reality, the theoretical probability will most likely not be exactly the same as the actual number of times the event occurs. For example, we might roll a 4 slightly more often, or slightly less often, than 1 in 6 times. So based on recorded data, we might find: P(4) = # of rolls where 4 occurred / # of total rolls = 3/20 = 0.15

However, whether we are working with theoretical or recorded probabilities, the following must be true:

The probability of each outcome must be between 0 and 1: 0 ≤ P(x) ≤ 1.
The sum of the probabilities of each outcome must equal 1: PP(x) = 1.

Common Discrete Probability Rules

When we work with discrete probability, we’re often interested in the more complex questions than the likelihood of a single event occurring in a single trial or draw. Sometimes, we’re interested in the probability that one of a set of non-mutually-exclusive events occurs or in the probability that multiple events occur simultaneously.

In order to answer these questions, we need to understand some of the basic rules of discrete probability. Two common rules are the additive rule and the multiplicative rule.

Additive Rule

The additive rule states that for two events in a single-trial probability experiment, the probability of either event occurring is equal to the sum of their individual probabilities, as long as those events are mutually exclusive.

Consider the following theoretical examples from our coin and die examples:

P(H or T) = P(H) + P(T) = 0.5 + 0.5 = 1

P(1 or 4) = P(1) + P(4) = ⅙ + ⅙ = ⅓

This rule can be generally stated as:

P(A or B) = P(A) + P(B)

The above rule holds when the events are mutually exclusive, because both events cannot occur – a single coin flip can only bring up heads or tails, not both. And the single roll of a die cannot land on the number 1 and the number 4.

But what about if both events can occur? Let’s look at an example including a standard deck of cards. Assuming that we’re interested in the probability that a single drawn card is a king, denoted by K, or is of the heart suit, denoted by H.

The probabilities of each of these single events are below:

P(K) = 4/52

P(H) = 13/52

If these events were mutually exclusive, meaning there’s no card that is simultaneously a king and of a heart suit, we’d just add P(K) and P(H) together. However, there is a single card that is a king and of a heart suit.

Its probability is denoted below:

P(K and H) = 1/52

In order to correctly calculate the probability of a single drawn card being a king or of the heart suit, we need to subtract the probability of drawing the king of hearts card. This is because that card is counted as both part of P(K) and P(H) – we’re simply making sure to not count it twice.

P(K or H) = P(K) + P(H) – P(K and H) = 4/52 + 13/52 – 1/52 = 16 / 52

This rule can be generally stated as:

P(A or B) = P(A) + P(B) – P(A and B)

The additive rule can be used to help us understand the probability of a single event of a set of events occurring. Depending on whether the set of events are mutually exclusive from one another, we might need to subtract the probability that they both occur.

Multiplicative Rule

The multiplicative rule states that the probability of two events both occurring is equal to the probability of the first event occurring multiplied by the probability of the second event occurring, as long as those events are independent.

Let us combine our coin and die examples by using the multiplicative rule to compute the probability that a flip of the coin lands heads-up and the die lands with the number 4 facing up.

P(H and 4) = P(H) * P(4) = ½ * ⅙ = 1/12

This rule can be generally stated as:

P(A and B) = P(A) * P(B)

The above rule holds when the two events are independent of one another. The outcome of flipping a coin has absolutely no impact on the outcome of rolling a die.

However, if the events are not independent of one another, then the multiplicative rule states that the probability of two events both occurring is equal to the probability of the first event occurring multiplied by the probability of the second event occurring given that the first event occurred. This last condition accounts for the dependence of the two events.

This rule can be generally stated as:

P(A and B) = P(A) * P(B|A)

Let’s consider our playing card example. Assuming that you want to know the probability of drawing a card that is both a king and of the hearts suit. In order to determine this, you need to know the probability of a card being a king and the probability of the card being of the hearts suit given that it is a king.

These probabilities are denoted below:

P(K) = 4/52

P(H|K) = ¼

Because these events are not independent from one another, we need to multiply these two probabilities by one another to compute P(K and H).

P(K and H) = P(K) * P(H|K) = P(4/52) * P(¼) = 1/52

The multiplicative rule can be used to help us understand the probability of two events occurring. Depending on whether the set of events are independent from one another, we might use the probability of the second event occurring if the first event has occurred, too.

Python Flask programming reference sites

Posted on January 16, 2024 by Arturo Gutierrez Loza

Official Flask Documentation:
- Flask Documentation
  - The official documentation provides comprehensive information about Flask, including installation, quickstart guide, and detailed explanations of Flask features and concepts.
Flask GitHub Repository:
- Flask GitHub Repository
  - The Flask source code is available on GitHub. You can explore the repository to understand the implementation details and contribute to the Flask project.
Flask Quickstart Guide:
- Flask Quickstart
  - The quickstart guide is a great starting point for beginners. It covers the basic steps to create a simple Flask application.
Flask Mega-Tutorial by Miguel Grinberg:
- Flask Mega-Tutorial
  - This tutorial by Miguel Grinberg is a comprehensive guide to building a full-featured web application with Flask. It covers a wide range of topics and is suitable for both beginners and intermediate learners.
Real Python Flask Tutorials:
- Real Python Flask Tutorials
  - Real Python offers a variety of tutorials covering Flask, from basic concepts to more advanced topics. The tutorials include video content and written guides.
Flask Web Development Book by Miguel Grinberg:
- Flask Web Development Book
  - Miguel Grinberg’s book “Flask Web Development” provides in-depth coverage of Flask, including building web applications, handling databases, and more.
Flask by Example Series on PyBites:
- Flask by Example
  - PyBites offers a Flask by Example series, which guides you through building Flask applications step by step.
Awesome Flask:
- Awesome Flask
  - The Awesome Flask GitHub repository is a curated list of Flask resources, including extensions, tutorials, and tools.
Flask WTF Documentation (WTForms):
- Flask WTF Documentation
  - If you are working with web forms in Flask, the Flask WTF (WTForms) documentation is a valuable resource.
Explore Flask:
- Explore Flask
  - Explore Flask is a free online book that covers Flask concepts and provides practical examples.

Remember to check the official Flask documentation for the most up-to-date and accurate information. Additionally, exploring community forums, such as the Flask community on Stack Overflow, can be helpful for getting answers to specific questions.

Simple example using Python’s unittest module to demonstrate basic unit testing.

Posted on January 9, 2024 by Arturo Gutierrez Loza

Simple example using Python’s unittest module to demonstrate basic unit testing. In this example, we’ll create a simple function and write test cases to ensure its correctness.

Step 1: Create a Python Module

Create a file named math_operations.py with the following content:

# math_operations.py
def add_numbers(a, b):
    return a + b

def multiply_numbers(a, b):
    return a * b

Step 2: Write Unit Tests

Create another file named test_math_operations.py to write unit tests for the math_operations module:

# test_math_operations.py
import unittest
from math_operations import add_numbers, multiply_numbers

class TestMathOperations(unittest.TestCase):

    def test_add_numbers(self):
        result = add_numbers(3, 7)
        self.assertEqual(result, 10)

    def test_multiply_numbers(self):
        result = multiply_numbers(3, 4)
        self.assertEqual(result, 12)

if __name__ == '__main__':
    unittest.main()

Step 3: Run the Tests

In the terminal or command prompt, navigate to the directory containing your Python files (math_operations.py and test_math_operations.py). Run the following command:

python -m unittest test_math_operations.py

This command will discover and run the tests in test_math_operations.py. If everything is correct, you should see an output indicating that all tests passed.

Example Output:

markdownCopy code..
----------------------------------------------------------------------
Ran 2 tests in 0.001s

OK

The unittest module executed two tests (test_add_numbers and test_multiply_numbers), and both passed successfully.

Feel free to modify the functions and test cases to explore more features of the unittest module. Unit testing is a crucial aspect of software development, helping ensure that individual components of your code work as expected.

Installing and using Pylint example

Posted on January 9, 2024 by Arturo Gutierrez Loza

Pylint is a widely used tool for static code analysis in Python. It helps identify potential issues, style violations, and other code quality concerns. Here’s a simple example of installing and using Pylint:

Step 1: Install Pylint

You can install Pylint using the package manager pip. Open your terminal or command prompt and run:

pip install pylint

Step 2: Create a Python Script

Let’s create a simple Python script for demonstration purposes. Create a file named example.py with the following content:

# example.py
def add_numbers(a, b):
    result = a + b
    return result

num1 = 5
num2 = 10
sum_result = add_numbers(num1, num2)
print(f"The sum of {num1} and {num2} is: {sum_result}")

Step 3: Run Pylint

In the terminal or command prompt, navigate to the directory where your example.py file is located. Run the following command:

pylint example.py

Pylint will analyze your Python script and provide a report with suggestions, warnings, and other information related to code quality.

Step 4: Review the Pylint Report

After running the pylint command, you’ll see an output similar to the following:

vbnetCopy code************* Module example
example.py:1:0: C0114: Missing module docstring (missing-module-docstring)
example.py:1:0: C0103: Argument name "a" doesn't conform to snake_case naming style (invalid-name)
...

The report includes various messages indicating potential issues in your code. Each message has a code (e.g., C0114) that corresponds to a specific type of warning or error.

Optional: Customize Pylint Configuration

You can create a Pylint configuration file (e.g., .pylintrc) in your project directory to customize Pylint’s behavior. This file allows you to ignore specific warnings, define naming conventions, and more.

Now you’ve installed and used Pylint to analyze a simple Python script. You can integrate Pylint into your development workflow to ensure code quality and adherence to coding standards.

Arturo Gutiérrez Loza Blog

LIVE FREE OR DIE

Tag Archives: machine-learning

Databricks: PySpark DataFrames in Databricks:

1. Importing Required Libraries

2. Creating a SparkSession

3. Reading Data

4. Displaying Data

5. Operations and Transformations

6. Writing Data

7. SQL Queries

Basic Rules of Discrete Probability

Types of Probability

Discrete Probability

Continuous Probability

Probability Notation

Common Discrete Probability Rules

Additive Rule

Multiplicative Rule

Python Flask programming reference sites

Simple example using Python’s unittest module to demonstrate basic unit testing.

Step 1: Create a Python Module

Step 2: Write Unit Tests

Step 3: Run the Tests

Example Output:

Installing and using Pylint example

Step 1: Install Pylint

Step 2: Create a Python Script

Step 3: Run Pylint

Step 4: Review the Pylint Report

Optional: Customize Pylint Configuration