Code Organization

Following standard Python3 code organization practices will make our code easier to read by other developers, and by our future selves who are looking back to see what we did. After going through this module, students should be able to:

  • Organize code into main() functions

  • Import functions into other scripts without executing the main() block

  • Write functions in a generalizable way so they are reusable

  • Use a shebang in their Python3 scripts to make them executable

Main Function

In many Python programs, you will find the developer has organized their code into a main() function. Then, they will only call the main() function if the variable __name__ is equal to the string '__main__'. For example:

def main():
    # application code goes here

if __name__ == '__main__':
    main()

If this script is executed on the command line directly, then the internal variable __name__ will be set to the string '__main__'. The conditional evaluates as True and the main() function is called.

If this script is instead imported into another script, say, to reuse some of the functions defined within, then the internal variable __name__ will instead be set to the name of the script. Thus, the main() function is not called, but other functions defined in this script would be available.

Consider the code below for analyzing the Meteorite Landings:

 1import json
 2from pydantic import BaseModel
 3
 4class MeteoriteLanding(BaseModel):
 5    name: str
 6    id: int
 7    class_name: str
 8    mass: int
 9    lat: float
10    long: float
11
12def compute_average_mass(landings: list[MeteoriteLandings]) -> float:
13    total_mass = 0.
14    for ml in landings:
15        total_mass += ml.mass
16    return (total_mass / len(landings))
17
18def check_hemisphere(ml: MeteoriteLanding) -> str:
19    location = ''
20    if (ml.lat > 0):
21        location = 'Northern'
22    else:
23        location = 'Southern'
24    if (ml.long > 0):
25        location = f'{location} & Eastern'
26    else:
27        location = f'{location} & Western'
28    return(location)
29
30with open('Meteorite_Landings_Simple.json', 'r') as f:
31    ml_data = json.load(f)
32
33landings = [MeteoriteLanding(**item) for items in ml_data["meteorite_landings"]]
34
35print(compute_average_mass(landings))
36
37for ml in landing:
38    print(check_hemisphere(ml))

To reorganize this code, we would put the file read operation and the two function calls into a main function:

 1import json
 2from pydantic import BaseModel
 3
 4class MeteoriteLanding(BaseModel):
 5    name: str
 6    id: int
 7    class_name: str
 8    mass: int
 9    lat: float
10    long: float
11
12def compute_average_mass(landings: list[MeteoriteLandings]) -> float:
13    total_mass = 0.
14    for ml in landings:
15        total_mass += ml.mass
16    return (total_mass / len(landings))
17
18def check_hemisphere(ml: MeteoriteLanding) -> str:
19    location = ''
20    if (ml.lat > 0):
21        location = 'Northern'
22    else:
23        location = 'Southern'
24    if (ml.long > 0):
25        location = f'{location} & Eastern'
26    else:
27        location = f'{location} & Western'
28    return(location)
29
30def main():
31    with open('Meteorite_Landings_Simple.json', 'r') as f:
32        ml_data = json.load(f)
33
34    landings = [MeteoriteLanding(**ml) for ml in ml_data["meteorite_landings"]]
35
36    print(compute_average_mass(landings))
37
38    for ml in landings:
39        print(check_hemisphere(ml))
40
41if __name__ == '__main__':
42    main()

Let’s put this code in a module called ml_data_analysis.py.

If this code is imported into another Python3 script, that other script will have access to the compute_average_mass() and check_hemisphere() functions, but it will not execute the code in the main() function.

EXERCISE

Write a new script to import the above code, assuming that above code is saved in a file called ml_data_analysis.py:

Try executing this new script with and without protecting the imported code in a main() function. How do the outputs differ?

Tip

The main function does not have to be called literally main(). But, if someone else is reading your code, calling it main() will certainly help orient the reader.

Refactoring

Refactoring is when you reorganize your code while preseving its original behavior. Refactoring code is analagous to factoring in mathematics. For example:

f(x) = x^2 + x can be written as f(x) = x(x + 1)

or in the opposite direction:

f(x) = x(x + 1) -> f(x) = x^2 + x

The expression changes, but the result does not.

In software engineering, we refactor our code so that it is better organized, more readable, and easier to reason about.

EXERCISE

Before we start refactoring, take another look at the code above and ask yourself the following:

  1. Can I succinctly describe what this code is doing?

  2. Can we reorganize this code in some way that improves its readability and our ability to reason about what it does?

Note

Let the following software development principle guide your thinking:

Single Responsiblity Principle (SRP): A function, class, or module should have a single, well-defined job or responsiblity.

First, let’s put our pydantic model in its own module called models.py:

models.py
1from pydantic import BaseModel
2
3class MeteoriteLanding(BaseModel):
4    name: str
5    id: int
6    class_name: str
7    mass: int
8    lat: float
9    long: float

Next, let’s consider the compute_average_mass and check_hemisphere functions. We can see from the types in the function signatures that both of these functions are tightly coupled to the MeteoriteLanding class. This is an indication that we should probably package it with or near the MeteoriteLanding class.

To keep it simple, lets just put it in the models.py module.

models.py
 1from pydantic import BaseModel
 2
 3class MeteoriteLanding(BaseModel):
 4    name: str
 5    id: int
 6    class_name: str
 7    mass: int
 8    lat: float
 9    long: float
10
11def compute_average_mass(landings: list[MeteoriteLandings]) -> float:
12    total_mass = 0.
13    for ml in landings:
14        total_mass += ml.mass
15    return (total_mass / len(landings))
16
17def check_hemisphere(ml: MeteoriteLanding) -> str:
18    location = ''
19    if (ml.lat > 0):
20        location = 'Northern'
21    else:
22        location = 'Southern'
23    if (ml.long > 0):
24        location = f'{location} & Eastern'
25    else:
26        location = f'{location} & Western'
27    return(location)

Now we must fix the imports in our ml_data_analysis.py module.

 1import json
 2
 3from models import MeteoriteLanding, compute_average_mass, check_hemisphere
 4
 5def main():
 6    with open('Meteorite_Landings_Simple.json', 'r') as f:
 7        ml_data = json.load(f)
 8
 9    landings = [MeteoriteLanding(**ml) for ml in ml_data["meteorite_landings"]]
10
11    print(compute_average_mass(landings))
12
13    for ml in landings:
14        print(check_hemisphere(ml))
15
16if __name__ == '__main__':
17    main()

This results in a cleaner, easier to read application that simple to reason about.

Q: What does this program do?

A: It loads a meteorite landings dataset, computes the average mass of the meteorites, calculates the which hemispheres they landed in, and prints the results to stdout.

Now run the ml_data_analysis file with uv run python ml_data_analysis.py and see that it runs the same as before.

Intermediate Pydantic & Complex Datasets

In the ealier examples, we downloaded and used the Meteorite_Landings_Simple.json dataset. This dataset was modified to simplify our introduction to the pydantic library. The original dataset contains json objects with keys that are not compatible with python’s syntax for class attributes.

Now that we have an inital grasp of pydantic, let’s work with the original dataset.

On your VM, run the following code in the directory that contains you ml_data_analysis.py.

[coe332-vm]$ wget https://raw.githubusercontent.com/TACC/coe-332-sp26/main/docs/unit02/sample-data/Meteorite_Landings.json

You should now have a file called Meteorite_Landings.py with the unmodified JSON inside.

{
    "meteorite_landings": [
        {
            "name": "Ruiz",
            "id": "10001",
            "recclass": "L5",
            "mass (g)": "21",
            "reclat": "50.775",
            "reclong": "6.08333",
            "GeoLocation": "(50.775, 6.08333)"
        },
        {
            "name": "Beeler",
            "id": "10002",
            "recclass": "H6",
            "mass (g)": "720",
            "reclat": "56.18333",
            "reclong": "10.23333",
            "GeoLocation": "(56.18333, 10.23333)"
        },
        ...
    ]
}

There are a few things about this new dataset that should catch you attention.

First, notice the data types of the values for each property. They are all strings! We know from our previous work that pydantic is smart enough to coerce these values into the correct python type (provided such coercion is possible).

Second, we have different keys in this dataset:

  • mass (g) instead of mass

  • reclat instead of lat

  • reclong instead of long

  • recclass instead of class_name

  • and finally an additional GeoLocation field with what looks like a tuple of the lat and long wrapped in a string.

Let’s now modify our pydantic model to handle this new dataset.

models.py
 1from pydantic import BaseModel, Field, model_validator
 2
 3
 4class MeteoriteLanding(BaseModel):
 5    name: str
 6    id: int
 7    mass: int = Field(alias="mass (g)")
 8    class_name: str = Field(alias="recclass")
 9    location: GeoLocation
10
11    @model_validator(mode="before")
12    @classmethod
13    def preprocess_inputs(cls, values):
14        values["location"] = {
15            "lat": values["reclat"],
16            "long": values["reclong"],
17        }
18    return values
19
20class GeoLocation(BaseModel):
21    lat: float
22    long: float
23
24...

Let’s break down our modifications to the models.py module.

First, we created aliases for certain properties of our data using pydantic’s Field class. The Field class is used to annotate an pydantic model attribute with some metadata. This metadata helps pydantic understands which model attribute corresponds to the input data.

  • The value at key mass (g) will be used with the attribute mass

  • The value at key reclat will be used with the attribute lat

  • The value at key reclong will be used with the attribute long

  • The value at key recclass will be used with the attribute class_name

Second, we added a GeoLocation class that contains a lat and long.

And finally, we made use of pydantic’s model_validator and python’s classmethod decorators to instruct pydantic on how to construct the input data. We will learn more about decorators in detail in a later unit. The model_validator decorator is a function that will be run before the model class is instantiated. It allows us to manipulate the incoming data, run validation logic, and much more.

If you’re observant, you may have noticed that we completely ignored the "Geolocation" property in the data. This is duplicate data on each object that we do not need. Pydantic will safely ignore any additional properties unless you tell it otherwise.

Note

You can make your model more strict by adding a model config attribute on your model class:

class MyModel(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid")

Since we nested our lat and long properties inside the GeoLocation model, we need to modify the check_hemisphere function in the models.py module to access that properly.

models.py
 1...
 2
 3def check_hemisphere(ml: MeteoriteLanding) -> str:
 4    location = ''
 5    if (ml.location.lat > 0):
 6        location = 'Northern'
 7    else:
 8        location = 'Southern'
 9    if (ml.location.long > 0):
10        location = f'{location} & Eastern'
11    else:
12        location = f'{location} & Western'
13    return(location)
14
15...

Now run the ml_data_analysis.py file and you will see that we have the same output as before.

Shebang

A “shebang” is a line at the top of your script that defines what interpreter should be used to run the script when treated as a standalone executable. You will often see these used in Python, Perl, Bash, C shell, and a number of other scripting languages. In our case, we want to use the following shebang, which should appear on the first line of our Python3 scripts:

#!/usr/bin/env python3

The env command simply figures out which version of python3 appears first in your path, and uses that to execute the script. We usually use that form instead of, e.g., #!/usr/bin/python3.8 because the location of the Python3 executable may differ from machine to machine, whereas the location of env will not.

Next, you also need to make the script executable using the Linux command chmod:

[coe332-vm]$ chmod u+x ml_data_analysis.py

That enables you to call the Python3 code within as a standalone executable without invoking the interpreter on the command line:

[coe332-vm]$ ./ml_data_analysis.py

This is helpful to lock in a Python version (e.g. Python3) for a script that may be executed on multiple different machines or in various environments.

Other Tips

As our Python3 scripts become longer and more complex, we should put more thought into how the different contents of the script are ordered. As a rule of thumb, try to organize the different sections of your Python3 code into this order:

# Shebang

# Imports

# Global variables / constants

# Class definitions

# Function definitions

# Main function definition

# Call to main function

Other general tips for writing code that is easy to read can be found in the PEP 8 Style Guide, including:

  • Use four spaces per indentation level (no tabs)

  • Limit lines to 80 characters, wrap and indent where needed

  • Avoid extraneous whitespace unless it improves readability

  • Be consistent with naming variables and functions

    • Classes are usually CapitalWords

    • Constants are usually ALL_CAPS

    • Functions and variables are usually lowercase_with_underscores

    • Consistency is the key

  • Use functions to improve organization and reduce redundancy

  • Document and comment your code

Note

Beyond individual Python3 scripts, there is a lot more to learn about organizing projects which may consist of many files. We will get into this later in the semester.

Additional Resources