Code formatting, styling and linting#

Here’s one scenario all of us have been in before, most likely countless times: we want to run an anlysis and get a script from a colleague that should exactly what we want. Sounds great, eh? However, upon opening it your joy starts to drastically diminish: the entire script, many hundreds of lines long, is basically unintelligable. There are no comments, a fast number of apparently random variables, huge blocks of code, etc. . Long story short: re-using and/or adapting this script will be a lot of work and maybe not even possible.

The same holds true for code you might find online in repositories or even tutorials, as well as the worst case: your own code from a while ago. You wrote this once and now you don’t understand a single thing that’s going on. How is anyone supposed to understand the content of files like finalanalysis_final_2.py or runthisbefore_monday.py?

It’s just annoying, frustrating and not FAIR. So, isn’t there anything one can do about that?

Say no more and join us for a basic introduction into what’s called code hygiene or linting, ie formatting and styling your code to address the above-outlined challenges.

However, before we start, here’s a little glimpse into how deep these principles are baked into python:

import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Here’s what we will explore in this session:

and we’re going to start with a short practice!

Task for y’all!

Within the materials you downloaded, ie materials/CI_CD, you will find a python script called emd_clust_cont_loss_sim.py (you can also directly download it here). Open it in VScode, go through it and write down any problems you notice and/or might run into. Please make sure to save your notes!

You have 10 min.

After the 10 min are over, please head over to this slido poll and add a few of your notes.

Guidelines for Code Styling#

Style guidelines differ between organisations, languages, and over time. Even, the Python style guide Python Enhancement Proposal 8 (PEP 8) has had numerous revisions since it was released in 2001. You must choose a framework that is best for your purposes: be they for your benefit or the benefit of others. It is also important to remain consistent (and not consistently inconsistent)!

Style guidelines include advice for file naming, variable naming, use of comments, and whitespace and bracketing.

In the following, we will explore some basics of the Python Enhancement Proposal 8 (PEP 8). You can browse the full set of guidelines on the PEP8 website:

File Naming#

The Centre for Open Science has some useful suggestions for the naming of files, particularly ensuring that they are readable for both humans and machines. This includes avoiding the use of wildcard characters (@£$%) and using underscores (“_”) to delimit information, and dashes (“-”) to conjunct information or spaces. They also suggest dating or numbering files and avoiding words like FINAL (or FINAL-FINAL). Does any of those things sound familiar?

The dating suggestion is the long format YYYY-MM-DD, followed by the name of the file, and the version number. This results in automatic, chronological order. For example, have a quick look at the difference between these files:

Final analysis @ home.py

2024-02-12_myproject_analysis-descriptive-statistics.py

One clearly has more information and FAIR-ness than the other, with the “workload” of naming the file being actually the same.

And here’s another fun fact to motivate you: spaces in file names can create hugh problems regarding finding and accessing them (e.g. you wouldn’t be able to work with the first file in bash as there spaces differentiate arguments).

A note on version control: while it definitely won’t hurt to add some form of versioning to the file name, it is generally recommended to use a dedicated version control system like git and ideally a form of continuous integration, which we will explore a bit in the last session of this workshop.

Variable Naming#

Remember your math classes? There variables are often unimaginatively named “x”, “y”, and “z”. This brevity is probably because teachers (understandably) do not want to repeatedly write long variable names on the board. In coding, however, you have the freedom to name your variables anything you like. This can be useful for representing the flow of your script.

For example, instead of using

x = pd.read_csv("my_df.csv")
y = x[["my_column_1", "my_column_2"]]

to indicate a loaded DataFrame, ie x, and a subset of that DataFrame, ie y, you could (and should) use:

df_loaded = pd.read_csv("my_df.csv")
df_loaded_sub = x[["my_column_1", "my_column_2"]]

to have much more intelligible variable names that also indicate the processing flow.

Furthermore, while you could name variables whatever you want, there are also a few exceptions and guidelines, which we will explore next.

Naming exceptions#

Variable names in Python can contain alphanumerical characters a-z, A-Z, 0-9 and some special characters such as _. Normal variable names must start with a letter.

By convention, variable names start with a lower-case letter, and Class names start with a capital letter.

Thus, variable names like

1DF
DF clean

are not possible and instead should be

df_1
df_clean

In addition, there are a number of python keywords that cannot be used as variable names. These keywords are:

and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try, while, with, yield

Naming conventions#

For clarity and readability, choosing a set of naming conventions for your variables is useful. There is a large variety, and some people can be quite vocal about which one is ‘correct’ (pick one that is right for you!). These include:

  • CamelCase

  • lowerCamelCase

  • Underscore_Methods

  • Mixed_Case_with_Underscores

  • lowercase

However, it is important to choose one style and stick to it. For example,

ThisIs Because_SwitchingbetweenDifferentformats is.difficult to read.

is rather suboptimal and hard to parse, while

Where_as if_you stick_to one_style, your_code will_be easier_to_follow!
WhereAs IfYou StickTo OneStyle, YourCodeWillBeEasierToFollow!

is much more intelligible and easier to follow.

Writing Human Readable Code#

Writing clear, well commented, readable and re-usable code benefits not only you but the community (or audience) that you are developing it for. This may be your lab, external collaborators, stakeholders, or you might be writing open source software for global distribution! Whatever scale you work at, readability counts!

Here are a few aspects to consider when making your code easy to read by others.

Line Length#

There is some agreement on the length of the coding lines. PEP8 suggests a maximum of 79 characters per line. This means that the lines can easily fit on a screen, and multiple coding windows can be opened. It is argued that if your line is any longer than this, then your function is too complex and should be separated! For example:

columns_select = ['participant_id', 'age', 'group', 'left-handed', 'session', 'TargetImage', 'rating', 'feedback']

would be too long and should be changed to:

columns_select = ['participant_id', 'age', 'group', 'left-handed',  'session', 
                  'TargetImage', 'rating', 'feedback']

Commenting#

Comments have been described as “Love letters to your future self” by Jon Peirce, creator of PsychoPy. Comments can be single/multi-line, blocked or inline. The PEP8 guidelines have firm suggestions that block comments should be full sentences, have two spaces following a period, and follow a dated style guide (Strunk and White). Fortunately the Elements of Style no longer ‘requires’ an unfair emphasis on masculine pronouns. Whereas inline comments should be used sparingly, keeping clear and concise comments not only allows you to keep track of the decisions you have made, what particular functions do, and what variables are used, it also allows other people to see your thought processes. The syntax for comments varies with programming languages. In python, a hashtag # is used for single/multi-line and inline and ''' for blocked comments.

Comment Type

Syntax Example

Use Case

Single-Line

# This is a comment

Brief explanations

Multi-Line

# Line 1\n # Line 2

Longer comments, block explanations

Triple-Quote

"""Multi-line comment"""

Block explanations

Docstring

def func(): """Docstring"""

Documentation for classes/functions

Inline

x = 10  # Inline comment

Clarifying specific code lines

Commenting Code

# code to disable

Temporarily disable code lines

Just compare these two examples, at first without comments:

dpgmm = BayesianGaussianMixture(n_components=10, covariance_type='full', 
        weight_concentration_prior_type="dirichlet_process", weight_concentration_prior=0.1, 
        random_state=42)
dpgmm.fit(representations_during_training[3])
dump(dpgmm, './outputs/models/dpgmm.joblib')

and now with comments:

# The subsequent block define a Bayesian Gaussian Mixture clustering,
# fits it to the test data and then saves the fitted estimator.
dpgmm = BayesianGaussianMixture(n_components=10, covariance_type='full', 
        weight_concentration_prior_type="dirichlet_process", weight_concentration_prior=0.1, 
        random_state=42) # define Bayesian Gaussian Mixture clustering
dpgmm.fit(representations_during_training[3]) # fit it to the test data
dump(dpgmm, './outputs/models/dpgmm.joblib') # save the fitted estimator

Immediately more clear, eh? While it might seem like extra workload, adding a few descriptive comments to describe what’s going on (and why), will make all the difference later on.

Indentation#

Python uses whitespaces to define code blocks. Using whitespaces at the beginning of a line is the indentation. This means that a codeblock that is indented with the same number of leading whitespaces or tabs should be run together. In other words: the indentation is part of the syntax in python and one of the major distinctions regarding other programming languages like, e.g. Matlab.

Usually in python we use four whitespaces for indentation of codeblocks.

Let’s see what that means:

i_hope_this_is_over_soon = "yes"
n_sections_left = 2

Each such set of statements is called a block, meaning that the lines/variable assignments will be run together.
What happens when we introduce a “wrong” indentation?

i_hope_this_is_over_soon = "yes"
    n_sections_left = 2
  Cell In[3], line 2
    n_sections_left = 2
    ^
IndentationError: unexpected indent

NB: you have most likely seen this already when applying any form of loops within which the parts inside the loop(s) have to have the right indentation as you otherwise will run into erros. The same holds true for splitting code across lines.

Code Styling Tools#

As mentioned earlier, there are some automatic tools that you can use to lint your code to existing guidelines. These range from plugins for IDEs packages that "spell-check" your style, and scripts that automatically lint for you.

You can use one or the other or both, that’s up to you. Just make sure you use at least one. However, we definitely recommend giving the "spell-check" approach a try, as it teaches you a lot of code formatting and styling and thus helps you to incorporate these guidelines into your everyday coding.

Automatic formatting tools#

There are a few common automatic formatting tools for python. Most likely, you will see:

These tools will automatically change your code to adhere to certain guidelines, like spaces around operators and removing unnecessary whitespace. It is also consistent, so that the code that you and your collaborators work on, will look the same once formatted it. It does not change what the code does. This can reduce the time spent making the above changes to the code.

Spell-checks in IDEs#

As mentioned before, there’s also the option to have “spell-checks” in IDEs, which basically indicate guideline violations like a writing tool would indicate typos. The exact way how and what they mark as a guideline violation depends on the “spell-check” option of you choice.

Usually, these “spell-check”s are implemented through IDE integrations and/or add-ons of the above mentioned formatting tools (and others), e.g. Black, Autopep8 and Flake8. As the tools themself, you can set different options and change their configuration concerning what should be considered a guideline violation and thus indicated respectively.

In order to add them to your IDE, just search for the respective integration and/or add-on. VScode is of course no exception to that and supports the mentioned tools and others. Just open the Extensions tab, enter the search term and then click install.

logo

Here’s an example of how the Flake8 extension in VScode would highlight guideline violations in our emd_clust_cont_loss_sim.py script from the beginning.

logo

Summary and task(s)#

Again, there’s much more to linting than what we have discussed here. Just have a look at them and utilize one of the mentioned tools and/or IDE integrations to get used to it and you will automatically adapt your code writing defaults in no time!

However, to give you a first idea of how this would work and combine different aspects of what we have discussed so far, we would like to present you with a set of practice tasks.

Task for y’all!

Remember the script you checked at the beginning? You already went through it once, but now it’s time to check things again based on the things we have briefly explored. Just imaging that a colleague gave this script to you or you wrote it a year ago and now you want to bring it up to code (get it?). Please note that you should of course adhere and integrate the guidelines from line 1 when you start coding. However, as our example is quite long and we don’t have too much time, please limit your clean code endeavors to the first 100 lines.

  1. Version control 1: Create a new GitHub repository called clean_reproducible_code_nm_example and add the initial version of the emd_clust_cont_loss_sim.py script to it.

  2. Commenting: Please add comments to all parts of the code, blocks and in-line

  3. Guidelines: Please adapt the code to adhere to PEP8 guidelines via

    3.1 Use a VScode extension to find and fix guideline violations
    3.2 Use an automatic formatting tool to check if everything is fixed

  4. Version control 2: Please commit the changes you have made and push them to your GitHub repository via a pull request

  5. Code review: Please assign one of the other participants as a code reviewer to your pull request and conduct the review for your assigned code review.

You have 40 min.

We included solutions to the task below but please try it yourself first.