Code formatting, styling and linting#
Here’s one scenario all of us have been in before, most likely countless times: we want to run an anlysis and get a script from a colleague that should exactly what we want. Sounds great, eh? However, upon opening it your joy starts to drastically diminish: the entire script, many hundreds of lines long, is basically unintelligable. There are no comments, a fast number of apparently random variables, huge blocks of code, etc. . Long story short: re-using and/or adapting this script will be a lot of work and maybe not even possible.
The same holds true for code you might find online in repositories or even tutorials, as well as the worst case: your own code from a while ago. You wrote this once and now you don’t understand a single thing that’s going on. How is anyone supposed to understand the content of files like finalanalysis_final_2.py
or runthisbefore_monday.py
?
It’s just annoying, frustrating and not FAIR
. So, isn’t there anything one can do about that?
Say no more and join us for a basic introduction into what’s called code hygiene
or linting
, ie formatting
and styling
your code
to address the above-outlined challenges.
However, before we start, here’s a little glimpse into how deep these principles are baked into python
:
import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Here’s what we will explore in this session:
and we’re going to start with a short practice!
Task for y’all!
Within the materials
you downloaded, ie materials/CI_CD
, you will find a python script
called emd_clust_cont_loss_sim.py
(you can also directly download it here). Open it in VScode
, go through it and write down any problems you notice and/or might run into. Please make sure to save your notes!
You have 10 min.
After the 10 min are over, please head over to this slido poll and add a few of your notes.
Guidelines for Code Styling#
Style guidelines differ between organisations, languages, and over time. Even, the Python style guide
Python Enhancement Proposal 8 (PEP 8) has had numerous revisions since it was released in 2001. You must choose a framework that is best for your purposes: be they for your benefit or the benefit of others. It is also important to remain consistent (and not consistently inconsistent)!
Style guidelines include advice for file naming
, variable naming
, use of comments
, and whitespace
and bracketing
.
In the following, we will explore some basics of the Python Enhancement Proposal 8 (PEP 8). You can browse the full set of guidelines
on the PEP8 website
:
File Naming#
The Centre for Open Science
has some useful suggestions for the naming
of files, particularly ensuring that they are readable for both humans and machines. This includes avoiding the use of wildcard characters
(@£$%
) and using underscores
(“_
”) to delimit information, and dashes
(“-
”) to conjunct information or spaces. They also suggest dating
or numbering files
and avoiding words like FINAL
(or FINAL-FINAL
). Does any of those things sound familiar?
The dating suggestion is the long format YYYY-MM-DD
, followed by the name of the file
, and the version number
. This results in automatic
, chronological order
. For example, have a quick look at the difference between these files:
Final analysis @ home.py
2024-02-12_myproject_analysis-descriptive-statistics.py
One clearly has more information and FAIR
-ness than the other, with the “workload” of naming the file being actually the same.
And here’s another fun fact to motivate you: spaces
in file names
can create hugh problems regarding finding and accessing them (e.g. you wouldn’t be able to work with the first file in bash
as there spaces
differentiate arguments
).
A note on version control: while it definitely won’t hurt to add some form of versioning
to the file name
, it is generally recommended to use a dedicated version control system
like git and ideally a form of continuous integration
, which we will explore a bit in the last session of this workshop.
Variable Naming#
Remember your math classes? There variables
are often unimaginatively named “x”, “y”, and “z”. This brevity is probably because teachers (understandably) do not want to repeatedly write long variable names
on the board. In coding
, however, you have the freedom to name your variables
anything you like. This can be useful for representing the flow of your script
.
For example, instead of using
x = pd.read_csv("my_df.csv")
y = x[["my_column_1", "my_column_2"]]
to indicate a loaded DataFrame
, ie x
, and a subset
of that DataFrame
, ie y
, you could (and should) use:
df_loaded = pd.read_csv("my_df.csv")
df_loaded_sub = x[["my_column_1", "my_column_2"]]
to have much more intelligible variable names
that also indicate the processing flow.
Furthermore, while you could name variables
whatever you want, there are also a few exceptions and guidelines, which we will explore next.
Naming exceptions#
Variable names
in Python
can contain alphanumerical characters a-z
, A-Z
, 0-9
and some special characters such as _
. Normal variable names
must start with a letter
.
By convention, variable names
start with a lower-case letter
, and Class names
start with a capital letter
.
Thus, variable names
like
1DF
DF clean
are not possible and instead should be
df_1
df_clean
In addition, there are a number of python keywords
that cannot be used as variable names
. These keywords are:
and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try, while, with, yield
Naming conventions#
For clarity and readability, choosing a set of naming conventions
for your variables
is useful. There is a large variety, and some people can be quite vocal about which one is ‘correct’ (pick one that is right for you!). These include:
CamelCase
lowerCamelCase
Underscore_Methods
Mixed_Case_with_Underscores
lowercase
However, it is important to choose one style and stick to it. For example,
ThisIs Because_SwitchingbetweenDifferentformats is.difficult to read.
is rather suboptimal and hard to parse, while
Where_as if_you stick_to one_style, your_code will_be easier_to_follow!
WhereAs IfYou StickTo OneStyle, YourCodeWillBeEasierToFollow!
is much more intelligible and easier to follow.
Writing Human Readable Code#
Writing clear, well commented, readable and re-usable code
benefits not only you but the community (or audience) that you are developing it for. This may be your lab, external collaborators, stakeholders, or you might be writing open source software for global distribution! Whatever scale you work at, readability counts!
Here are a few aspects to consider when making your code easy to read by others.
Line Length#
There is some agreement on the length of the coding lines
. PEP8
suggests a maximum of 79 characters
per line. This means that the lines can easily fit on a screen
, and multiple coding windows
can be opened. It is argued that if your line is any longer than this, then your function is too complex and should be separated! For example:
columns_select = ['participant_id', 'age', 'group', 'left-handed', 'session', 'TargetImage', 'rating', 'feedback']
would be too long and should be changed to:
columns_select = ['participant_id', 'age', 'group', 'left-handed', 'session',
'TargetImage', 'rating', 'feedback']
Commenting#
Comments
have been described as “Love letters to your future self” by Jon Peirce, creator of PsychoPy. Comments
can be single/multi-line
, blocked
or inline
. The PEP8 guidelines
have firm suggestions that block comments
should be full sentences
, have two spaces
following a period, and follow a dated style guide (Strunk and White). Fortunately the Elements of Style
no longer ‘requires’ an unfair emphasis on masculine pronouns. Whereas inline comments
should be used sparingly, keeping clear
and concise comments
not only allows you to keep track of the decisions you have made, what particular functions
do, and what variables
are used, it also allows other people to see your thought processes. The syntax
for comments
varies with programming languages.
In python
, a hashtag #
is used for single/multi-line
and inline
and '''
for blocked
comments.
Comment Type |
Syntax Example |
Use Case |
---|---|---|
Single-Line |
|
Brief explanations |
Multi-Line |
|
Longer comments, block explanations |
Triple-Quote |
|
Block explanations |
Docstring |
|
Documentation for classes/functions |
Inline |
|
Clarifying specific code lines |
Commenting Code |
|
Temporarily disable code lines |
Just compare these two examples, at first without comments
:
dpgmm = BayesianGaussianMixture(n_components=10, covariance_type='full',
weight_concentration_prior_type="dirichlet_process", weight_concentration_prior=0.1,
random_state=42)
dpgmm.fit(representations_during_training[3])
dump(dpgmm, './outputs/models/dpgmm.joblib')
and now with comments
:
# The subsequent block define a Bayesian Gaussian Mixture clustering,
# fits it to the test data and then saves the fitted estimator.
dpgmm = BayesianGaussianMixture(n_components=10, covariance_type='full',
weight_concentration_prior_type="dirichlet_process", weight_concentration_prior=0.1,
random_state=42) # define Bayesian Gaussian Mixture clustering
dpgmm.fit(representations_during_training[3]) # fit it to the test data
dump(dpgmm, './outputs/models/dpgmm.joblib') # save the fitted estimator
Immediately more clear, eh? While it might seem like extra workload, adding a few descriptive comments
to describe what’s going on (and why), will make all the difference later on.
Indentation#
Python
uses whitespaces
to define code blocks
. Using whitespaces
at the beginning of a line
is the indentation
. This means that a codeblock
that is indented
with the same number of leading whitespaces
or tabs
should be run together. In other words: the indentation
is part of the syntax
in python
and one of the major distinctions regarding other programming languages like, e.g. Matlab
.
Usually in python
we use four whitespaces
for indentation
of codeblocks
.
Let’s see what that means:
i_hope_this_is_over_soon = "yes"
n_sections_left = 2
Each such set of statements is called a block
, meaning that the lines
/variable assignments
will be run together.
What happens when we introduce a “wrong” indentation?
i_hope_this_is_over_soon = "yes"
n_sections_left = 2
Cell In[3], line 2
n_sections_left = 2
^
IndentationError: unexpected indent
NB: you have most likely seen this already when applying any form of loops
within which the parts inside the loop
(s) have to have the right indentation
as you otherwise will run into erros. The same holds true for splitting code
across lines
.
Code Styling Tools#
As mentioned earlier, there are some automatic tools that you can use to lint
your code
to existing guidelines. These range from plugins for IDE
s packages that "spell-check"
your style, and scripts
that automatically lint
for you.
You can use one or the other or both, that’s up to you. Just make sure you use at least one.
However, we definitely recommend giving the "spell-check"
approach a try, as it teaches you a lot of code formatting
and styling
and thus helps you to incorporate these guidelines into your everyday coding
.
Automatic formatting tools#
There are a few common automatic formatting tools
for python
. Most likely, you will see:
These tools will automatically change your code
to adhere to certain guidelines
, like spaces around operators
and removing unnecessary whitespace
. It is also consistent, so that the code
that you and your collaborators work on, will look the same once formatted it. It does not change what the code
does. This can reduce the time spent making the above changes to the code
.
Spell-checks in IDEs#
As mentioned before, there’s also the option to have “spell-checks
” in IDE
s, which basically indicate guideline violations
like a writing tool would indicate typos. The exact way how and what they mark as a guideline violation
depends on the “spell-check
” option of you choice.
Usually, these “spell-check
”s are implemented through IDE
integrations and/or add-ons of the above mentioned formatting tools (and others), e.g. Black
, Autopep8
and Flake8
. As the tools themself, you can set different options and change their configuration concerning what should be considered a guideline violation
and thus indicated respectively.
In order to add them to your IDE
, just search for the respective integration and/or add-on. VScode
is of course no exception to that and supports the mentioned tools and others. Just open the Extensions
tab, enter the search term and then click install
.
data:image/s3,"s3://crabby-images/ee91a/ee91a776b320a6c73028772640e551282505847e" alt="jupyter logo"
Here’s an example of how the Flake8
extension in VScode
would highlight guideline violations
in our emd_clust_cont_loss_sim.py
script from the beginning.
data:image/s3,"s3://crabby-images/b46d9/b46d9559b9682620f63fa357b94cb779c1ab991a" alt="jupyter logo"
Summary and task(s)#
Again, there’s much more to linting than what we have discussed here. Just have a look at them and utilize one of the mentioned tools and/or IDE
integrations to get used to it and you will automatically adapt your code
writing defaults in no time!
However, to give you a first idea of how this would work and combine different aspects of what we have discussed so far, we would like to present you with a set of practice tasks.
Task for y’all!
Remember the script
you checked at the beginning? You already went through it once, but now it’s time to check things again based on the things we have briefly explored. Just imaging that a colleague gave this script to you or you wrote it a year ago and now you want to bring it up to code
(get it?). Please note that you should of course adhere and integrate the guidelines
from line 1
when you start coding
. However, as our example is quite long and we don’t have too much time, please limit your clean code endeavors to the first 100 lines.
Version control 1
: Create a newGitHub repository
calledclean_reproducible_code_nm_example
and add the initial version of theemd_clust_cont_loss_sim.py
script to it.Commenting
: Please addcomments
to all parts of thecode
,blocks
andin-line
Guidelines
: Please adapt thecode
to adhere toPEP8
guidelines
via3.1 Use a
VScode
extension to find and fixguideline
violations
3.2 Use an automatic formatting tool to check if everything is fixedVersion control 2
: Pleasecommit
the changes you have made andpush
them to yourGitHub
repository via apull request
Code review
: Please assign one of the other participants as acode reviewer
to yourpull request
and conduct the review for your assignedcode review
.
You have 40 min.
We included solutions to the task below but please try it yourself first.
Example solutions
Here’s how you could implement the above outlined tasks. However, there are of course many different solutions to some of the tasks and we just provided some examples.
Version control 1
:Go to
GitHub
and clickNew Repository
on the left. Assign it aname
(as outlined), add aREADME
and choose alicense
.If everything worked out, you should see your new repo.
Clone the repository to your local machine via (don’t forget to exchange
your_user_name
to your respectiveGitHub username
) andcd
into it:git clone https://github.com/your_user_name/clean_reproducible_code_nm_example cd clean_reproducible_code_nm_example
Copy/move the original
emd_clust_cont_loss_sim.py
script to the newly created localrepository
(don’t forget to exchange/path/to/script
to the path where you stored theemd_clust_cont_loss_sim.py
script on your machine):mv /path/to/script .
Commit
the changes to yourrepository
via:git add emd_clust_cont_loss_sim.py git commit -m "add initial version of script"
and then
push
the changes to yourGitHub repository
:git push
If everything worked out, you should now see the script in your
GitHub repository
.Commenting
:Apply the aspects we briefly talked about in this session.
Guidelines
:3.1 Install the
Flake8
extension and go through the code searching for marked problems and address them.3.2 Run
autopep8
andflake8
to fix and check the code.Using
autopep8
you can automatically address some formatting problems automatically:autopep8 /path/to/script --in-place
After that, you can use
flake8
to automatically check for all still existing errors:flake8 /path/to/script
Most likely, you will still get a lot of errors, especially if you only went through the first 100 lines as suggested. However, that’s no problem for now and we will address this in a later session.
Verson control 2
: Pleasecommit
the changes you have made andpush
them to yourGitHub repository
viagit commit -m "apply code formatting" git push
If everything worked out, you should see the updated file in your
GitHub repository
.Additionally, due to
version control
, you can exactly check out what changed.
Again, some of these tasks could be addressed in a different manner. The here provided solutions are just examples.