New tool: countparticles

In my previous post, I wrote that I would not bother writing a Python command-line program to simply count particles in each class in a run_data.star file from RELION, because it is straightforward to do with AWK (and it probably runs faster on large files). I changed my mind and made a new tool (also installable with pip and citeable with doi:10.5281/zenodo.4139778 ).

I still love the AWK solution for many reasons: it is indeed straightforward, it runs fast even on large files, there was no boilerplate code to write to handle file input. And most importantly, it consists of only one file: drop it anywhere in your $PATH, make it executable, and it will work on any system that has AWK, which means everywhere, since AWK is “mandatory” in the sense of IEEE Std 1003.1-2008. The only problem is that star files from RELION change between versions, in a way that makes the relevant data for this counting not always stored in the same column. And AWK can only refer to a column by index, not by name. Changing the column number in the AWK script is trivial, but when the script produces nonsensical output or no output at all while I am trying to make sense of data, this kind of limitation gets frustrating quickly.

The obvious solution was to write a Python program, because the starfile library produces pandas DataFrames, and these in turn can refer to a column by its name as defined in the star file. This works regardless of the column’s numerical index, so it doesn’t break when a new version of RELION produces star files in which the relevant column has a new index. The downside is having to manage a Python installation… Luckily conda makes this manageable, but it now seems like a way over-engineered solution to the simple problem of counting lines by groups in a file… so I also added an option to display a bar graph representing the counts.