Bioinformatics – re-advance.com

The Footprint Of Bioinformatics

95kg of CO2?

To simulate a virus for 0,1 second!

What is the environmental impact of Bioinformatics?

Assembling a Human Genome

11-15kg of CO2e

RNA Read Alignment

<1kg CO2e

(Alignment of 10 million 100-base human read pairs)

Phylogenetics Study of Birds

>4372 kg of CO2e

A good rule of thumb is that about 20-40% of energy is needed for the severs in data centers. 40% of their energy is used for cooling and about 20% of their downtime due to overheating.

What causes the footprint?

On the one hand it is the electricity to run our code and our servers. This is also where the numbers from above stand for: doi.org/10.1093/molbev/msac034

However, the other aspect is what you can feel with your hands, the hardware: the hard-drives, processors, keyboards and screens …

It depends on the product which of the two is more significant. For small devices such as mobile phones, the hardware aspect can be worse, especially due to the shipping of the components around the world

What can I do?

1. Measure your own footprint

To become aware of your own footprint and to have a starting point, quantifying your impact is helpful.

Head over to green-algorithms and find an easy to use software to get started!

You just need to put in information such as the runtime, number of cores, where you run your jobs and a few others.

2. Be careful

Indeed, such tools are great to estimate your footprint. However, take the results with a grain of salt. Unless you know exactly how the electricity you consume is produced and the actual running time of each and every process you will not derive at precise numbers.

Also, do not forget that estimating the footprint of your hardware is a challenge for itself …

3. Take action

However, you cannot avoid a footprint, thus, we have listed some helpful examples and steps that should offer some inspiration for you below:

Concrete Examples

Lessons From Neuroimaging

Souter and colleagues published their paper “Ten recommendations for reducing the carbon footprint of research computing in human neuroimaging” in which they give some general advice for how to make computing more sustainable. Here are our top 3 insights:

1.You do not need to store all the data

They found that up to 96% of fMRIPrep output data can be considered unnecessary for subsequent analysis. “Only 0.23 GB, 4.0% of the total size, corresponded to files intended for use in subsequent statistical analysis (see Fig. 3).” On their github they have source code to remove those data.

Obviously, be cautious with deleting data!

Pro Tip: Regularly remove files that you do not need and plan ahead where, and for how long, you will store files while involving everyone, including technicians to make sure your plan is communicated well.

2.Make use of existing preprocessed data

Acquiring new data will not only cause a footprint through the conduct itself but one also has to preprocess that data …

It becomes more common to share data online and a few project dedicate themselves to collect all available data. Definitely, check out whether this is also the case for your field of research!

In their publication, table 1 provides an overview of open access neuroimaging projects/data repositories. Of note, there are repositories such as the OSF.

3.Reduce preprocessing and analysis where possible

Often, acquired data can be subsequently used for various purposes. This also means that in “standard” settings all of this data is preprocessed. For example, these processes involve smoothing, or denoising. However, if you know what you will do downstream you can save a lot of time only preprocessing according to your need.

In a subsequent paper they are about to publish called “Measuring and reducing the carbon footprint of fMRI preprocessing in fMRIPrep” they share some concrete examples.

They worked with an application called fMRIPrep. Although it comes with a certain variability in pipelines, there is a lot to be optimized in case you have a clear goal downstream.

Examples from the paper:

One tool is called FreeSurfer that reconstructs surfaces from multiple images including different basic pulse sequences in MRI (e.g., T1w & T2w). However, if not needed, disabling can reduce running times and emissions by more than 45%. Enabling the ‘sloppy’ mode (a preprocessing mode that is normally used in testing pipelines and uses low quality registration) reduced emissions and duration more than 40% Finally, as they describe “Low memory mode, which attempts to reduce memory usage at the cost of increasing disk usage in the working directory, reduced both emissions and duration by 6%. This had no impact on preprocessing performance, producing identical output to baseline”.

You Might NOT Need All Your Data

Deep learning is a powerful method, however, it heavily relies on vast amounts of data.

Interestingly, in a study about pest recognition for crops a method called Embedding Range Judgment (ERJ) in the feature space was proposed and tested through numerous comparative experiments. The results indicated that, for some recognition tasks, opting for high-quality data in smaller quantities can achieve similar performance compared to using all available data. Notably, significant improvements were observed even when utilizing just 40% of the data, with only marginal enhancements seen thereafter (40% – 0.87, 50% – 0.91).

You can have a look yourself:

Toward Sustainability: Trade-Off Between Data Quality and Quantity in Crop Pest Recognition

Tips & Tricks

Work Consciously

Even though jobs get finished quicker every day, it doesn’t mean they come at small(er) costs. It is easy to launch multiple trials but each time we cause a footprint.

Choose Wisely

Review the programming languages and tools you use because some will yield the same result with differences in CO2 production.

Stay Up-To-Date

Make use recent optimized libraries and staying updated with the latest versions of your tool/software

Have A Second Look

Optimizing code can be tedious. However, while benefitting the environment you make it easier to share your work and possibly avoid yourself some headache in the future.

Time Your Work

If possible, run jobs when there is low demand, for example at night. Also, if you are able to choose between data centers, you might use the one with smaller footprints.

Find Out Where

Employ code profilers in R or Python (available as build ins or packages) to pinpoint optimization opportunities.

Design The Workflow

Incorporate checkpoints and create lean trials to shorten experimentation cycles, while testing pipelines on smaller datasets.

Find Caveats

See where culprits might hide, for example, keep peak memory requirements in check and avoid unnecessary loops where packages might do the job.

Considering Hardware

Choose hardware with extended lifespans, prioritize repairs, and explore secondary markets for buying or selling your hardware.

Tools And Initiatives

Carbon Tracking tools
Scheduler
Another Network To Check Out
CO2nduct Initiative
Collecting Data About Emissions

Existing Tools for checking the footprint you leave behind based on your computational work

Python Packages: Carbontracker, Code Carbon, Cumulator, Tracarbon, Experiment Impact Tracker
Checking the impact of the cloud service such as AWS: Cloud Carbon Footprint
R Packages: Carbonr
However, there are also more specific trackers as worked out for Neuroimaging in fMRIPrep

There is a Python package called CATS (The Climate Aware Task Scheduler) that schedules jobs so that emissions are minimized. This works of course only with information from the Electricity Grid providers (and the according predictions).

Visit “the software sustainability network”! Whether you seek some resources or want to assess the sustainability of your code, they will be able to help you.

There is an initiative that calls for more transparent and comprehensive sharing of data about our footprints.

In short, they propose the definition of the scope of the analysis, collecting the relevant emissions sources and estimating the footprint of the carbon sources. Of course, everything from the experiments, infrastructure, commuting, procurement and waste disposal should be included. These data should be available as a table of carbon emissions.

On this website, you will find an example table and if a template would help you, you can find it here.

Henderson et al. propose an interesting idea: creating leaderboards about performance and efficiency. They also suggest that “underlying frameworks should by default use the most energy-efficient settings possible. However, they also offer an experiment-impact-tracker that collects 13 parameters that are important for pinpointing emissions through a simple change of code.

Additional Considerations

Whether a faster language (or code) is also more efficient and thus greener depends. If the consumed power would be constant, then yes. Although it is good rule of thumb, optimizing your code does not always make it “greener”. As there is no “best” coding language and thus no “greenest”. Although C,C++, and Rust are heavily optimized and rather efficient, the comparison is not perfect given that although Python might not be doing as great, these assessment often leave out the use of packages.

However, there is no real reason to not make your code more efficient.

You can have a look at that or this publication to dive a bit deeper intro the intricacies.

The source of electricity significantly impacts the carbon footprint of bioinformatics research. Opting for renewable sources like hydroelectric power can drastically reduce environmental impact compared to coal-based energy. Note that differences can be more than 10-fold!

The ambient temperature of a researcher’s country plays a vital role in determining energy consumption. In hotter climates, cooling down computational systems becomes more energy-intensive, as air conditioning or cooling systems are often required to maintain optimal operating temperatures.

The choice between central processing units (CPUs) and graphics processing units (GPUs) can influence energy consumption and carbon emissions in bioinformatics research. While CPUs are versatile and suitable for a wide range of tasks, GPUs excel in parallel processing, offering higher computational performance for certain bioinformatics applications. It is hard to generlize anything, but here is one study for you.

Choosing the right carbon tracker is crucial. There were reports where researchers experienced issues, for example, during preprocessing impacts would vary strongly without an intelligible reason. In this case, the issue was apparently a lacking hardware isolation, the tracker traced energy use from other jobs of the same node.

We cannot live without modern computing anymore. However, a single google search takes the energy to light a bulb for about 25 seconds.

Ease as a design principle often poses a challenge for sustainability. In other words if it is there = it is just a click = it is fast = I will use it.

With reference to research, training a language model with a similar number of parameters to our beloved ChatGPT (GPT3) caused about 15 000 kg of CO2e. For Alphafold, it allegedly took 4 tons of CO2e.

Thus, let us become aware that our actions cause a (significant) footprints often without realizing & that opportunities for change exist are impactful indeed.

Further Blog Posts & Articles Worth Reading

Teamwork makes the dream work

An inspiring innovation was also introduced in the fMRI preprocessing tool called fMRIPrep. This feature, was co-created with CodeCarbon, and allows the user to calculate the amount of carbon emissions produced when you run the preprocessing on your data. The user simply turns on this feature by using a special command and specifying which country they are in and then can toggle over the lines of code to see their impact.

How much data should be shared?

The FAIR principle —findable, accessible, interoperable, and reusable is a well known principle that advocates for sharing data, also to make automatic finding of such data (by machines & software) possible. However, given the huge amounts of data generated in neuroscience but also in genetic sequencing, one big question of the future is, how to share data in order to share as much as necessary but as few as possible.

Of note, 32% of emission were due to keeping nodes on regardless of whether code was running or not. How to deal with these activity variations is another open question for optimization.

When technology is faster than us

In the future, science will heavily rely on big data, and sharing this data openly is crucial. But there is a challenge ahead, called “dark data.” This data just sits around unused, taking up space and energy in computer systems.

For instance, when researchers leave a project or delete their profiles, their data might still linger on servers, unused and forgotten. Another reason for dark data is sloppy labeling or formatting.

Schembera & Durán who proposed the term “dark data” suggest to have dedicated officers to deal with this issues, although they point out that we also need a new way of standardizing how data is handled, teaching users how to manage data properly, and setting clear rules for keeping and accessing data.

What makes a good measure?

What makes a good measure? Sustainability is tricky because in science, we’re used to finding universal truths. For example, a metabolic pathway works similarly everywhere. This idea applies to personal medicine too – it’s tough to develop because everyone’s biology is different.

However, we tried to achieve this by having Carobon Dioxide Equivalents as a common measure. Potentially great for science but not very effective for Science Communication – is it?

So why not use other measures e.g., electricity used and translate it tohow many times you could charge your phone? Because different places have different carbon footprints even with the same amount of electricity used.

In article it was suggested that we should measure CO2 emissions in terms of how many trees it takes to absorb that CO2 over a certain time (=tree months). It is a great idea forward, but it is hard to visualize how many trees are needed or what life will be like in the future. If someone asks you how many trees the forest closest to you has or how your life will look in 2 years, answering is difficult.

Maybe the solution lies in something more tangible, and still green: money. People understand money better than abstract concepts like kg of gases. So, calculating the cost of carbon emissions might be the most effective measure in the end. But – we would need a carbon tax to go for it.

Want More?

Join us and learn more in our online talks, events and educational material we share regularly!

Meet our friends and become part of their community of Bioinformaticians!