Things covered here:
- What are packages, dependencies, and environments?
- What is Conda/Mamba and why use them?
- Understanding Conda channels
- How to install and configure Mamba (locally and on the HPC)
- Creating and managing environments
- Working with different tool versions
- Sharing environments with YAML files
When working with a programming language, such as Python, that can do almost anything, one has to wonder how this is possible. You download Python, it has about 25 MB, how can everthing be included in this small data package. The answer is - it is not. Python, as well as many other programming languages use external libraries or packages for being able to doing almost anything. You can see this already when you start programming. After learning some very basics, you often learn how to import something into your script or session.
A bit further into your programming career you may notice/have noticed that many packages do not just do everything on their own. Instead, they depend on other packages for their functionality. For example, the Scipy package is used for numerical routines. To not reinvent the wheel, the package makes use of other packages, such as numpy (numerical python) and matplotlib (plotting) and many more. So we say that numpy and matplotlib are dependencies of Scipy.
Many packages are being further developed all the time, generating different versions of packages. During development it may happen that a function call changes and/or functionalities are added or removed. If one package can depend on another, this may create issues. Therefore it is not only important to know that e.g. Scipy depends on numpy and matplotlib, but also that it depends on numpy version >= 1.6 and matplotlib version >= 1.1. Numpy version 1.5 in this case would not be sufficient.
When starting with programming we may not use many packages yet and the installation may be straightforward. But for most people, there comes a time when one version of a package or also the programming language is not enough anymore. You may find an older tool that depends on an older version of your programming language (e.g. Pyhton 2.7), but many of your other tools depend on a newer version (e.g. Python 3.6). You could now start up another computer or virtual machine to run the other version of the programming language, but this is not very handy, since you may want to use the tools together in a workflow later on. Here, environments are one solution to the problem. Nowadays there are several environment management systems following a similar idea: Instead of having to use multiple computers or virtual machines to run different versions of the same package, you can install packages in isolated environments.
An environment management system solves a number of problems commonly encountered by scientists.
An environment management system enables you to set up a new, project specific software environment containing specific Python versions as well as the versions of additional packages and required dependencies that are all mutually compatible.
A good package management system greatly simplifies the process of installing software by…
Installing software is hard. Installing scientific software is often even more challenging. In order to minimize the burden of installing and updating software (data) scientists often install software packages that they need for their various projects system-wide.
Installing software system-wide has a number of drawbacks:
Put differently, installing software system-wide creates complex dependencies between your research projects that shouldn’t really exist!
Rather than installing software system-wide, wouldn’t it be great if we could install software separately for each research project?
Conda is an open source package and environment management system that runs on Windows, Mac OS and Linux.
Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because Conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
Mamba is a reimplementation of conda that offers the same functionality but with significant performance improvements. Written in C++, Mamba solves package dependencies much faster than conda, which becomes especially noticeable when working with complex environments or large package installations. It is a “drop-in” replacement for conda. meaning that most commands where you’s use conda you can use mamba instead(e.g. mamba install instead of conda install).
defaults (Anaconda-managed):
main and r repositories from Anaconda Inc.conda-forge:
bioconda:
At NBI, we need to comply with Anaconda’s licensing requirements while maintaining access to all the bioinformatics and scientific computing tools we need. Therefore our setup is the following:
Step 1: Install Miniforge
Download and install Miniforge from: https://github.com/conda-forge/miniforge
For macOS:
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
Follow the installation prompts. When asked if you want to initialise Miniforge3, answer “yes”. This will modify your shell configuration file (e.g., ~/.bashrc or ~/.zshrc) to make conda/mamba commands available.
After installation, restart your terminal or run:
source ~/.bashrc
# or if using zsh
source ~/.zshrc
Step 2: Add the bioconda channel
conda config --add channels bioconda
Step 3: Switch to the Prefix channel mirror for open-source channels
conda config --set channel_alias "https://repo.prefix.dev"
Step 4: Verify your configuration
conda config --show channels
channels:
- bioconda
- conda-forge
conda config --show channel_alias
channel_alias: https://repo.prefix.dev
HPC Installation: Key differences for NBI HPC users
In this tutorial, we are walking through a mamba installation locally, however there are some slight differences between a local installation and installing on the NBI HPC:
- Installing without
conda initto keep your.bashrcclean- Using
source activateandsource deactivateinstead ofmamba activateandmamba deactivateFor the HPC installation, replace Step 1 above with the following:
Step 1:
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" bash Miniforge3-Linux-x86_64.sh -b -p ~/mamba rm Miniforge3-$(uname)-$(uname -m).sh ln -s ~/mamba/bin/activate ~/mamba/condabin/activate ln -s ~/mamba/bin/deactivate ~/mamba/condabin/deactivate export PATH=$PATH:/hpc-home/$USER/mamba/condabin echo "export PATH=\$PATH:/hpc-home/$USER/mamba/condabin" >> ~/.bash_profileThese commands will:
- download the Mamba installation script
- install Mamba into
~/mambawithout enabling conda init- delete the installation script (you will no longer need it)
- make the mamba, activate and deactivate commands accessible to your shell both right now, and every time
~/.bashrcis sourcedNot enabling
conda initis good practice in HPC environments - keeping your~/.bashrcfile free of additions you didn't manually make yourself can prevent unexpected and difficult-to-diagnose issues with personal environments interfering with software execution (interactively or via Slurm).The main consequence of the lack of
conda initis thatmamba activateandmamba deactivate, the most widely-used commands to activate Mamba environments (which you may see in other online tutorials/instructions), will not work. Instead, you will usesource activateandsource deactivate, which do the same thing.Throughout the rest of this tutorial, whenever you see
mamba activateormamba deactivate, HPC users should usesource activateandsource deactivateinstead.Why the difference from HPC?
On your local machine, enabling conda init is convenient and won't interfere with other systems. On the HPC, keeping
~/.bashrcclean prevents conflicts with cluster-wide software modules.Steps 2-4:
Follow the same configuration steps as above (adding bioconda channel, setting channel alias, and verifying configuration).
Note: Avoid installing packages into your base Mamba environment Mamba has a default environment called
basethat include a Python installation and some core system libraries and dependencies of Mamba. It is a “best practice” to avoid installing additional packages into your base software environment. Additional packages needed for a new project should always be installed into a newly created Mamba environment.
A Mamba environment is a directory that contains a specific collection of packages that you have installed. For example, you may be working on a research project that requires samtools version 1.22 and its dependencies, while another environment associated with a finished project has samtools 1.6 (perhaps because version 1.6 was the most current version of samtools at the time the project finished). If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them.
For a list of all commands, take a look at Conda general commands (remember wherever you can use conda you can replace it with mamba).
Always specify a version number for each package you wish to install In order to make your results more reproducible and to make it easier for research colleagues to recreate your Mamba environments on their machines it is a “best practice” to always explicitly specify the version number for each package that you install into an environment. If you are not sure exactly which version of a package you want to use, then you can use search to see what versions are available using the
mamba searchcommand.
mamba search samtools
# Lets create an environment with samtools v1.22 and check the version after install
mamba create --name samtools-1.22-env samtools=1.22
mamba activate samtools-1.22-env
samtools --version
# Let's deactivate before creating and activating the next environment
mamba deactivate
# Lets create an environment with samtools v1.6 and check the version after install
mamba create --name samtools-1.6-env samtools=1.6
mamba activate samtools-1.6-env
samtools --version
mamba deactivate
When we activate an environment, we can see the currently active environment in brackets like this:
(samtools-1.6-env) username@laptop
Remember when we learned about the PATH variable? When you activate a Mamba environment, Mamba temporarily modifies your PATH to point to the binaries in that specific environment. Let’s see this in action:
# Check PATH before activating an environment
echo $PATH | tr ":" "\n"
# Activate an environment and check PATH again
mamba activate samtools-1.22-env
echo $PATH | tr ":" "\n"
mamba deactivate
Notice that the environment’s bin directory is now at the front of PATH! This is how Mamba ensures you’re using the correct version of each tool: by controlling which directories are searched first in your PATH!
We can check what environments we have installed and their locations (which get appended to PATH) by running the mamba env list command:
mamba env list
base /Users/<USERNAME>/miniforge3
samtools-1.22-env /Users/<USERNAME>/miniforge3/envs/samtools-1.22-env
samtools-1.6-env /Users/<USERNAME>/miniforge3/envs/samtools-1.6-env
Each environment is a separate directory, and when you activate one, Mamba adds that environment’s bin subdirectory to the front of your PATH.
You could also create an environment and install multiple packages by listing the packages that you wish to install. We can either create a new environment or we can modify an existing environment. Let’s try both ways.
We can always check what packages are installed in our current environment using the mamba list command after activating our environment
# Let's find our tool that we want to install
mamba search bcftools
# Let's create a new environment with both tools
mamba create --name test-env samtools=1.22 bcftools=1.22
mamba activate test-env
mamba list
mamba deactivate
We could also install the same tool into one of our existing environments, by activating the environment and then using the mamba install command:
mamba activate samtools-1.6-env
mamba list
# Let's add bcftools version 1.22 into this environment
mamba install bcftools=1.22
# check that it was succesful
mamba list
For more complex projects or when sharing your work with collaborators, it’s best practice to define your environment in a YAML file. This makes your computational environment fully reproducible and easy to share.
Let’s export one of the environments we already created to see what a YAML file looks like:
# Activate the test-env we created earlier
mamba activate test-env
# Export to a YAML file
mamba env export --from-history > test-env.yml
# View the exported file
cat test-env.yml
Notice that the YAML file includes not just the tools you explicitly installed (samtools and bcftools), but also all their dependencies! Let’s move to CLIMB notebooks and see how we would recreate this environment from this YAML file.
Once logged into CLIMB, import the yaml file and simply run:
mamba env create -f test-env.yml
mamba env list
This training course was adapted from the Carpentries Introduction to Conda for (Data) Scientists Course.