OMA

3.1 Obtaining data and getting setup

FastOMA is a software for inferring homology information on your custom genomes, including generating Hierarchical Orthologous Groups. It takes as input the protein sequence in FASTA format in addition to the species tree.

For Modules 2-4, you will need to use our GitPod instance

In this exercise, we will run FastOMA standalone to infer the orthology information for five yeast species. We already provided the proteomes of five species in the GitPod environment, located at /workspace/SIBBiodiversityBionformatics2023/Module3_FastOMA/working_dir/in_folder/proteome.

Another input needed by FastOMA is the species tree. For our case, the species tree in newick format is provided in the GitPod workspace: Module3_FastOMA/working_dir/in_folder/species_tree.nwk. It is as follows:

(((Yarrowia_lipolytica:1,Saccharomyces_cerevisiae:1)Saccharomycetales:1,(Neosartorya_fumigata:1,Sclerotinia_sclerotiorum:1)leotiomyceta:1)Saccharomyceta:1,Schizosaccharomyces_pombe:1)Ascomycota;

The FastOMA software is already installed, and you should be able to use it after logging into your GitPod workspace.

Optional (If you are not using GitPod)

If you want to install FastOMA on your system, you can follow the installation instructions on the FastOMA GitHub page[https://github.com/DessimozLab/fastoma].

If you want to download the proteomes on your own system, check out the following hint:

Instruction for downloading proteomes The UniProt database includes the proteome of species. You can download the reference proteomes of the following species from UniProt by clicking on “Download one protein sequence per gene (FASTA)”:

Schizosaccharomyces pombe (Fission yeast) https://www.uniprot.org/proteomes/UP000002485
Sclerotinia sclerotiorum (White mold) https://www.uniprot.org/proteomes/UP000001312/
Yarrowia lipolytica https://www.uniprot.org/proteomes/UP000001300)
Saccharomyces cerevisiae https://www.uniprot.org/proteomes/UP000002311/
Neosartorya fumigata (Aspergillus fumigatus) https://www.uniprot.org/proteomes/UP000002530/<

Right click on “Download one protein sequence per gene (FASTA)" and copy the link. Then, use wget to download the file and unzip the file using gunzip software. For example for Schizosaccharomyces pombe:

wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000002485/UP000002485_284812.fasta.gz gunzip -k UP000002485_284812.fasta.gz

1. In what format are the proteome files?

FASTA format
2. How many proteins are there in the Schizosaccharomyces pombe proteome?

Each record in a FASTA file always starts with ">". You can use the following command line to calculate number of records in the FASTA file: grep ">" in_folder/proteome/Schizosaccharomyces_pombe.fa | wc -l

There are 5122 proteins in this FASTA file.
3. How many leaves are in the species tree? For how many species does the species tree provide evolutionary information?

The answer to both questions is 5. You can visualize the species tree using the phylo.io website.

3.2 Running FastOMA

The FastOMA algorithm runs in three main steps:

Mapping input proteins to HOGs in the OMA database using OMAmer (see Module 2)
Inferring gene families based on the OMAmer results
Orthology inference in the form of Hierarchical Orthologous Groups (HOGs)

Note that these steps are executed thanks to our highly-parallelized pipeline implemented in Nextflow. The output of FastOMA is reported in OrthoXML, which is the standard format of HOG. For more information on HOGs, see Module 1 and also Page 4 of Zahn-Zabal et al. F1000, 2020.

First change directory to the Module3_FastOMA/working_dir/ where the folder in_folder exists.

cd /workspace/SIBBiodiversityBioinformatics2023/Module3_FastOMA/working_dir/

Then, check whether Nextflow is installed your system by running nextflow -h. Now we can use the command line to run FastOMA on the five proteomes in in_folder/species_tree.nwk, also using the species tree from in_folder/species_tree.nwk.

1. What is the command line to run FastOMA?

Check the FastOMA GitHub page.

nextflow FastOMA_light.nf --input_folder in_folder --output_folder out_folder

Execute the above command to run FastOMA.

2. Where is the output orthoXML file?

Check the directory where in_folder is present.

Output_folder

3.3 Intepreting the results

Recall that Orthologous Groups are groups of strict orthologs, with at most 1 representative per species. Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level.

The output of FastOMA includes two folders (hogmap and OrthologousGroupsFasta) and three files (OrthologousGroupsFasta.tsv, rootHOGs.tsv and output_hog.orthoxml).

The hogmap folder includes the output of OMAmer (Module 2); each file corresponds to an input proteome. The folder OrthologousGroupsFasta includes FASTA files, and all proteins inside each FASTA file are orthologous to each other. These could be used as gene markers for species tree inference (Module 3).

1. How many Orthologous Groups are there?

You can count the number of FASTA files in the folder OrthologousGroupsFasta.

6773
2. How many genes in total are present in all Orthologous Groups?

Genes are coded in the orthoxml file as <geneRef id="1002001760"/>. We need to count the number of lines including "geneRef": grep geneRef output_hog.orthoxml | wc -l

There are 22606 genes in the groups.

Orthologous Groups which have a representative gene in every species could be considered as the core genome.

3. How many Orthologous Groups include one representative gene for each species?

Count how many rows in OrthologousGroups.tsv have five genes. You can count the number of columns having four commas (separating five genes) using this command: cat OrthologousGroups.tsv |sed 's/[^,]//g' | awk '{ print length }' | grep "4" | wc -l.

There are 1618 Orthologous Groups having five genes.
4. How many Root HOGs are in the HOG file?

Each line in the output file denotes a gene family. After running, check the end of the file rootHOGs.tsv. Note that the indexing starts from 0.

There are 6793 rootHOG (gene families) in this file.
5. Consider the gene “60S ribosomal protein L15-A” in Schizosaccharomyces pombe with protein ID: RL15A_SCHPO. How many proteins are in the gene family (for these 5 species of interest)?

Find the corresponding line in the rootHOGs.tsv using grep.

There are 7 proteins in this family.
6. Which genes are orthologous to the gene A7EQW0_SCLS1?

You can use grep on the OrthologousGroups.tsv.

'tr|Q4WEI0|Q4WEI0_ASPFU', 'tr|A7EQW0|A7EQW0_SCLS1', 'sp|P32468|CDC12_YEAST', 'tr|Q6C7L3|Q6C7L3_YARLI', 'sp|P48009|SPN4_SCHPO

Module 3: FastOMA

3.1 Obtaining data and getting setup

3.2 Running FastOMA

3.3 Intepreting the results