How was the ortholog database in EcoOmicsDB created?

jess.ewald · July 20, 2022, 6:25pm

EcoOmicsDB was created to expose the details of the protein ortholog databases used by the Seq2Fun algorithm. To create a comprehensive ortholog database, all protein-coding genes (n = 13,057,389) from 687 organisms which cover all major phylums of eukaryotes were downloaded from KEGG using KEGGREST (version 1.34.0). Protein FASTA files for each species were submitted to OrthoFinder (version 2.5.4) for classification of genes into ortholog groups. OrthoFinder (parameters: t = 56, a = 25) was run on a server with 56 threads and 504 GB’s RAM and it took about ~10 days to finish ortholog grouping for all the organisms.

Since OrthoFinders tries to find groups of sequences with a common ancestor, there is no restriction on ortholog size and some groups are too large to be useful for RNA-seq quantification. To solve this, each of the top 10, 000 orthologs was split using the following steps:

Sequences analyzed to create a phylogenetic tree using FastTree
Phylogenetic tree converted to cophenetic distance matrix
K-means clustering used to split sequences into optimal “k” groups. “k” defined as (# sequences/# species)*2.
New ortholog groups defined by each group from k-means clustering.

Information from each sequence was collapsed to generate a single symbol, description, KEGG pathway, and GO term annotation for each ortholog group. Phylogenetic groups of organisms were retrieved from KEGGREST to create sub-group databases, which is based on the NCBI taxonomy system. All of this information can be queried and retrieved using EcoOmicsDB.

padimitriu · July 13, 2023, 10:59pm

Thank you for this great tool. The explanation above, as that of the ExpressAnalyst paper, protein-coding genes were retrieved from KEGG using KEGGREST. Could you explain how you solved the fact that keggGet() (assuming you used that function) can only return 10 results per request?

Many thanks,
Pedro.

jess.ewald · July 21, 2023, 6:06pm

The person who did this has moved on to another position and is not monitoring these posts. I’m not sure if that’s the function he used. In the past, we have sometimes used a script to repetitively call functions until all results were retrieved, and then compiled everything.