EcoOmicsDB was created to expose the details of the protein ortholog databases used by the Seq2Fun algorithm. To create a comprehensive ortholog database, all protein-coding genes (n = 13,057,389) from 687 organisms which cover all major phylums of eukaryotes were downloaded from KEGG using KEGGREST (version 1.34.0). Protein FASTA files for each species were submitted to OrthoFinder (version 2.5.4) for classification of genes into ortholog groups. OrthoFinder (parameters: t = 56, a = 25) was run on a server with 56 threads and 504 GB’s RAM and it took about ~10 days to finish ortholog grouping for all the organisms.
Since OrthoFinders tries to find groups of sequences with a common ancestor, there is no restriction on ortholog size and some groups are too large to be useful for RNA-seq quantification. To solve this, each of the top 10, 000 orthologs was split using the following steps:
- Sequences analyzed to create a phylogenetic tree using FastTree
- Phylogenetic tree converted to cophenetic distance matrix
- K-means clustering used to split sequences into optimal “k” groups. “k” defined as (# sequences/# species)*2.
- New ortholog groups defined by each group from k-means clustering.
Information from each sequence was collapsed to generate a single symbol, description, KEGG pathway, and GO term annotation for each ortholog group. Phylogenetic groups of organisms were retrieved from KEGGREST to create sub-group databases, which is based on the NCBI taxonomy system. All of this information can be queried and retrieved using EcoOmicsDB.