UniRef annotation of ORFs ====== UniRef is a catalog of UniProt protein sequence clusters. - Website: https://www.uniprot.org/help/uniref - Citation: Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015 Mar 15;31(6):926-32. UniRef release 2022_01 was used to annotate the ORFs. - Source: https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/ release-2022_01/uniref/uniref2022_01.tar.gz DIAMOND v2.0.15 was used to perform sequence alignment. Commands: ``` diamond makedb --threads # --in db.faa --db db diamond blastp --threads # --db db --query input.faa --out output.m8 --id 90 \ --subject-cover 80 --query-cover 80 --index-chunks 1 --max-target-seqs 1 ``` Sequence alignment was performed on UniRef90 and UniRef50 separately. Results were merged, prioritizing the former over the latter (i.e., an ORF was assigned to a UniRef50 entry only when it could not be assigned to a UniRef90 entry). Statistics: - Number of annotated ORFs: 38,173,561 - Number of UniRef entries: 30,425,102 Database files: - orf-to-uniref.map: Mapping of ORFs to UniRef entries (unique). - uniref_name.txt: Names (descriptions) of UniRef entries. - idmaps/: Mapping of UniRef entries to external databases, including: common gene name, NCBI gene ID, RefSeq protein, BioCyc, eggNOG, OMA, OrthoDB, PATRIC, and STRING. Note: Mappings of UniRef entries to GO terms can be found under go/. Collapsing order: ``` ORF > UniRef > GO (see go/) v other databases ```