UniRef annotation of ORFs
======

UniRef is a catalog of UniProt protein sequence clusters.

 - Website: https://www.uniprot.org/help/uniref

 - Citation: Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, UniProt Consortium.
   UniRef clusters: a comprehensive and scalable alternative for improving
   sequence similarity searches. Bioinformatics. 2015 Mar 15;31(6):926-32.

UniRef release 2022_01 was used to annotate the ORFs.

 - Source: https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/
   release-2022_01/uniref/uniref2022_01.tar.gz

DIAMOND v2.0.15 was used to perform sequence alignment. Commands:

```
diamond makedb --threads # --in db.faa --db db
diamond blastp --threads # --db db --query input.faa --out output.m8 --id 90 \
  --subject-cover 80 --query-cover 80 --index-chunks 1 --max-target-seqs 1
```

Sequence alignment was performed on UniRef90 and UniRef50 separately. Results
were merged, prioritizing the former over the latter (i.e., an ORF was assigned
to a UniRef50 entry only when it could not be assigned to a UniRef90 entry).

Statistics:

 - Number of annotated ORFs: 38,173,561
 - Number of UniRef entries: 30,425,102

Database files:

 - orf-to-uniref.map: Mapping of ORFs to UniRef entries (unique).

 - uniref_name.txt: Names (descriptions) of UniRef entries.

 - idmaps/: Mapping of UniRef entries to external databases, including: common
   gene name, NCBI gene ID, RefSeq protein, BioCyc, eggNOG, OMA, OrthoDB,
   PATRIC, and STRING.

Note: Mappings of UniRef entries to GO terms can be found under go/.

Collapsing order:

```
  ORF > UniRef > GO (see go/)
          v
  other databases
```