Flexible protein database based on amino acid k-mers

kAAmer’s database engine is based on log-structured merge tree (LSM-tree) key-value (KV) stores11. LSM trees are used in data-intensive operations such as web indexing12.13social networks14 and online games15.16. KAAmer uses Badger17an efficient implementation in Golang (https://golang.org/) of a WiscKey KV (key-value) store16. WiscKey’s LSM tree design is optimized for SSDs and separates keys from values ​​to minimize data movement when building the key-value store. KAAmer will achieve optimal performance with modern hardware such as solid-state drives that provide good input/output (I/O) operations per second (IOPS) throughput and will scale effectively to use cases where many requests are sent simultaneously. A kAAmer database includes three KV stores (see Fig. 1a): one to provide protein information (protein store) and two to enable search functionality (k-mer store and combined store). The k-mer store contains all the 7-mers found in the sequence dataset and the keys to the combination store, which only serves the combination of proteins held by the k-mers. The fixed k-mer size of 7 was chosen to fit on 4 bytes and maintain a manageable database size while providing good specificity on protein targets. The k-merized design of a kAAmer database provides an attractive simplicity for search tasks that will yield an exact match count of all 7-mers between a protein query and all targets in a database of proteins. This strategy is not guaranteed to return the same homologous targets that would be obtained with alignment or HMM searching and is therefore less suitable for remote homology searching.

Figure 1

(a) Design of a kAAmer database. Three key-value stores are created within a database (K-mer Store, Combination Store, Protein Store). The colors indicate the combination (hash) values ​​that are reused in the combination store. Proteins are numbered (p01, p02, p03) and k-mers are numbered (k01, k02, …, k08). (b) Protein search speed reference. Software includes Blastp (v2.9.0+), Ghostz (v1.0.2), Diamond (v0.9.25) and kAAmer (v0.6) with (-aln) and without (-kmatch) alignment. (vs) Protein search accuracy and recall reference with ECOD database. Blue bars indicate precision results and red bars indicate recall results. Software includes Blastp (v2.9.0+), Ghostz (v1.0.2), Diamond (v0.9.25) and kAAmer (v0.6) with alignment.

In order to evaluate the performance and accuracy of kAAmer, we built a speed and sensitivity benchmark against protein families from the ECOD database (homology groups)18. We evaluated four software: Blastp (v2.9.0+)3Ghosts (v1.0.2)19Diamond (v0.9.25)5 and kAAmer (v0.6). Other interesting software that uses web servers for remote analysis, such as Sequenceserver20 and MMseqs221, are worth mentioning for the functionality they offer. However, we limited our benchmark to the previously mentioned software for their alignment efficiency and computational resource requirements. Note that we tested two modes for Diamond, the most sensitive and the fastest. Similarly, we tested two sensitivity modes with kAAmer, based on the minimum number of shared k-mers, which are k10 (at least 10 k-mers shared between protein query and target) and k1 (at least 1 k -shared sea). For kAAmer, each sensitivity mode was tested without alignment – ​​the ratio of shared k-mers serving as a scoring function and also with a subsequent form of alignment. The purpose of alignment in kAAmer is to improve scoring metrics while using the same set of results as the raw method without alignment.

Figure 1c illustrates the results of the ROC curve of the sensitivity benchmark. We observed that Ghostz, Blastp, and Diamond-sensitive respectively reported the highest number of true positives independent of false negatives. Next is kAAmer with the minimum number of shared k-mers, Diamond-fast, and kAAmer with at least 10 shared k-mers. The ROC curve also shows the difference in precision of the unaligned mode of kAAmer compared to the aligned modes. One of the main reasons would be the scoring scheme which uses percentage of k-mers shared as opposed to bit score with alignment results. Note that the minimum k-mer matches are a user-provided option to adjust the sensitivity of the protein search. We also compared our database engine with the aforementioned software for their running time with different query dataset sizes. Thirteen different protein query datasets were randomly and uniquely chosen from the original ECOD database, ranging in size from 1 protein to 50,000 proteins. Figure 1b illustrates alignment software clock times versus kAAmer for protein homology searches. See Methods section for hardware used in benchmarks. We observe with the larger query datasets (50,000 proteins) that kAAmer k10 in non-alignment mode completed the search in 46.5 s, while alignment mode for kAAmer-k10 did it in 390 s. .7 sec. When using a single shared k-mers (kAAmer k1), the most sensitive mode in kAAmer, run times were 78.8 s without alignment and 966.6 s with alignment. Diamond’s fast mode completed the same task in 64.7s, while it took 287.7s with sensitive mode. Ghost gave similar results to Diamond responsive mode while Blastp reported significantly slower results than the other software tested. When comparing speed results with the maximum number of queries (50,000 proteins), kAAmer in its alignment-free mode achieves performance comparable to Diamond’s fast mode, although the results vary with the setting of the minimum number of matches k -mer used. kAAmer’s alignment mode obviously adds overhead which will impact the runtime results. Yet, in combination with the minimum k-mer match of 1, it will provide better sensitivity at the expense of speed.

To consider real-world use cases, we constructed relevant kAAmer databases and investigated their use in typical bacterial genomic analyses. It should be noted that the annotation of genomes and the identification of genes strongly depend on the quality of the underlying database. What kAAmer has to offer is the inclusion of protein information in the database combined with efficient search functionality to facilitate downstream analyses. Therefore, we also provide utility scripts to illustrate these use cases. The first use case was to identify antibiotic resistance genes (ARGs) in a bacterial genome and test its accuracy against other ARG research software. For ARG identification, we used the NCBI Antimicrobial Bacterial Resistance Reference Genetic Database (v2020-01-06.1)22 and compared kAAmer results with ResFinder (v3.2 and database 2019-10-01)23 and MAP (v5.1.0)24 software and database. The query genome is a pan-resistant Pseudomonas aeruginosa strain E613095225. Table 1 presents the results of the ARG identification within the query genome by the three software/databases tested. For the majority of the classes of antibiotics, the results are consistent between the three databases. Interestingly, three aminoglycoside genes (aac(6′)-He, ant(2″)-Ia and aacA8) were only found with kAAmer (NCBI-ARG) and ResFinder. On the other hand, several other antibiotic efflux systems are annotated in CARD and the number of efflux proteins identified in E6130952 rises to 36 while only 3 were reported by kAAmer (NCBI-ARG) and none by ResFinder . Also 2 genes associated with resistance to peptide antibiotics (arnA, basS) and 2 others (soxR, carA) associated with several classes of antibiotics have only been reported by CARD. Other use cases tested include genome annotation and metagenome profiling, as discussed in the Methods section.

Table 1 Report on the identification of the antibiotic resistance gene in the pan-resistant strain of Pseudomonas aeruginosa E6130952 from the kAAmer + NCBI-arg, ResFinder and CARD databases.

In summary, kAAmer introduces a fast and flexible protein database engine to accommodate different genomic analysis use cases. It can be hosted on-premises or in the cloud and queried remotely while providing a flexible protein annotation scheme. Although it may be suitable for finding more distant homology, it is best suited for quickly finding nearby sequence homology with its k-mer matching functionality, while providing rich annotations on identified protein targets.

Maria H. Underwood