Aussie AI

Chapter 8. Vector Databases

  • Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"
  • by David Spuler and Michael Sharpe

Chapter 8. Vector Databases

What are Vector Databases?

Vector databases are a modern type of database for semantic lookup. Rather than searching for keywords, modern databases index into their documents using a vector embedding that represents the semantic meaning of its contents.

The main feature of a vector database is to be able to a vector-based query. Basically, you are looking for how close vectors are to each other in N-dimensional space. Cosine similarity is a common comparison, but other algorithms include nearest neighbors, least squares, and more.

Speed of the vector database lookup is obviously important for fast inference latency. Each vector also needs to be associated with the actual data or document chunk in the vector database.

What is an Embedding?

An embedding is a numeric vector of about 500 to 5,000 elements. For RAG systems, embeddings are calculated by sending chunks of text to a special model, which returns vector embeddings. The embedding is a representation that works as a summary of the contents of the chunk of text.

Vector embeddings do not need to be restricted to just chunks of text. In other non-RAG use cases, embeddings could be calculated on image data, metric streams, and other types of data.

An embedding in a RAG system intends to capture the semantic concepts in chunks of text. It is the model’s interpretation of the text provided.

What is a Vector Query?

There are a few mechanisms used to query using vectors as the input. The query is typically implemented as follows:

    1. The user asks a question,

    2. The vector embedding for the question is determined.

    3. That embedding vector is then used to query the vector database.

The vector query is along the lines of: what vectors exist in the database that are closest to the vector of the user query? It’s highly unlikely that an exact match can occur. So, instead of identifying a single match, typically the “closest” vector matches are what is desired. These matches are assumed to correspond to the chunks of text that are most likely to contain the answer to the user’s question. Typically, the top-k best matches are returned (using the k-nearest neighbors algorithm), or the top matches within a specific threshold of an accuracy metric.

Vector similarity relies on the computation of a comparison metric. The most common vector comparisons used are:

  • Cosine Similarity
  • Euclidian Distance
  • Dot Product Similarity

All these methods involve calculating the dot product between vectors being compared. Cosine similarity appears to be the most popular mechanism currently used in vector databases.

One problem that can occur with vector queries is that the user’s question is often terse and the chunks in the vector database are more detailed. Due to the limited context in the user’s query, the vector embedding calculated from the user’s query may not match the embeddings of the chunks well.

This problem of short user queries can often be remediated by adding a summary of the recent conversation to the question. Another mechanism is to use the LLM to provide a short or sample answer to the question and using both the user question and the short answer together to calculate a vector embedding to query the vector DB with. This is the HyDE technique.

Choosing a Vector Database

There are various options to consider for vector databases, including both commercial products and open source projects. Some well-known vector databases are:

  • Pinecone
  • Milvus
  • Chroma
  • Weaviate
  • Qdrant
  • FAISS

FAISS is an author’s favorite because it’s all in memory, and is quite good for quick prototyping, too. And it’s written in C++!

General database systems also have some vector capabilities, including (amongst numerous others):

  • Elastic (Elasticsearch/Opensearch)
  • Redis
  • Postgres
  • Mongo DB
  • Cosmos DB

These databases are all good choices, especially if your solution already uses these technologies for other reasons.

How Vector Searches Work

A vector database is effectively performing two separate steps in sequence:

    1. Text to embeddings vector mapping

    2. Embeddings vector to chunk mapping

The first part is not usually performed by the vector database as such, but via an embeddings model that is usually an external component to the vector database. The output of this first phase is a vector that contains embeddings data.

The second phase is a vector search using the embeddings vector as the input query to search the vector database of chunks, which needs to perform a vector-based search. Vector search algorithms are an area of active research, and there are various ways to optimize. Some of the algorithms for high-dimensionality searching in a vector space include:

  • Vector hashing
  • Locality-Sensitive Hashing (LSH)
  • Approximate Nearest Neighbour (ANN)
  • KD-trees
  • Hypercube
  • FAISS hybrid algorithms

Using different versions of these lookup algorithms is one way to optimize a vector database. However, this is a low-level implementation issue and there are many other approaches for optimizing a vector database.

Vector Database Speed Optimizations

Vector database lookups are an important step in the RAG chain, and can become a performance bottleneck if not properly tuned. The issue here is database tuning, which is a longstanding IT discipline, and most of the LLM inference optimization techniques are not applicable to vector databases as they don’t use LLMs.

The speed of a vector database can be optimized using methods such as:

  • Comparing multiple vector databases
  • Returning fewer chunks
  • Vector database indexes
  • Text cache (inference cache)
  • Dimensionality reduction (shorter vectors)
  • Batch processing
  • Vector search algorithm optimizations

An important point is that the LLM for embeddings can be much smaller than that used for the final RAG answers. Some optimizations include:

  • Embeddings caching (of queries)
  • Using a smaller LLM for embeddings
  • Quantization of embeddings model

Vector databases are like any other database in that the key point is scaling their speed to the number of records. If you’ve only ingested a few chunks to test things, it’ll fly for sure. But when you’ve ingested the totality of your input documents, then the scalability of the vector database (or lack thereof), will be the key factor.

Indexing Vector Databases

One way to speed up your vector database is the same as any other database: better indexes. Open-source and commercial vector databases have multiple ways to improve their indexing performance. Some examples:

  • Hierarchical Navigable Small World (HNSW) graph-based indexing
  • Inverted File (IVF) index
  • FAISS hybrid indexes
  • Tree-based indexing (e.g., KD-trees)
  • Product Quantization (PQ) compressed vectors

The brute force indexing mechanism would be a “flat index.” This is effectively a linear scan where every query vector is compared to all vectors in the database. This guarantees 100% accuracy, but it is very slow for large databases.

Do not be so quick to dismiss this method, though. The speed is manageable for less than one million records. All other indexing methods are an Approximate Nearest Neighbor (ANN) mechanism, where a little accuracy is traded for substantial speedups.

HNWS indexing. The HNSW method organizes vectors into a hierarchy of graphs (small world graphs). At the upper most levels, it’s a very coarse and sparse graph of vectors. At each lower level, more vectors (nodes) are added and more connections. The connections (edges) indicate the vector’s nearest neighbors.

During a query, HNSW starts at the top and traverses down until it gets to the lowest-level, dense parts of the graph where it can then do a linear search. On a database of one million vectors, HNSW may only do a few hundred vector comparison computations. The accuracy is in the range of 95%-100%. The storage of the graphs does require substantially more space than the raw vector list.

IVF indexing. The IVF indexing method is based on clustering. First, all similar vectors are grouped into clusters, often using a k-means approach. Each cluster has a centroid and each vector in the cluster is assigned to its centroid.

At query time, the query vector is compared to each centroid to find the candidate clusters, and then a linear search is performed over the vectors within the candidate clusters. Again, this method excels at reducing the search space dramatically. In a database of one million vectors, a thousand clusters are typical, and only ten of those might be the nearest to the query vector. This leads to a search over about one percent of the actual vectors, but the accuracy is still good and this method scales well.

KD-Tree indexing. KD-Trees are also often used for vector database indexing. This is a well-known structure for representing “spatial” data. This is a tree similar to a binary search tree but in higher dimensions. It is well known for nearest-neighbor searches. As can be imagined, just like with a binary tree, huge parts of the tree can be eliminated with a single comparison, resulting in far better performance than a linear search.

Product Quantization. The PQ indexing method takes a different approach. Instead of organising vectors based on a clustering measure, it simplifies the vectors to allow for substantially faster comparisons. The idea is to take the high-dimensional embedding vectors and then make them smaller by quantizing them.

One method is to break each vector into disjoint subvectors and organize them, whereby each subvector is related back to the original vector. Each subvector can be uniquely identified and assigned a code, called a PQ code. At query time, the query vector can be broken up and its PQ code identified. This can be used to guide the search until a small linear search can be done on a small set of related vectors with the same PQ code.

There are some even more extreme forms of quantization. For example, each element of an embedding is typically 32 bits, representing a number between -1 and 1. Reducing the precision of numeric values is a common technique in LLMs, where a smaller number of bits leads to faster comparisons. The absolute extreme is a binary quantization, where each element is reduced to a single bit (instead of 32 bits). When single bits are involved, numerical calculations become bitwise calculations and can be super-fast, but this can lose a significant amount of accuracy. Moderate approaches with some efficiency improvement but less accuracy degradation include 4-bit, 8-bit, or 16-bit quantization.

Vector Database Optimization Research

Research papers on techniques to optimize vector databases:

  1. Dr. Ashish Bamania, Jun 18, 2024, Google’s New Algorithms Just Made Searching Vector Databases Faster Than Ever: A Deep Dive into how Google’s ScaNN and SOAR Search algorithms supercharge the performance of Vector Databases, https://levelup.gitconnected.com/googles-new-algorithms-just-made-searching-vector-databases-faster-than-ever-36073618d078
  2. James Jie Pan, Jianguo Wang, Guoliang Li, 21 Oct 2023, Survey of Vector Database Management Systems, https://arxiv.org/abs/2310.14021 https://link.springer.com/article/10.1007/s00778-024-00864-x
  3. Yikun Han, Chunjiang Liu, Pengfei Wang, 18 Oct 2023, A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703
  4. Yunxiao Shi, Xing Zi, Zijing Shi, Haimin Zhang, Qiang Wu, Min Xu, 15 Jul 2024, Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems, https://arxiv.org/abs/2407.10670, Code: https://github.com/Ancientshi/ERM4
  5. Harvey Bower, 2024, Debugging RAG Pipelines: Best Practices for High-Performance LLMs, https://www.amazon.com/dp/B0DNWN5RB1 (Great book on debugging problems with RAG architecture accuracy and latency.)
  6. Pinecone, June 2025 (accessed), Similarity Search, Choosing the Right Index, https://www.pinecone.io/learn/series/faiss/vector-indexes/
  7. Milvus, June 2025 (accessed), How does indexing work in a vector DB (IVF, HNSW, PQ, etc.)? https://milvus.io/ai-quick-reference/how-does-indexing-work-in-a-vector-db-ivf-hnsw-pq-etc
  8. M K Pavan Kumar, June 2, 2024, Quantization Techniques in Vector Embeddings — Practical Approach, https://medium.com/stackademic/quantization-techniques-in-vector-embeddings-practical-approach-7f7383767c68

Survey and review papers on vector databases:

  1. James Jie Pan, Jianguo Wang, Guoliang Li, 21 Oct 2023, Survey of Vector Database Management Systems, https://arxiv.org/abs/2310.14021 https://link.springer.com/article/10.1007/s00778-024-00864-x
  2. Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, Min Zhang, 6 Feb 2024 (v2), When Large Language Models Meet Vector Databases: A Survey, https://arxiv.org/abs/2402.01763
  3. Yikun Han, Chunjiang Liu, Pengfei Wang, 18 Oct 2023, A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703
  4. Toni Taipalus, 2024, Vector database management systems: Fundamental concepts, use-cases, and current challenges, Cognitive Systems Research, Volume 85, 101216, ISSN 1389-0417, https://doi.org/10.1016/j.cogsys.2024.101216 https://www.sciencedirect.com/science/article/pii/S1389041724000093

General research papers on vector databases:

  1. Sebastian Bruch, Jan 2024, Foundations of Vector Retrieval, https://arxiv.org/abs/2401.09350 (Extensive 200+ pages review of vector lookup data structures such as LSH and clustering.)
  2. Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
  3. Dr. Ashish Bamania, Jun 18, 2024, Google’s New Algorithms Just Made Searching Vector Databases Faster Than Ever: A Deep Dive into how Google’s ScaNN and SOAR Search algorithms supercharge the performance of Vector Databases, https://levelup.gitconnected.com/googles-new-algorithms-just-made-searching-vector-databases-faster-than-ever-36073618d078
  4. Chips Ahoy Capital, Jul 02, 2024, Evolution of Databases in the World of AI Apps, https://chipsahoycapital.substack.com/p/evolution-of-databases-in-the-world
  5. Donald Farmer, 8 Aug 2024, 10 top vector database options for similarity searches, https://www.techtarget.com/searchdatamanagement/tip/Top-vector-database-options-for-similarity-searches
  6. Pere Martra, Aug 2024 (accessed), Implementing semantic cache to improve a RAG system with FAISS, https://huggingface.co/learn/cookbook/semantic_cache_chroma_vector_database
  7. Richmond Alake, Apoorva Joshi, Aug 14, 2024, Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain, MongoDB, https://www.mongodb.com/developer/products/atlas/advanced-rag-langchain-mongodb/
  8. James Jie Pan, Jianguo Wang, Guoliang Li, 21 Oct 2023, Survey of Vector Database Management Systems, https://arxiv.org/abs/2310.14021 https://link.springer.com/article/10.1007/s00778-024-00864-x
  9. Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, Min Zhang, 6 Feb 2024 (v2), When Large Language Models Meet Vector Databases: A Survey, https://arxiv.org/abs/2402.01763
  10. Yikun Han, Chunjiang Liu, Pengfei Wang, 18 Oct 2023, A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703
  11. Toni Taipalus, 2024, Vector database management systems: Fundamental concepts, use-cases, and current challenges, Cognitive Systems Research, Volume 85, 101216, ISSN 1389-0417, https://doi.org/10.1016/j.cogsys.2024.101216 https://www.sciencedirect.com/science/article/pii/S1389041724000093
  12. Zhi Yao, Zhiqing Tang, Jiong Lou, Ping Shen, Weijia Jia, 19 Jun 2024, VELO: A Vector Database-Assisted Cloud-Edge Collaborative LLM QoS Optimization Framework, https://arxiv.org/abs/2406.13399
  13. David Spuler, March 2024, Vector Databases, in Generative AI in C++, https://www.aussieai.com/book/ch6-vector-databases
  14. David Spuler, March 2024, Semantic Caching and Vector Databases, in Generative AI in C++, https://www.aussieai.com/book/ch29-semantic-caching-vector-databases
  15. Chirag Agrawal, Sep 20, 2024, Unlocking the Power of Efficient Vector Search in RAG Applications, https://pub.towardsai.net/unlocking-the-power-of-efficient-vector-search-in-rag-applications-c2e3a0c551d5
  16. F Sundh, Oct 2024, Evaluating the efficacy of modality conversion in vector databases, Bachelor’s Thesis, Computer Science and Engineering, Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, https://www.diva-portal.org/smash/get/diva2:1905628/FULLTEXT01.pdf
  17. Tolga Şakar and Hakan Emekci, 30 October 2024, Maximizing RAG efficiency: A comparative analysis of RAG methods, Natural Language Processing. doi:10.1017/nlp.2024.53, https://www.cambridge.org/core/journals/natural-language-processing/article/maximizing-rag-efficiency-a-comparative-analysis-of-rag-methods/D7B259BCD35586E04358DF06006E0A85 https://www.cambridge.org/core/services/aop-cambridge-core/content/view/D7B259BCD35586E04358DF06006E0A85/S2977042424000530a.pdf/div-class-title-maximizing-rag-efficiency-a-comparative-analysis-of-rag-methods-div.pdf
  18. Sonal Prabhune, Donald J. Berndt, 7 Nov 2024, Deploying Large Language Models With Retrieval Augmented Generation, https://arxiv.org/abs/2411.11895
  19. Matvey Arye, Avthar Sewrathan, 29 Oct 2024, Vector Databases Are the Wrong Abstraction, https://www.timescale.com/blog/vector-databases-are-the-wrong-abstraction/

Vector Search Theory

There are various low-level papers on using hashing and other search optimizations for computations involving vectors and tensors of higher dimensions. One of the main techniques is Locality-Sensitive Hashing (LSH), which is hashing to find vectors that are “close” in n-dimensional space. For example, LSH can also be used to hash vectors for caching vector dot products as an optimization of matrix multiplication.

Research papers on vector-level hashing and search optimizations in neural networks:

  1. Gionis, Aristides, Indyk, Piotr, and Motwani, Rajeev, 1999, Similarity search in high dimensions via hashing, In Atkinson, Malcolm P., Orlowska, Maria E., Valduriez, Patrick, Zdonik, Stanley B., and Brodie, Michael L. (eds.), Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518–529. Morgan Kaufmann, 1999, https://dl.acm.org/doi/10.5555/645925.671516
  2. Jaiyam Sharma, Saket Navlakha, Dec 2018, Improving Similarity Search with High-dimensional Locality-sensitive Hashing, https://arxiv.org/abs/1812.01844
  3. A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt, 2015, Practical and optimal LSH for angular distance, in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, (Cambridge, MA, USA), pp. 1225–1233, MIT Press, 2015. https://arxiv.org/abs/1509.02897
  4. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, 2004, Locality-sensitive hashing scheme based on p-stable distributions, in Proceedings of the Twentieth Annual Symposium on Computational Geometry, (New York, NY, USA), pp. 253–262, ACM, 2004. https://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/p253-datar.pdf
  5. P. Indyk and R. Motwani, 1998, Approximate nearest neighbors: towards removing the curse of dimensionality, in Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613, 1998. https://www.theoryofcomputing.org/articles/v008a014/v008a014.pdf
  6. A. Andoni and P. Indyk, 2006, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pp. 459–468, 2006. http://web.mit.edu/andoni/www/papers/cSquared.pdf
  7. K. Terasawa and Y. Tanaka, 2007, Spherical LSH for approximate nearest neighbor search on unit hypersphere, in Workshop on Algorithms and Data Structures, pp. 27–38, 2007. https://dl.acm.org/doi/10.5555/2394893.2394899
  8. Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis, 2020, SMYRF: Efficient Attention using Asymmetric Clustering, Advances in Neural Information Processing Systems, 33:6476–6489, 2020. https://arxiv.org/abs/2010.05315 (LSH used in attention algorithms.)
  9. Chinnadhurai Sankar, Sujith Ravi, and Zornitsa Kozareva. 2021. ProFormer: Towards On-Device LSH Projection Based Transformers, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2823– 2828, Online. Association for Computational Linguistics. https://arxiv.org/abs/2004.05801 (LSH used for embedding vectors.)
  10. J Zhang, 2023, Quantization for High-dimensional Data and Neural Networks: Theory and Algorithms, Ph.D. Thesis, University of California, San Diego, https://escholarship.org/content/qt9bd2k7gf/qt9bd2k7gf.pdf
  11. Gonzalez TF, 1985, Clustering to minimize the maximum intercluster distance, Theoretical computer science 38:293–306, PDF: https://sites.cs.ucsb.edu/~teo/papers/TCS-ktmm.pdf (Approximate algorithm for vector clustering.)
  12. Chen, B. and Shrivastava, A., 2018, Densified Winner Take All (WTA) Hashing for Sparse Datasets, Uncertainty in artificial intelligence, http://auai.org/uai2018/proceedings/papers/321.pdf (Uses hashing related to LSH.)
  13. Hackerllama, January 7, 2024, Sentence Embeddings. Introduction to Sentence Embeddings, https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/
  14. M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  15. David Spuler, March 2024, Chapter 18. Parallel Data Structures, in Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  16. Yikun Han, Chunjiang Liu, Pengfei Wang, 18 Oct 2023, A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703
  17. Ala-Eddine Benrazek, Zineddine Kouahla, Brahim Farou, Hamid Seridi, Ibtissem Kemouguette, 28 Aug 2024, Efficient k-NN Search in IoT Data: Overlap Optimization in Tree-Based Indexing Structures, https://arxiv.org/abs/2408.16036
  18. David Spuler, March 2024, Vector Hashing, in Generative AI in C++, https://www.aussieai.com/book/ch18-vector-hasing
  19. Duy-Thanh Nguyen, Abhiroop Bhattacharjee, Abhishek Moitra, Priyadarshini Panda, 9 Feb 2023, DeepCAM: A Fully CAM-based Inference Accelerator with Variable Hash Lengths for Energy-efficient Deep Neural Networks, https://arxiv.org/abs/2302.04712
  20. Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica, 19 Dec 2024, HashAttention: Semantic Sparsity for Faster Inference, https://arxiv.org/abs/2412.14468
  21. Uri Merhav, Dec 30, 2024, Vector Similarity Search is Hopeless. A pre-baked notion of similarity is inherently flawed and doomed to fail. We can do better, https://urimerhav.medium.com/vector-similarity-search-is-hopeless-7251a855b4bd

Vector Quantization

Vector quantization is a longstanding ML technique that pre-dates all of the Transformer work. Hence, there are many early papers on ML topics. Vector quantization is related to other vector methods such as nearest-neighbor search, such as for the analysis of embedding vectors and semantic similarity, amongst many other applications.

Research papers on vector quantization:

  1. Sebastian Bruch, Jan 2024, Foundations of Vector Retrieval, https://arxiv.org/abs/2401.09350 (Extensive 200+ pages review of vector lookup data structures such as LSH and clustering.)
  2. Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh, Jan 2024, Extreme Compression of Large Language Models via Additive Quantization, https://arxiv.org/abs/2401.06118
  3. W Jiang, P Liu, F Wen, 2017, An improved vector quantization method using deep neural network, AEU, Volume 72, February 2017, Pages 178-183, https://www.sciencedirect.com/science/article/pii/S1434841116313954
  4. Zicheng Liu, Li Wang, Siyuan Li, Zedong Wang, Haitao Lin, Stan Z. Li, 18 Apr 2024 (v2), LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory, https://arxiv.org/abs/2404.11163
  5. Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng, Yufei Huang, Stan Z. Li, 2 Jun 2024 (v2), VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling, https://arxiv.org/abs/2405.10812
  6. Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev, 18 Dec 2014, Compressing Deep Convolutional Networks using Vector Quantization, https://arxiv.org/abs/1412.6115 (A very early paper on vector quantization of CNNs that has been cited many times.)
  7. Qijiong Liu, Xiaoyu Dong, Jiaren Xiao, Nuo Chen, Hengchang Hu, Jieming Zhu, Chenxu Zhu, Tetsuya Sakai, Xiao-Ming Wu, 6 May 2024, Vector Quantization for Recommender Systems: A Review and Outlook, https://arxiv.org/abs/2405.03110 (Survey paper on vector quantization.)
  8. Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
  9. James Jie Pan, Jianguo Wang, Guoliang Li, 21 Oct 2023, Survey of Vector Database Management Systems, https://arxiv.org/abs/2310.14021 https://link.springer.com/article/10.1007/s00778-024-00864-x
  10. Or Sharir, Anima Anandkumar, 27 Jul 2023, Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs, https://arxiv.org/abs/2307.14988
  11. David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
  12. Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
  13. Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang, 25 Sep 2024, VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models, https://arxiv.org/abs/2409.17066 https://arxiv.org/pdf/2409.17066
  14. Christopher Fifty, Ronald G. Junkins, Dennis Duan, Aniketh Iger, Jerry W. Liu, Ehsan Amid, Sebastian Thrun, Christopher Ré, 8 Oct 2024, Restructuring Vector Quantization with the Rotation Trick, https://arxiv.org/abs/2410.06424 https://github.com/cfifty/rotation_trick
  15. Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos, 12 Dec 2024, Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries, https://arxiv.org/abs/2412.08890 https://github.com/krafton-ai/lexico (Sparsification of KV cache in prefill, using INT8 and vector lookup in a dictionary of predefined vectors.)
  16. Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che, 12 Dec 2024, CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs, https://arxiv.org/abs/2412.09282 (Vector quantization of low-bit or 1-bit weight vectors, with additional bits for some channels, analogous to combining mixed-precision quantization and/or weight clustering.)
  17. Taehee Jeong, 17 Jan 2025, 4bit-Quantization in Vector-Embedding for RAG, https://arxiv.org/abs/2501.10534 https://github.com/taeheej/4bit-Quantization-in-Vector-Embedding-for-RAG
  18. 18 Jan 2025, LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator, Guoyu Li, Shengyu Ye, Chunyun Chen, Yang Wang, Fan Yang, Ting Cao, Cheng Liu, Mohamed M. Sabry, Mao Yang, https://arxiv.org/abs/2501.10658 (Extremely low-bit quantization below 1-bit (!) with vector quantization to table lookup.)
  19. Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, Yuhao Zhu, Minjia Zhang, Jingwen Leng, Chen Jin, 4 Mar 2025, VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference, https://arxiv.org/abs/2503.02236

 

Online: Table of Contents

PDF: Free PDF book download

Buy: RAG Optimization: Accurate and Efficient LLM Applications

RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization