gptcache.readthedocs.io Open in urlscan Pro
2606:4700::6811:2152  Public Scan

URL: https://gptcache.readthedocs.io/en/latest/
Submission: On May 30 via api from US — Scanned from DE

Form analysis 2 forms found in the DOM

GET search.html

<form class="bd-search d-flex align-items-center" action="search.html" method="get">
  <i class="fa-solid fa-magnifying-glass"></i>
  <input type="search" class="form-control" name="q" id="search-input" placeholder="Search..." aria-label="Search..." autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false">
  <span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd>K</kbd></span>
</form>

GET //readthedocs.org/projects/gptcache/search/

<form id="flyout-search-form" class="wy-form" target="_blank" action="//readthedocs.org/projects/gptcache/search/" method="get">
  <input type="text" name="q" aria-label="Dokumente durchsuchen" placeholder="Dokumente durchsuchen">
</form>

Text Content

Skip to main content
Ctrl+K


GPTCache

Getting Started

 * GPTCache Quick Start
 * Feature
 * Release Note

Bootcamp

 * LangChain
   * QA Generation
   * Question Answering
   * SQLite Example
   * BabyAGI User Guide
 * Llama Index
   * WebPage QA
 * OpenAI
   * Chat
   * Image Generation
   * SQL Translate
   * Tweet Classifier
   * Image Generation
   * Speech to Text
 * Replicate
   * Visual Question Answering
 * Temperature
   * OpenAI Chat with Temperature
   * OpenAI Image Creation with Temperature

References

 * ๐Ÿฅธ API References
   * GPTCache
   * Similarity Evaluation
   * Embedding
   * Manager
   * Processor
   * Utils
   * Adapter

Contributing

 * ๐Ÿ˜ Contributing to GPTCache

Python Vulnerability Checker Find & fix vulnerabilities in your IDE with Snyk
Code Try Snyk For Free
Ad by EthicalAds ย  ยท ย  โ„น๏ธ
ย  v: latest
Versionen latest stable dev Downloads HTML Auf Read the Docs Projektstartseite
Erstellungsprozesse Downloads Auf GitHub Ansehen Bearbeiten Suche


--------------------------------------------------------------------------------

Bereitgestellt von Read the Docs ยท Datenschutz-Bestimmungen
 * .rst
 * .pdf


GPTCACHE : A LIBRARY FOR CREATING SEMANTIC CACHE FOR LLM QUERIES


CONTENTS

 * Quick Install
 * ๐Ÿš€ What is GPTCache?
 * ๐Ÿ˜Š Quick Start
   * dev install
   * example usage
 * ๐ŸŽ“ Bootcamp
 * ๐Ÿ˜Ž What can this help with?
 * ๐Ÿค” How does it work?
 * ๐Ÿค— Modules
 * ๐Ÿ˜‡ Roadmap
 * ๐Ÿ˜ Contributing




GPTCACHE : A LIBRARY FOR CREATING SEMANTIC CACHE FOR LLM QUERIES#

Slash Your LLM API Costs by 10x ๐Ÿ’ฐ, Boost Speed by 100x โšก

๐ŸŽ‰ GPTCache has been fully integrated with ๐Ÿฆœ๏ธ๐Ÿ”—LangChain ! Here are detailed
usage instructions.

๐Ÿณ The GPTCache server docker image has been released, which means that any
language will be able to use GPTCache!

๐Ÿ“” This project is undergoing swift development, and as such, the API may be
subject to change at any time. For the most up-to-date information, please refer
to the latest documentation and release note.


QUICK INSTALL#

pip install gptcache


๐Ÿš€ WHAT IS GPTCACHE?#

ChatGPT and various large language models (LLMs) boast incredible versatility,
enabling the development of a wide range of applications. However, as your
application grows in popularity and encounters higher traffic levels, the
expenses related to LLM API calls can become substantial. Additionally, LLM
services might exhibit slow response times, especially when dealing with a
significant number of requests.

To tackle this challenge, we have created GPTCache, a project dedicated to
building a semantic cache for storing LLM responses.


๐Ÿ˜Š QUICK START#

Note:

 * You can quickly try GPTCache and put it into a production environment without
   heavy development. However, please note that the repository is still under
   heavy development.

 * By default, only a limited number of libraries are installed to support the
   basic cache functionalities. When you need to use additional features, the
   related libraries will be automatically installed.

 * Make sure that the Python version is 3.8.1 or higher, check: python --version

 * If you encounter issues installing a library due to a low pip version, run:
   python -m pip install --upgrade pip.


DEV INSTALL#

# clone GPTCache repo
git clone -b dev https://github.com/zilliztech/GPTCache.git
cd GPTCache

# install the repo
pip install -r requirements.txt
python setup.py install


Copy to clipboard


EXAMPLE USAGE#

These examples will help you understand how to use exact and similar matching
with caching. You can also run the example on Colab. And more examples you can
refer to the Bootcamp

Before running the example, make sure the OPENAI_API_KEY environment variable is
set by executing echo $OPENAI_API_KEY.

If it is not already set, it can be set by using export
OPENAI_API_KEY=YOUR_API_KEY on Unix/Linux/MacOS systems or set
OPENAI_API_KEY=YOUR_API_KEY on Windows systems.

> It is important to note that this method is only effective temporarily, so if
> you want a permanent effect, youโ€™ll need to modify the environment variable
> configuration file. For instance, on a Mac, you can modify the file located at
> /etc/profile.

Click to SHOW examples

OpenAI API original usage

import os
import time

import openai


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']


question = 'whatโ€˜s chatgpt'

**OpenAI API original usage**
openai.api_key = os.getenv("OPENAI_API_KEY")
start_time = time.time()
response = openai.ChatCompletion.create(
  model='gpt-3.5-turbo',
  messages=[
    {
        'role': 'user',
        'content': question
    }
  ],
)
print(f'Question: {question}')
print("Time consuming: {:.2f}s".format(time.time() - start_time))
print(f'Answer: {response_text(response)}\n')


Copy to clipboard

OpenAI API + GPTCache, exact match cache

> If you ask ChatGPT the exact same two questions, the answer to the second
> question will be obtained from the cache without requesting ChatGPT again.

import time


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']

print("Cache loading.....")

**To use GPTCache, that's all you need**
**-------------------------------------------------**
from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()
**-------------------------------------------------**

question = "what's github"
for _ in range(2):
    start_time = time.time()
    response = openai.ChatCompletion.create(
      model='gpt-3.5-turbo',
      messages=[
        {
            'role': 'user',
            'content': question
        }
      ],
    )
    print(f'Question: {question}')
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response_text(response)}\n')


Copy to clipboard

OpenAI API + GPTCache, similar search cache

> After obtaining an answer from ChatGPT in response to several similar
> questions, the answers to subsequent questions can be retrieved from the cache
> without the need to request ChatGPT again.

import time


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

print("Cache loading.....")

onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )
cache.set_openai_key()

questions = [
    "what's github",
    "can you explain what GitHub is",
    "can you tell me more about GitHub"
    "what is the purpose of GitHub"
]

for question in questions:
    start_time = time.time()
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[
            {
                'role': 'user',
                'content': question
            }
        ],
    )
    print(f'Question: {question}')
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response_text(response)}\n')


Copy to clipboard

OpenAI API + GPTCache, use temperature

> You can always pass a parameter of temperature while requesting the API
> service or model.
> 
> The range of temperature is [0, 2], default value is 0.0.
> 
> A higher temperature means a higher possibility of skipping cache search and
> requesting large model directly. When temperature is 2, it will skip cache and
> send request to large model directly for sure. When temperature is 0, it will
> search cache before requesting large model service.
> 
> The default post_process_messages_func is temperature_softmax. In this case,
> refer to API reference to learn about how temperature affects output.

import time

from gptcache import cache, Config
from gptcache.manager import manager_factory
from gptcache.embedding import Onnx
from gptcache.processor.post import temperature_softmax
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from gptcache.adapter import openai

cache.set_openai_key()

onnx = Onnx()
data_manager = manager_factory("sqlite,faiss", vector_params={"dimension": onnx.dimension})

cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    post_process_messages_func=temperature_softmax
    )
**cache.config = Config(similarity_threshold=0.2)**

question = "what's github"

for _ in range(3):
    start = time.time()
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
**Change temperature here**
        messages=[{
            "role": "user",
            "content": question
        }],
    )
    print("Time elapsed:", round(time.time() - start, 3))
    print("Answer:", response["choices"][0]["message"]["content"])


Copy to clipboard

To use GPTCache exclusively, only the following lines of code are required, and
there is no need to modify any existing code.

from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()


Copy to clipboard

More Docs๏ผš

 * Usage, how to use GPTCache better

 * Features, all features currently supported by the cache

 * Examples, learn better custom caching


๐ŸŽ“ BOOTCAMP#

 * GPTCache with LangChain
   
   * QA Generation
   
   * Question Answering
   
   * SQL Chain
   
   * BabyAGI User Guide

 * GPTCache with Llama_index
   
   * WebPage QA

 * GPTCache with OpenAI
   
   * Chat completion
   
   * Language Translation
   
   * SQL Translate
   
   * Twitter Classifier
   
   * Multimodal: Image Generation
   
   * Multimodal: Speech to Text

 * GPTCache with Replicate
   
   * Visual Question Answering

 * GPTCache with Temperature Param
   
   * OpenAI Chat
   
   * OpenAI Image Creation


๐Ÿ˜Ž WHAT CAN THIS HELP WITH?#

GPTCache offers the following primary benefits:

 * Decreased expenses: Most LLM services charge fees based on a combination of
   number of requests and token count. GPTCache effectively minimizes your
   expenses by caching query results, which in turn reduces the number of
   requests and tokens sent to the LLM service. As a result, you can enjoy a
   more cost-efficient experience when using the service.

 * Enhanced performance: LLMs employ generative AI algorithms to generate
   responses in real-time, a process that can sometimes be time-consuming.
   However, when a similar query is cached, the response time significantly
   improves, as the result is fetched directly from the cache, eliminating the
   need to interact with the LLM service. In most situations, GPTCache can also
   provide superior query throughput compared to standard LLM services.

 * Adaptable development and testing environment: As a developer working on LLM
   applications, youโ€™re aware that connecting to LLM APIs is generally
   necessary, and comprehensive testing of your application is crucial before
   moving it to a production environment. GPTCache provides an interface that
   mirrors LLM APIs and accommodates storage of both LLM-generated and mocked
   data. This feature enables you to effortlessly develop and test your
   application, eliminating the need to connect to the LLM service.

 * Improved scalability and availability: LLM services frequently enforce rate
   limits, which are constraints that APIs place on the number of times a user
   or client can access the server within a given timeframe. Hitting a rate
   limit means that additional requests will be blocked until a certain period
   has elapsed, leading to a service outage. With GPTCache, you can easily scale
   to accommodate an increasing volume of of queries, ensuring consistent
   performance as your applicationโ€™s user base expands.


๐Ÿค” HOW DOES IT WORK?#

Online services often exhibit data locality, with users frequently accessing
popular or trending content. Cache systems take advantage of this behavior by
storing commonly accessed data, which in turn reduces data retrieval time,
improves response times, and eases the burden on backend servers. Traditional
cache systems typically utilize an exact match between a new query and a cached
query to determine if the requested content is available in the cache before
fetching the data.

However, using an exact match approach for LLM caches is less effective due to
the complexity and variability of LLM queries, resulting in a low cache hit
rate. To address this issue, GPTCache adopt alternative strategies like semantic
caching. Semantic caching identifies and stores similar or related queries,
thereby increasing cache hit probability and enhancing overall caching
efficiency.

GPTCache employs embedding algorithms to convert queries into embeddings and
uses a vector store for similarity search on these embeddings. This process
allows GPTCache to identify and retrieve similar or related queries from the
cache storage, as illustrated in the Modules section.

Featuring a modular design, GPTCache makes it easy for users to customize their
own semantic cache. The system offers various implementations for each module,
and users can even develop their own implementations to suit their specific
needs.

In a semantic cache, you may encounter false positives during cache hits and
false negatives during cache misses. GPTCache offers three metrics to gauge its
performance, which are helpful for developers to optimize their caching systems:

 * Hit Ratio: This metric quantifies the cacheโ€™s ability to fulfill content
   requests successfully, compared to the total number of requests it receives.
   A higher hit ratio indicates a more effective cache.

 * Latency: This metric measures the time it takes for a query to be processed
   and the corresponding data to be retrieved from the cache. Lower latency
   signifies a more efficient and responsive caching system.

 * Recall: This metric represents the proportion of queries served by the cache
   out of the total number of queries that should have been served by the cache.
   Higher recall percentages indicate that the cache is effectively serving the
   appropriate content.

A sample benchmark is included for users to start with assessing the performance
of their semantic cache.


๐Ÿค— MODULES#

 * LLM Adapter: The LLM Adapter is designed to integrate different LLM models by
   unifying their APIs and request protocols. GPTCache offers a standardized
   interface for this purpose, with current support for ChatGPT integration.
   
   * [x] Support OpenAI ChatGPT API.
   
   * [x] Support langchain.
   
   * [x] Support minigpt4.
   
   * [x] Support Llamacpp.
   
   * [x] Support dolly.
   
   * [ ] Support other LLMs, such as Hugging Face Hub, Bard, Anthropic.

 * Multimodal Adapter (experimental): The Multimodal Adapter is designed to
   integrate different large multimodal models by unifying their APIs and
   request protocols. GPTCache offers a standardized interface for this purpose,
   with current support for integrations of image generation, audio
   transcription.
   
   * [x] Support OpenAI Image Create API.
   
   * [x] Support OpenAI Audio Transcribe API.
   
   * [x] Support Replicate BLIP API.
   
   * [x] Support Stability Inference API.
   
   * [x] Support Hugging Face Stable Diffusion Pipeline (local inference).
   
   * [ ] Support other multimodal services or self-hosted large multimodal
     models.

 * Embedding Generator: This module is created to extract embeddings from
   requests for similarity search. GPTCache offers a generic interface that
   supports multiple embedding APIs, and presents a range of solutions to choose
   from.
   
   * [x] Disable embedding. This will turn GPTCache into a keyword-matching
     cache.
   
   * [x] Support OpenAI embedding API.
   
   * [x] Support ONNX with the GPTCache/paraphrase-albert-onnx model.
   
   * [x] Support Hugging Face embedding with transformers, ViTModel,
     Data2VecAudio.
   
   * [x] Support Cohere embedding API.
   
   * [x] Support fastText embedding.
   
   * [x] Support SentenceTransformers embedding.
   
   * [x] Support Timm models for image embedding.
   
   * [ ] Support other embedding APIs.

 * Cache Storage: Cache Storage is where the response from LLMs, such as
   ChatGPT, is stored. Cached responses are retrieved to assist in evaluating
   similarity and are returned to the requester if there is a good semantic
   match. At present, GPTCache supports SQLite and offers a universally
   accessible interface for extension of this module.
   
   * [x] Support SQLite.
   
   * [x] Support DuckDB.
   
   * [x] Support PostgreSQL.
   
   * [x] Support MySQL.
   
   * [x] Support MariaDB.
   
   * [x] Support SQL Server.
   
   * [x] Support Oracle.
   
   * [ ] Support MongoDB.
   
   * [ ] Support Redis.
   
   * [ ] Support Minio.
   
   * [ ] Support HBase.
   
   * [ ] Support ElasticSearch.
   
   * [ ] Support other storages.

 * Vector Store: The Vector Store module helps find the K most similar requests
   from the input requestโ€™s extracted embedding. The results can help assess
   similarity. GPTCache provides a user-friendly interface that supports various
   vector stores, including Milvus, Zilliz Cloud, and FAISS. More options will
   be available in the future.
   
   * [x] Support Milvus, an open-source vector database for production-ready
     AI/LLM applicaionts.
   
   * [x] Support Zilliz Cloud, a fully-managed cloud vector database based on
     Milvus.
   
   * [x] Support Milvus Lite, a lightweight version of Milvus that can be
     embedded into your Python application.
   
   * [x] Support FAISS, a library for efficient similarity search and clustering
     of dense vectors.
   
   * [x] Support Hnswlib, header-only C++/python library for fast approximate
     nearest neighbors.
   
   * [x] Support PGVector, open-source vector similarity search for Postgres.
   
   * [x] Support Chroma, the AI-native open-source embedding database.
   
   * [x] Support DocArray, DocArray is a library for representing, sending and
     storing multi-modal data, perfect for Machine Learning applications.
   
   * [ ] Support qdrant
   
   * [ ] Support weaviate
   
   * [ ] Support other vector databases.

 * Cache Manager: The Cache Manager is responsible for controlling the operation
   of both the Cache Storage and Vector Store.
   
   * Eviction Policy: Currently, GPTCache makes decisions about evictions based
     solely on the number of lines. This approach can result in inaccurate
     resource evaluation and may cause out-of-memory (OOM) errors. We are
     actively investigating and developing a more sophisticated strategy.
     
     * [x] Support LRU eviction policy.
     
     * [x] Support FIFO eviction policy.
     
     * [ ] Support more complicated eviction policies.

 * Similarity Evaluator: This module collects data from both the Cache Storage
   and Vector Store, and uses various strategies to determine the similarity
   between the input request and the requests from the Vector Store. Based on
   this similarity, it determines whether a request matches the cache. GPTCache
   provides a standardized interface for integrating various strategies, along
   with a collection of implementations to use. The following similarity
   definitions are currently supported or will be supported in the future:
   
   * [x] The distance we obtain from the Vector Store.
   
   * [x] A model-based similarity determined using the
     GPTCache/albert-duplicate-onnx model from ONNX.
   
   * [x] Exact matches between the input request and the requests obtained from
     the Vector Store.
   
   * [x] Distance represented by applying linalg.norm from numpy to the
     embeddings.
   
   * [ ] BM25 and other similarity measurements.
   
   * [ ] Support other model serving framework such as PyTorch.
   
   Note:Not all combinations of different modules may be compatible with each
   other. For instance, if we disable the Embedding Extractor, the Vector Store
   may not function as intended. We are currently working on implementing a
   combination sanity check for GPTCache.


๐Ÿ˜‡ ROADMAP#

Coming soon! Stay tuned!


๐Ÿ˜ CONTRIBUTING#

We are extremely open to contributions, be it through new features, enhanced
infrastructure, or improved documentation.

For comprehensive instructions on how to contribute, please refer to our
contribution guide.






next

GPTCache Quick Start

Contents
 * Quick Install
 * ๐Ÿš€ What is GPTCache?
 * ๐Ÿ˜Š Quick Start
   * dev install
   * example usage
 * ๐ŸŽ“ Bootcamp
 * ๐Ÿ˜Ž What can this help with?
 * ๐Ÿค” How does it work?
 * ๐Ÿค— Modules
 * ๐Ÿ˜‡ Roadmap
 * ๐Ÿ˜ Contributing

By Zilliz Inc.

ยฉ Copyright 2023, Zilliz Inc.

Last updated on May 29, 2023.