Model Routing Quickstart Guide
Our router is trained to optimize for cost and latency while maintaining high response quality. If your query can be adequately answered by GPT-3.5 then our router will return the corresponding label so that you can save on the cost of using GPT-4 without any degradation in quality.
Integrating notdiamond-0001 is super easy. You can pass us a list of messages, just like you would for OpenAI's chat completion API, notdiamond-0001 will return the most suitable model to call and an estimate of the number of tokens your query will cost so that you don't call a model with a context length smaller than your message tokens.
You control which model you call, how you call it, and any rate limits you might want to impose.
Step 1: Sign up
Head over to notdiamond-0001 to sign up. You only need your email address or simply sign up with your Google or Github accounts.
Step 2: Generate a Not Diamond API key
Once you sign up, you will be greeted with an API key generation page. There you can generate an API key or delete an existing key.
Step 3: Integration
The notdiamond-0001 modelSelector
endpoint can be called using your API key and providing a list of messages just like you would with OpenAI:
{"messages": [
{
"role": "assistant",
"content": "How can I help you today?"
},
{
"role": "user",
"content": "Can you write a function that counts from 1 to 10?"
}]
}
The received json response will look like this:
{"model": "gpt-3.5", “token_estimate”: 32}
With this, you can then route your messages to the appropriate model.
Additionally, our backend keeps track of the current status of both GPT-3.5 and GPT-4 and never returns a model which is currently unavailable so that your application is always working. In the event that both GPT-3.5 and GPT-4 are unavailable, you can specify a fallback model of your choice, and have it returned when that happens. If no fallback option is specified, we will simply return a null
in the model
response body. Note that when the fallback model is returned, the token count estimate is based on OpenAI's GPT-4 tokenizer. You may want to use the tokenizer that matches your fallback model for a more accurate estimate.
Below is an example of how you might handle routing using notdiamond-0001 in Python, using Claude as a fallback model:
import json
import requests
import openai
from anthropic import Anthropic, HUMAN_PROMPT, AI_PROMPT
openai.api_key = "YOUR OPENAI API KEY"
anthropic = Anthropic(api_key="YOUR CLAUDE API KEY")
nd_api_key = "
YOUR ND API KEY"
fallback_model = "claude-2.1"
url = "https://not-diamond-backend.onrender.com/modelSelector/"
def call_openai(messages, model):
model_response = openai.ChatCompletion.create(model=model, messages=messages)
return model_response.choices[0].message.content
def call_anthropic(messages, model):
context = ""
for msg in messages[1:-1]: # Anthropic expects the first message to come from user
if msg['role'] == 'user':
context += f"{HUMAN_PROMPT} {msg['content']} "
elif msg['role'] == 'assistant':
context += f"{AI_PROMPT} {msg['content']} "
prompt = context + '{} {} {}'.format(HUMAN_PROMPT, messages[-1]['content'], AI_PROMPT)
response = anthropic.completions.create(model=model, max_tokens_to_sample=300, prompt=prompt)
return response.completion
messages = [
{
"role": "assistant",
"content": "How can I help you today?"
},
{
"role": "user",
"content": "Can you write a function that counts from 1 to 10?"
}
]
payload = json.dumps({"messages": messages, "fallback_model": fallback_model})
headers = {
"Authorization": f"Bearer {nd_api_key}",
"Content-Type": "application/json"
}
response = requests.request("POST", url, headers=headers, data=payload).json()
model = response["model"]
token_estimate = response["token_estimate"]
if model == "gpt-3.5" and token_estimate <= 16384:
response = call_openai(messages, "gpt-3.5-turbo")
elif model == "gpt-4" and token_estimate <= 8192:
response = call_openai(messages, "gpt-4")
elif model == "gpt-4" and token_estimate > 8192:
response = call_openai(messages, "gpt-4-32k")
elif model == "claude-2.1":
response = call_anthropic(messages, fallback_model) # Fallback model
else:
response = call_openai(messages, "gpt-4-32k")
print(f"This question is answered by {model}:\n{response}")
Note to beta testers
Our router may take up to 10 seconds to spin up the first time you call it. Subsequent calls should be well under 1 second. Once we move out of beta you will not see any spin up time.
Updated about 1 month ago