Model Routing Quickstart Guide

Our router is trained to optimize for cost and latency while maintaining high response quality. If your query can be adequately answered by GPT-3.5 then our router will return the corresponding label so that you can save on the cost of using GPT-4 without any degradation in quality.

Integrating notdiamond-0001 is super easy. You can pass us a list of messages, just like you would for OpenAI's chat completion API, notdiamond-0001 will return the most suitable model to call and an estimate of the number of tokens your query will cost so that you don't call a model with a context length smaller than your message tokens.

You control which model you call, how you call it, and any rate limits you might want to impose.

Step 1: Sign up

Head over to notdiamond-0001 to sign up. You only need your email address or simply sign up with your Google or Github accounts.

Step 2: Generate a Not Diamond API key

Once you sign up, you will be greeted with an API key generation page. There you can generate an API key or delete an existing key.

Step 3: Integration

The notdiamond-0001 modelSelector endpoint can be called using your API key and providing a list of messages just like you would with OpenAI:

{"messages": [
  {
    "role": "assistant",
    "content": "How can I help you today?"
  },
  {
    "role": "user",
    "content": "Can you write a function that counts from 1 to 10?"
  }]
}

The received json response will look like this:

{"model": "gpt-3.5", “token_estimate”: 32}

With this, you can then route your messages to the appropriate model.

Additionally, our backend keeps track of the current status of both GPT-3.5 and GPT-4 and never returns a model which is currently unavailable so that your application is always working. In the event that both GPT-3.5 and GPT-4 are unavailable, you can specify a fallback model of your choice, and have it returned when that happens. If no fallback option is specified, we will simply return a null in the model response body. Note that when the fallback model is returned, the token count estimate is based on OpenAI's GPT-4 tokenizer. You may want to use the tokenizer that matches your fallback model for a more accurate estimate.

Below is an example of how you might handle routing using notdiamond-0001 in Python, using Claude as a fallback model:

import json 
import requests 
import openai
from anthropic import Anthropic, HUMAN_PROMPT, AI_PROMPT

openai.api_key = "YOUR OPENAI API KEY"
anthropic = Anthropic(api_key="YOUR CLAUDE API KEY")
nd_api_key = "
YOUR ND API KEY"
fallback_model = "claude-2.1"

url = "https://not-diamond-backend.onrender.com/modelSelector/"

def call_openai(messages, model):
  model_response = openai.ChatCompletion.create(model=model, messages=messages)
  return model_response.choices[0].message.content

def call_anthropic(messages, model):
  context = ""
  for msg in messages[1:-1]:  # Anthropic expects the first message to come from user
    if msg['role'] == 'user':
      context += f"{HUMAN_PROMPT} {msg['content']} "
    elif msg['role'] == 'assistant':
      context += f"{AI_PROMPT} {msg['content']} "

  prompt = context + '{} {} {}'.format(HUMAN_PROMPT, messages[-1]['content'], AI_PROMPT)
  response = anthropic.completions.create(model=model, max_tokens_to_sample=300, prompt=prompt)
  return response.completion

messages = [
  {
    "role": "assistant",
    "content": "How can I help you today?"
  },
  {
    "role": "user",
    "content": "Can you write a function that counts from 1 to 10?"
  }
]

payload = json.dumps({"messages": messages, "fallback_model": fallback_model})
headers = {
  "Authorization": f"Bearer {nd_api_key}",
  "Content-Type": "application/json"
}

response = requests.request("POST", url, headers=headers, data=payload).json()
model = response["model"]
token_estimate = response["token_estimate"]

if model == "gpt-3.5" and token_estimate <= 16384:
  response = call_openai(messages, "gpt-3.5-turbo")
elif model == "gpt-4" and token_estimate <= 8192:
  response = call_openai(messages, "gpt-4")
elif model == "gpt-4" and token_estimate > 8192:
  response = call_openai(messages, "gpt-4-32k")
elif model == "claude-2.1":
  response = call_anthropic(messages, fallback_model) # Fallback model
else:
  response = call_openai(messages, "gpt-4-32k")

print(f"This question is answered by {model}:\n{response}")

🚧

Note to beta testers

Our router may take up to 10 seconds to spin up the first time you call it. Subsequent calls should be well under 1 second. Once we move out of beta you will not see any spin up time.