LLM Model Evaluation Public Collection - Feedback or Questions?

We launched a new Public collection and template that allows you to easily test and evaluate LLMs from OpenAI, Google, Anthropic and more. Using the collection runner you can visualize or export the results to easily facilitate comparing benchmarks across AI providers.

Your feedback is important!
We’d love to hear how you’re using the collection.

  • What do you like about it?
  • What could make it better?
  • How have you been using it?
  • Any questions for us?
1 Like

You can now test and evaluation OpenAI o1-mini, o1, and o3-mini! Compare all three using the collection runner json below (be sure to read the collection overview for more details)

[
    {
      "name": "OpenAI reasoning",
      "prompt": "What are three compounds we should consider investigating to advance research into new antibiotics? Why should we consider them?",
      "context": "You are a helpful research assistant knowledgeable in the medical field. You respond with consise answers.",
      "temperature": 1,
      "max_tokens": 8000,
      "top_p": 1,
      "models": ["o1-mini", "o3-mini", "o1"],
      "tests": {
        "content_length": 2000,
        "response_time": 5000,
        "prompt_tokens": 1000,
        "completion_tokens": 1000,
        "total_tokens": 5000,
        "tokens_per_second": 100
      }
    }
  ]

DeepSeek is here! Test and evaluate DeepSeek’s deepseek-reasoner and deepseek-chat LLMs.

The below collection runner json example compares both DeepSeek and OpenAI’s reasoning models.

[
    {
      "name": "Reasoning",
      "prompt": "What are three compounds we should consider investigating to advance research into new antibiotics? Why should we consider them?",
      "context": "You are a helpful research assistant knowledgeable in the medical field. You respond with consise answers.",
      "temperature": 1,
      "max_tokens": 8000,
      "top_p": 1,
      "models": ["o3-mini", "deepseek-reasoner", "o1"],
      "tests": {
        "content_length": 2000,
        "response_time": 5000,
        "prompt_tokens": 1000,
        "completion_tokens": 1000,
        "total_tokens": 5000,
        "tokens_per_second": 100
      }
    }
  ]