LLM Model Evaluation Public Collection - Feedback or Questions?

christabrehm · January 21, 2025, 6:55pm

We launched a new Public collection and template that allows you to easily test and evaluate LLMs from OpenAI, Google, Anthropic and more. Using the collection runner you can visualize or export the results to easily facilitate comparing benchmarks across AI providers.

Your feedback is important!
We’d love to hear how you’re using the collection.

What do you like about it?
What could make it better?
How have you been using it?
Any questions for us?

christabrehm · February 7, 2025, 4:51pm

You can now test and evaluation OpenAI o1-mini, o1, and o3-mini! Compare all three using the collection runner json below (be sure to read the collection overview for more details)

[
    {
      "name": "OpenAI reasoning",
      "prompt": "What are three compounds we should consider investigating to advance research into new antibiotics? Why should we consider them?",
      "context": "You are a helpful research assistant knowledgeable in the medical field. You respond with consise answers.",
      "temperature": 1,
      "max_tokens": 8000,
      "top_p": 1,
      "models": ["o1-mini", "o3-mini", "o1"],
      "tests": {
        "content_length": 2000,
        "response_time": 5000,
        "prompt_tokens": 1000,
        "completion_tokens": 1000,
        "total_tokens": 5000,
        "tokens_per_second": 100
      }
    }
  ]

christabrehm · February 7, 2025, 4:55pm

DeepSeek is here! Test and evaluate DeepSeek’s deepseek-reasoner and deepseek-chat LLMs.

The below collection runner json example compares both DeepSeek and OpenAI’s reasoning models.

[
    {
      "name": "Reasoning",
      "prompt": "What are three compounds we should consider investigating to advance research into new antibiotics? Why should we consider them?",
      "context": "You are a helpful research assistant knowledgeable in the medical field. You respond with consise answers.",
      "temperature": 1,
      "max_tokens": 8000,
      "top_p": 1,
      "models": ["o3-mini", "deepseek-reasoner", "o1"],
      "tests": {
        "content_length": 2000,
        "response_time": 5000,
        "prompt_tokens": 1000,
        "completion_tokens": 1000,
        "total_tokens": 5000,
        "tokens_per_second": 100
      }
    }
  ]

Topic		Replies	Views
Suggestion: Enable custom AI bots Help Hub tests , collections	0	8	September 13, 2025
Struggling to choose the right AI model for your project? API Pulse eval , ai , agentic-ai	1	93	March 4, 2025
AI Provider Workspaces and Collections - Feedback or Questions? Help Hub collections , workspaces , ai	0	128	January 21, 2025
Skip the docs. Just ask your API Learning Lab tutorial , postman-api , postman-flows	0	36	April 18, 2025
🔧 Wrap Any API – $150 Swag \| 24 Hours Only API Pulse collections , challenge , agentic-ai , mcp , api-client	12	292	September 5, 2025

LLM Model Evaluation Public Collection - Feedback or Questions?

Related topics