Evaluating Foundation Models

Published by Andrew Dunn on December 31, 2025 February 18, 2026

This project was modified from an AWS Skill Builder Bonus Assignment.

Near closing time at CVS. You know what you’re looking for – but you’re not sure who to ask. Depending on the product – maybe you go to the cashier, maybe the guy restocking the nail polish, or you may go straight to the prescription counter. You have a few follow-up questions, so you want to choose the right worker. Picking the right Foundation Model to use can be a similar decision. Of course, you’re probably in the AWS Bedrock Console, not a CVS.

This quest is a little different. Instead of roaming the aisles shelve sat the convenience store for assistance, we are looking for a Foundation Model to ask very pointed financial questions. We identified three candidates to interview:

Claude Sonnet (anthropic.claude-3-sonnet-20240229-v1:0)
Claude Instant (anthropic.claude-instant-v1)
Titan Express (anthropic.titan-express-v1)

The python evaluation (found here) script has three worker methods for model evaluation:

invoke_model: names the three foundation model candidates, and then asks specified financial questios.
evaluate_models: judges foundation models similarity to ground truth results
calculate_similarity: find truth between model truth and the ground truth

Upon the trial run two the three candidate models were no-shows. The models had been discontinued. The script output looked like so:

It turns out that the Titan Express model was mistyped; it is distributed by Amazon, not Anthropic. And Anthropic Claude Instance has reached end of life. So Claude 3 Haiku was used as a replacement. With those two adfustments new similarilty scores were created:

Now that three working models were found, its time to add more questions for evaluation. Currently there is only one question in the test cases array:

test_cases = [
  {
    "question": "What is a 401 (k) retirement plan?",
    "context": "Financial services",
    "ground_truth": "A 401(k) is a tax-advantaged retirement savings plan offered by employers."
  }
]

Scouring the internet for more basic questions led to the extended set located here. With these new questions in place – the following similarity scores between the Foundation Models were derived:

While both Claude Sonnet and Claude Haiku outperform Titan in terms of similarity score, they use four times as many tokens. And Sonnet slightly outperforms Haiku in similarity to the ground truth, but Sonnet has twices as much latency as Haiku.

Model Selection Strategy

The code uses a weight strategy of 0.7 of the normalized similarity score and 0.3 of the normalized latency score to rank foundation models. Additionally, the code creates a json file to hold the model selection score, which will be consumed by AWS AppConfig.

The next step is to deploy the CloudFormation stacks across regions. There’s a bug in the AWS code – the CloudFormation tries to create a role with the same name in both regions, which causes the CloudFormation deployment to fail. I solved it the quickest way: by adding the region to the end of the roleName.

As a side note, this bug 🐞 illustrates why AWS CloudDevelopmentKit (CDK) is useful for managing CloudFormation templates. CDKJ compiles into CloudFormation templates, so compile-time bugs an be caught locally, before deployment.

to be continued….

Evaluating Foundation Models

Model Selection Strategy

Andrew Dunn

0 Comments

Leave a Reply Cancel reply

Create an AWS Client VPN and Work Remote

Set up the AWS Command Line Interface (aws-cli)

Create a New AWS User

Evaluating Foundation Models

Model Selection Strategy

Andrew Dunn

0 Comments

Leave a Reply Cancel reply

Related Posts

Create an AWS Client VPN and Work Remote

Set up the AWS Command Line Interface (aws-cli)

Create a New AWS User