Blog

Evaluating AI Agents: How Do You Know If Your Agent Is Actually Correct?

April 27, 2026 • #python, #ai, #llm, #evaluation, #langgraph, #claude, #openai, #snowflake, #streamlit

Most AI demos stop at 'look, it generated something.' This post walks through how I built a systematic evaluation pipeline for a customer support agent — using Claude as a judge, a golden test suite, and a Streamlit dashboard to track quality over time.

Filter by Language

All Posts
English
Vietnamese

Explore

All Posts
English Posts
Vietnamese Posts

Portfolio
About Me
GitHub Profile
[email protected]

About This Site

Personal site of Hien Phan — Lead Data Engineer, AI Engineer and Data Architect with a PhD in Computer Science. Writing on data platforms, AI evaluation, RAG systems, and cloud architecture. Occasional posts in Vietnamese.

About Me

Blog

Evaluating AI Agents: How Do You Know If Your Agent Is Actually Correct?

Filter by Language

Tags

Explore

About This Site