Portfolio

A selection of production-focused AI and data engineering projects, including LLM evaluation frameworks, RAG systems, Snowflake/dbt pipelines, and multi-cloud AI architecture. Full source on GitHub.

Featured Project — AI Evaluation & Observability

Snowflake AI Evaluation

Problem

AI demos always look good — but how do you know if the agent's answers are actually correct? Without systematic evaluation, quality is invisible and regressions go undetected.

Architecture

TPC-H dataset → Snowflake + dbt → LangGraph agents GPT-4o / Gemini → Claude-as-judge scorer → Evaluation mart (Snowflake) → Streamlit dashboard

Technologies

Python Snowflake dbt LangGraph GPT-4o Gemini 2.5 Flash Claude API Streamlit

Result

Built a reusable AI evaluation framework that compares multiple agents against a golden test suite, stores evaluation results in Snowflake, and exposes quality metrics through a Streamlit dashboard. In the sample run, GPT-4o scored 9/10 and Gemini 2.5 Flash scored 10/10, with failures traceable to specific test cases.

Why it matters

AI demos can look good, but without systematic evaluation, teams cannot measure quality, regressions, hallucination risk, or model changes over time. This framework makes agent quality measurable and reproducible.

Blog Post GitHub ↗

Multi-Cloud Serverless RAG

Problem

The local RAG pipeline was tied to a single machine and one AI provider — no way to compare AWS, Azure, and GCP AI stacks on the same workload.

Architecture

Terraform →

AWS — Glue + OpenSearch Serverless + Lambda + Bedrock Azure — Azure ML + AI Search + Functions + AI Foundry GCP — Vertex AI + Firestore + Cloud Functions

→ Hugging Face Spaces

Technologies

Python Terraform AWS Bedrock Azure AI Foundry Vertex AI OpenSearch Serverless Azure AI Search Firestore Streamlit

Result

One RAG system deployable on any major cloud with a single terraform apply — live on Hugging Face Spaces with one page per cloud backend.

Why it matters

Enterprise AI teams rarely operate on a single cloud — they inherit existing infrastructure, face vendor lock-in decisions, or need to compare AI stack costs across providers. Seeing the same pipeline built three ways makes those trade-offs concrete.

Blog Post GitHub ↗

RAG Pipeline

Problem

500 arXiv research papers were unsearchable via keyword search — a paper on 'reducing compute for LLMs' never surfaces when you search 'efficient LLM training', even though it's exactly what you need.

Architecture

arXiv API → PDF parser + chunker → sentence-transformers embedder → PostgreSQL + pgvector → Claude Sonnet 4.6 generator → Streamlit chat UI

Technologies

Python PostgreSQL pgvector sentence-transformers Claude API Streamlit Docker LangChain

Result

A chat interface that answers natural-language questions grounded in the paper collection, with cited sources — deployed locally via Docker.

Why it matters

LLMs answer from training data that can be outdated, hallucinated, or simply wrong for your domain. RAG grounds every answer in your own documents, making responses verifiable and traceable to the source.

Blog Post GitHub ↗

Hien Phan

Portfolio

Snowflake AI Evaluation

Multi-Cloud Serverless RAG

RAG Pipeline

Explore

About This Site