News Feed

Using MemAlign to Improve Evaluation of Traditional Machine Learning in Genie Code

Databricks Blog · 2026-05-08 · Evals

Recently announced Genie Code is Databricks’ autonomous AI partner purpose built for data work. ...

NIST's CAISI Evaluation of DeepSeek V4 Pro finds it to be on par with GPT-5

Hacker News · 2026-05-03 · Evals

Article URL: https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro Comments URL: https://news.ycombinator.com/item?id=47994920 Points: 1 # Comments: 0

Show HN: Browser-based light pollution simulator using real photometric data

Hacker News · 2026-05-02 · Evals

Hi HN — author here. iesna.eu is a browser-based ecosystem for working with photometric data: parsing standard luminaire files (LDT/EULUMDAT, IES LM-63, Oxytech, ATLA-S001), running design...

Xmemory: Benchmarking Structured AI Memory Against RAG and Hybrid RAG

Hacker News · 2026-05-01 · Evals

Article URL: https://arxiv.org/abs/2604.27906 Comments URL: https://news.ycombinator.com/item?id=47972683 Points: 1 # Comments: 0

AI evals are becoming the new compute bottleneck

Hugging Face · 2026-04-29 · Evals

Apr 29, 2026ScienceEvaluating Claude’s bioinformatics research capabilities with BioMysteryBench

Anthropic Research · 2026-04-29 · Evals

Apr 29, 2026ScienceEvaluating Claude’s bioinformatics research capabilities with BioMysteryBench

Show HN: A Karpathy-style LLM wiki your agents maintain (Markdown and Git)

Hacker News · 2026-04-25 · Evals

I shipped a wiki layer for AI agents that uses markdown + git as the source of truth, with a bleve (BM25) + SQLite index on top. No vector or graph db yet.It runs locally in ~/.wuphf/wiki/ and you...

Structuring AI Evaluation and Observability with MLflow: From Development to Production

MLflow Blog · 2026-04-22 · Evals

Shipping your first AI agent or LLM application feels fulfilling until you have to make changes because it does not work as you intended. Most of us start the same way: we test a few prompts, the...

Enforce Content Policies at the Gateway with AI Gateway Guardrails

MLflow Blog · 2026-04-21 · Evals

MLflow AI Gateway now supports configurable guardrails that use LLM judges to block or sanitize harmful content, PII, and custom policy violations before they reach your users or your models.

Show HN: I built a benchmark platform for WordPress backup plugins

Hacker News · 2026-04-13 · Evals

every "best backup plugin" comparison is affiliate content with no real data. so i built a site that runs actual backup/restore operations in docker containers and publishes real benchmark...

Show HN: SkillCompass – open-source quality evaluator for your AI skills

Hacker News · 2026-04-13 · Evals

Article URL: https://github.com/Evol-ai/SkillCompass Comments URL: https://news.ycombinator.com/item?id=47749727 Points: 1 # Comments: 0

Milla Jovovich released an AI memory system. None of benchmark scores are real

Hacker News · 2026-04-08 · Evals

Article URL: https://penfieldlabs.substack.com/p/milla-jovovich-just-released-an-ai Comments URL: https://news.ycombinator.com/item?id=47687208 Points: 2 # Comments: 0

Better Harness: A Recipe for Harness Hill-Climbing with Evals

LangChain Blog · 2026-04-08 · Evals

By Vivek Trivedy, Product Manager💡TL;DR: We can build better agents by building better harnesses. But to autonomously build a “better” harness, we need a strong learning signal to “hill-climb” on....

Humans Persistently Devalue AI-Generated Creative Writing

Hacker News · 2026-04-06 · Evals

Article URL: https://psycnet.apa.org/fulltext/2027-12675-001.html Comments URL: https://news.ycombinator.com/item?id=47658237 Points: 1 # Comments: 0

Evaluating alignment of behavioral dispositions in LLMs

Google Research · 2026-04-03 · Evals

Generative AI

Open Models have crossed a threshold

LangChain Blog · 2026-04-02 · Evals

💡TL;DR: Open models like GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks — file operations, tool use, and instruction following — at a fraction of the cost and latency....

Building better AI benchmarks: How many raters are enough?

Google Research · 2026-03-31 · Evals

Algorithms & Theory

Agent Evaluation Readiness Checklist

LangChain Blog · 2026-03-27 · Evals

A practical checklist for agent evaluation: error analysis, dataset construction, grader design, offline & online evals, and production readiness.

How Kensho built a multi-agent framework with LangGraph to solve trusted financial data retrieval

LangChain Blog · 2026-03-26 · Evals

Discover how Kensho, S&P Global’s AI innovation engine, leveraged LangGraph to create its Grounding framework–a unified agentic access layer solving fragmented financial data retrieval at enterprise scale.

How we build evals for Deep Agents

LangChain Blog · 2026-03-26 · Evals

💡TLDR: The best agent evals directly measure an agent behavior we care about. Here's how we source data, create metrics, and run well-scoped, targeted experiments over time to make agents more...

Show HN: Knitting – shared-memory function calls for JavaScript workers

Hacker News · 2026-03-24 · Evals

What it isKnitting is a from-scratch concurrency framework for JavaScript built on shared memory. Instead of manually wiring message protocols, you just call functions across threads.Why I built...

A New Framework for Evaluating Voice Agents (EVA)

Hugging Face · 2026-03-24 · Evals

Harness Your OpenHands Agent with AI Observability and Governance

MLflow Blog · 2026-03-24 · Evals

AI coding agents edit files, run commands, and browse the web autonomously, but what are they actually doing? Learn how to trace every step, evaluate output quality, and control LLM spending for...

Harness Your OpenHands Agent with AI Observability and Governance

MLflow Blog · 2026-03-24 · Evals

AI coding agents edit files, run commands, and browse the web autonomously, but what are they actually doing? Learn how to trace every step, evaluate output quality, and control LLM spending for...

Harness Your OpenHands Agent with AI Observability and Governance

MLflow Blog · 2026-03-24 · Evals

AI coding agents edit files, run commands, and browse the web autonomously, but what are they actually doing? Learn how to trace every step, evaluate output quality, and control LLM spending for...

Testing and Refining Claude Code Skills with MLflow

MLflow Blog · 2026-03-23 · Evals

How to test Claude Code skills using MLflow tracing and LLM judges, and create a self-improvement loop where Claude Code refines its own skills.

Testing and Refining Claude Code Skills with MLflow

MLflow Blog · 2026-03-23 · Evals

How to test Claude Code skills using MLflow tracing and LLM judges, and create a self-improvement loop where Claude Code refines its own skills.

Testing and Refining Claude Code Skills with MLflow

MLflow Blog · 2026-03-23 · Evals

How to test Claude Code skills using MLflow tracing and LLM judges, and create a self-improvement loop where Claude Code refines its own skills.

Tracking and Debugging AI Safety Evaluations with Inspect AI and MLflow

MLflow Blog · 2026-03-21 · Evals

Bring MLflow experiment tracking and execution tracing to Inspect AI evaluations with the inspect-mlflow package.

Your Agents Need an AI Platform

MLflow Blog · 2026-03-18 · Evals

Making your agents work reliably in production requires observability, evaluation, version control, and governance. Learn how MLflow brings all four together as the only complete open source AI platform.

Your Agents Need an AI Platform

MLflow Blog · 2026-03-18 · Evals

Making your agents work reliably in production requires observability, evaluation, version control, and governance. Learn how MLflow brings all four together as the only complete open source AI platform.

Your Agents Need an AI Platform

MLflow Blog · 2026-03-18 · Evals

Making your agents work reliably in production requires observability, evaluation, version control, and governance. Learn how MLflow brings all four together as the only complete open source AI platform.

SOTA Embedding Model for Agentic Workflows Now in Public Preview

Databricks Blog · 2026-03-17 · Evals

Retrieval underpins modern AI systems, and the quality of the embedding model determines...

Benchmarking LLMs at the Game of Science (Eleusis) [video]

Hacker News · 2026-03-16 · Evals

Article URL: https://www.youtube.com/watch?v=tz5wALHhhds Comments URL: https://news.ycombinator.com/item?id=47396503 Points: 1 # Comments: 0

Beyond Semantic Similarity: Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

Hugging Face · 2026-03-13 · Evals

AI is supercharging fake work

Hacker News · 2026-03-12 · Evals

As anyone with an internet connection knows, there’s been a lot of buzz about how AI is going to reshape the workforce for the past 3 years and layoffs due to “AI” have already started, the most...

Databricks acquires Quotient AI to power AI agent evaluations

Databricks Blog · 2026-03-11 · Evals

Databricks is excited to announce the acquisition of Quotient AI, an innovator in...

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI

Hacker News · 2026-03-08 · Evals

Article URL: https://arxiv.org/abs/2603.03823 Comments URL: https://news.ycombinator.com/item?id=47295537 Points: 8 # Comments: 1

Show HN: Go LLM inference with a Vulkan GPU back end that beats Ollama's CUDA

Hacker News · 2026-03-08 · Evals

dlgo is an LLM inference engine written in Go. CPU path has zero dependencies beyond the standard library. GPU path uses Vulkan compute — no CUDA required.I benchmarked it against Ollama using the...

Ask HN: Has finding more competitors ever made you more confident?

Hacker News · 2026-03-08 · Evals

I was researching the personal finance market and initially found only a few obvious companies.After digging more, I found a much longer tail of startups I’d never heard of, including products...

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Anthropic Engineering · 2026-03-06 · Evals

Evaluating Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to it—raising questions about eval integrity in web-enabled environments.

Evaluating Skills

LangChain Blog · 2026-03-05 · Evals

By Robert XuRecently at LangChain we’ve been building skills to help coding agents like Codex, Claude Code, and Deep Agents CLI work with our ecosystem: namely, LangChain and LangSmith. This is...

Agent Trace Evaluation with TruLens Scorers in MLflow

MLflow Blog · 2026-03-04 · Evals

Score agent plans, tool calls, and reasoning with TruLens GPA framework through mlflow.genai.evaluate().

LangSmith CLI & Skills

LangChain Blog · 2026-03-04 · Evals

We’re releasing a CLI along with our first set of skills to give AI coding agents expertise in the LangSmith ecosystem. This includes adding tracing to agents, understanding their execution,...

LangChain Skills

LangChain Blog · 2026-03-04 · Evals

We’re releasing our first set of skills to give AI coding agents expertise in the open source LangChain ecosystem. This includes building agents with LangChain, LangGraph, and Deep Agents. On our...

Agent Trace Evaluation with TruLens Scorers in MLflow

MLflow Blog · 2026-03-04 · Evals

Score agent plans, tool calls, and reasoning with TruLens GPA framework through mlflow.genai.evaluate().

Agent Trace Evaluation with TruLens Scorers in MLflow

MLflow Blog · 2026-03-04 · Evals

Score agent plans, tool calls, and reasoning with TruLens GPA framework through mlflow.genai.evaluate().

Benchmark Your Way to Better RAG and Agents:Tuning Vector Search with MLflow

MLflow Blog · 2026-03-02 · Evals

High-level summary: problems, approaches, and takeways for better RAG with MLflow

Evals Skills for Coding Agents

Hamel Husain's Blog · 2026-03-02 · Evals

Evals Skills for Coding Agents

Benchmark Your Way to Better RAG and Agents:Tuning Vector Search with MLflow

MLflow Blog · 2026-03-02 · Evals

High-level summary: problems, approaches, and takeways for better RAG with MLflow

Benchmark Your Way to Better RAG and Agents:Tuning Vector Search with MLflow

MLflow Blog · 2026-03-02 · Evals

High-level summary: problems, approaches, and takeways for better RAG with MLflow

Deterministic Safety Checks in MLflow with Guardrails AI

MLflow Blog · 2026-02-27 · Evals

Add fast, deterministic safety validation to MLflow evaluation pipelines using Guardrails AI scorers. No LLM required.

Deterministic Safety Checks in MLflow with Guardrails AI

MLflow Blog · 2026-02-27 · Evals

Add fast, deterministic safety validation to MLflow evaluation pipelines using Guardrails AI scorers. No LLM required.

Deterministic Safety Checks in MLflow with Guardrails AI

MLflow Blog · 2026-02-27 · Evals

Add fast, deterministic safety validation to MLflow evaluation pipelines using Guardrails AI scorers. No LLM required.

You don’t know what your agent will do until it’s in production

LangChain Blog · 2026-02-26 · Evals

You can't monitor agents like traditional software. Inputs are infinite, behavior is non-deterministic, and quality lives in the conversations themselves. This article explains what to monitor,...

Multi-turn Evaluation & Simulation: Enhancing AI Observability with MLflow for Chatbots

MLflow Blog · 2026-02-24 · Evals

MLflow 3.10 introduces multi-turn evaluation and conversation simulation so you can score entire conversations, test agent changes with reproducible scenarios, and catch failures that only surface...

Multi-turn Evaluation & Simulation: Enhancing AI Observability with MLflow for Chatbots

MLflow Blog · 2026-02-24 · Evals

MLflow 3.10 introduces multi-turn evaluation and conversation simulation so you can score entire conversations, test agent changes with reproducible scenarios, and catch failures that only surface...

Multi-turn Evaluation & Simulation: Enhancing AI Observability with MLflow for Chatbots

MLflow Blog · 2026-02-24 · Evals

MLflow 3.10 introduces multi-turn evaluation and conversation simulation so you can score entire conversations, test agent changes with reproducible scenarios, and catch failures that only surface...

Agent Observability Powers Agent Evaluation

LangChain Blog · 2026-02-22 · Evals

You can't build reliable agents without understanding how they reason, and you can't validate improvements without systematic evaluation.

monday Service + LangSmith: Building a Code-First Evaluation Strategy from Day 1

LangChain Blog · 2026-02-18 · Evals

Learn how monday Service developed an eval-driven development framework for their customer-facing service agents.

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

Hugging Face · 2026-02-12 · Evals

Quantifying infrastructure noise in agentic coding evals

Anthropic Engineering · 2026-02-05 · Evals

Infrastructure configuration can swing agentic coding benchmarks by several percentage points—sometimes more than the leaderboard gap between top models.\n\n

MemAlign: Building Better LLM Judges From Human Feedback With Scalable Memory

MLflow Blog · 2026-02-03 · Evals

Introducing MemAlign, a new framework that aligns LLMs with human feedback via a lightweight dual-memory system, achieving competitive or better quality than state-of-the-art prompt optimizers, at...

MemAlign: Building Better LLM Judges From Human Feedback With Scalable Memory

MLflow Blog · 2026-02-03 · Evals

Introducing MemAlign, a new framework that aligns LLMs with human feedback via a lightweight dual-memory system, achieving competitive or better quality than state-of-the-art prompt optimizers, at...

MemAlign: Building Better LLM Judges From Human Feedback With Scalable Memory

MLflow Blog · 2026-02-03 · Evals

Introducing MemAlign, a new framework that aligns LLMs with human feedback via a lightweight dual-memory system, achieving competitive or better quality than state-of-the-art prompt optimizers, at...

Introducing DeepEval, RAGAS, and Phoenix Judges in MLflow

MLflow Blog · 2026-01-29 · Evals

Improve your agents using MLflow's extensive, industry-leading suite of high-quality LLM judges.

Introducing DeepEval, RAGAS, and Phoenix Judges in MLflow

MLflow Blog · 2026-01-29 · Evals

Improve your agents using MLflow's extensive, industry-leading suite of high-quality LLM judges.

Introducing DeepEval, RAGAS, and Phoenix Judges in MLflow

MLflow Blog · 2026-01-29 · Evals

Improve your agents using MLflow's extensive, industry-leading suite of high-quality LLM judges.

Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs

Hugging Face · 2026-01-27 · Evals

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

Hugging Face · 2026-01-21 · Evals

Designing AI-resistant technical evaluations

Anthropic Engineering · 2026-01-21 · Evals

What we learned from three iterations of a performance engineering take-home that Claude keeps beating.

LLM Evals: Everything You Need to Know

Hamel Husain's Blog · 2026-01-15 · Evals

LLM Evals: Everything You Need to Know

Demystifying evals for AI agents

Anthropic Engineering · 2026-01-09 · Evals

The capabilities that make agents useful also make them difficult to evaluate. The strategies that work across deployments combine techniques to match the complexity of the systems they measure. \n

Rapidly Prototype and Evaluate Agents with Claude Agent SDK and MLflow

MLflow Blog · 2025-10-15 · Evals

How to quickly prototype an agent using the Claude Agent SDK then instrument and evaluate it with MLflow

Selecting The Right AI Evals Tool

Hamel Husain's Blog · 2025-10-01 · Evals

Selecting The Right AI Evals Tool

AfriMed-QA: Benchmarking large language models for global health

Google Research · 2025-09-24 · Evals

Generative AI

Democratizing AI Safety with RiskRubric.ai

Hugging Face · 2025-09-18 · Evals

Beyond Manually Crafted LLM Judges: Automate Building Domain-Specific Evaluators with MLflow

MLflow Blog · 2025-09-15 · Evals

How to easily create custom evaluators that understand the semantics of your domain and automatically align with human experts

Writing effective tools for agents — with agents

Anthropic Engineering · 2025-09-11 · Evals

Agents are only as effective as the tools we give them. We share how to write high-quality tools and evaluations, and how you can boost performance by using Claude to optimize its tools for itself.

Building and Managing an LLM-based OCR System with MLflow

MLflow Blog · 2025-08-30 · Evals

Building GenAI tools presents a unique set of challenges. As we evaluate accuracy, iterate on prompts, and enable collaboration, we often encounter bottlenecks that slow down our progress toward...

A scalable framework for evaluating health language models

Google Research · 2025-08-26 · Evals

Generative AI

Assessment-focused UIs in MLflow

MLflow Blog · 2025-08-11 · Evals

In MLflow 3.2, we introduced the concept of assessments, which are the quality evaluations and trace annotations that are crucial for understanding and improving your AI applications. With the...