Google Research
· 2026-03-31
· Evals
Algorithms & Theory
LangChain Blog
· 2026-03-27
· Evals
A practical checklist for agent evaluation: error analysis, dataset construction, grader design, offline & online evals, and production readiness.
LangChain Blog
· 2026-03-26
· Evals
Discover how Kensho, S&P Global’s AI innovation engine, leveraged LangGraph to create its Grounding framework–a unified agentic access layer solving fragmented financial data retrieval at enterprise scale.
LangChain Blog
· 2026-03-26
· Evals
💡TLDR: The best agent evals directly measure an agent behavior we care about. Here's how we source data, create metrics, and run well-scoped, targeted experiments over time to make agents more...
Hacker News
· 2026-03-24
· Evals
What it isKnitting is a from-scratch concurrency framework for JavaScript built on shared memory. Instead of manually wiring message protocols, you just call functions across threads.Why I built...
Hugging Face
· 2026-03-24
· Evals
MLflow Blog
· 2026-03-23
· Evals
How to test Claude Code skills using MLflow tracing and LLM judges, and create a self-improvement loop where Claude Code refines its own skills.
MLflow Blog
· 2026-03-18
· Evals
Making your agents work reliably in production requires observability, evaluation, version control, and governance. Learn how MLflow brings all four together as the only complete open source AI platform.
Databricks Blog
· 2026-03-17
· Evals
Retrieval underpins modern AI systems, and the quality of the embedding model determines...
Hacker News
· 2026-03-16
· Evals
Article URL: https://www.youtube.com/watch?v=tz5wALHhhds Comments URL: https://news.ycombinator.com/item?id=47396503 Points: 1 # Comments: 0
Hugging Face
· 2026-03-13
· Evals
Hacker News
· 2026-03-12
· Evals
As anyone with an internet connection knows, there’s been a lot of buzz about how AI is going to reshape the workforce for the past 3 years and layoffs due to “AI” have already started, the most...
Databricks Blog
· 2026-03-11
· Evals
Databricks is excited to announce the acquisition of Quotient AI, an innovator in...
Hacker News
· 2026-03-08
· Evals
Article URL: https://arxiv.org/abs/2603.03823 Comments URL: https://news.ycombinator.com/item?id=47295537 Points: 8 # Comments: 1
Hacker News
· 2026-03-08
· Evals
dlgo is an LLM inference engine written in Go. CPU path has zero dependencies beyond the standard library. GPU path uses Vulkan compute — no CUDA required.I benchmarked it against Ollama using the...
Hacker News
· 2026-03-08
· Evals
I was researching the personal finance market and initially found only a few obvious companies.After digging more, I found a much longer tail of startups I’d never heard of, including products...
Anthropic Engineering
· 2026-03-06
· Evals
Evaluating Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to it—raising questions about eval integrity in web-enabled environments.
LangChain Blog
· 2026-03-05
· Evals
By Robert XuRecently at LangChain we’ve been building skills to help coding agents like Codex, Claude Code, and Deep Agents CLI work with our ecosystem: namely, LangChain and LangSmith. This is...
MLflow Blog
· 2026-03-04
· Evals
Score agent plans, tool calls, and reasoning with TruLens GPA framework through mlflow.genai.evaluate().
LangChain Blog
· 2026-03-04
· Evals
We’re releasing a CLI along with our first set of skills to give AI coding agents expertise in the LangSmith ecosystem. This includes adding tracing to agents, understanding their execution,...
LangChain Blog
· 2026-03-04
· Evals
We’re releasing our first set of skills to give AI coding agents expertise in the open source LangChain ecosystem. This includes building agents with LangChain, LangGraph, and Deep Agents. On our...
MLflow Blog
· 2026-03-02
· Evals
High-level summary: problems, approaches, and takeways for better RAG with MLflow
Hamel Husain's Blog
· 2026-03-02
· Evals
Evals Skills for Coding Agents
LangChain Blog
· 2026-02-26
· Evals
You can't monitor agents like traditional software. Inputs are infinite, behavior is non-deterministic, and quality lives in the conversations themselves. This article explains what to monitor,...
MLflow Blog
· 2026-02-24
· Evals
MLflow 3.10 introduces multi-turn evaluation and conversation simulation so you can score entire conversations, test agent changes with reproducible scenarios, and catch failures that only surface...
LangChain Blog
· 2026-02-22
· Evals
You can't build reliable agents without understanding how they reason, and you can't validate improvements without systematic evaluation.
LangChain Blog
· 2026-02-18
· Evals
Learn how monday Service developed an eval-driven development framework for their customer-facing service agents.
Hugging Face
· 2026-02-12
· Evals
MLflow Blog
· 2026-02-03
· Evals
Introducing MemAlign, a new framework that aligns LLMs with human feedback via a lightweight dual-memory system, achieving competitive or better quality than state-of-the-art prompt optimizers, at...
MLflow Blog
· 2026-01-29
· Evals
Improve your agents using MLflow's extensive, industry-leading suite of high-quality LLM judges.
Hugging Face
· 2026-01-27
· Evals
Hugging Face
· 2026-01-21
· Evals
Anthropic Engineering
· 2026-01-21
· Evals
What we learned from three iterations of a performance engineering take-home that Claude keeps beating.
Hamel Husain's Blog
· 2026-01-15
· Evals
LLM Evals: Everything You Need to Know
Anthropic Engineering
· 2026-01-09
· Evals
The capabilities that make agents useful also make them difficult to evaluate. The strategies that work across deployments combine techniques to match the complexity of the systems they measure. \n
MLflow Blog
· 2025-10-15
· Evals
How to quickly prototype an agent using the Claude Agent SDK then instrument and evaluate it with MLflow
Hamel Husain's Blog
· 2025-10-01
· Evals
Selecting The Right AI Evals Tool
Google Research
· 2025-09-24
· Evals
Generative AI
Hugging Face
· 2025-09-18
· Evals
MLflow Blog
· 2025-09-15
· Evals
How to easily create custom evaluators that understand the semantics of your domain and automatically align with human experts
Anthropic Engineering
· 2025-09-11
· Evals
Agents are only as effective as the tools we give them. We share how to write high-quality tools and evaluations, and how you can boost performance by using Claude to optimize its tools for itself.
MLflow Blog
· 2025-08-30
· Evals
Building GenAI tools presents a unique set of challenges. As we evaluate accuracy, iterate on prompts, and enable collaboration, we often encounter bottlenecks that slow down our progress toward...
Google Research
· 2025-08-26
· Evals
Generative AI
MLflow Blog
· 2025-08-11
· Evals
In MLflow 3.2, we introduced the concept of assessments, which are the quality evaluations and trace annotations that are crucial for understanding and improving your AI applications. With the...
Hugging Face
· 2025-08-01
· Evals
Hugging Face
· 2025-07-17
· Evals
Hugging Face
· 2025-07-04
· Evals
Hamel Husain's Blog
· 2025-06-23
· Evals
Inspect AI, An OSS Python Library For LLM Evals
Hugging Face
· 2025-06-06
· Evals
Google Research
· 2025-05-14
· Evals
Data Mining & Modeling
Hugging Face
· 2025-04-16
· Evals
Hugging Face
· 2025-02-28
· Evals
Hugging Face
· 2025-02-04
· Evals
Anthropic Engineering
· 2025-01-06
· Evals
SWE-bench is an AI evaluation benchmark that assesses a model's ability to complete real-world software engineering tasks.
Hugging Face
· 2024-12-17
· Evals
Hugging Face
· 2024-12-04
· Evals
Hugging Face
· 2024-11-19
· Evals
Hamel Husain's Blog
· 2024-10-29
· Evals
Using LLM-as-a-Judge For Evaluation: A Complete Guide
Anthropic Engineering
· 2024-09-19
· Evals
For an AI model to be useful in specific contexts, it often needs access to background knowledge.
Hugging Face
· 2024-07-25
· Evals
Hugging Face
· 2024-07-01
· Evals
Hugging Face
· 2024-06-18
· Evals
Hugging Face
· 2024-05-24
· Evals
Hugging Face
· 2024-04-19
· Evals
Hugging Face
· 2024-04-16
· Evals
Hamel Husain's Blog
· 2024-03-29
· Evals
Your AI Product Needs Evals
Hugging Face
· 2024-02-27
· Evals
Hugging Face
· 2024-02-20
· Evals
Hugging Face
· 2024-02-02
· Evals
Hugging Face
· 2022-10-24
· Evals
Hugging Face
· 2022-10-03
· Evals
Google Developers Blog
·
· Evals
To bridge the gap between static model knowledge and rapidly evolving software practices, Google DeepMind developed a "Gemini API developer skill" that provides agents with live documentation and...
Google Developers Blog
·
· Evals
The newly introduced continuous checkpointing feature in Orbax and MaxText is designed to optimize the balance between reliability and performance during model training, addressing issues with...
No articles match your filters.