Evidently AI

Name: Evidently AI
Rating: 4.0 (3 reviews)

Collaborative AI observability platform

4.0•3 reviews•

397 followers

Collaborative AI observability platform

4.0•3 reviews•

397 followers

Visit website

Predictive AI

•

AI Infrastructure Tools

•

AI Metrics and Evaluation

Evidently helps evaluate, test and monitor your AI-powered products. From ML-based classifiers to LLM chatbots and agents. Built on top of the leading open-source library with over 20 million downloads: https://github.com/evidentlyai/evidently

This is the 2nd launch from Evidently AI. View more

Evidently AI

Open-source evaluations and observability for LLM apps

Evidently is an open-source framework to evaluate, test and monitor AI-powered apps.

📚 100+ built-in checks, from classification to RAG.
🚦 Both offline evals and live monitoring.
🛠 Easily add custom metrics and LLM judges.

Free

Launch tags:

Open Source•Developer Tools•Artificial Intelligence

Launch Team / Built With

Elena Samuylova

Evidently AI

Maker

📌

Hi Makers! I'm Elena, a co-founder of Evidently AI. I'm excited to share that our open-source Evidently library is stepping into the world of LLMs! 🚀 Three years ago, we started with testing and monitoring for what's now called "traditional" ML. Think classification, regression, ranking, and recommendation systems. With over 20 million downloads, we're now bringing our toolset to help evaluate and test LLM-powered products. As you build an LLM-powered app or feature, figuring out if it's "good enough" can be tricky. Evaluating generative AI is different from traditional software and predictive ML. It lacks clear criteria and labeled answers, making quality more subjective and harder to measure. But there is no way around it: to deploy an AI app to production, you need a way to evaluate it. For instance, you might ask: - How does the quality compare if I switch from GPT to Claude? - What will change if I tweak a prompt? Do my previous good answers hold? - Where is it failing? - What real-world quality are users experiencing? It's not just about metrics—it's about the whole quality workflow. You need to define what "good" means for your app, set up offline tests, and monitor live quality. With Evidently, we provide the complete open-source infrastructure to build and manage these evaluation workflows. Here's what you can do: 📚 Pick from a library of metrics or configure custom LLM judges 📊 Get interactive summary reports or export raw evaluation scores 🚦 Run test suites for regression testing 📈 Deploy a self-hosted monitoring dashboard ⚙️ Integrate it with any adjacent tools and frameworks It's open-source under an Apache 2.0 license. We build it together with the community: I would love to learn how you address this problem and any feedback and feature requests. Check it out on GitHub: https://github.com/evidentlyai/e..., get started in the docs: http://docs.evidentlyai.com or join our Discord to chat: https://discord.gg/xZjKRaNp8b.

Report

12mo ago

Joseph Abraham

SaaS for Greater Good

@elenasamuylova Congrats on bringing your idea to life! Wishing you a smooth and prosperous journey. How can we best support you on this journey?

Report

12mo ago

Elena Samuylova

Evidently AI

Maker

@kjosephabraham Thanks for the support! We always appreciate any feedback and help in spreading the word. As an open-source tool, it is built together with the community! 🚀

Report

12mo ago

Emeli Dral

Evidently AI

Maker

Hi everyone! I am Emeli, one of the co-founders of Evidently AI. I'm thrilled to share what we've been working on lately with our open-source Python library. I want to highlight a specific new feature of this launch: LLM judge templates. LLM as a judge is a popular evaluation method where you use an external LLM to review and score the outputs of LLMs. However, one thing we learned is that no LLM app is alike. Your quality criteria are unique to your use case. Even something seemingly generic like "sentiment" will mean something different each time. While we do have templates (it's always great to have a place to start), our primary goal is to make it easy to create custom LLM-powered evaluations. Here is how it works: 🏆 Define your grading criteria in plain English. Specify what matters to you, whether it's conciseness, clarity, relevance, or creativity. 💬 Pick a template. Pass your criteria to an Evidently template, and we'll generate a complete evaluation prompt for you, including formatting it as JSON and asking the LLM to explain its scores. ▶️ Run evals. Apply these evaluations to your datasets or recent traces from your app. 📊 Get results. Once you set a metric, you can use it across the Evidently framework. You can generate visual reports, run conditional test suites, and track metrics in time on a dashboard. You can track any metric you like - from hallucinations to how well your chatbot follows the brand guidelines. We plan to expand on this feature, making it easier to add examples to your prompt and adding more templates, such as pairwise comparisons. Let us know what you think! To check it out, visit our GitHub: https://github.com/evidentlyai/e..., docs http://docs.evidentlyai.com or Discord to chat: https://discord.gg/xZjKRaNp8b.

Report

12mo ago

Emeli Dral

Evidently AI

Maker

@hamza_afzal_butt Thank you so much!

Report

12mo ago

Rod Rivera

Congratulations on the launch, Evidently team! I've always admired Evidently for its comprehensiveness and all-encompassing approach framework. I often work with teams who are unsure about what metrics to focus on or how to begin their evaluation process. For those new or unsure where to start: * What best practices would you recommend? * Is there a feature that helps beginners 'set things on autopilot' while they're learning the ropes? * Do you offer any guided workflows or templates for common use cases that could help newcomers get started quickly? Thanks for your continued innovations in this space!

Report

12mo ago

Elena Samuylova

Evidently AI

Maker

@rorcde @rorcde Thanks for the support! :🙏🏻 Quickstart: We have a simple example here: https://docs.evidentlyai.com/get.... It will literally take a couple of minutes! We packaged some popular evaluations as presets and general metrics (like detecting Denials). However, we generally encourage using your custom criteria—no LLM app is exactly alike, and the beauty of using LLM as a judge is that you can use your own definitions. We made it super easy to define your custom prompt just by writing your criteria in plain English. Best practices: That's a huuuge question. Let me try to summarize a few of them: - Don't skip the evals! Implementing evals can sound complex, so it's tempting to "ship on vibes". But it’s much easier to start with a simple evaluation pipeline that you iterate on than to try adding evals to your process later on. So, start simple. - Make curating an evaluation dataset a part of your process. When it comes to offline evals, the metrics are as important as the data you run them on. Preparing a set of representative, realistic inputs (and, ideally, approved outputs) is a high-value activity that should be part of the process. - Log everything. On that note, don’t miss out on capturing real traces of user conversations. You can then use them for testing, to replay new prompts against them, etc. - Start with regression testing. This is low-hanging fruit in evals: every time you change a prompt, re-generate new outputs for a set of representative inputs and see what changed (or have peace of mind that nothing did). This is hugely important for the speed of iteration. - If you use LLM as a judge, start with binary criteria and measure the quality of your judge. It’s also easier to test alignment this way.

Report

12mo ago

Deepgram Voice Agent API — Build production-ready voice agents with a unified speech-to-speech API.

Build production-ready voice agents with a unified speech-to-speech API.

Promoted

Evidently AI Launches

Evidently AI Open-source evaluations and observability for LLM apps

Launched on August 20th, 2024

Evidently AI

Collaborative AI observability platform

Collaborative AI observability platform

Evidently AI

Collaborative AI observability platform

Collaborative AI observability platform

Similar Products

Evidently AI

Evidently AI Launches

Do you use Evidently AI?

Evidently AI Launches

Do you use Evidently AI?

You might also like