Skip to main content
Evaluations are a quantitative way to measure the performance of LLM applications. LLMs can behave unpredictably, even small changes to prompts, models, or inputs can significantly affect results. Evaluations provide a structured way to identify failures, compare versions, and build more reliable AI applications. Running an evaluation in LangSmith requires three key components:
  • Dataset: A set of test inputs (and optionally, expected outputs).
  • Target function: The part of your application you want to test—this might be a single LLM call with a new prompt, one module, or your entire workflow.
  • Evaluators: Functions that score your target function’s outputs.
This quickstart guides you through running a starter evaluation that checks the correctness of LLM responses, using either the LangSmith SDK or UI.
If you prefer to watch a video on getting started with tracing, refer to the datasets and evaluations Video guide.

Prerequisites

Before you begin, make sure you have: Select the UI or SDK filter for instructions:
  • UI
  • SDK

1. Set workspace secrets

In the LangSmith UI for your workspace, ensure that your OpenAI API key is set as a workspace.
  1. Navigate to Settings and then move to the Secrets tab.
  2. On the Workspace Secrets page, select Add secret and enter the OPENAI_API_KEY and your API key as the Value.
  3. Select Save secret.
When adding workspace secrets in the LangSmith UI, make sure the secret keys match the environment variable names expected by your model provider.

2. Create a prompt

LangSmith’s Prompt Playground makes it possible to run evaluations over different prompts, new models, or test different model configurations.
  1. In the LangSmith UI, navigate to the Playground under Prompt Engineering.
  2. Under the Prompts panel. Modify the SYSTEM prompt to:
    Answer the following question accurately:
    

3. Create a dataset

  1. Click Set up Evaluation, which will open a New Experiment table at the bottom of the page.
  2. In the Select or create a new dataset dropdown, click the + New button to create a new dataset.
    Playground with the edited system prompt and new experiment with the dropdown for creating a new dataset.
  3. Add the following examples to the dataset:
    InputsReference Outputs
    question: Which country is Mount Kilimanjaro located in?output: Mount Kilimanjaro is located in Tanzania.
    question: What is Earth’s lowest point?output: Earth’s lowest point is The Dead Sea.
  4. Click Save and enter a name to save your newly created dataset.

4. Add an evaluator

  1. Click + Evaluator and select Correctness from the Pre-built Evaluator options.
  2. In the Correctness panel, click Save.

5. Run your evaluation

  1. Select Start on the top right to run your evaluation. This will create an experiment that you can view in full by clicking the experiment name.
    Full experiment view of the results that used the example dataset.

Next steps

To learn more about running experiments in LangSmith, read the evaluation conceptual guide.

Video guide

I