February 13, 2026

Monitoring Generative AI with Databricks MLflow: Track Token Usage, Cost, and Latency in Real Time

Introduction

Shipping a Generative AI application is only half the battle. Once your users are generating images with a model like Google's Gemini, a whole new set of questions emerges - and they're less about model quality and more about running a sustainable product:

  • Who is using the application, and how often?
  • How long does each generation actually take?
  • And most critically: how much is each request costing us?

Traditional machine learning gives you clear signals like accuracy and loss to optimize against. Generative AI flips that on its head - the metrics that matter most are business metrics: token usage, latency per user, and cost per request. Without visibility into these, you're essentially flying blind.

In this post, we'll walk through how we integrated Databricks Managed MLflow into our Python FastAPI backend to bring that visibility to life - turning every image generation request into a trackable, measurable event.

The Technical Implementation

To keep things clean and maintainable, we built a dedicated TrackingService class within our backend that handles all observability logic in one place. This separation means our core application code stays focused on business logic, while all the MLflow instrumentation lives in a single, testable module. We log data points asynchronously so there's no added latency to the user-facing request.

Let's walk through each part of the implementation.

1. Setup and Initialization

The first step is connecting our application to the Databricks workspace. One of the nicest things about Databricks Managed MLflow is that there's no separate tracking server to spin up or maintain - you simply point the MLflow client at Databricks and it handles the rest.

In the snippet below, we initialize the TrackingService using dependency injection to pull in configuration values. This means our Dev and Prod environments each talk to their own isolated MLflow experiment automatically, with no code changes needed between deployments.


import mlflow
import logging
from config import config

class TrackingService:
  def __init__(self):
    """
    Initialize the TrackingService with Databricks Managed MLflow.
    """
    if not config.ENABLE_TRACKING:
      logger.info("Tracking is disabled via configuration.")
      return

    try:
      # Directs MLflow to log to the Databricks workspace linked by host/token env vars
      mlflow.set_tracking_uri("databricks")

      # Sets the experiment. If it doesn't exist, it will be created automatically.
      mlflow.set_experiment(config.MLFLOW_EXPERIMENT_NAME)
    except Exception as e:
      logger.error(f"Failed to initialize MLflow: {e}")
    

2. Tracking What Matters: Tokens & Cost

Of all the things we track, cost is the one that catches teams off guard most often. Unlike a traditional API with a flat rate, Generative AI models charge based on token consumption - you pay for the input tokens (the prompt and any images you send) and separately for the output tokens (what the model generates back).

To make this visible, we wrote a small helper function that calculates the estimated USD cost of every request in real-time. It supports both text and image model pricing tiers, with the actual rates pulled from environment-level config so they can be updated without a code change.


def _calculate_cost(self, input_tokens: int, output_tokens: int, model_type: str = "text") -> float:
    """
    Calculate the estimated cost in USD based on token usage.
    """
    cost = 0.0

    # We define these costs in our environment variables/config
    if model_type == "text":
      cost += (input_tokens / 1_000_000) * config.TEXT_INPUT_TOKEN_COST_PER_MILLION
      cost += (output_tokens / 1_000_000) * config.TEXT_OUTPUT_TOKEN_COST_PER_MILLION
    elif model_type == "image":
      # Image models often have different pricing tiers
      cost += (input_tokens / 1_000_000) * config.IMAGE_INPUT_TOKEN_COST_PER_MILLION
      cost += (output_tokens / 1_000_000) * config.IMAGE_OUTPUT_TOKEN_COST_PER_MILLION

    return cost
    

3. Logging the Run

With initialization and cost calculation in place, the final piece is logging each request as an MLflow run. MLflow gives us three distinct ways to attach data to a run, and we use all three intentionally:

  • Tags are used for categorical identifiers like the user's email - making it easy to filter and group runs by user in the UI.
  • Params capture the settings for that specific request, such as aspect_ratio and prompt_length.
  • Metrics hold the quantitative measurements - processing_time_ms, input_tokens, output_tokens, and estimated_cost_usd.

We also log the raw prompt text as an artifact. This is a small but powerful addition - it means we can go back and qualitatively review exactly what users are asking for, which feeds directly into improving our prompt engineering over time.


def log_image_generation(self, user_email: str, prompt: str, aspect_ratio: str,
               processing_time_ms: float, input_tokens: int, output_tokens: int):
    try:
      with mlflow.start_run(run_name="generate_image"):
        # 1. Who did it?
        mlflow.set_tag("user_email", user_email)

        # 2. What were the settings?
        mlflow.log_param("aspect_ratio", aspect_ratio)
        mlflow.log_param("prompt_length", len(prompt))

        # 3. Performance Metrics
        mlflow.log_metric("processing_time_ms", processing_time_ms)
        mlflow.log_metric("input_tokens", input_tokens)
        mlflow.log_metric("output_tokens", output_tokens)

        # 4. Financial Metrics
        estimated_cost = self._calculate_cost(input_tokens, output_tokens, model_type="image")
        mlflow.log_metric("estimated_cost_usd", estimated_cost)

        # 5. The Artifact (The actual prompt text)
        mlflow.log_text(prompt, "prompt.txt")
    except Exception as e:
      logger.error(f"Failed to log image generation: {e}")
    

The Result: Real-time Visibility

Once runs start flowing in, they appear instantly in the Databricks MLflow UI. What you get is a live, sortable table of every single image generation request - who triggered it, how long it took, how many tokens were consumed, and what it cost. No dashboards to build, no queries to write. It's all there out of the box.

The screenshot below shows what this looks like in practice. Each row is one user request, and every column is a metric or tag we logged programmatically.

Managed MLFlow Experiment Image

Frequently Asked Questions

Does this approach work with models other than Gemini?

Yes. The TrackingService is model-agnostic - it simply logs whatever token counts and metadata your application passes to it. As long as your model's API returns input and output token counts, the same pattern applies whether you're using Gemini, OpenAI, Anthropic, or any other provider.

What happens if the MLflow logging call fails - does it affect the user?

No. All logging calls are wrapped in try/except blocks, so any failure in the tracking layer is caught and logged as an error without bubbling up to the user-facing request. The core application continues to function normally even if observability is temporarily unavailable.

How do we handle MLflow logging in a high-traffic environment?

For high-throughput scenarios, we recommend running the logging calls asynchronously - offloading them to a background task so they don't block the main request thread. FastAPI's BackgroundTasks is a natural fit for this and requires minimal changes to the existing implementation.

How do we keep token pricing up to date?

Pricing rates are stored in environment-level configuration rather than hardcoded in the application. This means when a provider updates their pricing, you only need to update a config value - no code changes or redeployments required.

Conclusion

Integrating Databricks Managed MLflow into our backend turned our Generative AI application from a black box into a system we can actually reason about. Every request is now a data point, and that changes how the whole team operates - from engineering to product to finance.

Questions that used to require digging through logs or guesswork now have immediate, concrete answers:

  • "Which user incurred the highest cost yesterday?" - filter by the user_email tag and sum estimated_cost_usd
  • "Are long prompts causing higher latency?" - correlate prompt_length with processing_time_ms

The pattern we've described here is intentionally lightweight - a single service class, a few MLflow calls, and no extra infrastructure. But the visibility it unlocks is disproportionately valuable, especially as usage scales and cost management becomes a real operational concern.

If you're building something similar or exploring how to bring observability into your own GenAI stack, we'd love to hear about it. Feel free to reach out to our Cloud & GenAI Solutions team here.

No comments:

Post a Comment