Introduction
Shipping a Generative AI application is only half the battle. Once your users are generating images with a model like Google's Gemini, a whole new set of questions emerges - and they're less about model quality and more about running a sustainable product:
- Who is using the application, and how often?
- How long does each generation actually take?
- And most critically: how much is each request costing us?
Traditional machine learning gives you clear signals like accuracy and loss to optimize against. Generative AI flips that on its head - the metrics that matter most are business metrics: token usage, latency per user, and cost per request. Without visibility into these, you're essentially flying blind.
In this post, we'll walk through how we integrated Databricks Managed MLflow into our Python FastAPI backend to bring that visibility to life - turning every image generation request into a trackable, measurable event.
The Technical Implementation
To keep things clean and maintainable, we built a dedicated TrackingService class within our backend that handles all observability logic in one place. This separation means our core application code stays focused on business logic, while all the MLflow instrumentation lives in a single, testable module. We log data points asynchronously so there's no added latency to the user-facing request.
Let's walk through each part of the implementation.
1. Setup and Initialization
The first step is connecting our application to the Databricks workspace. One of the nicest things about Databricks Managed MLflow is that there's no separate tracking server to spin up or maintain - you simply point the MLflow client at Databricks and it handles the rest.
In the snippet below, we initialize the TrackingService using dependency injection to pull in configuration values. This means our Dev and Prod environments each talk to their own isolated MLflow experiment automatically, with no code changes needed between deployments.
import mlflow
import logging
from config import config
class TrackingService:
def __init__(self):
"""
Initialize the TrackingService with Databricks Managed MLflow.
"""
if not config.ENABLE_TRACKING:
logger.info("Tracking is disabled via configuration.")
return
try:
# Directs MLflow to log to the Databricks workspace linked by host/token env vars
mlflow.set_tracking_uri("databricks")
# Sets the experiment. If it doesn't exist, it will be created automatically.
mlflow.set_experiment(config.MLFLOW_EXPERIMENT_NAME)
except Exception as e:
logger.error(f"Failed to initialize MLflow: {e}")
2. Tracking What Matters: Tokens & Cost
Of all the things we track, cost is the one that catches teams off guard most often. Unlike a traditional API with a flat rate, Generative AI models charge based on token consumption - you pay for the input tokens (the prompt and any images you send) and separately for the output tokens (what the model generates back).
To make this visible, we wrote a small helper function that calculates the estimated USD cost of every request in real-time. It supports both text and image model pricing tiers, with the actual rates pulled from environment-level config so they can be updated without a code change.
def _calculate_cost(self, input_tokens: int, output_tokens: int, model_type: str = "text") -> float:
"""
Calculate the estimated cost in USD based on token usage.
"""
cost = 0.0
# We define these costs in our environment variables/config
if model_type == "text":
cost += (input_tokens / 1_000_000) * config.TEXT_INPUT_TOKEN_COST_PER_MILLION
cost += (output_tokens / 1_000_000) * config.TEXT_OUTPUT_TOKEN_COST_PER_MILLION
elif model_type == "image":
# Image models often have different pricing tiers
cost += (input_tokens / 1_000_000) * config.IMAGE_INPUT_TOKEN_COST_PER_MILLION
cost += (output_tokens / 1_000_000) * config.IMAGE_OUTPUT_TOKEN_COST_PER_MILLION
return cost
3. Logging the Run
With initialization and cost calculation in place, the final piece is logging each request as an MLflow run. MLflow gives us three distinct ways to attach data to a run, and we use all three intentionally:
- Tags are used for categorical identifiers like the user's email - making it easy to filter and group runs by user in the UI.
- Params capture the settings for that specific request, such as
aspect_ratioandprompt_length. - Metrics hold the quantitative measurements -
processing_time_ms,input_tokens,output_tokens, andestimated_cost_usd.
We also log the raw prompt text as an artifact. This is a small but powerful addition - it means we can go back and qualitatively review exactly what users are asking for, which feeds directly into improving our prompt engineering over time.
def log_image_generation(self, user_email: str, prompt: str, aspect_ratio: str,
processing_time_ms: float, input_tokens: int, output_tokens: int):
try:
with mlflow.start_run(run_name="generate_image"):
# 1. Who did it?
mlflow.set_tag("user_email", user_email)
# 2. What were the settings?
mlflow.log_param("aspect_ratio", aspect_ratio)
mlflow.log_param("prompt_length", len(prompt))
# 3. Performance Metrics
mlflow.log_metric("processing_time_ms", processing_time_ms)
mlflow.log_metric("input_tokens", input_tokens)
mlflow.log_metric("output_tokens", output_tokens)
# 4. Financial Metrics
estimated_cost = self._calculate_cost(input_tokens, output_tokens, model_type="image")
mlflow.log_metric("estimated_cost_usd", estimated_cost)
# 5. The Artifact (The actual prompt text)
mlflow.log_text(prompt, "prompt.txt")
except Exception as e:
logger.error(f"Failed to log image generation: {e}")
The Result: Real-time Visibility
Once runs start flowing in, they appear instantly in the Databricks MLflow UI. What you get is a live, sortable table of every single image generation request - who triggered it, how long it took, how many tokens were consumed, and what it cost. No dashboards to build, no queries to write. It's all there out of the box.
The screenshot below shows what this looks like in practice. Each row is one user request, and every column is a metric or tag we logged programmatically.
Frequently Asked Questions
Does this approach work with models other than Gemini?
Yes. The TrackingService is model-agnostic - it simply logs whatever token counts and metadata your application passes to it. As long as your model's API returns input and output token counts, the same pattern applies whether you're using Gemini, OpenAI, Anthropic, or any other provider.
What happens if the MLflow logging call fails - does it affect the user?
No. All logging calls are wrapped in try/except blocks, so any failure in the tracking layer is caught and logged as an error without bubbling up to the user-facing request. The core application continues to function normally even if observability is temporarily unavailable.
How do we handle MLflow logging in a high-traffic environment?
For high-throughput scenarios, we recommend running the logging calls asynchronously - offloading them to a background task so they don't block the main request thread. FastAPI's BackgroundTasks is a natural fit for this and requires minimal changes to the existing implementation.
How do we keep token pricing up to date?
Pricing rates are stored in environment-level configuration rather than hardcoded in the application. This means when a provider updates their pricing, you only need to update a config value - no code changes or redeployments required.
Conclusion
Integrating Databricks Managed MLflow into our backend turned our Generative AI application from a black box into a system we can actually reason about. Every request is now a data point, and that changes how the whole team operates - from engineering to product to finance.
Questions that used to require digging through logs or guesswork now have immediate, concrete answers:
- "Which user incurred the highest cost yesterday?" - filter by the
user_emailtag and sumestimated_cost_usd - "Are long prompts causing higher latency?" - correlate
prompt_lengthwithprocessing_time_ms
The pattern we've described here is intentionally lightweight - a single service class, a few MLflow calls, and no extra infrastructure. But the visibility it unlocks is disproportionately valuable, especially as usage scales and cost management becomes a real operational concern.
If you're building something similar or exploring how to bring observability into your own GenAI stack, we'd love to hear about it. Feel free to reach out to our Cloud & GenAI Solutions team here.
