February 13, 2026

Monitoring Generative AI with Databricks MLflow: Track Token Usage, Cost, and Latency in Real Time

Introduction

Shipping a Generative AI application is only half the battle. Once your users are generating images with a model like Google's Gemini, a whole new set of questions emerges - and they're less about model quality and more about running a sustainable product:

  • Who is using the application, and how often?
  • How long does each generation actually take?
  • And most critically: how much is each request costing us?

Traditional machine learning gives you clear signals like accuracy and loss to optimize against. Generative AI flips that on its head - the metrics that matter most are business metrics: token usage, latency per user, and cost per request. Without visibility into these, you're essentially flying blind.

In this post, we'll walk through how we integrated Databricks Managed MLflow into our Python FastAPI backend to bring that visibility to life - turning every image generation request into a trackable, measurable event.

The Technical Implementation

To keep things clean and maintainable, we built a dedicated TrackingService class within our backend that handles all observability logic in one place. This separation means our core application code stays focused on business logic, while all the MLflow instrumentation lives in a single, testable module. We log data points asynchronously so there's no added latency to the user-facing request.

Let's walk through each part of the implementation.

1. Setup and Initialization

The first step is connecting our application to the Databricks workspace. One of the nicest things about Databricks Managed MLflow is that there's no separate tracking server to spin up or maintain - you simply point the MLflow client at Databricks and it handles the rest.

In the snippet below, we initialize the TrackingService using dependency injection to pull in configuration values. This means our Dev and Prod environments each talk to their own isolated MLflow experiment automatically, with no code changes needed between deployments.


import mlflow
import logging
from config import config

class TrackingService:
  def __init__(self):
    """
    Initialize the TrackingService with Databricks Managed MLflow.
    """
    if not config.ENABLE_TRACKING:
      logger.info("Tracking is disabled via configuration.")
      return

    try:
      # Directs MLflow to log to the Databricks workspace linked by host/token env vars
      mlflow.set_tracking_uri("databricks")

      # Sets the experiment. If it doesn't exist, it will be created automatically.
      mlflow.set_experiment(config.MLFLOW_EXPERIMENT_NAME)
    except Exception as e:
      logger.error(f"Failed to initialize MLflow: {e}")
    

2. Tracking What Matters: Tokens & Cost

Of all the things we track, cost is the one that catches teams off guard most often. Unlike a traditional API with a flat rate, Generative AI models charge based on token consumption - you pay for the input tokens (the prompt and any images you send) and separately for the output tokens (what the model generates back).

To make this visible, we wrote a small helper function that calculates the estimated USD cost of every request in real-time. It supports both text and image model pricing tiers, with the actual rates pulled from environment-level config so they can be updated without a code change.


def _calculate_cost(self, input_tokens: int, output_tokens: int, model_type: str = "text") -> float:
    """
    Calculate the estimated cost in USD based on token usage.
    """
    cost = 0.0

    # We define these costs in our environment variables/config
    if model_type == "text":
      cost += (input_tokens / 1_000_000) * config.TEXT_INPUT_TOKEN_COST_PER_MILLION
      cost += (output_tokens / 1_000_000) * config.TEXT_OUTPUT_TOKEN_COST_PER_MILLION
    elif model_type == "image":
      # Image models often have different pricing tiers
      cost += (input_tokens / 1_000_000) * config.IMAGE_INPUT_TOKEN_COST_PER_MILLION
      cost += (output_tokens / 1_000_000) * config.IMAGE_OUTPUT_TOKEN_COST_PER_MILLION

    return cost
    

3. Logging the Run

With initialization and cost calculation in place, the final piece is logging each request as an MLflow run. MLflow gives us three distinct ways to attach data to a run, and we use all three intentionally:

  • Tags are used for categorical identifiers like the user's email - making it easy to filter and group runs by user in the UI.
  • Params capture the settings for that specific request, such as aspect_ratio and prompt_length.
  • Metrics hold the quantitative measurements - processing_time_ms, input_tokens, output_tokens, and estimated_cost_usd.

We also log the raw prompt text as an artifact. This is a small but powerful addition - it means we can go back and qualitatively review exactly what users are asking for, which feeds directly into improving our prompt engineering over time.


def log_image_generation(self, user_email: str, prompt: str, aspect_ratio: str,
               processing_time_ms: float, input_tokens: int, output_tokens: int):
    try:
      with mlflow.start_run(run_name="generate_image"):
        # 1. Who did it?
        mlflow.set_tag("user_email", user_email)

        # 2. What were the settings?
        mlflow.log_param("aspect_ratio", aspect_ratio)
        mlflow.log_param("prompt_length", len(prompt))

        # 3. Performance Metrics
        mlflow.log_metric("processing_time_ms", processing_time_ms)
        mlflow.log_metric("input_tokens", input_tokens)
        mlflow.log_metric("output_tokens", output_tokens)

        # 4. Financial Metrics
        estimated_cost = self._calculate_cost(input_tokens, output_tokens, model_type="image")
        mlflow.log_metric("estimated_cost_usd", estimated_cost)

        # 5. The Artifact (The actual prompt text)
        mlflow.log_text(prompt, "prompt.txt")
    except Exception as e:
      logger.error(f"Failed to log image generation: {e}")
    

The Result: Real-time Visibility

Once runs start flowing in, they appear instantly in the Databricks MLflow UI. What you get is a live, sortable table of every single image generation request - who triggered it, how long it took, how many tokens were consumed, and what it cost. No dashboards to build, no queries to write. It's all there out of the box.

The screenshot below shows what this looks like in practice. Each row is one user request, and every column is a metric or tag we logged programmatically.

Managed MLFlow Experiment Image

Frequently Asked Questions

Does this approach work with models other than Gemini?

Yes. The TrackingService is model-agnostic - it simply logs whatever token counts and metadata your application passes to it. As long as your model's API returns input and output token counts, the same pattern applies whether you're using Gemini, OpenAI, Anthropic, or any other provider.

What happens if the MLflow logging call fails - does it affect the user?

No. All logging calls are wrapped in try/except blocks, so any failure in the tracking layer is caught and logged as an error without bubbling up to the user-facing request. The core application continues to function normally even if observability is temporarily unavailable.

How do we handle MLflow logging in a high-traffic environment?

For high-throughput scenarios, we recommend running the logging calls asynchronously - offloading them to a background task so they don't block the main request thread. FastAPI's BackgroundTasks is a natural fit for this and requires minimal changes to the existing implementation.

How do we keep token pricing up to date?

Pricing rates are stored in environment-level configuration rather than hardcoded in the application. This means when a provider updates their pricing, you only need to update a config value - no code changes or redeployments required.

Conclusion

Integrating Databricks Managed MLflow into our backend turned our Generative AI application from a black box into a system we can actually reason about. Every request is now a data point, and that changes how the whole team operates - from engineering to product to finance.

Questions that used to require digging through logs or guesswork now have immediate, concrete answers:

  • "Which user incurred the highest cost yesterday?" - filter by the user_email tag and sum estimated_cost_usd
  • "Are long prompts causing higher latency?" - correlate prompt_length with processing_time_ms

The pattern we've described here is intentionally lightweight - a single service class, a few MLflow calls, and no extra infrastructure. But the visibility it unlocks is disproportionately valuable, especially as usage scales and cost management becomes a real operational concern.

If you're building something similar or exploring how to bring observability into your own GenAI stack, we'd love to hear about it. Feel free to reach out to our Cloud & GenAI Solutions team here.

SonarCloud “Supported Node.js Version Required” Error in GitHub Actions CI [Issue Resolved]

Introduction

CI/CD pipelines are designed to provide confidence. When a pipeline fails, the assumption is that something is genuinely wrong - a test broke, a dependency failed, or a configuration is missing.

Occasionally, however, pipelines fail in ways that don’t immediately make sense.

In our case, SonarCloud began failing in a GitHub Actions CI pipeline even though Node.js was installed, all tests were passing, and the pipeline configuration appeared correct. At first glance, the failure looked like a tooling issue. In reality, it turned out to be a subtle runtime-ordering problem - the kind that’s easy to overlook and difficult to diagnose.

This article walks through what happened, why the issue was misleading, and how we fixed it cleanly without introducing hacks or brittle workarounds.

What is SonarCloud and Why We Use It

SonarCloud is a static code analysis service that continuously inspects code for bugs, security vulnerabilities, and maintainability issues. In our organisation, SonarCloud acts as a quality gate in the CI pipeline, helping ensure that new changes meet agreed-upon code quality standards before they are merged.

For JavaScript and TypeScript projects, SonarCloud performs deep source-code analysis using a JavaScript engine that depends on Node.js. This means that, even though SonarCloud is not building or running the application itself, it still requires a compatible Node.js runtime to analyse the code correctly.

In our CI setup, SonarCloud runs automatically as part of GitHub Actions on pull requests and feature branches. Any failure at this stage blocks the pipeline, making SonarCloud a critical part of our development workflow.

This dependency on Node.js - and when that runtime is active - is the key to understanding the issue we encountered.

CI/CD Setup Context

The issue occurred in a GitHub Actions–based CI pipeline used for a JavaScript/TypeScript application. The pipeline was designed to be straightforward and intentionally modular, with clear separation between development, testing, and quality checks.

At a high level, the CI workflow included:

  • Code checkout
  • Dependency installation using pnpm
  • Linting
  • Multiple test suites
  • SonarCloud analysis as a quality gate

The project used Node.js 18 as the primary runtime for application development and testing. All test commands were explicitly run using this version, and they completed successfully on every execution. From an application perspective, the pipeline was healthy.

SonarCloud was configured to run after tests, analysing the same source code that had just passed validation. The expectation was simple: if the code compiled and tests passed under Node.js 18, static analysis should also succeed.

Importantly, this pipeline existed in the application repository itself. Infrastructure and deployment were handled in a separate repository, where this application was included as a Git submodule. SonarCloud analysis was intentionally scoped only to the application repository to keep CI responsibilities clearly separated.

From a configuration standpoint, everything appeared correct:

  • Node.js was installed
  • The correct version was specified
  • SonarCloud was integrated using the official GitHub Action

Yet despite this, the SonarCloud step failed consistently. That contradiction - a clean pipeline up until the quality gate - is what made this issue both confusing and time-consuming to debug.

The Problem and Error Symptoms

The failure occurred only during the SonarCloud analysis step. Every stage before it - dependency installation, linting, and all test suites - completed successfully.

SonarCloud failed with errors related to its JavaScript analysis engine, including messages suggesting:

  • Unsupported JavaScript features
  • Failure to start the internal analysis process
  • Requirement for a supported Node.js version

This was confusing because Node.js was clearly installed and working. The pipeline explicitly set up Node.js, node -v returned a valid version, and the same runtime had just executed all tests without issue.

From a debugging perspective, this created a contradiction:

  • If Node.js were missing or broken, tests should have failed.
  • If the Node version were incompatible, the pipeline should have failed earlier.

Instead, the failure surfaced only at the SonarCloud stage, with error messages pointing to Node.js but not clearly explaining the underlying problem. This ambiguity made it difficult to immediately determine whether the issue was related to SonarCloud, GitHub Actions, or the CI configuration.

Initial Assumptions and Why They Were Wrong

Given the error messages, the initial assumption was that the Node.js setup was incorrect. This was a reasonable conclusion - SonarCloud explicitly reported problems related to the Node runtime.

The first checks focused on confirming that Node.js was installed correctly:

  • Verifying the Node version in the pipeline
  • Ensuring Node was available in the PATH
  • Confirming that tests were running with the expected runtime

All of these checks passed.

The next assumption was that SonarCloud itself might be misconfigured. Various common fixes were considered, including reinstalling Node.js, forcing a specific executable path, or adding additional environment variables to guide the scanner.

While these approaches seemed promising, they were ultimately addressing symptoms rather than the root cause. Node.js was not missing, and SonarCloud was not misconfigured in an obvious way.

The real problem was not whether Node.js was installed, but which version of Node.js was active at the exact moment SonarCloud started its analysis. This distinction was easy to overlook and was not immediately obvious from the error messages.

Root Cause Summary

The issue was caused by the order in which the CI pipeline steps were executed. Although Node.js 18 was installed and used successfully for running tests, the SonarCloud GitHub Action was triggered before the intended Node.js runtime was fully applied to the runner environment.

As a result, SonarCloud defaulted to a different Node.js version that lacked support for certain modern JavaScript features.

This mismatch led to misleading errors during the analysis phase, even though the application itself was functioning correctly under the expected runtime.

How We Fixed the Issue

To resolve the problem cleanly and ensure consistent runtime behaviour across the pipeline, we made some adjustments:

Instead of adding overrides, hardcoding paths, or introducing custom scanner settings, we only needed to ensure that the required Node.js version was active immediately before SonarCloud started.

In practice, this meant keeping Node.js 18 for dependency installation and test execution, and then switching to Node.js 20 just before running the SonarCloud scan.

By doing this, we allowed each stage of the pipeline to use the runtime it expected, without interfering with other steps.

# Run tests with Node 18
- name: Setup Node.js 18
  uses: actions/setup-node@v4
  with:
    node-version: 18

# ... tests run here ...

# Switch runtime right before SonarCloud
- name: Setup Node.js 20
  uses: actions/setup-node@v4
  with:
    node-version: 20

- name: SonarCloud Scan
  uses: SonarSource/sonarcloud-github-action@v2

Once these changes were applied, SonarCloud analysis ran successfully, and the CI pipeline stabilised.

Key Lessons Learned

  • CI tools depend on the active runtime environment, not only on what is installed in the pipeline.
  • The order of execution can significantly change how tools behave.
  • Verifying the runtime in use at each stage is just as important as validating the configuration itself.
  • When failures appear to be code-related, the root cause may actually be environmental.
If you have any questions or need help with CI/CD pipelines, DevOps automation, or cloud solutions, feel free to reach out to our team here.

Big Data Analytics Powered by AI: Platforms, Use Cases, and Enterprise Value

Introduction

Every organization at present is encircled by a vast quantity of data. Data from Logs, transactions, user interactions, sensor readings data is being generated continuously. The real challenge is no longer how much data we have, but how quickly and intelligently we can use it.

This is where Big Data Analytics Platforms powered by AI become relevant. Through the integration of extensive data processing with Artificial Intelligence and Machine Learning(AI/ML), these platforms assist organizations in advancing beyond mere dashboards and reports to proactively respond to future events in real time.

Why AI and Big Data Work So Well Together

Big Data platforms are excellent at handling scale, but scale alone doesn't create value. AI adds the intelligence layer learning from patterns, adapting to change, and making predictions that humans simply can’t compute manually.

The Platforms Powering AI-Driven Analytics

Apache Spark: Velocity at Volume

Apache Spark has emerged as a fundamental element of contemporary analytics due to its ability to swiftly and effectively handle large datasets. Its capability to manage batch processing, real-time streams, and machine learning tasks makes it well-suited for predictive analytics.

Teams use Spark to analyze historical data, train models, and even generate near-real-time predictions whether that’s forecasting demand or identifying unusual behavior in transaction data.

Databricks: A Hub for Data Team Collaboration

Databricks builds on Spark, but focuses on simplifying the entire analytics and AI lifecycle. It brings data engineers, data scientists, and analysts onto a single collaborative platform.

What distinguishes Databricks is its ability to integrate Data processing, Machine learning, and deployment all in a single platform. Rather than managing various tools, teams can concentrate on experimenting, learning, and deploying models more quickly without the concern of infrastructure complications.

Hadoop: The Core Remains Important

Hadoop might not be the primary tool individuals consider for AI currently, yet it continues to hold significant importance. Numerous organizations depend on Hadoop.

That historical data is incredibly valuable for training predictive models. In many real-world architectures, Hadoop acts as the backbone for long-term storage, while newer tools like Spark and Databricks handle AI-driven analytics on top of it.

TensorFlow: Bringing Intelligence to the Data

TensorFlow is where advanced AI truly comes alive. It enables teams to build and train Machine learning and Deep learning models that can learn complex patterns far beyond what traditional analytics can uncover.

From time-series forecasting to image and text analysis, TensorFlow integrates seamlessly with Big Data platforms and cloud infrastructure, allowing models to scale as data grows.

Cloud Platforms: Scaling Without Limits

Cloud platforms have dramatically transformed the deployment of AI and Big Data. Teams can concentrate on resolving issues rather than handling servers and clusters.

  • AWS offers powerful services for large-scale data processing and machine learning, making it easier to go from raw data to production-ready models.
  • Azure shines in enterprise environments, offering strong governance, security, and seamless integration with analytics and AI tools.
  • Google Cloud Platform(GCP) brings an AI-first approach, with highly optimized services for analytics and machine learning at scale.

The cloud makes predictive analytics elastic, cost-effective, and globally accessible.

Frequently Asked Questions

FAQ 1: What distinguishes traditional analytics from AI-based Big Data analytics?

Reports and dashboards that provide descriptive insights and historical data are the main focus of traditional analytics. By employing machine learning models to anticipate future events, identify trends automatically, and facilitate real-time decision-making, AI-driven Big Data analytics takes one step further.

FAQ 2: What makes Apache Spark so popular for predictive analytics?

Fast and scalable, Apache Spark can handle batch and streaming data. It is perfect for training and implementing predictive models on big datasets because of its in-memory processing and integrated machine learning libraries.

FAQ 3: In what ways can Databricks streamline operations for AI and ML?

Teams working in machine learning, data science, and data engineering can collaborate on a single platform offered by Databricks. It makes infrastructure less complicated and speeds up teams' transition from unprocessed data to models that are ready for production.

FAQ 4: How does cloud computing fit into analytics powered by AI?

Cloud systems provide worldwide availability, managed services, and elastic scaling. This makes predictive analytics more affordable and accessible by enabling businesses to run sizable AI workloads without having to maintain on-premise infrastructure.

FAQ 5: What abilities are necessary to operate on Big Data platforms powered by AI?

Data engineering, distributed systems, SQL, Python, machine learning principles, and familiarity with cloud platforms are essential competencies. Additionally, it is becoming more and more crucial to comprehend MLOps and model monitoring.

Conclusion

Platforms for Big Data analytics powered by AI are transforming how organizations perceive data. Rather than relying solely on data to comprehend past events, companies can now predict results, mitigate risks, and make more informed decisions instantly. Technologies like Apache Spark, Databricks, Hadoop, and TensorFlow integrated with cloud services such as AWS, Azure, and GCP form a robust environment where data and intelligence collaborate.