May 11, 2026

MongoDB Shell (mongosh)

MongoDB Shell Overview

mongosh (MongoDB Shell) is an interactive JavaScript-based shell used to interact with MongoDB databases. It replaces the legacy mongo shell and provides a modern, developer-friendly experience with better usability and scripting capabilities. 

Whether you're debugging data issues, running quick queries, or executing scripts—mongosh is your most powerful direct interface to MongoDB.

When to Use mongosh vs MongoDB Compass

Both mongosh and MongoDB Compass are powerful tools for interacting with MongoDB, but they serve different purposes depending on the task.

MongoDB Compass — Best for Visualization & Exploration

MongoDB Compass is ideal when you want a visual interface to explore your database without writing commands manually.

Use Compass when you need to:

  • Browse collections visually
  • Inspect document structures
  • Quickly test simple queries
  • View indexes and schema information
  • Analyze aggregation pipelines visually
  • Work comfortably with smaller datasets

Compass is especially useful for beginners or during early-stage debugging where seeing the data structure helps more than scripting.

However, Compass has limitations when operations become more complex or repetitive.

mongosh — Best for Power Operations & Automation

mongosh gives developers direct control over MongoDB using JavaScript-based commands and scripts.

Use mongosh when you need to:

  • Perform bulk updates or deletions
  • Run loops and conditional logic
  • Execute migration or cleanup scripts
  • Handle duplicate data removal
  • Automate repetitive database tasks
  • Debug production-level data issues
  • Run advanced aggregation workflows
  • Execute commands faster than GUI interactions

Unlike Compass, mongosh allows scripting and automation, making it extremely valuable for backend developers and DevOps workflows.

For example:

  • Removing thousands of duplicate records
  • Updating fields across multiple collections
  • Writing one-time migration scripts
  • Performing production hotfixes

These tasks are significantly easier and more efficient in mongosh.

Which One Should You Choose?

In practice, developers often use both tools together:

  • Use Compass for exploration and visualization
  • Use mongosh for execution, automation, and production fixes

Think of Compass as the visual dashboard, while mongosh is the developer power tool for serious database operations.

Connecting to mongosh

Inside MongoDB Compass

  1. Open MongoDB Compass
  2. Connect to your database
  3. Click the >_ MONGOSH tab
  4. Start writing commands

Basic Commands

Switch Database

  use myDatabase

Show Collections

  show collections

Find Data

  db.users.find()

CRUD Operations

Insert

  db.users.insertOne({ name: "John", age: 30 })

Read

  db.users.find({ age: { $gt: 25 } })

Update

  db.users.updateOne(
  { name: "John" },
  { $set: { age: 31 } }
)

Delete

  db.users.deleteMany({ age: { $lt: 20 } })

Running Aggregation Pipelines

Aggregation pipelines are used to process and transform data in MongoDB.

db.orders.aggregate([
  { $match: { status: "completed" } },
  {
    $group: {
      _id: "$customerId",
      total: { $sum: "$amount" }
    }
  }
])

Real-World Use Case: Removing Duplicates

One of the most common production issues is duplicate data.

Step 1: Identify Duplicates

db.records.aggregate([
  {
    $group: {
      _id: "$id",
      count: { $sum: 1 },
      docs: { $push: "$_id" }
    }
  },
  { $match: { count: { $gt: 1 } } }
])

Step 2: Remove Duplicates (Keep One)

db.records.aggregate([
  {
    $group: {
      _id: "$id",
      ids: { $push: "$_id" },
      count: { $sum: 1 }
    }
  },
  { $match: { count: { $gt: 1 } } }
]).forEach(doc => {
  doc.ids.shift(); // keep first document
  db.records.deleteMany({
    _id: { $in: doc.ids }
  });
});

Creating Indexes

Indexes improve performance and enforce constraints.

Create Unique Index

db.records.createIndex({ id: 1 }, { unique: true })

View Indexes

db.records.getIndexes()

Best Practices

Always take a backup before:

  • Bulk delete operations
  • Data migration
  • Index changes

Test queries before execution:

// First check what will be affected
db.users.find({ age: { $lt: 20 } })

Use limits when unsure:

db.users.find().limit(5)

Common Mistakes

  • ❌ Running delete without a filter
    db.users.deleteMany({}) // deletes EVERYTHING
    
  • ❌ Forgetting to create a unique index before merging data
  • ❌ Using the Aggregation tab for scripting (doesn't support loops)


If you have any questions you can reach out our SharePoint Consulting team here.

May 7, 2026

Building Site-Aware Enterprise AI Agents on Microsoft 365 Using Claude Agent SDK

Introduction

An enterprise running seven department-specific SharePoint intranet sites needed AI that could actually operate within their systems - not just answer questions about them. Here is what a single task looked like before we got involved - and after.

Before - Sales proposal workflow After - same task, one sentence
Open Pipeline Tracker on SharePoint - find the deal Type: "Draft a proposal for Meridian Corp's cloud migration deal and send it to their CTO."
Switch to Client Contacts - find the CTO's email Agent queries Pipeline Tracker - pulls deal value, stage, scope notes
Open Word - hunt for the proposal template on the shared drive Agent looks up CTO in Client Contacts - name, email, title
Manually fill in deal value, scope, and terms Branded proposal generated from python-docx template and uploaded to Sales Collateral
Save, switch back to SharePoint, upload to Sales Collateral library Email compose panel opens - pre-filled with CTO's address, subject, and body
Open Outlook - type the CTO's email, write the body, attach, send Review, tweak one line, hit Send
40 minutes  ·  6 applications  ·  1 task One sentence  ·  Five systems  ·  Done

We assessed the organisation's requirements against Copilot Cowork and Claude Cowork. Both are genuinely capable products - but neither could query custom SharePoint list schemas, connect to a SQL employee database, generate documents from branded templates, or switch to a different specialist persona depending on which department site the user was on. They needed something those products are structurally not built to do.

Same panel. Same backend. Same LLM. Completely different specialist per site, loaded at runtime from a JSON manifest. This is what we built. Here is how:

The Architecture in One Sentence

A single SPFx floating panel on every SharePoint page sends messages to a FastAPI backend, which enters a Claude agentic loop with a filtered set of MCP tools - filtered by which site the user is on, what they are allowed to do, and which department's data model applies. Each site loads its own agent persona, slash commands, and skill files - deep knowledge modules that teach the agent how a specific department's data is structured, what business rules apply, and which workflows exist.

System Architecture diagram

System Architecture - SPFx panel → FastAPI backend → Claude agentic loop → MCP server → enterprise data sources

That sentence hides a lot of machinery. Let me unpack it through the folder structure, because the folder structure is the architecture.

The Folder Structure That Makes It Work

Backend/
├── api/
│ ├── main.py                    ← FastAPI app, startup, CORS
│ └── routers/
│ └── agent_chat.py          ← The agentic loop: SSE streaming, tool orchestration
│
├── core/
│ ├── mcp/
│ │ ├── server.py              ← In-process MCP server (58 tool routes)
│ │ ├── builder.py             ← Tool definitions with input schemas
│ │ └── tools/                 ← Tool implementations by connector type
│ ├── auth/                      ← OBO token exchange + certificate auth
│ ├── session/                   ← Session resolver (who, where, what permissions)
│ ├── registry/                  ← Plugin auto-discovery at startup
│ ├── audit/                     ← Write-operation audit logger
│ ├── alerts/handlers/           ← Proactive alert checks (5 built-in)
│ └── notifications/             ← Custom notification rules engine
│
├── connectors/                    ← 9 connector modules
│ ├── sharepoint/                ← List CRUD, library browse, search, upload
│ ├── msgraph/                   ← Users, calendar, mail, Teams channels
│ ├── ems/                       ← Direct SQL: employee lookup, org chart, leave
│ ├── email/                     ← Draft + send via Graph
│ ├── docgen/                    ← python-docx reports, offer letters, proposals
│ ├── azdevops/                  ← Work items, sprints
│ ├── analytics/                 ← NL reports, anomaly detection
│ ├── knowledge/                 ← Federated search, expert finder
│ └── automation/                ← Notification rule CRUD
│
├── plugins/                       ← THIS IS THE KEY DIRECTORY
│ ├── SALES/
│ │ ├── .claude-plugin/
│ │ │ └── plugin.json        ← Manifest: persona, tools, commands, skills, schemas
│ │ ├── agents/                ← Persona markdown files
│ │ ├── commands/              ← Slash command definitions
│ │ └── skills/                ← Domain knowledge modules
│ ├── FINANCE/                   ← Same structure, different specialist
│ ├── PEOPLE/                    ← Same structure, different specialist
│ ├── DELIVERY/                  ← Same structure, different specialist
│ ├── TECHNOLOGY/                ← Same structure, different specialist
│ ├── OPERATIONS/                ← Same structure, different specialist
│ └── CORE/                      ← Same structure, different specialist
│
└── db/migrations/                 ← 6 SQL migrations (sessions, audit, alerts)

Every architectural decision is visible in this tree. Let me walk through the three that matter most.

Decision 1: The Plugin Manifest

When a user opens the Sales site and the panel initialises, the backend resolves their session: who are they (from the OBO token), which site are they on (from the site_url passed by the frontend), and what can they do (from the plugin manifest). The manifest is a single JSON file:

{
"name": "sales",
"display_name": "Sales Assistant",
"entry_agent": "agents/sales-assistant.md",
"connectors": ["sharepoint", "msgraph", "email", "docgen", "ems",
"knowledge", "analytics", "automation"],
"commands": ["commands/pipeline-report.md", "commands/add-lead.md", ...],
"skills": ["skills/pipeline-management/SKILL.md",
"skills/client-engagement/SKILL.md", ...],
"sharepoint_lists": {
"lists": {
"Pipeline Tracker": "columns: OpportunityName, DealValue, PipelineStage, ..."
}
}
}

This manifest does five things at session start:

It loads a persona.

The entry_agent points to a markdown file that defines the agent's personality, role, behaviour rules, and domain expertise. The Sales agent is "friendly, professional, and results-driven." The Finance agent is precise and compliance-aware. The People agent is warm and policy-oriented. Same LLM, different specialist.

It filters the tool palette.

The connectors array controls which of the 58 MCP tools appear in Claude's tool list for this session. Sales gets SharePoint, Graph, Email, DocGen, EMS, Knowledge, Analytics, and Automation - but not Azure DevOps. Delivery gets Azure DevOps too. Technology gets everything. The LLM only sees tools it is allowed to use.

It loads skill files.

This is where the deep domain knowledge lives. Each skill is a markdown file that describes how a specific aspect of the department works. The Sales plugin has skills for pipeline management, client engagement, and competitive intelligence. A skill might describe how the Pipeline Tracker list is structured, what each column means, which OData filters produce useful results, what "stalled deal" means in this organisation's context, and what steps the agent should follow for a pipeline review. Skills are loaded into the system prompt based on the site and the user's query - they are the agent's domain training, delivered at runtime through text, not fine-tuning.

It injects list schemas.

The sharepoint_lists block is injected into the system prompt. This is how the agent knows the column is called PipelineStage, not Stage. It knows DealValue is the field for revenue, not Value or Amount. It never guesses. If a column is not in the manifest, the agent cannot reference it.

It registers slash commands.

Each command is a markdown file with trigger phrases, required parameters, the agent to hand off to, and step-by-step instructions. When a user types /pipeline-report, the command's markdown gets injected into the system prompt for that turn.

Plugin-Per-Site Architecture diagram

Plugin-Per-Site Architecture - 7 sites, auto-discovered, manifest-driven, session-filtered

The consequence of this design is that adding a new department - an eighth site, a regional hub, a project workspace - requires creating one folder with one JSON manifest and a few markdown files. No code changes. No backend redeployment. The registry discovers it on next restart.

The key insight: The plugin folder is the architecture. Not the LLM, not the framework, not the cloud infrastructure. The entire system's behaviour - which specialist responds, which tools are available, which data is accessible, which commands exist, which domain knowledge the agent draws on - is determined by which plugin.json file gets loaded and which skill files sit alongside it. Everything else is shared plumbing. This is what makes the system maintainable at scale: changing a department's AI behaviour is editing a JSON file and a few markdown documents, not shipping code.

Decision 2: Dual-Auth and the Invisible Session

Authentication is where most custom AI implementations get it wrong. Either they use the user's token for everything (writes are unauditable and tied to individual permissions) or they use an app-only token for everything (losing the identity context of who asked for the action).

We split it:

Reads use the user's delegated OBO token.

When the agent queries a SharePoint list or fetches calendar events, it does so as the user. If the user cannot see a site, neither can the agent. The existing M365 permission model is preserved - no data leakage, no privilege escalation.

Writes use an app-only certificate token.

  • All write operations - list item creation, document upload, email send - execute through a controlled service identity, not the user's token.
  • Every write is routed through an audit logger: who requested it, what changed, tool name, payload, status, and duration.
  • site_url, OBO token, app-only token, and the user's permission set are never parameters the LLM sees - they are injected from the session object before the agentic loop begins.

In the initial version, SharePoint tools accepted site_url as a tool parameter. During early integration testing, Claude hallucinated a URL - it substituted the Sales site URL when a user on the Finance site asked about budget items. The query returned nothing. The agent confidently reported "no budget items found." It was a silent, plausible failure - the worst kind. We refactored site_url out of every tool interface within the day and moved it to session-level injection. The LLM never sees it, never chooses it, never hallucinates it.

If the LLM does not need a value to reason about the task, do not put it in the tool interface. Every parameter visible to the LLM is a surface for hallucination. Session-level injection eliminates that surface.

Decision 3: The Agentic Loop

When a user sends a message, the backend does not call Claude once and return the answer. It enters a loop:

Agentic Loop diagram

Agentic Loop - Request Lifecycle with tool execution feedback loop

  • Multi-step tool chains. A single message like "find stalled deals, draft follow-ups, and notify the sales lead" triggers five sequential tool calls - SharePoint OData query, EMS employee lookup, three email draft preparations - each feeding the next decision.
  • Real-time status streaming. Each tool call emits an SSE event to the frontend ("Fetching list data: Pipeline Tracker...", "Looking up employee...") so the user sees the agent working, not a loading spinner.
  • Model is a config variable. Claude Sonnet is the default - fast enough for real-time streaming, capable enough for multi-step orchestration. Swapping models is a single environment variable change; the architecture is not coupled to any provider.
  • We tested Azure OpenAI + Semantic Kernel. The loop runs and tools get called. But in head-to-head testing, Claude produced fewer hallucinated tool parameters, followed complex multi-step prompts more faithfully, and handled branching logic - where a tool result determines whether to call the next tool or skip to a different action - more consistently. That is why it is our default.

What the Agent Can Actually Do

58 tools across 9 connectors. Here are the ones that matter most to daily operations:

Connector What it does Available on
SharePoint Read list items with OData filters, create and update items, browse document libraries, upload files, search across the site. Primary data layer. All sites
EMS Direct SQL via pyodbc against the HR database: employee lookup, org chart traversal, leave balances, skills search, project assignments, team allocation, capacity planning, budget tracking. No REST wrapper. All sites
Document generation Branded offer letters, proposals, and reports via python-docx. Budget workbooks via openpyxl. Generated on the backend, uploaded to SharePoint, download card in chat. Sales, People, Finance, Operations
Email Two-step workflow: agent drafts, human reviews in inline compose panel, then confirms Send. Never auto-sends. All sites
Azure DevOps Query and create work items, get sprint status. Delivery, Technology only
Proactive alerts Five APScheduler handlers: stalled deals, budget thresholds, expiring contracts, onboarding gaps, morning brief digest. Push to Teams channels and notification bell before anyone opens a browser. All sites (handler-specific)
MS Graph Users, calendar, Teams channels, mail integration. All sites
Analytics Natural language reports, anomaly detection. All sites
Knowledge Federated search, expert finder. All sites

Lessons from the Build

Every implementation surfaces lessons that inform the next one.

SharePoint field name complexity runs deeper than expected.

Display names and internal names diverge in non-obvious ways. "Status" might be stored as BudgetStatus. "Details" might be AnnouncementDetails. The initial deployment surfaced field name mismatches that cost debugging cycles. Our fix - a PowerShell schema export script that generates verified field mappings per site - is now a standard first step in our deployment methodology.

Not every data source needs an API layer.

Our first design had a full REST API wrapper around the SQL employee database. We replaced it with direct pyodbc queries wrapped in the MCP tool contract. Simpler, faster, easier to maintain.

The in-process MCP server will need extraction.

Running the tool server in-process was the right call for initial development - no serialisation overhead, easy debugging. But for production scale, tool execution needs to run independently so long-running operations (document generation, cross-site searches) do not block the API layer. The architecture was designed for this extraction - MCP's HTTP transport makes it a configuration change, not a rewrite - which is planned for the next phase.

Retry logic is now day-one infrastructure.

Claude's API occasionally returns transient errors under load. Without exponential backoff (1s, 2s, 4s), these surface as user-facing failures. Retry logic is now part of our standard agentic infrastructure from the start, based on this experience.

Production hardening - monitoring, observability, error recovery, and scale testing - deserves its own post. We will publish that next.

A Note on Portability

This case study uses SharePoint as the enterprise platform and Claude as the LLM, but the pattern is platform-agnostic. The same plugin-per-site architecture has been applied with Confluence, custom intranets, and internal portals. The frontend can be any web surface that supports a JavaScript embed - the SPFx panel is one implementation, not a requirement. The connectors, the plugin manifest pattern, the dual-auth model, and the session-scoped tool filtering are all transferable. If your organisation runs on a different stack, the architecture adapts to it.

Conclusion

The plugin-per-site pattern, dual-auth model, session-scoped tool filtering, and in-process MCP server described here form a reusable enterprise architecture - not a one-off implementation. Adding a new department means creating a folder with a JSON manifest and a few markdown skill files. No code changes, no redeployment. The system's behaviour is entirely determined by configuration, not by the codebase.

This architecture was designed and deployed by Binary Republik. It is adaptable to any organisation's departmental structure, data model, and governance requirements - and transferable to any enterprise platform with programmatic APIs.

If you have any questions you can reach out to our AI Consulting team here.

AWS Compute Services Explained: EC2 vs ECS vs EKS

Introduction

Choosing the right AWS compute service isn't just a technical decision - it directly affects your team's velocity, your operational costs, and how quickly your business can respond to change. Pick the wrong one and you'll either pay for complexity you don't need, or find yourself locked into a setup that can't scale with you.

EC2, ECS, and EKS are not competing services. They solve fundamentally different problems at different levels of abstraction. EC2 gives you a raw virtual machine with full infrastructure control. ECS is a managed container platform built natively on AWS - no Kubernetes required. EKS brings the full power of Kubernetes to AWS for teams that need portability and advanced orchestration at scale.

Here is what we cover:

  • What EC2, ECS, and EKS actually are - in plain terms
  • Key differences across abstraction, scaling, cost, and operational overhead
  • What each service costs, with real numbers from AWS's official pricing pages
  • Real-world use cases and business implications for each
  • A decision guide to help you pick the right one

Understanding the Basics

Amazon EC2 - Virtual Machines

EC2 is AWS's virtual machine service. You choose the operating system, CPU, memory, and storage - AWS provisions the server. Think of it as renting a physical server in the cloud. Your team owns everything from that point on: patching, scaling, monitoring, and security. Maximum control, maximum responsibility.

Business implication: EC2 requires engineering time to manage and maintain. That time has a cost. It is best suited for workloads where your team needs deep infrastructure control, or for migrating existing applications that weren't built for containers.

Amazon ECS - Managed Containers

ECS is AWS's native container orchestration service. You define your Docker image, CPU and memory requirements, and scaling rules. AWS handles the rest - scheduling, cluster management, health checks, and integrations with AWS services like Application Load Balancer, IAM, and CloudWatch. No Kubernetes knowledge required.

Business implication: ECS reduces the operational surface area significantly. Smaller teams can run production container workloads without dedicated DevOps headcount. It is AWS's recommended starting point for teams new to containers.

Amazon EKS - Managed Kubernetes

EKS runs Kubernetes on AWS. AWS manages the control plane - the brain of the cluster - while you manage the worker nodes, or offload them to AWS Fargate for a fully serverless setup. Kubernetes is the industry-standard container orchestration platform, and EKS makes it available as a managed service.

Business implication: EKS is the most powerful and flexible option, but it brings the highest operational complexity and cost. It pays off at scale, for teams running complex microservices architectures, or for organisations that want the option to run workloads across multiple cloud providers.

Key Differences at a Glance

Feature EC2 ECS EKS
Abstraction Level Low (Infrastructure) Medium (Platform) High (Ecosystem)
Primary Unit VM / Server Container Task Pod
Setup Complexity High Low Very High
Scaling Speed Slow (VM boot) Fast Fast
OS / Host Access Full None Limited
Kubernetes Support No No Yes
Multi-cloud Portability Low Low High
Operational Overhead High Medium Very High
Learning Curve Low–Medium Low High
Relative Cost (entry) Low–Medium Medium High
Best for Legacy apps, full control AWS-native container apps Complex microservices, multi-cloud

Cost Breakdown - With Real Numbers

Cost is one of the most common decision factors, and it is also one of the most misunderstood. The sticker price of compute is only part of the picture. Operational overhead - the engineering time spent managing infrastructure - is a real cost that doesn't show up on your AWS bill but does show up on your payroll.

EC2

With EC2, you pay for instance uptime. AWS offers four main pricing models: On-Demand (pay by the second, no commitment), Reserved Instances (commit to 1 or 3 years for up to 72% off On-Demand rates, per official AWS pricing), Savings Plans (flexible commitment-based discounts, AWS's currently recommended approach over Reserved Instances), and Spot Instances (spare capacity at up to 90% off, but AWS can reclaim with two minutes notice). Most production workloads that run around the clock should be on Reserved Instances or a Savings Plan - running purely On-Demand for steady-state workloads leaves significant money on the table.

Business implication: EC2 can be the most cost-efficient option for stable, predictable workloads - but only if your team actively manages instance sizing and pricing commitments. Teams that over-provision and forget tend to pay more than they should.

ECS

ECS has no additional management fee. You pay for the underlying compute - either EC2 instances you manage yourself, or AWS Fargate (fully serverless, where AWS manages the infrastructure). With Fargate on ECS, billing is per second based on the vCPU and memory your containers actually use. As a reference, in the US East (N. Virginia) region, Fargate charges approximately $0.04048 per vCPU-hour and $0.004445 per GB-hour for memory, per AWS's official Fargate pricing page. A container running 2 vCPUs and 4 GB of RAM costs roughly $0.099 per hour, or around $72 per month running continuously.

Business implication: Fargate on ECS is often the most cost-effective choice for bursty, unpredictable, or variable traffic patterns - you pay only for what you use, and there is no idle compute cost. For high-volume, steady-state workloads, EC2-backed ECS can be cheaper.

EKS

EKS has a fixed control plane fee of $0.10 per cluster per hour - approximately $73 per month per cluster - regardless of cluster size or workload, per AWS's official EKS pricing page. This is just the management fee; worker node compute (EC2 or Fargate), storage, load balancers, and data transfer are all billed separately. Teams running dev, staging, and production environments on separate clusters pay this fee three times minimum. Additionally, if a Kubernetes version ages past its 14-month standard support window without being upgraded, the fee jumps to $0.60 per hour - a 6x increase that catches many teams by surprise.

Business implication: EKS is expensive at small scale and cost-efficient at large scale, where advanced resource scheduling and bin-packing justify the overhead. For small to medium workloads, ECS on Fargate will almost always be the cheaper option. The $73/month control plane fee is a fixed entry cost per cluster - for multi-cluster strategies, it adds up quickly.

Real-World Use Cases

Use EC2 when:

  • You are migrating a legacy or monolithic application that was not built for containers
  • Your application requires custom OS-level configuration, kernel parameters, or specific hardware access
  • Your team needs full visibility and control over the underlying server environment
  • You have predictable, steady-state workloads and want to maximise savings through Reserved Instances or Savings Plans

Business implication: EC2 is the right lift-and-shift vehicle. It minimises application changes during migration and gives your team time to modernise at their own pace. The trade-off is that it demands the most ongoing engineering attention.

Use ECS when:

  • You want to run containers on AWS without learning Kubernetes
  • You need fast, reliable deployments with minimal DevOps overhead
  • Your workloads are bursty or variable and Fargate's pay-per-use model suits your traffic patterns
  • Your team is AWS-first and wants tight, native integration with IAM, CloudWatch, and ALB

Business implication: ECS on Fargate is the fastest path to production for container workloads. It requires the least infrastructure expertise, no cluster maintenance, and scales automatically. It is a strong default choice for startups, product teams, and organisations without dedicated platform engineering.

Use EKS when:

  • Your team already uses Kubernetes and has the expertise to operate it
  • You need multi-cloud portability - the ability to run the same workloads on AWS, GCP, or Azure
  • You are running complex microservices that benefit from Kubernetes-native features like Horizontal Pod Autoscaler, custom scheduling, or service meshes
  • You are operating at a scale where advanced resource optimisation (bin-packing, Spot node groups, Karpenter) delivers meaningful cost savings

Business implication: EKS is a long-term platform investment. It requires Kubernetes expertise to operate well - either in-house or through a managed services partner. The payoff is flexibility, portability, and the ability to run sophisticated workloads that outgrow what ECS can offer.

Quick Decision Guide

Two questions cut through most of the noise:

  1. Do you need Kubernetes?
    • Yes - your team uses it already, or you need multi-cloud portability → EKS
    • No → Move to question 2
  2. Do you want containers?
    • Yes → ECS
    • No, or you have a legacy app to migrate → EC2
Situation Recommended Service Why
Small team, new to containers ECS (Fargate) Lowest ops overhead, no cluster to manage
Legacy app migration EC2 Minimal app changes, full control
Startup scaling fast on AWS ECS (Fargate) Fast deployments, pay-as-you-go, AWS-native
Enterprise microservices EKS Advanced orchestration, multi-team platform
Existing Kubernetes users EKS Familiar tooling, avoid re-platforming cost
Multi-cloud strategy EKS Kubernetes runs anywhere - avoids AWS lock-in
Predictable, high-volume compute EC2 (Reserved / Savings Plan) Up to 72% savings over On-Demand with commitment

Conclusion

EC2, ECS, and EKS are tools for different jobs at different stages of growth. The right choice depends on your team's skills, your application's architecture, your traffic patterns, and your budget - both the AWS bill and the engineering time to manage it.

  • Want simplicity and speed? → ECS on Fargate
  • Have a legacy app to migrate? → EC2
  • Need Kubernetes power and portability? → EKS

ECS on Fargate handles a huge range of production workloads reliably and cost-effectively - and it is far easier to migrate to EKS later than to unwind unnecessary Kubernetes complexity from day one.

If you have any questions, you can reach out to our AWS Cloud Consulting team here.