What I Learned Migrating a Production SaaS Platform to Azure

The Decision Started in a Bathroom

The platform had been running on a hybrid setup for years. The web tier was in Azure. SQL Server, file servers, FTP, and user VMs were on-prem in a colocation facility, connected back to Azure via ExpressRoute. Latency wasn’t an issue. The arrangement worked.

Then we had an outage.

A power supply failed in the data center. The on-call technician, the person whose job it was to respond to exactly this kind of event, was on a bathroom break. By the time he was back at his desk, our customers had already been down long enough to notice. Long enough to be embarrassing. Long enough to make the case that “it works” wasn’t the same as “we’re protected.”

The owner had flirted with the idea of going fully cloud before. The bathroom break is what made it a decision.

This post is what we learned over the next twelve months executing that decision. A production SaaS platform migration with roughly four terabytes of SQL data, completed with a single scheduled maintenance window and no surprises during cutover.

The Starting Point Mattered

A common mistake in migration write-ups is treating “lift and shift” as a single decision. It almost never is. What you’re really deciding is, for each piece of your environment, what’s the right destination?

Here’s where we started:

  • Web tier: already in Azure App Services
  • SQL Server: on-prem, ~4TB, heavily tuned, owner had a habit of throwing hardware at performance problems
  • File servers and FTP: on-prem
  • User VMs: on-prem
  • Connectivity: ExpressRoute back to Azure

Because the web tier was already in Azure, the cutover for the application itself was largely a configuration change. We were repointing connection strings and updating DNS. The hard problem was the data tier.

If I’d been moving everything from a fully on-prem environment, I’d be writing a different post. Knowing what you have on day one shapes everything that comes next.

Why We Moved Regions at the Same Time

We weren’t just moving to Azure. We were moving regions inside Azure.

The original Azure footprint was in West US, which has no paired region. Region pairs matter for two reasons. Certain Azure services replicate automatically across paired regions, and Microsoft prioritizes pairs for sequenced rollouts and recovery during major incidents. Running production in a region without a pair is a quiet single point of failure you don’t notice until you need DR.

West US 3 was paired with East US, which solved the resiliency problem. It was also closer to Latin America, which mattered because the customer base we were focused on at the time was concentrated in Mexico.

The judgment call was whether to do the region move as a separate phase, which would be safer, or fold it into the migration. We folded it in. The reasoning was that every component we migrated would need to be configured, tested, and validated regardless. Doing that work twice, once for the migration and once for the region move, was twice the change risk. Better to land in the right place the first time.

The Twelve-Month Timeline Was Mostly Preparation

The migration itself was a single maintenance window. The twelve months leading up to it were spent making sure that window was uneventful.

The biggest single workstream was SQL performance tuning. The on-prem SQL servers were over-provisioned, too many cores and too much memory, because the company’s owner believed in spending money to make performance problems disappear. Replicating that hardware footprint in Azure was financially possible but absurd. It meant paying every month for capacity we didn’t actually need.

Our original target was Azure SQL Managed Instance. After enough testing, we accepted that we couldn’t get the performance we needed at a cost we could justify. We pivoted to SQL VMs in Azure, still managed and supported by us, but with the architectural flexibility we needed.

But we still had the over-provisioning problem. The fix was to spend several months tuning the application’s SQL code before the migration. We stood up SQL VMs in Azure, ran comparative performance tests against the on-prem servers, and identified the queries and procedures that were eating capacity. Critically, every code change we made was deployed to the on-prem environment first.

This is one of the most important decisions we made: do not stack code changes on top of an infrastructure migration. Every change you make during a migration is a change you’ll have to debug if something goes wrong. By the time we cut over, the application running in Azure was the same application that had been running on-prem for weeks. The only thing that changed during the maintenance window was where it was running.

The SQL Cutover

Four terabytes of SQL data with near-zero downtime sounds harder than it was. The mechanism we used was Always-On Availability Groups spanning on-prem and Azure.

We built a Windows Server Failover Cluster that included both the existing on-prem SQL Server and the new SQL VM in Azure. We added the Azure node as a secondary replica. Always-On synchronized the data over ExpressRoute. Once the secondary was caught up and stable, we failed over. The Azure node became primary, and we removed the on-prem node from the cluster.

Customers experienced this as a scheduled maintenance window. The actual failover took minutes. The application repointed at the new primary and kept going.

This approach had a few advantages worth naming:

  • No bulk data copy at cutover time. The data was already in Azure when the maintenance window started. We weren’t restoring backups or rushing to copy 4TB across the wire.
  • Reversibility. Until we removed the on-prem node, we could fail back if something went wrong. That option mattered for the team’s confidence going in.
  • Familiar tooling. Always-On is a feature most SQL DBAs already understand. We weren’t introducing migration-specific tools we’d never use again.

The other components were straightforward. The web tier, already in Azure, needed configuration updates to point at the new SQL VM. File servers and FTP transitioned with similarly minimal cutover work. The user VMs moved to Azure host pools, but we did that as a separate project after the production cutover. There was no reason to bundle internal infrastructure into a customer-facing migration.

How the Security Program Handled It

We were running an active SOC 2 program through this entire migration. That program could have been a headache. Instead, it was an asset.

Two things made it work.

We treated the migration as a normal project under existing policies. Security reviews. Change management. Documentation. Vendor evaluations. The same procedures that governed every other change to the platform governed this one. A migration isn’t an exception to your security program. It’s exactly the kind of event your security program exists to handle.

We adjusted the SOC 2 audit period. Our audit had been on a twelve-month cycle. With a major infrastructure change mid-cycle, that cycle would have produced an audit covering two materially different environments, half of it on hybrid infrastructure and half of it on Azure. Evidence collection for the auditor would have been a nightmare, and the auditor’s job would have been harder than it needed to be. We shortened the period to six months for the cycle that contained the migration. The audit covering the post-migration environment started cleanly with the new architecture in place.

The migration also gave us an opportunity to revisit security procedures that had been written for a hybrid world. We reviewed each one against a fully-cloud reality and adjusted where it made sense. Some procedures got simpler. A few got tighter. None of it was dramatic. The program was already mature enough that the migration didn’t force a rewrite.

What I’d Change

Two architecture decisions I’d make differently if I were doing this today.

Static Web Apps for the customer-facing portal. At the time, Static Web Apps looked like the natural choice. In practice, the five-custom-domain limit per Static Web App is a real operational constraint when you’re onboarding customers, each of whom expects their own subdomain. We end up provisioning additional resources and updating DNS for each new customer. A static site hosted on a Storage Account with Front Door in front of it would have been simpler and would have scaled cleanly with customer growth.

Application Gateway with WAF. Front Door with WAF would make more sense for our workload. It’s globally distributed, which fits a SaaS platform with geographically dispersed customers, and it’s cheaper for our usage pattern. Application Gateway is the right choice for some architectures, but in retrospect, ours wasn’t one of them.

Neither of these was a mistake at the time. Both are choices I’d revisit given how the Azure service landscape has matured.

What It Cost

The migration ended up cheaper than the on-prem hybrid arrangement. Two reasons.

Right-sizing the SQL tier. The performance tuning work let us provision SQL VMs that matched our actual workload, not the inflated footprint we’d been carrying on-prem.

Tooling consolidation. Several third-party products we’d been licensing on-prem became redundant in Azure. The most concrete example was our standalone vulnerability scanner. Defender for Cloud covered the same need natively, with deeper integration into the rest of the environment. We retired the third-party product and reduced both license cost and operational overhead.

I’d warn against going into a migration assuming cost savings. They’re possible, but they come from deliberate decisions, like right-sizing, eliminating duplicate tooling, and replacing third-party products with native services. They don’t come from the act of migrating itself. A lift-and-shift to Azure with no architectural rethinking can easily cost more than what you left behind.

What Got Better After Cutover

Two outcomes worth pointing to.

Uptime improved. Not in a way that shows up as a single dramatic statistic. We weren’t suffering chronic outages on-prem. But in the cumulative reliability of an environment with managed services, zone redundancy, and no dependency on a person being at his desk when a power supply fails.

Monthly maintenance got smoother. We still patch our VMs manually. We haven’t handed that work over to Azure’s automatic update services. But the absence of physical servers makes a real difference. No more waiting on hardware that takes longer to come back up. No more coordination across physical and virtual layers. The environment is consistent, and the patching workflow is too.

What I’d Tell Someone Starting an Azure Migration Today

If you’re about to lead a production migration to Azure, here’s the short version of what twelve months of execution taught me.

  • Plan for the migration to be uneventful. The headline goal isn’t technical heroics during cutover. It’s for the cutover to be the most boring part of the project. We had checklists for every team member for every step. Nothing surprising happened during the maintenance window because everything surprising had already been worked through during planning.
  • Don’t stack changes on top of the migration. If you need to tune SQL, refactor code, change schemas, or update libraries, do it before the cutover. Push those changes into your existing environment first. The application that lands in Azure should be the same application that’s been running in production for weeks.
  • Pick destinations per component, not for the whole environment. Managed Instance is the right destination for some workloads and the wrong one for others. The same goes for App Services vs. AKS, Static Web Apps vs. Storage Accounts, Application Gateway vs. Front Door. Generic patterns are starting points, not answers.
  • Use Always-On for the data cutover if you can. Spanning a cluster across on-prem and Azure gives you a continuously-replicated, reversible cutover path with tools your DBAs already know.
  • Treat the migration as a normal project under your security program. Don’t suspend your controls for the migration. Use them. If your audit cycle is going to span two environments, talk to your auditor about adjusting the period.
  • Don’t assume cloud is cheaper. It can be. It often is. But only if you take the migration as an opportunity to right-size, consolidate, and replace third-party tools with native services. A lift-and-shift with no architectural rethinking is a great way to make Azure look expensive.
  • Keep the team small if you can. Ours was four people: a project lead and architect, a database architect, a DBA, and an IT admin. A small team that owns the project end-to-end will outperform a larger team with split accountability nine times out of ten.

The bathroom-break outage is the kind of thing nobody puts in a quarterly review. But it’s the kind of moment that turns “we should think about this” into “we’re doing it.” Most migrations I’ve seen start with something similar. Not a strategic awakening, but a specific, embarrassing event that made the cost of inaction higher than the cost of action. If you’re sitting on a hybrid setup that mostly works and waiting for a clear signal, this is your reminder that the signal sometimes comes in the form of an empty chair at the data center.