ProductDock | From GCP to Azure with SRE Power

Summary.

ProductDock partnered with MOTORS to migrate core services from Google Cloud Platform (GCP) to Microsoft Azure. By combining cloud engineering and SRE expertise, we modernized infrastructure, improved observability, and empowered development teams through automation and collaboration.

About.

What does the client do?

MOTORS is a prominent player in the UK automotive marketplace, offering a powerful car search platform that connects dealerships with millions of potential buyers. The company leverages data-driven solutions, streamlined user experiences, and trusted partnerships to simplify and accelerate the vehicle purchasing process. With a strong focus on digital innovation and performance marketing, MOTORS supports dealers in maximizing their reach and driving sales in a competitive online landscape.

Can you tell us about the project and how long you’ve been working on it?

MOTORS sought to modernize its infrastructure and consolidate its cloud providers, migrating services from Google Cloud Platform (GCP) to Microsoft Azure, and improve its software release processes. ProductDock provided a team of Site Reliability Engineers (SREs) to facilitate this transformation, starting in October 2021.

Initial situation.

What was the initial state?

MOTORS hosted the main part of the platform already in Azure, but parts were in GCP. Engineers were used to develop the software and were partially responsible for infrastructure and operations. Some releases were done manually. There was an opportunity to streamline the operational process for all teams to reduce operational overhead.

What was the team structure?

As a team setup we used a mixed team model. We joined MOTORS as part of their existing team and are integrated in a larger cloud team. Our responsibilities include building new and managing existing infrastructure, defining the guidelines – meaning setting the guardrails and best practices regarding software deployments and observability for provided cloud infrastructure – as well as supporting teams and onboarding/upskilling them to be able to use the new platform.

Problem space.

What problems did you want to solve?

MOTORS faced several challenges, including unwanted costs associated with the use of GCP for services. The migration from GCP to Azure was complex, there was a huge desire to improve the reliability and robustness of the whole platform and to establish a self-service for developers to enable them to deploy their application with one click, and reduce manual efforts.

Solution.

What did you do? What technologies did you use?

The solutions are not solely depending on the technology or frameworks used. They are also highly dependent – especially for a SRE team that functions as a hub for all other teams – on efficient and effective communication. As we need to be engaged with all teams, we literally need to talk to all other developers: to always be up to date with our colleagues, we do dailies. Cloud chapter meetings were established alongside tech guild meetings that are held once to twice a month for knowledge exchange and trust building. During those meetings everyone can come and ask questions, share ideas on what to do, and also talk freely about problems and how to solve them. At MOTORS, it is of high importance to ensure a secure atmosphere, so that people support each other, everyone is okay that proposals may or may not be accepted since there is trust, that such decisions are based on good reason.

Of course, especially for SRE teams, not everything can be scheduled upfront, if important things pop up, there are ad hoc meetings with developers. What is to be respected is the prioritization of the topics that are on top of the company roadmap.

However, ad-hoc meetings wouldn’t be adequate for day-to-day support. To cover those kinds of jobs we organised Level 1 (L1) support – we rotate the L1 support weekly so that the next day the same person can deal with the problem without having a context switch. For the L1 support we use a separate Jira board where devs can create tickets instead of approaching us using only Slack. Of course, if more communication is needed, the L1 support checks also the support Slack channel and does PR reviews.

Having the communication in place is important to fully focus on the technology part. We supported MOTORS introducing a new landing zone in Azure and then migrated the services from GCP to Azure. We used the Infrastructure as Code (IaC) approach using Terraform to describe all the necessary infrastructure including Azure Kubernetes Service (AKS), Networking (APIM, AppGW, FWs etc.), Service Bus, Event Hub, Managed Databases and many more that we continued to manage.
Our workload, applications, were running in GKE (Google Kubernetes Engine) so we also opted out to using AKS. AKS offered a managed control plane with automated upgrades, integrated monitoring, and native support for scaling. But more importantly—it let our team shift left on value, meaning each team became responsible for managing its part of the infrastructure with clear boundaries and minimal dependencies. This led to much more autonomy inside the teams and made them more independent from our support.

Besides general infrastructure we paid a lot of attention to observability of the platform. We did this by collaborating closely with the other developer teams to learn about the monitoring and alerting needs they had, since for us it is essential that teams can have insights into their apps and are in charge of the monitoring. So the teams handed over the thresholds that were relevant for their services. We took care that alerts about the infrastructure, application and databases are pushed in dedicated team-Slack channels and Pager duty. The Monitoring is done with Azure Monitoring that collects and stores metrics from Azure resources. Prometheus is used as an additional monitoring system that scrapes data from other sources, while for the visualization of custom dashboards Grafana is leveraged. All the logging was streamlined with Loki. Thus we were able to visualise all relevant metrics in one single pane of glass improving the observability per team a lot.
We introduced alerting and monitoring measures to the different workstreams while we created CI/CD pipelines and supported developers in order to deploy applications to the new production and testing environments. As the landing zone continues to evolve, the infrastructure becomes more and more reliable and revertable introducing such technologies.

Key technologies we used:

Before migration	After migration
Selfhosted Jenkins+Spinnaker	Azure DevOps
Google Kubernetes Engine (GKE)	Azure Kubernetes Service (AKS)
Terraform to control everything
New Relic + GCP Monitoring	Prometheus, Grafana, Loki
PubSub	Service Bus
GCP Application Load Balancer	Application Gateway + API Management

Results.

What was the outcome?

By leveraging key technologies we could help MOTORS modernize their infrastructure, migrate to a new cloud platform, and improve their software development processes. The main positive outcomes are:

After the successful migration of the subsystem to Azure MOTORS could lower the GCP footprint and save costs.
We could improve the software release processes for teams with Terraform:
By splitting the state into smaller ones, team-owned, we enable better separation of concerns and allow development teams to manage their own infrastructure independently. This reduces the blast radius, meaning misconfigurations are isolated and don’t affect the entire platform. This led to a more stable and resilient system. For in detail explanation about using separate states with Terraform see our blog post.
Enhanced monitoring and alerting capabilities helped the teams to better manage their service and improve Mean Time to Recovery (MTTR).
With a modernized infrastructure, it is now much easier for teams and for MOTORS to scale.
Establishing communication and support routines increased trust and collaboration between teams.

What are your lessons learned?

Clear communication leads to efficient collaboration: Only teams that communicate clearly between each other are able to collaborate effectively. A company’s investment in communication by offering space and time for teams to connect is essential.
We learned how trust can be created:
- Sharing knowledge: For instance we did workshops about Terraform and Kubernetes to enable other teams to take over one day.
- MOTORS has a blameless culture thus everyone can ask for help without fear.
- And of course, it is mandatory to be honest to each other in regards to new ideas or solution proposals and to goals and priorities.

Shift Left approach: Coming from the traditional tech team setup where development teams do the work, and then other teams take over the testing and releases we already see the benefits to shift those phases into the earlier development stages. Our teams can identify and handle issues sooner which speeds up our development cycles, improves the overall software quality and implicitly reduces costs.

IaC for complex infrastructure: Infrastructure environments should be managed using Infrastructure as Code (IaC) since it ensures consistency, enables version control, reduces manual errors, and allows for automated, repeatable deployments across different stages and environments.

There are always trade-offs: We also learned that enabling teams to manage the infrastructure in separate states comes at a cost since it makes the implementation of the CI/CD pipelines more complex. There is always a need to balance cost considerations with the benefits of technological decisions. May it be the decision of increasing deployment speed over complex CI/CD pipelines, or saving cloud costs over investing in cloud migration.

Cloud migration as our service: From GCP to Azure with SRE Power.