infrastructure
49 TopicsPre-Migration Vulnerability Scans:
Migrating applications to the cloud or modernizing infrastructure requires thorough preparation. Whether it’s a cloud platform, a new data center, or a hybrid infrastructure — is a complex process. While organizations focus on optimizing performance, costs, and scalability, security often takes a backseat, leading to potential risks post-migration. One crucial step before migration is conducting a pre-migration scan to identify security vulnerabilities, licensing risks, and code quality issues. Several tools help in pre-migration scanning, including Blackduck, Coverity, Gitleaks, and Semgrep. In this article, we will explore the role of these tools in migration readiness. Why Perform a Pre-Migration Scan? When an application moves from an on-premises environment to the cloud, it interacts with new infrastructures, security models, and compliance regulations. Security scanning tools analyze various aspects of an application, including: Source Code: Detects insecure coding practices, injection vulnerabilities, and logic flaws. Third-Party Dependencies: Identifies vulnerabilities in open-source libraries and software components. Secrets & Credentials: Scans for hardcoded passwords, API keys, and authentication tokens. Infrastructure as Code (IaC): Checks for misconfigurations in Terraform, Kubernetes, Docker, and cloud resources. Compliance Risks: Ensures adherence to security standards like SOC 2, GDPR, HIPAA, and NIST. A pre-migration scan helps in: Identifying Security Vulnerabilities — Detecting potential security threats before moving to the cloud. Ensuring License Compliance — Avoiding open-source license violations. Code Quality Assurance — Identifying issues that could lead to performance degradation post-migration. Reducing Migration Risks — Understanding potential blockers early in the process. Optimizes Performance: Detecting inefficiencies early reduces technical debt. What to use? One of the biggest challenges organizations face during migration is understanding where vulnerabilities exist within their application. This is where scanning tools come into play, each addressing a specific aspect of security and compliance. Take BlackDuck, for instance. Many applications rely on open-source components, but these dependencies come with risks. BlackDuck helps teams analyze these libraries, identifying outdated dependencies and ensuring compliance with licensing policies. If an application heavily relies on open-source libraries, it should be prioritized to check for outdated or vulnerable dependencies. Key Features: Detects Open-Source Vulnerabilities: Identifies known CVEs (Common Vulnerabilities and Exposures) in third-party libraries. License Compliance Management: Ensures adherence to open-source licenses like GPL, MIT, Apache, etc. Integration with DevOps: Works seamlessly with CI/CD pipelines to automate security checks. Then there’s Coverity, which tackles security flaws hidden in the source code. A migration process should not only move applications but also ensure they are stable and secure in the new environment. Coverity, a Static Application Security Testing (SAST) tool, scans code for potential weaknesses — whether it’s SQL injection, cross-site scripting (XSS), or memory leaks. By fixing these defects before migration, teams can prevent costly failures post-deployment. Key Features: Deep Code Analysis: Identifies issues such as buffer overflows, SQL injection, cross-site scripting (XSS), and memory leaks. Supports Multiple Languages: Works with C, C++, Java, JavaScript, Python, Go, and more. Seamless CI/CD Integration: Can be integrated into GitHub, GitLab, and Azure DevOps workflows. Another key concern is secrets management. Hardcoded API keys, passwords, and tokens often find their way into repositories, creating a massive security risk. Gitleaks scans Git repositories to detect and eliminate these vulnerabilities before they can be exploited. Imagine pushing an application to the cloud, only to realize that an exposed API key is granting unauthorized access to critical services. By integrating Gitleaks into the pre-migration process, organizations can avoid such missteps. Key Features: Scans for Hardcoded Secrets: Detects sensitive information in commits, branches, and history. Pre-Commit Hooks: Prevents secrets from being pushed to Git repositories. Customizable Rulesets: Allows teams to define their own secret detection policies. Compatible with GitHub & GitLab: Easily integrates with popular version control platforms. Finally, Semgrep provides a flexible approach to enforcing security best practices. Unlike traditional scanning tools, it allows teams to define custom security rules to catch coding patterns that may lead to vulnerabilities. Whether it’s identifying misconfigurations or enforcing secure coding standards, Semgrep adds an extra layer of protection, ensuring applications follow best practices before going live in the cloud. Comparing the Tools: Tool Primary Use Case Best for CI/CD Integration BlackDuck Open-source security & license compliance Dependency scanning Yes Coverity Static code analysis Code vulnerabilities Yes Gitleaks Secret & credential scanning Preventing secret leaks Yes Semgrep Customizable code analysis Secure coding & policy enforcement Yes Intergration with the code: Automation is key to ensuring that security scans are not overlooked or treated as one-time activities. To streamline the process, organizations integrate these scanning tools directly into their Continuous Integration/Continuous Deployment (CI/CD) pipeline, ensuring security checks are part of every development cycle. A typical setup involves defining a pipeline configuration that automates the execution of each tool at various stages: Once the scans are complete, the results are typically stored as JSON reports in pipeline artifacts or logging systems, making it easy to track, analyze, and prioritize issues before proceeding with the migration. By integrating these tools into the CI/CD pipeline, security becomes an automated and continuous process, rather than a last-minute checkpoint. Challenges in Pre-Migration Security Scanning False Positives: Some tools generate excessive alerts, requiring manual verification. Lack of Security Awareness: Developers may not be trained to interpret scan results effectively. Integration with DevOps: Security scans must fit into existing CI/CD pipelines without slowing down deployments. Handling Legacy Code: Older applications may contain security issues that modern tools struggle to assess. Conclusion By proactively addressing these challenges and incorporating security scanning into the migration strategy, organizations can minimize risks and ensure a smooth, secure transition to their new environment. However, scanning alone is not enough. Following best practices — such as defining a security baseline, automating security checks in CI/CD pipelines, prioritizing remediation, and securing the migration process — ensures a smooth, risk-free transition. A secure migration is not just about moving workloads; it’s about ensuring that security remains a top priority at every stage. By taking a proactive security approach, organizations can prevent security incidents before they happen, making the migration process safer, smoother, and more resilient.Azure VMware Solution now available in Korea Central
We are pleased to announce that Azure VMware Solution is now available in Korea Central. Now in 34 Azure regions, Azure VMware Solution empowers you to seamlessly extend or migrate existing VMware workloads to Azure without the cost, effort or risk of re-architecting applications or retooling operations. Azure VMware Solution supports: Rapid cloud migration of VMware-based workloads to Azure without refactoring. Datacenter exit while maintaining operational consistency for the VMware environment. Business continuity and disaster recovery for on-premises VMware environments. Attach Azure services and innovate applications at your own pace. Includes the VMware technology stack and lets you leverage existing Microsoft licenses for Windows Server and SQL Server. For updates on current and upcoming region availability, visit the product by region page here. Streamline migration with new offers and licensing benefits, including a 20% discount. We recently announced the VMware Rapid Migration Plan, where Microsoft provides a comprehensive set of licensing benefits and programs to give you price protection and savings as you migrate to Azure VMware Solution. Azure VMware Solution is a great first step to the cloud for VMware customers, and this plan can help you get there. Learn MoreMigration planning of MySQL workloads using Azure Migrate
In our endeavor to increase coverage of OSS workloads in Azure Migrate, we are announcing discovery and modernization assessment of MySQL databases running on Windows and Linux servers. Customers previously had limited visibility into their MySQL workloads and often received generalized VM lift-and-shift recommendations. With this new capability, customers can now accurately identify their MySQL workloads and assess them for right-sizing into Azure Database for MySQL. MySQL workloads are a cornerstone of the LAMP stack, powering countless web applications with their reliability, performance, and ease of use. As businesses grow, the need for scalable and efficient database solutions becomes paramount. This is where Azure Database for MySQL comes into play. Migrating from on-premises to Azure Database for MySQL offers numerous benefits, including effortless scalability, cost efficiency, enhanced performance, robust security, high availability, and seamless integration with other Azure services. As a fully managed Database-as-a-Service (DBaaS), it simplifies database management, allowing businesses to focus on innovation and growth. What is Azure Migrate? Azure Migrate serves as a comprehensive hub designed to simplify the migration journey of on-premises infrastructure, including servers, databases, and web applications, to Azure Platform-as-a-Service (PaaS) and Infrastructure-as-a-Service (IaaS) targets at scale. It provides a unified platform with a suite of tools and features to help you identify the best migration path, assess Azure readiness, estimate the cost of hosting workloads on Azure, and execute the migration with minimal downtime and risk. Key features of the MySQL Discovery and Assessment in Azure Migrate The new MySQL Discovery and Assessment feature in Azure Migrate (Preview) introduces several powerful capabilities: Discover MySQL database instances: The tool allows you to discover MySQL instances within your environment efficiently. By identifying critical attributes of these instances, it lays the foundation for a thorough assessment and a strategic migration plan. Assessment for Azure readiness: The feature evaluates the readiness of your MySQL database instances to migrate to Azure Database for MySQL – Flexible Server. This assessment considers several factors, including compatibility and performance metrics, to ensure a smooth transition. SKU recommendations: Based on the discovered data, the tool recommends the optimal compute and storage configuration for hosting MySQL workloads on Azure Database for MySQL. Furthermore, it provides insights into the associated costs, enabling better financial planning. How to get started? To begin using the MySQL Discovery and Assessment feature in Azure Migrate, follow this five-step onboarding process: Create an Azure Migrate Project: Initiate your migration journey by setting up a project in the Azure portal. Configure the Azure Migrate Appliance: Use a Windows-based appliance to discover the inventory of servers and provide guest credentials for discovering the workloads and MySQL credentials to fetch database instances and their attributes. Review Discovered Inventory: Examine the detailed attributes of the discovered MySQL instances. Create an Assessment: Evaluate the readiness and get detailed recommendations for migration to Azure Database for MySQL. For a detailed step-by-step guidance check out the documentation for discovery and assessment tutorials. Documentation: Discover MySQL databases running in your datacenter Assess MySQL database instances for migration to Azure Database for MySQL Share your feedback! In summary, the MySQL Discovery and Assessment feature in Azure Migrate enables you to effortlessly discover, assess, and plan your MySQL database migrations to Azure. Try the feature out in public preview and fast-track your migration journey! If you have any queries, feedback or suggestions, please let us know by leaving a comment below or by directly contacting us at AskAzureDBforMySQL@service.microsoft.com. We are eager to hear your feedback and support you on your journey to Azure.Forward Azure VMware Solution logs anywhere using Azure Logic Apps
Overview As enterprises scale their infrastructure in Microsoft Azure using Azure VMware Solution, gaining real-time visibility into the operational health of their private cloud environment becomes increasingly critical. Whether troubleshooting deployment issues, monitoring security events, or performing compliance audits, centralized logging is a must-have. Azure VMware Solution offers flexible options for exporting syslogs from vCenter Server, ESXi Hosts, and NSX components. While many customers already use Log Analytics or third-party log platforms for visibility, some have unique operational or compliance requirements that necessitate forwarding logs to specific destinations outside the Microsoft ecosystem. With the advent of VMware Cloud Foundation on Azure VMware Solution, customers can now have more choices and can leverage tools like VCF Operations for Logs to monitor, analyze, and troubleshoot their logs. In this post, we’ll show you how to use Azure Logic Apps, Microsoft’s low-code, serverless integration platform, to forward Azure VMware Solution private cloud logs to any log management tool of your choosing. With a newly released workflow template tailored for Azure VMware Solution, you can set this up in minutes—no custom code required. Figure 1. Architectural flow of syslog data from an Azure VMware Solution private cloud to a log management server via Azure Logic Apps Background The Azure VMware Solution and Azure Logic Apps product teams have partnered to deliver a built-in integration that allows Azure VMware Solution customers to forward logs to any syslog-compatible endpoint—whether in Azure, on-premises, or another cloud. This new Logic Apps template is purpose-built for Azure VMware Solution and dramatically simplifies log forwarding. Figure 2. Azure VMware Solution template in Azure Logic Apps template catalog Historically, forwarding logs from Azure VMware Solution required customers to develop custom code or deploy complex workarounds, often involving multiple services and significant manual configuration. These methods not only introduced operational overhead but also made it difficult for platform teams to standardize logging across environments. With this new integration, customers who previously spent days in frustration trying to get their private cloud logs have now done so in under an hour, a massive improvement in both speed and simplicity. This new capability is particularly timely given recent industry changes. Following VMware’s announcement to discontinue the SaaS versions of Aria Operations, including Aria Operations for Logs, many customers have begun exploring alternative solutions for their log management needs. For those looking to use the on-premises alternative of Aria Operations for Logs, the ability to send Azure VMware Solution logs directly from Azure to their self-managed VCF Operations for Logs servers is now possible—with zero custom code. Using Azure Logic Apps, customers can seamlessly bridge their hybrid cloud monitoring environments and avoid gaps in visibility or compliance. This solution empowers Azure VMware Solution customers with more flexibility, shorter time-to-value, and a consistent logging strategy across both legacy and modernized environments. Why Azure Logic Apps? Azure Logic Apps is a powerful, low-code integration platform that enables IT administrators and platform teams to automate workflows and connect services—without having to manage any infrastructure. With over 1,400 connectors to Azure services, popular SaaS applications, and on-premises APIs, and more, Logic Apps provides a flexible and reliable foundation for routing log data across infrastructure environments. For Azure VMware Solution users, this means you can now easily forward logs from your Azure VMware Solution private cloud to any log management solution—on-premises or in the cloud—without writing custom code. Logic Apps acts as a dynamic “translator” or “dispatcher” in your architecture, listening for logs streamed to Event Hubs and securely forwarding them to your target syslog endpoint with the required formatting, headers, and authentication. This new capability not only accelerates time-to-value for log forwarding but also gives Azure VMware Solution customers the freedom to integrate with the logging platform of their choice—improving visibility, operational efficiency, and compliance in hybrid cloud environments. Future iterations of this integration will include support with Azure Blob Storage as well, another common method Azure VMware Solution customers use to retain and forward their logs. How to get started In addition to this blog, check out the links below to learn more about this integration, understand how Azure Logic Apps work, and use the pricing calculator to cost and size Azure Logic Apps. With large enterprise solutions for strategic and major customers, an Azure VMware Solution Architect from Azure, Broadcom, or a Broadcom Partner should be engaged to ensure the solution is correctly sized to deliver business value with the minimum of risk. If you are interested in using Logic Apps with Azure VMware Solution, please use the resources to learn more about the service: Detailed instructions on sending logs via Logic Apps: Send VMware syslogs to log management server using Azure Logic Apps - Azure VMware Solution | Microsoft Learn An overview of Logic Apps: Overview - Azure Logic Apps | Microsoft Learn Pricing calculator: Pricing - Logic Apps | Microsoft Azure -- Azure VMware Solution is a VMware validated first party Azure service from Microsoft that provides private clouds containing VMware vSphere clusters built from dedicated bare-metal Azure infrastructure. It enables customers to leverage their existing investments in VMware skills and tools, allowing them to focus on developing and running their VMware-based workloads on Azure. Author Bio Varun Hariharan is a Senior Product Manager on the Azure VMware Solution team at Microsoft, where he is focusing on observability and workload strategies for customers. His background is in Infrastructure as a Service (IaaS), log management, enterprise software, and DevOps. Kent Weare is a Principal PM Manager on the Azure Logic Apps team at Microsoft, where he is focusing on providing enterprise integration and automation capabilities for customers.Essentials of Azure and AI project performance and security | New training!
Are you ready to elevate your cloud skills and master the essentials of reliability, security, and performance of Azure and AI project? Join us for comprehensive training in Microsoft Azure Virtual Training Day events, where you'll gain the knowledge and tools to adopt the cloud at scale and optimize your cloud spend. Event Highlights: Two-Day Agenda: Dive deep into how-to learning on cloud and AI adoption, financial best practices, workload design, environment management, and more. Expert Guidance: Learn from industry experts and gain insights into designing with optimization in mind with the Azure Well-Architected Framework and the Cloud Adoption Framework for Azure. Hands-On Learning: Participate in interactive sessions and case studies to apply essentials of Azure and AI best practices in real-world scenarios, like reviewing and remediating workloads. FinOps in the Era of AI: Discover how to build a culture of cost efficiency and maximize the business value of the cloud with the FinOps Framework, including principles, phases, domains, and capabilities. Why Attend? Build Reliable and Secure Systems: Understand the shared responsibility between Microsoft and its customers to build resilient and secure systems. Optimize Cloud Spend: Learn best practices for cloud spend optimization and drive market differentiation through savings. Enhance Productivity: Improve productivity, customer experience, and competitive advantage by elevating the resiliency and security of your critical workloads. Don't miss the opportunity to transform your cloud strategy and take your skills to the next level. Register now and join us for an insightful and engaging virtual training experience! Register today! Aka.ms/AzureEssentialsVTD Eager to learn before the next event? Dive into our free self-paced training modules: Cost efficiency of Azure and AI Projects | on Microsoft Learn Resiliency and security of Azure and AI Projects | on Microsoft Learn Overview of essential skilling for Azure and AI workloads | on Microsoft LearnAzure VMware Solution Availability Design Considerations
Azure VMware Solution Design Series Availability Design Considerations Recoverability Design Considerations Performance Design Considerations Security Design Considerations VMware HCX Design with Azure VMware Solution Overview A global enterprise wants to migrate thousands of VMware vSphere virtual machines (VMs) to Microsoft Azure as part of their application modernization strategy. The first step is to exit their on-premises data centers and rapidly relocate their legacy application VMs to the Azure VMware Solution as a staging area for the first phase of their modernization strategy. What should the Azure VMware Solution look like? Azure VMware Solution is a VMware validated first party Azure service from Microsoft that provides private clouds containing VMware vSphere clusters built from dedicated bare-metal Azure infrastructure. It enables customers to leverage their existing investments in VMware skills and tools, allowing them to focus on developing and running their VMware-based workloads on Azure. In this post, I will introduce the typical customer workload availability requirements, describe the Azure VMware Solution architectural components, and describe the availability design considerations for Azure VMware Solution private clouds. In the next section, I will introduce the typical availability requirements of a customer’s workload. Customer Workload Requirements A typical customer has multiple application tiers that have specific Service Level Agreement (SLA) requirements that need to be met. These SLAs are normally named by a tiering system such as Platinum, Gold, Silver, and Bronze or Mission-Critical, Business-Critical, Production, and Test/Dev. Each SLA will have different availability, recoverability, performance, manageability, and security requirements that need to be met. For the availability design quality, customers will normally have an uptime percentage requirement with an availability zone (AZ) or region requirement that defines each SLA level. For example: SLA Name Uptime AZ/Region Gold 99.999% (5.26 min downtime/year) Dual Regions Silver 99.99% (52.6 min downtime/year) Dual AZs Bronze 99.9% (8.76 hrs downtime/year) Single AZ Table 1 – Typical Customer SLA requirements for Availability A typical legacy business-critical application will have the following application architecture: Load Balancer layer: Uses load balancers to distribute traffic across multiple web servers in the web layer to improve application availability. Web layer: Uses web servers to process client requests made via the secure Hypertext Transfer Protocol (HTTPS). Receives traffic from the load balancer layer and forwards to the application layer. Application layer: Uses application servers to run software that delivers a business application through a communication protocol. Receives traffic from the web layer and uses the database layer to access stored data. Database layer: Uses a relational database management service (RDMS) cluster to store data and provide database services to the application layer. Depending upon the availability requirements for the service, the application components could be many and spread across multiple sites and regions to meet the customer SLA. Figure 1 – Typical Legacy Business-Critical Application Architecture In the next section, I will introduce the architectural components of the Azure VMware Solution. Architectural Components The diagram below describes the architectural components of the Azure VMware Solution. Figure 2 – Azure VMware Solution Architectural Components Each Azure VMware Solution architectural component has the following function: Azure Subscription: Used to provide controlled access, budget and quota management for the Azure VMware Solution. Azure Region: Physical locations around the world where we group data centers into Availability Zones (AZs) and then group AZs into regions. Azure Resource Group: Container used to place Azure services and resources into logical groups. Azure VMware Solution Private Cloud: Uses VMware software, including vCenter Server, NSX software-defined networking, vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported. Azure VMware Solution Resource Cluster: Uses VMware software, including vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources for customer workloads by scaling out the Azure VMware Solution private cloud. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported. VMware HCX: Provides mobility, migration, and network extension services. VMware Site Recovery: Provides Disaster Recovery automation, and storage replication services with VMware vSphere Replication. Third party Disaster Recovery solutions Zerto DR and JetStream DR are also supported. Dedicated Microsoft Enterprise Edge (D-MSEE): Router that provides connectivity between Azure cloud and the Azure VMware Solution private cloud instance. Azure Virtual Network (VNet): Private network used to connect Azure services and resources together. Azure Route Server: Enables network appliances to exchange dynamic route information with Azure networks. Azure Virtual Network Gateway: Cross premises gateway for connecting Azure services and resources to other private networks using IPSec VPN, ExpressRoute, and VNet to VNet. Azure ExpressRoute: Provides high-speed private connections between Azure data centers and on-premises or colocation infrastructure. Azure Virtual WAN (vWAN): Aggregates networking, security, and routing functions together into a single unified Wide Area Network (WAN). In the next section, I will describe the availability design considerations for the Azure VMware Solution. Availability Design Considerations The architectural design process takes the business problem to be solved and the business goals to be achieved and distills these into customer requirements, design constraints and assumptions. Design constraints can be characterized by the following three categories: Laws of the Land – data and application sovereignty, governance, regulatory, compliance, etc. Laws of Physics – data and machine gravity, network latency, etc. Laws of Economics – owning versus renting, total cost of ownership (TCO), return on investment (ROI), capital expenditure, operational expenditure, earnings before interest, taxes, depreciation, and amortization (EBITDA), etc. Each design consideration will be a trade-off between the availability, recoverability, performance, manageability, and security design qualities. The desired result is to deliver business value with the minimum of risk by working backwards from the customer problem. Design Consideration 1 – Azure Region and AZs: Azure VMware Solution is available in 30 Azure Regions around the world (US Government has 2 additional Azure Regions). Select the relevant Azure Regions and AZs that meet your geographic requirements. These locations will typically be driven by your design constraints. Design Consideration 2 – Deployment topology: Select the Azure VMware Solution topology that best matches the uptime and geographic requirements of your SLAs. For very large deployments, it may make sense to have separate private clouds dedicated to each SLA for cost efficiency. The Azure VMware Solution supports a maximum of 12 clusters per private cloud. Each cluster supports a minimum of 3 hosts and a maximum of 16 hosts per cluster. Each private cloud supports a maximum of 96 hosts. VMware vSphere HA provides protection against ESXi host failures and VMware vSphere DRS provides distributed resource management. VMware vSphere Fault Tolerance is not supported by the Azure VMware Solution. These features are preconfigured as part of the managed service and cannot be changed by the customer. VMware vCenter Server, VMware HCX Manager, VMware SRM and VMware vSphere Replication Manager are individual appliances and are protected by vSphere HA. VMware NSX Manager is a cluster of 3 unified appliances that have a VM-VM anti-affinity placement policy to spread them across the hosts of the cluster. The VMware NSX Edge cluster is a pair of appliances that also use a VM-VM anti-affinity placement policy. Topology 1 – Standard: The Azure VMware Solution standard private cloud is deployed within a single AZ in an Azure Region, which delivers an infrastructure SLA of 99.9%. Figure 3 – Azure VMware Solution Private Cloud Standard Topology Topology 2 – Multi-AZ: Azure VMware Solution private clouds in separate AZs per Azure Region. VMware HCX is used to connect private clouds across AZs. Application clustering is required to provide the multi-AZ availability mechanism. The customer is responsible for ensuring their application clustering solution is within the limits of bandwidth and latency between private clouds. This topology will deliver an SLA of greater than 99.9%, however it will be dependent upon the application clustering solution used by the customer. The Azure VMware Solution does not support AZ selection during provisioning. This is mitigated by having separate Azure Subscriptions with quota in each separate AZ. You can open a ticket with Microsoft to configure a Special Placement Policy to deploy your Azure VMware Solution private cloud to a particular AZ per subscription. Figure 4 – Azure VMware Solution Private Cloud Multi-AZ Topology Topology 3 – Stretched: The Azure VMware Solution stretched clusters private cloud is deployed across dual AZs in an Azure Region, which delivers a 99.99% infrastructure SLA. This also includes a third AZ for the Azure VMware Solution witness site. Stretched clusters support policy-based synchronous replication to deliver a recovery point objective (RPO) of zero. It is possible to use placement policies and storage policies to mix SLA levels within stretched clusters, by pinning lower SLA workloads to a particular AZ, which will experience downtime during an AZ failure. This feature is GA and is currently only available in Australia East, West Europe, UK South and Germany West Central Azure Regions. Figure 5 – Azure VMware Solution Private Cloud with Stretched Clusters Topology Topology 4 – Multi-Region: Azure VMware Solution private clouds across Azure regions. VMware HCX is used to connect private clouds across Azure Regions. Application clustering is required to provide the multi-region availability mechanism. The customer is responsible for ensuring their application clustering solution is within the limits of bandwidth and latency between private clouds. This topology will deliver an SLA of greater than 99.9%, however it will be dependent upon the application clustering solution used by the customer. An additional enhancement could be using Azure VMware Solution stretched clusters in one or both Azure Regions. Figure 6 – Azure VMware Solution Private Cloud Multi-Region Topology Design Decision 3 – Shared Services or Separate Services Model: The management and control plane cluster (Cluster-1) can be shared with customer workload VMs or be a dedicated cluster for management and control, including customer enterprise services, such as Active Directory, DNS, and DHCP. Additional resource clusters can be added to support customer workload demand. This also includes the option of using separate clusters for each customer SLA. Figure 7 – Azure VMware Solution Shared Services Model Figure 8 – Azure VMware Solution Separate Services Model Design Consideration 4 – SKU type: Three SKU types can be selected for provisioning an Azure VMware Solution private cloud. The smaller AV36 SKU can be used to minimize the impact radius of a failed node. The larger AV36P and AV52 SKUs can be used to run more workloads with less nodes which increases the impact radius of a failed node. The AV36 SKU is widely available in most Azure regions and the AV36P and AV52 SKUs are limited to certain Azure regions. Azure VMware Solution does not support mixing different SKU types within a private cloud (AV64 SKU is the exception). You can check Azure VMware Solution SKU availability by Azure Region here. The AV64 SKU is currently only available for mixed SKU deployments in certain regions. Figure 9 – AV64 Mixed SKU Topology Design Consideration 5 – Placement Policies: Placement policies are used to increase the availability of a service by separating the VMs in an application availability layer across ESXi hosts. When an ESXi failure occurs, it would only impact one VM of a multi-part application layer, which would then restart on another ESXi host through vSphere HA. Placement policies support VM-VM and VM-Host affinity and anti-affinity rules. The vSphere Distributed Resource Scheduler (DRS) is responsible for migrating VMs to enforce the placement policies. To increase the availability of an application cluster, a placement policy with VM-VM anti-affinity rules for each of the web, application and database service layers can be used. Alternatively, VM-Host affinity rules can be used to segment the web, application, and database components to dedicated groups of hosts. The placement policies for stretched clusters can use VM-Host affinity rules to pin workloads to the preferred and secondary sites, if needed. Figure 10 – Azure VMware Solution Placement Policies – VM-VM Anti-Affinity Figure 11 – Azure VMware Solution Placement Policies – VM-Host Affinity Design Consideration 6 – Storage Policies: Table 2 lists the pre-defined VM Storage Policies available for use with VMware vSAN. The appropriate redundant array of independent disks (RAID) and failures to tolerate (FTT) settings per policy need to be considered to match the customer workload SLAs. Each policy has a trade-off between availability, performance, capacity, and cost that needs to be considered. The storage policies for stretched clusters include a designation for the dual site (synchronous replication), preferred site and secondary site policies that need to be considered. To comply with the Azure VMware Solution SLA, you are responsible for using an FTT=2 storage policy when the cluster has 6 or more nodes in a standard cluster. You must also retain a minimum slack space of 25% for backend vSAN operations. Deployment Type Policy Name RAID Failures to Tolerate (FTT) Site Standard RAID-1 FTT-1 1 1 N/A Standard RAID-1 FTT-2 1 2 N/A Standard RAID-1 FTT-3 1 3 N/A Standard RAID-5 FTT-1 5 1 N/A Standard RAID-6 FTT-2 6 2 N/A Standard VMware Horizon 1 1 N/A Stretched RAID-1 FTT-1 Dual Site 1 1 Site mirroring Stretched RAID-1 FTT-1 Preferred 1 1 Preferred Stretched RAID-1 FTT-1 Secondary 1 1 Secondary Stretched RAID-1 FTT-2 Dual Site 1 2 Site mirroring Stretched RAID-1 FTT-2 Preferred 1 2 Preferred Stretched RAID-1 FTT-2 Secondary 1 2 Secondary Stretched RAID-1 FTT-3 Dual Site 1 3 Site mirroring Stretched RAID-1 FTT-3 Preferred 1 3 Preferred Stretched RAID-1 FTT-3 Secondary 1 3 Secondary Stretched RAID-5 FTT-1 Dual Site 5 1 Site mirroring Stretched RAID-5 FTT-1 Preferred 5 1 Preferred Stretched RAID-5 FTT-1 Secondary 5 1 Secondary Stretched RAID-6 FTT-2 Dual Site 6 2 Site mirroring Stretched RAID-6 FTT-2 Preferred 6 2 Preferred Stretched RAID-6 FTT-2 Secondary 6 2 Secondary Stretched VMware Horizon 1 1 Site mirroring Table 2 – VMware vSAN Storage Policies Design Consideration 7 – Network Connectivity: Azure VMware Solution private clouds can be connected using IPSec VPN and Azure ExpressRoute circuits, including a variety of Azure Virtual Networking topologies such as Hub-Spoke and Azure Virtual WAN with Azure Firewall and third-party Network Virtualization Appliances. Multiple Azure ExpressRoute circuits can be used to provide redundant connectivity. VMware HCX also supports redundant Network Extension appliances to provide high availability for Layer-2 network extensions. For more information, refer to the Azure VMware Solution networking and interconnectivity concepts. The Azure VMware Solution Cloud Adoption Framework also has example network scenarios that can be considered. And, if you are interested in Azure ExpressRoute design: Understanding ExpressRoute private peering to address ExpressRoute resiliency ExpressRoute MSEE hairpin design considerations In the following section, I will describe the next steps that would need to be made to progress this high-level design estimate towards a validated detailed design. Next Steps The Azure VMware Solution sizing estimate should be assessed using Azure Migrate. With large enterprise solutions for strategic and major customers, an Azure VMware Solution Solutions Architect from Azure, VMware, or a VMware Partner should be engaged to ensure the solution is correctly sized to deliver business value with the minimum of risk. This should also include an application dependency assessment to understand the mapping between application groups and identify areas of data gravity, application network traffic flows, and network latency dependencies. Summary In this post, we took a closer look at the typical availability requirements of a customer workload, the architectural building blocks, and the availability design considerations for the Azure VMware Solution. We also discussed the next steps to continue an Azure VMware Solution design. If you are interested in the Azure VMware Solution, please use these resources to learn more about the service: Homepage: Azure VMware Solution Documentation: Azure VMware Solution SLA: SLA for Azure VMware Solution Azure Regions: Azure Products by Region Service Limits: Azure VMware Solution subscription limits and quotas Stretched Clusters: Deploy vSAN stretched clusters SKU types: Introduction Placement policies: Create placement policy Storage policies: Configure storage policy VMware HCX: Configuration & Best Practices GitHub repository: Azure/azure-vmware-solution Well-Architected Framework: Azure VMware Solution workloads Cloud Adoption Framework: Introduction to the Azure VMware Solution adoption scenario Network connectivity scenarios: Enterprise-scale network topology and connectivity for Azure VMware Solution Enterprise Scale Landing Zone: Enterprise-scale for Microsoft Azure VMware Solution Enterprise Scale GitHub repository: Azure/Enterprise-Scale-for-AVS Azure CLI: Azure Command-Line Interface (CLI) Overview PowerShell module: Az.VMware Module Azure Resource Manager: Microsoft.AVS/privateClouds REST API: Azure VMware Solution REST API Terraform provider: azurerm_vmware_private_cloud Terraform Registry Author Bio René van den Bedem is a Principal Technical Program Manager in the Azure VMware Solution product group at Microsoft. His background is in enterprise architecture with extensive experience across all facets of the enterprise, public cloud, and service provider spaces, including digital transformation and the business, enterprise, and technology architecture stacks. René works backwards from the problem to be solved and designs solutions that deliver business value with the minimum of risk. In addition to being the first quadruple VMware Certified Design Expert (VCDX), he is also a Dell Technologies Certified Master Enterprise Architect, a Nutanix Platform Expert (NPX), and a VMware vExpert. Link to PPTX Diagrams: azure-vmware-solution/azure-vmware-master-diagramsVMware HCX Troubleshooting with Azure VMware Solution
Overview VMware HCX is one of the Azure VMware Solution components that generates a large number of service requests from our customers. The Azure VMware Solution product group has worked to cover the most common troubleshooting considerations that you should know about when using VMware HCX with the Azure VMware Solution. Azure VMware Solution is a VMware validated first party Azure service from Microsoft that provides private clouds containing VMware vSphere clusters built from dedicated bare-metal Azure infrastructure. It enables customers to leverage their existing investments in VMware skills and tools, allowing them to focus on developing and running their VMware-based workloads on Azure. VMware HCX is the mobility and migration software used by the Azure VMware Solution to connect remote VMware vSphere environments to the Azure VMware Solution. These remote VMware vSphere environments can be on-premises, co-location or cloud-based instances. Figure 1 – Azure VMware Solution with VMware HCX Service Mesh In the next section, I will introduce the architectural components of the Azure VMware Solution. Architectural Components The diagram below describes the architectural components of the Azure VMware Solution. Figure 2 – Azure VMware Solution Architectural Components Each Azure VMware Solution architectural component has the following function: Azure Subscription: Used to provide controlled access, budget and quota management for the Azure VMware Solution. Azure Region: Physical locations around the world where we group data centers into Availability Zones (AZs) and then group AZs into regions. Azure Resource Group: Container used to place Azure services and resources into logical groups. Azure VMware Solution Private Cloud: Uses VMware software, including vCenter Server, NSX software-defined networking, vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported. Azure VMware Solution Resource Cluster: Uses VMware software, including vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources for customer workloads by scaling out the Azure VMware Solution private cloud. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported. VMware HCX: Provides mobility, migration, and network extension services. VMware Site Recovery: Provides Disaster Recovery automation, and storage replication services with VMware vSphere Replication. Third party Disaster Recovery solutions Zerto DR and JetStream DR are also supported. Dedicated Microsoft Enterprise Edge (D-MSEE): Router that provides connectivity between Azure cloud and the Azure VMware Solution private cloud instance. Azure Virtual Network (VNet): Private network used to connect Azure services and resources together. Azure Route Server: Enables network appliances to exchange dynamic route information with Azure networks. Azure Virtual Network Gateway: Cross premises gateway for connecting Azure services and resources to other private networks using IPSec VPN, ExpressRoute, and VNet to VNet. Azure ExpressRoute: Provides high-speed private connections between Azure data centers and on-premises or colocation infrastructure. Azure Virtual WAN (vWAN): Aggregates networking, security, and routing functions together into a single unified Wide Area Network (WAN). In the next section, I will describe the troubleshooting steps you should follow for VMware HCX when used with the Azure VMware Solution. Troubleshooting Considerations Before opening a ticket with Microsoft support, please use the following steps as a checklist to ensure you are not impacted by the most common VMware HCX issues. Troubleshooting Step 1: Download the VMware HCX Connector. Once VMware HCX is deployed on the Azure VMware Solution side, the download for the VMware HCX Connector OVA is in the VMware HCX UI plugin. Under the Administration there is a Request Download Link. The OVA can be copied locally or a download link for the OVA can be selected. Figure 3 – VMware HCX Connector OVA Download Troubleshooting Step 2: Upgrade to HCX Enterprise. Azure VMware Solution comes with an Enterprise license key for VMware HCX. If you have a pre-existing VMware HCX Connector on-prem that is licensed for VMware HCX Advanced, please be sure to upgrade the connector to the Enterprise version. To upgrade VMware HCX navigate to the HCX Connector at https://<hcx_connector_fqdn>:9443, under the Configuration section select Licensing and Activation, edit the current license and enter the VMware HCX enterprise license key obtained from the Azure VMware Solution portal. Verify that the License is showing Enterprise. Figure 4 – VMware HCX Connector License Key Once you have updated the VMware HCX Connector, be sure to update/edit the VMware HCX Compute Profile and Service Mesh to include the updated VMware HCX services that you would like to take advantage of, such as Replicated Assisted vMotion and OS Assisted Migration. OS Assisted Migration is used for migrating and converting Microsoft Hyper-V and RedHat KVM workloads into Azure VMware Solution. Figure 5 – VMware HCX Connector Compute Profile Service Activation Troubleshooting Step 3: Only use the key from the Azure VMware Solution private cloud you are connecting to. When deploying the VMware HCX Connector on-premises, the activation key should come from the Azure VMware Solution you are migrating to. In the Azure portal, an activation Key can be obtained in the Add-Ons section. Simply request an activation key, provide it with a friendly name and map that activation key to the on-premises VMware HCX connector. Figure 6 – VMware HCX Connector License Key Troubleshooting Step 4: Do not use an IPSec VPN. If possible, avoid using an IPSec VPN connection to Azure VMware Solution when migrations with VMware HCX will happen. Migrating with VMware HCX over VPN has been known to cause issues and multiple failures around migrations. Although utilizing VMware HCX via VPN is supported, it is not the recommended way to migrate virtual machines to Azure VMware Solution. One of the biggest caveats of migrating VMs with VMware HCX over VPN is that a separate uplink network profile is needed on-premises. The management network cannot be used as an uplink profile, as the MTU of the uplink profile needs to be adjusted to 1300 to accommodate the IPSec overhead. Note that VMware HCX uses IPSec VPN natively as part of the VMware HCX Service Mesh. Troubleshooting Step 5: Check MTU size within your Network Profile. Be sure to verify the MTU setting on the Network Profiles setup. Within VMware HCX, navigate to the Interconnect section, select Network Profiles and be sure to verify the correct MTU size is being used for each Profile. Be sure to verify this on both ends of the VMware HCX site pair. Figure 7 – VMware HCX MTU size in Network Profile Use this guide of recommended MTU sizes for the Network Profiles in the table below when connecting to Azure VMware Solution. Connectivity Method Management Uplink Replication vMotion Azure ExpressRoute 1500 1500 1500 or 9000 1500 or 9000 VMware HCX over IPSec VPN 1500 1300 1500 or 9000 1500 or 9000 Table 1 – VMware HCX Network Profile MTU Sizes Troubleshooting Step 6: Always keep your VMware HCX versions updated (Connectors, Cloud Manager and Service Meshes). Before you upgrade VMware HCX, check the VMware product interoperability matrix to ensure the integrated versions of on-premises VMware solution software are supported by the new version of VMware HCX you are going to upgrade to. Updates to VMware HCX are released regularly by VMware. It is the responsibility of the customer to upgrade and maintain VMware HCX on both sides of the Service Mesh (on-premises and Azure VMware Solution). When updating VMware HCX, the VMware HCX Cloud Managers should be updated first. It is recommended to create a back-up to the VMware HCX Connector before updating. Backups to the VMware HCX Connector can be done through the VMware HCX manager UI at https://<hcx_connector_fqdn>:9443 with the admin password created at the time of VMware HCX Connector deployment. Under the Administration section head to the Backups and restore section. Backups can be taken here and scheduled to be taken as well. Optionally, you can take a vSphere snapshot of the VMware HCX Connector on-premises as well. Figure 8 – VMware HCX Connector Backup & Restore Updates for the VMware HCX Cloud Managers can be found in the administration section, select your current version, and hit the ‘Check for Updates’ button. If a new version is available, you will be able to download and update to the newest version. Backups of the VMware HCX Cloud Manager are taken automatically each day. Figure 9 – VMware HCX Upgrades It should be noted that VMware HCX Service Meshes are updated independently of the VMware HCX Cloud Managers and Connectors. Upon completion of the VMware HCX Cloud Manager and Connector updates, Service Meshes should be updated next. VMware HCX Cloud Managers and Service Meshes should be upgraded in order and together as to not cause an issue with mixed mode versions of Managers and Service Meshes. Running mixed mode versions of VMware HCX Cloud Managers, Connectors, and Service Meshes in production is highly discouraged. You can lose certain features and it often creates issues within the environment. Figure 10 – VMware HCX Manager Service Mesh Update During the Service Mesh update process, if Network Extension appliances are deployed a temporary loss of connectivity will occur while the appliances update. For Network Extension in an HA pair, down time is approximately a few seconds. Network Extension appliances not in an HA pair will incur downtime of approximately one minute. Troubleshooting Step 7: On-Premises Network Connectivity and Firewalls. For VMware HCX to be activated and receive updates, your on-premises firewalls need to allow outbound traffic to port 443 for the following websites: https://connect.hcx.vmware.com https://hybridity-depot.vmware.com https://hcx.<guid>.<region>.avs.azure.com Your on-premises firewalls will also need to allow outbound traffic to UDP port 4500. Within VMware HCX UDP port 4500 serves a specific purpose, it allows IPSec VPN communication between VMware HCX components across environments and is essential for communication and data transfer between environments to work. When configuring VMware HCX, you need to ensure that this port is open between your on-premises VMware HCX Connector uplink network profile and the Azure VMware Solution HCX Cloud Manager uplink network profile. Another common issue we see within VMware HCX, is that your on-premises VMware HCX Connector is unable to reach the VMware HCX activation and entitlement website. A simple way to verify your on-premises environment has access to the activation and entitlement website is as follows. SSH into the on-premises VMware HCX Connector and run the below curl commands to verify connectivity: Curl -k -v https://connect.hcx.vmware.com Curl -k -v https://hyridity-depot.vmware.com A successful connection to the above website will look like the figure below. Figure 11 – VMware HCX Connector SSH CURL connectivity test Troubleshooting Step 8: Diagnostics page on the Service Mesh. Built into the VMware HCX Service Mesh there is an option to run a diagnostics check on the Service Mesh appliances. This is an effective way to verify the health of your Service Mesh and pinpoint any specific issues the appliances may have. In the VMware HCX Connect user interface, under the Interconnect section, select the Service Mesh you want to run the diagnostics on. Under the “More” link, select Run Diagnostics to perform a health check on the appliances. Figure 12 – VMware HCX Service Mesh Run Diagnostics Once the Diagnostics test is completed, if there are any issues, a red banner will appear under the Service Mesh name. You can drill down to the specific issues by clicking on the red alert (!) icon. Figure 13 – VMware HCX Service Mesh Alert Troubleshooting Step 9: If you are having issues with the source side interface reboot the VMware HCX Connector. VMware HCX Connectors may have issues over time. It is recommended to reboot the VMware HCX Connector if it has been up and running for an extended period without a reboot. On the Azure VMware Solution side, we do have the option for customers to reboot the VMware HCX Cloud Manager within Azure VMware Solution through a Run Command in the Azure portal. The option to Force or Hard Reboot the VMware HCX Cloud Manager is also an option that is offered. Please use this with caution as it does not check for any active migrations or replications that may be occurring. Figure 14 – Azure VMware Solution Run Command Restart-HCXManager Troubleshooting Step 10: Logging into the VMware HCX Cloud Manager directly You have the ability to log into the VMware HCX Cloud Manager directly. At times the VMware HCX plugin through your Azure VMware Solution vSphere Client will not be available or fail to open. You can obtain the IP address of the VMware HCX Cloud Manager in the Azure portal when you are in the Azure VMware Solution resource. In the Add-ons section under the “Migration using VMware HCX”, the IP address of the VMware HCX Cloud manager will be listed. It is part of the /22 network you provided when deploying Azure VMware Solution. Access the manager directly at https://<x.x.x.9>:443 or https://hcx.<guid>.<region>.avs.azure.com. The VMware HCX Cloud Manager will always end with a .9 octet. Figure 15 – VMware HCX Cloud Manager Login Troubleshooting Step 11: Network Extensions are for temporary migration phases, not for permanent use. At its core VMware HCX is a migration tool. When using Network Extensions in VMware HCX, it is important to understand that these Network Extensions should be a temporary solution used during the migration process to migrate VMs into Azure VMware Solution with no downtime during the migration. It is best practice to remove the network extensions as soon as the migration waves are completed. Leaving network extensions in place for extended periods of time can cause issues and outages in your environment. Use Network Extensions with caution. Figure 16 – VMware HCX Network Extension Troubleshooting Step 12: If you have Mobility Optimized Networking (MON) enabled, ensure you have the router location set to the correct side. When configuring MON, verify where the default gateway resides. The default gateway will always be located on the source side of the network extension. Primarily, it will reside in the on-premises data center when connecting to Azure VMware Solution. Figure 17 – VMware HCX Mobility Optimized Network (MON) Troubleshooting Step 13: OS Assisted Migration -Sentinel Gateway Appliances. When using VMware HCX OS Assisted Migration, it is important to maintain and manage the VMware HCX Sentinel Gateway Appliance (SGW) at the source site (On-premises). The Sentinel Gateway Appliance is responsible for establishing a forwarding connection with the VMware HCX Sentinel Data Receiver (SDR) on the destination site. Managing and maintaining the Sentinel Gateway appliance’s resources, CPU and memory configuration, is the responsibility of the customer. Next Steps If this has not resolved the VMware HCX issue in your Azure VMware Solution private cloud, please open a Service Request with Microsoft to continue the resolution process. Summary In this post, we described helpful troubleshooting tips when facing some of the most common VMware HCX service issues our customers have with the Azure VMware Solution. If you are interested in the Azure VMware Solution, please use these resources to learn more about the service: Homepage: Azure VMware Solution Documentation: Azure VMware Solution SLA: SLA for Azure VMware Solution Azure Regions: Azure Products by Region VMware Ports and Protocols for HCX VMware HCX - VMware Ports and Protocols VMware Interoperability Matrix Product Interoperability Matrix (vmware.com) VMware HCX: Configuration & Best Practices Design: Availability Design Considerations Design: Recoverability Design Considerations Design: Performance Design Considerations Design: Security Design Considerations GitHub repository: Azure/azure-vmware-solution Well-Architected Framework: Azure VMware Solution workloads Cloud Adoption Framework: Introduction to the Azure VMware Solution adoption scenario Network connectivity scenarios: Enterprise-scale network topology and connectivity for Azure VMware Solution Enterprise Scale Landing Zone: Enterprise-scale for Microsoft Azure VMware Solution Enterprise Scale GitHub repository: Azure/Enterprise-Scale-for-AVS Azure CLI: Azure Command-Line Interface (CLI) Overview PowerShell module: Az.VMware Module Azure Resource Manager: Microsoft.AVS/privateClouds REST API: Azure VMware Solution REST API Terraform provider: azurerm_vmware_private_cloud Terraform Registry Author Bios Ricky Perez is a Senior Cloud Solution Architect in the international Customer Success Unit (iCSU) at Microsoft. His background is in solution architecture with experience in public cloud and core infrastructure services. Jason Trammell is a Senior Software Engineer in the Azure VMware Solution engineering group at Microsoft. Kenyon Hensler is a Principal Technical Program Manager in the Azure VMware Solution product group at Microsoft. His background is in system engineering with experience across all facets of enterprise networking and compute stacks. René van den Bedem is a Principal Technical Program Manager in the Azure VMware Solution product group at Microsoft. His background is in enterprise architecture with extensive experience across all facets of the enterprise, public cloud & service provider spaces, including digital transformation and the business, enterprise, and technology architecture stacks. René works backwards from the problem to be solved and designs solutions that deliver business value with the minimum of risk. In addition to being the first quadruple VMware Certified Design Expert (VCDX), he is also a Dell Technologies Certified Master Enterprise Architect, a Nutanix Platform Expert (NPX), and a VMware vExpert.Azure VMware Solution Performance Design Considerations
Azure VMware Solution Design Series Availability Design Considerations Recoverability Design Considerations Performance Design Considerations Security Design Considerations VMware HCX Design with Azure VMware Solution Overview A global enterprise wants to migrate thousands of VMware vSphere virtual machines (VMs) to Microsoft Azure as part of their application modernization strategy. The first step is to exit their on-premises data centers and rapidly relocate their legacy application VMs to the Azure VMware Solution as a staging area for the first phase of their modernization strategy. What should the Azure VMware Solution look like? Azure VMware Solution is a VMware validated first party Azure service from Microsoft that provides private clouds containing VMware vSphere clusters built from dedicated bare-metal Azure infrastructure. It enables customers to leverage their existing investments in VMware skills and tools, allowing them to focus on developing and running their VMware-based workloads on Azure. In this post, I will introduce the typical customer workload performance requirements, describe the Azure VMware Solution architectural components, and describe the performance design considerations for Azure VMware Solution private clouds. In the next section, I will introduce the typical performance requirements of a customer’s workload. Customer Workload Requirements A typical customer has multiple application tiers that have specific Service Level Agreement (SLA) requirements that need to be met. These SLAs are normally named by a tiering system such as Platinum, Gold, Silver, and Bronze or Mission-Critical, Business-Critical, Production, and Test/Dev. Each SLA will have different availability, recoverability, performance, manageability, and security requirements that need to be met. For the performance design quality, customers will normally have CPU, RAM, Storage and Network requirements. This is normally documented for each application and then aggregated into the total performance requirements for each SLA. For example: SLA Name CPU RAM Storage Network Gold Low vCPU:pCore ratio (<1 to 2), Low VM to Host ratio (1-8) No RAM oversubscription (<=1) High Throughput or High IOPS (for a particular I/O size), Low Latency High Throughput, Low Latency Silver Medium vCPU:pCore ratio (3 to 10), Medium VM to Host ratio (9-15) Medium RAM oversubscription ratio (1.1-1.4) Medium Latency Medium Latency Bronze High vCPU:pCore ratio (10-15), High VM to Host ratio (16+) High RAM oversubscription ratio (1.5-2.5) High Latency High Latency Table 1 – Typical Customer SLA requirements for Performance The performance concepts introduced in Table 1 have the following dimensions: CPU: CPU model and speed (this can be important for legacy single threaded applications), number of cores, vCPU to physical core ratios, CPU Ready times. Memory: Random Access Memory size, Input/Output (I/O) speed and latency, oversubscription ratios. Storage: Capacity, Read/Write Input/Output per Second (IOPS) with Input/Output (I/O) size, Read/Write Throughput, Read/Write Input/Output Latency. Network: In/Out Speed, Network Latency (Round Trip Time). A typical legacy business-critical application will have the following application architecture: Load Balancer layer: Uses load balancers to distribute traffic across multiple web servers in the web layer to improve application availability. Web layer: Uses web servers to process client requests made via the secure Hypertext Transfer Protocol (HTTPS). Receives traffic from the load balancer layer and forwards to the application layer. Application layer: Uses application servers to run software that delivers a business application through a communication protocol. Receives traffic from the web layer and uses the database layer to access stored data. Database layer: Uses a relational database management service (RDMS) cluster to store data and provide database services to the application layer. The application can also be classified as OLTP or OLAP, which have the following characteristics: Online Transaction Processing (OLTP) is a type of data processing that consists of executing several transactions occurring concurrently. For example, online banking, retail shopping, or sending text messages. OLTP systems tend to have a performance profile that is latency sensitive, choppy CPU demands, with small amounts of data being read and written. Online Analytical Processing (OLAP) is a technology that organizes large business databases and supports complex analysis. It can be used to perform complex analytical queries without negatively impacting transactional systems (OLTP). For example, data warehouse systems, business performance analysis, or marketing analysis. OLAP systems tend to have a performance profile that is latency tolerant, requires large amounts of storage for records processing, has a steady state of CPU, RAM and storage throughput. Depending upon the performance requirements for each service, infrastructure design could be a mix of technologies used to meet the different performance SLAs with cost efficiency. Figure 1 – Typical Legacy Business-Critical Application Architecture In the next section, I will introduce the architectural components of the Azure VMware Solution. Architectural Components The diagram below describes the architectural components of the Azure VMware Solution. Figure 2 – Azure VMware Solution Architectural Components Each Azure VMware Solution architectural component has the following function: Azure Subscription: Used to provide controlled access, budget, and quota management for the Azure VMware Solution. Azure Region: Physical locations around the world where we group data centers into Availability Zones (AZs) and then group AZs into regions. Azure Resource Group: Container used to place Azure services and resources into logical groups. Azure VMware Solution Private Cloud: Uses VMware software, including vCenter Server, NSX software-defined networking, vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported. Azure VMware Solution Resource Cluster: Uses VMware software, including vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources for customer workloads by scaling out the Azure VMware Solution private cloud. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported. VMware HCX: Provides mobility, migration, and network extension services. VMware Site Recovery: Provides Disaster Recovery automation and storage replication services with VMware vSphere Replication. Third party Disaster Recovery solutions Zerto Disaster Recovery and JetStream Software Disaster Recovery are also supported. Dedicated Microsoft Enterprise Edge (D-MSEE): Router that provides connectivity between Azure cloud and the Azure VMware Solution private cloud instance. Azure Virtual Network (VNet): Private network used to connect Azure services and resources together. Azure Route Server: Enables network appliances to exchange dynamic route information with Azure networks. Azure Virtual Network Gateway: Cross premises gateway for connecting Azure services and resources to other private networks using IPSec VPN, ExpressRoute, and VNet to VNet. Azure ExpressRoute: Provides high-speed private connections between Azure data centers and on-premises or colocation infrastructure. Azure Virtual WAN (vWAN): Aggregates networking, security, and routing functions together into a single unified Wide Area Network (WAN). In the next section, I will describe the performance design considerations for the Azure VMware Solution. Performance Design Considerations The architectural design process takes the business problem to be solved and the business goals to be achieved and distills these into customer requirements, design constraints and assumptions. Design constraints can be characterized by the following three categories: Laws of the Land – data and application sovereignty, governance, regulatory, compliance, etc. Laws of Physics – data and machine gravity, network latency, etc. Laws of Economics – owning versus renting, total cost of ownership (TCO), return on investment (ROI), capital expenditure, operational expenditure, earnings before interest, taxes, depreciation, and amortization (EBITDA), etc. Each design consideration will be a trade-off between availability, recoverability, performance, manageability, and security design qualities. The desired result is to deliver business value with the minimum of risk by working backwards from the customer problem. Design Consideration 1 – Azure Region: Azure VMware Solution is available in 30 Azure Regions around the world (US Government has 2 additional Azure Regions). Select the relevant Azure Regions that meet your geographic requirements. These locations will typically be driven by your design constraints and the required Azure services that will be dependent upon the Azure VMware Solution. For highest throughput and lowest network latency, the Azure VMware Solution and dependent Azure services such as third-party backup/recovery and Azure NetApp Filer volumes should be placed in the same Availability Zone in an Azure Region. Unfortunately, the Azure VMware Solution does not have a Placement Policy Group feature to allow Azure services to be automatically deployed in the same Availability Zone. You can open a ticket with Microsoft to configure a Special Placement Policy to deploy your Azure VMware Solution private cloud to a particular AZ to ensure that your Azure services are placed as closely together as possible. In addition, the proximity of the Azure Region to the remote users and applications consuming the service should also be considered for network latency and throughput. Figure 3 – Azure VMware Solution Availability Zone Placement for Performance Design Consideration 2 – SKU type: Table 2 lists the three SKU types can be selected for provisioning an Azure VMware Solution private cloud. Depending upon the workload performance requirements, the AV36 and AV36P nodes can be used for general purpose compute and the AV52 nodes can be used for compute intensive and storage heavy workloads. The AV36 SKU is widely available in most Azure regions and the AV36P and AV52 SKUs are limited to certain Azure regions. Azure VMware Solution does not support mixing different SKU types within a private cloud (AV64 SKU is the exception). You can check Azure VMware Solution SKU availability by Azure Region here. The AV64 SKU is currently only available for mixed SKU deployments in certain regions. Figure 4 – AV64 Mixed SKU Topology Currently, Azure VMware Solution does not have SKUs that support GPU hardware. The Azure VMware Solution does not natively support Auto-Scale, however you can use this Auto-Scale function instead. For more information, refer to SKU types. SKU Type Purpose CPU (Cores/GHz) RAM (GB) vSAN Cache Tier (TB, raw) vSAN Capacity Tier (TB, raw) Network Interface Cards AV36 General Purpose Compute Dual Intel Xeon Gold 6140 CPUs (Skylake microarchitecture) with 18 cores/CPU @ 2.3 GHz, Total 36 physical cores (72 logical cores with hyperthreading) 576 3.2 (NVMe) 15.20 (SSD) 4x 25 Gb/s NICs (2 for management & control plane, 2 for customer traffic) AV36P General Purpose Compute Dual Intel Xeon Gold 6240 CPUs (Cascade Lake microarchitecture) with 18 cores/CPU @ 2.6 GHz / 3.9 GHz Turbo, Total 36 physical cores (72 logical cores with hyperthreading) 768 1.5 (Intel Cache) 19.20 (NVMe) 4x 25 Gb/s NICs (2 for management & control plane, 2 for customer traffic) AV52 Compute/Storage heavy workloads Dual Intel Xeon Platinum 8270 CPUs (Cascade Lake microarchitecture) with 26 cores/CPU @ 2.7 GHz / 4.0 GHz Turbo, Total 52 physical cores (104 logical cores with hyperthreading) 1,536 1.5 (Intel Cache) 38.40 (NVMe) 4x 25 Gb/s NICs (2 for management & control plane, 2 for customer traffic) AV64 General Purpose Compute Dual Intel Xeon Platinum 8370C CPUs (Ice Lake microarchitecture) with 32 cores/CPU @ 2.8 GHz / 3.5 GHz Turbo, Total 64 physical cores (128 logical cores with hyperthreading) 1,024 3.84 (NVMe) 15.36 (NVMe) 1x 100 Gb/s Table 2 – Azure VMware Solution SKUs Design Consideration 3 – Deployment topology: Select the Azure VMware Solution topology that best matches the performance requirements of your SLAs. For very large deployments, it may make sense to have separate private clouds dedicated to each SLA for optimum performance. The Azure VMware Solution supports a maximum of 12 clusters per private cloud. Each cluster supports a minimum of 3 hosts and a maximum of 16 hosts per cluster. Each private cloud supports a maximum of 96 hosts. VMware vCenter Server, VMware HCX Manager, VMware SRM and VMware vSphere Replication Manager are individual appliances that run in Cluster-1. VMware NSX Manager is a cluster of 3 unified appliances that have a VM-VM anti-affinity placement policy to spread them across the hosts of the cluster. The VMware NSX Edge cluster is a pair of appliances that also use a VM-VM anti-affinity placement policy. All northbound customer traffic traverses the NSX Edge cluster. All vSAN storage traffic traverses the VLAN-backed Portgroup of the Management vSphere Distributed Switch, which is part of the management and control plane. The management and control plane cluster (Cluster-1) can be shared with customer workload VMs or be a dedicated cluster for management and control, including customer enterprise services, such as Active Directory, DNS, & DHCP. Additional resource clusters can be added to support customer demand. This also includes the option of using dedicated clusters for each customer SLA. Topology 1 – Mixed: Run mixed SLA workloads in each cluster of the Azure VMware Solution private cloud. Figure 5 – Azure VMware Solution Mixed Workloads Topology Topology 2 – Dedicated Clusters: Use separate clusters for each SLA in the Azure VMware Solution private cloud. Figure 6 – Azure VMware Solution Dedicated Clusters Topology Topology 3 – Dedicated Private Clouds: Use dedicated Azure VMware Solution private clouds for each SLA for optimum performance. Figure 7 – Azure VMware Solution Dedicated Private Cloud Instances Topology Design Consideration 4 – Network Connectivity: Azure VMware Solution private clouds can be connected using IPSec VPN and Azure ExpressRoute circuits, including a variety of Azure Virtual Networking topologies such as Hub-Spoke and Azure Virtual WAN with Azure Firewall and third-party Network Virtualization Appliances. Azure Public IP connectivity with NSX is also available. From a performance perspective, Azure ExpressRoute and AVS Interconnect should be used instead of Azure Virtual WAN and IPSec VPN. The following design considerations (5-9) elaborate on network performance design. For more information, refer to the Azure VMware Solution networking and interconnectivity concepts. The Azure VMware Solution Cloud Adoption Framework also has example network scenarios that can be considered. Design Consideration 5 – Azure VNet Connectivity: Use FastPath for connecting an Azure VMware Solution private cloud to an Azure VNet for highest throughput and lowest latency. For maximum performance between Azure VMware Solution and Azure native services, a VNet Gateway with the Ultra performance or ErGw3AZ SKU is needed to enable the Fast Path feature when creating the connection. FastPath is designed to improve the data path performance to your VNet. When enabled, FastPath sends network traffic directly to virtual machines in the VNet, bypassing the gateway, resulting in 10 Gbps or higher throughput. For more information, refer to Azure ExpressRoute FastPath. Figure 8 – Azure VMware Solution connected to VNet Gateway with FastPath Design Consideration 6 – Intra-region Connectivity: Use AVS Interconnect for connecting Azure VMware Solution private clouds together in the same Azure Region for the highest throughput and lowest latency. You can select Azure VMware Solution private clouds from another Azure Subscription or Azure Resource Group, the only constraint is it must be in the same Azure Region. A maximum of 10 private clouds can be connected per private cloud instance. For more information, refer to AVS Interconnect. Figure 9 – Azure VMware Solution with AVS Interconnect Design Consideration 7 – Inter-region/On-Premises Connectivity: Use ExpressRoute Global Reach for connecting Azure VMware Solution private clouds together in different Azure Regions or to on-premises vSphere environments for the highest throughput and lowest latency. For more information, refer to Azure VMware Solution network design considerations. Figure 10 – Azure VMware Solution with ExpressRoute Global Reach Figure 11 – Azure VMware Solution with ExpressRoute Global Reach to On-premises vSphere infrastructure Design Consideration 8 – Host Connectivity: Use NSX Multi-Edge to increase the throughput of north/south traffic from the Azure VMware Solution private cloud. This configuration is available for a management cluster (Cluster-1) with four or more nodes. The additional Edge VMs are added to the Edge Cluster and increase the amount of traffic that can be forwarded through the 25Gbps uplinks across the ESXi hosts. This feature needs to be configured by opening an SR. For more information, refer to Azure VMware Solution network design considerations. Figure 12 – Azure VMware Solution Multi-Edge with NSX Design Consideration 9 – Internet Connectivity: Use Public IP on the NSX Edge if high speed internet access direct to the Azure VMware Solution private cloud is needed. This allows you to bring an Azure-owned Public IPv4 address range directly to the NSX Edge for consumption. You should configure this public range on a network virtual appliance (NVA) to secure the private cloud. For more information, refer to Internet Connectivity Design Considerations. Figure 13 – Azure VMware Solution Public IP Address with NSX Design Consideration 10 – VM Optimization: Use VM Hardware tuning, and Resource Pools to provide peak performance for workloads. VMware vSphere Virtual Machine Hardware should be optimized for the required performance: vNUMA optimization for CPU and RAM Shares Reservations & Limits Latency Sensitive setting Paravirtual network & storage adapters Multiple SCSI controllers Spread vDisks across SCSI controllers Resource Pools can be used to apply CPU and RAM QoS policies for each SLA running in a mixed cluster. For more information, refer to Performance Best Practices. Design Consideration 11 – Placement Policies: Placement policies can be used to increase the performance of a service by separating the VMs in an application availability layer across ESXi hosts. This allows you to pin workloads to a particular host for exclusive access to CPU and RAM resources. Placement policies support VM-VM and VM-Host affinity and anti-affinity rules. The vSphere Distributed Resource Scheduler (DRS) is responsible for migrating VMs to enforce the placement policies. For more information, refer to Placement Policies. Figure 14 – Azure VMware Solution Placement Policies Design Consideration 12 – External Datastores: Use a first-party or third-party storage solution to offload lower SLA workloads from VMware vSAN into a separate tier of storage. Azure VMware Solution supports attaching Azure NetApp Files as Network File System (NFS) datastores for offloading virtual machine storage from VMware vSAN. This allows the VMware vSAN datastore to be dedicated to Gold SLA virtual machines. Azure VMware Solution also supports the use of Azure Elastic SAN and Pure Cloud Block Stores as attached iSCSI datastores. For more information, refer to Azure NetApp Files datastores. Figure 15 – Azure VMware Solution External Datastores with Azure NetApp Files Design Consideration 13 – Storage Policies: Table 3 lists the pre-defined VM Storage Policies available for use with VMware vSAN. The appropriate redundant array of independent disks (RAID) and failures to tolerate (FTT) settings per policy need to be considered to match the customer workload SLAs. Each policy has a trade-off between availability, performance, capacity, and cost that needs to be considered. The highest performing VM Storage Policy for enterprise workloads is the RAID-1 policy. To comply with the Azure VMware Solution SLA, you are responsible for using an FTT=2 storage policy when the cluster has 6 or more nodes in a standard cluster. You must also retain a minimum slack space of 25% for backend vSAN operations. For more information, refer to Configure Storage Policy. Deployment Type Policy Name RAID Failures to Tolerate (FTT) Site Standard RAID-1 FTT-1 1 1 N/A Standard RAID-1 FTT-2 1 2 N/A Standard RAID-1 FTT-3 1 3 N/A Standard RAID-5 FTT-1 5 1 N/A Standard RAID-6 FTT-2 6 2 N/A Standard VMware Horizon 1 1 N/A Table 3 – VMware vSAN Storage Policies Design Consideration 14 – Mobility: VMware HCX can be tweaked to improve throughput and performance. VMware HCX Manager can be upsized through Run Command. The number of network extension (NE) instances can be increased to allow Portgroups to be distributed over instances to increase layer 2 extension (L2E) performance. You can also establish a dedicated Mobility Cluster, accompanied by a dedicated Service Mesh for each distinct workload cluster, thereby increasing mobility performance. The Azure VMware Solution supports a maximum of 10 service meshes per private cloud, this is due to the allocation of the /22 management IP schema. Application Path Resiliency & TCP Flow Conditioning are also options that can be enabled to improve mobility performance. TCP Flow Conditioning dynamically optimizes the segment size for traffic traversing the Network Extension path. Application Path Resiliency technology creates multiple Foo-Over-UDP (FOU) tunnels between the source and destination Uplink IP pair for improved performance, resiliency, and path diversity. For more information, refer to VMware HCX Best Practices. Figure 16 – VMware HCX with Dedicated Mobility Cluster Design Consideration 15 – Anti-Patterns: Try to avoid using these anti-patterns in your performance design. Anti-Pattern 1 – Stretched Clusters: Azure VMware Solution Stretched Clusters should primarily be used to meet a Multi-AZ or Recovery Point Objective of zero requirement. If stretched clusters are used, there will be a write throughput and write latency impact for all synchronous writes using the site mirroring storage policy. For more information, refer to Stretched Clusters. Figure 17 – Azure VMware Solution Private Cloud with Stretched Clusters In the following section, I will describe the next steps that need to be made to progress this high-level design estimate towards a validated detailed design. Next Steps The Azure VMware Solution sizing estimate should be assessed using Azure Migrate. With large enterprise solutions for strategic and major customers, an Azure VMware Solution Solutions Architect from Azure, VMware, or a trusted VMware Partner should be engaged to ensure the solution is correctly sized to deliver business value with the minimum of risk. This should also include an application dependency assessment to understand the mapping between application groups and identify areas of data gravity, application network traffic flows, and network latency dependencies. Summary In this post, we took a closer look at the typical performance requirements of a customer workload, the architectural building blocks, and the performance design considerations for the Azure VMware Solution. We also discussed the next steps to continue an Azure VMware Solution design. If you are interested in the Azure VMware Solution, please use these resources to learn more about the service: Homepage: Azure VMware Solution Documentation: Azure VMware Solution SLA: SLA for Azure VMware Solution Azure Regions: Azure Products by Region Service Limits: Azure VMware Solution subscription limits and quotas SKU types: Introduction Storage policies: Configure storage policy VMware HCX: Configuration & Best Practices GitHub repository: Azure/azure-vmware-solution Well-Architected Framework: Azure VMware Solution workloads Cloud Adoption Framework: Introduction to the Azure VMware Solution adoption scenario Network connectivity scenarios: Enterprise-scale network topology and connectivity for Azure VMware Solution Enterprise Scale Landing Zone: Enterprise-scale for Microsoft Azure VMware Solution Enterprise Scale GitHub repository: Azure/Enterprise-Scale-for-AVS Azure CLI: Azure Command-Line Interface (CLI) Overview PowerShell module: Az.VMware Module Azure Resource Manager: Microsoft.AVS/privateClouds REST API: Azure VMware Solution REST API Terraform provider: azurerm_vmware_private_cloud Terraform Registry Author Bio René van den Bedem is a Principal Technical Program Manager in the Azure VMware Solution product group at Microsoft. His background is in enterprise architecture with extensive experience across all facets of the enterprise, public cloud, and service provider spaces, including digital transformation and the business, enterprise, and technology architecture stacks. René works backwards from the problem to be solved and designs solutions that deliver business value with the minimum of risk. In addition to being the first quadruple VMware Certified Design Expert (VCDX), he is also a Dell Technologies Certified Master Enterprise Architect, a Nutanix Platform Expert (NPX), and a VMware vExpert. Link to PPTX Diagrams: azure-vmware-solution/azure-vmware-master-diagramsAzure VMware Solution Recoverability Design Considerations
Azure VMware Solution Design Series Availability Design Considerations Recoverability Design Considerations Performance Design Considerations Security Design Considerations VMware HCX Design with Azure VMware Solution Overview A global enterprise wants to migrate thousands of VMware vSphere virtual machines (VMs) to Microsoft Azure as part of their application modernization strategy. The first step is to exit their on-premises data centers and rapidly relocate their legacy application VMs to the Azure VMware Solution as a staging area for the first phase of their modernization strategy. What should the Azure VMware Solution look like? Azure VMware Solution is a VMware validated first party Azure service from Microsoft that provides private clouds containing VMware vSphere clusters built from dedicated bare-metal Azure infrastructure. It enables customers to leverage their existing investments in VMware skills and tools, allowing them to focus on developing and running their VMware-based workloads on Azure. In this post, I will introduce the typical customer workload recoverability requirements, describe the Azure VMware Solution architectural components, and describe the recoverability design considerations for Azure VMware Solution private clouds. In the next section, I will introduce the typical recoverability requirements of a customer’s workload. Customer Workload Requirements A typical customer has multiple application tiers that have specific Service Level Agreement (SLA) requirements that need to be met. These SLAs are normally named by a tiering system such as Platinum, Gold, Silver, and Bronze or Mission-Critical, Business-Critical, Production, and Test/Dev. Each SLA will have different availability, recoverability, performance, manageability, and security requirements that need to be met. For the recoverability design quality, customers will normally have an uptime percentage requirement with a recovery point objective (RPO), recovery time objective (RTO), work recovery time (WRT), maximum tolerable downtime (MTD) and a Disaster Recovery Site requirement that defines each SLA level. This is normally documented in the customer’s Business Continuity Plan (BCP). For example: SLA Name Uptime RPO RTO WRT MTD DR Site Gold 99.999% (5.26 min downtime/year) 5 min 3 min 2 min 5 min Yes Silver 99.99% (52.6 min downtime/year) 1 hour 20 min 10 min 30 min Yes Bronze 99.9% (8.76 hrs downtime/year) 4 hours 6 hours 2 hours 8 hours No Table 1 – Typical Customer SLA requirements for Recoverability The recoverability concepts introduced in Table 1 have the following definitions: Recovery Point Objective (RPO): Defines the maximum age of the restored data after a failure. Recovery Time Objective (RTO): Defines the maximum time to restore the service. Work Recovery Time (WRT): Defines how long it takes for the recovered service to be brought online and begin serving customers again. Maximum Tolerable Downtime (MTD): Sum of the RTO and WRT, which is the total time required to recover from a disaster and start serving the business again from the Disaster Recovery Site. This value needs to fit within the downtime value of the SLA for each year. Figure 1 – Recoverability Concepts A typical legacy business-critical application will have the following application architecture: Load Balancer layer: Uses load balancers to distribute traffic across multiple web servers in the web layer to improve application availability. Web layer: Uses web servers to process client requests made via the secure Hypertext Transfer Protocol (HTTPS). Receives traffic from the load balancer layer and forwards to the application layer. Application layer: Uses application servers to run software that delivers a business application through a communication protocol. Receives traffic from the web layer and uses the database layer to access stored data. Database layer: Uses a relational database management service (RDMS) cluster to store data and provide database services to the application layer. Depending upon the recoverability requirements for each service, the disaster recovery protection mechanisms could be a mix of manual runbooks and disaster recovery automation solutions with replication and clustering mechanisms connected to many different regions to meet the customer SLAs. Figure 2 – Typical Legacy Business-Critical Application Architecture In the next section, I will introduce the architectural components of the Azure VMware Solution. Architectural Components The diagram below describes the architectural components of the Azure VMware Solution. Figure 3 – Azure VMware Solution Architectural Components Each Azure VMware Solution architectural component has the following function: Azure Subscription: Used to provide controlled access, budget, and quota management for the Azure VMware Solution. Azure Region: Physical locations around the world where we group data centers into Availability Zones (AZs) and then group AZs into regions. Azure Resource Group: Container used to place Azure services and resources into logical groups. Azure VMware Solution Private Cloud: Uses VMware software, including vCenter Server, NSX software-defined networking, vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported. Azure VMware Solution Resource Cluster: Uses VMware software, including vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources for customer workloads by scaling out the Azure VMware Solution private cloud. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported. VMware HCX: Provides mobility, migration, and network extension services. VMware Site Recovery: Provides Disaster Recovery automation and storage replication services with VMware vSphere Replication. Third party Disaster Recovery solutions Zerto Disaster Recovery and JetStream Software Disaster Recovery are also supported. Dedicated Microsoft Enterprise Edge (D-MSEE): Router that provides connectivity between Azure cloud and the Azure VMware Solution private cloud instance. Azure Virtual Network (VNet): Private network used to connect Azure services and resources together. Azure Route Server: Enables network appliances to exchange dynamic route information with Azure networks. Azure Virtual Network Gateway: Cross premises gateway for connecting Azure services and resources to other private networks using IPSec VPN, ExpressRoute, and VNet to VNet. Azure ExpressRoute: Provides high-speed private connections between Azure data centers and on-premises or colocation infrastructure. Azure Virtual WAN (vWAN): Aggregates networking, security, and routing functions together into a single unified Wide Area Network (WAN). In the next section, I will describe the recoverability design considerations for the Azure VMware Solution. Recoverability Design Considerations The architectural design process takes the business problem to be solved and the business goals to be achieved and distills these into customer requirements, design constraints and assumptions. Design constraints can be characterized by the following three categories: Laws of the Land – data and application sovereignty, governance, regulatory, compliance, etc. Laws of Physics – data and machine gravity, network latency, etc. Laws of Economics – owning versus renting, total cost of ownership (TCO), return on investment (ROI), capital expenditure, operational expenditure, earnings before interest, taxes, depreciation, and amortization (EBITDA), etc. Each design consideration will be a trade-off between the availability, recoverability, performance, manageability, and security design qualities. The desired result is to deliver business value with the minimum of risk by working backwards from the customer problem. Design Consideration 1 – Azure Region: Azure VMware Solution is available in 30 Azure Regions around the world (US Government has 2 additional Azure Regions). Select the relevant Azure Regions that meet your geographic requirements. These locations will typically be driven by your design constraints and the required distance the Disaster Recovery Site needs to be from the Primary Site. The Primary Site can be located on-premises, in a co-location or in the public cloud. Figure 4 – Azure VMware Solution Region for Disaster Recovery Design Consideration 2 – Deployment topology: Select the Azure VMware Solution Disaster Recovery Pod topology that best matches the uptime and geographic requirements of your SLAs. For very large deployments, it may make sense to have separate Disaster Recovery Pods (private clouds) dedicated to each SLA for cost efficiency. The management and control plane cluster (Cluster-1) can be shared with customer workload VMs or be a dedicated cluster for management and control, including customer enterprise services, such as Active Directory, DNS, & DHCP. Additional resource clusters can be added to support customer workload demand. This also includes the option of using separate clusters for each customer SLA. The best practice for Disaster Recovery design is to follow a pod architecture where each protected site has a matching private cloud in the Disaster Recovery Azure Region. Complex mesh topologies should be avoided for operational simplicity. The required workload Service Level Agreement values must be mapped to the appropriate Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) and use a naming convention that is easy to understand. For example, Gold, Silver and Bronze or Tier-1, Tier-2 and Tier-3. Each pod should be designated with an SLA capability for operational simplicity. On a smaller scale, the pod concept could be per cluster instead of per private cloud. The Disaster Recovery pods are provisioned to support the necessary replicated storage capacity during steady state. When a disaster is declared, the necessary compute resources will be added to the private cloud. This can be configured automatically using this Auto-Scale function with Azure Automation Accounts and PowerShell Runbooks. Figure 5 – Azure VMware Solution DR Shared Services Figure 6 – Azure VMware Solution Dedicated DR Pods Design Consideration 3 – Disaster Recovery Solution: The Azure VMware Solution supports the following first-party and third-party Disaster Recovery solutions. Depending upon your recoverability and cost efficiency requirements, the best solution can be selected from Table 2 below. For cost efficiency, a best effort RPO and RTO can be met using backup replication of daily snapshots to the Disaster Recovery Site or using the Disaster Recovery replication feature of VMware HCX (Solution 4). If these solutions are not viable, you can also consider application, database or message bus clustering as an option. Solution RPO RTO DR Automation 1. VMware Site Recovery 5min – 24hr Minutes Yes, with Protection Groups & Recovery Plans 2. Zerto DR Seconds Minutes Yes, with Virtual Protection Groups (VPGs) 3. JetStream Software DR Seconds Minutes Yes, with Protection Domains, Runbooks & Runbook Groups 4. VMware HCX 5min – 24hr Hours No, manual process only Table 2 – Disaster Recovery Vendor Products Note: Azure Site Recovery can be used to protect Azure VMware Solution but is not listed here since we are describing how to use Azure VMware Solution to protect on-premises VMware vSphere solutions. Solution 1 – VMware Site Recovery supports Disaster Recovery automation with an RPO of 5 minutes to 24 hours with VMware SRM Virtual Appliance, VMware vSphere Replication and VMware vSAN. Currently, using VMware Site Recovery with Azure NetApp Files is not supported. When designing a solution with VMware Site Recovery, these Azure VMware Solution limits should be considered. Figure 7 – Azure VMware Solution with VMware Site Recovery Manager Solution 2 – Zerto Disaster Recovery supports Disaster Recovery automation with an RPO of seconds using continuous replication with the Zerto Virtual Manager (ZVM), Zerto Virtual Replication Appliance (ZVRA) and VMware vSAN. When designing a solution with Zerto Disaster Recovery, this Zerto Architecture Guide should be considered. Figure 8 – Azure VMware Solution with Zerto Disaster Recovery Solution 3 – JetStream Software Disaster Recovery supports Disaster Recovery automation with an RPO of seconds using continuous replication with the JetStream Manager Virtual Appliance (MSA), JetStream DR Virtual Appliance (DRVA) and VMware vSAN. When designing a solution with JetStream Software Disaster Recovery, these JetStream Software resources should be considered. Figure 9 – Azure VMware Solution with JetStream Software Disaster Recovery Solution 4 – VMware HCX Disaster Recovery supports manual Disaster Recovery with an RPO of 5 minutes to 24 hours with VMware HCX Manager, VMware vSphere Replication and VMware vSAN. When designing a solution with VMware HCX, these Azure VMware Solution limits should be considered. Figure 10 – Azure VMware Solution with VMware HCX Disaster Recovery Design Consideration 5 – SKU type: Three SKU types can be selected for provisioning an Azure VMware Solution private cloud. The smaller AV36 SKU can be used at the Disaster Recovery Site to build a pilot light cluster with the minimum storage resources for cost efficiency while the Primary Site can use the larger and more expensive AV36P and AV52 SKUs. The AV36 SKU is widely available in most Azure regions and the AV36P and AV52 SKUs are limited to certain Azure regions. Azure VMware Solution does not support mixing different SKU types within a private cloud (AV64 SKU is the exception). You can check Azure VMware Solution SKU availability by Azure Region here. The AV64 SKU is currently only available for mixed SKU deployments in certain regions. Figure 11 – AV64 Mixed SKU Topology Design Consideration 6 – Runbook Application Groups: After the application dependency assessment is complete, this data will be used to create the runbook application groups to ensure that the application SLAs are met during a disaster event. If the application dependency assessment is incomplete, the runbook application groups can be initially designed using the process knowledge from your application architecture team and IT operations. The idea is to ensure each application is captured in a runbook that allows the application to be recovered completely and consistently using the runbook architecture and order of operations. Figure 12 – VMware Site Recovery Application Recovery Plans Design Consideration 7– Storage Policies: Table 3 lists the pre-defined VM Storage Policies available for use with VMware vSAN. The appropriate redundant array of independent disks (RAID) and failures to tolerate (FTT) settings per policy need to be considered to match the customer workload SLAs. Each policy has a trade-off between availability, performance, capacity, and cost that needs to be considered. To comply with the Azure VMware Solution SLA, you are responsible for using an FTT=2 storage policy when the cluster has 6 or more nodes in a standard cluster. You must also retain a minimum slack space of 25% for backend vSAN operations. Deployment Type Policy Name RAID Failures to Tolerate (FTT) Site Standard RAID-1 FTT-1 1 1 N/A Standard RAID-1 FTT-2 1 2 N/A Standard RAID-1 FTT-3 1 3 N/A Standard RAID-5 FTT-1 5 1 N/A Standard RAID-6 FTT-2 6 2 N/A Standard VMware Horizon 1 1 N/A Table 3 – VMware vSAN Storage Policies Design Consideration 8 – Network Connectivity: Azure VMware Solution private clouds can be connected using IPSec VPN and Azure ExpressRoute circuits, including a variety of Azure Virtual Networking topologies such as Hub-Spoke and Virtual WAN with Azure Firewall and third-party Network Virtualization Appliances. For more information, refer to the Azure VMware Solution networking and interconnectivity concepts. The Azure VMware Solution Cloud Adoption Framework also has example network scenarios that can be considered. Design Consideration 9 – Layer 2 Network Extension: VMware HCX can be used to provide Layer 2 network extension functionality to maintain the same IP address schema between sites. Figure 13 – VMware HCX Layer 2 Network Extension with VMware Site Recovery Design Consideration 10 – Anti-Patterns: Try to avoid using these anti-patterns in your recoverability design. Anti-Pattern 1 – Stretched Clusters: Azure VMware Solution Stretched Clusters is the only option for meeting an RPO of 0 requirement. Remember that stretched clusters are considered an availability solution, not disaster recovery, because it is a single fault domain for the management and control plane running in dual Availability Zones (AZs). Azure VMware Solution stretched clusters (GA) currently does not support the VMware Site Recovery add-on. Figure 14 – Azure VMware Solution Private Cloud with Stretched Clusters Anti-Pattern 2 – Ransomware Protection: A Disaster Recovery Automation solution does not provide protection against a ransomware attack. Ransomware protection requires additional security functionality where an isolated and secure area is used to filter through a series of data restores to validate the point in time copy is free from ransomware. This process can take months and it is necessary to access data backups that may be months or years old. This is because the ransomware demand for money is merely the end of a long period of reconnaissance by an attacker and every system needs to be checked for active security vulnerabilities and spyware agents. Disaster Recovery Automation assumes that ransomware is not present, and that data corruption has not replicated to the Disaster Recovery Site. That said, some Disaster Recovery Automation vendors now have a Ransomware Protection feature that can be leveraged as part of the solution. In the following section, I will describe the next steps that would need to be made to progress this high-level design estimate towards a validated detailed design. Next Steps The Azure VMware Solution sizing estimate should be assessed using Azure Migrate. With large enterprise solutions for strategic and major customers, an Azure VMware Solution Solutions Architect from Azure, VMware, or a trusted VMware Partner should be engaged to ensure the solution is correctly sized to deliver business value with the minimum of risk. This should also include an application dependency assessment to understand the mapping between application groups and identify areas of data gravity, application network traffic flows, and network latency dependencies. Summary In this post, we took a closer look at the typical recoverability requirements of a customer workload, the architectural building blocks, and the recoverability design considerations for the Azure VMware Solution. We also discussed the next steps to continue an Azure VMware Solution design. If you are interested in the Azure VMware Solution, please use these resources to learn more about the service: Homepage: Azure VMware Solution Documentation: Azure VMware Solution SLA: SLA for Azure VMware Solution Azure Regions: Azure Products by Region Service Limits: Azure VMware Solution subscription limits and quotas VMware Site Recovery: Deploy disaster recovery with VMware Site Recovery Manager Zerto DR: Deploy Zerto disaster recovery on Azure VMware Solution Zerto DR: Architecture Guide JetStream Software DR: Deploy disaster recovery using JetStream DR VMware HCX DR: Deploy disaster recovery using VMware HCX Stretched Clusters (Public Preview): Deploy vSAN stretched clusters SKU types: Introduction Storage policies: Configure storage policy GitHub repository: Azure/azure-vmware-solution Well-Architected Framework: Azure VMware Solution workloads Cloud Adoption Framework: Introduction to the Azure VMware Solution adoption scenario Network connectivity scenarios: Enterprise-scale network topology and connectivity for Azure VMware Solution Enterprise Scale Landing Zone: Enterprise-scale for Microsoft Azure VMware Solution Enterprise Scale GitHub repository: Azure/Enterprise-Scale-for-AVS Azure CLI: Azure Command-Line Interface (CLI) Overview PowerShell module: Az.VMware Module Azure Resource Manager: Microsoft.AVS/privateClouds REST API: Azure VMware Solution REST API Terraform provider: azurerm_vmware_private_cloud Terraform Registry Author Bio René van den Bedem is a Principal Technical Program Manager in the Azure VMware Solution product group at Microsoft. His background is in enterprise architecture with extensive experience across all facets of the enterprise, public cloud, and service provider spaces, including digital transformation and the business, enterprise, and technology architecture stacks. René works backwards from the problem to be solved and designs solutions that deliver business value with the minimum of risk. In addition to being the first quadruple VMware Certified Design Expert (VCDX), he is also a Dell Technologies Certified Master Enterprise Architect, a Nutanix Platform Expert (NPX), and a VMware vExpert. Link to PPTX Diagrams: azure-vmware-solution/azure-vmware-master-diagrams