Compute Services: Virtual Machines

Compute Services: Virtual Machines, Auto Scaling, and Load Balancers

📅 Published: Feb 2026
⏱️ Estimated Reading Time: 18 minutes
🏷️ Tags: Cloud Computing, EC2, Virtual Machines, Auto Scaling, Load Balancing, AWS, Azure, GCP

Introduction: What Are Compute Services?

Compute services are the backbone of cloud computing. They provide the processing power to run your applications—the brains of your infrastructure. Just as a building needs a foundation before you can add walls and roofs, your cloud architecture needs compute before you can add databases, storage, and other services.

When you think about running an application in the cloud, you are fundamentally asking: where will this code execute? The answer is a compute service.

Cloud providers offer several types of compute services:

Virtual Machines — Traditional servers in the cloud
Containers — Lightweight, portable application packages
Serverless — Run code without managing servers
Platform as a Service — Deploy code, let the platform handle everything

This guide focuses on virtual machines, auto scaling, and load balancers—the foundational building blocks of cloud infrastructure.

Virtual Machines: The Foundation

What is a Virtual Machine?

A virtual machine (VM) is a software-based emulation of a physical computer. It runs an operating system and applications just like a physical server, but it shares physical hardware with other virtual machines.

Think of a physical server as an apartment building. Each virtual machine is a separate apartment. The apartments share the building's structure, electricity, and plumbing, but each has its own walls, doors, and locks. What happens in one apartment does not affect the others.

How Virtualization Works

A hypervisor is the software that creates and manages virtual machines. It sits between the physical hardware and the virtual machines, allocating CPU time, memory, storage, and network resources to each VM.

Each VM believes it has its own dedicated hardware. In reality, the hypervisor is sharing physical resources among many VMs.

Virtual Machine Naming Across Providers

Provider	Service Name	Console Name
AWS	Amazon EC2	EC2 Instances
Azure	Azure Virtual Machines	VMs
Google Cloud	Compute Engine	VM Instances

Despite different names, the concepts are identical across providers.

EC2 Instance Types (AWS)

AWS organizes EC2 instances into families optimized for different workloads:

Family	Purpose	Example Types
General Purpose	Balanced CPU, memory, network	t3, t4g, m5, m6i
Compute Optimized	CPU-intensive workloads	c5, c6i
Memory Optimized	Memory-intensive workloads	r5, r6i, x1e
Storage Optimized	High, sequential disk I/O	i3, d2
Accelerated Computing	GPU, FPGA workloads	p3, p4, g4dn

Selecting the right size:

Instance names follow a pattern: family generation size
Example: t3 medium

t = family (burstable general purpose)
3 = generation
medium = size

Sizes scale from nano (smallest) to 48xlarge (largest).

Azure VM Series

Azure uses a similar family-based naming system:

Series	Purpose
B-series	Burstable, cost-effective for development
D-series	General purpose
E-series	Memory optimized
F-series	Compute optimized
G-series	Very large, memory optimized
L-series	Storage optimized
NC/ND/NV-series	GPU workloads

Google Cloud Machine Types

GCP categorizes machines by family:

Family	Purpose
N1, N2, N2D	General purpose
C2, C2D	Compute optimized
M1, M2, M3	Memory optimized
A2	GPU workloads
E2	Cost-optimized general purpose

Choosing the Right Virtual Machine

The right VM depends on your workload characteristics:

Web servers — General purpose families work well. The load is typically CPU and memory balanced.

Batch processing — Compute optimized families are appropriate. These workloads need CPU power but can tolerate lower memory ratios.

In-memory databases (Redis, Memcached) — Memory optimized families are essential. These services need large amounts of RAM relative to CPU.

Data warehouses (SQL Server, Oracle) — Storage optimized or memory optimized depending on whether I/O or memory is the bottleneck.

Machine learning training — GPU-accelerated families are necessary. Training models requires specialized hardware.

Development and testing — Burstable families like t3 or B-series are cost-effective for workloads that are idle most of the time.

Pricing Models

On-Demand — Pay by the hour or second with no long-term commitment. This is the most flexible but most expensive model for continuous workloads.

Reserved Instances — Commit to using a VM for 1 or 3 years in exchange for significant discounts (up to 70%). This is best for predictable, steady-state workloads.

Spot Instances (AWS) / Preemptible VMs (GCP) / Low Priority VMs (Azure) — Access spare capacity at deep discounts (up to 90%) but the cloud provider can reclaim the instance with short notice. This is best for fault-tolerant, stateless, or batch workloads that can be interrupted.

Savings Plans — Flexible pricing model that offers discounts in exchange for committing to a consistent amount of compute usage.

Auto Scaling: Handling Variable Demand

Why Auto Scaling Matters

In traditional data centers, you had to predict capacity years in advance. If you guessed too low, your application failed under load. If you guessed too high, you wasted money on idle servers.

Auto scaling solves this problem. It automatically adjusts the number of running instances based on demand.

The Three Components of Auto Scaling

1. Launch Configuration or Template
This defines what to run: the AMI or machine image, instance type, security groups, and user data scripts. It is the blueprint for each instance.

2. Scaling Policies
These define when to scale: add instances when CPU exceeds 70 percent for five minutes, remove instances when CPU drops below 30 percent for ten minutes. Scaling policies can be based on metrics, schedules, or even custom application signals.

3. Scaling Group
This defines where to run: the VPC, subnets, and minimum, maximum, and desired instance counts. The scaling group ensures the right number of instances are running at all times.

How Auto Scaling Works

When you configure auto scaling, you define:

Minimum capacity: The smallest number of instances to keep running. For production, this is often at least 2 for high availability.
Maximum capacity: The largest number of instances allowed. This prevents runaway scaling from bankrupting you.
Desired capacity: The target number of instances. Auto scaling works to maintain this number.
Scaling policies: Rules that adjust desired capacity up or down based on metrics.

When a scaling policy triggers, the auto scaling group launches new instances using the launch template. When instances are no longer needed, they are terminated. The process is fully automatic.

Scaling Policies

Simple Scaling — Respond to a single alarm by adding or removing a fixed number of instances. This is the simplest but least flexible approach.

Step Scaling — Different alarm thresholds trigger different scaling actions. For example, CPU at 50 percent adds 1 instance, CPU at 80 percent adds 5 instances.

Target Tracking — You specify a target metric value, and auto scaling maintains that target. For example, keep average CPU at 40 percent. This is the most automated and recommended approach.

Scheduled Scaling — Scale based on predictable patterns. For example, add instances at 8 AM Monday through Friday, remove them at 6 PM.

Predictive Scaling — Machine learning models predict future demand and scale proactively. This is the most advanced option.

Auto Scaling Across Providers

Provider	Service Name	Key Features
AWS	Auto Scaling Groups	Target tracking, predictive scaling, instance refresh
Azure	Virtual Machine Scale Sets	Automatic OS upgrades, zone balancing
Google Cloud	Managed Instance Groups	Stateful and stateless instances, rolling updates

Best Practices for Auto Scaling

Use target tracking scaling policies. They are simpler to configure and maintain than step or simple scaling.

Set appropriate cooldown periods. Cooldown periods prevent scaling actions from overlapping. Too short and you may scale unnecessarily. Too long and you may not respond quickly enough.

Test scaling behavior. Scale up manually during low traffic to ensure new instances join the load balancer correctly. Scale down to ensure connections drain properly.

Monitor scaling activities. CloudWatch logs or equivalent should alert you if scaling is happening too frequently, which indicates your thresholds are too sensitive.

Use lifecycle hooks. Run custom scripts when instances launch or terminate. This is useful for installing software, draining connections before termination, or sending notifications.

Load Balancers: Distributing Traffic

What is a Load Balancer?

A load balancer distributes incoming traffic across multiple targets—virtual machines, containers, or serverless functions. It acts as the front door to your application.

Think of a load balancer as a receptionist at a busy office. When visitors arrive, the receptionist directs them to an available employee. If one employee becomes overwhelmed, the receptionist sends the next visitor to someone else.

Why Load Balancers Are Essential

Load balancers provide three critical benefits:

1. High Availability
If one server fails, the load balancer stops sending traffic to it. Your application continues running even when individual components fail.

2. Scalability
As traffic grows, you add more servers behind the load balancer. The load balancer distributes traffic across all available servers. You can scale without changing how users access your application.

3. SSL Termination
Load balancers can handle SSL certificates, decrypting HTTPS traffic before sending it to your servers. This offloads CPU-intensive encryption work from your application servers.

Types of Load Balancers

Application Load Balancer (Layer 7)
Operates at the application layer. It understands HTTP and HTTPS traffic and can make routing decisions based on URL paths, hostnames, headers, and cookies.

Example: Send requests to /api/* to one group of servers and requests to /static/* to another group.

Network Load Balancer (Layer 4)
Operates at the transport layer. It routes traffic based on IP addresses and TCP/UDP ports. It does not inspect application content.

Network load balancers handle extreme performance requirements—millions of requests per second with ultra-low latency.

Gateway Load Balancer (Layer 3)
Operates at the network layer. It is used for deploying third-party virtual appliances like firewalls, intrusion detection systems, and deep packet inspection.

Load Balancer Components

Listeners
A listener checks for connection requests using a specific protocol and port. You configure listeners to handle incoming traffic. Common listeners are HTTP on port 80 and HTTPS on port 443.

Target Groups
A target group routes requests to registered targets. You register your virtual machines with a target group. Health checks are configured at the target group level.

Health Checks
Health checks determine whether a target is healthy and able to receive traffic. If a health check fails, the load balancer stops sending traffic to that target. Health checks run continuously.

Rules
Rules define how the load balancer routes requests. For application load balancers, you can route based on path, hostname, HTTP headers, or query parameters.

Load Balancer Algorithms

Round Robin
Requests are distributed evenly across healthy targets in rotation. This is the simplest and most common algorithm.

Least Outstanding Requests
Requests are sent to the target with the fewest active connections. This works well when requests vary significantly in duration.

Least Response Time
Requests are sent to the target with the fastest response time. This requires the load balancer to track performance metrics.

Cross-Provider Comparison

Feature	AWS	Azure	Google Cloud
Application Load Balancer	Application Load Balancer	Application Gateway	HTTP(S) Load Balancing
Network Load Balancer	Network Load Balancer	Load Balancer	TCP/SSL Load Balancing
Internal Load Balancer	Internal Load Balancer	Internal Load Balancer	Internal Load Balancing
Health Checks	Yes	Yes	Yes
SSL Termination	Yes	Yes	Yes
Web Application Firewall	AWS WAF	Application Gateway WAF	Cloud Armor
Cross-zone balancing	Yes	Yes	Yes

Putting It All Together: A Complete Architecture

A production-ready web application combines all three services:

Virtual Machines running your application code. These are the workers that actually serve user requests.
Auto Scaling that adds or removes virtual machines based on CPU utilization or request count. During peak hours, more instances run. During quiet hours, fewer instances run.
Load Balancer in front of the auto scaling group. Users connect to the load balancer, which distributes traffic across the healthy instances.

When traffic increases:

CPU utilization rises across instances
Auto scaling policy triggers
New instances launch
Instances register with the load balancer
Load balancer begins sending traffic to new instances
CPU utilization returns to target

When traffic decreases, the process reverses. Instances are removed, but only after they finish handling existing connections.

Real-World Scenarios

Scenario 1: E-commerce Website with Variable Traffic

An online store experiences high traffic during holiday seasons and flash sales, but steady traffic the rest of the year.

Solution:

Use general purpose virtual machines (t3 or m5 families)
Configure auto scaling with target tracking at 60 percent CPU
Set minimum capacity to 2 for availability, maximum to 20 for peak
Use application load balancer to distribute traffic
Configure health checks on the application endpoint

During normal operation, 2 to 3 instances run. During Black Friday, the group scales to 15 instances automatically. After the sale, it scales back down.

Scenario 2: API Service with Consistent Load

A financial services API processes a predictable number of transactions throughout the day. Reliability is critical.

Solution:

Use compute optimized virtual machines (c5 family)
Configure scheduled scaling to match business hours
Reserve capacity with 1-year reservations for cost savings
Use network load balancer for low latency
Deploy across multiple availability zones

The API runs with a consistent number of instances, avoiding the complexity of dynamic scaling while maintaining high availability.

Scenario 3: Batch Processing Workload

A data processing job runs every night, processing large files for several hours.

Solution:

Use storage optimized instances (i3 family) for high disk I/O
Configure spot instances for cost savings (workload is fault-tolerant)
Use auto scaling with step scaling based on queue depth
No load balancer needed (internal processing only)

The batch job uses spot instances at 90 percent discount. If an instance is reclaimed, the job simply continues on another instance.

Cost Optimization Strategies

Right-size your instances. Most organizations over-provision. Monitor CPU and memory utilization and scale down instances that are consistently underutilized.

Use reservations for steady workloads. If your auto scaling group maintains a minimum of 5 instances 24/7, reserve those 5 instances and use on-demand for the variable portion.

Consider burstable instances for development. T3 or B-series instances can burst to high CPU when needed but cost much less than standard instances.

Use spot instances for fault-tolerant workloads. Batch processing, containerized microservices, and development environments can run on spot instances at massive discounts.

Turn off instances outside business hours. For development environments, use scheduled scaling to reduce capacity to zero overnight and on weekends.

Summary

Compute services form the foundation of cloud architecture:

Service	Purpose	Key Benefits
Virtual Machines	Run applications	Control, flexibility, familiar model
Auto Scaling	Match capacity to demand	Cost efficiency, reliability
Load Balancers	Distribute traffic	High availability, scalability

When combined, these services create resilient, scalable applications that can handle any level of demand while minimizing cost during quiet periods.

Practice Questions

An e-commerce website expects heavy traffic during holiday sales but moderate traffic otherwise. Which combination of services would you use?
A financial application requires consistent performance and cannot tolerate any downtime. You need to protect against instance failure. How would you design the compute layer?
A batch processing job runs for several hours each night, processing data from a queue. If a job is interrupted, it can restart from the last checkpoint. Which compute model is most cost-effective?
A web application has two distinct components: a public API and an administrative dashboard. The public API gets 95 percent of the traffic. How would you configure the load balancer?
Your development team needs 50 virtual machines for testing. They run during business hours but are idle at night and weekends. What is the most cost-effective approach?

Learn More

Practice compute services, auto scaling, and load balancers with hands-on exercises in our interactive labs:
https://devops.trainwithsky.com/

SKY Tech – Explore Technology!