An Off-Switch for NAT Gateway and ALB

Table of Contents

A NAT Gateway charges $0.045 per hour just for existing — $32 per month, per environment, in US East — before you’ve sent a single packet. Other regions cost more. Add data processing charges at $0.045 per GB and the bill climbs further once the environment is in use. Accounting for nights and weekends, a development environment sits idle for roughly two thirds of all available hours. Manual start/stop routines get forgotten. This post shows how to automate that lifecycle: provision the resource at the start of the day, remove it at the end — with no manual steps.

What About Instance Scheduler?
#

AWS offers Instance Scheduler on AWS — a managed solution that stops and starts resources on a schedule. If your environment consists of EC2 instances or RDS databases, it is the right tool. Tag your resources, define a schedule, done.

Instance Scheduler covers only EC2 and RDS. A NAT Gateway has no start/stop API — it either exists and charges you, or it doesn’t. The same is true for Application Load Balancers and most other networking infrastructure. If the expensive part of your dev environment is a NAT Gateway or an ALB, Instance Scheduler cannot help.

This post introduces a pattern that takes a different approach: instead of a start/stop API, it drives the resource lifecycle through a CloudFormation parameter update. When the parameter changes, CloudFormation provisions or removes the resource. This works for stateless infrastructure — NAT Gateways, ALBs, and similar — that Instance Scheduler cannot reach.

The Stack Stays. The Resource Doesn’t.
#

The core of the pattern is a CloudFormation Condition that controls whether a resource is provisioned:

# stack.yaml
Parameters:
  SwitchState:
    Type: String
    AllowedValues: [SwitchedOn, SwitchedOff]
    Default: SwitchedOff

Conditions:
  IsOn: !Equals [!Ref SwitchState, SwitchedOn]

Resources:
  MyNatGateway:
    Type: AWS::EC2::NatGateway
    Condition: IsOn
    Properties:
      ...

When you update the stack with SwitchState=SwitchedOn, CloudFormation creates the NAT Gateway. Update it to SwitchedOff and CloudFormation deletes it. The stack itself stays intact throughout — outputs, exports, and all dependent infrastructure. Only the conditional resource appears and disappears.

This matters because a failed update rolls back to the previous known-good state rather than leaving a partially deleted stack.

The working example uses an SSM parameter as the conditional resource, not a NAT Gateway or ALB. This is a deliberate choice for an audience of engineers experimenting on personal AWS accounts: if something goes wrong, an SSM parameter costs nothing and deletes in seconds, while a NAT Gateway left running can quietly accumulate charges. Engineers deploying this for a team environment will know how to make the substitution, or can ask their agentic coding tools to do so.

A Lambda Durable Function Runs the Schedule
#

The schedule is straightforward: switch the resource on, wait for CloudFormation to finish, hold for 12 hours, switch off, wait again. The waiting is the hard part — a plain Lambda function cannot pause mid-execution. For long-running orchestration like this, Step Functions is the natural instinct, and it works. An earlier version of this pattern ran on a state machine. The friction was in the CloudFormation polling step: checking whether a stack update has completed requires a Wait, Check, and Choice loop in the state definition. For a workflow that runs straight from start to finish, that is a lot of scaffolding.

Lambda durable functions, announced at re:Invent 2025, give you the same orchestration capability in regular Python code. A durable function can pause — for hours, days, or up to a year — with no compute running, then resume where it left off. One trade-off worth naming: the orchestration logic is now tied to a specific Python runtime version, which will eventually reach end of support and need updating. Step Functions does not have that problem. Worth knowing before you commit. For this use case, simpler code wins:

# lambda/handler.py
@durable_execution
def handler(event: dict, context: DurableContext) -> dict:
    context.step(set_switch_state("SwitchedOn"), name="switch-on")
    context.wait_for_condition(
        check=_check_stack_status,
        config=_wait_for_stack(),
        name="wait-for-on",
    )

    context.wait(duration=Duration.from_seconds(PAUSE_SECONDS), name="timed-pause")

    context.step(set_switch_state("SwitchedOff"), name="switch-off")
    context.wait_for_condition(
        check=_check_stack_status,
        config=_wait_for_stack(),
        name="wait-for-off",
    )

    return {"status": "complete", "stack": STACK_NAME}

Each call does something the others can’t:

context.step() — runs a function as an idempotent unit of work. The result is memoized; if the execution replays after a failure, the step is skipped rather than re-executed.
context.wait() — durable sleep. Lambda is not running during this time.
context.wait_for_condition() — polls a condition function on a timer, suspending between checks. This replaces a boto3 waiter, which blocks Lambda continuously until CloudFormation finishes its update.

One detail worth noting: context.step() retries on failure, which means set_switch_state may run more than once. If CloudFormation returns “No updates are to be performed” on a retry — because the previous attempt already applied the change — the function treats that as success rather than an error. Without this, a replay would fail on a step that had already succeeded.

SAM Handles the Durable Execution Wiring
#

Three properties on the SAM function enable durable execution. Two have side effects worth knowing before you deploy:

# lambda-stack.yaml
SwitchFunction:
  Type: AWS::Serverless::Function
  Properties:
    Runtime: python3.14
    Architectures:
      - arm64
    Handler: handler.handler
    CodeUri: lambda/
    AutoPublishAlias: live
    DurableConfig:
      ExecutionTimeout: 86400     # 24 hours — covers 12h pause + overhead
      RetentionPeriodInDays: 7

DurableConfig enables durable execution. AutoPublishAlias: live is required because durable functions cannot be invoked against $LATEST — attempting it fails immediately. SAM publishes a version and creates the alias automatically. EventBridge targets that alias ARN.

Any change to DurableConfig forces Lambda resource replacement, which terminates all in-flight executions. Set ExecutionTimeout generously from the start rather than adjusting it later.

Check the roadmap on AWS Builder Center for current regional availability before deploying — the feature launched at re:Invent 2025 and availability is expanding.

IAM: CloudFormation Inherits the Function’s Role
#

When a Lambda function calls cfn.update_stack() without a RoleARN, CloudFormation uses the caller’s credentials for all downstream API calls. Here, the caller is the Lambda execution role — so that role needs permissions for every resource the stack manages, not just CloudFormation.

For the SSM placeholder, that’s ssm:PutParameter and ssm:DeleteParameter. Replace SSM with a NAT Gateway and you’d need ec2:CreateNatGateway, ec2:DeleteNatGateway, and the related allocation permissions.

When CloudFormation lacks the resource-level permissions, it fails with a generic GeneralServiceException rather than AccessDeniedException. The stack rolls back immediately with no obvious cause.

The permissions are scoped to the specific SSM path this stack uses:

- Effect: Allow
  Action:
    - ssm:PutParameter
    - ssm:DeleteParameter
  Resource:
    - !Sub "arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/scheduled-switch/${MainStackName}/*"

The Scheduler Wraps an Existing Resource Stack
#

The pattern uses two stacks deliberately. The first — stack.yaml — is your conditional resource stack: the dev environment infrastructure you want to toggle. It exists independently and has no knowledge of the scheduler. The second — lambda-stack.yaml — is the scheduler. It takes the name of the first stack as a parameter and wraps around it. You can redeploy, update, or remove the scheduler at any time without touching the resource stack.

Deploy the resource stack first, defaulting to off:

aws cloudformation create-stack \
  --stack-name scheduled-switch-main \
  --template-body file://stack.yaml \
  --parameters ParameterKey=SwitchState,ParameterValue=SwitchedOff \
  --region <your-region>

Then deploy the Lambda stack. SAM handles packaging and upload:

sam build
sam deploy \
  --stack-name scheduled-switch-lambda \
  --region <your-region> \
  --capabilities CAPABILITY_NAMED_IAM \
  --resolve-s3 \
  --parameter-overrides MainStackName=scheduled-switch-main

The scheduler starts on its own EventBridge cron from this point. The full source, including a helper script for manual invocation, is in the repository linked below.

Your Dev Environment, Only When You Need It
#

Even with flexible hours, a 12-hour working day five days a week is under a third of the hours in a month — and idle infrastructure charges at the same rate as active. CloudFormation conditions keep the stack consistent throughout: no orphaned resources, no broken exports, clean rollback on failure. The stack becomes the single place to own your dev environment definition — no Service Catalog, no deployment pipelines, just a parameter and an on/off switch.

Swap out the SSM placeholder for your actual resource, adjust the cron and hold duration to match your team’s hours, and it runs itself.

View the full source code on GitHub

References
#

Instance Scheduler on AWS — AWS managed solution for scheduling EC2 and RDS start/stop
AWS::Lambda::Function DurableConfig — CloudFormation reference for enabling durable execution on a Lambda function
Lambda durable functions overview — AWS documentation on durable execution primitives and regional availability
CloudFormation conditions — how to use Conditions to conditionally provision resources in a template
AWS CloudFormation service role — explains the default behavior when no RoleARN is specified on a stack operation
Amazon VPC pricing — NAT Gateway hourly and per-GB data processing rates
Amazon EventBridge scheduled rules — cron and rate expressions for EventBridge rules

What About Instance Scheduler? #

The Stack Stays. The Resource Doesn’t. #

A Lambda Durable Function Runs the Schedule #

SAM Handles the Durable Execution Wiring #

IAM: CloudFormation Inherits the Function’s Role #

The Scheduler Wraps an Existing Resource Stack #

Your Dev Environment, Only When You Need It #

References #

Related