Back to Platform

Terraform for Your Data Platform: Infrastructure as Code

Managing Snowflake warehouses, AWS S3 buckets, and IAM roles with Terraform — from provider setup and remote state to CI/CD pipelines that plan on PR and apply on merge.

Chris P

Chris P

Platform11 min read
Data platform optimization dashboard with metric tiles

Most data platform infrastructure is managed by clicking through web consoles. A warehouse gets created in Snowflake's UI. An S3 bucket gets configured in the AWS console. IAM roles get hand-crafted by whoever has admin access that day. This works until it does not — someone deletes a role by accident, a staging environment drifts from production, or an auditor asks who changed the warehouse size last Tuesday and nobody can answer.

Infrastructure as Code solves all three problems. Every resource is defined in version-controlled configuration files. Changes go through pull requests with peer review. The full history of every modification is in git. Disaster recovery becomes terraform apply instead of a two-day scramble. For data platforms specifically, Terraform is the right tool because it has mature providers for both cloud infrastructure (AWS, GCP, Azure) and data platform services (Snowflake, Databricks, Confluent).

This guide covers Snowflake provider setup with key-pair authentication, managing warehouses, databases, schemas, and roles, S3 bucket configuration for data lake storage, remote state management, CI/CD with GitHub Actions, and environment separation.

Snowflake Provider Setup

The Snowflake Terraform provider authenticates using key-pair authentication rather than passwords. This is more secure and avoids embedding credentials in your Terraform configuration. Start by creating a service user in Snowflake and assigning it an RSA key pair.

sql
1-- Run this in Snowflake to create the Terraform service user2USE ROLE ACCOUNTADMIN;3 4CREATE USER IF NOT EXISTS TERRAFORM_SVC5  DEFAULT_ROLE = SYSADMIN6  DEFAULT_WAREHOUSE = WH_TERRAFORM7  MUST_CHANGE_PASSWORD = FALSE8  TYPE = SERVICE;9 10CREATE ROLE IF NOT EXISTS TERRAFORM_ROLE;11GRANT ROLE SYSADMIN TO ROLE TERRAFORM_ROLE;12GRANT ROLE SECURITYADMIN TO ROLE TERRAFORM_ROLE;13GRANT ROLE TERRAFORM_ROLE TO USER TERRAFORM_SVC;14 15ALTER USER TERRAFORM_SVC SET RSA_PUBLIC_KEY = 'MIIBIjANBgkqh...';

With the service user created, configure the Terraform provider to authenticate with the private key.

hcl
1# providers.tf2terraform {3  required_version = ">= 1.5"4 5  required_providers {6    snowflake = {7      source  = "Snowflake-Labs/snowflake"8      version = "~> 1.0"9    }10    aws = {11      source  = "hashicorp/aws"12      version = "~> 5.0"13    }14  }15}16 17provider "snowflake" {18  organization_name = var.snowflake_org19  account_name      = var.snowflake_account20  user              = "TERRAFORM_SVC"21  authenticator     = "JWT"22  private_key       = var.snowflake_private_key23  role              = "TERRAFORM_ROLE"24}25 26provider "aws" {27  region = var.aws_region28 29  default_tags {30    tags = {31      ManagedBy   = "terraform"32      Environment = var.environment33      Team        = "data-platform"34    }35  }36}

The private key is passed as a variable, never hardcoded. In CI/CD, it comes from a GitHub secret. Locally, it comes from an environment variable or a .tfvars file that is gitignored.

Managing Snowflake Resources

With the provider configured, define your Snowflake warehouses, databases, schemas, and roles. Start with warehouses. Each workload class gets its own warehouse with appropriate sizing and auto-suspend settings.

hcl
1# snowflake_warehouses.tf2resource "snowflake_warehouse" "etl" {3  name                = "WH_ETL_${upper(var.environment)}"4  warehouse_size      = var.environment == "prod" ? "MEDIUM" : "XSMALL"5  auto_suspend        = 1206  auto_resume         = true7  min_cluster_count   = 18  max_cluster_count   = var.environment == "prod" ? 3 : 19  scaling_policy      = "ECONOMY"10  comment             = "ETL batch processing — managed by Terraform"11}12 13resource "snowflake_warehouse" "bi" {14  name                = "WH_BI_${upper(var.environment)}"15  warehouse_size      = var.environment == "prod" ? "SMALL" : "XSMALL"16  auto_suspend        = 6017  auto_resume         = true18  min_cluster_count   = 119  max_cluster_count   = var.environment == "prod" ? 4 : 120  scaling_policy      = "STANDARD"21  comment             = "BI dashboard queries — managed by Terraform"22}23 24resource "snowflake_warehouse" "dev" {25  count               = var.environment == "prod" ? 0 : 126  name                = "WH_DEV_${upper(var.environment)}"27  warehouse_size      = "XSMALL"28  auto_suspend        = 6029  auto_resume         = true30  max_cluster_count   = 131  comment             = "Developer sandbox — managed by Terraform"32}

The environment variable controls sizing. Production gets larger warehouses with multi-cluster scaling. Staging and dev get the minimum. The dev warehouse is only created in non-production environments, saving costs.

Next, define databases, schemas, and the role hierarchy. Snowflake's role-based access control is powerful but becomes impossible to manage without code. Terraform makes the permission graph explicit and auditable.

hcl
1# snowflake_databases.tf2resource "snowflake_database" "analytics" {3  name    = "ANALYTICS_${upper(var.environment)}"4  comment = "Primary analytics database — managed by Terraform"5}6 7resource "snowflake_schema" "bronze" {8  database = snowflake_database.analytics.name9  name     = "BRONZE"10  comment  = "Raw ingested data — immutable audit trail"11}12 13resource "snowflake_schema" "silver" {14  database = snowflake_database.analytics.name15  name     = "SILVER"16  comment  = "Cleaned and deduplicated data"17}18 19resource "snowflake_schema" "gold" {20  database = snowflake_database.analytics.name21  name     = "GOLD"22  comment  = "Business-ready metrics and dimensions"23}24 25# Role hierarchy26resource "snowflake_account_role" "analyst" {27  name    = "ANALYST_${upper(var.environment)}"28  comment = "Read access to gold schema"29}30 31resource "snowflake_account_role" "engineer" {32  name    = "ENGINEER_${upper(var.environment)}"33  comment = "Read/write access to silver and gold schemas"34}35 36resource "snowflake_grant_privileges_to_account_role" "analyst_gold_read" {37  account_role_name = snowflake_account_role.analyst.name38  privileges        = ["SELECT"]39 40  on_schema_object {41    future {42      object_type_plural = "TABLES"43      in_schema          = "${snowflake_database.analytics.name}.${snowflake_schema.gold.name}"44    }45  }46}47 48resource "snowflake_grant_privileges_to_account_role" "engineer_silver_write" {49  account_role_name = snowflake_account_role.engineer.name50  privileges        = ["SELECT", "INSERT", "UPDATE", "DELETE"]51 52  on_schema_object {53    future {54      object_type_plural = "TABLES"55      in_schema          = "${snowflake_database.analytics.name}.${snowflake_schema.silver.name}"56    }57  }58}

Every role, grant, and schema is defined in code. When a new team member needs access, an engineer adds them to the appropriate role in Terraform and opens a PR. The change is reviewed, approved, and applied through CI. No ad-hoc GRANT statements in a Snowflake worksheet that nobody can trace.

S3 Bucket for Data Lake Storage

The data lake needs an S3 bucket with appropriate lifecycle policies, encryption, and access controls. Raw data lands here before being ingested into Snowflake's bronze layer. Iceberg tables may also store their data and metadata files here.

hcl
1# s3.tf2resource "aws_s3_bucket" "data_lake" {3  bucket = "datalake-${var.environment}-${var.aws_account_id}"4 5  tags = {6    Purpose = "Data lake storage for bronze/raw data"7  }8}9 10resource "aws_s3_bucket_versioning" "data_lake" {11  bucket = aws_s3_bucket.data_lake.id12  versioning_configuration {13    status = "Enabled"14  }15}16 17resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {18  bucket = aws_s3_bucket.data_lake.id19  rule {20    apply_server_side_encryption_by_default {21      sse_algorithm = "aws:kms"22    }23    bucket_key_enabled = true24  }25}26 27resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {28  bucket = aws_s3_bucket.data_lake.id29 30  rule {31    id     = "archive-old-raw-data"32    status = "Enabled"33 34    filter {35      prefix = "raw/"36    }37 38    transition {39      days          = 9040      storage_class = "STANDARD_IA"41    }42 43    transition {44      days          = 36545      storage_class = "GLACIER"46    }47  }48 49  rule {50    id     = "expire-tmp-files"51    status = "Enabled"52 53    filter {54      prefix = "tmp/"55    }56 57    expiration {58      days = 759    }60  }61}62 63resource "aws_s3_bucket_public_access_block" "data_lake" {64  bucket = aws_s3_bucket.data_lake.id65 66  block_public_acls       = true67  block_public_policy     = true68  ignore_public_acls      = true69  restrict_public_buckets = true70}

The lifecycle policy moves raw data to cheaper storage tiers after 90 days and archives to Glacier after a year. Temporary files are automatically cleaned up after seven days. Versioning is enabled so accidental deletions can be recovered. Public access is blocked entirely.

IAM Role for Cross-Account Access

Snowflake accesses your S3 bucket through a storage integration that assumes an IAM role. This role needs read/write access to the bucket and a trust policy that allows Snowflake's AWS account to assume it.

hcl
1# iam.tf2data "aws_iam_policy_document" "snowflake_assume_role" {3  statement {4    actions = ["sts:AssumeRole"]5 6    principals {7      type        = "AWS"8      identifiers = [var.snowflake_aws_iam_user_arn]9    }10 11    condition {12      test     = "StringEquals"13      variable = "sts:ExternalId"14      values   = [var.snowflake_storage_integration_external_id]15    }16  }17}18 19resource "aws_iam_role" "snowflake_access" {20  name               = "snowflake-data-lake-${var.environment}"21  assume_role_policy = data.aws_iam_policy_document.snowflake_assume_role.json22 23  tags = {24    Purpose = "Snowflake storage integration access"25  }26}27 28data "aws_iam_policy_document" "data_lake_access" {29  statement {30    sid    = "AllowListBucket"31    effect = "Allow"32    actions = [33      "s3:ListBucket",34      "s3:GetBucketLocation",35    ]36    resources = [aws_s3_bucket.data_lake.arn]37  }38 39  statement {40    sid    = "AllowObjectAccess"41    effect = "Allow"42    actions = [43      "s3:GetObject",44      "s3:GetObjectVersion",45      "s3:PutObject",46      "s3:DeleteObject",47    ]48    resources = ["${aws_s3_bucket.data_lake.arn}/*"]49  }50}51 52resource "aws_iam_role_policy" "snowflake_data_lake" {53  name   = "snowflake-data-lake-access"54  role   = aws_iam_role.snowflake_access.id55  policy = data.aws_iam_policy_document.data_lake_access.json56}

The trust policy uses an external ID condition, which prevents the confused deputy problem. Snowflake provides both the IAM user ARN and the external ID when you create a storage integration. These values are passed as Terraform variables.

Remote State with S3 Backend

Terraform state must be stored remotely so your team can collaborate and your CI/CD pipeline can access it. S3 with DynamoDB locking is the standard setup for AWS-based platforms.

hcl
1# backend.tf2terraform {3  backend "s3" {4    bucket         = "terraform-state-data-platform"5    key            = "data-platform/terraform.tfstate"6    region         = "us-east-1"7    encrypt        = true8    dynamodb_table = "terraform-state-lock"9  }10}

The DynamoDB table provides locking so two engineers cannot run terraform apply simultaneously and corrupt the state. The state file is encrypted at rest in S3. Create the state bucket and DynamoDB table manually before running terraform init — this is the one piece of infrastructure you bootstrap outside of Terraform.

For environment separation, use either Terraform workspaces or a directory structure. Workspaces are simpler: terraform workspace select prod switches context. A directory structure (environments/prod/, environments/staging/) gives you more isolation but requires duplicating the backend configuration. For most data platform teams, workspaces are sufficient.

CI/CD with GitHub Actions

The final piece is a CI/CD pipeline that runs terraform plan on pull requests and terraform apply on merge to main. This ensures every infrastructure change is reviewed before it is applied and creates an audit trail in your git history.

yaml
1# .github/workflows/terraform.yml2name: Terraform3on:4  pull_request:5    paths:6      - 'terraform/**'7  push:8    branches: [main]9    paths:10      - 'terraform/**'11 12env:13  TF_VAR_snowflake_private_key: ${{ secrets.SNOWFLAKE_PRIVATE_KEY }}14  TF_VAR_snowflake_org: ${{ secrets.SNOWFLAKE_ORG }}15  TF_VAR_snowflake_account: ${{ secrets.SNOWFLAKE_ACCOUNT }}16  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}17  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}18 19jobs:20  plan:21    if: github.event_name == 'pull_request'22    runs-on: ubuntu-latest23    steps:24      - uses: actions/checkout@v425 26      - uses: hashicorp/setup-terraform@v327        with:28          terraform_version: "1.8"29 30      - name: Terraform Init31        run: terraform init32        working-directory: terraform33 34      - name: Terraform Plan35        id: plan36        run: terraform plan -no-color -out=tfplan37        working-directory: terraform38 39      - name: Post plan to PR40        uses: actions/github-script@v741        with:42          script: |43            const output = `${{ steps.plan.outputs.stdout }}`;44            const truncated = output.length > 6000045              ? output.substring(0, 60000) + '\n... truncated'46              : output;47            github.rest.issues.createComment({48              owner: context.repo.owner,49              repo: context.repo.repo,50              issue_number: context.issue.number,51              body: `## Terraform Plan\n\`\`\`\n${truncated}\n\`\`\``52            });53 54  apply:55    if: github.event_name == 'push' && github.ref == 'refs/heads/main'56    runs-on: ubuntu-latest57    environment: production58    steps:59      - uses: actions/checkout@v460 61      - uses: hashicorp/setup-terraform@v362        with:63          terraform_version: "1.8"64 65      - name: Terraform Init66        run: terraform init67        working-directory: terraform68 69      - name: Terraform Apply70        run: terraform apply -auto-approve71        working-directory: terraform

The plan job runs on every PR and posts the plan output as a comment. Reviewers can see exactly what will change: which warehouses will be resized, which roles will be modified, which buckets will be created. The apply job runs only on pushes to main, which means it runs only after a PR is approved and merged. The production environment protection rule adds an additional approval gate if you configure it in GitHub.

Environment Separation

For data platforms, the cleanest environment separation uses Terraform workspaces combined with environment-specific variable files. Each environment gets its own .tfvars file with appropriate sizing, and the workspace determines which state file Terraform uses.

hcl
1# variables.tf2variable "environment" {3  type        = string4  description = "Deployment environment (dev, staging, prod)"5  validation {6    condition     = contains(["dev", "staging", "prod"], var.environment)7    error_message = "Environment must be dev, staging, or prod."8  }9}10 11variable "snowflake_org" {12  type      = string13  sensitive = true14}15 16variable "snowflake_account" {17  type      = string18  sensitive = true19}20 21variable "snowflake_private_key" {22  type      = string23  sensitive = true24}25 26variable "aws_region" {27  type    = string28  default = "us-east-1"29}30 31variable "aws_account_id" {32  type = string33}34 35variable "snowflake_aws_iam_user_arn" {36  type        = string37  description = "ARN provided by Snowflake storage integration"38}39 40variable "snowflake_storage_integration_external_id" {41  type        = string42  description = "External ID from Snowflake storage integration"43}

With this setup, deploying to staging is terraform workspace select staging followed by terraform apply -var-file=environments/staging.tfvars. Every resource name includes the environment variable, so there is no collision between environments. The state files are isolated by workspace. And the CI/CD pipeline can target any environment by selecting the appropriate workspace.

Importing Existing Resources

Most teams are not starting from scratch. You already have Snowflake warehouses, S3 buckets, and IAM roles that were created manually. Terraform can adopt these existing resources into its state using the import command. This is how you move from console-managed infrastructure to code-managed infrastructure without recreating anything.

hcl
1# imports.tf — one-time import block2import {3  to = snowflake_warehouse.etl4  id = "WH_ETL_PROD"5}6 7import {8  to = snowflake_database.analytics9  id = "ANALYTICS_PROD"10}11 12import {13  to = aws_s3_bucket.data_lake14  id = "datalake-prod-123456789"15}16 17import {18  to = aws_iam_role.snowflake_access19  id = "snowflake-data-lake-prod"20}

Run terraform plan after adding imports to see whether your Terraform configuration matches the actual state of the resources. If there are differences — for example, the existing warehouse has a different auto_suspend value than your Terraform code — terraform plan will show the drift. Fix the Terraform code to match reality first, then start making intentional changes through PRs.

Common Pitfalls

Three mistakes trip up most teams when terraforming their data platform. First, managing too many resources at once. Start with infrastructure resources (warehouses, buckets, roles) and leave object-level resources (tables, views, pipes) to dbt or other tools. Terraform is the wrong tool for managing individual Snowflake tables — that is what your transformation layer is for.

Second, not using modules for repeated patterns. If you have five warehouses that differ only in size and name, extract a warehouse module and call it five times with different parameters. This keeps your code DRY and makes it easy to apply consistent policies like auto-suspend settings across all warehouses.

Third, forgetting to protect the state file. Your Terraform state contains sensitive information: resource IDs, configuration values, and sometimes secrets. Enable encryption on the S3 state bucket, restrict access to the state file with IAM policies, and never commit the state file to git. The remote backend handles this correctly by default, but verify it during initial setup.

Start with your most critical resources: the production Snowflake warehouses and the S3 bucket. Import them into Terraform state with terraform import. Then expand to roles, schemas, and lower environments. Within two weeks, your entire data platform infrastructure will be version-controlled, auditable, and recoverable from a single terraform apply command.

Tags

Platformterraforminfrastructure as codesnowflakeAWSdata platformCI/CD

Related articles

Found this useful? Share it with your team.