IaC Cleanup: Removing Obsolete Manifests And Secrets

by ADMIN 53 views

Hey guys! Today, we're diving deep into a critical aspect of Infrastructure as Code (IaC) maintenance: cleaning up those obsolete manifests and secrets. It's like decluttering your digital workspace – essential for security, efficiency, and just plain sanity. Let's get started!

The Case of the Obsolete Manifests and Secrets

So, we've got this situation where our repository is holding onto some old files – specifically, database-cnpg-promoted.yaml, database.yaml, database-rw-pooler.yaml, and database-cluster.yaml. These were super important during the migration from Zalando to CloudNativePG (CNPG), but they're not being used anymore. It's like keeping the training wheels on a bike after you've mastered riding – unnecessary and just getting in the way.

Why is this a problem? Well, keeping these obsolete manifests around can cause confusion. When new team members come on board or when you're troubleshooting, seeing these files might lead you down the wrong path. Plus, it's just messy! We want our IaC to be clean and easy to understand. It’s like having a well-organized toolbox versus a drawer full of random tools – which one would you rather work with?

And it’s not just manifests. We also have old cert-manager CA secrets (mastodon‑postgresql-ca.yaml and postgresql-server-cert.yaml) hanging around. CNPG is now issuing its own Certificate Authority (CA) and server certificates, so these old secrets are just creating confusion and taking up space. Think of it as keeping old keys that don’t open any doors anymore – time to toss them!

The key takeaway here is that maintaining a clean and up-to-date IaC repository is crucial for several reasons:

  • Security: Old secrets can be a security risk if they're compromised. Removing them reduces your attack surface.
  • Clarity: A clean repository is easier to understand and maintain, reducing the risk of errors.
  • Efficiency: You'll save time and effort by not having to sort through unnecessary files.

We need to roll up our sleeves and get rid of this digital clutter. This will make our IaC more secure, more understandable, and ultimately, more effective. Let's break down exactly what needs to be done.

Unsafe Configuration Knobs: A Recipe for Disaster

Now, let's talk about some unsafe configuration settings we've spotted in our active cluster manifest (database-cnpg.yaml). It's like leaving the stove on when you leave the house – a potential hazard we need to address ASAP.

The Perils of enableAlterSystem

The first thing we've noticed is that enableAlterSystem is still enabled. Now, the CNPG documentation explicitly warns against using ALTER SYSTEM or enabling this flag. Why? Because changes made this way aren't replicated across the cluster. This can lead to an unpredictable state, which is the last thing you want in a production database environment. Imagine the chaos if different parts of your database cluster are running on different configurations! It’s like trying to conduct an orchestra where each musician is playing a different tune – a cacophony of errors waiting to happen.

Think of your database cluster as a team, and enableAlterSystem as a way for one team member to make changes without telling the others. This can lead to inconsistencies and problems down the line. We want everyone on the same page, following the same rules.

The Superuser Dilemma

Next up, we have enableSuperuserAccess: true. This setting exposes the built-in postgres superuser credentials. While it might seem convenient, it's a major security risk. CNPG's API reference even states that disabling this flag blanks the postgres password. The best practice here is that applications should connect using a dedicated role instead of the superuser. It's like giving everyone the master key to the building – risky and unnecessary.

The superuser account should be reserved for administrative tasks only. For regular application access, we should create specific roles with the minimum necessary permissions. This principle of least privilege is a cornerstone of security best practices. It’s like giving each employee a key that only opens the doors they need to access, rather than giving everyone the master key.

What's the fix? We need to remove both enableAlterSystem and enableSuperuserAccess from our active cluster manifest. This will improve the stability and security of our database cluster. It’s like taking those unsafe ingredients out of our recipe – the final dish will be much better (and less likely to explode!).

Missing Backup Configuration: A Disaster Waiting to Happen

Picture this: your database goes down, and you realize you have no backups. It's a nightmare scenario, right? Well, that's the situation we're in right now – we have no S3 backup configured for our CNPG cluster. This is a critical issue that needs immediate attention. It’s like driving a car without insurance – you might be fine most of the time, but when something goes wrong, you’re in big trouble.

Why Backups Are Non-Negotiable

Backups are the lifeline of any database system. They protect you from data loss due to hardware failures, software bugs, human errors, or even malicious attacks. Without backups, you're essentially playing a high-stakes game of Russian roulette with your data. Think of backups as your safety net – they’re there to catch you when things go wrong.

Our repository currently lacks an ObjectStore resource, and the cluster isn't referencing the Barman Cloud plugin. This means we have no WAL (Write-Ahead Logging) archiving or base backups in place. CNPG’s migration guide clearly shows that enabling the Barman Cloud plugin requires defining an ObjectStore and adding a plugin entry in the cluster spec. Without these, we're flying blind. It's like trying to navigate a ship without a map or a compass – you're bound to run aground eventually.

The Barman Cloud Solution

Barman Cloud is a powerful tool for managing backups in CNPG. It allows you to store your backups in object storage (like S3), making them highly available and durable. It also provides features like point-in-time recovery, which allows you to restore your database to a specific point in time. This is incredibly valuable if you need to recover from a data corruption issue or a human error. It’s like having a time machine for your data – you can go back to any point in time and restore your database to that state.

To configure backups, we need to:

  1. Create an ObjectStore resource.
  2. Reference the Barman Cloud plugin in our cluster manifest.
  3. Create a ScheduledBackup resource.

This might sound like a lot, but it's a small price to pay for the peace of mind that comes with knowing your data is safe and sound. It’s like investing in a good security system for your home – it protects your valuable assets and gives you peace of mind.

Leftover Certificate References: A Ticking Time Bomb

We've got some lingering references to old certificates in our jobs (pg-amcheck-weekly.yaml and pg-amcheck-monthly.yaml). These jobs are still pointing to the old CA secret (mastodon-postgresql-ca), which doesn't exist anymore since we removed cert-manager. It's like trying to use an old key for a lock that's been changed – it just won't work.

The Problem with Dead References

These leftover references can cause our jobs to fail, or worse, they might use incorrect or outdated certificates. This can lead to security vulnerabilities and other issues. Think of it as having a broken link on your website – it can frustrate users and damage your reputation. In our case, it can break our database maintenance tasks and leave us vulnerable.

The Fix: Point to the New CA

The solution is straightforward: we need to update these jobs to point to the new CNPG-managed CA (database-cnpg-ca). We also need to make sure the key is ca.crt instead of ca.. It’s like updating the address book on your phone – you want to make sure you have the correct contact information.

This is a simple fix, but it's crucial for ensuring our jobs run smoothly and securely. It’s like changing a light bulb – a small task that can make a big difference.

Pod Disruption and Checksum Bug: Preventing Unnecessary Downtime

Let's talk about two important issues that can impact the availability and reliability of our database cluster: pod disruption and a checksum bug.

Pod Disruption Budget (PDB): Protecting Your Primary

First up, we need to make sure our cluster manifest explicitly sets enablePDB. CNPG recommends this to protect the primary database instance from eviction during node drains. A Pod Disruption Budget (PDB) is a Kubernetes feature that limits the number of pods in a replicated application that can be down simultaneously due to voluntary disruptions. Think of it as a safety net for your primary database instance – it ensures that it's not accidentally taken down during maintenance or upgrades.

Without a PDB, your primary database instance could be evicted during a node drain, leading to downtime. This is especially important in a high-availability environment where you want to minimize disruption to your users. It’s like having a backup generator for your house – it ensures that you have power even when the main power grid goes down.

The Checksum Bug: A CloudNativePG Gotcha

Next, we need to address a known checksum mismatch bug in the cloudnative-pg/postgresql:17.5 image. This bug can cause issues with data integrity and can lead to unexpected errors. To fix this, we need to add the following environment variables to our cluster manifest:

  • AWS_REQUEST_CHECKSUM_CALCULATION=when_required
  • AWS_RESPONSE_CHECKSUM_VALIDATION=when_required

These environment variables tell the PostgreSQL client to calculate and validate checksums, which helps to prevent data corruption. It's like adding a layer of error checking to your data transfers – it ensures that the data you're sending and receiving is accurate.

Required Changes: The To-Do List

Okay, guys, let's break down the specific actions we need to take to address these issues. It's like having a checklist before a big trip – we want to make sure we don't forget anything important.

  1. Delete obsolete files:
    • Remove the unused database manifests (database-cnpg-promoted.yaml, database.yaml, database-rw-pooler.yaml, database-cluster.yaml).
    • Remove the old CA secrets (mastodon-postgresql-ca.yaml, postgresql-server-cert.yaml).
    • Clean up their references in all kustomization files.
  2. Drop unsafe settings:
    • In the active cluster manifest:
      • Remove enableAlterSystem.
      • Remove enableSuperuserAccess.
    • Before disabling superuser access, rotate application credentials to use a dedicated role (e.g., mastodon) and update the mastodon-db-url secret.
  3. Add a PodDisruptionBudget:
    • Set enablePDB: true at the top of the cluster spec.
  4. Add checksum env vars:
    • Under .spec.env, add the variables AWS_REQUEST_CHECKSUM_CALCULATION=when_required and AWS_RESPONSE_CHECKSUM_VALIDATION=when_required.
  5. Configure backups:
    • Create an ObjectStore resource named database-backup.
      • Point endpointURL to https://s3.jorijn.com.
      • Reference the new secret database-s3-credentials for the ACCESS_KEY_ID and ACCESS_SECRET_KEY.
      • Set wal.compression: gzip.
      • Supply the additional command arguments for archive chunk size and timeout.
      • Specify a 14-day retentionPolicy.
    • In the cluster manifest, define .spec.plugins with name: barman-cloud.cloudnative-pg.io, isWALArchiver: true, and barmanObjectName: database-backup.
    • Add a ScheduledBackup resource named database-backup that targets the cluster and runs on Mondays and Thursdays at 06:00.
      • Set method: plugin and pluginConfiguration.name to barman-cloud.cloudnative-pg.io.
  6. Update amcheck jobs:
    • In pg-amcheck-weekly.yaml and pg-amcheck-monthly.yaml, change the secretName under db-ca to database-cnpg-ca and ensure the key is ca.crt instead of ca...
  7. Maintain poolers:
    • Keep the read/write Pooler resources but rename them to match the new cluster name if necessary (e.g., database-pooler-rw).
    • Ensure they still use transaction mode and appropriate max_client_conn/default_pool_size values.
    • Remove any separate pooler manifests that reference the old cluster.
  8. Tune PostgreSQL parameters:
    • Keep the pgtune parameters (max_connections, shared_buffers, etc.) from the current manifest.
  9. Update kustomizations:
    • Add the ObjectStore and ScheduledBackup manifests to the workload kustomization.
    • Remove references to deleted files from all kustomization lists.

Conclusion: A Cleaner, Safer, and More Reliable IaC

Alright, guys, that's a wrap! We've covered a lot of ground today, from removing obsolete files to configuring backups and addressing security vulnerabilities. By tackling these issues, we're making our IaC cleaner, safer, and more reliable. It's like giving our digital infrastructure a thorough spring cleaning – the result is a more efficient and secure system.

Remember, maintaining a healthy IaC repository is an ongoing process. It's not a one-time fix, but a continuous effort to keep things organized, secure, and up-to-date. Keep those manifests tidy, rotate those secrets, and always prioritize backups. Your future self (and your team) will thank you for it! It's like taking care of your car – regular maintenance keeps it running smoothly and prevents costly breakdowns down the road.