IaC Cleanup: Removing Obsolete Manifests And Secrets
Hey guys! Today, we're diving deep into a critical aspect of Infrastructure as Code (IaC) maintenance: cleaning up those obsolete manifests and secrets. It's like decluttering your digital workspace – essential for security, efficiency, and just plain sanity. Let's get started!
The Case of the Obsolete Manifests and Secrets
So, we've got this situation where our repository is holding onto some old files – specifically, database-cnpg-promoted.yaml, database.yaml, database-rw-pooler.yaml, and database-cluster.yaml. These were super important during the migration from Zalando to CloudNativePG (CNPG), but they're not being used anymore. It's like keeping the training wheels on a bike after you've mastered riding – unnecessary and just getting in the way.
Why is this a problem? Well, keeping these obsolete manifests around can cause confusion. When new team members come on board or when you're troubleshooting, seeing these files might lead you down the wrong path. Plus, it's just messy! We want our IaC to be clean and easy to understand. It’s like having a well-organized toolbox versus a drawer full of random tools – which one would you rather work with?
And it’s not just manifests. We also have old cert-manager CA secrets (mastodon‑postgresql-ca.yaml and postgresql-server-cert.yaml) hanging around. CNPG is now issuing its own Certificate Authority (CA) and server certificates, so these old secrets are just creating confusion and taking up space. Think of it as keeping old keys that don’t open any doors anymore – time to toss them!
The key takeaway here is that maintaining a clean and up-to-date IaC repository is crucial for several reasons:
- Security: Old secrets can be a security risk if they're compromised. Removing them reduces your attack surface.
- Clarity: A clean repository is easier to understand and maintain, reducing the risk of errors.
- Efficiency: You'll save time and effort by not having to sort through unnecessary files.
We need to roll up our sleeves and get rid of this digital clutter. This will make our IaC more secure, more understandable, and ultimately, more effective. Let's break down exactly what needs to be done.
Unsafe Configuration Knobs: A Recipe for Disaster
Now, let's talk about some unsafe configuration settings we've spotted in our active cluster manifest (database-cnpg.yaml). It's like leaving the stove on when you leave the house – a potential hazard we need to address ASAP.
The Perils of enableAlterSystem
The first thing we've noticed is that enableAlterSystem
is still enabled. Now, the CNPG documentation explicitly warns against using ALTER SYSTEM
or enabling this flag. Why? Because changes made this way aren't replicated across the cluster. This can lead to an unpredictable state, which is the last thing you want in a production database environment. Imagine the chaos if different parts of your database cluster are running on different configurations! It’s like trying to conduct an orchestra where each musician is playing a different tune – a cacophony of errors waiting to happen.
Think of your database cluster as a team, and enableAlterSystem
as a way for one team member to make changes without telling the others. This can lead to inconsistencies and problems down the line. We want everyone on the same page, following the same rules.
The Superuser Dilemma
Next up, we have enableSuperuserAccess: true
. This setting exposes the built-in postgres superuser credentials. While it might seem convenient, it's a major security risk. CNPG's API reference even states that disabling this flag blanks the postgres password. The best practice here is that applications should connect using a dedicated role instead of the superuser. It's like giving everyone the master key to the building – risky and unnecessary.
The superuser account should be reserved for administrative tasks only. For regular application access, we should create specific roles with the minimum necessary permissions. This principle of least privilege is a cornerstone of security best practices. It’s like giving each employee a key that only opens the doors they need to access, rather than giving everyone the master key.
What's the fix? We need to remove both enableAlterSystem
and enableSuperuserAccess
from our active cluster manifest. This will improve the stability and security of our database cluster. It’s like taking those unsafe ingredients out of our recipe – the final dish will be much better (and less likely to explode!).
Missing Backup Configuration: A Disaster Waiting to Happen
Picture this: your database goes down, and you realize you have no backups. It's a nightmare scenario, right? Well, that's the situation we're in right now – we have no S3 backup configured for our CNPG cluster. This is a critical issue that needs immediate attention. It’s like driving a car without insurance – you might be fine most of the time, but when something goes wrong, you’re in big trouble.
Why Backups Are Non-Negotiable
Backups are the lifeline of any database system. They protect you from data loss due to hardware failures, software bugs, human errors, or even malicious attacks. Without backups, you're essentially playing a high-stakes game of Russian roulette with your data. Think of backups as your safety net – they’re there to catch you when things go wrong.
Our repository currently lacks an ObjectStore
resource, and the cluster isn't referencing the Barman Cloud plugin. This means we have no WAL (Write-Ahead Logging) archiving or base backups in place. CNPG’s migration guide clearly shows that enabling the Barman Cloud plugin requires defining an ObjectStore
and adding a plugin entry in the cluster spec. Without these, we're flying blind. It's like trying to navigate a ship without a map or a compass – you're bound to run aground eventually.
The Barman Cloud Solution
Barman Cloud is a powerful tool for managing backups in CNPG. It allows you to store your backups in object storage (like S3), making them highly available and durable. It also provides features like point-in-time recovery, which allows you to restore your database to a specific point in time. This is incredibly valuable if you need to recover from a data corruption issue or a human error. It’s like having a time machine for your data – you can go back to any point in time and restore your database to that state.
To configure backups, we need to:
- Create an
ObjectStore
resource. - Reference the Barman Cloud plugin in our cluster manifest.
- Create a
ScheduledBackup
resource.
This might sound like a lot, but it's a small price to pay for the peace of mind that comes with knowing your data is safe and sound. It’s like investing in a good security system for your home – it protects your valuable assets and gives you peace of mind.
Leftover Certificate References: A Ticking Time Bomb
We've got some lingering references to old certificates in our jobs (pg-amcheck-weekly.yaml and pg-amcheck-monthly.yaml). These jobs are still pointing to the old CA secret (mastodon-postgresql-ca), which doesn't exist anymore since we removed cert-manager. It's like trying to use an old key for a lock that's been changed – it just won't work.
The Problem with Dead References
These leftover references can cause our jobs to fail, or worse, they might use incorrect or outdated certificates. This can lead to security vulnerabilities and other issues. Think of it as having a broken link on your website – it can frustrate users and damage your reputation. In our case, it can break our database maintenance tasks and leave us vulnerable.
The Fix: Point to the New CA
The solution is straightforward: we need to update these jobs to point to the new CNPG-managed CA (database-cnpg-ca). We also need to make sure the key is ca.crt instead of ca.. It’s like updating the address book on your phone – you want to make sure you have the correct contact information.
This is a simple fix, but it's crucial for ensuring our jobs run smoothly and securely. It’s like changing a light bulb – a small task that can make a big difference.
Pod Disruption and Checksum Bug: Preventing Unnecessary Downtime
Let's talk about two important issues that can impact the availability and reliability of our database cluster: pod disruption and a checksum bug.
Pod Disruption Budget (PDB): Protecting Your Primary
First up, we need to make sure our cluster manifest explicitly sets enablePDB
. CNPG recommends this to protect the primary database instance from eviction during node drains. A Pod Disruption Budget (PDB) is a Kubernetes feature that limits the number of pods in a replicated application that can be down simultaneously due to voluntary disruptions. Think of it as a safety net for your primary database instance – it ensures that it's not accidentally taken down during maintenance or upgrades.
Without a PDB, your primary database instance could be evicted during a node drain, leading to downtime. This is especially important in a high-availability environment where you want to minimize disruption to your users. It’s like having a backup generator for your house – it ensures that you have power even when the main power grid goes down.
The Checksum Bug: A CloudNativePG Gotcha
Next, we need to address a known checksum mismatch bug in the cloudnative-pg/postgresql:17.5 image. This bug can cause issues with data integrity and can lead to unexpected errors. To fix this, we need to add the following environment variables to our cluster manifest:
AWS_REQUEST_CHECKSUM_CALCULATION=when_required
AWS_RESPONSE_CHECKSUM_VALIDATION=when_required
These environment variables tell the PostgreSQL client to calculate and validate checksums, which helps to prevent data corruption. It's like adding a layer of error checking to your data transfers – it ensures that the data you're sending and receiving is accurate.
Required Changes: The To-Do List
Okay, guys, let's break down the specific actions we need to take to address these issues. It's like having a checklist before a big trip – we want to make sure we don't forget anything important.
- Delete obsolete files:
- Remove the unused database manifests (database-cnpg-promoted.yaml, database.yaml, database-rw-pooler.yaml, database-cluster.yaml).
- Remove the old CA secrets (mastodon-postgresql-ca.yaml, postgresql-server-cert.yaml).
- Clean up their references in all kustomization files.
- Drop unsafe settings:
- In the active cluster manifest:
- Remove
enableAlterSystem
. - Remove
enableSuperuserAccess
.
- Remove
- Before disabling superuser access, rotate application credentials to use a dedicated role (e.g., mastodon) and update the mastodon-db-url secret.
- In the active cluster manifest:
- Add a PodDisruptionBudget:
- Set
enablePDB: true
at the top of the cluster spec.
- Set
- Add checksum env vars:
- Under
.spec.env
, add the variablesAWS_REQUEST_CHECKSUM_CALCULATION=when_required
andAWS_RESPONSE_CHECKSUM_VALIDATION=when_required
.
- Under
- Configure backups:
- Create an
ObjectStore
resource nameddatabase-backup
.- Point
endpointURL
tohttps://s3.jorijn.com
. - Reference the new secret
database-s3-credentials
for theACCESS_KEY_ID
andACCESS_SECRET_KEY
. - Set
wal.compression: gzip
. - Supply the additional command arguments for archive chunk size and timeout.
- Specify a 14-day
retentionPolicy
.
- Point
- In the cluster manifest, define
.spec.plugins
withname: barman-cloud.cloudnative-pg.io
,isWALArchiver: true
, andbarmanObjectName: database-backup
. - Add a
ScheduledBackup
resource nameddatabase-backup
that targets the cluster and runs on Mondays and Thursdays at 06:00.- Set
method: plugin
andpluginConfiguration.name
tobarman-cloud.cloudnative-pg.io
.
- Set
- Create an
- Update amcheck jobs:
- In pg-amcheck-weekly.yaml and pg-amcheck-monthly.yaml, change the
secretName
underdb-ca
todatabase-cnpg-ca
and ensure the key isca.crt
instead ofca..
.
- In pg-amcheck-weekly.yaml and pg-amcheck-monthly.yaml, change the
- Maintain poolers:
- Keep the read/write Pooler resources but rename them to match the new cluster name if necessary (e.g., database-pooler-rw).
- Ensure they still use transaction mode and appropriate
max_client_conn
/default_pool_size
values. - Remove any separate pooler manifests that reference the old cluster.
- Tune PostgreSQL parameters:
- Keep the pgtune parameters (max_connections, shared_buffers, etc.) from the current manifest.
- Update kustomizations:
- Add the
ObjectStore
andScheduledBackup
manifests to the workload kustomization. - Remove references to deleted files from all kustomization lists.
- Add the
Conclusion: A Cleaner, Safer, and More Reliable IaC
Alright, guys, that's a wrap! We've covered a lot of ground today, from removing obsolete files to configuring backups and addressing security vulnerabilities. By tackling these issues, we're making our IaC cleaner, safer, and more reliable. It's like giving our digital infrastructure a thorough spring cleaning – the result is a more efficient and secure system.
Remember, maintaining a healthy IaC repository is an ongoing process. It's not a one-time fix, but a continuous effort to keep things organized, secure, and up-to-date. Keep those manifests tidy, rotate those secrets, and always prioritize backups. Your future self (and your team) will thank you for it! It's like taking care of your car – regular maintenance keeps it running smoothly and prevents costly breakdowns down the road.