Cost, Control, and Chaos: My Journey of Hosting Databases on K8s for a High-Volume Startup

5 min readDec 7, 2024

In the fast-paced world of startups, managing costs and performance is a constant balancing act. Over the past few years, I’ve navigated the complexities of hosting databases for a startup where data volume was high and growing steadily. Starting with managed cloud solutions, transitioning to Aurora, and finally settling on a self-hosted database in a Kubernetes (K8s) cluster, each phase of this journey has taught me valuable lessons. Here’s how it unfolded.

The Problem: High Costs of Managed Solutions

Initially, we relied on managed RDS for our non-production environments, but the costs quickly spiralled out of control. Moving to Aurora seemed like a logical step to reduce expenses while maintaining scalability. However, Aurora turned out to be another misstep — costs were still high, and we lacked the flexibility needed for fine-grained optimizations.

This realization pushed us to explore self-hosting databases in a Kubernetes cluster.

The Solution: Hosting Databases in Kubernetes

Tools Used

Database Management Tools

pgBackRest — For automated backups and point-in-time recovery.
PGTune — For generating optimized PostgreSQL configuration settings based on workload.
PGBOUNCER — A lightweight connection pooler for PostgreSQL.
pgaudit — For enabling comprehensive audit logging in PostgreSQL.

Monitoring and Observability Tools

Prometheus — For monitoring system metrics (CPU, memory, etc.)
Grafana — For real-time visualization of metrics and setting up alerts.
Postgres Exporter — For exporting PostgreSQL- metrics to Prometheus.

Kubernetes-Specific Tools

Kubernetes StatefulSets — For deploying stateful applications like databases.
Kubernetes Secrets — For securely storing sensitive information (e.g., database credentials).
Network policies — For isolating the database from other deployments.
Kubernetes Probes — For setting up health checks (readiness and liveness probes).

Okay, let’s get back to the story..

Hosting a database in Kubernetes required meticulous planning and execution. Here’s how I tackled each challenge:

1. Migration Challenges

Avoiding Data Errors: Migrating from managed databases to self-hosted solutions required precision. I ensured no errors or duplicate data by performing incremental data dumps and running rigorous data validation scripts.
Roles and User Management: Transitioning user roles and permissions was critical. I recreated roles, ensuring compatibility with our applications.

2. Performance Tuning

Managing a database in Kubernetes necessitated deep attention to database performance:

Start Big, Scale Down: I started with high-capacity instances and reduced resources gradually as usage patterns stabilized.
Monitor Checkpoints and WAL Points: Proper checkpointing reduced disk I/O spikes, and monitoring Write-Ahead Logs (WAL) helped prevent replication lag.
Choose Throughput and IOPS Wisely: I provisioned storage carefully, balancing cost with performance.
Optimize Database Parameters: Using tools like PGTune, I optimized critical PostgreSQL settings:

max_connections = 200  
shared_buffers = 4GB  
effective_cache_size = 12GB  
maintenance_work_mem = 1GB  
checkpoint_completion_target = 0.9  
wal_buffers = 16MB  
random_page_cost = 1.1  
effective_io_concurrency = 200  
work_mem = 5242kB

3. Operational Best Practices

Observability: Monitoring CPU, memory, and storage usage was essential. I relied on tools like Prometheus and Grafana for real-time insights.
Connection Pooling: PGBOUNCER helped manage connections efficiently, reducing contention and improving throughput.
Audit Logging: Using pgaudit, I enabled comprehensive audit logging for better observability and compliance.
Vacuum and Analyze: Automated vacuuming and analyzing using PostgreSQL’s autovacuum settings to reclaim space and update query statistics.
Health Checks: Kubernetes probes ensured the database was always responsive, restarting pods when necessary.

4. Backups and Recovery

Automated Backups: I set up pgBackRest and WAL-G to automate backups and enable point-in-time recovery.
Anomaly Alerts: Alerts for unusual activity, such as high WAL generation or long-running queries, helped maintain stability.

5. Security Considerations

Restricted Exposure: The database was never exposed to the internet, and all access was secured through private networking.
User Permissions: I adhered to the principle of least privilege, carefully managing user roles.
Secrets Management: Kubernetes Secrets ensured sensitive information like credentials was stored securely.
Audit Logging: Capturing access and query logs provided both security and operational insights.

6. Cost Optimization with Reserved Instances

To further reduce costs, I transitioned to Reserved Instances for Kubernetes nodes hosting our database workloads. This strategy cut costs significantly while maintaining predictable performance levels. Combining reserved instances with resource right-sizing yielded substantial savings.

7. Upgrades and Maintenance

Upgrading PostgreSQL was a critical aspect of this journey:

Reading Release Notes: Before every upgrade, I studied release notes to identify breaking changes, such as changes in authentication mechanisms.
Extension Compatibility: Verified that extensions like pg_stat_statements or pgaudit were compatible with the target PostgreSQL version before upgrading.
Planned Downtime: Upgrades were carefully scheduled to minimize disruption.
Testing Upgrades: Each upgrade was first tested in a non-production environment to ensure compatibility.

The Trade-offs

Hosting a database in Kubernetes has its downsides:

Lack of High Availability: Our setup was not designed for HA, which we accepted as a trade-off for cost savings.
Operational Overhead: Managing backups, monitoring, and upgrades required a well-defined checklist and standard operating procedures (SOPs).

The Benefits

Despite the challenges, the benefits were significant:

Cost Savings: We achieved a 70% reduction in costs compared to managed solutions.
Full Control: Self-hosting allowed us to fine-tune every aspect of the database, from parameters to backup schedules.
Faster Restarts: Without the constraints of managed environments, database restarts were faster.
Customizations: We had full access to WAL logs and could implement custom workflows.

Lessons Learned

Plan Your Migration: Data integrity and role management are critical when transitioning from managed to self-hosted solutions.
Performance Monitoring is Crucial: Regularly tune parameters, monitor long-running queries, and adjust resources based on load.
Prioritize Security: Limit exposure and enforce strict user permissions.
Prepare for Operational Overhead: Develop clear SOPs to manage backups, upgrades, and incidents.
Embrace Observability: Invest in robust monitoring to stay ahead of potential issues.
Test Thoroughly: Every migration, upgrade, or configuration change must be tested.

This journey has been a blend of challenges and rewards. Self-hosting a database may not be for everyone, but for our high-volume startup, it provided the cost control and customization we needed to thrive.

Thanks for reading my blog. Feel free to hit me up for any AWS/DevOps/Open Source-related discussions.

Manoj Kumar — LinkedIn.