Consistent Solutions

Consistent Solutions

What we do?

Blog Posts

K8s Provisioning on AWS with Kubespray


Stand up a cluster with kubespray in AWS with minimal knowledge of AWS, Terraform or Ansible. First, we'll provision infrastructure in AWS with Terraform. Next we'll set up Bastion and provision a K8s cluster with Ansible. To confirm success, we'll then run kubectl commands to verify things are healthy, then clean up by deleting the cluster.

November 21, 2019


K8s Provisioning on AWS with Kubespray

November 21, 2019


This goes through standing up a cluster with kubespray in AWS with minimal knowledge on aws, terraform, or ansible.

  (0) provision infrastructure in AWS with terraform
    create a t2.medium EC2 Ubuntu 16.04 instance in default VPC of us-east-2 (regardless of default VPC or not, ensure it's in a subnet that's accessible via your aws setup)

  - ensure you grant it a pem key (be sure to save it and chmod 400 on it!)

Once ssh'd into machine (let's call it the terraform machine):
$ apt-get update

$ apt-get install awscli -y

$ apt-get install unzip -y

$ wget https://releases.hashicorp.com/terraform/0.12.2/terraform_0.12.2_linux_amd64.zip

$ unzip terraform_0.12.2_linux_amd64.zip

$ mv terraform /usr/local/bin/

$ git clone https://github.com/kubernetes-sigs/kubespray.git


#update the distro to be Ubuntu in variables.tf
#NOTE: see the instructions in kubespray/contrib/terraform/aws/README.md:
- Ubuntu 16.04, replace 'data "aws_ami" "distro"' in variables.tf with
data "aws_ami" "distro" {
  most_recent = true

  filter {
    name = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-*"]
  }

  filter {
    name = "virtualization-type"
    values = ["hvm"]
  }

  owners = ["099720109477"]
}

from instructions, update the credentials.tfvars
Be sure the .pem is excluded from the ssh key value!

$ cd kubespray/contrib/terraform/aws

$ terraform init

$ terraform plan --var-file="credentials.tfvars"

$ terraform apply --var-file="credentials.tfvars"


#NOTE enter yes to approve the apply
#NOTE: Save the terraform output, as we'll need the inventory later
#NOTE: to delete the resources after the next step, terraform destroy --var-file="credentials.tfvars"

this assumes we've setup the instances with terraform

  (1) Setup Bastion and Run Ansible to Provision K8s Cluster
    ssh into one of the created bastions and run following cmds:

$ sudo su

$ apt-get update

$ apt-get install unzip git vim wget python-pip jq -y

$ pip install ansible==2.6.5


add the pem key for the machines to the bastion as id_rsa chmod 400 on it

$ git clone https://github.com/kubernetes-sigs/kubespray.git

$ cd kubespray

$ pip install -r requirements.txt


add hosts file from the tf output to kubespray/inventory/hosts: the [all] section to the loadbalancer remove the bastion entries and bastion host group in hosts file

in the kubespray/ansible.cfg, private_key_file=<path-to-key>/id_rsa

$ ansible-playbook -i ./inventory/hosts ./cluster.yml -e ansible_user=ubuntu -b --become-user=root --flush-cache

      Coffee break :) !

  (2) Cluster has been provisioned.
    Login, run some kubectl cmds to verify things are healthy (kubectl get nodes, kubectl get po --all-namespaces, etc)

Once you're satisfied, delete the instances from the terraform machine with the following cmd (in the kubespray/contrib/terraform/aws dir)

$ terraform destroy --var-file="credentials.tfvars"

Helpful Monitoring Tips for your Kubernetes Clusters


In most projects, it seems monitoring comes last. It's an after-thought. The focus always seems to be on the tangible benefits, like feature development and what the end user sees. That's great, but as the codebase grows, so does complexity.

November 25, 2019


Helpful Monitoring Tips for your Kubernetes Clusters

November 25, 2019


In most projects, it seems monitoring comes last. It's an after-thought. The focus always seems to be on the tangible benefits, like feature development and what the end user sees. That's great, but as the codebase grows, so does complexity. Every line of code is incurring a debt. To monitor that debt effectively, we need observability into the system with some type of filter that only alerts us on relevant issues, so that we can take immediate action. Imagine a car with no gauges for low fluids. As you could imagine, there'd be plenty of frustrated drivers and broken down vehicles. Similarly, for a healthy software ecosystem, as engineers, we need to ensure we're monitoring and alerting effectively. That said, now introduce Kubernetes, a container platform that encapsulates the best practices from an operations engineer out of the box (with scalability, fault-tolerance, disaster recovery, etc). With the many provisioners, standing up Kubernetes is becoming easier and almost without thought. Every boat or elevator has a weight limit. So does a Kubernetes cluster. The more tenants that onboard or as the functionality of the workloads increase, so does the resource consumption. In this article we'll cover some crucial aspects to monitor and alert in our Kubernetes (or OpenShift) clusters. We'll use PrometheusRule expressions to illustrate tangible alertible expressions. However, keep in mind - the expressions aren't as important as the bigger picture. The most important information is how to effectively monitor node metrics, disk capacity, and node state in a cloud agnostic and opensource manner.


Single Pane of Glass for Metrics, Monitors, Alerts - opensource and cloud agnostic

If you've considered Kubernetes, the ability to be cloud agnostic is most likely an appeal. Taking that same approach to our metrics, monitors, and alerts, consider Prometheus and AlertManager. If you haven't already added Prometheus to your K8s cluster, checkout the Prometheus Operator.

The beauty with K8s, is that it offers custom resources to be treated as though native building blocks with the k8s cli. The custom resource in the cluster that Prometheus leverages are called PrometheusRules, which contain the logic of which metrics to alert on, thereby sending the alert via AlertManager. Whether or not you're leveraging Prometheus and PrometheusRules, these are some things to look out for in day-2 operations associated with eyes-on the cluster pertaining to compute, storage, and system state. This includes memory and cpu, free disk left on the nodes and persistent volumes, and node level inoperable state.


Memory/CPU

Every application consumes memory and cpu. If insufficient resources, then expect throttled behavior and instability at the least. To have a glimpse into the resource consumption of our worker bee, the k8s node, the entity where our applications live, we can monitor available memory and cpu with the appropriate PrometheusRule expressions: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20 and 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80. The first will enter an error state when any node memory surpasses 80%, the second will enter an error state when any node cpu surpasses 80%. These both would be expressions for PrometheusRules, configurable via the helm chart. A general rule of thumb is to have cpu/mem resources consuming 50-60% on average of the nodes capacity, but it ultimately depends on the types and purposes of workloads that will be running on your k8s nodes (a side tangent, be sure to performance test apps!). If you know for sure you have different types of workloads, some cpu intensive, some memory intensive, others gpu intensive, be sure to take those inconsideration. Resource requests/limits will allow the scheduler to place the pods on the appropriate nodes, so as you can imagine, understanding your apps performance needs comes in handy here. Another lesson learned here, is the system-reserved properties on the kubelet come in handy to ensure the kernel has enough resources reserved when the node is constrained. At the end of the day, by placing the alert at a responsible alertable threshold for system level resources, it offers the ability to investigate for overcommitted nodes, blips in node performance, or the opportunity to take action before things become evicted.


Disk

Disk can be a nightmare when not properly managed. Anyone ever experience the situation where your laptop has exceeded its storage capacity? In some cases it can be unrecoverable and requires the system to be wiped and reinstalled. Be sure your backing up state to your cluster or cluster components (velero, cough cough). That said, when disk drives fill up, things fall over. For example, if kafka's disk fills up, the fancy realtime ML streaming platform may end up at a screeching halt. Docker daemon is known to leak memory as well as to use up unnecessarily high levels of disk. Docker writing to the root disk of the node can cause a node to go unresponsive, especially if docker isn't managing all the open files appropriately. If etcd isn't properly managed (use compaction!), and disk creeps up on it, say bye-bye etcd. The list goes on. So the goal here is to monitor node-level disk AND application-level disk using a consistent pattern THAT'S cloud agnostic. A beautiful solution to this is, you guessed it, leveraging metrics exposed to Prometheus. Two expressions come in handy here:

expr: (node:node_filesystem_usage:) * 100 > 80 and expr: sum by(job, persistentvolumeclaim) ((kubelet_volume_stats_capacity_bytes - kubelet_volume_stats_available_bytes) / kubelet_volume_stats_capacity_bytes * 100) > 80. The first expression would alert when any drive specific to the node exceeds 80% capacity. The second expression will alert if ANY persistentVolume on the cluster exceeds 80%. This is beautiful. The expressions are completely cloud and application agnostic! No need to have a separate solution for each type of persistentVolume or the nodes' drives for each cloud - I love KISS, keeping it simple [and] succint.


Node Level Status

For anyone who has at least played with k8s, the message "Node Not Ready" would be all too familiar. When this message pops up, expect applications on the node in question to move around. Depending on the workloads in the cluster and the amount of extra capacity within the cluster to absorb a downed node, it could be a domino effect. Two aspects to monitor on are (1) when a node is not at the "Ready" and (2) When the node is unschedulable. Here are the Prometheus Rules with the appropriate expressions: expr: sum by (node) (kube_node_status_condition{condition="Ready",status!="true"}) > 0 and expr: sum by (node) (kube_node_spec_unschedulable) > 0. The first will alert when any k8s node not in the ready state, e.g 'Unknown' or 'NotReady'. The second expression will alert if a node is unschedulable. Note that a node can be healthy in the "Ready" state, yet unable to schedule pods by the scheduler due to it being "unschedulable". Any time an automated script fails while cordoning/uncordoning a node or for the case where a system administrator simply forgets to uncordon a node during maintenance hours in the wee morning, this alert will come in handy.


Conclusion

This article goes over the valuable expressions that are cloud agnostic with an opensource technology that can empower the administrators of your k8s platform. Even if the PrometheusRule expressions don't make sense upon first glance, the concepts are of more importance. Consider a means for alerting node metrics, disk capacity, and node state to gain confidence in your containerized platform of choice. Knowledge is power. Using Kubernetes effectively is empowering. The key word is - effectively. By having insight and observability into day-to-day operations of the platform you'll be able to have a well-managed Kubernetes ecosystem.


AWS - Enlarging EBS Volumes for EC2


You ever experience a problem enough times you find value in writing it down for future reference? This is one of those times. When increasing the EBS volume size in the Amazon console (or aws cli), the machine that the EBS volume is attached to won't register the new size. To ensure the EC2 instance reflects the actual disk size, here are some commands:

November 30, 2019


AWS - Enlarging EBS Volumes for EC2

November 30, 2019


If the EBS volumes are associated with a pv in k8s that's been created by a storageClass, then checkout the allowVolumeExpansion flag. The rest of this article discusses the hidden tech debt with increasing the EBS volume size in the console.

You ever experience a problem enough times you find value in writing it down for future reference?

This is one of those times. When increasing the EBS volume size in the Amazon console (or aws cli), the machine that the EBS volume is attached to won't register the new size. SO to ensure the EC2 instance reflects the actual disk size, here are some commands:

(0) go to the aws console and identify both the machine the EBS volume is attached and the volume name given to the attached EBS volume, e.g. /dev/sda1.

- NOTE: the name from step (2) of the EC2 perspective, will be /dev/xvda1 for the drive /dev/sda1 in the console

(1) ssh into the instance the EBS volume is attached

(2) run the command: df -h. The EBS volumes will look something like /dev/xvda1 or /dev/xvd...

(3) identify the EBS volume to increase from the EC2's perspective

(4) from the EBS volume identified, run the command: resize2fs <volume-here>, i.e. resize2fs /dev/xvdbf

(5) Validate if the new 'Avail' disk size is reflected: df -h | grep <volume-here>, i.e. df -h | grep xvdbf

(6) If applicable, update any automation to ensure it reflects the change in the storage size!


Combatting the K8s PLEG


When it rains it pours. The PLEG bug has been around from k8s 1.6 to 1.15. Although rare - when it occurs, half the cluster can go down within minutes. Either upgrade your k8s cluster to mitigate the risk of the PLEG or properly manage it.

December 4, 2019


Combatting the K8s PLEG

December 4, 2019


Over the past number of years managing production k8s clusters, this has always come up at unexpected times. In the few handful of times I’ve seen this, the impact can be catastrophic; in minutes k8s nodes can fall over like dominoes. Imagine a k8s cluster with applications that provide client health information or perform financial transactions, and half the cluster goes down. What then? Not the most pleasant predicament to be in. Fortunately, the PLEG issue has said to be resolved in the more recent version of k8s 1.16. I’m always amazed by the k8s community! Unfortunately, it leaves all others clusters still using versions 1.6 to 1.15 exposed by the PLEG issue. As a k8s administrator, I know first hand the difficulty associated with upgrading a cluster in place or migrating stateful workloads to a newer versioned k8s (or OpenShift) cluster. This article is for those who don’t have the option of easily upgrading their cluster. Of course, it’s highly recommended to upgrade your cluster, but here’s an interim solution until then. TL/DR: the solution that I’ve seen come up at various organizations is by adding script that watches for docker timing out, then restarting docker accordingly.


In short, it comes down to a script monitored by systemd with push based notifications. A script, let’s call it pleg-preventer.sh

pleg-00


This will watch for docker to become unresponsive with a timeout command. If docker takes too long to return any result, the script will then restart docker and send a notification to an alert channel whenever this script takes any action. The systemd file, pleg-preventer.service is as follows:

pleg-01

If leveraging AWS to host your k8s cluster, cloudwatch and sns are an obvious choice for managing custom metrics and forwarding along alerts. If google, stackdriver. If on-prem, use your notification tool of choice.


As a systems administrator, automation is key. Being able to reproduce results an idempotent fashion is an ideal scenario. This snippet would be contained in an ansible role that assumes the pleg-preventer files in its files directory.

pleg-02

If one were to add in the automation of sending to a custom sns topic, then consider leveraging jinja and the template module via ansible.


In conclusion, this PLEG issue has been observed from k8s version 1.6 to 1.15. The fix has been incorporated into k8s v1.16. Considering the difficulty of upgrading or migrating stateful workloads to a newer cluster, depending on how the cluster has been provisioned, this solution has been vetted for production grade clusters for a number of years at various institutions and can be leveraged until a v1.16 or later cluster has been utilized.


Safely Handling PII Data with K8s


Working with sensitive data requires a high moral responsibility towards your customers. Take advantage of what k8s has to offer without relenting on security or privacy of customer PII data.

December 8, 2019


Safely Handling PII Data with K8s

December 8, 2019


Working with sensitive data requires a high moral responsibility towards your customers. In the world of container platforms, the level of effort to manage dynamic and complex workloads has drastically decreased. A system like k8s consolidates years of platform best practices in a single offering. To leverage those best practices, while committing to the security and privacy of your customer data, consider these approaches for encryption at rest, encryption in transit, and properly managing PII data in logs.


Encryption at Rest

If using GCP, then this is easy as all disks are encrypted by default. When using AWS to host your cluster, be sure to have your EBS volumes for the nodes encrypted. For the applications residings on k8s (aka pods), make sure that the underlying disk is encrypted. If the aws cloud provider isn’t being leveraged to dynamically create EBS volumes, then highly consider leveraging it to reduce the responsibility of your k8s administrators! The parameter flag of encrypted: "true" can be added to the storageClass to ensure the EBS volumes tied to the pv’s are encrypted. K8s assumes Iaas, that said, it is very nice to have your stateful workloads such as databases be maintained by the cloud provider of choice. If elasticache, rds, or other managed cloud products are leveraged, be sure the underlying drives to those services are encrypted. From an auditing and enforcement perspective, check out cloud custodian to enforce encryption. It’s not what you preach, but what you tolerate - stop any configuration mishap early with the proper enforcement tools.


Encryption in Transit

Configuration for ssl traffic from within the cluster (e.g. custom certs vs certs created per cluster, etc) will be dependent on the provisioner used. For ingress objects, use the tls property with a k8s secret for the ssl certificate. One aspect that should be considered is to leverage PodSecurityPolicy to restrict hostPorts. It may be tempting as a developer to open a port to the application via a hostPort to test an application without having to create an ingress object and a secret containing the tls certs, HOWEVER, this can bypass tls communication - exposing an http endpoint to the application. As a system administrator, adding a podSecurityPolicy to restrict usage of hostPorts is an extra insurance policy towards enforcing encryption in transit. Similar to the point with encryption at rest with externally managed services, highly consider enforcing encryption in transit configuration for any externally managed services with a tool like cloud custodian.


Application Logs

In a vanilla k8s cluster, more cases than not, logs are being sent to elasticsearch. Whereby, elasticsearch most likely won’t have authentication, meaning anyone who’s able to hit the elasticsearch/kibana ui’s will be able to see all the logs. Thus, at minimum one should consider fronting the ingress object with with an oauth proxy that prompts a user to login (assuming the elasticsearch stack is hosted within the cluster). A login can appease the need for authentication, however, authorization is another complexity. With most k8s installers, it’s even less likely there will be any form of authorization within elasticsearch out of the box. This means that if application data is emitting sensitive customer data in the logs (without properly cleaning the logs), anyone who is able to authenticate and login to elasticsearch will be able to see the PII data in plain text. Hence the obvious need for properly managing customer PII data. There are such third party tools to add a layer of authz to elasticsearch. They’ll add an additional cost and an unexpected level of complexity for managing logs. For an opensource solution for incorporating authz to elasticsearch, consider opendistro. For managing PII data in the logs, either manage the data within the app (scrub in memory) or manage the logs with a sidecar container. For the former - have the app properly encrypt, tokenize, or scrub data in memory before emitting to the app’s log file(s). Irt the latter: for a vanilla k8s install, the log aggregator (e.g. fluentd or fluentbit) will be running on all the nodes as a daemonset. The daemonset will have an associated config for handling the logs. If one of your tenant teams need to add a fluentd filter to modify how the logs are handled, consider sending the logs to a file other than stdout while running a fluentd sidecar container for their application. The custom log file logic will be handled exclusively via the sidecar container. This way, the tenant team can manage their own fluentd or fluentbit configuration without impacting the config leveraged for every other tenant. If you know your application emits sensitive information within that log file, consider a number of fluentd plugins like record reformer or the anonymizer scrub or encrypt data in your logs. Properly managing and cleaning logs is an important element easy to miss. By encrypting or scrubbing sensitive fields emitted from the logs and adding a layer of authn/authz to our elasticsearch, we can ensure that we have the proper gates and controls in place to protect customer data within the k8s ecosystem.


Conclusion

In a tech world of ever evolving tools, k8s and containers are a hot commodity. A tool like k8s is revolutionizing platforms and becoming a main staple in organizations across all industries. To leverage the beauty of k8s and container platforms while ensuring a commitment to customer privacy and security, proper procedures and guardrails should be adopted. This article addresses encryption within the cluster and cloud provider levels, enforcement of encryption configuration, and a few ideas for properly managing PII data in application logs.


Velero Tips - Testing & Operationalizing K8s Disaster Recovery


To backup K8s or OpenShift clusters as well as the underlying persistent volumes of the applications running from within the cluster, consider Velero. In this article, we'll focus on testing and operationalizing features of Velero from a system adminstrators perspective.

December 12, 2019


Velero Tips - Testing & Operationalizing K8s Disaster Recovery

December 12, 2019


In k8s or openshift, etcd contains the state of the cluster. For disaster recovery (DR) purposes, one can backup etcd using etcdctl commands, however velero (previously known as heptio ark) can offer this functionality plus offer backups and restores of persistent volumes in an agnostic way.
What this article isn't: it's not for basic setup docs or basic cli usage. For that see the stable velero helm chart and velero cli docs.
What this article will focus on: focuses on setup and testing of velero backup and restore from a system administrator's perspective.


Helpful tips before we get started:

- to leverage the backups and restore of underlying disk, see the volumeSnapshotLocation property in the velero helm values file.

- if leveraging velero to backup and restore to other clusters, the CLUSTERNAME property will come in handy.
    - info here for v1.1.0. Be sure to double check with the version being used!

- if you don't have a tool like kiam that restricts pods permissions, then check it out. Otherwise, be sure to grant the velero pod least privileged permissions
    - here is the permissions info for v1.1.0. Be sure to double check with the version being used!

- to backup AWS EFS, one will need restic: issue here

- upgrade to latest velero version for backup optimizations (v1.2.0 as of time of writing)

- once velero is up and running, consider adding backup schedule with cron AND alerts should any backup fail


Stateful application for testing purposes

(assumes a namespace of test-mysql and storageclass of AWS EBS with gp2 type)

Note: Depending on where you're testing velero, the storageclass specs will change. Below contains (a) the yaml for deploying mysql (b) commands for populating the database for the testing scenarios

(a) THE YAML:


---
apiVersion: v1
kind: Namespace
metadata:
  name: test-mysql

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp2
parameters:
  type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Retain
volumeBindingMode: Immediate

---
apiVersion: v1
kind: Service
metadata:
  name: wordpress-mysql
  namespace: test-mysql
  labels:
    app: wordpress
spec:
  ports:
    - port: 3306
  selector:
    app: wordpress
    tier: mysql
  clusterIP: None

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: test-mysql
  name: mysql-pv-claim
  labels:
    app: wordpress
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: gp2

---
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: wordpress-mysql
  namespace: test-mysql
  labels:
    app: wordpress
spec:
  selector:
    matchLabels:
      app: wordpress
      tier: mysql
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: wordpress
        tier: mysql
    spec:
      containers:
      - image: mysql:5.6
        name: mysql
        env:
        - name: MYSQL_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-pass
              key: password
        ports:
        - containerPort: 3306
          name: mysql
        volumeMounts:
        - name: mysql-persistent-storage
          mountPath: /var/lib/mysql
      volumes:
      - name: mysql-persistent-storage
        persistentVolumeClaim:
          claimName: mysql-pv-claim

---
apiVersion: v1
kind: Secret
metadata:
  name: mysql-pass
  namespace: test-mysql
  labels:
    app: wordpress
type: Opaque
data:
  username: YWRtaW4=
  password: YWRtaW4=

(b) kubectl exec into the created mysql pod and run the commands: mysql -u root -p TYPE the password of: admin (only 1 time): CREATE DATABASE sampledb; Use the database once created: USE sampledb;
Create the table: CREATE TABLE IF NOT EXISTS equipment ( equip_id int(5) NOT NULL AUTO_INCREMENT, type varchar(50) DEFAULT NULL, install_date DATE DEFAULT NULL, color varchar(20) DEFAULT NULL, working bool DEFAULT NULL, location varchar(250) DEFAULT NULL, PRIMARY KEY(equip_id) );
Create a value in the table: INSERT INTO equipment (type, install_date, color, working, location) VALUES ("Slide", Now(), "blue", 1, "Southwest Corner");
Verify data: SELECT * FROM equipment;
EXPECT TO SEE:

+----------+-------+--------------+-------+---------+------------------+
| equip_id | type  | install_date | color | working | location         |
+----------+-------+--------------+-------+---------+------------------+
|        1 | Slide | 2017-03-24   | blue  |       1 | Southwest Corner |
+----------+-------+--------------+-------+---------+------------------+
1 row in set (0.00 sec)


Testing scenarios to consider

(0) backup whole cluster, delete some namespaces and recover (not testing the deletion of the underlying disk for any pv's)
    - velero create backup whole-cluster-test --snapshot-volumes
    - delete namespaces or resources to recover, e.g. kubectl delete ns istio, kubectl delete ns test-mysql
    - velero restore create whole-backup-expected --from-backup whole-cluster-test or velero restore create specific-namespace --from-backup whole-cluster-test --include-namespaces=<desired-namespace>
    - verify the deleted resources are now restored
(1) backup with specific label/namespace, delete the k8s resources, recover with label/namespace
    - tip to start with: velero create backup test-selector-backup --selector app=wordpress
(2) backup EBS, delete app and EBS, recover and verify EBS with proper tags AS WELL the data is preserved using the cmds from (b)
    - tip to start with: velero backup create mysql-persistent --selector app=wordpress --snapshot-volumes
(3) create a schedule and verify backup has been created from the cron
    - velero create schedule whole-cluster-backup --schedule "00 7 * * *"


Conclusion

This outlines tips with leveraging ark/velero to its full potential. It also offers common test scenarios, in particular, a scenario to successfully backup and restore a stateful application with velero.


Day 2 Operations with K8s Best Practices


From someone just getting into k8s, just provisioned a cluster, or a seasoned k8s veteran looking to get some ideas on improving their k8s setup - this article goes over best practices.

December 17, 2019


Day 2 Operations with K8s Best Practices

December 17, 2019


So you’ve either created a cluster and thinking now what, looking into setup of k8s and aren’t sure of next goals once the cluster has been setup, or you’ve been running a k8s cluster in production and are wondering how to improve your k8s ecosystem. To take your k8s platform to the next level, or to simply consider best practices with k8s, consider the below options for enhancing your k8s cluster with cluster alertables, improved security, CI/CD offerings to tenants your tenants, and best practices.

Cluster Level

- rbac
- HA and enough instances/resources for control plane
    - etcd will handle up to (n-1)/2 failures
- horizontal scaling in place
- tuned to expected load and resource guard rails in place: resourcequotas and limits
- for the cluster automation use an auditing tool for Infrastructure-as-code, and alert on any differences


Security - privileges w/ kiam or equivalent
- network policies
- only ssh into nodes through vpn or bastion
- pod security policies
- authn/authz for all applications
    - authn with something like oauth proxy
- authz may not be supported for a given app
- encrypt k8s secrets

Dashboard - Remove the skip button
- add a ttl (as a parameter)
- lock down permissions with rbac

Disaster Recovery - velero backups
- outline strategy for failover

CI/CD - [repeatable] pipeline in place for tenants
- consider gitops process with helmRelease objects

Logging -ensure enough resources for elasticsearch (es) based on retention
    - memory and storage, as es is a memory hog!
    - add a cleaner (see elasticsearch-curator helm chart)
- capture audit log
    - alertables from suspicious activity

Alertables - cpu/mem
- node notReady or other resources not up
- pv/storage size available
- offer process for tenants to add their own alertables for their pods

K8s Logging with FedRAMP and FISMA High Reqs


Here are constraints, logging config, and lessons learned from building a k8s platform that obtained an ATO appeasing FedRAMP and FISMA High Reqs in AWS GovCloud.

December 24, 2019


K8s Logging with FISMA and FedRAMP High Reqs

December 24, 2019


Must-haves - Have 180 days of queryable logs
- long term store of logs (usage of S3 and Glacier)
- audit logs, system level logs, and application logs captured

This assumes a few things: - enabling audit log from the apiserver (configurable to the apiserver). Most likely, an option with your provisioner of choice.
    - a config via kubespray:

  # audit support
  kubernetes_audit: false
  # path to audit log file
  audit_log_path: /var/log/audit/kube-apiserver-audit.log

- fluentd container with ability to write to S3, see https://github.com/bfraz/fluentd-s3 for an example
    - instance profile or usage of kiam for fluentd to write to S3
- rsyslogs on all nodes, writing to /var/log/rsyslog.log
- drop-in config for fluentd via the elastic-stack helm chart
    - assumes the audit log file at particular location, which can be overwritten


Lessons learned - based on retention, give ample memory, cpu, and disk resources to elasticsearch (needs will grow as more tenants onboard)
    - IF at all possible, keep elasticsearch external to cluster or isolate it via taints/tolerations.
       - IF elasticsearch goes down, and fluentd/fluentbit isn't able to write to it, logs will fill up disk. If the disk is writing to the root volume of the node, nodes can go unresponsive due to filled-up disks. This could cause a cluster outage.
- Use elasticsearch curator to remove any unneeded logs

Fluentd Config for the elastic-stack Helm Chart

    configMaps:

      ...

      system.input.conf: |-

        <source>

          @type tail

          # audit log path of kube-apiserver

          path /var/log/kubernetes/audit/kube-apiserver-audit.log

          pos_file /var/log/kubernetes/audit.pos

          format json

          time_key time

          time_format %Y-%m-%dT%H:%M:%S.%N%z

          tag audit

        </source>

        <source>

          @type tail

          @id syslog-k8s

          path /var/log/rsyslog.log

          pos_file /var/log/rsyslog.pos

          <parse>

            @type regexp

            expression /^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$/

            time_format "%b %d %H:%M:%S"

          </parse>

          tag rsyslog

        </source>

      output.conf: |

        # Enriches records with Kubernetes metadata

        <filter audit>

          @id kube_api_audit_normalize

          @type record_transformer

          auto_typecast false

          enable_ruby true

          <record>

            host "#{ENV['K8S_NODE_NAME']}"

            responseObject ${record["responseObject"].nil? ? "none": record["responseObject"].to_json}

            requestObject ${record["requestObject"].nil? ? "none": record["requestObject"].to_json}

            origin kubernetes-api-audit

          </record>

        </filter>

 

        <match **>

          @type copy

          <store>

            @id elasticsearch

            @type elasticsearch

            ...

          </store>

          <store>

            @type s3

            <instance_profile_credentials>

            </instance_profile_credentials>

            s3_bucket <some-s3-bucket-here>

            s3_region <aws-region-here>

            path logs

            <buffer>

              @type file

              path /var/log/fluentd-s3/s3

              timekey 3600

              timekey_wait 5m

              chunk_limit_size 256m

            </buffer>

            <secondary>

              @type file

              path /var/log/fluentd-s3/s3/some_failed_records

            </secondary>

            time_slice_format %Y%m%d%H%M

          </store>

        </match>


Lessons Learned from FedRAMP High + AWS GovCloud + K8s + FISMA High Reqs


In building out one of the first known FedRAMP K8s cluster on AWS with FISMA High requirements, here are a snapshot of patterns leveraged in the designed and implementation of the platform.

December 27, 2019


Lessons Learned from FedRAMP High + AWS GovCloud + K8s + FISMA High Reqs

December 27, 2019


Encryption At Rest and In transit

- See our blog on handling PII data with K8s


Logging with RETENTION - Logging at system level, k8s level, network level and audit level
- Kept logs in elasticsearch with a retention of 180 days, with a backup in S3 with lifecycle policy to Glacier for long-term storage
- See our blog on FedRAMP High with FISMA High Logging Reqs on K8s

Backup and DR - velero backups of cluster resources and storage level backups

    - see our blog on velero


RBAC with LDAP integration - authn and authz - ldap policies around credentials: password reset after so many days
- separation of concerns with who can deploy, who can develop code, who has admin/edit access to k8s, etc
- authn to all apps in k8s with ui's
- authz to applicable applications within k8s

Onboarding Plan for Tenants - what's a platform with no tenants? - repeatable plan for monitoring/alertables, logging, CI/CD, etc
    - check out our container checklist blog!

Security, Security, Security - scanning containers for security exploits - consider tool like twistlock, sysdig, etc
- vpn and bastion to access cluster level resources
- network policies, psps, oh-my

Container Checklist - for Onboarding or New Image Releases


For onboarding an app to K8s, ever wondered what the steps would be? This checklist offers considerations for any new app or for any new feature to an existing app on Kubernetes.

January 1, 2020


Container Checklist - for Onboarding or New Image Releases

January 1, 2020


Here are considerations and questions to consider when either adding a new containerized app or a new feature to an existing app:

Purpose / Use Case - What is the use-case this new app or new feature of the app offers to customers?
- Any particular technologies or architectures this is dependent on, e.g. streaming, batch, api-gateway. etc. ?

Storage - What kind of storage requirements does your app have?
    - Block based, object based, file based?
    - Read-write many?
    - Size specs?
    - Does it have to be HA?
       - Will backups with velero be sufficient?
    - Is a database or cache used?
       - If so, see networking section
       - If so, who is expected to maintain the database or cache?

Resources - what’s the expected mem/cpu/gpu needed for your app?
- does your app have limits and requests?
- be sure to performance test your app to have these baselines!

Networking - does your app require access to any entity outside the cluster?
    - if so, how so: inbound, outbound, or both?
- does your app have a ui?
    - if so, what’s the authn/authz for the app?
    - talk to your system administrator for how ssl certificates are managed for endpoints.

DR - what is expected SLA of the app?
- in case of failure, what are the remediation steps?
    - are they manual, is it automated?
       - if manual, are they clear and concise?
    - who is responsible for failures?
- will you be managing your own backups?

Permissions - what special permissions does your app need?
    - talk with your system administrator for how roles are managed, e.g. kiam, etc.

CICD - talk to your system administrator about the CICD process, e.g. helmRelease crds with flux, argo cd, jenkins, etc
- how is configuration managed?
- where is the image stored, e.g. custom registry?

Logging

- does the containers write logs to stdout and send errors to stderr?
- does the app handle PII data?
    - If so, does it emit any in the logs. See our PII blog
- is the level of logging (e.g. debug, info, etc) configurable from outside the app?


Metrics - does your app have a means for liveness and readiness probes?
- does your app have a means of a ServiceMonitor?

Alertables - do you have alerts setup for any failures of the app?
- who will be responsible for alerts?

Container Specs - can your app handle unexpected termination without loss of data?
- is your app expected to scale?
    - If so, vertically or horizontally?
    - Has the app been performance tested to know a baseline for scaling?

EKS Basics and Automated Approach


Review the basics of EKS and learn about an automated approach for provisioning EKS as well as the ecosystem of tools that reside on it, from external-dns to aws-alb-ingress-controller for dynamically provisioning AWS resources in support of applications on EKS.

February 1, 2020


EKS Basics and Automated Approach

February 1, 2020


Broadly, EKS is a managed Kubernetes cluster on AWS. It manages the control plane of the cluster across multiple AZ’s and offers the flexibility of where the workloads will run, such as in Fargate, EKS Managed Nodes, or on EC2. If convenience and quickness of scale is more preferred, then Fargate is recommended. If the ability for customization is desired, then EC2 would be the goto. Over the past several years, I’ve personally worked with various organizations in locked-down environments to provision Kubernetes and optimize workloads across various machines. Managed Kubernetes services lower the bar in complexity and cost, allowing organizations of any size to leverage the power and resiliency of Kuberentes. This article addresses basic concepts with an automated solution towards utilizing AWS EKS.


Before We Begin

For a k8s overview, check out the official Kubernetes documentation. Here’s a great illustration of the various components:

k8s-illustration
Logging, RBAC, & Networking on EKS

Streamlined Auditing and Logging with CloudTrail and CloudWatch. Keep track of the calls to the API server via CloudTrail and leverage the fluentd-cloudwatch plugin to send all logs there. Both have the ability for long-term storage to S3 (where logs can be shipped to Glacier after some amount of time).


There is an AWS IAM Authenticator which allows IAM roles to map to K8s service accounts for RBAC within the cluster. Although convenient, to stay more cloud provider agnostic, check out tools like kube2iam and kiam. Kube2iam has been known to not clean up after itself, which could make Kiam a more noteworthy option. However, if the cost is a concern, consider kube2iam, as a few extra nodes solely for kiam servers won’t be required (architecture on running kiam on EKS here).


As far as networking, leverage the familiar AWS VPC/Subnet concepts with EKS. Even deploy the cluster in subnets located in private subnets and expose ingress objects via a public-facing load balancer for tighter controls.


Alleviated Costs

The obvious costs with running a cluster would be the hourly bill for the control plane, even more if etcd is running on separate machines. A HUGE hidden cost is upgrades and maintenance. Minor version upgrades can be handled for you. Major upgrades from the control plane can fall back in case of failures. See awsdocs on this for further details. Why pay a premium and endure a steeper learning curve for employees/contractors when managing or deploying Kubernetes, when a managed service reduces the complexity and lowers the bar of deployment?


Automated Solution

A key aspect of devops is automation. An example leveraging terraform to provision an EKS cluster with EC2 worker nodes and ansible playbooks to provision the tooling that resides on the cluster can be found here: (https://github.com/consistent-solutions/eks-with-day-two-operations). The terraform will provision the VPC with private and public subnets, then deploy the EKS cluster in private subnets. It offers a bastion and a workstation. The workstation has the proper setup to be able to interact with the cluster without having to publicly expose the control plane. The ansible leverages helm v3 (not using tiller) and community helm charts to deploy kube2iam for RBAC, external-dns for dynamically managing ingress cnames, fluentd-cloudwatch for sending container logs to cloudwatch, velero for backup up cluster workloads and persistent volumes to S3 at regular intervals, prometheus-operator (for deploying grafana, prometheus, and alertmanager for monitoring/alerting), AWS Cognito for authentication to ingress objects, AWS EBS for application stateful workloads, and the aws-alb-ingress-controller to dynamically expose ingress objects with AWS ALBs. This solution offers a native terraform/ansible approach for extending it to other environments (environments such as qa, sandbox, prod, etc). The repo README contains the step-by-step instructions for setup and teardown.


Conclusion

By leveraging EKS, we can reduce monetary and operational costs associated with running a Kubernetes cluster while keeping the tooling relatively agnostic. To have a reproducible means for the K8s cluster as well as the ecosystem of tooling that resides on it, we can leverage an automated modular approach like the example demonstrated. For further questions or special considerations, feel free to reach us via the contact us section below.


A Well-Managed Docker Daemon


The docker daemon has been known to consume resources. While developing with docker, I’ve had to clean up docker resources every few months. Here are several tips that have been handy to get back in the game.

March 8, 2020


A Well-Managed Docker Daemon

March 8, 2020


While developing with docker, ever seen the message: 'FATAL: could not write lock file ____.pid' or 'No space left on device'? Here are a few commands to consider invoking to clear up your system. Note that this will remove docker resources, so be sure you have a means of accurately reproducing [stateful] containers.


System Prune

Will remove all stopped containers, unused networks, dangling images & dangling build cache (for docker version 1.13 or later)
$ docker system prune


Stop & Remove All Docker Containers Before copy and pasting this, keep in mind this will stop currently running containers. If you only wish to remove dangling resources, then skip this.

$ docker stop $(docker ps -a -q)
$ docker rm $(docker ps -a -q)


Remove All Dangling Volumes

$ docker volume rm `docker volume ls -q -f dangling=true`


Delete All Local Images

$ docker rmi -f $(docker images -a -q)


Conclusion

The above commands will free up disk and other resources. If you’re worried about docker being destructive on your k8s nodes, checkout this k8s monitoring blog for prometheus disk monitoring tips. Before thinking 'there's no way docker can be destructive', there are plenty of ways it can wreak havoc: from filling up disk, to not properly closing open files on the os, to not properly reporting to the kubelet (causing the PLEG issues for k8s versions 1.6 to 1.15). I hope this helps! For any other questions, feel free to reach out via the contact us form.


Distributed Analytics with Jupyter Notebooks on GCP


Empower your Data Scientists with Jupyter, K8s and Dask on GKE. This post demonstrates how to quickly and reliably spin-up a distributed analytics enabled infrastructure for Jupyter notebooks. In the example to follow, the Jupyter notebook performs basic ETL with BigQuery and Google Compute Storage (GCS) via a combination of GCP Python APIs and distributed data processing functionality of Dask.

June 24, 2020


Distributed Analytics with Jupyter Notebooks on GCP

June 24, 2020


Jupyter notebooks are great for basic ETL, exploratory data analysis and model development. However, unless the notebook is being leveraged from a distributed environment and with "out-of-core" enabled data processing libraries, like Dask, the “Bigness” of your Big Data analytics will be severely limited.


In other words, Python data processing libraries, like Numpy, Pandas or Scikit-learn, are limited to datasets that fit into the amount of memory available from the machine they are running on. Also, single thread execution per Python interpreter is imposed by the Python GIL. To deal with these limitations, Dask provides two important extensions of functionality. First, it extends the Numpy Array and the Pandas Dataframe into distributed collections, where the data within a collection can be spread out among multiple cluster nodes or instances. Second, Dask (using DAGs) creates a Task Graph for parallel execution of computations (pure python functions) and distributed data transport (to nodes/instances) that is then executed incrementally (releasing memory along the way) by dynamically scheduling tasks to Dask workers running as pods in your cluster.


In short, from a Jupyter Notebook, leverage Dask on GKE for Python-based analytics at scale via distributed computation and out-of-core algorithms.


Enough talk, let’s get started!


  1. First, you’ll need a GCP Account and Project.
  2. Next, clone https://github.com/consistent-solutions/gcp-dask-jupyter.git
  3. Follow the instructions outlined in gcp-dask-jupyter/README.md

    1. NOTE: For simplicity, use a combination of Google Cloud Shell and GCP Web UI for the following.

      PROCESS SUMMARY, as outlined in README.md:

      1. Create a GCP Account, create a Project and activate Billing (not the GCP Free Tier).
      2. Clone the Gruntwork repo, create terraform.tfvars config file and Terraform that cloud infrastructure!
      3. As the infrastructure is being provisioned, create a new Dataset within your project from the BigQuery web console and copy over a Table from the public Dataset to the one in your project.
      4. Add additional IAM roles to 2 existing GCP Service Accounts.
        • gcs/bigquery access from jupyter (ETL from Jupyter Notebook)
        • gcs bucket access from the cluster (Dask workers accessing bucket csv data)
      5. Install Helm and Dask (via Helm)
      6. Access Dask UIs via (Google Cloud Shell) Web Preview by using kubectl port-forward
      7. Use kubectl cp to copy service account credential file to (Dask) jupyter ui pod for access to GCS and BigQuery from the jupyter notebook using GCP Python APIs.
      8. Create a GCS bucket for csv data loaded from BigQuery for Dask to work with.
      9. Open Web Preview (via Google Cloud Shell) for the jupyter notebook ui and the scheduler ui.
      10. Clone gcp-dask-jupyter and drag and drop a local copy of jupyter-gcs-bq.ipynb into filesystem explorer (left side) of jupyter ui
      11. RUN THE NOTEBOOK!


      Component Breakdown:

      • GKE Infrastructure
        • Gruntwork
          • “...to quickly spin up a GKE Public Cluster, you can run the example that is in the root of this repo.”
      • Jupyter Notebook
        • Google BigQuery
          • Copy Table from public Dataset to a newly created private Dataset and Table.
          • google.cloud.bigquery
            • Run queries against BigQuery Tables.
        • Google Cloud Storage
          • Storage of BigQuery Table data as csv files before loading into Dask Dataframe for data processing.
          • gcsfs
            • Move BigQuery Table data to GCS bucket as multiple csv files.
        • Pandas
          • “pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.”
          • Needed for Dask.
        • Dask
          • Distributed computation with underlying Pandas, Numpy or Scikit-learn.
          • dask.distributed.Client
          • dask.dataframe
            • Using read_csv, loads GCS bucket csv file data into a Dask dataframe.


      Notebook Summary: The Jupyter Notebook, jupyter-gcs-bq.ipynb first installs needed Python dependencies including google.cloud.bigquery, gcsfs, pandas, dask.distributed.Client and dask.dataframe. Import statements follow after dependencies, including an os.environ call to set Google application credentials environment variable for Jupyter access permissions to GCS and Bigquery. The first cell of interest performs a query against a public Dataset/Table directly from BigQuery. Following that is the true meat and potatoes of the notebook. To begin, the next cell instantiates a Client for Dask, providing a scheduler for upcoming operations. The next cell is basic ETL. It copies data (the public Dataset/Table used a bit ago) from BigQuery to a GCS bucket as multiple csv files. After a confirmation/success message, the next cell lazily loads the csv file data into a Dask Dataframe. The final cell kicks off Dask computation.
      NOTE: Switch Web Preview to scheduler ui port to watch as Dask workers get it done!


      Conclusion:

      At the end of the day Data Scientists should be able to focus on what they do best, Analytics and Data Exploration. They shouldn't have to muck around with infrastructure and the like. Empower your Data Scientists by providing them with automated infrastructure and workload provisioning on Kubernetes. Empower them by providing a secure, resilient and repeatable environment supported by a community-driven ecosystem of tooling and best practices. Empower them by providing them with the tooling that they know and love, and abstract away everything else as much as possible. Trust me, they'll thank you for it!


PostgreSQL Performance Tuning


In almost any app, data is essential. Therefore the uptime and performance of the database for managing said data is crucial. Let's dive into the parameters to get the most out of our PostgreSQL database.

July 16, 2020


PostgreSQL Performance Tuning

July 16, 2020


In almost any app, data is essential. Therefore the uptime and performance of the database for managing said data is crucial. The types of workloads will ultimately drive the type of storage mechanism used as well as the appropriate fine-tunings. In this article, we'll explore the settings for fine-tuning PostgreSQL.


TL/DR- Checkout pgtune as a reference, then considering your workload and the system specs, adjust accordingly.


shared_buffers

The amount of dedicated memory on the server PostgreSQL will use for caching data. You can think of this as a temporal memory space to generate result sets. Typically set to 25% of memory. If the entire database can fit in the cache, then having this as a higher percentage would be beneficial, as this would reduce the amount of reading from disk. Having a higher value would be ideal for heavy read workloads, yet detrimental to high write workloads. In the case of higher write workloads, all the shared_buffer contents would need to be processed. The change of this value will require restarting the PostgreSQL instance.


effective_cache_size

effective_cache_size is an estimate for the memory that is available for disk caching and influences the query planner. If set too low, it can prevent indices from being beneficial. A general go-to conservative value is 50% of memory, which is set from AWS RDS PostgreSQL by default. However, a higher value like 75% may be more beneficial on dedicated db server(s). Consider this value as total memory minus what the OS and other apps/processes on the server.


max_connections

General rule is to keep this low. A few hundred connections depending on the size of the database is low. The more connections at once, the more overall memory consumption. The higher the number of connections, the less memory that can be allotted per connection. When one has a high number connections, watch the overall memory and the work_mem parameter in relation to it. The parameter of work_mem can make operations like sorting faster, however if both work_mem and number of connections is high, one may be at risk of exceeding database memory.


Example calculation of max_connections:
  The default AWS RDS parameter group has: LEAST({DBInstanceClassMemory/9531392},5000)
     with 1G of memory, this would yield: (1024*1024*1024 / 9531392)
     So for db.m4.2xlarge of 32G of memory, max_connections would be ~3600
     and for db.m4.4xlarge of 64G of memory, max_connections would be 5000

maintenance_work_mem

Specifies the total memory available for maintenance operations, such as VACUUM, CREATE INDEX, ALTER TABLE, etc. Be careful to not set this value too high, since this value of memory can be multiplied times the number of autovacuum_max_workers.
General recommendation for maintenance_work_mem:

  1. Set value 10% of system memory, up to 1GB
  2. If having VACUUM problems, consider setting it higher


checkpoint_completion_target

Target of checkpoint completion as a fraction of total time between checkpoints. This is typically 0.5 by default. In theory, 0.9 will give a smoother I/O disk curve over time.“ A value of 0.9 means aim to finish the checkpoint writes when 90% of the next checkpoint is done, thus throttling the checkpoint writes over a larger amount of time and avoiding spikes of performance bottlenecking.” One can set this roughly with (checkpoint_timeout - 2min) / checkpoint_timeout. So a value of checkpoint_timeout as 20 minutes would give a value of 0.9 for the checkpoint_completion_target. If in doubt, start with 0.5 observe, increase it by 0.1 or 0.2 incrementally with observing (up to 0.9).


wal_buffers The amount of shared memory used for WAL that hasn’t yet been written to disk. The range is around 64kB to 16MB, anything larger won’t be very beneficial. The default is typically -1 which will equate to 1/32 (~3%) of shared_buffers. This default should be reasonable. Keeping it relatively small to a few MB can improve write performance on a busy server. Note that a change will take effect upon restart.


random_page_cost

This is the cost of random reads. Typical default is 4. If using HDD, a value four is justified. However, if using SSD, a setting at 1.1-2 is acceptable. Note that if the database is purely in memory, then it would make sense to have random_page_cost = 1 = sequential_disk_cost.


effective_io_concurrency

Is the number of concurrent disk I/O operations that PostgreSQL can run simultaneously. Currently, the setting only impacts bitmap heat scans. For reference, with an index-only scan - PostgreSQL pulls from memory. For Bitmap Heat Scans, PostgreSQL will leverage the index to identify which rows to load from disk. Here's a great article on benchmarking with performance of this parameter.


work_mem

Memory used for sorting as well as joining tables with merge-join and hash-join operations.

In relation to observations in logs:

  1. If you notice an “out of memory” in PG logs → either decrease work_mem or increase instance size.
  2. If you notice “temporary files” in the recent logs → consider increasing work_mem

In relation to short vs long running queries:
  1. If there are few short running queries running frequently → decrease work_mem
  2. If there are few active queries at a time w/complex sorts and joins → increase work_mem


min_wal_size & max_wal_size

The write-ahead log (WAL) of the db is the accounting of modified records overtime. One can reproduce the database state by reading the WAL log. A database of several hundred GBs can produce TBs of WAL per day. That’s a lot of logs to be replayed in case of a recovery scenario. Thus, a checkpoint is taken at intervals. The checkpoint should be balanced so it doesn’t incur much load on I/O, yet isn’t too large to reduce the point-in-recovery time should a disaster happen. The parameters to adjust here are checkpoint_timeout, min_wal_size, and max_wal_size. A checkpoint happens either every checkpoint_timeout or when max_wal_size is about to be hit. The default min_wal_size is typically very conservative.
This post has great recommendations of adjusting the 2 params:

  1. If you know your database write traffic is “bursty” with long periods of inactivity followed by furious writes, increase min_wal_size.
  2. If you do bulk loads larger than 1GB, increase max_wal_size to the size of a bulk load.
  3. If you write more than 1GB of data every 10 minutes, increase max_wal_size.
  4. If you get “Checkpoints occurring too frequently” error messages, try increasing both parameters incrementally.

max_parallel_workers_per_gather

Keep in mind this can easily double to quadruple the typical resource consumption. If you’re looking to really get the most out of your database, this will come in handy, however, be sure you understand typical traffic and workloads over time in relation to resource consumption before dialing this up.


Conclusion

In short, you may not be seeing the true potential of your PostgreSQL database with standard defaults. Before fine-tuning, be sure you understand traffic patterns in relation to database workloads, as well as the read vs write intensive nature of the workloads themselves. It is always recommended to have observability of the database metrics, and ideally a performance testing strategy for new app features or traffic patterns. Once the database metrics are observable and the workloads understood, then consider fine-tuning the database. A great start would be leveraging pgtune. That said, it may be wise to revisit fine-tunings upon evolving needs and observations. For any further questions or feedback, feel free to reach out!


GKE Day 2 Operations & Best Practices


Out-of-the-box, GKE provides arguably the most robust and feature-rich managed Kubernetes out there. To follow is a demo of GKE with Day 2 Operations, with a combination of Terraform and Ansible automation for infrastructure and k8s workload provisioning.

August 3, 2020

(Montgomery Woods State Natural Reserve)

GKE Day 2 Operations & Best Practices

August 3, 2020


When it comes to managed k8s no hosted offering shines quite like Google Kubernetes Engine. With a fully-managed control plane, including security patches and upgrades handled behind-the-scenes, automatic cluster/node upgrades, node auto-repair, cluster autoscaling and node health monitoring via the node problem detector (after being enabled), GKE has many features baked-in, and particularly those that relieve the greatest pain points. The control plane, for instance, is the most difficult aspect of Kubernetes management, and yet on GKE, control plane security, patch management, availability and cost are abstracted away from the customer's concerns.(-->what you can configure)


Metric/Log Export to Stackdriver

Another out-of-the-box advantage gained with Google Kubernetes Engine is Kubernetes Engine Monitoring via Cloud Operations for GKE. For instance, "...Cloud Operations for GKE dashboard displays a summary line for each Kubernetes resource by default." Additionally, cluster components with subcomponents can be drilled into for more granular details from 3 different perspectives- Infrastructure, Workloads and Services. The Cloud Operations for GKE dashboard also includes a timeline of incidents from alerting violations. Cloud Operations for GKE also provides Kubernetes Observability, but only after creating a Workspace. A Workspace is "...for monitoring resources contained in one or more Google Cloud projects or AWS accounts. A Workspace accesses metric data from its monitored projects, but the metric data remains in those projects." Once a workspace is created and your project is associated with it as its host project, the metrics explorer can be used to explore available metrics and then use the web console to manage alerting policies (or the API). Additional builtin functionality includes cluster workloads that support metric/log export to Stackdriver and autoscaling via scaling metrics. Log/metric export consists of fluentd-gke (logs) and gke-metrics-agent (metrics). Metrics based autoscaling is enabled by a (builtin) Deployment of the metrics server as well.


These additional features provide immediate value by facilitating Day 2 Operations out of the box.

The following example demonstrates Day 2 Operations, addressing the concerns of persistent storage (agnostic) & workload state backup/disaster recovery with Velero, k8s RBAC, DNS management with ExternalDNS and SSL cert management via Cert-Manager. ExternalDNS and Cert-Manager support a 2048 (game) externally available with a custom domain and ssl certs. Observability through automation is demonstrated by leveraging Terraform to create GKE based Alerting Policies and a Notification Channel.


To provision GCP infra:

  1. First, you’ll need a GCP Account and Project.
  2. Next, clone https://github.com/consistent-solutions/terra-gke-jumpstart.git
  3. Follow the instructions outlined in terra-gke-jumpstart/README.md

    1. NOTE: For simplicity, use a combination of Google Cloud Shell and GCP Web UI for the following.

      NOTE: GKE-based Alerts require a Workspace, which currently requires manual creation.

      PROCESS SUMMARY, as outlined in README.md:

      1. Create a GCP Account, create a Project and activate Billing (not the GCP Free Tier) and enable required Google Cloud APIs.
      2. Clone the terra-gke-jumpstart repo, create terraform.tfvars config file, copy over GKE Alerts and Notification Channel terraform.
      3. Provision the infrastructure with terraform init, plan and apply then update kubeconfig and check for kubectl access to your cluster.
      4. Clone gke-with-day-two-operations and run the ansible playbook!


      With GCP infrastructure provisioned, the following cluster level features ensure a mature foundational setup, encouraging an ecosystem of tools to provide nothing less than a Day 2 capable platform.


      Component Breakdown:

      • GKE Infrastructure
        • Gruntwork
          • “...to quickly spin up a GKE Public Cluster, you can run the example that is in the root of this repo.”
        • Cluster Level
          • k8s RBAC
          • 2048 Game
          • ExternalDNS
            • "Keeps zones synchronized with Ingresses and Services (of Type=LoadBalancer) in various cloud providers."
          • Cert-Manager
            • "x509 certificate management for Kubernetes" for Ingress resources via Let's Encrypt (with DNS01 Challenge).
          • Velero
            • Disaster Recovery. Supported backup scenarios:
              • Entire cluster
              • By label
              • By namespace
              • Via cron schedule
              • Workload migration
      • GCP (Cloud Monitoring) Workspace



      Conclusion:

      Day 2 Operations means bootstrapping Kubernetes with the needed workloads and tooling to provide resilience by leveraging Observability in your cluster. With GKE as a foundation, one can implement Day 2 Operations sooner, realizing greater overall reliability and resilience to failure and ultimately delivering valuable agility at scale- the right way, from Day 1.


Managing PostgreSQL Bloat


Bloat significantly slows down database operations. Any time a record is updated or deleted, a dead tuple is created. The space used by a dead tuple is known as bloat. Learn how to identify and mitigate bloat.

September 8, 2020


Managing PostgreSQL Bloat

September 8, 2020


In PostgreSQL, when a record is deleted, it’s not really deleted. It still takes space, known as a dead tuple. The space taken from the dead tuples, that are no longer needed, is known as bloat. Bloat can greatly impact resource utilization and can significantly slow the database, if not properly managed. First identify the offending bloat, VACUUM where appropriate to reclaim storage, and fine-tune the database for proper maintenance.


Identify the Troublesome Tables

A query to identify the tables with most dead tuples:


SELECT relname AS TableName,
n_live_tup AS LiveTuples,
n_dead_tup AS DeadTuples
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

To identify tables with a lot of bloat, see this query. The last three lines can be adjusted to refine the search criteria of pct_bloat or the amount of mb_bloat, as well as to sort by mb_bloat.


Cleanup Time

The ideal state is to properly tune the database parameters. To clean up the immediate bloat, one would choose between the VACUUM and VACUUM FULL operations. In short, VACUUM FULL will block write operations [to the table(s)] while optimizing the disk [of the underlying table(s)]. VACUUM FULL or an alternative that doesn’t block, such as pg_repack should be considered if there is significant bloat. The next best thing is the VACUUM. It doesn’t optimize the disk space, but does remove dead tuples. Regardless of which command is invoked, be sure to run these during a time with low traffic to the database and follow it up with an ANALYZE command to update database statistics. The updated statistics will ensure the query planner operates efficiently.


Conclusion

In short, any update or delete of a record will amount in a dead tuple. Dead tuples take up space, known as bloat. To ensure the database isn’t negatively impacted by bloat, be sure to properly manage it via VACUUM commands or ideally fine-tune the database.


PostgreSQL Index Cleanup


Most likely the database has been running for many years. It simply works. To improve query performance, add “another index”. However, before simply adding another index, consider purging unneeded indices to improve database performance.

October 2, 2020


PostgreSQL Index Cleanup

October 2, 2020


Are you looking to understand current specs with respect to tables and indexes? Has it been awhile since you’ve reviewed relevancy of your indexes?
Below are a series of commands for basic db table and index analysis.
NOTE: if you’re planning to drop an index, monitor performance before/after. Preferably drop one index at a time and monitor performance as a result. If reprovisioning an index, checkout the CREATE INDEX CONCURRENTLY, as to not block existing operations


List Tables


SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';


List Indices


SELECT * FROM pg_indexes WHERE schemaname NOT IN ('pg_catalog', 'information_schema');


List DB Table Sizes


SELECT relname AS "table_name", pg_size_pretty(pg_table_size(pgc.oid)) AS "space_used" FROM pg_class AS pgc LEFT JOIN pg_namespace AS pgns ON (pgns.oid = pgc.relnamespace) WHERE nspname NOT IN ('pg_catalog', 'information_schema') AND nspname !~ '^pg_toast' AND relkind IN ('r') ORDER BY pg_table_size(pgc.oid) DESC;


Bonus: Table Bloat and Dead-Tuples


See blog here
An example cleanup command that won’t block on a table as discussed in the above blog:


VACUUM VERBOSE ANALYZE tablename;


Unused Indices


SELECT t.schemaname, t.tablename, indexname, indexdef, c.reltuples AS num_rows, pg_size_pretty(pg_relation_size(quote_ident(t.schemaname)::text || '.' || quote_ident(t.tablename)::text)) AS table_size, pg_size_pretty(pg_relation_size(quote_ident(t.schemaname)::text || '.' || quote_ident(indexrelname)::text)) AS index_size, pg_relation_size(quote_ident(t.schemaname)::text || '.' || quote_ident(indexrelname)::text) AS index_size_bytes,
    CASE WHEN indisunique THEN 'Y'
        ELSE 'N'
    END AS UNIQUE,
    number_of_scans,
    tuples_read,
    tuples_fetched
FROM pg_tables t
LEFT OUTER JOIN pg_class c ON t.tablename = c.relname
LEFT OUTER JOIN (
    SELECT
        c.relname AS ctablename,
        ipg.relname AS indexname,
        x.indnatts AS number_of_columns,
        idx_scan AS number_of_scans,
        idx_tup_read AS tuples_read,
        idx_tup_fetch AS tuples_fetched,
        indexdef AS indexdef,
        indexrelname,
        indisunique,
        pgi.schemaname
    FROM pg_index x
    JOIN pg_class c ON c.oid = x.indrelid
    JOIN pg_class ipg ON ipg.oid = x.indexrelid
    JOIN pg_indexes pgi ON pgi.indexname = ipg.relname
    JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid
) AS foo ON t.tablename = foo.ctablename AND t.schemaname = foo.schemaname
WHERE t.schemaname NOT IN ('pg_catalog', 'information_schema') AND number_of_scans < 1
ORDER BY 8 DESC;


Index Size to Table Size Ratio


Why? High level perspective indicating an issue with an index and/or data model.


SELECT *,
       Pg_size_pretty(total_bytes) AS total,
       Pg_size_pretty(index_bytes) AS INDEX,
       Pg_size_pretty(toast_bytes) AS toast,
       Pg_size_pretty(table_bytes) AS TABLE,
       TRUNC(total_bytes / (total_bytes - index_bytes - Coalesce(toast_bytes, 0))::numeric,2)
FROM   (SELECT *,
               total_bytes - index_bytes - Coalesce(toast_bytes, 0) AS
               table_bytes
        FROM   (SELECT c.oid,
                       nspname                               AS table_schema,
                       relname                               AS TABLE_NAME,
                       c.reltuples                           AS row_estimate,
                       Pg_total_relation_size(c.oid)         AS total_bytes,
                       Pg_indexes_size(c.oid)                AS index_bytes,
                       Pg_total_relation_size(reltoastrelid) AS toast_bytes
                FROM   pg_class c
                       LEFT JOIN pg_namespace n
                              ON n.oid = c.relnamespace
                WHERE  relkind = 'r') a
        WHERE  table_schema != 'pg_catalog' AND table_schema != 'information_schema' AND (total_bytes - index_bytes - Coalesce(toast_bytes, 0)) > 0.01
        ORDER  BY total_bytes / (total_bytes - index_bytes - Coalesce(toast_bytes, 0))::FLOAT DESC) a;


PostgreSQL Troubleshooting


Database Issue...All hands on deck. No time to waste. Check out (or bookmark) this quick to-the-point reference for troubleshooting a PostgreSQL database.

November 1, 2020


PostgreSQL Troubleshooting

November 1, 2020


Quick reference for troubleshooting PostgreSQL Database.

  • Leverage A Query analyzer, e.g. Pgbadger, or alternative tool that shows poorest performing queries
    • Other tools could be datadog apm, new relic, aws performance insights (not available on older db versions)
  • Basic metrics to review
    • CPU
    • Memory
    • FreeableMemory
    • Disk Size
    • DiskQueueDepth
    • BurstBalance (aws rds specific)
    • IOPs metrics
      • By increasing disk (based on type), it can increase iops.
        • This is helpful if you don’t want to pay for a larger instance type
  • Review any changes done to app (or changes to database code) that could have caused the issue
  • Review auto vacuum and auto analyze / dead vs live tuples stats
    • select * from pg_stat_user_tables;
  • A great debugging reference
    • View long running queries
    • Table missing index
    • Stop db process
    • Find out what is blocking a lock
  • For any queries identify as poor performing, leverage EXPLAIN


Once issue has been resolved, HIGHLY consider fine-tuning database, cleaning bloat, and removing unused indices.


About us

Our mission is to empower others. We take full ownership of the software we build. From K8s, to DevOps, to Cloud - we live and breathe software.

Feel free to get in touch.