One of the biggest challenges in developing cloud native applications today is speeding up the number of your deployments. Shorter, and more frequent deployments offer the following benefits:
- Reduced time-to-market.
- Customers get new functionality faster.
- Customer feedback flows back into the product team faster, which means the team can iterate on features and fix problems more quickly.
- More features in production makes for a happier development team.
But with more frequent releases, the chances of negatively affecting application reliability or customer experience also increases. It’s important that operations and DevOps teams develop processes to automate deployment strategies that minimize risk to the product and customers.
What is a Service Mesh?
According to Stefan, a service mesh is a dedicated infrastructure layer for handling service-to-service communication. Although this definition sounds very much like a CNI implementation on Kubernetes, there are some differences. A service mesh typically sits on top of the CNI and builds on its capabilities. It also adds several additional capabilities like service discovery and security.
The components of a service mesh include:
- Data plane - made up of lightweight proxies that are distributed as sidecars. Proxies include NGINX, or envoy; all of these technologies can be used to build your own service mesh in Kubernetes. In Kubernetes, the proxies are run as cycles and are in every Pod next to your application.
- Control plane - provides the configuration for the proxies, issues the TLS certificates authority, and contain the policy managers. It can collect telemetry and other metrics and some service mesh implementations also include the ability to perform tracing.
How is a service mesh useful?
The example shown below illustrates a Kubernetes cluster with an app composed of these services: a front-end, a backend and a database.
The blue arrows represent the traffic that comes into your cluster through an ingress gateway. This ingress gateway can be anything from NGINX to a cloud based one like ELB.
In this case, we’re also using an egress gateway which should be used in your cluster for better security. The red arrows indicate east - west traffic or the traffic that occurs between your services.
Without a service mesh these are the problems
If you are not using a service mesh and you’re implementing a plain vanilla Kubernetes cluster instead, these are the problems you’ll run into:
No security between services
By default, there is no security between services. Even if you are using Weave Net nothing is encrypted on the network layer. Weave Net encrypts traffic between nodes, but if you have two services running on the same node, those will not be encrypted.
To mitigate this risk, you could of course use TLS certificates for communications between all of your services. But doing so means more work for your SRE team, since they will need to rotate and manage TLS certificates. Your dev team will also need to integrate TLS into each service and many other tasks. In the end, it’s no small feat to implement TLS across your cluster.
Most service meshes have a goal of end-to-end encryption, which can save time for your teams. A service mesh injects a sidecar with a TLS certificate into each pod. Control planes will also come with a certificate authority that rotates the certificates for you.
Tracing a service latency problem is difficult
Another problem with a plain vanilla cluster is that it can be difficult to troubleshoot the source of a problem. For example a latency issue may be particularly difficult to trace by only looking at the data from a single service. Any analytics data you are reading in this case may not have anything to do with the service that communicates to the outside world. The problem may instead, reside in a malformed database query or it could be a problem in the front-end, you don’t know for sure.
Without a service mesh, you can solve this kind of problem by instrumenting your code and then measuring your requests between each service in your application. But if you use a service mesh like Istio that has distributed tracing built right in, you won’t have to worry about the extra step of code instrumentation.
With a service mesh, all of the traffic is routed through ingress and egress through a proxy sidecar. The proxy sidecar then adds tracing headers to a request. When a request comes through the ingress gateway to the front-end that goes to the backend, you will have a trace for all of those requests without having to instrument your code.
Load balancing is limited
Because metrics are built into a service mesh, you can take advantage of more advanced load balancing strategies. For example, the front-end can be scaled up when it has more traffic and you can also pinpoint other traffic bottlenecks more easily.
What a service mesh provides?
Not all of the services meshes out there have all of these capabilities, but in general, these are the features you gain:
- Service Discovery (eventually consistent, distributed cache)
- Load Balancing (least request, consistent hashing, zone/latency aware)
- Communication Resiliency (retries, timeouts, circuit-breaking, rate limiting)
- Security (end-to-end encryption, authorization policies)
- Observability (Layer 7 metrics, tracing, alerting)
- Routing Control (traffic shifting and mirroring)
- API (programmable interface, Kubernetes Custom Resource Definitions (CRD))
Differences between service mesh implementations?
Has a Go control plane and uses Envoy as a proxy data plane. Istio is a complex system that does many things, like tracing, logging, TLS, authentication, etc. A drawback is the resource hungry control plane, says Stefan. The more services you have the more resources you need to run them on Istio.
This is a managed control plane that also uses an Envoy proxy for its data plane. You don’t have to run it yourself on your cluster. It works very similar to Istio. Since it’s fairly new and it still lacks many of the features that Istio has. For example it doesn’t include mTLS or traffic policies.
Also has a Go control plane and a Linkerd proxy data plane that is written in Rust. Linkerd has some distributed tracing capabilities and just recently implemented traffic shifting. The current 2.4 release implements the Service Mesh Interface (SMI) traffic split API, that makes it possible to automate Canary deployments and other progressive delivery strategies with Linkerd and Flagger. The Linkerd roadmap also shows that many other new features will be implemented over the next year.
Uses a Consul control plane and requires the data plane to managed inside an app. It does not implement Layer 7 traffic management nor does it support Kubernetes CRDs.
How does progressive delivery work with a service mesh?
Progressive delivery is Continuous Delivery with fine-grained control over the blast radius. This means that you can deliver new features of your app to a certain percentage of your user base.
In order to control the progressive deployments, you need the following:
- User segmentation (provided by the service mesh)
- Traffic shifting Management (provided by the service mesh)
- Observability and metrics (provided by the service mesh)
- Automation (service mesh add-on like Flagger)
A canary is used for when you want to test some new functionality typically on the backend of your application. Traditionally you may have had two almost identical servers: one that goes to all users and another with the new features that gets rolled out to a subset of users and then compared. When no errors are reported, the new version can gradually roll out to the rest of the infrastructure.
While this strategy can be done just using Kubernetes resources by replacing old and new pods, it is much more convenient and easier to implement this strategy with a service mesh like Istio and an add-on like Flagger which can automate the shift in traffic.
View the talk in its entirety:
Want to participate and ask questions? Join the Weave Online User Group to be notified of upcoming events, webinars, in person meetups and online talks like this one.