Testers and the Technical Abstraction Stack

I had no idea what to call this post. My focus here is on the notion of owning quality. As in: who does so? I won’t tackle all the nuances of that wider topic here. But, due to recent discussions, I did start to think about what it looks like for testers to own even limited bits of quality in our industry that is currently focused on some form of DevOps.

Regarding DevOps in general, consider that we have an accelerating force in that development teams are generally incentivized to deliver change. But we have a friction force in that operations teams are generally incentivized to ensure stability and availability. This tends to discourage change.

I’ve talked about before about how testing provides experiments around forces. And here we have some forces:

Development: Change, Speed
Operations: Stability, Certainty

Actually, however, there is a third force that draws these forces together. That third force is a desire for predictability. We want to know that what we build (development) or deploy (operations) meets a desired level of quality.

Does all of this mean any one group owns that quality? Is it the developers? The operations? And where are the testers?

Yeah! Where Are the Testers?!

Speaking of the testers, the (relatively) recent introduction of containers and container orchestrators fundamentally changed the landscape of distributed system development. We now have an industry-wide interface for expressing core distributed system patterns and the mechanisms and tooling for building reusable containerized components.

This has — or should have — fundamentally shifted how we think about testing and how we demonstrate quality. Yet many testers are unable to even articulate this much less work fluidly with these systems.

Why is that?

We have technologies that support resource isolation (like Docker) or resource abstraction and allocation (like Mesos and Aurora) or orchestration platforms (Kubernetes and Cloud Foundry). All of these can cause entire systems of functionality to fail in one shot.

Remember Amazon’s Prime Day glitch? This was mainly due to a server provisioning failure, broken configuration management, and failures and gaps in host-level monitoring.

Testers, I noticed, were very quick to jump on poor quality on Amazon’s part. But then they seemed to be indicating that the developers should have done a better job. But wait! Shouldn’t that have been what Amazon’s testers were doing? (And, yes, Amazon does have testers contrary to some reports.)

Should those testers have “owned the quality” for all of that?

First Problem

You can’t own what you can’t reliably engage with.

On a more personal experience, I recently worked with a test team who didn’t even want to leverage containerized environments. Why? Because they felt it got rid of the long-standing idea of environment promotion. What they didn’t realize is that environment provisioning can contain the aspects of environment promotion by providing immutable release artifacts by making the environment itself deployable!

Being able to see this is critical to embracing continuous delivery and/or cloud-based deployments. Something this test team — and others I have seen — were in absolutely no position to do.

So it’s hard to own the quality of something if you can’t engage with it. That’s purely on testers if they act as such.

Second Problem

You can’t own what you only have limited control over. And this one’s not solely on testers. So let’s consider testers owning quality in a microservice system. Consider this visual:

I’m borrowing this image from the book Microservices in .NET Core, a book many testers would likely not read.

In that image you can see that the API Gateway microservice is all that a customer (the client) sees. But, as you can also see, that gateway is a thin layer in front of a whole system of microservices. The arrows indicate calls between different parts of the system and the numbers on the arrows are meant to show the sequence of calls.

Now, stop for a second. If you are a tester who wants to own quality, what’s one of the key questions you should be asking about the above?

Please think about it.

Before I provide what I consider a very viable answer to that, consider this from Susan Fowler’s excellent book Production Ready Microservices (another book I’m willing to bet most testers haven’t read):

“The most common catastrophes and failure scenarios are hardware failures, infrastructure (communication-layer and application-platform-layer) failures, dependency failures, and internal failures.”

Okay, but do we suggest testers should “own quality” for all of this? What does it even mean to own quality for all of this? Do you even have to control all of it? Or is the idea of “control” a bit of a strawman argument? Maybe we don’t need “control” in order to “own.”

Hmmm. Well, first let’s see what key question a tester would have, or should have, asked. Here’s a really, really good one:

“What happens if any given piece of the microservice architecture fails?”

Consider how microservices are often about the encapsulation of the single service and thus quality is suggested to be specific to that service. But a tester who wants to “own quality” would have to first and foremost ask the above question because they have to understand how a cascade of problems could happen if something in the chain fails.

That’s a lesson Amazon learned. (Hopefully.)

But it’s a crucial point. Yes, microservices are touted as these single-responsibility bounded contexts — mini-domains of behavior, as it were — but they are still part of a chain. The connective tissue is not just the APIs but the architecture and the infrastructure.

Do testers “own quality” for those aspects as well? Should they?

When a given piece of the infrastructure falls down or the architecture becomes compromised, these are failures. Failures are quality degraders. These are internal and external qualities.

Third Problem

In fact, that’s a third problem area: to “own quality,” you have to understand that there are internal and external distinctions to quality.

But who tests for these? Right now testers often feel very content letting developers (of the code-specialty variety) handle this, at least at this abstraction level.

But why? These are the “most common catastrophes,” according to Susan Fowler and others. Aren’t testers all about looking for the things that will, in the military sense, kill us first? These are the things testers should be focusing on, right?

But they aren’t.

Why not?

Problems here can compromise the ability of a service or the entire ecosystem of services to deliver value to customers! That should be very concerning to testers and thus where a lot of their testing takes place.

But it isn’t.

Again, why not?

Well, truthfully, I’m making assumptions there with “they aren’t” and “it isn’t.” But how often do you see testers engaging at this level of the abstraction stack in such a way that they could truly “own quality” for it? And, again, that still begs the question: should they own quality, either for the whole or the parts?

Move Up the Abstraction

Let’s shift gears here for a second. Let’s consider an API. APIs are driving many businesses these days. And many testers do work at this level of abstraction from an execution standpoint. But to “own quality” you would have to also work at the design standpoint. So I present you with this schematic of an API:

/groups                         GET     Gets a list of all groups

/groups/{id}                    GET     Gets details for a single group

/groups/{id}/members            GET     Gets members of a group

/groups                         POST    Creates a new group

/groups/{id}/members            POST    Adds a member to a group

/groups/{id}                    PUT     Updates team properties

/groups/{id}/members/{memberId} PUT     Updates member properties

/groups/{id}/members/{memberId} DELETE  Removes a member from the group

/groups/{id}                    DELETE  Deletes an entire group

If you are a tester, and you want to “own quality”, all I will ask here is a simple thing. Apply testing as a design activity. Is that API okay? Are there are potential problems with the endpoints?

When I say “design activity” here I mean as opposed to an “execution activity” since, at this point, there is no API to execute. We are simply documenting the API that we are proposing.

Oh, and regarding the API, one thing that probably would stand out immediately to a developer is this:

There’s no way to get to the information on a member without first knowing the ID of a group to which that member belongs.

That’s a potential quality degrader. And if a developer is going to spot that, then a tester, who is “owning quality”, would have to do so as well.

Fourth Problem

So I would argue this is another, fourth, problem for “owning quality.” You have to be able to to see testing as a design activity and as an execution activity.

A further corollary to this, I would argue, is that you have to see testing as an activity that, in part, puts pressure on design. That happens at different abstraction levels, from requirements on down. You also have to see that testing, in part, is about keeping design cheap at all times.

Understand Shifting Points of Failure

To “own quality” you truly have to understand single points of failure, connected points of failure, and when something isn’t (immediately) a point of failure.

Let’s take an example. Let’s say we have a microservice that uses Redis, which is a message broker, and Celery, which is a task processor. This is a relatively common setup. Celery basically has a bunch of “workers” that are used for processing tasks that come into it from Redis, which queues up a lot of tasks. Let’s say Celery bombs out on us for some reason.

Is that a single point of failure?

No, it’s not. Redis, in the above case, will basically keep queuing the tasks and try to send them once the Celery workers are back in action. So, yes, Celery took a nap for a bit but Redis was still available and none of the tasks that would have gone to Celery have been lost. Thus, by definition, Celery is not a single point of failure.

Okay, so tester! You “own quality.” Someone tells the above. What’s your immediate concern?

Where might quality get degraded?

Well, probably in the number of tasks being built up, right? But that leads to a fundamental question that a tester needs to ask.

“How much traffic does this microservice host?”

Notice that more traffic enhances quality as a general rule. People are using us more and thus getting value from us. But traffic can also degrade quality.

If the answer to that tester question is that the service can receive, say, thousands of requests per minute or even per second, then clearly those queues are going to fill up. That means, eventually, Redis is going to run out of memory. That means tasks will, in fact, be lost.

Redis is thus a single point of failure in this case.

And if you are “owning quality”, what else immediately jumps to your mind about this scenario? Consider what our microservices are running on.

The concern is that multiple microservices can rely on Redis, given the nature of what Redis is. And now those services are also losing their tasks. But how much and to what extent may also depend on the traffic of those services and what they do to handle adverse situations.

So you have a connected chain of points of failure.

So … Who Owns Quality Here?

Developers already have to think like what I describe above when constructing such systems and operations teams have to think like what I describe above when deploying such systems. So if testers are going to “own quality”, they have to think like this as well.

But if we’re all thinking like this, is it true to say that only one of that group (the testers) “owns quality”? Don’t we all own it?

Wouldn’t that be an ownership that is distributed among the delivery team? Where a “delivery team” is made up of business/product, developers, testers, and operations.

I would answer yes, it is very much a distributed form of ownership.

But note that this doesn’t let testers off the hook for understanding this abstraction stack. Which is why I do think, and have said before, that testers should become a type of developer, one that has a test specialty.

Also key to this is that anyone who claims to be “owning quality” has to be focusing on the cost of mistake feedback loop. From the time we introduce mistakes — at whatever abstraction level — what’s the earliest time we catch them? The longer that loop, the more quality degraders — both internal and external — can set in.

This is a core driver of technical debt across abstraction layers. Do testers “own quality” for that as well? If so, then that makes my argument that the distinction between testers and developers would have to become much less rigid.

In this post, I focused on the technical abstraction level. In another post I’ll talk about this same idea of “owning quality” but at the business abstraction level.