Schemas are important for development, why and how to use them

When developing backend services we often develop for someone. Either it's another service, a third party or a website. We use different types of transport methods like REST, SOAP, GRPC or message queues.

Note: I decided to do a small braindump in this article, so it will briefly touch on a lot of different topics. Hopefully it is insightful!

Basic Validation Techniques

And I bet most developers are familiar with the following code:

// Javascript code example
function httpHandler(request, reply) {
   // Guard clause for data validation
   if(validate(request.body.email, "string") == false){
       return httpError("Bad request, email should be string")
   }
   // Parse and validate the rest of the response...
}

This is probably the first iteration of learning for most developers before they figure out it's error prone and annoying to maintain. So they create a way to do define the incoming objects. Usually with a library or some built-in function that defines a type of schema for our data.

// Go lang code example
type User struct {
    FirstName string `json:"first_name" validate:"required"`
    LastName  string `json:"last_name" validate:"required"`
    Email     string `json:"email" validate:"required,email"`
}

func (u *User) validate() error {
    validate := validator.New()
    return validate.Struct(u)
}

func userHandler(w http.ResponseWriter, r *http.Request) {
    var u User
    err := json.NewDecoder(r.Body).Decode(&u)
    if err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }

    err = u.validate()
    if err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }

    // Process the request...
}

Which is nice, for a while. These validation techniques are very unique to the language you're writing in which has a long list of disadvantages - often you might need to support more languages (such as in a microservice architecture), you neeed to share a REST API or new developers without experience in your validation libraries or even programming language comes along. They're not language agnostic implementations and it's not a global standard.

So that's where schema validations formats come in. Let's continue with a few code examples, this example is written in Node.js Fastify which is the framework that got me into thinking there's a better way. Fastify has support for using JSON Schemas to validate data before it's even handled by our handler functions.

// This code only focuses on the handler and I've removed parts like the Fastify listener
const userSchema = {
  body: {
    type: 'object',
    required: ['first_name', 'last_name', 'email'],
    properties: {
      first_name: { type: 'string' },
      last_name: { type: 'string' },
      email: {
        type: 'string',
        format: 'email'
      }
    }
  }
};

fastify.post('/user', { schema: userSchema }, async (request, reply) => {
  const { first_name, last_name, email } = request.body;

  // Continue processing the request...

  return { status: 'ok' };
});

But what's happening here? How is this different from our Go example? First of inside the userSchema.body we're defining our JSON Schema which is, you guessed it! Valid JSON. This means we could technically send this anywhere and to any language. In Fastify the body inside the schema defines that Fastify should validate the body, but if we could also switch out the body (or add more objects) to validate incoming headers, path parameters and query string. In addition to doing validation Fastify also serializes and optimizes the validation during the initialisation of the program, that means that it is able to gain a lot of performance benefits.

If you're curious about how Fastify or if my explanation was insufficient, feel free to read about Fastify in their documentation https://www.fastify.io/docs/latest/Reference/Validation-and-Serialization/

That means before even the handler function is even called, you will know that the body of the request is valid and in the correct format. Since JSON Schema also supports "formats" we can also do validation of emails, which saves us a lot of boilerplate code. It helps us from repeating ourselves which will always cause errors down the line. That's not where the advantages stop. If you dropped by Fastify's documentation, you'd also notice that you can define the responses from a handler.

Responses

Fastify also supports validating the data that is returned by your application, that means that if someone is able to i.e abuse your database query somehow, Fastify can scrub the information from the response. You have full control over the data returned and it keeps our returned data consistent. Let's continue with expanding our Fastify example above:

const fastify = require('fastify')();

const userSchema = {
  body: {
    type: 'object',
    required: ['first_name', 'last_name', 'email'],
    properties: {
      first_name: { type: 'string' },
      last_name: { type: 'string' },
      email: {
        type: 'string',
        format: 'email'
      }
    }
  },
  response: {
    200: {
      type: 'object',
      properties: {
        status: { type: 'string' },
        data: {
          type: 'object',
          properties: {
            id: { type: 'number' }
          },
          required: ['id']
        }
      },
      required: ['status', 'data']
    }
  }
};

fastify.post('/user', { schema: userSchema }, async (request, reply) => {
  const { first_name, last_name, email } = request.body;

  // Continue processing the request...
  // Suppose you create a user and receive an id

  const userId = 123; // for example

  reply.send({ status: 'ok', data: { id: userId } });
});

In this example, the response property of userSchema defines a JSON Schema for responses. In this case, it is expecting a 200 status code with a response body that has a status property of type string and a dataobject with a required id property of type number.

If the handler does not send a response that matches the schema, Fastify will log an error, which can help you catch inconsistencies in your response structure. Depending on your JSON Schema defintion, Fastify will also remove any extra data that has not been explicitly defined. But once again, there's more benefits...

OpenAPI

One thing that you will notice fairly quickly when developing especially REST API's is keeping your API documentation up to date. Most of the time it's out of date and not accurate, it's annoying for both you as a developer and the consumer of the API. OpenAPI is a tool to fix that.

To quote one of our famous LLM's description of OpenAPI:

OpenAPI, formerly known as Swagger, is a specification for building APIs. It provides a standardized framework that allows both humans and machines to understand the capabilities of a service without having to access source code, documentation, or through network traffic inspection. The OpenAPI specification is language-agnostic and is both human-readable and machine-readable. It describes endpoints, request/response types, authentication methods, contact information, license, terms of use, and other information.

OpenAPI defininitions are written in JSON, the same our JSON Schema and normally you would have to write the OpenAPI definition ourselves and maintain it on the side of our code. But that's where our previous request and response definition comes in - we can both use them for data validation and to generate an OpenAPI document. Suddenly our API and API documentation is up to sync, by default! No more maintaining documentation when updating your code.

Avro, Protobuf and JSON Schema

The most common language agnostic ways to validate data today is Avro, Protobuf and JSON Schema. We've already gotten to know JSON Schema a bit, but let's look at the other two formats and how they're implemented.

A quick note on Protobuf - The standard for GRPC

Protocol Buffers (Protobuf) is a language-agnostic binary format developed by Google. It's used to serialize and deserialize data, making it a useful tool for communication between services, as well as for storing structured data. Protobuf is known for its efficiency and performance, offering small payload size and fast processing times compared to text-based formats like JSON or XML. Protobuf messages are strongly-typed and require a schema (defined in a .proto file) that specifies the structure of the messages. This allows for backwards and forwards compatibility, versioning, and the ability to evolve schemas over time.

Protobuf doesn't support formats for i.e Emails like JSON Schema does. But what I like about Protobuf does is to force developers to define their request and responses. This is something I believe tools like Fastify should implement so there's no missing definitions and that all API's are consistent. We have RFC standards for pretty much everything today and the HTTP error formats are so well defined that there shouldn't be an issue if we had a more opionated implementation today.

Why using opionated tools are important

Every developer wants flexibility. But often this flexibility brings errors; a library tries to do too many things and ends up doing a lot of them badly or maybe there's 20 different ways to implement the same thing (looking at you React) and 5 developers work on the same project with 5 different ways coding. There's popped up tools like linting that tries to mitigate this, but the frameworks themselves are the big culprits. They want to be approachable by anyone and everyone instead of focusing on one thing well.

This issue keeps on happening and needs to be addressed, if a request and response is missing or the definition isn't following the RFC standard correctly but no fatal errors are thrown, then other tools rely on it being there will start to fail. I believe more opinionated but agnostic tools are required to force us to write less error prone code.

Centralising Schemas

Before finishing off, I want to just throw out a note on centralising schemas. At some point when building a complex application, you will have your schemas spread around. You will also have to implement schemas for every transport method you're using, a REST handler will have a different way to implement it compared to i.e a Kafka message queue. It's actually fairly common for Kafka to use what is called Schema registries, that tell us exactly how the schemas look and we know exactly where to find them. They give us schema versioning and make sure they're backwards compatible.

By offering a way to share, store, and manage schemas in a centralized and versioned manner, schema registries play a crucial role in ensuring that an organization's data is well-managed, reliable, and readily usable.

So use a schema registry. Thanks for reading.