Code-to-Docs Pipelines

Since the truest source of truth for any software product is in the product source code itself, the only way to insure against product/docs divergence is to programmatically derive canonical product data from the product source code and insert it into the docs every time a new version of the product is released.

As much of a mouthful as that first sentence was, the technique itself stands much easier said than done. Nevertheless, the effort required pays dividends in the form of “always updated” docs. (Well, at least eventually it does, in theory… until it doesn’t.)

Let’s talk about building a bridge between product code and docs, and honestly evaluate what it takes to maintain and increase those connections.

Isn’t This Just Docs-as-Code?

If you are familiar with the concept of “docs as code”, you probably think of it in one or both of two ways:

Documentation authored along with and embedded in product source code, such as for APIs
Docs written in Markdown or the like, stored along side the code in the same or a separate Git repository

Those are valid, core techniques of docs-as-code, but today I want to talk about a third approach, which is less about where we store documentation than how we use existing product source code to inform documentation.

This blog entry is not exclusive to “docs-as-code” techniques, though there will be some obvious overlap.

Let me provide a straightforward example that does not fit with either of those two common understandings of docs-as-code.

Imagine a product that has several standard attributes, with one such attribute being the default installation path. This path is crucial to note in user docs, but it is hard-coded into the product source code as a variable called default_install_path.

Now imagine commercial technical-writing platform like MadCap Flare, which can ingest CSV files and turn their data into variables. If the product developers automate the generation of a attributes.csv file containing core product data like the default_install_path and its value, technical writers can express the current information accurately anywhere in the docs using a variable placeholder.

You might be thinking, “But Flare isn’t even docs-as-code”, and you certainly would be right for conventional uses of Flare, but we’ve just expressed a case where we’re legitimately single-sourcing documentation content as product code. The default installation path is reported anywhere throughout the docs without ever duplicating the literal path, which is derived automatically from the product source.

The Abstraction Distraction

I want to make clear that the real value of technical writers is in conveying info that the product manager (PM) may know and the product engineers may very well not know.

Consider a form field or API endpoint that gets labeled Project start date. In the underlying field name and the database backend, this is probably something like start_date or started_at, corresponding with a table or model called Projects.

There might even be a comment or documentation string that gets conveyed into the API docs or right there in the form’s inline help. It will almost always be something unhelpful like: “The date the project started.”

To the programmers, this datum means almost nothing. They know it must be a Date (or similar) data type, and they assume it to chronologically precede the corresponding end_date. (Maybe their code even enforces that chronology.)

To the user entering information, or the user viewing it as a report, the “project start date” corresponds to something official and far too real-world to belong in the source code. For instance, “The date of the first meeting to convene the task group.”

This is an extremely simple example of why technical writers should matter to product owners. There is a decent chance the product developers do not know or care about the analog/real-world correspondent of any piece of data coming into or out of the product.

I cannot tell you how many front-end engineers I have met who don’t really know how to talk to users about the interfaces they develop. Technical writers are often surprised by this, but it makes sense: programmers are often uninvolved in interface design. They are given specifications to meet, and how a real-world user decides what constitutes the correct date to enter is not actually all that important to the programmer.

Everything in this section has been to express the kind of thing products typically should not store in their source code.

And while I do recommend sourcing user-facing info in lightweight markup files in the product code repo, the point is it needs to be owned by or at least accessible to technical writers and the Product team. The SME in this case is likely a product manager, not an engineer.

Product Attributes

Most products are defined by lots and lots of specific data.

Consider a configuration file. This is a flat file in some format like YAML or JSON or XML or INI, etc, that enables parameterized settings of key product attributes for a given installation. Users will install the product then configure it in product-name.ini or something like that.

Ideally, these files can be treated very much like an API — all their associated info can be sourced in native code or something like YAML, so developers have a central place to edit stuff like data type and default value.

Consider this Python code intended to establish default values and enforce a certain type for a given configuration setting.

A minimally structured way to define product attributes.

class AppSettingsDC:
    db_url: str = "sqlite:///./data.db"
    pool_size: int = 5
    timeout_seconds: float = 30.0
    # dates have no default...
    project_start_date: Optional[date] = None
    project_end_date: Optional[date] = None

Above is how you will typically find such settings. This is of minimal value to informing the docs, except as described above, where a CSV file that could be output so the docs could read data that was generated from the source code.

But even in that case, the value is somewhat limited. A tech writer is going to have to manually author docs that refer to these settings.

A Markdown table reporting product attributes

| Setting | Type | Default | Description |
|---------|------|---------|-------------|
| `db_url` | str | `{{ site.data.attrs.db_url }}` | SQLAlchemy‑compatible connection URL. |
| `pool_size` | int | `{{ site.data.attrs.pool_size }}` | Number of connections kept alive in the DB connection pool. |
| `timeout_seconds` | float | `{{ site.data.attrs.timeout_seconds }}` | How long (in seconds) the driver waits for a response before timing out. |

Once created, the default values can stay up to date with product changes. But it’s easy to see why it would be a pain to author and maintain an entire reference of dozens or hundreds of such settings. It should also be easy to see why changes to the setting name, type, or description can easily fall out of sync and become just plain wrong.

More helpful would be a robustly structured system. Take a look at this next example.

A more fully structured system for product attributes

class SettingMeta(TypedDict):
    """Structure that holds the core three elements we need to convey."""
    type:      Literal["str", "int", "float"]
    default:   str # store the literal default as a string
    desc:      str # free‑form description

SETTINGS: dict[str, SettingMeta] = {
    "db_url": {
        "type":    "str",
        "default": "sqlite:///./data.db",
        "desc":    "SQLAlchemy‑compatible connection URL.",
    },
    "pool_size": {
        "type":    "int",
        "default": "5",
        "desc":    "Number of connections kept alive in the DB connection pool.",
    },
    "timeout_seconds": {
        "type":    "float",
        "default": "30.0",
        "desc":    "How long (in seconds) the driver waits for a response before timing out.",
    },
    "project_start_date": {
        "type":    "date",
        "desc":    "The start date of the project.",
    },
    "project_end_date": {
        "type":    "date",
        "desc":    "The end date of the project.",
    },
}

Now you can probably imagine how a configuration reference document could be auto-generated from a collection of numerous settings like these. A simple template can convert that data to markup, so the output is pressed whenever the product and/or docs are built.

The Catch

You’re probably already noticing a problem, though. While the highly technical data about the database backend may be explainable in one sentence and fully maintainable by programmers, there’s that pesky “project start date” we used as a previous example.

We already know this isn’t a good enough description for users, who themselves may not know what constitutes the actual start date they should enter. And I added a project_end_date for good measure — you can probably already imagine that being confusing: is it a future date or do we wait to report the actual end date?

Suddenly, we’re either asking the developers to maintain the user-facing documentation inside their source code, or we need to find some way to complement the source code with those enhanced docs.

None of this is to mention that even when product attribute descriptions are staightforward, they often need to be proofread and edited by a professional, who is probably not a developer working in the product codebase.

Doing all of this much better than so-far discussed is quite possible, but I needed to get to this point so anyone reading along can appreciate the problem. I often hear developers shrug off the idea that this is “even a thing”. They think the sentence fragment they jot into a desc field should be good enough.

Meanwhile, technical writers typically consider this an insurmountable obstacle. The developers are going to define this stuff how they see fit, TW’s assume, and they don’t really care if we (tech writers) can make use of it as a shared source of truth for the docs.

Next we will look at a few ways to address this conundrum.

Truth Strategies

There are options for handling the problem of where the truth gets sourced. Here I lay them out from least to most optimal.

The conventional method: Technical writers share a document with developers that is a third “source of truth”, which the product and (especially) docs are supposed to adhere to. Optimally, both parties copy and paste from this document into their files as needed, though in reality, devs tend to write the code first and then report into this doc. TWs then add any user-facing enhancements to the description, or they overwrite the description in their third source of truth.

This process is both dreadful and typical.
The read-only code method: Technical writers have access to the source code, and they use it as a reference, all the while maintaining a second source of truth.

This practice should be the bare minimum for any team hoping to maintain accurate docs, but the accuracy will depend on the vigilance of the TWs, following up on a reliable system for reporting product changes.
The overlay method: Tech writers maintain a secondary file in a format like YAML, which contents can establish, overwrite, or augment the descriptions of particular settings, but references can still be built automatically. The source is still maintained by devs, but TWs can affect the user-facing docs output without messing up canonical information like data types, default values, and so forth.

This is a legitimate practice that is mostly used with the newly introduced OpenAPI Overlays, though I have seen it used in custom setups a couple of times. It counts as legitimate single sourcing because each part of each attribute is singly sourced (types and defaults in the code, descriptions in the overlay).
The collaborative method (native-code approach): Tech writers maintain the desc fields right in the source code. They use Git and code-editors to modify the documentation source strings at agreed-upon points in the process, mindful not to edit anything but the documentation aspects.

This is more common than many people realize, and it is probably the way to establish true single sourcing that is lowest in terms of up-front friction. The docs still need to draw from the product code through an automated process, but there is no need for a secondary system.
The collaborative method (shared-definition approach): Mature software products should be defined in a neutral, user-friendly format (like YAML) so that the product source code, the documentation, and even testing code can be derived from a common, canonical source.

This is much easier and safer for TWs and even PMs to participate in directly, but it requires more up-front planning and infrastructure preparation.

In the following sections, I will detail and show examples of the latter three above strategies. (The first two strategies are self-explanatory and not recommended.)

The Overlay Method

Once there is a fixed data structure in place that is being used to generate output for reference codes, that system can be “overlayed” with suppliementary or overriding data.

project_start_date:
  desc: The date of the first meeting to convene the task group.
project_end_date:
  desc: The expected or actual date of the final meeting to adjourn the task group.

There really is no reason to carry over the original desc parameter values from the source code; we can just overwrite them in cases like this. But if you wanted to augment them, a parameter like docs is advised.

db_url:
  docs: |
    SQLite is used by default for a local file-based DB.
    For production, you should use a more robust database like PostgreSQL or MySQL.

This method works just fine, but it is always a tail chasing the dog. Technical writers under such a regimen are only able to describe the product after the fact.

The Collaborative Methods

When conditions align, the preferable way of handling documentation is to have tech writers and programmers working in the same source, wherever the product data is actually defined.

This can happen in the programming language files where this kind of data/content is typically sourced. I have seen teams where TWs modify the same source code as engineers, sometimes even committing to the same code branches prior to them being merged into a trunk.

The more accessible and powerful way is to actually define this matter in shared data documents. OpenAPI is surely the best-known form of this, but it only applies to REST APIs, which all share a relatively common architecture.

OpenAPI documents (OADs) can be used to indicate how the product should work, which means they are also a perfect source for reference documentation.

Configuration files and the like do not have a common standard specification for definition and documentation, but teams that want to explore these collaborative approaches can develop flat-file sourcing structures and templates for processing them.

Example schema defining a configuration file structure

settings:
  db_url:
    type: str
    default: sqlite:///./data.db
    desc: SQLAlchemy‑compatible connection URL.
    docs: |
      SQLite is used by default for a local file-based DB.
      For production, you should use a more robust database like PostgreSQL or MySQL.
  pool_size:
    type: int
    default: 5
    desc: Number of connections kept alive in the DB connection pool.
  timeout_seconds:
    type: int
    default: 30
    desc: How long (in seconds) the driver waits for a response before timing out.
    range: [1,120]
  project_start_date:
    type: date
    desc: The date of the first meeting to convene the task group.
  project_end_date:
    type: date
    desc: The expected or actual date of the final meeting to adjourn the task group.

Now imagine a data structure like this, with tech writers given full access, not only generating docs and underlying product code and tests, but also a form-fill interface for users, with inline help based on the desc and docs fields.

This approach has the additional advantage of keeping the product source code remarkably clean; no more description parameters cluttering the file in our example source code.

Once these core product attributes are defined in a neutral format — whether they represent unchangeable data like default_install_path or configurable via flat-file or form UIs — the potential is limitless. They can be used to establish the product settings, validate user input, perform unit tests, and of course, generate reference docs.

Getting Started with Single Sourcing

The above examples are idealized, but they do illustrate the potential of using product source code as the single source of truth for at least aspects of a product’s documentation.

In reality, there are often obstacles, starting with dependence on product developers to put the infrastructure in place.

Getting that CSV or JSON file of canonical product data generated for your authoring system to ingest might be too much to ask, but it is a reasonable place to start.

A full-fledged single-sourcing definition system will likely require a dedicated contractor or an internal task force.

The latest AI technology makes experimenting with this kind of stuff safe and accessible. If you are a technical writer with read access to the product source code, try asking an AI agent or chatbot to help script a bridge between the source code and your documentation platform.

See my post Vibe Coding vs AI-assisted Programming for resources and advice on AI-assisted coding.

It probably won’t do the job perfectly, but it might get close enough that developers are excited to polish it off or recreate it if they can see it in action and appreciate the benefits it offers.

Isn’t This Just Docs-as-Code?

The Abstraction Distraction

Product Attributes

The Catch

Truth Strategies

The Overlay Method

The Collaborative Methods

Getting Started with Single Sourcing

Related posts