A Tutorial On Redaction

This image has been redacted for your convenience.

Writing software is fun. Redacting software is torturous. It gets worse when the software was never intended for redaction, but you need to redact it anyway.

I’ve worked on multiple projects in the past decade and all of them have involved redaction at some level. The source code, the documentation, the bug fix requests; I’ve redacted every type of digital thing you can think of. Sadly, even after all this time, I’ve yet to find a way to remove the vampiric qualities of redaction which consume the souls of those who perform it. However, I have learned a few ways to make redaction more effective with less time-consuming rework, and that’s what this post is all about.

Redaction, if you are uninitiated into the cult, is a form of evil magic where you remove sections of important information from documents or source code all while somehow retaining most of your sanity. Sometimes, the information is removed because you aren’t licensed to give away someone else’s work, but you need to deliver something that contains the work. Other times, you want to protect your own inventions but still be able to sell portions of your work. You might be allowed to make redaction obvious, printing black boxes where text should be. Or maybe you’ll need to completely hide the fact you messed with the content. This latter approach is the one I’ll assume, because I don’t think there’s a whole lot of value in the former.

What’s the Point?

Stop redacting for a second. It’s easy to jump into redaction work and go through some easy, repeatable steps to get your job done and end up missing the whole point of redaction. Remember, the reason you are redacting is because whoever is receiving the information you have can’t have specific stuff it contains.

Are you redacting to remove terms? Maybe the names of intellectual properties? If that’s all, you might think you can search and replace the contents of the files you need to redact. Replacing terms is easy enough. You can probably finish your redaction work in a few minutes. But what value is there in removing terms? If the people who are being provided the redacted material have any idea what the material was used for, they’ll know which terms were redacted. You haven’t done anything. And if they have no idea what it was used for, why do they care about it?

I’ve found that redaction is time-consuming and tedious, but also an inconsistent process. You can’t write a program to perform redaction for you, because a program can’t interpret every conceivable spelling error, phrasing (especially poor English), or acronym. Searching for terms with software is really helpful, but it only catches the most obvious stuff.

Consider this paragraph:

“The software uses a proprietary component called This Sure Is Awesome Technology. This technology is used to generate output in a comma-separated list, but in columns instead of rows. This is protected by patent Des. 555,555. Our tool uses this tool to turn pictures of ducks into pictures of chickens. Chickens are better than ducks.”

Suppose you can’t transfer This Sure Is Awesome Technology, because it is licensed. And suppose you can’t transfer the patent information because of international law. A search for the relevant terms would get you what you want, but try just removing them:

“The software uses a proprietary component called. This technology is used to generate output in a comma-separated list, but in columns instead of rows. This is protected by patent. Our tool uses this tool to turn pictures of ducks into pictures of chickens. Chickens are better than ducks.”

You’ve left in the nature of the technology in question and the content of the patent. It doesn’t take much effort for someone to replace your redacted text. So what was the point? You need to do this by hand:

“The software uses a proprietary component which translates files from one type to another. Our tool uses this tool to turn pictures of ducks into pictures of chickens. Chickens are better than ducks.”

Here, the meaning is preserved, but not the method, which is the focus of the redaction. Redaction is almost always an effort to protect methods and concepts, so why apply a method that can’t protect anything?

Search, however, is limited. Consider a different writing of the same text:

“The software uses a proprietary component called TSiAT. This technology is used makes row-based CSVs. This is protected by PD 555,556. Our tool uses this tool to turn pictures of ducks into pictures of chickens. Chickens are better than ducks.”

Your search won’t catch every possible spelling error. It won’t catch different forms of the same phrases, especially if those forms have poor grammar. It won’t catch acronyms you don’t expect. If you understand the point of redaction, you won’t consider your job complete just because a search doesn’t return terms from a list of “bad” words.

The Redaction Balance

It’s easy not to go far enough in your redaction and leave too much content behind. On the other hand, redacting massive sections of documents removes any value from them. At that point, why even give the documents away?

You’ve probably seen documents redacted by the government. You know, those poorly scanned pages that have a handful of words floating in a sea of black ink. You might find pieces of information here and there but you might not. Why did they release the document at all if there’s nothing in it?

A better approach, as described above, is to search for terms and concepts  yourself. You, a human being and not a computer, can understand practically everything that might show up in the information you are redacting. It’s tedious, it’s terrible, and it might be evil, but redaction is something you can’t describe in logical terms any more than you can describe writing a book in logical terms. You can’t delegate this terrible work to a computer, no matter how much you want to. And if you are doing redaction the right way, you’ll really, really want to.

Your goal is to remove terms and ideas in a careful way that doesn’t make it obvious that the terms and ideas ever existed. For instance, if you need to remove the section in brackets, do it like this:

“The tool is capable of [feature A], which does X, Y, and [Z] in order, as well as feature B, which does X and Y only.”

Becomes:

“The tool is capable of feature B, which does X and Y in order.”

A hard redaction of this, replacing [feature A] with [redacted feature] and [Z] with [redacted function] would read like this:

“The tool is capable of redacted feature, which does X, Y, and redacted function in order, as well as feature B, which does X and Y only.”

This gives away the fact a feature exists as well as 66% of what it does. If you want someone to know about “Feature B” and not “Feature A”, this is a terrible way to do it.

Acronyms Are Your Enemy

If a term you are replacing is an acronym, things get much worse.

Imagine you have an acronym like RED. That term might appear thousands of times in unrelated words: hundred, redaction, credibility, bred, not to mention the word “red” itself. If this term is just replaced forcibly with something like “Supplier Technology”, you end up with ridiculous sentences like:

There are two-hundSupplier Technology tests, each of which appear in black if they passed and Supplier Technology if they failed, establishing the cSupplier Technologyibility of the claim that the software was tested.”

You’d need to go through these results by hand, which is no faster than searching and reading in the first place. If you left this in place, anyone reading the redacted document would realize that “Supplier Technology” is clearly what all instances of “RED” become. Again, search-and-replace has accomplished nothing.

And this isn’t the end of the pain you will suffer at the hands of linguistic shortcuts. Laziness compels people to turn all sorts of things into acronyms where you may not expect them. And even if you try and expect them, they probably won’t use the same letters you would. This doesn’t even take spelling errors into account. A misspelled acronym is like a land mine of important information you can’t sweep for. It’s just waiting for the recipient of the redacted material to trip on, blowing up in your face. Acronyms are just another reason you should be performing redaction by hand.

Some Precautions

You may have no idea if the documents or software you are writing will be subject to redaction in the future. But if you somehow do know, there are a few things you should keep in mind.

If you want to make redaction trivial, don’t mix different proprietary information. If you are working with three companies, try to keep the IP of each of them separate, restricting interaction to as few documents as possible. You’ll find that this won’t be possible, at least completely. The closer you get to it, however, the easier the redaction.

If you want to make redaction impossibly difficult, use extremely short proprietary terms. For instance, two-character terms like “A1” will show up in binary data in guids, maybe millions of times. Even if an engineer can look through things at a rate of 10/second (which is practically begging for human error), that’s still four full terrible days to look for a term which might legitimately appear twice in its proprietary context. Inconsistent acronyms, lack of spelling and grammatical checks, and images all help multiply the length of time you will need to perform redaction. Avoid these things as much as possible in any material that might be redacted in the future.

Leave a Reply

Your email address will not be published. Required fields are marked *