Semi-compiled languages such as Java and the Microsoft Intermediate Language (MSIL) are particularly easy to disassemble or reverse engineer. Unlike native code, the intermediate byte codes contain complete variable names, such that disassembly generates almost the exact source code of the original program. The only notable absence is the comments from the original source code. Everything else is there.
For ISVs and other commercial developers who want to protect their intellectual property, this ease of disassembly poses a significant and well-known problem: Algorithms can be reconstructed and studied, and program code can be reconstituted and customized. (Even in-house, noncommercial applications are vulnerable to source-code access made possible by disassembly. For example, passwords to databases, or embedded in SQL statements are now easily accessible to users. Likewise, sites that use outside Web hosts are at risk if they upload their ASP.NET code, because staff at the hosting facility can reconstruct all the programs should they wish to.)
Moreover, the tools that hackers or even curious users might need to reverse engineer code are widely available. Microsoft offers its own MSIL disassembler, called ILDASM, at no cost. The Anakrino tool is an open-source disassembler for .NET (go to http://www.saurik.com/net/exemplar/); and various other companies offer equivalent tools on a commercial basis.
Protecting Your Code
The most effective way to protect your code from these forms of reverse engineering and snooping is to obfuscate it. ("Obfuscate" means "...to make opaque (so) as to be difficult to perceive or understand"�American Heritage Dictionary, 3rd Ed.) Tools today perform this trick by various means that primarily focus on making the variable names meaningless, encrypting strings and literals, and inserting misleading directives that render disassembled code uncompilable.
The upcoming release of Visual Studio (called VS.NET 2003 and code-named Everett) sports an integrated obfuscating tool that Microsoft suggests running as a final pass on .NET intermediate code. The obfuscator is the so-called "lite" version of a more robust obfuscating utility, Dotfuscator, sold by Preemptive Solutions, a Cleveland-based company that got its start obfuscating Java code. Dotfuscator, uses a remarkable variety of techniques to make disassembly futile or, at least, very difficult.
Overload induction is Preemptive Solutions' name for its patented technique of changing variable names in the intermediate code (Obfuscators never touch source code, nor even need to reference it.) It takes advantage of the fact that the same identifier name can be used for classes and methods with different signatures. And within different namespaces, variables can use the same name without colliding. Dotfuscator exploits these lexical features to rename as many items as possible to the letter 'A.' The company claims that on some code 33% of references can be renamed to A and another 10% to B. This transformation makes disassembled code extremely hard to understand. Consider the following example:
Disassembled code without obfuscation:
private void CalcPayroll(SpecialList employeeGroup) {
while (employeeGroup.HasMore()) {
employee = employeeGroup.GetNext(true);
employee.updateSalary();
DistributeCheck(employee);
}
}
Same code with obfuscation:
private void a(a b) {
while (b.a()) {
a = b.a(true);
a.a();
a(a);
}
}
It is clear that both snippets perform the same logic. However, it is extraordinarily difficult to determine what the second snippet is doing, much less which fields and methods exactly are being accessed.
This renaming feature can be configured so that if you're building a DLL, let us say, the APIs are untouched. Interestingly, this feature alone visibly shrinks code simply by the reduction of so many variable names to just one character.
String encryption gets around a security problem that exists even in native code: String literals are easy to extract from binaries. For example, running the UNIX strings utility on any binary will generate a list of all ASCII literals in the file. In its most benign form, this list reveals only copyright information and whose libraries are included in the executable. However, if the program accesses databases, strings will reveal all the SQL commands. And if passwords are buried in the module, they are revealed as well.
With intermediate code, there are additional dangers. By examining the references to a given string, a cracker can figure out where password-protected code begins, and then can patch the file to jump there. To solve the problem of literals as human-readable text, most obfuscators encrypt strings. A small runtime penalty is incurred when the string is accessed, due to the decryption overhead. Interestingly, native code is at a disadvantage here because to achieve the same effect, developers must encrypt and decrypt each string manually, whereas an obfuscator performs this operation automatically.
Control-flow obfuscation is a technique designed to mislead disassemblers. It inserts goto statements in the code that effectively end up performing the original sequence of instructions but in a round-about way that makes it hard to follow the logic flow. Here is an example.
Disassembled intermediate code before control-flow obfuscation:
// Code Snippet copyright 2000, Microsoft Corp, from WordCount.cs
// sample app
public int CompareTo(Object o) {
int n = occurrences - ((WordOccurrence)o).occurrences;
if (n == 0) {
n = String.Compare(word, ((WordOccurrence)o).word;
}
return (n);
}
Same code after control-flow obfuscation:
public virtual int a(object A_0) {
int local0;
int local1;
local0 = this.a - (c) A_0.a;
if (local0 != 0)
goto i0;
goto i1;
while (true) {
return local1;
i0: local1 = local0;
}
i1: local0 = System.String.Compare(this.b, (c) A_0.b);
goto i0;
}
As can be seen, a bogus test is inserted, then a goto is performed. At the goto destination, the original statement (in obfuscated form) is executed, then another goto statement returns control to the original branch in the logic flow. Notice the unexecuted and just misleading while() loop. In this small snippet, after close comparison with the original, it's possible to figure out what's real and what's not. However, on a large routine without the benefit of the source code, these misdirecting interpositions create a hugely time-consuming effort. The idea here is to make the restitution of the original coding intent so demanding that hackers will move on to other, perhaps simpler, challenges. This particular technique adds small amounts of code to the binaries and so creates some overhead for the obfuscated portions. If this is a problem, only routines that need this extra level of protection should be subject to this particular technique.
Getting your own obfuscator for .NET
As indicated previously, the upcoming VS.NET 2003 environment contains an obfuscator. It applies only the overload induction transform. For developers who are not using VS.NET, but still want access to this tool, it can be downloaded from Preemptive Solutions. To get the full complement of techniques described here, the complete professional version is available as a paid commercial product for $1495, with discounted pricing for two or more copies. Several other obfuscators for .NET MSIL are listed here.
Additional Resource
An interesting survey of all kinds of code-obfuscation techniques.