Thursday, April 18, 2013

Building your own System.Reflection API from scratch, Part II: Round Tripping, Starting with Byte Zero

Reading and writing assemblies are a vital part of any .NET metaprogramming library.

Duality at the heart of Tao

From the very beginning, I wanted to make sure that Tao was capable of both reading and writing assemblies. After being overwhelmed after reading parts of the ECMA-335 spec, I realized that the only way to effectively create Tao is to allow it to be able to both read and write in small portions that I could incrementally understand. PEVerify was ultimately useless in this case since it only deals with validating entire assemblies rather than just validating the small binary subsets that are contained in each assembly.

The Goal

In an ideal scenario, Tao should be able to read an assembly (or part of an assembly), load it into memory, and then write it back to the disk and the bytes should look exactly the same as the bytes that it used to load it into memory. Given that even the slightest byte read or written in the wrong position could mean the difference between a valid and an invalid assembly, I needed to use an approach that will prevent me from both regressive bugs, and act as a guide when I read or write the incorrect bytes from an assembly. The problem is that since no such tools exist, what’s the best approach for incrementally round tripping an assembly?

Testing the seemingly untestable
Writing your own System.Reflection API from scratch can be very difficult, and there’s absolutely no room for mistakes. That’s why I chose to write it using a specification-driven approach since I didn’t have the time to test everything manually once the code was written. I needed an automated approach to testing since there was so many unwritten features that I still had to test. I also needed Tao to easily evolve as my understanding of the .NET assembly metadata increased over time. The idea was that I would use a hex editor to sample small byte arrays from an assembly and use those sample arrays as input to the code under test. Any particular round tripping feature in Tao would be considered “done” if what was read into memory could be serialized back to disk and match the sample array, byte-for-byte. At the time, there was so much that I didn’t know about the PE Format, so I wanted to start small, and use the tests as a guide to build my knowledge of the format over time. However, if you have a bunch of bytes in memory that I need to verify against a set of bytes on disk, how do you ensure that both sets of bytes match each other?

That’s where Tao’s hash function comes in handy:

These two extension methods form the heart of Tao’s self-verification system. It allowed me to incrementally read, write, and ultimately verify every single byte from a .NET assembly as my understanding of .NET assembly formats grew over time. The only problem I had at that point was that the test feedback was still too coarse. The hash functions allow you to verify a block of data, but how do you make it fail fast and track down the exact byte location if it writes the incorrect byte?

Divide and conquer

In order to effectively test Tao, we’ll need to use a hash function that tracks down an invalid byte at the very point where the byte is written. Each test should be able to immediately fail fast and fail hard if it encounters an invalid byte being read or written to disk. Tao cannot afford to wait until an entire chunk of bytes is written before it can be verified, simply because there’s currently no way to pinpoint the exact location of a byte mismatch if the chunk gets too large. For example, if you want to do a random write of a byte array with two million elements and have it fail at the instant that it writes the first invalid byte, how do you find the exact point where the mismatch occurred with just a hashing function performed on the entire byte array?

As it turns out, if you have two side-by-side arrays of two million bytes each that Tao needs to diff, the simplest way to track down the differences between these two arrays using a hashing function is to eliminate the parts that they have in common. The bytes that will remain are the bytes that differ, and in this case, we only need to find the first byte position where there is a mismatch. One of the most effective ways to eliminate the common parts of these two byte arrays is to recursively halve and compare the two arrays into smaller chunks and compare each successively smaller chunk until the first mismatching byte is found. This approach allows Tao to compare even very large sets of data often using fewer than fifty comparisons, and aside from the recursive function call, the implementation speaks for itself:

Each call to getMismatchPosition recursively halves and compares the streams until the stream comparison size is only a single byte in length. Once the mismatch byte has been found, the comparison ends, and the local function returns the position where the mismatch occurred. Now that we have a way to pinpoint the exact byte mismatch position in any given stream, the next issue is finding a way to immediately fail a test at the very instant that any piece of Tao's code attempts to write an invalid chunk of bytes to a stream. In this case, this is where Tao's TracerStream class is immensely useful:
There's nothing particularly interesting about the StreamDecorator base class other than the fact that it wraps (or decorates) an existing stream so that we don't have to reimplement an entire stream in the TracerStream class if it just intercepts the write calls. What is interesting, however, is that the TracerStream class compares every chunk that is being written (just as it is being written) and immediately fails the write operation if the bytes being written don't match the bytes in the original stream. It's a very simple and effective way to incrementally verify parts of a binary, and it's very useful for verifying .NET assemblies. Now that we have written the basic tools for verifying byte streams in an assembly, the next task is to actually find samples that Tao can parse into memory, write to disk, and verify whether or not it matches the original bytes written into memory. Given this task, exactly what tools do we need to actually grab these byte samples, and start roundtripping these samples in Tao's tests?

Finding binary samples using the right toolset

One of the best tools for analyzing and sampling raw .NET assemblies is a tool called CFF Explorer. When I first started writing Tao, I used CFF Explorer’s hex editor to create binary dumps in C# that I could use as the expected byte arrays in Tao’s unit tests. For example, when I needed to test roundtriping MS-DOS headers with Tao, I used the sample bytes from CFF Explorer to verify that the DosHeaderWriter class was writing the correct bytes to its given output stream:

Closing the loop
In a nutshell, the above test case demonstrates how Tao began, and shows how I designed Tao for roundtripping from the very beginning. The idea was that in order to roundtrip an assembly, we need to have the tools to sample small parts of the assembly so that we can, in turn, write small tests that have expectations and assertions that determine what an assembly should look like in memory, despite the fact that we might yet not understand everything about the .NET assembly format. In the next series of posts, we'll dive into the .NET format with the knowledge that each one of these successive read, write, and roundtripping tests in Tao will ensure that it will always be reading and writing the correct assemblies, without fear of having any bug regressions. Fewer regressions means that it will be easier to work with the .NET assembly format and we can just move on to working on the next part of reading/writing/roundtripping an assembly. It's a simple idea with some big design reprecussions, and hopefully, these series of posts will show you the details of how I created Tao, and help you understand how I did it, as well as understand why I did it.

Coming up in the next post
In the next post in this series, we'll talk about how to build the simplest possible assembly from scratch with ILASM and parse it with Tao. We'll also dive deeper into the CLR metadata format and explore the CLR metadata tables, and I'll briefly talk about how those tables bind an assembly together, and show you how Tao roundtrips those tables. Lastly, I'll also talk about some experimental uses for being able to directly manipulate those tables, and how they might just change the way you look at .NET assembly manipulation. One half of the post will talk about the tables, and the other half will talk about some of the crazy ideas which inspired me to write Tao in the first place, so stay tuned!

Wednesday, April 10, 2013

Building your own System.Reflection API from scratch, Part I: Choosing Nemerle

Sometimes you have to reinvent a better light bulb to understand how it works.


About three years ago, I decided to take a break from working on LinFu. Although I was happy with some of the work that I was doing with Cecil and IL rewriting, I wanted to understand the underlying abstractions that represent your every day .NET assembly. Even though there was some good IL rewriting work being done by other bytecoders like Simon Cropp, they only focus mostly on making small and surgical changes to assemblies, such as implementing the INotifyPropertyChanged interface, or making all public methods on a POCO class virtual.

For me, being able to make small changes to the IL wasn't enough. I wanted to understand how to manipulate .NET assemblies so that I could make some "big" changes, such as:

  • Type cloning
  • Dead type elimination
  • Code migrations
  • Modifying signed .NET BCL assemblies at runtime

Unfortunately, at the time of this post, there are currently no assembly manipulation tools that are capable of doing those things, and even the author of Cecil says that it doesn't support type cloning:
There's no easy way to move one type from a module toanother, as it involves taking a lot of decisions about what to dowith references. Also Mono.Merge is completely dead
Given that there are no tools that are capable of doing what I wanted to do with assemblies, and given that I wanted to master assembly manipulation, I decided to take the next logical step: I was going to build my own reflection API from scratch.

The Tao of Metaprogramming

In this series of small posts, I'll talk about some of the design decisions as well as share some of my design notes that I have as I continue to build Tao, which is my own reflection metaprogramming API.

Choosing Nemerle

When I first starting this project over two years ago, I needed to use a language with built-in support for Design by Contract features since I was essentially going to create a library that builds .NET assemblies from scratch, and since I was starting with nothing, I needed a language that was robust enough to be fault-intolerant enough to tell me where I was failing, and why I was failing. Those days were some challenging days for me because all I had was the CLR Metadata Specification as a reference, and there were no programs at all (including PEVerify) that would tell me what or where my mistakes were being made.

Essentially, I was flying blind, and I relied heavily on Nemerle to be able to explicitly state my assumptions as runtime assertions. For example, here's how you can use Design by Contract macros and Non-nullable type macros in Nemerle to write more reliable code. The [NotNull] macro ensures that NullReferenceExceptions will be all but impossible, and the Design by Contract syntax extensions ensure that the code is always in a valid state, and those extensions are invaluable when you need to build an API that has no room for mistakes. In reading or writing .NET executables, even a single byte in the wrong position can give you an invalid assembly. The world of compiling and decompiling can a very cold and unforgiving world, and I needed the best tools I could find to make sure that my API was doing exactly what I intended it to do.

Needless to say, building your own reflection API from scratch can be a very daunting task. Even today, as I look back on the work that I have already done with Tao, it's hard to imagine being able to get this far without the DbC language features that Nemerle has to offer, and in hindsight, I'm glad that I made that choice.

Coming up in the next post

In the next post, I'll talk about some of the challenges of reading a raw .NET portable executable and turning it into something meaningful that a program can understand. For example, what does the format of a .NET assembly look like? How is it different from say, an unmanaged DLL/EXE file? More importantly, how do you actually write tests that ensure that the bytes that you're reading into memory are exactly the same as the ones loaded from the disk? Those questions were just some of the issues that I had to solve, and in the next post, I'll tell you exactly how I solved them, as well as talk about some of the tools I had to (re)invent in order to solve those problems. Meanwhile, stay tuned!

Ratings by outbrain