Software Design and Go September 14, 2015
Designing a folder comparer the Go way. (I am adding these blurbs in February 2021, I still love Go and will probably use it until I'm no longer writing software)
Go has really flipped the concept of software design upside down for me. It's made it more interesting and fun. I will take you through some iterations of a recursive "merge verifier" that I wrote which was a necessity due to subversion's bad merge outcomes. Where after merging 3-4 branches into a master, the result would be broken. So alternatively, it can just be used to find different files among two identical folder structures. It's not going to tell you missing files, but that's just a shortcoming of it right now, it will be an added feature later.
So to design this application, there's the linear way. Read in a file from the "left directory", and find the same file in the "right directory". Get an MD5 hash of their contents. I know MD5 is "broked" but in what case in my use would they ever collide on two files that are diffent? Anyway, compare the MD5s. If they're not equal report them, otherwise, repeat, recurse, and recurse, and repeat. Repeatedly. Until the folder structures have been fully recursed.
Then there's the Go way. Although I'm still trying to determine what that is, evidenced by the application has gone through 2 redesigns at this point. One when learning Go initially, and now, after learning Go and putting it down for a good few months. However, I have a much better idea of how they're supposed to be designed, and the first time I didn't try to do it with the Go concurrency model.
So this new way... Get a file path and the fact that it exists, and send this along a channel. On a select statement, listen for these newly discovered files. In a go subroutine, get an MD5 of their contents ignoring white space. This is the blocking call, so we want to do this as separately as possible. Once the file contents are hashed, hang onto it until its twin (the same file in the other directory structure) gets its contents hashed. Once you have both, create a FilePair object and set its left content to the left file, and right content to the right file, and send it along another channel. Receive these on another select statement, and just do reporting. Compare the MD5, if they're different, spit out a message with the relative path to the file.
Going with the new way, I've cut the program down from 15 seconds to 7 seconds, on a 1.48 GB directory structure. So that's two directories so nearly 3GB of data.
The program does include a way to exclude directories (like .svn or .git) and specify only files that we care about (.cs, .css, .js, .ascx, .cshtml, etc), so it's not reading the entire 3 GB of data. But to cut it down in HALF is pretty good, just by reordering the steps and grouping IO together separate from other things, allowing it to read the files and still do work on the side.
Here's the meat of the code: