Hack This Site - Programming - Fix a corrupted file

Tue, Jul 13, 21

Welcome back to my walkthrough of hackthissite.org’s CTF missions. I will be going through my thought process of how I solved these missions, and therefore also giving away the solutions. If you came across this to give you hints, watch out for spoilers! Good luck, have fun.

This mission tells us that someone downloaded a file using the windows ftp client, but the file that was downloaded turned up corrupted. We are meant to uncorrupt this file, in which contains an image with a password that will complete the mission.

I will be honest, in the beginning, I had absolutely no idea what was wrong with this file. I am writing this paragraph after I figured out what was wrong (but prior to writing any code). I will explain my thought process into finding out what was wrong. My first step was to print a dump of the corrupted.png.bz2 file using python’s open(, 'rb') function. I did the dump by creating an array of readlines(), and printing them by itterating through the array. The first thing that I noticed right away is that each line was terminated by a CRLF - or \r\n - line ending. This comes up as very important later on, but at the moment I had no clue. From here, I had decided that I needed to learn more about the compressed file format of a bz2 file. I quickly realized that, whilst the BZIP2 code was open-source, there is absolutely 0 documentation on the final product that comes out after the compression. Luckly, I had found a document written by Joe Tsai, which EXTENSIVELY explained the entire process and final output of the BZIP2 algorithm. The document in question can be found here. This is an amazing resource for anyone that wants to learn more about this algorithm and how it works. Unfortunately, even with this holy grail at my hands, it was difficult to figure out what was wrong with this file. Looking at the examples of hexdumps that were in that file, and comparing them against hexdumps of this corrupted file, it was evident that there were many things that don’t exactly ‘line-up’. It was at this point that I decided that the way to complete this mission was to create your own decompressor and find some CRC code that didn’t match with it’s StreamBlock (terms defined in the file linked above). Although, this seemed way to far out of scope of the types of missions that I have tried so far on hackthissite. It was at this point that I had decided to visit the forums and try to see if anyone had posted anything at one point about this mission. I had quickly found 3 posts that were talking about some ‘pairs’ and working with these ‘pairs’. It was at this moment that it clicked to me - they are talking about the CRLF line endings. At this point I had decided to google ‘file corrupted using ftp’, and had come across numerous articles on how the ftp client can corrupt a file. Long story short, Windows uses CRLF (\r\n) to denote a line ending, whereas UNIX systems uses LF (\n). There are two transfer modes within ftp: binary, and ascii. When using the ascii transfer mode, the ftp client will change any line endings that it sees to match the line endings that are used on the machine. Whereas in binary, this does not occur. It was evident that the person that initially downloaded this file using the windows ftp client, had inadvertently added \r in front of some \n, even though those \n were not actually meant to denote a line ending. Although, it is certainly possible that such a pair - \r\n - could have come across naturally in the compression algorithm. Thus, the way forward is to try and bruteforce the locations of where such \r were inserted by the ftp client and reverse them. To confirm my theory, I will compress a random png file that I have on hand and will check against the number of \r\n and \n that there are within this file using grep. What resulted was a single occurence of \r\n and many \r. Thus this is the correct way forward for this mission.

The following code is what I had written in python in order to complete this mission. This was my second iteration of this algorithm, where in the first one, I would iterate back to front in a binary fashion, the algorithm below iterates in a most significant to least significant incrementally with the number of natural CRLFs that could be found in the uncorrupted file. This turned the code to be SIGNIFICANTLY faster in EXPECTED run time, but the same in overall complexity assuming the worst case for both algorithms. I paired my algorithm with the bzip2recover tool to eliminate as many CRLFs within the file as possible since I had noticed that there was always only 1 valid StreamBlock within the compressed file.

The command line to run the script was the following : bzip2recover corrupted.png.bz2 && python3 uncorrupt.py.

import bz2

def extractCorruptedLinesWithCRLF():
    """
    Reads a bz2 file in binary mode and splits it per each '\n'.
    """
    with open("rec00001corrupted.png.bz2", 'rb') as f:
        lines = f.readlines()
    return lines

def buildCheckPNG(lines, numOfNaturalCRLF, builtString = b''):
    """
    For a given number of natural CRLF, permutes, from left to right,
    that number of CRLF's within the file and attempts to uncompress.
    Returns 0 on success, and 1 on failure.
    """
    if(numOfNaturalCRLF == 0):
        # we have placed our needed natural CRLFs so we replace
        # the rest of the string and test it
        tempBuiltString = builtString
        for line in lines:
            tempBuiltString += line.replace(b'\r\n', b'\n')
        try:
            uncompressedRawFile = bz2.decompress(tempBuiltString)
        except (OSError, ValueError):
            return 1
        # we were successful in uncompressing
        with open('uncorrupted.png', 'wb') as f:
            f.write(uncompressedRawFile)
        return 0
    
    elif(len(lines) == 0 and not numOfNaturalCRLF == 0):
        # this is not a valid configuration for the given num
        # of naturalCRLF so fail right away
        return 1

    # leave the CRLF here
    if(lines[0].endswith(b'\r\n')):
        tempNumOfNaturalCRLF = numOfNaturalCRLF - 1
        tempBuiltString = builtString + lines[0]
        res = buildCheckPNG(lines[1:], tempNumOfNaturalCRLF, tempBuiltString)
        if (res == 0):
            return res
    # remove the CRLF / there was no CRLF here (should only happen in 
    # the last two lines... which we shouldn't itterate through
    # anyways)
    tempBuiltString = builtString + lines[0].replace(b'\r\n', b'\n')
    res = buildCheckPNG(lines[1:], numOfNaturalCRLF, tempBuiltString)
    if (res == 0):
        return res
    
    return 1

def permuteCRLFEndings(lines):
    """
    Initiates buildCheckPNG for all possible amounts of natural CRLFs 
    withing the uncorrupted PNG file.
    """
    # at most, there could be len(lines) natural CRLFs -2 (as last two lines
    # are always \n \n)
    for numOfNaturalCRLF in range(0, len(lines) - 2):
        res = buildCheckPNG(lines, numOfNaturalCRLF)
        if (res == 0):
            return res
    return 1


if __name__ == "__main__":
    
    lines = extractCorruptedLinesWithCRLF()
    res = permuteCRLFEndings(lines)
    if (res == 0):
        print("Success")
    else:
        print("Failure")