Recovering Deleted Files

13 min readDec 22, 2021

An introduction to Digital Forensics

Digital Forensics covers the process of investigating, analyzing, recovering and preserving data across various platforms and digital devices. One major subfield of Digital Forensics is data recovery. Given a storage medium which contained data that has been deleted we’ll look into going about recovering information to retrieve the original document / file. In this tutorial we’ll cover the steps to create a basis file recovery script that can restore a deleted image from a usb drive.

Preface

Before diving into the tutorial you’ll need the following resources.

A USB drive (or external hard disk) Note: Choose a spare one that you’re comfortable formatting (erasing all data).
A linux laptop / pc (Any flavor but the instructions below were tested on Ubuntu 20.04)
Your preferred editor / IDE
GCC / Clang / build-essential tools (Make sure you can compile C code)

By the end of this tutorial we will have covered the process of how to erase a usb drive by overwriting all bits with zeroes, create a new Ext3 file system, add a single image file to the drive (bitmap), delete the file and then create a script to recover it.

Linux Background

To successfully understand how to recover a file, its important to first understand how a file is stored on disk. Files are stored differently on linux, MacOS and windows since they utilize different file systems. For this tutorial we will cover how data is stored and can be recovered on a default Ext3 Linux file system. For more information on the latest storage techniques refer here.

At a high level, a linux filesystem is broken up into discrete uniformly sized sections known as blocks. These blocks can directly store data, metadata about the files, and metadata about the filesystem. Blocks are usually 4096 bytes but can be configured to be anywhere from 1024 to 8192 bytes in an Ext3 partition. Metadata about the filesystem such as its basis configuration, the size of blocks and more can be found in a specially designated block known as the superblock. The superblock is always found at the start of a disk partition and is usually replicated throughout the partition for redundancy.

Each file in a file system is represesnted by a data structure known as an inode. A single block can contain multiple inodes. An inode doesn’t contain the data for a file itself but rather contains pointers to the blocks that have the actual file data among other metadata. To store these pointers an inode has allocated within its data structure an array of 15 block addresses. As you might notice, 15 blocks might not contain sufficient memory (15 * 4096 = ~61kb) to store most files. Therefore, only the first 12 addresses point directly to data blocks that contain file data. The remaining 3 addresses point to blocks known as indirection blocks instead of data blocks. The first indirection block (13th address pointer in the file inode) contains addresses to the next 1024 data blocks (4096 / 4 byte addresses = 1024 pointers). However, even with one indirection block not enough memory might be addressable to store larger files. Thus, we have the second level indirection block (14th address pointer in the file inode). Each address in this block points to a first level indirection block which then contains addresses to the data blocks. Similarly, the last indirection block in a file inode (15th address pointer) is a third level indirection block where each address points to a second level indirection block. Thus, through this system of multi level indirection, a single inode can address very large amounts of memory. The first 12 addresses can point to ~61kb of data blocks, a first level indirection block can address 4mb of memory, and a second level indirection block can further address 4gb of memory in this manner.

Example of where data block addresses 1 through 10,000 would be stored for a hypothetical file. Note: None of the blocks shown in the diagram are data blocks. The leftmost data structure is the inode and the remaining 4 blocks are indirection blocks. If the file was larger then redirect3 in the diagram would have additional pointers to more first level indirection blocks and eventually as the file size grows a third level indirection block would be introduced as well.

This structure of storing a file creates the largest challenge in recovering a file. Since all the data blocks are not stored in a contiguous fashion, they need to be recovered in chunks and stitched together to recover the original file. When a file is present on a file system this process is trivial since the inode stores the addresses to all the directly addressed data blocks and the indirection blocks. However, when a file is deleted, the addresses in the inode are zeroed out and only information in the data blocks is available. Without knowing which blocks contain the data it becomes a game of guess and check along with applying some smart heuristics to shorten the list of potential data blocks that might belong to the original file.

The tutorial below reduces much of this ambiguity by controlling several factors like having only 1 file on the partition, knowing the format of the file, controlling thee size of the file to only have first level indirection and more to simplify the base file recovery process.

Step 1: Setup

The first step is to format a disk and overwrite it with zeroes. This will setup the foundation to create a new Ext3 partition. Note: All of these commands require running them as the root user or with sudo privelages.

Grab your spare USB & attach it to a linux device (all steps tested on Ubuntu 20.04)
Open the Disk Utility application and format the disk. Make sure to select the option to overwrite with zeroes.
Depending on the size of your drive this can take anywhere from a few minutes (512mb drive) to over an hour (16gb drive)

Now we can create a new partition on the freshly formatted USB drive.

Open the terminal and type fdisk -l
This will display the list of all attached drives. Find your USB here (can verify by checking the disk size) and copy the name (usually /dev/<something-else>)
Run fdisk /dev/<something-else> and then select n to create a new partition. Leave everything in its default settings and finish setting up the partition.
Enter w to save your changes and then exit the fdisk tool.

The last step is to setup this partition as an Ext3 filesystem

Run fdisk -l again and find your device. You’ll find a new sub-entry listed as /dev/<something-elseX> where X is a number. This represents the partition.
Run mkfs -t ext3 /dev/<something-elseX>. Make sure to include the number for X.
Leave everything in its default settings and finish setting up the Ext3 filesystem. The process of writing inodes and setting everything up can take a few minutes.

Now your USB drive has been successfully setup with a partition that uses Ext3. We can now proceed to adding a file, deleting it and recovering it. Additional information if facing challenges can be found here.

Step 2: Choosing the right file

For this tutorial we will be choosing a file beforehand and analyze its properties to improve the algorithm which makes it easier to recover the file. The steps below recover the following bitmap image, feel free to download the same one or use a different bitmap image.

Use the file linked here instead of downloading the image displayed on medium. Bitmap images are not supported on medium.

For this exercise we’re choosing a bitmap file because its the simplest image format to recover. Here’s a few pointers about bitmap images that are important:

All bitmap images start with the exact same header → 42 4D. These two bytes can be seen at the start of any bitmap image when viewing its raw hex. You can easily test this using popular tools like hexedit, xxd or hexfiend. We’ll use this information to find out where the start of the file is on disk.
Our file is more than a few kilobytes and less than 4mb. This is important because the simple script we will design will only search for the first 12 data blocks and the first level indirection block (which if you recall can address a maximum of 4mb of data blocks). Searching for second and third level indirection blocks are beyond the scope of this tutorial.
Bitmap files store the size of the file in the first few bytes as metadata. This tells an image viewer how many pixels to render to the screen. This is helpful because in the event our script recovers more or less bytes than the original image (either because only a partial recovery was possible or it recovered data blocks padded with zeroes at the end), then we will still be able to view the partially recovered bitmap. This also means that there’s no special information / metadata at the tail end of a bitmap which simplifies our search algorithm.

Copy this file over to the USB Drive and then unmount the drive. Your drive should now contain a single file and nothing else. Unmounting the drive is required to ensure that the OS does not perform any random reads / writes. We want to ensure that the only data blocks being used on the drive are due to the image and not for any other miscellaneous files.

Step 3: Basic file IO & Superblock verification

The code for this tutorial is written in C but can be done in any language. To properly understand how it works a review of basic file IO using system functions like fread, fwrite, fseek and rewind would be useful. To start off copy the following snippet into a superblock.c file. Note: Replace the /dev/<something-elseX with the name of your drive along with the partition number.

superblock.c

In this code snippet, we are reading the first block of the partition (which is the superblock). The most important part can be summarized in this one line:

int count = fread(&buffer, sizeof(char), 4096, fp);

Here, we’re reading 4096 pieces of data (each one the size of a character → one byte) into a char array. Its important to note here that we’re not attempting to read back 4096 characters but rather the hex bytes that are stored in the amount of memory consumed by 4096 characters (which is 4096 bytes). In the for loop below that we’re printing each byte with a space in between and at 32 byte intervals we’re printing a row number to keep track . Compiling this code with gcc superblock.c and running it with sudo privileges as sudo ./a.out should generate approximately the following output.

partial console output

Your console output may not look identical to this but to ensure that this is the superblock we can look for what’s known as the Magic Signature. Every superblock will have the hex value EF 53 stored at offset 1080. The superblock output printed must have this, if not return to the earlier steps and attempt to resolve any problems.

Note: You may see the output printed in little endian format where the bytes printed are in reverse order 53 EF as shown in the example console output above.

Step 4: Finding the starting block

Now that we’ve verified that we are able to read the drive and its well formatted we can begin the search for the data blocks belonging to out bitmap image. The first step in this process is very straightforward. We will scan the entire disk from start to end, one block at a time, till we hit a block which has 42 4D headers in the first two bytes that indicate the start of a bitmap image.

main.c — findStartingBlock()

Here we’ve wrapped the same logic for reading a single block within a loop and are now checking whether the start of each block contains the header signature we expect from our bitmap image. If found then we return the index of the matching block else -1. Here we’re searching only the first 20,000 blocks on our drive. Given that we’ve only added a single file it’s likely that it’ll be written somewhere at the start of the disk. However, its possible that it can be later in the drive and if you don’t find the start of the file in the first 20,000 blocks then increase the number to scan a larger section of the partition. It’s possible to automate the process of reading exactly from the start to end of the partition by checking the superblock for the total number of blocks in the partition. Once you find the index for the starting block you can use a hex editor to jump to that block and verify that it does have the first 4096 bytes of the file.

Step 5: Finding the indirection block

Unlike the bitmap starting block that had pre-defined attributes, the indirection block doesn’t have a fixed value that we can look for. Instead we need to rely on a known heuristic of how the OS saves files. Typically, the OS will attempt to store chunks of file bytes in consecutive data blocks. This means that the indirection block will likely contain addresses that are incrementally increasing by one. For example the first few addresses stored in the indirection block might be 0000C000, 0000C001, 0000C002, etc. However, we’re not guaranteed that all 1024 addresses in an indirection block will all be exactly in ascending order by one because there might be breaks around data blocks used for other purposes. Therefore, we can limit our search to just check that the first 3 addresses are in ascending order. While not a perfect method, it’ll suffice for this tutorial since we only expect to find a single block with this property.

main.c — findIndirectBlocks()

The structure of this function is pretty similar to finding the starting block. We still scan the disk one block at a time and check whether it has the properties of an indirect block. If it does then we save that index number to a global array for use later. The global array here is initialized to have 256 entries as an upper bound since we cannot know in advance how many indirection blocks we might find on the drive. We should ideally expect to only find a single one since we only have one file saved and its size is under 4mb.

Step 6: Writing blocks to file

To write blocks to disk we can create a simple function that will seek to an offset given a block index and copy the data from there to a different file.

main.c — copyBlockToFile()

First we’ll need to copy over 12 consecutive blocks from the starting block. Unlike with the indirect blocks there’s no guarantee that the first 12 data blocks will be present in consecutive order so its a shot in the dark. We can copy them over by calling the above function with the incrementally increasing block numbers 12 times.

main.c

For copying over the data blocks addresses by the indirection block we will need to read the indirection block andfseek to each address within it. For each of those addresses we can call the copyBlockToFile() function. Since the file may not be large enough to use all the addressable memory within an indirection block, the last few addresses will likely be zeroes. For a bitmap image appending these zeroes will not cause any issues since the file size metadata stored in the header will prevent an image viewer from rendering those extra padded zeroes. However, this can cause issues for other file formats and it’s required for those zeroes to be trimmed out (doing that here is beyond the scope of this tutorial).

main.c — copyIndirectBlocksToFile()

In this manner the data blocks from various parts can all be put together and stitched into a recovery file. During this process observe that we are not calling rewind() on the recovery file descriptor. This is because we want to append the data blocks instead of overwriting. We do call rewind() with the drive file descriptor since we want to fseek from the beginning of the drive each time we want to copy over a data block.

Step 7: Tying it all in together

Here’s the complete script with all the functions put together.

This script will recover a bitmap file and store it in a file called recovery.bmp.

Limitations

This tutorial was designed to provide the bare minimum required to get a script up and going for file recovery. Besides the scope limitations mentioned throughout the tutorial there are a few other things to watch out for along with the fixes if you do run into them.

In the event that there’s a discontinuity in the first 12 blocks then only a partial recovery will be possible. The pixels rendered from the data in blocks 2–12 (the 11 consecutive data blocks after the starting block) will be corrupted and show random colors. This is impossible to fix and will require you to add a second image to the drive and try to recover that one instead.
In the event that there’s a discontinuity in the first 3 blocks of the indirection block then identifying it will fail. This can be fixed by changing the number of addresses that are checked to be in incrementally ascending order (finding two consecutive addresses which are just 8 byte numbers might happen by chance but finding multiple consecutive addresses is unlikely since there might be a break).
This script will likely pad the end of the file with zeroes since there’s no check for the tail end of the file. This is a no-op with a bitmap image but will cause issues for other file formats. Fixing this just requires trimming away excess zeroes at the end of the file or looking for a tail marker when working with other file formats.
If your file size is greater than 4mb then there will be multiple indirection blocks. Currently, this script will just append the data blocks pointed by them in the order that they were found. If the second indirection block appears before the first then the recovery will fail. This can be fixed by adding logic to compare the first and last addresses in the indirection blocks and then sorting them in ascending order.
This script only searches through the first 20,000 blocks. In the event that the file was saved later in the partition the recovery will fail. Fixing this requires increasing the number of blocks scanned in the partition with the only drawback being an increased runtime. Check the findStartingBlock() and findIndirectBlocks() functions to change this parameter.

Final Notes

I hope that in going through this tutorial you’ve been able to gain a deeper insight into the field of digital forensics. Data recovery can be very complex when dealing with a standard drive full of thousands of files, bytes scattered across the partition and corrupted sections. This tutorial was created to give you a glimpse into how the process behind digital data recovery and analysis is carried out.

This file recovery completed in this tutorial is an adaptation of a semester long course project in CS 4398 — Digital Forensics at the University of Texas at Dallas.