The texture loading code for G-Engine was written early in development, when I was eager to see some on-screen graphics to demonstrate tangible progress on the project. And then that code stayed relatively untouched for years. However, a recent GitHub issue highlighted a bottleneck - installing a mod containing very large textures caused scene load times to skyrocket!

In this post, I’ll explain how this problem was investigated and fixed. It may be a helpful read if you want to learn more about loading image data to be used by a graphics API such as OpenGL, or if you want to see how naive texture loading code can be improved.

What’s the Problem?

G-Engine is built to load and parse the data files from Gabriel Knight 3, so it’s texture loading code is geared towards the specific data set for that game. All image files shipped with the game are BMP format. The vast majority use a proprietary compression technique, with a few uncompressed BMP files mixed in as well. Of the uncompressed BMP files, there’s a mix of 24-bit and 8-bit (palettized) images.

The BMP image format isn’t too complex, and the C++ code to load BMP image data is pretty straightforward. When tested against Gabriel Knight 3’s image assets, loading was quick and the rendered output was correct. All seemed well, so I didn’t spend more time optimizing it further.

However, a fan created a hi-res texture pack mod which swaps the original BMP files for ones that are much higher resolution. These replacement BMP files do not use the proprietary compression technique mentioned earlier - they’re just normal 24-bit BMP images. When the game loads these files, it results in an atypical use case for the code:

Rather than mostly loading compressed BMP files, the balance shifts to mostly loading normal 24-bit BMP files.
Rather than loading relatively small BMP files, the game is loading very large BMP files. The unmodded game rarely uses textures larger than 256x256. The mod contains files that are 512x512, 1024x1024, or even 2048x2048 and beyond!

When using the hi-res texture pack, each scene transition in G-Engine took significantly longer - sometimes up to 10 seconds! The original game’s engine was able to load these large textures very quickly in comparison - this signaled that the problem wasn’t the large textures, it was my texture loading code!

Identifying the Cause

Clearly something about reading in textures and sending them to the GPU wasn’t scaling. This is essentially a three step process:

The image data initially exists on the disk. First, we load it into a byte buffer with a single call to std::ifstream::read.
Once in memory, the byte buffer is interpreted into a Texture class instance. This class acts as a single runtime and in-memory representation of a texture in the engine. One of the main things done here is reading the pixel data from the byte buffer into a dedicated mPixels array.
The pixel data is sent to the graphics system so it can be used for rendering. In OpenGL, this is done with a single call to glTexImage2D.

Steps 1 and 3 already seemed about as efficient as they could be - just one API call to copy a contiguous buffer of data. So I focused on bottlenecks that might exist in step 2.

The textures in this mod were all standard 24-bit BMP images, so I focused on the code that parses BMP data into a Texture instance. The code which reads data in pixel-by-pixel seemed suspicious:

// Read in BMP file data pixel-by-pixel.
int rowSize = CalculateBmpRowSize(bitsPerPixel, mWidth);
for(int y = mHeight - 1; y >= 0; --y)
{
    int bytesRead = 0;
    for(uint32_t x = 0; x < mWidth; ++x)
    {
        // Calculate index into pixels array.
        int index = (y * mWidth + x) * mBytesPerPixel;

        // Pixel data in the BMP file is BGR.
        mPixels[index + 2] = reader.ReadByte(); // Blue
        mPixels[index + 1] = reader.ReadByte(); // Green
        mPixels[index] = reader.ReadByte();     // Red
        bytesRead += 3;

        // BI_RGB format doesn't save any alpha, even if 32 bits per pixel.
        // We'll use a placeholder of 255 (fully opaque).
        mPixels[index + 3] = 255; // Alpha
    }

    // Skip padding that may be present to ensure 4-byte alignment.
    if(bytesRead < rowSize)
    {
        reader.Skip(rowSize - bytesRead);
    }
}

This seems like a bit of code that doesn’t scale particularly well. It reads one pixel at a time, and reads one byte of the pixel at a time. Assuming a square image with width/height of n, this code is O(n^2), using big-O notation.

One of my go-to techniques to isolate problem code is via deduction: if I comment out a block of code, and the problem disappears, the problem is likely caused by that code block. In this case, commenting out the code obviously causes all textures in the game to look incorrect, but it DOES fix the long loading time issue. So, this is part of the problem, if not the whole problem!

How to Fix This?

You might look at the above code and think: can’t we just read the pixel data as a single block of memory, rather than reading one byte at a time? We can! This code reads in the whole image in one fell swoop, and it’s a lot faster:

// Just read the pixel data in one giant block!
mPixels = new uint8_t[mWidth * mHeight * mBytesPerPixel];
reader.Read(mPixels, mWidth * mHeight * mBytesPerPixel);

But before we celebrate, let’s realize we’re overlooking some important details:

The inefficient code has logic to skip padding bytes that may be present. This more efficient approach includes padding bytes in the pixel data.
The inefficient code changes the pixel component order as each pixel is read in. The BMP format stores pixel data in blue/green/red (BGR) order, but the code in the Texture class assumes red/greed/blue/alpha (RGBA) order. The efficient code is not changing the pixel component order.
The inefficient code flips the pixel array as it reads in the data. BMP files store pixel data from the bottom-left corner, but this Texture class stores pixel data from the top-left corner. The efficient code is not performing this flip.

To use the efficient code, we need to find ways to address each of these problems - either by addressing them or avoiding them entirely. Let’s consider if there’s any way we can overcome these challenges and achieve blazing fast loading of large BMP files!

Dealing with Padding Bytes

BMP files store pixel data in rows. If an image has a height of 512, it means there are 512 rows of pixel data in the image. Each row of pixel data must be 4-byte aligned. If the byte size of a row is not divisible by 4, padding bytes are added.

A simple way to solve this problem (and perhaps the preferred way) is to simply leave the padding bytes present. Graphics APIs can account for these padding bytes, and OpenGL actually uses 4-byte alignment by default. DirectX seems to have a strong preference for 4-byte alignment, and it’s even required in newer versions.

For the current task though, I decided to remove padding bytes. Why? Well, it’s a combination of things:

When I first wrote the Texture class, I didn’t realize you could just leave the padding bytes and pass them to the GPU - they seemed like an archaic aspect of BMP files, as opposed to a common convention supported by the graphics APIs. So, chalk this up to not knowing what I was doing.
Over time, many of the algorithms in the Texture class for manipulating pixel data do not account for padding bytes. This could be refactored (and perhaps I will do so in the future). But to solve the current problem at hand, I’d like to avoid making that change right now.

If we don’t want padding bytes, their presence throws a wrench in our efficient block-copy approach. Fortunately though, we can calculate whether padding bytes will be present. If the number of bytes needed to store a row is divisible by 4, we won’t have padding!

For example, consider a 32-bit image, which has 4 bytes per pixel. Such an image is ALWAYS 4-byte aligned, so no padding will ever be needed. On the other hand, a 24-bit (3 bytes per pixel) BMP image may need padding. If the width is 1024, it turns out that the byte count (3 * 1024) is divisible by 4. If the width was 1025, it would no longer be divisible by 4, requiring padding bytes.

All the BMPs in GK3 are 24-bit (3 bytes per pixel), so 4-byte alignment is not guaranteed. Fortunately, 24-bit BMP images are always aligned if the width is a power-of-two greater than or equal to four. Widths of 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048 (and so on) give us no padding bytes to deal with.

And wouldn’t you know it: almost all of GK3’s BMP files use power-of-two widths. So, such files can take the optimized “block copy” approach. Fortunately, the hi-res mod follows suit and still uses power-of-two widths as well.

What if we encounter an image with padding bytes? Our parsing code needs to be able to handle both scenarios:

if(((mBytesPerPixel * mWidth) % 4) == 0)
{
    reader.Read(mPixels, mWidth * mHeight * mBytesPerPixel);
}
else
{    
    // Use a less efficient approach that reads line-by-line.
}

There’s one final thing to do, which is to ensure the graphics API is configured to allow pixel data that is not 4-byte aligned. As mentioned earlier, OpenGL assumes 4-byte alignment by default, but you can call glPixelStorei(GL_UNPACK_ALIGNMENT, 1), which then allows 1-byte alignment. I’m not sure of the equivalent in DirectX - it may actually be required to instead keep padding bytes in that case.

Dealing with Pixel Formats

For simplicity, the Texture class (as originally written) always stored pixel data in RGBA order. Since BMP files store pixel data in BGR order, one function of the inefficient code was to reorder the pixel data. How can we do this when we use an efficient block copy?

In this case, I had to rethink my assumption of always storing pixel data internally in RGBA format. Doing so simplifies some pixel manipulation code, but it now seems to be negatively impacting performance.

So yes, the solution I employed was to drop the requirement that the internal pixels array always be in RGBA order. We store the pixel data as a byte buffer, but we also store the format of that byte buffer separately.

enum class Format : uint8_t
{
    // 24 bpp
    BGR,
    RGB,

    // 32 bpp
    BGRA,
    RGBA,
};

// The number of bytes per pixel for this image.
uint8_t mBytesPerPixel = 4;

// The format of the pixel data. This matches exactly the data stored in the pixels array.
Format mFormat = Format::RGBA;

// Pixel data, from the top-left corner of the image.
uint8_t* mPixels = nullptr;

The effect of this change is that the data in the mPixels array can’t always be treated as RGBA data. When a Texture is created from a BMP image, the format is going to be Format::BGR. The PNG decoder generates RGBA pixels, so the format for a Texture created from a PNG image will have the format Format::RBGA.

Functions that deal with texture manipulation become more complex, since the pixel data may be either 3 or 4 bytes per pixel, and the pixel components may be in different orders. Texture::SetColor and Texture::GetColor functions are more complex because the exact code required depends on the format. Functions like Texture::FlipVertically now need to deal with various byte-per-pixel counts. A function like Texture::BlendPixels (which blends one texture’s pixels into another texture’s pixels) is now limited to only work if the texture formats are the same. If a function like Texture::ApplyAlphaMask is called on a non-RGBA image, the image must be converted to RGBA format. Functions to convert between formats can also become fairly complex.

Finally, when creating textures in OpenGL (or other graphics systems), you can no longer assume RGBA format. When creating a texture in the graphics system, you must specify the correct format. In OpenGL, when creating the texture with glTexImage2D, this means the format argument is sometimes GL_BGR, sometimes GL_RGBA, sometimes GL_BGRA, etc.

Dealing with Pixel Order

As mentioned earlier, BMP files store pixel data starting at the bottom-left corner and going up. But the Texture class assumes pixel data is stored starting at top-left and going down. Clearly this conflict ruins the efficient block-copy plan. Or does it?

One possible way to solve this problem is to remove the requirement that mPixels must store data from the top-left corner - could we store the “texture origin” as a separate variable? Unfortunately, I think this approach would be too complex and likely error-prone: pixel data sent to the graphics system would be inconsistently ordered, but all UVs would be the same. The result would be chaos. A system where UVs are flipped depending on what texture is being used, while possible, seems complicated and convoluted.

Another idea is to simply change the Texture class’s assumption of starting at the top-left corner and instead align with the BMP format. This could work, but we’ll encounter conflicts where other image formats store from top-left. For example, PNG files store pixel data from top-left - since we support both BMP and PNG imports, we can’t satisfy everyone!

The actual solution used seems non-ideal, but actually works OK: I just read in the pixel data using the efficient block-copy, and then I vertically flip the image:

reader.Read(mPixels, mWidth * mHeight * mBytesPerPixel);
FlipVertically();

The initial reaction might be that this is terribly inefficient. But in practice, it works well, even with larger image files. A vertical flip only needs to iterate half the image’s height, and the flip itself can be done pretty efficiently. I see some opportunities to optimize the FlipVertically function - but even without those optimizations, it performs quite well.

Results

Qualitatively, the optimized code runs a lot faster. Whereas the hi-res mod previously had a noticeable delay when loading into a new scene, it now appears to load as quickly as the unmodded game. Even if it is slower (since there is certainly more work to do when more pixels are present), it isn’t noticeably slower.

Using a stopwatch class, I also measured the actual time difference between the normal and modded game with the inefficient and optimized versions of the code. Here are some readings from my desktop (AMD Ryzen 7 9700X) in Release mode. These readings measure the time required to load the first scene of the game when hitting the “Play” button from the title screen:

Test	Time
Unmodded, Unoptimized	0.187 seconds
Unmodded, Optimized	0.187 seconds
Modded, Unoptimized	2.986 seconds
Modded, Optimized	0.253 seconds

Conclusion

At first glance, the lesson from this investigation is something most programmers already know: reading/copying data in a large block is more efficient than reading/copying data byte by byte. The compiler is able to greatly optimize straightforward data reads/copies. Prefer such reads/copies when possible.

I think there’s also a lesson here about premature optimization. Sure, the unoptimized code presented in this article needed to be improved, but it was “good enough” for several years. It was appropriate for the data set it was authored for. But when the data set changed, the code needed to be optimized.

And sometimes you think a solution won’t work because it’ll be “too slow”, but then you try it, and it actually works fine for the use case. The idea of “vertically flipping” the image data after loading is an example of this. A brute force approach can be effective, so don’t discount it. At least try it, use it as a starting point, and you can optimize from there!

Finally, I think investigating and solving this problem gave me some insight into why game engines handle textures the way they do, and why it’s pretty reasonable to support storing pixel data in multiple formats instead of “forcing” the internal format to always be RGBA. Though these changes make the Texture class more complex, they also clearly improve performance, and I think they move the class closer to being a more fleshed out general-purpose implementation.

C++ OpenGL