I am by no means new to open source software: my personal laptop and PCs have all been running GNU/Linux circa 2006, and there's been plenty of open source software that I have been using for various tasks. However, I had never gone further than using open source software until recently.

Prologue

Some time around late 2018 I decided to build a personal computer tailored to my needs: great processing power, memory speed, and a decent SSD. I picked the rest of the parts around these 3 components.

As for a graphics card, even a very low-end model was sufficient for my needs. Therefore I chose one that only cost $36.00.

This system worked great for me until I became fascinated with computer graphics and working with 3D engines. My very low-end graphics card was no longer up to this task. After some research, I decided on a modern graphics card with good performance and price: AMD Radeon 5600 XT.

Another important criteria for me was to choose a hardware component that is sufficiently supported by the Linux kernel either out of the box or with minimal configuration. Based on what I read online, it seemed that Radeon 5600 XT is well supported by amdgpu, which is AMD's open source graphics driver for the Linux kernel.

Long story short, I bought this graphics card, installed it, and noticed that my operating system is sluggish when rendering graphics.

Exploring The Issue

The very first thing I did was to look at the output of dmesg, which prints information from the kernel ring buffer.

Based on the output I saw in the ring buffer, it was evident that the amdgpu kernel module failed to find a firmware binary file named navi10_gpu_info.bin. Having researched and shopped for my graphics card, I knew exactly that navi10 is the code name for my graphics card.

This was the root cause of my problem. Now the question is, how can I fix this?

First Attempt: Locate The Missing Firmware In Debian Repositories

The easiest possible solution in a situation like this is to locate and install a software package provided by my GNU/Linux distribution that provides the missing firmware binaries.

I was able to find a package named firmware-linux-nonfree, which contains the binary firmware for various drivers in the Linux kernel. This package is in the non-free category due to the fact that it contains proprietary firmware binaries that are not open-source.

At least at the time I tackled this issue, the version of this package available in the official Debian repositories (even in the unstable repository) did not contain the navi10 firmware files that I was desperately looking for.

Second Attempt: Where Are The navi10 Firmware Files?

After a little bit of searching, I was able to find the repository of firmware blobs over at https://git.kernel.org. It looked very promising that I could see a whole bunch of navi10 binary files in there.

Then I did the following:

  • Copy the address of the first navi10 file I could see in the repository by right-clicking on the file name. e.g. https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu/navi10_ce.bin
  • Write a quick Bash script to download all of the navi10 files using wget, based off of the copied link and by swapping the file name.
  • The Bash script also moves the downloaded navi10_*.bin files to the /lib/firmware/amdgpu/ directory.
  • Run update-initramfs -u to re-generate the initramfs file. This is an archive that the kernel loads into RAM when it is booting up.
  • Reboot.

Some of you may already know what mistake I have made, as soon as you read the first bullet point. However, I was completely blind to the problem at this point in time.

Soon after the reboot, I realized that the graphics performance is still sluggish, which indicated the problem is still unresolved.

After looking at the output of dmesg again, I realized that even though there is still an error being printed by the amdgpu kernel module, the error message is different than the one I was seeing before: It appears that the kernel is now able to find the firmware binary it was looking for, but it failed validation.

[ 1.101433] amdgpu 0000:0c:00.0: firmware: direct-loading firmware amdgpu/navi10_gpu_info.bin
[ 1.101435] amdgpu 0000:0c:00.0: Failed to validate gpu_info firmware "amdgpu/navi10_gpu_info.bin"
[ 1.101438] amdgpu 0000:0c:00.0: Fatal error during GPU init
[ 1.101440] [drm] amdgpu: finishing device.
... (a bunch of lines have been omitted from here)
[ 1.101819] amdgpu: probe of 0000:0c:00.0 failed with error -22

A validation check is failing for this specific binary file. However, I had no idea what this validation actually is.

Third Attempt: Build The Latest Stable Linux Kernel

After some research, I found out that there has been a good number of recent improvements in the Linux kernel for supporting newer AMD GPUs, including navi10. Without knowing what sort of validation was stopping me in my previous attempt, I decided to give this a shot.

After downloading and extracting the source code of kernel version 5.6.3 (newest stable version available at the time), I copied over my existing kernel configuration file to be reused. Also, I decided to modify the configuration to build the amdgpu module into the kernel, in hopes of somehow eliminating the point of failure by allowing amdgpu to be loaded early in the process of booting the kernel.

Less than an hour later, I had a freshly built kernel 5.6.3 with built-in amdgpu module (and hopefully firmware binaries). I rebooted the system and loaded the new kernel, only to find out the same failure with the built-in module:

[    1.020232] amdgpu 0000:0c:00.0: vgaarb: deactivate vga console
[    1.020802] Console: switching to colour dummy device 80x25
[    1.020957] [drm] initializing kernel modesetting (NAVI10 0x1002:0x731F 0x1682:0x5710 0xCA).
... (a bunch of lines omitted from here)
[    1.041905] amdgpu 0000:0c:00.0: Failed to validate gpu_info firmware "amdgpu/navi10_gpu_info.bin"
[    1.041907] amdgpu 0000:0c:00.0: Fatal error during GPU init
[    1.041909] [drm] amdgpu: finishing device.
[    1.042003] amdgpu: probe of 0000:0c:00.0 failed with error -22

It was in this moment of frustration and disappointment that I had a realization: Linux kernel is open source, and I already have the source code on my computer. Why shouldn't I look and see what this "validation" is, that is failing for me?

Given that I was not familiar with the code base, my best bet was to search for files that have an occurrence of the string "Failed to validate gpu_info firmware". This pointed me to the following excerpt of amdgpu_device.c:

snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_gpu_info.bin", chip_name);
err = request_firmware(&adev->firmware.gpu_info_fw, fw_name, adev->dev);
if (err) {
    dev_err(adev->dev,
        "Failed to load gpu_info firmware \"%s\"\n",
        fw_name);
    goto out;
}
err = amdgpu_ucode_validate(adev->firmware.gpu_info_fw);
if (err) {
    dev_err(adev->dev,
        "Failed to validate gpu_info firmware \"%s\"\n",
        fw_name);
    goto out;
}
File: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

If the call to amdgpu_ucode_validate returns a non-zero value, the message "Failed to validate gpu_info firmware" gets printed.

Having learned that, my next stop was to look at the definition of the function amdgpu_ucode_validate. This function is defined in drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c in the kernel source tree:

int amdgpu_ucode_validate(const struct firmware *fw)
{
    const struct common_firmware_header *hdr =
        (const struct common_firmware_header *)fw->data;

    if (fw->size == le32_to_cpu(hdr->size_bytes))
        return 0;

    return -EINVAL;
}
File: drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h

The definition of amdgpu_ucode_validate sheds light on what the kernel thinks is going wrong: the equality check for size is not evaluating to true. But what are the two size values coming from?

I realized that probably my best bet is to look into the definition of struct firmware and struct common_firmware_header:

struct common_firmware_header {
    uint32_t size_bytes; /* size of the entire header+image(s) in bytes */
    uint32_t header_size_bytes; /* size of just the header in bytes */
    uint16_t header_version_major; /* header version */
    uint16_t header_version_minor; /* header version */
    uint16_t ip_version_major; /* IP version */
    uint16_t ip_version_minor; /* IP version */
    uint32_t ucode_version;
    uint32_t ucode_size_bytes; /* size of ucode in bytes */
    uint32_t ucode_array_offset_bytes; /* payload offset from the start of the header */
    uint32_t crc32;  /* crc32 checksum of the payload */
};
File: drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h
struct firmware {
    size_t size;
    const u8 *data;
    struct page **pages;

    /* firmware loader private fields */
    void *priv;
};
File: include/linux/firmware.h

At this point, I understood what amdgpu_ucode_validate is doing: it reads the firmware header, and compares the total firmware size specified in the header with the actual firmware binary size to see if they match. In my case, these two numbers do not match. Therefore, the kernel prints an error message.

To be able to see more information, I decided to add a couple more lines of information to amdgpu_ucode_validate:

int amdgpu_ucode_validate(const struct firmware *fw)
{
    const struct common_firmware_header *hdr =
        (const struct common_firmware_header *)fw->data;

    DRM_DEBUG("Validating firmware size: fw->size=%lu, hdr->size_bytes=%u\n", fw->size,
            le32_to_cpu(hdr->size_bytes));

    if (fw->size == le32_to_cpu(hdr->size_bytes))
        return 0;

    return -EINVAL;
}

Also this time, for whatever reason, I randomly decided to clone the firmware blob repository (instead of using my download script) and copy the navi10 files from the cloned repository.

After re-compiling the kernel using make bindeb-pkg, and installing it using dpkg -i <kernel package name>.deb, I rebooted the system and enabled the kernel's DRM debug logs.

As with my previous attempts, I logged in and browsed the output of dmesg. As I had expected, there were a LOT of debug output. To be able to analyze the content better, I opened /var/log/syslog in Vim and tried searching for the specific log entry that I had manually added to amdgpu_ucode_validate. The following lines were my hits:

Apr  9 10:17:56 optimus kernel: [    1.052472] [drm] Validating firmware size: fw->size=772, hdr->size_bytes=772
Apr  9 10:17:56 optimus kernel: [    1.053758] [drm] Validating firmware size: fw->size=171888, hdr->size_bytes=171888
Apr  9 10:17:56 optimus kernel: [    1.054643] [drm] Validating firmware size: fw->size=127488, hdr->size_bytes=127488
Apr  9 10:17:56 optimus kernel: [    1.054644] [drm] Validating firmware size: fw->size=29440, hdr->size_bytes=29440
Apr  9 10:17:56 optimus kernel: [    1.054742] [drm] Validating firmware size: fw->size=267970, hdr->size_bytes=267970
Apr  9 10:17:56 optimus kernel: [    1.055383] [drm] Validating firmware size: fw->size=263424, hdr->size_bytes=263424
Apr  9 10:17:56 optimus kernel: [    1.055383] [drm] Validating firmware size: fw->size=263424, hdr->size_bytes=263424
Apr  9 10:17:56 optimus kernel: [    1.055384] [drm] Validating firmware size: fw->size=263296, hdr->size_bytes=263296
Apr  9 10:17:56 optimus kernel: [    1.055385] [drm] Validating firmware size: fw->size=43968, hdr->size_bytes=43968
Apr  9 10:17:56 optimus kernel: [    1.055386] [drm] Validating firmware size: fw->size=268592, hdr->size_bytes=268592
Apr  9 10:17:56 optimus kernel: [    1.055386] [drm] Validating firmware size: fw->size=268592, hdr->size_bytes=268592
Apr  9 10:17:56 optimus kernel: [    1.056508] [drm] Validating firmware size: fw->size=33792, hdr->size_bytes=33792
Apr  9 10:17:56 optimus kernel: [    1.056557] [drm] Validating firmware size: fw->size=33792, hdr->size_bytes=33792
Apr  9 10:17:56 optimus kernel: [    1.056750] [drm] Validating firmware size: fw->size=459360, hdr->size_bytes=459360
Partial output from /var/log/syslog

To my absolute surprise, it appeared that all of the pairs matched. I could also no longer find the validation error message Failed to validate gpu_info firmware "amdgpu/navi10_gpu_info.bin".

Then I looked at the output of sensors utility and noticed that it includes GPU-related information, whereas it did not before:

amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:      775.00 mV 
fan1:           0 RPM  (min =    0 RPM, max = 3500 RPM)
edge:         +31.0°C  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:     +31.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:          +34.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:       12.00 W  (cap = 150.00 W)

k10temp-pci-00c3
Adapter: PCI adapter
Vcore:       819.00 mV 
Vsoc:        819.00 mV 
Tdie:         +28.8°C  
Tctl:         +38.8°C  
Icore:         6.00 A  
Isoc:          4.50 A
Output from sensors program

While I was happy that whatever I did ended up fixing the issue, I had to find out why the amdgpu driver is now working.

What Did I Fix?

The kernel source modification I did to add additional debug messages was definitely not fixing any issues. There is no question in that. However, I did only one other thing differently this time: cloning the firmware blob repository as opposed to using the Bash script I had written earlier.

To put this idea to test, I executed my Bash script to download the navi10 firmware files, and compared the size of those files with the ones from the cloned repository.

They did not match!

Completely surprised, I opened https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu and clicked on one of the navi10 links to see if my browser downloads the file.

There was the problem: the link that I thought will download the file, actually opens another HTML page with a hex dump view of the binary file.

Yes, this was a very amateur mistake. However...

I Finally Realized What Open Source Means

While I learned that mistakes can happen in places you least expect it, I finally understood what the "open" in "open-source" software means: freedom.

There was nothing stopping me from looking into what the kernel is doing except my own will and desire. This is an extremely invaluable power to posess. One that you will never understand until you exercise it.

Unless required for a job or specific personal interest, most of us will not browse the Linux kernel's source code often. However, the message here is about what has been made possible by open-source.