# SYNAPTICON GPU Installation & Boot Incident Report - June 5, 2026

## 1. Executive Summary
On June 5, 2026, SYNAPTICON was gracefully shut down to perform a hardware upgrade: installing a 3rd NVIDIA GPU via a PCIe ribbon cable extension. Following physical installation, the server experienced a series of early power resets and POST boot loops. The new GPU was subsequently removed to restore server operations. 

This report provides a complete reconstruction of the boot troubleshooting timeline using iDRAC Lifecycle Controller logs and system metrics. It verifies today's current operating state (now running on the stable 2-GPU configuration) and provides an electrical/protocol diagnosis of the ribbon cable failure.

---

## 2. Reconstructed Troubleshooting Timeline
Using iDRAC Lifecycle logs (`racadm lclog view`) and OS boot histories, we have reconstructed the sequence of events during your troubleshooting session today:

| Time (PST) | Log Source | Event / Message | Diagnostic Interpretation |
| :--- | :--- | :--- | :--- |
| **12:45:13** | iDRAC | `System CPU Resetting / Turning Off` | **Graceful Shutdown Complete.** Server powered off cleanly as requested. |
| **17:58:25** | iDRAC | `The chassis is open while the power is off` | **Hardware Installation:** You opened the chassis, plugged in the ribbon cable/GPU, and closed the chassis (17:58:29). |
| **18:03:03** | iDRAC | `iDRAC IP Address changed to 128.223.139.35` | iDRAC management module boots up and gets network access on standby power. |
| **18:06 - 18:32** | iDRAC | Multiple `System is turning on` / `System CPU Resetting` loops | **POST Boot Loops:** The motherboard is attempting to initialize memory and train PCIe buses, but is encountering a low-level training or power-delivery fault, causing quick resets. |
| **18:13:49** | iDRAC | **`New PCI card(s) have been detected in the system.` (FQDD: System.Embedded.1)** | **Physical Presence Sensed:** The motherboard successfully detected that a card was physically seated in the PCIe slot. *This proves the hardware presence pins were closed.* |
| **18:36:16** | iDRAC | **`Device not detected: Video(Slot 7-1) (FQDD: Video.Slot.7-1)`** | **Link Training Failure:** The motherboard attempted high-speed communication with the card in Slot 7, but the electrical link failed to train. The system marked the device as lost/unresponsive. |
| **18:52:23** | OS Boot | `Debian Kernel 6.12.90+deb13.1-amd64` loaded successfully | **Recovery:** The third GPU was removed, allowing the system to complete POST and successfully load the Debian OS. |

---

## 3. Physical Diagnosis: Ribbon Cable vs. Power Failure
You noted that the GPU currently has no power and is on a ribbon cable extension that might be bad. 

### Did the logs see the GPU?
**Yes, but only briefly at a low physical-layer level.** 

Here is the exact explanation of why the server behaved the way it did:

1. **How the server knew it was there (18:13:49):** 
   PCIe slots use physical side-band pins called **`PRSNT1#`** and **`PRSNT2#`** (Presence Detect). When a card is physically inserted into a PCIe slot or a ribbon cable, these pins are bridged together on the GPU's circuit board. Even if the GPU is completely unpowered or the data lanes are corrupted, the motherboard immediately senses this bridged circuit and logs: *"New PCI card(s) have been detected."*
2. **Why it threw "Device not detected" (18:36:16):**
   Once the server moves past basic presence detection, the chipset starts **PCIe Link Training** (negotiating speeds, reference clocks, and electrical signaling). 
   * **If the GPU had no power:** The GPU's internal PCIe controller chip remained unpowered and could not reply to the motherboard's training signals.
   * **If the ribbon cable is bad:** High-frequency PCIe Gen 3 signals are extremely sensitive to impedance mismatches, electromagnetic interference, and crosstalk. Flat ribbon cables without proper shielding degrade the reference clock and data lines, causing high bit-error rates.
   
**Conclusion:** The presence-detect pins worked (recording the card's existence), but the lack of power and/or ribbon cable signal degradation prevented link training. This triggered a PCIe training timeout, causing the motherboard to reset/loop, and eventually ignore Slot 7 completely (`Device not detected`).

---

## 4. Current System State Comparison (Pre vs. Post)
We have compared the current post-boot state (`post_state`) against the pre-shutdown state (`pre_state`) to verify SYNAPTICON's health:

### A. GPU Verification
* **NVIDIA Bus Status:** Exactly 2 GPUs are detected on the PCIe bus and are fully operational under the NVIDIA driver:
  * **GPU 1:** `02:00.0` Tesla P40 (rev a1)
  * **GPU 2:** `43:00.0` Tesla P40 (rev a1)
* **Driver Health:** The NVIDIA kernel module `580.105.08` successfully compiled, signed via DKMS, loaded, and bound to both GPUs (`nvidia0` and `nvidia1`).

### B. Operating System & Kernel
* **Kernel Upgrade:** During the reboot cycle, the system cleanly updated its kernel version:
  * **Pre-Shutdown:** `Kernel 6.12.73+deb13-amd64`
  * **Post-Boot:** `Kernel 6.12.90+deb13.1-amd64`
  The DKMS ZFS and NVIDIA modules successfully rebuilt and signed themselves automatically for the new kernel!

### C. Storage Pool Integrity
Both ZFS pools are 100% healthy, online, and fully mounted:
* **`bpool`:** ONLINE, 0 errors.
* **`rpool`:** ONLINE, 0 errors (with active SSD caches `wwn-0x5002538...` fully operational).

### D. System Services
* **`certbot.service`:** This service was previously in a failed state prior to shutdown. Post-boot, it has cleanly initialized and is **no longer failing**!
* **`dnsmasq.service`:** Remains failed (exactly as it was pre-shutdown). This is normal/unchanged behavior.
* **`synapticon-zenith.service` (Zenith agetty):** Cleanly loaded and running.

---

## 5. Recommended Next Steps for Tomorrow
When you swap the ribbon cable extension tomorrow:
1. **Ensure Auxiliary Power is Connected:** Make sure the external 8-pin/6-pin PCIe power cables are fully seated into the GPU *before* powering on the server. Tesla GPUs will fail POST if their power rails do not detect voltage.
2. **Use a Shielded PCIe Riser/Extension:** If possible, avoid unshielded flat ribbon cables. High-speed GPUs require shielded, high-frequency PCIe riser cables (rated for Gen 3 or Gen 4) to maintain signal integrity over distance.
3. **Standby Diagnostics:** If you encounter boot loops again, you do not need to wait for the OS. You can log directly into the iDRAC Web GUI to check if FQDD `Video.Slot.7-1` successfully completes link negotiation or if it triggers another `PCI3018` / `PR8` alert loop.
