You are not logged in.
This is a continuation of the debugging process from this thread.
horace wrote:
Here is the status bits from a recent crash. Not one we've seen before: 20100. Three repetitions are listed to highlight the *occasional* difference in wlan_mac_get_status() of 0x00028100 vs. 0x00020100. The 28 version appears every ~100 iterations. Clearly one bit changes between the 'Stuck in...' printf and the 'MAC HW ...' printf. I guess it's toggling and the printf sometimes catches it?
Stuck in wlan_mac_hw_rx_finish! 0x00020100
MAC HW Status: 0x00020100
Rx Hdr Params: 0x0A050085
Rx PHY Status: 0x00000002
Stuck in wlan_mac_hw_rx_finish! 0x00020100
MAC HW Status: 0x00028100
Rx Hdr Params: 0x0A050085
Rx PHY Status: 0x00000002
I've been digging through the Rx PHY looking for any way the control logic could leave the MAC in that state. So far I can't find anything.
Just to confirm, the observations+prints above came from a STA+NOMAC node tuned to channel 6 with DSSS disabled?
Does the node fall into this bad state quickly after booting (seconds, minutes, ...)? And have you seen multiple WARP v3 nodes behave the same way?
One especially baffling thing is in the Rx Header Params print. Bit 0x10000000=0 means PHY_SEL=DSSS. However Bit 0x08000000=1 means PHY_HDR_READY, a bit only the OFDM pipeline can assert. I think this combination means PHY_HDR_READY asserted a long time after RX_END (PHY_SEL reverts to DSSS after RX_END). But I haven't been able to induce that behavior with any waveform (corrupt or otherwise) in simulation. I'll keep looking.
One additional debug step that would be very helpful would be to print the first 16 bytes of the Rx packet buffer. This is where the Rx PHY writes the bytes for the SIGNAL and HT-SIG fields.
// Add to "Stuck in wlan_mac_hw_rx_finish" if() in wlan_mac_hw_rx_finish() // Print first 16 bytes (PHY_RX_PKT_BUF_PHY_HDR_SIZE) of Rx packet xil_printf("[ 3: 0] 0x%08x\n", Xil_In32(RX_PKT_BUF_TO_ADDR(rx_pkt_buf) + PHY_RX_PKT_BUF_PHY_HDR_OFFSET + 0)); xil_printf("[ 7: 4] 0x%08x\n", Xil_In32(RX_PKT_BUF_TO_ADDR(rx_pkt_buf) + PHY_RX_PKT_BUF_PHY_HDR_OFFSET + 4)); xil_printf("[11: 8] 0x%08x\n", Xil_In32(RX_PKT_BUF_TO_ADDR(rx_pkt_buf) + PHY_RX_PKT_BUF_PHY_HDR_OFFSET + 8)); xil_printf("[15:12] 0x%08x\n", Xil_In32(RX_PKT_BUF_TO_ADDR(rx_pkt_buf) + PHY_RX_PKT_BUF_PHY_HDR_OFFSET + 12));
Offline
Configuration: We are logging on three nodes (ch 1, 6, 11) simultaneously with STA+NOMAC. We have made no modifications to the code other than those described below. DSSS is disabled during Python init_experiment() using node.enable_dsss(False). It typically takes many hours for this bug to manifest and it usually occurs during busy periods (around lunch time in our office). The error seems independent of the channel and node (more below).
Since the hang can take several hours to occur and on any node we have modified your suggestion slightly to log these debug words to a new entry type in the event log. Essentially, within the if(i++>1000000) we create an IPC message to CPU High, which copies the debug words to the event log via a new HANG_INFO entry type. We then reset the RX PHY so that the experiment can continue. In this way we can still perform measurements and post process the HANG_INFO event log from multiple nodes. The entire wlan_mac_hw_rx_finish() is pasted below so you can see exactly what we've done (sorry it's a long code paste):
inline u32 wlan_mac_hw_rx_finish() { u32 mac_hw_status; ipc_hang_t hang; wlan_ipc_msg_t ipc_msg_to_high; int i = 0; // Wait for the packet to finish - Rx PHY is "active" if the demod/decoding pipeline is // still writing bytes to the packet buffer. The FCS result is not known until the last // payload byte has been written. The RX_PHY_ACTIVE bit might de-assert before this happens // for some lengths/rates/bandwidths. This allows RX_END to start the SIFS timer at the right // time independent of payload-specific PHY latencies. do{ mac_hw_status = wlan_mac_get_status(); if(i++>1000000) { xil_printf("Stuck in wlan_mac_hw_rx_finish! 0x%08x\n", mac_hw_status); // Grab all debug hang.timestamp = wlan_mac_low_get_rx_start_timestamp(); //u64 hang.mac_hw_status_0 = mac_hw_status; //u32 hang.mac_hw_status_1 = wlan_mac_get_status(); //u32 hang.rx_phy_hdr_params = wlan_mac_get_rx_phy_hdr_params(); //u32 hang.rx_status = Xil_In32(WLAN_RX_STATUS); //u32 hang.pkt_buf_0 = Xil_In32(RX_PKT_BUF_TO_ADDR(rx_pkt_buf) + PHY_RX_PKT_BUF_PHY_HDR_OFFSET + 0); //u32 hang.pkt_buf_1 = Xil_In32(RX_PKT_BUF_TO_ADDR(rx_pkt_buf) + PHY_RX_PKT_BUF_PHY_HDR_OFFSET + 4); //u32 hang.pkt_buf_2 = Xil_In32(RX_PKT_BUF_TO_ADDR(rx_pkt_buf) + PHY_RX_PKT_BUF_PHY_HDR_OFFSET + 8); //u32 hang.pkt_buf_3 = Xil_In32(RX_PKT_BUF_TO_ADDR(rx_pkt_buf) + PHY_RX_PKT_BUF_PHY_HDR_OFFSET + 12); //u32 hang.magic = 0xABCDAAAA; // Just to check the IPC copied correctly //u32 // Send IPC Message ipc_msg_to_high.msg_id = IPC_MBOX_MSG_ID(IPC_MBOX_HANG); //19 ipc_msg_to_high.num_payload_words = 11; //(sizeof(ipc_hang_t) >> 2); ipc_msg_to_high.payload_ptr = (u32*) &hang; write_mailbox_msg(&ipc_msg_to_high); // Print debug to UART? if(1) { xil_printf("MAC HW Status: 0x%08x\n", hang.mac_hw_status_1); xil_printf("Rx Hdr Params: 0x%08x\n", hang.rx_phy_hdr_params); xil_printf("Rx PHY Status: 0x%08x\n", hang.rx_status); // Print first 16 bytes (PHY_RX_PKT_BUF_PHY_HDR_SIZE) of Rx packet xil_printf("[ 3: 0] 0x%08x\n", hang.pkt_buf_0); xil_printf("[ 7: 4] 0x%08x\n", hang.pkt_buf_1); xil_printf("[11: 8] 0x%08x\n", hang.pkt_buf_2); xil_printf("[15:12] 0x%08x\n\n", hang.pkt_buf_3); } // Reset the RX PHY REG_SET_BITS(WLAN_RX_REG_CTRL, WLAN_RX_REG_CTRL_RESET); REG_CLEAR_BITS(WLAN_RX_REG_CTRL, WLAN_RX_REG_CTRL_RESET); } } while(mac_hw_status & (WLAN_MAC_STATUS_MASK_RX_PHY_ACTIVE | WLAN_MAC_STATUS_MASK_RX_PHY_WRITING_PAYLOAD)); // Check RX_END_ERROR and FCS if( (mac_hw_status & WLAN_MAC_STATUS_MASK_RX_FCS_GOOD) && ((mac_hw_status & WLAN_MAC_STATUS_MASK_RX_END_ERROR) == 0)) { return 1; } else { return 0; } }
There is obviously other changes in CPU High to support the new event type and IPC handling etc.
This experiment ran for 17 hours before one node still hung. Even with the RX PHY reset lines, the node still crashes. This node was on channel 6.
For the node on channel 6 there are 180k HANG_INFO entries all with the same values. Clearly the number of entries is irrelevant since the node could not reset the PHY it continually made IPC messages/log entries. Results are:
timestamp, mac_hw_status_0, mac_hw_status_1, rx_phy_hdr_params, rx_status, pkt_buf_0, pkt_buf_1, pkt_buf_2, pkt_buf_3, magic DBEAD1D91, 00020102 , 00020102 , 0A050085 , 00000002 , A602036D , 8950EB94 , 004A0000 , 00000000 , ABCDAAAA DBEAD1D91, 00020102 , 00020102 , 0A050085 , 00000002 , A602036D , 8950EB94 , 004A0000 , 00000000 , ABCDAAAA etc.
For the node on channel 1 there were two HANG_INFO entries. This means it encountered a hang event but managed to reset the RX PHY and continue.
timestamp, mac_hw_status_0, mac_hw_status_1, rx_phy_hdr_params, rx_status, pkt_buf_0, pkt_buf_1, pkt_buf_2, pkt_buf_3, magic D6928319B, 00008180 , 00008180 , 190001FE , 00000002 , 80003FDB , C88F00A9 , 001D8003 , 00000000 , ABCDAAAA D6928319B, 00000500 , 00000500 , 190001FE , 00000002 , 80003FDB , C88F00A9 , 001D8003 , 00000000 , ABCDAAAA
I've checked the event log entries against a UART print and the values match (for the channel 6 node).
Hope this helps!
Offline
Some more occurrences:
Channel 1, different node to previous. This caused the node to hang.
timestamp, mac_hw_status_0, mac_hw_status_1, rx_phy_hdr_params, rx_status, pkt_buf_0, pkt_buf_1, pkt_buf_2, pkt_buf_3, magic 3A50DBEAE, 00020100 , 00028100 , 0A070085 , 00000002 , 9802067C , 2203DBF9 , 00000000 , 00000000 , ABCDAAAA 3A50DBEAE, 00020100 , 00028100 , 0A070085 , 00000002 , 9802067C , 2203DBF9 , 00000000 , 00000000 , ABCDAAAA etc.
Channel 6, again a different physical node. Caused another hang.
timestamp, mac_hw_status_0, mac_hw_status_1, rx_phy_hdr_params, rx_status, pkt_buf_0, pkt_buf_1, pkt_buf_2, pkt_buf_3, magic 2E179688 , 00020100 , 00020100 , 0A050085 , 00000002 , 570002CD , 5BFE39FE , 00000003 , 00000000 , ABCDAAAA 2E179688 , 00020100 , 00020100 , 0A050085 , 00000002 , 570002CD , 5BFE39FE , 00000003 , 00000000 , ABCDAAAA etc.
Offline
Additional observations:
1) Our experiment does not expect to receive any individually addressed frames, it simply captures frames on the channel - We therefore have DIP switch 3 set to 1 to prevent probe requests/scanning. If we allow scanning (DIP3 to zero) then the node will crash within a matter of minutes.
2) The issue is significantly more prevalent when NOMAC is used verses DCF mac in CPU Low. For NOMAC with scanning the node will crash very quickly giving errors "Stuck in wlan_mac_hw_rx_finish! 0x00020502". When using DCF mac it will occasionally crash (far less often nomac) with error "Stuck waiting for MAC_HW_LASTBYTE_ADDR1". This was observed using totally unmodified v.1.5.3 on channel 1.
Offline
This debug info has proven very helpful. There's definitely a path through the Rx PHY+MAC that I hadn't considered, where a bogus-but-not-invalid SIGNAL field causes the Rx PHY to assert RX_END before the HT-SIG is decoded. One way for this could happen is very low SNR receptions of 11n waveforms. I can reproduce some of the "stuck" states you've observed in simulation. I'm working on an updated design to test - I'll post it here soon (hopefully this afternoon).
Offline
We just posted a v1.5.4 project: Mango_802.11_RefDes_v1.5.4.zip. The release notes are short. The primary change is better control logic in the Rx PHY for handling SIGNAL fields with "valid" values.
The only way I found to reproduce the stuck Rx PHY you observed was to inject an HTMF waveform where the SIGNAL field had a high rate and short length. In the v1.5.3 PHY this results in an RX_END event that occurs before the PHY_HDR_READY. The early RX_END caused the Rx PHY to revert its MAC status outputs to the DSSS outputs. Because of this the RX_END_ERROR event was not observed by the C code, leaving the code stuck forever. It's possible to exercise this bug with an invalid 802.11 waveform, or when receiving a valid waveform at low SNR (the SIGNAL field is not robust to detecting multiple bit errors).
Offline
Thank you very much for the v1.5.4 release. I added the debug printf into the wlan_mac_hw_rx_finish() function just in case and have observed another hang via UART of:
MAC Hw Status: 0x00008180, Rx Hdr Params: 0x1902024E, Rx PHY Status: 0x00000002 on STA+NOMAC.
Sorry about this. I will continue testing.
Offline
I've run this for a week and it seems robust. Thanks
Offline
Thanks for the update. I've looked for any additional bugs that could manifest like post #7 above - I haven't found anything yet. It's good to know the fixes in v1.5.4 did improve robustness in your environment. Please post any other observations you have - these have been very helpful so far.
Offline