WARP Project Forums - Wireless Open-Access Research Platform

You are not logged in.

#1 2016-Feb-10 04:45:13

horace
Member
Registered: 2014-Jul-16
Posts: 61

wlan_mac_dcf hung state problem

Hello
We are seeing issues with the wlan_mac_dcf CPU LOW code in a busy mixed a/b/g/n/(+ac) environment. In STA mode the node is running fine but will 'freeze' at some point. It will hang indefinitely and not receive new frames. Sometimes this causes log capture via the experiments framework to also fail.

We have programmed the device with a fresh version of the 1.4 Reference Design. It usually lasts about 10-20 minutes but this must be environment dependent.

We identified the problem to this section of code in wlan_mac_dcf and applied your fix. But the error still occurs.

Is there anything else in the PHY which could cause the mac to hang? Receiving bad .11n frames, for example?

We added some hardware debugging to log the value of the 'Status' register in wlan_mac_hw > Registers > Status Register via chipscope (manual trigger) when the device enters a hung state. We have observed two hung states with the following register values.

First
'Status': 0x0100000
'Control': 0x00000052
'PHY_RX_PARAMS': 0x14070085
rx_bytenum: 0x4
rx_byte: 0x1F

Second
'Status': 0x01C0000
'Control': 0x00000052
'PHY_RX_PARAMS': 0x3202001C
rx_bytenum: 0x0
rx_byte: 0x0

Could we call wlan_mac_dcf_hw_unblock_rx_phy() at some point? But how to know when it hangs?
This limits the maximum duration of our experiment since it will intermittently freeze.

Offline

 

#2 2016-Feb-10 11:10:18

murphpo
Administrator
From: Mango Communications
Registered: 2006-Jul-03
Posts: 4622

Re: wlan_mac_dcf hung state problem

First, thank you for the detailed debug info. This gives us a fighting chance of finding the underlying problem.

The v1.4 Rx PHY definitely has a bug where reception of certain HT-SIG fields terminates the Rx pipeline without properly setting the RX_ERROR output to the MAC. The software check in changeset 5063 detects this case, allowing the C code to exit frame_receive() cleanly. Without this that while() loop can get stuck. The underlying issue is fixed in the current Rx PHY model (to be used in v1.5 soon).

Your observations indicate I may have missed another subtlety in handling unexpected RX_END events in the PHY. I'll keep looking.

The v1.4 Rx PHY also has a problem receiving 11n MCS7 waveforms with specific lengths (where length is a multiple of the number of bytes per OFDM symbol). The de-interleaver latency for MCS7 is too long, causing the FCS decision to occur after RX_END. I only discovered this a few days ago; a fixed Rx PHY with new de-interleaver is already in svn for v1.5. I think this problem would manifest as a packet error (incorrect FCS=bad decision). I'll double check how this might affect MAC state.


A few debug steps that would be helpful in isolating the problem:
-Does disabling DSSS Rx help? Change wlan_phy_DSSS_rx_enable() to wlan_phy_DSSS_rx_disable() in wlan_phy_init(), or call node.disable_dsss() from Python.

-Is the C code getting stuck in the "while(wlan_mac_get_last_byte_index() < MAC_HW_LASTBYTE_ADDR1)" loop? An easy way to confirm this is with the hex displays.

Code:

    // Wait until the PHY has written enough bytes so that the first address field can be processed
    userio_write_hexdisp_right(USERIO_BASEADDR, 0x2);
    while(wlan_mac_get_last_byte_index() < MAC_HW_LASTBYTE_ADDR1){userio_write_hexdisp_right(USERIO_BASEADDR, 0x3);};
    userio_write_hexdisp_right(USERIO_BASEADDR, 0x2);

Alternatively you can try the code below. This will detect that loop running too long, then dump the MAC/PHY state to the CPU Low UART.

Code:

u64 loop_start = get_system_time_usc();

userio_write_hexdisp_right(USERIO_BASEADDR, 0x2);

while(wlan_mac_get_last_byte_index() < MAC_HW_LASTBYTE_ADDR1) {
	userio_write_hexdisp_right(USERIO_BASEADDR, 0x3);

	if( (get_system_time_usec() - loop_start) > 1000000 ) { //Timeout after 1 sec
		//We've crashed - print debug info
		xil_printf("\n\n---Stuck in frame_receive()---\n");
		xil_printf("wlan_mac_get_last_byte_index() =       %d\n", wlan_mac_get_last_byte_index());
		xil_printf("phy_details->length =                  %d\n", phy_details->length);
		xil_printf("phy_details->mcs =                     0x%x\n", phy_details->mcs);
		xil_printf("phy_details->phy_mode =                0x%x\n", phy_details->phy_mode);
		xil_printf("phy_details->N_DBPS =                  0x%x\n", phy_details->N_DBPS);
		xil_printf("Xil_In32(WLAN_MAC_REG_PHY_RX_PARAMS) = 0x%08x\n", Xil_In32(WLAN_MAC_REG_PHY_RX_PARAMS));
		xil_printf("Xil_In32(WLAN_RX_STATUS) =             0x%08x\n", Xil_In32(WLAN_RX_STATUS));
		xil_printf("\nPacket buffer contents:\n");

		idx_start = 0;
		idx_end = idx_start + PHY_RX_PKT_BUF_PHY_HDR_OFFSET - 1;
		xil_printf("------[%d] - [%d] : rx_frame_info\n", idx_start, idx_end);
		for(temp_idx = idx_start; temp_idx <= idx_end; temp_idx++){
			xil_printf("%02x ", temp_pkt_buf[temp_idx]);
		}
		xil_printf("\n");
		idx_start = idx_end + 1;
		idx_end = idx_start + PHY_RX_PKT_BUF_PHY_HDR_SIZE - 1;
		xil_printf("------[%d] - [%d] : PHY_RX_PKT_BUF_PHY_HDR\n", idx_start, idx_end);
		for(temp_idx = idx_start; temp_idx <= idx_end; temp_idx++){
			xil_printf("%02x ", temp_pkt_buf[temp_idx]);
		}
		xil_printf("\n");
		idx_start = idx_end + 1;
		idx_end = idx_start + 100 - 1; //Just print 100 bytes of MAC payload
		xil_printf("------[%d] - [%d] : MAC Payload\n", idx_start, idx_end);
		for(temp_idx = idx_start; temp_idx <= idx_end; temp_idx++){
			xil_printf("%02x ", temp_pkt_buf[temp_idx]);
		}
		xil_printf("\n");
		while(1){};
	}
};
userio_write_hexdisp_right(USERIO_BASEADDR, 0x2);

Offline

 

#3 2016-Feb-10 15:58:32

murphpo
Administrator
From: Mango Communications
Registered: 2006-Jul-03
Posts: 4622

Re: wlan_mac_dcf hung state problem

Update- we have a new theory. The code fix above that detects an idle Rx PHY due to an invalid 11n Rx should also unblock the Rx PHY before returning. We suspect your node is stuck in a continuous loop of polling the Rx PHY, finding it blocked, detecting the invalid Rx state (waiting for bytes but PHY idle), then returning.

The code below should fix this. This replaces the contents of the same while() in wlan_mac_dcf.c.

Code:

while(wlan_mac_get_last_byte_index() < MAC_HW_LASTBYTE_ADDR1){
  mac_hw_status = wlan_mac_get_status();
  if(((mac_hw_status & WLAN_MAC_STATUS_MASK_RX_PHY_ACTIVE) == 0) && (wlan_mac_get_last_byte_index() == 0)) {
      //Rx PHY is idle, but we're still waiting for bytes - bad MAC/PHY status, ignore Rx, unblock Rx PHY, and return
      wlan_mac_dcf_hw_rx_finish();
      wlan_mac_dcf_hw_unblock_rx_phy();
      return return_value;
   }
}

Offline

 

#4 2016-Feb-11 12:08:09

horace
Member
Registered: 2014-Jul-16
Posts: 61

Re: wlan_mac_dcf hung state problem

Hi, thanks for your helpful debug information. In short, I think the new while{} block works... It will now run for hours without hanging (must be more than just luck!).

The condition within the while happens extremely often. My environment must be filled with .11n because it regularly has to break out of that loop. For more debugging I've shown my technique and observations below.

Code in that while:

Code:

wlan_mac_set_wt_debug_3(1);
while(wlan_mac_get_last_byte_index() < MAC_HW_LASTBYTE_ADDR1){
   mac_hw_status = wlan_mac_get_status();
   if(((mac_hw_status & WLAN_MAC_STATUS_MASK_RX_PHY_ACTIVE) == 0) && (wlan_mac_get_last_byte_index() == 0)) {
      //Rx PHY is idle, but we're still waiting for bytes - bad MAC/PHY status, ignore Rx, unblock Rx PHY, and return
      wlan_mac_set_wt_debug_1(1);
      wlan_mac_dcf_hw_rx_finish();
      wlan_mac_dcf_hw_unblock_rx_phy();
      wlan_mac_set_wt_debug_1(0);
      return return_value;
   }
}
wlan_mac_set_wt_debug_3(0);

The wlan_mac_set_wt_debug_* instructions link to macros which toggle signals from a register in the mac hardware core. I can then observe these via a chipscope in the mac hardware.
wlan_mac_set_wt_debug_0(1) occurs at the entry to frame_receive() and wlan_mac_set_wt_debug_0(0) just before the end of frame_receive().

wlan_mac_set_wt_debug_2 is inside the wlan_mac_dcf_hw_rx_finish() function. Specifically:

Code:

do{
   wlan_mac_set_wt_debug_2(1);
   mac_hw_status = wlan_mac_get_status();
   wlan_mac_set_wt_debug_2(0);
} while(mac_hw_status & WLAN_MAC_STATUS_MASK_RX_PHY_ACTIVE);

Using chipscope I trigger off the rising edge of wlan_mac_set_wt_debug_1. As soon as I start the trigger the event happens (much like triggering off rx_start_ind). Implying these events occur very frequently.
Here is a chipscope trace for triggering off wlan_mac_set_wt_debug_1 (i.e. when it needs to return due to invalid 11n rx):
https://cdn.pbrd.co/images/1tMI6gPK.jpg

And here is a chipscope trace triggering off wlan_mac_set_wt_debug_3 on a good frame reception.
https://cdn.pbrd.co/images/1tMzXPo0.jpg

Chipscope names relate to registers in the mac hardware
reg_control is: 'Control' from register
reg_rx_params is: 'PHY_RX_PARAMS' to register
reg_rx_status is: 'Status' to register

If they appear to occur so frequently, why, before the code change, would it take up to 10-20 mins for it to freeze...?

Sorry for the massive amount of information.

Offline

 

#5 2016-Feb-11 12:25:15

horace
Member
Registered: 2014-Jul-16
Posts: 61

Re: wlan_mac_dcf hung state problem

Also, I wonder if there might be some kind of race condition causing a one cycle assertion of the medium_idle signal in the second image.
This chipscope signal originates from CS Idle in wlan_mac_hw > CCA with reg_CCA_Ignore_TxBusy and reg_CCA_Ignore_NAV set to 1.

Offline

 

#6 2016-Mar-18 13:23:27

ofka3419
Member
Registered: 2015-Nov-20
Posts: 13

Re: wlan_mac_dcf hung state problem

I also have this problem and this fix it. But why MAC_HW_LASTBYTE_ADDR1 == 13?

frame control field + duration/ID + address 1 = 2 + 2 + 6 = 10 bytes.

Where do extra three bytes come from?

Offline

 

#7 2016-Mar-20 11:00:22

chunter
Administrator
From: Mango Communications
Registered: 2006-Aug-24
Posts: 1067

Re: wlan_mac_dcf hung state problem

The MAC_HW_LASTBYTE_ADDR1 definition is being compared against a byte index, so MAC_HW_LASTBYTE_ADDR1 = 13 refers to the 14th byte of the MAC payload. You're right -- an ACK/CTS payload is only 10 bytes, but that doesn't account for the 4 bytes of FCS used for determining whether or not the packet is correctly decoded. Counting the FCS, an ACK/CTS is 14 bytes long.

Offline

 

#8 2016-Apr-12 03:19:03

ofka3419
Member
Registered: 2015-Nov-20
Posts: 13

Re: wlan_mac_dcf hung state problem

Thank you, but why add 4 fcs bytes? These occur at the end of mac payload.
By waiting until ==13 you add unnecessary delay (waiting for 4 bytes).
We only want to wait for address field to be recieved/decoded, not check for whole frame valid. Otherwise MAC_HW_LASTBYTE_ADDR1 should change for each frame type.
It is only ==13 for ACK/CTS. Other cases is longer.

Offline

 

#9 2016-Apr-12 09:44:25

chunter
Administrator
From: Mango Communications
Registered: 2006-Aug-24
Posts: 1067

Re: wlan_mac_dcf hung state problem

Sorry about that -- you're totally right. It doesn't make any sense to wait for 14 bytes before looking at addr1. You can safely change that #define to 10 bytes (MAC_HW_LASTBYTE_ADDR1 = 9). That code is ancient by the standards of the reference design. I think at one point we were conflating the goals of (a) waiting long enough to look at the first address and (b) waiting for at least the minimum packet length of an ACK before doing anything. The second goal (b) is not required any more.

I just made the change to MAC_HW_LASTBYTE_ADDR1 in the code we're actively developing for the upcoming 1.5 release. Thanks for pointing this out!

Offline

 

#10 2016-Apr-13 04:29:52

ofka3419
Member
Registered: 2015-Nov-20
Posts: 13

Re: wlan_mac_dcf hung state problem

Great, thanks

Offline

 

#11 2016-Apr-13 14:22:50

murphpo
Administrator
From: Mango Communications
Registered: 2006-Jul-03
Posts: 4622

Re: wlan_mac_dcf hung state problem

We tested this change in our latest design and observed a problem. Thankfully it's easy to fix.

There's a subtle race condition in the current DCF frame_receive() function (I'm linking to v1.5 code, before the fix). The code waits for MAC_HW_LASTBYTE_ADDR1 bytes, then checks the value of address_1. If address_1 indicates unicast_to_me, the code then creates an ACK frame whose address_1 is copied from the received packet's address_2 field. And that's the race- the code isn't waiting for the Rx PHY to write all of address_2 before copying rx.address_2 into ack.adddress_1. If the code copies the address too early, stale bytes from the Rx packet buffer will be copied into the ACK header, corrupting the ACK before it's ever transmitted.

When MAC_HW_LASTBYTE_ADDR1=13 the window in which this race is a problem is narrow (the time it takes the Rx PHY to write 2 bytes). I suspect the v1.4 code never "loses" this race. Changing MAC_HW_LASTBYTE_ADDR1=9 widens the window enough that the v1.5 code definitely loses the race for some rates/lengths.

The correct behavior is for the code to wait until MAC_HW_LASTBYTE_ADDR2 (15) bytes are received before accessing rx.address_2. We have tested this fix successfully and will incorporate it into v1.5. The code snippet below will also work in v1.4, if you want to use the smaller MAC_HW_LASTBYTE_ADDR1 value in your current design. This code goes immediately before "wlan_create_ack_frame()" in frame_receive().

Code:

        if((phy_details->length) >= MAC_HW_LASTBYTE_ADDR2){
        	// Wait until the PHY has written enough bytes so that the second address field can be processed
        	// If this is a short reception that does not have a second address, it is still possible to get
        	// to this line of code if there is an FCS error and the WLAN_IS_CTRL_FRAME check above fails.
        	// As such, we sanity check the length of the reception before getting into a potentially infinite
        	// loop.
        	while(wlan_mac_get_last_byte_index() < MAC_HW_LASTBYTE_ADDR2) {
			};
        }

Offline

 

Board footer