WARP Project Forums - Wireless Open-Access Research Platform

Christian · 2015-Nov-03 08:12:54

Hi,

as a follow-up to http://warpproject.org/forums/viewtopic.php?id=2795 , I want to ask some questions. We created a "stripped-down" version of the 802.11design NoMAC code (v1.3.0). However, this causes the following problem: CPU low gets reset (we suspect that the MDM core issues a MB_debug_rst) after some time, which then drives the radio controller into a locked state: stat reg 0x00003070. This happens if we flash the boards using SD cards, but also with JTAG cable. We could identify that if CPU high gets bootloop assigned, the problem of a reset on CPU low is gone.

>> How can we locate and fix/prevent the reset?

Find attached the stripped-down software (hardware is almost as is, except for another bram and axi timer core for cpu low, tx buffers increased from 8 ->16, connection between eth and tx buffers; default drivers in BSPs) and a compiled bitstream to check if required.

Thanks in advance
Chrsitian

Last edited by Christian (2015-Nov-03 08:14:59)

Christian · 2015-Nov-03 11:27:27

the MDM core is not the cause. we just synthesized a version which also has this behavior.

welsh · 2015-Nov-03 11:53:05

From playing around with the design a bit and looking at the code, here are a couple of thoughts:

1) You are currently not copying the wlan_mac_addr from the EEPROM. You can see the line in the reference design here. This will affect receive processing since the node believes its address is all zeros (can be seen in the CPU high print).

2) Based on your comment: "We could identify that if CPU high gets bootloop assigned, the problem of a reset on CPU low is gone.", it looks like the problem might be in the IPC messaging between CPU High and CPU Low. Also, if you look at the timing of when CPU Low "crashes and reboots", it occurs right after CPU high prints:

Code:

  Waiting for Ethernet link ...   Initialization Successful

That print is part of the Ethernet initialization (in wlan_eth_init() in wlan_mac_eth_util.c). This is one of the last items executed in wlan_mac_high_init(). So, looking at what happens after that in the CPU High boot, we can see that:
- The scheduler is initialized
- There are a number of IPC messages down to CPU Low
- Interrupts are enabled

There is not enough information in the zip file to re-build the download.bit, so it is difficult for us to isolate the problem further. I would suggest that you add some print statements to CPU Low boot to debug what is going on.

3) Looking at the default printing in the v1.3.0 NOMAC reference design, you can see that CPU low prints:

Code:

----- Mango 802.11 Reference Design -----
----- v1.3 ------------------------------
----- wlan_mac_nomac --------------------
Compiled Aug 25 2015 15:38:45

Note: this UART is currently printing from CPU_LOW. To view prints from
and interact with CPU_HIGH, raise the right-most User I/O DIP switch bit.
This switch can be toggled live while the design is running.

AD Readback: 0x0000002B
Initialization Finished
Searching for empty packet buff 1
Searching for empty packet buff 2
Searching for empty packet buff 3
Searching for empty packet buff 4
Searching for empty packet buff 5
Searching for empty packet buff 6
Searching for empty packet buff 7
...

However, we don't see the "Searching for empty packet buff ..." print from CPU Low, even though it still exists within the code and should be called (at least based on some visual inspection of the code). Therefore, there might be an issue in your polling loop in CPU Low. Again, debug print statements will help narrow down the problem.

4) Based on the information above, my guess is that the reset itself is caused by some kind of exception that occurs in CPU Low during the boot sequence. By default, CPU Low enables exceptions and uses the default exception handler provided by the Standalone library. In the documentation of the Standalone Library and the Microblaze processor, it doesn't indicate that the default exception handler can cause a reset, but that might not be reality. You can always add your own exception handlers to get more information about exceptions that occur.

welsh · 2015-Nov-03 12:01:48

One quick addition:

4) Above is not very likely and you will get better results using print statements to debug the boot sequence.

I would also add that executing a NULL pointer can look like a reset since your reset vector is at address 0x00000000 (you can see this in the executable.map file in the Debug folder). So, there could be some pointer issue in which you execute a NULL pointer and it effectively resets the CPU.

Edlmann · 2015-Nov-03 13:02:05

Thanks for your help on the issue welsh.

1.) We're aware of that, in the non-stripped down version we substitute our own MAC-addr.

2.) In our setup, the timing of the Ethernet-Initialization print and the reboot on low don't correlate, sometimes CPU_LOW resets after a few seconds, sometimes it takes up to a minute. Both CPU_HIGH + CPU_LOW are already in their respective polling-loops.

3.) Not sure at this point, but i think we disabled that printOut for our design / it gets stripped out when DEBUG Compiler-switch is not set

Edit: It isn't disabled. However, in the NOMAC it was called once a packet had been passed to CPU_HIGH, which only happens after a packet has been received. We're currently using a relatively clean channel, where no packets are transmitted by other systems. Our system does not transmit any packets in the stripped down form.

4.) We now added exception handlers for all possible exceptions to both CPU_HIGH + CPU_LOW, handling exception types 0-7 simply printing if any exception has been triggered. To achieve this, we used the following snippet for all exceptions:

void IsDataBusEx(void* data) {
xil_printf("IsDataBusEx\n");
}

main () {
/* snip */
xil_printf("Registering handlers...\n");
microblaze_enable_exceptions();
microblaze_register_exception_handler(4, &IsDataBusEx, NULL);
/* snip other 7 handlers */
}

None of the registered handlers get triggered. However, if we add a forced unaligned access using

xil_printf("Forcing unaligned access");
xil_printf("%x\n", *((u32*)(0x000041)));

The handler gets triggered as expected.

As to your addition: Is there any way to change the reset vector to debug this possibility further? Or are printouts the only thing that could help us here?

2nd Edit: Scratch everything before. It was indeed a null-pointer that got called under certain circumstances. Two custom Rx-callbacks were not initialized correctly. In the non stripped down version, they were set upon reception of a special ETH-packet, leading to the system starting up correctly sometimes. If however any wireless RX was executed before this ETH-packet was received, the uninitialized callbacks would be executed. Thank you very much for the tip with the Null pointer welsh, turned out to be the solution.

Last edited by Edlmann (2015-Nov-03 13:33:54)

Christian · 2015-Nov-03 13:40:44

Thank you also from my side!

PS: I hate pointers ;)

welsh · 2015-Nov-03 13:44:01

I'm not exactly sure how to change the reset vector, but I'm sure it is possible.

One thing you could try is using the SDK debugger. I would think that you could set a breakpoint at 0x00000000 and when you hit that, then look at the CPU registers and stack to see if you can figure out where you came from.

One other thing you could do is to go through all your callback functions (i.e. anything that is a function pointer) and make sure that it is initialized to a null callback (i.e. a function that completes successfully). By default, any declared variables in BRAM that are not initialized get a value of zero when the bitstream is downloaded. If you happen to call an uninitialized callback, then you would be executing a NULL pointer. I know this is probably an artifact of the cleaning of the code you posted but the goodPayload_callback and badPayload_callback function pointers are never initialized in wlan_mac_clean.c. So, something like that could be the problem.

Christian · 2015-Nov-03 13:46:55

Exactly this also was the problem (see Edlmann's edit). Your last comment above gave as an indication and it exactly matches the observed behavior of being not 100% reproducible. Thank you so much.

Last edited by Christian (2015-Nov-03 13:47:59)

WARP Project Forums - Wireless Open-Access Research Platform

#1 2015-Nov-03 08:12:54

WARPv3 Microblaze Reboot

#2 2015-Nov-03 11:27:27

Re: WARPv3 Microblaze Reboot

#3 2015-Nov-03 11:53:05

Re: WARPv3 Microblaze Reboot

Code:

Code:

#4 2015-Nov-03 12:01:48

Re: WARPv3 Microblaze Reboot

#5 2015-Nov-03 13:02:05

Re: WARPv3 Microblaze Reboot

#6 2015-Nov-03 13:40:44

Re: WARPv3 Microblaze Reboot

#7 2015-Nov-03 13:44:01

Re: WARPv3 Microblaze Reboot

#8 2015-Nov-03 13:46:55

Re: WARPv3 Microblaze Reboot

Board footer