WARP Project Forums - Wireless Open-Access Research Platform

You are not logged in.

#1 2016-Jul-04 16:26:46

horace
Member
Registered: 2014-Jul-16
Posts: 63

Attempted to remove a NULL dl_entry

Hi,
We are having difficulty tracking down an issue with our experiment. The aim is simply to calculate the retransmission rate for multiple rates & frame lengths.
1) Associate with AP
2) Transmit n data frames to AP. Achieved using custom code in MAC High rather than LTG.
3) Using WLAN Exp Framework, grab the log. Confirm there are n 'TX' entries and count the number of 'TX_LOW' entries. Calculate retransmission rate from these two values.
4) Rapidly iterate through all rates and multiple frame lengths.

At some random point in the experiment the Exp Framework will report zero TX entries and 0 TX_LOW entries. The UART gives an error of: "ERROR: Attempted to remove a NULL dl_entry" found in wlan_mac_dl_list.c.

First we thought we were attempting to grab the log too quickly after the transmission attempts. For example, transmit 100 frames to  an AP that doesn't exist. All will retry (six times). If you attempt to collect the log immediately, it will contain 100 'TX' and 100  'TX_LOW' entries which is clearly wrong. If you time.sleep(0.5) the log will contain 100 'TX' and 700 'TX_LOW' (assuming 6 retry attempts each). Correct.
So we added delay prior to collecting logs.

But the error still occurs and logs report zero entries at non-predictable times in the experiment. Although perhaps longer iteration time increases the mean time to failure...

Attempting to restart the python experiment results in a 'Max retransmissions without reply from node' timeout error. Resetting the node via the switch sometimes clears the issue but often it will hang on boot at "DRAM SODIMM Detected". Only reprogramming will truly reset the device.

We have tried outputting to UART the value of queue_free.length (in wlan_mac_queue.c) whenever a queue_checkout() is called. This value is normally constant, unless high numbers of retries occur at high iteration speed. In which case queue_free.length will reduce to zero, but not necessarily cause this error to occur.

Are you able to give some pointers for debugging this? We're not really sure where to start.

Offline

 

#2 2016-Jul-05 09:45:41

chunter
Administrator
From: Mango Communications
Registered: 2006-Aug-24
Posts: 1212

Re: Attempted to remove a NULL dl_entry

That error is being printed by "dl_entry_remove()", which is called in many places throughout the design, you'll have to track down exactly which one it is. It's a little arcane to use, SDK's debugger can be useful for this since you can set a breakpoint on the line that prints that error and then check the stack track.

A couple of follow-up questions:

1. Is the design you are using modified in any way from an official release?
2. Is the error definitely connected to using wlan_exp to retrieve logs? If you never grab the log, does the error ever occur?

It sounds like some pointer isn't being handled properly. This would be consistent with sometimes not being able to issue a software reset since the code may over overwritten part of itself.

Offline

 

#3 2016-Jul-05 19:42:23

murphpo
Administrator
From: Mango Communications
Registered: 2006-Jul-03
Posts: 5159

Re: Attempted to remove a NULL dl_entry

Chris has good debug question above. The stack trace would be especially helpful for isolating which dl_list is encountering the error.

A few other things that may help simplify your experiment:

3) Using WLAN Exp Framework, grab the log. Confirm there are n 'TX' entries and count the number of 'TX_LOW' entries. Calculate retransmission rate from these two values.

The TX_HIGH entry (new name for TX entries in v1.5) has a 'num_tx' field that the DCF populates with the total number of transmissions it made for that MPDU. If your experiment would benefit from a tighter Tx/log-fetch/Tx loop, you could disable creation of TX_LOW entries to reduce the log size and use the TX_HIGH.num_tx value instead of counting TX_LOW's. Both methods have some subtleties when an MPDU uses the RTS/CTS handshake. I think an RTS Tx will increment TX_HIGH.num_tx but will be logged as an RTS TX_LOW, not a DATA TX_LOW.

2) Transmit n data frames to AP. Achieved using custom code in MAC High rather than LTG.
4) Rapidly iterate through all rates and multiple frame lengths.

If you added a lot of code to create traffic with varying rates/lengths, you might consider creating a custom LTG payload type instead. This could reduce the tedium of porting your code to new ref design releases, and would take advantage of the MAC framework's existing bookkeeping for LTGs, queues, etc. It would also automatically flag your packets as LTG-related, inserting the uniq_seq and flow_id values in the payload so the corresponding RX_OFDM_LTG log entries can be extracted.

The LTG framework is designed to accommodate custom traffic profiles. It sounds like you need a traffic source that increments rates and lengths. This could be encoded as a new LTG payload type, with the actual packet creation in the ltg_event() function in the top-level MAC application. If you want to control the new LTG type from wlan_exp, you would need to add a case to ltg_sched_deserialize(), then mimic one of the Payload sub-classes in ltg.py.

Offline

 

#4 2016-Jul-08 06:35:00

horace
Member
Registered: 2014-Jul-16
Posts: 63

Re: Attempted to remove a NULL dl_entry

Thanks both very much for the replies, we have been attempting further debugging and create a repeatable error, but it is very intermittent.

chunter.1 - Yes, it was a modified v1.4. No hw changes and only MAC high sw changes. We added additional wlan exp commands to call other functions which generate traffic. As far as possible these functions used the framework for generating data frames - queue_checkout(), wlan_create_data_frame(), enqueue_after_tail() etc. In fact we copied LTG code to do this (but with custom payload). The reason we did not use the LTG was for an initial requirement of MPDU frames < 20 bytes. LTG imposes minimum 20 byte limit.

murphpo - Yes, in fact 'num_tx' is how we generate retransmission statistics rather than via counting TX_LOW entries.


Thanks for the LTG suggestions.
To further debug this we created a new scenario using a clean v1.4 STA and the LTG via Python wlan exp. The issue is now not related to dl_entry_remove() but simply a sudden failure to transmit frames. The Python script is roughly:

1) frame_len++
2) node.reset(log=True, txrx_counts=True, ltg=True, queue_data=True)
3) ltg_id = node.ltg_configure(wlan_exp_ltg.FlowConfigCBR(dest_addr=BSSID, payload_length=frame_len,  interval=64e-6, duration=10*64e-6), auto_start=True)
4) time.sleep(enough for the LTG to finish & logs created)
5) node.ltg_remove_all()
6) node.queue_tx_data_purge_all()
7) Get logs 'TX_LTG' and filter with:
-- log_tx = log_np['TX_LTG']
-- tx_addrs_1 = log_tx[log_tx['addr1'] == BSSID]
-- tx_frames = len(tx_addrs_1)
8) Count transmissions (tx_frames) and retransmissions np.sum(tx_addrs_1['num_tx']), repeat via 1).

(Hope this makes sense)

The script will run as expected then at some point, tx_frames will begin to report zero frame transmissions (& retransmissions). The script will continue. We believe the log is being correctly gathered but the frames are simply not being transmitted.
Restarting the script immediately results in num_tx=0
CPU Low appears to continue to run (based on incrementing green LEDs from AP beacon frames).
Restarting the board results in a hang after 'DRAM SODIMM Detected' - Indicating your ideas about NULL pointers.

We can debug CPU High but not sure where to set breakpoints.
Could some queue/frame count be over flowing.

Will try another board...

Offline

 

#5 2016-Jul-08 11:54:53

chunter
Administrator
From: Mango Communications
Registered: 2006-Aug-24
Posts: 1212

Re: Attempted to remove a NULL dl_entry

I know intermittent bugs can be immensely frustrating -- sorry to hear that.

After the the board stops transmitting, but before you press reset, are you able to interact with the board via UART (e.g. hit the "esc" key and see the main menu)? Or did the board seem to crash and become unresponsive. Does it print any warnings or errors before it starts having errors?

About how many iterations does the node make it before it seems to stop behaving?

Offline

 

Board footer