Balluff - BVS CA-GX0 / BVS CA-GX2 Technical Documentation
Data Is Lost / Images Are Returned Incompletely

Symptoms

While acquiring data from the device image data is lost e.g. the counter for incomplete or lost images is increasing or request objects inside an application are returned with errors.

Cause

There are A LOT of potential causes for this so a general cause as well as a general resolution cannot be given:

  • Switches, routers or cables in between the device and the NIC of the host system might simply not be able to cope with the amount of data coming from the device.
  • The NIC might not be able to get rid of the incoming data fast enough.
  • The underlying driver might not be fast enough to send the captured data up to the application.
  • The application might not process the received data fast enough eventually blocking all the memory that could be used for capturing data into thus the driver must drop incoming data.

Resolution

General

General information about potential resolutions for network related transmission problems that are not specific to the Balluff Multi-Core Acquisition Optimizer can be found here.

Multi-Core Acquisition Optimizer Specific

The Balluff Multi-Core Acquisition Optimizer makes heavy use of the RSS features offered by NICs. Especially the combination of the mvMultiCoreAcquisitionCoreCount and the mvMultiCoreAcquisitionCoreSwitchInterval needs to be mentioned here. Internally these 2 properties will configure a connected device in a way that every mvMultiCoreAcquisitionCoreSwitchInterval network packets sent out by the device, the packets will be modified in a way that, when received, they will be processed by another CPU. Only mvMultiCoreAcquisitionCoreCount different CPU cores will be used by the algorithm.

Note
This applies to the NIC configuration described here: If possible do not configure more RSS queues than physical processors available! While this should work in case something doesn't behave as expected go with a smaller value at least to rule out this potential source of trouble. Keep in mind the potential impact of the RSS Base Processor as well!

Be aware of the impact of using multiple CPUs for receiving data! If possible always favor using a single, dedicated CPU core for processing the data stream of a single camera while moving the load of the application away from this core. If dealing with multiple cameras try to configure the system in a way that each camera uses its own, dedicated core to transmit network data to while the application uses the remaining cores. This will result in the best performance and in that case the parameter mvMultiCoreAcquisitionCoreSwitchInterval is not important since CPU switching for an individual network stream is not necessary.

If this is not possible for whatever reason try to find a good balance between CPU cores used and CPU core switch interval! As described here a network stream is usually bound to a specific CPU anyway! This is done for performance reasons, as switching from one core to another consumes additional CPU cycles and thus is avoided by the operating system. So regarding this troubleshooting section be aware of the fact that the smaller the value of the property mvMultiCoreAcquisitionCoreSwitchInterval is selected the higher the additional overhead will become. Extensive testing did show that values around 64 - 128 result in a good compromise between improved reliability and introduced overhead which becomes almost undetectable then.

Switching the CPU core every now and then in combination with parallel processors however introduces another aspect to the incoming traffic usually only encountered with LAG (Link Aggregation) configurations: It is no longer guaranteed that the driver receives the network packets in order! So a driver needs to be aware of this!

Also selecting a lot of CPU cores for processing with a large number (256 or greater) for the switch interval will reduce the overhead to a minimum but might cause the packets to arrive way out of order since several RSS queues might raise interrupts and the order these are served is not defined.

When using a very small value for mvMultiCoreAcquisitionCoreSwitchInterval in combination with an interrupt moderation scheme that results in a very low number of interrupts might also have a negative effect as then all processors will wait until their queues are almost full until signaling an interrupt. With a small switch interval this will likely happen when all queues are almost full and then it might happen that certain queues do overrun since every core wants to process data at the same time. Another aspect of this is, that the number of RSS queues divided by the number of receive descriptors allocated by the NIC should always be smaller than the value for mvMultiCoreAcquisitionCoreSwitchInterval since otherwise the card might actually run out of receive descriptors before getting rid of its data.

Note
A receive descriptor typically carries 2k bytes while the actual network packets might be larger thus a network packet might consume more than one receive descriptor! This again is another argument for mvMultiCoreAcquisitionCoreSwitchInterval not being selected too large.