SLink Studies in the TileCal ROD Environment

General Description
Motorola MVME2604. LynxOS-3.0
Pentium II PC. Linux-2.2.12

RIO2 8061. LynxOS-2.5.1
Slink Implementations

ATLAS TileCal Internal Note

Code Production Index

General Description

Two Slinks cards have been tested at Valencia ROD Lab. using different SLink hardware devices and hosts. In the source side, a PC PentiumII 266Mhz running a Linux 2.2.12-20 kernel, a PCI to Slink board is plugged. At the Receiver side, a Motorola 2604 VME card under LynxOS 3.0 and a RIO2-8061 under LynxOS-2.5.1 can allocate the Slink to PMC card. The main goal is to perform a set of tests to picture Bandwidth figure of the current commercial SLink hardware devices. In the SLink community, some public software packages can be found under different platforms. However, a custom "ROD" Slink set of libraries and drivers have been implemented in both sides due to the specific requirements needed. S-link PCI devices are built on top of the AMCC S5933 PCI chipset .The physical Slink control/status lines are seen through incoming/outcoming Mailboxes. Further information related with hardware description can be found at http://www.cern.ch/HSI/s-link. Some common issues have driven the software design:

The Latency of transfers MUST be as low as possible. System Calls are only allowed when the Slink devices need a Set-Up. Polling is used to read/write data from/to the S5933 FIFOs. Previous studies justify that an ISR means a system overload too much heavy in high rate DAQ systems. As backup of last thought, the drivers released map also the S5933 chip if a privilege intervention is required

The PCI S5933 Chipset MUST be user space mapped. Current Slink software packages shows that a user space Slink Library can provide same functionalities that a driver library version

The Data Flow between S5933 internal FIFOs and the system RAM MUST be performed using the S5933 DMA capabilities. CPU is only used to move data from/to FIFOs when the maximum bandwidth performance is NOT required

The RAM buffersfor Slink transactions MUST will also be user space mapped. Therefore, no Kernel to User/User to Kernel transactions are required to access the DMA data buffers.

The ROD management of Physical devices as RAM memory and PCI Chipsets MUST live in peace with the internal Kernel management of such kind of devices not related with ROD.

From the point of view of Slink Programmers And Slink users a similar set of functions and data structures have been implemented/used in both systems (Motorola-RIO2/Lynx, PC/Linux).

Motorola MVME 2604 LynxOS 3.0

The Standard LynxOS-3.0 distribution does not provide a PCI driver or user library package where a PCI application can be built. It is up to user, to build the necessary software to make a PCI application. A generic PCI driver has been implemented providing minimum PCI services. At Start-Up, the PCI driver scans the current LynxOS kernel PCI device attachments over physical PCI bus. It will manage an internal database following the next rule: if the device is already attached, (it means that another driver is already linked to the device) only READ actions over the PCI Configuration Space are allowed. Otherwise, the hardware is up to be managed by the driver completely. From the user point of view, it means that it can be mapped on the PCI (I/O, Memory spaces) and Kernel spaces.

The kernel-user I/O is performed using typical driver calls, "ioctls", "open" and "close" . "Read" and "write" calls operates over the complete PCI configuration device space (256bytes), which is physically inaccessible from a mapped system. The user procedure must follow an "attachment" of the particular device, it will perform some "read", "write" ioctl calls to configure the physical chipset and will "alloc" the device size in the PCI driver. "Alloc" ioctl call will return the physical address of the PCI chipset. This parameter will be used to map the chipset on the user context, using the typical LynxOS call smem_create.

The RAM device has been managed using the original RIO2 driver uiodrvr provided by CES, the same strategy has been followed on the uiolib user library provided the DAQ-1 project.

Slink libraries have been written on top of this two drivers. Both of them will provide some issues pointed at beginning. A complete set of routines performs the typical functions over the user software object as can be open, close, reset or DMA reads. A software quality checking has been carried out building a user application, SLIDAStest. SLIDAStest will try to estimate the performance of SLINK to PMC board to move data between the SLIDAS and the physical memory. SLIDAS emulates a data source generator of 32 bits at a maximum rate of 40MHz. It means that even 160Mbytes/sec can be loaded on our SLink PMC.

The hardware design of the Slink to PMC forces a continuous monitoring of the kind of Slink word present at AddOn FIFO (SLink FIFO), to get a continuous DMA flow. A mistmach between the expected link word on OMB1 and the current link word reported at IMB4 will resume the DMA.

Inside a DMA transfer, the link word is data type and therefore the expected word must also be data type, to get a continuous data flowing from AMCC S5933 FIFOs to RAM Memory. When a ctrl word arrives to the external FIFO (header or trailer of current packet), DMA will resume waiting that some external mechanism (software and therefore the CPU) change the expected word. After update the values, the DMA transaction will flow again for the new packet. Our tests will try to measure this intrinsic effect of current slink hardware on the Bandwidth performance.
Following last thought, we have done two kinds of test . Raw Transfers where SLIDAS supplies raw packets whitout CONTROL words and Control Transfers where the packet is a normal SLINK packet Header, data, data......, data, Trailer. On both cases, the buffer size has been sweeped covering 13K words. In the CTRL sweep the curve sampling and sweep size is constrained by the SLIDAS hardware. The Raw sweep has been software oversampling to check CTRL test measurement. Below, the data plots are shown.

Conclusions:

RAW DMA Transfers (DMA buffering without CTRL words) can be as high as 89Mbytes/sec with our current system, with an efficiency of 95% for 2Kwords buffer size. The figure shows that the buffer size required to obtain a good efficiency is not too much exigent (1/2 kword means a 85%)
CTRL DMA Transfers (DMA buffering with CTRL words) can reach up to 85 Mbytes/sec for 8Kwords (the maximum Packet size provided by SLIDAS). In terms of Bandwidth, the penalty coming from the software CTRL words switch, which is nicely seen in the curves, it's roughly about 10% for 1Kword
Both experiments trends to the same plateau (P1 parameter) for a buffer size >= 8Kwords inside a a variance lower than 4%
A good Chi square-ndf ratio has been obtained from the data collected which means a good characterization

Notes:

The calculations use software calls to measure elapsed time after M iterations for each packet size. We do not have in account overheads coming from intermediate function calls, the procedure followed seems something like below code:
for( jj = 0 ; jj < STATISTICs ; jj ++ ){
   run_params.start=times( &timer );
   for( ii = 0 ; ii < loops ; ii++ ){
     if( SLink_DMA_Read( dev, ( char * )( &dma ), size, PCItoPHY_M ) != SUCCESS )
       break;
   }
   run_params.end=times( &timer );
   time[ jj ] = show_performance( &run_params );
   printf("\n%d\t\%.0f", jj, time[ jj ]);
   fprintf(data_file,"\n%d\t\%.0f", jj, time[ jj ]);
}

S5933 Chipset is designed for data widths of 32 bits, working at a maximum speed of 33MHz. It means that the MVME2604 PCI bus is underloaded, which can extract/push data widths of 64 bits at a full speed of 33MHz

References:

Our tests are in agreement with the study performed by the ESSOS group Second Generation VMEBus POWERPCs based SBCs running LynxOS. This paper shows that a bandwidth of 88.1 Mbytes/sec is obtained for the MVME2604, writing PCI bursts from the S5933 to DRAM Main memory. This measurement has been performed with a VMEmetro PCI analyzer.

PC PentiumII 266MHz Linux 2.2.12-20

Our PC running Linux 2.2.12, allocates two PCI cards that will be used in the ROD environment. A PCI to SLink board which will supply data over the FiberChannel Link to the RODModule host. The Bit3 interface bridges PCI to VME, to make slow rate tasks as monitor, control or data sampling over the VMEbus.

The ROD driver is designed having in mind the issues pointed at header page. The Linux Memory management is the key word to allow a user space library design . Linux Kernel does not provide the user function smem_create found on LynxOS, following the general devices policy used on Linux, itshould not have too much sense.
Linux Kernel and modules are designed to hide the physical device management, as could be the PCI space or the Physical Memory, to user. User applications work under the virtual memory kernel management, which properly manage the memory page faults on the devices. A nice description of memory management under Linux is found at chapters Seven and Thirteen of Linux Device Drivers.

Our Linux module, performs PCI and Memory mapping services to user, using "ioctls", "open", "close". It is also implemented , POSIX "mmap" call returning a virtual pointer to the memory associated to the device selected. The RAM Memory device is limited to 32 pages (4096bytes per page), this is the maximum number of contiguous pages granted by _get_free_pages.
If more memory per object is needed, you can patch your kernel with bigphysarea which enhances _get_free_pages services, afterwards an update of "Alloc/Free" ioctl calls on ROD driver should be done.

The Mmap driver entry makes the PCI Chipset/RAM remapping job, using remap_page_range kernel symbol. The trick to remove page swapping kernel management on system RAM, comes from the X server strategy to access physical memory. A description of method can be found at Linux Lab Project Documentation. Our modified version blocks the kernel array mem_map entries linked to the page range allocated by the _get_free_pages call.

/* Trick from the Linux Lab Project to use remap_page_range in mmap call
when RAM is involved. PG_Reserved bit. Same strategy as Xfree */

       for( ii = MAP_NR( tmp->k_ptr );
             ii <= MAP_NR( tmp->k_ptr + ( (PAGE_SIZE - 1)*alloc.pages ) );
             ii++ ){
                         mem_map_reserve( ii );
       }

After this loop the pages are treated as PCI memory pages, and therefore are suitable to be user mapped using mmap, remap_page_range and VMAs.

Driver entries "write", "read" are implemented to access the PCI Configuration Space which can not be mapped. Same as the LynxOS PCI driver, the Linux driver maps internally the PCI chipsets to be able to access hardware whenever required, i.e. in the /proc/rod entry or if some interrupt service is designed.

As the SLink LynxOS libraries case, a similar set of data types and procedures have been implemented on top of the ROD driver services. We have check the package functionalities, setting up a real Slink environment based on Fiberchannel. LDC and LSC SLink/FiberChannel cards provide a maximum BW of 103Mbytes/sec over the maximum theoretical rate of 1Gbit/sec provided per FiberChannel specification, further information about hardware is found at http://www.rmki.kfki.hu/detector/S-Link/.

The applications built will try to measure the PC RAM to FIFO transfers bandwidth and the Fiberchannel bandwidth. We are also interested to measure the incoming penalty from the CTRL words management. In the Linux side, the control word (Slink packet header or trailer) needs to be written by software, therefore, no DMA can be used.

A synchronization method between the Sender application and receiver software is necessary before the sweep test starts. We will use the main data link and the SLink Return Lines to make the configuration setup between the two processes. Before the sweep test starts, a special packet will be sent from the PC Memory to the PowerPC memory. It will contains the sweep parameters: the data buffer size to be transmitted, the number of SLink packets, the number of loops per sweep point and the sweep resolution. When the receiver is ready to make the sweep, it will acknowledge its state lowering S-Link line LRL3, which will be monitored at sender side after send the synchro packet. The experimental results obtained are shown below:

Conclusions:

RAW DMA Transfers (DMA buffering without CTRL words) can be as high as 42 Mbytes/sec with an efficiency greater than 98% for 8Kwords buffer size. The Figure shows that the size-efficiency is not too much exigent (1/2 kword means a 90%)
CTRL DMA Transfers (DMA buffering with CTRL words) can reach up to 42,5 Mbytes/sec for 8Kwords. In terms of Bandwidth, the penalty coming from the software CTRL words management, it's roughly a 10% for 1/2Kword
Both experiments trends to the same plateau (P1 parameter) for a buffer size >= 1Kwords inside a 5% margin
Same as the SLIDAS case, a good Chi square is obtained from the data collected

Notes:

The theoretical BW on the PCI to SLink card is 65 Mbytes/sec. Therefore, is not possible to check the FiberChannel bandwidth. Erik V der Bij has suggested us that such test should be done using a SliTest board or a MicroEnable card which provides RAM Memory to/from PCI throughput of 110Mbytes/sec
The Linux load has been set to default. XServer and other "user applications" as networking services are running and our SLink application hasn't any special privileges. No special kernel scheduler Set-Up has been done
RAW and CTRL DMA tests are performed for a PCI-SLink Latency timer value of 96 (~23usec per burst). For a fixed data buffer (8Kwords), a latency sweep has been done covering the whole dynamic range of latency timer. We see a flat response of 35Mbytes/sec with a step response to P1 parameter from 26 latency value to the end of the sweep ?????

References:

Nikhef Studies about SLink shows that our P1 parameter is close to the Nikhef plateau using either the Windows NT Driver or the Memory mapped LSC driver provided by Steffen Luitz . The Nikhef group shows on a PCI trace gathered from a PCI VMetro card analyzer, there's an intrinsic CUT effect produced by the PC motherboard PCI/RAM bridge which limits the burst cycles length on the PCI bus, and therefore on the Bandwidth measured.

RIO2 8061. LynxOS-2.5.1

The SSP libraries developed for the MVME2604 have been ported to the RIO2 stuff which runs a CES-lynx-2.5.1 OS. This is an important hint, cause CES modifies the standard Lynx-OS kernel distribution, and so an extra PCI software layer is added. The new layer provides the bus managment in terms of PCI space memory allocation and device detection. The kernel also setups the PCI bridge, mapping the local system memory in the PCI space. The reverse is as well applied for the PCI chipsets availables at the bus. From the user point of view, the PCI memory space is presented as a physical addresses bunch suitable to be mapped using smem_create calls. If some PCI configuration adressing cycle is required from the user level, an interface is provided by the driver uiodrvr which manages also the contiguous physical memory requests. Below are the figures for the SSP plugged into the RIO2 8061. The left one is for the Fiberchannel/Slink LDC implementation, and for the rigth one a SLIDAS card supplies the data flow. The same measurement procedure as the MVME2604 has been applied.

From above is clear that the SSP-RIO2-8061 couple bandwidth is roughly 75Mbytes/sec for raw data packets greater than 10Kwords. The effciency is 95% for packages greater than 4Kwords. The 42Mbytes/sec obtained for the fiberchannel link corresponds with the MVME2604 which means that the real bottleneck is due to the limited bandwidth of the PCI-SLink card plugged in the PC Motherboard.

Slink Implementations

Extra measurements have been also performed using other Slink cards. This provide us an extra cross-check of the bandwidth measured with our software packages. Below are the figures corresponding to the ODIN cards, which use the G-Link technology to implement the Slink stuff. The pictures correspond with the MVME2604 and the RIO2-8061 destination hosts which suck the slink data using a PMC, to interface the Slink protocol with the PCI bus. The source side is a PCI to SLink card which houses the ODIN LSC.

Below are the figures obtained for the slink-fiberchannel integrated destination cards. The same Slink source has been used to supply the data into the PMCs. As above the left one is for the MVME2604 motherboard and the rigth one, the RIO2-8061 card.

The P1 parametter agreement for every experiment, confirms that all the measurements performed except the ones using the SLIDAS stuff are fully constrained by the limited PCI bandwidth of the PCItoSlink card and PC-Pentium PCI bridge.

TileCal ROD
Maintaned by Juanba
Ific, University of Valencia