Επιτάχυνση αλγορίθμων FIR φίλτρων με χρήση υλικού σε ενσωματωμένο σύστημα σε προγραμματιζόμενη συσκευή

Πέμπτη, 30 Ιανουάριος 2014 23:15 |

Τελευταία Ενημέρωση στις Πέμπτη, 30 Ιανουάριος 2014 23:36 |

Acceleration of Finite Impulse Response Filter Algorithms with Hardware, to Embedded System in a Programmable Device

I.Maltezos¹, V. Adrianoupolitou²

¹ Computer Systems & Networks Laboratory, 7^th School Laboratory Center, Pireaus, Greece
Tel: 2104010180-218, E-mail: Αυτή η διεύθυνση ηλεκτρονικού ταχυδρομείου προστατεύεται από κακόβουλη χρήση. Χρειάζεται να ενεργοποιήσετε την Javascript για να τη δείτε.

² HELLENIC SCIENTIFIC ASSOCIATION FOR TECHNICAL - VOCATIONAL EDUCATION AND TRAINING, Athens, Greece
Tel: 2108053539, E-mail: Αυτή η διεύθυνση ηλεκτρονικού ταχυδρομείου προστατεύεται από κακόβουλη χρήση. Χρειάζεται να ενεργοποιήσετε την Javascript για να τη δείτε.

ABSTRACT

The target of the reported study is the design of a Digital System on Programmable Chip, aiming at the optimized output, from side of time, of a digital finite impulse response filter. This is achieved by two ways, with the addition of hardware accelerator that collaborates in the realization of the filter algorithm with the NIOS II processor of ALTERA, and with the addition of special instructions in the processor instruction set. For reasons of realistic approach in the design methodology for the software system, was used the Real Time Operating System FreeRTOS, wich runs all realization processes for digital signal processing algorithms. The case study objectives which were achieved are: the important acceleration for implementing a digital finite impulse response filter algorithm, about 40 times with use of hardware accelerator, and 100 times with use of special instruction that is embedded in the instruction set of processor NIOS II, the investigation of a way to design software with use of Real Time Operational System for processor NIOS II and the presentation of a methodology for design other corresponding systems that requires acceleration of execution time. In order of the methodology practical appliance, were imported, sound, and numerical data, in order to be evaluated for the output and the validation of the designed system.

1. INTRODUCTION

Application-specific customizable processor cores enable the design of complex, high-performance, low-power embedded systems under tight time-to-market constraints. Customizable processors are configurable at the micro-architectural parameters. More importantly, they support extension of the core instruction set architecture with application-specific custom instructions. Custom instructions encapsulate the frequently occurring computation patterns in an application. They are implemented as custom functional units (CFU) in the datapath of the existing processor core. CFUs improve performance and reduce energy consumption of the applications. Lx, ARCTM core, Xtensa, and NIOS II are some examples of commercial customizable processors.The aim of this assignment is to show how custom instructions for the Altera Nios II processor [1] can be used to improve Nios II embedded system implementations in terms of speed and area. The Nios II processor enables the designer to change the hardware/software partitioning during system implementation. This is often done iteratively by replacing different parts of the software with hardware followed by an evaluation of the result. As the hardware design of the Nios II processor can be changed easily the development time for the whole system drops dramatically. . In order of the methodology practical appliance, were imported, sound, and numerical data, in order to be evaluated for the output and the validation of the designed system. Therefore one full design iteration from analyzing the software to the final evaluation of the result will be accomplished.

2. RELATED WORK

Many custom instructions generation techniques have been proposed in the literature, for example [2], [3], [4] among others. However, none of these approaches targets platforms exploiting dynamic reconfiguration of custom functional units. Recently, [3] have developed rotating instruction set processing platform that selects custom instructions at runtime. We [5] have earlier presented an efficient framework for runtime reconfiguration of custom instructions. However, the above studies do not consider real-time constraints. Co-synthesis of periodic task graphs with real-time constraint onto heterogeneous distributed embedded systems is addressed in [6], [7]. [8] partitions a task graph with timing constraints into a set of hardware units. Enforcing schedulability of real-time tasks with hardware implementation appears in [9], while [8] studies instruction-set customization for real-time tasks. None of these techniques takes into account the reconfiguration overhead or possibility of both spatial and temporal partitioning. [10], [11] co-synthesize real-time task graphs onto distributed systems containing dynamically reconfigurable FPGAs. These works assume a single hardware implementation of a task in FPGA and do not explore the hardware design space to evaluate tradeoffs between different implementations of the same task. Moreover, they do not put any hardware area constraint and try to minimize either cost (area), power or tardiness function while real-time constraints are satisfied. Finally, majority of the work in temporal partitioning comes from the reconfigurable computing community [12], [13], [14] .

3. NIOS II CUSTOM INSTRUCTION OVERVIEW

By using custom instructions with the Altera Nios II embedded processor time-critical software algorithms can be accelerated. This is achieved by replacing a complex sequence of standard instructions by a single instruction implemented in hardware. Up o 256 custom instructions can be added to the Nios II processor. [1] A custom instruction is a custom logic block directly connected to the Nios II arithmetic logic unit (ALU) [1]. This way the design of the FPGA-based Nios II processor can be changed easily. Therefore the designer still has all freedoms in terms of hardware/software partitioning during implementation. There are different types of custom instructions available. The main types are ?combinatorial? and ?multi-cycle?. The combinatorial custom instruction block reads a 32 bit input vector from ?dataa? and/or ?datab? and generates a 32 bit ?result? vector within one clock cycle (figure 1).

Figure 1: Combinatorial custom instruction block

The multi-cycle custom instruction block has additional inputs for ?clk?, ?clk en?, ?start?, ?reset? and ?done?; although some of the inputs are optional (figure 2).

Figure 2: Multi-cycle custom instruction block

The multi-cycle custom instruction block also reads the input from ?dataa? and/or ?datab? and produces a ?result?. This is done within a finite number of clock cycles. The maximum number of clock cycles the multi-cycle block needs to perform the operation has to be known when adding the instruction to the Nios II processor. The multi-cycle custom instruction block can be extended to perform multiple operations. The operation selection is done by an input vector ?n?. This vector is up to eight bit wide and allows up to 256 different operations. These different operations are simply mapped to different opcodes. A custom instruction can incorporate further functionalities like the access to internal register files. Logic outside the Nios II processors? data path can be interfaced, too.

4. THE SYSTEM IMPLEMENTATION

The implementation diagram of the system is shown at figure 3.

Figure 3: System Diagram

The NIOS II processor is the main core of the system, communicates with the other cores by the Avalon MM Data bus as data master device, and with memory device with the Avalon MM Instruction bus as instruction master device. We have use a few parallel input/output cores for user interface. The main systems memory is a SDRAM used by the FreeRTOS for Software Implementation. An audio CoDec core is used for sound I/O, and a full VGA interface it is implemented for visual interface.

Figure 4: visual interface

5. PERFORMANCE MEASUREMENT AND EVALUATION
To check proper operation of algorithms, custom circuits, and instructions that were developed, and to evaluate the performance of different implementations of the algorithm FIR, there are the following mode of operation:
Data SW Float mode - is given as input vector to the algorithm the following: (10000,0,0,0,0,0,0,0,0,0) and as the coefficients {0.0324577, 0.0362059, 0.0391191, 0.0409677, 0.0416012, 0.0416012 , 0.0409677, 0.0391191, 0.0362059, 0.0324577}, then output as expected rates multiplied ? 10000000. So checks the correctness of the algorithm and measures the execution time for 2 vectors of 10 samples each. The display outputs the result of the processing time of input vectors.
As a result we get: 207504 cycles and output as {324577, 362059.391191, 409677, 416012, 416012, 409677, 391191, 362059, 324577) sequence that confirms the correctness of the algorithm.
Data SW INT mode - is given as input vector to the algorithm the following: (1,0,0,0,0,0,0,0,0,0) and as the coefficients {0,0,16,64,127,127,64,16 , 0.0}, then outputs as expected the filter coefficients. So checks the correctness of the algorithm and measures the execution time for 2 vectors of 10 samples each. The display outputs the result of the processing time of input vectors.As a result we get: 21297 cycles and as output} {0,0,16,64,127,127,64,16,0,0 sequence that confirms the correctness of the algorithm.
Data HW ACC mode - - is given as input vector to the algorithm the following: {2000000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} and such rate as is the FIR HW Accelerator, then as the expected output rates. So checks the correctness of the algorithm and measure the execution time for 2 vectors of 37 samples each. The display outputs the result of the processing time of input vectors.
As a result we get: 11031 cycles and as output values [7 .. 16] we have {0.1953125, 7812500, 7812500, 0, -15625000, -31250000, -31250000, 0, 66406250} sequence that confirms the correctness of algorithm, and one is similar to rates[7..16].
Data CI mode - is given as input vector to the algorithm the following: (1000,0,0,0,0,0,0,0,0,0) and as the coefficients {0,0,16,64,127,127,64,16, 0.0}, then output as expected rates multiplied x 1000. So checks the correctness of the algorithm and measures the execution time for 2 vectors of 10 samples each. The display outputs the result of the processing time of input vectors.
As a result we get: 1558 cycles, and as output} {0,0,0,3,16,31,31,16,3,0 sequence confirms the correctness of the algorithm.

Figure 5: Acceleration with External Hardware Accelerator and Custom Instructions, for 100,200,400,800,1600,3200 number of samples. (blue for CI and orange for HACC)

6. CONCLUSION

By improving the existing FIR filter implementation using custom instructions the execution time was reduced significantly. Also less area is needed. This means the new implementation is more cost effective than the initial one. The next step in this design process would be to test the new implementation on an evaluation board. After that another iteration cycle could follow. Therefore other parts of the software would be replaced by hardware or vice versa. If after some iterations the result still doesn?t meet the required specifications it might become necessary to try a complete different way. This could be a plain hardware implementation for example. The case study objectives which were achieved are: the important acceleration for implementing a digital finite impulse response filter algorithm, about 40 times with use of hardware accelerator, and 100 times with use of special instruction that is embedded in the instruction set of processor NIOS II, the investigation of a way to design software with use of Real Time Operational System for processor NIOS II and the presentation of a methodology for design other corresponding systems that requires acceleration of execution time. In order of the methodology practical appliance, were imported, sound, and numerical data, in order to be evaluated for the output and the validation of the designed system.

7. REFERENCES

1. Instruction Set Synthesis with Efficient Instruction Encoding for Configurable Processors- JONG-EUN LEE Samsung Electronics Co., Korea -KIYOUNG CHOI Seoul National University, Korea and NIKIL D. DUTT University of California, Irvine

2. Y. Aoudni, Kais Loukil G. Gogniat J.L. Philippe M. Abid1LESTER - Mapping SoC architecture Solutions for an Application based on PACM Model -, Universit? de Bretagne Sud CNRS FRE 2734, Lorient, France

3. Huynh Phung Huynh and Tulika Mitra -Runtime Reconfiguration of Custom Instructions for Real-Time Embedded Systems- School of Computing-National University of Singapore

4. Hamblen J., (2008), Rapid Prototyping Of Digital Systems: SOPC Edition, Springer

5. H. P. Huynh and T. Mitra. Instruction-set customization for real-time embedded systems. In DATE ?07

6. K. Atasu, C. Ozturan, G. Dundar, O. Mencer, and W. Luk. CHIPS: Custom hardware instruction processor synthesis. In IEEE TCAD?08.

7. J. Sato, M. Imai, T. Hakata, A. Alomary, and N. Hikichi, ?An integrated design environment for application specific integrated processor,? Proceedings of the International Conference on Computer Design, pp. 414-417, Cambridge, MA, USA, Oct. 1991

8. E. Borin, F. Klein, N. Moreano, R. Azevedo, and G. Araujo, ?Fast instruction set customization,? Proceedings of the 2nd Workshop on Embedded systems for Real-Time Multimedia (ESTIMedia?04), pp. 53-58, Sep. 2004, Stockholm, Sweden.

9. H. Choi, J.-S. Kim, C.-W. Yoon, I.-C. Park, S.H. Hwang, and C.-M. Kyung, ?Synthesis of application specific instructions for embedded DSP software,? IEEE Transactions on Computers, Vol. 48, No. 6, pp. 603-614, Jun. 1999.

10. N. Clark, H. Zhong, W. Tang, and S. Mahlke, ?Automatic design of application specific instruction set extensions through dataflow graph exploration,? International Journal of Parallel Programming, Vol. 31, No. 6, pp. 429-449, Dec. 2003.

11. Nathan T. Clark, Hongtao Zhong, and Scott A. Mahlke, ?Automated custom instruction generation for domain-specific processor acceleration,? IEEE Transactions on Computers, Vol. 54, No. 10, pp. 1258-1270, Oct. 2005.

12. J. Cong, Y. Fan, G. Han, and Z. Zhang, ?Application-specific instruction generation for configurable processor architectures,? In Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays, pp. 183-189, Feb. 2004, Monterey, California, USA.

13. N. Dutt, and K. Choi, ?Configurable processors for embedded computing,? IEEE Computer, Vol. 36, No. 1, pp. 120-123, Jan. 2003.

14. R. Gonzalez, ?Xtensa: A configurable and extensible processor,? IEEE Micro, Vol. 20, No. 2, pp. 60-70, Mar.-Apr. 2000.

	Today	11
	Yesterday	5
	This week	49
	Last week	40
	This month	146
	Last month	321
	All days	27952

Οριζόντιο μενού

Πρόσφατα άρθρα

Επιτάχυνση αλγορίθμων FIR φίλτρων με χρήση υλικού σε ενσωματωμένο σύστημα σε προγραμματιζόμενη συσκευή