# Proceedings

# IEEE 17<sup>th</sup> International Conference on Application-specific Systems, Architectures and Processors

Steamboat Springs, Colorado, USA September 11-13, 2006



Los Alamitos, California

Washington

Tokyo

#### All rights reserved.

Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.

Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.

The papers in this book comprise the proceedings of the meeting mentioned on the cover and title page. They reflect the authors' opinions and, in the interests of timely dissemination, are published as presented and without change. Their inclusion in this publication does not necessarily constitute endorsement by the editors, the IEEE Computer Society, or the Institute of Electrical and Electronics Engineers, Inc.

IEEE Computer Society Order Number P2682 ISBN 0-7695-2682-9 ISBN 978-0-7695-2682-9 ISSN Number 1063-6862

Additional copies may be ordered from:

IEEE Computer Society
Customer Service Center
10662 Los Vaqueros Circle
P.O. Box 3014
Los Alamitos, CA 90720-1314
Tel: + 1 800 272 6657
Fax: + 1 714 821 4641
http://computer.org/cspress
csbooks@computer.org

IEEE Service Center
445 Hoes Lane
P.O. Box 1331
Piscataway, NJ 08855-1331
Tel: +1 732 981 0060
Fax: +1 732 981 9667
http://shop.ieee.org/store/
customer-service@ieee.org

IEEE Computer Society
Asia/Pacific Office
Watanabe Bldg., 1-4-2
Minami-Aoyama
Minato-ku, Tokyo 107-0062
JAPAN
Tel: +81 3 3408 3118
Fax: +81 3 3408 3553
tokyo.ofc@computer.org

Individual paper REPRINTS may be ordered at: <reprints@computer.org>

Editorial production by Bob Werner Cover art production by Joe Daigle/Studio Productions Printed in the United States of America by The Printing House





IEEE Computer Society

Conference Publishing Services

http://www.computer.org/proceedings/

# **Table of Contents: ASAP'06**

### 17<sup>th</sup> IEEE International Conference on Application-specific Systems, Architectures and Processors

| Message from the Conference Chairs                                                                                                               |    |
|--------------------------------------------------------------------------------------------------------------------------------------------------|----|
| Conference OrganizersProgram Committee                                                                                                           |    |
|                                                                                                                                                  |    |
| Keynote                                                                                                                                          |    |
| Programming Modern FPGA Platforms  Ivo Bolsens                                                                                                   |    |
| Session 1: Configurable Computing Machines (Invited)                                                                                             |    |
| Configurable Computing Platforms—Promises, Promises  Carl Ebeling                                                                                | 3  |
| The Mythical CCM: In Search of Usable (and Reusable) FPGA-Based General Computing Machines                                                       | 5  |
| Session 2: Processing, Storage and Network On-Chip                                                                                               |    |
| Cross Layer Design to Multi-thread a Data-Pipelining Application on a Multi-processor on Chip                                                    | 15 |
| The Molen FemtoJava Engine  Julio C.B. Mattos, Stephan Wong, and Luigi Carro                                                                     | 19 |
| A Generic Multi-Phase On-Chip Traffic Generation Environment  Antoine Scherrer, Antoine Fraboulet, and Tanguy Risset                             | 23 |
| Minimum Cost for Channels and Registers in Processor Arrays by Avoiding Redundancy  Sebastian Siegel and Renate Merker                           | 28 |
| NoC Hot Spot Minimization Using AntNet Dynamic Routing Algorithm  Massoud Daneshtalab, Ashkan Sobhani, A. Afzali-Kusha, O. Fatemi, and Z. Navabi | 33 |
| Session 3: Configurable Processors and Tools (Invited)                                                                                           |    |
| Recent Developments in Configurable and Extensible Processors                                                                                    | 39 |
| Software Configurable Processors  Jeffrey Arnold                                                                                                 | 45 |
| Reconfigurable Hardware and Software Architectural Constructs for the Enablement of Resilient Computing Systems                                  | 50 |

| Application Specific Processing: A Tools Approach                                                                                                                                  | 56  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| Drew Taussig, Andreas Hoffmann, Achim Nohl, and Andrea Kroll                                                                                                                       |     |
| Session 4: Parallel Connection Architectures                                                                                                                                       |     |
| Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions  Yedidya Hilewitz and Ruby B. Lee                                                       | 65  |
| A Mesh-of-Trees Interconnection Network for Single-Chip Parallel Processing  Aydin Balkan, Gang Qu, and Uzi Vishkin                                                                | 73  |
| Reconfigurable Shuffle Network Design in LDPC Decoders  Jun Tang, Tejas Bhatt, Vishwas Sundaramurthy, and Keshab K. Parhi                                                          | 81  |
| 2D-VLIW: An Architecture Based on the Geometry of Computation                                                                                                                      | 87  |
| Session 5: Parallel Processing and Arithmetic                                                                                                                                      |     |
| An Efficient Implementation of High-Accuracy Finite Difference Computing Engine on FPGAs                                                                                           | 95  |
| Performance Evaluation of a Novel Direct Table Lookup Method and Architecture with Application to 16-bit Integer Functions  L. Li, Alex Fit-Florea, M.A. Thornton, and D.W. Matula | 99  |
| Design of Radix 4 SRT Dividers in 65 Nanometer CMOS Technology  Tung N. Pham and Earl E. Swartzlander, Jr.                                                                         | 105 |
| Describing Quantum Circuits with Systolic Arrays  Aasavari Bhave, Eurípides Montagne, and Edgar Granados                                                                           | 109 |
| FPGA Implementation of Beamforming Receivers Based on MRC and NC-LMS for DS-CDMA System _ Elie Sarraf, Messaoud Ahmed-Ouameur, and Daniel Massicotte                               | 114 |
| Low Complexity Design of High Speed Parallel Decision-Feedback Equalizers  Daesun Oh and Keshab K. Parhi                                                                           | 118 |
| Session 6: Arithmetic: Analysis and Implementation                                                                                                                                 |     |
| Quantitative Analysis of Embedded FPGA-Architectures for Arithmetic                                                                                                                | 125 |
| A Cost Effective Pipelined Divider for Double Precision Floating Point Number                                                                                                      | 132 |
| A 64-bit Decimal Floating-Point Comparator <i>Ivan D. Castellanos and James E. Stine</i>                                                                                           | 138 |
| Pipelined Range Reduction for Floating Point Numbers  Francisco I. Jaime, Julio Villalba, Javier Hormigo, and Emilio I. Zapata                                                     | 145 |

## Session 7: 20th Anniversary Review—Array Processors (Invited)

| Systolic FFT Processors: Past, Present and Future  Earl E. Swartzlander, Jr.                                                                                                                                   | 153 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| From Bit Level Systolic Arrays to HDTV Processor Chips  John V. McCanny, Roger F. Woods, and John G. McWhirter                                                                                                 | 159 |
| The UCSC Kestrel Application-Unspecific Processor                                                                                                                                                              | 163 |
| Multicore Processors as Array Processors: Research Opportunities  Peter Cappello                                                                                                                               | 169 |
| Session 8: Analysis and Optimizations                                                                                                                                                                          |     |
| Analysis of a Fully-Scalable Digital Fractional Clock Divider  Thomas Preußer and Rainer G. Spallek                                                                                                            | 173 |
| Voltage Assignment and Loop Scheduling for Energy Minimization while Satisfying Timing Constraint with Guaranteed Probability  Meikang Qiu, Chun Xue, Qingfeng Zhuge, Zili Shao, Meilin Liu, and Edwin HM. Sha | 178 |
| Parallel Processing Based Power Reduction in a 256 State Viterbi Decoder                                                                                                                                       | 182 |
| Affine Nested Loop Programs and their Binary Parameterized Dataflow Graph Counterparts                                                                                                                         | 186 |
| Polyhedral Modeling and Analysis of Memory Access Profiles  Philippe Clauss and Bénédicte Kenmei                                                                                                               | 191 |
| Session 9: 20 <sup>th</sup> Anniversary Review—Optimizations and Applications (Invited)                                                                                                                        |     |
| Array Processing Using Alternate Arithmetic—A 20 Year Legacy                                                                                                                                                   | 199 |
| Loop Transformation Methodologies for Array-Oriented Memory Management  F. Balasa, P.G. Kjeldsberg, M. Palkovic, A. Vandecappelle, and F. Catthoor                                                             | 205 |
| An Overview of Systolic Array Concepts and Applications for Linear Algebra and Signal Processing Kung Yao and Flavio Lorenzelli                                                                                | 213 |
| Three Computationally Demanding Problems in Search of ASAP Solutions                                                                                                                                           | 214 |
| Session 10: Energy and Performance Optimizations                                                                                                                                                               |     |
| Parameterized Looped Schedules for Compact Representation of Execution Sequences  Ming-Yung Ko, Claudiu Zissulescu, Sebastian Puthenpurayil,  Shuvra S. Bhattacharyya, Bart Kienhuis, and Ed Deprettere        | 223 |
| An Improved Systolic Architecture for LU Decomposition  DaeGon Kim and Sanjay Rajopadhye                                                                                                                       | 231 |

| Dual-Processor Design of Energy Efficient Fault-Tolerant System  Shaoxiong Hua, Pushkin R. Pari, and Gang Qu                                                                              | 239 |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| An Energy-Delay Efficient Subword Permutation Unit                                                                                                                                        | 245 |
| Session 11: Video, Coding and Cryptography                                                                                                                                                |     |
| Architecture Design of an H.264/AVC Decoder for Real-Time FPGA Implementation                                                                                                             | 253 |
| Dynamic Voltage Scaling for Power Efficient MPEG4-SP Implementation  Antonio Portero, Guillermo Talavera, Marius Montón,  Borja Martínez, Francky Catthoor, and Jordi Carrabina           | 257 |
| Dynamic-SIMD for Lens Distortion Compensation                                                                                                                                             | 261 |
| High Speed Channel Coding Architectures for the Uncoordinated OR Channel                                                                                                                  | 265 |
| Efficient Group Key Management with Tamper-resistant ISA Extensions  Youtao Zhang, Jun Yang, and Lan Gao                                                                                  | 269 |
| Speeding Up AES by Extending a 32 bit Processor Instruction Set  Giudo Marco Bertoni, Luca Breveglieri, Roberto Farina, and Francesco Regazzoni                                           | 275 |
| Session 12: Memory and Processor Synthesis                                                                                                                                                |     |
| Buffer and Register Allocation for Memory Space Optimization                                                                                                                              | 283 |
| New Schemes in Clustered VLIW Processors Applied to Turbo Decoding<br>Pablo Ituero and Marisa López-Vallejo                                                                               | 291 |
| Evaluating Hardware Support for Reference Counting Using Software Configurable Processors Feng Xian, Witawas Srisa-an, and Hong Jiang                                                     | 297 |
| Architectural Support on Object-Oriented Programming in a JAVA Processor                                                                                                                  | 303 |
| Session 13: Matrix and Imaging Designs                                                                                                                                                    |     |
| Reconfigurable Fixed Point Dense and Sparse Matrix-Vector Multiply/Add Unit                                                                                                               | 311 |
| High Performance VLSI Architecture Design for H.264 CAVLC Decoder Mythri Alle, Jayanta Biswas, and S.K. Nandy                                                                             | 317 |
| An FPGA-Based Application-Specific Processor for Efficient Reduction of  Multiple Variable-Length Floating-Point Data Sets  Gerald R. Morris, Richard D. Anderson, and Viktor K. Prasanna | 323 |
| ντεκίμα η Δίουτες ημείατα ο Αυαργνού αυα VΙΚΙΟΥ <b>η Ε</b> ΥΑΝΑΝΝΑ                                                                                                                        |     |

| A Design Methodology for Hardware Acceleration of Adaptive Filter Algorithms in Image Processing                                                                               | 331 |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| Session 14: Cryptographic and Coding Applications                                                                                                                              |     |
| An Adaptable and Scalable Asymmetric Cryptographic Processor                                                                                                                   | 341 |
| Low-Cost Elliptic Curve Digital Signature Coprocessor for Smart Cards  Guerric Meurice de Dormale, Renaud Ambroise,  David Bol, Jean-Jacques Quisquater, and Jean-Didier Legat | 347 |
| Throughput Optimized SHA-1 Architecture Using Unfolding Transformation                                                                                                         | 354 |
| Configurable, High Throughput, Irregular LDPC Decoder Architecture:  Trade-off Analysis and Implementation  Marjan Karkooti, Predrag Radosavljevic, and Joseph R. Cavallaro    | 360 |
| Author Index                                                                                                                                                                   | 368 |