Scalable Hardware-Algorithms for Binary Prefix Sums

Lin, R.; Olariu, S.; Nakano, K.; Pinotti, Maria Cristina; Schwing, J.; Zomaya, A.

doi:10.1109/71.877941

In this work, we address the problem of designing efficient and scalable hardware-algorithms for computing the sum and prefix sums of a $w^k\hbox{-}{\rm{bit}}$, $(k\geq 2)$, sequence using as basic building blocks linear arrays of at most $w^2$ shift switches, where $w$ is a small power of $2$. An immediate consequence of this feature is that in our designs broadcasts are limited to buses of length at most $w^2$. We adopt a VLSI delay model where the “length” of a bus is proportional with the number of devices on the bus. We begin by discussing a hardware-algorithm that computes the sum of a $w^k\hbox{-}{\rm{bit}}$ binary sequence in the time of $2k-2$ broadcasts, while the corresponding prefix sums can be computed in the time of $3k-4$ broadcasts. Quite remarkably, in spite of the fact that our hardware-algorithm uses only linear arrays of size at most $w^2$, the total number of broadcasts involved is less than three times the number required by an “ideal” design. We then go on to propose a second hardware-algorithm, operating in pipelined fashion, that computes the sum of a $kw^k\hbox{-}{\rm{bit}}$ binary sequence in the time of $3k+\lceil\log_w k\rceil -3$ broadcasts. Using this design, the corresponding prefix sums can be computed in the time of $4k+\lceil\log_w k\rceil -5$ broadcasts.