hal-02131982, version 1

Communication dans un congrès

Abstract : The posit number system is proposed as a replacement of IEEE floats. It encodes floating-point values with tapered precision: numbers whose exponent is close to 0 have more precision than IEEE floats, while numbers with high-magnitude exponents have lower precision, because their encoding takes bits from the significand. In addition, the posit standard mandates the presence of the "quire", a Kulisch-like large accumulator able to perform exact sums of products. Several works have demonstrated that posit arithmetic can provide improved accuracy at the application level. However, the variable-length fields of posit encoding impacts the hardware cost of posit arithmetic. Existing comparisons of posit hardware versus float hardware are unconvincing, and the overhead of the exact accumulator has not been studied in detail so far. This work aims at filling this gap. To this purpose, it introduces an open-source tool to compare the respective costs of floats and posits on an application basis. A C++ templatized library compatible with Vivado HLS implements operators for custom size posits and their associated quire. These architectures are evaluated on recent FPGA hardware and compared to their IEEE-754 counterpart. The standard 32 bits posit adder is found to be twice as large as the corresponding floating-point adder. Posit multiplication requires about 7 times more LUTs and a few more DSPs for a latency which is 2x worst than the IEEE-754 32 bit multiplier. Furthermore , the cost of the posit 32 quire is shown to be the same as a 32 bits floating-point Kulisch accumulator.