Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture

Cryptosystem plays an important role in cyber security and users privacy protection. To achieve certain level of security, Public Key Cryptography algorithm is performed on large integer that is more than 64-bit, typical bit size supported by conventional Central Processing Unit (CPU) (e.g. 512-bit...

Full description

Saved in:
Bibliographic Details
Main Author: Chang, Boon Chiao
Format: Final Year Project / Dissertation / Thesis
Published: 2018
Subjects:
Online Access:http://eprints.utar.edu.my/3629/1/ESA%2D2019%2D1601742%2D1.pdf
http://eprints.utar.edu.my/3629/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utar-eprints.3629
record_format eprints
institution Universiti Tunku Abdul Rahman
building UTAR Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Tunku Abdul Rahman
content_source UTAR Institutional Repository
url_provider http://eprints.utar.edu.my
topic NA Architecture
spellingShingle NA Architecture
Chang, Boon Chiao
Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
description Cryptosystem plays an important role in cyber security and users privacy protection. To achieve certain level of security, Public Key Cryptography algorithm is performed on large integer that is more than 64-bit, typical bit size supported by conventional Central Processing Unit (CPU) (e.g. 512-bit for Elliptic Curve Cryptography (ECC), 2048-bit for Rivest-Shamir-Adleman (RSA) and million bits for Fully Homomorphic Encryption (FHE)). The computation of cryptographic algorithm required a lot of large integer arithmetic computation, especially large integer exponentiation and modular operation. Hence, large integer multiplication as the core operation of large integer modular exponentiation is the key factor in determining the performance of the cryptosystem in term of computation time.The minimum bit size of encryption/decryption keys to achieve certain security level have increased over years. CPU implementations are becoming less efficient in handling the computation of crypto algorithms. Hence, other contemporary processor architectures such as GPU and FPGA have become popular alternative to speed up the computation in recent years.This dissertation first discusses several existing large integer multiplication algorithms and reviews different methods used to implement thediscussed algorithms done by other researchers in both Graphic Processing Unit (GPU) and Field Programmable Gate Array (FPGA).Compared to GPU, FPGA offered more low level design and development to the implementation. While GPU allowed users to configure each of the cores (processing units) to perform independent tasks, FPGA provides users a platform to design the processing units themselves to perform dedicated arithmetic and logic operation.In our GPU implementation, we present two different large integer algorithms implementation on two generation of NVIDIA GPU architectures, Kepler and Pascal. The former focuses on utilizing the shuffle instructions to reduce memory latency and bulk multiplication that is able to perform up to 900 multiplications with operands size of 1024, 2048, 4096 and 8192-bit. This implementation also proved that our proposed method of utilizing the shuffle instructions feature to share data between registers memory is able to achieve up to 11.45% and 36.54% speedup when compared with conventional methods of storing data in GPU global memory and shared memory respectively; The later implementation on Pascal architecture aims to achieve single high speed multiplication of larger size by utilizing the available GPU resources. We are able to eliminate the bottleneck, the CRT algorithm from our previous implementation by replacing it with another level of CTFNT and hence increased the multiplication operand size to 4096, 192K, 384K and 768K-bit.In our FPGA implementation, we focused on designing the processing unit ourselves that is capable of computing 3072-bit multiplication. We started with a preliminary design of a multiplier run with typical NTT module that perform computation in parallel-in-serial-out manner before we moved into design with radix-R FNT modules that are able to compute in parallel-in-parallel-out manner for better performance. Both radix-2 and radix-4 FNT modules are implemented and a further design of radix-4 FNT module, named radix-4 CTFNT module that is optimized to fit the multiplication algorithm we used is proposed. The proposed radix-4 design sacrificed the scalability for NTT with other parameters setup but is able to compute 16-points NTT with 10% faster performance and costs about 27% less resources to be implemented when compared to a typical radix-4 design.
format Final Year Project / Dissertation / Thesis
author Chang, Boon Chiao
author_facet Chang, Boon Chiao
author_sort Chang, Boon Chiao
title Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
title_short Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
title_full Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
title_fullStr Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
title_full_unstemmed Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
title_sort implementation of high speed large integer multiplication algorithm on contemporary architecture
publishDate 2018
url http://eprints.utar.edu.my/3629/1/ESA%2D2019%2D1601742%2D1.pdf
http://eprints.utar.edu.my/3629/
_version_ 1657491981408927744
spelling my-utar-eprints.36292019-12-17T05:33:32Z Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture Chang, Boon Chiao NA Architecture Cryptosystem plays an important role in cyber security and users privacy protection. To achieve certain level of security, Public Key Cryptography algorithm is performed on large integer that is more than 64-bit, typical bit size supported by conventional Central Processing Unit (CPU) (e.g. 512-bit for Elliptic Curve Cryptography (ECC), 2048-bit for Rivest-Shamir-Adleman (RSA) and million bits for Fully Homomorphic Encryption (FHE)). The computation of cryptographic algorithm required a lot of large integer arithmetic computation, especially large integer exponentiation and modular operation. Hence, large integer multiplication as the core operation of large integer modular exponentiation is the key factor in determining the performance of the cryptosystem in term of computation time.The minimum bit size of encryption/decryption keys to achieve certain security level have increased over years. CPU implementations are becoming less efficient in handling the computation of crypto algorithms. Hence, other contemporary processor architectures such as GPU and FPGA have become popular alternative to speed up the computation in recent years.This dissertation first discusses several existing large integer multiplication algorithms and reviews different methods used to implement thediscussed algorithms done by other researchers in both Graphic Processing Unit (GPU) and Field Programmable Gate Array (FPGA).Compared to GPU, FPGA offered more low level design and development to the implementation. While GPU allowed users to configure each of the cores (processing units) to perform independent tasks, FPGA provides users a platform to design the processing units themselves to perform dedicated arithmetic and logic operation.In our GPU implementation, we present two different large integer algorithms implementation on two generation of NVIDIA GPU architectures, Kepler and Pascal. The former focuses on utilizing the shuffle instructions to reduce memory latency and bulk multiplication that is able to perform up to 900 multiplications with operands size of 1024, 2048, 4096 and 8192-bit. This implementation also proved that our proposed method of utilizing the shuffle instructions feature to share data between registers memory is able to achieve up to 11.45% and 36.54% speedup when compared with conventional methods of storing data in GPU global memory and shared memory respectively; The later implementation on Pascal architecture aims to achieve single high speed multiplication of larger size by utilizing the available GPU resources. We are able to eliminate the bottleneck, the CRT algorithm from our previous implementation by replacing it with another level of CTFNT and hence increased the multiplication operand size to 4096, 192K, 384K and 768K-bit.In our FPGA implementation, we focused on designing the processing unit ourselves that is capable of computing 3072-bit multiplication. We started with a preliminary design of a multiplier run with typical NTT module that perform computation in parallel-in-serial-out manner before we moved into design with radix-R FNT modules that are able to compute in parallel-in-parallel-out manner for better performance. Both radix-2 and radix-4 FNT modules are implemented and a further design of radix-4 FNT module, named radix-4 CTFNT module that is optimized to fit the multiplication algorithm we used is proposed. The proposed radix-4 design sacrificed the scalability for NTT with other parameters setup but is able to compute 16-points NTT with 10% faster performance and costs about 27% less resources to be implemented when compared to a typical radix-4 design. 2018 Final Year Project / Dissertation / Thesis NonPeerReviewed application/pdf http://eprints.utar.edu.my/3629/1/ESA%2D2019%2D1601742%2D1.pdf Chang, Boon Chiao (2018) Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture. Master dissertation/thesis, UTAR. http://eprints.utar.edu.my/3629/
score 13.160551