Staff View: Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture

Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture

Cryptosystem plays an important role in cyber security and users privacy protection. To achieve certain level of security, Public Key Cryptography algorithm is performed on large integer that is more than 64-bit, typical bit size supported by conventional Central Processing Unit (CPU) (e.g. 512-bit...

Full description

Saved in:

Bibliographic Details
Main Author:	Chang, Boon Chiao
Format:	Final Year Project / Dissertation / Thesis
Published:	2018
Subjects:	NA Architecture
Online Access:	http://eprints.utar.edu.my/3629/1/ESA%2D2019%2D1601742%2D1.pdf http://eprints.utar.edu.my/3629/
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-utar-eprints.3629
record_format	eprints
institution	Universiti Tunku Abdul Rahman
building	UTAR Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Tunku Abdul Rahman
content_source	UTAR Institutional Repository
url_provider	http://eprints.utar.edu.my
topic	NA Architecture
spellingShingle	NA Architecture Chang, Boon Chiao Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
description	Cryptosystem plays an important role in cyber security and users privacy protection. To achieve certain level of security, Public Key Cryptography algorithm is performed on large integer that is more than 64-bit, typical bit size supported by conventional Central Processing Unit (CPU) (e.g. 512-bit for Elliptic Curve Cryptography (ECC), 2048-bit for Rivest-Shamir-Adleman (RSA) and million bits for Fully Homomorphic Encryption (FHE)). The computation of cryptographic algorithm required a lot of large integer arithmetic computation, especially large integer exponentiation and modular operation. Hence, large integer multiplication as the core operation of large integer modular exponentiation is the key factor in determining the performance of the cryptosystem in term of computation time.The minimum bit size of encryption/decryption keys to achieve certain security level have increased over years. CPU implementations are becoming less efficient in handling the computation of crypto algorithms. Hence, other contemporary processor architectures such as GPU and FPGA have become popular alternative to speed up the computation in recent years.This dissertation first discusses several existing large integer multiplication algorithms and reviews different methods used to implement thediscussed algorithms done by other researchers in both Graphic Processing Unit (GPU) and Field Programmable Gate Array (FPGA).Compared to GPU, FPGA offered more low level design and development to the implementation. While GPU allowed users to configure each of the cores (processing units) to perform independent tasks, FPGA provides users a platform to design the processing units themselves to perform dedicated arithmetic and logic operation.In our GPU implementation, we present two different large integer algorithms implementation on two generation of NVIDIA GPU architectures, Kepler and Pascal. The former focuses on utilizing the shuffle instructions to reduce memory latency and bulk multiplication that is able to perform up to 900 multiplications with operands size of 1024, 2048, 4096 and 8192-bit. This implementation also proved that our proposed method of utilizing the shuffle instructions feature to share data between registers memory is able to achieve up to 11.45% and 36.54% speedup when compared with conventional methods of storing data in GPU global memory and shared memory respectively; The later implementation on Pascal architecture aims to achieve single high speed multiplication of larger size by utilizing the available GPU resources. We are able to eliminate the bottleneck, the CRT algorithm from our previous implementation by replacing it with another level of CTFNT and hence increased the multiplication operand size to 4096, 192K, 384K and 768K-bit.In our FPGA implementation, we focused on designing the processing unit ourselves that is capable of computing 3072-bit multiplication. We started with a preliminary design of a multiplier run with typical NTT module that perform computation in parallel-in-serial-out manner before we moved into design with radix-R FNT modules that are able to compute in parallel-in-parallel-out manner for better performance. Both radix-2 and radix-4 FNT modules are implemented and a further design of radix-4 FNT module, named radix-4 CTFNT module that is optimized to fit the multiplication algorithm we used is proposed. The proposed radix-4 design sacrificed the scalability for NTT with other parameters setup but is able to compute 16-points NTT with 10% faster performance and costs about 27% less resources to be implemented when compared to a typical radix-4 design.
format	Final Year Project / Dissertation / Thesis
author	Chang, Boon Chiao
author_facet	Chang, Boon Chiao
author_sort	Chang, Boon Chiao
title	Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
title_short	Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
title_full	Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
title_fullStr	Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
title_full_unstemmed	Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture
title_sort	implementation of high speed large integer multiplication algorithm on contemporary architecture
publishDate	2018
url	http://eprints.utar.edu.my/3629/1/ESA%2D2019%2D1601742%2D1.pdf http://eprints.utar.edu.my/3629/
_version_	1657491981408927744
spelling	my-utar-eprints.36292019-12-17T05:33:32Z Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture Chang, Boon Chiao NA Architecture Cryptosystem plays an important role in cyber security and users privacy protection. To achieve certain level of security, Public Key Cryptography algorithm is performed on large integer that is more than 64-bit, typical bit size supported by conventional Central Processing Unit (CPU) (e.g. 512-bit for Elliptic Curve Cryptography (ECC), 2048-bit for Rivest-Shamir-Adleman (RSA) and million bits for Fully Homomorphic Encryption (FHE)). The computation of cryptographic algorithm required a lot of large integer arithmetic computation, especially large integer exponentiation and modular operation. Hence, large integer multiplication as the core operation of large integer modular exponentiation is the key factor in determining the performance of the cryptosystem in term of computation time.The minimum bit size of encryption/decryption keys to achieve certain security level have increased over years. CPU implementations are becoming less efficient in handling the computation of crypto algorithms. Hence, other contemporary processor architectures such as GPU and FPGA have become popular alternative to speed up the computation in recent years.This dissertation first discusses several existing large integer multiplication algorithms and reviews different methods used to implement thediscussed algorithms done by other researchers in both Graphic Processing Unit (GPU) and Field Programmable Gate Array (FPGA).Compared to GPU, FPGA offered more low level design and development to the implementation. While GPU allowed users to configure each of the cores (processing units) to perform independent tasks, FPGA provides users a platform to design the processing units themselves to perform dedicated arithmetic and logic operation.In our GPU implementation, we present two different large integer algorithms implementation on two generation of NVIDIA GPU architectures, Kepler and Pascal. The former focuses on utilizing the shuffle instructions to reduce memory latency and bulk multiplication that is able to perform up to 900 multiplications with operands size of 1024, 2048, 4096 and 8192-bit. This implementation also proved that our proposed method of utilizing the shuffle instructions feature to share data between registers memory is able to achieve up to 11.45% and 36.54% speedup when compared with conventional methods of storing data in GPU global memory and shared memory respectively; The later implementation on Pascal architecture aims to achieve single high speed multiplication of larger size by utilizing the available GPU resources. We are able to eliminate the bottleneck, the CRT algorithm from our previous implementation by replacing it with another level of CTFNT and hence increased the multiplication operand size to 4096, 192K, 384K and 768K-bit.In our FPGA implementation, we focused on designing the processing unit ourselves that is capable of computing 3072-bit multiplication. We started with a preliminary design of a multiplier run with typical NTT module that perform computation in parallel-in-serial-out manner before we moved into design with radix-R FNT modules that are able to compute in parallel-in-parallel-out manner for better performance. Both radix-2 and radix-4 FNT modules are implemented and a further design of radix-4 FNT module, named radix-4 CTFNT module that is optimized to fit the multiplication algorithm we used is proposed. The proposed radix-4 design sacrificed the scalability for NTT with other parameters setup but is able to compute 16-points NTT with 10% faster performance and costs about 27% less resources to be implemented when compared to a typical radix-4 design. 2018 Final Year Project / Dissertation / Thesis NonPeerReviewed application/pdf http://eprints.utar.edu.my/3629/1/ESA%2D2019%2D1601742%2D1.pdf Chang, Boon Chiao (2018) Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture. Master dissertation/thesis, UTAR. http://eprints.utar.edu.my/3629/
score	13.160551

Implementation of High Speed Large Integer Multiplication Algorithm on Contemporary Architecture

Similar Items