Open access peer-reviewed chapter - ONLINE FIRST

AI on Edge: A Mass Accessible Tool for Decision Support Systems

Written By

Sankalp Dayal

Submitted: 29 September 2023 Reviewed: 27 November 2023 Published: 05 January 2024

DOI: 10.5772/intechopen.1003945

Decision Support Systems (DSS) and Tools IntechOpen
Decision Support Systems (DSS) and Tools Edited by Tien M. Nguyen

From the Edited Volume

Decision Support Systems (DSS) and Tools [Working Title]

Dr. Tien M. Manh Nguyen

Chapter metrics overview

34 Chapter Downloads

View Full Metrics

Abstract

Artificial Intelligence (AI) advancements in last decade have been explosive and now AI is considered to be potentially surpassing human intelligence for many decision tasks like detecting objects, answering specific questions or decoding sound to speech. This also means now human can offload raw information processing to an AI enabled machine and use its outcome as their Decision Support System (DSS). To make this AI based DSS mass accessible, it has to be pervasive, cheap or free and available locally where decision is being made like a home or industry. This requires Machine Learning (ML) models powering the AI to be deployed at edge hardware like phones, security cameras, automobiles. The deployment on edge requires compressing these ML models and, in some cases, tailoring to the hardware. This chapter explains how AI is becoming an important tool for DSS and then discusses the State-of-the-Art (SOTA) model compression techniques used for deployment on edge ensuring no loss in performance.

Keywords

  • deep learning
  • machine learning
  • artificial intelligence
  • edge computing
  • model compression
  • intelligent DSS
  • scientific computing

1. Introduction

Traditionally DSS have been mostly centered around humans where source of information was coming from machines or decisions of other humans. DSS for applications like comparative sales figures between one week and the next, projected revenue figures based on new product sales assumptions, or as simple as the consequences of different decisions are becoming much higher quality and in large volumes due to easy and mass accessibility of the information. In last decade we further saw penetration of internet powered cell phones, tablets, smart speakers which made this available information accessible and immediate.

However, one key frontier that was missing was automating the decision making itself. Humans were responsible to sort through information and make the decisions that would benefit them. This resulted in humans being often over loaded with information and also ending up making rudimentary decisions. This has changed with the onset of AI. Now the AI has become mature enough that firstly it can take decisions and explain them same as humans can. Furthermore, this AI is now possible to be personalized to specific human which makes them capable to make decisions that are truly personalized. How internet made the information accessible to masses in large volumes and cell phones made them accessible immediately, AI also need to reach masses & immediately to make it highly impactful. It needs to be able to explain its decision making and also become personalized to truly become a companion in making decisions as their human counterpart would do. This has given birth to new kind of DSS called Intelligent Decision Support System (IDSS).

Inflection point of the technology that made internet possible was lower cost to transfer information [1]. However, this size and cost was still large enough to make a consumer device like cell phone. Another inflection point reached in last decade where this to cost and size level reached to level where cell phones were possible to have compute power same as servers had 20 years ago. This roughly took 20 years for this to happen. A similar inflection point happened in early 2010 when compute became cheaper enough that models that power AI became feasible. This gave birth to the technology that was capable enough to make decisions like humans do and it was first time when offloading information to these systems started becoming accessible and feasible. Next immediate thing is to make this AI mass accessible and immediately available.

The ML models by default are computing intensive. This computing heaviness comes from two sources, namely, the number of parameters used by the ML models, and the nature of operations being performed [2]. There are few properties of these models however that are unique to them. After the ML model is fully trained, they end up being over parameterized meaning the same output for a given input can be obtained even if certain parameters are removed [3]. Second property is the operations does not require full precision that is available in computer-hardware. These properties allow them to be compressed and also reduced precision. This implies once the model is trained, they do not need the high computing power and high precision computation. This is unlike the cell phone revolution where it took 20 years, for AI to be available in the hands of users. Now, it is possible due to the use of advanced computational science and technology allowing the data to be compressed to small sizes and computed in reduced precision. Another property of these models is that the arithmetic operations among all these models are very similar. This allows for creation of specialized hardware called ML Accelerators that accelerate these specific arithmetic operations. These ML accelerators are Application-Specific-Integrated-Circuits (ASIC) specifically designed with non-conventional design used in general purpose computer hardware [4]. The key benefit of these is that they can do the same compute at lower power and their manufacturing is cheaper than a general-purpose computer hardware. In some cases, the models and the hardware are designed together to get the most accurate and best performing models.

Lastly it has also been observed that a large ML model models can be trained for general tasks. But if end application is simpler than a smaller model can be trained with bigger model used as teacher. This process is called knowledge distillation [5]. This distillation process allows for creation of smaller models which can run efficiently on edge hardware like cell phones, smart speakers and chips that go in smart cars and in industrial settings like cameras checking for defects or detecting life of equipment by measuring vibration signature.

Decision Support Systems will be seeing a revolution on how AI will be assisting humans in decision making on daily basis. Humans by offloading rudimentary and basic decision making to these systems will experience a new higher cognitive thinking which was never available before. Key barrier that had existed in allowing humans to have easy and mass access of AI at tip of hands can be removed with scientific techniques that reduce computation and precision and size of these models. These techniques when combined with specifically designed ML accelerator hardware will bridge this gap.

Advertisement

2. Decision support systems and AI

A Decision Support System (DSS) is an information system that primarily used in making decisions that help business improve. For any DSS, the process can be conceptualized as in 3 steps, namely, gather information step, process information step, and make decision or judgment step. These can be illustrated in a foundational block diagram as shown in Figure 1.

Figure 1.

3 three-step decision-making process.

The block diagram can keep repeating where the outcome of the judgment or the judgment itself can become an information for next set of decisions as shown in Figure 2.

Figure 2.

Repeating three-step decision making process with outcomes in the middle.

A good decision that aids a business require accessing large amount of information and the sorting through that information or processing it. Machines traditionally had been good at processing the information faster than humans however the logic of processing and allowing humans to make the final decision. With onset of AI these two also can now be offloaded to a machine. As demonstrated in Figure 3, now AI is capable of analyzing large amount of data and also can consider insights from human and start taking decisions on their behalf.

Figure 3.

AI as proxy to humans in decision making.

There are different types of DSS that include communication-driven, model-driven, knowledge-driven, document-driven and data-driven. For a high quality DSS all of these are required. Communication, knowledge and document driven DSS primarily required human interaction whereas model and data driven can be offloaded to a machine. This implied for next level of decision making required human and machine to work together but ultimately human had to make the decisions. Such decision making is called Human-In-Loop (HIL).

With the latest Generative AI models like ChatGPT [6], Llama [7], Whisper [8] all of the human interaction-based information can be coded in a language that machine can understand. Also, the latest AI models are now smart enough to use this information and make decisions even better than human would make. For example, social media feed of the company can be fed to a language processing model which can predict the sentiment of the users. This can be fed to another AI model that can determine what kind of campaigns can be done and what marketing channel to use. Such complicated decisions traditionally were always done by humans and now can be done with AI with no HIL.

Advertisement

3. Machine learning models powering AI in DSS

In simplest terms, AI is computer software that mimics the way human thinks. Machine Learning (ML) is a subset of AI that uses various ML algorithms trained on data to produce ML-AI models that can then mimic human thinking. Some of the latest ML models like for processing speech, language, data, vision share same common building blocks Neural Networks (NN). These NN are mathematical models which generally providing some kind of matrix multiplications and passing the output through some non-linear functions as shown in Figure 4.

Figure 4.

Neural networks as matrix multiplications from input to output.

The simplest example of a NN is Dense NN where a series of matrix multiplications followed by a non-linear function such as Rectified Linear Unit (ReLU) gives as Nx. Consider a NN which has multiple layers. An input will be processed by these layers one by one to generate final output. These layers are attached as a chain where for intermediate layers the output from previous layer becomes its input. For dense layers this can be expressed mathematically as given in Eqs. (13). For a given layer i input xi, can be represented as a vector of real values of dimension Ii. This input is projected or multiplied with a matrix Wi of rows Ii and columns Oi. The projection is then shifted by adding a vector of values bi called bias. The linear operation of projection and shift can be represented as given in Eq. (1) as fix.

fixi=Wixi+biwherexiϵRIi,WiϵRIi,Oi,biϵROiE1

This linearly operated input is then passed through a non-linear function represented by σ where this can be ReLU which zeros all the values below 0 and can be represented as σix=x+=max0x. The output from this non-linear function becomes the output of the layer represented as Lixi which becomes input to the next layer denoted by xi+1 as represented in Eq. (2).

xi+1=Lixi=σifixiE2

The Neural Network (NN) with n number of layers is simply chaining these such that output of layer i becomes input to layer i+1 as given in Eq. (3).

Nx=Ln(Ln1(Ln2L2L1xE3

Readers may notice that Input dimension Ii, output dimension Oi, weights Wi, bias vector bi and choice of non linear function σi have subscript i to denote that these are specific to the layer and hence each layer in NN will have unique representation for these and for a network with n number of layers will have n unique sets and can be written as IOWbσiwhereiϵn.

A more complicated networks are called Convolution Neural Networks (CNN) where instead of simple matrix multiplication is replaced with convolution operation as shown in Eq. (4). Assume the input is a 2D matrix of dimension h,w. Similar to Dense layer there is a projection matrix W but instead of its dimension matching the input dimension, it is of much smaller size denoted by m,n where m,n is much smaller than h,w. This matrix is generally called kernel. One key difference is instead of matrix product this is convolution or element wise product as shown in Eq. (4). The output at index i,j is given by multiplying and sum in input from index i,j to i+m,j+n with 2D kernel w of size m,n. Similar to Dense NN, different CNN layers can be chained together.

yij=a=0m1b=0n1wa,bxi+a,j+bwhereE4
i,jareindeces of input,output,
mxnis kernel size of convolution

Recurrent Neural Networks (RNN) is another type of network with process time-based information. The benefit of RNN is they are able to hold previously processed information in a variable called state denoted by h in Eq. (6). The output y is generated using the current input x and the state h. The new state at time ht is obtained using current input xt and previous state ht1 by applying projections using matrices U and W respectively and shift with b and passing through non-linearity tanh as given in Eqs. (5) and (6). To get output at time t denoted by yt, the new state ht is projected and shifted using matrix V and bias c respectively and passed through non-linearity softmax function as given in Eqs. (7) and (8). This process can then recursively continue preserving the information in state similar to a Markov process.

at=b+Wht1+UxtE5
ht=tanhatE6
ot=c+VhtE7
yt=softmaxotE8

Dense networks were useful for smaller networks, CNN have provided state of the art results on computer vision tasks and RNN have shined for speech processing tasks like speech recognition. However, until recently it the domains were separate. Transformer based models [9] have provided a mechanism to apply same NN structure to all the tasks. They use attention-based mechanism. The input x is first projected into 3 different spaces called Q, K, V called Query, Key and Value respectively as shown in Eq. (9). A similarity score is obtained between Query and Key by taking a dot product between then and product is normalized by applying softmax non-linearity and division by length of vector. Then this score is scaled by the value to determine how much attention needs to be given to the current input. This is done by taking dot product of score with V. This calculation of attention from Q, K, V is shown in Eq. (10).

Q=WQx,K=WKx,V=WVxwhereWareprojection matricesE9
AttentionQKV=softmaxQKTdkVwheredkis lenght of input vectorxE10

For transformer based NN also, multiple transformer blocks can be stacked or changed together.

The ML models carry parameters generally called weights and biases. These are real numbers wiϵR. The total count of these unique parameters is called the model size iS. Hence weights can be thought of a big vector of dimension of model size RS. During training is where the data is fed and output is constraint to match some decision criterion or objective function. The weights start as random values and during training converge to a unique vector that results in minimum error in objective function. This objective function is generally needing some kind of human feedback to ensure model is trained to simulate human decision making. When model is processing input information, it goes through each layer. The intermediate outputs in each layer are called activations. These activations are also real numbers ajϵR. The activation of the final layer then becomes the output. During these computations the weights are kept constant. This is called forward pass. Then the generated output is matched against a human feedback and an objective score is obtained. Depending on how low is the objective score the weights are adjusted so that next time objective score is higher for the same input. This adjustment of weights is called backward pass [8]. The forward and backward passes as shown in Figure 5.

Figure 5.

Forward and backward pass.

Once the model is fully trained, weight adjustment is not needed and hence only forward pass is required. A fully trained model is then ready to be integrated into a DSS and can now make decisions same as humans would do. This integration is generally called Model Deployment.

Utility of a deployed ML model can be categorized into 3 performance metrics

  • Accuracy: Indicator of how well a model is deciding compared to a human. This metric has to be reasonably high and preferably close to 100%. Generally bigger the model, higher is the accuracy.

  • Latency: Time the model takes to generate the output. This is determined by how big is the model, nature of the computation and the compute platform. Generally lower latency is preferred and, in some time, critical applications this need to be close to zero.

  • (Compute) Cost: In an integrated DSS, the models will be deployed on some kind of computer hardware like servers or edge devices like cell phones and laptops. In some cases, these models can be big that may even require multiple servers. This compute cost in the end determines the real cost as it would either result in buying hardware or subscribing to some cloud service. It can be understood that bigger the model, lower is the latency, higher will be the cost.

It can be noted that these 3-performance metrics are opposing to each other meaning bigger is the model higher is the accuracy but then higher will be latency and cost. For the same model to lower the latency, cost will further increase as it would need higher quality computer hardware or cloud service.

Advertisement

4. AI on edge: a mass accessibility tool for DSS

Edge computing is the deployment of computing and storage resources at the location where data is produced. This location is always an electronic device like laptops, tablets, cell phones, GPS navigation devices, bar code scanners, and many more. This kind of device is called Edge Devices. Figure 6 shows the difference between network-based system versus edge computing-based system.

Figure 6.

Network versus edge based compute.

Computing and storage resources at the edge reduces the latency and cost considerably. This is because now the data does not require to be transported back and forth between an edge device and server. More importantly, if computing and storage is there at the edge, it also makes is fully private as the users who are generating this data will own it and it would not leave their device. Edge computing [10] thereby not only comes with benefits of reduced latency and cost but also higher privacy. From ML model deployment standpoint, this means always deploying them on Edge hardware is going to be beneficial to a business that is developing a DSS.

Advertisement

5. Scientific techniques enabling edge AI

Bringing AI on edge has clear benefits. For example, a crop management company that is using camera to analyze health of crop does not have to have a human monitoring them with help of AI. When this AI is deployed on the camera itself then costly broadband communication is not required to transport camera feed if the required AI processing and decision making can be done at the edge. However, it is not straightforward to bring AI to the edge. This is because these models are really large which would require large amount of memory. Additionally, the processing the operations required in forward pass of these models need powerful computing platforms. Currently, these are not available on edge hardware. Table 1 provides reference specifications of typical cloud versus edge hardware.

Cloud Hardware*Edge Hardware**
Max Power700 W15 W
GPU/CPU DDR Memory80 GB8 GB
Precision SupportFP8, FP16, FP32, INTFP32, INT
Compute TypeGPUCPU

Table 1.

Typical Specifications of cloud versus edge hardware.

Cloud Hardware is considered as NVIDIA H100.


Edge Hardware is considered as Raspberry Pi 4.


This is where the properties of ML models can be used. ML models during training require large number of parameters or weights. However, once the model is trained, it may not need all of those parameters. This property is called over parameterization. Also, these models are probabilistic in nature. This means they do not need the highest precision in computation and weight representation to generate the same probability distribution for potential outcomes. If we observe closely the Eq. (3), one more property that stands out is that these models are built with the same building blocks as matrix multiplications. This property is called common kernel utilization and a cheaper dedicated hardware can be designed to enable this. These properties lead to different scientific techniques that are used to reduce the deployment cost without affecting the accuracy. This can be visualized in the illustration in Figure 7.

Figure 7.

Probability distribution of different decisions with and without compression techniques.

5.1 Precision reduction

On a computing machine, a real number is represented by floating point (FP) precision which will typically need 32 bits or 4 bytes on the memory (FP32). For a model that let us say has 7 billion parameters like Llama will need 28 billion bytes of memory written as 28 GB. During the forward pass when computation is happening intermediate activations will be generated (Refer Eq. 2 for more details) which also need to be stored which for a 7 billion bytes can in worst case go to 16 GB. This roughly can cost easily about >$100 for DRAM that compute would need. Hence precision reduction can greatly help in reducing this memory cost. Different precision reduction schemes and their impact on cost are tabulated in Table 2.

PrecisionCost on DRAM
FP32$160
FP16$80
FP8/INT8$40
FP4/INT4$20

Table 2.

Cost of DDR needed for 7 Billion model for different precisions [11].

Precision reduction is performed by a process called model quantization. In this floating-point value is mapped to an integer value and a corresponding scale factor [12]. These integer values are of 8-bit precision or less. A quantization operator Q applies first and affine transformation by dividing the value by scale S and subtracting a fixed-point value Z called zero point. Then this value is quantized to an integer by applying a rounding operation. This is defined in Eq. (11). To convert back to floating point value, the quantized value is casted to float value, added the zero point and multiplied by the scale factor. However due to rounding operation during quantization, the back converted value with be approximately equal to the original value. This different is called quantization error and goal of quantization science it to ensure that this difference does the effect the final performance of the model.

Qr=roundr/SZE11
r=QrZ.Swhereris quantized valueE12

The number of scale factors for the entire model are considerably smaller in count as compared to number of parameters. Even from computation standpoint doing computations like matrix multiplications on 8-bit integers are 16x cheaper in cost and power compared to FP32 matrix multiplications.

The precision reduction science is also called model quantization. There are various techniques like weight histogram matching between quantized and original floating-point tensors, correcting weights for quantization, absorbing scale factor before quantizing and making them power of two [13]. Quantization has become a de-facto method of pre-conditioning of the models for deployment on edge [14]. Following table expresses the impact of quantization on model size on a 16 MB model [15] called MobileNet-V1[16], a popular model used in vision based processing and visual DSS methods (Table 3).

PrecisionQuantize only first and last (MB)Quantize whole model (MB)
32 bit (Baseline)16.116.1
4 bit7.14.2
2 bit5.62.2
1bit4.81.2

Table 3.

Impact of quantization on model size on MobileNet-V1.

5.2 Model compression

Model compression is based on the property of ML models being over over-parameterization. This means to perform with same accuracy, model may not need all the parameters and some of them can be reduced to zero. Zeroing out some weights of the model is called sparsification or pruning of the model. Different levels of sparsity can be achieved depending on how model was trained and what is end objective or application of the model. Generally, 25–50% sparsity are achievable. Impact of model compression is given in the following Table 4 it can be seen if we take a model like InceptionV3 different sparsity [17].

Sparsity (%)Number of parameters (M)Accuracy on image classification (%)
027.178.1
5013.678.0
756.876.1
87.53.374.6

Table 4.

Impact of compression on number of parameters and on accuracy. Image classification is based on ImageNet dataset.

Model compression mathematically can be represented as mapping using bits where bits where value is present is made 1 and otherwise 0. This is illustrated in following Figure 8.

Figure 8.

Sparse weight structured mapped to a bitmap.

5.3 ML accelerator

Accelerated computing is an idea where computing utilizes specialized hardware to execute computations more efficiently than CPUs, enhancing speed and performance. It is especially beneficial for tasks that can be parallelized. ML Accelerator is designing a computer hardware that’s job is dedicated to compute math operations in ML models like matrix multiplications. They are different from general purpose computing hardware available in major edge hardware as they are specifically designed to run arithmetic operations like matrix multiplications at higher rate and lower power. This means if the same operation when run on the ML accelerator will have lower latency and lower cost [18].

However, these ML accelerator as they may sound super powerful, since they can only do certain mathematical operations effectively are very rigid in terms of programmability.

5.4 HW-model Co-design using neural architecture search

ML accelerator as shown in Figure 8 have a unique way to perform the computations. This implies some models which have large matrices will benefit more than the models which have smaller. This leads to the concept of defining the model itself considering the hardware properties in consideration. This kind of HW and model co-design can be achieved using some of the architecture search algorithms called Neural Architecture Search (NAS) [19]. In NAS building blocks are provided and a search algorithm is employed to create different model architectures and check for their performance as shown in Figures 9 and 10.

Figure 9.

Parallelization of matrix multiplications.

Figure 10.

Neural architecture search based on performance criterion like latency, accuracy, memory.

The best performing model architecture in terms of accuracy and latency is then used for deployment. This can achieve about 30–40% reduction in memory and compute cost as observed in MobileNetV3 [20].

5.5 Model distillation

Model distillation is highly specialized field where instead of one but two models are trained [5]. First a considerably large model which is highly accurate is trained. Then an order of magnitude smaller model is trained but instead of its objective to predict outputs as human, its objective is mimic the outputs of the larger model. The larger model is called the teacher and smaller is a student model in this distillation process. Key idea is that the knowledge of the “teach” model is distilled into the “student” by allowing the student model to mimic the outputs of the teacher model instead of an objective function. Generally, with distillation 10x reduction in model size can be obtained. These scientific techniques exploiting the ML model properties can lead to tremendous reduction in cost for deployment by reducing the memory and compute requirements. This is an active field of research [21].

Advertisement

6. Conclusions

Decision Support Systems are seeing a new paradigm where AI is penetrating the tools that are currently available. AI now reaching and surpassing human capability. AI computing engine has capability to process large amounts of data and coming from even very discrete sources that can make decisions aiding business decision using DSS. Like cell phone and internet revolution enabled access to information at fingertips and accessible by masses, AI also needs to be made accessible to users. This can only be possible if AI is made available on edge devices like phones, tablets, home monitoring systems, etc. AI is powered by large models which are memory and compute hungry which make them unsuitable for edge devices because these devices are always resource constraint. However, these models exhibit properties of over parametrization and probability distribution approximation. These properties give lead to techniques like precision reduction, model compression & model distillation. Because these models have very similar arithmetic operations, these can further be utilized by designing custom hardware for the edge that can accelerate these ML models. To improve the performance further for these models, the model themselves can be co-designed with the ML accelerator hardware. All these techniques when applied can lead to deployment of ML models on the edge and allow development of DSS tools that are not just AI powered but mass accessible due to low latency and cost.

Advertisement

Acknowledgments

We would like to acknowledge Dr. Mahdi Heydari for reviewing the manuscript and providing useful feedback. We would also like to acknowledge Josipa Karadzole for helping throughout the writing of this chapter. I would also convey my specially thanks to the anonymous reviewer and Dr. Tien M. Nguyen for providing constructive comments and reviewing this chapter.

Advertisement

Conflict of interest

The authors declare no conflict of interest.

References

  1. 1. Are We at an AI Inflection Point with ChatGPT? A 2023 Tipping Point to Rival the Internet in 1983 by Opace Digital Agency. 2023. Available from: https://www.opace.co.uk/blog/ai-inflection-point-to-surpass-the-internet#the-advent-of-the-internet-in-1983
  2. 2. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436-444. DOI: 10.1038/nature14539
  3. 3. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM. 2021;64(3):107-115
  4. 4. Reuther A, Michaleas P, Jones M, Gadepally V, Samsi S, Kepner J. AI and ML accelerator survey and trends. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC). Waltham, MA, USA: IEEE; 2022. pp. 1-10. DOI: 10.1109/HPEC55821.2022.9926331
  5. 5. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. 2015. Available from: https://arxiv.org/abs/1503.02531
  6. 6. Introducing ChatGPT by OpenAI. Available from https://openai.com/blog/chatgpt
  7. 7. Touvron H et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. 2023. Available from: https://arxiv.org/abs/2302.13971
  8. 8. Introducing Whisper by OpenAI. Available from https://openai.com/research/whisper
  9. 9. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017;30. Available from: https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  10. 10. LeCun Y et al. A theoretical framework for back-propagation. In: Proceedings of the 1988 Connectionist Models Summer School. Vol. 1. Pittsburgh, Pa: Morgan Kaufmann, CMU; 1988. pp. 21-28
  11. 11. K. Cao, Y. Liu, G. Meng and Q. Sun, “An overview on edge computing research,” in IEEE Access, vol. 8, pp. 85714-85728, 2020, doi: 10.1109/ACCESS.2020.2991734. Trends in DRAM price per gigabyte. Available from: https://aiimpacts.org/trends-in-dram-price-per-gigabyte/
  12. 12. Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah: IEEE; 2018. pp. 2704-2713
  13. 13. Manohara M, Dayal S, Afzal T, Bakshi R, Fu K. MRQ: Support multiple quantization schemes through model re-quantization. arXiv preprint arXiv:2308.01867. 2023. Available from: https://arxiv.org/abs/2308.01867
  14. 14. Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K. A survey of quantization methods for efficient neural network inference. In: Low-Power Computer Vision. Chapman and Hall/CRC; 2022. pp. 291-326. Available from: https://www.routledge.com/corporate/about-us
  15. 15. Kundu A, Yoo C, Mishra S, Cho M, Adya S. R^ 2: Range Regularization for model compression and quantization. arXiv preprint arXiv:2303.08253. 2023. Available from: https://arxiv.org/abs/2303.08253
  16. 16. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. 2017. Available from: https://arxiv.org/abs/1704.04861
  17. 17. Zhu M, Gupta S. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. 2017. Available from: https://arxiv.org/abs/1710.01878
  18. 18. Amazon’s new AZ2 chip powers Echo AI like voice recognition, Visual ID by CNET. Available from: https://www.cnet.com/home/smart-home/amazons-new-az2-chip-powers-echo-ai-like-voice-recognition-visual-id/
  19. 19. Zoph B, Le QV. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. 2016. Available from: https://arxiv.org/abs/1611.01578
  20. 20. Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, et al. Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South). 2019. pp. 1314-1324. DOI: 10.1109/ICCV.2019.00140
  21. 21. Gou J, Yu B, Maybank SJ, Tao D. Knowledge distillation: A survey. International Journal of Computer Vision. 2021;129:1789-1819

Written By

Sankalp Dayal

Submitted: 29 September 2023 Reviewed: 27 November 2023 Published: 05 January 2024