Fig.11.Sparseweightsafterbasisprojection[50].
researchonexploringtheimpactofbitwidthonaccuracy[
51
].
Infact,recentlycommercialhardwareforDNNreportedly
support8-bitintegeroperations[
52
].Asbitwidthscanvaryby
layer,hardwareoptimizationshavebeenexploredtoexploit
thereducedbitwidthfor2.56
×
energysavings[
53
]or2.24
×
increaseinthroughput[
54
]comparedtoa16-bitfxedpoint
implementation.Withmoresignifcantchangestothenetwork,
itispossibletoreducebitwidthdownto1-bitforeither
weights[
55
]orbothweightsandactivations[
56
,
57
]atthecost
ofreducedaccuracy.Theimpactof1-bitweightsonhardware
isexploredin[58].
B.Sparsity
ForSVMclassifcation,theweightscanbeprojectedonto
abasissuchthattheresultingweightsaresparsefora2
×
reductioninnumberofmultiplications[
50
](Fig.11).For
featureextraction,theinputimagecanbemadesparsebypre-
processingfora24%reductioninpowerconsumption[48].
ForDNNs,thenumberofMACsandweightscanbereduced
byremovingweightsthroughaprocesscalledpruning.This
wasfrstexploredin[
59
]whereweightswithminimalimpact
ontheoutputwereremoved.In[
60
],pruningisappliedto
modernDNNsbyremovingsmallweights.However,removing
weightsdoesnotnecessarilyleadtolowerenergy.Accordingly,
in[
61
]weightsareremovedbasedonanenergy-modelto
directlyminimizeenergyconsumption.Thetoolusedforenergy
modelingcanbefoundat[62].
Specializedhardwarehasbeenproposedin[
47
,
50
,
63
,
64
]toexploitsparseweightsforincreasedspeedorreduced
energyconsumption.InEyeriss[
47
],theprocessingelements
aredesignedtoskipreadsandMACswhentheinputsare
zero,resultingina45%energyreduction.In[
50
],byusing
specializedhardwaretoavoidsparseweights,theenergyand
storagecostarereducedby43%and34%,respectively.
C.Compression
Datamovementandstorageareimportantfactorsinboth
energyandcost.Featureextractioncanresultinsparsedata
(e.g.,gradientinHOGandReLUinDNN)andtheweights
usedinclassifcationcanalsobemadesparsebypruning.As
aresult,compressioncanbeappliedtoexploitdatastatistics
toreducedatamovementandstoragecost.
Variousformsoflightweightcompressionhavebeenex-
ploredtoreducedatamovement.Losslesscompressioncanbe
usedtoreducethetransferofdataonandoffchip[
11
,
53
,
64
].
Simplerun-lengthcodingoftheactivationsin[
65
]provides
upto1.9
×
bandwidthreduction,whichiswithin5-10%ofthe
theoreticalentropylimit.Lossycompressionsuchasvector
quantizationcanalsobeusedonfeaturevectors[
50
]and
weights[
8
,
12
,
66
]suchthattheycanbestoredon-chipatlow
cost.Generally,thecostofthecompression/decompressionis
ontheorderofafewthousandkgateswithminimalenergy
overhead.Inthelossycompressioncase,itisalsoimportant
toevaluatetheimpactonperformanceaccuracy.VII.O
PPORTUNITIESIN
M
IXED
-S
IGNAL
C
IRCUITS
Mostofthedatamovementisinbetweenthememory
andprocessingelement(PE),andalsothesensorandPE.
Inthissection,wediscusshowthisisaddressedusingmixed-
signalcircuitdesign.However,circuitnon-idealitiesshould
alsobefactoredintothealgorithmdesign;thesecircuitscan
beneftfromthereducedprecisionalgorithmsdiscussedin
SectionVI.Inaddition,sincethetrainingoftenoccursinthe
digitaldomain,theADCandDACconversionoverheadshould
alsobeaccountedforwhenevaluatingthesystem.
Whilespatialarchitecturesbringthememoryclosertothe
computation(i.e.,intothePE),therehavealsobeeneffortsto
integratethecomputationintothememoryitself.Forinstance,
in[
67
]theclassifcationisembeddedintheSRAM.Specifcally,
thewordline(WL)isdrivenbya5-bitfeaturevectorusing
aDAC,whilethebit-cellsstorethebinaryweights
±
1
.The
bit-cellcurrentiseffectivelyaproductofthevalueofthe
featurevectorandthevalueoftheweightstoredinthebit-cell;
thecurrentsfromthecolumnareaddedtogethertodischarge
thebitline(BLorBLB).Acomparatoristhenusedtocompare
theresultingdotproducttoathreshold,specifcallysign
thresholdingofthedifferentialbitlines.Duetothevariations
inthebitcell,thisisconsideredaweakclassifer,andboosting
isneededtocombinetheweakclassiferstoformastrong
classifer[
68
].Thisapproachgives12
×
energysavingsover
readingthe1-bitweightsfromtheSRAM.
Recentworkhasalsoexploredtheuseofmixed-signal
circuitstoreducethecomputationcostoftheMAC.Itwas
shownin[
69
]thatperformingtheMACusingswitched
capacitorscanbemoreenergy-effcientthandigitalcircuits
despiteADCandDACconversionoverhead.Accordingly,
thematrixmultiplicationcanbeintegratedintotheADCas
demonstratedin[
70
],wherethemostsignifcantbitsofthe
multiplicationsforAdaboostclassifcationareperformedusing
switchedcapacitorsinan8-bitsuccessiveapproximationformat.
Thisisextendedin[
71
]tonotonlyperformmultiplications,
butalsotheaccumulationintheanalogdomain.Itisassumed
that3-bitsand6-bitsaresuffcienttorepresenttheweights
andinputvectors,respectively.Thisenablesthecomputation
tomoveclosertothesensorandreducesthenumberofADC
conversionsby21
×
.