1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
|
# EIE: Efficient Inference Engine on Compressed Deep Neural Network
## Deep Neural Network
- Convolutional layers
- Fully-connected layers
- In FC-layers: Trained weights. This only focuses on inference
- Multiply-Accumulate (MAC) on each layer
- DNN dataflows
- Convolutional layers: 5% of memory, 95% of FLOPs
- FC layers: 5% of FLOPs, 90-95% of memory
## Motivation
- Inference metrics: throughput, latency, model size, energy use
- Uncompressed DNN: Does not fit SRAM, memory access to/from DRAM
- Von-Neumann bottleneck
- Grafik aus Chen 2016
- Additional levels of indirection because of indices (weight reusing)
## Compression
- In general: Encode in such a way, that it reduces the number of bits per weight
Trivial:
- Use different kernels/filters to the input
- Apply pooling to the inputs (runtime memory)
More complex:
- Pruning (remove unimportant weights and retrain, 2 approaches)
- Encode with relative indexing
- Weight quantization with clustering
- Group similar weights to clusters
- Minimalize WCSS
- Different methods to initialize cluster centroids, e.g. random, linear, CDF-based
- Indirection because of shared weight table lookup
- Huffman encoding (binary tree with weights, globally)
- Fixed-Point-Quantization of activation functions (refer to CPU optimization)
- Extremely narrow weight engines (4 bit)
- Compressed sparse column (CSC) matrix representation
## EIE implementation
- Per-Activation-Formula
- Accelerates sparse and weight sharing networks
- Uses CSC representation
- PE Quickly finds non-zero elements in column
- Explain general procedure
- Show image of the architecture
- Non-Zero filtering
- Queues for load balancing
- Two different SRAM banks for pointers (16 bit) to column borders
- Each entry: 8 bit width (4 bit reference and 4 bit activation register index)
- Table lookup / weight decoding of reference in the same cycle
- Arithmetic Unit: Performs Multiply-Accumulate
- Read/Write unit
- Source and destination register files
- Change their role on each layer
- Feed-Forward networks
## EIE evaluation
- Speedup: 189x, 13x, 307x faster than CPU, GPU and mGPU
- EIE latency focused: Batch size of 1
- Throughput: 102 GOP/s compressed -> 3 TOP/s uncompressed
- Energy efficiency: 24.000x, 3.400x, 2.700x more energy efficient than CPU, GPU and mGPU
- Speed calculation: Measure wall clock times for different workloads
- Energy calculation: Total computation time x average measured power
- Sources of energy consumption and reasons for less energy consumption:
- SRAM access instead of DRAM
- Compression type and architecture reduces amount of memory reads
- Vector sparsity encoding in CSC representation
## Limitations / future optimizations
- EIE only capable of matrix-multiplication
- Other optimization methods
- In-Memory Acceleration
-
|