Presentation/structure.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

# EIE: Efficient Inference Engine on Compressed Deep Neural Network

## Deep Neural Network

- Convolutional layers
- Fully-connected layers
- In FC-layers: Trained weights. This only focuses on inference
- Multiply-Accumulate (MAC) on each layer
- DNN dataflows
- Convolutional layers: 5% of memory, 95% of FLOPs
- FC layers: 5% of FLOPs, 90-95% of memory

## Motivation

- Inference metrics: throughput, latency, model size, energy use
- Uncompressed DNN: Does not fit SRAM, memory access to/from DRAM
- Von-Neumann bottleneck
- Grafik aus Chen 2016
- Additional levels of indirection because of indices (weight reusing)

## Compression

- In general: Encode in such a way, that it reduces the number of bits per weight

Trivial:

- Use different kernels/filters to the input
- Apply pooling to the inputs (runtime memory)

More complex:

- Pruning (remove unimportant weights and retrain, 2 approaches)
    - Encode with relative indexing
- Weight quantization with clustering
    - Group similar weights to clusters
    - Minimalize WCSS
    - Different methods to initialize cluster centroids, e.g. random, linear, CDF-based
    - Indirection because of shared weight table lookup
- Huffman encoding (binary tree with weights, globally)
- Fixed-Point-Quantization of activation functions (refer to CPU optimization)
- Extremely narrow weight engines (4 bit)
- Compressed sparse column (CSC) matrix representation

## EIE implementation

- Per-Activation-Formula
- Accelerates sparse and weight sharing networks
- Uses CSC representation
    - PE Quickly finds non-zero elements in column
- Explain general procedure
- Show image of the architecture
- Non-Zero filtering
- Queues for load balancing
- Two different SRAM banks for pointers (16 bit) to column borders
- Each entry: 8 bit width (4 bit reference and 4 bit activation register index)
- Table lookup / weight decoding of reference in the same cycle
- Arithmetic Unit: Performs Multiply-Accumulate
- Read/Write unit
    - Source and destination register files
    - Change their role on each layer
    - Feed-Forward networks

## EIE evaluation

- Speedup: 189x, 13x, 307x faster than CPU, GPU and mGPU
    - EIE latency focused: Batch size of 1
- Throughput: 102 GOP/s compressed -> 3 TOP/s uncompressed
- Energy efficiency: 24.000x, 3.400x, 2.700x more energy efficient than CPU, GPU and mGPU


- Speed calculation: Measure wall clock times for different workloads
- Energy calculation: Total computation time x average measured power
- Sources of energy consumption and reasons for less energy consumption:
    - SRAM access instead of DRAM
    - Compression type and architecture reduces amount of memory reads
    - Vector sparsity encoding in CSC representation

## Limitations / future optimizations

- EIE only capable of matrix-multiplication
- Other optimization methods
    - In-Memory Acceleration
    -