From 036b0c74c8f712e9fbf55ef41b8d2ae13feb2baf Mon Sep 17 00:00:00 2001 From: Leonard Kugis Date: Sat, 7 Jan 2023 14:54:34 +0100 Subject: Finished presentation slides --- Presentation/structure.md | 83 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 83 insertions(+) create mode 100644 Presentation/structure.md (limited to 'Presentation/structure.md') diff --git a/Presentation/structure.md b/Presentation/structure.md new file mode 100644 index 0000000..8c772b6 --- /dev/null +++ b/Presentation/structure.md @@ -0,0 +1,83 @@ +# EIE: Efficient Inference Engine on Compressed Deep Neural Network + +## Deep Neural Network + +- Convolutional layers +- Fully-connected layers +- In FC-layers: Trained weights. This only focuses on inference +- Multiply-Accumulate (MAC) on each layer +- DNN dataflows +- Convolutional layers: 5% of memory, 95% of FLOPs +- FC layers: 5% of FLOPs, 90-95% of memory + +## Motivation + +- Inference metrics: throughput, latency, model size, energy use +- Uncompressed DNN: Does not fit SRAM, memory access to/from DRAM +- Von-Neumann bottleneck +- Grafik aus Chen 2016 +- Additional levels of indirection because of indices (weight reusing) + +## Compression + +- In general: Encode in such a way, that it reduces the number of bits per weight + +Trivial: + +- Use different kernels/filters to the input +- Apply pooling to the inputs (runtime memory) + +More complex: + +- Pruning (remove unimportant weights and retrain, 2 approaches) + - Encode with relative indexing +- Weight quantization with clustering + - Group similar weights to clusters + - Minimalize WCSS + - Different methods to initialize cluster centroids, e.g. random, linear, CDF-based + - Indirection because of shared weight table lookup +- Huffman encoding (binary tree with weights, globally) +- Fixed-Point-Quantization of activation functions (refer to CPU optimization) +- Extremely narrow weight engines (4 bit) +- Compressed sparse column (CSC) matrix representation + +## EIE implementation + +- Per-Activation-Formula +- Accelerates sparse and weight sharing networks +- Uses CSC representation + - PE Quickly finds non-zero elements in column +- Explain general procedure +- Show image of the architecture +- Non-Zero filtering +- Queues for load balancing +- Two different SRAM banks for pointers (16 bit) to column borders +- Each entry: 8 bit width (4 bit reference and 4 bit activation register index) +- Table lookup / weight decoding of reference in the same cycle +- Arithmetic Unit: Performs Multiply-Accumulate +- Read/Write unit + - Source and destination register files + - Change their role on each layer + - Feed-Forward networks + +## EIE evaluation + +- Speedup: 189x, 13x, 307x faster than CPU, GPU and mGPU + - EIE latency focused: Batch size of 1 +- Throughput: 102 GOP/s compressed -> 3 TOP/s uncompressed +- Energy efficiency: 24.000x, 3.400x, 2.700x more energy efficient than CPU, GPU and mGPU + + +- Speed calculation: Measure wall clock times for different workloads +- Energy calculation: Total computation time x average measured power +- Sources of energy consumption and reasons for less energy consumption: + - SRAM access instead of DRAM + - Compression type and architecture reduces amount of memory reads + - Vector sparsity encoding in CSC representation + +## Limitations / future optimizations + +- EIE only capable of matrix-multiplication +- Other optimization methods + - In-Memory Acceleration + - \ No newline at end of file -- cgit v1.2.1