Finished presentation slides

author: Leonard Kugis <leonard@kug.is> 2023-01-07 14:54:34 +0100
committer: Leonard Kugis <leonard@kug.is> 2023-01-07 14:54:34 +0100
commit: 036b0c74c8f712e9fbf55ef41b8d2ae13feb2baf (patch)
tree: c33a3de067e1ac8ef756f05521a6534bafdfa4fb /Presentation/structure.md
parent: 5ef7ef8d615ab3098e0b90f18af939908d4f4dfa (diff)
1 files changed, 83 insertions, 0 deletions
diff --git a/Presentation/structure.md b/Presentation/structure.md
new file mode 100644
index 0000000..8c772b6
--- /dev/null
+++ b/Presentation/structure.md
@@ -0,0 +1,83 @@
+# EIE: Efficient Inference Engine on Compressed Deep Neural Network
+
+## Deep Neural Network
+
+- Convolutional layers
+- Fully-connected layers
+- In FC-layers: Trained weights. This only focuses on inference
+- Multiply-Accumulate (MAC) on each layer
+- DNN dataflows
+- Convolutional layers: 5% of memory, 95% of FLOPs
+- FC layers: 5% of FLOPs, 90-95% of memory
+
+## Motivation
+
+- Inference metrics: throughput, latency, model size, energy use
+- Uncompressed DNN: Does not fit SRAM, memory access to/from DRAM
+- Von-Neumann bottleneck
+- Grafik aus Chen 2016
+- Additional levels of indirection because of indices (weight reusing)
+
+## Compression
+
+- In general: Encode in such a way, that it reduces the number of bits per weight
+
+Trivial:
+
+- Use different kernels/filters to the input
+- Apply pooling to the inputs (runtime memory)
+
+More complex:
+
+- Pruning (remove unimportant weights and retrain, 2 approaches)
+    - Encode with relative indexing
+- Weight quantization with clustering
+    - Group similar weights to clusters
+    - Minimalize WCSS
+    - Different methods to initialize cluster centroids, e.g. random, linear, CDF-based
+    - Indirection because of shared weight table lookup
+- Huffman encoding (binary tree with weights, globally)
+- Fixed-Point-Quantization of activation functions (refer to CPU optimization)
+- Extremely narrow weight engines (4 bit)
+- Compressed sparse column (CSC) matrix representation
+
+## EIE implementation
+
+- Per-Activation-Formula
+- Accelerates sparse and weight sharing networks
+- Uses CSC representation
+    - PE Quickly finds non-zero elements in column
+- Explain general procedure
+- Show image of the architecture
+- Non-Zero filtering
+- Queues for load balancing
+- Two different SRAM banks for pointers (16 bit) to column borders
+- Each entry: 8 bit width (4 bit reference and 4 bit activation register index)
+- Table lookup / weight decoding of reference in the same cycle
+- Arithmetic Unit: Performs Multiply-Accumulate
+- Read/Write unit
+    - Source and destination register files
+    - Change their role on each layer
+    - Feed-Forward networks
+
+## EIE evaluation
+
+- Speedup: 189x, 13x, 307x faster than CPU, GPU and mGPU
+    - EIE latency focused: Batch size of 1
+- Throughput: 102 GOP/s compressed -> 3 TOP/s uncompressed
+- Energy efficiency: 24.000x, 3.400x, 2.700x more energy efficient than CPU, GPU and mGPU
+
+
+- Speed calculation: Measure wall clock times for different workloads
+- Energy calculation: Total computation time x average measured power
+- Sources of energy consumption and reasons for less energy consumption:
+    - SRAM access instead of DRAM
+    - Compression type and architecture reduces amount of memory reads
+    - Vector sparsity encoding in CSC representation
+
+## Limitations / future optimizations
+
+- EIE only capable of matrix-multiplication
+- Other optimization methods
+    - In-Memory Acceleration
+    - 
+\ No newline at end of file
author	Leonard Kugis <leonard@kug.is>	2023-01-07 14:54:34 +0100
committer	Leonard Kugis <leonard@kug.is>	2023-01-07 14:54:34 +0100
commit	036b0c74c8f712e9fbf55ef41b8d2ae13feb2baf (patch)
tree	c33a3de067e1ac8ef756f05521a6534bafdfa4fb /Presentation/structure.md
parent	5ef7ef8d615ab3098e0b90f18af939908d4f4dfa (diff)