summaryrefslogtreecommitdiff
path: root/Presentation/structure.md
diff options
context:
space:
mode:
Diffstat (limited to 'Presentation/structure.md')
-rw-r--r--Presentation/structure.md83
1 files changed, 83 insertions, 0 deletions
diff --git a/Presentation/structure.md b/Presentation/structure.md
new file mode 100644
index 0000000..8c772b6
--- /dev/null
+++ b/Presentation/structure.md
@@ -0,0 +1,83 @@
+# EIE: Efficient Inference Engine on Compressed Deep Neural Network
+
+## Deep Neural Network
+
+- Convolutional layers
+- Fully-connected layers
+- In FC-layers: Trained weights. This only focuses on inference
+- Multiply-Accumulate (MAC) on each layer
+- DNN dataflows
+- Convolutional layers: 5% of memory, 95% of FLOPs
+- FC layers: 5% of FLOPs, 90-95% of memory
+
+## Motivation
+
+- Inference metrics: throughput, latency, model size, energy use
+- Uncompressed DNN: Does not fit SRAM, memory access to/from DRAM
+- Von-Neumann bottleneck
+- Grafik aus Chen 2016
+- Additional levels of indirection because of indices (weight reusing)
+
+## Compression
+
+- In general: Encode in such a way, that it reduces the number of bits per weight
+
+Trivial:
+
+- Use different kernels/filters to the input
+- Apply pooling to the inputs (runtime memory)
+
+More complex:
+
+- Pruning (remove unimportant weights and retrain, 2 approaches)
+ - Encode with relative indexing
+- Weight quantization with clustering
+ - Group similar weights to clusters
+ - Minimalize WCSS
+ - Different methods to initialize cluster centroids, e.g. random, linear, CDF-based
+ - Indirection because of shared weight table lookup
+- Huffman encoding (binary tree with weights, globally)
+- Fixed-Point-Quantization of activation functions (refer to CPU optimization)
+- Extremely narrow weight engines (4 bit)
+- Compressed sparse column (CSC) matrix representation
+
+## EIE implementation
+
+- Per-Activation-Formula
+- Accelerates sparse and weight sharing networks
+- Uses CSC representation
+ - PE Quickly finds non-zero elements in column
+- Explain general procedure
+- Show image of the architecture
+- Non-Zero filtering
+- Queues for load balancing
+- Two different SRAM banks for pointers (16 bit) to column borders
+- Each entry: 8 bit width (4 bit reference and 4 bit activation register index)
+- Table lookup / weight decoding of reference in the same cycle
+- Arithmetic Unit: Performs Multiply-Accumulate
+- Read/Write unit
+ - Source and destination register files
+ - Change their role on each layer
+ - Feed-Forward networks
+
+## EIE evaluation
+
+- Speedup: 189x, 13x, 307x faster than CPU, GPU and mGPU
+ - EIE latency focused: Batch size of 1
+- Throughput: 102 GOP/s compressed -> 3 TOP/s uncompressed
+- Energy efficiency: 24.000x, 3.400x, 2.700x more energy efficient than CPU, GPU and mGPU
+
+
+- Speed calculation: Measure wall clock times for different workloads
+- Energy calculation: Total computation time x average measured power
+- Sources of energy consumption and reasons for less energy consumption:
+ - SRAM access instead of DRAM
+ - Compression type and architecture reduces amount of memory reads
+ - Vector sparsity encoding in CSC representation
+
+## Limitations / future optimizations
+
+- EIE only capable of matrix-multiplication
+- Other optimization methods
+ - In-Memory Acceleration
+ - \ No newline at end of file