Spaces:
Runtime error
Runtime error
Commit
·
064a5f0
1
Parent(s):
32aafee
Update app.py
Browse files
app.py
CHANGED
|
@@ -160,7 +160,7 @@ c1.write("Cached keys and values (GB)")
|
|
| 160 |
acts = round(2*bs*l*(d/h)*2*n/1e9, 2)
|
| 161 |
c2.write(str(acts))
|
| 162 |
|
| 163 |
-
st.subheader("
|
| 164 |
st.markdown("We use the [following crude approximation](https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf) to estimate the execution time for each matrix multiplication.")
|
| 165 |
|
| 166 |
st.latex("C = A \cdot B")
|
|
@@ -183,9 +183,13 @@ st.latex("T_{math}(A \cdot B) = 2*M*K*N / BW_{math}")
|
|
| 183 |
st.markdown("where BW_math is the number of floating point operations per second (e.g. 312 TFLOPS for an A100 GPU)")
|
| 184 |
|
| 185 |
st.markdown("If we assume we can *perfectly* overlap memory access with math operations, then the estimated execution time for the operation is:")
|
| 186 |
-
st.latex("max(
|
| 187 |
|
| 188 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
if breakdown:
|
| 190 |
st.header('Attention layer')
|
| 191 |
|
|
|
|
| 160 |
acts = round(2*bs*l*(d/h)*2*n/1e9, 2)
|
| 161 |
c2.write(str(acts))
|
| 162 |
|
| 163 |
+
st.subheader("Estimating execution time")
|
| 164 |
st.markdown("We use the [following crude approximation](https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf) to estimate the execution time for each matrix multiplication.")
|
| 165 |
|
| 166 |
st.latex("C = A \cdot B")
|
|
|
|
| 183 |
st.markdown("where BW_math is the number of floating point operations per second (e.g. 312 TFLOPS for an A100 GPU)")
|
| 184 |
|
| 185 |
st.markdown("If we assume we can *perfectly* overlap memory access with math operations, then the estimated execution time for the operation is:")
|
| 186 |
+
st.latex("max(T_{math}, T_{mem})")
|
| 187 |
|
| 188 |
+
st.markdown("We also a minimum time for executing the operation due to [kernel launch overhead](https://forums.developer.nvidia.com/t/any-way-to-measure-the-latency-of-a-kernel-launch/221413/2)")
|
| 189 |
+
|
| 190 |
+
st.subheader("Operations in MHA and MQA")
|
| 191 |
+
|
| 192 |
+
breakdown = st.checkbox("Show inference time for each operation")
|
| 193 |
if breakdown:
|
| 194 |
st.header('Attention layer')
|
| 195 |
|