`torch.empty` can create issues; use `torch.zeros`

For MPS, using a tensor created with `torch.empty()` can cause `torch.baddbmm()` to include NaNs in the tensor it returns, even though `beta=0`. However, with a tensor of shape [1,1,1], there should be a negligible performance difference between `torch.empty()` and `torch.zeros()` anyway, so it's better to just use `torch.zeros()` for this and avoid unnecessarily creating issues.

`torch.empty` can create issues; use `torch.zeros`
For MPS, using a tensor created with `torch.empty()` can cause `torch.baddbmm()` to include NaNs in the tensor it returns, even though `beta=0`. However, with a tensor of shape [1,1,1], there should be a negligible performance difference between `torch.empty()` and `torch.zeros()` anyway, so it's better to just use `torch.zeros()` for this and avoid unnecessarily creating issues.
24892520 · brkirch · 87dd6852 · 24892520
Commit 24892520 authored Jul 25, 2023 by brkirch
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 2 deletions

modules/sub_quadratic_attention.py modules/sub_quadratic_attention.py +2 -2

No files found.
--- a/modules/sub_quadratic_attention.py
+++ b/modules/sub_quadratic_attention.py
@@ -58,7 +58,7 @@ def _summarize_chunk(
    scale: float,
 ) -> AttnChunk:
    attn_weights = torch.baddbmm(
-        torch.empty(1, 1, 1, device=query.device, dtype=query.dtype),
+        torch.zeros(1, 1, 1, device=query.device, dtype=query.dtype),
        query,
        key.transpose(1,2),
        alpha=scale,
@@ -121,7 +121,7 @@ def _get_attention_scores_no_kv_chunking(
    scale: float,
 ) -> Tensor:
    attn_scores = torch.baddbmm(
-        torch.empty(1, 1, 1, device=query.device, dtype=query.dtype),
+        torch.zeros(1, 1, 1, device=query.device, dtype=query.dtype),
        query,
        key.transpose(1,2),
        alpha=scale,