Commit 8861ffba authored by rui.zheng's avatar rui.zheng

#73: Add shadowsocks UDP relay support

parent 321b0371
Creative Commons Legal Code
CC0 1.0 Universal
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
HEREUNDER.
Statement of Purpose
The laws of most jurisdictions throughout the world automatically confer
exclusive Copyright and Related Rights (defined below) upon the creator
and subsequent owner(s) (each and all, an "owner") of an original work of
authorship and/or a database (each, a "Work").
Certain owners wish to permanently relinquish those rights to a Work for
the purpose of contributing to a commons of creative, cultural and
scientific works ("Commons") that the public can reliably and without fear
of later claims of infringement build upon, modify, incorporate in other
works, reuse and redistribute as freely as possible in any form whatsoever
and for any purposes, including without limitation commercial purposes.
These owners may contribute to the Commons to promote the ideal of a free
culture and the further production of creative, cultural and scientific
works, or to gain reputation or greater distribution for their Work in
part through the use and efforts of others.
For these and/or other purposes and motivations, and without any
expectation of additional consideration or compensation, the person
associating CC0 with a Work (the "Affirmer"), to the extent that he or she
is an owner of Copyright and Related Rights in the Work, voluntarily
elects to apply CC0 to the Work and publicly distribute the Work under its
terms, with knowledge of his or her Copyright and Related Rights in the
Work and the meaning and intended legal effect of CC0 on those rights.
1. Copyright and Related Rights. A Work made available under CC0 may be
protected by copyright and related or neighboring rights ("Copyright and
Related Rights"). Copyright and Related Rights include, but are not
limited to, the following:
i. the right to reproduce, adapt, distribute, perform, display,
communicate, and translate a Work;
ii. moral rights retained by the original author(s) and/or performer(s);
iii. publicity and privacy rights pertaining to a person's image or
likeness depicted in a Work;
iv. rights protecting against unfair competition in regards to a Work,
subject to the limitations in paragraph 4(a), below;
v. rights protecting the extraction, dissemination, use and reuse of data
in a Work;
vi. database rights (such as those arising under Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal
protection of databases, and under any national implementation
thereof, including any amended or successor version of such
directive); and
vii. other similar, equivalent or corresponding rights throughout the
world based on applicable law or treaty, and any national
implementations thereof.
2. Waiver. To the greatest extent permitted by, but not in contravention
of, applicable law, Affirmer hereby overtly, fully, permanently,
irrevocably and unconditionally waives, abandons, and surrenders all of
Affirmer's Copyright and Related Rights and associated claims and causes
of action, whether now known or unknown (including existing as well as
future claims and causes of action), in the Work (i) in all territories
worldwide, (ii) for the maximum duration provided by applicable law or
treaty (including future time extensions), (iii) in any current or future
medium and for any number of copies, and (iv) for any purpose whatsoever,
including without limitation commercial, advertising or promotional
purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
member of the public at large and to the detriment of Affirmer's heirs and
successors, fully intending that such Waiver shall not be subject to
revocation, rescission, cancellation, termination, or any other legal or
equitable action to disrupt the quiet enjoyment of the Work by the public
as contemplated by Affirmer's express Statement of Purpose.
3. Public License Fallback. Should any part of the Waiver for any reason
be judged legally invalid or ineffective under applicable law, then the
Waiver shall be preserved to the maximum extent permitted taking into
account Affirmer's express Statement of Purpose. In addition, to the
extent the Waiver is so judged Affirmer hereby grants to each affected
person a royalty-free, non transferable, non sublicensable, non exclusive,
irrevocable and unconditional license to exercise Affirmer's Copyright and
Related Rights in the Work (i) in all territories worldwide, (ii) for the
maximum duration provided by applicable law or treaty (including future
time extensions), (iii) in any current or future medium and for any number
of copies, and (iv) for any purpose whatsoever, including without
limitation commercial, advertising or promotional purposes (the
"License"). The License shall be deemed effective as of the date CC0 was
applied by Affirmer to the Work. Should any part of the License for any
reason be judged legally invalid or ineffective under applicable law, such
partial invalidity or ineffectiveness shall not invalidate the remainder
of the License, and in such case Affirmer hereby affirms that he or she
will not (i) exercise any of his or her remaining Copyright and Related
Rights in the Work or (ii) assert any associated claims and causes of
action with respect to the Work, in either case contrary to Affirmer's
express Statement of Purpose.
4. Limitations and Disclaimers.
a. No trademark or patent rights held by Affirmer are waived, abandoned,
surrendered, licensed or otherwise affected by this document.
b. Affirmer offers the Work as-is and makes no representations or
warranties of any kind concerning the Work, express, implied,
statutory or otherwise, including without limitation warranties of
title, merchantability, fitness for a particular purpose, non
infringement, or the absence of latent or other defects, accuracy, or
the present or absence of errors, whether or not discoverable, all to
the greatest extent permissible under applicable law.
c. Affirmer disclaims responsibility for clearing rights of other persons
that may apply to the Work or any use thereof, including without
limitation any person's Copyright and Related Rights in the Work.
Further, Affirmer disclaims responsibility for obtaining any necessary
consents, permissions or other rights required for any use of the
Work.
d. Affirmer understands and acknowledges that Creative Commons is not a
party to this document and has no duty or obligation with respect to
this CC0 or use of the Work.
### chacha20 - ChaCha20
#### Yawning Angel (yawning at schwanenlied dot me)
Yet another Go ChaCha20 implementation. Everything else I found was slow,
didn't support all the variants I need to use, or relied on cgo to go fast.
Features:
* 20 round, 256 bit key only. Everything else is pointless and stupid.
* IETF 96 bit nonce variant.
* XChaCha 24 byte nonce variant.
* SSE2 and AVX2 support on amd64 targets.
* Incremental encrypt/decrypt support, unlike golang.org/x/crypto/salsa20.
// chacha20.go - A ChaCha stream cipher implementation.
//
// To the extent possible under law, Yawning Angel has waived all copyright
// and related or neighboring rights to chacha20, using the Creative
// Commons "CC0" public domain dedication. See LICENSE or
// <http://creativecommons.org/publicdomain/zero/1.0/> for full details.
package chacha20
import (
"crypto/cipher"
"encoding/binary"
"errors"
"math"
"runtime"
)
const (
// KeySize is the ChaCha20 key size in bytes.
KeySize = 32
// NonceSize is the ChaCha20 nonce size in bytes.
NonceSize = 8
// INonceSize is the IETF ChaCha20 nonce size in bytes.
INonceSize = 12
// XNonceSize is the XChaCha20 nonce size in bytes.
XNonceSize = 24
// HNonceSize is the HChaCha20 nonce size in bytes.
HNonceSize = 16
// BlockSize is the ChaCha20 block size in bytes.
BlockSize = 64
stateSize = 16
chachaRounds = 20
// The constant "expand 32-byte k" as little endian uint32s.
sigma0 = uint32(0x61707865)
sigma1 = uint32(0x3320646e)
sigma2 = uint32(0x79622d32)
sigma3 = uint32(0x6b206574)
)
var (
// ErrInvalidKey is the error returned when the key is invalid.
ErrInvalidKey = errors.New("key length must be KeySize bytes")
// ErrInvalidNonce is the error returned when the nonce is invalid.
ErrInvalidNonce = errors.New("nonce length must be NonceSize/INonceSize/XNonceSize bytes")
// ErrInvalidCounter is the error returned when the counter is invalid.
ErrInvalidCounter = errors.New("block counter is invalid (out of range)")
useUnsafe = false
usingVectors = false
blocksFn = blocksRef
)
// A Cipher is an instance of ChaCha20/XChaCha20 using a particular key and
// nonce.
type Cipher struct {
state [stateSize]uint32
buf [BlockSize]byte
off int
ietf bool
}
// Reset zeros the key data so that it will no longer appear in the process's
// memory.
func (c *Cipher) Reset() {
for i := range c.state {
c.state[i] = 0
}
for i := range c.buf {
c.buf[i] = 0
}
}
// XORKeyStream sets dst to the result of XORing src with the key stream. Dst
// and src may be the same slice but otherwise should not overlap.
func (c *Cipher) XORKeyStream(dst, src []byte) {
if len(dst) < len(src) {
src = src[:len(dst)]
}
for remaining := len(src); remaining > 0; {
// Process multiple blocks at once.
if c.off == BlockSize {
nrBlocks := remaining / BlockSize
directBytes := nrBlocks * BlockSize
if nrBlocks > 0 {
blocksFn(&c.state, src, dst, nrBlocks, c.ietf)
remaining -= directBytes
if remaining == 0 {
return
}
dst = dst[directBytes:]
src = src[directBytes:]
}
// If there's a partial block, generate 1 block of keystream into
// the internal buffer.
blocksFn(&c.state, nil, c.buf[:], 1, c.ietf)
c.off = 0
}
// Process partial blocks from the buffered keystream.
toXor := BlockSize - c.off
if remaining < toXor {
toXor = remaining
}
if toXor > 0 {
for i, v := range src[:toXor] {
dst[i] = v ^ c.buf[c.off+i]
}
dst = dst[toXor:]
src = src[toXor:]
remaining -= toXor
c.off += toXor
}
}
}
// KeyStream sets dst to the raw keystream.
func (c *Cipher) KeyStream(dst []byte) {
for remaining := len(dst); remaining > 0; {
// Process multiple blocks at once.
if c.off == BlockSize {
nrBlocks := remaining / BlockSize
directBytes := nrBlocks * BlockSize
if nrBlocks > 0 {
blocksFn(&c.state, nil, dst, nrBlocks, c.ietf)
remaining -= directBytes
if remaining == 0 {
return
}
dst = dst[directBytes:]
}
// If there's a partial block, generate 1 block of keystream into
// the internal buffer.
blocksFn(&c.state, nil, c.buf[:], 1, c.ietf)
c.off = 0
}
// Process partial blocks from the buffered keystream.
toCopy := BlockSize - c.off
if remaining < toCopy {
toCopy = remaining
}
if toCopy > 0 {
copy(dst[:toCopy], c.buf[c.off:c.off+toCopy])
dst = dst[toCopy:]
remaining -= toCopy
c.off += toCopy
}
}
}
// ReKey reinitializes the ChaCha20/XChaCha20 instance with the provided key
// and nonce.
func (c *Cipher) ReKey(key, nonce []byte) error {
if len(key) != KeySize {
return ErrInvalidKey
}
switch len(nonce) {
case NonceSize:
case INonceSize:
case XNonceSize:
var subkey [KeySize]byte
var subnonce [HNonceSize]byte
copy(subnonce[:], nonce[0:16])
HChaCha(key, &subnonce, &subkey)
key = subkey[:]
nonce = nonce[16:24]
defer func() {
for i := range subkey {
subkey[i] = 0
}
}()
default:
return ErrInvalidNonce
}
c.Reset()
c.state[0] = sigma0
c.state[1] = sigma1
c.state[2] = sigma2
c.state[3] = sigma3
c.state[4] = binary.LittleEndian.Uint32(key[0:4])
c.state[5] = binary.LittleEndian.Uint32(key[4:8])
c.state[6] = binary.LittleEndian.Uint32(key[8:12])
c.state[7] = binary.LittleEndian.Uint32(key[12:16])
c.state[8] = binary.LittleEndian.Uint32(key[16:20])
c.state[9] = binary.LittleEndian.Uint32(key[20:24])
c.state[10] = binary.LittleEndian.Uint32(key[24:28])
c.state[11] = binary.LittleEndian.Uint32(key[28:32])
c.state[12] = 0
if len(nonce) == INonceSize {
c.state[13] = binary.LittleEndian.Uint32(nonce[0:4])
c.state[14] = binary.LittleEndian.Uint32(nonce[4:8])
c.state[15] = binary.LittleEndian.Uint32(nonce[8:12])
c.ietf = true
} else {
c.state[13] = 0
c.state[14] = binary.LittleEndian.Uint32(nonce[0:4])
c.state[15] = binary.LittleEndian.Uint32(nonce[4:8])
c.ietf = false
}
c.off = BlockSize
return nil
}
// Seek sets the block counter to a given offset.
func (c *Cipher) Seek(blockCounter uint64) error {
if c.ietf {
if blockCounter > math.MaxUint32 {
return ErrInvalidCounter
}
c.state[12] = uint32(blockCounter)
} else {
c.state[12] = uint32(blockCounter)
c.state[13] = uint32(blockCounter >> 32)
}
c.off = BlockSize
return nil
}
// NewCipher returns a new ChaCha20/XChaCha20 instance.
func NewCipher(key, nonce []byte) (*Cipher, error) {
c := new(Cipher)
if err := c.ReKey(key, nonce); err != nil {
return nil, err
}
return c, nil
}
// HChaCha is the HChaCha20 hash function used to make XChaCha.
func HChaCha(key []byte, nonce *[HNonceSize]byte, out *[32]byte) {
var x [stateSize]uint32 // Last 4 slots unused, sigma hardcoded.
x[0] = binary.LittleEndian.Uint32(key[0:4])
x[1] = binary.LittleEndian.Uint32(key[4:8])
x[2] = binary.LittleEndian.Uint32(key[8:12])
x[3] = binary.LittleEndian.Uint32(key[12:16])
x[4] = binary.LittleEndian.Uint32(key[16:20])
x[5] = binary.LittleEndian.Uint32(key[20:24])
x[6] = binary.LittleEndian.Uint32(key[24:28])
x[7] = binary.LittleEndian.Uint32(key[28:32])
x[8] = binary.LittleEndian.Uint32(nonce[0:4])
x[9] = binary.LittleEndian.Uint32(nonce[4:8])
x[10] = binary.LittleEndian.Uint32(nonce[8:12])
x[11] = binary.LittleEndian.Uint32(nonce[12:16])
hChaChaRef(&x, out)
}
func init() {
switch runtime.GOARCH {
case "386", "amd64":
// Abuse unsafe to skip calling binary.LittleEndian.PutUint32
// in the critical path. This is a big boost on systems that are
// little endian and not overly picky about alignment.
useUnsafe = true
}
}
var _ cipher.Stream = (*Cipher)(nil)
// chacha20_amd64.go - AMD64 optimized chacha20.
//
// To the extent possible under law, Yawning Angel has waived all copyright
// and related or neighboring rights to chacha20, using the Creative
// Commons "CC0" public domain dedication. See LICENSE or
// <http://creativecommons.org/publicdomain/zero/1.0/> for full details.
// +build amd64,!gccgo,!appengine
package chacha20
import (
"math"
)
var usingAVX2 = false
func blocksAmd64SSE2(x *uint32, inp, outp *byte, nrBlocks uint)
func blocksAmd64AVX2(x *uint32, inp, outp *byte, nrBlocks uint)
func cpuidAmd64(cpuidParams *uint32)
func xgetbv0Amd64(xcrVec *uint32)
func blocksAmd64(x *[stateSize]uint32, in []byte, out []byte, nrBlocks int, isIetf bool) {
// Probably unneeded, but stating this explicitly simplifies the assembly.
if nrBlocks == 0 {
return
}
if isIetf {
var totalBlocks uint64
totalBlocks = uint64(x[8]) + uint64(nrBlocks)
if totalBlocks > math.MaxUint32 {
panic("chacha20: Exceeded keystream per nonce limit")
}
}
if in == nil {
for i := range out {
out[i] = 0
}
in = out
}
// Pointless to call the AVX2 code for just a single block, since half of
// the output gets discarded...
if usingAVX2 && nrBlocks > 1 {
blocksAmd64AVX2(&x[0], &in[0], &out[0], uint(nrBlocks))
} else {
blocksAmd64SSE2(&x[0], &in[0], &out[0], uint(nrBlocks))
}
}
func supportsAVX2() bool {
// https://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family
const (
osXsaveBit = 1 << 27
avx2Bit = 1 << 5
)
// Check to see if CPUID actually supports the leaf that indicates AVX2.
// CPUID.(EAX=0H, ECX=0H) >= 7
regs := [4]uint32{0x00}
cpuidAmd64(&regs[0])
if regs[0] < 7 {
return false
}
// Check to see if the OS knows how to save/restore XMM/YMM state.
// CPUID.(EAX=01H, ECX=0H):ECX.OSXSAVE[bit 27]==1
regs = [4]uint32{0x01}
cpuidAmd64(&regs[0])
if regs[2]&osXsaveBit == 0 {
return false
}
xcrRegs := [2]uint32{}
xgetbv0Amd64(&xcrRegs[0])
if xcrRegs[0]&6 != 6 {
return false
}
// Check for AVX2 support.
// CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]==1
regs = [4]uint32{0x07}
cpuidAmd64(&regs[0])
return regs[1]&avx2Bit != 0
}
func init() {
blocksFn = blocksAmd64
usingVectors = true
usingAVX2 = supportsAVX2()
}
#!/usr/bin/env python3
#
# To the extent possible under law, Yawning Angel has waived all copyright
# and related or neighboring rights to chacha20, using the Creative
# Commons "CC0" public domain dedication. See LICENSE or
# <http://creativecommons.org/publicdomain/zero/1.0/> for full details.
#
# cgo sucks. Plan 9 assembly sucks. Real languages have SIMD intrinsics.
# The least terrible/retarded option is to use a Python code generator, so
# that's what I did.
#
# Code based on Ted Krovetz's vec128 C implementation, with corrections
# to use a 64 bit counter instead of 32 bit, and to allow unaligned input and
# output pointers.
#
# Dependencies: https://github.com/Maratyszcza/PeachPy
#
# python3 -m peachpy.x86_64 -mabi=goasm -S -o chacha20_amd64.s chacha20_amd64.py
#
from peachpy import *
from peachpy.x86_64 import *
x = Argument(ptr(uint32_t))
inp = Argument(ptr(const_uint8_t))
outp = Argument(ptr(uint8_t))
nrBlocks = Argument(ptr(size_t))
#
# SSE2 helper functions. A temporary register is explicitly passed in because
# the main fast loop uses every single register (and even spills) so manual
# control is needed.
#
# This used to also have a DQROUNDS helper that did 2 rounds of ChaCha like
# in the C code, but the C code has the luxury of an optimizer reordering
# everything, while this does not.
#
def ROTW16_sse2(tmp, d):
MOVDQA(tmp, d)
PSLLD(tmp, 16)
PSRLD(d, 16)
PXOR(d, tmp)
def ROTW12_sse2(tmp, b):
MOVDQA(tmp, b)
PSLLD(tmp, 12)
PSRLD(b, 20)
PXOR(b, tmp)
def ROTW8_sse2(tmp, d):
MOVDQA(tmp, d)
PSLLD(tmp, 8)
PSRLD(d, 24)
PXOR(d, tmp)
def ROTW7_sse2(tmp, b):
MOVDQA(tmp, b)
PSLLD(tmp, 7)
PSRLD(b, 25)
PXOR(b, tmp)
def WriteXor_sse2(tmp, inp, outp, d, v0, v1, v2, v3):
MOVDQU(tmp, [inp+d])
PXOR(tmp, v0)
MOVDQU([outp+d], tmp)
MOVDQU(tmp, [inp+d+16])
PXOR(tmp, v1)
MOVDQU([outp+d+16], tmp)
MOVDQU(tmp, [inp+d+32])
PXOR(tmp, v2)
MOVDQU([outp+d+32], tmp)
MOVDQU(tmp, [inp+d+48])
PXOR(tmp, v3)
MOVDQU([outp+d+48], tmp)
# SSE2 ChaCha20 (aka vec128). Does not handle partial blocks, and will
# process 4/2/1 blocks at a time. x (the ChaCha20 state) must be 16 byte
# aligned.
with Function("blocksAmd64SSE2", (x, inp, outp, nrBlocks)):
reg_x = GeneralPurposeRegister64()
reg_inp = GeneralPurposeRegister64()
reg_outp = GeneralPurposeRegister64()
reg_blocks = GeneralPurposeRegister64()
reg_sp_save = GeneralPurposeRegister64()
LOAD.ARGUMENT(reg_x, x)
LOAD.ARGUMENT(reg_inp, inp)
LOAD.ARGUMENT(reg_outp, outp)
LOAD.ARGUMENT(reg_blocks, nrBlocks)
# Align the stack to a 32 byte boundary.
reg_align = GeneralPurposeRegister64()
MOV(reg_sp_save, registers.rsp)
MOV(reg_align, 0x1f)
NOT(reg_align)
AND(registers.rsp, reg_align)
SUB(registers.rsp, 0x20)
# Build the counter increment vector on the stack, and allocate the scratch
# space
xmm_v0 = XMMRegister()
PXOR(xmm_v0, xmm_v0)
SUB(registers.rsp, 16+16)
MOVDQA([registers.rsp], xmm_v0)
reg_tmp = GeneralPurposeRegister32()
MOV(reg_tmp, 0x00000001)
MOV([registers.rsp], reg_tmp)
mem_one = [registers.rsp] # (Stack) Counter increment vector
mem_tmp0 = [registers.rsp+16] # (Stack) Scratch space.
mem_s0 = [reg_x] # (Memory) Cipher state [0..3]
mem_s1 = [reg_x+16] # (Memory) Cipher state [4..7]
mem_s2 = [reg_x+32] # (Memory) Cipher state [8..11]
mem_s3 = [reg_x+48] # (Memory) Cipher state [12..15]
# xmm_v0 allocated above...
xmm_v1 = XMMRegister()
xmm_v2 = XMMRegister()
xmm_v3 = XMMRegister()
xmm_v4 = XMMRegister()
xmm_v5 = XMMRegister()
xmm_v6 = XMMRegister()
xmm_v7 = XMMRegister()
xmm_v8 = XMMRegister()
xmm_v9 = XMMRegister()
xmm_v10 = XMMRegister()
xmm_v11 = XMMRegister()
xmm_v12 = XMMRegister()
xmm_v13 = XMMRegister()
xmm_v14 = XMMRegister()
xmm_v15 = XMMRegister()
xmm_tmp = xmm_v12
#
# 4 blocks at a time.
#
vector_loop4 = Loop()
SUB(reg_blocks, 4)
JB(vector_loop4.end)
with vector_loop4:
MOVDQA(xmm_v0, mem_s0)
MOVDQA(xmm_v1, mem_s1)
MOVDQA(xmm_v2, mem_s2)
MOVDQA(xmm_v3, mem_s3)
MOVDQA(xmm_v4, xmm_v0)
MOVDQA(xmm_v5, xmm_v1)
MOVDQA(xmm_v6, xmm_v2)
MOVDQA(xmm_v7, xmm_v3)
PADDQ(xmm_v7, mem_one)
MOVDQA(xmm_v8, xmm_v0)
MOVDQA(xmm_v9, xmm_v1)
MOVDQA(xmm_v10, xmm_v2)
MOVDQA(xmm_v11, xmm_v7)
PADDQ(xmm_v11, mem_one)
MOVDQA(xmm_v12, xmm_v0)
MOVDQA(xmm_v13, xmm_v1)
MOVDQA(xmm_v14, xmm_v2)
MOVDQA(xmm_v15, xmm_v11)
PADDQ(xmm_v15, mem_one)
reg_rounds = GeneralPurposeRegister64()
MOV(reg_rounds, 20)
rounds_loop4 = Loop()
with rounds_loop4:
# a += b; d ^= a; d = ROTW16(d);
PADDD(xmm_v0, xmm_v1)
PADDD(xmm_v4, xmm_v5)
PADDD(xmm_v8, xmm_v9)
PADDD(xmm_v12, xmm_v13)
PXOR(xmm_v3, xmm_v0)
PXOR(xmm_v7, xmm_v4)
PXOR(xmm_v11, xmm_v8)
PXOR(xmm_v15, xmm_v12)
MOVDQA(mem_tmp0, xmm_tmp) # Save
ROTW16_sse2(xmm_tmp, xmm_v3)
ROTW16_sse2(xmm_tmp, xmm_v7)
ROTW16_sse2(xmm_tmp, xmm_v11)
ROTW16_sse2(xmm_tmp, xmm_v15)
# c += d; b ^= c; b = ROTW12(b);
PADDD(xmm_v2, xmm_v3)
PADDD(xmm_v6, xmm_v7)
PADDD(xmm_v10, xmm_v11)
PADDD(xmm_v14, xmm_v15)
PXOR(xmm_v1, xmm_v2)
PXOR(xmm_v5, xmm_v6)
PXOR(xmm_v9, xmm_v10)
PXOR(xmm_v13, xmm_v14)
ROTW12_sse2(xmm_tmp, xmm_v1)
ROTW12_sse2(xmm_tmp, xmm_v5)
ROTW12_sse2(xmm_tmp, xmm_v9)
ROTW12_sse2(xmm_tmp, xmm_v13)
# a += b; d ^= a; d = ROTW8(d);
MOVDQA(xmm_tmp, mem_tmp0) # Restore
PADDD(xmm_v0, xmm_v1)
PADDD(xmm_v4, xmm_v5)
PADDD(xmm_v8, xmm_v9)
PADDD(xmm_v12, xmm_v13)
PXOR(xmm_v3, xmm_v0)
PXOR(xmm_v7, xmm_v4)
PXOR(xmm_v11, xmm_v8)
PXOR(xmm_v15, xmm_v12)
MOVDQA(mem_tmp0, xmm_tmp) # Save
ROTW8_sse2(xmm_tmp, xmm_v3)
ROTW8_sse2(xmm_tmp, xmm_v7)
ROTW8_sse2(xmm_tmp, xmm_v11)
ROTW8_sse2(xmm_tmp, xmm_v15)
# c += d; b ^= c; b = ROTW7(b)
PADDD(xmm_v2, xmm_v3)
PADDD(xmm_v6, xmm_v7)
PADDD(xmm_v10, xmm_v11)
PADDD(xmm_v14, xmm_v15)
PXOR(xmm_v1, xmm_v2)
PXOR(xmm_v5, xmm_v6)
PXOR(xmm_v9, xmm_v10)
PXOR(xmm_v13, xmm_v14)
ROTW7_sse2(xmm_tmp, xmm_v1)
ROTW7_sse2(xmm_tmp, xmm_v5)
ROTW7_sse2(xmm_tmp, xmm_v9)
ROTW7_sse2(xmm_tmp, xmm_v13)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD(xmm_v1, xmm_v1, 0x39)
PSHUFD(xmm_v5, xmm_v5, 0x39)
PSHUFD(xmm_v9, xmm_v9, 0x39)
PSHUFD(xmm_v13, xmm_v13, 0x39)
PSHUFD(xmm_v2, xmm_v2, 0x4e)
PSHUFD(xmm_v6, xmm_v6, 0x4e)
PSHUFD(xmm_v10, xmm_v10, 0x4e)
PSHUFD(xmm_v14, xmm_v14, 0x4e)
PSHUFD(xmm_v3, xmm_v3, 0x93)
PSHUFD(xmm_v7, xmm_v7, 0x93)
PSHUFD(xmm_v11, xmm_v11, 0x93)
PSHUFD(xmm_v15, xmm_v15, 0x93)
MOVDQA(xmm_tmp, mem_tmp0) # Restore
# a += b; d ^= a; d = ROTW16(d);
PADDD(xmm_v0, xmm_v1)
PADDD(xmm_v4, xmm_v5)
PADDD(xmm_v8, xmm_v9)
PADDD(xmm_v12, xmm_v13)
PXOR(xmm_v3, xmm_v0)
PXOR(xmm_v7, xmm_v4)
PXOR(xmm_v11, xmm_v8)
PXOR(xmm_v15, xmm_v12)
MOVDQA(mem_tmp0, xmm_tmp) # Save
ROTW16_sse2(xmm_tmp, xmm_v3)
ROTW16_sse2(xmm_tmp, xmm_v7)
ROTW16_sse2(xmm_tmp, xmm_v11)
ROTW16_sse2(xmm_tmp, xmm_v15)
# c += d; b ^= c; b = ROTW12(b);
PADDD(xmm_v2, xmm_v3)
PADDD(xmm_v6, xmm_v7)
PADDD(xmm_v10, xmm_v11)
PADDD(xmm_v14, xmm_v15)
PXOR(xmm_v1, xmm_v2)
PXOR(xmm_v5, xmm_v6)
PXOR(xmm_v9, xmm_v10)
PXOR(xmm_v13, xmm_v14)
ROTW12_sse2(xmm_tmp, xmm_v1)
ROTW12_sse2(xmm_tmp, xmm_v5)
ROTW12_sse2(xmm_tmp, xmm_v9)
ROTW12_sse2(xmm_tmp, xmm_v13)
# a += b; d ^= a; d = ROTW8(d);
MOVDQA(xmm_tmp, mem_tmp0) # Restore
PADDD(xmm_v0, xmm_v1)
PADDD(xmm_v4, xmm_v5)
PADDD(xmm_v8, xmm_v9)
PADDD(xmm_v12, xmm_v13)
PXOR(xmm_v3, xmm_v0)
PXOR(xmm_v7, xmm_v4)
PXOR(xmm_v11, xmm_v8)
PXOR(xmm_v15, xmm_v12)
MOVDQA(mem_tmp0, xmm_tmp) # Save
ROTW8_sse2(xmm_tmp, xmm_v3)
ROTW8_sse2(xmm_tmp, xmm_v7)
ROTW8_sse2(xmm_tmp, xmm_v11)
ROTW8_sse2(xmm_tmp, xmm_v15)
# c += d; b ^= c; b = ROTW7(b)
PADDD(xmm_v2, xmm_v3)
PADDD(xmm_v6, xmm_v7)
PADDD(xmm_v10, xmm_v11)
PADDD(xmm_v14, xmm_v15)
PXOR(xmm_v1, xmm_v2)
PXOR(xmm_v5, xmm_v6)
PXOR(xmm_v9, xmm_v10)
PXOR(xmm_v13, xmm_v14)
ROTW7_sse2(xmm_tmp, xmm_v1)
ROTW7_sse2(xmm_tmp, xmm_v5)
ROTW7_sse2(xmm_tmp, xmm_v9)
ROTW7_sse2(xmm_tmp, xmm_v13)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD(xmm_v1, xmm_v1, 0x93)
PSHUFD(xmm_v5, xmm_v5, 0x93)
PSHUFD(xmm_v9, xmm_v9, 0x93)
PSHUFD(xmm_v13, xmm_v13, 0x93)
PSHUFD(xmm_v2, xmm_v2, 0x4e)
PSHUFD(xmm_v6, xmm_v6, 0x4e)
PSHUFD(xmm_v10, xmm_v10, 0x4e)
PSHUFD(xmm_v14, xmm_v14, 0x4e)
PSHUFD(xmm_v3, xmm_v3, 0x39)
PSHUFD(xmm_v7, xmm_v7, 0x39)
PSHUFD(xmm_v11, xmm_v11, 0x39)
PSHUFD(xmm_v15, xmm_v15, 0x39)
MOVDQA(xmm_tmp, mem_tmp0) # Restore
SUB(reg_rounds, 2)
JNZ(rounds_loop4.begin)
MOVDQA(mem_tmp0, xmm_tmp)
PADDD(xmm_v0, mem_s0)
PADDD(xmm_v1, mem_s1)
PADDD(xmm_v2, mem_s2)
PADDD(xmm_v3, mem_s3)
WriteXor_sse2(xmm_tmp, reg_inp, reg_outp, 0, xmm_v0, xmm_v1, xmm_v2, xmm_v3)
MOVDQA(xmm_v3, mem_s3)
PADDQ(xmm_v3, mem_one)
PADDD(xmm_v4, mem_s0)
PADDD(xmm_v5, mem_s1)
PADDD(xmm_v6, mem_s2)
PADDD(xmm_v7, xmm_v3)
WriteXor_sse2(xmm_tmp, reg_inp, reg_outp, 64, xmm_v4, xmm_v5, xmm_v6, xmm_v7)
PADDQ(xmm_v3, mem_one)
PADDD(xmm_v8, mem_s0)
PADDD(xmm_v9, mem_s1)
PADDD(xmm_v10, mem_s2)
PADDD(xmm_v11, xmm_v3)
WriteXor_sse2(xmm_tmp, reg_inp, reg_outp, 128, xmm_v8, xmm_v9, xmm_v10, xmm_v11)
PADDQ(xmm_v3, mem_one)
MOVDQA(xmm_tmp, mem_tmp0)
PADDD(xmm_v12, mem_s0)
PADDD(xmm_v13, mem_s1)
PADDD(xmm_v14, mem_s2)
PADDD(xmm_v15, xmm_v3)
WriteXor_sse2(xmm_v0, reg_inp, reg_outp, 192, xmm_v12, xmm_v13, xmm_v14, xmm_v15)
PADDQ(xmm_v3, mem_one)
MOVDQA(mem_s3, xmm_v3)
ADD(reg_inp, 4 * 64)
ADD(reg_outp, 4 * 64)
SUB(reg_blocks, 4)
JAE(vector_loop4.begin)
ADD(reg_blocks, 4)
out = Label()
JZ(out)
# Past this point, we no longer need to use every single register to hold
# the in progress state.
xmm_s0 = xmm_v8
xmm_s1 = xmm_v9
xmm_s2 = xmm_v10
xmm_s3 = xmm_v11
xmm_one = xmm_v13
MOVDQA(xmm_s0, mem_s0)
MOVDQA(xmm_s1, mem_s1)
MOVDQA(xmm_s2, mem_s2)
MOVDQA(xmm_s3, mem_s3)
MOVDQA(xmm_one, mem_one)
#
# 2 blocks at a time.
#
SUB(reg_blocks, 2)
vector_loop2 = Loop()
JB(vector_loop2.end)
with vector_loop2:
MOVDQA(xmm_v0, xmm_s0)
MOVDQA(xmm_v1, xmm_s1)
MOVDQA(xmm_v2, xmm_s2)
MOVDQA(xmm_v3, xmm_s3)
MOVDQA(xmm_v4, xmm_v0)
MOVDQA(xmm_v5, xmm_v1)
MOVDQA(xmm_v6, xmm_v2)
MOVDQA(xmm_v7, xmm_v3)
PADDQ(xmm_v7, xmm_one)
reg_rounds = GeneralPurposeRegister64()
MOV(reg_rounds, 20)
rounds_loop2 = Loop()
with rounds_loop2:
# a += b; d ^= a; d = ROTW16(d);
PADDD(xmm_v0, xmm_v1)
PADDD(xmm_v4, xmm_v5)
PXOR(xmm_v3, xmm_v0)
PXOR(xmm_v7, xmm_v4)
ROTW16_sse2(xmm_tmp, xmm_v3)
ROTW16_sse2(xmm_tmp, xmm_v7)
# c += d; b ^= c; b = ROTW12(b);
PADDD(xmm_v2, xmm_v3)
PADDD(xmm_v6, xmm_v7)
PXOR(xmm_v1, xmm_v2)
PXOR(xmm_v5, xmm_v6)
ROTW12_sse2(xmm_tmp, xmm_v1)
ROTW12_sse2(xmm_tmp, xmm_v5)
# a += b; d ^= a; d = ROTW8(d);
PADDD(xmm_v0, xmm_v1)
PADDD(xmm_v4, xmm_v5)
PXOR(xmm_v3, xmm_v0)
PXOR(xmm_v7, xmm_v4)
ROTW8_sse2(xmm_tmp, xmm_v3)
ROTW8_sse2(xmm_tmp, xmm_v7)
# c += d; b ^= c; b = ROTW7(b)
PADDD(xmm_v2, xmm_v3)
PADDD(xmm_v6, xmm_v7)
PXOR(xmm_v1, xmm_v2)
PXOR(xmm_v5, xmm_v6)
ROTW7_sse2(xmm_tmp, xmm_v1)
ROTW7_sse2(xmm_tmp, xmm_v5)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD(xmm_v1, xmm_v1, 0x39)
PSHUFD(xmm_v5, xmm_v5, 0x39)
PSHUFD(xmm_v2, xmm_v2, 0x4e)
PSHUFD(xmm_v6, xmm_v6, 0x4e)
PSHUFD(xmm_v3, xmm_v3, 0x93)
PSHUFD(xmm_v7, xmm_v7, 0x93)
# a += b; d ^= a; d = ROTW16(d);
PADDD(xmm_v0, xmm_v1)
PADDD(xmm_v4, xmm_v5)
PXOR(xmm_v3, xmm_v0)
PXOR(xmm_v7, xmm_v4)
ROTW16_sse2(xmm_tmp, xmm_v3)
ROTW16_sse2(xmm_tmp, xmm_v7)
# c += d; b ^= c; b = ROTW12(b);
PADDD(xmm_v2, xmm_v3)
PADDD(xmm_v6, xmm_v7)
PXOR(xmm_v1, xmm_v2)
PXOR(xmm_v5, xmm_v6)
ROTW12_sse2(xmm_tmp, xmm_v1)
ROTW12_sse2(xmm_tmp, xmm_v5)
# a += b; d ^= a; d = ROTW8(d);
PADDD(xmm_v0, xmm_v1)
PADDD(xmm_v4, xmm_v5)
PXOR(xmm_v3, xmm_v0)
PXOR(xmm_v7, xmm_v4)
ROTW8_sse2(xmm_tmp, xmm_v3)
ROTW8_sse2(xmm_tmp, xmm_v7)
# c += d; b ^= c; b = ROTW7(b)
PADDD(xmm_v2, xmm_v3)
PADDD(xmm_v6, xmm_v7)
PXOR(xmm_v1, xmm_v2)
PXOR(xmm_v5, xmm_v6)
ROTW7_sse2(xmm_tmp, xmm_v1)
ROTW7_sse2(xmm_tmp, xmm_v5)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD(xmm_v1, xmm_v1, 0x93)
PSHUFD(xmm_v5, xmm_v5, 0x93)
PSHUFD(xmm_v2, xmm_v2, 0x4e)
PSHUFD(xmm_v6, xmm_v6, 0x4e)
PSHUFD(xmm_v3, xmm_v3, 0x39)
PSHUFD(xmm_v7, xmm_v7, 0x39)
SUB(reg_rounds, 2)
JNZ(rounds_loop2.begin)
PADDD(xmm_v0, xmm_s0)
PADDD(xmm_v1, xmm_s1)
PADDD(xmm_v2, xmm_s2)
PADDD(xmm_v3, xmm_s3)
WriteXor_sse2(xmm_tmp, reg_inp, reg_outp, 0, xmm_v0, xmm_v1, xmm_v2, xmm_v3)
PADDQ(xmm_s3, xmm_one)
PADDD(xmm_v4, xmm_s0)
PADDD(xmm_v5, xmm_s1)
PADDD(xmm_v6, xmm_s2)
PADDD(xmm_v7, xmm_s3)
WriteXor_sse2(xmm_tmp, reg_inp, reg_outp, 64, xmm_v4, xmm_v5, xmm_v6, xmm_v7)
PADDQ(xmm_s3, xmm_one)
ADD(reg_inp, 2 * 64)
ADD(reg_outp, 2 * 64)
SUB(reg_blocks, 2)
JAE(vector_loop2.begin)
ADD(reg_blocks, 2)
out_serial = Label()
JZ(out_serial)
#
# 1 block at a time. Only executed once, because if there was > 1,
# the parallel code would have processed it already.
#
MOVDQA(xmm_v0, xmm_s0)
MOVDQA(xmm_v1, xmm_s1)
MOVDQA(xmm_v2, xmm_s2)
MOVDQA(xmm_v3, xmm_s3)
reg_rounds = GeneralPurposeRegister64()
MOV(reg_rounds, 20)
rounds_loop1 = Loop()
with rounds_loop1:
# a += b; d ^= a; d = ROTW16(d);
PADDD(xmm_v0, xmm_v1)
PXOR(xmm_v3, xmm_v0)
ROTW16_sse2(xmm_tmp, xmm_v3)
# c += d; b ^= c; b = ROTW12(b);
PADDD(xmm_v2, xmm_v3)
PXOR(xmm_v1, xmm_v2)
ROTW12_sse2(xmm_tmp, xmm_v1)
# a += b; d ^= a; d = ROTW8(d);
PADDD(xmm_v0, xmm_v1)
PXOR(xmm_v3, xmm_v0)
ROTW8_sse2(xmm_tmp, xmm_v3)
# c += d; b ^= c; b = ROTW7(b)
PADDD(xmm_v2, xmm_v3)
PXOR(xmm_v1, xmm_v2)
ROTW7_sse2(xmm_tmp, xmm_v1)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD(xmm_v1, xmm_v1, 0x39)
PSHUFD(xmm_v2, xmm_v2, 0x4e)
PSHUFD(xmm_v3, xmm_v3, 0x93)
# a += b; d ^= a; d = ROTW16(d);
PADDD(xmm_v0, xmm_v1)
PXOR(xmm_v3, xmm_v0)
ROTW16_sse2(xmm_tmp, xmm_v3)
# c += d; b ^= c; b = ROTW12(b);
PADDD(xmm_v2, xmm_v3)
PXOR(xmm_v1, xmm_v2)
ROTW12_sse2(xmm_tmp, xmm_v1)
# a += b; d ^= a; d = ROTW8(d);
PADDD(xmm_v0, xmm_v1)
PXOR(xmm_v3, xmm_v0)
ROTW8_sse2(xmm_tmp, xmm_v3)
# c += d; b ^= c; b = ROTW7(b)
PADDD(xmm_v2, xmm_v3)
PXOR(xmm_v1, xmm_v2)
ROTW7_sse2(xmm_tmp, xmm_v1)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD(xmm_v1, xmm_v1, 0x93)
PSHUFD(xmm_v2, xmm_v2, 0x4e)
PSHUFD(xmm_v3, xmm_v3, 0x39)
SUB(reg_rounds, 2)
JNZ(rounds_loop1.begin)
PADDD(xmm_v0, xmm_s0)
PADDD(xmm_v1, xmm_s1)
PADDD(xmm_v2, xmm_s2)
PADDD(xmm_v3, xmm_s3)
WriteXor_sse2(xmm_tmp, reg_inp, reg_outp, 0, xmm_v0, xmm_v1, xmm_v2, xmm_v3)
PADDQ(xmm_s3, xmm_one)
LABEL(out_serial)
# Write back the updated counter. Stoping at 2^70 bytes is the user's
# problem, not mine. (Skipped if there's exactly a multiple of 4 blocks
# because the counter is incremented in memory while looping.)
MOVDQA(mem_s3, xmm_s3)
LABEL(out)
# Paranoia, cleanse the scratch space.
PXOR(xmm_v0, xmm_v0)
MOVDQA(mem_tmp0, xmm_v0)
# Remove our stack allocation.
MOV(registers.rsp, reg_sp_save)
RETURN()
#
# AVX2 helpers. Like the SSE2 equivalents, the scratch register is explicit,
# and more helpers are used to increase readability for destructive operations.
#
# XXX/Performance: ROTW16_avx2/ROTW8_avx2 both can use VPSHUFFB.
#
def ADD_avx2(dst, src):
VPADDD(dst, dst, src)
def XOR_avx2(dst, src):
VPXOR(dst, dst, src)
def ROTW16_avx2(tmp, d):
VPSLLD(tmp, d, 16)
VPSRLD(d, d, 16)
XOR_avx2(d, tmp)
def ROTW12_avx2(tmp, b):
VPSLLD(tmp, b, 12)
VPSRLD(b, b, 20)
XOR_avx2(b, tmp)
def ROTW8_avx2(tmp, d):
VPSLLD(tmp, d, 8)
VPSRLD(d, d, 24)
XOR_avx2(d, tmp)
def ROTW7_avx2(tmp, b):
VPSLLD(tmp, b, 7)
VPSRLD(b, b, 25)
XOR_avx2(b, tmp)
def WriteXor_avx2(tmp, inp, outp, d, v0, v1, v2, v3):
# XOR_WRITE(out+ 0, in+ 0, _mm256_permute2x128_si256(v0,v1,0x20));
VPERM2I128(tmp, v0, v1, 0x20)
VPXOR(tmp, tmp, [inp+d])
VMOVDQU([outp+d], tmp)
# XOR_WRITE(out+32, in+32, _mm256_permute2x128_si256(v2,v3,0x20));
VPERM2I128(tmp, v2, v3, 0x20)
VPXOR(tmp, tmp, [inp+d+32])
VMOVDQU([outp+d+32], tmp)
# XOR_WRITE(out+64, in+64, _mm256_permute2x128_si256(v0,v1,0x31));
VPERM2I128(tmp, v0, v1, 0x31)
VPXOR(tmp, tmp, [inp+d+64])
VMOVDQU([outp+d+64], tmp)
# XOR_WRITE(out+96, in+96, _mm256_permute2x128_si256(v2,v3,0x31));
VPERM2I128(tmp, v2, v3, 0x31)
VPXOR(tmp, tmp, [inp+d+96])
VMOVDQU([outp+d+96], tmp)
# AVX2 ChaCha20 (aka avx2). Does not handle partial blocks, will process
# 8/4/2 blocks at a time. Alignment blah blah blah fuck you.
with Function("blocksAmd64AVX2", (x, inp, outp, nrBlocks), target=uarch.broadwell):
reg_x = GeneralPurposeRegister64()
reg_inp = GeneralPurposeRegister64()
reg_outp = GeneralPurposeRegister64()
reg_blocks = GeneralPurposeRegister64()
reg_sp_save = GeneralPurposeRegister64()
LOAD.ARGUMENT(reg_x, x)
LOAD.ARGUMENT(reg_inp, inp)
LOAD.ARGUMENT(reg_outp, outp)
LOAD.ARGUMENT(reg_blocks, nrBlocks)
# Align the stack to a 32 byte boundary.
reg_align = GeneralPurposeRegister64()
MOV(reg_sp_save, registers.rsp)
MOV(reg_align, 0x1f)
NOT(reg_align)
AND(registers.rsp, reg_align)
SUB(registers.rsp, 0x20)
x_s0 = [reg_x] # (Memory) Cipher state [0..3]
x_s1 = [reg_x+16] # (Memory) Cipher state [4..7]
x_s2 = [reg_x+32] # (Memory) Cipher state [8..11]
x_s3 = [reg_x+48] # (Memory) Cipher state [12..15]
ymm_v0 = YMMRegister()
ymm_v1 = YMMRegister()
ymm_v2 = YMMRegister()
ymm_v3 = YMMRegister()
ymm_v4 = YMMRegister()
ymm_v5 = YMMRegister()
ymm_v6 = YMMRegister()
ymm_v7 = YMMRegister()
ymm_v8 = YMMRegister()
ymm_v9 = YMMRegister()
ymm_v10 = YMMRegister()
ymm_v11 = YMMRegister()
ymm_v12 = YMMRegister()
ymm_v13 = YMMRegister()
ymm_v14 = YMMRegister()
ymm_v15 = YMMRegister()
ymm_tmp0 = ymm_v12
# Allocate the neccecary stack space for the counter vector and two ymm
# registers that we will spill.
SUB(registers.rsp, 96)
mem_tmp0 = [registers.rsp+64] # (Stack) Scratch space.
mem_s3 = [registers.rsp+32] # (Stack) Working copy of s3. (8x)
mem_inc = [registers.rsp] # (Stack) Counter increment vector.
# Increment the counter for one side of the state vector.
VPXOR(ymm_tmp0, ymm_tmp0, ymm_tmp0)
VMOVDQU(mem_inc, ymm_tmp0)
reg_tmp = GeneralPurposeRegister32()
MOV(reg_tmp, 0x00000001)
MOV([registers.rsp+16], reg_tmp)
VBROADCASTI128(ymm_v3, x_s3)
VPADDQ(ymm_v3, ymm_v3, [registers.rsp])
VMOVDQA(mem_s3, ymm_v3)
# As we process 2xN blocks at a time, so the counter increment for both
# sides of the state vector is 2.
MOV(reg_tmp, 0x00000002)
MOV([registers.rsp], reg_tmp)
MOV([registers.rsp+16], reg_tmp)
out_write_even = Label()
out_write_odd = Label()
#
# 8 blocks at a time. Ted Krovetz's avx2 code does not do this, but it's
# a decent gain despite all the pain...
#
vector_loop8 = Loop()
SUB(reg_blocks, 8)
JB(vector_loop8.end)
with vector_loop8:
VBROADCASTI128(ymm_v0, x_s0)
VBROADCASTI128(ymm_v1, x_s1)
VBROADCASTI128(ymm_v2, x_s2)
VMOVDQA(ymm_v3, mem_s3)
VMOVDQA(ymm_v4, ymm_v0)
VMOVDQA(ymm_v5, ymm_v1)
VMOVDQA(ymm_v6, ymm_v2)
VPADDQ(ymm_v7, ymm_v3, mem_inc)
VMOVDQA(ymm_v8, ymm_v0)
VMOVDQA(ymm_v9, ymm_v1)
VMOVDQA(ymm_v10, ymm_v2)
VPADDQ(ymm_v11, ymm_v7, mem_inc)
VMOVDQA(ymm_v12, ymm_v0)
VMOVDQA(ymm_v13, ymm_v1)
VMOVDQA(ymm_v14, ymm_v2)
VPADDQ(ymm_v15, ymm_v11, mem_inc)
reg_rounds = GeneralPurposeRegister64()
MOV(reg_rounds, 20)
rounds_loop8 = Loop()
with rounds_loop8:
# a += b; d ^= a; d = ROTW16(d);
ADD_avx2(ymm_v0, ymm_v1)
ADD_avx2(ymm_v4, ymm_v5)
ADD_avx2(ymm_v8, ymm_v9)
ADD_avx2(ymm_v12, ymm_v13)
XOR_avx2(ymm_v3, ymm_v0)
XOR_avx2(ymm_v7, ymm_v4)
XOR_avx2(ymm_v11, ymm_v8)
XOR_avx2(ymm_v15, ymm_v12)
VMOVDQA(mem_tmp0, ymm_tmp0) # Save
ROTW16_avx2(ymm_tmp0, ymm_v3)
ROTW16_avx2(ymm_tmp0, ymm_v7)
ROTW16_avx2(ymm_tmp0, ymm_v11)
ROTW16_avx2(ymm_tmp0, ymm_v15)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2(ymm_v2, ymm_v3)
ADD_avx2(ymm_v6, ymm_v7)
ADD_avx2(ymm_v10, ymm_v11)
ADD_avx2(ymm_v14, ymm_v15)
XOR_avx2(ymm_v1, ymm_v2)
XOR_avx2(ymm_v5, ymm_v6)
XOR_avx2(ymm_v9, ymm_v10)
XOR_avx2(ymm_v13, ymm_v14)
ROTW12_avx2(ymm_tmp0, ymm_v1)
ROTW12_avx2(ymm_tmp0, ymm_v5)
ROTW12_avx2(ymm_tmp0, ymm_v9)
ROTW12_avx2(ymm_tmp0, ymm_v13)
# a += b; d ^= a; d = ROTW8(d);
VMOVDQA(ymm_tmp0, mem_tmp0) # Restore
ADD_avx2(ymm_v0, ymm_v1)
ADD_avx2(ymm_v4, ymm_v5)
ADD_avx2(ymm_v8, ymm_v9)
ADD_avx2(ymm_v12, ymm_v13)
XOR_avx2(ymm_v3, ymm_v0)
XOR_avx2(ymm_v7, ymm_v4)
XOR_avx2(ymm_v11, ymm_v8)
XOR_avx2(ymm_v15, ymm_v12)
VMOVDQA(mem_tmp0, ymm_tmp0) # Save
ROTW8_avx2(ymm_tmp0, ymm_v3)
ROTW8_avx2(ymm_tmp0, ymm_v7)
ROTW8_avx2(ymm_tmp0, ymm_v11)
ROTW8_avx2(ymm_tmp0, ymm_v15)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2(ymm_v2, ymm_v3)
ADD_avx2(ymm_v6, ymm_v7)
ADD_avx2(ymm_v10, ymm_v11)
ADD_avx2(ymm_v14, ymm_v15)
XOR_avx2(ymm_v1, ymm_v2)
XOR_avx2(ymm_v5, ymm_v6)
XOR_avx2(ymm_v9, ymm_v10)
XOR_avx2(ymm_v13, ymm_v14)
ROTW7_avx2(ymm_tmp0, ymm_v1)
ROTW7_avx2(ymm_tmp0, ymm_v5)
ROTW7_avx2(ymm_tmp0, ymm_v9)
ROTW7_avx2(ymm_tmp0, ymm_v13)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD(ymm_v1, ymm_v1, 0x39)
VPSHUFD(ymm_v5, ymm_v5, 0x39)
VPSHUFD(ymm_v9, ymm_v9, 0x39)
VPSHUFD(ymm_v13, ymm_v13, 0x39)
VPSHUFD(ymm_v2, ymm_v2, 0x4e)
VPSHUFD(ymm_v6, ymm_v6, 0x4e)
VPSHUFD(ymm_v10, ymm_v10, 0x4e)
VPSHUFD(ymm_v14, ymm_v14, 0x4e)
VPSHUFD(ymm_v3, ymm_v3, 0x93)
VPSHUFD(ymm_v7, ymm_v7, 0x93)
VPSHUFD(ymm_v11, ymm_v11, 0x93)
VPSHUFD(ymm_v15, ymm_v15, 0x93)
# a += b; d ^= a; d = ROTW16(d);
VMOVDQA(ymm_tmp0, mem_tmp0) # Restore
ADD_avx2(ymm_v0, ymm_v1)
ADD_avx2(ymm_v4, ymm_v5)
ADD_avx2(ymm_v8, ymm_v9)
ADD_avx2(ymm_v12, ymm_v13)
XOR_avx2(ymm_v3, ymm_v0)
XOR_avx2(ymm_v7, ymm_v4)
XOR_avx2(ymm_v11, ymm_v8)
XOR_avx2(ymm_v15, ymm_v12)
VMOVDQA(mem_tmp0, ymm_tmp0) # Save
ROTW16_avx2(ymm_tmp0, ymm_v3)
ROTW16_avx2(ymm_tmp0, ymm_v7)
ROTW16_avx2(ymm_tmp0, ymm_v11)
ROTW16_avx2(ymm_tmp0, ymm_v15)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2(ymm_v2, ymm_v3)
ADD_avx2(ymm_v6, ymm_v7)
ADD_avx2(ymm_v10, ymm_v11)
ADD_avx2(ymm_v14, ymm_v15)
XOR_avx2(ymm_v1, ymm_v2)
XOR_avx2(ymm_v5, ymm_v6)
XOR_avx2(ymm_v9, ymm_v10)
XOR_avx2(ymm_v13, ymm_v14)
ROTW12_avx2(ymm_tmp0, ymm_v1)
ROTW12_avx2(ymm_tmp0, ymm_v5)
ROTW12_avx2(ymm_tmp0, ymm_v9)
ROTW12_avx2(ymm_tmp0, ymm_v13)
# a += b; d ^= a; d = ROTW8(d);
VMOVDQA(ymm_tmp0, mem_tmp0) # Restore
ADD_avx2(ymm_v0, ymm_v1)
ADD_avx2(ymm_v4, ymm_v5)
ADD_avx2(ymm_v8, ymm_v9)
ADD_avx2(ymm_v12, ymm_v13)
XOR_avx2(ymm_v3, ymm_v0)
XOR_avx2(ymm_v7, ymm_v4)
XOR_avx2(ymm_v11, ymm_v8)
XOR_avx2(ymm_v15, ymm_v12)
VMOVDQA(mem_tmp0, ymm_tmp0) # Save
ROTW8_avx2(ymm_tmp0, ymm_v3)
ROTW8_avx2(ymm_tmp0, ymm_v7)
ROTW8_avx2(ymm_tmp0, ymm_v11)
ROTW8_avx2(ymm_tmp0, ymm_v15)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2(ymm_v2, ymm_v3)
ADD_avx2(ymm_v6, ymm_v7)
ADD_avx2(ymm_v10, ymm_v11)
ADD_avx2(ymm_v14, ymm_v15)
XOR_avx2(ymm_v1, ymm_v2)
XOR_avx2(ymm_v5, ymm_v6)
XOR_avx2(ymm_v9, ymm_v10)
XOR_avx2(ymm_v13, ymm_v14)
ROTW7_avx2(ymm_tmp0, ymm_v1)
ROTW7_avx2(ymm_tmp0, ymm_v5)
ROTW7_avx2(ymm_tmp0, ymm_v9)
ROTW7_avx2(ymm_tmp0, ymm_v13)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD(ymm_v1, ymm_v1, 0x93)
VPSHUFD(ymm_v5, ymm_v5, 0x93)
VPSHUFD(ymm_v9, ymm_v9, 0x93)
VPSHUFD(ymm_v13, ymm_v13, 0x93)
VPSHUFD(ymm_v2, ymm_v2, 0x4e)
VPSHUFD(ymm_v6, ymm_v6, 0x4e)
VPSHUFD(ymm_v10, ymm_v10, 0x4e)
VPSHUFD(ymm_v14, ymm_v14, 0x4e)
VPSHUFD(ymm_v3, ymm_v3, 0x39)
VPSHUFD(ymm_v7, ymm_v7, 0x39)
VPSHUFD(ymm_v11, ymm_v11, 0x39)
VPSHUFD(ymm_v15, ymm_v15, 0x39)
VMOVDQA(ymm_tmp0, mem_tmp0) # Restore
SUB(reg_rounds, 2)
JNZ(rounds_loop8.begin)
# ymm_v12 is in mem_tmp0 and is current....
# XXX: I assume VBROADCASTI128 is about as fast as VMOVDQA....
VBROADCASTI128(ymm_tmp0, x_s0)
ADD_avx2(ymm_v0, ymm_tmp0)
ADD_avx2(ymm_v4, ymm_tmp0)
ADD_avx2(ymm_v8, ymm_tmp0)
ADD_avx2(ymm_tmp0, mem_tmp0)
VMOVDQA(mem_tmp0, ymm_tmp0)
VBROADCASTI128(ymm_tmp0, x_s1)
ADD_avx2(ymm_v1, ymm_tmp0)
ADD_avx2(ymm_v5, ymm_tmp0)
ADD_avx2(ymm_v9, ymm_tmp0)
ADD_avx2(ymm_v13, ymm_tmp0)
VBROADCASTI128(ymm_tmp0, x_s2)
ADD_avx2(ymm_v2, ymm_tmp0)
ADD_avx2(ymm_v6, ymm_tmp0)
ADD_avx2(ymm_v10, ymm_tmp0)
ADD_avx2(ymm_v14, ymm_tmp0)
ADD_avx2(ymm_v3, mem_s3)
WriteXor_avx2(ymm_tmp0, reg_inp, reg_outp, 0, ymm_v0, ymm_v1, ymm_v2, ymm_v3)
VMOVDQA(ymm_v3, mem_s3)
ADD_avx2(ymm_v3, mem_inc)
ADD_avx2(ymm_v7, ymm_v3)
WriteXor_avx2(ymm_tmp0, reg_inp, reg_outp, 128, ymm_v4, ymm_v5, ymm_v6, ymm_v7)
ADD_avx2(ymm_v3, mem_inc)
ADD_avx2(ymm_v11, ymm_v3)
WriteXor_avx2(ymm_tmp0, reg_inp, reg_outp, 256, ymm_v8, ymm_v9, ymm_v10, ymm_v11)
ADD_avx2(ymm_v3, mem_inc)
VMOVDQA(ymm_v12, mem_tmp0)
ADD_avx2(ymm_v15, ymm_v3)
WriteXor_avx2(ymm_v0, reg_inp, reg_outp, 384, ymm_v12, ymm_v13, ymm_v14, ymm_v15)
ADD_avx2(ymm_v3, mem_inc)
VMOVDQA(mem_s3, ymm_v3)
ADD(reg_inp, 8 * 64)
ADD(reg_outp, 8 * 64)
SUB(reg_blocks, 8)
JAE(vector_loop8.begin)
# ymm_v3 contains a current copy of mem_s3 either from when it was built,
# or because the loop updates it. Copy this before we mess with the block
# counter in case we need to write it back and return.
ymm_s3 = ymm_v11
VMOVDQA(ymm_s3, ymm_v3)
ADD(reg_blocks, 8)
JZ(out_write_even)
# We now actually can do everything in registers.
ymm_s0 = ymm_v8
VBROADCASTI128(ymm_s0, x_s0)
ymm_s1 = ymm_v9
VBROADCASTI128(ymm_s1, x_s1)
ymm_s2 = ymm_v10
VBROADCASTI128(ymm_s2, x_s2)
ymm_inc = ymm_v14
VMOVDQA(ymm_inc, mem_inc)
#
# 4 blocks at a time.
#
SUB(reg_blocks, 4)
vector_loop4 = Loop()
JB(vector_loop4.end)
with vector_loop4:
VMOVDQA(ymm_v0, ymm_s0)
VMOVDQA(ymm_v1, ymm_s1)
VMOVDQA(ymm_v2, ymm_s2)
VMOVDQA(ymm_v3, ymm_s3)
VMOVDQA(ymm_v4, ymm_v0)
VMOVDQA(ymm_v5, ymm_v1)
VMOVDQA(ymm_v6, ymm_v2)
VPADDQ(ymm_v7, ymm_v3, ymm_inc)
reg_rounds = GeneralPurposeRegister64()
MOV(reg_rounds, 20)
rounds_loop4 = Loop()
with rounds_loop4:
# a += b; d ^= a; d = ROTW16(d);
ADD_avx2(ymm_v0, ymm_v1)
ADD_avx2(ymm_v4, ymm_v5)
XOR_avx2(ymm_v3, ymm_v0)
XOR_avx2(ymm_v7, ymm_v4)
ROTW16_avx2(ymm_tmp0, ymm_v3)
ROTW16_avx2(ymm_tmp0, ymm_v7)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2(ymm_v2, ymm_v3)
ADD_avx2(ymm_v6, ymm_v7)
XOR_avx2(ymm_v1, ymm_v2)
XOR_avx2(ymm_v5, ymm_v6)
ROTW12_avx2(ymm_tmp0, ymm_v1)
ROTW12_avx2(ymm_tmp0, ymm_v5)
# a += b; d ^= a; d = ROTW8(d);
ADD_avx2(ymm_v0, ymm_v1)
ADD_avx2(ymm_v4, ymm_v5)
XOR_avx2(ymm_v3, ymm_v0)
XOR_avx2(ymm_v7, ymm_v4)
ROTW8_avx2(ymm_tmp0, ymm_v3)
ROTW8_avx2(ymm_tmp0, ymm_v7)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2(ymm_v2, ymm_v3)
ADD_avx2(ymm_v6, ymm_v7)
XOR_avx2(ymm_v1, ymm_v2)
XOR_avx2(ymm_v5, ymm_v6)
ROTW7_avx2(ymm_tmp0, ymm_v1)
ROTW7_avx2(ymm_tmp0, ymm_v5)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD(ymm_v1, ymm_v1, 0x39)
VPSHUFD(ymm_v5, ymm_v5, 0x39)
VPSHUFD(ymm_v2, ymm_v2, 0x4e)
VPSHUFD(ymm_v6, ymm_v6, 0x4e)
VPSHUFD(ymm_v3, ymm_v3, 0x93)
VPSHUFD(ymm_v7, ymm_v7, 0x93)
# a += b; d ^= a; d = ROTW16(d);
ADD_avx2(ymm_v0, ymm_v1)
ADD_avx2(ymm_v4, ymm_v5)
XOR_avx2(ymm_v3, ymm_v0)
XOR_avx2(ymm_v7, ymm_v4)
ROTW16_avx2(ymm_tmp0, ymm_v3)
ROTW16_avx2(ymm_tmp0, ymm_v7)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2(ymm_v2, ymm_v3)
ADD_avx2(ymm_v6, ymm_v7)
XOR_avx2(ymm_v1, ymm_v2)
XOR_avx2(ymm_v5, ymm_v6)
ROTW12_avx2(ymm_tmp0, ymm_v1)
ROTW12_avx2(ymm_tmp0, ymm_v5)
# a += b; d ^= a; d = ROTW8(d);
ADD_avx2(ymm_v0, ymm_v1)
ADD_avx2(ymm_v4, ymm_v5)
XOR_avx2(ymm_v3, ymm_v0)
XOR_avx2(ymm_v7, ymm_v4)
ROTW8_avx2(ymm_tmp0, ymm_v3)
ROTW8_avx2(ymm_tmp0, ymm_v7)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2(ymm_v2, ymm_v3)
ADD_avx2(ymm_v6, ymm_v7)
XOR_avx2(ymm_v1, ymm_v2)
XOR_avx2(ymm_v5, ymm_v6)
ROTW7_avx2(ymm_tmp0, ymm_v1)
ROTW7_avx2(ymm_tmp0, ymm_v5)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD(ymm_v1, ymm_v1, 0x93)
VPSHUFD(ymm_v5, ymm_v5, 0x93)
VPSHUFD(ymm_v2, ymm_v2, 0x4e)
VPSHUFD(ymm_v6, ymm_v6, 0x4e)
VPSHUFD(ymm_v3, ymm_v3, 0x39)
VPSHUFD(ymm_v7, ymm_v7, 0x39)
SUB(reg_rounds, 2)
JNZ(rounds_loop4.begin)
ADD_avx2(ymm_v0, ymm_s0)
ADD_avx2(ymm_v1, ymm_s1)
ADD_avx2(ymm_v2, ymm_s2)
ADD_avx2(ymm_v3, ymm_s3)
WriteXor_avx2(ymm_tmp0, reg_inp, reg_outp, 0, ymm_v0, ymm_v1, ymm_v2, ymm_v3)
ADD_avx2(ymm_s3, ymm_inc)
ADD_avx2(ymm_v4, ymm_s0)
ADD_avx2(ymm_v5, ymm_s1)
ADD_avx2(ymm_v6, ymm_s2)
ADD_avx2(ymm_v7, ymm_s3)
WriteXor_avx2(ymm_tmp0, reg_inp, reg_outp, 128, ymm_v4, ymm_v5, ymm_v6, ymm_v7)
ADD_avx2(ymm_s3, ymm_inc)
ADD(reg_inp, 4 * 64)
ADD(reg_outp, 4 * 64)
SUB(reg_blocks, 4)
JAE(vector_loop4.begin)
ADD(reg_blocks, 4)
JZ(out_write_even)
#
# 2/1 blocks at a time. The two codepaths are unified because
# with AVX2 we do 2 blocks at a time anyway, and this only gets called
# if 3/2/1 blocks are remaining, so the extra branches don't hurt that
# much.
#
vector_loop2 = Loop()
with vector_loop2:
VMOVDQA(ymm_v0, ymm_s0)
VMOVDQA(ymm_v1, ymm_s1)
VMOVDQA(ymm_v2, ymm_s2)
VMOVDQA(ymm_v3, ymm_s3)
reg_rounds = GeneralPurposeRegister64()
MOV(reg_rounds, 20)
rounds_loop2 = Loop()
with rounds_loop2:
# a += b; d ^= a; d = ROTW16(d);
ADD_avx2(ymm_v0, ymm_v1)
XOR_avx2(ymm_v3, ymm_v0)
ROTW16_avx2(ymm_tmp0, ymm_v3)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2(ymm_v2, ymm_v3)
XOR_avx2(ymm_v1, ymm_v2)
ROTW12_avx2(ymm_tmp0, ymm_v1)
# a += b; d ^= a; d = ROTW8(d);
ADD_avx2(ymm_v0, ymm_v1)
XOR_avx2(ymm_v3, ymm_v0)
ROTW8_avx2(ymm_tmp0, ymm_v3)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2(ymm_v2, ymm_v3)
XOR_avx2(ymm_v1, ymm_v2)
ROTW7_avx2(ymm_tmp0, ymm_v1)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD(ymm_v1, ymm_v1, 0x39)
VPSHUFD(ymm_v2, ymm_v2, 0x4e)
VPSHUFD(ymm_v3, ymm_v3, 0x93)
# a += b; d ^= a; d = ROTW16(d);
ADD_avx2(ymm_v0, ymm_v1)
XOR_avx2(ymm_v3, ymm_v0)
ROTW16_avx2(ymm_tmp0, ymm_v3)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2(ymm_v2, ymm_v3)
XOR_avx2(ymm_v1, ymm_v2)
ROTW12_avx2(ymm_tmp0, ymm_v1)
# a += b; d ^= a; d = ROTW8(d);
ADD_avx2(ymm_v0, ymm_v1)
XOR_avx2(ymm_v3, ymm_v0)
ROTW8_avx2(ymm_tmp0, ymm_v3)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2(ymm_v2, ymm_v3)
XOR_avx2(ymm_v1, ymm_v2)
ROTW7_avx2(ymm_tmp0, ymm_v1)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD(ymm_v1, ymm_v1, 0x93)
VPSHUFD(ymm_v2, ymm_v2, 0x4e)
VPSHUFD(ymm_v3, ymm_v3, 0x39)
SUB(reg_rounds, 2)
JNZ(rounds_loop2.begin)
ADD_avx2(ymm_v0, ymm_s0)
ADD_avx2(ymm_v1, ymm_s1)
ADD_avx2(ymm_v2, ymm_s2)
ADD_avx2(ymm_v3, ymm_s3)
# XOR_WRITE(out+ 0, in+ 0, _mm256_permute2x128_si256(v0,v1,0x20));
VPERM2I128(ymm_tmp0, ymm_v0, ymm_v1, 0x20)
VPXOR(ymm_tmp0, ymm_tmp0, [reg_inp])
VMOVDQU([reg_outp], ymm_tmp0)
# XOR_WRITE(out+32, in+32, _mm256_permute2x128_si256(v2,v3,0x20));
VPERM2I128(ymm_tmp0, ymm_v2, ymm_v3, 0x20)
VPXOR(ymm_tmp0, ymm_tmp0, [reg_inp+32])
VMOVDQU([reg_outp+32], ymm_tmp0)
SUB(reg_blocks, 1)
JZ(out_write_odd)
ADD_avx2(ymm_s3, ymm_inc)
# XOR_WRITE(out+64, in+64, _mm256_permute2x128_si256(v0,v1,0x31));
VPERM2I128(ymm_tmp0, ymm_v0, ymm_v1, 0x31)
VPXOR(ymm_tmp0, ymm_tmp0, [reg_inp+64])
VMOVDQU([reg_outp+64], ymm_tmp0)
# XOR_WRITE(out+96, in+96, _mm256_permute2x128_si256(v2,v3,0x31));
VPERM2I128(ymm_tmp0, ymm_v2, ymm_v3, 0x31)
VPXOR(ymm_tmp0, ymm_tmp0, [reg_inp+96])
VMOVDQU([reg_outp+96], ymm_tmp0)
SUB(reg_blocks, 1)
JZ(out_write_even)
ADD(reg_inp, 2 * 64)
ADD(reg_outp, 2 * 64)
JMP(vector_loop2.begin)
LABEL(out_write_odd)
VPERM2I128(ymm_s3, ymm_s3, ymm_s3, 0x01) # Odd number of blocks.
LABEL(out_write_even)
VMOVDQA(x_s3, ymm_s3.as_xmm) # Write back ymm_s3 to x_v3
# Paranoia, cleanse the scratch space.
VPXOR(ymm_v0, ymm_v0, ymm_v0)
VMOVDQA(mem_tmp0, ymm_v0)
VMOVDQA(mem_s3, ymm_v0)
# Remove our stack allocation.
MOV(registers.rsp, reg_sp_save)
RETURN()
#
# CPUID
#
cpuidParams = Argument(ptr(uint32_t))
with Function("cpuidAmd64", (cpuidParams,)):
reg_params = registers.r15
LOAD.ARGUMENT(reg_params, cpuidParams)
MOV(registers.eax, [reg_params])
MOV(registers.ecx, [reg_params+4])
CPUID()
MOV([reg_params], registers.eax)
MOV([reg_params+4], registers.ebx)
MOV([reg_params+8], registers.ecx)
MOV([reg_params+12], registers.edx)
RETURN()
#
# XGETBV (ECX = 0)
#
xcrVec = Argument(ptr(uint32_t))
with Function("xgetbv0Amd64", (xcrVec,)):
reg_vec = GeneralPurposeRegister64()
LOAD.ARGUMENT(reg_vec, xcrVec)
XOR(registers.ecx, registers.ecx)
XGETBV()
MOV([reg_vec], registers.eax)
MOV([reg_vec+4], registers.edx)
RETURN()
// Generated by PeachPy 0.2.0 from chacha20_amd64.py
// func blocksAmd64SSE2(x *uint32, inp *uint8, outp *uint8, nrBlocks *uint)
TEXT ·blocksAmd64SSE2(SB),4,$0-32
MOVQ x+0(FP), AX
MOVQ inp+8(FP), BX
MOVQ outp+16(FP), CX
MOVQ nrBlocks+24(FP), DX
MOVQ SP, DI
MOVQ $31, SI
NOTQ SI
ANDQ SI, SP
SUBQ $32, SP
PXOR X0, X0
SUBQ $32, SP
MOVO X0, 0(SP)
MOVL $1, SI
MOVL SI, 0(SP)
SUBQ $4, DX
JCS vector_loop4_end
vector_loop4_begin:
MOVO 0(AX), X0
MOVO 16(AX), X1
MOVO 32(AX), X2
MOVO 48(AX), X3
MOVO X0, X4
MOVO X1, X5
MOVO X2, X6
MOVO X3, X7
PADDQ 0(SP), X7
MOVO X0, X8
MOVO X1, X9
MOVO X2, X10
MOVO X7, X11
PADDQ 0(SP), X11
MOVO X0, X12
MOVO X1, X13
MOVO X2, X14
MOVO X11, X15
PADDQ 0(SP), X15
MOVQ $20, SI
rounds_loop4_begin:
PADDL X1, X0
PADDL X5, X4
PADDL X9, X8
PADDL X13, X12
PXOR X0, X3
PXOR X4, X7
PXOR X8, X11
PXOR X12, X15
MOVO X12, 16(SP)
MOVO X3, X12
PSLLL $16, X12
PSRLL $16, X3
PXOR X12, X3
MOVO X7, X12
PSLLL $16, X12
PSRLL $16, X7
PXOR X12, X7
MOVO X11, X12
PSLLL $16, X12
PSRLL $16, X11
PXOR X12, X11
MOVO X15, X12
PSLLL $16, X12
PSRLL $16, X15
PXOR X12, X15
PADDL X3, X2
PADDL X7, X6
PADDL X11, X10
PADDL X15, X14
PXOR X2, X1
PXOR X6, X5
PXOR X10, X9
PXOR X14, X13
MOVO X1, X12
PSLLL $12, X12
PSRLL $20, X1
PXOR X12, X1
MOVO X5, X12
PSLLL $12, X12
PSRLL $20, X5
PXOR X12, X5
MOVO X9, X12
PSLLL $12, X12
PSRLL $20, X9
PXOR X12, X9
MOVO X13, X12
PSLLL $12, X12
PSRLL $20, X13
PXOR X12, X13
MOVO 16(SP), X12
PADDL X1, X0
PADDL X5, X4
PADDL X9, X8
PADDL X13, X12
PXOR X0, X3
PXOR X4, X7
PXOR X8, X11
PXOR X12, X15
MOVO X12, 16(SP)
MOVO X3, X12
PSLLL $8, X12
PSRLL $24, X3
PXOR X12, X3
MOVO X7, X12
PSLLL $8, X12
PSRLL $24, X7
PXOR X12, X7
MOVO X11, X12
PSLLL $8, X12
PSRLL $24, X11
PXOR X12, X11
MOVO X15, X12
PSLLL $8, X12
PSRLL $24, X15
PXOR X12, X15
PADDL X3, X2
PADDL X7, X6
PADDL X11, X10
PADDL X15, X14
PXOR X2, X1
PXOR X6, X5
PXOR X10, X9
PXOR X14, X13
MOVO X1, X12
PSLLL $7, X12
PSRLL $25, X1
PXOR X12, X1
MOVO X5, X12
PSLLL $7, X12
PSRLL $25, X5
PXOR X12, X5
MOVO X9, X12
PSLLL $7, X12
PSRLL $25, X9
PXOR X12, X9
MOVO X13, X12
PSLLL $7, X12
PSRLL $25, X13
PXOR X12, X13
PSHUFL $57, X1, X1
PSHUFL $57, X5, X5
PSHUFL $57, X9, X9
PSHUFL $57, X13, X13
PSHUFL $78, X2, X2
PSHUFL $78, X6, X6
PSHUFL $78, X10, X10
PSHUFL $78, X14, X14
PSHUFL $147, X3, X3
PSHUFL $147, X7, X7
PSHUFL $147, X11, X11
PSHUFL $147, X15, X15
MOVO 16(SP), X12
PADDL X1, X0
PADDL X5, X4
PADDL X9, X8
PADDL X13, X12
PXOR X0, X3
PXOR X4, X7
PXOR X8, X11
PXOR X12, X15
MOVO X12, 16(SP)
MOVO X3, X12
PSLLL $16, X12
PSRLL $16, X3
PXOR X12, X3
MOVO X7, X12
PSLLL $16, X12
PSRLL $16, X7
PXOR X12, X7
MOVO X11, X12
PSLLL $16, X12
PSRLL $16, X11
PXOR X12, X11
MOVO X15, X12
PSLLL $16, X12
PSRLL $16, X15
PXOR X12, X15
PADDL X3, X2
PADDL X7, X6
PADDL X11, X10
PADDL X15, X14
PXOR X2, X1
PXOR X6, X5
PXOR X10, X9
PXOR X14, X13
MOVO X1, X12
PSLLL $12, X12
PSRLL $20, X1
PXOR X12, X1
MOVO X5, X12
PSLLL $12, X12
PSRLL $20, X5
PXOR X12, X5
MOVO X9, X12
PSLLL $12, X12
PSRLL $20, X9
PXOR X12, X9
MOVO X13, X12
PSLLL $12, X12
PSRLL $20, X13
PXOR X12, X13
MOVO 16(SP), X12
PADDL X1, X0
PADDL X5, X4
PADDL X9, X8
PADDL X13, X12
PXOR X0, X3
PXOR X4, X7
PXOR X8, X11
PXOR X12, X15
MOVO X12, 16(SP)
MOVO X3, X12
PSLLL $8, X12
PSRLL $24, X3
PXOR X12, X3
MOVO X7, X12
PSLLL $8, X12
PSRLL $24, X7
PXOR X12, X7
MOVO X11, X12
PSLLL $8, X12
PSRLL $24, X11
PXOR X12, X11
MOVO X15, X12
PSLLL $8, X12
PSRLL $24, X15
PXOR X12, X15
PADDL X3, X2
PADDL X7, X6
PADDL X11, X10
PADDL X15, X14
PXOR X2, X1
PXOR X6, X5
PXOR X10, X9
PXOR X14, X13
MOVO X1, X12
PSLLL $7, X12
PSRLL $25, X1
PXOR X12, X1
MOVO X5, X12
PSLLL $7, X12
PSRLL $25, X5
PXOR X12, X5
MOVO X9, X12
PSLLL $7, X12
PSRLL $25, X9
PXOR X12, X9
MOVO X13, X12
PSLLL $7, X12
PSRLL $25, X13
PXOR X12, X13
PSHUFL $147, X1, X1
PSHUFL $147, X5, X5
PSHUFL $147, X9, X9
PSHUFL $147, X13, X13
PSHUFL $78, X2, X2
PSHUFL $78, X6, X6
PSHUFL $78, X10, X10
PSHUFL $78, X14, X14
PSHUFL $57, X3, X3
PSHUFL $57, X7, X7
PSHUFL $57, X11, X11
PSHUFL $57, X15, X15
MOVO 16(SP), X12
SUBQ $2, SI
JNE rounds_loop4_begin
MOVO X12, 16(SP)
PADDL 0(AX), X0
PADDL 16(AX), X1
PADDL 32(AX), X2
PADDL 48(AX), X3
MOVOU 0(BX), X12
PXOR X0, X12
MOVOU X12, 0(CX)
MOVOU 16(BX), X12
PXOR X1, X12
MOVOU X12, 16(CX)
MOVOU 32(BX), X12
PXOR X2, X12
MOVOU X12, 32(CX)
MOVOU 48(BX), X12
PXOR X3, X12
MOVOU X12, 48(CX)
MOVO 48(AX), X3
PADDQ 0(SP), X3
PADDL 0(AX), X4
PADDL 16(AX), X5
PADDL 32(AX), X6
PADDL X3, X7
MOVOU 64(BX), X12
PXOR X4, X12
MOVOU X12, 64(CX)
MOVOU 80(BX), X12
PXOR X5, X12
MOVOU X12, 80(CX)
MOVOU 96(BX), X12
PXOR X6, X12
MOVOU X12, 96(CX)
MOVOU 112(BX), X12
PXOR X7, X12
MOVOU X12, 112(CX)
PADDQ 0(SP), X3
PADDL 0(AX), X8
PADDL 16(AX), X9
PADDL 32(AX), X10
PADDL X3, X11
MOVOU 128(BX), X12
PXOR X8, X12
MOVOU X12, 128(CX)
MOVOU 144(BX), X12
PXOR X9, X12
MOVOU X12, 144(CX)
MOVOU 160(BX), X12
PXOR X10, X12
MOVOU X12, 160(CX)
MOVOU 176(BX), X12
PXOR X11, X12
MOVOU X12, 176(CX)
PADDQ 0(SP), X3
MOVO 16(SP), X12
PADDL 0(AX), X12
PADDL 16(AX), X13
PADDL 32(AX), X14
PADDL X3, X15
MOVOU 192(BX), X0
PXOR X12, X0
MOVOU X0, 192(CX)
MOVOU 208(BX), X0
PXOR X13, X0
MOVOU X0, 208(CX)
MOVOU 224(BX), X0
PXOR X14, X0
MOVOU X0, 224(CX)
MOVOU 240(BX), X0
PXOR X15, X0
MOVOU X0, 240(CX)
PADDQ 0(SP), X3
MOVO X3, 48(AX)
ADDQ $256, BX
ADDQ $256, CX
SUBQ $4, DX
JCC vector_loop4_begin
vector_loop4_end:
ADDQ $4, DX
JEQ out
MOVO 0(AX), X8
MOVO 16(AX), X9
MOVO 32(AX), X10
MOVO 48(AX), X11
MOVO 0(SP), X13
SUBQ $2, DX
JCS vector_loop2_end
vector_loop2_begin:
MOVO X8, X0
MOVO X9, X1
MOVO X10, X2
MOVO X11, X3
MOVO X0, X4
MOVO X1, X5
MOVO X2, X6
MOVO X3, X7
PADDQ X13, X7
MOVQ $20, SI
rounds_loop2_begin:
PADDL X1, X0
PADDL X5, X4
PXOR X0, X3
PXOR X4, X7
MOVO X3, X12
PSLLL $16, X12
PSRLL $16, X3
PXOR X12, X3
MOVO X7, X12
PSLLL $16, X12
PSRLL $16, X7
PXOR X12, X7
PADDL X3, X2
PADDL X7, X6
PXOR X2, X1
PXOR X6, X5
MOVO X1, X12
PSLLL $12, X12
PSRLL $20, X1
PXOR X12, X1
MOVO X5, X12
PSLLL $12, X12
PSRLL $20, X5
PXOR X12, X5
PADDL X1, X0
PADDL X5, X4
PXOR X0, X3
PXOR X4, X7
MOVO X3, X12
PSLLL $8, X12
PSRLL $24, X3
PXOR X12, X3
MOVO X7, X12
PSLLL $8, X12
PSRLL $24, X7
PXOR X12, X7
PADDL X3, X2
PADDL X7, X6
PXOR X2, X1
PXOR X6, X5
MOVO X1, X12
PSLLL $7, X12
PSRLL $25, X1
PXOR X12, X1
MOVO X5, X12
PSLLL $7, X12
PSRLL $25, X5
PXOR X12, X5
PSHUFL $57, X1, X1
PSHUFL $57, X5, X5
PSHUFL $78, X2, X2
PSHUFL $78, X6, X6
PSHUFL $147, X3, X3
PSHUFL $147, X7, X7
PADDL X1, X0
PADDL X5, X4
PXOR X0, X3
PXOR X4, X7
MOVO X3, X12
PSLLL $16, X12
PSRLL $16, X3
PXOR X12, X3
MOVO X7, X12
PSLLL $16, X12
PSRLL $16, X7
PXOR X12, X7
PADDL X3, X2
PADDL X7, X6
PXOR X2, X1
PXOR X6, X5
MOVO X1, X12
PSLLL $12, X12
PSRLL $20, X1
PXOR X12, X1
MOVO X5, X12
PSLLL $12, X12
PSRLL $20, X5
PXOR X12, X5
PADDL X1, X0
PADDL X5, X4
PXOR X0, X3
PXOR X4, X7
MOVO X3, X12
PSLLL $8, X12
PSRLL $24, X3
PXOR X12, X3
MOVO X7, X12
PSLLL $8, X12
PSRLL $24, X7
PXOR X12, X7
PADDL X3, X2
PADDL X7, X6
PXOR X2, X1
PXOR X6, X5
MOVO X1, X12
PSLLL $7, X12
PSRLL $25, X1
PXOR X12, X1
MOVO X5, X12
PSLLL $7, X12
PSRLL $25, X5
PXOR X12, X5
PSHUFL $147, X1, X1
PSHUFL $147, X5, X5
PSHUFL $78, X2, X2
PSHUFL $78, X6, X6
PSHUFL $57, X3, X3
PSHUFL $57, X7, X7
SUBQ $2, SI
JNE rounds_loop2_begin
PADDL X8, X0
PADDL X9, X1
PADDL X10, X2
PADDL X11, X3
MOVOU 0(BX), X12
PXOR X0, X12
MOVOU X12, 0(CX)
MOVOU 16(BX), X12
PXOR X1, X12
MOVOU X12, 16(CX)
MOVOU 32(BX), X12
PXOR X2, X12
MOVOU X12, 32(CX)
MOVOU 48(BX), X12
PXOR X3, X12
MOVOU X12, 48(CX)
PADDQ X13, X11
PADDL X8, X4
PADDL X9, X5
PADDL X10, X6
PADDL X11, X7
MOVOU 64(BX), X12
PXOR X4, X12
MOVOU X12, 64(CX)
MOVOU 80(BX), X12
PXOR X5, X12
MOVOU X12, 80(CX)
MOVOU 96(BX), X12
PXOR X6, X12
MOVOU X12, 96(CX)
MOVOU 112(BX), X12
PXOR X7, X12
MOVOU X12, 112(CX)
PADDQ X13, X11
ADDQ $128, BX
ADDQ $128, CX
SUBQ $2, DX
JCC vector_loop2_begin
vector_loop2_end:
ADDQ $2, DX
JEQ out_serial
MOVO X8, X0
MOVO X9, X1
MOVO X10, X2
MOVO X11, X3
MOVQ $20, DX
rounds_loop1_begin:
PADDL X1, X0
PXOR X0, X3
MOVO X3, X12
PSLLL $16, X12
PSRLL $16, X3
PXOR X12, X3
PADDL X3, X2
PXOR X2, X1
MOVO X1, X12
PSLLL $12, X12
PSRLL $20, X1
PXOR X12, X1
PADDL X1, X0
PXOR X0, X3
MOVO X3, X12
PSLLL $8, X12
PSRLL $24, X3
PXOR X12, X3
PADDL X3, X2
PXOR X2, X1
MOVO X1, X12
PSLLL $7, X12
PSRLL $25, X1
PXOR X12, X1
PSHUFL $57, X1, X1
PSHUFL $78, X2, X2
PSHUFL $147, X3, X3
PADDL X1, X0
PXOR X0, X3
MOVO X3, X12
PSLLL $16, X12
PSRLL $16, X3
PXOR X12, X3
PADDL X3, X2
PXOR X2, X1
MOVO X1, X12
PSLLL $12, X12
PSRLL $20, X1
PXOR X12, X1
PADDL X1, X0
PXOR X0, X3
MOVO X3, X12
PSLLL $8, X12
PSRLL $24, X3
PXOR X12, X3
PADDL X3, X2
PXOR X2, X1
MOVO X1, X12
PSLLL $7, X12
PSRLL $25, X1
PXOR X12, X1
PSHUFL $147, X1, X1
PSHUFL $78, X2, X2
PSHUFL $57, X3, X3
SUBQ $2, DX
JNE rounds_loop1_begin
PADDL X8, X0
PADDL X9, X1
PADDL X10, X2
PADDL X11, X3
MOVOU 0(BX), X12
PXOR X0, X12
MOVOU X12, 0(CX)
MOVOU 16(BX), X12
PXOR X1, X12
MOVOU X12, 16(CX)
MOVOU 32(BX), X12
PXOR X2, X12
MOVOU X12, 32(CX)
MOVOU 48(BX), X12
PXOR X3, X12
MOVOU X12, 48(CX)
PADDQ X13, X11
out_serial:
MOVO X11, 48(AX)
out:
PXOR X0, X0
MOVO X0, 16(SP)
MOVQ DI, SP
RET
// func blocksAmd64AVX2(x *uint32, inp *uint8, outp *uint8, nrBlocks *uint)
TEXT ·blocksAmd64AVX2(SB),4,$0-32
MOVQ x+0(FP), AX
MOVQ inp+8(FP), BX
MOVQ outp+16(FP), CX
MOVQ nrBlocks+24(FP), DX
MOVQ SP, DI
MOVQ $31, SI
NOTQ SI
ANDQ SI, SP
SUBQ $32, SP
SUBQ $96, SP
BYTE $0xC4; BYTE $0x41; BYTE $0x1D; BYTE $0xEF; BYTE $0xE4 // VPXOR ymm12, ymm12, ymm12
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x24; BYTE $0x24 // VMOVDQU [rsp], ymm12
MOVL $1, SI
MOVL SI, 16(SP)
BYTE $0xC4; BYTE $0xE2; BYTE $0x7D; BYTE $0x5A; BYTE $0x58; BYTE $0x30 // VBROADCASTI128 ymm3, [rax + 48]
BYTE $0xC5; BYTE $0xE5; BYTE $0xD4; BYTE $0x1C; BYTE $0x24 // VPADDQ ymm3, ymm3, [rsp]
BYTE $0xC5; BYTE $0xFD; BYTE $0x7F; BYTE $0x5C; BYTE $0x24; BYTE $0x20 // VMOVDQA [rsp + 32], ymm3
MOVL $2, SI
MOVL SI, 0(SP)
MOVL SI, 16(SP)
SUBQ $8, DX
JCS vector_loop8_end
vector_loop8_begin:
BYTE $0xC4; BYTE $0xE2; BYTE $0x7D; BYTE $0x5A; BYTE $0x00 // VBROADCASTI128 ymm0, [rax]
BYTE $0xC4; BYTE $0xE2; BYTE $0x7D; BYTE $0x5A; BYTE $0x48; BYTE $0x10 // VBROADCASTI128 ymm1, [rax + 16]
BYTE $0xC4; BYTE $0xE2; BYTE $0x7D; BYTE $0x5A; BYTE $0x50; BYTE $0x20 // VBROADCASTI128 ymm2, [rax + 32]
BYTE $0xC5; BYTE $0xFD; BYTE $0x6F; BYTE $0x5C; BYTE $0x24; BYTE $0x20 // VMOVDQA ymm3, [rsp + 32]
BYTE $0xC5; BYTE $0xFD; BYTE $0x6F; BYTE $0xE0 // VMOVDQA ymm4, ymm0
BYTE $0xC5; BYTE $0xFD; BYTE $0x6F; BYTE $0xE9 // VMOVDQA ymm5, ymm1
BYTE $0xC5; BYTE $0xFD; BYTE $0x6F; BYTE $0xF2 // VMOVDQA ymm6, ymm2
BYTE $0xC5; BYTE $0xE5; BYTE $0xD4; BYTE $0x3C; BYTE $0x24 // VPADDQ ymm7, ymm3, [rsp]
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0xC0 // VMOVDQA ymm8, ymm0
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0xC9 // VMOVDQA ymm9, ymm1
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0xD2 // VMOVDQA ymm10, ymm2
BYTE $0xC5; BYTE $0x45; BYTE $0xD4; BYTE $0x1C; BYTE $0x24 // VPADDQ ymm11, ymm7, [rsp]
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0xE0 // VMOVDQA ymm12, ymm0
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0xE9 // VMOVDQA ymm13, ymm1
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0xF2 // VMOVDQA ymm14, ymm2
BYTE $0xC5; BYTE $0x25; BYTE $0xD4; BYTE $0x3C; BYTE $0x24 // VPADDQ ymm15, ymm11, [rsp]
MOVQ $20, SI
rounds_loop8_begin:
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE $0xC4; BYTE $0x41; BYTE $0x3D; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm8, ymm8, ymm9
BYTE $0xC4; BYTE $0x41; BYTE $0x1D; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm12, ymm12, ymm13
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm11, ymm11, ymm8
BYTE $0xC4; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm7, 16
BYTE $0xC5; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm7, ymm7, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm11, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x25; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm11, ymm11, 16
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xDC // VPXOR ymm11, ymm11, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm15, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x05; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm15, ymm15, 16
BYTE $0xC4; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE $0xC4; BYTE $0x41; BYTE $0x2D; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm10, ymm10, ymm11
BYTE $0xC4; BYTE $0x41; BYTE $0x0D; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm14, ymm14, ymm15
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE $0xC4; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCA // VPXOR ymm9, ymm9, ymm10
BYTE $0xC4; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEE // VPXOR ymm13, ymm13, ymm14
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm5, 12
BYTE $0xC5; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm5, ymm5, 20
BYTE $0xC4; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm9, 12
BYTE $0xC4; BYTE $0xC1; BYTE $0x35; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm9, ymm9, 20
BYTE $0xC4; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCC // VPXOR ymm9, ymm9, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm13, 12
BYTE $0xC4; BYTE $0xC1; BYTE $0x15; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm13, ymm13, 20
BYTE $0xC4; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEC // VPXOR ymm13, ymm13, ymm12
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA ymm12, [rsp + 64]
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE $0xC4; BYTE $0x41; BYTE $0x3D; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm8, ymm8, ymm9
BYTE $0xC4; BYTE $0x41; BYTE $0x1D; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm12, ymm12, ymm13
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm11, ymm11, ymm8
BYTE $0xC4; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm7, 8
BYTE $0xC5; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm7, ymm7, 24
BYTE $0xC4; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm11, 8
BYTE $0xC4; BYTE $0xC1; BYTE $0x25; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm11, ymm11, 24
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xDC // VPXOR ymm11, ymm11, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm15, 8
BYTE $0xC4; BYTE $0xC1; BYTE $0x05; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm15, ymm15, 24
BYTE $0xC4; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE $0xC4; BYTE $0x41; BYTE $0x2D; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm10, ymm10, ymm11
BYTE $0xC4; BYTE $0x41; BYTE $0x0D; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm14, ymm14, ymm15
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE $0xC4; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCA // VPXOR ymm9, ymm9, ymm10
BYTE $0xC4; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEE // VPXOR ymm13, ymm13, ymm14
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm5, 7
BYTE $0xC5; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm5, ymm5, 25
BYTE $0xC4; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm9, 7
BYTE $0xC4; BYTE $0xC1; BYTE $0x35; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm9, ymm9, 25
BYTE $0xC4; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCC // VPXOR ymm9, ymm9, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm13, 7
BYTE $0xC4; BYTE $0xC1; BYTE $0x15; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm13, ymm13, 25
BYTE $0xC4; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEC // VPXOR ymm13, ymm13, ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x39 // VPSHUFD ymm1, ymm1, 57
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xED; BYTE $0x39 // VPSHUFD ymm5, ymm5, 57
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xC9; BYTE $0x39 // VPSHUFD ymm9, ymm9, 57
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xED; BYTE $0x39 // VPSHUFD ymm13, ymm13, 57
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm6, ymm6, 78
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm10, ymm10, 78
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm14, ymm14, 78
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x93 // VPSHUFD ymm3, ymm3, 147
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xFF; BYTE $0x93 // VPSHUFD ymm7, ymm7, 147
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xDB; BYTE $0x93 // VPSHUFD ymm11, ymm11, 147
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xFF; BYTE $0x93 // VPSHUFD ymm15, ymm15, 147
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA ymm12, [rsp + 64]
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE $0xC4; BYTE $0x41; BYTE $0x3D; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm8, ymm8, ymm9
BYTE $0xC4; BYTE $0x41; BYTE $0x1D; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm12, ymm12, ymm13
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm11, ymm11, ymm8
BYTE $0xC4; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm7, 16
BYTE $0xC5; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm7, ymm7, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm11, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x25; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm11, ymm11, 16
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xDC // VPXOR ymm11, ymm11, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm15, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x05; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm15, ymm15, 16
BYTE $0xC4; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE $0xC4; BYTE $0x41; BYTE $0x2D; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm10, ymm10, ymm11
BYTE $0xC4; BYTE $0x41; BYTE $0x0D; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm14, ymm14, ymm15
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE $0xC4; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCA // VPXOR ymm9, ymm9, ymm10
BYTE $0xC4; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEE // VPXOR ymm13, ymm13, ymm14
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm5, 12
BYTE $0xC5; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm5, ymm5, 20
BYTE $0xC4; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm9, 12
BYTE $0xC4; BYTE $0xC1; BYTE $0x35; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm9, ymm9, 20
BYTE $0xC4; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCC // VPXOR ymm9, ymm9, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm13, 12
BYTE $0xC4; BYTE $0xC1; BYTE $0x15; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm13, ymm13, 20
BYTE $0xC4; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEC // VPXOR ymm13, ymm13, ymm12
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA ymm12, [rsp + 64]
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE $0xC4; BYTE $0x41; BYTE $0x3D; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm8, ymm8, ymm9
BYTE $0xC4; BYTE $0x41; BYTE $0x1D; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm12, ymm12, ymm13
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm11, ymm11, ymm8
BYTE $0xC4; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm7, 8
BYTE $0xC5; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm7, ymm7, 24
BYTE $0xC4; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm11, 8
BYTE $0xC4; BYTE $0xC1; BYTE $0x25; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm11, ymm11, 24
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xDC // VPXOR ymm11, ymm11, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm15, 8
BYTE $0xC4; BYTE $0xC1; BYTE $0x05; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm15, ymm15, 24
BYTE $0xC4; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE $0xC4; BYTE $0x41; BYTE $0x2D; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm10, ymm10, ymm11
BYTE $0xC4; BYTE $0x41; BYTE $0x0D; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm14, ymm14, ymm15
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE $0xC4; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCA // VPXOR ymm9, ymm9, ymm10
BYTE $0xC4; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEE // VPXOR ymm13, ymm13, ymm14
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm5, 7
BYTE $0xC5; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm5, ymm5, 25
BYTE $0xC4; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm9, 7
BYTE $0xC4; BYTE $0xC1; BYTE $0x35; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm9, ymm9, 25
BYTE $0xC4; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCC // VPXOR ymm9, ymm9, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm13, 7
BYTE $0xC4; BYTE $0xC1; BYTE $0x15; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm13, ymm13, 25
BYTE $0xC4; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEC // VPXOR ymm13, ymm13, ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x93 // VPSHUFD ymm1, ymm1, 147
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xED; BYTE $0x93 // VPSHUFD ymm5, ymm5, 147
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xC9; BYTE $0x93 // VPSHUFD ymm9, ymm9, 147
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xED; BYTE $0x93 // VPSHUFD ymm13, ymm13, 147
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm6, ymm6, 78
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm10, ymm10, 78
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm14, ymm14, 78
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x39 // VPSHUFD ymm3, ymm3, 57
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xFF; BYTE $0x39 // VPSHUFD ymm7, ymm7, 57
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xDB; BYTE $0x39 // VPSHUFD ymm11, ymm11, 57
BYTE $0xC4; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xFF; BYTE $0x39 // VPSHUFD ymm15, ymm15, 57
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA ymm12, [rsp + 64]
SUBQ $2, SI
JNE rounds_loop8_begin
BYTE $0xC4; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x20 // VBROADCASTI128 ymm12, [rax]
BYTE $0xC4; BYTE $0xC1; BYTE $0x7D; BYTE $0xFE; BYTE $0xC4 // VPADDD ymm0, ymm0, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x5D; BYTE $0xFE; BYTE $0xE4 // VPADDD ymm4, ymm4, ymm12
BYTE $0xC4; BYTE $0x41; BYTE $0x3D; BYTE $0xFE; BYTE $0xC4 // VPADDD ymm8, ymm8, ymm12
BYTE $0xC5; BYTE $0x1D; BYTE $0xFE; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VPADDD ymm12, ymm12, [rsp + 64]
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm12
BYTE $0xC4; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x60; BYTE $0x10 // VBROADCASTI128 ymm12, [rax + 16]
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xFE; BYTE $0xCC // VPADDD ymm1, ymm1, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x55; BYTE $0xFE; BYTE $0xEC // VPADDD ymm5, ymm5, ymm12
BYTE $0xC4; BYTE $0x41; BYTE $0x35; BYTE $0xFE; BYTE $0xCC // VPADDD ymm9, ymm9, ymm12
BYTE $0xC4; BYTE $0x41; BYTE $0x15; BYTE $0xFE; BYTE $0xEC // VPADDD ymm13, ymm13, ymm12
BYTE $0xC4; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x60; BYTE $0x20 // VBROADCASTI128 ymm12, [rax + 32]
BYTE $0xC4; BYTE $0xC1; BYTE $0x6D; BYTE $0xFE; BYTE $0xD4 // VPADDD ymm2, ymm2, ymm12
BYTE $0xC4; BYTE $0xC1; BYTE $0x4D; BYTE $0xFE; BYTE $0xF4 // VPADDD ymm6, ymm6, ymm12
BYTE $0xC4; BYTE $0x41; BYTE $0x2D; BYTE $0xFE; BYTE $0xD4 // VPADDD ymm10, ymm10, ymm12
BYTE $0xC4; BYTE $0x41; BYTE $0x0D; BYTE $0xFE; BYTE $0xF4 // VPADDD ymm14, ymm14, ymm12
BYTE $0xC5; BYTE $0xE5; BYTE $0xFE; BYTE $0x5C; BYTE $0x24; BYTE $0x20 // VPADDD ymm3, ymm3, [rsp + 32]
BYTE $0xC4; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x20 // VPERM2I128 ymm12, ymm0, ymm1, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x23 // VPXOR ymm12, ymm12, [rbx]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x21 // VMOVDQU [rcx], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x20 // VPERM2I128 ymm12, ymm2, ymm3, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x20 // VPXOR ymm12, ymm12, [rbx + 32]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x20 // VMOVDQU [rcx + 32], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x31 // VPERM2I128 ymm12, ymm0, ymm1, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x40 // VPXOR ymm12, ymm12, [rbx + 64]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x40 // VMOVDQU [rcx + 64], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x31 // VPERM2I128 ymm12, ymm2, ymm3, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x60 // VPXOR ymm12, ymm12, [rbx + 96]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x60 // VMOVDQU [rcx + 96], ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0x6F; BYTE $0x5C; BYTE $0x24; BYTE $0x20 // VMOVDQA ymm3, [rsp + 32]
BYTE $0xC5; BYTE $0xE5; BYTE $0xFE; BYTE $0x1C; BYTE $0x24 // VPADDD ymm3, ymm3, [rsp]
BYTE $0xC5; BYTE $0xC5; BYTE $0xFE; BYTE $0xFB // VPADDD ymm7, ymm7, ymm3
BYTE $0xC4; BYTE $0x63; BYTE $0x5D; BYTE $0x46; BYTE $0xE5; BYTE $0x20 // VPERM2I128 ymm12, ymm4, ymm5, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x80; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 128]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x80; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 128], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x4D; BYTE $0x46; BYTE $0xE7; BYTE $0x20 // VPERM2I128 ymm12, ymm6, ymm7, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xA0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 160]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xA0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 160], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x5D; BYTE $0x46; BYTE $0xE5; BYTE $0x31 // VPERM2I128 ymm12, ymm4, ymm5, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xC0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 192]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xC0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 192], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x4D; BYTE $0x46; BYTE $0xE7; BYTE $0x31 // VPERM2I128 ymm12, ymm6, ymm7, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xE0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 224]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xE0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 224], ymm12
BYTE $0xC5; BYTE $0xE5; BYTE $0xFE; BYTE $0x1C; BYTE $0x24 // VPADDD ymm3, ymm3, [rsp]
BYTE $0xC5; BYTE $0x25; BYTE $0xFE; BYTE $0xDB // VPADDD ymm11, ymm11, ymm3
BYTE $0xC4; BYTE $0x43; BYTE $0x3D; BYTE $0x46; BYTE $0xE1; BYTE $0x20 // VPERM2I128 ymm12, ymm8, ymm9, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x00; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 256]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x00; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 256], ymm12
BYTE $0xC4; BYTE $0x43; BYTE $0x2D; BYTE $0x46; BYTE $0xE3; BYTE $0x20 // VPERM2I128 ymm12, ymm10, ymm11, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x20; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 288]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x20; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 288], ymm12
BYTE $0xC4; BYTE $0x43; BYTE $0x3D; BYTE $0x46; BYTE $0xE1; BYTE $0x31 // VPERM2I128 ymm12, ymm8, ymm9, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x40; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 320]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x40; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 320], ymm12
BYTE $0xC4; BYTE $0x43; BYTE $0x2D; BYTE $0x46; BYTE $0xE3; BYTE $0x31 // VPERM2I128 ymm12, ymm10, ymm11, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x60; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 352]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x60; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 352], ymm12
BYTE $0xC5; BYTE $0xE5; BYTE $0xFE; BYTE $0x1C; BYTE $0x24 // VPADDD ymm3, ymm3, [rsp]
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA ymm12, [rsp + 64]
BYTE $0xC5; BYTE $0x05; BYTE $0xFE; BYTE $0xFB // VPADDD ymm15, ymm15, ymm3
BYTE $0xC4; BYTE $0xC3; BYTE $0x1D; BYTE $0x46; BYTE $0xC5; BYTE $0x20 // VPERM2I128 ymm0, ymm12, ymm13, 32
BYTE $0xC5; BYTE $0xFD; BYTE $0xEF; BYTE $0x83; BYTE $0x80; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm0, ymm0, [rbx + 384]
BYTE $0xC5; BYTE $0xFE; BYTE $0x7F; BYTE $0x81; BYTE $0x80; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 384], ymm0
BYTE $0xC4; BYTE $0xC3; BYTE $0x0D; BYTE $0x46; BYTE $0xC7; BYTE $0x20 // VPERM2I128 ymm0, ymm14, ymm15, 32
BYTE $0xC5; BYTE $0xFD; BYTE $0xEF; BYTE $0x83; BYTE $0xA0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm0, ymm0, [rbx + 416]
BYTE $0xC5; BYTE $0xFE; BYTE $0x7F; BYTE $0x81; BYTE $0xA0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 416], ymm0
BYTE $0xC4; BYTE $0xC3; BYTE $0x1D; BYTE $0x46; BYTE $0xC5; BYTE $0x31 // VPERM2I128 ymm0, ymm12, ymm13, 49
BYTE $0xC5; BYTE $0xFD; BYTE $0xEF; BYTE $0x83; BYTE $0xC0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm0, ymm0, [rbx + 448]
BYTE $0xC5; BYTE $0xFE; BYTE $0x7F; BYTE $0x81; BYTE $0xC0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 448], ymm0
BYTE $0xC4; BYTE $0xC3; BYTE $0x0D; BYTE $0x46; BYTE $0xC7; BYTE $0x31 // VPERM2I128 ymm0, ymm14, ymm15, 49
BYTE $0xC5; BYTE $0xFD; BYTE $0xEF; BYTE $0x83; BYTE $0xE0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm0, ymm0, [rbx + 480]
BYTE $0xC5; BYTE $0xFE; BYTE $0x7F; BYTE $0x81; BYTE $0xE0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 480], ymm0
BYTE $0xC5; BYTE $0xE5; BYTE $0xFE; BYTE $0x1C; BYTE $0x24 // VPADDD ymm3, ymm3, [rsp]
BYTE $0xC5; BYTE $0xFD; BYTE $0x7F; BYTE $0x5C; BYTE $0x24; BYTE $0x20 // VMOVDQA [rsp + 32], ymm3
ADDQ $512, BX
ADDQ $512, CX
SUBQ $8, DX
JCC vector_loop8_begin
vector_loop8_end:
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0xDB // VMOVDQA ymm11, ymm3
ADDQ $8, DX
JEQ out_write_even
BYTE $0xC4; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x00 // VBROADCASTI128 ymm8, [rax]
BYTE $0xC4; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x48; BYTE $0x10 // VBROADCASTI128 ymm9, [rax + 16]
BYTE $0xC4; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x50; BYTE $0x20 // VBROADCASTI128 ymm10, [rax + 32]
BYTE $0xC5; BYTE $0x7D; BYTE $0x6F; BYTE $0x34; BYTE $0x24 // VMOVDQA ymm14, [rsp]
SUBQ $4, DX
JCS vector_loop4_end
vector_loop4_begin:
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0xC0 // VMOVDQA ymm0, ymm8
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0xC9 // VMOVDQA ymm1, ymm9
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0xD2 // VMOVDQA ymm2, ymm10
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0xDB // VMOVDQA ymm3, ymm11
BYTE $0xC5; BYTE $0xFD; BYTE $0x6F; BYTE $0xE0 // VMOVDQA ymm4, ymm0
BYTE $0xC5; BYTE $0xFD; BYTE $0x6F; BYTE $0xE9 // VMOVDQA ymm5, ymm1
BYTE $0xC5; BYTE $0xFD; BYTE $0x6F; BYTE $0xF2 // VMOVDQA ymm6, ymm2
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xD4; BYTE $0xFE // VPADDQ ymm7, ymm3, ymm14
MOVQ $20, SI
rounds_loop4_begin:
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm7, 16
BYTE $0xC5; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm7, ymm7, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm5, 12
BYTE $0xC5; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm5, ymm5, 20
BYTE $0xC4; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm7, 8
BYTE $0xC5; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm7, ymm7, 24
BYTE $0xC4; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm5, 7
BYTE $0xC5; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm5, ymm5, 25
BYTE $0xC4; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x39 // VPSHUFD ymm1, ymm1, 57
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xED; BYTE $0x39 // VPSHUFD ymm5, ymm5, 57
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm6, ymm6, 78
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x93 // VPSHUFD ymm3, ymm3, 147
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xFF; BYTE $0x93 // VPSHUFD ymm7, ymm7, 147
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm7, 16
BYTE $0xC5; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm7, ymm7, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm5, 12
BYTE $0xC5; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm5, ymm5, 20
BYTE $0xC4; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm7, 8
BYTE $0xC5; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm7, ymm7, 24
BYTE $0xC4; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm5, 7
BYTE $0xC5; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm5, ymm5, 25
BYTE $0xC4; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x93 // VPSHUFD ymm1, ymm1, 147
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xED; BYTE $0x93 // VPSHUFD ymm5, ymm5, 147
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm6, ymm6, 78
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x39 // VPSHUFD ymm3, ymm3, 57
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xFF; BYTE $0x39 // VPSHUFD ymm7, ymm7, 57
SUBQ $2, SI
JNE rounds_loop4_begin
BYTE $0xC4; BYTE $0xC1; BYTE $0x7D; BYTE $0xFE; BYTE $0xC0 // VPADDD ymm0, ymm0, ymm8
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xFE; BYTE $0xC9 // VPADDD ymm1, ymm1, ymm9
BYTE $0xC4; BYTE $0xC1; BYTE $0x6D; BYTE $0xFE; BYTE $0xD2 // VPADDD ymm2, ymm2, ymm10
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xFE; BYTE $0xDB // VPADDD ymm3, ymm3, ymm11
BYTE $0xC4; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x20 // VPERM2I128 ymm12, ymm0, ymm1, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x23 // VPXOR ymm12, ymm12, [rbx]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x21 // VMOVDQU [rcx], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x20 // VPERM2I128 ymm12, ymm2, ymm3, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x20 // VPXOR ymm12, ymm12, [rbx + 32]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x20 // VMOVDQU [rcx + 32], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x31 // VPERM2I128 ymm12, ymm0, ymm1, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x40 // VPXOR ymm12, ymm12, [rbx + 64]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x40 // VMOVDQU [rcx + 64], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x31 // VPERM2I128 ymm12, ymm2, ymm3, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x60 // VPXOR ymm12, ymm12, [rbx + 96]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x60 // VMOVDQU [rcx + 96], ymm12
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xFE; BYTE $0xDE // VPADDD ymm11, ymm11, ymm14
BYTE $0xC4; BYTE $0xC1; BYTE $0x5D; BYTE $0xFE; BYTE $0xE0 // VPADDD ymm4, ymm4, ymm8
BYTE $0xC4; BYTE $0xC1; BYTE $0x55; BYTE $0xFE; BYTE $0xE9 // VPADDD ymm5, ymm5, ymm9
BYTE $0xC4; BYTE $0xC1; BYTE $0x4D; BYTE $0xFE; BYTE $0xF2 // VPADDD ymm6, ymm6, ymm10
BYTE $0xC4; BYTE $0xC1; BYTE $0x45; BYTE $0xFE; BYTE $0xFB // VPADDD ymm7, ymm7, ymm11
BYTE $0xC4; BYTE $0x63; BYTE $0x5D; BYTE $0x46; BYTE $0xE5; BYTE $0x20 // VPERM2I128 ymm12, ymm4, ymm5, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x80; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 128]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x80; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 128], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x4D; BYTE $0x46; BYTE $0xE7; BYTE $0x20 // VPERM2I128 ymm12, ymm6, ymm7, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xA0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 160]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xA0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 160], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x5D; BYTE $0x46; BYTE $0xE5; BYTE $0x31 // VPERM2I128 ymm12, ymm4, ymm5, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xC0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 192]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xC0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 192], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x4D; BYTE $0x46; BYTE $0xE7; BYTE $0x31 // VPERM2I128 ymm12, ymm6, ymm7, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xE0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 224]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xE0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 224], ymm12
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xFE; BYTE $0xDE // VPADDD ymm11, ymm11, ymm14
ADDQ $256, BX
ADDQ $256, CX
SUBQ $4, DX
JCC vector_loop4_begin
vector_loop4_end:
ADDQ $4, DX
JEQ out_write_even
vector_loop2_begin:
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0xC0 // VMOVDQA ymm0, ymm8
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0xC9 // VMOVDQA ymm1, ymm9
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0xD2 // VMOVDQA ymm2, ymm10
BYTE $0xC5; BYTE $0x7D; BYTE $0x7F; BYTE $0xDB // VMOVDQA ymm3, ymm11
MOVQ $20, SI
rounds_loop2_begin:
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x39 // VPSHUFD ymm1, ymm1, 57
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x93 // VPSHUFD ymm3, ymm3, 147
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE $0xC5; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE $0xC5; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE $0xC5; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE $0xC5; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE $0xC5; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE $0xC5; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x93 // VPSHUFD ymm1, ymm1, 147
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE $0xC5; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x39 // VPSHUFD ymm3, ymm3, 57
SUBQ $2, SI
JNE rounds_loop2_begin
BYTE $0xC4; BYTE $0xC1; BYTE $0x7D; BYTE $0xFE; BYTE $0xC0 // VPADDD ymm0, ymm0, ymm8
BYTE $0xC4; BYTE $0xC1; BYTE $0x75; BYTE $0xFE; BYTE $0xC9 // VPADDD ymm1, ymm1, ymm9
BYTE $0xC4; BYTE $0xC1; BYTE $0x6D; BYTE $0xFE; BYTE $0xD2 // VPADDD ymm2, ymm2, ymm10
BYTE $0xC4; BYTE $0xC1; BYTE $0x65; BYTE $0xFE; BYTE $0xDB // VPADDD ymm3, ymm3, ymm11
BYTE $0xC4; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x20 // VPERM2I128 ymm12, ymm0, ymm1, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x23 // VPXOR ymm12, ymm12, [rbx]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x21 // VMOVDQU [rcx], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x20 // VPERM2I128 ymm12, ymm2, ymm3, 32
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x20 // VPXOR ymm12, ymm12, [rbx + 32]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x20 // VMOVDQU [rcx + 32], ymm12
SUBQ $1, DX
JEQ out_write_odd
BYTE $0xC4; BYTE $0x41; BYTE $0x25; BYTE $0xFE; BYTE $0xDE // VPADDD ymm11, ymm11, ymm14
BYTE $0xC4; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x31 // VPERM2I128 ymm12, ymm0, ymm1, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x40 // VPXOR ymm12, ymm12, [rbx + 64]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x40 // VMOVDQU [rcx + 64], ymm12
BYTE $0xC4; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x31 // VPERM2I128 ymm12, ymm2, ymm3, 49
BYTE $0xC5; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x60 // VPXOR ymm12, ymm12, [rbx + 96]
BYTE $0xC5; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x60 // VMOVDQU [rcx + 96], ymm12
SUBQ $1, DX
JEQ out_write_even
ADDQ $128, BX
ADDQ $128, CX
JMP vector_loop2_begin
out_write_odd:
BYTE $0xC4; BYTE $0x43; BYTE $0x25; BYTE $0x46; BYTE $0xDB; BYTE $0x01 // VPERM2I128 ymm11, ymm11, ymm11, 1
out_write_even:
BYTE $0xC5; BYTE $0x79; BYTE $0x7F; BYTE $0x58; BYTE $0x30 // VMOVDQA [rax + 48], xmm11
BYTE $0xC5; BYTE $0xFD; BYTE $0xEF; BYTE $0xC0 // VPXOR ymm0, ymm0, ymm0
BYTE $0xC5; BYTE $0xFD; BYTE $0x7F; BYTE $0x44; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm0
BYTE $0xC5; BYTE $0xFD; BYTE $0x7F; BYTE $0x44; BYTE $0x24; BYTE $0x20 // VMOVDQA [rsp + 32], ymm0
MOVQ DI, SP
BYTE $0xC5; BYTE $0xF8; BYTE $0x77 // VZEROUPPER
RET
// func cpuidAmd64(cpuidParams *uint32)
TEXT ·cpuidAmd64(SB),4,$0-8
MOVQ cpuidParams+0(FP), R15
MOVL 0(R15), AX
MOVL 4(R15), CX
CPUID
MOVL AX, 0(R15)
MOVL BX, 4(R15)
MOVL CX, 8(R15)
MOVL DX, 12(R15)
RET
// func xgetbv0Amd64(xcrVec *uint32)
TEXT ·xgetbv0Amd64(SB),4,$0-8
MOVQ xcrVec+0(FP), BX
XORL CX, CX
BYTE $0x0F; BYTE $0x01; BYTE $0xD0 // XGETBV
MOVL AX, 0(BX)
MOVL DX, 4(BX)
RET
// chacha20_ref.go - Reference ChaCha20.
//
// To the extent possible under law, Yawning Angel has waived all copyright
// and related or neighboring rights to chacha20, using the Creative
// Commons "CC0" public domain dedication. See LICENSE or
// <http://creativecommons.org/publicdomain/zero/1.0/> for full details.
package chacha20
import (
"encoding/binary"
"math"
"unsafe"
)
func blocksRef(x *[stateSize]uint32, in []byte, out []byte, nrBlocks int, isIetf bool) {
if isIetf {
var totalBlocks uint64
totalBlocks = uint64(x[8]) + uint64(nrBlocks)
if totalBlocks > math.MaxUint32 {
panic("chacha20: Exceeded keystream per nonce limit")
}
}
// This routine ignores x[0]...x[4] in favor the const values since it's
// ever so slightly faster.
for n := 0; n < nrBlocks; n++ {
x0, x1, x2, x3 := sigma0, sigma1, sigma2, sigma3
x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15 := x[4], x[5], x[6], x[7], x[8], x[9], x[10], x[11], x[12], x[13], x[14], x[15]
for i := chachaRounds; i > 0; i -= 2 {
// quarterround(x, 0, 4, 8, 12)
x0 += x4
x12 ^= x0
x12 = (x12 << 16) | (x12 >> 16)
x8 += x12
x4 ^= x8
x4 = (x4 << 12) | (x4 >> 20)
x0 += x4
x12 ^= x0
x12 = (x12 << 8) | (x12 >> 24)
x8 += x12
x4 ^= x8
x4 = (x4 << 7) | (x4 >> 25)
// quarterround(x, 1, 5, 9, 13)
x1 += x5
x13 ^= x1
x13 = (x13 << 16) | (x13 >> 16)
x9 += x13
x5 ^= x9
x5 = (x5 << 12) | (x5 >> 20)
x1 += x5
x13 ^= x1
x13 = (x13 << 8) | (x13 >> 24)
x9 += x13
x5 ^= x9
x5 = (x5 << 7) | (x5 >> 25)
// quarterround(x, 2, 6, 10, 14)
x2 += x6
x14 ^= x2
x14 = (x14 << 16) | (x14 >> 16)
x10 += x14
x6 ^= x10
x6 = (x6 << 12) | (x6 >> 20)
x2 += x6
x14 ^= x2
x14 = (x14 << 8) | (x14 >> 24)
x10 += x14
x6 ^= x10
x6 = (x6 << 7) | (x6 >> 25)
// quarterround(x, 3, 7, 11, 15)
x3 += x7
x15 ^= x3
x15 = (x15 << 16) | (x15 >> 16)
x11 += x15
x7 ^= x11
x7 = (x7 << 12) | (x7 >> 20)
x3 += x7
x15 ^= x3
x15 = (x15 << 8) | (x15 >> 24)
x11 += x15
x7 ^= x11
x7 = (x7 << 7) | (x7 >> 25)
// quarterround(x, 0, 5, 10, 15)
x0 += x5
x15 ^= x0
x15 = (x15 << 16) | (x15 >> 16)
x10 += x15
x5 ^= x10
x5 = (x5 << 12) | (x5 >> 20)
x0 += x5
x15 ^= x0
x15 = (x15 << 8) | (x15 >> 24)
x10 += x15
x5 ^= x10
x5 = (x5 << 7) | (x5 >> 25)
// quarterround(x, 1, 6, 11, 12)
x1 += x6
x12 ^= x1
x12 = (x12 << 16) | (x12 >> 16)
x11 += x12
x6 ^= x11
x6 = (x6 << 12) | (x6 >> 20)
x1 += x6
x12 ^= x1
x12 = (x12 << 8) | (x12 >> 24)
x11 += x12
x6 ^= x11
x6 = (x6 << 7) | (x6 >> 25)
// quarterround(x, 2, 7, 8, 13)
x2 += x7
x13 ^= x2
x13 = (x13 << 16) | (x13 >> 16)
x8 += x13
x7 ^= x8
x7 = (x7 << 12) | (x7 >> 20)
x2 += x7
x13 ^= x2
x13 = (x13 << 8) | (x13 >> 24)
x8 += x13
x7 ^= x8
x7 = (x7 << 7) | (x7 >> 25)
// quarterround(x, 3, 4, 9, 14)
x3 += x4
x14 ^= x3
x14 = (x14 << 16) | (x14 >> 16)
x9 += x14
x4 ^= x9
x4 = (x4 << 12) | (x4 >> 20)
x3 += x4
x14 ^= x3
x14 = (x14 << 8) | (x14 >> 24)
x9 += x14
x4 ^= x9
x4 = (x4 << 7) | (x4 >> 25)
}
// On amd64 at least, this is a rather big boost.
if useUnsafe {
if in != nil {
inArr := (*[16]uint32)(unsafe.Pointer(&in[n*BlockSize]))
outArr := (*[16]uint32)(unsafe.Pointer(&out[n*BlockSize]))
outArr[0] = inArr[0] ^ (x0 + sigma0)
outArr[1] = inArr[1] ^ (x1 + sigma1)
outArr[2] = inArr[2] ^ (x2 + sigma2)
outArr[3] = inArr[3] ^ (x3 + sigma3)
outArr[4] = inArr[4] ^ (x4 + x[4])
outArr[5] = inArr[5] ^ (x5 + x[5])
outArr[6] = inArr[6] ^ (x6 + x[6])
outArr[7] = inArr[7] ^ (x7 + x[7])
outArr[8] = inArr[8] ^ (x8 + x[8])
outArr[9] = inArr[9] ^ (x9 + x[9])
outArr[10] = inArr[10] ^ (x10 + x[10])
outArr[11] = inArr[11] ^ (x11 + x[11])
outArr[12] = inArr[12] ^ (x12 + x[12])
outArr[13] = inArr[13] ^ (x13 + x[13])
outArr[14] = inArr[14] ^ (x14 + x[14])
outArr[15] = inArr[15] ^ (x15 + x[15])
} else {
outArr := (*[16]uint32)(unsafe.Pointer(&out[n*BlockSize]))
outArr[0] = x0 + sigma0
outArr[1] = x1 + sigma1
outArr[2] = x2 + sigma2
outArr[3] = x3 + sigma3
outArr[4] = x4 + x[4]
outArr[5] = x5 + x[5]
outArr[6] = x6 + x[6]
outArr[7] = x7 + x[7]
outArr[8] = x8 + x[8]
outArr[9] = x9 + x[9]
outArr[10] = x10 + x[10]
outArr[11] = x11 + x[11]
outArr[12] = x12 + x[12]
outArr[13] = x13 + x[13]
outArr[14] = x14 + x[14]
outArr[15] = x15 + x[15]
}
} else {
// Slow path, either the architecture cares about alignment, or is not little endian.
x0 += sigma0
x1 += sigma1
x2 += sigma2
x3 += sigma3
x4 += x[4]
x5 += x[5]
x6 += x[6]
x7 += x[7]
x8 += x[8]
x9 += x[9]
x10 += x[10]
x11 += x[11]
x12 += x[12]
x13 += x[13]
x14 += x[14]
x15 += x[15]
if in != nil {
binary.LittleEndian.PutUint32(out[0:4], binary.LittleEndian.Uint32(in[0:4])^x0)
binary.LittleEndian.PutUint32(out[4:8], binary.LittleEndian.Uint32(in[4:8])^x1)
binary.LittleEndian.PutUint32(out[8:12], binary.LittleEndian.Uint32(in[8:12])^x2)
binary.LittleEndian.PutUint32(out[12:16], binary.LittleEndian.Uint32(in[12:16])^x3)
binary.LittleEndian.PutUint32(out[16:20], binary.LittleEndian.Uint32(in[16:20])^x4)
binary.LittleEndian.PutUint32(out[20:24], binary.LittleEndian.Uint32(in[20:24])^x5)
binary.LittleEndian.PutUint32(out[24:28], binary.LittleEndian.Uint32(in[24:28])^x6)
binary.LittleEndian.PutUint32(out[28:32], binary.LittleEndian.Uint32(in[28:32])^x7)
binary.LittleEndian.PutUint32(out[32:36], binary.LittleEndian.Uint32(in[32:36])^x8)
binary.LittleEndian.PutUint32(out[36:40], binary.LittleEndian.Uint32(in[36:40])^x9)
binary.LittleEndian.PutUint32(out[40:44], binary.LittleEndian.Uint32(in[40:44])^x10)
binary.LittleEndian.PutUint32(out[44:48], binary.LittleEndian.Uint32(in[44:48])^x11)
binary.LittleEndian.PutUint32(out[48:52], binary.LittleEndian.Uint32(in[48:52])^x12)
binary.LittleEndian.PutUint32(out[52:56], binary.LittleEndian.Uint32(in[52:56])^x13)
binary.LittleEndian.PutUint32(out[56:60], binary.LittleEndian.Uint32(in[56:60])^x14)
binary.LittleEndian.PutUint32(out[60:64], binary.LittleEndian.Uint32(in[60:64])^x15)
in = in[BlockSize:]
} else {
binary.LittleEndian.PutUint32(out[0:4], x0)
binary.LittleEndian.PutUint32(out[4:8], x1)
binary.LittleEndian.PutUint32(out[8:12], x2)
binary.LittleEndian.PutUint32(out[12:16], x3)
binary.LittleEndian.PutUint32(out[16:20], x4)
binary.LittleEndian.PutUint32(out[20:24], x5)
binary.LittleEndian.PutUint32(out[24:28], x6)
binary.LittleEndian.PutUint32(out[28:32], x7)
binary.LittleEndian.PutUint32(out[32:36], x8)
binary.LittleEndian.PutUint32(out[36:40], x9)
binary.LittleEndian.PutUint32(out[40:44], x10)
binary.LittleEndian.PutUint32(out[44:48], x11)
binary.LittleEndian.PutUint32(out[48:52], x12)
binary.LittleEndian.PutUint32(out[52:56], x13)
binary.LittleEndian.PutUint32(out[56:60], x14)
binary.LittleEndian.PutUint32(out[60:64], x15)
}
out = out[BlockSize:]
}
// Stoping at 2^70 bytes per nonce is the user's responsibility.
ctr := uint64(x[13])<<32 | uint64(x[12])
ctr++
x[12] = uint32(ctr)
x[13] = uint32(ctr >> 32)
}
}
func hChaChaRef(x *[stateSize]uint32, out *[32]byte) {
x0, x1, x2, x3 := sigma0, sigma1, sigma2, sigma3
x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15 := x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7], x[8], x[9], x[10], x[11]
for i := chachaRounds; i > 0; i -= 2 {
// quarterround(x, 0, 4, 8, 12)
x0 += x4
x12 ^= x0
x12 = (x12 << 16) | (x12 >> 16)
x8 += x12
x4 ^= x8
x4 = (x4 << 12) | (x4 >> 20)
x0 += x4
x12 ^= x0
x12 = (x12 << 8) | (x12 >> 24)
x8 += x12
x4 ^= x8
x4 = (x4 << 7) | (x4 >> 25)
// quarterround(x, 1, 5, 9, 13)
x1 += x5
x13 ^= x1
x13 = (x13 << 16) | (x13 >> 16)
x9 += x13
x5 ^= x9
x5 = (x5 << 12) | (x5 >> 20)
x1 += x5
x13 ^= x1
x13 = (x13 << 8) | (x13 >> 24)
x9 += x13
x5 ^= x9
x5 = (x5 << 7) | (x5 >> 25)
// quarterround(x, 2, 6, 10, 14)
x2 += x6
x14 ^= x2
x14 = (x14 << 16) | (x14 >> 16)
x10 += x14
x6 ^= x10
x6 = (x6 << 12) | (x6 >> 20)
x2 += x6
x14 ^= x2
x14 = (x14 << 8) | (x14 >> 24)
x10 += x14
x6 ^= x10
x6 = (x6 << 7) | (x6 >> 25)
// quarterround(x, 3, 7, 11, 15)
x3 += x7
x15 ^= x3
x15 = (x15 << 16) | (x15 >> 16)
x11 += x15
x7 ^= x11
x7 = (x7 << 12) | (x7 >> 20)
x3 += x7
x15 ^= x3
x15 = (x15 << 8) | (x15 >> 24)
x11 += x15
x7 ^= x11
x7 = (x7 << 7) | (x7 >> 25)
// quarterround(x, 0, 5, 10, 15)
x0 += x5
x15 ^= x0
x15 = (x15 << 16) | (x15 >> 16)
x10 += x15
x5 ^= x10
x5 = (x5 << 12) | (x5 >> 20)
x0 += x5
x15 ^= x0
x15 = (x15 << 8) | (x15 >> 24)
x10 += x15
x5 ^= x10
x5 = (x5 << 7) | (x5 >> 25)
// quarterround(x, 1, 6, 11, 12)
x1 += x6
x12 ^= x1
x12 = (x12 << 16) | (x12 >> 16)
x11 += x12
x6 ^= x11
x6 = (x6 << 12) | (x6 >> 20)
x1 += x6
x12 ^= x1
x12 = (x12 << 8) | (x12 >> 24)
x11 += x12
x6 ^= x11
x6 = (x6 << 7) | (x6 >> 25)
// quarterround(x, 2, 7, 8, 13)
x2 += x7
x13 ^= x2
x13 = (x13 << 16) | (x13 >> 16)
x8 += x13
x7 ^= x8
x7 = (x7 << 12) | (x7 >> 20)
x2 += x7
x13 ^= x2
x13 = (x13 << 8) | (x13 >> 24)
x8 += x13
x7 ^= x8
x7 = (x7 << 7) | (x7 >> 25)
// quarterround(x, 3, 4, 9, 14)
x3 += x4
x14 ^= x3
x14 = (x14 << 16) | (x14 >> 16)
x9 += x14
x4 ^= x9
x4 = (x4 << 12) | (x4 >> 20)
x3 += x4
x14 ^= x3
x14 = (x14 << 8) | (x14 >> 24)
x9 += x14
x4 ^= x9
x4 = (x4 << 7) | (x4 >> 25)
}
// HChaCha returns x0...x3 | x12...x15, which corresponds to the
// indexes of the ChaCha constant and the indexes of the IV.
if useUnsafe {
outArr := (*[16]uint32)(unsafe.Pointer(&out[0]))
outArr[0] = x0
outArr[1] = x1
outArr[2] = x2
outArr[3] = x3
outArr[4] = x12
outArr[5] = x13
outArr[6] = x14
outArr[7] = x15
} else {
binary.LittleEndian.PutUint32(out[0:4], x0)
binary.LittleEndian.PutUint32(out[4:8], x1)
binary.LittleEndian.PutUint32(out[8:12], x2)
binary.LittleEndian.PutUint32(out[12:16], x3)
binary.LittleEndian.PutUint32(out[16:20], x12)
binary.LittleEndian.PutUint32(out[20:24], x13)
binary.LittleEndian.PutUint32(out[24:28], x14)
binary.LittleEndian.PutUint32(out[28:32], x15)
}
return
}
......@@ -63,15 +63,15 @@ func (s *TcpForwardServer) handleTcpForward(conn net.Conn, raddr net.Addr) {
}
type packet struct {
srcAddr *net.UDPAddr // src address
dstAddr *net.UDPAddr // dest address
srcAddr string // src address
dstAddr string // dest address
data []byte
}
type cnode struct {
chain *ProxyChain
conn net.Conn
srcAddr, dstAddr *net.UDPAddr
srcAddr, dstAddr string
rChan, wChan chan *packet
err error
ttl time.Duration
......@@ -146,13 +146,9 @@ func (node *cnode) run() {
timer.Reset(node.ttl)
glog.V(LDEBUG).Infof("[udp] %s <<< %s : length %d", node.srcAddr, addr, n)
if node.dstAddr.String() != addr.String() {
glog.V(LWARNING).Infof("[udp] %s <- %s : dst-addr mismatch (%s)", node.srcAddr, node.dstAddr, addr)
break
}
select {
// swap srcAddr with dstAddr
case node.rChan <- &packet{srcAddr: node.dstAddr, dstAddr: node.srcAddr, data: b[:n]}:
case node.rChan <- &packet{srcAddr: addr.String(), dstAddr: node.srcAddr, data: b[:n]}:
case <-time.After(time.Second * 3):
glog.V(LWARNING).Infof("[udp] %s <- %s : %s", node.srcAddr, node.dstAddr, "recv queue is full, discard")
}
......@@ -169,13 +165,9 @@ func (node *cnode) run() {
timer.Reset(node.ttl)
glog.V(LDEBUG).Infof("[udp-tun] %s <<< %s : length %d", node.srcAddr, dgram.Header.Addr.String(), len(dgram.Data))
if dgram.Header.Addr.String() != node.dstAddr.String() {
glog.V(LWARNING).Infof("[udp-tun] %s <- %s : dst-addr mismatch (%s)", node.srcAddr, node.dstAddr, dgram.Header.Addr)
break
}
select {
// swap srcAddr with dstAddr
case node.rChan <- &packet{srcAddr: node.dstAddr, dstAddr: node.srcAddr, data: dgram.Data}:
case node.rChan <- &packet{srcAddr: dgram.Header.Addr.String(), dstAddr: node.srcAddr, data: dgram.Data}:
case <-time.After(time.Second * 3):
glog.V(LWARNING).Infof("[udp-tun] %s <- %s : %s", node.srcAddr, node.dstAddr, "recv queue is full, discard")
}
......@@ -187,9 +179,15 @@ func (node *cnode) run() {
for pkt := range node.wChan {
timer.Reset(node.ttl)
dstAddr, err := net.ResolveUDPAddr("udp", pkt.dstAddr)
if err != nil {
glog.V(LWARNING).Infof("[udp] %s -> %s : %s", pkt.srcAddr, pkt.dstAddr, err)
continue
}
switch c := node.conn.(type) {
case *net.UDPConn:
if _, err := c.WriteToUDP(pkt.data, pkt.dstAddr); err != nil {
if _, err := c.WriteToUDP(pkt.data, dstAddr); err != nil {
glog.V(LWARNING).Infof("[udp] %s -> %s : %s", pkt.srcAddr, pkt.dstAddr, err)
node.err = err
errChan <- err
......@@ -198,7 +196,7 @@ func (node *cnode) run() {
glog.V(LDEBUG).Infof("[udp] %s >>> %s : length %d", pkt.srcAddr, pkt.dstAddr, len(pkt.data))
default:
dgram := gosocks5.NewUDPDatagram(gosocks5.NewUDPHeader(uint16(len(pkt.data)), 0, ToSocksAddr(pkt.dstAddr)), pkt.data)
dgram := gosocks5.NewUDPDatagram(gosocks5.NewUDPHeader(uint16(len(pkt.data)), 0, ToSocksAddr(dstAddr)), pkt.data)
if err := dgram.Write(c); err != nil {
glog.V(LWARNING).Infof("[udp-tun] %s -> %s : %s", pkt.srcAddr, pkt.dstAddr, err)
node.err = err
......@@ -255,7 +253,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
}
select {
case ch <- &packet{srcAddr: addr, dstAddr: raddr, data: b[:n]}:
case ch <- &packet{srcAddr: addr.String(), dstAddr: raddr.String(), data: b[:n]}:
case <-time.After(time.Second * 3):
glog.V(LWARNING).Infof("[udp] %s -> %s : %s", addr, raddr, "send queue is full, discard")
}
......@@ -264,7 +262,12 @@ func (s *UdpForwardServer) ListenAndServe() error {
// start recv queue
go func(ch <-chan *packet) {
for pkt := range ch {
if _, err := conn.WriteToUDP(pkt.data, pkt.dstAddr); err != nil {
dstAddr, err := net.ResolveUDPAddr("udp", pkt.dstAddr)
if err != nil {
glog.V(LWARNING).Infof("[udp] %s <- %s : %s", pkt.dstAddr, pkt.srcAddr, err)
continue
}
if _, err := conn.WriteToUDP(pkt.data, dstAddr); err != nil {
glog.V(LWARNING).Infof("[udp] %s <- %s : %s", pkt.dstAddr, pkt.srcAddr, err)
return
}
......@@ -285,7 +288,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
}
}
node, ok := m[pkt.srcAddr.String()]
node, ok := m[pkt.srcAddr]
if !ok {
node = &cnode{
chain: s.Base.Chain,
......@@ -295,7 +298,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
wChan: make(chan *packet, 32),
ttl: time.Duration(s.TTL) * time.Second,
}
m[pkt.srcAddr.String()] = node
m[pkt.srcAddr] = node
go node.run()
glog.V(LINFO).Infof("[udp] %s -> %s : new client (%d)", pkt.srcAddr, pkt.dstAddr, len(m))
}
......
......@@ -11,7 +11,7 @@ import (
)
const (
Version = "2.3"
Version = "2.4-dev"
)
// Log level for glog
......
......@@ -71,7 +71,7 @@ func ParseProxyNode(s string) (node ProxyNode, err error) {
}
switch node.Transport {
case "ws", "wss", "tls", "http2", "ssu", "quic", "kcp", "redirect":
case "ws", "wss", "tls", "http2", "quic", "kcp", "redirect", "ssu":
case "https":
node.Protocol = "http"
node.Transport = "tls"
......
......@@ -32,7 +32,7 @@ func NewProxyServer(node ProxyNode, chain *ProxyChain, config *tls.Config) *Prox
var cipher *ss.Cipher
var ota bool
if node.Protocol == "ss" {
if node.Protocol == "ss" || node.Transport == "ssu" {
var err error
var method, password string
......@@ -98,8 +98,6 @@ func (s *ProxyServer) Serve() error {
return NewRTcpForwardServer(s).Serve()
case "rudp": // Remote UDP port forwarding
return NewRUdpForwardServer(s).Serve()
case "ssu": // TODO: shadowsocks udp relay
return NewShadowUdpServer(s).ListenAndServe()
case "quic":
return NewQuicServer(s).ListenAndServeTLS(s.TLSConfig)
case "kcp":
......@@ -118,6 +116,12 @@ func (s *ProxyServer) Serve() error {
return NewKCPServer(s, config).ListenAndServe()
case "redirect":
return NewRedsocksTCPServer(s).ListenAndServe()
case "ssu": // shadowsocks udp relay
ttl, _ := strconv.Atoi(s.Node.Get("ttl"))
if ttl <= 0 {
ttl = DefaultTTL
}
return NewShadowUdpServer(s, ttl).ListenAndServe()
default:
ln, err = net.Listen("tcp", node.Addr)
}
......
......@@ -5,6 +5,7 @@ import (
"encoding/binary"
"errors"
"fmt"
"github.com/ginuerzh/gosocks5"
"github.com/golang/glog"
ss "github.com/shadowsocks/shadowsocks-go/shadowsocks"
"io"
......@@ -65,47 +66,6 @@ func (s *ShadowServer) Serve() {
glog.V(LINFO).Infof("[ss] %s >-< %s", s.conn.RemoteAddr(), addr)
}
type ShadowUdpServer struct {
Base *ProxyServer
Handler func(conn *net.UDPConn, addr *net.UDPAddr, data []byte)
}
func NewShadowUdpServer(base *ProxyServer) *ShadowUdpServer {
return &ShadowUdpServer{Base: base}
}
func (s *ShadowUdpServer) ListenAndServe() error {
laddr, err := net.ResolveUDPAddr("udp", s.Base.Node.Addr)
if err != nil {
return err
}
lconn, err := net.ListenUDP("udp", laddr)
if err != nil {
return err
}
defer lconn.Close()
if s.Handler == nil {
s.Handler = s.HandleConn
}
for {
b := make([]byte, LargeBufferSize)
n, addr, err := lconn.ReadFromUDP(b)
if err != nil {
glog.V(LWARNING).Infoln(err)
continue
}
go s.Handler(lconn, addr, b[:n])
}
}
// TODO: shadowsocks udp relay handler
func (s *ShadowUdpServer) HandleConn(conn *net.UDPConn, addr *net.UDPAddr, data []byte) {
}
// This function is copied from shadowsocks library with some modification.
func (s *ShadowServer) getRequest() (host string, ota bool, err error) {
// buf size should at least have the same size with the largest possible
......@@ -276,3 +236,109 @@ func (c *shadowConn) SetReadDeadline(t time.Time) error {
func (c *shadowConn) SetWriteDeadline(t time.Time) error {
return c.conn.SetWriteDeadline(t)
}
type ShadowUdpServer struct {
Base *ProxyServer
TTL int
}
func NewShadowUdpServer(base *ProxyServer, ttl int) *ShadowUdpServer {
return &ShadowUdpServer{Base: base, TTL: ttl}
}
func (s *ShadowUdpServer) ListenAndServe() error {
laddr, err := net.ResolveUDPAddr("udp", s.Base.Node.Addr)
if err != nil {
return err
}
lconn, err := net.ListenUDP("udp", laddr)
if err != nil {
return err
}
defer lconn.Close()
conn := ss.NewSecurePacketConn(lconn, s.Base.cipher.Copy(), true) // force OTA on
rChan, wChan := make(chan *packet, 128), make(chan *packet, 128)
// start send queue
go func(ch chan<- *packet) {
for {
b := make([]byte, MediumBufferSize)
n, addr, err := conn.ReadFrom(b[3:]) // add rsv and frag fields to make it the standard SOCKS5 UDP datagram
if err != nil {
glog.V(LWARNING).Infof("[ssu] %s -> %s : %s", addr, laddr, err)
continue
}
if b[3]&ss.OneTimeAuthMask > 0 {
glog.V(LWARNING).Infof("[ssu] %s -> %s : client does not support OTA", addr, laddr)
continue
}
b[3] &= ss.AddrMask
dgram, err := gosocks5.ReadUDPDatagram(bytes.NewReader(b[:n+3]))
if err != nil {
glog.V(LWARNING).Infof("[ssu] %s -> %s : %s", addr, laddr, err)
continue
}
select {
case ch <- &packet{srcAddr: addr.String(), dstAddr: dgram.Header.Addr.String(), data: b[:n+3]}:
case <-time.After(time.Second * 3):
glog.V(LWARNING).Infof("[ssu] %s -> %s : %s", addr, dgram.Header.Addr.String(), "send queue is full, discard")
}
}
}(wChan)
// start recv queue
go func(ch <-chan *packet) {
for pkt := range ch {
dstAddr, err := net.ResolveUDPAddr("udp", pkt.dstAddr)
if err != nil {
glog.V(LWARNING).Infof("[ssu] %s <- %s : %s", pkt.dstAddr, pkt.srcAddr, err)
continue
}
if _, err := conn.WriteTo(pkt.data, dstAddr); err != nil {
glog.V(LWARNING).Infof("[ssu] %s <- %s : %s", pkt.dstAddr, pkt.srcAddr, err)
return
}
}
}(rChan)
// mapping client to node
m := make(map[string]*cnode)
// start dispatcher
for pkt := range wChan {
// clear obsolete nodes
for k, node := range m {
if node != nil && node.err != nil {
close(node.wChan)
delete(m, k)
glog.V(LINFO).Infof("[ssu] clear node %s", k)
}
}
node, ok := m[pkt.srcAddr]
if !ok {
node = &cnode{
chain: s.Base.Chain,
srcAddr: pkt.srcAddr,
dstAddr: pkt.dstAddr,
rChan: rChan,
wChan: make(chan *packet, 32),
ttl: time.Duration(s.TTL) * time.Second,
}
m[pkt.srcAddr] = node
go node.run()
glog.V(LINFO).Infof("[ssu] %s -> %s : new client (%d)", pkt.srcAddr, pkt.dstAddr, len(m))
}
select {
case node.wChan <- pkt:
case <-time.After(time.Second * 3):
glog.V(LWARNING).Infof("[ssu] %s -> %s : %s", pkt.srcAddr, pkt.dstAddr, "node send queue is full, discard")
}
}
return nil
}
......@@ -12,7 +12,7 @@ import (
"io"
"strings"
"github.com/codahale/chacha20"
"github.com/Yawning/chacha20"
"golang.org/x/crypto/blowfish"
"golang.org/x/crypto/cast5"
"golang.org/x/crypto/salsa20/salsa"
......@@ -65,11 +65,19 @@ func newStream(block cipher.Block, err error, key, iv []byte,
}
}
func newAESStream(key, iv []byte, doe DecOrEnc) (cipher.Stream, error) {
func newAESCFBStream(key, iv []byte, doe DecOrEnc) (cipher.Stream, error) {
block, err := aes.NewCipher(key)
return newStream(block, err, key, iv, doe)
}
func newAESCTRStream(key, iv []byte, doe DecOrEnc) (cipher.Stream, error) {
block, err := aes.NewCipher(key)
if err != nil {
return nil, err
}
return cipher.NewCTR(block, iv), nil
}
func newDESStream(key, iv []byte, doe DecOrEnc) (cipher.Stream, error) {
block, err := des.NewCipher(key)
return newStream(block, err, key, iv, doe)
......@@ -95,7 +103,11 @@ func newRC4MD5Stream(key, iv []byte, _ DecOrEnc) (cipher.Stream, error) {
}
func newChaCha20Stream(key, iv []byte, _ DecOrEnc) (cipher.Stream, error) {
return chacha20.New(key, iv)
return chacha20.NewCipher(key, iv)
}
func newChaCha20IETFStream(key, iv []byte, _ DecOrEnc) (cipher.Stream, error) {
return chacha20.NewCipher(key, iv)
}
type salsaStreamCipher struct {
......@@ -145,15 +157,19 @@ type cipherInfo struct {
}
var cipherMethod = map[string]*cipherInfo{
"aes-128-cfb": {16, 16, newAESStream},
"aes-192-cfb": {24, 16, newAESStream},
"aes-256-cfb": {32, 16, newAESStream},
"des-cfb": {8, 8, newDESStream},
"bf-cfb": {16, 8, newBlowFishStream},
"cast5-cfb": {16, 8, newCast5Stream},
"rc4-md5": {16, 16, newRC4MD5Stream},
"chacha20": {32, 8, newChaCha20Stream},
"salsa20": {32, 8, newSalsa20Stream},
"aes-128-cfb": {16, 16, newAESCFBStream},
"aes-192-cfb": {24, 16, newAESCFBStream},
"aes-256-cfb": {32, 16, newAESCFBStream},
"aes-128-ctr": {16, 16, newAESCTRStream},
"aes-192-ctr": {24, 16, newAESCTRStream},
"aes-256-ctr": {32, 16, newAESCTRStream},
"des-cfb": {8, 8, newDESStream},
"bf-cfb": {16, 8, newBlowFishStream},
"cast5-cfb": {16, 8, newCast5Stream},
"rc4-md5": {16, 16, newRC4MD5Stream},
"chacha20": {32, 8, newChaCha20Stream},
"chacha20-ietf": {32, 12, newChaCha20IETFStream},
"salsa20": {32, 8, newSalsa20Stream},
}
func CheckCipherMethod(method string) error {
......
package shadowsocks
import (
"bytes"
"fmt"
"net"
"time"
)
const (
maxPacketSize = 4096 // increase it if error occurs
)
var (
errPacketTooSmall = fmt.Errorf("[udp]read error: cannot decrypt, received packet is smaller than ivLen")
errPacketTooLarge = fmt.Errorf("[udp]read error: received packet is latger than maxPacketSize(%d)", maxPacketSize)
errBufferTooSmall = fmt.Errorf("[udp]read error: given buffer is too small to hold data")
errPacketOtaFailed = fmt.Errorf("[udp]read error: received packet has invalid ota")
)
type SecurePacketConn struct {
net.PacketConn
*Cipher
ota bool
}
func NewSecurePacketConn(c net.PacketConn, cipher *Cipher, ota bool) *SecurePacketConn {
return &SecurePacketConn{
PacketConn: c,
Cipher: cipher,
ota: ota,
}
}
func (c *SecurePacketConn) Close() error {
return c.PacketConn.Close()
}
func (c *SecurePacketConn) ReadFrom(b []byte) (n int, src net.Addr, err error) {
ota := false
cipher := c.Copy()
buf := make([]byte, 4096)
n, src, err = c.PacketConn.ReadFrom(buf)
if err != nil {
return
}
if n < c.info.ivLen {
return 0, nil, errPacketTooSmall
}
if len(b) < n-c.info.ivLen {
err = errBufferTooSmall // just a warning
}
iv := make([]byte, c.info.ivLen)
copy(iv, buf[:c.info.ivLen])
if err = cipher.initDecrypt(iv); err != nil {
return
}
cipher.decrypt(b[0:], buf[c.info.ivLen:n])
n -= c.info.ivLen
if b[idType]&OneTimeAuthMask > 0 {
ota = true
}
if c.ota && !ota {
return 0, src, errPacketOtaFailed
}
if ota {
key := cipher.key
actualHmacSha1Buf := HmacSha1(append(iv, key...), b[:n-lenHmacSha1])
if !bytes.Equal(b[n-lenHmacSha1:n], actualHmacSha1Buf) {
Debug.Printf("verify one time auth failed, iv=%v key=%v data=%v", iv, key, b)
return 0, src, errPacketOtaFailed
}
n -= lenHmacSha1
}
return
}
func (c *SecurePacketConn) WriteTo(b []byte, dst net.Addr) (n int, err error) {
cipher := c.Copy()
iv, err := cipher.initEncrypt()
if err != nil {
return
}
packetLen := len(b) + len(iv)
if c.ota {
b[idType] |= OneTimeAuthMask
packetLen += lenHmacSha1
key := cipher.key
actualHmacSha1Buf := HmacSha1(append(iv, key...), b)
b = append(b, actualHmacSha1Buf...)
}
cipherData := make([]byte, packetLen)
copy(cipherData, iv)
cipher.encrypt(cipherData[len(iv):], b)
n, err = c.PacketConn.WriteTo(cipherData, dst)
if c.ota {
n -= lenHmacSha1
}
return
}
func (c *SecurePacketConn) LocalAddr() net.Addr {
return c.PacketConn.LocalAddr()
}
func (c *SecurePacketConn) SetDeadline(t time.Time) error {
return c.PacketConn.SetDeadline(t)
}
func (c *SecurePacketConn) SetReadDeadline(t time.Time) error {
return c.PacketConn.SetReadDeadline(t)
}
func (c *SecurePacketConn) SetWriteDeadline(t time.Time) error {
return c.PacketConn.SetWriteDeadline(t)
}
func (c *SecurePacketConn) IsOta() bool {
return c.ota
}
func (c *SecurePacketConn) ForceOTA() net.PacketConn {
return NewSecurePacketConn(c.PacketConn, c.Cipher.Copy(), true)
}
package shadowsocks
import (
"encoding/binary"
"fmt"
"net"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
const (
idType = 0 // address type index
idIP0 = 1 // ip addres start index
idDmLen = 1 // domain address length index
idDm0 = 2 // domain address start index
typeIPv4 = 1 // type is ipv4 address
typeDm = 3 // type is domain address
typeIPv6 = 4 // type is ipv6 address
lenIPv4 = 1 + net.IPv4len + 2 // 1addrType + ipv4 + 2port
lenIPv6 = 1 + net.IPv6len + 2 // 1addrType + ipv6 + 2port
lenDmBase = 1 + 1 + 2 // 1addrType + 1addrLen + 2port, plus addrLen
lenHmacSha1 = 10
)
var (
reqList = newReqList()
natlist = newNatTable()
udpTimeout = 30 * time.Second
reqListRefreshTime = 5 * time.Minute
)
type natTable struct {
sync.Mutex
conns map[string]net.PacketConn
}
func newNatTable() *natTable {
return &natTable{conns: map[string]net.PacketConn{}}
}
func (table *natTable) Delete(index string) net.PacketConn {
table.Lock()
defer table.Unlock()
c, ok := table.conns[index]
if ok {
delete(table.conns, index)
return c
}
return nil
}
func (table *natTable) Get(index string) (c net.PacketConn, ok bool, err error) {
table.Lock()
defer table.Unlock()
c, ok = table.conns[index]
if !ok {
c, err = net.ListenPacket("udp", "")
if err != nil {
return nil, false, err
}
table.conns[index] = c
}
return
}
type requestHeaderList struct {
sync.Mutex
List map[string]([]byte)
}
func newReqList() *requestHeaderList {
ret := &requestHeaderList{List: map[string]([]byte){}}
go func() {
for {
time.Sleep(reqListRefreshTime)
ret.Refresh()
}
}()
return ret
}
func (r *requestHeaderList) Refresh() {
r.Lock()
defer r.Unlock()
for k, _ := range r.List {
delete(r.List, k)
}
}
func (r *requestHeaderList) Get(dstaddr string) (req []byte, ok bool) {
r.Lock()
defer r.Unlock()
req, ok = r.List[dstaddr]
return
}
func (r *requestHeaderList) Put(dstaddr string, req []byte) {
r.Lock()
defer r.Unlock()
r.List[dstaddr] = req
return
}
func parseHeaderFromAddr(addr net.Addr) ([]byte, int) {
// if the request address type is domain, it cannot be reverselookuped
ip, port, err := net.SplitHostPort(addr.String())
if err != nil {
return nil, 0
}
buf := make([]byte, 20)
IP := net.ParseIP(ip)
b1 := IP.To4()
iplen := 0
if b1 == nil { //ipv6
b1 = IP.To16()
buf[0] = typeIPv6
iplen = net.IPv6len
} else { //ipv4
buf[0] = typeIPv4
iplen = net.IPv4len
}
copy(buf[1:], b1)
port_i, _ := strconv.Atoi(port)
binary.BigEndian.PutUint16(buf[1+iplen:], uint16(port_i))
return buf[:1+iplen+2], 1 + iplen + 2
}
func Pipeloop(write net.PacketConn, writeAddr net.Addr, readClose net.PacketConn) {
buf := leakyBuf.Get()
defer leakyBuf.Put(buf)
defer readClose.Close()
for {
readClose.SetDeadline(time.Now().Add(udpTimeout))
n, raddr, err := readClose.ReadFrom(buf)
if err != nil {
if ne, ok := err.(*net.OpError); ok {
if ne.Err == syscall.EMFILE || ne.Err == syscall.ENFILE {
// log too many open file error
// EMFILE is process reaches open file limits, ENFILE is system limit
Debug.Println("[udp]read error:", err)
}
}
Debug.Printf("[udp]closed pipe %s<-%s\n", writeAddr, readClose.LocalAddr())
return
}
// need improvement here
if req, ok := reqList.Get(raddr.String()); ok {
write.WriteTo(append(req, buf[:n]...), writeAddr)
} else {
header, hlen := parseHeaderFromAddr(raddr)
write.WriteTo(append(header[:hlen], buf[:n]...), writeAddr)
}
}
}
func handleUDPConnection(handle *SecurePacketConn, n int, src net.Addr, receive []byte) {
var dstIP net.IP
var reqLen int
var ota bool
addrType := receive[idType]
defer leakyBuf.Put(receive)
if addrType&OneTimeAuthMask > 0 {
ota = true
}
receive[idType] &= ^OneTimeAuthMask
compatiblemode := !handle.IsOta() && ota
switch addrType & AddrMask {
case typeIPv4:
reqLen = lenIPv4
if len(receive) < reqLen {
Debug.Println("[udp]invalid received message.")
}
dstIP = net.IP(receive[idIP0 : idIP0+net.IPv4len])
case typeIPv6:
reqLen = lenIPv6
if len(receive) < reqLen {
Debug.Println("[udp]invalid received message.")
}
dstIP = net.IP(receive[idIP0 : idIP0+net.IPv6len])
case typeDm:
reqLen = int(receive[idDmLen]) + lenDmBase
if len(receive) < reqLen {
Debug.Println("[udp]invalid received message.")
}
name := string(receive[idDm0 : idDm0+int(receive[idDmLen])])
// avoid panic: syscall: string with NUL passed to StringToUTF16 on windows.
if strings.ContainsRune(name, 0x00) {
fmt.Println("[udp]invalid domain name.")
return
}
dIP, err := net.ResolveIPAddr("ip", name) // carefully with const type
if err != nil {
Debug.Printf("[udp]failed to resolve domain name: %s\n", string(receive[idDm0:idDm0+receive[idDmLen]]))
return
}
dstIP = dIP.IP
default:
Debug.Printf("[udp]addrType %d not supported", addrType)
return
}
dst := &net.UDPAddr{
IP: dstIP,
Port: int(binary.BigEndian.Uint16(receive[reqLen-2 : reqLen])),
}
if _, ok := reqList.Get(dst.String()); !ok {
req := make([]byte, reqLen)
copy(req, receive)
reqList.Put(dst.String(), req)
}
remote, exist, err := natlist.Get(src.String())
if err != nil {
return
}
if !exist {
Debug.Printf("[udp]new client %s->%s via %s ota=%v\n", src, dst, remote.LocalAddr(), ota)
go func() {
if compatiblemode {
Pipeloop(handle.ForceOTA(), src, remote)
} else {
Pipeloop(handle, src, remote)
}
natlist.Delete(src.String())
}()
} else {
Debug.Printf("[udp]using cached client %s->%s via %s ota=%v\n", src, dst, remote.LocalAddr(), ota)
}
if remote == nil {
fmt.Println("WTF")
}
remote.SetDeadline(time.Now().Add(udpTimeout))
_, err = remote.WriteTo(receive[reqLen:n], dst)
if err != nil {
if ne, ok := err.(*net.OpError); ok && (ne.Err == syscall.EMFILE || ne.Err == syscall.ENFILE) {
// log too many open file error
// EMFILE is process reaches open file limits, ENFILE is system limit
Debug.Println("[udp]write error:", err)
} else {
Debug.Println("[udp]error connecting to:", dst, err)
}
if conn := natlist.Delete(src.String()); conn != nil {
conn.Close()
}
}
// Pipeloop
return
}
func ReadAndHandleUDPReq(c *SecurePacketConn) error {
buf := leakyBuf.Get()
n, src, err := c.ReadFrom(buf[0:])
if err != nil {
return err
}
go handleUDPConnection(c, n, src, buf)
return nil
}
package shadowsocks
import (
"errors"
"fmt"
"os"
"crypto/hmac"
"crypto/sha1"
"encoding/binary"
"errors"
"fmt"
"os"
)
func PrintVersion() {
const version = "1.1.5"
const version = "1.2.0"
fmt.Println("shadowsocks-go version", version)
}
......@@ -57,4 +57,4 @@ func (flag *ClosedFlag) SetClosed() {
func (flag *ClosedFlag) IsClosed() bool {
return flag.flag
}
\ No newline at end of file
}
......@@ -2,6 +2,12 @@
"comment": "",
"ignore": "test",
"package": [
{
"checksumSHA1": "IFJyJgPCjumDG37lEb0lyRBBGZE=",
"path": "github.com/Yawning/chacha20",
"revision": "c91e78db502ff629614837aacb7aa4efa61c651a",
"revisionTime": "2016-04-30T09:49:23Z"
},
{
"checksumSHA1": "QPs3L3mjPoi+a9GJCjW8HhyJczM=",
"path": "github.com/codahale/chacha20",
......@@ -15,10 +21,10 @@
"revisionTime": "2017-01-19T05:34:58Z"
},
{
"checksumSHA1": "b0uHAM/lCGCJ9GeKfClvrMMWXQM=",
"checksumSHA1": "idpL1fpHpfntk74IVfWtkP1PMZs=",
"path": "github.com/ginuerzh/gost",
"revision": "358f57add6087d77b1d978e92e2f7c8073c2f544",
"revisionTime": "2017-01-21T03:14:59Z"
"revision": "321b03712af504981d35a47c50c2cfe4dd788a9d",
"revisionTime": "2017-01-21T03:16:33Z"
},
{
"checksumSHA1": "URsJa4y/sUUw/STmbeYx9EKqaYE=",
......@@ -153,10 +159,10 @@
"revisionTime": "2016-10-02T05:25:12Z"
},
{
"checksumSHA1": "o0WHRL8mNIhfsoWlzhdJ8du6+C8=",
"checksumSHA1": "MRsfMrdZwnnCTfIzT3czcj0lb0s=",
"path": "github.com/shadowsocks/shadowsocks-go/shadowsocks",
"revision": "5c9897ecdf623f385ccb8c2c78e32c5256961b41",
"revisionTime": "2016-06-15T15:25:08Z"
"revision": "97a5c71f80ba5f5b3e549f14a619fe557ff4f3c9",
"revisionTime": "2017-01-21T20:35:16Z"
},
{
"checksumSHA1": "JsJdKXhz87gWenMwBeejTOeNE7k=",
......
......@@ -63,15 +63,15 @@ func (s *TcpForwardServer) handleTcpForward(conn net.Conn, raddr net.Addr) {
}
type packet struct {
srcAddr *net.UDPAddr // src address
dstAddr *net.UDPAddr // dest address
srcAddr string // src address
dstAddr string // dest address
data []byte
}
type cnode struct {
chain *ProxyChain
conn net.Conn
srcAddr, dstAddr *net.UDPAddr
srcAddr, dstAddr string
rChan, wChan chan *packet
err error
ttl time.Duration
......@@ -146,13 +146,9 @@ func (node *cnode) run() {
timer.Reset(node.ttl)
glog.V(LDEBUG).Infof("[udp] %s <<< %s : length %d", node.srcAddr, addr, n)
if node.dstAddr.String() != addr.String() {
glog.V(LWARNING).Infof("[udp] %s <- %s : dst-addr mismatch (%s)", node.srcAddr, node.dstAddr, addr)
break
}
select {
// swap srcAddr with dstAddr
case node.rChan <- &packet{srcAddr: node.dstAddr, dstAddr: node.srcAddr, data: b[:n]}:
case node.rChan <- &packet{srcAddr: addr.String(), dstAddr: node.srcAddr, data: b[:n]}:
case <-time.After(time.Second * 3):
glog.V(LWARNING).Infof("[udp] %s <- %s : %s", node.srcAddr, node.dstAddr, "recv queue is full, discard")
}
......@@ -169,13 +165,9 @@ func (node *cnode) run() {
timer.Reset(node.ttl)
glog.V(LDEBUG).Infof("[udp-tun] %s <<< %s : length %d", node.srcAddr, dgram.Header.Addr.String(), len(dgram.Data))
if dgram.Header.Addr.String() != node.dstAddr.String() {
glog.V(LWARNING).Infof("[udp-tun] %s <- %s : dst-addr mismatch (%s)", node.srcAddr, node.dstAddr, dgram.Header.Addr)
break
}
select {
// swap srcAddr with dstAddr
case node.rChan <- &packet{srcAddr: node.dstAddr, dstAddr: node.srcAddr, data: dgram.Data}:
case node.rChan <- &packet{srcAddr: dgram.Header.Addr.String(), dstAddr: node.srcAddr, data: dgram.Data}:
case <-time.After(time.Second * 3):
glog.V(LWARNING).Infof("[udp-tun] %s <- %s : %s", node.srcAddr, node.dstAddr, "recv queue is full, discard")
}
......@@ -187,9 +179,15 @@ func (node *cnode) run() {
for pkt := range node.wChan {
timer.Reset(node.ttl)
dstAddr, err := net.ResolveUDPAddr("udp", pkt.dstAddr)
if err != nil {
glog.V(LWARNING).Infof("[udp] %s -> %s : %s", pkt.srcAddr, pkt.dstAddr, err)
continue
}
switch c := node.conn.(type) {
case *net.UDPConn:
if _, err := c.WriteToUDP(pkt.data, pkt.dstAddr); err != nil {
if _, err := c.WriteToUDP(pkt.data, dstAddr); err != nil {
glog.V(LWARNING).Infof("[udp] %s -> %s : %s", pkt.srcAddr, pkt.dstAddr, err)
node.err = err
errChan <- err
......@@ -198,7 +196,7 @@ func (node *cnode) run() {
glog.V(LDEBUG).Infof("[udp] %s >>> %s : length %d", pkt.srcAddr, pkt.dstAddr, len(pkt.data))
default:
dgram := gosocks5.NewUDPDatagram(gosocks5.NewUDPHeader(uint16(len(pkt.data)), 0, ToSocksAddr(pkt.dstAddr)), pkt.data)
dgram := gosocks5.NewUDPDatagram(gosocks5.NewUDPHeader(uint16(len(pkt.data)), 0, ToSocksAddr(dstAddr)), pkt.data)
if err := dgram.Write(c); err != nil {
glog.V(LWARNING).Infof("[udp-tun] %s -> %s : %s", pkt.srcAddr, pkt.dstAddr, err)
node.err = err
......@@ -255,7 +253,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
}
select {
case ch <- &packet{srcAddr: addr, dstAddr: raddr, data: b[:n]}:
case ch <- &packet{srcAddr: addr.String(), dstAddr: raddr.String(), data: b[:n]}:
case <-time.After(time.Second * 3):
glog.V(LWARNING).Infof("[udp] %s -> %s : %s", addr, raddr, "send queue is full, discard")
}
......@@ -264,7 +262,12 @@ func (s *UdpForwardServer) ListenAndServe() error {
// start recv queue
go func(ch <-chan *packet) {
for pkt := range ch {
if _, err := conn.WriteToUDP(pkt.data, pkt.dstAddr); err != nil {
dstAddr, err := net.ResolveUDPAddr("udp", pkt.dstAddr)
if err != nil {
glog.V(LWARNING).Infof("[udp] %s <- %s : %s", pkt.dstAddr, pkt.srcAddr, err)
continue
}
if _, err := conn.WriteToUDP(pkt.data, dstAddr); err != nil {
glog.V(LWARNING).Infof("[udp] %s <- %s : %s", pkt.dstAddr, pkt.srcAddr, err)
return
}
......@@ -285,7 +288,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
}
}
node, ok := m[pkt.srcAddr.String()]
node, ok := m[pkt.srcAddr]
if !ok {
node = &cnode{
chain: s.Base.Chain,
......@@ -295,7 +298,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
wChan: make(chan *packet, 32),
ttl: time.Duration(s.TTL) * time.Second,
}
m[pkt.srcAddr.String()] = node
m[pkt.srcAddr] = node
go node.run()
glog.V(LINFO).Infof("[udp] %s -> %s : new client (%d)", pkt.srcAddr, pkt.dstAddr, len(m))
}
......
......@@ -11,7 +11,7 @@ import (
)
const (
Version = "2.3"
Version = "2.4-dev"
)
// Log level for glog
......
......@@ -71,7 +71,7 @@ func ParseProxyNode(s string) (node ProxyNode, err error) {
}
switch node.Transport {
case "ws", "wss", "tls", "http2", "ssu", "quic", "kcp", "redirect":
case "ws", "wss", "tls", "http2", "quic", "kcp", "redirect", "ssu":
case "https":
node.Protocol = "http"
node.Transport = "tls"
......
......@@ -32,7 +32,7 @@ func NewProxyServer(node ProxyNode, chain *ProxyChain, config *tls.Config) *Prox
var cipher *ss.Cipher
var ota bool
if node.Protocol == "ss" {
if node.Protocol == "ss" || node.Transport == "ssu" {
var err error
var method, password string
......@@ -98,8 +98,6 @@ func (s *ProxyServer) Serve() error {
return NewRTcpForwardServer(s).Serve()
case "rudp": // Remote UDP port forwarding
return NewRUdpForwardServer(s).Serve()
case "ssu": // TODO: shadowsocks udp relay
return NewShadowUdpServer(s).ListenAndServe()
case "quic":
return NewQuicServer(s).ListenAndServeTLS(s.TLSConfig)
case "kcp":
......@@ -118,6 +116,12 @@ func (s *ProxyServer) Serve() error {
return NewKCPServer(s, config).ListenAndServe()
case "redirect":
return NewRedsocksTCPServer(s).ListenAndServe()
case "ssu": // shadowsocks udp relay
ttl, _ := strconv.Atoi(s.Node.Get("ttl"))
if ttl <= 0 {
ttl = DefaultTTL
}
return NewShadowUdpServer(s, ttl).ListenAndServe()
default:
ln, err = net.Listen("tcp", node.Addr)
}
......
......@@ -5,6 +5,7 @@ import (
"encoding/binary"
"errors"
"fmt"
"github.com/ginuerzh/gosocks5"
"github.com/golang/glog"
ss "github.com/shadowsocks/shadowsocks-go/shadowsocks"
"io"
......@@ -65,47 +66,6 @@ func (s *ShadowServer) Serve() {
glog.V(LINFO).Infof("[ss] %s >-< %s", s.conn.RemoteAddr(), addr)
}
type ShadowUdpServer struct {
Base *ProxyServer
Handler func(conn *net.UDPConn, addr *net.UDPAddr, data []byte)
}
func NewShadowUdpServer(base *ProxyServer) *ShadowUdpServer {
return &ShadowUdpServer{Base: base}
}
func (s *ShadowUdpServer) ListenAndServe() error {
laddr, err := net.ResolveUDPAddr("udp", s.Base.Node.Addr)
if err != nil {
return err
}
lconn, err := net.ListenUDP("udp", laddr)
if err != nil {
return err
}
defer lconn.Close()
if s.Handler == nil {
s.Handler = s.HandleConn
}
for {
b := make([]byte, LargeBufferSize)
n, addr, err := lconn.ReadFromUDP(b)
if err != nil {
glog.V(LWARNING).Infoln(err)
continue
}
go s.Handler(lconn, addr, b[:n])
}
}
// TODO: shadowsocks udp relay handler
func (s *ShadowUdpServer) HandleConn(conn *net.UDPConn, addr *net.UDPAddr, data []byte) {
}
// This function is copied from shadowsocks library with some modification.
func (s *ShadowServer) getRequest() (host string, ota bool, err error) {
// buf size should at least have the same size with the largest possible
......@@ -276,3 +236,109 @@ func (c *shadowConn) SetReadDeadline(t time.Time) error {
func (c *shadowConn) SetWriteDeadline(t time.Time) error {
return c.conn.SetWriteDeadline(t)
}
type ShadowUdpServer struct {
Base *ProxyServer
TTL int
}
func NewShadowUdpServer(base *ProxyServer, ttl int) *ShadowUdpServer {
return &ShadowUdpServer{Base: base, TTL: ttl}
}
func (s *ShadowUdpServer) ListenAndServe() error {
laddr, err := net.ResolveUDPAddr("udp", s.Base.Node.Addr)
if err != nil {
return err
}
lconn, err := net.ListenUDP("udp", laddr)
if err != nil {
return err
}
defer lconn.Close()
conn := ss.NewSecurePacketConn(lconn, s.Base.cipher.Copy(), true) // force OTA on
rChan, wChan := make(chan *packet, 128), make(chan *packet, 128)
// start send queue
go func(ch chan<- *packet) {
for {
b := make([]byte, MediumBufferSize)
n, addr, err := conn.ReadFrom(b[3:]) // add rsv and frag fields to make it the standard SOCKS5 UDP datagram
if err != nil {
glog.V(LWARNING).Infof("[ssu] %s -> %s : %s", addr, laddr, err)
continue
}
if b[3]&ss.OneTimeAuthMask > 0 {
glog.V(LWARNING).Infof("[ssu] %s -> %s : client does not support OTA", addr, laddr)
continue
}
b[3] &= ss.AddrMask
dgram, err := gosocks5.ReadUDPDatagram(bytes.NewReader(b[:n+3]))
if err != nil {
glog.V(LWARNING).Infof("[ssu] %s -> %s : %s", addr, laddr, err)
continue
}
select {
case ch <- &packet{srcAddr: addr.String(), dstAddr: dgram.Header.Addr.String(), data: b[:n+3]}:
case <-time.After(time.Second * 3):
glog.V(LWARNING).Infof("[ssu] %s -> %s : %s", addr, dgram.Header.Addr.String(), "send queue is full, discard")
}
}
}(wChan)
// start recv queue
go func(ch <-chan *packet) {
for pkt := range ch {
dstAddr, err := net.ResolveUDPAddr("udp", pkt.dstAddr)
if err != nil {
glog.V(LWARNING).Infof("[ssu] %s <- %s : %s", pkt.dstAddr, pkt.srcAddr, err)
continue
}
if _, err := conn.WriteTo(pkt.data, dstAddr); err != nil {
glog.V(LWARNING).Infof("[ssu] %s <- %s : %s", pkt.dstAddr, pkt.srcAddr, err)
return
}
}
}(rChan)
// mapping client to node
m := make(map[string]*cnode)
// start dispatcher
for pkt := range wChan {
// clear obsolete nodes
for k, node := range m {
if node != nil && node.err != nil {
close(node.wChan)
delete(m, k)
glog.V(LINFO).Infof("[ssu] clear node %s", k)
}
}
node, ok := m[pkt.srcAddr]
if !ok {
node = &cnode{
chain: s.Base.Chain,
srcAddr: pkt.srcAddr,
dstAddr: pkt.dstAddr,
rChan: rChan,
wChan: make(chan *packet, 32),
ttl: time.Duration(s.TTL) * time.Second,
}
m[pkt.srcAddr] = node
go node.run()
glog.V(LINFO).Infof("[ssu] %s -> %s : new client (%d)", pkt.srcAddr, pkt.dstAddr, len(m))
}
select {
case node.wChan <- pkt:
case <-time.After(time.Second * 3):
glog.V(LWARNING).Infof("[ssu] %s -> %s : %s", pkt.srcAddr, pkt.dstAddr, "node send queue is full, discard")
}
}
return nil
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment