Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Sign in / Register
Toggle navigation
G
gost
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Locked Files
Issues
0
Issues
0
List
Boards
Labels
Service Desk
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Security & Compliance
Security & Compliance
Dependency List
License Compliance
Packages
Packages
List
Container Registry
Analytics
Analytics
CI / CD
Code Review
Insights
Issues
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
nanahira
gost
Commits
8861ffba
Commit
8861ffba
authored
Jan 24, 2017
by
rui.zheng
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
#73: Add shadowsocks UDP relay support
parent
321b0371
Changes
22
Hide whitespace changes
Inline
Side-by-side
Showing
22 changed files
with
4107 additions
and
153 deletions
+4107
-153
cmd/gost/vendor/github.com/Yawning/chacha20/LICENSE
cmd/gost/vendor/github.com/Yawning/chacha20/LICENSE
+122
-0
cmd/gost/vendor/github.com/Yawning/chacha20/README.md
cmd/gost/vendor/github.com/Yawning/chacha20/README.md
+14
-0
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20.go
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20.go
+273
-0
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20_amd64.go
...gost/vendor/github.com/Yawning/chacha20/chacha20_amd64.go
+95
-0
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20_amd64.py
...gost/vendor/github.com/Yawning/chacha20/chacha20_amd64.py
+1303
-0
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20_amd64.s
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20_amd64.s
+1187
-0
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20_ref.go
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20_ref.go
+392
-0
cmd/gost/vendor/github.com/ginuerzh/gost/forward.go
cmd/gost/vendor/github.com/ginuerzh/gost/forward.go
+22
-19
cmd/gost/vendor/github.com/ginuerzh/gost/gost.go
cmd/gost/vendor/github.com/ginuerzh/gost/gost.go
+1
-1
cmd/gost/vendor/github.com/ginuerzh/gost/node.go
cmd/gost/vendor/github.com/ginuerzh/gost/node.go
+1
-1
cmd/gost/vendor/github.com/ginuerzh/gost/server.go
cmd/gost/vendor/github.com/ginuerzh/gost/server.go
+7
-3
cmd/gost/vendor/github.com/ginuerzh/gost/ss.go
cmd/gost/vendor/github.com/ginuerzh/gost/ss.go
+107
-41
cmd/gost/vendor/github.com/shadowsocks/shadowsocks-go/shadowsocks/encrypt.go
...hub.com/shadowsocks/shadowsocks-go/shadowsocks/encrypt.go
+28
-12
cmd/gost/vendor/github.com/shadowsocks/shadowsocks-go/shadowsocks/udp.go
.../github.com/shadowsocks/shadowsocks-go/shadowsocks/udp.go
+135
-0
cmd/gost/vendor/github.com/shadowsocks/shadowsocks-go/shadowsocks/udprelay.go
...ub.com/shadowsocks/shadowsocks-go/shadowsocks/udprelay.go
+265
-0
cmd/gost/vendor/github.com/shadowsocks/shadowsocks-go/shadowsocks/util.go
...github.com/shadowsocks/shadowsocks-go/shadowsocks/util.go
+5
-5
cmd/gost/vendor/vendor.json
cmd/gost/vendor/vendor.json
+12
-6
forward.go
forward.go
+22
-19
gost.go
gost.go
+1
-1
node.go
node.go
+1
-1
server.go
server.go
+7
-3
ss.go
ss.go
+107
-41
No files found.
cmd/gost/vendor/github.com/Yawning/chacha20/LICENSE
0 → 100644
View file @
8861ffba
Creative Commons Legal Code
CC0 1.0 Universal
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
HEREUNDER.
Statement of Purpose
The laws of most jurisdictions throughout the world automatically confer
exclusive Copyright and Related Rights (defined below) upon the creator
and subsequent owner(s) (each and all, an "owner") of an original work of
authorship and/or a database (each, a "Work").
Certain owners wish to permanently relinquish those rights to a Work for
the purpose of contributing to a commons of creative, cultural and
scientific works ("Commons") that the public can reliably and without fear
of later claims of infringement build upon, modify, incorporate in other
works, reuse and redistribute as freely as possible in any form whatsoever
and for any purposes, including without limitation commercial purposes.
These owners may contribute to the Commons to promote the ideal of a free
culture and the further production of creative, cultural and scientific
works, or to gain reputation or greater distribution for their Work in
part through the use and efforts of others.
For these and/or other purposes and motivations, and without any
expectation of additional consideration or compensation, the person
associating CC0 with a Work (the "Affirmer"), to the extent that he or she
is an owner of Copyright and Related Rights in the Work, voluntarily
elects to apply CC0 to the Work and publicly distribute the Work under its
terms, with knowledge of his or her Copyright and Related Rights in the
Work and the meaning and intended legal effect of CC0 on those rights.
1. Copyright and Related Rights. A Work made available under CC0 may be
protected by copyright and related or neighboring rights ("Copyright and
Related Rights"). Copyright and Related Rights include, but are not
limited to, the following:
i. the right to reproduce, adapt, distribute, perform, display,
communicate, and translate a Work;
ii. moral rights retained by the original author(s) and/or performer(s);
iii. publicity and privacy rights pertaining to a person's image or
likeness depicted in a Work;
iv. rights protecting against unfair competition in regards to a Work,
subject to the limitations in paragraph 4(a), below;
v. rights protecting the extraction, dissemination, use and reuse of data
in a Work;
vi. database rights (such as those arising under Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal
protection of databases, and under any national implementation
thereof, including any amended or successor version of such
directive); and
vii. other similar, equivalent or corresponding rights throughout the
world based on applicable law or treaty, and any national
implementations thereof.
2. Waiver. To the greatest extent permitted by, but not in contravention
of, applicable law, Affirmer hereby overtly, fully, permanently,
irrevocably and unconditionally waives, abandons, and surrenders all of
Affirmer's Copyright and Related Rights and associated claims and causes
of action, whether now known or unknown (including existing as well as
future claims and causes of action), in the Work (i) in all territories
worldwide, (ii) for the maximum duration provided by applicable law or
treaty (including future time extensions), (iii) in any current or future
medium and for any number of copies, and (iv) for any purpose whatsoever,
including without limitation commercial, advertising or promotional
purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
member of the public at large and to the detriment of Affirmer's heirs and
successors, fully intending that such Waiver shall not be subject to
revocation, rescission, cancellation, termination, or any other legal or
equitable action to disrupt the quiet enjoyment of the Work by the public
as contemplated by Affirmer's express Statement of Purpose.
3. Public License Fallback. Should any part of the Waiver for any reason
be judged legally invalid or ineffective under applicable law, then the
Waiver shall be preserved to the maximum extent permitted taking into
account Affirmer's express Statement of Purpose. In addition, to the
extent the Waiver is so judged Affirmer hereby grants to each affected
person a royalty-free, non transferable, non sublicensable, non exclusive,
irrevocable and unconditional license to exercise Affirmer's Copyright and
Related Rights in the Work (i) in all territories worldwide, (ii) for the
maximum duration provided by applicable law or treaty (including future
time extensions), (iii) in any current or future medium and for any number
of copies, and (iv) for any purpose whatsoever, including without
limitation commercial, advertising or promotional purposes (the
"License"). The License shall be deemed effective as of the date CC0 was
applied by Affirmer to the Work. Should any part of the License for any
reason be judged legally invalid or ineffective under applicable law, such
partial invalidity or ineffectiveness shall not invalidate the remainder
of the License, and in such case Affirmer hereby affirms that he or she
will not (i) exercise any of his or her remaining Copyright and Related
Rights in the Work or (ii) assert any associated claims and causes of
action with respect to the Work, in either case contrary to Affirmer's
express Statement of Purpose.
4. Limitations and Disclaimers.
a. No trademark or patent rights held by Affirmer are waived, abandoned,
surrendered, licensed or otherwise affected by this document.
b. Affirmer offers the Work as-is and makes no representations or
warranties of any kind concerning the Work, express, implied,
statutory or otherwise, including without limitation warranties of
title, merchantability, fitness for a particular purpose, non
infringement, or the absence of latent or other defects, accuracy, or
the present or absence of errors, whether or not discoverable, all to
the greatest extent permissible under applicable law.
c. Affirmer disclaims responsibility for clearing rights of other persons
that may apply to the Work or any use thereof, including without
limitation any person's Copyright and Related Rights in the Work.
Further, Affirmer disclaims responsibility for obtaining any necessary
consents, permissions or other rights required for any use of the
Work.
d. Affirmer understands and acknowledges that Creative Commons is not a
party to this document and has no duty or obligation with respect to
this CC0 or use of the Work.
cmd/gost/vendor/github.com/Yawning/chacha20/README.md
0 → 100644
View file @
8861ffba
### chacha20 - ChaCha20
#### Yawning Angel (yawning at schwanenlied dot me)
Yet another Go ChaCha20 implementation. Everything else I found was slow,
didn't support all the variants I need to use, or relied on cgo to go fast.
Features:
*
20 round, 256 bit key only. Everything else is pointless and stupid.
*
IETF 96 bit nonce variant.
*
XChaCha 24 byte nonce variant.
*
SSE2 and AVX2 support on amd64 targets.
*
Incremental encrypt/decrypt support, unlike golang.org/x/crypto/salsa20.
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20.go
0 → 100644
View file @
8861ffba
// chacha20.go - A ChaCha stream cipher implementation.
//
// To the extent possible under law, Yawning Angel has waived all copyright
// and related or neighboring rights to chacha20, using the Creative
// Commons "CC0" public domain dedication. See LICENSE or
// <http://creativecommons.org/publicdomain/zero/1.0/> for full details.
package
chacha20
import
(
"crypto/cipher"
"encoding/binary"
"errors"
"math"
"runtime"
)
const
(
// KeySize is the ChaCha20 key size in bytes.
KeySize
=
32
// NonceSize is the ChaCha20 nonce size in bytes.
NonceSize
=
8
// INonceSize is the IETF ChaCha20 nonce size in bytes.
INonceSize
=
12
// XNonceSize is the XChaCha20 nonce size in bytes.
XNonceSize
=
24
// HNonceSize is the HChaCha20 nonce size in bytes.
HNonceSize
=
16
// BlockSize is the ChaCha20 block size in bytes.
BlockSize
=
64
stateSize
=
16
chachaRounds
=
20
// The constant "expand 32-byte k" as little endian uint32s.
sigma0
=
uint32
(
0x61707865
)
sigma1
=
uint32
(
0x3320646e
)
sigma2
=
uint32
(
0x79622d32
)
sigma3
=
uint32
(
0x6b206574
)
)
var
(
// ErrInvalidKey is the error returned when the key is invalid.
ErrInvalidKey
=
errors
.
New
(
"key length must be KeySize bytes"
)
// ErrInvalidNonce is the error returned when the nonce is invalid.
ErrInvalidNonce
=
errors
.
New
(
"nonce length must be NonceSize/INonceSize/XNonceSize bytes"
)
// ErrInvalidCounter is the error returned when the counter is invalid.
ErrInvalidCounter
=
errors
.
New
(
"block counter is invalid (out of range)"
)
useUnsafe
=
false
usingVectors
=
false
blocksFn
=
blocksRef
)
// A Cipher is an instance of ChaCha20/XChaCha20 using a particular key and
// nonce.
type
Cipher
struct
{
state
[
stateSize
]
uint32
buf
[
BlockSize
]
byte
off
int
ietf
bool
}
// Reset zeros the key data so that it will no longer appear in the process's
// memory.
func
(
c
*
Cipher
)
Reset
()
{
for
i
:=
range
c
.
state
{
c
.
state
[
i
]
=
0
}
for
i
:=
range
c
.
buf
{
c
.
buf
[
i
]
=
0
}
}
// XORKeyStream sets dst to the result of XORing src with the key stream. Dst
// and src may be the same slice but otherwise should not overlap.
func
(
c
*
Cipher
)
XORKeyStream
(
dst
,
src
[]
byte
)
{
if
len
(
dst
)
<
len
(
src
)
{
src
=
src
[
:
len
(
dst
)]
}
for
remaining
:=
len
(
src
);
remaining
>
0
;
{
// Process multiple blocks at once.
if
c
.
off
==
BlockSize
{
nrBlocks
:=
remaining
/
BlockSize
directBytes
:=
nrBlocks
*
BlockSize
if
nrBlocks
>
0
{
blocksFn
(
&
c
.
state
,
src
,
dst
,
nrBlocks
,
c
.
ietf
)
remaining
-=
directBytes
if
remaining
==
0
{
return
}
dst
=
dst
[
directBytes
:
]
src
=
src
[
directBytes
:
]
}
// If there's a partial block, generate 1 block of keystream into
// the internal buffer.
blocksFn
(
&
c
.
state
,
nil
,
c
.
buf
[
:
],
1
,
c
.
ietf
)
c
.
off
=
0
}
// Process partial blocks from the buffered keystream.
toXor
:=
BlockSize
-
c
.
off
if
remaining
<
toXor
{
toXor
=
remaining
}
if
toXor
>
0
{
for
i
,
v
:=
range
src
[
:
toXor
]
{
dst
[
i
]
=
v
^
c
.
buf
[
c
.
off
+
i
]
}
dst
=
dst
[
toXor
:
]
src
=
src
[
toXor
:
]
remaining
-=
toXor
c
.
off
+=
toXor
}
}
}
// KeyStream sets dst to the raw keystream.
func
(
c
*
Cipher
)
KeyStream
(
dst
[]
byte
)
{
for
remaining
:=
len
(
dst
);
remaining
>
0
;
{
// Process multiple blocks at once.
if
c
.
off
==
BlockSize
{
nrBlocks
:=
remaining
/
BlockSize
directBytes
:=
nrBlocks
*
BlockSize
if
nrBlocks
>
0
{
blocksFn
(
&
c
.
state
,
nil
,
dst
,
nrBlocks
,
c
.
ietf
)
remaining
-=
directBytes
if
remaining
==
0
{
return
}
dst
=
dst
[
directBytes
:
]
}
// If there's a partial block, generate 1 block of keystream into
// the internal buffer.
blocksFn
(
&
c
.
state
,
nil
,
c
.
buf
[
:
],
1
,
c
.
ietf
)
c
.
off
=
0
}
// Process partial blocks from the buffered keystream.
toCopy
:=
BlockSize
-
c
.
off
if
remaining
<
toCopy
{
toCopy
=
remaining
}
if
toCopy
>
0
{
copy
(
dst
[
:
toCopy
],
c
.
buf
[
c
.
off
:
c
.
off
+
toCopy
])
dst
=
dst
[
toCopy
:
]
remaining
-=
toCopy
c
.
off
+=
toCopy
}
}
}
// ReKey reinitializes the ChaCha20/XChaCha20 instance with the provided key
// and nonce.
func
(
c
*
Cipher
)
ReKey
(
key
,
nonce
[]
byte
)
error
{
if
len
(
key
)
!=
KeySize
{
return
ErrInvalidKey
}
switch
len
(
nonce
)
{
case
NonceSize
:
case
INonceSize
:
case
XNonceSize
:
var
subkey
[
KeySize
]
byte
var
subnonce
[
HNonceSize
]
byte
copy
(
subnonce
[
:
],
nonce
[
0
:
16
])
HChaCha
(
key
,
&
subnonce
,
&
subkey
)
key
=
subkey
[
:
]
nonce
=
nonce
[
16
:
24
]
defer
func
()
{
for
i
:=
range
subkey
{
subkey
[
i
]
=
0
}
}()
default
:
return
ErrInvalidNonce
}
c
.
Reset
()
c
.
state
[
0
]
=
sigma0
c
.
state
[
1
]
=
sigma1
c
.
state
[
2
]
=
sigma2
c
.
state
[
3
]
=
sigma3
c
.
state
[
4
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
0
:
4
])
c
.
state
[
5
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
4
:
8
])
c
.
state
[
6
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
8
:
12
])
c
.
state
[
7
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
12
:
16
])
c
.
state
[
8
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
16
:
20
])
c
.
state
[
9
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
20
:
24
])
c
.
state
[
10
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
24
:
28
])
c
.
state
[
11
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
28
:
32
])
c
.
state
[
12
]
=
0
if
len
(
nonce
)
==
INonceSize
{
c
.
state
[
13
]
=
binary
.
LittleEndian
.
Uint32
(
nonce
[
0
:
4
])
c
.
state
[
14
]
=
binary
.
LittleEndian
.
Uint32
(
nonce
[
4
:
8
])
c
.
state
[
15
]
=
binary
.
LittleEndian
.
Uint32
(
nonce
[
8
:
12
])
c
.
ietf
=
true
}
else
{
c
.
state
[
13
]
=
0
c
.
state
[
14
]
=
binary
.
LittleEndian
.
Uint32
(
nonce
[
0
:
4
])
c
.
state
[
15
]
=
binary
.
LittleEndian
.
Uint32
(
nonce
[
4
:
8
])
c
.
ietf
=
false
}
c
.
off
=
BlockSize
return
nil
}
// Seek sets the block counter to a given offset.
func
(
c
*
Cipher
)
Seek
(
blockCounter
uint64
)
error
{
if
c
.
ietf
{
if
blockCounter
>
math
.
MaxUint32
{
return
ErrInvalidCounter
}
c
.
state
[
12
]
=
uint32
(
blockCounter
)
}
else
{
c
.
state
[
12
]
=
uint32
(
blockCounter
)
c
.
state
[
13
]
=
uint32
(
blockCounter
>>
32
)
}
c
.
off
=
BlockSize
return
nil
}
// NewCipher returns a new ChaCha20/XChaCha20 instance.
func
NewCipher
(
key
,
nonce
[]
byte
)
(
*
Cipher
,
error
)
{
c
:=
new
(
Cipher
)
if
err
:=
c
.
ReKey
(
key
,
nonce
);
err
!=
nil
{
return
nil
,
err
}
return
c
,
nil
}
// HChaCha is the HChaCha20 hash function used to make XChaCha.
func
HChaCha
(
key
[]
byte
,
nonce
*
[
HNonceSize
]
byte
,
out
*
[
32
]
byte
)
{
var
x
[
stateSize
]
uint32
// Last 4 slots unused, sigma hardcoded.
x
[
0
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
0
:
4
])
x
[
1
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
4
:
8
])
x
[
2
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
8
:
12
])
x
[
3
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
12
:
16
])
x
[
4
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
16
:
20
])
x
[
5
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
20
:
24
])
x
[
6
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
24
:
28
])
x
[
7
]
=
binary
.
LittleEndian
.
Uint32
(
key
[
28
:
32
])
x
[
8
]
=
binary
.
LittleEndian
.
Uint32
(
nonce
[
0
:
4
])
x
[
9
]
=
binary
.
LittleEndian
.
Uint32
(
nonce
[
4
:
8
])
x
[
10
]
=
binary
.
LittleEndian
.
Uint32
(
nonce
[
8
:
12
])
x
[
11
]
=
binary
.
LittleEndian
.
Uint32
(
nonce
[
12
:
16
])
hChaChaRef
(
&
x
,
out
)
}
func
init
()
{
switch
runtime
.
GOARCH
{
case
"386"
,
"amd64"
:
// Abuse unsafe to skip calling binary.LittleEndian.PutUint32
// in the critical path. This is a big boost on systems that are
// little endian and not overly picky about alignment.
useUnsafe
=
true
}
}
var
_
cipher
.
Stream
=
(
*
Cipher
)(
nil
)
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20_amd64.go
0 → 100644
View file @
8861ffba
// chacha20_amd64.go - AMD64 optimized chacha20.
//
// To the extent possible under law, Yawning Angel has waived all copyright
// and related or neighboring rights to chacha20, using the Creative
// Commons "CC0" public domain dedication. See LICENSE or
// <http://creativecommons.org/publicdomain/zero/1.0/> for full details.
// +build amd64,!gccgo,!appengine
package
chacha20
import
(
"math"
)
var
usingAVX2
=
false
func
blocksAmd64SSE2
(
x
*
uint32
,
inp
,
outp
*
byte
,
nrBlocks
uint
)
func
blocksAmd64AVX2
(
x
*
uint32
,
inp
,
outp
*
byte
,
nrBlocks
uint
)
func
cpuidAmd64
(
cpuidParams
*
uint32
)
func
xgetbv0Amd64
(
xcrVec
*
uint32
)
func
blocksAmd64
(
x
*
[
stateSize
]
uint32
,
in
[]
byte
,
out
[]
byte
,
nrBlocks
int
,
isIetf
bool
)
{
// Probably unneeded, but stating this explicitly simplifies the assembly.
if
nrBlocks
==
0
{
return
}
if
isIetf
{
var
totalBlocks
uint64
totalBlocks
=
uint64
(
x
[
8
])
+
uint64
(
nrBlocks
)
if
totalBlocks
>
math
.
MaxUint32
{
panic
(
"chacha20: Exceeded keystream per nonce limit"
)
}
}
if
in
==
nil
{
for
i
:=
range
out
{
out
[
i
]
=
0
}
in
=
out
}
// Pointless to call the AVX2 code for just a single block, since half of
// the output gets discarded...
if
usingAVX2
&&
nrBlocks
>
1
{
blocksAmd64AVX2
(
&
x
[
0
],
&
in
[
0
],
&
out
[
0
],
uint
(
nrBlocks
))
}
else
{
blocksAmd64SSE2
(
&
x
[
0
],
&
in
[
0
],
&
out
[
0
],
uint
(
nrBlocks
))
}
}
func
supportsAVX2
()
bool
{
// https://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family
const
(
osXsaveBit
=
1
<<
27
avx2Bit
=
1
<<
5
)
// Check to see if CPUID actually supports the leaf that indicates AVX2.
// CPUID.(EAX=0H, ECX=0H) >= 7
regs
:=
[
4
]
uint32
{
0x00
}
cpuidAmd64
(
&
regs
[
0
])
if
regs
[
0
]
<
7
{
return
false
}
// Check to see if the OS knows how to save/restore XMM/YMM state.
// CPUID.(EAX=01H, ECX=0H):ECX.OSXSAVE[bit 27]==1
regs
=
[
4
]
uint32
{
0x01
}
cpuidAmd64
(
&
regs
[
0
])
if
regs
[
2
]
&
osXsaveBit
==
0
{
return
false
}
xcrRegs
:=
[
2
]
uint32
{}
xgetbv0Amd64
(
&
xcrRegs
[
0
])
if
xcrRegs
[
0
]
&
6
!=
6
{
return
false
}
// Check for AVX2 support.
// CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]==1
regs
=
[
4
]
uint32
{
0x07
}
cpuidAmd64
(
&
regs
[
0
])
return
regs
[
1
]
&
avx2Bit
!=
0
}
func
init
()
{
blocksFn
=
blocksAmd64
usingVectors
=
true
usingAVX2
=
supportsAVX2
()
}
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20_amd64.py
0 → 100644
View file @
8861ffba
#!/usr/bin/env python3
#
# To the extent possible under law, Yawning Angel has waived all copyright
# and related or neighboring rights to chacha20, using the Creative
# Commons "CC0" public domain dedication. See LICENSE or
# <http://creativecommons.org/publicdomain/zero/1.0/> for full details.
#
# cgo sucks. Plan 9 assembly sucks. Real languages have SIMD intrinsics.
# The least terrible/retarded option is to use a Python code generator, so
# that's what I did.
#
# Code based on Ted Krovetz's vec128 C implementation, with corrections
# to use a 64 bit counter instead of 32 bit, and to allow unaligned input and
# output pointers.
#
# Dependencies: https://github.com/Maratyszcza/PeachPy
#
# python3 -m peachpy.x86_64 -mabi=goasm -S -o chacha20_amd64.s chacha20_amd64.py
#
from
peachpy
import
*
from
peachpy.x86_64
import
*
x
=
Argument
(
ptr
(
uint32_t
))
inp
=
Argument
(
ptr
(
const_uint8_t
))
outp
=
Argument
(
ptr
(
uint8_t
))
nrBlocks
=
Argument
(
ptr
(
size_t
))
#
# SSE2 helper functions. A temporary register is explicitly passed in because
# the main fast loop uses every single register (and even spills) so manual
# control is needed.
#
# This used to also have a DQROUNDS helper that did 2 rounds of ChaCha like
# in the C code, but the C code has the luxury of an optimizer reordering
# everything, while this does not.
#
def
ROTW16_sse2
(
tmp
,
d
):
MOVDQA
(
tmp
,
d
)
PSLLD
(
tmp
,
16
)
PSRLD
(
d
,
16
)
PXOR
(
d
,
tmp
)
def
ROTW12_sse2
(
tmp
,
b
):
MOVDQA
(
tmp
,
b
)
PSLLD
(
tmp
,
12
)
PSRLD
(
b
,
20
)
PXOR
(
b
,
tmp
)
def
ROTW8_sse2
(
tmp
,
d
):
MOVDQA
(
tmp
,
d
)
PSLLD
(
tmp
,
8
)
PSRLD
(
d
,
24
)
PXOR
(
d
,
tmp
)
def
ROTW7_sse2
(
tmp
,
b
):
MOVDQA
(
tmp
,
b
)
PSLLD
(
tmp
,
7
)
PSRLD
(
b
,
25
)
PXOR
(
b
,
tmp
)
def
WriteXor_sse2
(
tmp
,
inp
,
outp
,
d
,
v0
,
v1
,
v2
,
v3
):
MOVDQU
(
tmp
,
[
inp
+
d
])
PXOR
(
tmp
,
v0
)
MOVDQU
([
outp
+
d
],
tmp
)
MOVDQU
(
tmp
,
[
inp
+
d
+
16
])
PXOR
(
tmp
,
v1
)
MOVDQU
([
outp
+
d
+
16
],
tmp
)
MOVDQU
(
tmp
,
[
inp
+
d
+
32
])
PXOR
(
tmp
,
v2
)
MOVDQU
([
outp
+
d
+
32
],
tmp
)
MOVDQU
(
tmp
,
[
inp
+
d
+
48
])
PXOR
(
tmp
,
v3
)
MOVDQU
([
outp
+
d
+
48
],
tmp
)
# SSE2 ChaCha20 (aka vec128). Does not handle partial blocks, and will
# process 4/2/1 blocks at a time. x (the ChaCha20 state) must be 16 byte
# aligned.
with
Function
(
"blocksAmd64SSE2"
,
(
x
,
inp
,
outp
,
nrBlocks
)):
reg_x
=
GeneralPurposeRegister64
()
reg_inp
=
GeneralPurposeRegister64
()
reg_outp
=
GeneralPurposeRegister64
()
reg_blocks
=
GeneralPurposeRegister64
()
reg_sp_save
=
GeneralPurposeRegister64
()
LOAD
.
ARGUMENT
(
reg_x
,
x
)
LOAD
.
ARGUMENT
(
reg_inp
,
inp
)
LOAD
.
ARGUMENT
(
reg_outp
,
outp
)
LOAD
.
ARGUMENT
(
reg_blocks
,
nrBlocks
)
# Align the stack to a 32 byte boundary.
reg_align
=
GeneralPurposeRegister64
()
MOV
(
reg_sp_save
,
registers
.
rsp
)
MOV
(
reg_align
,
0x1f
)
NOT
(
reg_align
)
AND
(
registers
.
rsp
,
reg_align
)
SUB
(
registers
.
rsp
,
0x20
)
# Build the counter increment vector on the stack, and allocate the scratch
# space
xmm_v0
=
XMMRegister
()
PXOR
(
xmm_v0
,
xmm_v0
)
SUB
(
registers
.
rsp
,
16
+
16
)
MOVDQA
([
registers
.
rsp
],
xmm_v0
)
reg_tmp
=
GeneralPurposeRegister32
()
MOV
(
reg_tmp
,
0x00000001
)
MOV
([
registers
.
rsp
],
reg_tmp
)
mem_one
=
[
registers
.
rsp
]
# (Stack) Counter increment vector
mem_tmp0
=
[
registers
.
rsp
+
16
]
# (Stack) Scratch space.
mem_s0
=
[
reg_x
]
# (Memory) Cipher state [0..3]
mem_s1
=
[
reg_x
+
16
]
# (Memory) Cipher state [4..7]
mem_s2
=
[
reg_x
+
32
]
# (Memory) Cipher state [8..11]
mem_s3
=
[
reg_x
+
48
]
# (Memory) Cipher state [12..15]
# xmm_v0 allocated above...
xmm_v1
=
XMMRegister
()
xmm_v2
=
XMMRegister
()
xmm_v3
=
XMMRegister
()
xmm_v4
=
XMMRegister
()
xmm_v5
=
XMMRegister
()
xmm_v6
=
XMMRegister
()
xmm_v7
=
XMMRegister
()
xmm_v8
=
XMMRegister
()
xmm_v9
=
XMMRegister
()
xmm_v10
=
XMMRegister
()
xmm_v11
=
XMMRegister
()
xmm_v12
=
XMMRegister
()
xmm_v13
=
XMMRegister
()
xmm_v14
=
XMMRegister
()
xmm_v15
=
XMMRegister
()
xmm_tmp
=
xmm_v12
#
# 4 blocks at a time.
#
vector_loop4
=
Loop
()
SUB
(
reg_blocks
,
4
)
JB
(
vector_loop4
.
end
)
with
vector_loop4
:
MOVDQA
(
xmm_v0
,
mem_s0
)
MOVDQA
(
xmm_v1
,
mem_s1
)
MOVDQA
(
xmm_v2
,
mem_s2
)
MOVDQA
(
xmm_v3
,
mem_s3
)
MOVDQA
(
xmm_v4
,
xmm_v0
)
MOVDQA
(
xmm_v5
,
xmm_v1
)
MOVDQA
(
xmm_v6
,
xmm_v2
)
MOVDQA
(
xmm_v7
,
xmm_v3
)
PADDQ
(
xmm_v7
,
mem_one
)
MOVDQA
(
xmm_v8
,
xmm_v0
)
MOVDQA
(
xmm_v9
,
xmm_v1
)
MOVDQA
(
xmm_v10
,
xmm_v2
)
MOVDQA
(
xmm_v11
,
xmm_v7
)
PADDQ
(
xmm_v11
,
mem_one
)
MOVDQA
(
xmm_v12
,
xmm_v0
)
MOVDQA
(
xmm_v13
,
xmm_v1
)
MOVDQA
(
xmm_v14
,
xmm_v2
)
MOVDQA
(
xmm_v15
,
xmm_v11
)
PADDQ
(
xmm_v15
,
mem_one
)
reg_rounds
=
GeneralPurposeRegister64
()
MOV
(
reg_rounds
,
20
)
rounds_loop4
=
Loop
()
with
rounds_loop4
:
# a += b; d ^= a; d = ROTW16(d);
PADDD
(
xmm_v0
,
xmm_v1
)
PADDD
(
xmm_v4
,
xmm_v5
)
PADDD
(
xmm_v8
,
xmm_v9
)
PADDD
(
xmm_v12
,
xmm_v13
)
PXOR
(
xmm_v3
,
xmm_v0
)
PXOR
(
xmm_v7
,
xmm_v4
)
PXOR
(
xmm_v11
,
xmm_v8
)
PXOR
(
xmm_v15
,
xmm_v12
)
MOVDQA
(
mem_tmp0
,
xmm_tmp
)
# Save
ROTW16_sse2
(
xmm_tmp
,
xmm_v3
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v7
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v11
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v15
)
# c += d; b ^= c; b = ROTW12(b);
PADDD
(
xmm_v2
,
xmm_v3
)
PADDD
(
xmm_v6
,
xmm_v7
)
PADDD
(
xmm_v10
,
xmm_v11
)
PADDD
(
xmm_v14
,
xmm_v15
)
PXOR
(
xmm_v1
,
xmm_v2
)
PXOR
(
xmm_v5
,
xmm_v6
)
PXOR
(
xmm_v9
,
xmm_v10
)
PXOR
(
xmm_v13
,
xmm_v14
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v1
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v5
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v9
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v13
)
# a += b; d ^= a; d = ROTW8(d);
MOVDQA
(
xmm_tmp
,
mem_tmp0
)
# Restore
PADDD
(
xmm_v0
,
xmm_v1
)
PADDD
(
xmm_v4
,
xmm_v5
)
PADDD
(
xmm_v8
,
xmm_v9
)
PADDD
(
xmm_v12
,
xmm_v13
)
PXOR
(
xmm_v3
,
xmm_v0
)
PXOR
(
xmm_v7
,
xmm_v4
)
PXOR
(
xmm_v11
,
xmm_v8
)
PXOR
(
xmm_v15
,
xmm_v12
)
MOVDQA
(
mem_tmp0
,
xmm_tmp
)
# Save
ROTW8_sse2
(
xmm_tmp
,
xmm_v3
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v7
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v11
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v15
)
# c += d; b ^= c; b = ROTW7(b)
PADDD
(
xmm_v2
,
xmm_v3
)
PADDD
(
xmm_v6
,
xmm_v7
)
PADDD
(
xmm_v10
,
xmm_v11
)
PADDD
(
xmm_v14
,
xmm_v15
)
PXOR
(
xmm_v1
,
xmm_v2
)
PXOR
(
xmm_v5
,
xmm_v6
)
PXOR
(
xmm_v9
,
xmm_v10
)
PXOR
(
xmm_v13
,
xmm_v14
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v1
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v5
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v9
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v13
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD
(
xmm_v1
,
xmm_v1
,
0x39
)
PSHUFD
(
xmm_v5
,
xmm_v5
,
0x39
)
PSHUFD
(
xmm_v9
,
xmm_v9
,
0x39
)
PSHUFD
(
xmm_v13
,
xmm_v13
,
0x39
)
PSHUFD
(
xmm_v2
,
xmm_v2
,
0x4e
)
PSHUFD
(
xmm_v6
,
xmm_v6
,
0x4e
)
PSHUFD
(
xmm_v10
,
xmm_v10
,
0x4e
)
PSHUFD
(
xmm_v14
,
xmm_v14
,
0x4e
)
PSHUFD
(
xmm_v3
,
xmm_v3
,
0x93
)
PSHUFD
(
xmm_v7
,
xmm_v7
,
0x93
)
PSHUFD
(
xmm_v11
,
xmm_v11
,
0x93
)
PSHUFD
(
xmm_v15
,
xmm_v15
,
0x93
)
MOVDQA
(
xmm_tmp
,
mem_tmp0
)
# Restore
# a += b; d ^= a; d = ROTW16(d);
PADDD
(
xmm_v0
,
xmm_v1
)
PADDD
(
xmm_v4
,
xmm_v5
)
PADDD
(
xmm_v8
,
xmm_v9
)
PADDD
(
xmm_v12
,
xmm_v13
)
PXOR
(
xmm_v3
,
xmm_v0
)
PXOR
(
xmm_v7
,
xmm_v4
)
PXOR
(
xmm_v11
,
xmm_v8
)
PXOR
(
xmm_v15
,
xmm_v12
)
MOVDQA
(
mem_tmp0
,
xmm_tmp
)
# Save
ROTW16_sse2
(
xmm_tmp
,
xmm_v3
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v7
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v11
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v15
)
# c += d; b ^= c; b = ROTW12(b);
PADDD
(
xmm_v2
,
xmm_v3
)
PADDD
(
xmm_v6
,
xmm_v7
)
PADDD
(
xmm_v10
,
xmm_v11
)
PADDD
(
xmm_v14
,
xmm_v15
)
PXOR
(
xmm_v1
,
xmm_v2
)
PXOR
(
xmm_v5
,
xmm_v6
)
PXOR
(
xmm_v9
,
xmm_v10
)
PXOR
(
xmm_v13
,
xmm_v14
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v1
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v5
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v9
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v13
)
# a += b; d ^= a; d = ROTW8(d);
MOVDQA
(
xmm_tmp
,
mem_tmp0
)
# Restore
PADDD
(
xmm_v0
,
xmm_v1
)
PADDD
(
xmm_v4
,
xmm_v5
)
PADDD
(
xmm_v8
,
xmm_v9
)
PADDD
(
xmm_v12
,
xmm_v13
)
PXOR
(
xmm_v3
,
xmm_v0
)
PXOR
(
xmm_v7
,
xmm_v4
)
PXOR
(
xmm_v11
,
xmm_v8
)
PXOR
(
xmm_v15
,
xmm_v12
)
MOVDQA
(
mem_tmp0
,
xmm_tmp
)
# Save
ROTW8_sse2
(
xmm_tmp
,
xmm_v3
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v7
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v11
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v15
)
# c += d; b ^= c; b = ROTW7(b)
PADDD
(
xmm_v2
,
xmm_v3
)
PADDD
(
xmm_v6
,
xmm_v7
)
PADDD
(
xmm_v10
,
xmm_v11
)
PADDD
(
xmm_v14
,
xmm_v15
)
PXOR
(
xmm_v1
,
xmm_v2
)
PXOR
(
xmm_v5
,
xmm_v6
)
PXOR
(
xmm_v9
,
xmm_v10
)
PXOR
(
xmm_v13
,
xmm_v14
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v1
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v5
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v9
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v13
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD
(
xmm_v1
,
xmm_v1
,
0x93
)
PSHUFD
(
xmm_v5
,
xmm_v5
,
0x93
)
PSHUFD
(
xmm_v9
,
xmm_v9
,
0x93
)
PSHUFD
(
xmm_v13
,
xmm_v13
,
0x93
)
PSHUFD
(
xmm_v2
,
xmm_v2
,
0x4e
)
PSHUFD
(
xmm_v6
,
xmm_v6
,
0x4e
)
PSHUFD
(
xmm_v10
,
xmm_v10
,
0x4e
)
PSHUFD
(
xmm_v14
,
xmm_v14
,
0x4e
)
PSHUFD
(
xmm_v3
,
xmm_v3
,
0x39
)
PSHUFD
(
xmm_v7
,
xmm_v7
,
0x39
)
PSHUFD
(
xmm_v11
,
xmm_v11
,
0x39
)
PSHUFD
(
xmm_v15
,
xmm_v15
,
0x39
)
MOVDQA
(
xmm_tmp
,
mem_tmp0
)
# Restore
SUB
(
reg_rounds
,
2
)
JNZ
(
rounds_loop4
.
begin
)
MOVDQA
(
mem_tmp0
,
xmm_tmp
)
PADDD
(
xmm_v0
,
mem_s0
)
PADDD
(
xmm_v1
,
mem_s1
)
PADDD
(
xmm_v2
,
mem_s2
)
PADDD
(
xmm_v3
,
mem_s3
)
WriteXor_sse2
(
xmm_tmp
,
reg_inp
,
reg_outp
,
0
,
xmm_v0
,
xmm_v1
,
xmm_v2
,
xmm_v3
)
MOVDQA
(
xmm_v3
,
mem_s3
)
PADDQ
(
xmm_v3
,
mem_one
)
PADDD
(
xmm_v4
,
mem_s0
)
PADDD
(
xmm_v5
,
mem_s1
)
PADDD
(
xmm_v6
,
mem_s2
)
PADDD
(
xmm_v7
,
xmm_v3
)
WriteXor_sse2
(
xmm_tmp
,
reg_inp
,
reg_outp
,
64
,
xmm_v4
,
xmm_v5
,
xmm_v6
,
xmm_v7
)
PADDQ
(
xmm_v3
,
mem_one
)
PADDD
(
xmm_v8
,
mem_s0
)
PADDD
(
xmm_v9
,
mem_s1
)
PADDD
(
xmm_v10
,
mem_s2
)
PADDD
(
xmm_v11
,
xmm_v3
)
WriteXor_sse2
(
xmm_tmp
,
reg_inp
,
reg_outp
,
128
,
xmm_v8
,
xmm_v9
,
xmm_v10
,
xmm_v11
)
PADDQ
(
xmm_v3
,
mem_one
)
MOVDQA
(
xmm_tmp
,
mem_tmp0
)
PADDD
(
xmm_v12
,
mem_s0
)
PADDD
(
xmm_v13
,
mem_s1
)
PADDD
(
xmm_v14
,
mem_s2
)
PADDD
(
xmm_v15
,
xmm_v3
)
WriteXor_sse2
(
xmm_v0
,
reg_inp
,
reg_outp
,
192
,
xmm_v12
,
xmm_v13
,
xmm_v14
,
xmm_v15
)
PADDQ
(
xmm_v3
,
mem_one
)
MOVDQA
(
mem_s3
,
xmm_v3
)
ADD
(
reg_inp
,
4
*
64
)
ADD
(
reg_outp
,
4
*
64
)
SUB
(
reg_blocks
,
4
)
JAE
(
vector_loop4
.
begin
)
ADD
(
reg_blocks
,
4
)
out
=
Label
()
JZ
(
out
)
# Past this point, we no longer need to use every single register to hold
# the in progress state.
xmm_s0
=
xmm_v8
xmm_s1
=
xmm_v9
xmm_s2
=
xmm_v10
xmm_s3
=
xmm_v11
xmm_one
=
xmm_v13
MOVDQA
(
xmm_s0
,
mem_s0
)
MOVDQA
(
xmm_s1
,
mem_s1
)
MOVDQA
(
xmm_s2
,
mem_s2
)
MOVDQA
(
xmm_s3
,
mem_s3
)
MOVDQA
(
xmm_one
,
mem_one
)
#
# 2 blocks at a time.
#
SUB
(
reg_blocks
,
2
)
vector_loop2
=
Loop
()
JB
(
vector_loop2
.
end
)
with
vector_loop2
:
MOVDQA
(
xmm_v0
,
xmm_s0
)
MOVDQA
(
xmm_v1
,
xmm_s1
)
MOVDQA
(
xmm_v2
,
xmm_s2
)
MOVDQA
(
xmm_v3
,
xmm_s3
)
MOVDQA
(
xmm_v4
,
xmm_v0
)
MOVDQA
(
xmm_v5
,
xmm_v1
)
MOVDQA
(
xmm_v6
,
xmm_v2
)
MOVDQA
(
xmm_v7
,
xmm_v3
)
PADDQ
(
xmm_v7
,
xmm_one
)
reg_rounds
=
GeneralPurposeRegister64
()
MOV
(
reg_rounds
,
20
)
rounds_loop2
=
Loop
()
with
rounds_loop2
:
# a += b; d ^= a; d = ROTW16(d);
PADDD
(
xmm_v0
,
xmm_v1
)
PADDD
(
xmm_v4
,
xmm_v5
)
PXOR
(
xmm_v3
,
xmm_v0
)
PXOR
(
xmm_v7
,
xmm_v4
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v3
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v7
)
# c += d; b ^= c; b = ROTW12(b);
PADDD
(
xmm_v2
,
xmm_v3
)
PADDD
(
xmm_v6
,
xmm_v7
)
PXOR
(
xmm_v1
,
xmm_v2
)
PXOR
(
xmm_v5
,
xmm_v6
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v1
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v5
)
# a += b; d ^= a; d = ROTW8(d);
PADDD
(
xmm_v0
,
xmm_v1
)
PADDD
(
xmm_v4
,
xmm_v5
)
PXOR
(
xmm_v3
,
xmm_v0
)
PXOR
(
xmm_v7
,
xmm_v4
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v3
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v7
)
# c += d; b ^= c; b = ROTW7(b)
PADDD
(
xmm_v2
,
xmm_v3
)
PADDD
(
xmm_v6
,
xmm_v7
)
PXOR
(
xmm_v1
,
xmm_v2
)
PXOR
(
xmm_v5
,
xmm_v6
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v1
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v5
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD
(
xmm_v1
,
xmm_v1
,
0x39
)
PSHUFD
(
xmm_v5
,
xmm_v5
,
0x39
)
PSHUFD
(
xmm_v2
,
xmm_v2
,
0x4e
)
PSHUFD
(
xmm_v6
,
xmm_v6
,
0x4e
)
PSHUFD
(
xmm_v3
,
xmm_v3
,
0x93
)
PSHUFD
(
xmm_v7
,
xmm_v7
,
0x93
)
# a += b; d ^= a; d = ROTW16(d);
PADDD
(
xmm_v0
,
xmm_v1
)
PADDD
(
xmm_v4
,
xmm_v5
)
PXOR
(
xmm_v3
,
xmm_v0
)
PXOR
(
xmm_v7
,
xmm_v4
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v3
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v7
)
# c += d; b ^= c; b = ROTW12(b);
PADDD
(
xmm_v2
,
xmm_v3
)
PADDD
(
xmm_v6
,
xmm_v7
)
PXOR
(
xmm_v1
,
xmm_v2
)
PXOR
(
xmm_v5
,
xmm_v6
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v1
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v5
)
# a += b; d ^= a; d = ROTW8(d);
PADDD
(
xmm_v0
,
xmm_v1
)
PADDD
(
xmm_v4
,
xmm_v5
)
PXOR
(
xmm_v3
,
xmm_v0
)
PXOR
(
xmm_v7
,
xmm_v4
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v3
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v7
)
# c += d; b ^= c; b = ROTW7(b)
PADDD
(
xmm_v2
,
xmm_v3
)
PADDD
(
xmm_v6
,
xmm_v7
)
PXOR
(
xmm_v1
,
xmm_v2
)
PXOR
(
xmm_v5
,
xmm_v6
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v1
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v5
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD
(
xmm_v1
,
xmm_v1
,
0x93
)
PSHUFD
(
xmm_v5
,
xmm_v5
,
0x93
)
PSHUFD
(
xmm_v2
,
xmm_v2
,
0x4e
)
PSHUFD
(
xmm_v6
,
xmm_v6
,
0x4e
)
PSHUFD
(
xmm_v3
,
xmm_v3
,
0x39
)
PSHUFD
(
xmm_v7
,
xmm_v7
,
0x39
)
SUB
(
reg_rounds
,
2
)
JNZ
(
rounds_loop2
.
begin
)
PADDD
(
xmm_v0
,
xmm_s0
)
PADDD
(
xmm_v1
,
xmm_s1
)
PADDD
(
xmm_v2
,
xmm_s2
)
PADDD
(
xmm_v3
,
xmm_s3
)
WriteXor_sse2
(
xmm_tmp
,
reg_inp
,
reg_outp
,
0
,
xmm_v0
,
xmm_v1
,
xmm_v2
,
xmm_v3
)
PADDQ
(
xmm_s3
,
xmm_one
)
PADDD
(
xmm_v4
,
xmm_s0
)
PADDD
(
xmm_v5
,
xmm_s1
)
PADDD
(
xmm_v6
,
xmm_s2
)
PADDD
(
xmm_v7
,
xmm_s3
)
WriteXor_sse2
(
xmm_tmp
,
reg_inp
,
reg_outp
,
64
,
xmm_v4
,
xmm_v5
,
xmm_v6
,
xmm_v7
)
PADDQ
(
xmm_s3
,
xmm_one
)
ADD
(
reg_inp
,
2
*
64
)
ADD
(
reg_outp
,
2
*
64
)
SUB
(
reg_blocks
,
2
)
JAE
(
vector_loop2
.
begin
)
ADD
(
reg_blocks
,
2
)
out_serial
=
Label
()
JZ
(
out_serial
)
#
# 1 block at a time. Only executed once, because if there was > 1,
# the parallel code would have processed it already.
#
MOVDQA
(
xmm_v0
,
xmm_s0
)
MOVDQA
(
xmm_v1
,
xmm_s1
)
MOVDQA
(
xmm_v2
,
xmm_s2
)
MOVDQA
(
xmm_v3
,
xmm_s3
)
reg_rounds
=
GeneralPurposeRegister64
()
MOV
(
reg_rounds
,
20
)
rounds_loop1
=
Loop
()
with
rounds_loop1
:
# a += b; d ^= a; d = ROTW16(d);
PADDD
(
xmm_v0
,
xmm_v1
)
PXOR
(
xmm_v3
,
xmm_v0
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v3
)
# c += d; b ^= c; b = ROTW12(b);
PADDD
(
xmm_v2
,
xmm_v3
)
PXOR
(
xmm_v1
,
xmm_v2
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v1
)
# a += b; d ^= a; d = ROTW8(d);
PADDD
(
xmm_v0
,
xmm_v1
)
PXOR
(
xmm_v3
,
xmm_v0
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v3
)
# c += d; b ^= c; b = ROTW7(b)
PADDD
(
xmm_v2
,
xmm_v3
)
PXOR
(
xmm_v1
,
xmm_v2
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v1
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD
(
xmm_v1
,
xmm_v1
,
0x39
)
PSHUFD
(
xmm_v2
,
xmm_v2
,
0x4e
)
PSHUFD
(
xmm_v3
,
xmm_v3
,
0x93
)
# a += b; d ^= a; d = ROTW16(d);
PADDD
(
xmm_v0
,
xmm_v1
)
PXOR
(
xmm_v3
,
xmm_v0
)
ROTW16_sse2
(
xmm_tmp
,
xmm_v3
)
# c += d; b ^= c; b = ROTW12(b);
PADDD
(
xmm_v2
,
xmm_v3
)
PXOR
(
xmm_v1
,
xmm_v2
)
ROTW12_sse2
(
xmm_tmp
,
xmm_v1
)
# a += b; d ^= a; d = ROTW8(d);
PADDD
(
xmm_v0
,
xmm_v1
)
PXOR
(
xmm_v3
,
xmm_v0
)
ROTW8_sse2
(
xmm_tmp
,
xmm_v3
)
# c += d; b ^= c; b = ROTW7(b)
PADDD
(
xmm_v2
,
xmm_v3
)
PXOR
(
xmm_v1
,
xmm_v2
)
ROTW7_sse2
(
xmm_tmp
,
xmm_v1
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
PSHUFD
(
xmm_v1
,
xmm_v1
,
0x93
)
PSHUFD
(
xmm_v2
,
xmm_v2
,
0x4e
)
PSHUFD
(
xmm_v3
,
xmm_v3
,
0x39
)
SUB
(
reg_rounds
,
2
)
JNZ
(
rounds_loop1
.
begin
)
PADDD
(
xmm_v0
,
xmm_s0
)
PADDD
(
xmm_v1
,
xmm_s1
)
PADDD
(
xmm_v2
,
xmm_s2
)
PADDD
(
xmm_v3
,
xmm_s3
)
WriteXor_sse2
(
xmm_tmp
,
reg_inp
,
reg_outp
,
0
,
xmm_v0
,
xmm_v1
,
xmm_v2
,
xmm_v3
)
PADDQ
(
xmm_s3
,
xmm_one
)
LABEL
(
out_serial
)
# Write back the updated counter. Stoping at 2^70 bytes is the user's
# problem, not mine. (Skipped if there's exactly a multiple of 4 blocks
# because the counter is incremented in memory while looping.)
MOVDQA
(
mem_s3
,
xmm_s3
)
LABEL
(
out
)
# Paranoia, cleanse the scratch space.
PXOR
(
xmm_v0
,
xmm_v0
)
MOVDQA
(
mem_tmp0
,
xmm_v0
)
# Remove our stack allocation.
MOV
(
registers
.
rsp
,
reg_sp_save
)
RETURN
()
#
# AVX2 helpers. Like the SSE2 equivalents, the scratch register is explicit,
# and more helpers are used to increase readability for destructive operations.
#
# XXX/Performance: ROTW16_avx2/ROTW8_avx2 both can use VPSHUFFB.
#
def
ADD_avx2
(
dst
,
src
):
VPADDD
(
dst
,
dst
,
src
)
def
XOR_avx2
(
dst
,
src
):
VPXOR
(
dst
,
dst
,
src
)
def
ROTW16_avx2
(
tmp
,
d
):
VPSLLD
(
tmp
,
d
,
16
)
VPSRLD
(
d
,
d
,
16
)
XOR_avx2
(
d
,
tmp
)
def
ROTW12_avx2
(
tmp
,
b
):
VPSLLD
(
tmp
,
b
,
12
)
VPSRLD
(
b
,
b
,
20
)
XOR_avx2
(
b
,
tmp
)
def
ROTW8_avx2
(
tmp
,
d
):
VPSLLD
(
tmp
,
d
,
8
)
VPSRLD
(
d
,
d
,
24
)
XOR_avx2
(
d
,
tmp
)
def
ROTW7_avx2
(
tmp
,
b
):
VPSLLD
(
tmp
,
b
,
7
)
VPSRLD
(
b
,
b
,
25
)
XOR_avx2
(
b
,
tmp
)
def
WriteXor_avx2
(
tmp
,
inp
,
outp
,
d
,
v0
,
v1
,
v2
,
v3
):
# XOR_WRITE(out+ 0, in+ 0, _mm256_permute2x128_si256(v0,v1,0x20));
VPERM2I128
(
tmp
,
v0
,
v1
,
0x20
)
VPXOR
(
tmp
,
tmp
,
[
inp
+
d
])
VMOVDQU
([
outp
+
d
],
tmp
)
# XOR_WRITE(out+32, in+32, _mm256_permute2x128_si256(v2,v3,0x20));
VPERM2I128
(
tmp
,
v2
,
v3
,
0x20
)
VPXOR
(
tmp
,
tmp
,
[
inp
+
d
+
32
])
VMOVDQU
([
outp
+
d
+
32
],
tmp
)
# XOR_WRITE(out+64, in+64, _mm256_permute2x128_si256(v0,v1,0x31));
VPERM2I128
(
tmp
,
v0
,
v1
,
0x31
)
VPXOR
(
tmp
,
tmp
,
[
inp
+
d
+
64
])
VMOVDQU
([
outp
+
d
+
64
],
tmp
)
# XOR_WRITE(out+96, in+96, _mm256_permute2x128_si256(v2,v3,0x31));
VPERM2I128
(
tmp
,
v2
,
v3
,
0x31
)
VPXOR
(
tmp
,
tmp
,
[
inp
+
d
+
96
])
VMOVDQU
([
outp
+
d
+
96
],
tmp
)
# AVX2 ChaCha20 (aka avx2). Does not handle partial blocks, will process
# 8/4/2 blocks at a time. Alignment blah blah blah fuck you.
with
Function
(
"blocksAmd64AVX2"
,
(
x
,
inp
,
outp
,
nrBlocks
),
target
=
uarch
.
broadwell
):
reg_x
=
GeneralPurposeRegister64
()
reg_inp
=
GeneralPurposeRegister64
()
reg_outp
=
GeneralPurposeRegister64
()
reg_blocks
=
GeneralPurposeRegister64
()
reg_sp_save
=
GeneralPurposeRegister64
()
LOAD
.
ARGUMENT
(
reg_x
,
x
)
LOAD
.
ARGUMENT
(
reg_inp
,
inp
)
LOAD
.
ARGUMENT
(
reg_outp
,
outp
)
LOAD
.
ARGUMENT
(
reg_blocks
,
nrBlocks
)
# Align the stack to a 32 byte boundary.
reg_align
=
GeneralPurposeRegister64
()
MOV
(
reg_sp_save
,
registers
.
rsp
)
MOV
(
reg_align
,
0x1f
)
NOT
(
reg_align
)
AND
(
registers
.
rsp
,
reg_align
)
SUB
(
registers
.
rsp
,
0x20
)
x_s0
=
[
reg_x
]
# (Memory) Cipher state [0..3]
x_s1
=
[
reg_x
+
16
]
# (Memory) Cipher state [4..7]
x_s2
=
[
reg_x
+
32
]
# (Memory) Cipher state [8..11]
x_s3
=
[
reg_x
+
48
]
# (Memory) Cipher state [12..15]
ymm_v0
=
YMMRegister
()
ymm_v1
=
YMMRegister
()
ymm_v2
=
YMMRegister
()
ymm_v3
=
YMMRegister
()
ymm_v4
=
YMMRegister
()
ymm_v5
=
YMMRegister
()
ymm_v6
=
YMMRegister
()
ymm_v7
=
YMMRegister
()
ymm_v8
=
YMMRegister
()
ymm_v9
=
YMMRegister
()
ymm_v10
=
YMMRegister
()
ymm_v11
=
YMMRegister
()
ymm_v12
=
YMMRegister
()
ymm_v13
=
YMMRegister
()
ymm_v14
=
YMMRegister
()
ymm_v15
=
YMMRegister
()
ymm_tmp0
=
ymm_v12
# Allocate the neccecary stack space for the counter vector and two ymm
# registers that we will spill.
SUB
(
registers
.
rsp
,
96
)
mem_tmp0
=
[
registers
.
rsp
+
64
]
# (Stack) Scratch space.
mem_s3
=
[
registers
.
rsp
+
32
]
# (Stack) Working copy of s3. (8x)
mem_inc
=
[
registers
.
rsp
]
# (Stack) Counter increment vector.
# Increment the counter for one side of the state vector.
VPXOR
(
ymm_tmp0
,
ymm_tmp0
,
ymm_tmp0
)
VMOVDQU
(
mem_inc
,
ymm_tmp0
)
reg_tmp
=
GeneralPurposeRegister32
()
MOV
(
reg_tmp
,
0x00000001
)
MOV
([
registers
.
rsp
+
16
],
reg_tmp
)
VBROADCASTI128
(
ymm_v3
,
x_s3
)
VPADDQ
(
ymm_v3
,
ymm_v3
,
[
registers
.
rsp
])
VMOVDQA
(
mem_s3
,
ymm_v3
)
# As we process 2xN blocks at a time, so the counter increment for both
# sides of the state vector is 2.
MOV
(
reg_tmp
,
0x00000002
)
MOV
([
registers
.
rsp
],
reg_tmp
)
MOV
([
registers
.
rsp
+
16
],
reg_tmp
)
out_write_even
=
Label
()
out_write_odd
=
Label
()
#
# 8 blocks at a time. Ted Krovetz's avx2 code does not do this, but it's
# a decent gain despite all the pain...
#
vector_loop8
=
Loop
()
SUB
(
reg_blocks
,
8
)
JB
(
vector_loop8
.
end
)
with
vector_loop8
:
VBROADCASTI128
(
ymm_v0
,
x_s0
)
VBROADCASTI128
(
ymm_v1
,
x_s1
)
VBROADCASTI128
(
ymm_v2
,
x_s2
)
VMOVDQA
(
ymm_v3
,
mem_s3
)
VMOVDQA
(
ymm_v4
,
ymm_v0
)
VMOVDQA
(
ymm_v5
,
ymm_v1
)
VMOVDQA
(
ymm_v6
,
ymm_v2
)
VPADDQ
(
ymm_v7
,
ymm_v3
,
mem_inc
)
VMOVDQA
(
ymm_v8
,
ymm_v0
)
VMOVDQA
(
ymm_v9
,
ymm_v1
)
VMOVDQA
(
ymm_v10
,
ymm_v2
)
VPADDQ
(
ymm_v11
,
ymm_v7
,
mem_inc
)
VMOVDQA
(
ymm_v12
,
ymm_v0
)
VMOVDQA
(
ymm_v13
,
ymm_v1
)
VMOVDQA
(
ymm_v14
,
ymm_v2
)
VPADDQ
(
ymm_v15
,
ymm_v11
,
mem_inc
)
reg_rounds
=
GeneralPurposeRegister64
()
MOV
(
reg_rounds
,
20
)
rounds_loop8
=
Loop
()
with
rounds_loop8
:
# a += b; d ^= a; d = ROTW16(d);
ADD_avx2
(
ymm_v0
,
ymm_v1
)
ADD_avx2
(
ymm_v4
,
ymm_v5
)
ADD_avx2
(
ymm_v8
,
ymm_v9
)
ADD_avx2
(
ymm_v12
,
ymm_v13
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
XOR_avx2
(
ymm_v7
,
ymm_v4
)
XOR_avx2
(
ymm_v11
,
ymm_v8
)
XOR_avx2
(
ymm_v15
,
ymm_v12
)
VMOVDQA
(
mem_tmp0
,
ymm_tmp0
)
# Save
ROTW16_avx2
(
ymm_tmp0
,
ymm_v3
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v7
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v11
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v15
)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2
(
ymm_v2
,
ymm_v3
)
ADD_avx2
(
ymm_v6
,
ymm_v7
)
ADD_avx2
(
ymm_v10
,
ymm_v11
)
ADD_avx2
(
ymm_v14
,
ymm_v15
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
XOR_avx2
(
ymm_v5
,
ymm_v6
)
XOR_avx2
(
ymm_v9
,
ymm_v10
)
XOR_avx2
(
ymm_v13
,
ymm_v14
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v1
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v5
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v9
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v13
)
# a += b; d ^= a; d = ROTW8(d);
VMOVDQA
(
ymm_tmp0
,
mem_tmp0
)
# Restore
ADD_avx2
(
ymm_v0
,
ymm_v1
)
ADD_avx2
(
ymm_v4
,
ymm_v5
)
ADD_avx2
(
ymm_v8
,
ymm_v9
)
ADD_avx2
(
ymm_v12
,
ymm_v13
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
XOR_avx2
(
ymm_v7
,
ymm_v4
)
XOR_avx2
(
ymm_v11
,
ymm_v8
)
XOR_avx2
(
ymm_v15
,
ymm_v12
)
VMOVDQA
(
mem_tmp0
,
ymm_tmp0
)
# Save
ROTW8_avx2
(
ymm_tmp0
,
ymm_v3
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v7
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v11
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v15
)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2
(
ymm_v2
,
ymm_v3
)
ADD_avx2
(
ymm_v6
,
ymm_v7
)
ADD_avx2
(
ymm_v10
,
ymm_v11
)
ADD_avx2
(
ymm_v14
,
ymm_v15
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
XOR_avx2
(
ymm_v5
,
ymm_v6
)
XOR_avx2
(
ymm_v9
,
ymm_v10
)
XOR_avx2
(
ymm_v13
,
ymm_v14
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v1
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v5
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v9
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v13
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD
(
ymm_v1
,
ymm_v1
,
0x39
)
VPSHUFD
(
ymm_v5
,
ymm_v5
,
0x39
)
VPSHUFD
(
ymm_v9
,
ymm_v9
,
0x39
)
VPSHUFD
(
ymm_v13
,
ymm_v13
,
0x39
)
VPSHUFD
(
ymm_v2
,
ymm_v2
,
0x4e
)
VPSHUFD
(
ymm_v6
,
ymm_v6
,
0x4e
)
VPSHUFD
(
ymm_v10
,
ymm_v10
,
0x4e
)
VPSHUFD
(
ymm_v14
,
ymm_v14
,
0x4e
)
VPSHUFD
(
ymm_v3
,
ymm_v3
,
0x93
)
VPSHUFD
(
ymm_v7
,
ymm_v7
,
0x93
)
VPSHUFD
(
ymm_v11
,
ymm_v11
,
0x93
)
VPSHUFD
(
ymm_v15
,
ymm_v15
,
0x93
)
# a += b; d ^= a; d = ROTW16(d);
VMOVDQA
(
ymm_tmp0
,
mem_tmp0
)
# Restore
ADD_avx2
(
ymm_v0
,
ymm_v1
)
ADD_avx2
(
ymm_v4
,
ymm_v5
)
ADD_avx2
(
ymm_v8
,
ymm_v9
)
ADD_avx2
(
ymm_v12
,
ymm_v13
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
XOR_avx2
(
ymm_v7
,
ymm_v4
)
XOR_avx2
(
ymm_v11
,
ymm_v8
)
XOR_avx2
(
ymm_v15
,
ymm_v12
)
VMOVDQA
(
mem_tmp0
,
ymm_tmp0
)
# Save
ROTW16_avx2
(
ymm_tmp0
,
ymm_v3
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v7
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v11
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v15
)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2
(
ymm_v2
,
ymm_v3
)
ADD_avx2
(
ymm_v6
,
ymm_v7
)
ADD_avx2
(
ymm_v10
,
ymm_v11
)
ADD_avx2
(
ymm_v14
,
ymm_v15
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
XOR_avx2
(
ymm_v5
,
ymm_v6
)
XOR_avx2
(
ymm_v9
,
ymm_v10
)
XOR_avx2
(
ymm_v13
,
ymm_v14
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v1
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v5
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v9
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v13
)
# a += b; d ^= a; d = ROTW8(d);
VMOVDQA
(
ymm_tmp0
,
mem_tmp0
)
# Restore
ADD_avx2
(
ymm_v0
,
ymm_v1
)
ADD_avx2
(
ymm_v4
,
ymm_v5
)
ADD_avx2
(
ymm_v8
,
ymm_v9
)
ADD_avx2
(
ymm_v12
,
ymm_v13
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
XOR_avx2
(
ymm_v7
,
ymm_v4
)
XOR_avx2
(
ymm_v11
,
ymm_v8
)
XOR_avx2
(
ymm_v15
,
ymm_v12
)
VMOVDQA
(
mem_tmp0
,
ymm_tmp0
)
# Save
ROTW8_avx2
(
ymm_tmp0
,
ymm_v3
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v7
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v11
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v15
)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2
(
ymm_v2
,
ymm_v3
)
ADD_avx2
(
ymm_v6
,
ymm_v7
)
ADD_avx2
(
ymm_v10
,
ymm_v11
)
ADD_avx2
(
ymm_v14
,
ymm_v15
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
XOR_avx2
(
ymm_v5
,
ymm_v6
)
XOR_avx2
(
ymm_v9
,
ymm_v10
)
XOR_avx2
(
ymm_v13
,
ymm_v14
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v1
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v5
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v9
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v13
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD
(
ymm_v1
,
ymm_v1
,
0x93
)
VPSHUFD
(
ymm_v5
,
ymm_v5
,
0x93
)
VPSHUFD
(
ymm_v9
,
ymm_v9
,
0x93
)
VPSHUFD
(
ymm_v13
,
ymm_v13
,
0x93
)
VPSHUFD
(
ymm_v2
,
ymm_v2
,
0x4e
)
VPSHUFD
(
ymm_v6
,
ymm_v6
,
0x4e
)
VPSHUFD
(
ymm_v10
,
ymm_v10
,
0x4e
)
VPSHUFD
(
ymm_v14
,
ymm_v14
,
0x4e
)
VPSHUFD
(
ymm_v3
,
ymm_v3
,
0x39
)
VPSHUFD
(
ymm_v7
,
ymm_v7
,
0x39
)
VPSHUFD
(
ymm_v11
,
ymm_v11
,
0x39
)
VPSHUFD
(
ymm_v15
,
ymm_v15
,
0x39
)
VMOVDQA
(
ymm_tmp0
,
mem_tmp0
)
# Restore
SUB
(
reg_rounds
,
2
)
JNZ
(
rounds_loop8
.
begin
)
# ymm_v12 is in mem_tmp0 and is current....
# XXX: I assume VBROADCASTI128 is about as fast as VMOVDQA....
VBROADCASTI128
(
ymm_tmp0
,
x_s0
)
ADD_avx2
(
ymm_v0
,
ymm_tmp0
)
ADD_avx2
(
ymm_v4
,
ymm_tmp0
)
ADD_avx2
(
ymm_v8
,
ymm_tmp0
)
ADD_avx2
(
ymm_tmp0
,
mem_tmp0
)
VMOVDQA
(
mem_tmp0
,
ymm_tmp0
)
VBROADCASTI128
(
ymm_tmp0
,
x_s1
)
ADD_avx2
(
ymm_v1
,
ymm_tmp0
)
ADD_avx2
(
ymm_v5
,
ymm_tmp0
)
ADD_avx2
(
ymm_v9
,
ymm_tmp0
)
ADD_avx2
(
ymm_v13
,
ymm_tmp0
)
VBROADCASTI128
(
ymm_tmp0
,
x_s2
)
ADD_avx2
(
ymm_v2
,
ymm_tmp0
)
ADD_avx2
(
ymm_v6
,
ymm_tmp0
)
ADD_avx2
(
ymm_v10
,
ymm_tmp0
)
ADD_avx2
(
ymm_v14
,
ymm_tmp0
)
ADD_avx2
(
ymm_v3
,
mem_s3
)
WriteXor_avx2
(
ymm_tmp0
,
reg_inp
,
reg_outp
,
0
,
ymm_v0
,
ymm_v1
,
ymm_v2
,
ymm_v3
)
VMOVDQA
(
ymm_v3
,
mem_s3
)
ADD_avx2
(
ymm_v3
,
mem_inc
)
ADD_avx2
(
ymm_v7
,
ymm_v3
)
WriteXor_avx2
(
ymm_tmp0
,
reg_inp
,
reg_outp
,
128
,
ymm_v4
,
ymm_v5
,
ymm_v6
,
ymm_v7
)
ADD_avx2
(
ymm_v3
,
mem_inc
)
ADD_avx2
(
ymm_v11
,
ymm_v3
)
WriteXor_avx2
(
ymm_tmp0
,
reg_inp
,
reg_outp
,
256
,
ymm_v8
,
ymm_v9
,
ymm_v10
,
ymm_v11
)
ADD_avx2
(
ymm_v3
,
mem_inc
)
VMOVDQA
(
ymm_v12
,
mem_tmp0
)
ADD_avx2
(
ymm_v15
,
ymm_v3
)
WriteXor_avx2
(
ymm_v0
,
reg_inp
,
reg_outp
,
384
,
ymm_v12
,
ymm_v13
,
ymm_v14
,
ymm_v15
)
ADD_avx2
(
ymm_v3
,
mem_inc
)
VMOVDQA
(
mem_s3
,
ymm_v3
)
ADD
(
reg_inp
,
8
*
64
)
ADD
(
reg_outp
,
8
*
64
)
SUB
(
reg_blocks
,
8
)
JAE
(
vector_loop8
.
begin
)
# ymm_v3 contains a current copy of mem_s3 either from when it was built,
# or because the loop updates it. Copy this before we mess with the block
# counter in case we need to write it back and return.
ymm_s3
=
ymm_v11
VMOVDQA
(
ymm_s3
,
ymm_v3
)
ADD
(
reg_blocks
,
8
)
JZ
(
out_write_even
)
# We now actually can do everything in registers.
ymm_s0
=
ymm_v8
VBROADCASTI128
(
ymm_s0
,
x_s0
)
ymm_s1
=
ymm_v9
VBROADCASTI128
(
ymm_s1
,
x_s1
)
ymm_s2
=
ymm_v10
VBROADCASTI128
(
ymm_s2
,
x_s2
)
ymm_inc
=
ymm_v14
VMOVDQA
(
ymm_inc
,
mem_inc
)
#
# 4 blocks at a time.
#
SUB
(
reg_blocks
,
4
)
vector_loop4
=
Loop
()
JB
(
vector_loop4
.
end
)
with
vector_loop4
:
VMOVDQA
(
ymm_v0
,
ymm_s0
)
VMOVDQA
(
ymm_v1
,
ymm_s1
)
VMOVDQA
(
ymm_v2
,
ymm_s2
)
VMOVDQA
(
ymm_v3
,
ymm_s3
)
VMOVDQA
(
ymm_v4
,
ymm_v0
)
VMOVDQA
(
ymm_v5
,
ymm_v1
)
VMOVDQA
(
ymm_v6
,
ymm_v2
)
VPADDQ
(
ymm_v7
,
ymm_v3
,
ymm_inc
)
reg_rounds
=
GeneralPurposeRegister64
()
MOV
(
reg_rounds
,
20
)
rounds_loop4
=
Loop
()
with
rounds_loop4
:
# a += b; d ^= a; d = ROTW16(d);
ADD_avx2
(
ymm_v0
,
ymm_v1
)
ADD_avx2
(
ymm_v4
,
ymm_v5
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
XOR_avx2
(
ymm_v7
,
ymm_v4
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v3
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v7
)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2
(
ymm_v2
,
ymm_v3
)
ADD_avx2
(
ymm_v6
,
ymm_v7
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
XOR_avx2
(
ymm_v5
,
ymm_v6
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v1
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v5
)
# a += b; d ^= a; d = ROTW8(d);
ADD_avx2
(
ymm_v0
,
ymm_v1
)
ADD_avx2
(
ymm_v4
,
ymm_v5
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
XOR_avx2
(
ymm_v7
,
ymm_v4
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v3
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v7
)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2
(
ymm_v2
,
ymm_v3
)
ADD_avx2
(
ymm_v6
,
ymm_v7
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
XOR_avx2
(
ymm_v5
,
ymm_v6
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v1
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v5
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD
(
ymm_v1
,
ymm_v1
,
0x39
)
VPSHUFD
(
ymm_v5
,
ymm_v5
,
0x39
)
VPSHUFD
(
ymm_v2
,
ymm_v2
,
0x4e
)
VPSHUFD
(
ymm_v6
,
ymm_v6
,
0x4e
)
VPSHUFD
(
ymm_v3
,
ymm_v3
,
0x93
)
VPSHUFD
(
ymm_v7
,
ymm_v7
,
0x93
)
# a += b; d ^= a; d = ROTW16(d);
ADD_avx2
(
ymm_v0
,
ymm_v1
)
ADD_avx2
(
ymm_v4
,
ymm_v5
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
XOR_avx2
(
ymm_v7
,
ymm_v4
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v3
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v7
)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2
(
ymm_v2
,
ymm_v3
)
ADD_avx2
(
ymm_v6
,
ymm_v7
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
XOR_avx2
(
ymm_v5
,
ymm_v6
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v1
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v5
)
# a += b; d ^= a; d = ROTW8(d);
ADD_avx2
(
ymm_v0
,
ymm_v1
)
ADD_avx2
(
ymm_v4
,
ymm_v5
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
XOR_avx2
(
ymm_v7
,
ymm_v4
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v3
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v7
)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2
(
ymm_v2
,
ymm_v3
)
ADD_avx2
(
ymm_v6
,
ymm_v7
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
XOR_avx2
(
ymm_v5
,
ymm_v6
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v1
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v5
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD
(
ymm_v1
,
ymm_v1
,
0x93
)
VPSHUFD
(
ymm_v5
,
ymm_v5
,
0x93
)
VPSHUFD
(
ymm_v2
,
ymm_v2
,
0x4e
)
VPSHUFD
(
ymm_v6
,
ymm_v6
,
0x4e
)
VPSHUFD
(
ymm_v3
,
ymm_v3
,
0x39
)
VPSHUFD
(
ymm_v7
,
ymm_v7
,
0x39
)
SUB
(
reg_rounds
,
2
)
JNZ
(
rounds_loop4
.
begin
)
ADD_avx2
(
ymm_v0
,
ymm_s0
)
ADD_avx2
(
ymm_v1
,
ymm_s1
)
ADD_avx2
(
ymm_v2
,
ymm_s2
)
ADD_avx2
(
ymm_v3
,
ymm_s3
)
WriteXor_avx2
(
ymm_tmp0
,
reg_inp
,
reg_outp
,
0
,
ymm_v0
,
ymm_v1
,
ymm_v2
,
ymm_v3
)
ADD_avx2
(
ymm_s3
,
ymm_inc
)
ADD_avx2
(
ymm_v4
,
ymm_s0
)
ADD_avx2
(
ymm_v5
,
ymm_s1
)
ADD_avx2
(
ymm_v6
,
ymm_s2
)
ADD_avx2
(
ymm_v7
,
ymm_s3
)
WriteXor_avx2
(
ymm_tmp0
,
reg_inp
,
reg_outp
,
128
,
ymm_v4
,
ymm_v5
,
ymm_v6
,
ymm_v7
)
ADD_avx2
(
ymm_s3
,
ymm_inc
)
ADD
(
reg_inp
,
4
*
64
)
ADD
(
reg_outp
,
4
*
64
)
SUB
(
reg_blocks
,
4
)
JAE
(
vector_loop4
.
begin
)
ADD
(
reg_blocks
,
4
)
JZ
(
out_write_even
)
#
# 2/1 blocks at a time. The two codepaths are unified because
# with AVX2 we do 2 blocks at a time anyway, and this only gets called
# if 3/2/1 blocks are remaining, so the extra branches don't hurt that
# much.
#
vector_loop2
=
Loop
()
with
vector_loop2
:
VMOVDQA
(
ymm_v0
,
ymm_s0
)
VMOVDQA
(
ymm_v1
,
ymm_s1
)
VMOVDQA
(
ymm_v2
,
ymm_s2
)
VMOVDQA
(
ymm_v3
,
ymm_s3
)
reg_rounds
=
GeneralPurposeRegister64
()
MOV
(
reg_rounds
,
20
)
rounds_loop2
=
Loop
()
with
rounds_loop2
:
# a += b; d ^= a; d = ROTW16(d);
ADD_avx2
(
ymm_v0
,
ymm_v1
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v3
)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2
(
ymm_v2
,
ymm_v3
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v1
)
# a += b; d ^= a; d = ROTW8(d);
ADD_avx2
(
ymm_v0
,
ymm_v1
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v3
)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2
(
ymm_v2
,
ymm_v3
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v1
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD
(
ymm_v1
,
ymm_v1
,
0x39
)
VPSHUFD
(
ymm_v2
,
ymm_v2
,
0x4e
)
VPSHUFD
(
ymm_v3
,
ymm_v3
,
0x93
)
# a += b; d ^= a; d = ROTW16(d);
ADD_avx2
(
ymm_v0
,
ymm_v1
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
ROTW16_avx2
(
ymm_tmp0
,
ymm_v3
)
# c += d; b ^= c; b = ROTW12(b);
ADD_avx2
(
ymm_v2
,
ymm_v3
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
ROTW12_avx2
(
ymm_tmp0
,
ymm_v1
)
# a += b; d ^= a; d = ROTW8(d);
ADD_avx2
(
ymm_v0
,
ymm_v1
)
XOR_avx2
(
ymm_v3
,
ymm_v0
)
ROTW8_avx2
(
ymm_tmp0
,
ymm_v3
)
# c += d; b ^= c; b = ROTW7(b)
ADD_avx2
(
ymm_v2
,
ymm_v3
)
XOR_avx2
(
ymm_v1
,
ymm_v2
)
ROTW7_avx2
(
ymm_tmp0
,
ymm_v1
)
# b = ROTV1(b); c = ROTV2(c); d = ROTV3(d);
VPSHUFD
(
ymm_v1
,
ymm_v1
,
0x93
)
VPSHUFD
(
ymm_v2
,
ymm_v2
,
0x4e
)
VPSHUFD
(
ymm_v3
,
ymm_v3
,
0x39
)
SUB
(
reg_rounds
,
2
)
JNZ
(
rounds_loop2
.
begin
)
ADD_avx2
(
ymm_v0
,
ymm_s0
)
ADD_avx2
(
ymm_v1
,
ymm_s1
)
ADD_avx2
(
ymm_v2
,
ymm_s2
)
ADD_avx2
(
ymm_v3
,
ymm_s3
)
# XOR_WRITE(out+ 0, in+ 0, _mm256_permute2x128_si256(v0,v1,0x20));
VPERM2I128
(
ymm_tmp0
,
ymm_v0
,
ymm_v1
,
0x20
)
VPXOR
(
ymm_tmp0
,
ymm_tmp0
,
[
reg_inp
])
VMOVDQU
([
reg_outp
],
ymm_tmp0
)
# XOR_WRITE(out+32, in+32, _mm256_permute2x128_si256(v2,v3,0x20));
VPERM2I128
(
ymm_tmp0
,
ymm_v2
,
ymm_v3
,
0x20
)
VPXOR
(
ymm_tmp0
,
ymm_tmp0
,
[
reg_inp
+
32
])
VMOVDQU
([
reg_outp
+
32
],
ymm_tmp0
)
SUB
(
reg_blocks
,
1
)
JZ
(
out_write_odd
)
ADD_avx2
(
ymm_s3
,
ymm_inc
)
# XOR_WRITE(out+64, in+64, _mm256_permute2x128_si256(v0,v1,0x31));
VPERM2I128
(
ymm_tmp0
,
ymm_v0
,
ymm_v1
,
0x31
)
VPXOR
(
ymm_tmp0
,
ymm_tmp0
,
[
reg_inp
+
64
])
VMOVDQU
([
reg_outp
+
64
],
ymm_tmp0
)
# XOR_WRITE(out+96, in+96, _mm256_permute2x128_si256(v2,v3,0x31));
VPERM2I128
(
ymm_tmp0
,
ymm_v2
,
ymm_v3
,
0x31
)
VPXOR
(
ymm_tmp0
,
ymm_tmp0
,
[
reg_inp
+
96
])
VMOVDQU
([
reg_outp
+
96
],
ymm_tmp0
)
SUB
(
reg_blocks
,
1
)
JZ
(
out_write_even
)
ADD
(
reg_inp
,
2
*
64
)
ADD
(
reg_outp
,
2
*
64
)
JMP
(
vector_loop2
.
begin
)
LABEL
(
out_write_odd
)
VPERM2I128
(
ymm_s3
,
ymm_s3
,
ymm_s3
,
0x01
)
# Odd number of blocks.
LABEL
(
out_write_even
)
VMOVDQA
(
x_s3
,
ymm_s3
.
as_xmm
)
# Write back ymm_s3 to x_v3
# Paranoia, cleanse the scratch space.
VPXOR
(
ymm_v0
,
ymm_v0
,
ymm_v0
)
VMOVDQA
(
mem_tmp0
,
ymm_v0
)
VMOVDQA
(
mem_s3
,
ymm_v0
)
# Remove our stack allocation.
MOV
(
registers
.
rsp
,
reg_sp_save
)
RETURN
()
#
# CPUID
#
cpuidParams
=
Argument
(
ptr
(
uint32_t
))
with
Function
(
"cpuidAmd64"
,
(
cpuidParams
,)):
reg_params
=
registers
.
r15
LOAD
.
ARGUMENT
(
reg_params
,
cpuidParams
)
MOV
(
registers
.
eax
,
[
reg_params
])
MOV
(
registers
.
ecx
,
[
reg_params
+
4
])
CPUID
()
MOV
([
reg_params
],
registers
.
eax
)
MOV
([
reg_params
+
4
],
registers
.
ebx
)
MOV
([
reg_params
+
8
],
registers
.
ecx
)
MOV
([
reg_params
+
12
],
registers
.
edx
)
RETURN
()
#
# XGETBV (ECX = 0)
#
xcrVec
=
Argument
(
ptr
(
uint32_t
))
with
Function
(
"xgetbv0Amd64"
,
(
xcrVec
,)):
reg_vec
=
GeneralPurposeRegister64
()
LOAD
.
ARGUMENT
(
reg_vec
,
xcrVec
)
XOR
(
registers
.
ecx
,
registers
.
ecx
)
XGETBV
()
MOV
([
reg_vec
],
registers
.
eax
)
MOV
([
reg_vec
+
4
],
registers
.
edx
)
RETURN
()
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20_amd64.s
0 → 100644
View file @
8861ffba
//
Generated
by
PeachPy
0
.2.0
from
chacha20_amd64
.
py
//
func
blocksAmd64SSE2
(
x
*
uint32
,
inp
*
uint8
,
outp
*
uint8
,
nrBlocks
*
uint
)
TEXT
·
blocksAmd64SSE2
(
SB
),4,$0-32
MOVQ
x
+
0
(
FP
),
AX
MOVQ
inp
+
8
(
FP
),
BX
MOVQ
outp
+
16
(
FP
),
CX
MOVQ
nrBlocks
+
24
(
FP
),
DX
MOVQ
SP
,
DI
MOVQ
$
31
,
SI
NOTQ
SI
ANDQ
SI
,
SP
SUBQ
$
32
,
SP
PXOR
X0
,
X0
SUBQ
$
32
,
SP
MOVO
X0
,
0
(
SP
)
MOVL
$
1
,
SI
MOVL
SI
,
0
(
SP
)
SUBQ
$
4
,
DX
JCS
vector_loop4_end
vector_loop4_begin
:
MOVO
0
(
AX
),
X0
MOVO
16
(
AX
),
X1
MOVO
32
(
AX
),
X2
MOVO
48
(
AX
),
X3
MOVO
X0
,
X4
MOVO
X1
,
X5
MOVO
X2
,
X6
MOVO
X3
,
X7
PADDQ
0
(
SP
),
X7
MOVO
X0
,
X8
MOVO
X1
,
X9
MOVO
X2
,
X10
MOVO
X7
,
X11
PADDQ
0
(
SP
),
X11
MOVO
X0
,
X12
MOVO
X1
,
X13
MOVO
X2
,
X14
MOVO
X11
,
X15
PADDQ
0
(
SP
),
X15
MOVQ
$
20
,
SI
rounds_loop4_begin
:
PADDL
X1
,
X0
PADDL
X5
,
X4
PADDL
X9
,
X8
PADDL
X13
,
X12
PXOR
X0
,
X3
PXOR
X4
,
X7
PXOR
X8
,
X11
PXOR
X12
,
X15
MOVO
X12
,
16
(
SP
)
MOVO
X3
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X3
PXOR
X12
,
X3
MOVO
X7
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X7
PXOR
X12
,
X7
MOVO
X11
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X11
PXOR
X12
,
X11
MOVO
X15
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X15
PXOR
X12
,
X15
PADDL
X3
,
X2
PADDL
X7
,
X6
PADDL
X11
,
X10
PADDL
X15
,
X14
PXOR
X2
,
X1
PXOR
X6
,
X5
PXOR
X10
,
X9
PXOR
X14
,
X13
MOVO
X1
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X1
PXOR
X12
,
X1
MOVO
X5
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X5
PXOR
X12
,
X5
MOVO
X9
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X9
PXOR
X12
,
X9
MOVO
X13
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X13
PXOR
X12
,
X13
MOVO
16
(
SP
),
X12
PADDL
X1
,
X0
PADDL
X5
,
X4
PADDL
X9
,
X8
PADDL
X13
,
X12
PXOR
X0
,
X3
PXOR
X4
,
X7
PXOR
X8
,
X11
PXOR
X12
,
X15
MOVO
X12
,
16
(
SP
)
MOVO
X3
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X3
PXOR
X12
,
X3
MOVO
X7
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X7
PXOR
X12
,
X7
MOVO
X11
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X11
PXOR
X12
,
X11
MOVO
X15
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X15
PXOR
X12
,
X15
PADDL
X3
,
X2
PADDL
X7
,
X6
PADDL
X11
,
X10
PADDL
X15
,
X14
PXOR
X2
,
X1
PXOR
X6
,
X5
PXOR
X10
,
X9
PXOR
X14
,
X13
MOVO
X1
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X1
PXOR
X12
,
X1
MOVO
X5
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X5
PXOR
X12
,
X5
MOVO
X9
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X9
PXOR
X12
,
X9
MOVO
X13
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X13
PXOR
X12
,
X13
PSHUFL
$
57
,
X1
,
X1
PSHUFL
$
57
,
X5
,
X5
PSHUFL
$
57
,
X9
,
X9
PSHUFL
$
57
,
X13
,
X13
PSHUFL
$
78
,
X2
,
X2
PSHUFL
$
78
,
X6
,
X6
PSHUFL
$
78
,
X10
,
X10
PSHUFL
$
78
,
X14
,
X14
PSHUFL
$
147
,
X3
,
X3
PSHUFL
$
147
,
X7
,
X7
PSHUFL
$
147
,
X11
,
X11
PSHUFL
$
147
,
X15
,
X15
MOVO
16
(
SP
),
X12
PADDL
X1
,
X0
PADDL
X5
,
X4
PADDL
X9
,
X8
PADDL
X13
,
X12
PXOR
X0
,
X3
PXOR
X4
,
X7
PXOR
X8
,
X11
PXOR
X12
,
X15
MOVO
X12
,
16
(
SP
)
MOVO
X3
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X3
PXOR
X12
,
X3
MOVO
X7
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X7
PXOR
X12
,
X7
MOVO
X11
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X11
PXOR
X12
,
X11
MOVO
X15
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X15
PXOR
X12
,
X15
PADDL
X3
,
X2
PADDL
X7
,
X6
PADDL
X11
,
X10
PADDL
X15
,
X14
PXOR
X2
,
X1
PXOR
X6
,
X5
PXOR
X10
,
X9
PXOR
X14
,
X13
MOVO
X1
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X1
PXOR
X12
,
X1
MOVO
X5
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X5
PXOR
X12
,
X5
MOVO
X9
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X9
PXOR
X12
,
X9
MOVO
X13
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X13
PXOR
X12
,
X13
MOVO
16
(
SP
),
X12
PADDL
X1
,
X0
PADDL
X5
,
X4
PADDL
X9
,
X8
PADDL
X13
,
X12
PXOR
X0
,
X3
PXOR
X4
,
X7
PXOR
X8
,
X11
PXOR
X12
,
X15
MOVO
X12
,
16
(
SP
)
MOVO
X3
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X3
PXOR
X12
,
X3
MOVO
X7
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X7
PXOR
X12
,
X7
MOVO
X11
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X11
PXOR
X12
,
X11
MOVO
X15
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X15
PXOR
X12
,
X15
PADDL
X3
,
X2
PADDL
X7
,
X6
PADDL
X11
,
X10
PADDL
X15
,
X14
PXOR
X2
,
X1
PXOR
X6
,
X5
PXOR
X10
,
X9
PXOR
X14
,
X13
MOVO
X1
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X1
PXOR
X12
,
X1
MOVO
X5
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X5
PXOR
X12
,
X5
MOVO
X9
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X9
PXOR
X12
,
X9
MOVO
X13
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X13
PXOR
X12
,
X13
PSHUFL
$
147
,
X1
,
X1
PSHUFL
$
147
,
X5
,
X5
PSHUFL
$
147
,
X9
,
X9
PSHUFL
$
147
,
X13
,
X13
PSHUFL
$
78
,
X2
,
X2
PSHUFL
$
78
,
X6
,
X6
PSHUFL
$
78
,
X10
,
X10
PSHUFL
$
78
,
X14
,
X14
PSHUFL
$
57
,
X3
,
X3
PSHUFL
$
57
,
X7
,
X7
PSHUFL
$
57
,
X11
,
X11
PSHUFL
$
57
,
X15
,
X15
MOVO
16
(
SP
),
X12
SUBQ
$
2
,
SI
JNE
rounds_loop4_begin
MOVO
X12
,
16
(
SP
)
PADDL
0
(
AX
),
X0
PADDL
16
(
AX
),
X1
PADDL
32
(
AX
),
X2
PADDL
48
(
AX
),
X3
MOVOU
0
(
BX
),
X12
PXOR
X0
,
X12
MOVOU
X12
,
0
(
CX
)
MOVOU
16
(
BX
),
X12
PXOR
X1
,
X12
MOVOU
X12
,
16
(
CX
)
MOVOU
32
(
BX
),
X12
PXOR
X2
,
X12
MOVOU
X12
,
32
(
CX
)
MOVOU
48
(
BX
),
X12
PXOR
X3
,
X12
MOVOU
X12
,
48
(
CX
)
MOVO
48
(
AX
),
X3
PADDQ
0
(
SP
),
X3
PADDL
0
(
AX
),
X4
PADDL
16
(
AX
),
X5
PADDL
32
(
AX
),
X6
PADDL
X3
,
X7
MOVOU
64
(
BX
),
X12
PXOR
X4
,
X12
MOVOU
X12
,
64
(
CX
)
MOVOU
80
(
BX
),
X12
PXOR
X5
,
X12
MOVOU
X12
,
80
(
CX
)
MOVOU
96
(
BX
),
X12
PXOR
X6
,
X12
MOVOU
X12
,
96
(
CX
)
MOVOU
112
(
BX
),
X12
PXOR
X7
,
X12
MOVOU
X12
,
112
(
CX
)
PADDQ
0
(
SP
),
X3
PADDL
0
(
AX
),
X8
PADDL
16
(
AX
),
X9
PADDL
32
(
AX
),
X10
PADDL
X3
,
X11
MOVOU
128
(
BX
),
X12
PXOR
X8
,
X12
MOVOU
X12
,
128
(
CX
)
MOVOU
144
(
BX
),
X12
PXOR
X9
,
X12
MOVOU
X12
,
144
(
CX
)
MOVOU
160
(
BX
),
X12
PXOR
X10
,
X12
MOVOU
X12
,
160
(
CX
)
MOVOU
176
(
BX
),
X12
PXOR
X11
,
X12
MOVOU
X12
,
176
(
CX
)
PADDQ
0
(
SP
),
X3
MOVO
16
(
SP
),
X12
PADDL
0
(
AX
),
X12
PADDL
16
(
AX
),
X13
PADDL
32
(
AX
),
X14
PADDL
X3
,
X15
MOVOU
192
(
BX
),
X0
PXOR
X12
,
X0
MOVOU
X0
,
192
(
CX
)
MOVOU
208
(
BX
),
X0
PXOR
X13
,
X0
MOVOU
X0
,
208
(
CX
)
MOVOU
224
(
BX
),
X0
PXOR
X14
,
X0
MOVOU
X0
,
224
(
CX
)
MOVOU
240
(
BX
),
X0
PXOR
X15
,
X0
MOVOU
X0
,
240
(
CX
)
PADDQ
0
(
SP
),
X3
MOVO
X3
,
48
(
AX
)
ADDQ
$
256
,
BX
ADDQ
$
256
,
CX
SUBQ
$
4
,
DX
JCC
vector_loop4_begin
vector_loop4_end
:
ADDQ
$
4
,
DX
JEQ
out
MOVO
0
(
AX
),
X8
MOVO
16
(
AX
),
X9
MOVO
32
(
AX
),
X10
MOVO
48
(
AX
),
X11
MOVO
0
(
SP
),
X13
SUBQ
$
2
,
DX
JCS
vector_loop2_end
vector_loop2_begin
:
MOVO
X8
,
X0
MOVO
X9
,
X1
MOVO
X10
,
X2
MOVO
X11
,
X3
MOVO
X0
,
X4
MOVO
X1
,
X5
MOVO
X2
,
X6
MOVO
X3
,
X7
PADDQ
X13
,
X7
MOVQ
$
20
,
SI
rounds_loop2_begin
:
PADDL
X1
,
X0
PADDL
X5
,
X4
PXOR
X0
,
X3
PXOR
X4
,
X7
MOVO
X3
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X3
PXOR
X12
,
X3
MOVO
X7
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X7
PXOR
X12
,
X7
PADDL
X3
,
X2
PADDL
X7
,
X6
PXOR
X2
,
X1
PXOR
X6
,
X5
MOVO
X1
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X1
PXOR
X12
,
X1
MOVO
X5
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X5
PXOR
X12
,
X5
PADDL
X1
,
X0
PADDL
X5
,
X4
PXOR
X0
,
X3
PXOR
X4
,
X7
MOVO
X3
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X3
PXOR
X12
,
X3
MOVO
X7
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X7
PXOR
X12
,
X7
PADDL
X3
,
X2
PADDL
X7
,
X6
PXOR
X2
,
X1
PXOR
X6
,
X5
MOVO
X1
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X1
PXOR
X12
,
X1
MOVO
X5
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X5
PXOR
X12
,
X5
PSHUFL
$
57
,
X1
,
X1
PSHUFL
$
57
,
X5
,
X5
PSHUFL
$
78
,
X2
,
X2
PSHUFL
$
78
,
X6
,
X6
PSHUFL
$
147
,
X3
,
X3
PSHUFL
$
147
,
X7
,
X7
PADDL
X1
,
X0
PADDL
X5
,
X4
PXOR
X0
,
X3
PXOR
X4
,
X7
MOVO
X3
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X3
PXOR
X12
,
X3
MOVO
X7
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X7
PXOR
X12
,
X7
PADDL
X3
,
X2
PADDL
X7
,
X6
PXOR
X2
,
X1
PXOR
X6
,
X5
MOVO
X1
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X1
PXOR
X12
,
X1
MOVO
X5
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X5
PXOR
X12
,
X5
PADDL
X1
,
X0
PADDL
X5
,
X4
PXOR
X0
,
X3
PXOR
X4
,
X7
MOVO
X3
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X3
PXOR
X12
,
X3
MOVO
X7
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X7
PXOR
X12
,
X7
PADDL
X3
,
X2
PADDL
X7
,
X6
PXOR
X2
,
X1
PXOR
X6
,
X5
MOVO
X1
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X1
PXOR
X12
,
X1
MOVO
X5
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X5
PXOR
X12
,
X5
PSHUFL
$
147
,
X1
,
X1
PSHUFL
$
147
,
X5
,
X5
PSHUFL
$
78
,
X2
,
X2
PSHUFL
$
78
,
X6
,
X6
PSHUFL
$
57
,
X3
,
X3
PSHUFL
$
57
,
X7
,
X7
SUBQ
$
2
,
SI
JNE
rounds_loop2_begin
PADDL
X8
,
X0
PADDL
X9
,
X1
PADDL
X10
,
X2
PADDL
X11
,
X3
MOVOU
0
(
BX
),
X12
PXOR
X0
,
X12
MOVOU
X12
,
0
(
CX
)
MOVOU
16
(
BX
),
X12
PXOR
X1
,
X12
MOVOU
X12
,
16
(
CX
)
MOVOU
32
(
BX
),
X12
PXOR
X2
,
X12
MOVOU
X12
,
32
(
CX
)
MOVOU
48
(
BX
),
X12
PXOR
X3
,
X12
MOVOU
X12
,
48
(
CX
)
PADDQ
X13
,
X11
PADDL
X8
,
X4
PADDL
X9
,
X5
PADDL
X10
,
X6
PADDL
X11
,
X7
MOVOU
64
(
BX
),
X12
PXOR
X4
,
X12
MOVOU
X12
,
64
(
CX
)
MOVOU
80
(
BX
),
X12
PXOR
X5
,
X12
MOVOU
X12
,
80
(
CX
)
MOVOU
96
(
BX
),
X12
PXOR
X6
,
X12
MOVOU
X12
,
96
(
CX
)
MOVOU
112
(
BX
),
X12
PXOR
X7
,
X12
MOVOU
X12
,
112
(
CX
)
PADDQ
X13
,
X11
ADDQ
$
128
,
BX
ADDQ
$
128
,
CX
SUBQ
$
2
,
DX
JCC
vector_loop2_begin
vector_loop2_end
:
ADDQ
$
2
,
DX
JEQ
out_serial
MOVO
X8
,
X0
MOVO
X9
,
X1
MOVO
X10
,
X2
MOVO
X11
,
X3
MOVQ
$
20
,
DX
rounds_loop1_begin
:
PADDL
X1
,
X0
PXOR
X0
,
X3
MOVO
X3
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X3
PXOR
X12
,
X3
PADDL
X3
,
X2
PXOR
X2
,
X1
MOVO
X1
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X1
PXOR
X12
,
X1
PADDL
X1
,
X0
PXOR
X0
,
X3
MOVO
X3
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X3
PXOR
X12
,
X3
PADDL
X3
,
X2
PXOR
X2
,
X1
MOVO
X1
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X1
PXOR
X12
,
X1
PSHUFL
$
57
,
X1
,
X1
PSHUFL
$
78
,
X2
,
X2
PSHUFL
$
147
,
X3
,
X3
PADDL
X1
,
X0
PXOR
X0
,
X3
MOVO
X3
,
X12
PSLLL
$
16
,
X12
PSRLL
$
16
,
X3
PXOR
X12
,
X3
PADDL
X3
,
X2
PXOR
X2
,
X1
MOVO
X1
,
X12
PSLLL
$
12
,
X12
PSRLL
$
20
,
X1
PXOR
X12
,
X1
PADDL
X1
,
X0
PXOR
X0
,
X3
MOVO
X3
,
X12
PSLLL
$
8
,
X12
PSRLL
$
24
,
X3
PXOR
X12
,
X3
PADDL
X3
,
X2
PXOR
X2
,
X1
MOVO
X1
,
X12
PSLLL
$
7
,
X12
PSRLL
$
25
,
X1
PXOR
X12
,
X1
PSHUFL
$
147
,
X1
,
X1
PSHUFL
$
78
,
X2
,
X2
PSHUFL
$
57
,
X3
,
X3
SUBQ
$
2
,
DX
JNE
rounds_loop1_begin
PADDL
X8
,
X0
PADDL
X9
,
X1
PADDL
X10
,
X2
PADDL
X11
,
X3
MOVOU
0
(
BX
),
X12
PXOR
X0
,
X12
MOVOU
X12
,
0
(
CX
)
MOVOU
16
(
BX
),
X12
PXOR
X1
,
X12
MOVOU
X12
,
16
(
CX
)
MOVOU
32
(
BX
),
X12
PXOR
X2
,
X12
MOVOU
X12
,
32
(
CX
)
MOVOU
48
(
BX
),
X12
PXOR
X3
,
X12
MOVOU
X12
,
48
(
CX
)
PADDQ
X13
,
X11
out_serial
:
MOVO
X11
,
48
(
AX
)
out
:
PXOR
X0
,
X0
MOVO
X0
,
16
(
SP
)
MOVQ
DI
,
SP
RET
//
func
blocksAmd64AVX2
(
x
*
uint32
,
inp
*
uint8
,
outp
*
uint8
,
nrBlocks
*
uint
)
TEXT
·
blocksAmd64AVX2
(
SB
),4,$0-32
MOVQ
x
+
0
(
FP
),
AX
MOVQ
inp
+
8
(
FP
),
BX
MOVQ
outp
+
16
(
FP
),
CX
MOVQ
nrBlocks
+
24
(
FP
),
DX
MOVQ
SP
,
DI
MOVQ
$
31
,
SI
NOTQ
SI
ANDQ
SI
,
SP
SUBQ
$
32
,
SP
SUBQ
$
96
,
SP
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x1D; BYTE $0xEF; BYTE $0xE4 // VPXOR ymm12, ymm12, ymm12
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x24; BYTE $0x24 // VMOVDQU [rsp], ymm12
MOVL
$
1
,
SI
MOVL
SI
,
16
(
SP
)
BYTE
$
0xC4
; BYTE $0xE2; BYTE $0x7D; BYTE $0x5A; BYTE $0x58; BYTE $0x30 // VBROADCASTI128 ymm3, [rax + 48]
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xD4; BYTE $0x1C; BYTE $0x24 // VPADDQ ymm3, ymm3, [rsp]
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x7F; BYTE $0x5C; BYTE $0x24; BYTE $0x20 // VMOVDQA [rsp + 32], ymm3
MOVL
$
2
,
SI
MOVL
SI
,
0
(
SP
)
MOVL
SI
,
16
(
SP
)
SUBQ
$
8
,
DX
JCS
vector_loop8_end
vector_loop8_begin
:
BYTE
$
0xC4
; BYTE $0xE2; BYTE $0x7D; BYTE $0x5A; BYTE $0x00 // VBROADCASTI128 ymm0, [rax]
BYTE
$
0xC4
; BYTE $0xE2; BYTE $0x7D; BYTE $0x5A; BYTE $0x48; BYTE $0x10 // VBROADCASTI128 ymm1, [rax + 16]
BYTE
$
0xC4
; BYTE $0xE2; BYTE $0x7D; BYTE $0x5A; BYTE $0x50; BYTE $0x20 // VBROADCASTI128 ymm2, [rax + 32]
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x6F; BYTE $0x5C; BYTE $0x24; BYTE $0x20 // VMOVDQA ymm3, [rsp + 32]
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x6F; BYTE $0xE0 // VMOVDQA ymm4, ymm0
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x6F; BYTE $0xE9 // VMOVDQA ymm5, ymm1
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x6F; BYTE $0xF2 // VMOVDQA ymm6, ymm2
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xD4; BYTE $0x3C; BYTE $0x24 // VPADDQ ymm7, ymm3, [rsp]
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0xC0 // VMOVDQA ymm8, ymm0
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0xC9 // VMOVDQA ymm9, ymm1
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0xD2 // VMOVDQA ymm10, ymm2
BYTE
$
0xC5
; BYTE $0x45; BYTE $0xD4; BYTE $0x1C; BYTE $0x24 // VPADDQ ymm11, ymm7, [rsp]
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0xE0 // VMOVDQA ymm12, ymm0
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0xE9 // VMOVDQA ymm13, ymm1
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0xF2 // VMOVDQA ymm14, ymm2
BYTE
$
0xC5
; BYTE $0x25; BYTE $0xD4; BYTE $0x3C; BYTE $0x24 // VPADDQ ymm15, ymm11, [rsp]
MOVQ
$
20
,
SI
rounds_loop8_begin
:
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x3D; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm8, ymm8, ymm9
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x1D; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm12, ymm12, ymm13
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm11, ymm11, ymm8
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm7, 16
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm7, ymm7, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm11, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x25; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm11, ymm11, 16
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xDC // VPXOR ymm11, ymm11, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm15, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x05; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm15, ymm15, 16
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x2D; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm10, ymm10, ymm11
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x0D; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm14, ymm14, ymm15
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCA // VPXOR ymm9, ymm9, ymm10
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEE // VPXOR ymm13, ymm13, ymm14
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm5, 12
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm5, ymm5, 20
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm9, 12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x35; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm9, ymm9, 20
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCC // VPXOR ymm9, ymm9, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm13, 12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x15; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm13, ymm13, 20
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEC // VPXOR ymm13, ymm13, ymm12
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA ymm12, [rsp + 64]
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x3D; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm8, ymm8, ymm9
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x1D; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm12, ymm12, ymm13
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm11, ymm11, ymm8
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm7, 8
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm7, ymm7, 24
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm11, 8
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x25; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm11, ymm11, 24
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xDC // VPXOR ymm11, ymm11, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm15, 8
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x05; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm15, ymm15, 24
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x2D; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm10, ymm10, ymm11
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x0D; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm14, ymm14, ymm15
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCA // VPXOR ymm9, ymm9, ymm10
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEE // VPXOR ymm13, ymm13, ymm14
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm5, 7
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm5, ymm5, 25
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm9, 7
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x35; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm9, ymm9, 25
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCC // VPXOR ymm9, ymm9, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm13, 7
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x15; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm13, ymm13, 25
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEC // VPXOR ymm13, ymm13, ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x39 // VPSHUFD ymm1, ymm1, 57
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xED; BYTE $0x39 // VPSHUFD ymm5, ymm5, 57
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xC9; BYTE $0x39 // VPSHUFD ymm9, ymm9, 57
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xED; BYTE $0x39 // VPSHUFD ymm13, ymm13, 57
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm6, ymm6, 78
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm10, ymm10, 78
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm14, ymm14, 78
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x93 // VPSHUFD ymm3, ymm3, 147
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xFF; BYTE $0x93 // VPSHUFD ymm7, ymm7, 147
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xDB; BYTE $0x93 // VPSHUFD ymm11, ymm11, 147
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xFF; BYTE $0x93 // VPSHUFD ymm15, ymm15, 147
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA ymm12, [rsp + 64]
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x3D; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm8, ymm8, ymm9
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x1D; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm12, ymm12, ymm13
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm11, ymm11, ymm8
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm7, 16
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm7, ymm7, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm11, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x25; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm11, ymm11, 16
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xDC // VPXOR ymm11, ymm11, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm15, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x05; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm15, ymm15, 16
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x2D; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm10, ymm10, ymm11
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x0D; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm14, ymm14, ymm15
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCA // VPXOR ymm9, ymm9, ymm10
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEE // VPXOR ymm13, ymm13, ymm14
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm5, 12
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm5, ymm5, 20
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm9, 12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x35; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm9, ymm9, 20
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCC // VPXOR ymm9, ymm9, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm13, 12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x15; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm13, ymm13, 20
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEC // VPXOR ymm13, ymm13, ymm12
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA ymm12, [rsp + 64]
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x3D; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm8, ymm8, ymm9
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x1D; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm12, ymm12, ymm13
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm11, ymm11, ymm8
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm7, 8
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm7, ymm7, 24
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm11, 8
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x25; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm11, ymm11, 24
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xEF; BYTE $0xDC // VPXOR ymm11, ymm11, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm15, 8
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x05; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm15, ymm15, 24
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x05; BYTE $0xEF; BYTE $0xFC // VPXOR ymm15, ymm15, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x2D; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm10, ymm10, ymm11
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x0D; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm14, ymm14, ymm15
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCA // VPXOR ymm9, ymm9, ymm10
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEE // VPXOR ymm13, ymm13, ymm14
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm5, 7
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm5, ymm5, 25
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm9, 7
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x35; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm9, ymm9, 25
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x35; BYTE $0xEF; BYTE $0xCC // VPXOR ymm9, ymm9, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x1D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm13, 7
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x15; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm13, ymm13, 25
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x15; BYTE $0xEF; BYTE $0xEC // VPXOR ymm13, ymm13, ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x93 // VPSHUFD ymm1, ymm1, 147
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xED; BYTE $0x93 // VPSHUFD ymm5, ymm5, 147
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xC9; BYTE $0x93 // VPSHUFD ymm9, ymm9, 147
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xED; BYTE $0x93 // VPSHUFD ymm13, ymm13, 147
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm6, ymm6, 78
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm10, ymm10, 78
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm14, ymm14, 78
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x39 // VPSHUFD ymm3, ymm3, 57
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xFF; BYTE $0x39 // VPSHUFD ymm7, ymm7, 57
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xDB; BYTE $0x39 // VPSHUFD ymm11, ymm11, 57
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x7D; BYTE $0x70; BYTE $0xFF; BYTE $0x39 // VPSHUFD ymm15, ymm15, 57
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA ymm12, [rsp + 64]
SUBQ
$
2
,
SI
JNE
rounds_loop8_begin
BYTE
$
0xC4
; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x20 // VBROADCASTI128 ymm12, [rax]
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x7D; BYTE $0xFE; BYTE $0xC4 // VPADDD ymm0, ymm0, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x5D; BYTE $0xFE; BYTE $0xE4 // VPADDD ymm4, ymm4, ymm12
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x3D; BYTE $0xFE; BYTE $0xC4 // VPADDD ymm8, ymm8, ymm12
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xFE; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VPADDD ymm12, ymm12, [rsp + 64]
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm12
BYTE
$
0xC4
; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x60; BYTE $0x10 // VBROADCASTI128 ymm12, [rax + 16]
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xFE; BYTE $0xCC // VPADDD ymm1, ymm1, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x55; BYTE $0xFE; BYTE $0xEC // VPADDD ymm5, ymm5, ymm12
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x35; BYTE $0xFE; BYTE $0xCC // VPADDD ymm9, ymm9, ymm12
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x15; BYTE $0xFE; BYTE $0xEC // VPADDD ymm13, ymm13, ymm12
BYTE
$
0xC4
; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x60; BYTE $0x20 // VBROADCASTI128 ymm12, [rax + 32]
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x6D; BYTE $0xFE; BYTE $0xD4 // VPADDD ymm2, ymm2, ymm12
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x4D; BYTE $0xFE; BYTE $0xF4 // VPADDD ymm6, ymm6, ymm12
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x2D; BYTE $0xFE; BYTE $0xD4 // VPADDD ymm10, ymm10, ymm12
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x0D; BYTE $0xFE; BYTE $0xF4 // VPADDD ymm14, ymm14, ymm12
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xFE; BYTE $0x5C; BYTE $0x24; BYTE $0x20 // VPADDD ymm3, ymm3, [rsp + 32]
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x20 // VPERM2I128 ymm12, ymm0, ymm1, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x23 // VPXOR ymm12, ymm12, [rbx]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x21 // VMOVDQU [rcx], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x20 // VPERM2I128 ymm12, ymm2, ymm3, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x20 // VPXOR ymm12, ymm12, [rbx + 32]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x20 // VMOVDQU [rcx + 32], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x31 // VPERM2I128 ymm12, ymm0, ymm1, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x40 // VPXOR ymm12, ymm12, [rbx + 64]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x40 // VMOVDQU [rcx + 64], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x31 // VPERM2I128 ymm12, ymm2, ymm3, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x60 // VPXOR ymm12, ymm12, [rbx + 96]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x60 // VMOVDQU [rcx + 96], ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x6F; BYTE $0x5C; BYTE $0x24; BYTE $0x20 // VMOVDQA ymm3, [rsp + 32]
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xFE; BYTE $0x1C; BYTE $0x24 // VPADDD ymm3, ymm3, [rsp]
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0xFE; BYTE $0xFB // VPADDD ymm7, ymm7, ymm3
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x5D; BYTE $0x46; BYTE $0xE5; BYTE $0x20 // VPERM2I128 ymm12, ymm4, ymm5, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x80; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 128]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x80; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 128], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x4D; BYTE $0x46; BYTE $0xE7; BYTE $0x20 // VPERM2I128 ymm12, ymm6, ymm7, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xA0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 160]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xA0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 160], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x5D; BYTE $0x46; BYTE $0xE5; BYTE $0x31 // VPERM2I128 ymm12, ymm4, ymm5, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xC0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 192]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xC0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 192], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x4D; BYTE $0x46; BYTE $0xE7; BYTE $0x31 // VPERM2I128 ymm12, ymm6, ymm7, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xE0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 224]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xE0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 224], ymm12
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xFE; BYTE $0x1C; BYTE $0x24 // VPADDD ymm3, ymm3, [rsp]
BYTE
$
0xC5
; BYTE $0x25; BYTE $0xFE; BYTE $0xDB // VPADDD ymm11, ymm11, ymm3
BYTE
$
0xC4
; BYTE $0x43; BYTE $0x3D; BYTE $0x46; BYTE $0xE1; BYTE $0x20 // VPERM2I128 ymm12, ymm8, ymm9, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x00; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 256]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x00; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 256], ymm12
BYTE
$
0xC4
; BYTE $0x43; BYTE $0x2D; BYTE $0x46; BYTE $0xE3; BYTE $0x20 // VPERM2I128 ymm12, ymm10, ymm11, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x20; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 288]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x20; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 288], ymm12
BYTE
$
0xC4
; BYTE $0x43; BYTE $0x3D; BYTE $0x46; BYTE $0xE1; BYTE $0x31 // VPERM2I128 ymm12, ymm8, ymm9, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x40; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 320]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x40; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 320], ymm12
BYTE
$
0xC4
; BYTE $0x43; BYTE $0x2D; BYTE $0x46; BYTE $0xE3; BYTE $0x31 // VPERM2I128 ymm12, ymm10, ymm11, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x60; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 352]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x60; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 352], ymm12
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xFE; BYTE $0x1C; BYTE $0x24 // VPADDD ymm3, ymm3, [rsp]
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0x64; BYTE $0x24; BYTE $0x40 // VMOVDQA ymm12, [rsp + 64]
BYTE
$
0xC5
; BYTE $0x05; BYTE $0xFE; BYTE $0xFB // VPADDD ymm15, ymm15, ymm3
BYTE
$
0xC4
; BYTE $0xC3; BYTE $0x1D; BYTE $0x46; BYTE $0xC5; BYTE $0x20 // VPERM2I128 ymm0, ymm12, ymm13, 32
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xEF; BYTE $0x83; BYTE $0x80; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm0, ymm0, [rbx + 384]
BYTE
$
0xC5
; BYTE $0xFE; BYTE $0x7F; BYTE $0x81; BYTE $0x80; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 384], ymm0
BYTE
$
0xC4
; BYTE $0xC3; BYTE $0x0D; BYTE $0x46; BYTE $0xC7; BYTE $0x20 // VPERM2I128 ymm0, ymm14, ymm15, 32
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xEF; BYTE $0x83; BYTE $0xA0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm0, ymm0, [rbx + 416]
BYTE
$
0xC5
; BYTE $0xFE; BYTE $0x7F; BYTE $0x81; BYTE $0xA0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 416], ymm0
BYTE
$
0xC4
; BYTE $0xC3; BYTE $0x1D; BYTE $0x46; BYTE $0xC5; BYTE $0x31 // VPERM2I128 ymm0, ymm12, ymm13, 49
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xEF; BYTE $0x83; BYTE $0xC0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm0, ymm0, [rbx + 448]
BYTE
$
0xC5
; BYTE $0xFE; BYTE $0x7F; BYTE $0x81; BYTE $0xC0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 448], ymm0
BYTE
$
0xC4
; BYTE $0xC3; BYTE $0x0D; BYTE $0x46; BYTE $0xC7; BYTE $0x31 // VPERM2I128 ymm0, ymm14, ymm15, 49
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xEF; BYTE $0x83; BYTE $0xE0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VPXOR ymm0, ymm0, [rbx + 480]
BYTE
$
0xC5
; BYTE $0xFE; BYTE $0x7F; BYTE $0x81; BYTE $0xE0; BYTE $0x01; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 480], ymm0
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xFE; BYTE $0x1C; BYTE $0x24 // VPADDD ymm3, ymm3, [rsp]
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x7F; BYTE $0x5C; BYTE $0x24; BYTE $0x20 // VMOVDQA [rsp + 32], ymm3
ADDQ
$
512
,
BX
ADDQ
$
512
,
CX
SUBQ
$
8
,
DX
JCC
vector_loop8_begin
vector_loop8_end
:
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0xDB // VMOVDQA ymm11, ymm3
ADDQ
$
8
,
DX
JEQ
out_write_even
BYTE
$
0xC4
; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x00 // VBROADCASTI128 ymm8, [rax]
BYTE
$
0xC4
; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x48; BYTE $0x10 // VBROADCASTI128 ymm9, [rax + 16]
BYTE
$
0xC4
; BYTE $0x62; BYTE $0x7D; BYTE $0x5A; BYTE $0x50; BYTE $0x20 // VBROADCASTI128 ymm10, [rax + 32]
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x6F; BYTE $0x34; BYTE $0x24 // VMOVDQA ymm14, [rsp]
SUBQ
$
4
,
DX
JCS
vector_loop4_end
vector_loop4_begin
:
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0xC0 // VMOVDQA ymm0, ymm8
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0xC9 // VMOVDQA ymm1, ymm9
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0xD2 // VMOVDQA ymm2, ymm10
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0xDB // VMOVDQA ymm3, ymm11
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x6F; BYTE $0xE0 // VMOVDQA ymm4, ymm0
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x6F; BYTE $0xE9 // VMOVDQA ymm5, ymm1
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x6F; BYTE $0xF2 // VMOVDQA ymm6, ymm2
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xD4; BYTE $0xFE // VPADDQ ymm7, ymm3, ymm14
MOVQ
$
20
,
SI
rounds_loop4_begin
:
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm7, 16
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm7, ymm7, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm5, 12
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm5, ymm5, 20
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm7, 8
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm7, ymm7, 24
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm5, 7
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm5, ymm5, 25
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x39 // VPSHUFD ymm1, ymm1, 57
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xED; BYTE $0x39 // VPSHUFD ymm5, ymm5, 57
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm6, ymm6, 78
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x93 // VPSHUFD ymm3, ymm3, 147
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xFF; BYTE $0x93 // VPSHUFD ymm7, ymm7, 147
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x10 // VPSLLD ymm12, ymm7, 16
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x10 // VPSRLD ymm7, ymm7, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x0C // VPSLLD ymm12, ymm5, 12
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x14 // VPSRLD ymm5, ymm5, 20
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xDD; BYTE $0xFE; BYTE $0xE5 // VPADDD ymm4, ymm4, ymm5
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm4
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF7; BYTE $0x08 // VPSLLD ymm12, ymm7, 8
BYTE
$
0xC5
; BYTE $0xC5; BYTE $0x72; BYTE $0xD7; BYTE $0x18 // VPSRLD ymm7, ymm7, 24
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x45; BYTE $0xEF; BYTE $0xFC // VPXOR ymm7, ymm7, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xCD; BYTE $0xFE; BYTE $0xF7 // VPADDD ymm6, ymm6, ymm7
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0xEF; BYTE $0xEE // VPXOR ymm5, ymm5, ymm6
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF5; BYTE $0x07 // VPSLLD ymm12, ymm5, 7
BYTE
$
0xC5
; BYTE $0xD5; BYTE $0x72; BYTE $0xD5; BYTE $0x19 // VPSRLD ymm5, ymm5, 25
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x55; BYTE $0xEF; BYTE $0xEC // VPXOR ymm5, ymm5, ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x93 // VPSHUFD ymm1, ymm1, 147
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xED; BYTE $0x93 // VPSHUFD ymm5, ymm5, 147
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xF6; BYTE $0x4E // VPSHUFD ymm6, ymm6, 78
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x39 // VPSHUFD ymm3, ymm3, 57
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xFF; BYTE $0x39 // VPSHUFD ymm7, ymm7, 57
SUBQ
$
2
,
SI
JNE
rounds_loop4_begin
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x7D; BYTE $0xFE; BYTE $0xC0 // VPADDD ymm0, ymm0, ymm8
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xFE; BYTE $0xC9 // VPADDD ymm1, ymm1, ymm9
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x6D; BYTE $0xFE; BYTE $0xD2 // VPADDD ymm2, ymm2, ymm10
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xFE; BYTE $0xDB // VPADDD ymm3, ymm3, ymm11
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x20 // VPERM2I128 ymm12, ymm0, ymm1, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x23 // VPXOR ymm12, ymm12, [rbx]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x21 // VMOVDQU [rcx], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x20 // VPERM2I128 ymm12, ymm2, ymm3, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x20 // VPXOR ymm12, ymm12, [rbx + 32]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x20 // VMOVDQU [rcx + 32], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x31 // VPERM2I128 ymm12, ymm0, ymm1, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x40 // VPXOR ymm12, ymm12, [rbx + 64]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x40 // VMOVDQU [rcx + 64], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x31 // VPERM2I128 ymm12, ymm2, ymm3, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x60 // VPXOR ymm12, ymm12, [rbx + 96]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x60 // VMOVDQU [rcx + 96], ymm12
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xFE; BYTE $0xDE // VPADDD ymm11, ymm11, ymm14
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x5D; BYTE $0xFE; BYTE $0xE0 // VPADDD ymm4, ymm4, ymm8
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x55; BYTE $0xFE; BYTE $0xE9 // VPADDD ymm5, ymm5, ymm9
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x4D; BYTE $0xFE; BYTE $0xF2 // VPADDD ymm6, ymm6, ymm10
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x45; BYTE $0xFE; BYTE $0xFB // VPADDD ymm7, ymm7, ymm11
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x5D; BYTE $0x46; BYTE $0xE5; BYTE $0x20 // VPERM2I128 ymm12, ymm4, ymm5, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0x80; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 128]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0x80; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 128], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x4D; BYTE $0x46; BYTE $0xE7; BYTE $0x20 // VPERM2I128 ymm12, ymm6, ymm7, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xA0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 160]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xA0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 160], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x5D; BYTE $0x46; BYTE $0xE5; BYTE $0x31 // VPERM2I128 ymm12, ymm4, ymm5, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xC0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 192]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xC0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 192], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x4D; BYTE $0x46; BYTE $0xE7; BYTE $0x31 // VPERM2I128 ymm12, ymm6, ymm7, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0xA3; BYTE $0xE0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VPXOR ymm12, ymm12, [rbx + 224]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0xA1; BYTE $0xE0; BYTE $0x00; BYTE $0x00; BYTE $0x00 // VMOVDQU [rcx + 224], ymm12
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xFE; BYTE $0xDE // VPADDD ymm11, ymm11, ymm14
ADDQ
$
256
,
BX
ADDQ
$
256
,
CX
SUBQ
$
4
,
DX
JCC
vector_loop4_begin
vector_loop4_end
:
ADDQ
$
4
,
DX
JEQ
out_write_even
vector_loop2_begin
:
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0xC0 // VMOVDQA ymm0, ymm8
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0xC9 // VMOVDQA ymm1, ymm9
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0xD2 // VMOVDQA ymm2, ymm10
BYTE
$
0xC5
; BYTE $0x7D; BYTE $0x7F; BYTE $0xDB // VMOVDQA ymm3, ymm11
MOVQ
$
20
,
SI
rounds_loop2_begin
:
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x39 // VPSHUFD ymm1, ymm1, 57
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x93 // VPSHUFD ymm3, ymm3, 147
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x10 // VPSLLD ymm12, ymm3, 16
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x10 // VPSRLD ymm3, ymm3, 16
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x0C // VPSLLD ymm12, ymm1, 12
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x14 // VPSRLD ymm1, ymm1, 20
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xFE; BYTE $0xC1 // VPADDD ymm0, ymm0, ymm1
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0xEF; BYTE $0xD8 // VPXOR ymm3, ymm3, ymm0
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF3; BYTE $0x08 // VPSLLD ymm12, ymm3, 8
BYTE
$
0xC5
; BYTE $0xE5; BYTE $0x72; BYTE $0xD3; BYTE $0x18 // VPSRLD ymm3, ymm3, 24
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xEF; BYTE $0xDC // VPXOR ymm3, ymm3, ymm12
BYTE
$
0xC5
; BYTE $0xED; BYTE $0xFE; BYTE $0xD3 // VPADDD ymm2, ymm2, ymm3
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0xEF; BYTE $0xCA // VPXOR ymm1, ymm1, ymm2
BYTE
$
0xC5
; BYTE $0x9D; BYTE $0x72; BYTE $0xF1; BYTE $0x07 // VPSLLD ymm12, ymm1, 7
BYTE
$
0xC5
; BYTE $0xF5; BYTE $0x72; BYTE $0xD1; BYTE $0x19 // VPSRLD ymm1, ymm1, 25
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xEF; BYTE $0xCC // VPXOR ymm1, ymm1, ymm12
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xC9; BYTE $0x93 // VPSHUFD ymm1, ymm1, 147
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xD2; BYTE $0x4E // VPSHUFD ymm2, ymm2, 78
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x70; BYTE $0xDB; BYTE $0x39 // VPSHUFD ymm3, ymm3, 57
SUBQ
$
2
,
SI
JNE
rounds_loop2_begin
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x7D; BYTE $0xFE; BYTE $0xC0 // VPADDD ymm0, ymm0, ymm8
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x75; BYTE $0xFE; BYTE $0xC9 // VPADDD ymm1, ymm1, ymm9
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x6D; BYTE $0xFE; BYTE $0xD2 // VPADDD ymm2, ymm2, ymm10
BYTE
$
0xC4
; BYTE $0xC1; BYTE $0x65; BYTE $0xFE; BYTE $0xDB // VPADDD ymm3, ymm3, ymm11
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x20 // VPERM2I128 ymm12, ymm0, ymm1, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x23 // VPXOR ymm12, ymm12, [rbx]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x21 // VMOVDQU [rcx], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x20 // VPERM2I128 ymm12, ymm2, ymm3, 32
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x20 // VPXOR ymm12, ymm12, [rbx + 32]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x20 // VMOVDQU [rcx + 32], ymm12
SUBQ
$
1
,
DX
JEQ
out_write_odd
BYTE
$
0xC4
; BYTE $0x41; BYTE $0x25; BYTE $0xFE; BYTE $0xDE // VPADDD ymm11, ymm11, ymm14
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x7D; BYTE $0x46; BYTE $0xE1; BYTE $0x31 // VPERM2I128 ymm12, ymm0, ymm1, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x40 // VPXOR ymm12, ymm12, [rbx + 64]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x40 // VMOVDQU [rcx + 64], ymm12
BYTE
$
0xC4
; BYTE $0x63; BYTE $0x6D; BYTE $0x46; BYTE $0xE3; BYTE $0x31 // VPERM2I128 ymm12, ymm2, ymm3, 49
BYTE
$
0xC5
; BYTE $0x1D; BYTE $0xEF; BYTE $0x63; BYTE $0x60 // VPXOR ymm12, ymm12, [rbx + 96]
BYTE
$
0xC5
; BYTE $0x7E; BYTE $0x7F; BYTE $0x61; BYTE $0x60 // VMOVDQU [rcx + 96], ymm12
SUBQ
$
1
,
DX
JEQ
out_write_even
ADDQ
$
128
,
BX
ADDQ
$
128
,
CX
JMP
vector_loop2_begin
out_write_odd
:
BYTE
$
0xC4
; BYTE $0x43; BYTE $0x25; BYTE $0x46; BYTE $0xDB; BYTE $0x01 // VPERM2I128 ymm11, ymm11, ymm11, 1
out_write_even
:
BYTE
$
0xC5
; BYTE $0x79; BYTE $0x7F; BYTE $0x58; BYTE $0x30 // VMOVDQA [rax + 48], xmm11
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0xEF; BYTE $0xC0 // VPXOR ymm0, ymm0, ymm0
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x7F; BYTE $0x44; BYTE $0x24; BYTE $0x40 // VMOVDQA [rsp + 64], ymm0
BYTE
$
0xC5
; BYTE $0xFD; BYTE $0x7F; BYTE $0x44; BYTE $0x24; BYTE $0x20 // VMOVDQA [rsp + 32], ymm0
MOVQ
DI
,
SP
BYTE
$
0xC5
; BYTE $0xF8; BYTE $0x77 // VZEROUPPER
RET
//
func
cpuidAmd64
(
cpuidParams
*
uint32
)
TEXT
·
cpuidAmd64
(
SB
),4,$0-8
MOVQ
cpuidParams
+
0
(
FP
),
R15
MOVL
0
(
R15
),
AX
MOVL
4
(
R15
),
CX
CPUID
MOVL
AX
,
0
(
R15
)
MOVL
BX
,
4
(
R15
)
MOVL
CX
,
8
(
R15
)
MOVL
DX
,
12
(
R15
)
RET
//
func
xgetbv0Amd64
(
xcrVec
*
uint32
)
TEXT
·
xgetbv0Amd64
(
SB
),4,$0-8
MOVQ
xcrVec
+
0
(
FP
),
BX
XORL
CX
,
CX
BYTE
$
0x0F
; BYTE $0x01; BYTE $0xD0 // XGETBV
MOVL
AX
,
0
(
BX
)
MOVL
DX
,
4
(
BX
)
RET
cmd/gost/vendor/github.com/Yawning/chacha20/chacha20_ref.go
0 → 100644
View file @
8861ffba
// chacha20_ref.go - Reference ChaCha20.
//
// To the extent possible under law, Yawning Angel has waived all copyright
// and related or neighboring rights to chacha20, using the Creative
// Commons "CC0" public domain dedication. See LICENSE or
// <http://creativecommons.org/publicdomain/zero/1.0/> for full details.
package
chacha20
import
(
"encoding/binary"
"math"
"unsafe"
)
func
blocksRef
(
x
*
[
stateSize
]
uint32
,
in
[]
byte
,
out
[]
byte
,
nrBlocks
int
,
isIetf
bool
)
{
if
isIetf
{
var
totalBlocks
uint64
totalBlocks
=
uint64
(
x
[
8
])
+
uint64
(
nrBlocks
)
if
totalBlocks
>
math
.
MaxUint32
{
panic
(
"chacha20: Exceeded keystream per nonce limit"
)
}
}
// This routine ignores x[0]...x[4] in favor the const values since it's
// ever so slightly faster.
for
n
:=
0
;
n
<
nrBlocks
;
n
++
{
x0
,
x1
,
x2
,
x3
:=
sigma0
,
sigma1
,
sigma2
,
sigma3
x4
,
x5
,
x6
,
x7
,
x8
,
x9
,
x10
,
x11
,
x12
,
x13
,
x14
,
x15
:=
x
[
4
],
x
[
5
],
x
[
6
],
x
[
7
],
x
[
8
],
x
[
9
],
x
[
10
],
x
[
11
],
x
[
12
],
x
[
13
],
x
[
14
],
x
[
15
]
for
i
:=
chachaRounds
;
i
>
0
;
i
-=
2
{
// quarterround(x, 0, 4, 8, 12)
x0
+=
x4
x12
^=
x0
x12
=
(
x12
<<
16
)
|
(
x12
>>
16
)
x8
+=
x12
x4
^=
x8
x4
=
(
x4
<<
12
)
|
(
x4
>>
20
)
x0
+=
x4
x12
^=
x0
x12
=
(
x12
<<
8
)
|
(
x12
>>
24
)
x8
+=
x12
x4
^=
x8
x4
=
(
x4
<<
7
)
|
(
x4
>>
25
)
// quarterround(x, 1, 5, 9, 13)
x1
+=
x5
x13
^=
x1
x13
=
(
x13
<<
16
)
|
(
x13
>>
16
)
x9
+=
x13
x5
^=
x9
x5
=
(
x5
<<
12
)
|
(
x5
>>
20
)
x1
+=
x5
x13
^=
x1
x13
=
(
x13
<<
8
)
|
(
x13
>>
24
)
x9
+=
x13
x5
^=
x9
x5
=
(
x5
<<
7
)
|
(
x5
>>
25
)
// quarterround(x, 2, 6, 10, 14)
x2
+=
x6
x14
^=
x2
x14
=
(
x14
<<
16
)
|
(
x14
>>
16
)
x10
+=
x14
x6
^=
x10
x6
=
(
x6
<<
12
)
|
(
x6
>>
20
)
x2
+=
x6
x14
^=
x2
x14
=
(
x14
<<
8
)
|
(
x14
>>
24
)
x10
+=
x14
x6
^=
x10
x6
=
(
x6
<<
7
)
|
(
x6
>>
25
)
// quarterround(x, 3, 7, 11, 15)
x3
+=
x7
x15
^=
x3
x15
=
(
x15
<<
16
)
|
(
x15
>>
16
)
x11
+=
x15
x7
^=
x11
x7
=
(
x7
<<
12
)
|
(
x7
>>
20
)
x3
+=
x7
x15
^=
x3
x15
=
(
x15
<<
8
)
|
(
x15
>>
24
)
x11
+=
x15
x7
^=
x11
x7
=
(
x7
<<
7
)
|
(
x7
>>
25
)
// quarterround(x, 0, 5, 10, 15)
x0
+=
x5
x15
^=
x0
x15
=
(
x15
<<
16
)
|
(
x15
>>
16
)
x10
+=
x15
x5
^=
x10
x5
=
(
x5
<<
12
)
|
(
x5
>>
20
)
x0
+=
x5
x15
^=
x0
x15
=
(
x15
<<
8
)
|
(
x15
>>
24
)
x10
+=
x15
x5
^=
x10
x5
=
(
x5
<<
7
)
|
(
x5
>>
25
)
// quarterround(x, 1, 6, 11, 12)
x1
+=
x6
x12
^=
x1
x12
=
(
x12
<<
16
)
|
(
x12
>>
16
)
x11
+=
x12
x6
^=
x11
x6
=
(
x6
<<
12
)
|
(
x6
>>
20
)
x1
+=
x6
x12
^=
x1
x12
=
(
x12
<<
8
)
|
(
x12
>>
24
)
x11
+=
x12
x6
^=
x11
x6
=
(
x6
<<
7
)
|
(
x6
>>
25
)
// quarterround(x, 2, 7, 8, 13)
x2
+=
x7
x13
^=
x2
x13
=
(
x13
<<
16
)
|
(
x13
>>
16
)
x8
+=
x13
x7
^=
x8
x7
=
(
x7
<<
12
)
|
(
x7
>>
20
)
x2
+=
x7
x13
^=
x2
x13
=
(
x13
<<
8
)
|
(
x13
>>
24
)
x8
+=
x13
x7
^=
x8
x7
=
(
x7
<<
7
)
|
(
x7
>>
25
)
// quarterround(x, 3, 4, 9, 14)
x3
+=
x4
x14
^=
x3
x14
=
(
x14
<<
16
)
|
(
x14
>>
16
)
x9
+=
x14
x4
^=
x9
x4
=
(
x4
<<
12
)
|
(
x4
>>
20
)
x3
+=
x4
x14
^=
x3
x14
=
(
x14
<<
8
)
|
(
x14
>>
24
)
x9
+=
x14
x4
^=
x9
x4
=
(
x4
<<
7
)
|
(
x4
>>
25
)
}
// On amd64 at least, this is a rather big boost.
if
useUnsafe
{
if
in
!=
nil
{
inArr
:=
(
*
[
16
]
uint32
)(
unsafe
.
Pointer
(
&
in
[
n
*
BlockSize
]))
outArr
:=
(
*
[
16
]
uint32
)(
unsafe
.
Pointer
(
&
out
[
n
*
BlockSize
]))
outArr
[
0
]
=
inArr
[
0
]
^
(
x0
+
sigma0
)
outArr
[
1
]
=
inArr
[
1
]
^
(
x1
+
sigma1
)
outArr
[
2
]
=
inArr
[
2
]
^
(
x2
+
sigma2
)
outArr
[
3
]
=
inArr
[
3
]
^
(
x3
+
sigma3
)
outArr
[
4
]
=
inArr
[
4
]
^
(
x4
+
x
[
4
])
outArr
[
5
]
=
inArr
[
5
]
^
(
x5
+
x
[
5
])
outArr
[
6
]
=
inArr
[
6
]
^
(
x6
+
x
[
6
])
outArr
[
7
]
=
inArr
[
7
]
^
(
x7
+
x
[
7
])
outArr
[
8
]
=
inArr
[
8
]
^
(
x8
+
x
[
8
])
outArr
[
9
]
=
inArr
[
9
]
^
(
x9
+
x
[
9
])
outArr
[
10
]
=
inArr
[
10
]
^
(
x10
+
x
[
10
])
outArr
[
11
]
=
inArr
[
11
]
^
(
x11
+
x
[
11
])
outArr
[
12
]
=
inArr
[
12
]
^
(
x12
+
x
[
12
])
outArr
[
13
]
=
inArr
[
13
]
^
(
x13
+
x
[
13
])
outArr
[
14
]
=
inArr
[
14
]
^
(
x14
+
x
[
14
])
outArr
[
15
]
=
inArr
[
15
]
^
(
x15
+
x
[
15
])
}
else
{
outArr
:=
(
*
[
16
]
uint32
)(
unsafe
.
Pointer
(
&
out
[
n
*
BlockSize
]))
outArr
[
0
]
=
x0
+
sigma0
outArr
[
1
]
=
x1
+
sigma1
outArr
[
2
]
=
x2
+
sigma2
outArr
[
3
]
=
x3
+
sigma3
outArr
[
4
]
=
x4
+
x
[
4
]
outArr
[
5
]
=
x5
+
x
[
5
]
outArr
[
6
]
=
x6
+
x
[
6
]
outArr
[
7
]
=
x7
+
x
[
7
]
outArr
[
8
]
=
x8
+
x
[
8
]
outArr
[
9
]
=
x9
+
x
[
9
]
outArr
[
10
]
=
x10
+
x
[
10
]
outArr
[
11
]
=
x11
+
x
[
11
]
outArr
[
12
]
=
x12
+
x
[
12
]
outArr
[
13
]
=
x13
+
x
[
13
]
outArr
[
14
]
=
x14
+
x
[
14
]
outArr
[
15
]
=
x15
+
x
[
15
]
}
}
else
{
// Slow path, either the architecture cares about alignment, or is not little endian.
x0
+=
sigma0
x1
+=
sigma1
x2
+=
sigma2
x3
+=
sigma3
x4
+=
x
[
4
]
x5
+=
x
[
5
]
x6
+=
x
[
6
]
x7
+=
x
[
7
]
x8
+=
x
[
8
]
x9
+=
x
[
9
]
x10
+=
x
[
10
]
x11
+=
x
[
11
]
x12
+=
x
[
12
]
x13
+=
x
[
13
]
x14
+=
x
[
14
]
x15
+=
x
[
15
]
if
in
!=
nil
{
binary
.
LittleEndian
.
PutUint32
(
out
[
0
:
4
],
binary
.
LittleEndian
.
Uint32
(
in
[
0
:
4
])
^
x0
)
binary
.
LittleEndian
.
PutUint32
(
out
[
4
:
8
],
binary
.
LittleEndian
.
Uint32
(
in
[
4
:
8
])
^
x1
)
binary
.
LittleEndian
.
PutUint32
(
out
[
8
:
12
],
binary
.
LittleEndian
.
Uint32
(
in
[
8
:
12
])
^
x2
)
binary
.
LittleEndian
.
PutUint32
(
out
[
12
:
16
],
binary
.
LittleEndian
.
Uint32
(
in
[
12
:
16
])
^
x3
)
binary
.
LittleEndian
.
PutUint32
(
out
[
16
:
20
],
binary
.
LittleEndian
.
Uint32
(
in
[
16
:
20
])
^
x4
)
binary
.
LittleEndian
.
PutUint32
(
out
[
20
:
24
],
binary
.
LittleEndian
.
Uint32
(
in
[
20
:
24
])
^
x5
)
binary
.
LittleEndian
.
PutUint32
(
out
[
24
:
28
],
binary
.
LittleEndian
.
Uint32
(
in
[
24
:
28
])
^
x6
)
binary
.
LittleEndian
.
PutUint32
(
out
[
28
:
32
],
binary
.
LittleEndian
.
Uint32
(
in
[
28
:
32
])
^
x7
)
binary
.
LittleEndian
.
PutUint32
(
out
[
32
:
36
],
binary
.
LittleEndian
.
Uint32
(
in
[
32
:
36
])
^
x8
)
binary
.
LittleEndian
.
PutUint32
(
out
[
36
:
40
],
binary
.
LittleEndian
.
Uint32
(
in
[
36
:
40
])
^
x9
)
binary
.
LittleEndian
.
PutUint32
(
out
[
40
:
44
],
binary
.
LittleEndian
.
Uint32
(
in
[
40
:
44
])
^
x10
)
binary
.
LittleEndian
.
PutUint32
(
out
[
44
:
48
],
binary
.
LittleEndian
.
Uint32
(
in
[
44
:
48
])
^
x11
)
binary
.
LittleEndian
.
PutUint32
(
out
[
48
:
52
],
binary
.
LittleEndian
.
Uint32
(
in
[
48
:
52
])
^
x12
)
binary
.
LittleEndian
.
PutUint32
(
out
[
52
:
56
],
binary
.
LittleEndian
.
Uint32
(
in
[
52
:
56
])
^
x13
)
binary
.
LittleEndian
.
PutUint32
(
out
[
56
:
60
],
binary
.
LittleEndian
.
Uint32
(
in
[
56
:
60
])
^
x14
)
binary
.
LittleEndian
.
PutUint32
(
out
[
60
:
64
],
binary
.
LittleEndian
.
Uint32
(
in
[
60
:
64
])
^
x15
)
in
=
in
[
BlockSize
:
]
}
else
{
binary
.
LittleEndian
.
PutUint32
(
out
[
0
:
4
],
x0
)
binary
.
LittleEndian
.
PutUint32
(
out
[
4
:
8
],
x1
)
binary
.
LittleEndian
.
PutUint32
(
out
[
8
:
12
],
x2
)
binary
.
LittleEndian
.
PutUint32
(
out
[
12
:
16
],
x3
)
binary
.
LittleEndian
.
PutUint32
(
out
[
16
:
20
],
x4
)
binary
.
LittleEndian
.
PutUint32
(
out
[
20
:
24
],
x5
)
binary
.
LittleEndian
.
PutUint32
(
out
[
24
:
28
],
x6
)
binary
.
LittleEndian
.
PutUint32
(
out
[
28
:
32
],
x7
)
binary
.
LittleEndian
.
PutUint32
(
out
[
32
:
36
],
x8
)
binary
.
LittleEndian
.
PutUint32
(
out
[
36
:
40
],
x9
)
binary
.
LittleEndian
.
PutUint32
(
out
[
40
:
44
],
x10
)
binary
.
LittleEndian
.
PutUint32
(
out
[
44
:
48
],
x11
)
binary
.
LittleEndian
.
PutUint32
(
out
[
48
:
52
],
x12
)
binary
.
LittleEndian
.
PutUint32
(
out
[
52
:
56
],
x13
)
binary
.
LittleEndian
.
PutUint32
(
out
[
56
:
60
],
x14
)
binary
.
LittleEndian
.
PutUint32
(
out
[
60
:
64
],
x15
)
}
out
=
out
[
BlockSize
:
]
}
// Stoping at 2^70 bytes per nonce is the user's responsibility.
ctr
:=
uint64
(
x
[
13
])
<<
32
|
uint64
(
x
[
12
])
ctr
++
x
[
12
]
=
uint32
(
ctr
)
x
[
13
]
=
uint32
(
ctr
>>
32
)
}
}
func
hChaChaRef
(
x
*
[
stateSize
]
uint32
,
out
*
[
32
]
byte
)
{
x0
,
x1
,
x2
,
x3
:=
sigma0
,
sigma1
,
sigma2
,
sigma3
x4
,
x5
,
x6
,
x7
,
x8
,
x9
,
x10
,
x11
,
x12
,
x13
,
x14
,
x15
:=
x
[
0
],
x
[
1
],
x
[
2
],
x
[
3
],
x
[
4
],
x
[
5
],
x
[
6
],
x
[
7
],
x
[
8
],
x
[
9
],
x
[
10
],
x
[
11
]
for
i
:=
chachaRounds
;
i
>
0
;
i
-=
2
{
// quarterround(x, 0, 4, 8, 12)
x0
+=
x4
x12
^=
x0
x12
=
(
x12
<<
16
)
|
(
x12
>>
16
)
x8
+=
x12
x4
^=
x8
x4
=
(
x4
<<
12
)
|
(
x4
>>
20
)
x0
+=
x4
x12
^=
x0
x12
=
(
x12
<<
8
)
|
(
x12
>>
24
)
x8
+=
x12
x4
^=
x8
x4
=
(
x4
<<
7
)
|
(
x4
>>
25
)
// quarterround(x, 1, 5, 9, 13)
x1
+=
x5
x13
^=
x1
x13
=
(
x13
<<
16
)
|
(
x13
>>
16
)
x9
+=
x13
x5
^=
x9
x5
=
(
x5
<<
12
)
|
(
x5
>>
20
)
x1
+=
x5
x13
^=
x1
x13
=
(
x13
<<
8
)
|
(
x13
>>
24
)
x9
+=
x13
x5
^=
x9
x5
=
(
x5
<<
7
)
|
(
x5
>>
25
)
// quarterround(x, 2, 6, 10, 14)
x2
+=
x6
x14
^=
x2
x14
=
(
x14
<<
16
)
|
(
x14
>>
16
)
x10
+=
x14
x6
^=
x10
x6
=
(
x6
<<
12
)
|
(
x6
>>
20
)
x2
+=
x6
x14
^=
x2
x14
=
(
x14
<<
8
)
|
(
x14
>>
24
)
x10
+=
x14
x6
^=
x10
x6
=
(
x6
<<
7
)
|
(
x6
>>
25
)
// quarterround(x, 3, 7, 11, 15)
x3
+=
x7
x15
^=
x3
x15
=
(
x15
<<
16
)
|
(
x15
>>
16
)
x11
+=
x15
x7
^=
x11
x7
=
(
x7
<<
12
)
|
(
x7
>>
20
)
x3
+=
x7
x15
^=
x3
x15
=
(
x15
<<
8
)
|
(
x15
>>
24
)
x11
+=
x15
x7
^=
x11
x7
=
(
x7
<<
7
)
|
(
x7
>>
25
)
// quarterround(x, 0, 5, 10, 15)
x0
+=
x5
x15
^=
x0
x15
=
(
x15
<<
16
)
|
(
x15
>>
16
)
x10
+=
x15
x5
^=
x10
x5
=
(
x5
<<
12
)
|
(
x5
>>
20
)
x0
+=
x5
x15
^=
x0
x15
=
(
x15
<<
8
)
|
(
x15
>>
24
)
x10
+=
x15
x5
^=
x10
x5
=
(
x5
<<
7
)
|
(
x5
>>
25
)
// quarterround(x, 1, 6, 11, 12)
x1
+=
x6
x12
^=
x1
x12
=
(
x12
<<
16
)
|
(
x12
>>
16
)
x11
+=
x12
x6
^=
x11
x6
=
(
x6
<<
12
)
|
(
x6
>>
20
)
x1
+=
x6
x12
^=
x1
x12
=
(
x12
<<
8
)
|
(
x12
>>
24
)
x11
+=
x12
x6
^=
x11
x6
=
(
x6
<<
7
)
|
(
x6
>>
25
)
// quarterround(x, 2, 7, 8, 13)
x2
+=
x7
x13
^=
x2
x13
=
(
x13
<<
16
)
|
(
x13
>>
16
)
x8
+=
x13
x7
^=
x8
x7
=
(
x7
<<
12
)
|
(
x7
>>
20
)
x2
+=
x7
x13
^=
x2
x13
=
(
x13
<<
8
)
|
(
x13
>>
24
)
x8
+=
x13
x7
^=
x8
x7
=
(
x7
<<
7
)
|
(
x7
>>
25
)
// quarterround(x, 3, 4, 9, 14)
x3
+=
x4
x14
^=
x3
x14
=
(
x14
<<
16
)
|
(
x14
>>
16
)
x9
+=
x14
x4
^=
x9
x4
=
(
x4
<<
12
)
|
(
x4
>>
20
)
x3
+=
x4
x14
^=
x3
x14
=
(
x14
<<
8
)
|
(
x14
>>
24
)
x9
+=
x14
x4
^=
x9
x4
=
(
x4
<<
7
)
|
(
x4
>>
25
)
}
// HChaCha returns x0...x3 | x12...x15, which corresponds to the
// indexes of the ChaCha constant and the indexes of the IV.
if
useUnsafe
{
outArr
:=
(
*
[
16
]
uint32
)(
unsafe
.
Pointer
(
&
out
[
0
]))
outArr
[
0
]
=
x0
outArr
[
1
]
=
x1
outArr
[
2
]
=
x2
outArr
[
3
]
=
x3
outArr
[
4
]
=
x12
outArr
[
5
]
=
x13
outArr
[
6
]
=
x14
outArr
[
7
]
=
x15
}
else
{
binary
.
LittleEndian
.
PutUint32
(
out
[
0
:
4
],
x0
)
binary
.
LittleEndian
.
PutUint32
(
out
[
4
:
8
],
x1
)
binary
.
LittleEndian
.
PutUint32
(
out
[
8
:
12
],
x2
)
binary
.
LittleEndian
.
PutUint32
(
out
[
12
:
16
],
x3
)
binary
.
LittleEndian
.
PutUint32
(
out
[
16
:
20
],
x12
)
binary
.
LittleEndian
.
PutUint32
(
out
[
20
:
24
],
x13
)
binary
.
LittleEndian
.
PutUint32
(
out
[
24
:
28
],
x14
)
binary
.
LittleEndian
.
PutUint32
(
out
[
28
:
32
],
x15
)
}
return
}
cmd/gost/vendor/github.com/ginuerzh/gost/forward.go
View file @
8861ffba
...
...
@@ -63,15 +63,15 @@ func (s *TcpForwardServer) handleTcpForward(conn net.Conn, raddr net.Addr) {
}
type
packet
struct
{
srcAddr
*
net
.
UDPAddr
// src address
dstAddr
*
net
.
UDPAddr
// dest address
srcAddr
string
// src address
dstAddr
string
// dest address
data
[]
byte
}
type
cnode
struct
{
chain
*
ProxyChain
conn
net
.
Conn
srcAddr
,
dstAddr
*
net
.
UDPAddr
srcAddr
,
dstAddr
string
rChan
,
wChan
chan
*
packet
err
error
ttl
time
.
Duration
...
...
@@ -146,13 +146,9 @@ func (node *cnode) run() {
timer
.
Reset
(
node
.
ttl
)
glog
.
V
(
LDEBUG
)
.
Infof
(
"[udp] %s <<< %s : length %d"
,
node
.
srcAddr
,
addr
,
n
)
if
node
.
dstAddr
.
String
()
!=
addr
.
String
()
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s <- %s : dst-addr mismatch (%s)"
,
node
.
srcAddr
,
node
.
dstAddr
,
addr
)
break
}
select
{
// swap srcAddr with dstAddr
case
node
.
rChan
<-
&
packet
{
srcAddr
:
node
.
dstAddr
,
dstAddr
:
node
.
srcAddr
,
data
:
b
[
:
n
]}
:
case
node
.
rChan
<-
&
packet
{
srcAddr
:
addr
.
String
()
,
dstAddr
:
node
.
srcAddr
,
data
:
b
[
:
n
]}
:
case
<-
time
.
After
(
time
.
Second
*
3
)
:
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s <- %s : %s"
,
node
.
srcAddr
,
node
.
dstAddr
,
"recv queue is full, discard"
)
}
...
...
@@ -169,13 +165,9 @@ func (node *cnode) run() {
timer
.
Reset
(
node
.
ttl
)
glog
.
V
(
LDEBUG
)
.
Infof
(
"[udp-tun] %s <<< %s : length %d"
,
node
.
srcAddr
,
dgram
.
Header
.
Addr
.
String
(),
len
(
dgram
.
Data
))
if
dgram
.
Header
.
Addr
.
String
()
!=
node
.
dstAddr
.
String
()
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp-tun] %s <- %s : dst-addr mismatch (%s)"
,
node
.
srcAddr
,
node
.
dstAddr
,
dgram
.
Header
.
Addr
)
break
}
select
{
// swap srcAddr with dstAddr
case
node
.
rChan
<-
&
packet
{
srcAddr
:
node
.
dstAddr
,
dstAddr
:
node
.
srcAddr
,
data
:
dgram
.
Data
}
:
case
node
.
rChan
<-
&
packet
{
srcAddr
:
dgram
.
Header
.
Addr
.
String
()
,
dstAddr
:
node
.
srcAddr
,
data
:
dgram
.
Data
}
:
case
<-
time
.
After
(
time
.
Second
*
3
)
:
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp-tun] %s <- %s : %s"
,
node
.
srcAddr
,
node
.
dstAddr
,
"recv queue is full, discard"
)
}
...
...
@@ -187,9 +179,15 @@ func (node *cnode) run() {
for
pkt
:=
range
node
.
wChan
{
timer
.
Reset
(
node
.
ttl
)
dstAddr
,
err
:=
net
.
ResolveUDPAddr
(
"udp"
,
pkt
.
dstAddr
)
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s -> %s : %s"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
err
)
continue
}
switch
c
:=
node
.
conn
.
(
type
)
{
case
*
net
.
UDPConn
:
if
_
,
err
:=
c
.
WriteToUDP
(
pkt
.
data
,
pkt
.
dstAddr
);
err
!=
nil
{
if
_
,
err
:=
c
.
WriteToUDP
(
pkt
.
data
,
dstAddr
);
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s -> %s : %s"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
err
)
node
.
err
=
err
errChan
<-
err
...
...
@@ -198,7 +196,7 @@ func (node *cnode) run() {
glog
.
V
(
LDEBUG
)
.
Infof
(
"[udp] %s >>> %s : length %d"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
len
(
pkt
.
data
))
default
:
dgram
:=
gosocks5
.
NewUDPDatagram
(
gosocks5
.
NewUDPHeader
(
uint16
(
len
(
pkt
.
data
)),
0
,
ToSocksAddr
(
pkt
.
dstAddr
)),
pkt
.
data
)
dgram
:=
gosocks5
.
NewUDPDatagram
(
gosocks5
.
NewUDPHeader
(
uint16
(
len
(
pkt
.
data
)),
0
,
ToSocksAddr
(
dstAddr
)),
pkt
.
data
)
if
err
:=
dgram
.
Write
(
c
);
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp-tun] %s -> %s : %s"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
err
)
node
.
err
=
err
...
...
@@ -255,7 +253,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
}
select
{
case
ch
<-
&
packet
{
srcAddr
:
addr
,
dstAddr
:
raddr
,
data
:
b
[
:
n
]}
:
case
ch
<-
&
packet
{
srcAddr
:
addr
.
String
(),
dstAddr
:
raddr
.
String
()
,
data
:
b
[
:
n
]}
:
case
<-
time
.
After
(
time
.
Second
*
3
)
:
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s -> %s : %s"
,
addr
,
raddr
,
"send queue is full, discard"
)
}
...
...
@@ -264,7 +262,12 @@ func (s *UdpForwardServer) ListenAndServe() error {
// start recv queue
go
func
(
ch
<-
chan
*
packet
)
{
for
pkt
:=
range
ch
{
if
_
,
err
:=
conn
.
WriteToUDP
(
pkt
.
data
,
pkt
.
dstAddr
);
err
!=
nil
{
dstAddr
,
err
:=
net
.
ResolveUDPAddr
(
"udp"
,
pkt
.
dstAddr
)
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s <- %s : %s"
,
pkt
.
dstAddr
,
pkt
.
srcAddr
,
err
)
continue
}
if
_
,
err
:=
conn
.
WriteToUDP
(
pkt
.
data
,
dstAddr
);
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s <- %s : %s"
,
pkt
.
dstAddr
,
pkt
.
srcAddr
,
err
)
return
}
...
...
@@ -285,7 +288,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
}
}
node
,
ok
:=
m
[
pkt
.
srcAddr
.
String
()
]
node
,
ok
:=
m
[
pkt
.
srcAddr
]
if
!
ok
{
node
=
&
cnode
{
chain
:
s
.
Base
.
Chain
,
...
...
@@ -295,7 +298,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
wChan
:
make
(
chan
*
packet
,
32
),
ttl
:
time
.
Duration
(
s
.
TTL
)
*
time
.
Second
,
}
m
[
pkt
.
srcAddr
.
String
()
]
=
node
m
[
pkt
.
srcAddr
]
=
node
go
node
.
run
()
glog
.
V
(
LINFO
)
.
Infof
(
"[udp] %s -> %s : new client (%d)"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
len
(
m
))
}
...
...
cmd/gost/vendor/github.com/ginuerzh/gost/gost.go
View file @
8861ffba
...
...
@@ -11,7 +11,7 @@ import (
)
const
(
Version
=
"2.
3
"
Version
=
"2.
4-dev
"
)
// Log level for glog
...
...
cmd/gost/vendor/github.com/ginuerzh/gost/node.go
View file @
8861ffba
...
...
@@ -71,7 +71,7 @@ func ParseProxyNode(s string) (node ProxyNode, err error) {
}
switch
node
.
Transport
{
case
"ws"
,
"wss"
,
"tls"
,
"http2"
,
"
ssu"
,
"quic"
,
"kcp"
,
"redirect
"
:
case
"ws"
,
"wss"
,
"tls"
,
"http2"
,
"
quic"
,
"kcp"
,
"redirect"
,
"ssu
"
:
case
"https"
:
node
.
Protocol
=
"http"
node
.
Transport
=
"tls"
...
...
cmd/gost/vendor/github.com/ginuerzh/gost/server.go
View file @
8861ffba
...
...
@@ -32,7 +32,7 @@ func NewProxyServer(node ProxyNode, chain *ProxyChain, config *tls.Config) *Prox
var
cipher
*
ss
.
Cipher
var
ota
bool
if
node
.
Protocol
==
"ss"
{
if
node
.
Protocol
==
"ss"
||
node
.
Transport
==
"ssu"
{
var
err
error
var
method
,
password
string
...
...
@@ -98,8 +98,6 @@ func (s *ProxyServer) Serve() error {
return
NewRTcpForwardServer
(
s
)
.
Serve
()
case
"rudp"
:
// Remote UDP port forwarding
return
NewRUdpForwardServer
(
s
)
.
Serve
()
case
"ssu"
:
// TODO: shadowsocks udp relay
return
NewShadowUdpServer
(
s
)
.
ListenAndServe
()
case
"quic"
:
return
NewQuicServer
(
s
)
.
ListenAndServeTLS
(
s
.
TLSConfig
)
case
"kcp"
:
...
...
@@ -118,6 +116,12 @@ func (s *ProxyServer) Serve() error {
return
NewKCPServer
(
s
,
config
)
.
ListenAndServe
()
case
"redirect"
:
return
NewRedsocksTCPServer
(
s
)
.
ListenAndServe
()
case
"ssu"
:
// shadowsocks udp relay
ttl
,
_
:=
strconv
.
Atoi
(
s
.
Node
.
Get
(
"ttl"
))
if
ttl
<=
0
{
ttl
=
DefaultTTL
}
return
NewShadowUdpServer
(
s
,
ttl
)
.
ListenAndServe
()
default
:
ln
,
err
=
net
.
Listen
(
"tcp"
,
node
.
Addr
)
}
...
...
cmd/gost/vendor/github.com/ginuerzh/gost/ss.go
View file @
8861ffba
...
...
@@ -5,6 +5,7 @@ import (
"encoding/binary"
"errors"
"fmt"
"github.com/ginuerzh/gosocks5"
"github.com/golang/glog"
ss
"github.com/shadowsocks/shadowsocks-go/shadowsocks"
"io"
...
...
@@ -65,47 +66,6 @@ func (s *ShadowServer) Serve() {
glog
.
V
(
LINFO
)
.
Infof
(
"[ss] %s >-< %s"
,
s
.
conn
.
RemoteAddr
(),
addr
)
}
type
ShadowUdpServer
struct
{
Base
*
ProxyServer
Handler
func
(
conn
*
net
.
UDPConn
,
addr
*
net
.
UDPAddr
,
data
[]
byte
)
}
func
NewShadowUdpServer
(
base
*
ProxyServer
)
*
ShadowUdpServer
{
return
&
ShadowUdpServer
{
Base
:
base
}
}
func
(
s
*
ShadowUdpServer
)
ListenAndServe
()
error
{
laddr
,
err
:=
net
.
ResolveUDPAddr
(
"udp"
,
s
.
Base
.
Node
.
Addr
)
if
err
!=
nil
{
return
err
}
lconn
,
err
:=
net
.
ListenUDP
(
"udp"
,
laddr
)
if
err
!=
nil
{
return
err
}
defer
lconn
.
Close
()
if
s
.
Handler
==
nil
{
s
.
Handler
=
s
.
HandleConn
}
for
{
b
:=
make
([]
byte
,
LargeBufferSize
)
n
,
addr
,
err
:=
lconn
.
ReadFromUDP
(
b
)
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infoln
(
err
)
continue
}
go
s
.
Handler
(
lconn
,
addr
,
b
[
:
n
])
}
}
// TODO: shadowsocks udp relay handler
func
(
s
*
ShadowUdpServer
)
HandleConn
(
conn
*
net
.
UDPConn
,
addr
*
net
.
UDPAddr
,
data
[]
byte
)
{
}
// This function is copied from shadowsocks library with some modification.
func
(
s
*
ShadowServer
)
getRequest
()
(
host
string
,
ota
bool
,
err
error
)
{
// buf size should at least have the same size with the largest possible
...
...
@@ -276,3 +236,109 @@ func (c *shadowConn) SetReadDeadline(t time.Time) error {
func
(
c
*
shadowConn
)
SetWriteDeadline
(
t
time
.
Time
)
error
{
return
c
.
conn
.
SetWriteDeadline
(
t
)
}
type
ShadowUdpServer
struct
{
Base
*
ProxyServer
TTL
int
}
func
NewShadowUdpServer
(
base
*
ProxyServer
,
ttl
int
)
*
ShadowUdpServer
{
return
&
ShadowUdpServer
{
Base
:
base
,
TTL
:
ttl
}
}
func
(
s
*
ShadowUdpServer
)
ListenAndServe
()
error
{
laddr
,
err
:=
net
.
ResolveUDPAddr
(
"udp"
,
s
.
Base
.
Node
.
Addr
)
if
err
!=
nil
{
return
err
}
lconn
,
err
:=
net
.
ListenUDP
(
"udp"
,
laddr
)
if
err
!=
nil
{
return
err
}
defer
lconn
.
Close
()
conn
:=
ss
.
NewSecurePacketConn
(
lconn
,
s
.
Base
.
cipher
.
Copy
(),
true
)
// force OTA on
rChan
,
wChan
:=
make
(
chan
*
packet
,
128
),
make
(
chan
*
packet
,
128
)
// start send queue
go
func
(
ch
chan
<-
*
packet
)
{
for
{
b
:=
make
([]
byte
,
MediumBufferSize
)
n
,
addr
,
err
:=
conn
.
ReadFrom
(
b
[
3
:
])
// add rsv and frag fields to make it the standard SOCKS5 UDP datagram
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s -> %s : %s"
,
addr
,
laddr
,
err
)
continue
}
if
b
[
3
]
&
ss
.
OneTimeAuthMask
>
0
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s -> %s : client does not support OTA"
,
addr
,
laddr
)
continue
}
b
[
3
]
&=
ss
.
AddrMask
dgram
,
err
:=
gosocks5
.
ReadUDPDatagram
(
bytes
.
NewReader
(
b
[
:
n
+
3
]))
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s -> %s : %s"
,
addr
,
laddr
,
err
)
continue
}
select
{
case
ch
<-
&
packet
{
srcAddr
:
addr
.
String
(),
dstAddr
:
dgram
.
Header
.
Addr
.
String
(),
data
:
b
[
:
n
+
3
]}
:
case
<-
time
.
After
(
time
.
Second
*
3
)
:
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s -> %s : %s"
,
addr
,
dgram
.
Header
.
Addr
.
String
(),
"send queue is full, discard"
)
}
}
}(
wChan
)
// start recv queue
go
func
(
ch
<-
chan
*
packet
)
{
for
pkt
:=
range
ch
{
dstAddr
,
err
:=
net
.
ResolveUDPAddr
(
"udp"
,
pkt
.
dstAddr
)
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s <- %s : %s"
,
pkt
.
dstAddr
,
pkt
.
srcAddr
,
err
)
continue
}
if
_
,
err
:=
conn
.
WriteTo
(
pkt
.
data
,
dstAddr
);
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s <- %s : %s"
,
pkt
.
dstAddr
,
pkt
.
srcAddr
,
err
)
return
}
}
}(
rChan
)
// mapping client to node
m
:=
make
(
map
[
string
]
*
cnode
)
// start dispatcher
for
pkt
:=
range
wChan
{
// clear obsolete nodes
for
k
,
node
:=
range
m
{
if
node
!=
nil
&&
node
.
err
!=
nil
{
close
(
node
.
wChan
)
delete
(
m
,
k
)
glog
.
V
(
LINFO
)
.
Infof
(
"[ssu] clear node %s"
,
k
)
}
}
node
,
ok
:=
m
[
pkt
.
srcAddr
]
if
!
ok
{
node
=
&
cnode
{
chain
:
s
.
Base
.
Chain
,
srcAddr
:
pkt
.
srcAddr
,
dstAddr
:
pkt
.
dstAddr
,
rChan
:
rChan
,
wChan
:
make
(
chan
*
packet
,
32
),
ttl
:
time
.
Duration
(
s
.
TTL
)
*
time
.
Second
,
}
m
[
pkt
.
srcAddr
]
=
node
go
node
.
run
()
glog
.
V
(
LINFO
)
.
Infof
(
"[ssu] %s -> %s : new client (%d)"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
len
(
m
))
}
select
{
case
node
.
wChan
<-
pkt
:
case
<-
time
.
After
(
time
.
Second
*
3
)
:
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s -> %s : %s"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
"node send queue is full, discard"
)
}
}
return
nil
}
cmd/gost/vendor/github.com/shadowsocks/shadowsocks-go/shadowsocks/encrypt.go
View file @
8861ffba
...
...
@@ -12,7 +12,7 @@ import (
"io"
"strings"
"github.com/
codahale
/chacha20"
"github.com/
Yawning
/chacha20"
"golang.org/x/crypto/blowfish"
"golang.org/x/crypto/cast5"
"golang.org/x/crypto/salsa20/salsa"
...
...
@@ -65,11 +65,19 @@ func newStream(block cipher.Block, err error, key, iv []byte,
}
}
func
newAESStream
(
key
,
iv
[]
byte
,
doe
DecOrEnc
)
(
cipher
.
Stream
,
error
)
{
func
newAES
CFB
Stream
(
key
,
iv
[]
byte
,
doe
DecOrEnc
)
(
cipher
.
Stream
,
error
)
{
block
,
err
:=
aes
.
NewCipher
(
key
)
return
newStream
(
block
,
err
,
key
,
iv
,
doe
)
}
func
newAESCTRStream
(
key
,
iv
[]
byte
,
doe
DecOrEnc
)
(
cipher
.
Stream
,
error
)
{
block
,
err
:=
aes
.
NewCipher
(
key
)
if
err
!=
nil
{
return
nil
,
err
}
return
cipher
.
NewCTR
(
block
,
iv
),
nil
}
func
newDESStream
(
key
,
iv
[]
byte
,
doe
DecOrEnc
)
(
cipher
.
Stream
,
error
)
{
block
,
err
:=
des
.
NewCipher
(
key
)
return
newStream
(
block
,
err
,
key
,
iv
,
doe
)
...
...
@@ -95,7 +103,11 @@ func newRC4MD5Stream(key, iv []byte, _ DecOrEnc) (cipher.Stream, error) {
}
func
newChaCha20Stream
(
key
,
iv
[]
byte
,
_
DecOrEnc
)
(
cipher
.
Stream
,
error
)
{
return
chacha20
.
New
(
key
,
iv
)
return
chacha20
.
NewCipher
(
key
,
iv
)
}
func
newChaCha20IETFStream
(
key
,
iv
[]
byte
,
_
DecOrEnc
)
(
cipher
.
Stream
,
error
)
{
return
chacha20
.
NewCipher
(
key
,
iv
)
}
type
salsaStreamCipher
struct
{
...
...
@@ -145,15 +157,19 @@ type cipherInfo struct {
}
var
cipherMethod
=
map
[
string
]
*
cipherInfo
{
"aes-128-cfb"
:
{
16
,
16
,
newAESStream
},
"aes-192-cfb"
:
{
24
,
16
,
newAESStream
},
"aes-256-cfb"
:
{
32
,
16
,
newAESStream
},
"des-cfb"
:
{
8
,
8
,
newDESStream
},
"bf-cfb"
:
{
16
,
8
,
newBlowFishStream
},
"cast5-cfb"
:
{
16
,
8
,
newCast5Stream
},
"rc4-md5"
:
{
16
,
16
,
newRC4MD5Stream
},
"chacha20"
:
{
32
,
8
,
newChaCha20Stream
},
"salsa20"
:
{
32
,
8
,
newSalsa20Stream
},
"aes-128-cfb"
:
{
16
,
16
,
newAESCFBStream
},
"aes-192-cfb"
:
{
24
,
16
,
newAESCFBStream
},
"aes-256-cfb"
:
{
32
,
16
,
newAESCFBStream
},
"aes-128-ctr"
:
{
16
,
16
,
newAESCTRStream
},
"aes-192-ctr"
:
{
24
,
16
,
newAESCTRStream
},
"aes-256-ctr"
:
{
32
,
16
,
newAESCTRStream
},
"des-cfb"
:
{
8
,
8
,
newDESStream
},
"bf-cfb"
:
{
16
,
8
,
newBlowFishStream
},
"cast5-cfb"
:
{
16
,
8
,
newCast5Stream
},
"rc4-md5"
:
{
16
,
16
,
newRC4MD5Stream
},
"chacha20"
:
{
32
,
8
,
newChaCha20Stream
},
"chacha20-ietf"
:
{
32
,
12
,
newChaCha20IETFStream
},
"salsa20"
:
{
32
,
8
,
newSalsa20Stream
},
}
func
CheckCipherMethod
(
method
string
)
error
{
...
...
cmd/gost/vendor/github.com/shadowsocks/shadowsocks-go/shadowsocks/udp.go
0 → 100644
View file @
8861ffba
package
shadowsocks
import
(
"bytes"
"fmt"
"net"
"time"
)
const
(
maxPacketSize
=
4096
// increase it if error occurs
)
var
(
errPacketTooSmall
=
fmt
.
Errorf
(
"[udp]read error: cannot decrypt, received packet is smaller than ivLen"
)
errPacketTooLarge
=
fmt
.
Errorf
(
"[udp]read error: received packet is latger than maxPacketSize(%d)"
,
maxPacketSize
)
errBufferTooSmall
=
fmt
.
Errorf
(
"[udp]read error: given buffer is too small to hold data"
)
errPacketOtaFailed
=
fmt
.
Errorf
(
"[udp]read error: received packet has invalid ota"
)
)
type
SecurePacketConn
struct
{
net
.
PacketConn
*
Cipher
ota
bool
}
func
NewSecurePacketConn
(
c
net
.
PacketConn
,
cipher
*
Cipher
,
ota
bool
)
*
SecurePacketConn
{
return
&
SecurePacketConn
{
PacketConn
:
c
,
Cipher
:
cipher
,
ota
:
ota
,
}
}
func
(
c
*
SecurePacketConn
)
Close
()
error
{
return
c
.
PacketConn
.
Close
()
}
func
(
c
*
SecurePacketConn
)
ReadFrom
(
b
[]
byte
)
(
n
int
,
src
net
.
Addr
,
err
error
)
{
ota
:=
false
cipher
:=
c
.
Copy
()
buf
:=
make
([]
byte
,
4096
)
n
,
src
,
err
=
c
.
PacketConn
.
ReadFrom
(
buf
)
if
err
!=
nil
{
return
}
if
n
<
c
.
info
.
ivLen
{
return
0
,
nil
,
errPacketTooSmall
}
if
len
(
b
)
<
n
-
c
.
info
.
ivLen
{
err
=
errBufferTooSmall
// just a warning
}
iv
:=
make
([]
byte
,
c
.
info
.
ivLen
)
copy
(
iv
,
buf
[
:
c
.
info
.
ivLen
])
if
err
=
cipher
.
initDecrypt
(
iv
);
err
!=
nil
{
return
}
cipher
.
decrypt
(
b
[
0
:
],
buf
[
c
.
info
.
ivLen
:
n
])
n
-=
c
.
info
.
ivLen
if
b
[
idType
]
&
OneTimeAuthMask
>
0
{
ota
=
true
}
if
c
.
ota
&&
!
ota
{
return
0
,
src
,
errPacketOtaFailed
}
if
ota
{
key
:=
cipher
.
key
actualHmacSha1Buf
:=
HmacSha1
(
append
(
iv
,
key
...
),
b
[
:
n
-
lenHmacSha1
])
if
!
bytes
.
Equal
(
b
[
n
-
lenHmacSha1
:
n
],
actualHmacSha1Buf
)
{
Debug
.
Printf
(
"verify one time auth failed, iv=%v key=%v data=%v"
,
iv
,
key
,
b
)
return
0
,
src
,
errPacketOtaFailed
}
n
-=
lenHmacSha1
}
return
}
func
(
c
*
SecurePacketConn
)
WriteTo
(
b
[]
byte
,
dst
net
.
Addr
)
(
n
int
,
err
error
)
{
cipher
:=
c
.
Copy
()
iv
,
err
:=
cipher
.
initEncrypt
()
if
err
!=
nil
{
return
}
packetLen
:=
len
(
b
)
+
len
(
iv
)
if
c
.
ota
{
b
[
idType
]
|=
OneTimeAuthMask
packetLen
+=
lenHmacSha1
key
:=
cipher
.
key
actualHmacSha1Buf
:=
HmacSha1
(
append
(
iv
,
key
...
),
b
)
b
=
append
(
b
,
actualHmacSha1Buf
...
)
}
cipherData
:=
make
([]
byte
,
packetLen
)
copy
(
cipherData
,
iv
)
cipher
.
encrypt
(
cipherData
[
len
(
iv
)
:
],
b
)
n
,
err
=
c
.
PacketConn
.
WriteTo
(
cipherData
,
dst
)
if
c
.
ota
{
n
-=
lenHmacSha1
}
return
}
func
(
c
*
SecurePacketConn
)
LocalAddr
()
net
.
Addr
{
return
c
.
PacketConn
.
LocalAddr
()
}
func
(
c
*
SecurePacketConn
)
SetDeadline
(
t
time
.
Time
)
error
{
return
c
.
PacketConn
.
SetDeadline
(
t
)
}
func
(
c
*
SecurePacketConn
)
SetReadDeadline
(
t
time
.
Time
)
error
{
return
c
.
PacketConn
.
SetReadDeadline
(
t
)
}
func
(
c
*
SecurePacketConn
)
SetWriteDeadline
(
t
time
.
Time
)
error
{
return
c
.
PacketConn
.
SetWriteDeadline
(
t
)
}
func
(
c
*
SecurePacketConn
)
IsOta
()
bool
{
return
c
.
ota
}
func
(
c
*
SecurePacketConn
)
ForceOTA
()
net
.
PacketConn
{
return
NewSecurePacketConn
(
c
.
PacketConn
,
c
.
Cipher
.
Copy
(),
true
)
}
cmd/gost/vendor/github.com/shadowsocks/shadowsocks-go/shadowsocks/udprelay.go
0 → 100644
View file @
8861ffba
package
shadowsocks
import
(
"encoding/binary"
"fmt"
"net"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
const
(
idType
=
0
// address type index
idIP0
=
1
// ip addres start index
idDmLen
=
1
// domain address length index
idDm0
=
2
// domain address start index
typeIPv4
=
1
// type is ipv4 address
typeDm
=
3
// type is domain address
typeIPv6
=
4
// type is ipv6 address
lenIPv4
=
1
+
net
.
IPv4len
+
2
// 1addrType + ipv4 + 2port
lenIPv6
=
1
+
net
.
IPv6len
+
2
// 1addrType + ipv6 + 2port
lenDmBase
=
1
+
1
+
2
// 1addrType + 1addrLen + 2port, plus addrLen
lenHmacSha1
=
10
)
var
(
reqList
=
newReqList
()
natlist
=
newNatTable
()
udpTimeout
=
30
*
time
.
Second
reqListRefreshTime
=
5
*
time
.
Minute
)
type
natTable
struct
{
sync
.
Mutex
conns
map
[
string
]
net
.
PacketConn
}
func
newNatTable
()
*
natTable
{
return
&
natTable
{
conns
:
map
[
string
]
net
.
PacketConn
{}}
}
func
(
table
*
natTable
)
Delete
(
index
string
)
net
.
PacketConn
{
table
.
Lock
()
defer
table
.
Unlock
()
c
,
ok
:=
table
.
conns
[
index
]
if
ok
{
delete
(
table
.
conns
,
index
)
return
c
}
return
nil
}
func
(
table
*
natTable
)
Get
(
index
string
)
(
c
net
.
PacketConn
,
ok
bool
,
err
error
)
{
table
.
Lock
()
defer
table
.
Unlock
()
c
,
ok
=
table
.
conns
[
index
]
if
!
ok
{
c
,
err
=
net
.
ListenPacket
(
"udp"
,
""
)
if
err
!=
nil
{
return
nil
,
false
,
err
}
table
.
conns
[
index
]
=
c
}
return
}
type
requestHeaderList
struct
{
sync
.
Mutex
List
map
[
string
]([]
byte
)
}
func
newReqList
()
*
requestHeaderList
{
ret
:=
&
requestHeaderList
{
List
:
map
[
string
]([]
byte
){}}
go
func
()
{
for
{
time
.
Sleep
(
reqListRefreshTime
)
ret
.
Refresh
()
}
}()
return
ret
}
func
(
r
*
requestHeaderList
)
Refresh
()
{
r
.
Lock
()
defer
r
.
Unlock
()
for
k
,
_
:=
range
r
.
List
{
delete
(
r
.
List
,
k
)
}
}
func
(
r
*
requestHeaderList
)
Get
(
dstaddr
string
)
(
req
[]
byte
,
ok
bool
)
{
r
.
Lock
()
defer
r
.
Unlock
()
req
,
ok
=
r
.
List
[
dstaddr
]
return
}
func
(
r
*
requestHeaderList
)
Put
(
dstaddr
string
,
req
[]
byte
)
{
r
.
Lock
()
defer
r
.
Unlock
()
r
.
List
[
dstaddr
]
=
req
return
}
func
parseHeaderFromAddr
(
addr
net
.
Addr
)
([]
byte
,
int
)
{
// if the request address type is domain, it cannot be reverselookuped
ip
,
port
,
err
:=
net
.
SplitHostPort
(
addr
.
String
())
if
err
!=
nil
{
return
nil
,
0
}
buf
:=
make
([]
byte
,
20
)
IP
:=
net
.
ParseIP
(
ip
)
b1
:=
IP
.
To4
()
iplen
:=
0
if
b1
==
nil
{
//ipv6
b1
=
IP
.
To16
()
buf
[
0
]
=
typeIPv6
iplen
=
net
.
IPv6len
}
else
{
//ipv4
buf
[
0
]
=
typeIPv4
iplen
=
net
.
IPv4len
}
copy
(
buf
[
1
:
],
b1
)
port_i
,
_
:=
strconv
.
Atoi
(
port
)
binary
.
BigEndian
.
PutUint16
(
buf
[
1
+
iplen
:
],
uint16
(
port_i
))
return
buf
[
:
1
+
iplen
+
2
],
1
+
iplen
+
2
}
func
Pipeloop
(
write
net
.
PacketConn
,
writeAddr
net
.
Addr
,
readClose
net
.
PacketConn
)
{
buf
:=
leakyBuf
.
Get
()
defer
leakyBuf
.
Put
(
buf
)
defer
readClose
.
Close
()
for
{
readClose
.
SetDeadline
(
time
.
Now
()
.
Add
(
udpTimeout
))
n
,
raddr
,
err
:=
readClose
.
ReadFrom
(
buf
)
if
err
!=
nil
{
if
ne
,
ok
:=
err
.
(
*
net
.
OpError
);
ok
{
if
ne
.
Err
==
syscall
.
EMFILE
||
ne
.
Err
==
syscall
.
ENFILE
{
// log too many open file error
// EMFILE is process reaches open file limits, ENFILE is system limit
Debug
.
Println
(
"[udp]read error:"
,
err
)
}
}
Debug
.
Printf
(
"[udp]closed pipe %s<-%s
\n
"
,
writeAddr
,
readClose
.
LocalAddr
())
return
}
// need improvement here
if
req
,
ok
:=
reqList
.
Get
(
raddr
.
String
());
ok
{
write
.
WriteTo
(
append
(
req
,
buf
[
:
n
]
...
),
writeAddr
)
}
else
{
header
,
hlen
:=
parseHeaderFromAddr
(
raddr
)
write
.
WriteTo
(
append
(
header
[
:
hlen
],
buf
[
:
n
]
...
),
writeAddr
)
}
}
}
func
handleUDPConnection
(
handle
*
SecurePacketConn
,
n
int
,
src
net
.
Addr
,
receive
[]
byte
)
{
var
dstIP
net
.
IP
var
reqLen
int
var
ota
bool
addrType
:=
receive
[
idType
]
defer
leakyBuf
.
Put
(
receive
)
if
addrType
&
OneTimeAuthMask
>
0
{
ota
=
true
}
receive
[
idType
]
&=
^
OneTimeAuthMask
compatiblemode
:=
!
handle
.
IsOta
()
&&
ota
switch
addrType
&
AddrMask
{
case
typeIPv4
:
reqLen
=
lenIPv4
if
len
(
receive
)
<
reqLen
{
Debug
.
Println
(
"[udp]invalid received message."
)
}
dstIP
=
net
.
IP
(
receive
[
idIP0
:
idIP0
+
net
.
IPv4len
])
case
typeIPv6
:
reqLen
=
lenIPv6
if
len
(
receive
)
<
reqLen
{
Debug
.
Println
(
"[udp]invalid received message."
)
}
dstIP
=
net
.
IP
(
receive
[
idIP0
:
idIP0
+
net
.
IPv6len
])
case
typeDm
:
reqLen
=
int
(
receive
[
idDmLen
])
+
lenDmBase
if
len
(
receive
)
<
reqLen
{
Debug
.
Println
(
"[udp]invalid received message."
)
}
name
:=
string
(
receive
[
idDm0
:
idDm0
+
int
(
receive
[
idDmLen
])])
// avoid panic: syscall: string with NUL passed to StringToUTF16 on windows.
if
strings
.
ContainsRune
(
name
,
0x00
)
{
fmt
.
Println
(
"[udp]invalid domain name."
)
return
}
dIP
,
err
:=
net
.
ResolveIPAddr
(
"ip"
,
name
)
// carefully with const type
if
err
!=
nil
{
Debug
.
Printf
(
"[udp]failed to resolve domain name: %s
\n
"
,
string
(
receive
[
idDm0
:
idDm0
+
receive
[
idDmLen
]]))
return
}
dstIP
=
dIP
.
IP
default
:
Debug
.
Printf
(
"[udp]addrType %d not supported"
,
addrType
)
return
}
dst
:=
&
net
.
UDPAddr
{
IP
:
dstIP
,
Port
:
int
(
binary
.
BigEndian
.
Uint16
(
receive
[
reqLen
-
2
:
reqLen
])),
}
if
_
,
ok
:=
reqList
.
Get
(
dst
.
String
());
!
ok
{
req
:=
make
([]
byte
,
reqLen
)
copy
(
req
,
receive
)
reqList
.
Put
(
dst
.
String
(),
req
)
}
remote
,
exist
,
err
:=
natlist
.
Get
(
src
.
String
())
if
err
!=
nil
{
return
}
if
!
exist
{
Debug
.
Printf
(
"[udp]new client %s->%s via %s ota=%v
\n
"
,
src
,
dst
,
remote
.
LocalAddr
(),
ota
)
go
func
()
{
if
compatiblemode
{
Pipeloop
(
handle
.
ForceOTA
(),
src
,
remote
)
}
else
{
Pipeloop
(
handle
,
src
,
remote
)
}
natlist
.
Delete
(
src
.
String
())
}()
}
else
{
Debug
.
Printf
(
"[udp]using cached client %s->%s via %s ota=%v
\n
"
,
src
,
dst
,
remote
.
LocalAddr
(),
ota
)
}
if
remote
==
nil
{
fmt
.
Println
(
"WTF"
)
}
remote
.
SetDeadline
(
time
.
Now
()
.
Add
(
udpTimeout
))
_
,
err
=
remote
.
WriteTo
(
receive
[
reqLen
:
n
],
dst
)
if
err
!=
nil
{
if
ne
,
ok
:=
err
.
(
*
net
.
OpError
);
ok
&&
(
ne
.
Err
==
syscall
.
EMFILE
||
ne
.
Err
==
syscall
.
ENFILE
)
{
// log too many open file error
// EMFILE is process reaches open file limits, ENFILE is system limit
Debug
.
Println
(
"[udp]write error:"
,
err
)
}
else
{
Debug
.
Println
(
"[udp]error connecting to:"
,
dst
,
err
)
}
if
conn
:=
natlist
.
Delete
(
src
.
String
());
conn
!=
nil
{
conn
.
Close
()
}
}
// Pipeloop
return
}
func
ReadAndHandleUDPReq
(
c
*
SecurePacketConn
)
error
{
buf
:=
leakyBuf
.
Get
()
n
,
src
,
err
:=
c
.
ReadFrom
(
buf
[
0
:
])
if
err
!=
nil
{
return
err
}
go
handleUDPConnection
(
c
,
n
,
src
,
buf
)
return
nil
}
cmd/gost/vendor/github.com/shadowsocks/shadowsocks-go/shadowsocks/util.go
View file @
8861ffba
package
shadowsocks
import
(
"errors"
"fmt"
"os"
"crypto/hmac"
"crypto/sha1"
"encoding/binary"
"errors"
"fmt"
"os"
)
func
PrintVersion
()
{
const
version
=
"1.
1.5
"
const
version
=
"1.
2.0
"
fmt
.
Println
(
"shadowsocks-go version"
,
version
)
}
...
...
@@ -57,4 +57,4 @@ func (flag *ClosedFlag) SetClosed() {
func
(
flag
*
ClosedFlag
)
IsClosed
()
bool
{
return
flag
.
flag
}
\ No newline at end of file
}
cmd/gost/vendor/vendor.json
View file @
8861ffba
...
...
@@ -2,6 +2,12 @@
"comment"
:
""
,
"ignore"
:
"test"
,
"package"
:
[
{
"checksumSHA1"
:
"IFJyJgPCjumDG37lEb0lyRBBGZE="
,
"path"
:
"github.com/Yawning/chacha20"
,
"revision"
:
"c91e78db502ff629614837aacb7aa4efa61c651a"
,
"revisionTime"
:
"2016-04-30T09:49:23Z"
},
{
"checksumSHA1"
:
"QPs3L3mjPoi+a9GJCjW8HhyJczM="
,
"path"
:
"github.com/codahale/chacha20"
,
...
...
@@ -15,10 +21,10 @@
"revisionTime"
:
"2017-01-19T05:34:58Z"
},
{
"checksumSHA1"
:
"
b0uHAM/lCGCJ9GeKfClvrMMWXQM
="
,
"checksumSHA1"
:
"
idpL1fpHpfntk74IVfWtkP1PMZs
="
,
"path"
:
"github.com/ginuerzh/gost"
,
"revision"
:
"3
58f57add6087d77b1d978e92e2f7c8073c2f544
"
,
"revisionTime"
:
"2017-01-21T03:1
4:59
Z"
"revision"
:
"3
21b03712af504981d35a47c50c2cfe4dd788a9d
"
,
"revisionTime"
:
"2017-01-21T03:1
6:33
Z"
},
{
"checksumSHA1"
:
"URsJa4y/sUUw/STmbeYx9EKqaYE="
,
...
...
@@ -153,10 +159,10 @@
"revisionTime"
:
"2016-10-02T05:25:12Z"
},
{
"checksumSHA1"
:
"
o0WHRL8mNIhfsoWlzhdJ8du6+C8
="
,
"checksumSHA1"
:
"
MRsfMrdZwnnCTfIzT3czcj0lb0s
="
,
"path"
:
"github.com/shadowsocks/shadowsocks-go/shadowsocks"
,
"revision"
:
"
5c9897ecdf623f385ccb8c2c78e32c5256961b41
"
,
"revisionTime"
:
"201
6-06-15T15:25:08
Z"
"revision"
:
"
97a5c71f80ba5f5b3e549f14a619fe557ff4f3c9
"
,
"revisionTime"
:
"201
7-01-21T20:35:16
Z"
},
{
"checksumSHA1"
:
"JsJdKXhz87gWenMwBeejTOeNE7k="
,
...
...
forward.go
View file @
8861ffba
...
...
@@ -63,15 +63,15 @@ func (s *TcpForwardServer) handleTcpForward(conn net.Conn, raddr net.Addr) {
}
type
packet
struct
{
srcAddr
*
net
.
UDPAddr
// src address
dstAddr
*
net
.
UDPAddr
// dest address
srcAddr
string
// src address
dstAddr
string
// dest address
data
[]
byte
}
type
cnode
struct
{
chain
*
ProxyChain
conn
net
.
Conn
srcAddr
,
dstAddr
*
net
.
UDPAddr
srcAddr
,
dstAddr
string
rChan
,
wChan
chan
*
packet
err
error
ttl
time
.
Duration
...
...
@@ -146,13 +146,9 @@ func (node *cnode) run() {
timer
.
Reset
(
node
.
ttl
)
glog
.
V
(
LDEBUG
)
.
Infof
(
"[udp] %s <<< %s : length %d"
,
node
.
srcAddr
,
addr
,
n
)
if
node
.
dstAddr
.
String
()
!=
addr
.
String
()
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s <- %s : dst-addr mismatch (%s)"
,
node
.
srcAddr
,
node
.
dstAddr
,
addr
)
break
}
select
{
// swap srcAddr with dstAddr
case
node
.
rChan
<-
&
packet
{
srcAddr
:
node
.
dstAddr
,
dstAddr
:
node
.
srcAddr
,
data
:
b
[
:
n
]}
:
case
node
.
rChan
<-
&
packet
{
srcAddr
:
addr
.
String
()
,
dstAddr
:
node
.
srcAddr
,
data
:
b
[
:
n
]}
:
case
<-
time
.
After
(
time
.
Second
*
3
)
:
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s <- %s : %s"
,
node
.
srcAddr
,
node
.
dstAddr
,
"recv queue is full, discard"
)
}
...
...
@@ -169,13 +165,9 @@ func (node *cnode) run() {
timer
.
Reset
(
node
.
ttl
)
glog
.
V
(
LDEBUG
)
.
Infof
(
"[udp-tun] %s <<< %s : length %d"
,
node
.
srcAddr
,
dgram
.
Header
.
Addr
.
String
(),
len
(
dgram
.
Data
))
if
dgram
.
Header
.
Addr
.
String
()
!=
node
.
dstAddr
.
String
()
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp-tun] %s <- %s : dst-addr mismatch (%s)"
,
node
.
srcAddr
,
node
.
dstAddr
,
dgram
.
Header
.
Addr
)
break
}
select
{
// swap srcAddr with dstAddr
case
node
.
rChan
<-
&
packet
{
srcAddr
:
node
.
dstAddr
,
dstAddr
:
node
.
srcAddr
,
data
:
dgram
.
Data
}
:
case
node
.
rChan
<-
&
packet
{
srcAddr
:
dgram
.
Header
.
Addr
.
String
()
,
dstAddr
:
node
.
srcAddr
,
data
:
dgram
.
Data
}
:
case
<-
time
.
After
(
time
.
Second
*
3
)
:
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp-tun] %s <- %s : %s"
,
node
.
srcAddr
,
node
.
dstAddr
,
"recv queue is full, discard"
)
}
...
...
@@ -187,9 +179,15 @@ func (node *cnode) run() {
for
pkt
:=
range
node
.
wChan
{
timer
.
Reset
(
node
.
ttl
)
dstAddr
,
err
:=
net
.
ResolveUDPAddr
(
"udp"
,
pkt
.
dstAddr
)
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s -> %s : %s"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
err
)
continue
}
switch
c
:=
node
.
conn
.
(
type
)
{
case
*
net
.
UDPConn
:
if
_
,
err
:=
c
.
WriteToUDP
(
pkt
.
data
,
pkt
.
dstAddr
);
err
!=
nil
{
if
_
,
err
:=
c
.
WriteToUDP
(
pkt
.
data
,
dstAddr
);
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s -> %s : %s"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
err
)
node
.
err
=
err
errChan
<-
err
...
...
@@ -198,7 +196,7 @@ func (node *cnode) run() {
glog
.
V
(
LDEBUG
)
.
Infof
(
"[udp] %s >>> %s : length %d"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
len
(
pkt
.
data
))
default
:
dgram
:=
gosocks5
.
NewUDPDatagram
(
gosocks5
.
NewUDPHeader
(
uint16
(
len
(
pkt
.
data
)),
0
,
ToSocksAddr
(
pkt
.
dstAddr
)),
pkt
.
data
)
dgram
:=
gosocks5
.
NewUDPDatagram
(
gosocks5
.
NewUDPHeader
(
uint16
(
len
(
pkt
.
data
)),
0
,
ToSocksAddr
(
dstAddr
)),
pkt
.
data
)
if
err
:=
dgram
.
Write
(
c
);
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp-tun] %s -> %s : %s"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
err
)
node
.
err
=
err
...
...
@@ -255,7 +253,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
}
select
{
case
ch
<-
&
packet
{
srcAddr
:
addr
,
dstAddr
:
raddr
,
data
:
b
[
:
n
]}
:
case
ch
<-
&
packet
{
srcAddr
:
addr
.
String
(),
dstAddr
:
raddr
.
String
()
,
data
:
b
[
:
n
]}
:
case
<-
time
.
After
(
time
.
Second
*
3
)
:
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s -> %s : %s"
,
addr
,
raddr
,
"send queue is full, discard"
)
}
...
...
@@ -264,7 +262,12 @@ func (s *UdpForwardServer) ListenAndServe() error {
// start recv queue
go
func
(
ch
<-
chan
*
packet
)
{
for
pkt
:=
range
ch
{
if
_
,
err
:=
conn
.
WriteToUDP
(
pkt
.
data
,
pkt
.
dstAddr
);
err
!=
nil
{
dstAddr
,
err
:=
net
.
ResolveUDPAddr
(
"udp"
,
pkt
.
dstAddr
)
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s <- %s : %s"
,
pkt
.
dstAddr
,
pkt
.
srcAddr
,
err
)
continue
}
if
_
,
err
:=
conn
.
WriteToUDP
(
pkt
.
data
,
dstAddr
);
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[udp] %s <- %s : %s"
,
pkt
.
dstAddr
,
pkt
.
srcAddr
,
err
)
return
}
...
...
@@ -285,7 +288,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
}
}
node
,
ok
:=
m
[
pkt
.
srcAddr
.
String
()
]
node
,
ok
:=
m
[
pkt
.
srcAddr
]
if
!
ok
{
node
=
&
cnode
{
chain
:
s
.
Base
.
Chain
,
...
...
@@ -295,7 +298,7 @@ func (s *UdpForwardServer) ListenAndServe() error {
wChan
:
make
(
chan
*
packet
,
32
),
ttl
:
time
.
Duration
(
s
.
TTL
)
*
time
.
Second
,
}
m
[
pkt
.
srcAddr
.
String
()
]
=
node
m
[
pkt
.
srcAddr
]
=
node
go
node
.
run
()
glog
.
V
(
LINFO
)
.
Infof
(
"[udp] %s -> %s : new client (%d)"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
len
(
m
))
}
...
...
gost.go
View file @
8861ffba
...
...
@@ -11,7 +11,7 @@ import (
)
const
(
Version
=
"2.
3
"
Version
=
"2.
4-dev
"
)
// Log level for glog
...
...
node.go
View file @
8861ffba
...
...
@@ -71,7 +71,7 @@ func ParseProxyNode(s string) (node ProxyNode, err error) {
}
switch
node
.
Transport
{
case
"ws"
,
"wss"
,
"tls"
,
"http2"
,
"
ssu"
,
"quic"
,
"kcp"
,
"redirect
"
:
case
"ws"
,
"wss"
,
"tls"
,
"http2"
,
"
quic"
,
"kcp"
,
"redirect"
,
"ssu
"
:
case
"https"
:
node
.
Protocol
=
"http"
node
.
Transport
=
"tls"
...
...
server.go
View file @
8861ffba
...
...
@@ -32,7 +32,7 @@ func NewProxyServer(node ProxyNode, chain *ProxyChain, config *tls.Config) *Prox
var
cipher
*
ss
.
Cipher
var
ota
bool
if
node
.
Protocol
==
"ss"
{
if
node
.
Protocol
==
"ss"
||
node
.
Transport
==
"ssu"
{
var
err
error
var
method
,
password
string
...
...
@@ -98,8 +98,6 @@ func (s *ProxyServer) Serve() error {
return
NewRTcpForwardServer
(
s
)
.
Serve
()
case
"rudp"
:
// Remote UDP port forwarding
return
NewRUdpForwardServer
(
s
)
.
Serve
()
case
"ssu"
:
// TODO: shadowsocks udp relay
return
NewShadowUdpServer
(
s
)
.
ListenAndServe
()
case
"quic"
:
return
NewQuicServer
(
s
)
.
ListenAndServeTLS
(
s
.
TLSConfig
)
case
"kcp"
:
...
...
@@ -118,6 +116,12 @@ func (s *ProxyServer) Serve() error {
return
NewKCPServer
(
s
,
config
)
.
ListenAndServe
()
case
"redirect"
:
return
NewRedsocksTCPServer
(
s
)
.
ListenAndServe
()
case
"ssu"
:
// shadowsocks udp relay
ttl
,
_
:=
strconv
.
Atoi
(
s
.
Node
.
Get
(
"ttl"
))
if
ttl
<=
0
{
ttl
=
DefaultTTL
}
return
NewShadowUdpServer
(
s
,
ttl
)
.
ListenAndServe
()
default
:
ln
,
err
=
net
.
Listen
(
"tcp"
,
node
.
Addr
)
}
...
...
ss.go
View file @
8861ffba
...
...
@@ -5,6 +5,7 @@ import (
"encoding/binary"
"errors"
"fmt"
"github.com/ginuerzh/gosocks5"
"github.com/golang/glog"
ss
"github.com/shadowsocks/shadowsocks-go/shadowsocks"
"io"
...
...
@@ -65,47 +66,6 @@ func (s *ShadowServer) Serve() {
glog
.
V
(
LINFO
)
.
Infof
(
"[ss] %s >-< %s"
,
s
.
conn
.
RemoteAddr
(),
addr
)
}
type
ShadowUdpServer
struct
{
Base
*
ProxyServer
Handler
func
(
conn
*
net
.
UDPConn
,
addr
*
net
.
UDPAddr
,
data
[]
byte
)
}
func
NewShadowUdpServer
(
base
*
ProxyServer
)
*
ShadowUdpServer
{
return
&
ShadowUdpServer
{
Base
:
base
}
}
func
(
s
*
ShadowUdpServer
)
ListenAndServe
()
error
{
laddr
,
err
:=
net
.
ResolveUDPAddr
(
"udp"
,
s
.
Base
.
Node
.
Addr
)
if
err
!=
nil
{
return
err
}
lconn
,
err
:=
net
.
ListenUDP
(
"udp"
,
laddr
)
if
err
!=
nil
{
return
err
}
defer
lconn
.
Close
()
if
s
.
Handler
==
nil
{
s
.
Handler
=
s
.
HandleConn
}
for
{
b
:=
make
([]
byte
,
LargeBufferSize
)
n
,
addr
,
err
:=
lconn
.
ReadFromUDP
(
b
)
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infoln
(
err
)
continue
}
go
s
.
Handler
(
lconn
,
addr
,
b
[
:
n
])
}
}
// TODO: shadowsocks udp relay handler
func
(
s
*
ShadowUdpServer
)
HandleConn
(
conn
*
net
.
UDPConn
,
addr
*
net
.
UDPAddr
,
data
[]
byte
)
{
}
// This function is copied from shadowsocks library with some modification.
func
(
s
*
ShadowServer
)
getRequest
()
(
host
string
,
ota
bool
,
err
error
)
{
// buf size should at least have the same size with the largest possible
...
...
@@ -276,3 +236,109 @@ func (c *shadowConn) SetReadDeadline(t time.Time) error {
func
(
c
*
shadowConn
)
SetWriteDeadline
(
t
time
.
Time
)
error
{
return
c
.
conn
.
SetWriteDeadline
(
t
)
}
type
ShadowUdpServer
struct
{
Base
*
ProxyServer
TTL
int
}
func
NewShadowUdpServer
(
base
*
ProxyServer
,
ttl
int
)
*
ShadowUdpServer
{
return
&
ShadowUdpServer
{
Base
:
base
,
TTL
:
ttl
}
}
func
(
s
*
ShadowUdpServer
)
ListenAndServe
()
error
{
laddr
,
err
:=
net
.
ResolveUDPAddr
(
"udp"
,
s
.
Base
.
Node
.
Addr
)
if
err
!=
nil
{
return
err
}
lconn
,
err
:=
net
.
ListenUDP
(
"udp"
,
laddr
)
if
err
!=
nil
{
return
err
}
defer
lconn
.
Close
()
conn
:=
ss
.
NewSecurePacketConn
(
lconn
,
s
.
Base
.
cipher
.
Copy
(),
true
)
// force OTA on
rChan
,
wChan
:=
make
(
chan
*
packet
,
128
),
make
(
chan
*
packet
,
128
)
// start send queue
go
func
(
ch
chan
<-
*
packet
)
{
for
{
b
:=
make
([]
byte
,
MediumBufferSize
)
n
,
addr
,
err
:=
conn
.
ReadFrom
(
b
[
3
:
])
// add rsv and frag fields to make it the standard SOCKS5 UDP datagram
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s -> %s : %s"
,
addr
,
laddr
,
err
)
continue
}
if
b
[
3
]
&
ss
.
OneTimeAuthMask
>
0
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s -> %s : client does not support OTA"
,
addr
,
laddr
)
continue
}
b
[
3
]
&=
ss
.
AddrMask
dgram
,
err
:=
gosocks5
.
ReadUDPDatagram
(
bytes
.
NewReader
(
b
[
:
n
+
3
]))
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s -> %s : %s"
,
addr
,
laddr
,
err
)
continue
}
select
{
case
ch
<-
&
packet
{
srcAddr
:
addr
.
String
(),
dstAddr
:
dgram
.
Header
.
Addr
.
String
(),
data
:
b
[
:
n
+
3
]}
:
case
<-
time
.
After
(
time
.
Second
*
3
)
:
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s -> %s : %s"
,
addr
,
dgram
.
Header
.
Addr
.
String
(),
"send queue is full, discard"
)
}
}
}(
wChan
)
// start recv queue
go
func
(
ch
<-
chan
*
packet
)
{
for
pkt
:=
range
ch
{
dstAddr
,
err
:=
net
.
ResolveUDPAddr
(
"udp"
,
pkt
.
dstAddr
)
if
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s <- %s : %s"
,
pkt
.
dstAddr
,
pkt
.
srcAddr
,
err
)
continue
}
if
_
,
err
:=
conn
.
WriteTo
(
pkt
.
data
,
dstAddr
);
err
!=
nil
{
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s <- %s : %s"
,
pkt
.
dstAddr
,
pkt
.
srcAddr
,
err
)
return
}
}
}(
rChan
)
// mapping client to node
m
:=
make
(
map
[
string
]
*
cnode
)
// start dispatcher
for
pkt
:=
range
wChan
{
// clear obsolete nodes
for
k
,
node
:=
range
m
{
if
node
!=
nil
&&
node
.
err
!=
nil
{
close
(
node
.
wChan
)
delete
(
m
,
k
)
glog
.
V
(
LINFO
)
.
Infof
(
"[ssu] clear node %s"
,
k
)
}
}
node
,
ok
:=
m
[
pkt
.
srcAddr
]
if
!
ok
{
node
=
&
cnode
{
chain
:
s
.
Base
.
Chain
,
srcAddr
:
pkt
.
srcAddr
,
dstAddr
:
pkt
.
dstAddr
,
rChan
:
rChan
,
wChan
:
make
(
chan
*
packet
,
32
),
ttl
:
time
.
Duration
(
s
.
TTL
)
*
time
.
Second
,
}
m
[
pkt
.
srcAddr
]
=
node
go
node
.
run
()
glog
.
V
(
LINFO
)
.
Infof
(
"[ssu] %s -> %s : new client (%d)"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
len
(
m
))
}
select
{
case
node
.
wChan
<-
pkt
:
case
<-
time
.
After
(
time
.
Second
*
3
)
:
glog
.
V
(
LWARNING
)
.
Infof
(
"[ssu] %s -> %s : %s"
,
pkt
.
srcAddr
,
pkt
.
dstAddr
,
"node send queue is full, discard"
)
}
}
return
nil
}
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment