Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Sign in / Register
Toggle navigation
Y
ygo-agent
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Locked Files
Issues
0
Issues
0
List
Boards
Labels
Service Desk
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Security & Compliance
Security & Compliance
Dependency List
License Compliance
Packages
Packages
List
Container Registry
Analytics
Analytics
CI / CD
Code Review
Insights
Issues
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Biluo Shen
ygo-agent
Commits
157f440c
Commit
157f440c
authored
Mar 05, 2024
by
biluo.shen
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
add torchrun_setup
parent
92898eae
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
15 additions
and
6 deletions
+15
-6
ygoai/rl/dist.py
ygoai/rl/dist.py
+15
-6
No files found.
ygoai/rl/dist.py
View file @
157f440c
...
@@ -24,6 +24,20 @@ def reduce_gradidents(params, world_size):
...
@@ -24,6 +24,20 @@ def reduce_gradidents(params, world_size):
offset
+=
param
.
numel
()
offset
+=
param
.
numel
()
def
test_nccl
(
local_rank
):
# manual init nccl
x
=
torch
.
rand
(
4
,
device
=
f
'cuda:{local_rank}'
)
dist
.
all_reduce
(
x
,
op
=
dist
.
ReduceOp
.
SUM
)
x
.
mean
()
.
item
()
dist
.
barrier
()
def
torchrun_setup
(
backend
,
local_rank
):
dist
.
init_process_group
(
backend
,
timeout
=
datetime
.
timedelta
(
seconds
=
60
*
30
))
test_nccl
(
local_rank
)
def
setup
(
backend
,
rank
,
world_size
,
port
):
def
setup
(
backend
,
rank
,
world_size
,
port
):
os
.
environ
[
'MASTER_ADDR'
]
=
'127.0.0.1'
os
.
environ
[
'MASTER_ADDR'
]
=
'127.0.0.1'
os
.
environ
[
'MASTER_PORT'
]
=
str
(
port
)
os
.
environ
[
'MASTER_PORT'
]
=
str
(
port
)
...
@@ -31,12 +45,7 @@ def setup(backend, rank, world_size, port):
...
@@ -31,12 +45,7 @@ def setup(backend, rank, world_size, port):
backend
,
rank
=
rank
,
world_size
=
world_size
,
backend
,
rank
=
rank
,
world_size
=
world_size
,
timeout
=
datetime
.
timedelta
(
seconds
=
60
*
30
))
timeout
=
datetime
.
timedelta
(
seconds
=
60
*
30
))
# manual init nccl
test_nccl
(
rank
)
x
=
torch
.
rand
(
4
,
device
=
f
'cuda:{rank}'
)
dist
.
all_reduce
(
x
,
op
=
dist
.
ReduceOp
.
SUM
)
x
.
mean
()
.
item
()
dist
.
barrier
()
# print(f"Rank {rank} initialized")
def
mp_start
(
run
):
def
mp_start
(
run
):
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment