前几天 PDD 面试被问到这个问题，吟唱 GMP 调度八股，但是很明显面试官很不满意，指出不要背八股，他想听的 runtime 的调度过程。（尽管可能还是因为我八股看的不够多）特查看源码理解一下（，顺带总结一下 GMP 模型。

调度模型#

首先总结一下但凡涉及到 golang 八股就肯定绕不开的 GMP 调度。

GM 模型#

Golang 1.1 前，使用的是 GM 调度模型。 G 即 goroutine，代表 go 协程；M即Machine，是处理线程操作的结构体，直接与操作系统进行交互，可以直接视作 OS 线程。新建的 G会被放到一个全局队列中等待M的处理。

来看一下早期 1.0.1 版本的调度部分代码：

1
// One round of scheduler: find a goroutine and run it.
2
// The argument is the goroutine that was running before
3
// schedule was called, or nil if this is the first call.
4
// Never returns.
5
static void
6
schedule(G *gp)
7
{
8
    int32 hz;
9
    uint32 v;
10

11
    schedlock();
12
    if(gp != nil) {
13
        // Just finished running gp.
14
        // 解绑当前的 G 和 M，全局队列的 G 运行数--
15
        gp->m = nil;
16
        runtime·sched.grunning--;
17

18
        // atomic { mcpu-- }
19
        v = runtime·xadd(&runtime·sched.atomic, -1<<mcpuShift);
20
        if(atomic_mcpu(v) > maxgomaxprocs)
21
            runtime·throw("negative mcpu in scheduler");
22

23
        switch(gp->status){
24
        case Grunnable:
25
            // 为什么就绪状态的会被调度？
26
        case Gdead:
27
            // Shouldn't have been running!
28
            // 唉，你怎么死了
29
            runtime·throw("bad gp->status in sched");
30
        case Grunning:
31
            gp->status = Grunnable;
32
            // 回调度队列
33
            gput(gp);
34
            break;
35
        case Gmoribund:
36
            // 将死的 go 协程，销毁资源
37
            gp->status = Gdead; // 埋了
38
            if(gp->lockedm) {
39
                gp->lockedm = nil;
40
                m->lockedg = nil;
41
            }
42
            gp->idlem = nil;
43
            unwindstack(gp, nil); // 释放栈空间
44
            gfput(gp); // 回收 G 到空闲列表
45
            if(--runtime·sched.gcount == 0)
46
                runtime·exit(0);
47
            break;
48
        }
49
        // 如果 gp 设置了 readyonstop（表示它执行完需要唤醒某个 G）
50
        if(gp->readyonstop){
51
            gp->readyonstop = 0;
52
            readylocked(gp);
53
        }
54
    } else if(m->helpgc) { // 清理 GC 辅助协程
55
        // Bootstrap m or new m started by starttheworld.
56
        // atomic { mcpu-- }
57
        v = runtime·xadd(&runtime·sched.atomic, -1<<mcpuShift);
58
        if(atomic_mcpu(v) > maxgomaxprocs)
59
            runtime·throw("negative mcpu in scheduler");
60
        // Compensate for increment in starttheworld().
61
        runtime·sched.grunning--;
62
        m->helpgc = 0;
63
    } else if(m->nextg != nil) {
64
        // New m started by matchmg.
65
        // m 被分配任务
66
    } else {
67
        runtime·throw("invalid m state in scheduler");
68
    }
69

70
    // Find (or wait for) g to run.  Unlocks runtime·sched.
71
    // 从队列中找下一个 g 来运行
72
    gp = nextgandunlock();
73
    gp->readyonstop = 0;
74
    gp->status = Grunning;
75
    m->curg = gp;
76
    gp->m = m;
77

78
    // Check whether the profiler needs to be turned on or off.
79
    hz = runtime·sched.profilehz;
80
    if(m->profilehz != hz)
81
        runtime·resetcpuprofiler(hz);
82

83
    if(gp->sched.pc == (byte*)runtime·goexit) {    // kickoff
84
        runtime·gogocall(&gp->sched, (void(*)(void))gp->entry);
85
    }
86
    runtime·gogo(&gp->sched, 0); // 真正切换到 gp 的执行上下文
87
}

首先schedlock()对调度器加锁保护全局队列。然后判断gp的状态，如果已经完成，则彻底销毁；反之则正常回收当前 goroutine。然后从队列中取下一个 g 来运行。

1
static G*
2
nextgandunlock(void)
3
{
4
    G *gp;
5
    uint32 v;
6

7
top:
8
    if(atomic_mcpu(runtime·sched.atomic) >= maxgomaxprocs)
9
        runtime·throw("negative mcpu");
10

11
    // If there is a g waiting as m->nextg, the mcpu++
12
    // happened before it was passed to mnextg.
13
    if(m->nextg != nil) {
14
        gp = m->nextg;
15
        m->nextg = nil;
16
        schedunlock();
17

18
        return gp;
19
    }
20
......

GM 模型的问题？#

从上面的代码可以看出，每次调度都需要获取一遍全局锁，导致频繁的锁竞争。
M 之间频繁通过G.nextg传递 g，导致额外开销。
同时每个 M 都有自己的独立内存缓存M.mcache，造成资源浪费+数据局部性差。
系统调用时 M 经常被阻塞或解除阻塞，造成很多额外开销

GMP 调度模型#

GM 模型的问题就是为什么要引入 P 的原因。P 即为Processor，表示 M的执行上下文。每一个 P 都和一个 M 相互绑定。

每个 P 都会维护自己的本地队列，减轻了对全局队列的依赖，从而减少锁的竞争。同时带来的工作窃取机制也减少了 M 的空转，提高了资源利用率。

G, M, P#

GMP 调度流程#

go func() 创建协程后会加入到一个 P 的本地队列中；如果本地队列已满则加入全局队列。
每个 P 和一个 M 绑定， M 从 P 的本地队列中获取 G 来执行。若 M 绑定的 P 本地队列为空怎么会从其他 P 的本地队列中偷取 G。
若 G 因系统调用阻塞， P 会和当前 M 解绑并寻找新的空闲 M，若没有空闲的 M 则新建一个 M；若 G 因通道或网络 I/O 阻塞，则 M 会寻找其他处于 Grunnable 状态的 G；原来被阻塞的 G 恢复后会重新进入 P 的本地队列等待执行。

G 执行完成后释放资源。

schedule()#

1
func schedule() {
2
    mp := getg().m // 获取当前 G 所在的 M
3

4
    ......
5

6
top:
7
    pp := mp.p.ptr() // 获取当前 M 所属的 P
8
    pp.preempt = false
9

10
    ......
11

12
    gp, inheritTime, tryWakeP := findRunnable() // blocks until work is available
13

14
    ......
15

16
    execute(gp, inheritTime)
17
}#

经过众多版本迭代，1.23 版的 schedule() 实现复杂了很多，其中选择具体 G 的代码封装在 findRunnable()中，这个函数实在太长，这里只简单讲一下其逻辑。

findRunnable() 做的是：

找出一个可执行的 goroutine（gp），并返回是否继承时间片（inheritTime），以及是否唤醒新的 P（tryWakeP）。

首先尝试调度 traceReader 或 GC worker，这些是 runtime 系统 goroutine，优先级高。然后查看本地队列。如果本地队列是空的，再尝试从全局队列获取 G。为了防止某个 P 的本地队列饱和而其他 P 饿死，每隔一段频率会优先尝试取全局队列的 G。

如果上面都没有找到任务，并且 spinning M 数量没超过限制，尝试从其他 P 的队列中偷任务。成功偷到就执行；偷不到但发现有新 timer 或 GC 任务则重试。

最后如果没有任务了，则准备释放 P，阻塞 M。

那么 runtime 到底做了什么？#

编译器会将 go func() { … } 翻译成 runtime.newproc() 语句：

1
// Create a new g running fn.
2
// Put it on the queue of g's waiting to run.
3
// The compiler turns a go statement into a call to this.
4
func newproc(fn *funcval) {
5
    gp := getg()
6
    // 获取当前的程序计数器（返回地址），用于调试和分析栈踪迹时追踪调用来源
7
    pc := sys.GetCallerPC()
8
    // 切换到当前 M 的系统栈
9
    systemstack(func() {
10
        newg := newproc1(fn, gp, pc, false, waitReasonZero)
11

12
        // 获取当前线程的 P 并将新创建的 newg 加入到本地队列
13
        pp := getg().m.p.ptr()
14
        runqput(pp, newg, true)
15

16
        if mainStarted {
17
            wakep()
18
        }
19
    })
20
}

再来看一下 newproc1 的内部逻辑

1
// Create a new g in state _Grunnable (or _Gwaiting if parked is true), starting at fn.
2
// callerpc is the address of the go statement that created this. The caller is responsible
3
// for adding the new g to the scheduler. If parked is true, waitreason must be non-zero.
4
func newproc1(fn *funcval, callergp *g, callerpc uintptr, parked bool, waitreason waitReason) *g {
5
    if fn == nil {
6
        fatal("go of nil func value")
7
    }
8

9
    mp := acquirem() // disable preemption because we hold M and P in local vars.
10
    pp := mp.p.ptr()
11
    newg := gfget(pp)
12
    if newg == nil {
13
        newg = malg(stackMin)
14
        casgstatus(newg, _Gidle, _Gdead)
15
        allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
16
    }
17

18
    ......
19

20
    totalSize := uintptr(4*goarch.PtrSize + sys.MinFrameSize) // extra space in case of reads slightly beyond frame
21
    totalSize = alignUp(totalSize, sys.StackAlign)
22
    sp := newg.stack.hi - totalSize
23

24
    ......
25

26
    memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched))
27
    newg.sched.sp = sp
28
    newg.stktopsp = sp
29
    newg.sched.pc = abi.FuncPCABI0(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
30
    newg.sched.g = guintptr(unsafe.Pointer(newg))
31
    gostartcallfn(&newg.sched, fn)
32
    newg.parentGoid = callergp.goid
33
    newg.gopc = callerpc
34
    newg.ancestors = saveAncestors(callergp)
35
    newg.startpc = fn.fn
36

37
    ...... // 判断是否系统 G，进行标记
38

39
    // Track initial transition?
40
    newg.trackingSeq = uint8(cheaprand())
41
    if newg.trackingSeq%gTrackingPeriod == 0 {
42
        newg.tracking = true
43
    }
44
    gcController.addScannableStack(pp, int64(newg.stack.hi-newg.stack.lo))
45

46
    .......// trace 跟踪
47

48
    // Set up race context.
49
    if raceenabled {
50
        newg.racectx = racegostart(callerpc)
51
        newg.raceignore = 0
52
        if newg.labels != nil {
53
            // See note in proflabel.go on labelSync's role in synchronizing
54
            // with the reads in the signal handler.
55
            racereleasemergeg(newg, unsafe.Pointer(&labelSync))
56
        }
57
    }
58
    releasem(mp) // 释放锁
59

60
    return newg
61
}

在获取到当前 M 后，首先尝试从当前 P 的空闲 G 池中复用，若没有空闲 G，则分配新栈。之后初始化新 G 的调度上下文和元数据。最后返回这个 G。

总结#

最后总结一下标题的答案：

当使用 go 关键字新建一个 goroutine 时， runtime 会调用 newproc 生成新的 G。

在 newproc 中，用 systemstack() 切换到 (g0 的) 系统栈，调用 newproc1。
在 newproc1 中，首先尝试复用 P 中空闲的 G，若没有，则新建一个 G 实例并为其分配栈空间。
初始化调度上下文 (sched)。
分配唯一的 goid，记录父 G 的信息、调用栈、标签等元数据。
设置 G 的初始状态为 _Grunnable，即可运行状态。
将这个 G 加入当前 P 的本地队列，让调度器 schedule() 后续执行它。（若已满则进入全局队列）

此外， runtime还会做一些tracing、GC 标记、新栈注册、race 检测等附加工作。

附录#

1
type p struct {
2
    id          int32
3
    status      uint32 // one of pidle/prunning/...
4
    link        puintptr
5
    schedtick   uint32     // incremented on every scheduler call
6
    syscalltick uint32     // incremented on every system call
7
    sysmontick  sysmontick // last tick observed by sysmon
8
    m           muintptr   // back-link to associated m (nil if idle)
9
    mcache      *mcache
10
    pcache      pageCache
11
    raceprocctx uintptr
12

13
    deferpool    []*_defer // pool of available defer structs (see panic.go)
14
    deferpoolbuf [32]*_defer
15

16
    // Cache of goroutine ids, amortizes accesses to runtime·sched.goidgen.
17
    goidcache    uint64
18
    goidcacheend uint64
19

20
    // Queue of runnable goroutines. Accessed without lock.
21
    runqhead uint32
22
    runqtail uint32
23
    runq     [256]guintptr
24
    // runnext, if non-nil, is a runnable G that was ready'd by
25
    // the current G and should be run next instead of what's in
26
    // runq if there's time remaining in the running G's time
27
    // slice. It will inherit the time left in the current time
28
    // slice. If a set of goroutines is locked in a
29
    // communicate-and-wait pattern, this schedules that set as a
30
    // unit and eliminates the (potentially large) scheduling
31
    // latency that otherwise arises from adding the ready'd
32
    // goroutines to the end of the run queue.
33
    //
34
    // Note that while other P's may atomically CAS this to zero,
35
    // only the owner P can CAS it to a valid G.
36
    runnext guintptr
37

38
    // Available G's (status == Gdead)
39
    gFree struct {
40
        gList
41
        n int32
42
    }
43

44
    sudogcache []*sudog
45
    sudogbuf   [128]*sudog
46

47
    // Cache of mspan objects from the heap.
48
    mspancache struct {
49
        // We need an explicit length here because this field is used
50
        // in allocation codepaths where write barriers are not allowed,
51
        // and eliminating the write barrier/keeping it eliminated from
52
        // slice updates is tricky, more so than just managing the length
53
        // ourselves.
54
        len int
55
        buf [128]*mspan
56
    }
57

58
    // Cache of a single pinner object to reduce allocations from repeated
59
    // pinner creation.
60
    pinnerCache *pinner
61

62
    trace pTraceState
63

64
    palloc persistentAlloc // per-P to avoid mutex
65

66
    // Per-P GC state
67
    gcAssistTime         int64 // Nanoseconds in assistAlloc
68
    gcFractionalMarkTime int64 // Nanoseconds in fractional mark worker (atomic)
69

70
    // limiterEvent tracks events for the GC CPU limiter.
71
    limiterEvent limiterEvent
72

73
    // gcMarkWorkerMode is the mode for the next mark worker to run in.
74
    // That is, this is used to communicate with the worker goroutine
75
    // selected for immediate execution by
76
    // gcController.findRunnableGCWorker. When scheduling other goroutines,
77
    // this field must be set to gcMarkWorkerNotWorker.
78
    gcMarkWorkerMode gcMarkWorkerMode
79
    // gcMarkWorkerStartTime is the nanotime() at which the most recent
80
    // mark worker started.
81
    gcMarkWorkerStartTime int64
82

83
    // gcw is this P's GC work buffer cache. The work buffer is
84
    // filled by write barriers, drained by mutator assists, and
85
    // disposed on certain GC state transitions.
86
    gcw gcWork
87

88
    // wbBuf is this P's GC write barrier buffer.
89
    //
90
    // TODO: Consider caching this in the running G.
91
    wbBuf wbBuf
92

93
    runSafePointFn uint32 // if 1, run sched.safePointFn at next safe point
94

95
    // statsSeq is a counter indicating whether this P is currently
96
    // writing any stats. Its value is even when not, odd when it is.
97
    statsSeq atomic.Uint32
98

99
    // Timer heap.
100
    timers timers
101

102
    // maxStackScanDelta accumulates the amount of stack space held by
103
    // live goroutines (i.e. those eligible for stack scanning).
104
    // Flushed to gcController.maxStackScan once maxStackScanSlack
105
    // or -maxStackScanSlack is reached.
106
    maxStackScanDelta int64
107

108
    // gc-time statistics about current goroutines
109
    // Note that this differs from maxStackScan in that this
110
    // accumulates the actual stack observed to be used at GC time (hi - sp),
111
    // not an instantaneous measure of the total stack size that might need
112
    // to be scanned (hi - lo).
113
    scannedStackSize uint64 // stack size of goroutines scanned by this P
114
    scannedStacks    uint64 // number of goroutines scanned by this P
115

116
    // preempt is set to indicate that this P should be enter the
117
    // scheduler ASAP (regardless of what G is running on it).
118
    preempt bool
119

120
    // gcStopTime is the nanotime timestamp that this P last entered _Pgcstop.
121
    gcStopTime int64
122

123
    // Padding is no longer needed. False sharing is now not a worry because p is large enough
124
    // that its size class is an integer multiple of the cache line size (for any of our architectures).
125
}
126

127
type g struct {
128
    // Stack parameters.
129
    // stack describes the actual stack memory: [stack.lo, stack.hi).
130
    // stackguard0 is the stack pointer compared in the Go stack growth prologue.
131
    // It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
132
    // stackguard1 is the stack pointer compared in the //go:systemstack stack growth prologue.
133
    // It is stack.lo+StackGuard on g0 and gsignal stacks.
134
    // It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
135
    stack       stack   // offset known to runtime/cgo
136
    stackguard0 uintptr // offset known to liblink
137
    stackguard1 uintptr // offset known to liblink
138

139
    _panic    *_panic // innermost panic - offset known to liblink
140
    _defer    *_defer // innermost defer
141
    m         *m      // current m; offset known to arm liblink
142
    sched     gobuf
143
    syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
144
    syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
145
    syscallbp uintptr // if status==Gsyscall, syscallbp = sched.bp to use in fpTraceback
146
    stktopsp  uintptr // expected sp at top of stack, to check in traceback
147
    // param is a generic pointer parameter field used to pass
148
    // values in particular contexts where other storage for the
149
    // parameter would be difficult to find. It is currently used
150
    // in four ways:
151
    // 1. When a channel operation wakes up a blocked goroutine, it sets param to
152
    //    point to the sudog of the completed blocking operation.
153
    // 2. By gcAssistAlloc1 to signal back to its caller that the goroutine completed
154
    //    the GC cycle. It is unsafe to do so in any other way, because the goroutine's
155
    //    stack may have moved in the meantime.
156
    // 3. By debugCallWrap to pass parameters to a new goroutine because allocating a
157
    //    closure in the runtime is forbidden.
158
    // 4. When a panic is recovered and control returns to the respective frame,
159
    //    param may point to a savedOpenDeferState.
160
    param        unsafe.Pointer
161
    atomicstatus atomic.Uint32
162
    stackLock    uint32 // sigprof/scang lock; TODO: fold in to atomicstatus
163
    goid         uint64
164
    schedlink    guintptr
165
    waitsince    int64      // approx time when the g become blocked
166
    waitreason   waitReason // if status==Gwaiting
167

168
    preempt       bool // preemption signal, duplicates stackguard0 = stackpreempt
169
    preemptStop   bool // transition to _Gpreempted on preemption; otherwise, just deschedule
170
    preemptShrink bool // shrink stack at synchronous safe point
171

172
    // asyncSafePoint is set if g is stopped at an asynchronous
173
    // safe point. This means there are frames on the stack
174
    // without precise pointer information.
175
    asyncSafePoint bool
176

177
    paniconfault bool // panic (instead of crash) on unexpected fault address
178
    gcscandone   bool // g has scanned stack; protected by _Gscan bit in status
179
    throwsplit   bool // must not split stack
180
    // activeStackChans indicates that there are unlocked channels
181
    // pointing into this goroutine's stack. If true, stack
182
    // copying needs to acquire channel locks to protect these
183
    // areas of the stack.
184
    activeStackChans bool
185
    // parkingOnChan indicates that the goroutine is about to
186
    // park on a chansend or chanrecv. Used to signal an unsafe point
187
    // for stack shrinking.
188
    parkingOnChan atomic.Bool
189
    // inMarkAssist indicates whether the goroutine is in mark assist.
190
    // Used by the execution tracer.
191
    inMarkAssist bool
192
    coroexit     bool // argument to coroswitch_m
193

194
    raceignore    int8  // ignore race detection events
195
    nocgocallback bool  // whether disable callback from C
196
    tracking      bool  // whether we're tracking this G for sched latency statistics
197
    trackingSeq   uint8 // used to decide whether to track this G
198
    trackingStamp int64 // timestamp of when the G last started being tracked
199
    runnableTime  int64 // the amount of time spent runnable, cleared when running, only used when tracking
200
    lockedm       muintptr
201
    fipsIndicator uint8
202
    sig           uint32
203
    writebuf      []byte
204
    sigcode0      uintptr
205
    sigcode1      uintptr
206
    sigpc         uintptr
207
    parentGoid    uint64          // goid of goroutine that created this goroutine
208
    gopc          uintptr         // pc of go statement that created this goroutine
209
    ancestors     *[]ancestorInfo // ancestor information goroutine(s) that created this goroutine (only used if debug.tracebackancestors)
210
    startpc       uintptr         // pc of goroutine function
211
    racectx       uintptr
212
    waiting       *sudog         // sudog structures this g is waiting on (that have a valid elem ptr); in lock order
213
    cgoCtxt       []uintptr      // cgo traceback context
214
    labels        unsafe.Pointer // profiler labels
215
    timer         *timer         // cached timer for time.Sleep
216
    sleepWhen     int64          // when to sleep until
217
    selectDone    atomic.Uint32  // are we participating in a select and did someone win the race?
218

219
    // goroutineProfiled indicates the status of this goroutine's stack for the
220
    // current in-progress goroutine profile
221
    goroutineProfiled goroutineProfileStateHolder
222

223
    coroarg   *coro // argument during coroutine transfers
224
    syncGroup *synctestGroup
225

226
    // Per-G tracer state.
227
    trace gTraceState
228

229
    // Per-G GC state
230

231
    // gcAssistBytes is this G's GC assist credit in terms of
232
    // bytes allocated. If this is positive, then the G has credit
233
    // to allocate gcAssistBytes bytes without assisting. If this
234
    // is negative, then the G must correct this by performing
235
    // scan work. We track this in bytes to make it fast to update
236
    // and check for debt in the malloc hot path. The assist ratio
237
    // determines how this corresponds to scan work debt.
238
    gcAssistBytes int64
239
}