前些天有位朋友找到我,说他的程序跑着跑着就崩溃了,让我看下怎么回事,其实没怎么回事,抓它的 crash dump 就好,具体怎么抓也是被问到的一个高频问题,这里再补一下链接: [.NET程序崩溃了怎么抓 Dump ? 我总结了三种方案] /huangxincheng/p/ ,采用第二种 AEDebug 的形式抓取即可。
如果dump中塞了异常,用 windbg 打开的时候会有一个提示 This dump file has an exception of interest stored in it
,输出如下:
(资料图片)
************* Path validation summary **************Response Time (ms) LocationDeferred SRV*C:\mysymbols*/download/symbolsSymbol search path is: SRV*C:\mysymbols*/download/symbolsExecutable search path is: Windows 7 Version 7601 (Service Pack 1) MP (4 procs) Free x64Product: Server, suite: Enterprise TerminalServer SingleUserTSDebug session time: Wed Jun 14 13:34: 2023 (UTC + 8:00)System Uptime: 0 days 3:28: Uptime: 0 days 0:00:......................................................................................................................................................................................This dump file has an exception of interest stored in stored exception information can be accessed via .ecxr.(): Stack overflow - code c00000fd (first/second chance not available)For analysis of this file, run !analyze -vclr!SlowAllocateString+0x11:000007fe`f9236451 48c785b0fffffffeffffff mov qword ptr [rbp-50h],0FFFFFFFFFFFFFFFEh ss:00000000`123d5fd0=0000000000000000
从卦中看当前有一个 Stack overflow - code c00000fd
异常,说实话好久都没看到 栈溢出
了,甚是想念,既然说栈溢出了,那就看下异常前是个啥情况,使用 .excr
即可。
0:028> .excr;krax=00000000123d6048 rbx=00000000123d5d70 rcx=0000000000000001rdx=0000000000000001 rsi=0000000000000000 rdi=00000000123d5880rip=000007fef9236451 rsp=00000000123d5fb0 rbp=00000000123d6020 r8=00000000ffffffff r9=0000000000000000 r10=00000000123d618er11=0000000000000000 r12=0000000000000000 r13=0000000000000000r14=0000000000000000 r15=0000000000000001iopl=0 nv up ei pl nz na pe nccs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010200clr!SlowAllocateString+0x11:000007fe`f9236451 48c785b0fffffffeffffff mov qword ptr [rbp-50h],0FFFFFFFFFFFFFFFEh ss:00000000`123d5fd0=0000000000000000 *** Stack trace for last set context - .thread/.cxr resets it # Child-SP RetAddr Call Site00 00000000`123d5fb0 000007fe`f920a5bd clr!SlowAllocateString+0x1101 00000000`123d6050 000007fe`f920a9c7 clr!StringObject::NewString+0x2502 00000000`123d6080 000007fe`f920a80d clr!Int32ToDecStr+0xdf03 00000000`123d6320 000007fe`9ab3bb72 clr!COMNumber::FormatInt32+0x10d04 00000000`123d65f0 000007fe`9ab33e04 0x000007fe`9ab3bb7205 00000000`123d6630 000007fe`9ab3be52 0x000007fe`9ab33e0406 00000000`123d6720 000007fe`9ab3bd2a 0x000007fe`9ab3be5207 00000000`123d6790 000007fe`9ab33e35 0x000007fe`9ab3bd2a08 00000000`123d67f0 000007fe`9ab3be52 0x000007fe`9ab33e3509 00000000`123d68e0 000007fe`9ab3bd2a 0x000007fe`9ab3be52...ff 00000000`123df860 000007fe`9ab3bd2a 0x000007fe`9ab3be52
从卦中看,当前默认的 255 个栈帧全部被打满,看样子是无限死循环了,为了能看到托管部分我们改用 !clrstack
命令。
0:028> !clrstackOS Thread Id: 0xbc4 (28) Child SP IP Call Site00000000123d63b8 000007fef9236451 [HelperMethodFrame_PROTECTOBJ: 00000000123d63b8] (Int32, , )00000000123d65f0 000007fe9ab3bb72 xxx___symbol00(Byte[])00000000123d6630 000007fe9ab33e04 xxx___symbol00(Byte[], Int64, Int64, Boolean)00000000123d6720 000007fe9ab3be52 xxx___symbol00(Int32, Int32)00000000123d6790 000007fe9ab3bd2a xxx___symbol00(Byte[], Boolean)00000000123d67f0 000007fe9ab33e35 xxx___symbol00(Byte[], Int64, Int64, Boolean)00000000123d68e0 000007fe9ab3be52 xxx___symbol00(Int32, Int32)00000000123d6950 000007fe9ab3bd2a xxx___symbol00(Byte[], Boolean)00000000123d69b0 000007fe9ab33e35 xxx___symbol00(Byte[], Int64, Int64, Boolean)00000000123d6aa0 000007fe9ab3be52 xxx___symbol00(Int32, Int32)00000000123d6b10 000007fe9ab3bd2a xxx___symbol00(Byte[], Boolean)00000000123d6b70 000007fe9ab33e35 xxx___symbol00(Byte[], Int64, Int64, Boolean)00000000123d6c60 000007fe9ab3be52 xxx___symbol00(Int32, Int32)00000000123d6cd0 000007fe9ab3bd2a xxx___symbol00(Byte[], Boolean)00000000123d6d30 000007fe9ab33e35 xxx___symbol00(Byte[], Int64, Int64, Boolean)00000000123d6e20 000007fe9ab3be52 xxx___symbol00(Int32, Int32)00000000123d6e90 000007fe9ab3bd2a xxx___symbol00(Byte[], Boolean)....000000001244db60 000007fe9ab31f0e _symbol00(, , Byte[])000000001244dbc0 000007fe9ab318e5 (, Int32, Int32, , Int32)
从卦中信息看,是代码用 Convertxxxx
调用了一个第三方库,在这个库中出现了死递归。
按理说不管外界给了什么参数下去,都不应该用死递归的方式来呈现,所以这类问题可以归于 SDK 的bug,接下来我们的研究方向就是看下这个 SDK 是何方神圣?
[assembly: AssemblyCopyright("© 2008 O2 Solutions")][assembly: AssemblyProduct("PDFxxx4NET")][assembly: AssemblyCompany("O2 Solutions (/)")][assembly: AssemblyTrademark("PDFxxx4NET is a trademark of O2 Solutions")][assembly: AllowPartiallyTrustedCallers][assembly: AssemblyTitle("Print and convert PDF files to images.")][assembly: RuntimeCompatibility(WrapNonExceptionThrows = true)][assembly: AssemblyDescription("Component for rendering pdf files on .NET platform")][assembly: AssemblyConfiguration("")][assembly: AssemblyInformationalVersion("")][assembly: AssemblyKeyName("")][assembly: AssemblyDelaySign(false)][assembly: CompilationRelaxations(8)][assembly: AssemblyVersion("")]
从卦中看还是 2008 年写的 版本,而官网早已出了 2023 年版本,也就是说 15年都没有更新,也是厉害,截图如下:
到这里就可以给到朋友答案了,让他看下能否把 PDFRender4NET
升级到最新版本,按理说应该就没有问题了。
心细的朋友可能会有一个疑问,既然都栈溢出了,按理说异常码应该是 c0000005
(访问违例),怎么会是 c00000fd
呢?
这是一个非常好的问题,要理解为什么是 c00000fd
而不是 c0000005
,需要你对栈的布局有一个比较清晰的理解,为了方便讲述,以当前的 w3wp 来绘制一张图。
画完这张图肯定有朋友会提几个反对意见:
1) 线程栈不是 1M 吗? 怎么会是 512k 呢?
这里要说的是 1M 并不是什么公理,可以在 PE 头上随便设定的,截图如下:
2)PAGE_GUARD 不是 1个内存页吗?
很多教科书都是按 1个内存页 讲述的,但这也不是定死的,也可能是多个内存页,比如 2个,5个,要想验证很简单,用 !address -f:Stack
观察下便知。
0:121> !address -f:Stack BaseAddress EndAddress+1 RegionSize Type State Protect Usage-------------------------------------------------------------------------------------------------------------------------- 0`001f0000 0`00266000 0`00076000 MEM_PRIVATE MEM_RESERVE Stack [~0; ] 0`00266000 0`00268000 0`00002000 MEM_PRIVATE MEM_COMMIT PAGE_READWRITE | PAGE_GUARD Stack [~0; ] 0`00268000 0`00270000 0`00008000 MEM_PRIVATE MEM_COMMIT PAGE_READWRITE Stack [~0; ] ... 0`15710000 0`15788000 0`00078000 MEM_PRIVATE MEM_RESERVE Stack [~139; ] 0`15788000 0`1578d000 0`00005000 MEM_PRIVATE MEM_COMMIT PAGE_READWRITE | PAGE_GUARD Stack [~139; ] 0`1578d000 0`15790000 0`00003000 MEM_PRIVATE MEM_COMMIT PAGE_READWRITE Stack [~139; ]
接下来我们聊一下什么是 PAGE_GUARD
,从名字上看就是 哨兵页
,说白一点就是 Windows 做 栈伸展
的一种系统机制,当 rsp 访问到这个区域时会引发系统的 页中断
进而 COMMIT 更多内存页,新的 Commit 页会被 哨兵
侵占,同时也会让渡 RSP 所占的内存页给程序使用,这是一种良性机制,一旦 哨兵
无法侵占更多新的 COMMIT 页时,也就表示栈空间已经到位了,这时候会将自身的 PAGE_GUARD
标签去掉,表示它的使命已完成,如果此时 RSP 访问到了这个弥留的 哨兵区
,就会抛出 c00000fd
异常,这种异常只是表示 RSP 进入了 哨兵区
,不代表栈空间
真的用完了,所以这就是不抛 c0000005
的真正原因,画个简图如下:
说了这么说,如何去验证呢?非常简单,我们提取出 StackLimit, StackBase, RSP
即可。
0:028> r rsprsp=00000000123d5fb00:028> !tebTEB at 000007fffff70000 ExceptionList: 0000000000000000 StackBase: 0000000012450000 StackLimit: 00000000123d10000:028> !address -f:Stack BaseAddress EndAddress+1 RegionSize Type State Protect Usage-------------------------------------------------------------------------------------------------------------------------- 0`123d0000 0`123d1000 0`00001000 MEM_PRIVATE MEM_RESERVE Stack [~28; ] 0`123d1000 0`12450000 0`0007f000 MEM_PRIVATE MEM_COMMIT PAGE_READWRITE Stack [~28; ]
标签: