You are on page 1of 4

NSD Analysis

OK, so here goes on a lengthy post for the admins amongst us on NSD Analysis. An area I feel I know quite well . however as youll remember from my last post this is based on publicly available information. NSD (or Notes System Diagnostic) is the name given to software bundled in Domino to give a snapshot of what the Domino system is doing. The tool produces text files with enormous ammounts of information and can be run manually or will run automatically during a crash. Interested ? (it may seem a bit dull but the information here could save you a lot of time!). Memory Without memory nothing works. In the operating system memory is divided into Kernel and User Address Space. The Kernel looks after the OS, hardware drivers and communications with the hardware. The User Address Space is where our applications run, and this includes Domino. So when Domino crashes it happens in the User Address Space .. this means that Domino wont directly cause a blue screen of death!!!! However Domino may, for example, be attempting to read or write to an area of disk which could cause a kernel memory error remembering that the kernel must deal with the disk. As we know, Domino is made up of a number of individual processes (nserver, nreplica, nrouter etc). Each of these processes all do their own little bit to make up the server. Each process is doing a number of tasks at any one time, these are called threads. And within each thread there is a specific set of individual actions. These are called function calls. Crashes (in a paragraph!) Yeah, Yeah, Blah, Blah so what does this mean for me? Well Domino is a fairly complex beast. Now and again a thread will try and use some memory which is reserved or in use by another process. This is a memory exception and at this point everything will go a bit messy. A panic will be recorded in the thread, Domino will freeze everything that it is doing and the nsd task will run. This will gain a snapshot of the environment immediately before the crash storing the important results in \data\ibm_technical_support on either the client or server which has crashed. Hangs Hangs are a different beast and Ill not do much here to go into them. To recognise a hang the easiest way to look for the hang is to examine in real time the memory allocated to each Domino or Notes process. Remember from earlier each process is made up of a number of threads. New threads are constantly starting and old threads are constantly stopping. So for each process you should see the

memory allocated to that process changing with time. If you dont see changes in the memory allocated to a thread then you possibly have a hang. A server can recover from a hang. A hung process may or may not prevent user sessions on a server. To troubleshoot a hang you need to run the nsd process 3 times at 5 minute intervals and then engage IBM Support to help resolve the issue. Running NSD Manually The important thing to remember when running NSD is that by default it will kill all the processes .. so if you want to run it without killing the PIDs check out the extensions by running nsd -?. Normally advice is to run nsd -detach as that leaves the processes alone after running. The file Well the file produced will always have a common naming convention: type_plaftorm_systemname_date@time.log Each platform has its own format and for sake of making this post a record length Im going to stick to Wintel. Sections in Wintel NSDs First section is the header with system information, a list of each Domino instance and a list of the processes running therein. Youll see some strange entries for Found X processes, matched Y. If Y is one less than X then providing you are running Domino as a Windows service dont worry! nsd examines all processes from nserver down, nservice is the parent of nserver. nsd sees nservice is running but also sees it isnt running under nserver so it says for example found 22 matched 21. Next we have the process table. From here you can see all processes on the server. Processes nsd recognises as Domino are indicated with ->. The position of [ denotes parent and child status indents denoting children. Youll see nsd as a child of whichever process crashed. OK so this section helps gather a picture of what was running on the server Below this section there is a dump of each process, what files the process was using, and then importantly a dump of each thread. On the thread which resulted in the crash the name will change from thread to fatal thread. Best option once you have looked through the process table is to search for FATAL. Fatal Thread So once youve searched for fatal you may see something like this: ### FATAL THREAD 39/83 [ nSERVER:07c0: 2764] ### FP=0743f548, PC=60197cf3, SP=0743ebd0, stksize=2424

Exception code: c0000005 (ACCESS_VIOLATION) ############################################################ @[ 1] 060197cf3 nnotes._Panic@4+483 (7430016,496dae76,0,496dace8) @[ 2] 0600018a4 nnotes._OSBBlockAddr@8+148 (1153f38,2000000,743f608,1) @[ 3] 06000bd92 nnotes._CollectionNavigate@24+610 (0,743fc74,f,0) @[ 4] 0600626cc nnotes._ReadEntries@68+2860 (4c5440e8,4cfb8dba,800f,1) @[ 5] 0600b9f6f nnotes._NIFReadEntriesExt@72+351 (0,4cfb8dba,800f,1) @[ 6] 010032d40 nserverl._ServerReadEntries@8+1424 (0,8d0c0035,4b64b5bc,4ae46dd6) @[ 7] 0100191fc nserverl._DbServer@8+2284 (41b0383,cb740064,0,23696f8) @[ 8] 01002b8c8 nserverl._WorkThreadTask@8+1576 (4711d68,0,3,563fb10) @[ 9] 0100016cb nserverl._Scheduler@4+763 (0,563fb10,0,10ec334) @[10] 06011e5e4 nnotes._ThreadWrapper@4+212 (0,10ec334,563fb10,0) [11] 077e887dd KERNEL32.GetModuleFileNameA+465 So what does all this mean. Well the header block is fairly obvious. Lines 1 through 11 are the function calls that the thread performed. These are in sequence. For wintel 1 is the event closes to the crash and 11 the event furthest from the crash. So the server performed 11, 10, 9, 8, 2, then crashed and 1 shows the panic. So what does each line mean? The @ sign means nsd has annotated it and recognised the thread as a domino function. The 0x lines I assume to be the address (but someone may correct me). The bit before the full stop is the class (nnotes, nserverl etc). The bit after the full stop and before the @ sign is the function call. So here the function calls are _ThreadWrapper, _Scheduler, _WorkThreadTask etc. Call Stack Listing all these functions we get the call stack. Panic OSBBlockAddr _CollectionNavigate _ReadEntries _NIFReadEntriesExt _ServerReadEntries _DbServer _WorkThreadTask _Scheduler _ThreadWrapper Finding the fault

Well now is the point where you have some data which can be searched in the IBM Knowledge Base My only tip here is to ensure a good search strip off the leading underscore, and also add * to the beginning and end of the call stack. Take 2 items from the call stack list and search for them in turn. i.e. search for 11 and 10, then 10 and 9 then .. you need to compare your call stack with any call stacks listed in the knowledgebase. Reference material

UNIX NSD Analysis : http://www-1.ibm.com/support/docview.wss?rs=0&uid=swg27003396 Nash!com presentation : http://www.nashcom.de/nshweb/pages/lotusphere.htm Redbooks technote : http://www.redbooks.ibm.com/abstracts/tips0053.html?Open LDD Article : http://www-128.ibm.com/developerworks/lotus/library/domino-server-crashes/ REMEMBER IBM ARE THE EXPERTS As a footnote please remember that locked in a deep vault somewhere in IBM is a team of people who spend all day every day looking at NSDs (and even having fun). They are experts. If you need to examine an NSD Id recommend before you start you log the call with IBM. While you are waiting for them to get back to you have a go at resolving the NSD yourself.

You might also like