You are on page 1of 128

Network Monitoring:

The following is a guide to monitoring network interfaces and the network from the Solaris operating system. Note: This document is not written by Sun. Brendan Gregg, 09-Oct-2005, version 0.70. Is your network busy? The network often gets blamed when things are performing poorly, and perhaps this is correct - your network interfaces may be running at 100% utilisation. What command will tell you how busy the network interface is? Many sysadmins suggest using netstat -i to find out, $ netstat -i 1 input hme0 output packets errs packets errs colls colls 70820498 6 73415337 0 0 0 242 0 149 0 0 1068 0 552 0 0 [...] input (Total) output packets errs packets errs 113173825 6 246 1075 0 0 153 559 115768664 0 0 0 0 0

This shows packet counts per second, the first line is the summary since boot. How many packets would mean the interface is busy? 100/sec, 1000/sec? What we do know is the speed of the network interface, for this one it is 100 Mb/sec. However we have no way of telling the size of the packets - they may be 56 byte packets or 1500 bytes. This means that the packet count is not useful, perhaps it is useful as a yardstick of activity only. What we really need is Kb/sec... Contents: By System - monitoring network interface usage. netstat - netstat, the Solaris kitchen sink network tool. kstat - the Kernel Statistics framework. nx.se - SE Toolkit's nx.se. nicstat - nicstat for network interface utilisation. SNMP - SNMP based tools include MRTG. ping - the classic network probe tool. traceroute - timing hops to destination. TTCP - creates test load between two hosts. pathchar - traceroute with throughputs, an amazing tool. ntop - comprhensive statistics for snooped traffic. tcptop - TCP PID summary.

Across Network - analysing external network performance.

By Process - determining the process responsible for traffic.

By System

tcpsnoop - watch TCP traffic live with PID.

How to monitor network usage for the entire system, usually by network interface.

Netstat:
The Solaris netstat command is where a number of different network status programs have been dropped, it's the kitchen sink of network tools. netstat -i as mentioned earlier, only prints packet counts. We don't know if they are big packets or small packets, and we can't use them to accurately determine how utilised the network interface is. There are other performance monitoring tools that plot this as a "be all and end all" value - this is wrong. netstat -s dumps various network related counters from Kstat, the Kernel Statistics framework. This shows that Kstat does track at least some details in terms of bytes, $ netstat -s | grep Bytes tcpOutDataSegs =166744792 tcpRetransSegs =72298114 tcpInAckSegs =148658291 tcpInInorderSegs =3637819567 tcpInUnorderSegs =406912945 tcpInDupSegs =73998299 tcpInPartDupSegs =5821485 tcpInPastWinSegs =971347352 =37367847 =153437 =25548715 =35290928 =324309 =152795 = = 7896 38 tcpOutDataBytes tcpRetransBytes tcpInAckBytes tcpInInorderBytes tcpInUnorderBytes tcpInDupBytes tcpInPartDupBytes tcpInPastWinBytes

However the byte values above are for TCP in total, including loopback traffic that never travelled via the network interfaces. netstat -k on Solaris 9 and earlier dumped all Kstat counters,

$ netstat -k | awk '/^hme0/,/^$/' hme0: ipackets 70847004 ierrors 6 opackets 73438793 oerrors 0 collisions 0 defer 0 framing 0 crc 0 sqe 0 code_violations 0 len_errors 0 ifspeed 100000000 buff 0 oflo 0 uflo 0 missed 6

tx_late_collisions 0 retry_error 0 first_collisions 0 nocarrier 0 nocanput 0 allocbfail 0 runt 0 jabber 0 babble 0 tmd_error 0 tx_late_error 0 rx_late_error 0 slv_parity_error 0 tx_parity_error 0 rx_parity_error 0 slv_error_ack 0 tx_error_ack 0 rx_error_ack 0 tx_tag_error 0 rx_tag_error 0 eop_error 0 no_tmds 0 no_tbufs 0 no_rbufs 0 rx_late_collisions 0 rbytes 289601566 obytes 358304357 multircv 558 multixmt 73411 brdcstrcv 3813836 brdcstxmt 1173700 norcvbuf 0 noxmtbuf 0 newfree 0 ipackets64 70847004 opackets64 73438793 rbytes64 47534241822 obytes64 51897911909 align_errors 0 fcs_errors 0 sqe_errors 0 defer_xmts 0 ex_collisions 0 macxmt_errors 0 carrier_errors 0 toolong_errors 0 macrcv_errors 0 link_duplex 0 inits 31 rxinits 0 txinits 0 dmarh_inits 0 dmaxh_inits 0 link_down_cnt 0 phy_failures 0 xcvr_vendor 524311 asic_rev 193 link_up 1 Great - so bytes by network interface are indeed tracked. However netstat -k was an undocumented switch that has now been dropped in Solaris 10. That's ok, as there are better ways to get to Kstat, including the C library that tools such as vmstat use - libkstat.

Kstat :
The Solaris Kernel Statistics framework does track network usage, and as of Solaris 8 there has been a /usr/bin/kstat command to fetch Kstat details, $ kstat -p 'hme:0:hme0:*bytes64' 1 hme:0:hme0:obytes64 51899673435 hme:0:hme0:rbytes64 47536009231 hme:0:hme0:obytes64 hme:0:hme0:rbytes64 [...] 51899673847 47536009709

Now we just need a tool to present this in a more meaningful way.

nx.se :
The SE Toolkit provides a language, SymbEL, that lets us write our own performance monitoring tools. It also contained a collection of example tools, including nx.se which lets us identify network utilisation, $ se nx.se 1 Current tcp RtoMin is 400, interval 1, start Sun Oct 10:36:42 2005 9

10:36:43 Iseg/s Oseg/s InKB/s OuKB/s Rst/s Atf/s Ret% Icn/s Ocn/s tcp 841.6 4.0 74.98 0.27 0.00 0.00 0.0 0.00 0.00 Name Ipkt/s Opkt/s InKB/s OuKB/s IErr/s OErr/s Coll% NoCP/s Defr/s hme0 845.5 420.8 119.91 22.56 0.000 0.000 0.0 0.00 0.00 10:36:44 Iseg/s Oseg/s InKB/s OuKB/s Rst/s Atf/s Ret% Icn/s Ocn/s tcp 584.2 5.0 77.97 0.60 0.00 0.00 0.0 0.00 0.00 Name Ipkt/s Opkt/s InKB/s OuKB/s IErr/s OErr/s Coll% NoCP/s Defr/s hme0 579.2 297.1 107.95 16.16 0.000 0.000 0.0 0.00 0.00 [...] Having KB/s values lets us determine exactly how busy our network interfaces are. There is other useful information printed above, including Coll% - collisions, NoCP/s - no can puts, and Defr/s defers, which may be evidence of network saturation.

Nicstat :
nicstat is a freeware tool written in C to print out network utilisation and saturation by interface, $ nicstat 1 Time Int wAvs %Util 10:48:30 hme0 706.50 0.07 10:48:31 hme0 127.00 0.01 10:48:32 hme0 289.00 0.05 10:48:33 hme0 94.39 0.71 10:48:34 hme0 182.58 3.64 10:48:35 hme0 66.11 4.89 rKb/s Sat 4.02 0.00 0.29 0.00 1.35 0.00 67.73 0.00 315.22 0.00 529.96 0.00 wKb/s 4.39 0.50 4.23 19.08 rPk/s 6.14 3.00 14.00 426.00 wPk/s 6.36 4.00 15.00 207.00 723.00 rAvs 670.73 98.00 98.79 162.81 258.44 265.37

128.91 1249.00

67.53 2045.00 1046.00

10:48:36 hme0 54.73 4.23 10:48:37 hme0 54.77 0.90 10:48:38 hme0 111.38 0.88 10:48:39 hme0 474.00 0.04 [...]

454.14 0.00 93.55 0.00 74.84 0.00 0.76 0.00

62.16 2294.00 1163.00 15.78 32.41 4.17 583.00 516.00 7.00 295.00 298.00 9.00

202.72 164.31 148.52 111.43

Fantastic. There is also an older Perl version of nicstat available. The following are the switches available from version 0.90 of the C version, $ nicstat -h USAGE: nicstat [-hsz] [-i int[,int...]] | [interval [count]] -h -i interface -s -z nicstat nicstat 1 nicstat 1 5 nicstat -z 1 lines # # # # # # # # help track interface only summary output skip zero value lines print print print print summary every 1 5 times every 1 since boot only second only second, skip zero

eg,

nicstat -i hme0 1 # print hme0 only every 1 second

The utilisation measurement is based on the maximum speed of the interface (if available via Kstat), divided by the current throughput. The saturation measurement is a value that reflects errors due to saturation (no can puts, etc).

SNMP :
It's worth mentionining that there is also useful data available in SNMP, which is used by software such as MRTG. Here we use Net-SNMP's snmpget to fetch some interface values, $ snmpget -v1 -c public localhost ifOutOctets.2 ifInOctets.2 IF-MIB::ifOutOctets.2 = Counter32: 10016768 IF-MIB::ifInOctets.2 = Counter32: 11932165 These values are the outbound and inbound bytes for our main interface. In Solaris 10 a full description of the IF-MIB values can be found at /etc/sma/snmp/mibs/IF-MIB.txt. Across Network Analysing the performance of the external network.

Ping:
ping is the classic network probe tool, $ ping -s mars PING mars: 56 data bytes 64 bytes from mars (192.168.1.1): icmp_seq=0. time=0.623 ms 64 bytes from mars (192.168.1.1): icmp_seq=1. time=0.415 ms 64 bytes from mars (192.168.1.1): icmp_seq=2. time=0.464 ms ^C ----mars PING Statistics---3 packets transmitted, 3 packets received, 0% packet loss round-trip (ms) min/avg/max/stddev = 0.415/0.501/0.623/0.11 So I discover that mars is up, and it responds within 1 ms. Solaris 10 enhanced ping to print 3 decimal places for the times. ping is handy to see if a host is up, but that's about all. Some people use it to test whether their application server is ok. Hmm. ICMP is handled in the kernel without needing to call a user based process, so it's possible that a server will ping ok while the application either responds slowly or not at all.

Traceroute :
traceroute sends a series of UDP packets with an increasing TTL, and by watching the ICMP time expired replies can discover the hops to a host (assuming the hops actually decrement the TTL), $ traceroute www.sun.com traceroute: Warning: Multiple interfaces found; using 260.241.10.2 @ hme0:1 traceroute to www.sun.com (209.249.116.195), 30 hops max, 40 byte packets 1 tpggate (260.241.10.1) 21.224 ms 25.933 ms 25.281 ms 2 172.31.217.14 (172.31.217.14) 49.565 ms 27.736 ms 25.297 ms 3 syd-nxg-ero-zeu-2-gi-3-0.tpgi.com.au (220.244.229.9) 25.454 ms 22.066 ms 26.237 ms 4 syd-nxg-ibo-l3-ge-0-2.tpgi.com.au (220.244.229.132) 42.216 ms * 37.675 ms 5 220-245-178-199.tpgi.com.au (220.245.178.199) 40.727 ms 38.291 ms 41.468 ms 6 syd-nxg-ibo-ero-ge-1-0.tpgi.com.au (220.245.178.193) 37.437 ms 38.223 ms 38.373 ms 7 Gi11-2.gw2.syd1.asianetcom.net (202.147.41.193) 24.953 ms 25.191 ms 26.242 ms 8 po2-1.gw1.nrt4.asianetcom.net (202.147.55.110) 155.811 ms 169.330 ms 153.217 ms 9 Abovenet.POS2-2.gw1.nrt4.asianetcom.net (203.192.129.42) 150.477 ms 157.173 ms * 10 so-6-0-0.mpr3.sjc2.us.above.net (64.125.27.54) 240.077

ms 239.733 ms 244.015 ms 11 so-0-0-0.mpr4.sjc2.us.above.net (64.125.30.2) 224.560 ms 228.681 ms 221.149 ms 12 64.125.27.102 (64.125.27.102) 241.229 ms 235.481 ms 238.868 ms 13 * *^C The times may give me some idea where a network bottleneck is. We must also remember that networks are dynamic, and this may not be the permanent path to that host.

TTCP :
Test TCP is a freeware tool to test the throughput between two hops. It needs to be run on both the source and destination, and there is a Java version of TTCP which will run on many different operating systems. Beware, it will flood the network with traffic to perform it's test. The following is run on one host as a reciever. The options used make the test run for a reasonable duration - around 60 seconds, $ java ttcp -r -n 65536 Receive: buflen= 8192 nbuf= 65536 port= 5001 Then the following was run on the second host as the transmitter, $ java ttcp -t jupiter -n 65536 Transmit: buflen= 8192 nbuf= 65536 port= 5001 Transmit connection: Socket[addr=jupiter/192.168.1.5,port=5001,localport=46684] . Transmit: 536870912 bytes in 46010 milli-seconds = 11668.57 KB/sec (93348.56 Kbps). This shows the speed between these hosts for this test is around 11.6 Megabytes per second.

Pathchar :
After writing traceroute, Van Jacobson then went on to write pathchar - an amazing tool that identifys network bottlenecks. It operates like traceroute, but rather than printing response time to each hop it prints bandwidth between each pair of hops. # pathchar 192.168.1.10 pathchar to 192.168.1.1 (192.168.1.1) doing 32 probes at each of 64 to 1500 by 32 0 localhost | 30 Mb/s, 79 us (562 us) 1 neptune.drinks.com (192.168.2.1) | 44 Mb/s, 195 us (1.23 ms) 2 mars.drinks.com (192.168.1.1) 2 hops, rtt 547 us (1.23 ms), bottleneck 30 Mb/s, pipe 7555 bytes This tool works by sending "shaped" traffic over a long interval and carefully measuring the response times. It doesn't flood the network like TTCP does.

Ntop :
ntop is a tool that sniffs network traffic and provides comprehensive reports via a web interface. It is also available on sunfreeware.com. # ntop ntop v.1.3.1 MT [sparc-sun-solaris2.8] listening on [hme0,hme0:0,hme0:1]. Copyright 1998-2000 by Luca Deri <deri@ntop.org> Get the freshest ntop from http://www.ntop.org/ Initialising... Loading plugins (if any)... WARNING: Unable to find the plugins/ directory. Waiting for HTTP connections on port 3000... Sniffying... Now you connect via a web browser to localhost:3000. By Process How to monitor network usage by process. Recently the addition of DTrace to Solaris 10 has allowed the creation of the first network by process tools.

Tcptop :
This is a DTrace based tool from the freeware DTraceToolkit which gives a summary of TCP traffic by system and by process, # tcptop 10 Sampling... Please wait. 2005 Jul 5 04:55:25, load: 1.11, TCPout: 110 Kb UID PID SIZE NAME 100 20876 1160 finger 100 20875 1160 finger 100 20878 1303 telnet 100 20877 115712 rcp LADDR 192.168.1.5 192.168.1.5 192.168.1.5 192.168.1.5

TCPin:

2 Kb, FPORT 79 79 23 514

LPORT FADDR 36396 192.168.1.1 36395 192.168.1.1 36397 192.168.1.1 859 192.168.1.1

This version of tcptop will examine newly connected sessions (while tcptop has been running). In the above we can see PID and SIZE columns, this is tracking TCP traffic that has travelled on external interfaces. The TCPin and TCPout summaries also tracks localhost TCP traffic.

Tcpsnoop :
This is a DTrace based tool from the DTraceToolkit which prints TCP packets live by process,

# tcpsnoop UID PID LADDR RPORT SIZE CMD 100 20892 192.168.1.5 79 54 finger 100 20892 192.168.1.5 79 66 finger 100 20892 192.168.1.5 79 54 finger 100 20892 192.168.1.5 79 56 finger 100 20892 192.168.1.5 79 54 finger 100 20892 192.168.1.5 79 606 finger 100 20892 192.168.1.5 79 54 finger 100 20892 192.168.1.5 79 54 finger 100 20892 192.168.1.5 79 54 finger 100 20892 192.168.1.5 79 54 finger 100 20892 192.168.1.5 79 54 finger 0 242 192.168.1.5 54224 54 inetd 0 242 192.168.1.5 54224 54 inetd [...]

LPORT DR RADDR 36398 -> 192.168.1.1 36398 <- 192.168.1.1 36398 -> 192.168.1.1 36398 -> 192.168.1.1 36398 <- 192.168.1.1 36398 <- 192.168.1.1 36398 -> 192.168.1.1 36398 <- 192.168.1.1 36398 -> 192.168.1.1 36398 -> 192.168.1.1 36398 <- 192.168.1.1 23 <- 192.168.1.1 23 -> 192.168.1.1

This version of tcpsnoop will examine newly connected sessions (while tcpsnoop has been running). In the above we can see a PID column and packet details, this is tracking TCP traffic that has travelled on external interfaces.

Solaris Performance Monitoring & Tuning iostat, vmstat, netstat Introduction to iostat , vmstat and netstat This document is primarily written with reference to solaris performance monitoring and tuning but these tools are available in other unix variants also with slight syntax difference. iostat , vmstat and netstat are three most commonly used tools for performance monitoring . These comes built in with the operating system and are easy to use .iostat stands for input output statistics and reports statistics for i/o devices such as disk drives . vmstat gives the statistics for virtual Memory and netstat gives the network statstics . Following paragraphs describes these tools and their usage for performance monitoring. Table of content : 1. Iostat * Syntax

* example * Result and Solutions 2. vmstat * syntax * example * Result and Solutions 3. netstat * syntax * example * Result and Solutions Input Output statistics ( iostat ) iostat reports terminal and disk I/O activity and CPU utilization. The first line of output is for the time period since boot & each subsequent line is for the prior interval . Kernel maintains a number of counters to keep track of the values. iostats activity class options default to tdc (terminal, disk, and CPU). If any other option/s are specified, this default is completely overridden i.e. iostat -d will report only statistics about the disks. syntax: Basic synctax is iostat interval count option let you specify the device for which information is needed like disk , cpu or terminal. (-d , -c , -t or -tdc ) . x options gives the extended statistics . interval is time period in seconds between two samples . iostat 4 will give data at each 4 seconds interval. count is the number of times the data is needed . iostat 4 5 will give data at 4 seconds interval 5 times Example $ iostat -xtc 5 2 cpu wt id 11 0 disk r/s sd0 sd1 sd2 sd3 extended disk statistics %w 6 2 0 3 %b 19 23 0 31 tty tin tout us sy 0 84 3 85

w/s Kr/s Kw/s wait actv svc_t 0.2 0.2 0.0 0.3 59.2 47.2 0.0 31.2

2.6 3.0 20.7 22.7 0.1 4.2 1.0 33.5 8.0 0.0 0.0 0.0 0.0 0.0 0.0 10.2 1.6 51.4 12.8 0.1

The fields have the following meanings: disk name of the disk r/s reads per second w/s writes per second Kr/s kilobytes read per second Kw/s kilobytes written per second wait average number of transactions waiting for service (Q length) actv average number of transactions actively being serviced (removed from the queue but not yet completed) %w percent of time there are transactions waiting for service (queue non-empty) %b percent of time the disk is busy (transactions

in progress) Results and Solutions The values to look from the iostat output are: * Reads/writes per second (r/s , w/s) * Percentage busy. * Service time (svc_t) If a disk shows consistently high reads/writes along with , the percentage busy (%b) of the disks is greater than 5 percent, and the average service time (svc_t) is greater than 30 milliseconds, then one of the following action needs to be taken 1.) Tune the application to use disk i/o more efficiently by modifying the disk queries and using available cache facilities of application servers . 2.) Spread the file system of the disk on to two or more disk using disk striping feature of volume manager /disksuite etc. 3.) Increase the system parameter values for inode cache , ufs_ninode , which is Number of inodes to be held in memory. Inodes are cached globally (for UFS), not on a per-file system basis 4.) Move the file system to another faster disk /controller or replace existing disk/controller to a faster one. Virtual Memory Statistics ( vmstat ) vmstat vmstat reports virtual memory statistics of process, virtual memory, disk, trap, and CPU activity. On multicpu systems , vmstat averages the number of CPUs into the output. For per-process statistics .Without options, vmstat displays a one-line summary of the virtual memory activity since the system was booted. syntax Basic synctax is vmstat interval count option let you specify the type of information needed such as paging -p , cache -c ,.interrupt -i etc. if no option is specified information about process , memory , paging , disk ,interrupts & cpu is displayed . interval is time period in seconds between two samples . vmstat 4 will give data at each 4 seconds interval. count is the number of times the data is needed . vmstat 4 5 will give data at 4 seconds interval 5 times. Example The following command displays a summary of what the system is doing every five seconds. example% vmstat 5 procs memory cpu r b w swap free sy id 0 0 0 11456 4120 14 82 0 0 1 10132 4280 35 62 0 0 1 10132 4616 33 64 page disk faults in sy cs us 4 3 3

re mf pi p fr de sr s0 s1 s2 s3 1 0 0 41 19 1 4 44 0 0 20 0 3 0 0 0 0 0 2 0 0 0 4 0 0 0 0

48 112 130

0 23 0 19

0 211 230 144 0 150 172 146

21 78

0 0 1 10132 5292 0

9 0

0 21

0 165 105 130

The fields of vmstat's display are procs r in run queue b blocked for resources I/O, paging etc. w swapped memory (in Kbytes) swap - amount of swap space currently free - size of the free list

available

page ( in units per second). re page reclaims - see -S option for how this field is modified. mf minor faults - see -S option for how this field is modified. pi kilobytes paged in po kilobytes paged out fr kilobytes freed de anticipated short-term memory shortfall (Kbytes) sr pages scanned by clock algorithm disk ( operations per second ) There are slots for up to four disks, labeled with a single letter and number. The letter indicates the type of disk (s = SCSI, i = IPI, etc). The number is the logical unit number. faults in (non clock) device interrupts sy system calls cs CPU context switches cpu breakdown of percentage usage of CPU On multiprocessors this is an a average across all processors. us user time sy system time id idle time Results and Solution from iostat A. CPU issues Following columns has to be watched to determine if there is any cpu issue 1. Processes in the run queue (procs r) 2. User time (cpu us) 3. System time (cpu sy) 4. Idle time (cpu id) procs r b w 0 0 0 cpu us sy 4 14 id 82 time.

0 0 1 0 0 1 0 0 1 Problem symptoms

3 3 1

35 33 21

62 64 78

A.) Number of processes in run queue 1.) If the number of processes in run queue (procs r) are consistently greater than the number of CPUs on the system it will slow down system as there are more processes then available CPUs . 2.) if this number is more than four times the number of available CPUs in the system then system is facing shortage of cpu power and will greatly slow down the processess on the system. 3.) If the idle time (cpu id) is consistently 0 and if the system time (cpu sy) is double the user time (cpu us) system is facing shortage of CPU resources. Resolution Resolution to these kind of issues involves tuning of application procedures to make efficient use of cpu and as a last resort increasing the cpu power or adding more cpu to the system. B. Memory Issues Memory bottlenecks are determined by the scan rate (sr) . The scan rate is the pages scanned by the clock algorithm per second. If the scan rate (sr) is continuously over 200 pages per second then there is a memory shortage. Resolution 1. Tune the applications & servers to make efficient use of memory and cache. 2. Increase system memory . 3. Implement priority paging in s in pre solaris 8 versions by adding line set priority paging=1 in /etc/system. Remove this line if upgrading from Solaris 7 to 8 & retaining old /etc/system file. Next Page netstat Network Statistics (netstat) netstat displays the contents of various network-related data structures in depending on the options selected. Syntax netstat multiple options can be given at one time. Options -a displays the state of all sockets. -r shows the system routing tables -i gives statistics on a per-interface basis. -m displays information from the network memory buffers. On Solaris, this shows statistics for STREAMS -p [proto] retrieves statistics for the specified protocol -s shows per-protocol statistics. (some implementations allow -ss to remove fileds with a value of 0 (zero) from the display.) -D display the status of DHCP configured interfaces. -n do not lookup hostnames, display only IP addresses. -d (with -i) displays dropped packets per interface. -I [interface] retrieve information about only the specified interface. -v be verbose interval number for continuous display of statictics. Example $netstat -rn

Routing Table: IPv4 Destination Gateway Interface -------------------- -------------------192.168.1.0 192.168.1.11 224.0.0.0 192.168.1.11 le0 default 192.168.1.1 127.0.0.1 127.0.0.1

Flags

Ref

Use

----- ----- ------ --------U 1 1444 le0 U 1 0 UG UH 1 1 68276 10497 lo0

This shows the output on a Solaris machine whos IP address is 192.168.1.11 with a default router at 192.168.1.1 Results and Solutions A.) Network availability The command as above is mostly useful in troubleshooting network accessibility issues . When outside network is not accessible from a machine check the following 1. if the default router ip address is correct 2. you can ping it from your machine. 3. If router address is incorrect it can be changed with route add command. See man route for more information. route command examples $route add default $route add 192.0.2.32 If the router address is correct but still you cant ping it there may be some network cable /hub/switch problem and you have to try and eliminate the faulty component . B.) Network Response $ netstat -i Name Mtu Net/Dest Address Queue lo0 8232 loopback localhost hme0 1500 server1 server1 0

Ipkts 77814 10658

Ierrs 0 3

Opkts Oerrs 77814 48325 0 0

Collis 0 0 279257

This option is used to diagnose the network problems when the connectivity is there but it is slow in response . Values to look at: * Collisions (Collis) * Output packets (Opkts) * Input errors (Ierrs) * Input packets (Ipkts) The above values will give information to workout i. Network collision rate as follows : Network collision rate = Output collision counts / Output packets Network-wide collision rate greater than 10 percent will indicate * Overloaded network, * Poorly configured network, * Hardware problems.

ii. Input packet error rate as follows : Input Packet Error Rate = Ierrs / Ipkts. If the input error rate is high (over 0.25 percent), the host is dropping packets. Hub/switch cables etc needs to be checked for potential problems. C. Network socket & TCP Connection state Netstat gives important information about network socket and tcp state . This is very useful in finding out the open , closed and waiting network tcp connection . Network states returned by netstat are following CLOSED ---- Closed. The socket is not being used. LISTEN ---- Listening for incoming connections. SYN_SENT ---- Actively trying to establish connection. SYN_RECEIVED ---- Initial synchronization of the connection under way. ESTABLISHED ---- Connection has been established. CLOSE_WAIT ---- Remote shut down; waiting for the socket to close. FIN_WAIT_1 ---- Socket closed; shutting down connection. CLOSING ---- Closed, then remote shutdown; awaiting acknowledgement. LAST_ACK ---Remote shut down, then closed ;awaiting acknowledgement. FIN_WAIT_2 ---- Socket closed; waiting for shutdown from remote. TIME_WAIT ---- Wait after close for remote shutdown retransmission.. Example #netstat -a Local Address Rwind Recv-Q *.* *.* 0 *.22 *.* 0 *.22 *.* 0 *.* *.* 0 *.32771 *.* *.4045 *.* *.25 *.* 0 *.5987 *.* *.898 *.* 0 *.32772 *.* *.32775 *.* *.32776 *.* *.* *.* 0 192.168.1.184.22 ESTABLISHED 192.168.1.184.22 ESTABLISHED 192.168.1.184.22 ESTABLISHED Remote Address State 0 24576 0 0 24576 0 0 24576 0 0 24576 0 0 0 24576 0 0 24576 0 24576 0 0 0 24576 0 24576 0 0 0 24576 0 0 24576 0 0 24576 0 24576 0 192.168.1.186.50457 192.168.1.186.56806 192.168.1.183.58672 Swind IDLE LISTEN LISTEN IDLE 0 0 LISTEN 0 LISTEN 0 0 0 IDLE 41992 38912 18048 Send-Q

LISTEN LISTEN LISTEN LISTEN LISTEN LISTEN 0 0 0 24616 24616 24616 0 0 0

if you see a lots of connections in FIN_WAIT state tcp/ip parameters have to be tuned because the connections are not being closed and they gets accumulating . After some time system may run out of

resource . TCP parameter can be tuned to define a time out so that connections can be released and used by new connection Solaris Open Boot Commands The primary function of the OpenBoot firmware is to start up the system. Starting up is the process of loading and executing a standalone program (for example, the operating system or the diagnostic monitor). In case of Solaris the standalone program that is being started is the two-part operating system kernel. After the kernel is loaded, the kernel starts the Solaris OS, mounts the necessary file systems, and runs /sbin/init to bring the system to the initdefault state that is specified in /etc/inittab. Stages of this process is known as "System Run States". The OpenBoot architecture consists of the following components: Plug-in device driversA device driver can be loaded from a plug-in device such as an SBus card. The plug-in device driver can be used to boot the operating system from that device or to display text on the device before the operating system has activated its own software device drivers. This feature lets the input and output devices evolve without changing the system PROM. The FCode interpreterPlug-in drivers are written in a machine-independent interpreted language called FCode. Each OpenBoot system PROM contains an FCode interpreter. This enables the same device and driver to be used on machines with different CPU instruction sets. The OpenBoot user interface is based on the programming language Forth, which provides an interactive programming environment. Forth is used for direct communication between humans and machines. It can be quickly expanded and adapted to special needs and different hardware systems. Forth is used not only by Sun but also by other hardware vendors such as HewlettPackard. The device treeDevices called nodes are attached to a host computer through a hierarchy of interconnected buses on the device tree. A node representing the host computer's main physical address bus forms the tree's root node. Both the user and the operating system can determine the system's hardware configuration by viewing the device tree. Nodes with children usually represent buses and their associated controllers, if any. Each such node defines a physical address space that distinguishes the devices connected to the node from one another. Each child of that node is assigned a physical address in the parent's address space. The physical address generally represents a physical characteristic that is unique to the device (such as the bus address or the slot number where the device is installed). The use of physical addresses to identify devices prevents device addresses from changing when other devices are installed or removed.

The normal Solaris boot process has five main phases:

1. Basic hardware detection (memory, disk, keyboard, mouse, and the like) and executing the
firmware system initialization program . In Solaris this is called Boot PROM phaseAfter you turn on power to the system, the PROM displays system identification information and runs self-test diagnostics to verify the system's hardware and memory. PROM chip contains Forth OpenBoot firmware, and it is executed immediately after you turn on the system. The primary task of the OpenBoot firmware is to boot the operating system either from a mass storage device

or from the network. OpenBoot contains a program called the monitor that controls the operation of the system before the kernel is available. When a system is turned on, the monitor runs a power-on self-test (POST) that checks such things as the hardware and memory on the system. If no errors are found, the automatic boot process begins. OpenBoot contains a set of instructions that locate and start up the system's boot program and eventually start up the Unix operating system.

2. Locating and running the initial boot program (IPL or bootloader) from a predetermined
location on the disk (MBR in PC). In Solaris the primary boot program, called bootblk, is loaded from its location on the boot device (usually disk) into memory.

3. Locating and starting the Unix kernel. The kernel image file to execute may be determined
automatically or via input to the bootloader. In Solaris the bootblk program finds and executes the secondary boot program (called ufsboot) from the Unix file system (UFS) and loads it into memory. After the ufsboot program is loaded, the ufsboot program loads the two-part kernel.

4. The kernel initializes itself and then performs final, high-level hardware checks, loading
device drivers and/or kernel modules as required. In Solaris the kernel initializes itself and begins loading modules, using ufsboot to read the files. When the kernel has loaded enough modules to mount the root file system, it unmaps the ufsboot program and continues, using its own resources.

5. The kernel starts the init process, which in turn starts system processes (daemons) and
initializes all active subsystems. When everything is ready, the system begins accepting user logins. In Solaris kernel starts the Unix operating system, mounts the necessary file systems, and runs /sbin/init to bring the system to the initdefault state specified in /etc/inittab. The kernel creates a user process and starts the /sbin/init process, which starts other processes by reading the /etc/inittab file. The /sbin/init process starts the run control (rc) scripts, which execute a series of other scripts. These scripts (/sbin/rc*) check and mount file systems, start various processes, and perform system maintenance tasks.

PCI System Commands The following user query and control commands (forth words) are available on PCI based systems. Use the show-pci-devs command to show all devices on a specific PCI bus. ok show-pci-devs /pci@1f,2000 (show pcia devices) ok show-pci-devs /pci@1f,4000 (show pcib devices) Use the show-pci-devs-all command to show all PCI devices. ok show-pci-devs-all (show all pci devices) Use the show-pci-config command to show configuration space registers for a given PCI device.

ok show-pci-config /pci@1f,4000/network@1,1 Use the show-pci-configs command to show configuration space registers for all PCI devices on a PCI bus. ok show-pci-configs /pci@1f,4000 Use the show-pci-configs-all command to show configuration space registers for all PCI devices on all PCI busses. ok show-pci-configs-all /pci@1f,4000 Use the probe-pci command to probe all devices on a specific PCI bus. ok probe-pci /pci@1f,4000 probing /pci@1f,4000 at Device 3 scsi disk tape probing /pci@1f,4000 at Device 3 nothing there Use the probe-pci-slot command to probe a specific PCI slot on a specific PCI bus. ok 3 probe-pci-slot /pci@1f,4000 probing /pci@1f,4000 at Device 3 scsi disk tape Updating OpenBoot PROM for Sun Workstations and Workgroup Servers ... Having the latest version of OpenBoot PROM (OBP) on a SPARC processor-based workstation or workgroup server can be critical when adding new applications or hardware, or when upgrading the machine's Solaris Operating System (OS). Updating may also save some time and difficulty by resolving any latent bugs that have been detected and fixed since the previous releases. The paragraphs that follow guide you through the steps required to do the update.

Note: This Tech Tip does not cover larger servers; for those systems, see SunSolve document #41723 How to Check the Revision Level of the OBP in a System

How to Perform System Boot and Shutdown Procedures for Solaris 10 ... [PDF] Debugging Solaris using Open Boot Prom Solaris 9 System Startup and Shutdown > The OpenBoot Architecture OpenBoot > OpenBoot Environment Open Boot Parameters Adminschoice.com The firmware in Sun's boot PROM is called OpenBoot. The main features of openboot are initial program loading, & debugging features to assist kernel debugging .OpenBoot supports plug-in device drivers which are written in language Forth. . This plug in feature allows Sun or any third-party vendors to develop new boot devices but without making any changes to boot PROM. Top of Form
en GALT:#008000;G ISO-8859-1 ISO-8859-1 1 pub-4031247137

Bottom of Form Top of Form OpenBoot > OpenBoot Environment Sun's "OpenBoot 3.x Command Reference Manual", Chapter 1. (Currently at http://docs.sun.com/db/doc/805-4436/6j4719c8a?a=view). Solaris 9 System Startup and Shutdown Objectives The following objectives for the Solaris System Administrator Exam are covered in this chapter: Explain how to execute boot PROM commands to Identify the system's boot PROM version Boot the system; access detailed information List, change, and restore default NVRAM parameters Display devices connected to the bus Identify the system's boot device Create and remove custom device aliases View and change NVRAM parameters from the shell Interrupt a hung system Given a scenario involving a hung system, troubleshoot problems and deduce resolutions. Explain how to perform a system boot, control boot processes, and complete a system shutdown, using associated directories, scripts, and commands.

You need to understand the primary functions of the OpenBoot environment, which includes the programmable read-only memory (PROM. You need to have a complete understanding of how to use many of the OpenBoot commands and how to set and modify all the configuration parameters that control system bootup and hardware behavior.

You must understand the entire boot process, from the proper power-on sequence to the steps you perform to bring the system into multiuser mode. You must be able to identify the devices connected to a system and recognize the various special files for each device. Occasionally, conventional shutdown methods might not work on an unresponsive system or on a system that has crashed. This chapter introduces when and how to use these alternative shutdown methods to bring the system down safely. You must understand how the system run levels define which processes and services are started at various stages of the boot process. You need to understand all the run levels that are available in Solaris. You need to understand how to add and modify run control scripts to customize the startup of processes and services on Solaris systems. You need to have a detailed understanding of the programs and configuration files involved at the various run levels. Outline Introduction Booting a System Powering On the System The Boot PROM and Program Phases Entry-Level to High-End Systems Accessing the OpenBoot Environment OpenBoot Firmware Tasks

The OpenBoot Environment

The OpenBoot Architecture The OpenBoot Interface The Restricted Monitor The Forth Monitor

Getting Help in OpenBoot PROM Device Tree (Full Device Pathnames) OpenBoot Device Aliases The nvedit Line Editor OpenBoot NVRAM OpenBoot Security OpenBoot Diagnostics Input and Output Control OpenBoot PROM Versions Booting a System The boot Command The Kernel System Run States

swapper The init Phase rc Scripts Using the Run Control Scripts to Stop or Start Services Adding Scripts to the Run Control Directories

System Shutdown Commands to Shut Down the System The /usr/sbin/shutdown Command The /sbin/init Command The /usr/sbin/halt Command The /usr/sbin/reboot Command The /usr/sbin/poweroff Command

Stopping the System for Recovery Purposes Turning Off the Power to the Hardware Summary Apply Your Knowledge Study Strategies The following study strategies will help you prepare for the exam: When studying this chapter, you should practice on a Sun system each step-by-step process that is outlined. In addition to practicing the processes, you should practice the various options described for booting the system. You should display the hardware configuration of your Sun system by using the various OpenBoot commands presented in this chapter. You need to familiarize yourself with all the devices associated with your system. You should be able to identify each hardware component by its device pathname. You should practice creating both temporary and permanent device aliases. In addition, you should practice setting the various OpenBoot system parameters that are described in this chapter. You should practice booting the system by using the various methods described. You need to understand how to boot into single-user and multiuser modes and how to specify an alternate kernel or system file during the boot process. During the boot process, you should watch the system messages and familiarize yourself with every stage of the boot process. You should watch the system messages that are displayed at bootup. You need to understand each message displayed during the boot process from system power-on to bringing the system into multiuser mode. You need to thoroughly understand all the system run states, including when and where to use each of them. In addition, you must understand run control scripts and how they affect the system services. You should practice adding your own run control scripts. You should practice shutting down the system. You should make sure you understand the advantages and disadvantages of each method presented.

Introduction

System startup requires an understanding of the hardware and the operating system functions that are required to bring the system to a running state. This chapter discusses the operations that the system must perform from the time you power on the system until you receive a system logon prompt. In addition, it covers the steps required to properly shut down a system. After reading this chapter, you'll understand how to boot the system from the OpenBoot programmable read-only memory (PROM) and what operations must take place to start up the kernel and Unix system processes. Plagiarism?By David, Feb 3, 2004 05:11 PM Much of this seems to be copied, with minor modification, from Sun's "OpenBoot 3.x Command Reference Manual", Chapter 1. (Currently at http://docs.sun.com/db/doc/8054436/6j4719c8a?a=view). For example, the Sun documentation says: "OpenBoot deals directly with hardware devices in the system. Each device has a unique name representing the type of device and where that device is located in the system addressing structure." In the this article, it's been changed slightly to read: "OpenBoot deals directly with the hardware devices in the system. Each device has a unique name that represents both the type of device and the location of that device in the device tree." Sometimes, the minor changes alter the meaning. For example, the Sun doc says: "A full device path name is a series of node names separated by slashes (/). The root of the tree is the machine node, which is not named explicitly but is indicated by a leading slash (/). Each node name has the form: driver-name@unit-address:device-arguments" In the copied version posted here, though, it's: "A device tree is a series of node names separated by slashes (/).The top of the device tree is the root device node. Following the root device node, and separated by a leading slash /, is a bus nexus node. Connected to a bus nexus node is a leaf node, which is typically a controller for the attached device. Each device pathname has this form: driver-name@unit-address:device-arguments" The first sentence is incorrect, and the last sentence is confusing. It's each nodename that has that form, not the pathname, which is a series of slash delimited nodenames.

How to edit crontab in Sun Solaris To edit the crontab file, the crontab program must be used. The actual crontab files should not be edited directly because the contents are cached and changes will not take effect until the crond process is restarted. Using the crontab program to edit the crontabs will update the cache when the file is changed. To edit the current users crontab file, use: crontab -e The -e option tells the program to edit a copy of the users crontab file. The EDITOR environment variable is referenced to determine which editor to use (default is ed). To set this environment variable, see recipes for ksh and sh. The superuser can edit a specific users crontab by adding the username at the end of this command. The processes run from a users crontab will be run as that user. Be careful with commands in roots crontab because these will run as root and could cause problems. If shell scripts are run from roots crontab, make sure their file permissions do not allow modification by anyone but root. The syntax of crontab is simple. Each line represents a single scheduled task. The first five fields represent timing information and everything following is interpreted as the command to schedule. The timing fields in order are: minutes 0-59 hours 0-23 days of month 1-31 months of year 1-12 days of week 0-6 (Sunday-Saturday) A variety of options work for each field. An asterisk (*) indicates all possible occurrences for that field. A number sets that single occurrence. Two numbers separated by a indicates a range of values, and numbers separated by a comma indicate a list of occurrences. Several examples: 15 * * * * logcheck Runs a command called logcheck every 15 minutes of every day. 0,15,30,45 8-17 * * 1-5 dobackup Runs dobackup every 15 minutes (i.e., 8:00, 8:15, 8:30, and 8:45) during business hours (from 8:00 to 17:00) during business days (Monday-Friday). First time I edited crontab file in Solaris was when I wanted to schedule my system to always synchronized to NTP server. It is set in the crontab file. On Linux and BSD, just run command crontab -e to edit the crontab. It is different on Solaris system, you need to specify the editor program to use. In Linux and BSD are set by default to use vi editor. To set editing crontab file to use vi editor, run the following command: bash# export EDITOR=vi You can now edit your crontab as Linux and BSD system bash# crontab -e #ident */ "@(#)root 1.19 98/07/06 SMI" /* SVr4.0 1.1.3.1

# # The root crontab should be used to perform accounting data collection. # # The rtc command is run to adjust the real time clock if and when # daylight savings time changes. # 10 3 * * * /usr/sbin/logadm 15 3 * * 0 /usr/lib/fs/nfs/nfsfind 1 2 * * * [ -x /usr/sbin/rtc ] && /usr/sbin/rtc -c > /dev/null 2>&1 30 3 * * * [ -x /usr/lib/gss/gsscred_clean ] && /usr/lib/gss/gsscred_clean #10 3 * * * /usr/lib/krb5/kprop_script ___slave_kdcs___ #Run ntpdate at 4:40 everyday 40 4 * * * /usr/sbin/ntpdate -b asia.pool.ntp.org 1> /dev/null 2>&1 Step-by-Step Zone Configuration in the Solaris 10 OS Diego E. Aguirre, August 2007 Here is a short guide to creating zones with Solaris Containers technology, with examples using Solaris Volume Manager and an Oracle database. It's easy to modify these steps and add more file systems into the script. Notes: In this example, I make only one instance or zone, called zone1. I used Solaris Volume Manager in Steps 2 and 3, and I tested this on Oracle 10.1 and 10.2. 1. Format the hard disk into slice 0. 2. Make the meta devices. For example, I have three SAN disks, and I want to make a meta device with the three disks concatenated. (Note: Please type the command all on one line.) # metainit d60 3 1 c2t50060E800456EE02d0s0 1 c2t50060E800456EE02d1s0 1 c2t50060E800456EE02d2s0 d60: Concat/Stripe is setup 3. Make the soft partitions: # metainit d61 -p d60 6g d61: Soft Partition is setup # metainit d62 -p d60 10g d62: Soft Partition is setup # metainit d63 -p d60 30g d63: Soft Partition is setup # 4. Create the file systems: # newfs /dev/md/rdsk/d61 newfs: construct a new file system /dev/md/rdsk/d61: (y/n)? y # newfs /dev/md/rdsk/d62 newfs: construct a new file system /dev/md/rdsk/d62: (y/n)? y # newfs /dev/md/rdsk/d63 newfs: construct a new file system /dev/md/rdsk/d63: (y/n)? y #

5. Create the mount point for the root file system (/ fs) and /u00 and /u01 for the Oracle database. mkdir mkdir mkdir mount -p /export/zone1 /u00 /u01 /export/zone1

6. Execute the following script, which is shown in its entirety after Step 11. zonecfg -z zone1 -f /usr/scripts/make.zone1.ksh # zoneadm list -cv ID NAME STATUS 0 global running - zone1 configured # chmod 700 /export/zone1 7. Install zone1: # zoneadm -z zone1 install Preparing to install zone <zone1>. Checking <ufs> file system on device </dev/md/rdsk/d62> to be mounted at </export/zone1/root> Checking <ufs> file system on device </dev/md/rdsk/d63> to be mounted at </export/zone1/root> Creating list of files to copy from the global zone. Copying <124550> files to the zone. Initializing zone product registry. Determining zone package initialization order. Preparing to initialize <1021> packages on the zone. Initializing package <49> of <1021>: percent complete: 4% 8. Run the following command to get the zone state: # zoneadm list -cv ID NAME 0 global - zone1 STATUS running installed PATH / /export/zone1 PATH / /export/zone1

9. Transition the zone to the ready state by running the following command: # zoneadm -z zone1 ready 10. Use the following command to get the zone state: # zoneadm list -cv ID NAME STATUS PATH 0 global running / 1 zone1 ready /export/zone1 11. Boot the zone: # zoneadm -z zone1 boot The script to be executed is /usr/scripts/make.zone1.ksh, and here are the details: create -b

set set add set set set set end add set set set set end add set set end

zonepath=/export/zone1 autoboot=true fs dir=/u00 special=/dev/md/dsk/d62 raw=/dev/md/rdsk/d62 type=ufs fs dir=/u01 special=/dev/md/dsk/d63 raw=/dev/md/rdsk/d63 type=ufs net address=10.11.33.144 physical=ce2

St Action Description ep 1 Use zoneadm list on the global zone server to show status of zone # /usr/sbin/zoneadm list to show status of all installed zones (i-stands for -vi installed) # /usr/sbin/zoneadm list to show status of all configured zones (c-stands -vc for configured which includes installed zones). On the global zone, use the zoneadm list -vi to show current status of all installed zones. global# /usr/sbin/zoneadm list -vi ID NAME STATUS BRAND IP 0 global running native shared 2 rlogic running native shared 3 utility running native shared 4 myzone running native shared global# PATH / /zones/rlogic /zones/utility /zones/myzone

On the global zone, use the zoneadm list -vc to show current status of all configured zones. global# /usr/sbin/zoneadm list -vc ID NAME STATUS PATH

BRAND IP 0 global native shared 2 rlogic native shared 3 utility native shared 4 myzone native shared - junkzone native shared global#

running running running running configured

/ /zones/rlogic /zones/utility /zones/myzone /zones/junkzone

Note Using the /usr/sbin/zoneadm on a local zone will only display : the status of that zone. 2 Use zonecfg -z with the info option to list a specific zone configuration. # zonecfg The zonecfg command can be used to list the configuration of a -z <zone current zone. The zonename must be specified. name> Basic Zone info global# zonecfg -z myzone info zonename: myzone zonepath: /zonepool/myzone brand: native autoboot: true bootargs: pool: limitpriv: scheduling-class: ip-type: shared inherit-pkg-dir: dir: /lib inherit-pkg-dir: dir: /platform inherit-pkg-dir: dir: /sbin inherit-pkg-dir: dir: /usr inherit-pkg-dir: dir: /opt/sfw net: address: 192.168.29.143/24 physical: e1000g0 defrouter: 192.168.29.2 global# The example above is a basic zone installation which contains inherited packages. A self-contained zone will display the following configuration display with no inherit-pkg-dir: global# zonecfg -z mywhole info

zonename: mywhole zonepath: /zonepool/mywhole brand: native autoboot: true bootargs: pool: limitpriv: scheduling-class: ip-type: shared net: address: 192.168.29.142/24 physical: e1000g0 defrouter: 192.168.29.2 capped-memory: physical: 512M [swap: 512M] [locked: 512M] rctl: name: zone.max-swap value: (priv=privileged,limit=536870912,action=deny) rctl: name: zone.max-locked-memory value: (priv=privileged,limit=536870912,action=deny) rctl: name: zone.cpu-shares value: (priv=privileged,limit=10,action=none) global# 3 Examine /etc/zones directory to display zone configuration file. # cat <zonenam The directory /etc/zones contains the configuration files for each zone by the zone name. Examining these files will show the show the e>.xml configuration of the zones. global# cd /etc/zones bash-3.00# ls -l total 18 -rw-r--r-1 root root 1198 Mar 28 04:18 index -rw-r--r-1 root root 547 Mar 26 13:56 myzone.xml -rw-r--r-1 root root 547 Mar 24 03:06 rlogic.xml -r--r--r-1 root bin 1196 Sep 7 2005 SUNWblank.xml -r--r--r-1 root bin 1366 Sep 7 2005 SUNWdefault.xml -rw-r--r-1 root root 549 Mar 24 03:50 utility.xml global# cat myzone.xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE zone PUBLIC "-//Sun Microsystems Inc//DTD Zones//EN" "file:///usr/share/lib/xml/dtd/zonecfg.dtd.1">

<!-DO NOT EDIT THIS FILE. -->

Use zonecfg(1M) instead.

<zone name="myzone" zonepath="/zones/myzone" autoboot="true"> <inherited-pkg-dir directory="/lib"/> <inherited-pkg-dir directory="/platform"/> <inherited-pkg-dir directory="/sbin"/> <inherited-pkg-dir directory="/usr"/> <inherited-pkg-dir directory="/opt/sfw"/> <network address="192.168.3.38/24" physical="rtls0"/> </zone> gloabl# The index file in this directory also contains the status of the zones: global# ls -l total 18 -rw-r--r-1 root root 1198 Mar 28 04:18 index -rw-r--r-1 root root 547 Mar 26 13:56 myzone.xml -rw-r--r-1 root root 547 Mar 24 03:06 rlogic.xml -r--r--r-1 root bin 1196 Sep 7 2005 SUNWblank.xml -r--r--r-1 root bin 1366 Sep 7 2005 SUNWdefault.xml -rw-r--r-1 root root 549 Mar 24 03:50 utility.xml global# cat index # ident "@(#)zones-index 1.3 05/06/08 SMI" # Copyright 2005 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. # # CDDL HEADER START # # The contents of this file are subject to the terms of the # Common Development and Distribution License, Version 1.0 only # (the "License"). You may not use this file except in compliance # with the License.

# # You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE # or http://www.opensolaris.org/os/licensing. # See the License for the specific language governing permissions # and limitations under the License. # # When distributing Covered Code, include this CDDL HEADER in each # file and include the License file at usr/src/OPENSOLARIS.LICENSE. # If applicable, add the following below this CDDL HEADER, with the # fields enclosed by brackets "[]" replaced with your own identifying # information: Portions Copyright [yyyy] [name of copyright owner] # # CDDL HEADER END # # # DO NOT EDIT: this file is automatically generated by zoneadm(1M) # and zonecfg(1M). Any manual changes will be lost. # global:installed:/ rlogic:installed:/zones/rlogic utility:installed:/zones/utility myzone:installed:/zones/myzone global#

Solstice DiskSuite / Solaris Volume Manager Soft Partitioning


The intent of this document is to describe Soft Partitioning within Solstice DiskSuite (soon-tobe-renamed Solaris Volume Manager), and offer a short primer/tutorial on how to create, use, and delete them. Until now, Solaris, without any volume management software, has only ever allowed a fixed number of partitions on a physical disk (seven (7) on SPARC platforms). With the increase in capacity of disks, this limitation has become a severe restriction. SDS/SVM uses these slices for its metadevices (sub-mirrors, trans, stripes, and RAID5) and hence is faced with the same limitation, whereas Veritas Volume Manager (VxVM) allows for the logical partitioning of disks into a virtually unlimited number of subdisks. Soft Partitioning allows for a disk to be subdivided into many partitions which are controlled and maintained by software, thereby removing the limitation of the number of partitions on a disk. A soft partition is made up of one or more "extents". An extent describes the parts of the physical disk that make up the soft partition. While the maximum number of extents per soft partition is 2147483647, the majority of soft partitions will use only one (1) extent.

What is new?
Soft Partitioning was not in the original Solstice DiskSuite 4.2.1 Release, which coincided with the release of Solaris 8. However, the soft partitioning functionality was released in patch 108693-06 for SDS 4.2.1. When Solaris 9 gets released, the "Solstice DiskSuite" name will change to "Solaris Volume Manager" ("SVM") and it will be bundled in with Solaris 9. Soft Partitioning will, of course, be part of the base functionality of that release. Soft Partitions are implemented by new kernel driver: md_sp.
There are new options to the metainit command:
# modinfo | grep md_sp 228 78328000 4743 - 1 md_sp (Meta disk soft partition module) metainit softpart -p [-e] component size metainit softpart -p component -o offset -b size metattach softpart size

The metattach command has been modified to allow for growing of soft partitions: There is a new command... metarecover:
metarecover [-n] [-v] component -p [-d|-m]

NOTE: the -p option means that the command refers to soft partitions.

Creating Soft Partitions


There are three methods to create a soft partition using the metainit command: 1. Specifying an unused disk and size (with the -e option). For example:
# metainit d0 -p -e c1t0d0 200m

The -e option requires that the name of the disk supplied be in the form c#t#d#. The last parameter (200m) specifies the initial size of the soft partition. The sizes can be specified in blocks, kilobytes, megabytes, gigabytes, and terabytes. The -e option causes the disk to be repartitioned such that slice 7 has enough space to hold a replica (although no replica is actually created on this disk) and slice 0 contains the rest of the space. Slice 2 is removed from the disk. The soft partition that is being created is put into slice 0. Further soft partitions can be created on slice 0 by the next method of creating a soft partition. After this command is run, the layout of the disk would like similar to this example:
Part Tag Flag Blocks 0 unassigned wm (2031/0/0) 2047248 1 unassigned wm 0 2 unassigned wm 0 3 unassigned wm 0 4 unassigned wm 0 5 unassigned wm 0 6 unassigned wm 0 7 unassigned wu 5040 Cylinders 5 - 2035 0 0 0 0 0 0 0 4 Size 999.63MB 0 0 0 0 0 0 2.46MB (0/0/0) (0/0/0) (0/0/0) (0/0/0) (0/0/0) (0/0/0) (5/0/0)

This command (with the -e) can only be run on an empty disk (one that is not used in any other metadevice). If another metadevice or replica already exists on this disk, one of the following messages will be printed, and no soft partition will be created.
metainit: hostname: c#t#d#s0: has appeared more than once in the specification of d#

or

metainit: hostname: c#t#d#s#: has a metadevice database replica

2. Specifying an existing slice name and size (without the -e option). This will be the most common method of creation. For example:
# metainit d1 -p c1t0d0s0 1g

This will create a soft partition on the specified slice. No repartitioning of the disk is done. Provided there is space on the slice, additional soft partitions could be created as required. The device name must include the slice number (c#t#d#s#). If another soft partition already exists in this slice, this one will be created immediately after the existing one. Therefore, no overlap of soft partitions can occur by accident.
3. Specifying an existing slice and absolute offset and size values. For example:
# metainit d2 -p c1t0d0s0 -o 2048 -b 1024

The -o parameter signifies the offset into the slice, and the -b parameter is the size for the soft partition. All numbers are in blocks (a block is 512 bytes). The metainit command ensures that extents and soft partitions do not overlap. For example, the following is an attempt to create overlapping soft partitions.
# metainit d1 -p c1t0d0s0 -o 1 -b 2024 d1: Soft Partition is setup # metainit d2 -p c1t0d0s0 -o 2000 -b 2024 metainit: hostname: d2: overlapping extents specified

An offset of 0 is not valid, as the first block on a slice containing a soft partition contains the initial extent header. Each extent header consumes 1 block of disk and each soft partition will have an extent header placed at the end of each extent. Extent headers are explained in more detail in the next section. NOTE: This method is not documented in the man page for metainit and is not recommended for manual use. It is here because a subsequent metastat -p command will output information in this format.

Extent Headers
Whenever a soft partiton is created in a disk slice, an "extent header" is written to disk. Internally to Sun, these are sometimes referred to as "watermarks". An extent header is a consistency record and contains such information as the metadevice (soft partition) name, it's status, it's size, and a checksum. Each extent header 1 block (512 bytes) in size. The following diagram shows an example 100MB slice (c1t0d0s0) and the extent headers (watermarks) that have been created on it. The command to make the soft partition shown was
# metainit d1 -p c1t0d0s0 20m

Troubleshooting Hardware Problems


The term troubleshooting refers to the act of applying diagnostic tools--often heuristically and accompanied by common sense--to determine the causes of system problems. Each system problem must be treated on its own merits. It is not possible to provide a cookbook of actions that resolve each problem. However, this chapter provides some approaches and procedures, which used in combination with experience and common sense, can resolve many problems that might arise. Tasks covered in this chapter include:
Troubleshooting a System With the Operating System Responding Troubleshooting a System After an Unexpected Reboot Troubleshooting Fatal Reset Errors and RED State Exceptions Troubleshooting a System That Does Not Boot Troubleshooting a System That Is Hanging Information to Gather During Troubleshooting System Error States

Other information in this chapter includes:

Unexpected Reboots

Information to Gather During Troubleshooting


Familiarity with a wide variety of equipment, and experience with a particular machine's common failure modes can be invaluable when troubleshooting system problems. Establishing a systematic approach to investigating and solving a particular system's problems can help ensure that you can quickly identify and remedy most issues as they arise. The Netra 440 server indicates and logs events and errors in a variety of ways. Depending on the system's configuration and software, certain types of errors are captured only temporarily. Therefore, you must observe and record all available information immediately before you attempt any corrective action. POST, for instance, accumulates a list of failed components across resets. However, failed component information is cleared after a system reset. Similarly, the state of LEDs in a hung system is lost when the system reboots or resets. If you encounter any system problems that are not familiar to you, gather as much information as you can before you attempt any remedial actions. The following task listing outlines a basic approach to information gathering.
Gather as much error information (error indications and messages) as you can from the system. See Error Information From the ALOM System Controller and Error Information From the System for more information about sources of error indications and messages. Gather as much information as you can about the system by reviewing and verifying the system's operating system, firmware, and hardware configuration. To accurately analyze error indications and messages, you or a Sun support services engineer must know the system's operating system and patch revision levels as well as the specific hardware configuration. See Recording Information About the System. Compare the specifics of your situation to the latest published information about your system. Often, unfamiliar problems you encounter have been seen, diagnosed, and fixed by others. This information might help you avoid the unnecessary expense of replacing parts that are not actually failing. See Updated Troubleshooting Information for information sources.

Error Information From the ALOM System Controller


In most troubleshooting situations, you can use the ALOM system controller as the primary source of information about the system. On the Netra 440 server, the ALOM system controller provides you with access to a variety of system logs and other information about the system, even when the system is powered off. For more information about ALOM, see:
Monitoring the System Using Advanced Lights Out Manager Monitoring the System Using Sun Advanced Lights Out Manager

Advanced Lights Out Manager Software User's Guide for the Netra 440 Server

Error Information From the System


Depending on the state of the system, you should check as many of the following sources as possible for error indications and record the information found.
Output from the prtdiag -v command - If Solaris software is running, issue the prtdiag -v command to capture information stored by OpenBoot Diagnostics and POST tests. Any information from these tests about the current state of the system is lost when the system is reset. See Troubleshooting a System With the Operating System Responding. Output from show-post-results and show-obdiag-results commands - From the ok prompt, issue the show-post-results command or show-obdiag-results command to view summaries of the results from the most recent POST and OpenBoot Diagnostics tests, respectively. The test results are saved across power cycles and provide an indication of which components passed and which components failed POST or OpenBoot Diagnostics tests. See Viewing Diagnostic Test Results After the Fact. State of system LEDs - The system LEDs can be viewed in various locations on the system or by using the ALOM system controller. Be sure to check any network port LEDs for activity as you examine the system. Any information about the state of the system from the LEDs is lost when the system is reset. For more information about using LEDs to troubleshoot system problems, see Isolating Faults Using LEDs. Solaris logs - If Solaris software is running, check the message files in the /var/adm/messages file. For more information, refer to "How to Customize System Message Logging" in the Solaris System Administration Guide: Advanced Administration Guide, which is part of the Solaris System Administrator Collection. System console - You can access system console messages from OpenBoot Diagnostics and POST using the ALOM system controller, provided the system console has not been redirected. The system controller also provides you access to boot log information from the latest system reset. For more information about the system console, refer to the Netra 440 Server System Administration Guide. Core files generated from panics - These files are located in the /var/crash directory. See The Core Dump Process for more information.

Recording Information About the System


As part of your standard operating procedures, it is important to have the following information about your system readily available:
Current patch levels for the system firmware and operating system Solaris OS version Specific hardware configuration information

Optional equipment and driver information Recent service records

Having all of this information available and verified makes it easier for you to recognize any problems already identified by others. This information is also required if you contact Sun support or your authorized support provider. It is vital to know the version and patch revision levels of the system's operating system, patch revision levels of the firmware, and your specific hardware configuration before you attempt to fix any problems. Problems often occur after changes have been made to the system. Some errors are caused by hardware and software incompatibilities and interactions. If you have all system information available, you might be able to quickly fix a problem by simply updating the system's firmware. Knowing about recent upgrades or component replacements might help you avoid replacing components that are not faulty.

System Error States


When troubleshooting, it is important to understand what kind of error has occurred, to distinguish between real and apparent system hangs, and to respond appropriately to error conditions so as to preserve valuable information.

Responding to System Error States


Depending on the severity of a system error, a Netra 440 server might or might not respond to commands you issue to the system. Once you have gathered all available information, you can begin taking action. Your actions depend on the information you have already gathered and the state of the system. Remember these guidelines:
Avoid power cycling the system until you have gathered all the information you can. Error information might be lost when power cycling the system. If your system appears to be hung, attempt multiple approaches to get the system to respond. See Responding to System Hang States.

Responding to System Hang States


Troubleshooting a hanging system can be a difficult process because the root cause of the hang might be masked by false error indications from another part of the system. Therefore, it is important that you carefully examine all the information sources available to you before you attempt any remedy. Also, it is helpful to understand the type of hang the system is experiencing. This hang state information is especially important to Sun support services engineers, should you contact them. A system soft hang can be characterized by any of the following symptoms:
Usability or performance of the system gradually decreases. New attempts to access the system fail.

Some parts of the system appear to stop responding. You can drop the system into the OpenBoot ok prompt level.

Some soft hangs might dissipate on their own, while others will require that the system be interrupted to gather information at the OpenBoot prompt level. A soft hang should respond to a break signal that is sent through the system console. A system hard hang leaves the system unresponsive to a system break sequence. You will know that a system is in a hard hang state when you have attempted all the soft hang remedies with no success. See Troubleshooting a System That Is Hanging.

Responding to Fatal Reset Errors and RED State Exceptions


Fatal Reset errors and RED State Exceptions are most often caused by hardware problems. Hardware Fatal Reset errors are the result of an "illegal" hardware state that is detected by the system. A hardware Fatal Reset error can either be a transient error or a hard error. A transient error causes intermittent failures. A hard error causes persistent failures that occur in the same way each time. CODE EXAMPLE 7-1 shows a sample Fatal Reset error alert from the system console.

CODE EXAMPLE 7-1


Sun-SFV440-a console login:

Fatal Reset Error Alert

Fatal Error Reset CPU 0000.0000.0000.0002 AFSR 0210.9000.0200.0000 AFAR 0000.0280.0ec0.c180 SC Alert: Host System has Reset

JETO PRIV OM TO

SC Alert: Host System has read and cleared bootmode.

A RED State Exception condition is most commonly a hardware fault that is detected by the system. There is no recoverable information that you can use to troubleshoot a RED State Exception. The Exception causes a loss of system integrity, which would jeopardize the system if Solaris software continued to operate. Because of this, Solaris software terminates ungracefully without logging any details of the RED State Exception error in the /var/adm/messages file. CODE EXAMPLE 7-2 shows a sample RED State Exception alert from the system console.

CODE EXAMPLE 7-2


Sun-SFV440-a console login: RED State Exception Error enable reg: 0000.0001.00f0.001f ECCR: 0000.0000.02f0.4c00

RED State Exception Alert

CPU: 0000.0000.0000.0002 TL=0000.0000.0000.0005 TT=0000.0000.0000.0010 TPC=0000.0000.0100.4200 TnPC=0000.0000.0100.4204 TSTATE=0000.0044.8200.1507 TL=0000.0000.0000.0004 TT=0000.0000.0000.0010 TPC=0000.0000.0100.4200 TnPC=0000.0000.0100.4204 TSTATE=0000.0044.8200.1507 TL=0000.0000.0000.0003 TT=0000.0000.0000.0010 TPC=0000.0000.0100.4680 TnPC=0000.0000.0100.4684 TSTATE=0000.0044.8200.1507 TL=0000.0000.0000.0002 TT=0000.0000.0000.0034 TPC=0000.0000.0100.7164 TnPC=0000.0000.0100.7168 TSTATE=0000.0044.8200.1507 TL=0000.0000.0000.0001 TT=0000.0000.0000.004e TPC=0000.0001.0001.fd24 TnPC=0000.0001.0001.fd28 TSTATE=0000.0000.8200.1207 SC Alert: Host System has Reset SC Alert: Host System has read and cleared bootmode.

In some isolated cases, software can cause a Fatal Reset error or RED State Exception. Typically, these are device driver problems that can be identified easily. You can obtain this information through SunSolve Online (see Web Sites), or by contacting Sun or the third-party driver vendor. The most important pieces of information to gather when diagnosing a Fatal Reset error or RED State Exception are:
System console output at the time of the error Recent service history of systems that encounter Fatal Reset errors or RED State Exceptions

Capturing system console indications and messages at the time of the error can help you isolate the true cause of the error. In some cases, the true cause of the original error might be masked by false error indications from another part of the system. For example, POST results (shown by the output from the prtdiag command) might indicate failed components, when, in fact, the "failed" components are not the actual cause of the Fatal Reset error. In most cases, a good component will actually report the Fatal Reset error. By analyzing the system console output at the time of the error, you can avoid replacing components based on these false error indications. In addition, knowing the service history of a system experiencing transient errors can help you avoid repeatedly replacing "failed" components that do not fix the problem.

Unexpected Reboots
Sometimes, a system might reboot unexpectedly. In that case, ensure that the reboot was not caused by a panic. For example, L2-cache errors, which occur in user space (not kernel space),

might cause Solaris software to log the L2-cache failure data and reboot the system. The information logged might be sufficient to troubleshoot and correct the problem. If the reboot was not caused by a panic, it might be caused by a Fatal Reset error or a RED State Exception. See Troubleshooting Fatal Reset Errors and RED State Exceptions. Also, system ASR and POST settings can determine the system response to certain error conditions. If POST is not invoked during the reboot process, or if the system diagnostics level is not set to max, you might need to run system diagnostics at a higher level of coverage to determine the source of the reboot if the system message and system console files do not clearly indicate the source of the reboot.

Troubleshooting a System With the Operating System Responding


This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.

To Troubleshoot a System With the Operating System Running

1. Log in to the system controller and access the sc> prompt. For information, refer to the Netra 440 Server System Administration Guide. 2. Examine the ALOM event log. Type:
sc> showlogs

The ALOM event log shows system events such as reset events and LED indicator state changes that have occurred since the last system boot. CODE EXAMPLE 7-3 shows a sample event log, which indicates that the front panel Service Required LED is ON.

CODE EXAMPLE 7-3


MAY 09 16:54:27 MAY 09 16:54:27 MAY 09 16:56:35 MAY 09 16:56:54 MAY 09 16:58:11 MAY 09 16:58:11 MAY 09 16:58:13 bootmode." MAY 09 16:58:13 Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a:

showlogs

Command Output

00060003: 00040029: 00060000: 00060000: 00040001: 00040002: 0004000b:

"SC System booted." "Host system has shut down." "SC Login: User admin Logged on." "SC Login: User admin Logged on." "SC Request to Power On Host." "Host System has Reset" "Host System has read and cleared

Sun-SFV440-a: 0004004f: "Indicator PS0.POK is now ON"

MAY 09 16:58:13 MAY 09 16:59:19 MAY 09 17:00:46 MAY 09 17:01:51 ON" MAY 09 17:03:22 MAY 09 17:03:22 OFF" MAY 09 17:03:24 bootmode." MAY 09 17:04:30 MAY 09 17:05:59 MAY 09 17:06:40 ON" MAY 09 17:07:44 sc>

Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a:

0004004f: 00040002: 00040002: 0004004f:

"Indicator PS1.POK is now ON" "Host System has Reset" "Host System has Reset" "Indicator SYS_FRONT.SERVICE is now

Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now Sun-SFV440-a: 0004000b: "Host System has read and cleared Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.ACT is now ON"

Note - Time stamps for ALOM logs reflect UTC (Universal Time Coordinated) time, while time stamps for the Solaris OS reflect local (server) time. Therefore, a single event might generate messages that appear to be logged at different times in different logs.

3. Examine system environment status. Type:


sc> showenvironment

The showenvironment command reports much useful data such as temperature readings; state of system and component LEDs; motherboard voltages; and status of system disks, fans, motherboard circuit breakers, and CPU module DC-to-DC converters. CODE EXAMPLE 7-4, an excerpt of output from the showenvironment command, indicates that the front panel Service Required LED is ON. When reviewing the complete output from the showenvironment command, check the state of all Service Required LEDs and verify that all components show a status of OK. See CODE EXAMPLE 4-1 for a sample of complete output from the showenvironment command.

CODE EXAMPLE 7-4

showenvironment

Command

Output
System Indicator Status:

--------------------------------------------------SYS_FRONT.LOCATE SYS_FRONT.SERVICE SYS_FRONT.ACT -------------------------------------------------------OFF ON ON . . . sc>

4. Examine the output of the prtdiag -v command. Type:


sc> console Enter #. to return to ALOM. # /usr/platform/`uname -i`/sbin/prtdiag -v

The prtdiag -v command provides access to information stored by POST and OpenBoot Diagnostics tests. Any information from this command about the current state of the system is lost if the system is reset. When examining the output to identify problems, verify that all installed CPU modules, PCI cards, and memory modules are listed; check for any Service Required LEDs that are ON; and verify that the system PROM firmware is the latest version. CODE EXAMPLE 7-5 shows an excerpt of output from the prtdiag -v command. See CODE EXAMPLE 2-8 through CODE EXAMPLE 2-13 for the complete prtdiag -v output from a "healthy" Netra 440 server.

CODE EXAMPLE 7-5

prtdiag -v

Command Output

System Configuration: Sun Microsystems System clock frequency: 177 MHZ Memory size: 4GB

sun4u Netra 440

==================================== CPUs E$ CPU CPU Freq Size Impl. Unit --- -------- ---------- --------0 1062 MHz 1MB US-IIIi 1 1062 MHz 1MB US-IIIi

==================================== CPU Temperature Fan Mask Die Ambient Speed ---2.3 2.3 -------------------

================================= IO Devices ================================= Bus Freq Brd Type MHz Slot Name Model --- ---- ---- ---------- ----------------------------------------------0 pci 66 MB pci108e,abba (network) SUNW,pci-ce 0 pci 33 MB isa/su (serial) 0 pci 33 MB isa/su (serial) . .

. Memory Module Groups: -------------------------------------------------ControllerID GroupID Labels -------------------------------------------------0 0 C0/P0/B0/D0,C0/P0/B0/D1 0 1 C0/P0/B1/D0,C0/P0/B1/D1 Memory Module Groups: -------------------------------------------------ControllerID GroupID Labels -------------------------------------------------1 0 C1/P0/B0/D0,C1/P0/B0/D1 1 1 C1/P0/B1/D0,C1/P0/B1/D1 . . . System PROM revisions: ---------------------OBP 4.10.3 2003/05/02 20:25 Netra 440 OBDIAG 4.10.3 2003/05/02 20:26 #

5. Check the system LEDs. 6. Check the /var/adm/messages file. The following are clear indications of a failing part:
Warning messages from Solaris software about any hardware or software components ALOM environmental messages about a failing part, including a fan or power supply

If there is no clear indication of a failing part, investigate the installed applications, the network, or the disk configuration. If you have clear indications that a part has failed or is failing, replace that part as soon as possible. If the problem is a confirmed environmental failure, replace the fan or power supply as soon as possible. A system with a redundant configuration might still operate in a degraded state, but the stability and performance of the system will be affected. Since the system is still operational, attempt to isolate the fault using several methods and tools to ensure that the part you suspect as faulty really is causing the problems you are experiencing. See Isolating Faults in the System. For information about installing and replacing field-replaceable parts, refer to the Netra 440 Server Service Manual (817-3883-xx).

Troubleshooting a System After an Unexpected Reboot


This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.

To Troubleshoot a System After an Unexpected Reboot

1. Log in to the system controller and access the sc> prompt. For information, refer to the Netra 440 Server System Administration Guide. 2. Examine the ALOM event log. Type:
sc> showlogs

The ALOM event log shows system events such as reset events and LED indicator state changes that have occurred since the last system boot. CODE EXAMPLE 7-6 shows a sample event log, which indicates that the front panel Service Required LED is ON.

CODE EXAMPLE 7-6


MAY 09 16:54:27 MAY 09 16:54:27 MAY 09 16:56:35 MAY 09 16:56:54 MAY 09 16:58:11 MAY 09 16:58:11 MAY 09 16:58:13 bootmode." MAY 09 16:58:13 MAY 09 16:58:13 MAY 09 16:59:19 MAY 09 17:00:46 MAY 09 17:01:51 ON" MAY 09 17:03:22 MAY 09 17:03:22 OFF" MAY 09 17:03:24 bootmode." MAY 09 17:04:30 MAY 09 17:05:59 MAY 09 17:06:40 ON" Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a:

showlogs

Command Output

00060003: 00040029: 00060000: 00060000: 00040001: 00040002: 0004000b: 0004004f: 0004004f: 00040002: 00040002: 0004004f:

"SC System booted." "Host system has shut down." "SC Login: User admin Logged on." "SC Login: User admin Logged on." "SC Request to Power On Host." "Host System has Reset" "Host System has read and cleared "Indicator PS0.POK is now ON" "Indicator PS1.POK is now ON" "Host System has Reset" "Host System has Reset" "Indicator SYS_FRONT.SERVICE is now

Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now Sun-SFV440-a: 0004000b: "Host System has read and cleared Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now

MAY 09 17:07:44 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.ACT is now ON" sc>

Note - Time stamps for ALOM logs reflect UTC (Universal Time Coordinated) time, while time stamps for the Solaris OS reflect local (server) time. Therefore, a single event might generate messages that appear to be logged at different times in different logs.

3. Examine the ALOM run log. Type:


sc> consolehistory run -v

This command shows the log containing the most recent system console output of boot messages from the Solaris OS. When troubleshooting, examine the output for hardware or software errors logged by the operating environment on the system console. CODE EXAMPLE 7-7 shows sample output from the consolehistory run -v command.

CODE EXAMPLE 7-7


May

consolehistory run -v

Command Output

9 14:48:22 Sun-SFV440-a rmclomv: SC Login: User admin Logged on.

# # init 0 # INIT: New run level: 0 The system is coming down. Please wait. System services are now being stopped. Print services stopped. May 9 14:49:18 Sun-SFV440-a last message repeated 1 time May 9 14:49:38 Sun-SFV440-a syslogd: going down on signal 15

The system is down. syncing file systems... done Program terminated {1} ok boot disk Netra 440, No Keyboard Copyright 1998-2003 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571.

Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03. Initializing Initializing Initializing Initializing Initializing Initializing Initializing Initializing 1MB of memory at addr 1MB of memory at addr 14MB of memory at addr 16MB of memory at addr 992MB of memory at addr 1024MB of memory at addr 1024MB of memory at addr 1024MB of memory at addr 123fecc000 123fe02000 123f002000 123e002000 1200000000 1000000000 200000000 0 -

Rebooting with command: boot disk Boot device: /pci@1f,700000/scsi@2/disk@0,0 File and args: \ SunOS Release 5.8 Version Generic_114696-04 64-bit Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved. Hardware watchdog enabled Indicator SYS_FRONT.ACT is now ON configuring IPv4 interfaces: ce0. Hostname: Sun-SFV440-a The system is coming up. Please wait. NIS domainname is Ecd.East.Sun.COM Starting IPv4 router discovery. starting rpc services: rpcbind keyserv ypbind done. Setting netmask of lo0 to 255.0.0.0 Setting netmask of ce0 to 255.255.255.0 Setting default IPv4 interface for multicast: add net 224.0/4: gateway SunSFV440-a syslog service starting. Print services started. volume management starting. The system is ready. Sun-SFV440-a console login: May 9 14:52:57 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = UNKNOWN May 9 14:52:57 Sun-SFV440-a rmclomv: Keyswitch Position has changed to Unknown state. May 9 14:52:58 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = LOCKED May 9 14:52:58 Sun-SFV440-a rmclomv: KeySwitch Position has changed to Locked State. May 9 14:53:00 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = NORMAL May 9 14:53:01 Sun-SFV440-a rmclomv: KeySwitch Position has changed to On

State. sc>

4. Examine the ALOM boot log. Type:


sc> consolehistory boot -v

The ALOM boot log contains boot messages from POST, OpenBoot firmware, and Solaris software from the server's most recent reset. When examining the output to identify a problem, check for error messages from POST and OpenBoot Diagnostics tests. CODE EXAMPLE 7-8 shows the boot messages from POST. Note that POST returned no error messages. See What POST Error Messages Tell You for a sample POST error message and more information about POST error messages.

CODE EXAMPLE 7-8

consolehistory boot -v

Command Output

(Boot

Messages From POST)


Keyswitch set to diagnostic position. @(#)OBP 4.10.3 2003/05/02 20:25 Netra 440 Clearing TLBs Power-On Reset Executing Power On SelfTest 0>@(#) Netra[TM] 440 POST 4.10.3 2003/05/04 22:08 /export/work/staff/firmware_re/post/post-build4.10.3/Fiesta/system/integrated (firmware_re) 0>Hard Powerup RST thru SW 0>CPUs present in system: 0 1 0>OBP->POST Call with %o0=00000000.01012000. 0>Diag level set to MIN. 0>MFG scrpt mode set NORM 0>I/O port set to TTYA. 0>Start selftest... 1>Print Mem Config 1>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON. 1>Memory interleave set to 0 1> Bank 0 1024MB : 00000010.00000000 -> 00000010.40000000. 1> Bank 2 1024MB : 00000012.00000000 -> 00000012.40000000. 0>Print Mem Config 0>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON. 0>Memory interleave set to 0 0> Bank 0 1024MB : 00000000.00000000 -> 00000000.40000000. 0> Bank 2 1024MB : 00000002.00000000 -> 00000002.40000000. 0>INFO: 0> POST Passed all devices. 0>POST: Return to OBP.

CODE EXAMPLE 7-9 shows the initialization of the OpenBoot PROM.

CODE EXAMPLE 7-9 consolehistory boot -v Command Output (OpenBoot PROM Initialization)
Keyswitch set to diagnostic position. @(#)OBP 4.10.3 2003/05/02 20:25 Netra 440 Clearing TLBs POST Results: Cpu 0000.0000.0000.0000 %o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff POST Results: Cpu 0000.0000.0000.0001 %o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff Membase: 0000.0000.0000.0000 MemSize: 0000.0000.0004.0000 Init CPU arrays Done Probing /pci@1d,700000 Device 1 Nothing there Probing /pci@1d,700000 Device 2 Nothing there

The following sample output shows the system banner.

CODE EXAMPLE 7-10 consolehistory boot (System Banner Display)

-v

Command Output

Netra 440, No Keyboard Copyright 1998-2003 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571. Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.

The following sample output shows OpenBoot Diagnostics testing. See What OpenBoot Diagnostics Error Messages Tell You for a sample OpenBoot Diagnostics error message and more information about OpenBoot Diagnostics error messages.

CODE EXAMPLE 7-11 consolehistory boot -v (OpenBoot Diagnostics Testing)


Running diagnostic script obdiag/normal

Command Output

Testing /pci@1f,700000/network@1 Testing /pci@1e,600000/ide@d Testing /pci@1e,600000/isa@7/flashprom@2,0 Testing /pci@1e,600000/isa@7/serial@0,2e8 Testing /pci@1e,600000/isa@7/serial@0,3f8 Testing /pci@1e,600000/isa@7/rtc@0,70 Testing /pci@1e,600000/isa@7/i2c@0,320:tests={gpio@0.42,gpio@0.44,gpio@0.46,gpio@0.48}

Testing Testing Testing Testing Testing

/pci@1e,600000/isa@7/i2c@0,320:tests={hardware-monitor@0.5c} /pci@1e,600000/isa@7/i2c@0,320:tests={temperature-sensor@0.9c} /pci@1c,600000/network@2 /pci@1f,700000/scsi@2,1 /pci@1f,700000/scsi@2

The following sample output shows memory initialization by the OpenBoot PROM.

CODE EXAMPLE 7-12

consolehistory boot -v

Command Output

(Memory Initialization)
Initializing Initializing Initializing Initializing Initializing Initializing 1MB of memory at addr 12MB of memory at addr 1008MB of memory at addr 1024MB of memory at addr 1024MB of memory at addr 1024MB of memory at addr 123fe02000 123f000000 1200000000 1000000000 200000000 0 -

{1} ok boot disk

The following sample output shows the system booting and loading Solaris software

CODE EXAMPLE 7-13 consolehistory boot -v Command Output (System Booting and Loading Solaris Software)
Rebooting with command: boot disk Boot device: /pci@1f,700000/scsi@2/disk@0,0 File and args: Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. FCode UFS Reader 1.11 97/07/10 16:19:15. Loading: /platform/SUNW,Netra-440/ufsboot Loading: /platform/sun4u/ufsboot \ SunOS Release 5.8 Version Generic_114696-04 64-bit Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved. Hardware watchdog enabled sc>

5. Check the /var/adm/messages file for indications of an error. Look for the following information about the system's state:
Any large gaps in the time stamp of Solaris software or application messages Warning messages about any hardware or software components

Information from last root logins to determine whether any system administrators might be able to provide any information about the system state at the time of the hang

6. If possible, check whether the system saved a core dump file. Core dump files provide invaluable information to your support provider to aid in diagnosing any system problems. For further information about core dump files, see The Core Dump Process and "Managing System Crash Information" in the Solaris System Administration Guide. 7. Check the system LEDs. You can use the ALOM system controller to check the state of the system LEDs. Refer to the Netra 440 Server System Administration Guide (817-3884-xx) for information about system LEDs. 8. Examine the output of the prtdiag -v command. Type:
sc> console Enter #. to return to ALOM. # /usr/platform/`uname -i`/sbin/prtdiag -v

The prtdiag -v command provides access to information stored by POST and OpenBoot Diagnostics tests. Any information from this command about the current state of the system is lost if the system is reset. When examining the output to identify problems, verify that all installed CPU modules, PCI cards, and memory modules are listed; check for any Service Required LEDs that are ON; and verify that the system PROM firmware is the latest version. CODE EXAMPLE 7-14 shows an excerpt of output from the prtdiag -v command. See CODE EXAMPLE 2-8 through CODE EXAMPLE 2-13 for the complete prtdiag -v output from a "healthy" Netra 440 server.

CODE EXAMPLE 7-14


System Configuration: Sun Microsystems System clock frequency: 177 MHZ Memory size: 4GB

prtdiag -v

Command Output

sun4u Netra 440

==================================== CPUs E$ CPU CPU Freq Size Impl. Unit --- -------- ---------- --------0 1062 MHz 1MB US-IIIi 1 1062 MHz 1MB US-IIIi

==================================== CPU Temperature Fan Mask Die Ambient Speed ---2.3 2.3 -------------------

================================= IO Devices ================================= Bus Freq Brd Type MHz Slot Name Model --- ---- ---- ---------- ----------------------------

-------------------0 pci 66 0 pci 33 0 pci 33

MB MB MB

pci108e,abba (network) isa/su (serial) isa/su (serial)

SUNW,pci-ce

. . . Memory Module Groups: -------------------------------------------------ControllerID GroupID Labels -------------------------------------------------0 0 C0/P0/B0/D0,C0/P0/B0/D1 0 1 C0/P0/B1/D0,C0/P0/B1/D1 . . . System PROM revisions: ---------------------OBP 4.10.3 2003/05/02 20:25 Netra 440 OBDIAG 4.10.3 2003/05/02 20:26 #

9. Verify that all user and system processes are functional. Type:
# ps -ef

Output from the ps -ef command shows each process, the start time, the run time, and the full process command-line options. To identify a system problem, examine the output for missing entries in the CMD column. CODE EXAMPLE 7-15 shows the ps -ef command output of a "healthy" Netra 440 server.

CODE EXAMPLE 7-15


UID root root root root root root root root user1 root root root root -broadcast root PID 0 1 2 3 291 205 312 169 314 53 59 100 131 118 PPID 0 0 0 0 1 1 148 1 312 1 1 1 1 1 C 0 0 0 0 0 0 0 0 0 0 0 0 0 STIME 14:51:32 14:51:32 14:51:32 14:51:32 14:51:47 14:51:44 14:54:33 14:51:42 14:54:33 14:51:36 14:51:37 14:51:40 14:51:40 TTY ? ? ? ? ? ? ? ? pts/1 ? ? ? ?

ps -ef TIME 0:17 0:00 0:00 0:02 0:00 0:00 0:00 0:00 0:00 0:00 0:02 0:00 0:00

Command Output
CMD sched /etc/init pageout fsflush /usr/lib/saf/sac -t 300 /usr/lib/lpsched in.telnetd /usr/lib/autofs/automountd -csh /usr/lib/sysevent/syseventd /usr/lib/picl/picld /usr/sbin/in.rdisc -s /usr/lib/netsvc/yp/ypbind

0 14:51:40 ?

0:00 /usr/sbin/rpcbind

root 121 1 0 14:51:40 root 148 1 0 14:51:42 root 218 1 0 14:51:44 root 199 1 0 14:51:43 root 162 1 0 14:51:42 daemon 166 1 0 14:51:42 root 181 1 0 14:51:43 root 283 1 0 14:51:47 SFV440-a root 184 1 0 14:51:43 root 235 233 0 14:51:44 root 233 1 0 14:51:44 root 245 1 0 14:51:45 root 247 1 0 14:51:45 root 256 1 0 14:51:45 /usr/lib/efcode/sparcv9/efdaemon root 294 291 0 14:51:47 root 304 274 0 14:51:51 root 274 1 0 14:51:46 /etc/snmp/conf root 334 292 0 15:00:59 #

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? console

0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00

/usr/sbin/keyserv /usr/sbin/inetd -s /usr/lib/power/powerd /usr/sbin/nscd /usr/lib/nfs/lockd /usr/lib/nfs/statd /usr/sbin/syslogd /usr/lib/dmi/snmpXdmid -s Sun/usr/sbin/cron /usr/sadm/lib/smc/bin/smcboot /usr/sadm/lib/smc/bin/smcboot /usr/sbin/vold /usr/lib/sendmail -bd -q15m

0:00 /usr/lib/saf/ttymon 0:00 mibiisa -r -p 32826 0:00 /usr/lib/snmp/snmpdx -y -c 0:00 ps -ef

10. Verify that all I/O devices and activities are still present and functioning. Type:
# iostat -xtc

This command shows all I/O devices and reports activity for each device. To identify a problem, examine the output for installed devices that are not listed. CODE EXAMPLE 7-16 shows the iostat -xtc command output from a "healthy" Netra 440 server.

CODE EXAMPLE 7-16


cpu device sy wt id sd0 2 2 96 sd1 sd2 sd3 sd4 nfs1 nfs2 nfs3 nfs4 #

iostat -xtc

Command Output
tty svc_t 0.0 24.6 0.0 0.0 0.0 0.0 9.6 1.4 5.1 %w 0 0 0 0 0 0 0 0 0 %b 0 3 0 0 0 0 0 0 0 tin tout 0 183 us 0

extended device statistics r/s 0.0 6.5 0.2 0.2 0.2 0.0 0.0 0.1 0.0 w/s 0.0 1.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 kr/s 0.0 49.5 0.0 0.0 0.0 0.0 0.1 0.6 0.1 kw/s wait actv 0.0 7.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0

11. Examine errors pertaining to I/O devices. Type:


# iostat -E

This command reports on errors for each I/O device. To identify a problem, examine the output for any type of error that is more than 0. For example, in CODE EXAMPLE 7-17, iostat -E reports Hard Errors: 2 for I/O device sd0.

CODE EXAMPLE 7-17

iostat -E

Command Output
No: 04/17/02

sd0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 0 Vendor: TOSHIBA Product: DVD-ROM SD-C2612 Revision: 1011 Serial Size: 18446744073.71GB <-1 bytes> Media Error: 0 Device Not Ready: 2 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST336607LSUN36G Revision: 0207 Serial 3JA0BW6Y00002317 Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd2 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST336607LSUN36G Revision: 0207 Serial 3JA0BRQJ00007316 Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd3 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST336607LSUN36G Revision: 0207 Serial 3JA0BWL000002318 Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd4 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST336607LSUN36G Revision: 0207 Serial 3JA0AGQS00002317 Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 #

No:

No:

No:

No:

12. Verify that any mirrored RAID devices are functioning. Type:
# raidctl

This command shows the status of RAID devices. To identify a problem, examine the output for Disk Status that is not OK. For more information about configuring mirrored RAID devices, refer to "About Hardware Disk Mirroring" in the Netra 440 Server System Administration Guide (817-3884-xx).

CODE EXAMPLE 7-18 Output

raidctl

Command

# raidctl RAID RAID RAID Disk Volume Status Disk Status -----------------------------------------------------c1t0d0 RESYNCING c1t0d0 OK c1t1d0 OK #

13. Run an exercising tool such as Sun VTS software or Hardware Diagnostic Suite. See Chapter 5 for information about exercising tools. 14. If this is the first occurrence of an unexpected reboot and the system did not run POST as part of the reboot process, run POST. If ASR is not enabled, now is a good time to enable ASR. ASR runs POST and OpenBoot Diagnostics tests automatically at reboot. With ASR enabled, you can save time diagnosing problems since POST and OpenBoot Diagnostics test results are already available after an unexpected reboot. Refer to the Netra 440 Server System Administration Guide (817-3884xx) for more information about ASR and complete instructions for enabling ASR. 15. Once troubleshooting is complete, schedule maintenance as necessary for any service actions.

Troubleshooting Fatal Reset Errors and RED State Exceptions


This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide. For more information about Fatal Reset errors and RED State Exceptions, see Responding to Fatal Reset Errors and RED State Exceptions. For a sample Fatal Reset error message, see CODE EXAMPLE 71. For a sample RED State Exception message, see CODE EXAMPLE 7-2. 1. Log in to the system controller and access the sc> prompt. For information, refer to the Netra 440 Server System Administration Guide.

2. Examine the ALOM event log. Type:


sc> showlogs

The ALOM event log shows system events such as reset events and LED indicator state changes that have occurred since the last system boot. CODE EXAMPLE 7-19 shows a sample event log, which indicates that the front panel Service Required LED is ON.

CODE EXAMPLE 7-19


MAY 09 16:54:27 MAY 09 16:54:27 MAY 09 16:56:35 MAY 09 16:56:54 MAY 09 16:58:11 MAY 09 16:58:11 MAY 09 16:58:13 bootmode." MAY 09 16:58:13 MAY 09 16:58:13 MAY 09 16:59:19 MAY 09 17:00:46 MAY 09 17:01:51 ON" MAY 09 17:03:22 MAY 09 17:03:22 OFF" MAY 09 17:03:24 bootmode." MAY 09 17:04:30 MAY 09 17:05:59 MAY 09 17:06:40 ON" MAY 09 17:07:44 sc> Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: 00060003: 00040029: 00060000: 00060000: 00040001: 00040002: 0004000b: 0004004f: 0004004f: 00040002: 00040002: 0004004f:

showlogs

Command Output

"SC System booted." "Host system has shut down." "SC Login: User admin Logged on." "SC Login: User admin Logged on." "SC Request to Power On Host." "Host System has Reset" "Host System has read and cleared "Indicator PS0.POK is now ON" "Indicator PS1.POK is now ON" "Host System has Reset" "Host System has Reset" "Indicator SYS_FRONT.SERVICE is now

Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now Sun-SFV440-a: 0004000b: "Host System has read and cleared Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.ACT is now ON"

Note - Time stamps for ALOM logs reflect UTC (Universal Time Coordinated) time, while time stamps for the Solaris OS reflect local (server) time. Therefore, a single event might generate messages that appear to be logged at different times in different logs.

3. Examine the ALOM run log. Type:


sc> consolehistory run -v

This command shows the log containing the most recent system console output of boot messages from the Solaris software. When troubleshooting, examine the output for hardware or software errors logged by the operating system on the system console. CODE EXAMPLE 7-20 shows sample output from the consolehistory run -v command.

CODE EXAMPLE 7-20


May

consolehistory run -v

Command Output

9 14:48:22 Sun-SFV440-a rmclomv: SC Login: User admin Logged on.

# # init 0 # INIT: New run level: 0 The system is coming down. Please wait. System services are now being stopped. Print services stopped. May 9 14:49:18 Sun-SFV440-a last message repeated 1 time May 9 14:49:38 Sun-SFV440-a syslogd: going down on signal 15

The system is down. syncing file systems... done Program terminated {1} ok boot disk Netra 440, No Keyboard Copyright 1998-2003 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571. Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03. Initializing Initializing Initializing Initializing Initializing Initializing Initializing Initializing 1MB of memory at addr 1MB of memory at addr 14MB of memory at addr 16MB of memory at addr 992MB of memory at addr 1024MB of memory at addr 1024MB of memory at addr 1024MB of memory at addr 123fecc000 123fe02000 123f002000 123e002000 1200000000 1000000000 200000000 0 File and args:

Rebooting with command: boot disk Boot device: /pci@1f,700000/scsi@2/disk@0,0

\ SunOS Release 5.8 Version Generic_114696-04 64-bit Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved. Hardware watchdog enabled Indicator SYS_FRONT.ACT is now ON configuring IPv4 interfaces: ce0. Hostname: Sun-SFV440-a The system is coming up. Please wait. NIS domainname is Ecd.East.Sun.COM Starting IPv4 router discovery. starting rpc services: rpcbind keyserv ypbind done. Setting netmask of lo0 to 255.0.0.0 Setting netmask of ce0 to 255.255.255.0 Setting default IPv4 interface for multicast: add net 224.0/4: gateway SunSFV440-a syslog service starting. Print services started. volume management starting. The system is ready. Sun-SFV440-a console login: May 9 14:52:57 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = UNKNOWN May 9 14:52:57 Sun-SFV440-a rmclomv: Keyswitch Position has changed to Unknown state. May 9 14:52:58 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = LOCKED May 9 14:52:58 Sun-SFV440-a rmclomv: KeySwitch Position has changed to Locked State. May 9 14:53:00 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = NORMAL May 9 14:53:01 Sun-SFV440-a rmclomv: KeySwitch Position has changed to On State. sc>

4. Examine the ALOM boot log. Type:


sc> consolehistory boot -v

The ALOM boot log contains boot messages from POST, OpenBoot firmware, and Solaris software from the server's most recent reset. When examining the output to identify a problem, check for error messages from POST and OpenBoot Diagnostics tests. CODE EXAMPLE 7-21 shows the boot messages from POST. Note that POST returned no error messages. See What POST Error Messages Tell You for a sample POST error message and more information about POST error messages.

CODE EXAMPLE 7-21 consolehistory boot (Boot Messages From POST)


Keyswitch set to diagnostic position. @(#)OBP 4.10.3 2003/05/02 20:25 Netra 440 Clearing TLBs Power-On Reset Executing Power On SelfTest

-v

Command Output

0>@(#) Netra[TM] 440 POST 4.10.3 2003/05/04 22:08 /export/work/staff/firmware_re/post/post-build4.10.3/Fiesta/system/integrated (firmware_re) 0>Hard Powerup RST thru SW 0>CPUs present in system: 0 1 0>OBP->POST Call with %o0=00000000.01012000. 0>Diag level set to MIN. 0>MFG scrpt mode set NORM 0>I/O port set to TTYA. 0> 0>Start selftest... 1>Print Mem Config 1>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON. 1>Memory interleave set to 0 1> Bank 0 1024MB : 00000010.00000000 -> 00000010.40000000. 1> Bank 2 1024MB : 00000012.00000000 -> 00000012.40000000. 0>Print Mem Config 0>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON. 0>Memory interleave set to 0 0> Bank 0 1024MB : 00000000.00000000 -> 00000000.40000000. 0> Bank 2 1024MB : 00000002.00000000 -> 00000002.40000000. 0>INFO: 0> POST Passed all devices. 0> 0>POST: Return to OBP.

The following output shows the initialization of the OpenBoot PROM.

CODE EXAMPLE 7-22 consolehistory boot -v (OpenBoot PROM Initialization)

Command Output

Keyswitch set to diagnostic position. @(#)OBP 4.10.3 2003/05/02 20:25 Netra 440 Clearing TLBs POST Results: Cpu 0000.0000.0000.0000 %o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff POST Results: Cpu 0000.0000.0000.0001 %o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff Membase: 0000.0000.0000.0000 MemSize: 0000.0000.0004.0000

Init CPU arrays Done Probing /pci@1d,700000 Device 1 Probing /pci@1d,700000 Device 2

Nothing there Nothing there

The following sample output shows the system banner.

CODE EXAMPLE 7-23 c onsolehistory boot (System Banner Display)

-v

Command Output

Netra 440, No Keyboard Copyright 1998-2003 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571. Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.

The following sample output shows OpenBoot Diagnostics testing. See What OpenBoot Diagnostics Error Messages Tell You for a sample OpenBoot Diagnostics error message and more information about OpenBoot Diagnostics error messages.

CODE EXAMPLE 7-24 consolehistory boot -v (OpenBoot Diagnostics Testing)


Running diagnostic script obdiag/normal

Command Output

Testing /pci@1f,700000/network@1 Testing /pci@1e,600000/ide@d Testing /pci@1e,600000/isa@7/flashprom@2,0 Testing /pci@1e,600000/isa@7/serial@0,2e8 Testing /pci@1e,600000/isa@7/serial@0,3f8 Testing /pci@1e,600000/isa@7/rtc@0,70 Testing /pci@1e,600000/isa@7/i2c@0,320:tests={gpio@0.42,gpio@0.44,gpio@0.46,gpio@0.48} Testing /pci@1e,600000/isa@7/i2c@0,320:tests={hardware-monitor@0.5c} Testing /pci@1e,600000/isa@7/i2c@0,320:tests={temperature-sensor@0.9c} Testing /pci@1c,600000/network@2 Testing /pci@1f,700000/scsi@2,1 Testing /pci@1f,700000/scsi@2

The following sample output shows memory initialization by the OpenBoot PROM.

CODE EXAMPLE 7-25

consolehistory boot -v

Command Output

(Memory Initialization)
Initializing Initializing 1MB of memory at addr 12MB of memory at addr 123fe02000 123f000000 -

Initializing Initializing Initializing Initializing

1008MB of memory at addr 1024MB of memory at addr 1024MB of memory at addr 1024MB of memory at addr

1200000000 1000000000 200000000 0 -

{1} ok boot disk

The following sample output shows the system booting and loading the Solaris software.

CODE EXAMPLE 7-26 consolehistory boot -v Command Output (System Booting and Loading Solaris Software)
Rebooting with command: boot disk Boot device: /pci@1f,700000/scsi@2/disk@0,0 File and args: Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. FCode UFS Reader 1.11 97/07/10 16:19:15. Loading: /platform/SUNW,Netra-440/ufsboot Loading: /platform/sun4u/ufsboot \ SunOS Release 5.8 Version Generic_114696-04 64-bit Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved. Hardware watchdog enabled sc>

5. Check the /var/adm/messages file for indications of an error. Look for the following information about the system's state:
Any large gaps in the time stamp of Solaris software or application messages Warning messages about any hardware or software components Information from last root logins to determine whether any system administrators might be able to provide any information about the system state at the time of the hang

6. If possible, check whether the system saved a core dump file. Core dump files provide invaluable information to your support provider to aid in diagnosing any system problems. For further information about core dump files, see The Core Dump Process and "Managing System Crash Information" in the Solaris System Administration Guide. 7. Check the system LEDs. You can use the ALOM system controller to check the state of the system LEDs. Refer to the Netra 440 Server System Administration Guide (817-3884-xx) for information about system LEDs.

8. Examine the output of the prtdiag -v command. Type:


sc> console Enter #. to return to ALOM. # /usr/platform/`uname -i`/sbin/prtdiag -v

The prtdiag -v command provides access to information stored by POST and OpenBoot Diagnostics tests. Any information from this command about the current state of the system is lost if the system is reset. When examining the output to identify problems, verify that all installed CPU modules, PCI cards, and memory modules are listed; check for any Service Required LEDs that are ON; and verify that the system PROM firmware is the latest version. CODE EXAMPLE 7-27 shows an excerpt of output from the prtdiag -v command. See CODE EXAMPLE 2-8 through CODE EXAMPLE 2-13 for the complete prtdiag -v output from a "healthy" Netra 440 server.

CODE EXAMPLE 7-27


System Configuration: Sun Microsystems System clock frequency: 177 MHZ Memory size: 4GB

prtdiag -v

Command Output

sun4u Netra 440

==================================== CPUs E$ CPU CPU Freq Size Impl. Unit --- -------- ---------- --------0 1062 MHz 1MB US-IIIi 1 1062 MHz 1MB US-IIIi

==================================== CPU Temperature Fan Mask Die Ambient Speed ---2.3 2.3 -------------------

================================= IO Devices ================================= Bus Freq Brd Type MHz Slot Name Model --- ---- ---- ---------- ----------------------------------------------0 pci 66 MB pci108e,abba (network) SUNW,pci-ce 0 pci 33 MB isa/su (serial) 0 pci 33 MB isa/su (serial) . . . Memory Module Groups: -------------------------------------------------ControllerID GroupID Labels -------------------------------------------------0 0 C0/P0/B0/D0,C0/P0/B0/D1 0 1 C0/P0/B1/D0,C0/P0/B1/D1 . . . System PROM revisions:

---------------------OBP 4.10.3 2003/05/02 20:25 Netra 440 OBDIAG 4.10.3 2003/05/02 20:26 #

9. Verify that all user and system processes are functional. Type:
# ps -ef

Output from the ps -ef command shows each process, the start time, the run time, and the full process command-line options. To identify a system problem, examine the output for missing entries in the CMD column. CODE EXAMPLE 7-28 shows the ps -ef command output of a "healthy" Netra 440 server.

CODE EXAMPLE 7-28


UID PID PPID C STIME root 0 0 0 14:51:32 root 1 0 0 14:51:32 root 2 0 0 14:51:32 root 3 0 0 14:51:32 root 291 1 0 14:51:47 root 205 1 0 14:51:44 root 312 148 0 14:54:33 root 169 1 0 14:51:42 user1 314 312 0 14:54:33 root 53 1 0 14:51:36 root 59 1 0 14:51:37 root 100 1 0 14:51:40 root 131 1 0 14:51:40 -broadcast root 118 1 0 14:51:40 root 121 1 0 14:51:40 root 148 1 0 14:51:42 root 226 1 0 14:51:44 root 218 1 0 14:51:44 root 199 1 0 14:51:43 root 162 1 0 14:51:42 daemon 166 1 0 14:51:42 root 181 1 0 14:51:43 root 283 1 0 14:51:47 SFV440-a root 184 1 0 14:51:43 root 235 233 0 14:51:44 root 233 1 0 14:51:44 root 245 1 0 14:51:45 root 247 1 0 14:51:45 root 256 1 0 14:51:45 /usr/lib/efcode/sparcv9/efdaemon root 294 291 0 14:51:47 TTY ? ? ? ? ? ? ? ? pts/1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

ps -ef TIME 0:17 0:00 0:00 0:02 0:00 0:00 0:00 0:00 0:00 0:00 0:02 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00

Command Output
CMD sched /etc/init pageout fsflush /usr/lib/saf/sac -t 300 /usr/lib/lpsched in.telnetd /usr/lib/autofs/automountd -csh /usr/lib/sysevent/syseventd /usr/lib/picl/picld /usr/sbin/in.rdisc -s /usr/lib/netsvc/yp/ypbind /usr/sbin/rpcbind /usr/sbin/keyserv /usr/sbin/inetd -s /usr/lib/utmpd /usr/lib/power/powerd /usr/sbin/nscd /usr/lib/nfs/lockd /usr/lib/nfs/statd /usr/sbin/syslogd /usr/lib/dmi/snmpXdmid -s Sun/usr/sbin/cron /usr/sadm/lib/smc/bin/smcboot /usr/sadm/lib/smc/bin/smcboot /usr/sbin/vold /usr/lib/sendmail -bd -q15m

0:00 /usr/lib/saf/ttymon

root 304 root 274 /etc/snmp/conf root 334 root 281 root 282 root 292 root 324 #

274 1 292 1 1 1 314

0 14:51:51 ? 0 14:51:46 ? 0 0 0 0 0 15:00:59 14:51:47 14:51:47 14:51:47 14:54:51 console ? ? console pts/1

0:00 mibiisa -r -p 32826 0:00 /usr/lib/snmp/snmpdx -y -c 0:00 0:00 0:00 0:00 0:00 ps -ef /usr/lib/dmi/dmispd /usr/dt/bin/dtlogin -daemon -sh -sh

10. Verify that all I/O devices and activities are still present and functioning. Type:
# iostat -xtc

This command shows all I/O devices and reports activity for each device. To identify a problem, examine the output for installed devices that are not listed. CODE EXAMPLE 7-29 shows the iostat -xtc command output from a "healthy" Netra 440 server.

CODE EXAMPLE 7-29


cpu device sy wt id sd0 2 2 96 sd1 sd2 sd3 sd4 nfs1 nfs2 nfs3 nfs4 #

iostat -xtc

Command Output
tty svc_t 0.0 24.6 0.0 0.0 0.0 0.0 9.6 1.4 5.1 %w 0 0 0 0 0 0 0 0 0 %b 0 3 0 0 0 0 0 0 0 tin tout 0 183 us 0

extended device statistics r/s 0.0 6.5 0.2 0.2 0.2 0.0 0.0 0.1 0.0 w/s 0.0 1.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 kr/s 0.0 49.5 0.0 0.0 0.0 0.0 0.1 0.6 0.1 kw/s wait actv 0.0 7.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0

11. Examine errors pertaining to I/O devices. Type:


# iostat -E

This command reports on errors for each I/O device. To identify a problem, examine the output for any type of error that is more than 0. For example, in CODE EXAMPLE 7-30, iostat -E reports Hard Errors: 2 for I/O device sd0.

CODE EXAMPLE 7-30

iostat -E

Command Output
No: 04/17/02

sd0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 0 Vendor: TOSHIBA Product: DVD-ROM SD-C2612 Revision: 1011 Serial Size: 18446744073.71GB <-1 bytes> Media Error: 0 Device Not Ready: 2 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST336607LSUN36G Revision: 0207 Serial 3JA0BW6Y00002317 Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd2 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST336607LSUN36G Revision: 0207 Serial 3JA0BRQJ00007316 Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd3 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST336607LSUN36G Revision: 0207 Serial 3JA0BWL000002318 Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd4 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST336607LSUN36G Revision: 0207 Serial 3JA0AGQS00002317 Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 #

No:

No:

No:

No:

12. Check your system Product Notes and the SunSolve Online Web site for the latest information, driver updates, and Free Info Docs for the system. 13. Check the system's recent service history. A system that has had several recent Fatal Reset errors and subsequent FRU replacements should be monitored closely to determine whether the recently replaced parts were, in fact, not faulty, and whether the actual faulty hardware has gone undetected.

Troubleshooting a System That Does Not Boot


A system might be unable to boot due to hardware or software problems. If you suspect that the system is unable to boot for software reasons, refer to "Troubleshooting Miscellaneous Software Problems" in the Solaris System Administration Guide: Advanced Administration. If you suspect

the system is unable to boot due to a hardware problem, use the following procedure to determine the possible causes. This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide. 1. Log in to the system controller and access the sc> prompt. For information, refer to the Netra 440 Server System Administration Guide. 2. Examine the ALOM event log. Type:
sc> showlogs

The ALOM event log shows system events such as reset events and LED indicator state changes that have occurred since the last system boot. To identify problems, examine the output for Service Required LEDs that are ON. CODE EXAMPLE 7-31 shows a sample event log, which indicates that the front panel Service Required LED is ON.

CODE EXAMPLE 7-31


MAY 09 16:54:27 MAY 09 16:54:27 MAY 09 16:56:35 MAY 09 16:56:54 MAY 09 16:58:11 MAY 09 16:58:11 MAY 09 16:58:13 bootmode." MAY 09 16:58:13 MAY 09 16:58:13 MAY 09 16:59:19 MAY 09 17:00:46 MAY 09 17:01:51 ON" MAY 09 17:03:22 MAY 09 17:03:22 OFF" MAY 09 17:03:24 bootmode." MAY 09 17:04:30 MAY 09 17:05:59 MAY 09 17:06:40 ON" MAY 09 17:07:44 sc> Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: Sun-SFV440-a: 00060003: 00040029: 00060000: 00060000: 00040001: 00040002: 0004000b: 0004004f: 0004004f: 00040002: 00040002: 0004004f:

showlogs

Command Output

"SC System booted." "Host system has shut down." "SC Login: User admin Logged on." "SC Login: User admin Logged on." "SC Request to Power On Host." "Host System has Reset" "Host System has read and cleared "Indicator PS0.POK is now ON" "Indicator PS1.POK is now ON" "Host System has Reset" "Host System has Reset" "Indicator SYS_FRONT.SERVICE is now

Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now Sun-SFV440-a: 0004000b: "Host System has read and cleared Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 00040002: "Host System has Reset" Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.ACT is now ON"

3. Examine the ALOM run log. Type:


sc> consolehistory

run -v

This command shows the log containing the most recent system console output of boot messages from the Solaris OS. When troubleshooting, examine the output for hardware or software errors logged by the operating system on the system console. CODE EXAMPLE 7-32 shows sample output from the consolehistory run -v command.

CODE EXAMPLE 7-32


May

consolehistory run -v

Command Output

9 14:48:22 Sun-SFV440-a rmclomv: SC Login: User admin Logged on.

# # init 0 # INIT: New run level: 0 The system is coming down. Please wait. System services are now being stopped. Print services stopped. May 9 14:49:18 Sun-SFV440-a last message repeated 1 time May 9 14:49:38 Sun-SFV440-a syslogd: going down on signal 15

The system is down. syncing file systems... done Program terminated {1} ok boot disk Netra 440, No Keyboard Copyright 1998-2003 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571. Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03. Initializing Initializing Initializing Initializing Initializing Initializing Initializing Initializing 1MB of memory at addr 1MB of memory at addr 14MB of memory at addr 16MB of memory at addr 992MB of memory at addr 1024MB of memory at addr 1024MB of memory at addr 1024MB of memory at addr 123fecc000 123fe02000 123f002000 123e002000 1200000000 1000000000 200000000 0 -

Rebooting with command: boot disk Boot device: /pci@1f,700000/scsi@2/disk@0,0 File and args: \ SunOS Release 5.8 Version Generic_114696-04 64-bit Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved.

Hardware watchdog enabled Indicator SYS_FRONT.ACT is now ON configuring IPv4 interfaces: ce0. Hostname: Sun-SFV440-a The system is coming up. Please wait. NIS domainname is Ecd.East.Sun.COM Starting IPv4 router discovery. starting rpc services: rpcbind keyserv ypbind done. Setting netmask of lo0 to 255.0.0.0 Setting netmask of ce0 to 255.255.255.0 Setting default IPv4 interface for multicast: add net 224.0/4: gateway SunSFV440-a syslog service starting. Print services started. volume management starting. The system is ready. Sun-SFV440-a console login: May 9 14:52:57 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = UNKNOWN May 9 14:52:57 Sun-SFV440-a rmclomv: Keyswitch Position has changed to Unknown state. May 9 14:52:58 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = LOCKED May 9 14:52:58 Sun-SFV440-a rmclomv: KeySwitch Position has changed to Locked State. May 9 14:53:00 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = NORMAL May 9 14:53:01 Sun-SFV440-a rmclomv: KeySwitch Position has changed to On State. sc>

Note - Time stamps for ALOM logs reflect UTC (Universal Time Coordinated) time, while time stamps for the Solaris OS reflect local (server) time. Therefore, a single event might generate messages that appear to be logged at different times in different logs.

Note - The ALOM system controller runs independently from the system and uses

standby power from the server. Therefore, ALOM firmware and software continue to function when power to the machine is turned off.

4. Examine the ALOM boot log. Type:


sc> consolehistory boot -v

The ALOM boot log contains boot messages from POST, OpenBoot firmware, and the Solaris software from the server's most recent reset. When examining the output to identify a problem, check for error messages from POST and OpenBoot Diagnostics tests. CODE EXAMPLE 7-33 shows the boot messages from POST. Note that POST returned no error messages. See What POST Error Messages Tell You for a sample POST error message and more information about POST error messages.

CODE EXAMPLE 7-33 consolehistory boot -v (Boot Messages From POST)


Keyswitch set to diagnostic position. @(#)OBP 4.10.3 2003/05/02 20:25 Netra 440 Clearing TLBs Power-On Reset Executing Power On SelfTest

Command Output

0>@(#) Netra[TM] 440 POST 4.10.3 2003/05/04 22:08 /export/work/staff/firmware_re/post/post-build4.10.3/Fiesta/system/integrated (firmware_re) 0>Hard Powerup RST thru SW 0>CPUs present in system: 0 1 0>OBP->POST Call with %o0=00000000.01012000. 0>Diag level set to MIN. 0>MFG scrpt mode set NORM 0>I/O port set to TTYA. 0> 0>Start selftest... 1>Print Mem Config 1>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON. 1>Memory interleave set to 0 1> Bank 0 1024MB : 00000010.00000000 -> 00000010.40000000. 1> Bank 2 1024MB : 00000012.00000000 -> 00000012.40000000. 0>Print Mem Config 0>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON. 0>Memory interleave set to 0 0> Bank 0 1024MB : 00000000.00000000 -> 00000000.40000000. 0> Bank 2 1024MB : 00000002.00000000 -> 00000002.40000000. 0>INFO: 0> POST Passed all devices.

0> 0>POST: Return to OBP.

5. Turn the system control rotary switch to the Diagnostics position. 6. Power on the system. If the system does not boot, the system might have a basic hardware problem. If you have not made any recent hardware changes to the system, contact your authorized service provider. 7. If the system gets to the ok prompt but does not load the operating system, you might need to change the boot-device setting in the system firmware. See Using OpenBoot Information Commands for information about using the probe commands. You can use the probe commands to display information about active SCSI and IDE devices. For information on changing the default boot device, refer to the Solaris System Administration Guide: Basic Administration. a. Try to load the operating system for a single user from a CD. Place a valid Solaris OS CD into the system DVD-ROM or CD-ROM drive and enter boot cdrom -s from the ok prompt. b. If the system boots from the CD and loads the operating system, check the following:
If the system normally boots from a system hard disk, check the system disk for problems and a valid boot image. If the system normally boots from the network, check the system network configuration, the system Ethernet cables, and the system network card.

c. If the system gets to the ok prompt but does not load the operating system from the CD, check the following:
OpenBoot variable settings (boot-device, diag-device, and auto-boot?). OpenBoot PROM device tree. See show-devs Command for more information. That the banner was displayed before the ok prompt. Any diagnostic test failure or other hardware failure message before the ok prompt was displayed.

Troubleshooting a System That Is Hanging

This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.

To Troubleshoot a System That Is Hanging

1. Verify that the system is hanging. a. Type the ping command to determine whether there is any network activity. b. Type the ps -ef command to determine whether any other user sessions are active or responding. If another user session is active, use it to review the contents of the /var/adm/messages file for any indications of the system problem. c. Try to access the system console through the ALOM system controller. If you can establish a working system console connection, the problem might not be a true hang but might instead be a network-related problem. For suspected network problems, use the ping, rlogin, or telnet commands to reach another system that is on the same sub-network, hub, or router. If NFS services are served by the affected system, determine whether NFS activity is present on other systems. d. Change the system control rotary switch position while observing the system console. For example, turn the rotary switch from the Normal position to the Diagnostics position, or from the Locked position to the Normal position. If the system console logs the change of rotary switch position, the system is not fully hung. 2. If there are no responding user sessions, record the state of the system LEDs. The system LEDs might indicate a hardware failure in the system. You can use the ALOM system controller to check the state of the system LEDs. Refer to the Netra 440 Server System Administration Guide (817-3884-xx) for more information about system LEDs. 3. Attempt to bring the system to the ok prompt. For instructions, refer to the Netra 440 Server System Administration Guide. If the system can get to the ok prompt, then the system hang can be classified as a soft hang. Otherwise, the system hang can be classified as a hard hang. See Responding to System Hang States for more information. 4. If the preceding step failed to bring the system to the ok prompt, execute an externally initiated reset (XIR).

Executing an XIR resets the system and preserves the state of the system before it resets, so that indications and messages about transient errors might be saved. An XIR is the equivalent of issuing a direct hardware reset. For further information about XIR, refer to the Netra 440 Server System Administration Guide. 5. If an XIR brings the system to the ok prompt, do the following. a. Issue the printenv command. This command displays the settings of the OpenBoot configuration variables. b. Set the auto-boot? variable to true, the diag-switch? variable to true, the diag-level variable to max, and the post-trigger and obdiag-trigger variables to all-resets. c. Issue the sync command to obtain a core dump file. Core dump files provide invaluable information to your support provider to aid in diagnosing any system problems. For further information about core dump files, see The Core Dump Process and "Managing System Crash Information" in the Solaris System Administration Guide, which is part of the Solaris System Administrator Collection. The system reboots automatically provided that the OpenBoot configuration autoboot? variable is set to true (default value).

Note - Steps 3, 4, and 5 occur automatically when the hardware watchdog mechanism is enabled.

6. If an XIR failed to bring the system to the ok prompt, follow these steps: a. Turn the system control rotary switch to the Diagnostics position. This forces the system to run POST and OpenBoot Diagnostics tests during system startup. b. Press the system Power button for five seconds. This causes an immediate hardware shutdown. c. Wait at least 30 seconds; then power on the system by pressing the Power button.

Note - You can also use the ALOM system controller to set the POST and OpenBoot

Diagnostics levels, and to power off and reboot the system. Refer to the Advanced Lights Out Manager Software User's Guide for the Netra 440 Server (817-5481-xx).

7. Use the POST and OpenBoot Diagnostics tests to diagnose system problems. When the system initiates the startup sequence, it will run POST and OpenBoot Diagnostics tests. See Isolating Faults Using POST Diagnostics and Isolating Faults Using Interactive OpenBoot Diagnostics Tests. 8. Review the contents of the /var/adm/messages file. Look for the following information about the system's state:
Any large gaps in the time stamp of Solaris software or application messages Warning messages about any hardware or software components Information from last root logins to determine whether any system administrators might be able to provide any information about the system state at the time of the hang

9. If possible, check whether the system saved a core dump file. Core dump files provide invaluable information to your support provider to aid in diagnosing any system problems. For further information about core dump files, see The Core Dump Process and "Managing System Crash Information" in the Solaris System Administration Guide, which is part of the Solaris System Administrator Collection.

Solaris Patching
Applying patches have evolved into complex, time-consuming processes. Knowing what software updates and patches to use and when to apply them have become important cost saving

measure. To minimize the cost, while maintaining a reasonable level of risk Sun provides Recommended Patch Clusters which is the most common patching solution for enterprize (so called "blind patching"). While this strategy does address patching issues and compliance issues, it also introduces more change to the system than is necessary. Therefore, for critical servers you might chose to apply patches only to address specific issues or needs and not apply patches merely to keep current. Without understanding whether those patches provide and what production or security problems they fix, the jury is out amount the benefits of "blind patching" for critical servers. But the fashion now is to patch everything and this trend accelerated with SOX. Solaris Recommended Patch Clusters do not upgrade Solaris to the next minor revision, for example from 04/04 to 04/08, you stay on the same revision as you was. The most common commands for managing patches are: install_cluster - install cluster patches by using the command. This is command used for installing Recommended Cluster, the most common patching method in Solaris. You can also manage patches through the Solaris Management Console, but command line is simpler. showrev -p -- List patches patchadd Installs uncompressed patches patchrm Removes patches installed /var/sadm/pkg directory is used to save the base packages /var/sadm/patch contains the list of installed patched. The date of the creation of a directory for a particular patch is actually the date of patch installation.

Important directories:

SunSolve is the mail location to get patches. Recommended patch cluster can be downloaded via ftp and HTTP from a browser. HTTP might be helpful if you have a firewall. You should use install_cluster for the installation of the Recommended patch cluster. Installing individual patches is a more tedious and time-consuming task. Each patch must be downloaded, placed on each server, uncompressed, untarred, installed, and removed. The recommended patch cluster is a zip file with the OS version as a prefix, for example for Solaris 9 it will be 9_Recommended.zip. Before installing any patches, you need to verify if the /var filesystem has sufficient space. A recommended patch cluster may need up to 0.5G. The availability of free space can be checked by executing df -k /var. It might make sense to remove extra log from this filesystem to free more space. Generally you need to download Recommended Patch Cluster into partition that has a lot of space, for example /home or /opt -- it is a bad idea to download it to the /tmp partition, as it is mapped to the memory or root partition as it is often (and should be) pretty tiny (less then 1G). There are perverts that allocate to root partition several gigabytes, but generally it is a stupid

idea. Moreover I would recommend to convert root home directory to /root as it is done in Red Hat, but that's another story (see hardening for details) To determine the current set of patches installed, you can use the command:
showrev -p | more.

The resulting long long listing shows each patch and its revision level. To find if a particular patch installed use grep for example:
showrev -p | grep 116268

Each patch is also identified with a revision number separated by a dash from the patch number. It is only necessary to install the most current revision. 116268-05 would indicate a patch revision of 05, and all patches with the revision lower than 05 would be considered obsolete. The patch number typically does not change, however the revision number changes with every new release. This makes it easy to identify new releases. The kernel patch level can be verified by typing uname -a. This command shows the current OS release and the current kernel patch installed. The command can be executed by any user and does not require root privileges. Determining the completeness of patching the kernel revision status on dozens of servers is a time-consuming process. You can split audit into two parts: data collection and report generation. Data collection can be made into a monthly cron job. Using the SunSolve Web site one can try to describe a problem and locate patches that might solve it, or at least list current bug reports about the problem. Once you have a list of possible patches, use the showrev -p command to see which patches are currently installed on the system and remedy any differences. As always, be sure you have a working backup of your system before making major changes (such as installing patches). Note that, rarely, patches that are installed do not appear in the showrev -p list. These are patches which include only firmware, and make no system code changes. Because there are no software changes, they are not recorded as an installed patch. The Web site can send a notification when any document of interest changes. Another means of gaining patch information is to join Sun's mailing lists.

Configuring Telnet / FTP to login as root (Solaris)


by Jeff Hunter, Sr. Database Administrator Now before getting into the details of how to configure Solaris for root logins, keep in mind that this is VERY BAD security. Make sure that you NEVER configure your production servers for this type of login. Configure Telnet for root logins Simply edit the file /etc/default/login and comment out the following line as follows:
# If CONSOLE is set, root can only login on that device. # Comment this line out to allow remote login by root. # # CONSOLE=/dev/console

Configure FTP for root logins First remove the 'root' line from /etc/ftpusers. Also, don't forget to edit the file /etc/ftpaccess and comment out the 'deny-uid' and 'deny-gid' lines. If the file doesn't exist, there is no need to create it.

Solaris configure ftp server


SUN Solaris FTP SUN Solaris comes with ftp daemon based on WU-FTPd Washington University project. While not being very enthusiastic about its vulnerabilities discovered over the years and being rather abandoned by its developers ,still it comes by default and as long as Sun ok with that it is ok with me too. Below I will shortly introduce configuring it for local user access as well as anonymous one. By default FTP daemon (in.ftpd) is disabled. Here is the initial state you have it :
root@Solaris# svcs ftp STATE STIME FMRI disabled 7:21:44 svc:/network/ftp:default

As ftpd is inet managed daemon more information can be queried from inetadm:
root@Solaris# inetadm -l svc:/network/ftp:default SCOPE NAME=VALUE name=ftp endpoint_type=stream proto=tcp6 isrpc=FALSE wait=FALSE exec=/usr/sbin/in.ftpd -a user=root default bind_addr=" default bind_fail_max=-1 default bind_fail_interval=-1 default max_con_rate=-1 default max_copies=-1 default con_rate_offline=-1 default failrate_cnt=40 default failrate_interval=60 default inherit_env=TRUE default tcp_trace=FALSE default tcp_wrappers=FALSE default connection_backlog=10

Insecure you say , well , you are right lets sharpen it a bit. Enable more detailed logging.
root@Solaris# inetadm -m svc:/network/ftp:default tcp_trace=TRUE root@Solaris# inetadm -l svc:/network/ftp SCOPE NAME=VALUE name=ftp endpoint_type=stream proto=tcp6 isrpc=FALSE wait=FALSE exec=/usr/sbin/in.ftpd -a user=root default bind_addr=" default bind_fail_max=-1 default bind_fail_interval=-1 default max_con_rate=-1 default max_copies=-1 default con_rate_offline=-1 default failrate_cnt=40 default failrate_interval=60 default inherit_env=TRUE tcp_trace=TRUE default tcp_wrappers=FALSE default connection_backlog=10

When execution option a is given (and it is by default) then ftpd will consult /etc/ftpd/ftpaccess file for additional restrictions and tweaks. Here are the few that are worth enabling. Uncomment following lines to have more verbose logging available:
log transfers real,guest,anonymous inbound,outbound xferlog format %T %Xt %R %Xn %XP %Xy %Xf %Xd %Xm %U ftp %Xa %u %Xc %Xs %Xr

Make sure these changes are applied


root@Solaris# svcadm refresh svc:/network/ftp:default

Configure anonymous access. All the configs so far will allow only local valid users to connect by ftp and be automatically placed in their respective home directories. To allow anonymous ftp access with dedicated chrooted for that folder there is a special set of tools to use. Actually it is just one script that does all the hard work behind the scenes creates ftp user, creates directory tree , sets up needed permissions, sets up chrooted environment for the anonymous ftp user.
root@Solaris# ftpconfig /export/home/ftp_pub

Updating user ftp Creating directory /export/home/ftp_pub Updating directory /export/home/ftp_pub

That is all, now you can login anonymously and download anything from /export/home/ftp_pub/pub directory. To also allow upload there , change the upload option in /etc/ftpd/ftpaccess and set accordingly permissions on the Solaris level for the directory pub (777)
upload class=anonusers * /pub yes #upload class=anonusers * * no nodirs

And finally enable it


root@Solaris# svcadm enable ftp

Solaris 10 Role Based Access Control (RBAC)


Lecture Notes: The problem with the traditional model is not just that root (superuser) is so powerful, but that a regular user accounts are not powerful enough to fix their own problems. There were some limited attempts to address this problem in Unix in the past (wheel group and immutable file attributes in BSD, sudo, extended attributes (ACL), etc), but Role Based Access Control (RBAC) as implemented in Solaris 10 is probably the most constructive way to address this complex problem in its entirety. RBAC is not a completely new thing. Previous versions of RBAC with more limited capabilities existed for more then ten years in previous versions of Solaris. It was introduced in Trusted Solaris and was later incorporated into the Solaris 8. In was improved and several additional predefined roles were introduced in Solaris 9. Still they generally fall short of expectations and only Solaris 10 implementation has the qualities necessary for enterprize adoption of this feature. Among predefined roles that is several that are immediately useful and usable:
1. All Provides a role access to commands without security attributes: all commands that do not need root permission to run in a regular Solaris system (Solaris without RBAC implementation) : 2. Primary Administrator Administrator role that is equivalent to the root user. 3. System Administrator. Secondary administrators who can administer users (add, remove user accounts, etc). Has privileges solaris.admin.usermgr.read solaris.admin.usermgr.write that provides read/write access to users configuration files. Cannot change passwords.

4. Operator. Has few security related capabilities but still capable of mounting volumes. Also has solaris.admin.usermgr.read privilege that provides read access to users configuration files. 5. Basic Solaris User. Enables users to perform tasks that are not related to security. 6. Printer Management. Dedicated to printer administration.

But the original implementation has had severe limitation in defining new roles which blocked wide adoption of this feature: in practice most system commands that were needed for roles should be run as root. Still even the old implementation that exited till Solaris 9 has sudo-style capability of one time assumption of the command with specific (additional) privileges is accomplished by pfexec command. If the user is assigned "Primary Administrator" profile then pfexec command became almost exact replica of the typical sudo usage. Also if role has no password then switch of context does not require additional authentication (only authorized users can assume roles). That can be convenient for some application roles. There were several problems with early RBAC implementations:
Limited flexibility in constructing new roles Hidden dangers of running selected commands with root privileges (that danger that is typified by sudo). Fuzzy interaction of RBAC facility with the extended attributes facility (ACL). The four flat flat files that Solaris 8 implementation introduced suggested some questionable quality of engineering.

All-in-all Solaris RBAC until version 10 has limited appeal to most organization and unless there was a stron push from the top was considered by most administrators too complex to be implemented properly. It also has some deficiencies even in comparison with sudo. The only "cheap and sure" application in the old RBAC implementation was conversion of root account to role and conversion of operators to operator roles. Conversion of application-related accounts like oracle into roles was also possible, but more problematic. That changed with Solaris 10 when RBAC model was extended with Solaris privileges model. It extended the ability to create new custom roles by assigning very granular privileges to the roles. Previously such tricks needed heavy usage of ACLs, and as any ACL-based solution were both expensive and heavy maintenance solutions There are three distinct properties of roles:
1. A role is not accessible for normal logins (root is a classic example of an account that should not be accessible by normal login; most application accounts fall into the same category) 2. Users can gain access to it only explicitly changing his identity via su command, the activity that is logged and can be checked for compliance. 3. Role account uses a special shell (pfksh or pfsh). Please note that bash is not on the list :-)

Each user can assume as many roles as is required for him to perform his responsibilities (one at a time) and switch to a particular role for performing a subset of operations that are provided for this role. Theoretically an administrator can map user responsibilities into a set of roles and then

grant users the ability to "assume" only those roles that match their job responsibilities. And no user should beg for root access any longer :-) But the devil is in details: even with Solaris 10 power this easier said then done. Role engineering is a pretty tough subject in itself even if technical capabilities are here and it requires time and money to implement properly. Still it looks like Solaris 10 was the first Unix that managed to breaks old Unix dichotomy of "root and everybody else". In this sense Solaris 10 is the first XXI century Unix. The privilege model that was incorporated in RBAC made it more flexible and useful, surpassing sudo in most respects. One time execution of a command, for example vi, with additional privileges still remains the problem as the command can have a backdoor to shell. Like its predecessor sudo, Solaris RBAC provides the ability selectively package superuser privileges for assignment to user accounts by assigning them packages of the appropriate privileges. For example, the need for root account can be diminished by dividing those capabilities into several packages and assigning them separately to individuals sharing administrative responsibilities (still root remains a very powerful account as it owns most important files). It might be useful to distinguish between following notions:
Authorization - A right that is used to grant access to a restricted function Profile - A mechanism used for grouping authorizations and commands for subsequent assignment to role or to a user. You can assign one or several profile to role. Role - A special type of user account that you cannot login directly, but can only su to it. It intended for application accounts and sometimes is useful as a container for performing a set of administrative tasks Role shell Special shell alias (for example pfksh, instead of ksh) that gives the shell capability to consult RBAC database before execution of the command. Please note that bash can't be used as role shell.

RBAC relies on a database that consist of four flat files (naming suggests that Microsoft agents penetrated Sun on large scale ;-), as the proper way to group related configuration files in Unix is to use common prefix, like rbas_user, rbac_prof, rbac_exec, rbac_auth, but Unix is flexible and you can create such links and forget about this problem):

/etc/user_attr (main RBAC file) /etc/security/prof_attr (right profile attributes/authorizations) /etc/security/exec_attr (profile execution attributes) /etc/security/auth_attr (authorization attributes).

As usual syntax is pretty wild and is a testimony that in Sun left hand does not know what right is doing. Essentially this is another mini-language in a family of approximately a hundred minilanguages that Sun boldly introduced for configuration files while naively expecting that administrators say with Solaris no matter what perverse syntax they are using in "yet another configuration file" (TM) :-). HEre are some details on those configuration files:
1. /etc/user_attr (main RBAC file, essentially the extension of /etc/passwd)

/etc/user_attr

lists the accounts that are roles, associates regular users with roles. Consists of type of the account (type=) and authorizations list (auth= ) and profiles (profiles=, which is an indirect way to assign authorizations). If type is notmal then the account is a regular traditional Unix account for example:
root::::type=normal;auth=solaris.*,solaris.grant

If type is "role" then this is a new type of account -- role account. For example:
datesetter::::type=role;profile=Date Management By default all Solaris users are granted Basic Solaris User profile. The default profile

stored in /etc/security/policy.conf is applicable to all accounts that do not have an explicit assignment. Effective profile for normal users can also be changed, for example for the user Joe Doers the profile can be changed to Log Management:
doerj::::type=normal;profile=Date Management

2. /etc/security/prof_attr (right profile attributes/authorizations) Associates names of right profiles (or simply profiles, although the tem is confusing) with the set of authorizations. Only authorizations listed in /etc/security/auth_attr are allowed.

This is a little bit problematic implementation as right profile is essentially a parameterless macro that is substituted into a predefined set of authorizations. But surprisingly the only form of this macro definition is plain vanilla list. There is no wildcard capabilities of regular expression capabilities in specifying them. Also there is no way to deny certain lower level authorization while granting a higher level authorizations. For example, I cannot specify expressions like (solaris.admin.usermgr.* solaris.admin.usermgr.write). There is also no possibility to grant global access to a specific operation reading like solaris.*.*.read In general I see no any attempt to incorporate the access control logistics typical for TCP wrappers, firewalls and similar programs. That makes creation of a profile less flexible then it should be, but hopefully this is not that frequent operation anyway, and you can write Perl scripts that generate any combination of authorizations you want quote easily, so the damage is minor. Like I mentioned in my lecture before there are several predefined right profiles (all of the them can be modified by sysadmin):
All right profile that provides a role access to commands without security attributes. In a non-RBAC system, these commands would be all commands that do not need root permission to run. Primary Administrator right profile that is designed specifically for the Primary Administrator role. In a non-RBAC system, this role would be equivalent to the root user. System Administrator. right profile that is designed specifically for a junior level System Administrator role. The System Administrator rights profile uses discrete supplementary profiles to create a powerful role. Operator right profile Designed specifically for the Operator role. The Operator rights profile uses a few discrete supplementary profiles to create a basic role.

Basic Solaris User right profile that enables users to perform tasks that are not related to security. Printer Management. right profile dedicated to the single area of printer administration.
Basic Solaris User:::Automatically assigned rights: auths=

Each profile consists of one or more authorizations, for example:

solaris.profmgr.read,solaris.jobs.users, solaris.mail.mailq, solaris.admin.usermgr.read, solaris.admin.logsvc.read, solaris.admin.fsmgr.read, solaris.admin.serialmgr.read, solaris.admin.diskmgr.read, solaris.admin.procmgr.user, solaris.compsys.read, solaris.admin.printer.read, solaris.admin.prodreg.read, solaris.admin.dcmgr.read, solaris.snmp.read,solaris.project.read, solaris.admin.patchmgr.read, solaris.network.hosts.read, solaris.admin.volmgr.read; profiles=All; help=RtDefault.html

Notes: it uses mnemonic name for a privilege set, not dotted representation. There are many Sun-supplied profiles (30 in Solaris 9), for example:
Primary Administrator (profile that permits performing all administrative tasks) Basic Solaris User (Default profile assigned to the new accounts ) Operator (Can perform simple administrative tasks).

3. /etc/security/exec_attr(profile execution attributes) - this is sudo style file that defines the commands assigned to a profile and under which EUID and EGID. The fields in the /etc/security/exec_attr database are separated by colons:

name:policy:type:res1:res2:id:attr
name The name of the profile. Profile names are case sensitive. policy -- The security policy (priviledge) associated with this entry. In Solaris 9 the suser (superuser policy model) is the only valid policy entry. type The type of entity whose attributes are specified. The only valid type is cmd (command). res1,res2 Reserved for future use. Reserved for future use. id A string identifying the entity. You can use the asterisk (*) wildcard. Commands should have the full path or a path with a wildcard. To specify arguments, write a script with the arguments, and point the id to the script. attr An optional list of key-value pairs that describes the security attributes to apply to the entity when executed. You can specify zero or more keys. The list of valid key words depends on the policy being enforced. There are four valid keys: euid, uid, egid, and gid. euid and uid Contain a single user name or a numeric user ID. Commands designated with euid run with the effective UID indicated, which is similar to setting the setuid bit

on an executable file. Commands designated with uid run with both the real and effective UIDs set to the UID you specify. egid and gid Contain a single group name or numeric group ID. Commands designated with egid run with the effective GID indicated, which is similar to setting the setgid bit on an executable file. Commands designated with gid run with both the real and effective GIDs set to the GID you specify.
Date Management:suser:cmd:::/usr/bin/date:euid=0

For example
Adds to the profile "Date Management" the ability to execute command "/usr/bin/date 4. /etc/security/auth_attr (authorization attributes)- This is a system generated static file that predefines a hierachical sets of privileges available on a particular system (92 in Solaris 9, 126 in Solaris 10). Privileges (authorizations) are structured like DNS with dots separating each constituent: Authorizations for the Solaris OE use solaris as a prefix. The suffix indicates what is being authorized, typically the functional area and operation. For example grant or delete or modify. When there is no suffix (that is, the authname consists of a prefix, a functional area, and ends with a period), the authname serves as a heading for use by applications in their GUI rather than as an authorization. The authname solaris.printmgr. is an example of a heading. When authname ends with the word grant, the authname serves as a grant authorization and lets the user delegate related authorizations (that is, authorizations with the same prefix and functional area) to other users. The authname solaris.printmgr.grant is an example of a grant authorization. It gives the user the right to delegate such authorizations as solaris.printmgr.admin and solaris.printmgr.nobanner to other users.

Only system programmers can add entries to this database. It also identifies the help file that explains a particular privilege set.

For example:
solaris.admin.usermgr.:::User Accounts::help=AuthUsermgrHeader.html solaris.admin.usermgr.write:::Manage Users::help=AuthUsermgrWrite.html solaris.admin.usermgr.read:::View Users and Roles::help=AuthUsermgrRead.html solaris.admin.usermgr.pswd:::Change Password::help=AuthUserMgrPswd.html

5. In addition to those four file, /etc/security/policy.conf lets you grant specific rights profiles and authorizations to all users. Essentially provides system default authorizations for all users. Entries consist of key-value pairs, for example:
6. AUTHS_GRANTED=solaris.device.cdrw PROFS_GRANTED=Basic Solaris User

The solaris.device.cdrw authorization provides access to the cdrw command. # grep solaris.device.cdrw /etc/security/auth_attr solaris.device.cdrw:::CD-R/RW Recording Authorizations::help=DevCDRW.html The Basic Solaris User profile grants users access to all listed authorizations. Paradoxically RBAC can be more useful for application accounts then to "human" accounts. That means that for a large organization an optimal plan for conversion to RBAC is first to convert system and applications accounts to roles.
Among system roles are root and operator. Both roles can (and probably should ) be converted to role even in previous versions of Solaris. By application accounts we mean the account used for structuring permissions and launching processes for a particular enterprise software like Oracle, Webshere, Apache, Sendmail, bind, etc) because privileges requirements for those accounts are static.

The main command line tools include:


roleadd Adds a role account on the system. rolemod Modifies a roles login information. useradd Adds a user account on the system.

Additional commands that you can use with RBAC operations.


auths Displays authorizations for a user. pam_roles Identifies the role account management module for the Password Authentication Module (PAM). Checks for authorization to assume a role. pfexec Executes commands with the attributes specified in the exec_attr database. roles Displays roles granted to a user. roleadd Adds a role account to the system. roledel Deletes a roles account from the system. rolemod Modifies a roles account information in the system.

Command that has role relat4ed options:


useradd Adds a user account to the system. Use the -R option to assign a role to a users account. userdel Deletes a users login from the system. usermod Modifies a users account information in the system.

Related commands:

makedbm Makes a dbm file. nscd Identifies the name service cache daemon, which is useful for caching the user_attr, prof_attr, and exec_attr databases.
Your feedback is extremely important. Please send us your opinion about the page. This way you can help to improve the site. Thank you in advance for this courtesy... This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Some amount of grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree... Please try to use Google, Open directory, etc. to find a replacement link (see HOWTO search the WEB for details). We would appreciate if you can mail us a correct link.

Google Search
Top of Form

partner-pub-4031 ISO-8859-1

Search

Top of Form

Bottom of Form

Top of Form

Open directory

Research Index

Open Directory Search

Search Documents

Bottom of Form

Bottom of Form Top of Form Bottom of Form

Old News ;-)


[Dec 10, 2006] Sys Admin System Security in SolarisTM 10 Privileges and Zones in Perspective -- Part 1 by Peter van der Weerd What Is RBAC Again?

A user has two UIDs: a Real UID and an Effective UID. This is not entirely true because Posix and System V Unixes introduced a third UID, the Saved UID. We don't need the Saved UID for this discussion, though.

At login time, both UIDs are identical, based on the entry in the passwd-file. When a user starts a second process, the Effective UID may differ from the Real UID. To accomplish this, a user would typically run the su command. When a regular user with, say UID 2001, runs su, he will be asked for the superuser password, and a new process will be created. This process has Real UID 2001 and Effective UID 0. This means that for the lifetime of this new process, the user runs all his instructions with Effective UID 0. The user will have to know the password of the UID he wants to switch to. Root would be the only one that can switch to a regular user without knowing that user's password. Having to know the other user's password is a major flaw in the use of su. RBAC combines users, roles, profiles, authorizations, and execute attributes. A user can su to a role, thus enabling himself to perform certain instructions based on the role's profile. The profile is connected to an "execute" right. This execute right is a specific command that will be executed with a specific Effective UID. It is good practice to connect multiple profiles to one role so as to allow a user to do multiple administration tasks once he has su-ed to that role. As with su, the user must know the password of the role he wants to su to, but his execute rights will be much more limited on the basis of the exec attributes he is entitled to:
User -> Role -> Profile -> Exec Attribute (command with EUID)

Here's an example:

First, a regular user called "baseuser" is created on an x86 machine called "solx":
solx# useradd -m -d /export/home/baseuser baseuser 64 blocks solx# passwd baseuser New Password: <password here> Re-enter new Password: <password again> passwd: password successfully changed for baseuser solx# grep baseuser /etc/passwd baseuser:x:5007:1::/export/home/baseuser:/bin/sh

Second, a role is added to the system:


solx# roleadd -m -d /export/home/reboot reboot 64 blocks solx# passwd reboot New Password: <password here> Re-enter new Password: <password again> passwd: password successfully changed for reboot solx# grep reboot /etc/passwd reboot:x:5008:1::/export/home/reboot:/bin/pfsh

Notice the shell (/bin/pfsh). This shell enables a user to execute a command in a profile. This is not a shell you can login to.

Connect the user "baseuser" to the role "reboot":


solx# usermod -R reboot baseuser solx# grep baseuser /etc/user_attr baseuser::::type=normal;roles=reboot

Create a profile:

solx# echo "REBOOT:::profile to reboot:help=reboot.html" > \ /etc/security/prof_attr

Connect the profile "REBOOT" to the role "reboot":


solx# rolemod -P REBOOT reboot solx# grep reboot /etc/user_attr reboot::::type=role;profiles=REBOOT baseuser::::type=normal;roles=reboot

So, there is a user called "baseuser" connected to the role "reboot". The role "reboot" has a profile called "REBOOT". All that is left to do is to make sure that the profile (REBOOT) will allow the role (reboot) to execute /usr/sbin/reboot with the correct EUID of 0:
solx# echo "REBOOT:suser:cmd:::/usr/sbin/reboot:euid=0" > \ /etc/security/exec_attr

Now, the baseuser can log in, su to the role "reboot" and run /usr/sbin/reboot to reboot the machine. Whether in fact you would want to allow anybody to run reboot to reset a machine instead of running shutdown is beyond the scope of this article.

All of this is still based on being somebody as a Unix user. A user that is not connected to a role will not be able to assume that role; a role not connected to a profile will not be able to run that profile; and a profile not connected to the proper execute attribute will not be able to run the command that it has in mind. A lot of files with a lot of colons. Privileges Privileges work on a different level -- the process level, which is maintained by the kernel. This is a big difference: RBAC works in userland, related to UIDs and permissions; privileges work on a kernel level; UIDs and file permissions are by-passed. This means that a regular user can, if his process is granted the privilege, read a file that is only readable by root on account of the UID in the file's inode. Or, on the other hand, a user that has read access to a directory may not be able to read anything anymore because the privilege to fork has been revoked from the process privilege list. To list the total amount of privileges, you can run ppriv:
solx# ppriv -l | wc -l 48

Any of these 48 privileges can be connected to any single process on your system. To make things easier, Sun Microsystems grouped these privileges in the following sets: Effective set, Permitted set, Inheritable set, and Limit set. The Effective set is the set of privileges that are currently in effect. It holds the privileges that a process has at runtime. If you must, you can compare it to the Effective UID. The Permitted set is the set of privileges a process can maximally obtain. This resembles the Real UID. A privilege can be added to the Effective set only if it is part of the Permitted set. The Inheritable set is the set of privileges that will be inherited by, or passed on to, sub-processes. A privilege not in the Inheritable set will neither be in the Effective nor in the Permitted set of a sub-process.

The Limit set is the set of privileges that a process and its children may obtain. These privileges can be "promoted" to the Permitted set of a process, and from there on upwards, to both the Effective and Inherited sets.

To query the sets and privileges at your disposal, run ppriv again:
solx$ ppriv $$ 2141: -sh flags = none E: basic I: basic P: basic L: all solx$ ppriv -v $$ 2141: -sh flags = none E: file_link_any,proc_exec,proc_fork,proc_info,proc_session I: file_link_any,proc_exec,proc_fork,proc_info,proc_session P: file_link_any,proc_exec,proc_fork,proc_info,proc_session L: contract_event,contract_observer,cpc_cpu,dtrace_kernel, (output skipped)

Changing the privileges of processes doesn't seem to be very functional, because when the user exits and logs in again, all changes will be gone. Nevertheless, some examples to get the idea of changing privilege sets.

Here is an example of revoking PRIV_PROC_FORK from the Effective set of PID 1774:
solx# ppriv -s E-proc_fork 1774 solx# ppriv 1774 1774: -sh flags = none E: basic,!proc_fork I: basic P: basic L: all

In this example, process 1774 will get a permission denied message every time it tries to fork. Try typing ls, for example.

You can allow PID 1882 to read any file on the system by adding PRIV_FILE_DAC_READ to the sets:
solx# ppriv -s EIP+file_dac_read 1882 solx# ppriv 1882 1882: -sh flags = none E: basic,file_dac_read I: basic,file_dac_read P: basic,file_dac_read L: all

The process with PID 1882 is allowed to read any file on the system, irrespective of the EUID and file permissions.

What we really want is to make these privileges more permanent for particular users of applications. We want these privileges to be set for the user at login time to make sure that every process created by that user will have a desired set of privileges. To achieve that goal, we can add them to /etc/user_attr next to roles entry:
solx# usermod -K defaultpriv=basic,-file_link_any baseuser solx# grep baseuser /etc/user_attr baseuser::::type=normal;defaultpriv=basic,-file_link_any;roles=reboot

In this example, user "baseuser" will not be able to create any hardlinks to files that he does not own. The reduced privilege set will be in effect at login time of the user:
solx# su - baseuser Sun Microsystems Inc. SunOS 5.10 Generic January 2005 $ln /etc/hosts myhosts ln: cannot create link myhosts: Not owner

The user can get another confirmation of the lack of a privilege by using the debug option of ppriv:
$ ppriv -eD ln /etc/hosts myhost ln[9842]: missing privilege "file_link_any" (output skipped)

So far so good. Apparently RBAC and privileges are two different concepts that work on different levels with "System Security" as a binding factor.

But what about zones? We should talk about the place of zones in all this. Next month, in Part 2 of this article, I will show how to combine privileges and zones to create a secure environment for Apache Web server
[Nov 10, 2006] London OpenSolaris User's Group RBAC Security Overview by Darren Moffat

October 2006 (London) This talk provides an overview of the security features found in the Solaris 10 OS and provides a more in-depth treatment of the Solaris Role-based Access Control (RBAC) facility.
[Dec 22, 2005] Custom Roles Using RBAC in the Solaris OS by Kristopher M. March July, 2005 In the next example of using RBAC, I am setting up a role to allow the Oracle user to run a script to roll over some web server log files which are generally owned by root. The script, called rollover_logs.sh, is used by DBAs typically during weekend maintenance when the web server software is stopped and started and to clean up space on the disk drives. In preparation, I make sure the script has been tested, remove oracle:dba ownership, and replace with root ownership. Changing ownership prevents the script from being modified to perform other actions. Since this task is often done at obscure times, the DBAs have access to run this at their leisure. I provide the details of this script below.

This example is very similar to the previous role setup except that we use completely different role names and user accounts, and I create a unique profile for each script located in different directories. I use orassist for the role, and then assign that role to the Oracle user account. Oracle_Assistant is the profile name and defines what script will be run by the Oracle user. If

there were another script, possibly located in another directory, I would create another profile using Oracle_Assistant as the profile name, and add the location to the new script in the /etc/security/exec_attr file. Specifying multiple tasks or scripts with the same profile name is a standard practice. Just remember to keep them separated by placing each new task on its own line in the exec_attr file.
Rolename: orassist Username: oracle Profile Name: Oracle_Assistant

1. Using a role name of your choice, use roleadd to create the role:
# roleadd -u 2100 -g 10 -d /export/home/orassist -m orassist

Assign a password to the role account.


# passwd orassist

2. Create a profile to allow the oracle user to run scripts in a specified location. Edit the /etc/security/prof_attr file and enter the following line:
Oracle_Assistant:::Permit Oracle user to run Oracle scripts:

Save and exit file. 3. The next step is to create the execution attribute entry in the /etc/security/exec_attr. This defines the task or script to be run and the uid the role will assume when running this task. Note that we specify the profile at the start of the line. Note: There are seven fields to this file with each one separated by a colon; only fields one, two, three, six and seven should be populated at this time. Insert the execution attributes entry in /etc/security/exec_attr as follows:
Oracle_Assistant:suser:cmd:::<location of script>:uid=0

4. Add the Oracle_Assistant and All [1] profiles to the role as follows:
# rolemod -P Oracle_Assistant,All orassist

5. The rolemod command updates the /etc/user_attr file. Verify that the changes have been made.
# more /etc/user_attr oracle::::type=normal;roles=orassist orassist::::type=role;profiles=Oracle_Assistant,All

Here you can see that the orassist role has been assigned to the oracle user account. The Oracle_Assistant profile is now associated with the orassist role. 6. Use the usermod command to assign the role account to the oracle user:
# usermod -R orassist <user_name> <-- insert oracle here.

With the script owned by root and tested several times, I have peace of mind in turning it over to another user. I'll continue to use RBAC to assist those who have tasks that require superuser privileges. I have included instructions for the user on how to gain access to the role and execute the script.

Instructions for User: You must be logged in as the required account (in this case, it is oracle). Switch user to the Oracle Assistant role as shown here:
$ su - orassist Password: <enter password>

Once you are authenticated, the command to run the script is as follows:
$ /<location of script>/rollover_logs.sh

To exit the role and return to original user, type exit.


Disabling Roles

Roles are similar to user accounts. They require a password to be authenticated. The role feature can be disabled easily by manually disabling or locking the role account. One way this is done is by placing an "LK" in the encrypted password field in the /etc/shadow file. Example:
orassist:LK:12937:7:56:7:::

Script Example (As Described in Scenario Two)


#!/bin/ksh # set today's date extension cleandate=`date '+%c%m%d' | awk '{ print $5 }'` echo $cleandate # # Roll over logs: save copy with current date # create new ( empty ) file # # # rename Apache logs must be done by root # cd <APACHE_HOME>/Apache/Apache/logs cp access_log access_log.$cleandate echo > access_log cp error_log error_log.$cleandate echo > error_log cp ssl_request_log ssl_request_log.$cleandate echo > ssl_request_log

Profiles in the Solaris OS

The next and final section reviews several RBAC profiles that ship with the Solaris OS. You can easily create new roles and add these existing profiles to them to create unique combinations for your users and junior administrators and even backup administrators. As you were following along with the examples above, you probably noticed the long list of profiles while editing the /etc/prof_attr file. These profiles have been created to aid in system administration functions. By comparing the prof_attr and exec_attr files, you can easily distinguish which profiles are allowed to do what. Knowing this information will allow you to implement these. Most IT data centers employ computer room operators who have the responsibility to monitor activities that take place on numerous servers and mainframes. These

folks are often involved in the data backup process. When this becomes true, these folks make good candidates for new roles that allow them to run backup and tape related commands. For example, a system administrator could create a new role called bkadmin and add the Media Backup profile (shown below) which would give bkadmin access to tar files and to mount and unmount tapes.
$ more /etc/security/exec_attr | grep Backup Media Backup:suser:cmd:::/usr/bin/mt:euid=0 Media Backup:suser:cmd:::/usr/sbin/tar:euid=0 Media Backup:suser:cmd:::/usr/lib/ufs/ufsdump:euid=0;gid=sys

Another helpful set of profiles is the Printer Management profiles. These allow a user access to a large group of printer-related commands used to manage printers and print devices. Creating a role and assigning this group of profiles might help ease the workload of a system administrator.
$ more /etc/security/exec_attr | grep Printer Printer Management:suser:cmd:::/etc/init.d/lp:euid=0;uid=0 Printer Management:suser:cmd:::/usr/bin/cancel:euid=lp;uid=lp Printer Management:suser:cmd:::/usr/bin/lpset:egid=14 Printer Management:suser:cmd:::/usr/sbin/accept:euid=lp;uid=lp Printer Management:suser:cmd:::/usr/lib/lp/local/accept:uid=lp Printer Management:suser:cmd:::/usr/sbin/lpadmin:egid=14;uid=lp;gid=8 Printer Management:suser:cmd:::/usr/lib/lp/local/lpadmin:uid=lp;gid=8 Printer Management:suser:cmd:::/usr/sbin/lpfilter:euid=lp;uid=lp Printer Management:suser:cmd:::/usr/sbin/lpforms:euid=lp Printer Management:suser:cmd:::/usr/sbin/lpmove:euid=lp Printer Management:suser:cmd:::/usr/sbin/lpshut:euid=lp Printer Management:suser:cmd:::/usr/sbin/lpusers:euid=lp Printer Management:suser:cmd:::/usr/bin/lpstat:euid=0 Printer Management:suser:cmd:::/usr/lib/lp/lpsched:uid=0 Printer Management:suser:cmd:::/usr/ucb/lpq:euid=0 Printer Management:suser:cmd:::/usr/ucb/lprm:euid=0

As with most tasks and scripts in UNIX, there is almost always more than one way to accomplish something. This also holds true in the case of creating roles and profiles. If you're familiar with and use the Solaris Management Console (SMC), then you have access to the GUI tool that allows you to review, create, and modify users and roles. There are also several other role management commands that can be utilized to create and manage your roles. The Sun documentation located online has all the additional information needed to use those methods. [1] A special note about the All (All Rights) profile. The All profile provides a role giving access to all commands not currently assigned to other roles. At the time of role creation, I could have chosen to omit the All profile. This would prevent the role from doing anything else but the assigned role. To give the user some flexibility, I always choose to add it in. Just something to keep in mind as you design your own roles and profiles.
References

Solaris 9 System Administrator Collection, System Administration Guide: Security Services, Chapter 17, Role Based Access Control (Overview) Solaris 9 System Administrator Collection, System Administration Guide: Security Services, Chapter 18, Role Based Access Control (Tasks)

Solaris 5.8, 5.9 Man pages: useradd, usermod, rolemod Planet Sun
[Oct 28, 2005] [PDF] The Role of Identity Management in Sarbanes-Oxley Compliance [Oct 7, 2005] Sun BluePrints OnLine - Archives By Subject
Enforcing the Two-Person Rule Via Role-Based Access Control in the Solaris 10 Operating System (August 2005) by Glenn Brunette Whether discussing physical or logical access controls, organizations have for years applied the practice of the two-person rule to help secure IT assets. Using the two-person rule is an optional approach for organizations wanting to protect access to key data sets, or to restrict who may perform sensitive or high impact operations on a system. In many circumstances, however, more traditional IT security controls are likely appropriate. Using the two-person rule is most often reserved for restricting the most sensitive IT security operations performed within an organization. Whether and where a given organization could apply the two-person rule depends on its policies, architecture, processes, and requirements. This Sun BluePrints cookbook describes how to use Solaris Role-Based Access Control (RBAC) in the Solaris 10 Operating System (Solaris OS) to enforce the two-person rule in IT security.

[Apr 5, 2005] Roles and Sarbanes-Oxley -- speculative, but still interesting white paper. [Mar 30, 2005] [PDF] RBAC in Solaris 10 Sun presentation [Feb 11, 2005] The Least Privilege Model in the Solaris ... Privileges and RBAC
The RBAC facility, present in the Solaris OS since version 8, is used to assign specific privileges to roles or users. Solaris RBAC configuration is controlled through four main files, /etc/security/exec_attr, /etc/security/prof_attr, /etc/security/auth_attr, and /etc/user_attr. exec_attr(4) specifies the execution attributes associated with profiles. This generally includes the user and group IDs, commands, and default/limit privileges. prof_attr(4) contains a collection of execution profile names, descriptions, and other attributes. auth_attr(4) contains authorization definitions and descriptions. user_attr(4) contains user and role definitions along with their assigned authorizations, profiles, and projects. For a better understanding of how RBAC operates, read the above-mentioned man pages along with rbac(5), policy.conf(4), chkauthattr(3SECDB) man pages, and the Roles, Rights Profiles, and Privileges section of the Solaris 10 System Administrator Collection. To allow a group of users to use DTrace, the system administrator would either create a role that had access to the DTrace privileges or assign the privilege directly to a user. The following would create a "debug" role and grant it the appropriate privileges: roleadd -u 201 -d /export/home/debug -P "Process Management" debug rolemod -K defaultpriv=basic,dtrace_kernel,dtrace_proc,dtrace_user debug Now add the necessary users to the debug role with usermod: usermod -R debug username The users with the role debug can now use su to access debug, providing the appropriate password, and run the necessary DTrace commands. Instead of adding roles and making the users access the role via su, the system administrator can also directly assign privileges to a user. The user must be logged out in order for the following command to succeed: usermod -K defaultpriv=basic,dtrace_kernel,dtrace_proc,dtrace_user username If additional privileges are required, pinpoint them by running dtrace command ppriv again.

RBAC can also be used in conjunction with the least privilege model to more securely run daemons, like httpd, that need to bind to privileged ports. Many such programs do not actually need root access for anything other than listening on a port below 1024, so granting the role/user that runs the process net_privaddr would remove the need for ever running the process with EUID 0.

[Feb 11, 2005] How To Create Custom Roles Using Based Access Control (RBAC) How to set up RBAC to allow non-root user to manage in.named on DNS server
The following are assumed: The role is: dnsrole The user is: dnsadmin The profile is: DNS Admin ( case sensitive ) Home directory for dnsrole is: /export/home/dnsrole Home directory for dnsadmin is: /export/home/dnsadmin

Configuration Steps 1. Create the role and assign it a password: # roleadd -u 1001 -g 10 -d /export/home/dnsrole -m dnsrole # passwd dnsrole NOTE: Check in /etc/passwd that the shell is /bin/pfsh. This ensures that nobody can log in as the role. Example line in /etc/passwd: dnsrole:x:1001:10::/export/home/dnsrole:/bin/pfsh 2. Create the profile called "DNS Admin": Edit /etc/security/prof_attr and insert the following line: DNS Admin:::BIND Domain Name System administrator: 3. Add profile to the role using rolemod(1) or by editing /etc/user_attr: # rolemod -P "DNS Admin" dnsrole Verify that the changes have been made in /etc/user_attr with profiles(1) or grep(1): # profiles dnsrole DNS Admin Basic Solaris User All # grep dnsrole /etc/user_attr dnsrole::::type=role;profiles=DNS Admin

1. Assign the role 'dnsrole' to the user 'dnsadmin':


1. If 'dnsadmin' already exists then use usermod(1M) to add the role (user must not be logged in): # usermod -R dnsrole dnsadmin 2. Otherwise create new user using useradd(1M) and passwd(1): # useradd -u 1002 -g 10 -d /export/home/dnsadmin -m \ -s /bin/ksh -R dnsrole dnsadmin # passwd dnsadmin 3. Confirm user has been added to role with roles(1) or grep(1): # roles dnsadmi

dnsrole # grep ^dnsadmin: /etc/user_attr dnsadmin::::type=normal;roles=dnsrole 5. Assign commands to the profile 'dns': Add the following entries to /etc/security/exec_attr: DNS Admin:suser:cmd:BIND 8:BIND 8 DNS:/usr/sbin/in.named:uid=0 DNS Admin:suser:cmd:ndc:BIND 8 control program:/usr/sbin/ndc:uid=0 If using Solaris 10 you may need to add a rule for BIND 9: DNS Admin:suser:cmd:BIND 9:BIND 9 DNS:/usr/sfw/sbin/named:uid=0 BIND 9 does not use ndc, instead rndc(1M) is used which does not require RBAC. 6. Create or Modify named configuration files. To further limit the use of root in configuring and maintaining BIND make dnsadmin the owner of /etc/named.conf and the directory it specifies. # chown dnsadmin /etc/named.conf # grep -i directory /etc/named.conf directory "/var/named/"; # chown dnsadmin /var/named 7. Test the configuration: Login as the user "dnsadmin" and start named: $ su dnsrole -c '/usr/sbin/in.named -u dnsadmin' To stop named use ndc (for BIND 9 use rndc): $ su dnsrole -c '/usr/sbin/ndc stop' Summary: In this example the user 'dnsadmin' has been set up to manage the DNS configuration files and assume the role 'dnsrole' to start the named process. The role 'dnsrole' is only used to start named and to control it with ndc (for BIND 8). With this RBAC configuration the DNS process when started by user 'dnsrole' would acquire root privileges and thus have access to its configuration files. The named options '-u dnsadmin' may be used to specify the user that the server should run as after it initializes. Furthermore 'dnsadmin' is then permitted to send signals to named as described in the named manual page. References: ndc(1M), named(1M), rbac(5), profiles(1), rolemod(1M), roles(1), rndc(1M), usermod(1M), useradd(1M)

Sys Admin Magazine RBAC as a replacement of sudo


The UNIX administration model of a single, all-powerful superuser is a troublesome limitation in many network computing environments. Sys admins often need to delegate selected administration tasks without granting unrestricted superuser powers. The sudo utility (http://www.courtesan.com/sudo/) is a longtime favorite for fulfilling this function. However, some organizations prohibit the use of freeware software tools, especially for such a critical security function. For Solaris sys admins who find themselves in this situation, there is now an alternative. Role-Based Access Control (RBAC) was introduced in Solaris 8. Adopted from Suns Trusted Solaris offering, RBAC has its roots in military and government computing systems where operations are more tightly controlled than in a typical commercial UNIX environment. Like sudo, RBAC allows sys admins the flexibility to grant users superuser privileges on a per-command basis.

To show how RBAC can be used as a substitute for sudo, I will begin with an example sudoers file, then replicate the same configuration using RBAC.

White Papers RBAC in the Solaris Operating Environment.

Also available in PDF (311K) | PS (1.06M)


Role Based Access Control

One of the most challenging problems in managing large networked systems is the complexity of security
administration. Today, security administration is costly and prone to error because administrators usually specify access control lists for each user on the system individually. Role based access control (RBAC) is a technology that is attracting increasing attention, particularly for commercial applications, because of its potential for reducing the complexity and cost of security administration in large networked applications. With RBAC, security is managed at a level that corresponds closely to the organization's structure. Each user is assigned one or more roles, and each role is assigned one or more privileges that are permitted to users in that role. Security administration with RBAC consists of determining the operations that must be executed by persons in particular jobs, and assigning employees to the proper roles. Complexities introduced by mutually exclusive roles or role hierarchies are handled by the RBAC software, making security administration easier.

From: Casper H.S. Dik (Casper.Dik@Sun.COM) Subject: Re: su - access Newsgroups: comp.security.unix View this article only

Date: 2002-08-28 08:00:22 PST


"Sree" <sreep@qnetstaff.com> writes: >Dear Friends, >how to provide su access to oracle user and disable direct logging throught >ssh or telnet. only for user oracle.On Sun systems. On Solaris systems with RBAC (s8+, I think) you can make the oracle user into a "role"; that disables direct login access to the account and also allows you to specify which users can su to it. see user_attr(4) and rbac(5). Casper

Recommended Links
In case of broken links please try to use Google search. If you find the page please notify us about new location
Top of Form

en

GALT:#008000;G ISO-8859-1

ISO-8859-1

pub-4031247137

Bottom of Form Top of Form

Search

pub-4031247137

ISO-8859-1

ISO-8859-1

GALT:#008000;G en

Bottom of Form

Solaris[TM] Security Product Index Role Based Access Control (RBAC) -- Sun-maintained list of RBVAC-related links. System Administration Guide: Security Services The Least Privilege Model in the Solaris ...(Feb, 2005) Privileges and RBAC
Instead of adding roles and making the users access the role via su, the system administrator can also directly assign privileges to a user. The user must be logged out in order for the following command to succeed: usermod -K defaultpriv=basic,dtrace_kernel,dtrace_proc,dtrace_user username If additional privileges are required, pinpoint them by running dtrace command ppriv again. RBAC can also be used in conjunction with the least privilege model to more securely run daemons, like httpd, that need to bind to privileged ports. Many such programs do not actually need root access for anything other than listening on a port below 1024, so granting the role/user that runs the process net_privaddr would remove the need for ever running the process with EUID 0.

Integrating the Secure Shell Software (May 2003) by Jason Reid


This article discusses integrating Secure Shell software into an environment. It covers replacing rsh(1) with ssh(1) in scripts, using proxies to bridge disparate networks, limiting privileges with role-based access control (RBAC), and protecting legacy TCP-based applications. This article is the entire fifth chapter of the upcoming Sun BluePrints book "Secure Shell in the Enterprise" by Jason Reid, which will be available in June 2003.

RBAC in the Solaris Operating Environment Whitepaper Security in the Solaris 9 Operating System Data Sheet Role Based Access Control and Secure Shell--A Closer Look At Two Solaris Operating Environment Security Features On-Line Blueprints Sol 9 8/03 System Administration Guide Part II: Managing System Security Chapter 5 Role-Based Access Control (Overview) Sol 9 8/03 System Administration Guide Part II: Managing System Security Chapter 6 Role-Based Access Control (Tasks) Sol 9 8/03 System Administration Guide Part II: Managing System Security Chapter 7 RoleBased Access Control (Reference) How To Create Custom Roles Using Based Access Control (RBAC) How to enable/disable root logins via to ssh How do I restrict access to Apache Web Server in the Solaris[tm] Operating Environment?

The default RBAC Printer Management profile does not work in Solaris [TM] 8 How to start/stop web server (port 80) by a non-root user Cron jobs fail for Solaris 8 RBAC(Role Based Access Control) roles Restricting logins to "su" only for a given account Third-Party Using Solaris RBAC Developer Resources (http://developers.sun.com)Authorization Infrastructure in Solaris Publications Conference Papers Framework for Role-Based Delegation Models (PDF) Decentralized Group Hierarchies in UNIX: An Experiment and Lessons Learned (PDF, Postscript) ==> see Abstract Group Hierarchies with Decentralized User Assignment in Windows NT (PDF, Postscript) ==> see Abstract NetWare 4 as an Example of RBAC (PDF, Postscript) ==> see Abstract Articles Role-Based Access Control (PDF, Postscript) Dissertations Sys Admin Magazine RBAC in the Solaris Operating Environment http://www.sun.com/software/whitepapers/wp-rbac/ The Solaris Companion: Role-Based Access Control by Peter Baer Galvin. Sys Admin, August 2001. http://www.samag.com/documents/s=1147/sam0108p
[NIST] Role Based Access Control
One of the most challenging problems in managing large networked systems is the complexity of security administration. Today, security administration is costly and prone to error because administrators usually specify access control lists for each user on the system individually. Role based access control (RBAC) is a technology that is attracting increasing attention, particularly for commercial applications, because of its potential for reducing the complexity and cost of security administration in large networked applications. With RBAC, security is managed at a level that corresponds closely to the organization's structure. Each user is assigned one or more roles, and each role is assigned one or more privileges that are permitted to users in that role. Security administration with RBAC consists of determining the operations that must be executed by persons in particular jobs, and assigning employees to the proper roles. Complexities introduced by mutually exclusive roles or role hierarchies are handled by the RBAC software, making security administration easier.

Blueprints
Role Based Access Control and Secure Shell--A Closer Look At Two Solaris Operating Environment Security Features (June 2003) by Thomas M. Chalfant
To aid the customer in adopting better security practices, this article introduces and explains two security features in the Solaris operating environment. The first is Role Based Access Control and the second is

Secure Shell. The goal is to provide you with enough information to make an effective decision to use or not use these features at your site as well as to address configuration and implementation topics. This article is targeted to the intermediate level of expertise.

Integrating the Secure Shell Software (May 2003) -by Jason Reid Pages 9-12 are devote to an example of a very restricted role that is able to perform just one external command (scp)
This article discusses integrating Secure Shell software into an environment. It covers replacing rsh(1) with ssh(1) in scripts, using proxies to bridge disparate networks, limiting privileges with role-based access control (RBAC), and protecting legacy TCP-based applications. This article is the entire fifth chapter of the upcoming Sun BluePrints book "Secure Shell in the Enterprise" by Jason Reid, which will be available in June 2003.

Solaris Operating Environment Network Settings for Security: Updated for Solaris 9 Operating Environment (June 2003) by Alex Noordergraaf
This article describes network settings available within the Solaris Operating Environment (Solaris OE) and recommends how to adjust network settings to strengthen the security posture of Solaris OE systems. This article updates the original article to include changes for Solaris 9 OE. These additions and modifications are incorporated into an updated "nddconfig" script available from http://www.sun.com/blueprints/tools/. This article is ideal for all levels of expertise.

Software White Paper RBAC in the Solaris[tm] Operating Environment

How-to documents
#25968 How To Create Custom Roles Using Based Access Control (RBAC) Restricting logins to "su" only for a given account PAM: Making the root user a role Solaris[TM]: Making the root user a role

Reference
docs.sun.com System Administration Guide, Volume 2

Role-based access control (RBAC) is an alternative to the all-or-nothing security model of traditional superuser-based systems. The problem with the traditional model is not just that superuser is so powerful but that other users are not powerful enough to fix their own problems. Like its predecessor sudo, RBAC provides the ability to package superuser privileges for assignment to user accounts. With RBAC, you can give users the ability to solve their own problems by assigning them packages of the appropriate privileges. Superuser's capabilities can be diminished by dividing those capabilities into several packages and assigning them separately to individuals sharing administrative responsibilities. RBAC thus enables separation of powers, controlled delegation of privileged operations to other users, and a variable degree of access control. It includes the following features:
Authorization - A right that is used to grant access to a restricted function

Execution profile (or simply profile) - A bundling mechanism for grouping authorizations and commands with special attributes; for example, user and group IDs Role - A special type of user account intended for performing a set of administrative tasks

RBAC relies on four databases to provide users access to privileged operations:


user_attr (extended user attributes database) - Associates users and roles with authorizations

and execution profiles


auth_attr (authorization attributes database) - Defines authorizations and their attributes and

identifies the associated help file


prof_attr (execution profile attributes database) - Defines profiles, lists the profile's assigned

authorizations, and identifies the associated help file


exec_attr (profile execution attributes database) - Defines the privileged operations assigned to

a profile

The /etc/user_attr database supplements the passwd and shadow databases. It contains extended user attributes such as authorizations and execution profiles. It also allows roles to be assigned to a user. A role is a special type of user account that is intended for performing a set of administrative tasks. It is like a normal user account in most respects except that it has a special shell pfksh and users can gain access to it only through the su command; it is not accessible for normal logins. From a role account, a user can access commands with special attributes, typically root user ID, that are not available to users in normal accounts. The fields in the user_attr database are separated by colons:
user:qualifier:res1:res2:attr

attr is a list of semicolon-separated (;) key-value pairs that describe the security attributes to be applied when the user runs commands. There are four valid keys: auths, profiles, roles, and type.

auths specifies a comma-separated list of authorization names chosen from names defined in

the auth_attr(4) database. Authorization names may include the asterisk (*) character as a wildcard. For example, solaris.device.* means all of the Solaris device authorizations.
profiles contains an ordered, comma-separated list of profile names chosen from prof_attr(4). A

profile determines which commands a user can execute and with which command attributes. At minimum each user in user_attr should have the All profile, which makes all commands available but without any attributes. The order of profiles is important; it works similarly to UNIX search paths. The first profile in the list that contains the command to be executed defines which (if any) attributes are to be applied to the command.
roles can be assigned to the user using a comma-separated list of role names. Note that roles are defined in the same user_attr database. They are indicated by setting the type value to role.

Roles cannot be assigned to other roles.


type can be set to normal, if this account is for a normal user, or to role, if this account is for a

role. A role is assumed by a normal user after the user has logged in.

Authorizations

An authorization is a user right that grants access to a restricted function. It is a unique string that identifies what is being authorized as well as who created the authorization. Authorizations are checked by certain privileged programs to determine whether users can execute restricted functionality. For example, the solaris.jobs.admin authorization is required for one user to edit another user's crontab file. All authorizations are stored in the auth_attr database. Authorizations may be assigned directly to users (or roles) in which case they are entered in the user_attr database. Authorizations can also be assigned to execution profiles which in turn are assigned to users. The fields in the auth_attr database are separated by colons:
Profiles

profname:res1:res2:desc:attr An optional list of key-value pairs separated by semicolons (;) that describe the security attributes to apply to the object upon execution. Zero or more keys may be specified. There are two valid keys, help and auths. The keyword help identifies a help file in HTML. Help files can be accessed from the index.html file in the /usr/lib/help/auths/locale/C directory.
auths

specifies a comma-separated list of authorization names chosen from those names defined in the auth_attr(4) database. Authorization names may be specified using the asterisk (*) character as a wildcard.

Attributes
An execution attribute associated with a profile is a command (with any special security attributes) that can be run by those users or roles to whom the profile is assigned. Special security attributes refer to such attributes as UID, EUID, GID, EGID that can be added to a process when the command is run. The definitions of the execution attributes are stored in the exec_attr database. The fields in the exec_attr database are separated by colons: name:policy:type:res1:res2:id:attr The type of entity whose attributes are specified. Currently, the only valid type is cmd (command). attr An optional list of semicolon (;) separated key-value pairs that describe the security attributes to apply to the entity upon execution. Zero or more keys may be specified. The list of valid key words depends on the policy being enforced. There are four valid keys: euid, uid, egid, and gid.
euid and uid contain

a single user name or a numeric user ID. Commands designated with euid run with the effective UID indicated, which is similar to setting the setuid bit on an executable file. Commands designated with uid run with both the real and effective UIDs.

egid

and gid contain a single group name or numeric group ID. Commands designated with egid run with the effective GID indicated, which is similar to setting the setgid bit on an executable file. Commands designated with gid run with both the real and effective GIDs.
How to assume role:

To assume a role, use the su command. You cannot log in to a role. For example: %su my-role Password: my-role-password #
Command auths(1) makedbm( 1M) nscd(1M) Description Display authorizations for a user. Make a dbm file.

Name service cache daemon, useful for caching the user_attr, prof_attr, and exec_attr databases. Role account management module for PAM. Checks for the authorization to assume role. Profile shells, used by profile shells to execute commands with attributes specified in the exec_attr database. Configuration file for security policy. Lists granted authorizations.

pam_roles( 5) pfexec(1)

policy.conf( 4) profiles(1) roles(1) roleadd(1M ) roledel(1M)

Display profiles for a specified user. Display roles granted to a user. Add a role account on the system.

Delete a role's account from the system.

rolemod(1 M) useradd(1 M) userdel(1M ) usermod(1 M)

Modify a role's account information on the system.

Add a user account on the system. The -P option assigns a role to a user's account. Delete a user's login from the system.

Modify a user's account information on the system.

Etc
RBAC Principles Disclaimer & Definitions
This document is merely a re-sorting of the Principles output from the NAC 2001 Fall conference, which will hopefully act as a starting point to more comprehensive documentation. The conference material did not contain all of the principles that may apply to RBAC, but there were also some good ideas expressed in the contents that were not principles as well. This document re-sorts the conference document into RBAC Principles items and Role Engineering Best Practices items for further discussion.

Definitions:

Principle: A principle, as used in this context, is a constant (some might call a rule) that defines a particular behavior or characteristic that any RBAC system must include, exhibit or comply with. Best Practice: A best practice can be an insight based on experience, a recommendation based on research, or even an opinion based on sound reasoning. In this case, certain best practices for role engineering or role discovery were contained within the NAC conference Principle document.

Principles:
1. 2. 3. 4. 5. 6. 7. 8. 9. 110. 111. The RBAC system must delineate users, roles, and permissions. A user can be assigned to multiple roles. The system must support the notion of Separation of Duty constraints. Must leverage existing standards to the extent possible. An inheritance model must be supported so that a role can inherit rights other roles . The permission with the least privilege applies, in cases where a user is assigned to multiple roles, and two or more roles define a permission on the same object. The system must log changes to role assignments. The system must provide an aggregated view of all permissions assigned to a particular user. The system must provide a view of all users assigned to a particular permission. There must be an administrative interface to add/change/delete roles, permissions and users as well as the assignments of permissions to roles and roles to users. A role should be able to map to one or more groups on one or more target systems.

112.

Users can be people, applications, devices or functions.

Best Practices for Role Engineering:


1. Roles are abstractions of system entitlements that consider two design criteria:


2.

Consolidation of entities into meaningful groups that facilitate ease of administration. Collections of permissions that are meaningful to specific user communities There are three basic approaches to role engineering: Bottom Up: which derives roles from the groups and permissions defined on the existing systems and platforms in the enterprise. Top Down: which derives roles from an analysis of business function and organizational criteria, typically from a zero base starting point. Hybrid: which, as the name implies, derives roles using a combination of bottom up and top down approaches.


3. 4. 5. 6. 7. 8. 9. 110. 111. 112. 113. 114.

The top down approach typically is much more difficult to implement, because it will probably result in significant changes to legacy group and permission models. The bottom up approach is typically easier to implement, because the RBAC system will overlay the legacy group and permission models. A successful role definition approach will likely be a hybrid approach The objective of role modeling is to maximize flexibility with minimal administrative effort. Role engineering should consider how role and user administration is to be delegated. More decentralized models benefit from more top down analysis. Top down role engineering will aggregate business processes into organizational parameters. A top down approach can only be successful with participation and buy-in from business units. A role typically maps to a group on a legacy system. Successful implementation of RBAC benefits from a cultural commitment to define common roles that supercede individual privileges. Roles do not have meaning unless they are used to define access privileges. Plan for role life cycle management Consider using use case modeling to validate role definitions

Troubleshooting Solaris Network Problems


Adapted from Sun's Solaris 8 training notes. When the user reporting the problem first of all verify that you and the user are on the same page and that his usage of words is correct. This will eliminate problems in which the user reports a problem but uses technical terms incorrectly; for example, "my system crashes." Your version could be "a specific application terminates unexpectedly." Attempt to locate the lowest level of the problem, for example, applications that appear to be failing may be impacted by underpinning network problems. The ping, traceroute, ngrep and other network tools are indispensable tools for troubleshooting networking problems. Network troubleshooting means recognizing and diagnosing networking problems with the goal of keeping your network running optimally. As a network administrator, your primary concern is maintaining connectivity of all devices (a process often called fault management). You may also continually evaluate and improve your network's performance. Because serious networking problems can sometimes begin as performance problems, paying attention to performance can help you address issues before they become serious. Like in any investigation you need to avoid jumping to conclusion and calmly collect all relevant facts. You can use famous "How to solve it "approach. Among more network specific issues:
Cut though communication barrier to the level of clear understanding of real symptoms.

If this is a remote problem try to define the problem in your own words and check with the user reporting the problem if your usage of words is correct. This will eliminate problems in which the user reports a problem but uses technical terms incorrectly of use fuzzy terms like "my system crashes." Your version of the same problem could be "a specific application terminates unexpectedly."
Work bottom-up using TCP/IP networking layers and attempt to trace the problem to the lowest level of TCP/IP stack. For example, if networked applications appear to be failing to connect using symbolic name of the host, it is prudent to check if they can work OK with raw IP addresses. If yes, this suggests that they may be impacted by underpinning DNS problems. Do not take anything for granted. For example, the link LED (light emitting diode) on a hub may light when a cable is connected, leading you to believe that the link has been established and that the network cabling is functional. But the transmit wire or connector could be broken causing loss of communications. The link LED lights because the link signal is being received from the receive line.

In general, there is no one correct way to determine the root cause of a networking problem. Like any troubleshooting of complex systems this is more art then science and the success depends both on your IQ and the level of experience with the environment. However, there are a heuristics that you can follow:

Concentrate on the problem in hand. Networks are a lot like cars: You can start out investigating one problem and find 10 other things that may need attention. Make a note of any non-related problems, but focus on investigating the primary problem. Focus on one area at a time. If several users are reporting problems from different areas of the network at the same time, there is a good chance that they are reporting elements of the same problem. It can be overwhelming to have 1,000 or more users down at once, but if the same problem is simply reoccurring in multiple parts of the network, you only have to figure it out once. If possible use the lab. Whenever possible, try to duplicate the problem in a lab and troubleshoot it there. Sometimes changes performed during troubleshooting has a greater negative impact on the end-user population than the original problem. Use intelligent testing strategies: If any test requires reconfiguring a device, ensure that you can roll back the change after the test, or you may find that you have backed yourself into a corner and cannot proceed. Use as few tests as possible to isolate and define the problem working top-down TCP stack or bottom-up. Ensure that the results of the tests are unambiguous before concentrating on the next level. Validate the test results by repeating each one at least twice. Note that running a command to verify a configuration parameter is not considered a test in this sense and therefore doesn't need to be performed twice.

Document changes as your proceed not as an afterthought: Document the tests performed and the results in case a bug is found. Document any changes made to the network during the troubleshooting procedure so that the network can be properly restored to its original condition. Document any workarounds that were left in place so that other support personnel will be able to understand how and why the network changed.

Troubleshooting Commandments 1. Create a backup of the faulty system before fixing anything. Backup can be done only for configuration files or for the complete system. Complete backup is important as troubleshooting is a high stress activity and it is easy accidentally to destroy some files. Ghost is a great tool for performing quick complete backups and Ghost 2003 works with Linux ext filesystems. With the current sized of USB flash drives available most system partitions can be backuped on a flash drive. Such backup also can be indispensable if the fault disappears on its own: faults that fix themselves often come back on their own too. 2. Before changing and file always create a baseline. That prevents you from the most typical mistake in troubleshooting: losing the initial configuration.

3. Simplify your environment, if possible. Where possible try to remove routers and firewalls from the networking path affected. Often problems are introduced by network devices. This is typical for example for home environments with cheap routers like Linksys. In enterprise environment left hand often does not know what right is doing and similar effects can be due the fact that someone may have upgraded a router's operating system or a firewall's rule set. Patches are just special kind of upgrade and can introduce problems too. 4. Have a testing plan. Make sure that you can replicate the reported fault at will. This is important because you should always attempt to re-create the reported fault after effecting any changes. You need to be sure that you are not changing or adding to the problem. 5. Document all steps and results. This is important because you could forget exactly what you did to fix or change the problem. This is especially true when someone interrupts you as you are about to test a configuration change. You can always revert the system to the faulty state if you backed it up as suggested earlier. 6. Where possible, make permanent changes to the configuration settings. Temporary changes may be faster to implement but cause confusion when the system reboots after a power failure months or even years later and the fault occurs again. Nobody will remember what was done by whom. Using ping as a Troubleshooting Tool

The ping utility sends ICMP echo request packets to the target host or hosts. Once ICMP echo responses are received, the message target is alive, where target is the hostname of the device receiving the ICMP echo requests, is displayed. # ping problem.host.com problem.host.com is alive The -s option is useful when attempting to connect to a remote host that is down or not available. No output will be produced until an ICMP echo response is received from the target host. The -R option can be useful if the traceroute utility is not available. Statistics are displayed when the ping -s command is terminated. # ping -s problem.host.com Another useful troubleshooting technique using ping is to send ICMP echo requests to the entire network by using the broadcast address as the target host. Using the -s option with the broadcast address provides good information about which systems are available on the network: # ping -s 172.20.4.255

Using ifconfig as a Troubleshooting Tool

The ifconfig utility is useful when troubleshooting networking problems. You can use it to display an interface's current status including the settings for the following:
MTU Address family IP address Netmask Broadcast address Ethernet address (MAC address)

Be aware that there are two ifconfig commands. The two versions differ in how they use name services. The /sbin/ifconfig is called by the /etc/rc2 . d/S30sysid. net startup script. This version is not affected by the configuration of the /etc/nsswitch. conf file. The /usr/sbin/ifconfig is called by the /etc/rc2 .d/S69inet and the /etc/rc2 . d/S72inetsvc startup scripts. This version of the ifconfig command is affected by the name service settings in the /etc/nsswitch. conf file. Power user - Use the plumb switch when troubleshooting interfaces that have been manually added and configured. Often an interface will report that it is up and running yet a snoop session from another host shows that no traffic is flowing out of the suspect interface. Using the plumb switch resolves the misconfiguration problem.
Using arp as a Troubleshooting Tool

The arp utility can be useful when attempting to locate network problems relating to duplicate IP address usage. Determine the Ethernet address of the target host. You can do this by using the banner utility at the ok prompt, or the ifconfig utility at a shell prompt on a Sun system. Armed with the Ethernet address (also known as the MAC address) use the ping utility to determine if the target host can be reached. Use the arp utility immediately after using the ping utility and verify that the arp table reflects the expected (correct) Ethernet address. The following example demonstrates this technique. Working from the system three, use the ping and arp utilities to determine if the system one is really responding to system three. First, determine the Ethernet address of the host called one. problem.host.com# ifconfig -a lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 hme0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST, IPv4> mtu 1500 index 6 inet 128.50.2.1 netmask ffffff00 broadcast 128.50.2.255 ether 8:0:20:76:6:b problem.host.com#

The ifconfig utility shows that the Ethernet address of the hme0 interface is 8:0:20:76:6:b. The first half of the address, 08:00:20 shows that the system is a Sun computer. The last half of the address, 76:06:0b is the unique part of the system's Ethernet address. Search the Internet to determine the manufacturer of devices with unknown Ethernet addresses. 2. Use the ping utility to send ICMP echo requests from system three to system problem.host.com. ping problem.host.com problem.host.com is alive 3. View the arp table to determine if the device that sent the ICMP echo response is the correct system, 76:06:0b. # arp -a Net to Media Table: IPv4 Device IP Address Mask Flags Phys Addr ------ -------------------- --------------- ------------------08: 00 : 20 : 76: 06: 0b 08: 00 : 20: 8e : ee : 18 08: 00 : 20: 7a: 0b:b8 08:00:20:78:54:90 00: 60:97:7f:4f:dd 01: 00 : 5e: 00 : 00: 00 Output from the arp utility will appear to hang if name resolution fails because the arp utility attempts to resolve names. Use the netstat -pn utility to obtain similar output. The table displayed in step 3 proved that the correct device responded. If the wrong system responded, it could have been quickly tracked down by using the Ethernet address. Once located, it can be configured with the correct IP address. Many hubs and switches will report the Ethernet address of the attached device, making it easier to track down incorrectly configured devices. The first half of the Ethernet address can also be used to refine the search. The previous example showed a device, presumably a personal computer, as it reported an Ethernet address of 00:60:97:7f:4f:dd. A quick search on the Internet reveals that the 00:60:97 vendor code is assigned to the 3COM corporation.
Using snoop as a Troubleshooting Tool

The snoop utility can be particularly useful when troubleshooting virtually any networking problems. The traces that are produced by the snoop utility can be most helpful when attempting remote troubleshooting because an end-user (with access to the root password) can capture a snoop trace and email it or send it using ftp to a network troubleshooter for remote diagnosis. You can use the snoop utility to display packets on the fly or to write to a file. Writing to a file using the -o switch is preferable because each packet can be interrogated later. problem.host.com# snoop -o tracefile Using device /dev/le (promiscuous mode) You can view the snoop file by using the -i switch and the filename in any of the standard modes, namely:

Terse mode - No option switch is required. Summary verbose mode - Use the -V switch. Verbose mode - Use the -v switch.

Verbose is most useful when you are troubleshooting routing, network booting, Trivial File Transport Protocol (TFTP), and any network-related problems that require diagnosis at the packet level. Each layer of the packet is clearly defined by the specific headers. View the snoop output file in terse mode and locate a packet or range of packets of interest. Use the -p switch to view these packets. For example, if packet two is of interest, type: problem.host.com# snoop -p2,2 -v -i tracefile
Using ndd as a Troubleshooting Tool

Use extreme caution when using the Solaris ndd utility because the system could be rendered inoperable if you set parameters incorrectly. Use an escaped question mark (\?) to determine which parameters a driver supports. For example, to determine which parameters the 100-Mbit Ethernet (hme) device supports, type: # ndd /dev/hme \? ? (read only) trans ceiver_inuse (read only) link_status (read only) link_speed (read only) ... ... lance_mode ipg0 # (read and write) (read and write)
Routing/IP Forwarding

Many systems configured as multi-homed hosts or firewalls may have IP forwarding disabled. A fast way to determine the state of IP forwarding is to use the ndd utility. problem.host.com# ndd /dev/ip ip_forwarding 0 This example shows that the system is not forwarding IP packets between its interfaces. The value for ip_forwarding is 1 when the system is routing or forwarding IP packets.
Interface Speed

The hme (100-Mbit Ethernet) Ethernet card can operate at two speeds, 10 or 100 Mbits per second. You can use the ndd utility to quickly display the speed at which the interface is running.
ndd /dev/hme link_speed 1

A one (1) indicates that the interface is running at 100 MBits per second. A zero (0) indicates that the interface is running at 10 MBits per second. Interface Mode

The hme interface can run in either full-duplex or half-duplex mode. Again, the ndd utility provides a fast way to determine the mode of the interface. # ndd /dev/hme link_mode One (1) indicates that the interface is running in full-duplex mode. A zero (0) indicates that the interface is running in half-duplex mode.
Using netstat as a Troubleshooting Tool

You can use the netstat utility to display the status of the system's network interfaces. Of particular interest when troubleshooting networks are the routing tables of all the systems in question. You can use the -r switch to display a system's routing tables.
# netstat -r

Although interesting, the displayed routing table is not of much use unless you are familiar with the name resolution services, be they the /etc/hosts, NIS, or NIS+ services. The problem is that it is difficult to concentrate on routing issues when any doubt can be cast on the name services. For example, someone could have modified the name service database, and the system msbravo may no longer be the IP address that you expected. Using the -n switch eliminates this uncertainty. # netstat -rn # ifconfig -a This routing table is much easier to translate and troubleshoot, especially when combined with the information from the ifconfig -a utility. lo0: flags=849<UP,LOOPBACK,RUNNING,MULTICAST> mtu 8232 inet 127.0.0.1 netmask ff000000 hme0: flags=863<UP, BROADCAST, NOTRAILERS, RUNNING, MULTICAST> mtu 1500 inet 129.147.11.59 netmask ffffff00 broadcast 129.147.11.255 hme0: flags=863<UP, BROADCAST, NOTRAILERS, RUNNING, MULTICAST> mtu 1500 inet 172.20.4.110 netmask ffffff00 broadcast 172.20.4.255 # The verbose mode switch, -v displays additional information, including the MTU size configured for the interface: # netstat -rnv
Using traceroute as a Troubleshooting Tool

The traceroute utility is useful when you perform network troubleshooting. You can quickly determine if the expected route is being taken when communicating or attempting to communicate with a target network device. As with most network troubleshooting, it is useful to

have a benchmark against which current traceroute output can be compared. The traceroute output can report network problems to other network troubleshooters. For example, you could say, "Our normal route to a host is from our router called router1-ISP to your routers called rtr-a1 to rtr-c4. Today, however, users are complaining that performance is very slow. Screen refreshes are taking more than 40 seconds when they normally take less than a second. The output from traceroute shows that the route to the host is from our router router1-ISP to your routers called rtr-a1, rtr-d4 rtr-x5, and then to rtr-c4. What is going on?" The traceroute utility uses the IP TTL and tries to force ICMP TIME_EXCEEDED responses from all gateways and routers along the path to the target host. The traceroute utility also tries to force a PORT_UNREACHABLE message from the target host. The traceroute utility can also attempt to force an ICMP ECHO_REPLY message from the target host by using the -I (ICMP ECHO) option when issuing the traceroute command. The traceroute utility will, by default, resolve IP addresses as shown in the following example: # traceroute 172.20.4.110 traceroute to 172.20.4.110 (172.20.4.110), 30 hops max, 40 byte packets 1 129.147.11.253 (129.147.11.253) 1.037 ms 0.785 ms 0.702 ms 2 129.147.3.249 (129.147.3.249) 1.452 ms 1.569 ms 0.766 ms 3 * dungeon (129.147.11.59) 1.320 ms * You can display IP addresses instead of hostnames by using the -n switch as shown in the following example. In this example, the hostname dungeon for IP address 129.147.11.59 on line 3 is no longer resolved. # traceroute -n 172 .20.4. 110 traceroute to 172.20.4.110 (172.20.4.110), 30 hops max, 40 byte packets 1 129.147.11.253 0.954 ms 0.657 ms 0.695 ms 2 129.147.3.249 0.844 ms 0.745 ms 0.771 ms 3 129.147.11.59 0.534 ms * 0.640 ms
Common Network Problems

Following is a list of some common problems that occur:


Faulty RJ-45 - The network connection fails intermittently.

Faulty wiring on patch cable - No network communications. mdi to mdi (no mdi-x) - Media data interfaces, such as hubs, are not connected to another mdi device. Many hubs have a port that can be switched to become an mdi-x mdi crossover port. Badly configured encryption - Once encryption is configured, things are not as they appear. Standard tools such as ifconfig, and netstat will not locate the problem. Use the snoop utility to view the contents of packets to determine if all is normal. Hub or switch configured to block the MAC - Modern hubs and switches are configured to block specific MAC addresses or any addresses if the connection is tampered with. Access to the console of the hub or switch is necessary to unblock a port. Bad routing tables or rogue router - Routing tables can be corrupted. Sometime a rogue router can appear of the network due to installation of multihomed host. Rogue DHCP server is present in DHCP environment. Often happens when somebody installs Windows server on the network without understanding what they are doing. Protocol not being routed - for example if jumpstart or bootp is being used across routers. Interface not plumbed - Additional interfaces, when configured, are not plumbed. The interface will appear to be functioning, but it will not pass traffic. Connection to the wrong interface on multihoned host. Bad information in the /etc/hosts or NIS database - The IP address of systems is incorrect or missing.

The user statement, "My application does not work" is just a tip of an iceberg and the user often does not understand what exactly is not working by jumping to conclusions that can mislead you in troubleshooting. Never believe the user story. You need ask the user very specific question to uncover the real story. Among questions to consider:
Is the server up and functioning normally? Can other users access the server? Is the client system up and functioning normally? Has anything changed on the server? Has anything changed on the client?

Layers-based troubleshooting
When troubleshooting networks, some people prefer to think in layers, similar to the TCP/IP Model while others prefer to think in terms of functionality.

Using the TCP/IP Model layered approach, you could start at either the Physical or Application layer. Start at either end of the model and test, draw conclusions, move to the next layer and so on.

The Application Layer

A user complains that an application is not functioning. Assuming the application has everything that it needs, such as disk space, name servers, and the like, determine if the Application layer is functional by using another system. Application layer programs often have diagnostic capabilities and may report that a remote system is not available. Use the snoop command to determine if the application program is receiving and sending the expected data.
The Transport Layer and the Internet Layer

These two layers can be bundled together for the purposes of troubleshooting. Determine if the systems can communicate with each other. Look for ICMP messages that can provide clues as to where the problem lies. Could this be a router or switching problem? Are the protocols (TFTP, BOOTP) being routed? Are you attempting to use protocols that cannot be routed? Are the hostnames being translated to the correct IP addresses? Are the correct netmask and broadcast addresses being used? Tests between the client and server can include using ping, traceroute, arp, and snoop.
The Network Interface Layer

Use snoop to determine if the network interface is actually functioning. Use the arp command to determine if the arp cache has the expected Ethernet or MAC address. Fourth generation hubs and some switches can be configured to block certain MAC addresses. When troubleshooting connectivity problems here are some useful questions:
Have any changes been made to the network devices? Can the client contact the server using ping? Can the client contact any system using ping? Can the server contact the client using ping? Can the client system use ping to contact any other hosts on the local network segment? Can the client use ping to contact the far interface of the router? Can the client use ping to contact any hosts on the server's subnet? Is the server in the client's arp cache? Can snoop be used to determine what happens to the service or arp request Is the client's interface correctly configured? (Has it been plumbed?) Has any encryption software installation been attempted?

The Physical Layer

Check that the link status LED is lit. Test it with a known working cable. The link LED will be lit even if the transmit line is damaged. Verify that a mdi-x connection or crossover cable is being used if connecting hub to hub.

Selected Troubleshooting Scenarios

Multi-Homed System Acts as Rogue Router

For example system A can use telnet to contact system B, but system B cannot use telnet to contact system A. Further questioning of the user revealed that this problem appeared shortly after a power failure. For troubleshooting use the traceroute utility to show the route that network traffic takes from system B to system A. If the traceroute output reveals route that goes via additional system (let's call it system C) you have a rogue router problem. Often that happens due to the fact that system C had been modified by an end-user. For example an additional interface was added, bit the user did not add /etc/notrouter file to the system. In this case, after rebooting the system, it came up as a router and started advertising routes, which confuses the core routers and disrupts network traffic patterns.
Faulty Cable

For example users on network A could not reach hosts on network B even though routers R1 and E2 appeared to be functioning normally. First you need to verify that the routers R1and R2 were configured correctly and that the interfaces are up. They you need to verify that systems A and B were up and configured correctly. They you need to use the traceroute utility to discover the actual route from system A to system B. For example the traceroute output shows that the attempted route from system A on network net1 goes through router R1 as expected. But the trafficnever reaches router R2 though. Investigate the router R2 log files. For example is they show that the interface to network net-2 is flapping (going up and down at a very high rate) and corrupt routing tables you can suspect that the cable is a problem, To solve this problem, replace the network net-2 cable to router R2. If it fixes the problem then it was faulty and causes intermittent connections.
Duplicate IP Address

Reported Problem: Systems on network net-1 could not use ping past router R1 to a recently configured network, net-2. You must be "root" or the sys to perform some of the other troubleshooting step in the previous examples. Suggested steps:
1. Verify that the T1 link between the routers R1 and R2 is functioning properly. 2. Verify that router R1 can use ping to contact router R2. 3. Verify that system A can use ping to reach the close interface of router R2. System A cannot use ping on the far interface of router R2, though. 4. Confirm that systems on network net-1 can use ping to reach router R1.

5. Check that systems on network net-2 can use ping to reach router R2. 6. Determine that the routers are configured correctly. 7. Verify that the systems on network net-1 and network net-2 are configured correctly. 8. Make sure the systems on network net-1 can communicate with each other. 9. Verify that systems on network net-2 can communicate with each other. 10.Log onto router R1 and use traceroute to display how the data is routed from router R1 to router R2. traceroute reported that the traffic from router 11. R1 to router R2 was going out the network net-1 side interface of the router instead of the network net-2 side as expected. This indicates that the IP address for router R2 may also exist on network net-1. 12. Check the Ethernet address of router R2; compare the actual address with the contents of router R1's arp cache. The arp cache revealed that the device was of a different manufacturer than expected.

To solve the problem track down the device on network net-1, system C, that has an illegal IP address (one that is the same as the network-net-1-side interface of router R2). This resulted in a routing loop as the routers had multiple best-case paths to take to the same location (which were actually in two different sites). Correct the duplicate IP address problem on system C and make sure communications work as expected.
Duplicate MAC Address (Mostly Sun environment problem)

Reported Problem: After adding an additional Ethernet interface to your host, the system performance is very poor. Troubleshooting (as user root):
Use arp -a to view the address table on the host. If the MAC address appears on more than one host that is on the same physical network, this may be the problem. Use ifconfig -a to check the IP address and MAC address. If host with the same MAC address are on the same subnet you found the problem

Notice from the previous ifconfig output that all the interfaces have the same MAC address. Host C is on different subnet, so this is not a problem. This would cause problems because packets that leave either qfe0 or qfe1 would not be guaranteed to receive a response since both interfaces are broadcasting themselves as the source for those packets.

Old News ;-)

Submitted Article Network Troubleshooting Tips for the Solaris 9 OS by Ross Moffatt,
December 2006 | BigAdmin

Contents Overview Two Tips for Network Performance Checking Network Connectivity Troubleshooting Checking Network Settings Checking Routing Settings Changing the IP Address About the Author

Overview

This How-To covers some basic networking setup and troubleshooting on the Solaris 9 OS.
Two Tips for Network Performance Checking

a. Use FTP to copy a large file between hosts. Make sure you copy the file in both directions, as network performance problems can be directional. A possible cause of performance issues is autonegotiation being enabled at either the host or the router/switch. (See the Checking Network Settings section of this article for more details.) b. Use ping with small (1Kbyte) and large (10K) packet sizes: Sometime routers in the network can have issues depending upon the size of the packet, as some use different queues within the router depending upon packet size.
Network Connectivity Troubleshooting

Here is a checklist to help you locate and resolve network connectivity problems. 1. Use ifconfig -a to check that interfaces are plumbed; that is, that they exist in the output. Also, check the network address and netmask of the interface. To plumb an interface, run the command ifconfig <interface><instance> plumb, for example:
# ifconfig ce1 plumb Use ifconfig to see if the interface now exists. # ifconfig -a lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 ce0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 inet 444.555.666.7 netmask ffffff00 broadcast 444.555.666.255

ether 5:3:de:de:de:de ce1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 6 inet 0.0.0.0 netmask 0 ether 3:4:aa:bb:cc:dd

Give the interface its ipaddress and netmask.


ifconfig ce1 555.66.77.88 netmask 255.255.255.0 up

2. Ping the interface address; it should work! 3. Ping your router/switch. If you see => fail, then check your network settings. (See the Checking Network Settings section of this article.) 4. Ping a host on another network. If that doesn't work, check the routing table. (See the Checking Routing Settings section of this document.)
Checking Network Settings

You can check or set the status and configuration of a network interface with the ndd command. To use the ndd command, you first need to use the ifconfig command to find out what the device file is for the network interface in which you are interested.
# ifconfig -a lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 ce0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 inet 444.555.666.7 netmask ffffff00 broadcast 444.555.666.255 ether 5:3:de:de:de:de

This example shows only one interface plumbed, ce0, so the interface type is ce, and it is instance 0. Now sometimes the ndd command uses a generic device file, e.g. /dev/ce, and you need to set the instance number first to know which instance is being interrogated. Otherwise, an instancespecific device file, e.g. /dev/bge0, is used. You can use ls -ld /dev/<interface>* to see if instance-specific device files exist, for example, ls -ld /dev/ce*.
# ls -l /dev/ce* lrwxrwxrwx 1 root root 28 Mar 3 2006 /dev/ce -> .../devices/pseudo/clone@0:ce

or
# ls -ld /dev/bge* lrwxrwxrwx 1 root root 29 Oct 14 2005 /dev/bge -> .../devices/pseudo/clone@0:bge lrwxrwxrwx 1 root root 39 Oct 14 2005 /dev/bge0 -> .../devices/pci@1f,700000/network@2:bge0 lrwxrwxrwx 1 root root 41 Oct 14 2005 /dev/bge1 -> .../devices/pci@1f,700000/network@2,1:bge1 lrwxrwxrwx 1 root root 39 Oct 14 2005 /dev/bge2 ->

.../devices/pci@1d,700000/network@2:bge2 lrwxrwxrwx 1 root root 41 Oct 14 2005 /dev/bge3 -> .../devices/pci@1d,700000/network@2,1:bge3

The following scripts print out all variables available via ndd. One is for a generic device file, and the other is for an instance-specific device file. The script is run with the instance required as a command-line option, that is, <script> 0. You would need to change the script to have the correct interface device type. For example, you may need to replace ce with eri if your interface device file is /dev/eri.
#!/bin/sh # ndd generic device script ndd -set /dev/ce instance $1 for p in `ndd /dev/ce \\? | awk '{print $1}' | grep -v \\?` do echo \"$p: `ndd /dev/ce $p`\" done Output. # ./script 0 "instance: 0" "adv_autoneg_cap: 0" "adv_1000fdx_cap: 0" "adv_1000hdx_cap: 0" "adv_100T4_cap: 0" "adv_100fdx_cap: 1" "adv_100hdx_cap: 0" "adv_10fdx_cap: 0" "adv_10hdx_cap: 0" "adv_asmpause_cap: 0" "adv_pause_cap: 0" "master_cfg_enable: 0" "master_cfg_value: 0" "use_int_xcvr: 0" "enable_ipg0: 1" "ipg0: 8" "ipg1: 8" "ipg2: 4" "rx_intr_pkts: 8" "rx_intr_time: 3" "red_dv4to6k: 0" "red_dv6to8k: 0" "red_dv8to10k: 0" "red_dv10to12k: 0" "tx_dma_weight: 0" "rx_dma_weight: 0" "infinite_burst: 1" "disable_64bit: 0" "accept_jumbo: 0" "laggr_multistream: 0" #!/bin/sh # ndd specific device script for p in `ndd /dev/bge$1 \\? | awk '{print $1}' | grep -v \\?` do echo \"$p: `ndd /dev/bge$1 $p`\" done

Output. # ./script 0 "autoneg_cap: 0" "pause_cap: 1" "asym_pause_cap: 1" "1000fdx_cap: 0" "1000hdx_cap: 0" "100T4_cap: 0" "100fdx_cap: 1" "100hdx_cap: 0" "10fdx_cap: 0" "10hdx_cap: 0" "adv_autoneg_cap: 0" "adv_pause_cap: 1" "adv_asym_pause_cap: 1" "adv_1000fdx_cap: 0" "adv_1000hdx_cap: 0" "adv_100T4_cap: 0" "adv_100fdx_cap: 1" "adv_100hdx_cap: 0" "adv_10fdx_cap: 0" "adv_10hdx_cap: 0" "lp_autoneg_cap: 0" "lp_pause_cap: 0" "lp_asym_pause_cap: 0" "lp_1000fdx_cap: 0" "lp_1000hdx_cap: 0" "lp_100T4_cap: 0" "lp_100fdx_cap: 0" "lp_100hdx_cap: 0" "lp_10fdx_cap: 0" "lp_10hdx_cap: 0" "link_status: 1" "link_speed: 100" "link_duplex: 1" "link_autoneg: 0" "link_rx_pause: 1" "link_tx_pause: 0" "loop_mode: 0"

I have found it best to set the network devices to disable autonegotiation on both the host and the router to which the host is connected. This is done by setting the following parameters: autoneg_cap, 1000fdx_cap, 1000hdx_cap, 100T4_cap, 100fdx_cap, 100hdx_cap, 10fdx_cap, 10hdx_cap, and adv_autoneg_cap. I use this script in the /etc/rc2 directory to set the network parameters.
smart1 # more /etc/rc2.d/S95net_tune #!/sbin/sh PATH=$PATH:/usr/sbin;export PATH case "$1" in 'start') /usr/sbin/ndd -set /dev/bge0 /usr/sbin/ndd -set /dev/bge0 /usr/sbin/ndd -set /dev/bge0 /usr/sbin/ndd -set /dev/bge0

adv_1000fdx_cap 0 adv_1000hdx_cap 0 adv_100fdx_cap 1 adv_100hdx_cap 0

/usr/sbin/ndd -set /dev/bge0 adv_10fdx_cap 0 /usr/sbin/ndd -set /dev/bge0 adv_10hdx_cap 0 /usr/sbin/ndd -set /dev/bge0 adv_autoneg_cap 0 ;; 'stop') ;; *) echo "Usage: $0 { start | stop }" exit 1 ;;

esac exit 0

The bge interface shows you the currently running status with the following parameters: link_status, link_speed, and link_duplex. For information on interface device drivers, look in the man pages, in Section 7: Devices and Network Interfaces.
Checking Routing Settings

To see your current routing configuration, use netstat -r, and add the -n option depending on whether you want to see DNS names or IP addresses. For example:
smart1 # netstat -rn Routing Table: IPv4 Destination Gateway Flags Ref Use Interface -------------------- -------------------- ----- ----- ------ --------222.333.444.0 222.333.444.21 U 1 104943 bge0 default 222.333.444.1 UG 163805900 127.0.0.1 127.0.0.1 UH 538851300 lo0 smart1 # netstat -r Routing Table: IPv4 Destination Gateway Flags Ref Use Interface -------------------- -------------------- ----- ----- ------ --------222.333.444.0 myhost U 1 104943 bge0 default router UG 163805927 localhost localhost UH 538851327 lo0

If the default route is missing, then use the ifconfig command to add it on the fly:
ifconfg add default 222.333.444.1

Changing the IP Address

The following startup files need to be modified to change a host's IP address.


/etc/inet/hosts:

Change the IP address, file format, IP<tab>hostname. Add a new netmask, file format, network<tab>netmask. Specify the new gateway for this subnet, file format, ipaddress.

/etc/inet/netmasks: /etc/defaultrouter:

About the Author

Ross Moffatt has been a UNIX System Administrator for more than 10 years. He can be contacted at ross.stuff@telstra.com.
Solaris- Network Troubleshooting

One of the first signs of trouble on the network is a loss of communications by one or more hosts. If a host refuses to come up at all the first time it is added to the network, the problem might lie in one of the configuration files, or in the network interface. If a single host suddenly develops a problem, the network interface might be the cause. If the hosts on a network can communicate with each other but not with other networks, the problem could lie with the router, or it could lie in another network. You can use the ifconfig program to obtain information on network interfaces and netstat to display routing tables and protocol statistics. Third-party network diagnostic programs provide a number of troubleshooting utilities. Refer to third-party documentation for information. Less obvious are the causes of problems that degrade performance on the network. For example, you can use tools like ping to quantify problems like the loss of packets by a host. Running Software Checks If the network has trouble, some actions that you can take to diagnose and fix software-related problems include:
1. Using the netstat command to display network information. 2. Checking the hosts database (and ipnodes if you are using IPv6) to make sure that the entries are correct and up to date. 3. If you are running RARP, checking the Ethernet addresses in the ethers database to make sure that the entries are correct and up to date. 4. Trying to connect by telnet to the local host. 5. Ensuring that the network daemon inetd is running. To do this, log in as superuser and type: # ps -ef | grep inetd

Here is an example of output displayed if the inetd daemon is running:


root 57 1 0 Apr 04 ? 3:19 /usr/sbin/inetd -s root 4218 4198 0 17:57:23 pts/3 0:00 grep inetd

ping Command Use the ping command to find out whether there is IP connectivity to a particular host. The basic syntax is:
/usr/sbin/ping host [timeout] where host is the host name of the machine in question. The optional timeout argument indicates the time in seconds for ping to keep trying to reach the machine-20 seconds by default. The ping(1M) man page describes additional syntaxes and options. When you run ping, the ICMP protocol sends a datagram to the host you specify, asking for a response. (ICMP is the protocol responsible for error handling on a TCP/IP network. See ICMP Protocol for details.)

How to Determine if a Host Is Losing Packets If you suspect that a machine might be losing packets even though it is running, you can use the s option of ping to try to detect the problem. On the command line, type the following command.
% ping -s hostname

ping continually sends packets to hostname until you send an interrupt character or a timeout occurs. The responses on your screen will resemble:
PING elvis: 56 data bytes 64 bytes from 129.144.50.21: icmp_seq=0. time=80. ms 64 bytes from 129.144.50.21: icmp_seq=1. time=0. ms 64 bytes from 129.144.50.21: icmp_seq=2. time=0. ms 64 bytes from 129.144.50.21: icmp_seq=3. time=0. ms . . . ----elvis PING Statistics---4 packets transmitted, 4 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/20/80

The packet-loss statistic indicates whether the host has dropped packets. If ping fails, check the status of the network reported by ifconfig and netstat, as described in ifconfig Command and netstat Command ifconfig Command The ifconfig command displays information about the configuration of an interface that you specify. (Refer to the ifconfig(1M) man page for details.) The syntax of ifconfig is:
ifconfig interface-name [protocol_family]

How to Get Information About a Specific Interface


1. Become superuser. 2. On the command line, type the following command. # ifconfig interface

For an le0 interface, your output resembles the following:


le0: flags=863 mtu 1500 inet 129.144.44.140 netmask ffffff00 broadcast 129.144.44.255 ether 8:0:20:8:el:fd

The flags section just given shows that the interface is configured "up," capable of broadcasting, and not using "trailer" link level encapsulation. The mtu field tells you that this interface has a maximum transfer size of 1500 octets. Information on the second line includes the IP address of the host you are using, the netmask being currently used, and the IP broadcast address of the interface. The third line gives the machine address (Ethernet, in this case) of the host. How to Get Information About All Interfaces on a Network

A useful ifconfig option is -a, which provides information on all interfaces on your network. 1. # ifconfig -a interface

This produces, for example:


le0: flags=49 mtu 8232 inet 127.144.44.140 netmask ff000000 le0:flags=863 mtu 1500 inet 129.144.44.140 netmask ffffff00 broadcast 129.144.44.255 ether 8:0:20:8:el:fd

Output that indicates an interface is not running might mean a problem with that interface. In this case, see the ifconfig(1M) man page. netstat Command The netstat command generates displays that show network status and protocol statistics. You can display the status of TCP and UDP endpoints in table format, routing table information, and interface information. netstat displays various types of network data depending on the command line option selected. These displays are the most useful for system administration. The syntax for this form is:
netstat [-m] [-n] [-s] [-i | -r] [-f address_family]

The most frequently used options for determining network status are: s, r, and i. See the netstat(1M) man page for a description of the options. How to Display Statistics by Protocol The netstat -s option displays per protocol statistics for the UDP, TCP, ICMP, and IP protocols. On the command line, type the following command.
% netstat -s

The result resembles the display shown in the example below. (Parts of the output have been truncated.) The information can indicate areas where a protocol is having problems. For example, statistical information from ICMP can indicate where this protocol has found errors.
UDP TCP udpInDatagrams udpInErrors tcpRtoAlgorithm tcpRtoMax tcpActiveOpens tcpAttemptFails . . IP . . ICMP ipForwarding ipInReceives = = 4518 2 ipDefaultTTL ipInHdrErrors = = 255 0 = = 39228 0 udpOutDatagrams tcpMaxConn tcpPassiveOpens tcpEstabResets tcpOutSegs = = 2455 -1 2 1 315

= 4 = 60000 = 4 = 3

= = =

icmpInMsgs

icmpInErrors

icmpInCksumErrs = 0 icmpInUnknowns = . . IGMP: 0 messages received 0 messages received with too few bytes 0 messages received with bad checksum 0 membership queries received 0 membership queries received with invalid field(s) 0 membership reports received 0 membership reports received with invalid field(s) 0 membership reports received for groups to which we belong 0 membership reports sent

How to Display Network Interface Status The i option of netstat shows the state of the network interfaces that are configured with the machine where you ran the command. On the command line, type the following command:
% netstat -i

Here is a sample display produced by netstat -i:


Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs le0 1500 b5-spd-2f-cm tatra 14093893 8492 10174659 1119 lo0 8232 loopback localhost 92997622 5442 12451748 0 Collis Queue 2314178 0 775125 0

Using this display, you can find out how many packets a machine thinks it has transmitted and received on each network. For example, the input packet count (Ipkts) displayed for a server can increase each time a client tries to boot, while the output packet count (Opkts) remains steady. This suggests that the server is seeing the boot request packets from the client, but does not realize it is supposed to respond to them. This might be caused by an incorrect address in the hosts, ipnodes, or ethers database. On the other hand, if the input packet count is steady over time, it means that the machine does not see the packets at all. This suggests a different type of failure, possibly a hardware problem. How to Display Routing Table Status The -r option of netstat displays the IP routing table. On the command line, type the following command.
% netstat -r

Here is a sample display produced by netstat -r run on machine tenere:


Routing tables Destination Gateway temp8milptp elvis irmcpeb1-ptp0 elvis route93-ptp0 speed mtvb9-ptp0 speed . mtnside speed ray-net speed

Flags UGH UGH UGH UGH UG UG

Refcnt 0 0 0 0 1 0

Use 0 0 0 0 567 0

Interface

mtnside-eng mtnside-eng mtnside-eng

speed speed tenere

UG UG U

0 0 33

36 558 190248

le0

The first column shows the destination network, the second the router through which packets are forwarded. The U flag indicates that the route is up; the G flag indicates that the route is to a gateway. The H flag indicates that the destination is a fully qualified host address, rather than a network. The Refcnt column shows the number of active uses per route, and the Use column shows the number of packets sent per route. Finally, the Interface column shows the network interface that the route uses. How to Log Network Problems
1. Become superuser. 2. Create a log file of routing daemon actions by typing the following command at a command line prompt. # /usr/sbin/in.routed /var/logfilename

Caution: On a busy network, this can generate almost continuous output. Displaying Packet Contents You can use snoop to capture network packets and display their contents. Packets can be displayed as soon as they are received, or saved to a file. When snoop writes to an intermediate file, packet loss under busy trace conditions is unlikely. snoop itself is then used to interpret the file. For information about using the snoop command, refer to the snoop(1M) man page. The snoop command must be run by root (#) to capture packets to and from the default interface in promiscuous mode. In summary form, only the data pertaining to the highest-level protocol is displayed. For example, an NFS packet only displays NFS information. The underlying RPC, UDP, IP, and Ethernet frame information is suppressed but can be displayed if either of the verbose options is chosen. The snoop capture file format is described in RFC 1761. To access, use your favorite web browser with the URL: http://ds.internic.net/rfc/rfc1761.txt. snoop server client rpc rstatd collects all RPC traffic between a client and server, and filters it for rstatd. How to Check All Packets from Your System
1. Type the following command at the command line prompt to find the interfaces attached to the system. # netstat -i

Snoop normally uses the first non-loopback device (le0).


2. Type snoop. Use Ctl-C to halt the process.
3. # snoop 4. Using device /dev/le (promiscuous mode)

5. 6. 7. 8. 9.

maupiti -> atlantic-82 atlantic-82 -> maupiti maupiti -> atlantic-82 atlantic-82 -> maupiti maupiti -> atlantic-82

NFS C GETATTR FH=0343 NFS R GETATTR OK NFS C GETATTR FH=D360 NFS R GETATTR OK NFS C GETATTR FH=1A18 NFS R GETATTR OK ARP C Who is 120.146.82.36, npmpk17a-

10. atlantic-82 -> maupiti 11. 82 ? maupiti -> (broadcast)

12. Interpret the results.

In the example, client maupiti transmits to server atlantic-82 using NFS file handle 0343. atlantic-82 acknowledges with OK. The conversation continues until maupiti broadcasts an ARP request asking who is 120.146.82.36? This example demonstrates the format of snoop. The next step is to filter snoop to capture packets to a file. Interpret the capture file using details described in RFC 1761. To access, use your favorite web browser with the URL: http://ds.internic.net/rfc/rfc1761.txt How to Capture snoop Results to a File
1. # snoop -o filename

For example:
# snoop -o /tmp/cap Using device /dev/le (promiscuous mode) 30 snoop: 30 packets captured

This has captured 30 packets in a file /tmp/cap. The file can be anywhere with enough disk space. The number of packets captured is displayed on the command line, enabling you to press Ctl-C to abort at any time. snoop creates a noticeable networking load on the host machine, which can distort the results. To see reality at work, run snoop from a third system, (see the next section). On the command line, type the following command to inspect the file.
# snoop -i filename

For example:
# snoop -i /tmp/cap 1 0.00000 frmpk17b-082 -> 224.0.0.2 S=129.146.82.1 LEN=32, ID=0 2 0.56104 scout -> (broadcast) 129.146.82.63, grail ? IP D=224.0.0.2

ARP C Who is

3 0.16742 atlantic-82 -> 129.146.82.76, honeybea ? 4 0.77247 scout -> 129.146.82.63, grail ? 5 0.80532 frmpk17b-082 -> 129.146.82.92, holmes ? 6 0.13462 scout -> 129.146.82.63, grail ? 7 0.94003 scout -> 129.146.82.63, grail ? 8 0.93992 scout -> 129.146.82.63, grail ? 9 0.60887 towel -> 129.146.82.35, udmpk17b-82 10 0.86691 nimpk17a-82 ->

(broadcast) (broadcast) (broadcast) (broadcast) (broadcast) (broadcast)

ARP C Who is ARP C Who is ARP C Who is ARP C Who is ARP C Who is ARP C Who is

(broadcast) ARP C Who is ? 129.146.82.255 RIP R (1 destinations)

Refer to specific protocol documentation for detailed analysis and recommended parameters for ARP, IP, RIP and so forth. Searching the Web is a good place to look at RFCs. How to Check Packets Between Server and Client
Establish a snoop system off a hub connected to either the client or server.

The third system (the snoop system) sees all the intervening traffic, so the snoop trace reflects reality on the wire.
1. Become superuser. 2. On the command line, type snoop with options and save to a file. 3. Inspect and interpret results.

Look at RFC 1761 for details of the snoop capture file. To access, use your favorite web browser with the URL: http://ds.internic.net/rfc/rfc1761.txt Use snoop frequently and consistently to get a feel for normal system behavior. For assistance in analyzing packets, look for recent white papers and RFCs, and seek the advice of an expert in a particular area, such as NFS or YP. For details on using snoop and its options, refer to the snoop(1M) man page. Displaying Routing Information Use the traceroute utility to trace the route an IP packet follows to some internet host. The traceroute utility utilizes the IP protocol time(to live) ttl field and attempts to elicit an ICMP TIME_EXCEEDED response from each gateway along the path, and the response PORT_UNREACHABLE (or ECHO_REPLY) from the destination host. The traceroute utility starts sending probes with a ttl of one and increases by one until it gets to the intended host or has passed through a maximum number of intermediate hosts. The traceroute utility is especially useful for determining routing misconfiguration and routing path failures. If a particular host is unreachable, you can use the traceroute utility to see what path the packet follows to the intended host and where possible failures might occur.

The traceroute utility also displays the round trip time for each gateway along the path to the target host. This information can be useful for analyzing where traffic is slow between the two hosts.
How to Run the Traceroute Utility

On the command line, type the following command.


% traceroute destination-hostname Example--traceroute Utility

The following sample of the traceroute command shows the 7-hop path a packet follows from the host istanbul to the host sanfrancisco along with the times for a packet to traverse each hop.
istanbul% traceroute sanfrancisco traceroute: Warning: Multiple interfaces found; using 172.31.86.247 @ le0 traceroute to sanfrancisco (172.29.64.39), 30 hops max, 40 byte packets 1 frbldg7c-86 (172.31.86.1) 1.516 ms 1.283 ms 1.362 ms 2 bldg1a-001 (172.31.1.211) 2.277 ms 1.773 ms 2.186 ms 3 bldg4-bldg1 (172.30.4.42) 1.978 ms 1.986 ms 13.996 ms 4 bldg6-bldg4 (172.30.4.49) 2.655 ms 3.042 ms 2.344 ms 5 ferbldg11a-001 (172.29.1.236) 2.636 ms 3.432 ms 3.830 ms 6 frbldg12b-153 (172.29.153.72) 3.452 ms 3.146 ms 2.962 ms 7 sanfrancisco (172.29.64.39) 3.430 ms 3.312 ms 3.451 ms

You might also like