Professional Documents
Culture Documents
Infrastructure Overview
March 2021
George Lambidakis, Ofer Licht
Necessary Terminology
... Cisco has more acronyms than the US Government, and they change all the time
• CAG – Common ASIC Group (Eyal)
• CHG – Common Hardware Group (Ravi K – Eyal’s boss)
• CEC – Cisco Employee Connection (Cisco credentials)
• IT – Cisco IT proper (security, policy, networking, phones)
• EngIT – Engineering IT (DC infra: VMs, compute, NAS)
• Tools Group – (part of CHG) LSF SW, tool wrappers, licenses
• CapNet – Cisco’s global internal network
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Necessary Terminology, Continued
• LSF – Load Sharing Facility (IBM), like SunGrid
• DC – Data Center (NTN, SJC/MTV, BGL, RTP, GPK)
• Labs – Cisco’s lab infrastructure, separate from DC
• Clusters – UCS compute nodes, composed of 40 blades
• ServiceNow – Cisco’s IT broke/fix service system
• E-store – Cisco’s IT request system (SW, services, etc.)
• Duo MFA – Single sign on (SSO)
• MobilePass – Single use password generator required for sudo
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Help - What Single Link Do I Need?
Debugging and reporting problems
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Total 562
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CAG Infrastructure Topology and Connectivity
RTP
100ms / OC-48
NTN
RTP01 RTP05 AMS 50-60ms/500Mb
(Amsterdam) NTN01
50 LSF hosts
150 ms / OC48
70 ms / 250 LSF hosts
2 ms / 50-70 ms / 500Mb
10G OC48 2 ms / 2 ms /
150Mb 1Gb
50-60ms/500Mb
SJC Campus 150 ms / OC48 AER Backup VPN CAE02
(VNC) (Almere) (VNC)
(removed)
210 ms OC48
sjc12 sjc05
160 ms / OC48
2x10G 2x10G BLR
~1 ms 150 ms 1Gb
ISPA ISPB
12ms / 10G BGL11
GPK01
MTV (CA) (UK)
17ms 100 LSF hosts
NAS
mirrors
mirrors
mirrors
mirrors
NAS
mirrors
LSF NAS NAS NAS
MTV RTP BGL NTN GPK
SOS Servers asic-sos-rtp01 asic-sos-ntn01 asic-sos-gpk01
• All LSF hosts support an interactive use model ; running in LSF is indistinguishable from
running locally or on a VM
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Resource Strategy
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Desktop and Compute Strategy
• Virtual desktops (VNC) not intended for execution of compute or
memory intensive applications
• Licensed EDA tools run in LSF on the highest performance
hardware available, using optimized slot counts
• Benefits
• Persistent – connect/disconnect/share to desktop anywhere
• Project based fairshare and license allocation
• Optimized job slot counts per host
• Supported – centrally managed and supported DC resources
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
VMs, Desktops & RealVNC
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
VMs – Region Specific VNC Capacity
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
** EngIT does not allow use of TigerVNC, TightVNC, etc. due to CEC credential requirement
VMs, Cont’d
• VMs dispatch LSF jobs to their local farm
• Option available to submit jobs to an arbitrary farm
• Care needed due to project file system locations
• Users can run VNC servers in other locations
• Users working in multiple regions, see Alternate Home Directory
• Common Usage:
• Engineers in BGL/CAE run VNC servers in SJC
• They have additional home directories in SJC
• Servers display back to laptops (VNC client) in BGL
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Alternate Home Directory Illustration Omer’s laptop
running a VNC
client in CAE
s
210m
Humans SJC t h e WA
N CAE
ss
e r acro
r v
g to s e Low latency VNC
n nectin 2 ms connection from
2 ms co
cl ient client to server
VNC
VNC Server in MTV VNC Server in NTN
VMs
Storage and VNC
0 ms
210m 0 ms server in same room
VNC s
serve
Low latency VNC server R/W r R/W
across NTN
the W
AN
mtv5-netapp-ns Omer’s alternate ntn01-netapp-ns Omer’s default
/users/omsali home directory in /users/omsali home directory in
NAS MTV* (SJC) NAS NTN
LSF
SJC/MTV
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
MTV is Mountain View, CA and is ~5 miles from the main Cisco campus
Enterprise RealVNC (Cisco Licensed - Required)
• VM based VNCs (basis for LSF clients)
• Eng IT supported*
• Uses CEC credentials
• Dynamic resizing of display resolution (xrandr)
• Encrypted connections
• Secure collaboration w/o sharing passwords
• Our documentation: VNC Overview and RealVNC
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
* While technically true, Unix and VNC support is mostly up to the users
LSF and Resource Allocation
*LSF - Load Sharing Facility
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Access
• LSF access to our farms is restricted
• Employees/contractors reporting to Eyal have LSF access
based on reporting chain (w/o requesting)
• Engineers outside of CAG request admission via dedicated
mailing lists, subject to CAG management approval
• Otherwise, no LSF access
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Fairness
• Operating Principle: Licenses and compute resources are allocated to
projects based on business priorities
• CAG uses LSF to enforce business priorities
• Priorities modeled as project and user ‘share’ values
• Jobs launch with a project ID and a computed priority
• Jobs PEND outside of exec host based on resource availability and priority
• Non-LSF jobs are problematic
• They violate fairness since they poll license server directly
• Resource allocation not based on business priorities
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Project and User Based Fairshare
• Job scheduling system based on shares and dynamic priority
rush Highest priority first come first serve queue for individual (human) low count jobs [PBFS]
normal Medium priority, project based allocation, intended for most simulation regressions [PBFS]
3-4
Lowest priority, project based allocation, intended for long running simulation workloads [PBFS]
long
build Used for unlicensed workloads as well as massively parallel tools (Voltus, Seascape)
interactive 20 (misleading name) Waveform viewers, large file editing, mostly idle jobs
imp 2-5 Implementation and physical design tools ; slots based on host memory
Imphcc 1-2 High Core Count implementation/PD queue for full chip (access restricted)
analog 8-10 Analog design tools (Virtuoso) with larger memory hosts ; Jobs periodically use CPU time
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
analogsim 36 Dedicated capacity for analog simulations ; configured as 1 slot = 1 physical CPU
User
Queue Characteristics Limit
Hard Run
Limit Queue
Limit
Soft Run
Limit
Soft Memory
Limit
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
* Soft limits apply sane defaults and avoid jobs that otherwise run forever
Compute Farm – Status and Soft/Hard Limits
1 unique job
per slot
Non-CAG
queue
Suspended
jobs
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Compute Farm – Simulation Job Status
Project
Project
Shares
Dynamic
Priority
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Licenses
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
License Servers (users generally do not specify)
ls-sjc-01 – primary CAG license server (SNPS, CDNS, etc.)
ls-sjc-03 – primary CHG license server (some VIP)
ls-csi-01 – NTN Low latency license server (ARC MetaWare)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CAD Tools Support (not EngIT, not IT)
• Part of the CHG (non-CAG) organization
• Manages LSF, tools, and licenses (not HW)
• Handles installation and support of most EDA tools
• /auto/edatools mirrored across sites as requested
• Installation initiated via a ToolBox case
• Does not preclude private/project specific tool installation
• Provides tool wrappers
• Available for most tools (VCS, DC, ICC2, hspice, etc.)
• Region aware wrappers set license paths
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Vendor Libraries
• Libraries are generally maintained by us (CAG)
• Standard project independent library installation areas
• Optional mirroring to RTP, NTN, GPK, and BGL
• Mirrors have single source location (RW)
• One or more mirrors (RO) in other locations
• Update rates of 1x-6x/day based on size and type
• Daily data migration limits, per Eng IT policies
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Storage
• NetApp 100% SSD Storage, w/ tiered storage options
• 6 HA node pairs in MTV ; 4 in NTN ; 6 in BGL
• w/ and w/o backups, snapshots
• NFS and CIFS/SMB volumes
• Replaced on a 2 year cycle by Cisco Engineering IT
• Storage as a Service
• We (CAG) do not own storage
• We pay for it based on consumption
• Off site backups, when enabled, are standard and automated
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Volume Identification and Traditional Limits
• local volumes (local, I know, it is a silly name)
• Snapshots, Backups
• 10.18.229.84:/local/cagbb-gb-pd 3.0T 453G 2.5T 16% /auto/cagbb-gb-pd
• Maximum recommended size = 20TB*
• Limited by Cisco backup system policy
Due to the use if multi-IP filers, the host name of the filer is not visible via the output of df and the Unix ‘hosts’ command is used to lookup
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CIFS/SMB* Laptop Access
• Windows and OSX use CIFS (SMB) to mount file systems
• OSX
smb://mtv5-netapp-eg/local/argon
smb://mtv5-netapp-ns/workspace/wslocal002/rwaldoem
• WIN
\\mtv5-netapp-eg\local\argon
\\mtv5-netapp-ns\workspace\wslocal002\rwaldoem
• Full list of CAG /ws areas in the nightly CAG /ws Report
• Don’t have a /ws? Want one? See Workspace Storage Request
Examples: /ws/kevenes-sjc (MTV), /ws/okarniel-ntn (NTN)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Mirrors and Site Selectors
• Mirror – One RW master and one or more RO locations
• Site Selector – Independent RW file systems, multiple locs
• Mirrors/Site Selectors employ region based mounting
• /auto/<name> is effectively dynamic
• Mount location controlled by Vintela (understands regions)
• Example: /auto/asic-tools
• In SJC, mounts mtv5-netapp-eg:/local/asic_tools
• In NTN, mounts ntn01-netapp-eg:/dfs/asic_tools
• This happens to be a mirror, so the NTN location is RO (and dfs)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Mirrors
• NetApp provided feature using Snapmirroring
• Data from a single RW (Read Write) master site is replicated to
one or more RO (Read Only) sites
• Useful for tools and libraries
• Schedule varies from 1x/day to 6x/day based on the size and
type of data is being replicated
• Automated operation once setup
• Ex: /auto/asic-libs MTV(rw), NTN(ro), RTP(ro), CSI(ro),BGL(ro)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Site Selectors
• Manually maintained (by CAG engineers)
• Data from a RW (Read Write) site is often replicated to one or
more RW sites using tools rsync or scp
• Useful for project data and certain types of tools
• Care should be taken when synchronizing into and out of
CAE/NTN due to WAN bandwidth limitations
• Ex: /auto/palladium in NTN(rw), MTV(rw), and BGL(rw)
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
NetApp FlexGroups
• Newish NetApp proprietary feature *
• Allows single volumes > 100T
• Developed for PD type flows
• Data striped across all filer nodes instead of 1
A NetApp is composed of node pairs (HA). With FG, a 2 filer system like that
in in NTN has 4 nodes – and data is striped across all 4
3
4 5
6
7 8
9
10 11 13
12
14 16
15
17 19
18
20 22
21
23
24
25 26 28
27
29 31
30
32 34
33
35 37
36
38 40
39
41 43
42
44 46
45
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
1 2
3
4 5
6
7 8 10
9
11 13
12
14
15
16 17 19
18
20 22
21
23
24
25 26 28
27
29 31
30
32
33
34 35 37
36
38
39
40 41 43
42
44 46
45
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
92300YC
• NetApp nodes also have internal intra-node 40G92300YC
connectivity
Node 1
LNK HA0 b
LNK HA0 a
NV
ACT/LINK
O=100
Y=1000
1 2 4
3 5
a
c
LNK LNK LNK LNK
LNK HA0 b
LNK HA0 a
NV
ACT/LINK
O=100
Y=1000
1 2 4
3 5
a
c
Node 2 LNK LNK LNK LNK
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CISCO NEXUS
5596UP CISCO NEXUS
5596UP
ID ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
STAT STAT
Data Security
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Access to CAG Protected Content
• CAG Full Time Employees (FTE) +VP approved exceptions
• All in ; Access to all official project CAG protected content *
• CAG Contractors
• In for their project ; Access to all data required for their project
• Access to common scripts, libraries, etc.
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential * Except “skunkworks” (hidden) projects
File Protection System
• Most CAG data resides in the Unix file system (NetApps)
• Protected using Unix GID enabled groups
• World access removed from all file systems ; group access only
• We require filers with NFS Extended Groups (EG) support
• This feature allows > 16 Unix groups
• Authorization is performed by the filer rather than the Unix host
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Repository Protection Scheme
• Perforce security
• Access
• Users require CEC accounts for authentication.
• Users must be members of an ASIC-specific AD group (composed of dynamic HR-list
under Eyal plus manual additions)
• AD groups used to restrict access to specific project data (same groups as NFS)
• Encryption
• All perforce traffic uses SSL (including authentication, file data, and meta-data)
• Auditing
• Medium level of server-side logging, short retention of logs
• Source and Perforce workspaces are 99% NFS
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Complications
• Humans – a significant paradigm shift
• Users require education and training
• Managers need to actively manage data access (group membership)
• Users need to become familiar with tool reported access messages
• Non-CAG collaborators (i.e. SW, DFT) are be a support burden to all
• Infrastructure
• 24-48 hours latency when adding new members to a program
• Increased reliance on a robust Active Directory (AD) infrastructure
• In rare circumstances, users can lose access to data for up to 24 hours
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Training Materials and Useful Links
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Training Materials From Doc Central
• The following documents are available in Cisco's Doc Central repository.
To access them, you must already be a member of standard group "cag-
base"
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Cisco Internal Home https://wwwin.cisco.com
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Employee Resources : https://wwwin.cisco.com
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Case Management – http://ays.cisco.com
• Laptop, phone, network, bade,.... Problems go here
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Requesting Things – http://estore.cisco.com
• Desktop SW, storage requests, One Time Passes (OPTs), etc.
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
estore - Continued
• Once you place an order, use the "My
Orders" link to see them
• To see current SW subscriptions, use
the "My Things"
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
MobilePass and VPN
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Duo MFA (Multi-Factor Authentication)
• Link
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Buying Things – http://smartbuy.cisco.com
• Headsets
• Keyboards
• Mice
• Desktop systems
• etc.
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Active Directory (ADAM) https://adam.cisco.com
Change Unix information, home directories, group management
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Password Reset https://pwreset.cisco.com
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Backup
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Optimization: pid-track.pl – Process Tracking
• Wall time vs CPU time
• Stalled job analysis, multi-thread effectiveness
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Laptop Security
• Laptops are permitted to access and store restricted data
• They are considered secure devices
• IT managed
• Users authenticate using CEC credentials
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
OS and LSF Migrations
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Red Hat 7.4 Migration
• Cisco will end support for RH6 in CY20
• Requires us to migrate the infrastructure to RH7
• For tools that will not run RH7, Eng IT provides FBE
• Using a wrapper, able to run tools natively using the old OS
• Test VMs and LSF hosts created with new OS and packages
• RH7 testing progress documented in RH7 Testing Matrix
• For multi-user VMs, the Desktop Environment (DE) is problematic
• Gnome requires HW assist (unavail in VMs)
• KDE uses excessive resources, poor choice for multi-user environments
• Xfce will be the likely choice for a default DE
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
LSF 10 Migration
• We currently run LSF 9.1
• LSF 10 provides incremental benefits
• Improved reporting of PEND reasons
• Improved transfer of statistics to RTM DB
• Better support for RH7 as well as better support from IBM
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Confidential