Cray Urika-GX Manual de instrucciones

Urika®-GX System Administration Guide
(2.2.UP00)
S-3016

Contents
1 About the Urika®-GX System Administration Guide............................................................................................... 7
2 The Urika-GX System.............................................................................................................................................9
2.1 Administrative Components of Urika-GX...................................................................................................9
2.2 Network Components.............................................................................................................................. 10
2.3 File Systems............................................................................................................................................ 11
2.4 System Nodes......................................................................................................................................... 12
2.5 Restrictions on Use................................................................................................................................. 12
3 System Management............................................................................................................................................15
3.1 Check the Current Service Mode............................................................................................................ 15
3.2 Urika-GX Component Naming Conventions............................................................................................15
3.3 System Management Workstation (SMW)...............................................................................................17
3.3.1 Power On the System Management Workstation (SMW).......................................................... 17
3.3.2 About the Integrated Dell Remote Access Controller (iDRAC).................................................. 17
3.3.3 Control System Management Workstation (SMW) Power with the iDRAC8 Web Console........17
3.3.4 Synchronize the System Management Workstation (SMW) to the Site NTP Server..................20
3.3.5 Synchronize Time of Day on System Nodes.............................................................................. 21
3.3.6 Reboot a Stopped System Management Workstation (SMW)................................................... 22
3.4 Hardware Supervisory System (HSS)..................................................................................................... 22
3.4.1 Hardware Supervisory System (HSS) Architecture Overview....................................................24
3.4.2 The xtdiscover Command.......................................................................................................... 25
3.4.3 Hardware Supervisory System (HSS) Component Location Discovery..................................... 25
3.4.4 Hardware Supervisory System (HSS) Daemons........................................................................26
3.4.5 Hardware Supervisory System (HSS) Administration and Diagnostic Commands
Supported on Urika-GX.............................................................................................................27
3.4.6 Hardware Supervisory System (HSS) Environments................................................................. 30
3.4.7 High Speed Network (HSN) Management................................................................................. 32
3.4.8 Create Direct Connection between the System Management Workstation (SMW) and a
Compute Node Console............................................................................................................32
3.4.9 Disable Hardware Components................................................................................................. 33
3.4.10 Enable Hardware Components................................................................................................ 33
3.4.11 Set Hardware Components to EMPTY...................................................................................... 34
3.4.12 Stop Components Using the Hardware Supervisory System (HSS)........................................ 34
3.4.13 Unlock Hardware Components................................................................................................ 35
3.4.14 Capture and Analyze System-level and Node-level Dumps.....................................................35
3.4.15 Collect Debug Information From Hung Nodes Using the xtnmi Command.............................. 36
Contents
S3016 2

3.4.16 Find Node Information..............................................................................................................36
3.4.17 Request and Display System Routing......................................................................................37
3.4.18 Initiate a Network Discovery Process.......................................................................................38
3.4.19 Power Up a Rack or Dual Aries Network Card (dANC)............................................................38
3.4.20 Check the Status of System Components................................................................................38
3.4.21 Check Compute Node High Speed Network (HSN) Connection..............................................39
3.4.22 Monitor the Health of PCIe Channels.......................................................................................39
3.4.23 Poll a Response from an HSS Daemon, Manager, or the Event Router..................................39
3.4.24 View Component Alert, Warning, and Location History............................................................40
3.4.25 Display Alerts and Warnings.................................................................................................... 40
3.4.26 Display Error Codes................................................................................................................. 40
3.4.27 Display Component State Information......................................................................................41
3.4.28 Clear Component Flags........................................................................................................... 41
3.4.29 Flash Management on Urika-GX..............................................................................................41
3.4.30 Create and Modify the authorized_keys File Using the xtcc-ssh-keys Command....... 42
3.4.31 Change the Passwords of RC, dANCCs and iSCB using the xtccpasswd Command.......... 42
3.4.32 Gather Troubleshooting Information Using the xtdumpsys Command.....................................43
3.5 Dual Aries Network Card (dANC) Management...................................................................................... 43
3.6 Analyze Node Memory Dump Using the kdump and crash Utilities on a Node.....................................44
3.7 Cray Lightweight Log Management (LLM) System................................................................................. 45
3.8 Urika-GX Node Power Management....................................................................................................... 45
3.9 Power Up the Urika-GX System.............................................................................................................. 46
3.10 Power Down the Urika-GX System....................................................................................................... 49
3.11 Urika-GX CLI Commands for Managing Services................................................................................. 51
3.12 Remote HDFS Remote Access and Multihoming on Urika-GX.............................................................54
3.13 Update the InfluxDB Data Retention Policy...........................................................................................54
3.14 Service to Node Mapping...................................................................................................................... 55
3.15 Image Management with Docker and Kubernetes................................................................................ 59
3.15.1 Execute Spark Jobs on Kubernetes......................................................................................... 60
3.15.2 Multi-tenant Spark Thrift Server on Urika-GX...........................................................................62
4 System Monitoring................................................................................................................................................ 65
4.1 System Monitoring Tools......................................................................................................................... 65
4.2 Monitor Resource Utilization and Node Status Using Nagios................................................................. 66
4.2.1 Configure SSL/TLS for Nagios Core.......................................................................................... 67
4.2.2 Configure the Nagios Server to Send Email Notifications.......................................................... 70
4.2.3 Change the Default Log File Path and Rotation Interval............................................................ 73
4.2.4 Configure Email Alerts................................................................................................................74
4.2.5 Modify Nagios Plug-in Threshold............................................................................................... 75
Contents
S3016 3

4.3 Get Started with Using Grafana...............................................................................................................77
4.4 Default Grafana Dashboards...................................................................................................................79
4.5 Update InfluxDB Security Settings.......................................................................................................... 90
4.6 Update the InfluxDB Data Retention Policy.............................................................................................91
4.7 Configuration Settings of Grafana........................................................................................................... 93
4.8 Change the Default Timezone Displayed on Grafana............................................................................. 93
4.9 Create a New Grafana Dashboard.......................................................................................................... 95
4.10 Add a New Graph to the Grafana Dashboard....................................................................................... 97
4.11 Start InfluxDB Before Hadoop Services.............................................................................................. 100
4.12 Monitor Subrack Attributes.................................................................................................................. 101
4.13 Analyze Node Memory Dump Using the kdump and crash Utilities on a Node.................................102
4.14 Retrieve System Status Information Using the urika-check-platform Command.................................103
4.15 iSCB Description................................................................................................................................. 104
4.15.1 Log on to the iSCB................................................................................................................. 104
4.15.2 iSCB Command Reference.................................................................................................... 105
5 Resource Management...................................................................................................................................... 124
5.1 Manage Resources on Urika-GX...........................................................................................................124
5.2 Use Apache Mesos on Urika-GX ..........................................................................................................126
5.2.1 Access the Apache Mesos Web UI.......................................................................................... 128
5.3 Use mrun to Retrieve Information About Marathon and Mesos Frameworks........................................129
5.4 Launch an HPC Job Using mrun........................................................................................................... 133
5.5 Manage Long Running Services Using Marathon................................................................................. 133
5.6 Manage the Spark Thrift Server as a Non-Admin User......................................................................... 136
5.7 Manage Jobs Using the Cray Application Management UI................................................................... 137
5.7.1 Overview of the Cray Application Management UI...................................................................138
6 Cray DVS............................................................................................................................................................140
6.1 Introduction to DVS............................................................................................................................... 140
6.1.1 Use Cray DVS on Urika-GX..................................................................................................... 141
6.1.2 DVS ioctl Interfaces..................................................................................................................141
6.1.3 DVS Client Mount Point Options.............................................................................................. 143
6.1.4 DVS Environment Variables..................................................................................................... 149
6.1.5 Modes.......................................................................................................................................150
6.1.6 Resiliency and Diagnostics...................................................................................................... 154
6.1.7 Caveats.................................................................................................................................... 157
6.1.8 Administrative Tasks.................................................................................................................158
7 Security...............................................................................................................................................................175
7.1 Authentication and Authorization...........................................................................................................175
7.2 Urika-GX Service Modes....................................................................................................................... 177
Contents
S3016 4

7.2.1 Modify the Service Mode..........................................................................................................180
7.2.2 User Interface Access in the Secure Service Mode................................................................. 181
7.3 Security Architecture Overview............................................................................................................. 181
7.4 Set up Passwordless SSH.....................................................................................................................182
7.5 Tenancy................................................................................................................................................. 183
7.5.1 Configure a Bridge Port............................................................................................................185
7.5.2 Tenant Management.................................................................................................................189
7.5.3 Tenant Virtual Machine States..................................................................................................195
7.5.4 Tenant Management CLI Commands.......................................................................................196
7.5.5 Execution of Lustre Sub-Commands Inside Tenant VMs......................................................... 198
7.5.6 Get Started with Tenant Management......................................................................................199
7.5.7 Multi-Tenancy........................................................................................................................... 202
7.5.8 Multi-tenant HDFS....................................................................................................................203
7.6 Authorized User Management...............................................................................................................205
7.7 Guidance on LDAP Forwarding.............................................................................................................208
7.8 Authentication Mechanisms...................................................................................................................216
7.9 Change Default Passwords................................................................................................................... 217
7.9.1 Default Urika-GX System Accounts......................................................................................... 219
7.9.2 Change the Default Nagios Password..................................................................................... 220
7.9.3 Change the Default iDRAC8 Password....................................................................................221
7.9.4 Change the Default System Management Workstation (SMW) Passwords.............................223
7.9.5 Change LDAP Password on Urika-GX.....................................................................................224
7.9.6 Reset a Forgotten Password for the Cray Application Management UI...................................224
7.9.7 Reset an Administrator LDAP Password on Systems Using Urika-GX 1.2UP01 and Earlier
Releases................................................................................................................................. 225
7.9.8 Reset an Administrator LDAP Password when the OLC Schema Password is Unknown....... 226
7.9.9 Reset an Administrator LDAP Password when the OLC Scheme Password is Known........... 228
7.10 Tableau Authorization and Authentication Mechanisms...................................................................... 229
7.11 Enable SSL..........................................................................................................................................229
7.12 Enable SSL for Spark Thrift Server of a Tenant.................................................................................. 234
7.13 Install a Trusted SSL Certificate on Urika-GX..................................................................................... 235
7.14 Enable LDAP Authentication on Urika-GX ......................................................................................... 236
7.14.1 Enable LDAP for Connecting Tableau to HiveServer2........................................................... 238
7.15 Enable SQL Standard based Authorization for HiveServer2...............................................................239
7.16 File System Permissions..................................................................................................................... 240
7.17 Urika-GX Security Quick Reference Information................................................................................. 240
7.18 Port Assignments................................................................................................................................ 241
8 Troubleshooting.................................................................................................................................................. 245
Contents
S3016 5

8.1 System Management Log File Locations.............................................................................................. 245
8.2 Default Log Settings.............................................................................................................................. 246
8.3 Analytic Applications Log File Locations............................................................................................... 248
8.4 Security Related Troubleshooting Information.......................................................................................250
8.4.1 Save and Restore Tenant Information......................................................................................254
8.4.2 LDAP Server Start-up Issues................................................................................................... 256
8.5 Modify the Secret of a Mesos Framework............................................................................................. 256
8.6 Clean Up Log Data................................................................................................................................ 257
8.7 Diagnose and Troubleshoot Orphaned Mesos Tasks............................................................................258
8.8 Troubleshoot Common Analytic and System Management Issues ...................................................... 259
8.9 Troubleshoot mrun Issues.....................................................................................................................268
8.10 Troubleshoot: Application Hangs as a Result of NFS File Locking..................................................... 270
8.11 Troubleshoot: DVS does not Start after Data Store Move...................................................................270
8.12 Troubleshoot: DVS Ignores User Environment Variables....................................................................271
8.13 Clear Leftover hugetlbf Files................................................................................................................271
8.14 Remove Temporary Spark Files from SSDs........................................................................................271
Contents
S3016 6

1About the Urika®-GX System Administration Guide
This publication contains administrative information about using the Cray® Urika®-GX system.
Typographic Conventions
Monospace Indicates program code, reserved words, library functions, command-line prompts,
screen output, file/path names, key strokes (e.g., Enter and Alt-Ctrl-F), and
other software constructs.
Monospaced Bold Indicates commands that must be entered on a command line or in response to an
interactive prompt.
Oblique or Italics Indicates user-supplied values in commands or syntax definitions.
Proportional Bold Indicates a graphical user interface window or element.
\ (backslash) At the end of a command line, indicates the Linux® shell line continuation character
(lines joined by a backslash are parsed as a single line). Do not type anything after
the backslash or the continuation feature will not work correctly.
Scope and Audience
The audience of this publication is system administrators of the Urika®-GX system. This publication is not
intended to provide detailed information about open source products used in the system. References to online
documentation are included where applicable.
Record of Revision
Date Addressed Release
September, 2018 2.2UP00
May, 2018 2.1UP00
December, 2017 2.0UP00
April, 2017 1.2UP00
December, 2016 1.1UP00
August, 2016 1.0UP00
March, 2016 0.5UP00
Record of Revision
This revision includes updates to Tableau related topics.
About the Urika®-GX System Administration Guide
S3016 7

Trademarks
The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and
design, SONEXION, Urika-GX, Urika-XA, Urika-GD, and YARCDATA. The following are trademarks of Cray Inc.:
APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYDOC, CRAYPAT, CRAYPORT, DATAWARP, ECOPHLEX,
LIBSCI, NODEKARE. The following system family marks, and associated model number marks, are trademarks
of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a
sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other
trademarks used in this document are the property of their respective owners.
About the Urika®-GX System Administration Guide
S3016 8

2The Urika-GX System
The Urika-GX system is a big data analytics platform optimized for analytic workflows. It combines a highly
advanced hardware platform with a comprehensive analytic software stack to help derive optimal business value
from data. The Urika-GX platform provides the tools required for capturing and organizing a wide variety of data
types from different sources and enables analyzing big data and discovering hidden relationships.
The Urika-GX system also features a number of workload management tools as well as an optimized system
administration tool for performing monitoring and management tasks.
For a list of features of the Urika-GX system, see S-3017, "Urika®-GX System Overview".
2.1 Administrative Components of Urika-GX
Urika-GX platforms have been developed by tightly integrating commodity hardware components, open-source
software, and Cray proprietary hardware, to provide users a high performance, scalable and open compute
platform.
Major administrative components of Urika-GX include:
●System Management Workstation (SMW) - The SMW is a server that acts as a single-point interface to a
system administrator's environment. It provides an interface for performing administrative and monitoring
capabilities.
○Hardware Supervisory System (HSS) - HSS is an integrated system of hardware and software
components that are used for managing and monitoring the system.
○Cobbler - Cobbler is used on Urika-GX for provisioning and deployment.
●Rack Controller (RC) - The RC monitors the environmental sensors within the rack and manages
communication between the SMW and other physical system components, including the rack, sub-rack and
dANC (Dual Aries Network Card).
●Intelligent Subrack Control Board (iSCB) - The iSCB status command can be used to monitor the physical
attributes of the sub-rack, such as the power supply, amperage, fan status, and temperature.
●Aries Network Card Controller (ANCC) - Each sub-rack chassis of the Urika-GX system contains two
dANCs (dual Aries Network Cards). Each dANC contains 2 Aries chips, an Advanced RISC Machines (ARM)
processor, and a number of environmental sensors to help monitor the system.
●Integrated Dell Remote Access Controller (iDRAC) - The iDRAC is a hardware that provides advanced
agentless system management functionality for the SMW. It operates independently of the SMW's CPU and
operating system. The version of iDRAC used on the Urika-GX system is iDRAC8.
●System Monitoring and Performance Analysis Tools - Urika-GX ships with Grafana and Nagios. These
tools enable monitoring system resources and viewing performance statistics of various system components.
For more information, see S-3015, "Urika®-GX Analytic Applications Guide".
The Urika-GX System
S3016 9

●Data Analytic Components - Urika-GX features a number of data analytic tools that help perform analytic
tasks, including managing and monitoring clusters, executing Hadoop and SPARK jobs, performing graph
analytics, etc. For more information, see S-3015, "Urika®-GX Analytic Applications Guide" and S-3010,
"Cray™ Graph Engine User Guide".
●Security and Tenant Management Tools - Secret files used on the system are managed by the Urika-GX
Secret Manager. Tenancy is implemented through the use of a tenant VM that runs on physical nodes and
provides controlled access to services on the physical nodes through a command proxy mechanism. For
more information, refer to Urika-GX Service Modes on page 177 and Tenancy on page 183.
NOTE: Only Spark and HDFS commands can be executed within a tenant VM in this release. All the
commands for flexing the cluster, mrun and Cray Graph Engine (CGE) CLI commands cannot be
executed within a tenant VM.
In addition, Urika-GX features a number of CLI scripts that facilitate system management and monitoring the
system.
2.2 Network Components
There are 3 networks deployed on the Urika®-GX platform:
●Aries High Speed Network (HSN) - The Aries HSN provides high speed application and data network
connectivity between nodes. This network provides node interconnect via high bandwidth, low latency DMA
access. The hardware to support this network consists of an Aries Interface Board (AIB) connected to an
available PCIe slot on each Urika-GX node and integrated into the node chassis assembly. The AIB is
connected to the dANC integrated in the Urika-GX sub-rack. Copper cables provide an all-to-all connection of
all dANCs in the system.
●Operational Ethernet network- The operational Ethernet network is used for ingesting user data. This
network is comprised of a single unit 48-port GigE switch that provides dual 1GigE and/or dual 10GigE
interfaces to the site network. Urika-GX's login nodes do not route through this switch and need to be directly
connected to the site network. The operational network allows node connectivity externally from Urika-GX to
the site network. The Urika-GX compute and I/O nodes are connected to a single managed Brocade ICX
6450-48, 48 port switch with a single power supply. Connectivity of this network to the site network is made
possible by two available Gigabit Ethernet ports and/or two 10 Gigabit Ethernet ports on the ICX 6450-48
switch.
The operational network can also be used to access data streaming applications and services directly from
compute nodes.
●Management Ethernet network - The management Ethernet network is primarily used for system
management, and not for user data. The management Ethernet network is comprised of two stacked 1U 48-
port switches, which are located at the top of the Urika-GX rack, and can optionally contain redundant switch
power supplies. These switches provide GigE management Ethernet connectivity to every node, System
Management Workstation (SMW), Rack Controller (RC), Intelligent Subrack Control Board (iSCB), Power
Distribution Units (PDUs), Dual Aries Network Cards (dANCs) and to the operational network that connects to
the nodes.
The Urika-GX system also contains the following subnets:
○ SMW subnet, which provides connectivity to the SMW and the RC.
○ Rack subnet, which provides connectivity to the dANCs and iSCB module.
This network is supported by two managed Brocade ICX 6450-48, 48 port switches stacked together with two
10gigE optical interconnects. Each switch contains a single power supply, and can optionally contain
The Urika-GX System
S3016 10
Tabla de contenidos
Otros manuales de Servidor de Cray




















