클러스터 모니터링 및 문제 진단
Teleport는 정상 상태를 확인하고 트래픽을 처리할 준비가 되었는지 검증하기 위한 헬스 체크 메커니즘을 제공합니다. Teleport 인스턴스의 상태를 모니터링하는 방법입니다. Teleport's diagnostic HTTP endpoints are disabled by default.
Teleport는 정상 상태를 확인하고 트래픽을 처리할 준비가 되었는지 검증하기 위한 헬스 체크 메커니즘을 제공합니다. 메트릭, 추적, 프로파일링은 클러스터 성능과 응답성을 추적하는 심층 데이터를 제공합니다.
헬스 모니터링 활성화#
Teleport 인스턴스의 상태를 모니터링하는 방법입니다.
Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:
Ensure you can connect to the diagnostic endpoint
Verify that Teleport is now serving the diagnostics endpoint:
```code
$ curl http://127.0.0.1:3000/healthz
```
이제 여러 엔드포인트에서 모니터링 정보를 수집할 수 있습니다. Kubernetes 프로브 등에서 Teleport 프로세스의 상태를 모니터링하는 데 사용할 수 있습니다.
/healthz#
http://127.0.0.1:3000/healthz 엔드포인트는 프로세스가 실행 중이면 {"status":"ok"} 본문과 HTTP 200 OK 상태 코드로 응답합니다.
이는 Teleport 프로세스가 아직 실행 중인지 확인하는 데 적합한 간단한 체크입니다.
/readyz#
http://127.0.0.1:3000/readyz 엔드포인트는 /healthz와 유사하지만, 응답에 프로세스의 상태 정보가 포함됩니다.
응답 본문은 다음 형식의 JSON 객체입니다:
{ "status": "a status message here"}
/readyz와 하트비트#
Teleport 구성 요소가 하트비트 절차를 실행하지 못하면 저하 상태에 진입합니다. 하트비트가 성공적으로 완료되면 Teleport는 이 상태에서 회복을 시작합니다.
첫 번째 성공적인 하트비트는 Teleport를 회복 상태로 전환합니다. 두 번째 연속 성공적인 하트비트는 Teleport를 OK 상태로 전환합니다.
Teleport 하트비트는 정상 상태일 때 약 60초마다 실행되며, 실패한 하트비트는 약 5초마다 재시도됩니다. 이는 하트비트 타이밍에 따라 연결이 복구된 후 /readyz가 다시 정상 상태를 보고하기까지 60-70초가 걸릴 수 있음을 의미합니다.
상태 코드#
응답의 상태 코드는 다음 중 하나일 수 있습니다:
- HTTP 200 OK: Teleport가 정상적으로 작동 중입니다
- HTTP 503 Service Unavailable: Teleport에 연결 오류가 발생하여 저하 상태로 실행 중입니다. Teleport 하트비트가 실패할 때 발생합니다.
- HTTP 400 Bad Request: Teleport가 초기 시작 단계에 진입 중이거나 저하 상태에서 회복을 시작했습니다.
동일한 상태 정보는 /metrics 엔드포인트 아래의 process_state 메트릭을 통해서도 확인할 수 있습니다.
메트릭#
Teleport는 모든 구성 요소에 대한 메트릭을 노출하여 클러스터 상태에 대한 인사이트를 제공합니다. 이 가이드는 Teleport 클러스터에서 수집할 수 있는 메트릭을 설명합니다.
메트릭 활성화#
Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:
Ensure you can connect to the diagnostic endpoint
Verify that Teleport is now serving the diagnostics endpoint:
```code
$ curl http://127.0.0.1:3000/healthz
```
이렇게 하면 http://127.0.0.1:3000/metrics 엔드포인트가 활성화되어 Teleport가 추적하는 메트릭을 제공합니다. Prometheus 수집기와 호환됩니다.
다음 메트릭을 사용할 수 있습니다:
Auth Service and backends#
| Name | Type | Component | Description |
|---|---|---|---|
audit_failed_disk_monitoring |
counter | Teleport Audit Log | Number of times disk monitoring failed. |
audit_failed_emit_events |
counter | Teleport Audit Log | Number of times emitting audit events failed. |
audit_percentage_disk_space_used |
gauge | Teleport Audit Log | Percentage of disk space used. |
audit_server_open_files |
gauge | Teleport Audit Log | Number of open audit files. |
auth_generate_requests_throttled_total |
counter | Teleport Auth | Number of throttled requests to generate new server keys. |
auth_generate_requests_total |
counter | Teleport Auth | Number of requests to generate new server keys. |
auth_generate_requests |
gauge | Teleport Auth | Number of current generate requests. |
auth_generate_seconds |
histogram | Teleport Auth | Latency for generate requests. |
backend_batch_read_requests_total |
counter | cache | Number of read requests to the backend. |
backend_batch_read_seconds |
histogram | cache | Latency for batch read operations. |
backend_batch_write_requests_total |
counter | cache | Number of batch write requests to the backend. |
backend_batch_write_seconds |
histogram | cache | Latency for backend batch write operations. |
backend_read_requests_total |
counter | cache | Number of read requests to the backend. |
backend_read_seconds |
histogram | cache | Latency for read operations. |
backend_requests |
counter | cache | Number of requests to the backend (reads, writes, and keepalives). |
backend_write_requests_total |
counter | cache | Number of write requests to the backend. |
backend_write_seconds |
histogram | cache | Latency for backend write operations. |
cluster_name_not_found_total |
counter | Teleport Auth | Number of times a cluster was not found. |
dynamo_requests_total |
counter | DynamoDB | Total number of requests to the DynamoDB API. |
dynamo_requests |
counter | DynamoDB | Total number of requests to the DynamoDB API grouped by result. |
dynamo_requests_seconds |
histogram | DynamoDB | Latency of DynamoDB API requests. |
etcd_backend_batch_read_requests |
counter | etcd | Number of read requests to the etcd database. |
etcd_backend_batch_read_seconds |
histogram | etcd | Latency for etcd read operations. |
etcd_backend_read_requests |
counter | etcd | Number of read requests to the etcd database. |
etcd_backend_read_seconds |
histogram | etcd | Latency for etcd read operations. |
etcd_backend_tx_requests |
counter | etcd | Number of transaction requests to the database. |
etcd_backend_tx_seconds |
histogram | etcd | Latency for etcd transaction operations. |
etcd_backend_write_requests |
counter | etcd | Number of write requests to the database. |
etcd_backend_write_seconds |
histogram | etcd | Latency for etcd write operations. |
teleport_etcd_events |
counter | etcd | Total number of etcd events processed. |
teleport_etcd_event_backpressure |
counter | etcd | Total number of times event processing encountered backpressure. |
firestore_events_backend_batch_read_requests |
counter | GCP Cloud Firestore | Number of batch read requests to Cloud Firestore events. |
firestore_events_backend_batch_read_seconds |
histogram | GCP Cloud Firestore | Latency for Cloud Firestore events batch read operations. |
firestore_events_backend_batch_write_requests |
counter | GCP Cloud Firestore | Number of batch write requests to Cloud Firestore events. |
firestore_events_backend_batch_write_seconds |
histogram | GCP Cloud Firestore | Latency for Cloud Firestore events batch write operations. |
firestore_events_backend_write_requests |
counter | GCP Cloud Firestore | Number of write requests to Cloud Firestore events. |
firestore_events_backend_write_seconds |
histogram | GCP Cloud Firestore | Latency for Cloud Firestore events write operations. |
gcs_event_storage_downloads_seconds |
histogram | GCP GCS | Latency for GCS download operations. |
gcs_event_storage_downloads |
counter | GCP GCS | Number of downloads from the GCS backend. |
gcs_event_storage_uploads_seconds |
histogram | GCP GCS | Latency for GCS upload operations. |
gcs_event_storage_uploads |
counter | GCP GCS | Number of uploads to the GCS backend. |
grpc_server_started_total |
counter | Teleport Auth | Total number of RPCs started on the server. |
grpc_server_handled_total |
counter | Teleport Auth | Total number of RPCs completed on the server, regardless of success or failure. |
grpc_server_msg_received_total |
counter | Teleport Auth | Total number of RPC stream messages received on the server. |
grpc_server_msg_sent_total |
counter | Teleport Auth | Total number of gRPC stream messages sent by the server. |
heartbeat_connections_received_total |
counter | Teleport Auth | Number of times the Auth Service received a heartbeat connection, representing total heart beating Agents. |
s3_requests_total |
counter | Amazon S3 | Total number of requests to the S3 API. |
s3_requests |
counter | Amazon S3 | Total number of requests to the S3 API grouped by result. |
s3_requests_seconds |
histogram | Amazon S3 | Request latency for the S3 API. |
teleport_audit_emit_events |
counter | Teleport Audit Log | Number of audit events emitted. |
teleport_audit_parquetlog_batch_processing_seconds |
histogram | Teleport Audit Log | Duration of processing single batch of events in the Parquet-format audit log. |
teleport_audit_parquetlog_s3_flush_seconds |
histogram | Teleport Audit Log | Duration of flushing parquet files to S3 in Parquet-format audit log. |
teleport_audit_parquetlog_delete_events_seconds |
histogram | Teleport Audit Log | Duration of deletion events from SQS in Parquet-format audit log. |
teleport_audit_parquetlog_batch_size |
histogram | Teleport Audit Log | Overall size of events in single batch in Parquet-format audit log. |
teleport_audit_parquetlog_batch_count |
counter | Teleport Audit Log | Total number of events in single batch in Parquet-format audit log. |
teleport_audit_parquetlog_last_processed_timestamp |
gauge | Teleport Audit Log | Number of last processing time in Parquet-format audit log. |
teleport_audit_parquetlog_age_oldest_processed_message |
gauge | Teleport Audit Log | Number of age of oldest event in Parquet-format audit log. |
teleport_audit_parquetlog_errors_from_collect_count |
counter | Teleport Audit Log | Number of collect failures in Parquet-format audit log. |
teleport_connected_resources |
gauge | Teleport Auth | Number and type of resources connected via keepalives. |
teleport_postgres_events_backend_write_requests |
counter | Postgres (Events) | Number of write requests to postgres events, labeled with the request status (success or failure). |
teleport_postgres_events_backend_batch_read_requests |
counter | Postgres (Events) | Number of batch read requests to postgres events, labeled with the request status (success or failure). |
teleport_postgres_events_backend_batch_delete_requests |
counter | Postgres (Events) | Number of batch delete requests to postgres events, labeled with the request status (success or failure). |
teleport_postgres_events_backend_write_seconds |
histogram | Postgres (Events) | Latency for postgres events write operations, in seconds. |
teleport_postgres_events_backend_batch_read_seconds |
histogram | Postgres (Events) | Latency for postgres events batch read operations, in seconds. |
teleport_postgres_events_backend_batch_delete_seconds |
histogram | Postgres (Events) | Latency for postgres events batch delete operations, in seconds. |
teleport_registered_servers |
gauge | Teleport Auth | The number of Teleport services that are connected to an Auth Service instance grouped by version. |
teleport_registered_servers_by_install_methods |
gauge | Teleport Auth | The number of Teleport services that are connected to an Auth Service instance grouped by install methods. |
teleport_roles_total |
gauge | Teleport Auth | The number of roles that exist in the cluster. |
teleport_migrations |
gauge | Teleport Auth | Tracks for each migration if it is active (1) or not (0). |
teleport_bot_instances |
gauge | Teleport Auth | The number of bot instances across the entire cluster grouped by version. |
user_login_total |
counter | Teleport Auth | Number of user logins. |
watcher_event_sizes |
histogram | cache | Overall size of events emitted. |
watcher_events |
histogram | cache | Per resource size of events emitted. |
Session recording summarizer#
These metrics are exported by the Auth Service. They are all labeled with an
inference_model_name label, which is the metadata.name field of
corresponding inference_model resource.
General metrics#
These metrics apply to all inference providers.
| Name | Type | Component | Description |
|---|---|---|---|
teleport_summarizer_summarizations_total |
counter | Teleport Auth | Total number of summarization jobs started |
teleport_summarizer_summarization_errors |
counter | Teleport Auth | Number of failed summarization jobs |
teleport_summarizer_summarization_jobs_pending |
gauge | Teleport Auth | Number of summarization jobs currently awaiting execution |
teleport_summarizer_summarization_jobs_running |
gauge | Teleport Auth | Number of summarization jobs currently being executed |
OpenAI-specific metrics#
These metrics apply to jobs executed using OpenAI inference provider, including OpenAI-compatible proxies.
| Name | Type | Component | Description |
|---|---|---|---|
teleport_summarizer_openai_api_requests |
counter | Teleport Auth | Total number of OpenAI API requests |
teleport_summarizer_openai_api_errors |
counter | Teleport Auth | Number of errors returned by the OpenAI API. Additionally labeled with api_error_code which denotes the OpenAI API error code. |
teleport_summarizer_openai_api_requests_in_flight |
gauge | Teleport Auth | Number of OpenAI requests currently awaiting response |
Enhanced Session Recording / BPF#
| Name | Type | Component | Description |
|---|---|---|---|
bpf_lost_command_events |
counter | BPF | Number of lost command events. |
bpf_lost_disk_events |
counter | BPF | Number of lost disk events. |
bpf_lost_network_events |
counter | BPF | Number of lost network events. |
Proxy Service#
| Name | Type | Component | Description |
|---|---|---|---|
failed_connect_to_node_attempts_total |
counter | Teleport Proxy | Number of failed SSH connection attempts to the SSH Service. Use with teleport_connect_to_node_attempts_total to get the failure rate. |
failed_login_attempts_total |
counter | Teleport Proxy | Number of failed tsh login or tsh ssh logins. |
grpc_client_started_total |
counter | Teleport Proxy | Total number of RPCs started on the client. |
grpc_client_handled_total |
counter | Teleport Proxy | Total number of RPCs completed on the client, regardless of success or failure. |
grpc_client_msg_received_total |
counter | Teleport Proxy | Total number of RPC stream messages received on the client. |
grpc_client_msg_sent_total |
counter | Teleport Proxy | Total number of gRPC stream messages sent by the client. |
proxy_connection_limit_exceeded_total |
counter | Teleport Proxy | Number of connections that exceeded the Proxy Service connection limit. |
proxy_peer_client_dial_error_total |
counter | Teleport Proxy | Total number of errors encountered dialing peer Proxy Service instances. |
proxy_peer_client_connections |
gauge | Teleport Proxy | Number of currently opened connection to proxy Proxy Service instances. |
proxy_peer_client_rpc |
gauge | Teleport Proxy | Number of current client RPC requests. |
proxy_peer_client_rpc_total |
counter | Teleport Proxy | Total number of client RPC requests. |
proxy_peer_client_rpc_duration_seconds |
histogram | Teleport Proxy | Duration in seconds of RPCs sent by the client. |
proxy_peer_client_message_sent_size |
histogram | Teleport Proxy | Size of messages sent by the client. |
proxy_peer_client_message_received_size |
histogram | Teleport Proxy | Size of messages received by the client. |
proxy_peer_server_connections |
gauge | Teleport Proxy | Number of currently opened connection to peer Proxy Service clients. |
proxy_peer_server_rpc |
gauge | Teleport Proxy | Number of current server RPC requests. |
proxy_peer_server_rpc_total |
counter | Teleport Proxy | Total number of server RPC requests. |
proxy_peer_server_rpc_duration_seconds |
histogram | Teleport Proxy | Duration in seconds of RPCs sent by the server. |
proxy_peer_server_message_sent_size |
histogram | Teleport Proxy | Size of messages sent by the server. |
proxy_peer_server_message_received_size |
histogram | Teleport Proxy | Size of messages received by the server. |
proxy_ssh_sessions_total |
gauge | Teleport Proxy | Number of active sessions through this Proxy Service instance. |
proxy_missing_ssh_tunnels |
gauge | Teleport Proxy | Number of missing SSH tunnels. Used to debug if Teleport instances have discovered all Proxy Service instances. |
remote_clusters |
gauge | Teleport Proxy | Number of inbound connections from leaf clusters. |
teleport_connect_to_node_attempts_total |
counter | Teleport Proxy | Number of SSH connection attempts to a SSH Service. Use with failed_connect_to_node_attempts_total to get the failure rate. |
teleport_reverse_tunnels_connected |
gauge | Teleport Proxy | Number of reverse SSH tunnels connected to the Teleport Proxy Service by Teleport instances. |
teleport_proxy_db_connection_setup_time_seconds |
histogram | Teleport Proxy | Time to establish connection to DB service from Proxy service. |
teleport_proxy_db_connection_dial_attempts_total |
counter | Teleport Proxy | Number of dial attempts from Proxy to DB service made. |
teleport_proxy_db_connection_dial_failures_total |
counter | Teleport Proxy | Number of failed dial attempts from Proxy to DB service made. |
teleport_proxy_db_attempted_servers_total |
histogram | Teleport Proxy | Number of servers processed during connection attempt to the DB service from Proxy service. |
teleport_proxy_db_connection_tls_config_time_seconds |
histogram | Teleport Proxy | Time to fetch TLS configuration for the connection to DB service from Proxy service. |
teleport_proxy_db_active_connections_total |
gauge | Teleport Proxy | Number of currently active connections to DB service from Proxy service. |
trusted_clusters |
gauge | Teleport Proxy | Number of outbound connections to leaf clusters. |
Database Service#
| Name | Type | Component | Description |
|---|---|---|---|
teleport_db_messages_from_client_total |
counter | Teleport Database Service | Number of messages (packets) received from the DB client. |
teleport_db_messages_from_server_total |
counter | Teleport Database Service | Number of messages (packets) received from the DB server. |
teleport_db_method_call_count_total |
counter | Teleport Database Service | Number of times a DB method was called. |
teleport_db_method_call_latency_seconds |
histogram | Teleport Database Service | Call latency for a DB method calls. |
teleport_db_initialized_connections_total |
counter | Teleport Database Service | Number of initialized DB connections. |
teleport_db_active_connections_total |
gauge | Teleport Database Service | Number of active DB connections. |
teleport_db_connection_durations_seconds |
histogram | Teleport Database Service | Duration of DB connection. |
teleport_db_connection_setup_time_seconds |
histogram | Teleport Database Service | Initial time to setup DB connection, before any requests are handled. |
teleport_db_errors_total |
counter | Teleport Database Service | Number of synthetic DB errors sent to the client. |
Kubernetes access#
The following tables identify all metrics available in the Teleport Proxy Service if at least one Kubernetes cluster is enrolled in your Teleport cluster.
Client#
The following table identifies all metrics available when the service connects
to upstream servers. In the case of proxy, the upstream server can be a
kubernetes_service or Kubernetes Cluster if it's running in legacy mode.
| Name | Type | Component | Description |
|---|---|---|---|
teleport_kubernetes_client_in_flight_requests |
gauge | Teleport Kubernetes Proxy | In-flight requests waiting for the upstream response. |
teleport_kubernetes_client_requests_total |
counter | Teleport Kubernetes Proxy | Total number of requests sent to the upstream Teleport proxy, kube_service or Kubernetes Cluster servers. |
teleport_kubernetes_client_tls_duration_seconds |
histogram | Teleport Kubernetes Proxy | Latency distribution of TLS handshakes. |
teleport_kubernetes_client_got_conn_duration_seconds |
histogram | Teleport Kubernetes Proxy | Latency distribution of time to dial to the upstream server - using reverse tunnel or direct dialer. |
teleport_kubernetes_client_first_byte_response_duration_seconds |
histogram | Teleport Kubernetes Proxy | Latency distribution of time to receive the first response byte from the upstream server. |
teleport_kubernetes_client_request_duration_seconds |
histogram | Teleport Kubernetes Proxy | Latency distribution of the upstream request time. |
Server#
The following table identifies all metrics available for incoming connections.
| Name | Type | Component | Description |
|---|---|---|---|
teleport_kubernetes_server_in_flight_requests |
gauge | Teleport Kubernetes Proxy | In-flight requests currently handled by the server. |
teleport_kubernetes_server_api_requests_total |
counter | Teleport Kubernetes Proxy | Total number of requests handled by the server. |
teleport_kubernetes_server_request_duration_seconds |
histogram | Teleport Kubernetes Proxy | Latency distribution of the total request time. |
teleport_kubernetes_server_response_size_bytes |
histogram | Teleport Kubernetes Proxy | Distribution of the response size. |
teleport_kubernetes_server_exec_in_flight_sessions |
gauge | Teleport Kubernetes Proxy | Number of active kubectl exec sessions. |
teleport_kubernetes_server_exec_sessions_total |
counter | Teleport Kubernetes Proxy | Total number of kubectl exec sessions. |
teleport_kubernetes_server_portforward_in_flight_sessions |
gauge | Teleport Kubernetes Proxy | Number of active kubectl portforward sessions. |
teleport_kubernetes_server_portforward_sessions_total |
counter | Teleport Kubernetes Proxy | Number of active kubectl portforward sessions. |
teleport_kubernetes_server_join_in_flight_sessions |
gauge | Teleport Kubernetes Proxy | Number of active joining sessions, |
teleport_kubernetes_server_join_sessions_total |
counter | Teleport Kubernetes Proxy | Total number of joining sessions. |
Teleport SSH Service#
| Name | Type | Component | Description |
|---|---|---|---|
user_max_concurrent_sessions_hit_total |
counter | Teleport SSH | Number of times a user exceeded their concurrent session limit. |
Teleport Kubernetes Service#
The following table identifies all metrics available when the service connects
to upstream servers. In the case of kubernetes_service, the upstream server
is always a Kubernetes cluster.
| Name | Type | Component | Description |
|---|---|---|---|
teleport_kubernetes_client_in_flight_requests |
gauge | Teleport Kubernetes Service | In-flight requests waiting for the upstream response. |
teleport_kubernetes_client_requests_total |
counter | Teleport Kubernetes Service | Total number of requests sent to the upstream teleport proxy, kube_service or Kubernetes Cluster servers. |
teleport_kubernetes_client_tls_duration_seconds |
histogram | Teleport Kubernetes Service | Latency distribution of TLS handshakes. |
teleport_kubernetes_client_got_conn_duration_seconds |
histogram | Teleport Kubernetes Service | Latency distribution of time to dial to the upstream server - using reversetunnel or direct dialer. |
teleport_kubernetes_client_first_byte_response_duration_seconds |
histogram | Teleport Kubernetes Service | Latency distribution of time to receive the first response byte from the upstream server. |
teleport_kubernetes_client_request_duration_seconds |
histogram | Teleport Kubernetes Service | Latency distribution of the upstream request time. |
The following table identifies all metrics available for incoming connections.
| Name | Type | Component | Description |
|---|---|---|---|
teleport_kubernetes_server_in_flight_requests |
gauge | Teleport Kubernetes Service | In-flight requests currently handled by the server. |
teleport_kubernetes_server_api_requests_total |
counter | Teleport Kubernetes Service | Total number of requests handled by the server. |
teleport_kubernetes_server_request_duration_seconds |
histogram | Teleport Kubernetes Service | Latency distribution of the total request time. |
teleport_kubernetes_server_response_size_bytes |
histogram | Teleport Kubernetes Service | Distribution of the response size. |
teleport_kubernetes_server_exec_in_flight_sessions |
gauge | Teleport Kubernetes Service | Number of active kubectl exec sessions. |
teleport_kubernetes_server_exec_sessions_total |
counter | Teleport Kubernetes Service | Total number of kubectl exec sessions. |
teleport_kubernetes_server_portforward_in_flight_sessions |
gauge | Teleport Kubernetes Service | Number of active kubectl portforward sessions. |
teleport_kubernetes_server_portforward_sessions_total |
counter | Teleport Kubernetes Service | Number of active kubectl portforward sessions. |
teleport_kubernetes_server_join_in_flight_sessions |
gauge | Teleport Kubernetes Service | Number of active joining sessions, |
teleport_kubernetes_server_join_sessions_total |
counter | Teleport Kubernetes Service | Total number of joining sessions. |
All Teleport instances#
| Name | Type | Component | Description |
|---|---|---|---|
process_state |
gauge | Teleport | State of the teleport process: 0 - ok, 1 - recovering, 2 - degraded, 3 - starting. |
certificate_mismatch_total |
counter | Teleport | Number of SSH server login failures due to a certificate mismatch. |
rx |
counter | Teleport | Number of bytes received during an SSH connection. |
server_interactive_sessions_total |
gauge | Teleport | Number of active sessions. |
teleport_build_info |
gauge | Teleport | Provides build information of Teleport including gitref (git describe --long --tags), Go version, and Teleport version. The value of this gauge will always be 1. |
teleport_breaker_connector_executions_total |
counter | Teleport | Number of requests to the Teleport Auth Service API that go through a circuit breaker done by Teleport services, labeled by role of the connector (almost always Instance), state of the associated circuit breaker and success as interpreted by the breaker. |
teleport_cache_events |
counter | Teleport | Number of events received by a Teleport service cache. Teleport's Auth Service, Proxy Service, and other services cache incoming events related to their service. |
teleport_cache_stale_events |
counter | Teleport | Number of stale events received by a Teleport service cache. A high percentage of stale events can indicate a degraded backend. |
tx |
counter | Teleport | Number of bytes transmitted during an SSH connection. |
Teleport Health Checks#
| Name | Type | Component | Description |
|---|---|---|---|
teleport_resources_health_status_healthy |
gauge | Teleport Health Check | Number of healthy resources. |
teleport_resources_health_status_unhealthy |
gauge | Teleport Health Check | Number of unhealthy resources. |
teleport_resources_health_status_unknown |
gauge | Teleport Health Check | Number of resources in an unknown health state. |
Go runtime metrics#
These metrics are surfaced by the Go runtime and are not specific to Teleport.
| Name | Type | Component | Description |
|---|---|---|---|
go_gc_duration_seconds |
summary | Internal Go | A summary of GC invocation durations. |
go_goroutines |
gauge | Internal Go | Number of goroutines that currently exist. |
go_info |
gauge | Internal Go | Information about the Go environment. |
go_memstats_alloc_bytes_total |
counter | Internal Go | Total number of bytes allocated, even if freed. |
go_memstats_alloc_bytes |
gauge | Internal Go | Number of bytes allocated and still in use. |
go_memstats_buck_hash_sys_bytes |
gauge | Internal Go | Number of bytes used by the profiling bucket hash table. |
go_memstats_frees_total |
counter | Internal Go | Total number of frees. |
go_memstats_gc_cpu_fraction |
gauge | Internal Go | The fraction of this program's available CPU time used by the GC since the program started. |
go_memstats_gc_sys_bytes |
gauge | Internal Go | Number of bytes used for garbage collection system metadata. |
go_memstats_heap_alloc_bytes |
gauge | Internal Go | Number of heap bytes allocated and still in use. |
go_memstats_heap_idle_bytes |
gauge | Internal Go | Number of heap bytes waiting to be used. |
go_memstats_heap_inuse_bytes |
gauge | Internal Go | Number of heap bytes that are in use. |
go_memstats_heap_objects |
gauge | Internal Go | Number of allocated objects. |
go_memstats_heap_released_bytes |
gauge | Internal Go | Number of heap bytes released to the OS. |
go_memstats_heap_sys_bytes |
gauge | Internal Go | Number of heap bytes obtained from the system. |
go_memstats_last_gc_time_seconds |
gauge | Internal Go | Number of seconds since the Unix epoch of the last garbage collection. |
go_memstats_lookups_total |
counter | Internal Go | Total number of pointer lookups. |
go_memstats_mallocs_total |
counter | Internal Go | Total number of mallocs. |
go_memstats_mcache_inuse_bytes |
gauge | Internal Go | Number of bytes in use by mcache structures. |
go_memstats_mcache_sys_bytes |
gauge | Internal Go | Number of bytes used for mcache structures obtained from system. |
go_memstats_mspan_inuse_bytes |
gauge | Internal Go | Number of bytes in use by mspan structures. |
go_memstats_mspan_sys_bytes |
gauge | Internal Go | Number of bytes used for mspan structures obtained from system. |
go_memstats_next_gc_bytes |
gauge | Internal Go | Number of heap bytes when next the garbage collection will take place. |
go_memstats_other_sys_bytes |
gauge | Internal Go | Number of bytes used for other system allocations. |
go_memstats_stack_inuse_bytes |
gauge | Internal Go | Number of bytes in use by the stack allocator. |
go_memstats_stack_sys_bytes |
gauge | Internal Go | Number of bytes obtained from the system for stack allocator. |
go_memstats_sys_bytes |
gauge | Internal Go | Number of bytes obtained from the system. |
go_threads |
gauge | Internal Go | Number of OS threads created. |
process_cpu_seconds_total |
counter | Internal Go | Total user and system CPU time spent in seconds. |
process_max_fds |
gauge | Internal Go | Maximum number of open file descriptors. |
process_open_fds |
gauge | Internal Go | Number of open file descriptors. |
process_resident_memory_bytes |
gauge | Internal Go | Resident memory size in bytes. |
process_start_time_seconds |
gauge | Internal Go | Start time of the process since the Unix epoch in seconds. |
process_virtual_memory_bytes |
gauge | Internal Go | Virtual memory size in bytes. |
process_virtual_memory_max_bytes |
gauge | Internal Go | Maximum amount of virtual memory available in bytes. |
Prometheus#
| Name | Type | Component | Description |
|---|---|---|---|
promhttp_metric_handler_requests_in_flight |
gauge | prometheus | Current number of scrapes being served. |
promhttp_metric_handler_requests_total |
counter | prometheus | Total number of scrapes by HTTP status code. |
분산 추적#
Teleport 인스턴스에 대한 분산 추적을 활성화하는 방법입니다.
Teleport는 OpenTelemetry를 활용하여 추적을 생성하고 OpenTelemetry Protocol (OTLP) 호환 내보내기로 내보냅니다. 텔레메트리 백엔드가 OTLP 추적 수신을 지원하지 않는 경우, OpenTelemetry Collector를 활용하여 OTLP에서 텔레메트리 백엔드가 허용하는 형식으로 추적을 프록시할 수 있습니다.
Teleport 구성#
teleport 인스턴스에 대한 추적을 활성화하려면 해당 인스턴스의 구성 파일(/etc/teleport.yaml)에 다음 섹션을 추가합니다. 이러한 구성 필드에 대한 자세한 설명은 구성 참조 페이지를 참조하세요.
tracing_service:
enabled: true
exporter_url: grpc://collector.example.com:4317
sampling_rate_per_million: 1000000
샘플링 속도#
샘플링 속도를 신중하게 선택하는 것이 중요합니다. 100% 속도로 샘플링하면 클러스터 성능에 부정적인 영향을 줄 수 있습니다. Teleport는 들어오는 요청에 포함된 샘플링 속도를 준수합니다. 즉, tracing_service가 활성화되고 샘플링 속도가 0이더라도, Teleport가 샘플링된 스팬이 있는 요청을 수신하면 Teleport는 해당 요청에 대응하여 생성된 모든 스팬을 샘플링하고 내보냅니다.
내보내기 URL#
exporter_url 설정은 Teleport가 스팬을 전송할 위치를 나타냅니다. 지원되는 스킴은 grpc://, http://, https://, file://입니다(스킴이 제공되지 않으면 grpc://가 사용됩니다).
file://을 사용할 때, URL은 Teleport에 쓰기 권한이 있는 디렉터리 경로여야 합니다. 스팬은 제공된 디렉터리 내의 파일에 저장되며, 각 파일에는 줄당 하나의 proto 인코딩된 스팬이 포함됩니다. 파일이 100MB를 초과하면 교체됩니다. 기본 제한을 재정의하려면 exporter_url에 ?limit=<desired_file_size_in_bytes>를 추가합니다(예: file:///var/lib/teleport/traces?limit=100).
기본적으로 내보내기에 대한 연결은 안전하지 않습니다. TLS를 지원하려면 tracing_service 구성에 다음을 추가합니다:
# 선택 사항: 내보내기 유효성 검사에 사용되는 CA 인증서 경로
ca_certs:
- /var/lib/teleport/exporter_ca.pem
# 선택 사항: 내보내기에 대한 mTLS를 활성화하는 TLS 인증서 경로
https_keypairs:
- key_file: /var/lib/teleport/exporter_key.pem
cert_file: /var/lib/teleport/exporter_cert.pem
teleport.yaml을 업데이트한 후 새 구성을 적용하려면 teleport 인스턴스를 시작합니다.
tsh#
tsh에서 추적을 캡처하려면 명령에 --trace 플래그를 추가합니다. tsh --trace로 생성된 모든 추적은 명령이 실행 중인 클러스터의 Auth 서비스에 정의된 exporter_url로 프록시됩니다.
$ tsh --trace ssh root@myserver
$ tsh --trace ls
Auth 서비스 구성에 정의된 것과 다른 내보내기로 tsh에서 추적을 내보내는 것도 --trace-exporter 플래그를 통해 가능합니다. tracing_service의 exporter_url과 동일한 형식을 따르는 URL을 제공해야 합니다.
$ tsh --trace --trace-exporter=grpc://collector.example.com:4317 ssh root@myserver
$ tsh --trace --trace-exporter=file:///var/lib/teleport/traces ls
프로파일 수집#
Teleport 인스턴스에서 런타임 프로파일링 데이터를 수집하는 방법입니다.
Teleport는 Go의 진단 기능을 활용하여 프로파일링 데이터를 수집하고 내보냅니다. 프로파일은 CPU 급증의 원인, 메모리 누수의 출처 또는 교착 상태의 이유를 식별하는 데 도움이 됩니다.
Debug 서비스 사용#
Teleport Debug 서비스를 사용하면 관리자가 시작 시 pprof 엔드포인트를 활성화하지 않고도 진단 프로파일을 수집할 수 있습니다. 기본적으로 활성화된 서비스는 로컬 전용 접근을 보장하며 동일한 인스턴스 내에서만 사용해야 합니다.
teleport debug profile은 pprof 프로파일 목록을 수집합니다. STDOUT으로 압축된 tarball(.tar.gz)을 출력합니다. tar를 사용하여 압축을 풀거나 결과를 파일로 보내면 됩니다.
기본적으로 goroutine, heap, profile 프로파일을 수집합니다.
수집된 각 프로파일은 tarball 내의 해당 파일을 가집니다. 예를 들어, goroutine,trace,heap을 수집하면 goroutine.pprof, trace.pprof, heap.pprof 파일이 생성됩니다.
# 기본 프로파일을 수집하여 파일에 저장합니다.
$ teleport debug profile > pprof.tar.gz
$ tar xvf pprof.tar.gz
# 기본 프로파일을 수집하고 압축을 풉니다.
$ teleport debug profile | tar xzv -C ./
# "trace" 및 "mutex" 프로파일을 수집하여 파일에 저장합니다.
$ teleport debug profile trace,mutex > pprof.tar.gz
# 프로파일링 시간을 초 단위로 설정하여 프로파일을 수집합니다
$ teleport debug profile -s 20 trace > pprof.tar.gz
Kubernetes 클러스터에서 Teleport를 실행 중인 경우 인터랙티브 세션 없이 로컬 디렉터리로 직접 프로파일을 수집할 수 있습니다:
$ kubectl -n teleport exec my-pod -- teleport debug profile > pprof.tar.gz
내용을 추출한 후 go tool 명령을 사용하여 탐색하고 시각화할 수 있습니다:
# 터미널 인터랙티브 탐색기를 엽니다
$ go tool pprof heap.pprof
# 웹 시각화 도구를 엽니다
$ go tool pprof -http : heap.pprof
# 추적 프로파일을 시각화합니다
$ go tool trace trace.pprof
진단 엔드포인트 사용#
프로파일링 엔드포인트는 --debug 플래그가 제공된 경우에만 활성화됩니다.
Teleport's diagnostic HTTP endpoints are disabled by default. You can enable them via:
Ensure you can connect to the diagnostic endpoint
Verify that Teleport is now serving the diagnostics endpoint:
```code
$ curl http://127.0.0.1:3000/healthz
```
프로파일 수집#
Go의 표준 프로파일링 엔드포인트는 http://127.0.0.1:3000/debug/pprof/에서 제공됩니다.
프로파일을 검색하려면 원하는 프로파일 유형에 해당하는 엔드포인트에 요청을 보내야 합니다. 문제를 디버깅할 때 일정 기간 동안 일련의 프로파일을 수집하는 것이 유용합니다.
CPU#
CPU 프로파일은 사용자가 지정한 기간 동안 수집된 실행 통계를 보여줍니다:
# 프로파일을 파일로 다운로드합니다:
$ curl -o cpu.profile http://127.0.0.1:3000/debug/pprof/profile?seconds=30
# 프로파일을 시각화합니다
$ go tool pprof -http : cpu.profile
Goroutine#
Goroutine 프로파일은 시스템에서 실행 중인 모든 goroutine의 스택 추적을 보여줍니다:
# 프로파일을 파일로 다운로드합니다:
$ curl -o goroutine.profile http://127.0.0.1:3000/debug/pprof/goroutine
# 프로파일을 시각화합니다
$ go tool pprof -http : goroutine.profile
Heap#
Heap 프로파일은 시스템에 할당된 객체를 보여줍니다:
# 프로파일을 파일로 다운로드합니다:
$ curl -o heap.profile http://127.0.0.1:3000/debug/pprof/heap
# 프로파일을 시각화합니다
$ go tool pprof -http : heap.profile
Trace#
Trace 프로파일은 사용자가 지정한 기간 동안 Go 런타임이 수집하는 스케줄링, 시스템 호출, 가비지 컬렉션, 힙 크기 및 기타 이벤트를 캡처합니다:
# 프로파일을 파일로 다운로드합니다:
$ curl -o trace.out http://127.0.0.1:3000/debug/pprof/trace?seconds=5
# 프로파일을 시각화합니다
$ go tool trace trace.out
추가 읽기#
- Go 에코시스템의 진단에 대한 자세한 정보: https://go.dev/doc/diagnostics
- Go의 프로파일링 엔드포인트: https://golang.org/pkg/net/http/pprof/
- Go 프로그램 프로파일링에 대한 심층 가이드: https://go.dev/blog/pprof
