Server Monitoring & Performance — Complete Guide | ITVedas

Server Monitoring & Performance

Monitoring is your early warning system. Before users call about slowness, you've already identified the bottleneck. Comprehensive monitoring tracks performance metrics, records events, and alerts to issues before they become outages. Proactive monitoring separates well-managed infrastructure from reactive firefighting.

Server Monitoring Fundamentals

The Four Pillars of Monitoring:

  • Metrics: Quantifiable data (CPU %, RAM usage, disk I/O)
  • Logs: Event records (application logs, security events)
  • Traces: Request/transaction flow (application performance)
  • Alerts: Proactive notifications when issues occur

Modern monitoring combines all four to provide comprehensive visibility into system health.

Performance Monitor

Performance Monitor is Windows' built-in tool for real-time and historical performance tracking. It collects metrics from all system components.

Key Performance Indicators (KPIs):

  • CPU Usage: Should average below 80%, spikes to 100% acceptable
  • Memory Usage: Should be below 80%, sustained high usage indicates issue
  • Disk Usage: Free space should be above 20% of total capacity
  • Disk I/O: Monitor queue length (should be <2 average)
  • Network Bandwidth: Monitor utilization (should be <70% of link capacity)

Creating a Performance Monitor Counter Log

  1. Open Performance Monitor from Administrative Tools
  2. Expand Data Collector Sets
  3. Right-click User-defined and select New → Data Collector Set
  4. Name it (e.g., "Daily-Server-Performance")
  5. Select "Create from template"
  6. Choose "System Diagnostics" or "System Performance"
  7. Complete the wizard
  8. Right-click created set and select Properties
  9. Configure schedule (e.g., daily at 11:00 PM)
  10. Set data retention (keep 7-30 days)
  11. Start the collector set
# PowerShell: Query key performance metrics # CPU Usage Get-WmiObject win32_processor | Select-Object LoadPercentage # Memory Usage $MemUsage = (Get-WmiObject -Class win32_operatingsystem).TotalVisibleMemorySize $MemFree = (Get-WmiObject -Class win32_operatingsystem).FreePhysicalMemory $PercentUsed = [Math]::Round(((($MemUsage - $MemFree) / $MemUsage) * 100), 2) Write-Host "Memory Usage: $PercentUsed%" # Disk Usage Get-PSDrive | Where-Object {$_.Provider -like "*FileSystem*"} | Select-Object Name, Used, Free, @{Name="PercentUsed"; Expression={[math]::Round((($_.Used / ($_.Used + $_.Free)) * 100), 2)}}

Event Viewer and Event Logs

Event Viewer records system, application, and security events. Regular review identifies issues before they become problems.

Main Event Logs:

  • System Log: Windows and hardware events (services starting/stopping, driver issues)
  • Application Log: Application-specific events and errors
  • Security Log: Authentication, access control, policy application events
  • PowerShell Operational Log: PowerShell script execution and errors

Event Severity Levels:

  • Critical: System failure imminent, immediate action required
  • Error: Functionality lost, should be investigated
  • Warning: Issue detected, should be addressed but not urgent
  • Information: Normal operations, generally not concerning

Configuring Event Log Retention

  1. Open Event Viewer
  2. Right-click "Windows Logs" → Select log (e.g., System)
  3. Select Properties from right-click menu
  4. Configure Log Size: Set to 500MB minimum
  5. When log is full: Select "Archive the log when full"
  6. Retention: Keep logs for 30-90 days
  7. Click OK
  8. Repeat for all critical logs
# PowerShell: Query event logs # Get last 10 system errors Get-EventLog -LogName System -EntryType Error -Newest 10 | Format-Table TimeGenerated, Source, EventID, Message # Get failed logon attempts last 24 hours $Since = (Get-Date).AddDays(-1) Get-EventLog -LogName Security -InstanceId 4625 -After $Since | Format-Table TimeGenerated, @{N="Account";E={$_.ReplacementStrings[5]}}, Message # Get PowerShell script execution errors Get-EventLog -LogName Application -Source PowerShell -EntryType Error -Newest 20

Resource Capacity Planning

Monitoring current state informs future capacity decisions. Track trends to predict when upgrades become necessary.

Capacity Planning Process:

  1. Establish baseline usage (monitor for 2-4 weeks)
  2. Identify peak usage patterns (peak hours, peak days)
  3. Calculate growth rate (% increase per month/year)
  4. Project when resources will be exhausted
  5. Plan upgrades 3-6 months before reaching capacity

📊 Example Capacity Planning

Scenario: File server disk usage

- Current usage: 3TB of 10TB (30%)

- Usage growing at: 5% per month

- Comfortable maximum: 80% (8TB)

- Space available: 5TB (50%)

- At 5% growth: Reaches 80% in 10 months

- Action: Plan storage upgrade within 4-6 months

Common Performance Issues and Diagnosis

Problem: High CPU Usage

Diagnosis:

  1. Open Task Manager (Ctrl+Shift+Esc)
  2. Click Processes tab
  3. Sort by CPU column
  4. Identify process consuming CPU
  5. Check if expected (backup, indexing, reporting job)

Solutions:

  • Stop non-essential services or processes
  • Schedule heavy processes for off-peak hours
  • Add CPU capacity (more cores, faster processor)
  • Check for runaway processes or infinite loops
  • Update drivers and firmware
  • Scan for malware using Windows Defender

Problem: High Memory Usage

Diagnosis:

  1. Open Performance Monitor
  2. Monitor Memory → Available MBytes
  3. If below 512MB, system is memory-starved
  4. Use Task Manager to identify memory hogs
  5. Check for memory leaks in applications

Solutions:

  • Restart services with memory leaks
  • Add physical RAM to server
  • Increase virtual memory (paging file)
  • Remove unnecessary services and applications
  • Configure application memory limits
  • Update application to fix memory leak

Problem: Disk Space Running Out

Diagnosis:

  1. Use File Explorer to check drive properties
  2. Right-click drive → Properties
  3. Identify free space vs. used space
  4. Use Disk Usage Analyzer to find large folders

Solutions:

  • Delete unnecessary files (temp files, old logs)
  • Archive old data to different location
  • Enable disk compression for less critical files
  • Implement file retention policies
  • Add additional disk storage
  • Schedule old file cleanup via scheduled task

Proactive Monitoring Strategies

Threshold-Based Alerting: Set alerts when metrics exceed defined thresholds.

Metric Warning Threshold Critical Threshold Action
CPU Usage 60% 85% Investigate process, schedule load shift, add capacity
Memory Usage 70% 90% Identify leaks, restart services, add RAM
Disk Usage 70% 90% Clean up old files, archive data, add storage
Disk Queue Length 2 5+ Reduce IOPS, upgrade disk subsystem, add RAM for caching
Network Utilization 60% 80% Implement compression, upgrade link, optimize traffic

Monitoring Tools Comparison

Built-in Tools (No Cost):

  • Performance Monitor: Real-time and historical metrics
  • Event Viewer: Centralized logging
  • Task Manager: Quick process overview
  • Resource Monitor: Detailed resource utilization

Enterprise Monitoring Solutions (With Cost):

  • System Center Operations Manager (SCOM): Microsoft's enterprise monitoring
  • Prometheus + Grafana: Open-source monitoring and visualization
  • Datadog, New Relic, Splunk: Cloud-based monitoring services

Start with built-in tools, graduate to enterprise solutions as infrastructure grows.

Monitoring Best Practices

  • Monitor continuously: Not just during problems—establish baselines
  • Set realistic thresholds: Too low = alert fatigue, too high = missed issues
  • Document baselines: Normal CPU is 20%, not 5% or 50%
  • Review logs regularly: Weekly review of critical logs
  • Centralize logging: Don't check 50 servers individually
  • Alert intelligently: Page for critical, email for warning, ignore info
  • Test alerts: Ensure they actually fire and notify correct people
  • Retain data: Keep historical data for trend analysis
  • Correlate events: Don't look at CPU in isolation—correlate with disk I/O, network, application logs
  • Automate responses: Automatically restart services, clear queues, scale resources when possible

Creating a Monitoring Dashboard

A good dashboard shows server health at a glance. Include:

  • CPU utilization % (green <60%, yellow 60-80%, red >80%)
  • Memory utilization % (same color coding)
  • Disk space available (warning if <20% free)
  • Critical services status (running/stopped)
  • Recent errors from event log
  • Network bandwidth utilization
  • Last backup status
  • Ping/connectivity status
💡 Pro Tip: Display dashboards on NOC (Network Operations Center) monitors. Spend 10 seconds visually scanning health before diving into details.

Key Takeaways

  • Monitoring provides early warning of issues
  • Key metrics are CPU, memory, disk, and network
  • Event logs record all important system activities
  • Trend analysis enables capacity planning
  • Threshold-based alerting prevents surprises
  • Centralized monitoring scales with infrastructure
  • Regular review maintains system health