Look, we’ve all been there. It’s 2am on a Saturday, you’re halfway through a boerie roll at your mate’s braai, and your phone starts buzzing like a swarm of angry bees. Your Windows service has gone down. Again. The production system is offline, customers are getting stroppy, and you’re about to have a very awkward conversation with your boss on Monday morning.
What if I told you there’s a better way? A way to build Windows services in .NET 8 that are tougher than a Free State farmer’s boots and more reliable than your ouma’s Sunday roast recipe?
Buckle up, because we’re about to dive deep into the art of building Windows services that refuse to die.
Why Your Service Keeps Falling Over (And Why You Should Care)
Here’s the thing – most Windows services fail because we write them like they’re running in a perfect world. But production environments are more like the M1 during peak hour: chaotic, unpredictable, and one small incident away from complete gridlock.
Your service needs to handle:
- Network hiccups (because Telkom, am I right?)
- Memory pressure (someone’s running Chrome with 47 tabs open on the server)
- Database timeouts (the DBA is running maintenance at 3pm on a Wednesday)
- Dodgy TCP connections that drop mid-message
- That one message format that nobody documented properly
If your service falls over every time one of these things happens, you’re in for a rough time.
The Foundation: Building Your Service the Right Way
Let’s start with the basics. In .NET 8, Microsoft’s made it dead easy to create Windows services using the BackgroundService class. Here’s how you kick things off:
public class TcpMessageProcessorService : BackgroundService
{
private readonly ILogger<TcpMessageProcessorService> _logger;
private readonly IConfiguration _configuration;
private TcpListener _tcpListener;
private readonly CancellationTokenSource _shutdownToken;
public TcpMessageProcessorService(
ILogger<TcpMessageProcessorService> logger,
IConfiguration configuration)
{
_logger = logger;
_configuration = configuration;
_shutdownToken = new CancellationTokenSource();
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
_logger.LogInformation("Service starting up - let's do this!");
// Your brilliant code goes here
}
}
Now wire it up in your Program.cs:
var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddWindowsService(options =>
{
options.ServiceName = "RobustTcpProcessorService";
});
builder.Services.AddHostedService<TcpMessageProcessorService>();
builder.Services.AddLogging();
var host = builder.Build();
await host.RunAsync();
Right, that’s the skeleton. But a skeleton doesn’t help much when things go pear-shaped. Let’s add some muscle.
Mechanism #1: Health Checks (Your Service’s Check-up)
Think of health checks like taking your bakkie in for a service. You don’t wait until the engine seizes up – you check the oil, the tyres, the battery. Same principle here.
Here’s a proper health check implementation:
public class TcpServiceHealthCheck : IHealthCheck
{
private readonly TcpMessageProcessorService _tcpService;
private readonly ILogger<TcpServiceHealthCheck> _logger;
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken cancellationToken = default)
{
try
{
// Is the TCP listener even awake?
if (!_tcpService.IsListening)
{
_logger.LogError("TCP listener has gone to sleep on the job!");
return HealthCheckResult.Unhealthy("TCP listener is not active");
}
// Can we actually process messages?
if (!_tcpService.CanProcessMessages)
{
_logger.LogWarning("Service is up but can't process messages");
return HealthCheckResult.Degraded("Service cannot process messages");
}
// Are we eating up memory like a hungry hippo?
var memoryUsage = GC.GetTotalMemory(false);
if (memoryUsage > 500_000_000) // 500MB threshold
{
_logger.LogWarning("Memory usage is getting spicy: {MemoryMB}MB",
memoryUsage / 1_000_000);
return HealthCheckResult.Degraded(
$"High memory usage: {memoryUsage} bytes");
}
return HealthCheckResult.Healthy("All systems go!");
}
catch (Exception ex)
{
_logger.LogError(ex, "Health check itself is broken - that's awkward");
return HealthCheckResult.Unhealthy(
"Health check threw an exception", ex);
}
}
}
Real-world example: I once worked on a service that processed status messages for a large retailer. The service would gradually slow down over the day until it was processing one status every few seconds (should’ve been 25 per second). We added health checks that monitored processing speed and discovered a memory leak in a third-party library. The health check flagged the degraded performance before customers even noticed. Saved our bacon, that did.
Mechanism #2: The Heartbeat (Proof of Life)
A heartbeat is your service saying “Yebo, I’m still here!” every minute or so. It’s like those check-in messages your mom insists on when you’re driving long distances.
public class HeartbeatService : BackgroundService
{
private readonly ILogger<HeartbeatService> _logger;
private readonly string _heartbeatFilePath;
public HeartbeatService(ILogger<HeartbeatService> logger)
{
_logger = logger;
_heartbeatFilePath = Path.Combine(
Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData),
"MyService",
"heartbeat.json"
);
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
try
{
var heartbeat = new
{
Timestamp = DateTime.UtcNow,
ProcessId = Environment.ProcessId,
Status = "Healthy",
MachineName = Environment.MachineName
};
await File.WriteAllTextAsync(_heartbeatFilePath,
JsonSerializer.Serialize(heartbeat, new JsonSerializerOptions
{
WriteIndented = true
}),
stoppingToken);
_logger.LogDebug("Heartbeat updated at {Timestamp}", DateTime.UtcNow);
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed to update heartbeat - that's not good");
}
await Task.Delay(TimeSpan.FromMinutes(1), stoppingToken);
}
}
}
Now you can have an external monitoring tool check this file. If the timestamp hasn’t updated in 5 minutes, you know something’s up.
Real-world example: At a logistics company, we had a service that managed warehouse movements. The service appeared to be running (Windows said it was), but it had actually deadlocked internally. The heartbeat file hadn’t updated in 3 hours. Our monitoring picked it up and automatically restarted the service before anyone started climbing the walls. Crisis averted.
Mechanism #3: Performance Monitoring (Know Your Limits)
Your service needs to know when it’s working too hard. Like knowing when to stop at three boerewors rolls and not go for a fourth.
public class PerformanceMonitoringService : BackgroundService
{
private readonly ILogger<PerformanceMonitoringService> _logger;
private readonly Process _currentProcess;
public PerformanceMonitoringService(ILogger<PerformanceMonitoringService> logger)
{
_logger = logger;
_currentProcess = Process.GetCurrentProcess();
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
try
{
_currentProcess.Refresh();
var cpuTime = _currentProcess.TotalProcessorTime.TotalMilliseconds;
var memoryMB = _currentProcess.WorkingSet64 / 1_000_000;
var threadCount = _currentProcess.Threads.Count;
_logger.LogInformation(
"Performance Stats - Memory: {MemoryMB}MB, Threads: {ThreadCount}",
memoryMB, threadCount);
if (memoryMB > 1000) // More than 1GB
{
_logger.LogWarning(
"Memory usage is getting hectic: {MemoryMB}MB - time to investigate",
memoryMB);
}
if (threadCount > 100)
{
_logger.LogWarning(
"Thread count is suspiciously high: {ThreadCount} - possible thread leak",
threadCount);
}
await StoreMetricsToDatabase(memoryMB, threadCount);
}
catch (Exception ex)
{
_logger.LogError(ex, "Performance monitoring went sideways");
}
await Task.Delay(TimeSpan.FromSeconds(30), stoppingToken);
}
}
private async Task StoreMetricsToDatabase(long memoryMB, int threadCount)
{
// Store these metrics for trending - helps you spot patterns
// Like noticing your memory usage spikes every Tuesday at 2pm
}
}
Handling TCP Connections Like a Boss
Right, let’s get to the meat and potatoes – handling TCP connections without falling over.
public class RobustTcpListener
{
private readonly ILogger<RobustTcpListener> _logger;
private TcpListener _listener;
private readonly ConcurrentBag<TcpClient> _activeConnections;
private readonly SemaphoreSlim _connectionSemaphore;
private int _maxConnections = 100;
public RobustTcpListener(ILogger<RobustTcpListener> logger)
{
_logger = logger;
_activeConnections = new ConcurrentBag<TcpClient>();
_connectionSemaphore = new SemaphoreSlim(_maxConnections, _maxConnections);
}
public async Task StartListening(
IPEndPoint endpoint,
CancellationToken cancellationToken)
{
_listener = new TcpListener(endpoint);
_listener.Start();
_logger.LogInformation(
"TCP listener is live on {Endpoint} - bring on the connections!",
endpoint);
while (!cancellationToken.IsCancellationRequested)
{
try
{
// Don't accept more connections than we can handle
await _connectionSemaphore.WaitAsync(cancellationToken);
var tcpClient = await _listener.AcceptTcpClientAsync();
_activeConnections.Add(tcpClient);
_logger.LogInformation(
"New client connected. Active connections: {Count}",
_activeConnections.Count);
// Handle each connection separately - don't block accepting new ones
_ = Task.Run(() => HandleClientConnection(tcpClient, cancellationToken),
cancellationToken);
}
catch (ObjectDisposedException)
{
// Service is shutting down - this is fine
_logger.LogInformation("Listener stopped - service shutting down");
break;
}
catch (Exception ex)
{
_logger.LogError(ex, "Error accepting TCP client - will retry shortly");
_connectionSemaphore.Release(); // Release the semaphore we took
await Task.Delay(1000, cancellationToken); // Don't hammer the system
}
}
}
private async Task HandleClientConnection(
TcpClient client,
CancellationToken cancellationToken)
{
var clientEndpoint = client.Client.RemoteEndPoint?.ToString() ?? "Unknown";
try
{
using (client)
using (var stream = client.GetStream())
{
_logger.LogInformation("Chatting with client {ClientEndpoint}", clientEndpoint);
var buffer = new byte[4096];
while (!cancellationToken.IsCancellationRequested && client.Connected)
{
try
{
var bytesRead = await stream.ReadAsync(
buffer, 0, buffer.Length, cancellationToken);
if (bytesRead == 0)
{
_logger.LogInformation(
"Client {ClientEndpoint} hung up - cheers mate",
clientEndpoint);
break;
}
await ProcessMessage(buffer, bytesRead, stream, cancellationToken);
}
catch (IOException ioEx)
{
_logger.LogWarning(
"Network drama with client {ClientEndpoint}: {Error}",
clientEndpoint, ioEx.Message);
break;
}
catch (Exception ex)
{
_logger.LogError(ex,
"Failed processing message from {ClientEndpoint} - but we soldier on",
clientEndpoint);
// Don't let one bad message kill the whole connection
}
}
}
}
catch (Exception ex)
{
_logger.LogError(ex,
"Connection with {ClientEndpoint} went completely pear-shaped",
clientEndpoint);
}
finally
{
_activeConnections.TryTake(out _);
_connectionSemaphore.Release();
_logger.LogInformation(
"Connection with {ClientEndpoint} closed. Active connections: {Count}",
clientEndpoint, _activeConnections.Count);
}
}
}
Real-world example: A payment processing service that a friend built handled card transactions from point-of-sale terminals. One day, a terminal in Bloemfontein started sending malformed messages (corrupted by a dodgy network switch). Without proper error handling, each bad message would’ve crashed the entire service, affecting all 20+ stores. Instead, the service logged the error, rejected that specific message, and kept processing the other 5,000 daily transactions without missing a beat.
The Secret Weapon: Polly and Resilience Patterns
Now we’re getting fancy. Polly is a .NET library that implements resilience patterns. Think of it as insurance for your code.
First, add the NuGet package:
dotnet add package Polly
Then implement retry logic and circuit breakers:
public class MessageProcessor
{
private readonly ILogger<MessageProcessor> _logger;
private readonly ResiliencePipeline _resiliencePipeline;
public MessageProcessor(ILogger<MessageProcessor> logger)
{
_logger = logger;
// Configure resilience pipeline with retry and circuit breaker
_resiliencePipeline = new ResiliencePipelineBuilder()
.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = 3,
BackoffType = DelayBackoffType.Exponential,
BaseDelay = TimeSpan.FromSeconds(1),
OnRetry = args =>
{
_logger.LogWarning(
"Retry attempt {Attempt} of 3 - third time's the charm?",
args.AttemptNumber);
return ValueTask.CompletedTask;
}
})
.AddCircuitBreaker(new CircuitBreakerStrategyOptions
{
BreakDuration = TimeSpan.FromMinutes(2),
FailureRatio = 0.5,
MinimumThroughput = 10,
OnOpened = args =>
{
_logger.LogError(
"Circuit breaker tripped! Too many failures - taking a breather");
return ValueTask.CompletedTask;
},
OnClosed = args =>
{
_logger.LogInformation(
"Circuit breaker reset - back in business!");
return ValueTask.CompletedTask;
}
})
.Build();
}
public async Task<bool> ProcessMessageAsync(
byte[] messageData,
CancellationToken cancellationToken)
{
return await _resiliencePipeline.ExecuteAsync(async (ct) =>
{
try
{
var message = Encoding.UTF8.GetString(messageData);
_logger.LogInformation("Processing: {MessagePreview}...",
message.Substring(0, Math.Min(50, message.Length)));
await ProcessBusinessLogic(message, ct);
return true;
}
catch (Exception ex)
{
_logger.LogError(ex, "Message processing failed");
throw; // Re-throw to trigger retry/circuit breaker
}
}, cancellationToken);
}
private async Task ProcessBusinessLogic(string message, CancellationToken ct)
{
// Your actual business logic here
// Could be saving to database, calling an API, whatever
await Task.Delay(100, ct); // Simulating work
}
}
What’s happening here?
The retry policy says: “If processing fails, try again up to 3 times with increasing delays (1 second, 2 seconds, 4 seconds).”
The circuit breaker says: “If 50% of requests are failing and we’ve had at least 10 requests, stop trying for 2 minutes. The downstream system is probably on fire, let it recover.”
Real-world example: Another mate worked on a service that called a third-party credit check API. Sometimes the API would timeout due to load. Without Polly, each timeout would fail the transaction. With retry logic, 95% of timeouts were resolved on the second attempt. The circuit breaker saved him during a major API outage – instead of hammering their dying servers with thousands of requests, it gracefully backed off and let them recover.
Logging: Your Service’s Black Box
When things go wrong (and they will), proper logging is the difference between finding the problem in 5 minutes or 5 hours.
Install Serilog:
dotnet add package Serilog.Extensions.Hosting
dotnet add package Serilog.Sinks.Console
dotnet add package Serilog.Sinks.File
dotnet add package Serilog.Sinks.EventLog
Set it up in Program.cs:
Log.Logger = new LoggerConfiguration()
.ReadFrom.Configuration(builder.Configuration)
.Enrich.WithProperty("ServiceName", "TcpProcessorService")
.Enrich.WithProperty("Version", Assembly.GetExecutingAssembly().GetName().Version)
.Enrich.WithMachineName()
.WriteTo.Console(outputTemplate:
"[{Timestamp:HH:mm:ss} {Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}")
.WriteTo.File("logs/service-.log",
rollingInterval: RollingInterval.Day,
retainedFileCountLimit: 30)
.WriteTo.EventLog("TcpProcessorService", manageEventSource: true)
.CreateLogger();
builder.Logging.ClearProviders();
builder.Logging.AddSerilog();
Pro tip: Log to multiple places:
- Console: For when you’re debugging locally
- Files: For historical tracking and analysis
- Windows Event Log: So your sysadmin can see issues in Event Viewer
- Application Insights/Seq/ELK (optional): For fancy dashboards and alerting
The Watchdog: Self-Healing Services
This is where things get properly clever. A watchdog monitors your service and restarts it if things go really wrong.
public class ServiceWatchdog : BackgroundService
{
private readonly ILogger<ServiceWatchdog> _logger;
private readonly IServiceProvider _serviceProvider;
private DateTime _lastHealthyCheck = DateTime.UtcNow;
private int _consecutiveFailures = 0;
public ServiceWatchdog(
ILogger<ServiceWatchdog> logger,
IServiceProvider serviceProvider)
{
_logger = logger;
_serviceProvider = serviceProvider;
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
// Wait a bit for service to start up properly
await Task.Delay(TimeSpan.FromMinutes(1), stoppingToken);
while (!stoppingToken.IsCancellationRequested)
{
try
{
var isHealthy = await PerformHealthCheck();
if (isHealthy)
{
_lastHealthyCheck = DateTime.UtcNow;
_consecutiveFailures = 0;
}
else
{
_consecutiveFailures++;
_logger.LogWarning(
"Health check failed {Count} consecutive times",
_consecutiveFailures);
}
var timeSinceLastHealthy = DateTime.UtcNow - _lastHealthyCheck;
if (timeSinceLastHealthy > TimeSpan.FromMinutes(10))
{
_logger.LogCritical(
"Service has been unhealthy for {Duration}. Houston, we have a problem.",
timeSinceLastHealthy);
await InitiateGracefulRestart();
}
}
catch (Exception ex)
{
_logger.LogError(ex, "Watchdog check went wrong - ironic, isn't it?");
}
await Task.Delay(TimeSpan.FromMinutes(2), stoppingToken);
}
}
private async Task<bool> PerformHealthCheck()
{
try
{
var healthCheckService = _serviceProvider
.GetRequiredService<HealthCheckService>();
var result = await healthCheckService.CheckHealthAsync();
if (result.Status == HealthStatus.Healthy)
{
return true;
}
_logger.LogWarning("Health check status: {Status}", result.Status);
foreach (var entry in result.Entries)
{
_logger.LogWarning(" {Key}: {Status} - {Description}",
entry.Key,
entry.Value.Status,
entry.Value.Description);
}
return false;
}
catch (Exception ex)
{
_logger.LogError(ex, "Health check exploded");
return false;
}
}
private async Task InitiateGracefulRestart()
{
try
{
_logger.LogCritical("Initiating graceful restart - see you on the other side");
// Give running operations a chance to finish
await Task.Delay(TimeSpan.FromSeconds(30));
// Restart the Windows service
var serviceName = "RobustTcpProcessorService";
using var serviceController = new ServiceController(serviceName);
if (serviceController.Status == ServiceControllerStatus.Running)
{
serviceController.Stop();
serviceController.WaitForStatus(
ServiceControllerStatus.Stopped,
TimeSpan.FromSeconds(30));
serviceController.Start();
_logger.LogInformation("Service restarted successfully");
}
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed to restart service - we're in trouble now");
}
}
}
Windows Service Recovery Settings
You can also configure Windows itself to restart your service automatically. Add this after installing your service:
# PowerShell script to configure service recovery
sc.exe failure RobustTcpProcessorService reset= 86400 actions= restart/60000/restart/120000/restart/300000
# Translation:
# reset= 86400 : Reset failure count after 1 day
# restart/60000 : First failure: restart after 1 minute
# restart/120000 : Second failure: restart after 2 minutes
# restart/300000 : Third+ failures: restart after 5 minutes
Or do it programmatically:
public static class ServiceRecoveryManager
{
public static void ConfigureServiceRecovery(string serviceName)
{
try
{
var process = new Process
{
StartInfo = new ProcessStartInfo
{
FileName = "sc.exe",
Arguments = $"failure \"{serviceName}\" reset= 86400 " +
$"actions= restart/60000/restart/120000/restart/300000",
UseShellExecute = false,
CreateNoWindow = true,
RedirectStandardOutput = true,
RedirectStandardError = true
}
};
process.Start();
process.WaitForExit();
if (process.ExitCode == 0)
{
Console.WriteLine("Service recovery configured successfully");
}
else
{
Console.WriteLine($"Failed to configure recovery. Exit code: {process.ExitCode}");
}
}
catch (Exception ex)
{
Console.WriteLine($"Error configuring service recovery: {ex.Message}");
}
}
}
The Global Safety Net
Even with all these mechanisms, unexpected exceptions can slip through. Catch them with a global handler:
public static class GlobalExceptionHandler
{
public static void Configure(ILogger logger)
{
AppDomain.CurrentDomain.UnhandledException += (sender, e) =>
{
var exception = e.ExceptionObject as Exception;
logger.LogCritical(exception,
"UNHANDLED EXCEPTION! The ship is going down! IsTerminating: {IsTerminating}",
e.IsTerminating);
if (e.IsTerminating)
{
// Last chance to clean up
PerformEmergencyCleanup();
}
};
TaskScheduler.UnobservedTaskException += (sender, e) =>
{
logger.LogError(e.Exception,
"Unobserved task exception - someone forgot to await something");
e.SetObserved(); // Prevent process termination
};
}
private static void PerformEmergencyCleanup()
{
try
{
// Close database connections
// Flush log buffers
// Send alert to monitoring system
// Whatever you need to do before the lights go out
}
catch
{
// Even cleanup can fail, but we can't throw here
}
}
}
Call this early in your Program.cs:
var logger = LoggerFactory.Create(builder => builder.AddConsole()).CreateLogger("Startup");
GlobalExceptionHandler.Configure(logger);
Configuration That Doesn’t Break
Your service needs configuration, but loading config shouldn’t crash the service:
public class RobustConfigurationService
{
private readonly IConfiguration _configuration;
private readonly ILogger<RobustConfigurationService> _logger;
public RobustConfigurationService(
IConfiguration configuration,
ILogger<RobustConfigurationService> logger)
{
_configuration = configuration;
_logger = logger;
}
public T GetValue<T>(string key, T defaultValue = default)
{
try
{
var value = _configuration.GetValue<T>(key);
if (value == null || value.Equals(default(T)))
{
_logger.LogWarning(
"Configuration key {Key} not found or empty, using default: {Default}",
key, defaultValue);
return defaultValue;
}
return value;
}
catch (Exception ex)
{
_logger.LogError(ex,
"Error reading config key {Key}, falling back to default: {Default}",
key, defaultValue);
return defaultValue;
}
}
public bool ValidateConfiguration()
{
var requiredKeys = new Dictionary<string, Type>
{
{ "TcpPort", typeof(int) },
{ "MaxConnections", typeof(int) },
{ "ConnectionTimeout", typeof(int) },
{ "DatabaseConnectionString", typeof(string) }
};
var isValid = true;
foreach (var (key, type) in requiredKeys)
{
var value = _configuration[key];
if (string.IsNullOrWhiteSpace(value))
{
_logger.LogError("Required configuration key {Key} is missing", key);
isValid = false;
continue;
}
// Try converting to expected type
try
{
Convert.ChangeType(value, type);
}
catch
{
_logger.LogError(
"Configuration key {Key} has invalid format. Expected {Type}",
key, type.Name);
isValid = false;
}
}
return isValid;
}
}
Real-world example: A service I maintained read database connection strings from config. One deployment, someone typo’d the connection string. Without validation, the service would crash on startup. With proper config validation, it logged a clear error message and used a fallback read-only connection string, allowing the service to start in degraded mode while we fixed the config.
Performance Optimisation (Because Memory Leaks Are Not Lekker)
Use object pooling to avoid unnecessary allocations:
public class OptimizedTcpProcessor
{
private readonly ObjectPool<StringBuilder> _stringBuilderPool;
private readonly ArrayPool<byte> _byteArrayPool;
private readonly ILogger<OptimizedTcpProcessor> _logger;
public OptimizedTcpProcessor(ILogger<OptimizedTcpProcessor> logger)
{
_logger = logger;
_stringBuilderPool = new DefaultObjectPool<StringBuilder>(
new StringBuilderPooledObjectPolicy());
_byteArrayPool = ArrayPool<byte>.Shared;
}
public async Task ProcessMessageOptimized(
Stream stream,
CancellationToken cancellationToken)
{
// Rent a buffer from the pool instead of allocating new
var buffer = _byteArrayPool.Rent(4096);
try
{
var bytesRead = await stream.ReadAsync(
buffer, 0, buffer.Length, cancellationToken);
// Use Span<T> for efficient processing
await ProcessBuffer(buffer.AsSpan(0, bytesRead));
}
finally
{
// Always return buffers to the pool
_byteArrayPool.Return(buffer, clearArray: true);
}
}
private async Task ProcessBuffer(ReadOnlySpan<byte> data)
{
// Rent a StringBuilder from the pool
var sb = _stringBuilderPool.Get();
try
{
// Do your string building
sb.Append("Processing: ");
sb.Append(Encoding.UTF8.GetString(data));
_logger.LogDebug(sb.ToString());
}
finally
{
// Clear and return it
sb.Clear();
_stringBuilderPool.Return(sb);
}
}
}
Why bother with pooling? Every time you allocate memory, the garbage collector has to clean it up later. In a high-throughput service processing thousands of messages per second, those allocations add up fast. Object pooling reuses objects, reducing GC pressure and keeping your service running smoothly.
Real-world example: A service processing real-time stock market data was handling 50,000 messages per second. Initially, it allocated a new byte array for each message. Memory usage would climb to 2GB over an hour, then GC pauses would cause message delays. After implementing array pooling, memory usage stabilised at 200MB and GC pauses dropped by 90%. The service ran for months without needing a restart.
Testing Your Resilient Service
You can’t know if your service is resilient until you’ve tried to break it. Here’s how to test failure scenarios:
[TestClass]
public class TcpServiceResilienceTests
{
private TestServer _testServer;
private TcpMessageProcessorService _service;
[TestInitialize]
public void Setup()
{
// Set up your test dependencies
var services = new ServiceCollection();
services.AddLogging();
services.AddSingleton<TcpMessageProcessorService>();
var provider = services.BuildServiceProvider();
_service = provider.GetRequiredService<TcpMessageProcessorService>();
}
[TestMethod]
public async Task Service_ShouldRecoverFromNetworkInterruption()
{
// Arrange - start the service
var cts = new CancellationTokenSource();
var serviceTask = _service.StartAsync(cts.Token);
await Task.Delay(1000); // Let it start up
// Act - simulate network failure by killing connections
SimulateNetworkFailure();
// Wait for recovery mechanisms to kick in
await Task.Delay(5000);
// Assert - service should still be healthy
var healthCheck = await _service.CheckHealthAsync();
Assert.AreEqual(HealthStatus.Healthy, healthCheck.Status);
// Cleanup
cts.Cancel();
}
[TestMethod]
public async Task Service_ShouldHandleHighVolumeConnections()
{
// Arrange - start service
await _service.StartAsync(CancellationToken.None);
// Act - hammer it with 100 concurrent connections
var tasks = Enumerable.Range(0, 100)
.Select(i => ConnectAndSendMessage($"Test message {i}"))
.ToArray();
var results = await Task.WhenAll(tasks);
// Assert - all messages should be processed
Assert.IsTrue(results.All(r => r.Success),
$"Failed: {results.Count(r => !r.Success)} out of 100");
}
[TestMethod]
public async Task Service_ShouldHandleMalformedMessages()
{
// Arrange
await _service.StartAsync(CancellationToken.None);
// Act - send garbage data
var results = new List<bool>();
results.Add(await SendMessage(new byte[] { 0xFF, 0xFE, 0xFD })); // Binary junk
results.Add(await SendMessage(Encoding.UTF8.GetBytes("{{{{invalid json")));
results.Add(await SendMessage(new byte[0])); // Empty message
results.Add(await SendMessage(new byte[10000])); // Huge message
// Service should still be running after all that
await Task.Delay(1000);
// Assert - service is still healthy despite the abuse
var healthCheck = await _service.CheckHealthAsync();
Assert.AreEqual(HealthStatus.Healthy, healthCheck.Status);
}
[TestMethod]
public async Task Service_ShouldHandleMemoryPressure()
{
// Arrange - start service
await _service.StartAsync(CancellationToken.None);
// Act - send lots of large messages to create memory pressure
var largeTasks = Enumerable.Range(0, 1000)
.Select(_ => SendMessage(new byte[100000])) // 100KB each
.ToArray();
await Task.WhenAll(largeTasks);
// Force a GC and check memory
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
var memoryUsed = GC.GetTotalMemory(false);
// Assert - memory should be under control (adjust threshold as needed)
Assert.IsTrue(memoryUsed < 500_000_000,
$"Memory usage too high: {memoryUsed / 1_000_000}MB");
}
private void SimulateNetworkFailure()
{
// Kill all active TCP connections
// In production this might be a network cable unplug,
// firewall rule, or router failure
}
private async Task<(bool Success, string Error)> ConnectAndSendMessage(string message)
{
try
{
using var client = new TcpClient();
await client.ConnectAsync("localhost", 8080);
using var stream = client.GetStream();
var data = Encoding.UTF8.GetBytes(message);
await stream.WriteAsync(data, 0, data.Length);
// Wait for response
var buffer = new byte[1024];
var bytesRead = await stream.ReadAsync(buffer, 0, buffer.Length);
return (true, null);
}
catch (Exception ex)
{
return (false, ex.Message);
}
}
}
Deployment Checklist
Right, you’ve built this magnificent, resilient service. Here’s your deployment checklist:
1. Service Account Configuration
Don’t run your service as Local System or Administrator – that’s asking for trouble.
# Create a dedicated service account
New-LocalUser -Name "SvcTcpProcessor" -Description "Service account for TCP Processor" -NoPassword
# Grant only the permissions it needs
# - Read/Write to its data directory
# - Read configuration files
# - Write to log directory
# - Network access
2. Install the Service
# Build and publish
dotnet publish -c Release -r win-x64 --self-contained
# Install as Windows service
sc.exe create RobustTcpProcessorService binPath= "C:\Services\TcpProcessor\TcpProcessor.exe" start= auto
# Configure recovery
sc.exe failure RobustTcpProcessorService reset= 86400 actions= restart/60000/restart/120000/restart/300000
# Set the service account
sc.exe config RobustTcpProcessorService obj= ".\SvcTcpProcessor" password= "YourSecurePassword"
# Start it up
sc.exe start RobustTcpProcessorService
3. Configure Firewall
# Allow inbound TCP connections on your port
New-NetFirewallRule -DisplayName "TCP Processor Service"
-Direction Inbound
-Protocol TCP
-LocalPort 8080
-Action Allow
-Profile Domain
4. Set Up Monitoring
Configure monitoring alerts for:
- Service stops unexpectedly
- Memory usage > 1GB
- CPU usage > 80% for more than 5 minutes
- Error rate > 5% of messages
- No heartbeat for 5 minutes
- Circuit breaker opens
5. Log Management
# Create scheduled task to archive old logs $action = New-ScheduledTaskAction -Execute "PowerShell.exe"-Argument "-File C:\Scripts\ArchiveLogs.ps1" $trigger = New-ScheduledTaskTrigger -Daily -At 2am Register-ScheduledTask -TaskName "ArchiveTcpProcessorLogs"-Action $action-Trigger $trigger
Real-World War Stories
Let me share a few more battle-tested scenarios I've encountered:
The Case of the Disappearing Messages
A service processing insurance claims would occasionally "lose" messages. No errors, no crashes - messages just vanished. Turned out the TCP connection was dropping mid-message, and we weren't checking if we'd received the full message before processing.
The fix: Implement message length headers and validation:
private async Task<byte[]> ReadCompleteMessage(NetworkStream stream, CancellationToken ct)
{
// First 4 bytes = message length
var lengthBuffer = new byte[4];
var bytesRead = 0;
while (bytesRead < 4)
{
var read = await stream.ReadAsync(
lengthBuffer, bytesRead, 4 - bytesRead, ct);
if (read == 0)
throw new IOException("Connection closed while reading message length");
bytesRead += read;
}
var messageLength = BitConverter.ToInt32(lengthBuffer, 0);
// Sanity check
if (messageLength <= 0 || messageLength > 10_000_000) // 10MB max
{
throw new InvalidDataException($"Invalid message length: {messageLength}");
}
// Now read the actual message
var messageBuffer = new byte[messageLength];
bytesRead = 0;
while (bytesRead < messageLength)
{
var read = await stream.ReadAsync(
messageBuffer, bytesRead, messageLength - bytesRead, ct);
if (read == 0)
throw new IOException("Connection closed while reading message body");
bytesRead += read;
}
return messageBuffer;
}
The Midnight Memory Leak
A service ran perfectly during the day but crashed every night around midnight. Memory usage would spike from 200MB to 4GB in minutes.
The culprit: A scheduled report generation task that loaded the entire day's data into memory. The fix was to process data in chunks:
public async Task GenerateDailyReport(CancellationToken ct)
{
const int batchSize = 1000;
var offset = 0;
while (!ct.IsCancellationRequested)
{
// Process in batches instead of loading everything
var batch = await _repository.GetMessagesAsync(offset, batchSize, ct);
if (!batch.Any())
break;
await ProcessBatch(batch, ct);
offset += batchSize;
// Give GC a chance to clean up between batches
if (offset % 10000 == 0)
{
GC.Collect();
await Task.Delay(100, ct);
}
}
}
The DDoS That Wasn't
A service started rejecting connections during business hours. The investigation showed it was hitting the max connection limit. Turned out one customer's system was opening connections but never closing them (connection leak on their side).
The fix: Connection timeout and limit per client:
private readonly ConcurrentDictionary<string, int> _connectionsByClient = new();
private readonly int _maxConnectionsPerClient = 10;
private async Task HandleClientConnection(TcpClient client, CancellationToken ct)
{
var clientIp = ((IPEndPoint)client.Client.RemoteEndPoint).Address.ToString();
// Check connection limit per IP
var currentConnections = _connectionsByClient.AddOrUpdate(
clientIp, 1, (key, count) => count + 1);
if (currentConnections > _maxConnectionsPerClient)
{
_logger.LogWarning(
"Client {ClientIp} exceeded connection limit ({Count}/{Max})",
clientIp, currentConnections, _maxConnectionsPerClient);
_connectionsByClient.AddOrUpdate(clientIp, 0, (key, count) => count - 1);
client.Close();
return;
}
try
{
// Set read timeout to prevent stuck connections
client.ReceiveTimeout = 300000; // 5 minutes
await ProcessClient(client, ct);
}
finally
{
_connectionsByClient.AddOrUpdate(clientIp, 0, (key, count) => Math.Max(0, count - 1));
}
}
Wrapping Up: Your Service Survival Kit
Building a resilient Windows service isn't rocket science, but it does require thinking about all the ways things can go wrong. Here's your survival kit:
The Five Pillars of Resilience:
- Health Checks - Know when something's wrong
- Heartbeats - Prove you're still alive
- Performance Monitoring - Watch for warning signs
- Resilience Patterns - Handle failures gracefully (retry, circuit breaker)
- Comprehensive Logging - Understand what happened when things go wrong
The Three Safety Nets:
- Global exception handlers - Catch the unexpected
- Watchdog services - Monitor and self-heal
- Windows service recovery - Automatic restarts
The Golden Rules:
- Log everything important, but don't log so much that you can't find anything
- Always assume the network is unreliable
- Always assume messages can be malformed
- Always assume resources are limited
- Test your failure scenarios - if you haven't tested it, it doesn't work
- Monitor in production - you can't fix what you can't see
Build your service with these principles, and you'll spend less time at 2am debugging production issues and more time at braai's enjoying your boerewors. And isn't that what we're all really aiming for?
Now go forth and build services that are tougher than Kudu biltong and more reliable than your bakkie. You've got this!
Additional Resources
Want to dive deeper? Check these out:
- Microsoft's guide on Windows Services in .NET
- Polly documentation for resilience patterns
- Serilog documentation for structured logging
- Martin Fowler's articles on circuit breakers and resilience patterns
And remember - the best Windows service is one that runs so smoothly, you forget it exists. Until bonus time, when your boss remembers it hasn't crashed in 6 months. Lekker!
Have your own Windows service war story? Found this helpful? Drop a comment below - I'd love to hear how you're keeping your services running smoothly!





Leave a Reply