Tuesday, June 04, 2019

Getting the ETW messages for a live running instance of a Service Fabric application

As our company has recently discovered, it would be very handy to have a live running stream of diagnostic from a live running Service Fabric application when one is trying to figure out why a Service won't start.

This docs.microsoft.com article provides a handy way to enable the log streaming feature in Visual Studio for a live running service in the same manner as one would get when debugging the Service Fabric Application in your local cluster.

Friday, August 31, 2018

Solving NETSDK1061 build errors

Microsoft has changed a lot in the last 10 years, especially since Satya Nadella took over, and not for the better in many ways. One of the ways they've changed is that they've gotten a lot faster, which on the surface sounds good. When you start digging into the consequences of that change however, it's not so great: one of the ways in which that faster development velocity is enabled is by skipping over things I find to be essential, like documentation.

Today, the documentation that's missing from Microsoft's websites and negatively impacting me is how to keep the version of your dotnetcore 2.x+ SDKs "sticky" during your Visual Studio desktop builds and automated VSTS builds when Microsoft upgrades their SDKs. The very annoying thing is that Microsoft is going **so fast** that they're including beta and prerelease builds WITH VISUAL STUDIO!! That's a big no no as far as impacting one's customers in my book. The consequence of this is that you can upgrade Visual Studio, and what built with the previous version will no longer build with the current version. This results in builds errors with code:

NETSDK1061: The project was restored using Microsoft.NETCore.App version 2.1.3, but with current settings, version 2.1.3-servicing-26724-03 would be used instead. To resolve this issue, make sure the same settings are used for restore and for subsequent operations such as build or publish. Typically this issue can occur if the RuntimeIdentifier property is set during build or publish but not during restore. For more information, see https://aka.ms/dotnet-runtime-patch-selection.

Following the link in the error message, you're taken to documentation that tells you nothing useful about how to actually solve this error. Instead, you get to Google around and if you happen to find the right set of keywords and stumble across this documentation and read it **** VERY CAREFULLY **** you'll find that you need to make use of the "RuntimeFrameworkVersion" property that has to be specified in the .csproj project file of your dotnet core project, and you have to set that to the version you want to actually use for building your dotnet core project, in addition to a <PackageReference> element that looks similar to the following:

<PackageReference Update="Microsoft.NETCore.App" Version="2.1.2" />

The version in the above <PackageReference> element should match what you have in your <RuntimeFrameworkVersion> element.

Thursday, April 26, 2018

Project count and name length limitations of the Service Fabric tooling in Visual Studio

It would seem that I've just stumbled across the practical technical limit to how many services a Service Fabric application can have and still be debugged with Visual Studio: ~ 37. My problem would seem to be confirmed by this Github issue.

Tuesday, April 03, 2018

Accessing remoting exceptions and original causes of Exceptions in Service Fabric stateless/stateful services in v3.0+/v6.1+

The section "Remoting Exception Handling" on this docs.microsoft.com page appears to be the sum and total of the Service Fabric team's documentation on proper exception handling for Service Fabric remoting. It's two paragraphs. And it's completely insufficient bullshit.

What they fail to explain and be explicit about is the fact that Service Fabric takes WCF's exception handling capabilities to the extreme and handles all exceptions automatically by serializing them using DataContractSerializer, remoting them back to the caller, deserializing them on the caller and converting them back to .NET exceptions and making them accessible via the AggregateException.InnerException property in a try-catch block around a ServiceProxy / ActorProxy.Create<>() result's service method. Well, that's all well and good, except for the fact that according to multiple github.com issues for Service Fabric, the Service Fabric team broke the shit out of this nice facility in V2 remoting. So, now, we get to revert somewhat back to WCF's exception handling model, in the earlier days of WCF, and throw a FaultException<MyFault> where MyFault is your own custom DataContract serializable data contract object. Fantastic job boys and girls. When are you going to grow into your big person pants and fucking test things properly before releasing them ? #GettingSickAndFuckingTiredOfLazyAssUndiligentMillenialDevelopers

Thursday, March 29, 2018

Enabling changes to Default Services within applications during deployment on a Service Fabric cluster

As it turns out, as of Service Fabric runtime 6.1, allowing applications to change their Default Services during an upgrade is not enabled by default. This can be controlled by the setting called 'EnableDefaultServicesUpgrade' in the cluster level setting group called 'ClusterManager'. This setting can be set in the Cluster Manifest if you're managing your own cluster on-premise, or it can be set in an ARM template if deploying to an Azure-based cluster like so:

{
"apiVersion": "2016-09-01",
"type": "Microsoft.ServiceFabric/clusters",
"name": "[parameters('clusterName')]",
"location": "[parameters('location')]",
"dependsOn": [
"[variables('supportLogStorageAccountName')]"
],
"properties": {
"certificate": {
"thumbprint": "[parameters('certificateThumbprint')]",
"x509StoreName": "[parameters('certificateStoreValue')]"
},
"clientCertificateCommonNames": [],
"clientCertificateThumbprints": [],
"clusterState": "Default",
"diagnosticsStorageAccountConfig": {
"blobEndpoint": "[reference(concat('Microsoft.Storage/storageAccounts/', variables('supportLogStorageAccountName')), '2017-06-01').primaryEndpoints.blob]",
"protectedAccountKeyName": "StorageAccountKey1",
"queueEndpoint": "[reference(concat('Microsoft.Storage/storageAccounts/', variables('supportLogStorageAccountName')), '2017-06-01').primaryEndpoints.queue]",
"storageAccountName": "[variables('supportLogStorageAccountName')]",
"tableEndpoint": "[reference(concat('Microsoft.Storage/storageAccounts/', variables('supportLogStorageAccountName')), '2017-06-01').primaryEndpoints.table]"
},
"fabricSettings": [
{
"parameters": [
{
"name": "ClusterProtectionLevel",
"value": "[parameters('clusterProtectionLevel')]"
}
],
"name": "Security"
},
{
"parameters": [
{
"name": "EnableDefaultServicesUpgrade",
"value": "[parameters('enableDefaultServicesUpgrade')]"
}
],
"name": "ClusterManager"
}
],
"managementEndpoint": "[concat('https://',reference(variables('lbIPName')).dnsSettings.fqdn,':',variables('nt0fabricHttpGatewayPort'))]",
"nodeTypes": [
{
"name": "[variables('vmNodeType0Name')]",
"applicationPorts": {
"endPort": "[variables('nt0applicationEndPort')]",
"startPort": "[variables('nt0applicationStartPort')]"
},
"clientConnectionEndpointPort": "[variables('nt0fabricTcpGatewayPort')]",
"durabilityLevel": "Bronze",
"ephemeralPorts": {
"endPort": "[variables('nt0ephemeralEndPort')]",
"startPort": "[variables('nt0ephemeralStartPort')]"
},
"httpGatewayEndpointPort": "[variables('nt0fabricHttpGatewayPort')]",
"isPrimary": true,
"vmInstanceCount": "[parameters('nt0InstanceCount')]"
}
],
"provisioningState": "Default",
"reliabilityLevel": "Silver",
"upgradeMode": "Automatic",
"vmImage": "Windows"
},
"tags": {
"resourceType": "Service Fabric",
"displayName": "IoT Service Fabric Cluster",
"clusterName": "[parameters('clusterName')]"
}
}

Thursday, March 22, 2018

Troubleshooting connections to a service running in Service Fabric in Azure

Recently I decided to try deploying a Service Fabric cluster to Azure and investigate what it takes to create applications with Service Fabric. I used the default ARM template to deploy the cluster, with the parameters in the parameter file set appropriately. I've been able to successfully deploy the cluster itself, along with a private sample application that's showing as running and healthy within the cluster. (all of this with VSTS, but that's for another post). I'm now running into the problem of actually connecting to the Service.

When I use postman to connect to the service, I'm connecting to a URL like :

https://mycluster.westus.cloudapp.azure.com:8870/api/things

However, when I send my request, I instantly see Postman fail to connect:



Here's the steps taken when verifying all the settings so far:

  • I got the address for my cluster from the "IP Address" resource that came with the ARM template that's in the Azure portal, using the "Copy" functionality. 
  • I've verified that the Load Balancer that was set up by the ARM template is correctly configured to use my application ports.
  • I've consulted the documentation for Load Balancer health probes here to ensure that my machines are using the correct type of probe: https://docs.microsoft.com/en-ca/azure/load-balancer/load-balancer-custom-probe-overview#learn-about-the-types-of-probes . In my case, I don't want to use HTTP even though I've got an HTTP service because I'd need a 200 response. Instead, we use the more basic TCP probe which determines health status based on TCP handshake, which should be just fine.
  • I've updated the Diagnostics lettings on the Load Balancer to pipe logs out to a storage account. Using the generated output, I've found that my health probe contradicts my expectations and is in fact failing. What's worse, there's a timing issue: the health probe fails too many times before the Service Fabric Host can startup on the VMs and then permanently marks the hosts as failed, preventing accessibility to any of my VMs, which would seem to explain my inability to connect to my services AND the speed with which the response is returned (because the traffic doesn't even get past the Load Balancer).
  • I've used Remote Desktop to gain access to the Virtual Machine Scale Set VMs thanks to the default settings that came with the Service Fabric ARM template. Loading up PowerShell and executing the command "iwr -Method Get -Uri https://localhost:8870/api/Things" yields the error "iwr : The underlying connection was closed: An unexpected error occurred on a send". It would seem that I can't even get an actual connection to my service working on the local machine. This would explain why the health probes are failing: they're perfectly legit.
  • Running the command "netstat -an | ? { $_ -like '*8870*' }" on the VMSS VM indicates that the Service Fabric Host has in fact launched my process on my expected port of 8870 and the process is listening on that port. Curiously, I'm also seeing an established connection on that port as well. This is at least somewhat consistent with the fact that the Service Fabric Management Portal is showing my service as healthy on all nodes, but inconsistent with the status of the health probe.
  • Figuring that I now have a past problem that I already solved, I tried setting the permissions of the certificate stores for the certificates my API application uses. After some time waiting for the load balancer health probes to update, they were now able to connect and the services were running properly.