Thoughts about how Azure is architected have been jumping around my brain for a long time now. Subconsciously I was trying to tie all of these thoughts together, but as usual it took some idle time when I wasn’t thinking about anything else for it to come to the foreground.

I first began seeing problems with the way Azure is architected over a year ago at the San Diego Day of Azure (Saturday, Oct 03, 2009). There I ran into a brilliant guy, who I hadn’t seen in a while named Jason Diamond. Jason is a fellow DevelopMentor instructor and former co-worker. He was playing with NServiceBus and asked if you could deploy both web and worker roles to a single machine. This pointed out a limit in the ability of Azure to scale down. While we were talking I pointed out another limit – that in order for the service level agreements (SLAs) to take effect you have to run two instances of each role. These two problems together meant that if you have both a web role and worker role you basically needed four dedicated instances in order to achieve the SLAs. Ouch!

Then in preparation for writing my cloud course I started reading more about Google App Engine. I was marveling out how they could offer so much for free, until I realized that they weren’t dedicating *any* hardware to a particular App Engine “customer.” As a customer you might be running on a box with hundreds or even thousands of other customers. Heck, for all you know you might not be running on a box at all. The interesting thing is, until you hit the limits, you don’t really care. When you do hit the limits then you can start paying money and Google might upgrade you to your own machine (actually, I am not really sure what they do – it is difficult to tell from reading the skimpy documentation on how it actually works under the covers).

Then last Friday Ike Ellis and I were writing an article about SQL Azure vs. Amazon RDS. Probably the most interesting parts of the article were the graphs (price, and performance).

I think that if SQL Azure can flatten the storage cost line a little bit, then they are a much more compelling scenario. They are also more “cloudy”. By that I mean that SQL Azure is SQL Server re-architected for the cloud, not just an instance of SQL Server running in the cloud. SQL Azure is multi-tenet, it supports 3 replicas automatically, and if a box is “getting hot” SQL Azure can move it to another box with less running on it in order to better support that customer’s needs. I think it is a great abstraction and ultimately will win in the long run.

Regardless of what the marketing materials say, Azure was architected as Infrastructure as a service. I know it is positioned as Platform as a service, but underneath the covers it is definitely – without a doubt – an infrastructure based system. That is both good and bad. It is great as you get big enough to need your own dedicated hardware, but until you get to that point, you really don’t need all of the expense that goes along with paying for multiple CPUs owned solely by you. Google has proved that if you put a lots of customers together on the same hardware, it is much cheaper than giving each customer there own hardware. That is how they can afford to give away so much for free.

If Azure really is a platform then they should start acting like one. To me a platform is something that you can stand on, without having to know how it was constructed underneath. In Azure, due to the law of leaky abstractions, some of the Infrastructure details come leaking through. This is most notable through the fact that you have to manually or programmatically adjust the number of instances that your application is running on. “Instances?! I am running on instances? I thought I was running on telepathic robots! I am going over to Google, where telepathic robots do my work for me, instances are so 2000-an-late.”

If Azure had the same free entry model as Google where they ran in a multi-teneted environment then you would simply deploy your application to the platform, and the platform would make sure that it never fell down. Microsoft knows how to setup a system like this, as they have demonstrated with SQL Azure. This is the ideal entry level system, and an ideal on-ramp for customers. As the applications outgrow the free system, they can move to dedicated hardware. This is something that Google currently doesn’t offer and it gives companies the best of both worlds. In fact Microsoft could apply that same philosophy with SQL Azure, and compete against Amazon RDS’s high-end database in the cloud scenarios.

When I finally sat down and started writing the cloud course one of the first slides that I wrote was what I consider the cloud philosophies to be.

However two announcements by Amazon in the past couple of months have made this chart slightly less accurate. The announcements were Elastic Beanstalk and just yesterday Cloud Formation. Beanstalk is Amazon’s first foray into the Platform as a Service (PaaS) offering. The platform they refer to is Java. Cloud Formation allows you to create essentially what is called a service model in Azure. It is almost as if Amazon realized that they were the most complicated of the cloud platforms and started thinking of ways to simplify it :)

I am teaching Azure in England this week to Microsoft and one of my students said that he was having trouble getting an actual crash dump file to be produced, both locally and in the cloud. I think the problem was that the way you normally think to write a program that crashes is to throw an exception soon after startup. The problem is that happens every time and it doesn’t give Azure a chance to send the crash dump file. To make this work, what I wanted was a program that crashed every *other* time, allowing it to crash, then send the dump, then crash, and then send the dump, etc.

I was able to get that working successfully by creating a NumberRepository and using that to keep track of how many times I have run.  Here are some excerpts from the code:

First the OnStart of the WorkerRole:

var config = DiagnosticMonitor.GetDefaultInitialConfiguration();
config.Directories.ScheduledTransferPeriod = TimeSpan.FromMinutes(1);
CrashDumps.EnableCollection(true);
DiagnosticMonitor.Start("DiagnosticsConnectionString", config);

Then in the Run:

int number = NumberRepository.GetNumber();
NumberRepository.UpdateNumber(++number);
if (number %2 == 1)
{
	throw new Exception("Bye");
}

I ran it one time (outside the debugger) and it crashed, I ran it again and waited for 1.5 minutes and voila it appeared in wad-crash-dumps storage container.

I knew already that Amazon had two ways of calling their services. The first was by consuming the WSDL metadata and calling through SOAP, and the second was through REST. Of course the REST would be too cumbersome by itself but not to fear – there is a SDK which makes that easier from common languages like Java and .NET. But the SOAP should be brain dead simple to consume right? Wrong. After searching the forums for a while I figured out that somebody had managed to get it working through WSE 2.0 but nobody had managed to get it working from WCF. I thought to myself – “Self, I can’t allowed this to happen”. Myself agreed.

OK the first thing was to get it to work any way I could. While I was searching the forums I came across this post which describes how to call AWS using SOAPSonar. So I downloaded my trial edition and gave it a whirl.Using SOAPSonar enterprise I was able to add a certificate that I had saved earlier called ‘brainhz-cert.cer’ and call the service. Excellent. So now all I needed was to do this same task in WCF.

The first step was figuring out how they were securing their service. After looking through their docs I found a couple of helpful snippets.

  1. AWS does not implement a full public key infrastructure. The certificate information is used only to authenticate requests to AWS. AWS uses X.509 certificates only as carriers for public keys and does not trust or use in any way any identity binding that might be included in an X.509 certificate. Pasted from here.
  2. Amazon does not store your private key.  Creating a new certificate/private key pair invalidates your old one.  This only affects your X.509 key used to authenticate AWS requests.  It does not affect the ssh keypairs you use to log into instances (linux) or retrieve their password (windows). Pasted from here.
  3. The WS-Security 1.0 specification requires you to sign the SOAP message with the private key associated with the X.509 certificate and include the X.509 certificate in the SOAP message header. Specifically, you must represent the X.509 certificate as a BinarySecurityToken as described in the WS-Security X.509 token profile (also available if you go to the OASIS-Open web site). Pasted from here.

From this I was able to deduce that they were using the WSS SOAP Message Security X.509 Certificate Token Profile 1.0

I guessed that I needed to use Message based security with the Certificate credential type, but I double checked myself on the MSDN website.

WSS SOAP Message Security X.509 Certificate Token Profile 1.0

<basicHttpBinding>
  <security mode="Message">
    <message credentialType="Certificate"/>
  </security>
</basicHttpBinding>

Pasted from here.

I needed to specify which certificate I was going to use. It looked like I already had one in my Personal store (sometimes called the My store).

<endpointBehaviors>
	<behavior name="cert">
		<clientCredentials>
			<clientCertificate storeLocation="CurrentUser" storeName="My"
				 x509FindType="FindByThumbprint"
				findValue="6b 6a e8 ad b6 61 9c 1d a2 75 21 e4 4a d7 15 53 11 e6 72 27"/>
		</clientCredentials>
	</behavior>
</endpointBehaviors>

After adding that the next error that I ran into was this:
“The service certificate is not provided for target ‘http://ec2.amazonaws.com/’. Specify a service certificate in ClientCredentials.”

OK, so I needed the serviceCertificate. I used FireFox and hit https://ec2.amazonaws.com/ and saved the certificate. Then I imported it into my trusted people store.
Then I went in and added the following in my endpoint behavior:

<serviceCertificate>
	<defaultCertificate storeLocation="CurrentUser" storeName="TrustedPeople"
		x509FindType="FindByThumbprint" findValue="29 ca cd 8f 43 2e ff 31 f2 7f e5 70 e9 2e 1a f3 9e 1b f8 e8"/>
	<authentication certificateValidationMode="PeerOrChainTrust" revocationMode="NoCheck"/>
</serviceCertificate>

I had high hopes, before running this time, but no. The next error was:
“Private key is not present in the X.509 certificate”. When I looked at the certificate in the store, sure enough I did not see the “You have a private key that corresponds to this certificate” at the bottom.
Weird that it worked for SOAPSonar, but whatever. I went to Amazon, created and downloaded another certificate, combined the two and put them in my personal store. I then had to switch the certificate thumbprint to the one starting with 72 46.

After doing all of that I received a very strange error.
“Value (xmlenc#) for parameter Version is invalid. Version not well formed. Must be in YYYY-MM-DD format.”
WTF? I had never seen this one before, and I didn’t really know what sort of black magic was going on beneath me. So I turned on message level tracing, did some searching, and ended up trying two things:

  1. Switching the algorithmSuite from Default (Basic256), to (Basic128).
  2. Switching the OperationContract ProtectionLevel to Sign only.

WARNING: this is a HACK do not do this.
I went into the generated code into Reference.cs and changed the attribute on DescribeImages.

        [OperationContract(Action="DescribeImages", ReplyAction="*", ProtectionLevel=ProtectionLevel.Sign)]

Now fervently praying, I ran again. Bad news and good news.
Bad news was it didn’t work, good news was it was a message size issue, which I have fixed so many times in the past. Because we were using Message security I couldn’t turn on streaming. So I had to just up the maximum.
maxBufferSize=”9999999″ maxReceivedMessageSize=”9999999″

After cranking up the number high enough I got
System.ServiceModel.Security.MessageSecurityException occurred
Message=Security processor was unable to find a security header in the message. This might be because the message was an unsecured fault or because there was a binding mismatch between the communicating parties. This can occur if the service is configured for security and the client is not using security.

This was starting to make me mad. I was saying things that are unfit for children’s ears to my computer. After tracing, I discovered that this was related to the fact that Amazon messages are only secured one way.
The responses, or this response anyway, seemed to be unsecured. After some searching I found a hotfix for this issue in WCF.

http://support.microsoft.com/kb/971493

However, it required a customBinding. ARRGH!

Now the next step was to figure out which of the properties needed to be set so that it matched what I was doing before.
I created a program that created the two bindings, and compared the binding elements using reflection. The outcome of that program was the following binding declaration:

<binding name="customWithUnsecuredResponse">
	<security authenticationMode="MutualCertificate"
		 allowSerializedSigningTokenOnReply="true"
		 defaultAlgorithmSuite="Basic128"
		 messageSecurityVersion="WSSecurity10..."
		 enableUnsecuredResponse="true"
		 securityHeaderLayout="Lax"
		 />
	<textMessageEncoding />
	<httpTransport maxBufferSize="9999999" maxReceivedMessageSize="9999999" />
</binding>

I particularly liked the message security version which has to be high on the list of longest names in all of .NET. It was so long that I had to use ellipses because it was screwing up the CSS layout. The full name is (one piece at a time):
WSSecurity10
WSTrustFebruary2005
WSSecureConversationFebruary2005
WSSecurityPolicy11
BasicSecurityProfile10

Wow – what a mouthful…
Also notice the enableUnsecuredResponse = true, which was the cause of the problem.

After running one more time…

I bet the suspense is killing you…


IT WORKED!!!

I spent the next several minutes whooping it up. After having done it, I can honestly say that may be the only person in the world that has been stupid enough to try and get this working :)