Posted by Mike Racine
Recon your Azure resources with Kusto Query Language (KQL)
ITOps is always dealing with lots of data. From monitoring data and logs to resource metadata, its not uncommon to have to sift through thousands if not millions of records at a time. There is one hidden gem of a tool in Azure that can handle a lot of this, and that’s KQL… the Kusto Query Language.
In this episode of #KnowOps, Dana introduces us to the power of the KQL and shows how to use it with things like Log Analytics, Azure Sentinel and Azure Resource Graph. He even demonstrates a simple way he uses KQL as part of his regular Azure pentest efforts to find potentially vulnerable hosts in seconds when working with clients.
As an Azure security guy, I get tasked with sifting through tons of data all the time. Sometimes, I'm looking for risky, misconfigurations through the Azure Resource Graph. Other times, I'm digging through suspicious behaviors with Log Analytics. Or, threat hunting in Azure Sentinel. Whatever I'm doing, there's always one tool that I can count on to get the job done no matter if it's hundreds, thousands, or millions of records that I have to get through. And, that's KQL, the Kusto Query Language. In the last episode, I introduced you to Azure Monitor. Today, I'm gonna show you how to use KQL to turn your monitoring data into operational insights in minutes. And, stick around until the end. I'll even show you a trick with KQL that I use when pentesting Azure environments.
Dana Epp here, welcome to the channel that helps aspiring Azure administrators like you and me to know ops and, well, master the Microsoft Cloud. I'm glad to have you here. If you haven't yet, please smash the subscribe button so that you can be notified when I release new videos each week. I promised last week that I'd teach you KQL this week, so let's jump right into Azure and I'll walk through the fundamentals that you should know that will help you slice and dice your data very, very quickly. Because I don't wanna cause an information disclosure incident here at work, I'm gonna use some demo data that Microsoft publishes for anyone to use with Log Analytics. I encourage you to head over to aka.ms/LADemo and follow along in your own tenant.
Go ahead, I'll wait.
Okay, so we're gonna do a very quick whirlwind tour of how to use Log Analytics and, more importantly, about KQL. KQL can be used in so many different places, like the Resource Graph or in Data Explorer. Log Analytics is a great place to use it though because A, we can have a lot of data like we've loaded in this demo database. And B, the IntelliSense that it provides makes life really easy when you wanna learn how to query it. So, in our case here, we're gonna use the Security Events table. If you notice, as I'm typing, it's auto completing for me, giving me the options, so if I hit enter, boom, I'm right into the Security Events table. In my case, you'll notice it went right into a pipe operator. And, you can build complex queries using KQL by piping in the output from one command into the other. So, you always have to start KQL with grabbing some source data from some sort of table, in this case, the SecurityEvent table. And, we're gonna pipe it into count so we can get an idea of how many records are in the system. And, we're going to filter that by the last seven days just so we get an idea for the last week, how many records are in here. And, we can see we have about 1.9 million records and that's quite a bit, so we wouldn't want to just grab the table directly.
So, if I wanted to, I could just say security event, get rid of the pipe. And, if I hit enter, this is gonna go query and it could take some time to come back. Obviously, I don't wanna do that, so what I'm gonna do is say maybe I just wanna grab 50 records. And so, I can use a command called take, or limit is an alias to that. And, what that'll do is it'll only bring back 50 records. And then, I can click the Run button. And, it will bring back, over here, you can see it brought back 50 records in 1.8 seconds. Now, one thing to note here is going forward, I'm not gonna hit this Run button. I'm gonna use the Shift Enter key, which is a shortcut to that so that I don't have to keep moving my mouse and I can get that out of the way for you.
So, let's go and start using some of this information to our benefit. As an example, you can see here, when bringing back in the Security Events table, I have a whole bunch of columns, like TimeGenerated and Account and the AccountType and Computer and it just goes on and on and there's a ton of columns of data here. In my case here, let's go and see what we can do about getting through that 1.9 million records very, very quickly. So, maybe go and find things like, oh, I don't know, failed logins or something. So, why don't we start by saying, first thing we wanna do is we're gonna filter these records and we're gonna say where TimeGenerated. So, if you're not aware, TimeGenerated is a column that's on all data sources that relate to the time it was inserted into the log store.
So, in our case here, we're gonna say just bring us back everything that was in the last day. So, we go say go, which is a time operator. And then, we can do 1d. If I wanted to say give me one minute, I would go 1m, but, this case here, we're just gonna say give me everything from one day. And if we run that, you can see, in 4.9 seconds, it's brought back, well, it says 10,000 records, but you can see down here that it's paging, so it means that there is potential for even more records that are in the system that's there. So, maybe we'll do something just to make it a little simpler. For me, I'm usually working at things that are happening just in time, so I might say from now and I'll say in the last 15 minutes. Bring me back all the security event logs where the TimeGenerated was in the last 15 minutes. Okay, that's a little better, 2,772 records. That's not too bad.
Maybe though, I want to be able to maybe filter that information. I don't need all these columns. That's a lot of information, so maybe what I'm gonna do is I'll use another operator here called project and what project does is it gives me the opportunity to decide what columns, or what pieces of information, I wanna bring back. So, in my case here, I'm gonna say TimeGenerated. Maybe the Account. Computer would be useful. I happen to know that there's something here called EventID that's pretty important. Activity and maybe IpAddress. Now, obviously, you're gonna pull back what you feel is needed to show your information. So, if I run this command, we can now see we're only bringing back the columns of information. And, you can see here under Activity, as an example, I can see here, Activity with an EventID of 4625, an account failed to log in, which I just happen to know that is the EventID in Windows for this, so let's use that. So, maybe what I wanna do is not just filter by TimeGenerated, but we can use more complex queries where I could say where EventID equals 4625.
So, you can see we can chain these where commands so that where I take the output from the previous one as the input to the new one and we can keep going down the line. Now, we've brought it down to 2,023 records. Now, another way to do things like this where you have multiple wheres is you can use the and command. So, I could go and say where TimeGenerated is in the last 15 minutes and EventID equals 6225. It means the same thing. Something to be aware of. Of course, you can use or operators and you can do lots of comparators here, so if you know how to do this from a scripting perspective, a lot of the constructs that you might be used to in something, let's say, SQL queries or in PowerShell, you can very well use here. You just need to know what they represent. So, the thing is though that maybe in my case here, what we really wanna do at the end of the day is summarize it. We wanna know about these failed logins.
So, one of the other things we might be able to do here is do something like say we wanna summarize this information. And, I'm gonna create a new computed column called failed logins and I'm going to compute it by using the count operator by the Computer. So, in this case, what it's going to do is say I wanna go find, group everything by that Computer. Give me a count of that for each failed login. So, now if I go and try to run this, we now can see we have, in this case, one, two, three, four, five different machines and they have different failed counts. Now, you'll notice these aren't in any kind of order. They just happened to group them by what they found in the data. But I could order them to give me a little more cleanliness to the data as it's coming back. So, I could do this by doing order by failed logins. And, now we can see that this Contoso job FW JB is at 813 failed logins in the last 15 minutes.
Now, not only can we display this in table formats, we can actually pipe this into renderer into interesting things like barcharts or piecharts. And, what we just did is we went from having over 1.9 million records down into a pie chart, which we now could go pin to a dashboard and have available to us if we wanted to. And, you can slice and dice this data in different ways. Like, maybe I wanna be able to do it by Computer, but if I was maybe filtering by different error codes or EventIDs, I could filter it by that if I wanted to be able to group these things together.
Now, here are a few performance optimization tips. Use a where clause as soon as you can and always filter by timestamps first. This will considerably speed up your results set. Also, project only the columns that you need. We sometimes work with datasets that have hundreds of columns and we only need a few of them. Getting rid of the extra columns can not only help performance, but it makes debugging a whole lot easier.
Now, let's put this all together with a fun example.
I use this when pentesting environments and it's surprising how effective it can be. No matter if your resources are on-premises or in the cloud, one security fundamental is that systems that have been running the longest have a pretty good chance of being more vulnerable. Chances are their system kernels and drivers and OS patches aren't being applied or they would've been rebooted. So, let's see if we can weaponize that information thanks to perf counters picked up by Azure Monitor. Check this out. Okay, let's go create a new query. It ends up that through Azure monitor and running of a typical monitoring agent, it'll send back a ton of performance counters to the Perf table. And, if I was to just do a quick run here, we can see that in the last 24 hours, there's been over 1.6 million records related to some of the servers here in this demo database.
I wanna filter that out and I'm gonna go look where the object name happens to represent the System performance counters. And, I'm also gonna go and look for some special counter names, which I happen to know is System Up Time, which represents the uptime for Windows servers. And, I'm also gonna look for where the counter name represents Uptime, which represents Linux servers. And, what I wanna do from this is that I happen to know that the counter value, which represents the data point related to the specific counter name is in milliseconds, so I wanna convert this. So, I'm gonna compute a new column called UpTime. And, I'm gonna just take the CounterValue and I'm gonna multiply it by 1s, or one second, to convert it into seconds, minutes, hours, days, etc. And then, I wanna just project some information that I want, which would be the TimeGenerated, the Computer, this actual UpTime computed value, and I also want the InstanceName. And, the InstanceName will become a little more evident in a minute here. And then, I wanna summarize it.
Now, it ends up, because there's gonna be millions of records here for the uptime counters, I really only want the latest one. So, what I'm going to do is use something called arg_max, which says just get me, in our case, for this argument of TimeGenerated for all these records, I want to get the latest one for the Computer. And then, finally, I am going to order this by that new computed column called UpTime in descending order. And, by doing this, what we're gonna end up doing is we're gonna say of the 106 million records, I wanna go find all the System records that have a counter of System Up Time or Uptime so that we can bring back both Windows and Linux servers and I wanna compute a new column called UpTime, which is really just gonna show us in seconds, minutes, hours, and days what the uptime is and will then pull back some of this data and give us the last record.
So, in other words, once we order this, what we're gonna end up getting is a list of the servers that have been running the longest. And, here, we can see right now that this ContosoVm1 has been running for a very long time, 447 days. And, we've got a hardening-demo Linux Red Hat box for 426. And, InfraScaleVMS for 347. And, another Windows machine here, retailEUS7, for 227. There's some servers here that have been running for over a year. There's a very good chance that they haven't been patched and it might be something for us to look into.
These are the kinda cool queries that we can get out of KQL very, very quickly if you know what to do. So, KQL is becoming a defacto query language for Azure services and data. Check out the KQL reference for more information, including a variety of examples that you can use to do even more interesting and complex queries. I'll leave a link below in the description for you.
I hope I've been able to give you a bit of an idea into just how powerful KQL can be. And, hopefully, inspire you to rethink how you might be able to query your data and turn it into operational insights for your business. Let me know by leaving me a comment and hitting the like button. And, if you haven't yet, smash the subscribe button so that you can be notified as I publish more videos. Until then, thanks for watching. We'll see you in the next episode.
Rethink how you might be able to query your #Azure data and turn it into operational insights with KQL #knowops @auditwolf