Spatial data analysis can allow organisations across industries to gain a deeper understanding of their data and make more informed decisions.
In this instalment of the four-part Q&A series with keynote speakers from Ozri 2020, Esri’s Global Data Science Lead Lauren Bennett and Product Engineer Ankita Bakshi answer questions from the GIS user community about the latest spatial analysis workflows in ArcGIS Pro.
Lauren and Ankita’s presentation and live Q&A are available to watch on-demand.
Q. What’s the best way to learn about the methods you demonstrated and start to use them in my own analysis?
A [Ankita]. The first thing that comes to mind – and the first thing that I usually do when I want to learn more about a new tool – is the tool’s documentation itself. We put our heart and soul into making the tool documentation as informative and transparent as possible so when our users want to use our tools, it doesn’t feel like a black box. We put information regarding the algorithms that run under the hood, and provide some ‘gotchas’ and tool best practices so you can use them for your analysis.
We also have a series of four blog posts written by very talented product engineers on our team with some real-world examples on the time series forecasting tools that I showed in the presentation. You can look at how these tools are used in the real world and be applied in your analysis.
Learn lessons are another great way to get started – you’ll find a lot of useful lessons and resources. If you want to learn more about spatial stats then we have a Spatial Stats website with a lot of examples, tool reference guides, blog posts and technical workshops to get started with your analysis. I also highly encourage everyone to enrol in the spatial data science MOOC and get started with it.
Q. The emerging hotspot analysis that presented COVID hotspots over time in 3D was something that really caught my eye. What’s your advice on the best way to share data like this with executives or decision makers?
A [Lauren]. We have a whole workshop dedicated to thinking through some of this, how we think about visualisation from an analysis perspective, and how to focus on the difference between what I need as an analyst, and what a decision-maker needs. They’re not necessarily the same thing.
With emerging hotspot analysis, it’s already set up to help you interpret the results of the analysis. We take this beautiful 3D map – which by itself is not necessarily all that useful – and boil it down to those categories to help interpret that 3D map. Instead of looking at hundreds and thousands of bins in 3D, I can say I want to look specifically at this location and what’s been going on over time. That’s just a lot more interpretable.
When it comes to a decision maker, they may not even need the 3D at all – though we can certainly publish it as a web scene and share it online through things like story maps. We’ve got lots of examples where we do just that as the evidence behind a recommendation we might make.
We’ve found with decision makers, the most important thing is to think through the categories you have results for – do you actually need all of them for a decision maker? Does a decision maker care about the hot spots or the cold spots? Maybe they just care about the hot spots, depending on the problem you’re trying to solve. You can get rid of a bunch of categories, simplify your map and even group together a couple of categories that aren’t as meaningful to the decision maker.
I think it’s really important as an analyst to take off your hat of ‘I want all of the data and all of the information’ and say ‘what does a decision-maker actually need to make their decision?’. Try to simplify the information as much as possible and tell a really effective story.
I’ve seen some really interesting conversations about ‘What is a data scientist?’ and ‘What is a spatial data scientist?’. One of the things that makes the ‘best spatial data scientist’ is being a great storyteller – because it doesn’t matter how great your analysis is, if no one can use it to make a decision, it’s not very useful.
Q. Could you provide some tips and resources on how to be a good storyteller?
A [Lauren]. Knowing your audience is the first step in creating powerful stories. You can also take inspiration from thousands of stories shared by storytellers on the ArcGIS StoryMaps website.
Q. There’s a lot of different clustering techniques – how do I decide which one to choose?
A [Lauren]. I’m sure Ankita and I would give different answers and I’m curious about Ankita’s perspective. With statistics, machine learning and other different approaches it’s not about what method you’re going to use, it’s about what question you’re trying to answer. It’s too easy to get swept away with all these shiny new algorithms and approaches but that doesn’t necessarily make it the best method to use for the problem you’re trying to solve. I’m a big proponent for starting from a place of your question.
A [Ankita]. I was going to say the same thing. People think of machine learning and artificial intelligence but if your question is to find traffic clusters and traffic patterns that can easily be solved with density-based clustering, why would you go to that algorithm? It’s exactly what Lauren mentioned – it depends on the question you’re asking. There are different types of clustering algorithms mentioned in the presentation like supervised and unsupervised clustering – it really boils down to what your question is and what you’re trying to answer using these clustering tools.
Q. What are your tips and tricks for processing large volumes of space time data?
A [Ankita]. It’s very important to be able to optimise your analysis to do a better job with large volumes of data. Within ArcGIS Pro, there is an option in the Space Time Tools environment settings called Pattern Processing Factor – you can choose to specify how many processors on your machine you want to fire up to do this analysis. Let’s say you have a 16-core machine and you want to dedicate fifty percent of your processors for analysis to speed things up.
There is a geo-analytics server in ArcGIS Enterprise to process and analyse big data using a distributed computing framework. Many of the geo-analytic server tools are available as desktop applications and tools in ArcGIS Pro and provide that similar power processing framework on your desktop machine using Apache Spark. Depending on where you’re performing your analysis – on your local machine or on the server – these are ways to optimise the processing of large volumes of data.
Q. I know there’s been some advancements with NetCDF in ArcGIS Pro. Are you able to talk about how you work with some of those?
A [Lauren]. What I would mention is the voxel layer – which just came out in the last release – and is a really powerful way of visualising CDF data and multi-dimensional data in general. If you haven’t seen the voxel layer in action, I would definitely check it out. It’s stunningly beautiful to watch how you can manipulate and explore multi-dimensional data inside of ArcGIS Pro and it also supports space time cubes.
We were kind of a stakeholder on that project because we wanted to make sure we could take advantage of this new technology for space time cubes – and we certainly can. There’s a recent lesson on using voxel layers and space time cubes to analyse social distancing data which is worth checking out. But those voxel layers in general are amazing.
Q. Time is quite complicated. There can be patterns by time of day, season or long-term trends. How does this relate to the time steps in the Space Time Cube?
A. [Lauren]. There are quite a few ways to think about incorporating things like seasons or cycles into your analysis. By default, for analyses like Emerging Hot Spot Analysis and Local Outlier Analysis, you can’t automatically incorporate seasons, but you can create cubes that represent cycles in your data. For instance, you can create a cube of only weekends, and then analyse what kinds of trends you have just on weekends. Or you can create cubes with just summers, and analyse trends across summers.
It is also worth mentioning that Time Series Forecasting automatically takes seasonality into account, and both Forest Based Forecast and Exponential Smoothing are great at dealing with data that includes cycles.
Q. In Australia, two of the most common cloud infrastructures are AWS and Microsoft Azure – is that something you leverage to speed up this type of processing?
A [Lauren]. When creating space time cubes specifically, its actually really fast. We create these as NetCDF files and they’re optimised for dealing with spatio-temporal data. We’re more likely to move to geo-analytics for pretty much any other kind of big data processing – there are cool movement tools coming out of the geo-analytics team and having them in ArcGIS Pro is really powerful. A lot of spatial statistics tools such as Density-Based Clustering are hugely optimised when they’re distributed for massive data.
They also incorporate time in a really powerful way in geo-analytics. Being able to do that kind of cloud-distributed computing for massive data is especially important when you’re doing pattern analysis and spatial statistics and we definitely take advantage of geo-analytics for that pretty regularly. Space Time Cube is an amazing format that works pretty efficiently on large data –even on your desktop.
Q. One difference in the demo of the Kernel Density and Hot Spot Analysis is that the latter used pre-defined aggregation. Are there other tests of statistical significance where you use the discrete location of the data?
A[Lauren]. You can use the discrete locations even while using Hot Spot Analysis if an attribute is associated with the point locations. If you have defined locations (like polygon boundaries), there’s no need for aggregation. The associated attributes can be directly used in the analysis. There are also other methods to analyse features exclusively based on the location of the data such as Density Based Clustering.
Q. Are there any general rules on the size of geographical areas used for analysis – for example can I use Hot Spot Analysis for metropolitan and rural areas?
A [Lauren]. It really depends on the question you’re asking. You can run the analysis separately for rural or urban areas if the geography varies greatly. There are certain defaults that these statistical tools use, however, there isn’t any magic number or formula to find the neighbourhood distance that would fit all the scenarios.
Q. The Hot Spot Analysis tool is sensitive to how we choose the parameters in Hot Spots Interface in ArcMap. Was this the case in your analysis?
A [Lauren]. It’s true that the Hot Spot results vary based on the analysis distance chosen. As mentioned earlier, statistical tools use particular defaults, however there isn’t a ‘one-size fits-all’ number or formula for finding neighbourhood distance that would work for all situations.
Q. Do you have to specify a neighbourhood boundary or does the tool recognise them on the fly?
A [Lauren]. There are certain defaults that statistical tools use. The optimised Hot Spot Analysis tool runs Incremental Spatial Autocorrelation to find a distance where the clustering is most pronounced. Some tools use average distance to the thirty nearest neighbours. You can find more about the defaults for each tool in its documentation.
There isn’t a magic number or formula to find the neighbourhood distance that would fit all scenarios – it’s the art and science of analysis. When choosing the right distance for your analysis, we recommend talking to subject matter experts. Ultimately, the distance that you use is really determining the question that you’re asking – so it is a really important decision.
Q. What is the percentage of accuracy while using forecast analysis?
A [Lauren]. You can specify the number of time steps to exclude for validation in the forecasting tools. The tool then takes the excluded time steps as test data and compares the forecasted values with the raw values to find the validation RMSE – these are reported in the tool’s messages. You can also use the Evaluate Forecast By Location tool to test which model did a better job of forecasting based on validation RMSE.
Q. The confidence interval shown in the temporal forest had a high spread of results. What techniques can be used to reduce the variability of forest results?
A [Lauren]. There are many parameters available in the forest-based forecast tools which can be adjusted for better results. For instance, the time window parameter represents the number of previous time steps to use when training the forest.
If your data displays seasonality (repeating cycles), provide the number of time steps corresponding to one season for this parameter. Like any other random forest method, you also have the option to specify the number of trees, minimum leaf size, depth and more. You can find out more about these parameters in the Forest-based Forecast (Space Time Pattern Mining) reference tool.
Q. A lot of the tools seem to need ten time steps. How do you manage with less than ten time steps?
A [Lauren]. It is recommended to have at least ten time steps in the Space Time Cube for reliable statistical analysis. If the scale in which the data is available is small – for example, if it is collected monthly – you can aggregate it into six months to have at least ten time steps.
Q. What were the tools used in the space-time demo that Ankita ran through?
A [Lauren]. The tools used in the demo (in order) were Create Space Time Cube By Defined Location, Visualise Space Time Cube in 3D, Visualise Space Time Cube in 2D and Emerging Hot Spot Analysis.
The Time Series Forecasting tools shown were: Curve Fit Forecast, Exponential Smoothing Forecast, Forest Based Forecast and Evaluate Forecasts by Location.
The tools are available in ArcGIS Pro. Several of these tools are also available in ArcMap, but not all of them.
Q. Ankita mentioned Apache Spark in the presentation – is this integrated with ArcGIS Enterprise or ArcGIS Pro?
A [Lauren]. You can find out more about Apache Spark in this blog on Distributed GeoAnalytics or watch it in action via this video.
Q. Can we get access to the DHHA data to use for teaching purposes?
A [Lauren]. Yes, the DHHA data is available on the Esri Australia COVID Hub.
Q. You demonstrated alternative new geoprocessing tools to analyse clusters – are these also available on the ArcGIS Desktop environment?
A [Lauren]. Yes, all these new tools are available in the ArcGIS Pro Desktop application.
Q. We are currently transitioning some of our data into ArcGIS as part of our digital transformation. We can currently request OGC Web Map Services (and some Web Feature Services) from the NetCDF files. Will we able to do that once the data is in ArcGIS and are Web Processing Services available in ArcGIS?
A [Lauren]. ArcGIS supports WPS, WMS and WFS services and you can find out more about this via the ArcGIS server page. There are also a lot of exciting advancements when it comes to visualising and analysing NetCDF data, including the new voxel layer which is worth checking out via this video and help page.
Q. Can you give us a sneak peek of what’s to come in this space?
A [Ankita]. Our team is working on a very exciting data engineering project. Whenever we start doing an analysis, the biggest pain point is getting the data in the right format – cleaning it up and filling in missing values. If you are doing regression analysis, you want to transform your field to do a standardisation.
Eighty per cent of an analyst’s time is spent getting their data ready for analysis. That’s what we’re working on – to reduce those pain points and provide that capability in ArcGIS Pro so the workflow can be streamlined and we can provide more options to investigate your data and make it analysis-ready. We are hopeful that it will be very helpful to our users.
A [Lauren]. Ankita is one of the key product engineers working on the data engineering project and it is the first thing I would have mentioned. When we think about what spatial analysis is, data engineering is a key piece of it. We all do our data engineering in ArcGIS already but this is about streamlining it and we’re really excited about that.
There’s a bunch of other projects going on. There’s a lot more work going on in the area of thinking about time series analysis – particularly time series outliers and spatial outliers. We’re really excited about some work we’re doing looking at comparing two categorical maps – that will definitely fill a gap and a lot of folks are going to be eager to use these tools. This upcoming release is a huge one for us. The longer-term plan is that spatial data science will be one of the core capabilities of ArcGIS and we’re busy making sure that it’s everything you need it to be.
This is one of a four-part Q&A series with keynote presenters from Ozri 2020. Read the Q&As with New York State Police’s Kellen Crouse, Esri’s Global Director of Artificial Intelligence Omar Maher and Norman Disney and Young’s John Benstead.
If you’d like more information on how spatial analysis can support your operations, call 1800 870 750 or submit an enquiry.