“Data for good is a movement in which people and organisations transcend boundaries to use data science and AI to tackle today’s pressing challenges.”
Table of Contents
Introductory Story
Input Data Preparation
Exploratory Data Analysis
Feature Engineering
Model Selection
Policy Recommendations
References
Introductory Story
Picture this scenario. Out of nowhere, you are appointed as Senior Advisor to the Ministry of Humanitarian and Disaster Management in Somalia, the government agency responsible for handling nationwide emergencies. Struck by sheer joy, you’d love to update your LinkedIn profile, your Twitter timeline, and all your other platforms to celebrate. Even though you have no clue how you were shortlisted or selected.
Attached to the offer letter, you notice, are dozens of daunting documents. Reports from NGOs, evaluation assessments, previous advisory papers, and other worrying details described in PDF briefs; thus, within the blink of an eye, it dawns on you that you’re actually sitting on a career landmine. For it couldn’t have gotten more challenging.
Deemed a failed state, Somalia has been trying to get back on its feet for decades. There’s an ongoing insurgency of a terrorist outfit — Alshabaab — that was born out of the peak years of civil war and dysfunctionality. To this very date they are wreaking havoc at almost weekly frequency. Moreover, there are also momentous institutional and state-building tasks on Somalia’s plate. The economy, the political climate and overall trust among stakeholders all rest upon on their adequate addressing. Lamentably, those are just a few points in a long list of calamities. One that may look different depending on who you ask and what they consider priorities.
Coming back to our hypothetical example, remember that the ministry which offered you the job wants you to solve one of them: namely, Somalia’s drought problem. In this article, I will help you find those solutions, so as to enable Somalia to handle drought situations better in the future. Together we shall build a drought alarm system using data science. That is, gather the relevant data, formulate a modelling approach and in the end provide a data-led drought management strategy.
Input Data Preparation
So, where do we start? As every so often, Google comes to our rescue. For chasing after rainfall data on Somalia - though not a joy - isn’t that hard, fortunately. Following some hiccups, I have found that TAMSAT http://www.tamsat.org.uk/ seems to be the only consistent, easily accessible, and freely available data source on rainfall for Somalia. It actually provides satellite-based rainfall estimates for the whole of Africa — configurable up to daily granularity and a 4km radius resolution. Right after specifying the geo location, the desired granularity and time period of analysis, you’ll receive a download link either containing a CSV or NETCDF file. Let’s read that in using the Pandas library.
As can be seen, the granularity is at daily level, and the units of measurement are in millimetres (abbreviated as mm). One millimetre of rainfall is the equivalent of one litre of water per square meter.
The next step in our end-to-end project involves writing up a data processing function. Our objective here is to up-sample (gear up one granularity level) the daily data to weekly frequency, and also generate a couple of perhaps useful time series features. “Why would we want to do that?” you may wonder. Extracting these features from our raw input data will come in handy sometime later on during the data modelling phase.
Exploratory Data Analysis
What must be remembered is that any serious data project begins with exploratory analysis. It is, truth to tell, the stepping stone that comes before one jumps to any form of modelling. We can draw insightful wisdom in this regard from John Tukey, a much-venerated pioneer of statistics who once coined a fitting phrase. As part of his seminal work from 1977, entitled ‘Exploratory Data Analysis’, he wrote: “Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” This is why we’ll start off by merely exploring the rainfall data for Somalia.
When rain seasons fail to materialise, we speak of the occurrence of drought. It is, beyond question, a situation characterised by overwhelming scarcity. And as can be seen from the time series plot above, Somalia has ever been, rather involuntarily, one of its most popular destinations. Indeed, it is evident from the chart, which begins in the year 1983 and goes up to 2021, that Somalia has suffered greatly from recurring spells of drought.
Furthermore, to provide a bit more context on the technique of visualisation, it is worth knowing that the blue line — which oscillates between the zero origin and the 200mm maximum — represents Somalia’s monthly rainfall. Whereas the dashed red line represents an aggregated measure generated for analytical purposes. It is the average monthly rainfall for the entire period of study and purely a visual helper to identify gaps in the blue line. See the below code:
Looking at this code snippet one sees two noteworthy points: 1) that I’ve used Pandas transform to combine the aggregation with the disaggregated data, and 2) that I’ve created a new variable called ‘Drought’. Let me explain my reasoning.
The first question anyone analysing the topic of drought in Somalia will see themselves confronted with is: ‘What actually constitutes a drought in the climatic and geographic context of Somalia?’ According to the Köppen climate classification, Somalia falls under two categories. The first being the desert climate (BWh) and the second being the warm semi-arid climate (BSh). Altogether there are four seasons, of which two normally come with rain.
Had it not been for climate change, which has brought more frequent extreme weather conditions and droughts, Somalia would regularly enjoy the Gu’ (spring) rain season that runs from April to June and the Deyr (autumn) rain season that runs from October to December.
As uncovered in the above graphic, we can see these patterns clearly. April/May and Oct/Nov rainfall is, on average, a lot more pronounced than it is the case for the rest of the year. But that’s not all one could infer. Particularly astonishing is the year 1997 in the third and last bar chart, the peak of the entire 38 year period. It is distinguished by a rainfall amount above 600mm.
Somalia should normally receive around 400–600mm of rain in good years. This, however, seems to have become a rather rare occurrence: only 7 out of the 38 years boast such amounts! Another striking observation is the dip registered in the year 2017, which went down in history as one of the worst Somalis have ever witnessed, culminating in a humanitarian crisis and famine. Fortunately, it garnered enough international attention and relief funds preventing acute malnourishment from escalating to widespread starvation and death.
Bearing in mind that prolonged periods of inadequate rains cause so much hardship, one just wonders what countries can do about it. This is why I have created that boolean variable ‘Drought’ previously shown, so as to build a model which predicts the occurrence of drought.The logic I’ve used is pretty simple, as is evident from the last code snippet.
I first generated an intermediate column called ‘Rain_Season’, which is 1 whenever the date column reads April, May, June, or October, November, December, according to the two rain seasons that should normally transpire. Then, every time the rainfall is below average and ‘Rain_Season’ is 1, the ‘Drought’ column will be 1, the boolean code for the occurrence of drought. And although this may sound sensible, I want to admit that it would be much better if my decisions were informed by weather specialists who know exactly at which rainfall mm amount the threshold should be. Ideally localised at district level.
But at any rate, let’s now turn to another important angle of exploratory analysis. Answering the question I’ve raised on what countries could do, one is well-advised to employ concepts of statistics. One such concept is that of autocorrelation: how past patterns explain present realities. For this we import two handy functions from the statistical Python library Statsmodels.
Visually inspecting the autocorrelation function of weekly rainfall in Somalia is useful in that it allows us to understand how the past and present are intertwined. In this case, its shape is characterised by an inverse exponential curve, which means that future values of rainfall are moderately correlated with its past values, while the strength of correlation decreases with the number of lags we consider.
Examining the partial autocorrelation function, however, we see only a strong correlation of rainfall with the first and second lag. For those not familiar with the concepts of time series econometrics, this just means that the past trends explain the presently observed reality to a great deal. It also implies that future values of rainfall are dependent on present values.
Feature Engineering
Proceeding now to the next phase of our end-to-end project, we will turn our energy to the topic of feature engineering. Though sometimes trivialised as a high-effort low-yield activity, a common definition of this extremely critical step is: ‘to prepare, transform and extract useful information from raw input data.’ Indeed, it is well to remember that a sage of the field has once verbalised the remarks below, strongly emphasising the importance of feature engineering.
The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering.
— Luca Massaron
So, let’s act upon this brilliant nugget of wisdom. Building on the knowledge that the past explains the future, which we’ve gained from autocorrelation analysis, let us now extract some additional time series features. The lagged values of rainfall that I decided to extract represent rainfall amounts in prior weeks, stored in separate columns for every point in time t. This should certainly improve our drought prediction accuracy in the later modelling stage.
The observant eye undoubtedly notices that our data table looks somewhat different from the raw data which we started with. Yet please rest assured, I haven’t resorted to some illegitimate black magic that’ll now give me the results I want (an accusation many data sceptics will throw at you).
I’ve just transformed the raw time series data into a form suitable for supervised machine learning. Put differently, given that a time series is nothing more than a sequence of data points with a time dimension, all one needs to ensure for algorithmic prediction is that we have a time variable which captures the cadence of this sequence.
In our case it was weekly cadence, and thus a look at the variable ‘Time_t’, against the backdrop of the date-time index, suffices for us to know that we’re still looking at the same weekly time series data. At this stage, however, we have a lot more detail captured in a combination of numeric and boolean variables. ‘How does this help in making drought emergency prediction?’ you’re probably wondering. Taking a look at the feature importances plot below will make things clearer.
To point out what’s inferable from the plot, we can proclaim a few rather unsurprising things. Firstly, that spring or autumn seasonality, as is obviously expected, highly correlates with the occurrence or non-occurrence of drought. So, when the boolean rain season variable is equal to 1, chances are high we’ll not observe rainfall. This is because Somalia’s recent weather history has tended to feature drought more prominently. Secondly, when the average rainfall observed in a normal rain season (the red dashed line from before) is below the actual recorded amount (i.e., ‘Avg_Rainfall_Diff’ is positive), we have a high likelihood of the opposite: rainfall! That’s why there is a negative correlation coefficient of close to 50% for that variable. Pretty intuitive.
Resulting from this is the conclusion that these features correlate to a satisfactory extent with the target we want to predict. There is something else that’s significant here. Have a look at our lagged autocorrelation features. According to them, the most indicative sign of a looming drought is when 4 weeks have passed after a rain season has begun and no rain was recorded. And although it’s interesting to note that insight, that is, the lag of 4 weeks being the most strongly correlating one with drought, we must remember that correlation does not imply causation.
Beyond that, it is good practice to not only examine strength of correlation between features and target, but also between the features themselves. For there are situations when strong negative or positive association between variables hampers the ability of models to predict. In the chart below, let’s try to identify whether we’re affected by this phenomenon, known as multicollinearity in the realm of statistics. Anything above 80% Pearson correlation is problematic.
Unlike the previous vertical bar chart, this one is a bit harder to read. But thankfully, we have the colour coding which enables us to easily spot strongly correlated features. Adjacent to the first column, the quadrant in the top left looks quite suspicious. There we can identify pairs of variables which exhibit a strong mutual correlation. They are ‘Weeknum’, ‘Quarter’ and ‘Month’.
Deleting two of them from our input data will not harm our ability to predict droughts, as their information gain is rather insignificant, owing to the fact that they contain the same seasonality information. It is as though you had two persons in a meeting at the same level of hierarchy with the same amount information. One of them is not really needed in the meeting, as the other suffices in reporting what they hold of information.
Model Selection
Before finally embarking on our original goal of predicting drought, and selecting data science models, we have to split the data into a train set and test set. Remember that we also have to drop the redundant features we’ve identified in the previous step. As can be seen from the screenshot, I have selected the turbulent period 2015–21 as the test set, so as to ascertain whether or not the model would accurately predict in critical time periods. There are slightly fewer occurrences of drought in the test set, purely because the timeframe is much shorter than the train time period.
Logistic Regression
In view of the fact that our exploratory analysis uncovered strong linearity in the feature-target relationship, it’s sensible to adopt a logistic regression approach to predict drought for Somalia. The below code snippet and results illustrate that a simple logistic model suffices for us to achieve good results.
All things considered, the results look promising. We have achieved a 92% prediction accuracy on the test set. That means historical rainfall data is sufficient enough for any country to design a drought management system based on data science. Furthermore, imagine we had more detailed and reliable data with geographic information such as longitude and latitude, district of rainfall, ect. I know I’ve mentioned that TAMSAT has such data for Africa, yet I found that Somalia data was not very reliable in terms of the geographic information. Some of the coordinates were actually locations in the neighbouring countries Kenya and Ethiopia.
But hold on, should accuracy be the only metric of performance to report? No, in actuality, you should always be suspicious of data scientists who only communicate accuracy, especially when the target variable is imbalanced (one category outnumbers the other by far). For this particular use-case, the logistic regression model did in fact make some serious mistakes: false negatives. Looking at the top right quadrant you can see there were 8 instances, about 3 percent of data points in the test set, where the model predicted no drought even though those weeks were droughts. Generally speaking, false negatives can be quite dangerous in many applications, such as Covid-19 predictions for example. The equivalent of declaring a patient virus-free when they actually are carriers of it. In either case, I think we did the best we could with the data we had at our disposal.
Policy Recommendations
Current trends in the world of technology have time and time again proven one thing. They’ve proven that data has the power to transform. It is, therefore, of utmost necessity for governments in both the developed and developing spectrums to utilise data for good. Yet word of caution. So as to not make this article longer than it already is, I’m going to summarise these policy recommendation points succinctly.
a) Somalia needs to establish a meteorological institute with the latest state-of-the-art equipment to observe weather phenomena such as drought cycles. This would enrich satellite-based data from global organisations and make the task of a data-driven drought prediction a lot easier. Ultimately, such efforts should culminate in a drought management system that gives amber or red warnings when certain localities are on the verge of humanitarian catastrophe.
b) Somalia needs to leverage its strong oral tradition and collective memory of the population to document the history of droughts, their locations, their impacts in terms of loss of life and livestock, and their durations. It is salient to upgrade data quality by categorising droughts into different levels based on frequency, severity and duration.
c) Somalia must, by any means possible, tackle the causes of extreme weather conditions. Illegal charcoal production, overgrazing, irresponsible use of water by business cartels, all these developments need to be inhibited. Truthfully, unless the present progress of deforestation is arrested, things won’t get any better anytime soon — even with data science capabilities.
d) Somalia should team up with international water technology specialists who’re currently world leaders in the development of rainwater harvesting and conservation systems. The little rain that falls tends to be lost in flash floods and unwatched rural landscapes. More could be done to leverage that in the fight against famine.
Many thanks for reading.
References
How to Convert a Time Series to a Supervised Learning Problem in Python. https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
Data for Good: A New Social Movement https://www.apexofinnovation.com/data-for-good-a-new-social-movement/
The Geology of Somalia by Robert Lee Hadden 2007.
10 ways Israel’s water expertise is helping the world https://www.israel21c.org/10-ways-israels-water-expertise-is-helping-the-world/
Fundamental Techniques of Feature Engineering for Machine Learning https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
Rapid Real-Time Review Somalia Drought Response 2018, Marc DuBois, Paul Harvey and Glyn Taylor.
Near Real-Time Disturbance Detection in Terrestrial Ecosystems Using Satellite Image Time Series: Drought Detection in Somalia 2011, Jan Verbesselt, Achim Zeileis, Martin Herold.
SPEI-based spatial and temporal evaluation of drought in Somalia 2020, Sylus Kipnegeno, Justine Muhoro Nyaga, Abdi Zeila Dubow.
John Tukey Exploratory Data Analysis, 1977.
Comments