8 October 2023
By Katja Hemmerich
This week, the Executive Boards of UNESCO, UNHCR and WFP will all engage with evaluations undertaken by their respective organizations. UNHCR presents an annual report summarizing the key findings of evaluations undertaken in 2022 as well as a larger analysis of evaluation findings related to accountability to affected people from 2018-2022. UNESCO presents a report outlining its analysis of cross-cutting issues arising from its 2022 evaluations. And WFP has organized a two day Roundtable to facilitate better engagement by Board members with the 12 country evaluations recently completed. Evaluations provide an overwhelming amount of useful data, but that also makes them hard to analyze and use for broader organizational level learning.
That is why one short paragraph in the UNHCR evaluation report is potentially quite revolutionary:
“In order to synthesize findings and more efficiently distill learning from a large number of independent evaluations, the Evaluation Office has been testing artificial intelligence tools. It is exploring the use of language learning models for synthesis work as well as for comparative and thematic analyses of existing evaluation reports.” - UNHCR, Report on evaluation (A/AC.96/74/9)
Earlier this year, a group of researchers from universities in Austria, Germany and Sweden developed a new methodology to use AI to analyze 1,082 evaluation reports from nine UN agencies to understand agency performance. Their study illustrates the power of AI for organizational learning from evaluations.
Our spotlight this week explores how AI provides analytical power that is greater than that of any human team. We also provide practical tips on how UN staff can harness this power for their organizations - and this is where the unique expertise of humans is key.
The UN Evaluation Group (UNEG) sets the standards for evaluation across the UN system to help facilitate comparative analysis, as well as collaboration. But evaluation reports are often hundreds of pages long. They also have different levels of focus, i.e. sometime country or project level, or they can cover themes or policies at the global level. All of these factors make a comparative analysis across multiple evaluations quite complicated, and extremely labour intensive. UNESCO’s evaluation report tries to identify and highlight cross-cutting issues, but is limited to examining 36 of its evaluations from 2022 and early 2023, likely because of the workload. Organizational learning from evaluations is therefore usually limited to a particular evaluation, topic or year - and like so often in the UN is siloed. Organizational learning across the UN system is virtually impossible, unless organizations are intentionally collaborating on evaluations and searching out shared information and insights.
That is why the ability of researchers to analyze 1,082 different evaluations undertaken by nine different UN agencies from 2012 to 2021 is so powerful. Those documents consisted of almost 1 million sentences which would have taken an army of humans to analyze in the same period of time. If UNHCR’s evaluation unit is able to harness the power of AI, they will increase their analytical capacity immensely, more than they ever could by trying to fundraise for additional posts.
Understanding how these European researchers developed their analysis provides helpful insights for those initiating and managing projects to apply AI for text analyses. The first issue is to be clear about is what you want to analyze. In this case, the researchers wanted to see if the evaluation reports could be used as a tool to understand the performance of UN entities, i.e. do the evaluation reports indicate particular projects, programmes or thematic areas where UN entities performance particularly well or where they struggle. This means they wanted to differentiate between positive and negative evaluations.
This is where the terms ‘artificial intelligence’ and ‘machine learning’ can be somewhat misleading, as they seem to imply you can just let the machine loose on 1,082 documents and tell it to sort them into positive versus negative reports. But the machine actually needs to learn how to do that and what we as humans deem as negative or positive. This is where the UN - with its diplomatic language - often creates an extra complication. Standard machine learning dictionaries are constantly being built to help AI tools analyze text, but very few of these can work with the UN’s highly nuanced language. But if you’re aware of this problem, it can be addressed, as these researchers have demonstrated.
The human researchers developed a specific codebook for the AI tool using language from UN evaluation reports to teach the machine what to look for. The starting premise of the research group was that an evaluation report that demonstrates good performance will have a majority of positive sentences and a report indicating poor performance will have a majority of negative sentences. Three different people then analyzed 180 executive summaries of UN evaluation reports and coded each sentence as positive, negative or neutral. The three humans met weekly to compare their results and explain their reasoning when their results differed so that they had a shared understanding, which could be captured in the AI codebook.
This is where engagement of a UN practitioner with relevant programmatic understanding and experience with UN language can be an important factor in ensuring the quality and speed of this coding process. Consequently, AI projects need to be outsourced carefully. External AI expertise is highly valuable, but you should check whether your vendor has experience with UN or similar diplomatic language. Even with experience, you should plan to provide appropriate practitioner support to help 'teach' the AI tool how to read the UN documents you want analyzed. In short, it means these kinds of projects cannot just be delegated to junior data scientists and ignored. They need collaborative engagement and support from subject matter experts or practitioners with experience in drafting similar types of documents.
The final codebook developed by the university researchers then allowed their natural language processing tool BERT (Bidirectional Encoder Representations from Transformers, originally developed by Google in 2018) to be sufficiently fine-tuned to classify individual sentences in all of the 1,082 evaluation reports as negative, positive or neutral. Once the neutral sentences were removed, the reports could be categorized as indicating positive or negative performance based on the proportion of related sentences that remained. While this seems fairly simple, the graph below demonstrates some of the study's analytical value. The consistent spread of positive and negative evaluations each year adds credibility to the objectivity of the evaluation function in all the agencies. It also provides interesting insights into which organizations invest more in evaluations each year, and how quickly that investment grew in UNICEF and UNDP in particular.
The next step undertaken by the researchers highlights another best practice for those engaging with machine learning analyses. Because AI tools and methodologies are relatively new, and they have been shown to easily adopt biases inherent in the data they use or the humans who ‘teach’ them, it is important to validate your methodology and findings. Accordingly, any AI analysis project plan should include this step and allocate sufficient time.
In this case, the researchers validated their methods and findings in three different ways. First, they did their own spot check of the outliers, i.e. they read the full reports which BERT indicated were particularly positive or negative, and the human readers agreed with BERT’s conclusion. Second, they applied their methodology to 661 World Bank evaluation reports which all include a standard rating ranging from highly satisfactory to highly unsatisfactory given by the evaluators when they complete the report. The comparison demonstrated a consistently positive relationship between BERT’s assessment and that of the human evaluators. Third, they checked their findings against theoretical predictions. Management theory highlights that projects are easier to manage successfully than programmes because projects o have clearer deliverables and outputs than programmes, which aim for results and more complex societal or behavioral change. BERT’s analysis was consistent with the theory, demonstrating that:
“Program level activities contain on average 53% positive assessments, whereas projects contain on average 59% positive assessments.” - Eckhard et al., "The performance of international organizations: a new measure and dataset based on computational text analysis of evaluation reports", 2023
The analysis of UN evaluations undertaken by Prof. Eckhard and his team assessed a relatively simple parameter, negative or positive performance. But in doing so it has illustrated how powerful learning from evaluations can be.
Figure 1 above provides useful information to share with senior leaders and Executive Board members to demonstrate an agency’s investment in evaluation vis-a-vis other UN agencies, funds and programmes. The visual depiction that shows evaluation results are consistently a mix of both positive and negative is another way to demonstrate the objectivity and credibility of independent evaluation units in the UN - another important issue that senior management and Board members are interested in. The fact that programme level evaluations consistently show weaker performance than project evaluations across the UN system provides quantitative data to guide learning investments. The UN should invest more in management training and support focused on programming skills as compared project management training - but don't abandon project management training because it's clearly still needed if 41% of projects evaluated are struggling.
Tips for AI text analysis projects in the UN
So much more can be gleaned from further analyses and we look forward to seeing how the UNHCR Evaluation office approaches this. For others in the UN considering using AI for text analyses, we offer the following tips:
Know what you are trying to measure or analyze and be able to articulate it clearly.
Seek external help from those who have experience in dealing with the nuanced language in the UN or similar diplomatic organizations.
Ensure that you have someone with programmatic and/or drafting experience in the UN to help the data scientists develop the AI codebook or dictionary. Build that into the project plan.
Ensure the project plan includes at least one, if not more methods to validate the findings and methodology of the AI tool. Build that step and sufficient time into the project plan.