Automatic classification for efficient records management at the International Committee of the Red Cross

ELCA has successfully tested a machine learning approach to document classification to make the ICRC's records management more efficient.

25.02.2019

The information management unit of the International Committee of the Red Cross manually categorizes thousands of emails per month for records management and archiving purposes. Using machine learning, ELCA has successfully designed, tested and challenged several algorithms to automatically classify these records. The winning algorithm was based on Fasttext. This project is the first of several proof-of-concepts to be conducted by the ICRC in 2020 in this area.

by Amina Chebira
AI & Big Data expert at ELCA

The information management unit of the ICRC manually categorizes thousands of emails per month for records management and archiving purposes. Categorization is performed according to a taxonomy spanning over 2,800 possible values that describe business functions, tasks, activities and countries.

This is a very time-consuming and sometimes error-prone task. Thus, the ICRC was interested in exploring the potential application of machine learning, and more particularly how automatic classification algorithms would fare compared with human judgment. About 6,000 previously categorized emails were provided to ELCA for comparison with combinations of algorithms and textual representations.

Over two months ELCA identified several approaches that exceed 90% accuracy, and, in some case, reach 98% accuracy. These approaches feature state-of-the-art techniques for representation of textual semantics, such as word embedding (a compact representation of a word’s meaning given its context) and some of the latest supervised learning algorithms such as XGBoost and Fasttext.

 

The ICRC’s strict policies regarding data confidentiality are met since all approaches make use of open-source software that runs fully on premise and on commodity hardware. From an architectural viewpoint, the solution – implemented in Python – is extremely lightweight and could easily be integrated into pre-existing data pipelines, for instance in the form of a web service.

 

Having reached near-human classification accuracy on its production data, the ICRC can now confidently consider machine learning as a valuable part of its records management strategy.   

By continuing to browse this site, you accept the use of cookies or similar technologies whose purpose is to produce statistics on visits to our site (tests and measurement of visitor numbers, visit frequency, page views and performance) and to offer you content and promotions which will be of interest to you.

Our cookie policy has been updated. Feel free to manage your preferences.

close
save

Manage your cookie preferences

Update your cookie preferences

Find out about the type of cookies stored on your device, accept or block them for the entire site, all services or on a service-by-service basis.

OK, accept all

Visitor flow

These cookies provide us with insight into traffic sources and allow us to better understand our visitors anonymously.

(Google Analytics and CrazyEgg)

New

Sharing tool

Social media cookies allow content sharing on your preferred networks.

(ShareThis)

New

Visitor understanding

These cookies are used to track visitors across websites.

The intention is to enable us to offer more relevant, targeted content to existing contacts (ClickDimensions) and display ads that are relevant and engaging for users (Facebook Pixels).

 

New
For more information about these cookies and our cookie policy, click here