Automatic classification for efficient records management at the International Committee of the Red Cross
ELCA has successfully tested a machine learning approach to document classification to make the ICRC's records management more efficient.
The information management unit of the International Committee of the Red Cross manually categorizes thousands of emails per month for records management and archiving purposes. Using machine learning, ELCA has successfully designed, tested and challenged several algorithms to automatically classify these records. The winning algorithm was based on Fasttext. This project is the first of several proof-of-concepts to be conducted by the ICRC in 2020 in this area.
The information management unit of the ICRC manually categorizes thousands of emails per month for records management and archiving purposes. Categorization is performed according to a taxonomy spanning over 2,800 possible values that describe business functions, tasks, activities and countries.
This is a very time-consuming and sometimes error-prone task. Thus, the ICRC was interested in exploring the potential application of machine learning, and more particularly how automatic classification algorithms would fare compared with human judgment. About 6,000 previously categorized emails were provided to ELCA for comparison with combinations of algorithms and textual representations.
Over two months ELCA identified several approaches that exceed 90% accuracy, and, in some case, reach 98% accuracy. These approaches feature state-of-the-art techniques for representation of textual semantics, such as word embedding (a compact representation of a word’s meaning given its context) and some of the latest supervised learning algorithms such as XGBoost and Fasttext.
The ICRC’s strict policies regarding data confidentiality are met since all approaches make use of open-source software that runs fully on premise and on commodity hardware. From an architectural viewpoint, the solution – implemented in Python – is extremely lightweight and could easily be integrated into pre-existing data pipelines, for instance in the form of a web service.
Having reached near-human classification accuracy on its production data, the ICRC can now confidently consider machine learning as a valuable part of its records management strategy.