Independent Tech & Cryptocurrency News
Binance Coin
$0.33 in production. Real-word text classification with ULMFiT.


I’ve overcome my skepticism about for production and trained a text classification system in non-English language, small dataset and lots of classes with ULMFiT.

About the project

My friend and classmate, who is one of the founders of RocketBank (leading online-only bank in Russia), asked me to develop a classifier to help first-line of customer support.

The initial set of constraints was pretty restricted —

  • no labeled historical data
  • obfuscated personally identifiable information
  • “we want it yesterday”
  • mostly Russian language (hard to find pre-trained model)
  • no access to cloud solutions due to privacy regulations
  • ability to retrain the model if new classes arise without my involvement

The scope of work was pretty straight forward — develop a model and serving solution for incoming messages classification into 25 classes.

Initially, after thinking about restrictions — I was pretty sure, that no neural networks should be used for this approach. Why?

  • We would never label enough data for the neural model in reasonable time.
  • Building an environment for the reliable serving of neural model is a kind of pain.
  • I was skeptical about reaching the requested performance (requests per second) with reasonable resources.

Based on that, I made a sad face and created new conda environment. I was classic.

Dataset collection and initial research.

RocketBank had set up a task force consisting of a project manager and devops on their side plus a bunch of people handling dataset labeling. It was extremely smart and helpful, as, in my opinion, this constitutes a perfect team for handling a data science project in the industrial world.

We started with analyzing historical data and came to a number of conclusions:

  • To train a system we take into account only messages received by the bank before any response from customer support.
  • There are 2 distinct meta-classes of incoming messages — coming from existing customers and from new leads. Adding this information as an input to classifier should provide additional information to the system and boost classification scores.
  • Bank on their side decided on 25 distinct classes of messages ranging from ‘Credit request’ up to ‘Harassment’.

I requested around 25000 not labeled historical messages and in just a few days a task force was able to classify around 1500 historical messages into 25 classes. Initially, I assumed that this number (1.5k) is too low to even try any neural model (I was wrong).


I will fast forward through non-interesting part of the thing. I decided to test various flavors of TD-IDF, embeddings and optimize machine learning model using TPOT and DEAP.

TPOT and DEAP, for those unaware, are two secret weapons in data scientist arsenal that make model search CPU-intensive and hands-free.

TPOT runs all stack of machine learning methods embedded in sklearn plus few extra(XGBoost) and finds the optimal pipeline. I played around various embeddings, fed them into TPOT and after 24 hours said that Random Forest performs best for my model (ha-ha, what a surprise!).

Then I needed to find an optimal set of hyperparameters. I always do this using directed evolution strategies with DEAP library. That actually deserves a separate post.

Anyway, at the end of the day, I received an optimal set of settings and my precision was around 63%. I think this was close to the maximum that I was able to get from classical methods and 1.5k dataset. While 63% for 25 classes sounds good from the machine learning perspective, it’s quite bad for real-world usage. So, I decided to take a look into neural nets as a last chance. comes into play.

So, I needed a fast way to check the performance of a neural-based model on the same task. While implementing a model from scratch using Tensorflow was the most viable option, I decided to run a fast test with and their recent discovery of ULMFiT. Problem is — I needed a pretrained language model for Russian text, which isn’t available in After looking at forums I discovered an ongoing effort to create a set of a language model for most languages. There was a thread for Russian language and a pre-trained model from a Russian Kaggler Pavel Pleskov, that he used to get a second place at Yandex competition.

From there it was mostly writing 20 lines of code and few hours of GPU training time to get to 70% precision. After a few more days of tuning hyperparameters, I get to 80% precision. Some tips:

  • Use FocalLoss as a training goal.
  • Have a pretrained language model, but finetune it on a non-labeled data available.
  • Convert text to lowercase and make a token for uppercase, make a special token for pieces of obfuscated data.
  • Put token of meta-class not only in the beginning but also at the end of the message.


Ok, great. Should I convert the model into TensorFlow? Nope, I was lazy and decided to test model performance using native + Pytorch + Docker.

After running stress-tests in a single-core Docker container I was surprised to see less than 300 milliseconds response time for an average request and no crashes. What else did I need? Nothing. showed that it is a perfect solution for fast and precise development of production ML systems.


The beauty of + transfer learning is a pretty predictable result of retraining in terms of quality and speed. I’ve shared a script inside the docker container coping my final training notebook and providing a new model as an asset. I’ve run a few cycles of retraining and cross-validation and obtained highly repeatable results, so this is a simple way to deliver not only a model but a training script as well.

I can’t share the actual code and system configs, but I am ready to answer any questions.

Clap, if you like and clap, if you want more details on DEAP and TPOT.

Get real time updates directly on you device, subscribe now.

Leave A Reply

Your email address will not be published.