Book categorization AI: Interview with the developer

We released our category recommendation AI last week, and it has been a huge success. Just to recap if you missed it: Savant is able to “read” your content file (epub or docx) upon upload, and guess the most relevant BISAC categories. The task of finding the most relevant BISAC categories is one of the most daunting beginner authors face. There are several hundreds of available options, and it feels like you need to know them all to make a relevant choice. (Read more about BISAC categories here.) This is where Savant comes in the picture. But how does our book categorization AI work? We asked data scientist Mór Kapronczay to explain us the background processes and tell us more about how he built this feature that is truly unique in the ebook market.

Congratulations on your achievement! Where did the idea of a book categorization AI come from?

Thank you very much! Actually, the idea is not mine: I was hired for this particular job, almost exactly a year ago. It was Róbert Csizmár, our CTO who thought that an AI could be built for this purpose. I agreed and got to work.

What is your background? Are you a developer?

No, I am in fact an economist, with specialization in empirical finance. I have a Finance & Accounting BA, and a Finance MA with a Specialisation in Investment Analysis from Corvinus University of Budapest. I learned a lot of statistics and econometrics there. In addition to that, I had the opportunity to be a part of Rajk College for Advanced Studies, where I could learn programming and algorithm design, and Machine Learning. I instantly fell in love with both.

Mór Kapronczay at World Summit AI in Amsterdam

What exactly did you do? How did you train the bot?

The most important question we had to figure out was how to turn a book into a vector. The reason behind this is that Machine Learning algorithms only understand numbers, but we wanted it to understand a book.

Turn it into a vector? I’m not sure I understand…

Turning a book into a vector means turning it into a high dimension object. Practically, this means turning it into a set of numbers. Every number in this set describes one dimension of the book. And what these dimensions are, I’d rather keep as a secret.

Thank you! Please, continue…

After finding the best vector representation, an algorithm was needed for training. Here I got extensive help from Gábor Gulyás, a very experienced Machine Learning engineer. Training meant running the algorithm on all of our books. This had really been challenging on the technical side; I learned a lot during this process.

Great! Can you explain how it works so that even I can understand it?

Sure! The model tries to learn the similarities in the vector representation of books in the same category. Technically, this means that identically categorized books have similar vectors. So it checks a book’s vector representation and compares it with the vector representation of other books until it finds a match. Once it’s been properly trained, if Savant is now given a book, it guesses the category according to what it saw during training. If the book it is given has a vector representation that is different from the books’ it has previously seen, it won’t be able to make a guess.

Will it become smarter by itself, or does it need your constant attention?

This is supervised learning, so it does not get smarter by itself. It needs to be trained to become smarter. For this, we need new data and different algorithms. Fortunately, our authors and publishers upload new books every day. As we get new books and also feedback on the current guesses by our authors accepting or rejecting the suggestions, Savant will learn something new every single day.

What was the most challenging part of the work? Did you face any unforeseen difficulties?

In the academic sense, creating the structure of the training algorithm was the most interesting and biggest challenge. Regarding the technical side, I found almost everything challenging. If I look at the Python code I wrote recently, I know that I wouldn’t have been able to understand it a year ago. And this is just Python; I had to learn how to make different servers to communicate with each other, and how to use Linux with the command line. As an economics student, I was not used to this. I truly learned a lot.

What was the most rewarding part of developing this AI?

Seeing an AI work in practice that is almost entirely your work is the most rewarding thing I can think of. Of course, it can only fulfill its role if our authors and publishers find it useful. The first feedbacks are mainly positive, but we already noticed areas where Savant should be improved.

What’s next? Are you allowed to tell us your next AI project?

This project is not finished yet: Savant has to get smarter and smarter. Our next big project should be just as big as Savant, so I don’t want to spoil it. Let it be a surprise!

Savant is the brainchild of Róbert Csizmár and the work of Mór Kapronczay, Gábor Gulyás, and Zsófia Dedinszky. Interested in trying it? Simply sign up or log in to your PublishDrive account, and upload a book.