Natural Language Processing At Casai

The Data Science team at Casai is always looking for innovative ways to improve our relationship with our guests. In the past year alone, we exchanged over 90,000 invaluable messages with our guests, and we believe there are trends within those conversations that will help us improve our product. Unfortunately, having one of our team members read though thousands of interactions would require too much time and effort for any human, so we’ve developed Natural Language Processing (NLP) models: a technology used to aid computers in processing and analyzing human language. 


In this post we will be discussing how we use NLP models to find trends within our conversations with guests, and how these models have led us to devise better products and operational processes, which have improved the overall Casai experience.

For this purpose, we’ve developed a machine learning model based on Latent Dirichlet Allocation (LDA), which is a Topic Modeling unsupervised learning algorithm.

Topic Modeling refers to the task of identifying topics in a series of documents. The model counts all of the words and then categorizes them into topics by identifying word patterns. For example, instead of reading through hundreds of app reviews, you can utilize Topic Modeling to cluster all reviews into two topics: positive or negative. Both positive and negative reviews typically contain their own distinct set of words, which dictates the grouping process. 


This type of technology can be commonly found in the real world. Web libraries use LDA to recommend books based on a person’s past reading trends, while news providers use it to group articles based on their similarity.


Comparable to other Machine Learning techniques, NLP models can be both supervised and unsupervised. In order to train these two types of models, we need ‘labeled’ or ‘unlabeled’ data. For labeled data, a human must manually ‘label’ each message in a conversation with a topic. For unsupervised models, it will analyze the messages and let the algorithm suggest topics on its own.

In the past year alone, we exchanged over 90,000 invaluable messages with our guests, and we believe there are trends within those conversations that will help us improve our product.

Francisco Antonio Rodríguez García, Data Scientist at Casai

The objective of the model we’ve implemented is to extract main topics found in a series of messages we’ve collected from real conversations between guests and our Customer Experience team. For the training process alone, the model analyzed over 90,000 messages. 


Before feeding the algorithm with the training messages, it was necessary to “pre-process” the conversations. This means:

  • Tokenization: Splitting the text into sentences, and then the sentences into words. Then, changing all words to lowercase letters and removing punctuation.
  • Removing all words shorter than three characters in length. 
  • Removing all stopwords, which are the most common words in the English language, and thus offer the least information. Some examples include personal pronouns (“I”, “You”, etc), or words like “with”, “for”, “at”, etc.
  • Words are lemmatized: Words in third person are changed to first person and verbs in past and future tense are changed to present. For example words like “Changes,” “Changing,” “Changer,” or “Changed,” will be replaced by “Change.”
  • Words are stemmed (Reduced to their root form): For instance, “Change” will be replaced with “Chang.” 

Here is an example of that preprocessing:

NLP models at casai
Natural language processing models at casai

The model then produces groups of words that most commonly appear under each topic. Following several tests, and a fine-tuning of our parameters, we saw the best results after generating twelve topics (this way we created better clusters) with ten words each. For example, a topic about “Apartment Access” appeared, containing words like “Door”, “Key”, “Code”, “Guard”, “Open”, “Need”, “Access”, “Button”, “Follow”, “Enter”.

NLP processing

This model allows us to better detect the primary needs of our guests and the things they love most about their stay. We can then use this information to enhance positive feedback or better ourselves in areas in which we fell behind. 

For instance, we have found that around 15% of the messages we received involve questions about the check-in process, and another 7% involved special requests guests tend to make like additional towels, kitchenware or special baby furniture. 

Thanks to the results we’ve found by using this model, we will begin planning new strategies that will focus on the areas of opportunity we’ve identified, such as creating new ways to easily communicate the check-in process to our guests. 

One shortcoming of this model is that we cannot pre-determine which topics will be grouped after running the model. Because of that, we will also begin experimenting with a supervised NLP algorithm that will capture topics we’re interested in exploring further. 

To learn more about how Casai is using tech to reimagine hospitality, follow us on Linkedin as we post regular updates and articles.