Building a Feature Store to reduce the time to production of ML models

Published in

Mercado Libre Tech

7 min readNov 24, 2020

A new problem in the industry

Within the world of software development and especially in the area of Machine learning, times assigned to analysis, development and deployment are key to the success of an organization. At Mercado Libre, we are not exempt from this problem and since it is a leading company in Latin America, much less. The staggering growth of Machine Learning development has no doubt meant new problems to large companies. Today there’s no denying that for many teams to work on the same backend, it is a standard practice to use a microservice infrastructure, but what happens when many of these autonomous teams also develop Machine Learning models?

On Mercado Libre’s backend there are several entities that are consulted by many teams. For instance, we call “item” to the publication that a seller makes in our Marketplace. These items or publications contain, at least, a title, description, photos, interactions with thousands of users and a collection of structured information. Since our business needs are usually based precisely on these types of entities, at this level of abstraction, many teams have had to generate some type of model based on items such as a binary classification to know if an item should be moderated or not for breaking a rule, a classification by product category or a search for similar items to feed recommendations, to name a few.

Whether to do classification, regression or search for close neighbors, all these initiatives have had to go through a similar feature engineering process in order to generate representations of the item that would allow for the training of models. So we have the problem at sight. On the one hand, we have repeated effort across teams (and not a minor one). The feature engineering process can take time and is key to model development. On the other hand, not all teams can boast of a Machine Learning expert or of developers with significant experience in a tool or business area and even if they can, there may be no simple mechanism for other teams to take advantage of their knowhow.

Suppose a team wants to train a new binary classifier of items to separate them between those to be shipped in boxes or in bags. If it could start with a well-resolved representation of the item, this team could bring down the lead time of its first model from days to hours. It is at this point where we propose a Feature Store.

What is a Feature Store?

It is a centralized place where teams can consult the features of an entity. There are several approaches to implement this solution.

If we think of features in the traditional sense, as a characteristic of the entity understandable by a human, the feature store could be an API on a database engine that allows extracting specific information, for example “the number of publications visited by some user in the last 30 days”. At Mercado Libre, we have this type of solution, which contributes to many use cases and needs, but it is beyond the scope of this article to delve into it — interesting topic for another article, perhaps.

Another alternative — and the one we will actually be exploring — is to think of features as an intermediate level of abstraction which is not human-readable but which may have certain properties that make it useful for a model. Just as there are models that can generate arbitrary text embeddings, we can create models that generate embedding-like representations (n-dimensional vectors) from any entity we’re interested in (for example, items or shipments). The meaning of these vectors may not be clear to a human but they could have interesting properties. For instance, the vectors representing items (publications) corresponding to the same type of product can be close in the n-dimensional space; or the vectors that represent payments may be largely grouped so that we can set aside and review those whose representation is far from that group. This would enable us to find fraudulent transactions.

The properties of the representation will be tied to the underlying optimization process that generated them so if we use browsing sessions to infer the representations, 2 items will be similar under that representation when visited in similar contexts.

A solution proposed

Service vs. Lib

One way of allowing any developer to transform an entity into a vector is through an API, though we’ve found some drawbacks in this strategy. The team developing the feature store has to build and be in charge of the maintenance of a specific micro service for this matter. This micro service has requirements (e.g.: latency, scalability, availability or costs) that are under the scope of the team that develops the user application and not of the one accountable for the feature store. Moreover, there could be several machine learning models that depend on the same micro service, which is likely to generate problems; for instance, the model that generates vectors may be adjusted to improve the performance of one of the models that consumes it, damaging the others. Even though this last point could be mitigated with a correct versioning of the API, it would entail a greater investment in the development and maintenance of the feature store.

Therefore, we’ve opted for the TensorFlowHub strategy which comprises a repository or “Hub” of models and an API to download them. After downloading, the developer becomes the owner of that object and is totally independent from our Hub. Thus, the application that uses it is only dependent on the feature store in the training stage and afterwards it can work in a completely self-contained way.

Only known tools

In addition, we decided that these models that transform entities into embeddings would be shared in python libraries with a scikit-learn interface. In other words, they can be installed with “pip” and are objects with fit and transform methods. This facilitates their adoption by our machine learning devs (who mostly already use python) not only to integrate them into sklearn pipelines but also to use in combination with other tools (for example, we have PyTorch models using these representations).

Versions

Yet another issue worth pointing out is that the same model could be used to represent different data sets (like publications in Spanish or Portuguese) and there could also be models trained with different hyper parameters (for example, to generate vectors of different dimensions). This is why these models have “versions.” To better illustrate this, we add an example for using our lib:

What we achieve by this

The solution we propose has the following advantages:

Fast MVP

Since a good start-up solution for feature engineering is already available, you can have an MVP or proof of concept in a few days or even hours.

Less duplicate effort

Since we allow entity-to-feature transformations to be shared across teams.

Widespread expert knowledge

The knowledge and expertise of Machine Learning experts can be available to all teams.

Improvement of existing models

Embeddings can be combined with the features of models that are already in production, which can render improved metrics with very little effort.

Resource Optimization

Reducing redundant work not only saves developer hours, but also model training hours and infrastructure costs.

Next steps

Setting up a Hub of models that generate embeddings is one of many strategies to have a collaborative feature engineering system. Although we are still developing this valuable time-saving tool, we have been able to validate its effectiveness as proper embeddings allow us to train models with good initial performance in a short time.

Our challenge now is to ensure that the Hub continues adding new models and that its use keeps on growing in the company, maximizing synergy across teams. This implies understanding what information each machine learning developer (our user) needs in order to find the model that best solves their problem.

We are convinced, like other great referents in our industry (Uber, Netflix, Twitter and Google, among others), that when working on a large scale it is vital to take on a more global look and devise a platform that fosters cooperation among teams and speeds up delivery times to production.

Acknowledgments

The whole team: Applied Machine Learning & Machine Learning Technology: Maximiliano Andres Adad, Rodolfo Edelmann, Pablo Zivic, Sofía Denner, Josefina Dalla Via Monti, Emiliano Martin, Pablo Armas, Javier Mansilla and Rafael Carrascosa.
Barbara Michalla for general coordination assistance, and Cecilia Sassone (editor).