Solution Background and Business Value

Entity resolution enables companies to identify and merge records that refer to the same real-world entity (such as customers, products, or businesses) across different data sources. Consolidating these records allows for a more accurate and holistic view of each entity, which in turn improves data quality and enhances decision-making. Entity resolution is also crucial when performing predictive tasks, as duplicate entries can introduce unwanted noise and will generally reduce overall model accuracy.

Despite its importance, entity resolution is notoriously difficult to perform accurately using traditional rule-based approaches. These methods often rely on manually crafted heuristics that not only can become difficult to maintain at scale, but also struggle to capture the usually complex relationships between different data fields. Kumo AI’s feature-based learning and graph neural network approach creates context-aware embeddings that allow it to identify subtle and non-obvious links between records, perfect for performing entity resolution tasks.

While there exist many different entity resolution problems, we will provide an example of how to use Kumo AI to create a link prediction model that can identify the top N possible candidates for entity resolution, specifically, linking customers that have created accounts on two different platforms.

Data Requirements and Schema

To develop an effective entity resolution model, we need a structured set of tables which captures all the relevant user data for both platforms, and is able to represent different signals to perform entity resolution on. While there exists a minimum amount of tables for generating entity resolution predictions, the addition of relevant information and complexity to the graph will only serve to increase model accuracy.

For this example, the highest confidence signal we have is email, meaning we assume that two users having the same email is ground truth for the existence of a link. Device ID is a medium confidence signal, meaning if two users access the platform through the same device there’s a strong likelihood that there’s a link. To add other signals, such as IP addresses or content links, you can follow the same approach used for device signals: a shared table with connections to users from different platforms.

Core Tables

  1. Platform A User Data:
    • Stores data about each user from platform A, using email as an identifier
    • Note: Emails are omitted from the table to prevent data leakage during training
    • Key attributes:
      • platform_a_user_id : unique user identifier for platform A
      • first_seen: user creation date
      • last_seen: last time a user was seen
      • Optional: Other user attributes (age, gender, location, etc.)
  2. Platform B User Data:
    • Stores data about each user from platform B, using email as an identifier
    • Contains similar information to the platform A user data table
    • Key attributes:
      • platform_b_user_id : unique user identifier for platform B
      • first_seen: user creation date
      • last_seen: last time a user was seen
      • Optional: Other user attributes (age, gender, location, etc.)
  3. Platform A User Sessions:
    • Stores data about each user session from platform A
    • Key attributes:
      • platform_a_session_id : unique session identifier for platform A
      • platform_a_user_id : the user from platform A this session belonged to
      • create_date : create date of the session
      • device_id : device used for this session
      • Optional: ip address, duration, location, etc.
  4. Platform B User Sessions:
    • Stores data about each user session from platform B
    • Contains similar information about user sessions as those from platform A
    • Key attributes:
      • platform_b_session_id : unique session identifier for platform B
      • platform_b_user_id : the user from platform B this session belonged to
      • create_date : create date of the session
      • device_id : device used for this session
      • Optional: ip address, duration, location, etc.
  5. Device Data:
    • Stores data about each device used by users from both platforms A and B
    • Key attributes:
      • device_id : unique device identifier
      • device_type : device type
      • Optional: device brand, device model, etc.
  6. Labels Table:
    • Stores data about each device used by users from both platform A and B
    • Key attributes:
      • link_id : unique identifier for each link
      • platform_a_user_id : identifier for a user from platform A
      • platform_b_user_id : identifier for a user from platform B

Entity Relationship Diagram (ERD)

Predictive Query:

This predictive query relies on a labels table which needs to be pre-generated. Each entry in the labels tables corresponds with a potential link between users. At prediction time, this predictive query generates, for each user from platform A, the top N most likely users from platform B that represent the same entity as the user from platform A. Using only established high confidence links can increase label quality, so adding a pre-generated confidence column and filtering by confidence can be used to improve model accuracy.

PREDICT LIST_DISTINCT(labels.platform_b_user_id 
WHERE labels.confidence='High') 
RANK TOP N
FOR EACH platform_a_users.platform_a_user_id

Next Steps:

While this model generates a list of ranked candidate pairs for entity resolution, there is no guarantee that there exist any duplicate users on platforms A and B. Therefore, since the model ranks potential pairs regardless of whether there is a duplicate user or not, all that the link prediction model does is narrow down the amount of pairs that need to be manually reviewed.

Although this pipeline can lead to large increases in efficiency, it can be pushed further by training a separate binary classification model to generate the probability score of a candidate pair being a true match. The structure of the tables would not need to change, as all that would be different is the label table and the predictive query. Then, a threshold for the probability score can be established to detect candidate pairs of duplicate users.