top of page

Data Labelling with OpenAI: GPT 3.5 vs GPT 4.0

The quest for optimising search relevance often grapples with the challenge of missing labels. These labels underpin the foundation of machine learning models, instructing them on how best to evaluate and prioritise search results. But what happens when direct human interaction with the data isn't feasible due to privacy concerns, or when human labelling is too costly and inconsistent in quality? Enter the potential of external AI models.

External AI models, pre-trained on vast amounts of data, can serve as a proxy for human judgment. In scenarios where original data is sensitive or confidential, these models can produce predictions for label generation without direct human interaction. This approach not only safeguards user privacy but also ensures that data integrity remains uncompromised.

Moreover, leveraging external AI models can offer a more consistent and scalable solution compared to human labelling. Human judgment, while invaluable, can be prone to inconsistencies, biases, and variations, especially when scaled across a large dataset. In contrast, AI models can provide a more standardized and reproducible approach, reducing variations and boosting the reliability of the generated labels.

In this blog post, we will present the opportunity to use LLMs, in particular OpenAI GPT 3.5 and GPT 4.0, to label data from the Home Depot Product Search Relevance Kaggle competition and evaluate their performance. We will start by presenting the proposed algorithm and prompt to embed the model. Afterwards, the results from GPT 3.5 and GPT 4.0 will be compared and discussed. Not only will we assess the power of LLMs in data labelling, but also quantify how GPT 4.0 is more accurate than GPT 3.5.

Presenting Home Depot Product Search Relevance Competition

In January 2016, Kaggle presented a challenge with Home Depot. Kagglers were asked to build a model that can predict the relevance of search results on their website in order to create a fast and accurate shopping experience for the customers.

The data set is composed of four csv files that must be used for the model design, including: training data (contains products, searches, and relevance scores), test data (contains products and searches), product descriptions (contains a text description of each product), attributes (extended technical information about a subset of the products). In the files, there are the following data fields:

  • id - a unique Id field which represents a (search_term, product_uid) pair

  • product_uid - an id for the products

  • product_title - the product title

  • product_description - the text description of the product (may contain HTML content)

  • search_term - the search query

  • relevance - the average of the relevance ratings for a given id

  • name - an attribute name

  • value - the attribute's value

In the scope of this work, since we aim to evaluate the performance and accuracy of the proposed algorithm, it is proper to use the training data for prediction so that it is possible to compare the results to the actual relevance scores for each pair of product ID and search term. As for the example that the prompt will be using, we opted for the following, given by the Kaggle challenge:

“The relevance is a number between 1 (not relevant) to 3 (highly relevant). For example, a search for "AA battery" would be considered highly relevant to a pack of size AA batteries (relevance = 3), mildly relevant to a cordless drill battery (relevance = 2), and not relevant to a snow shovel (relevance = 1).”

To get to know more about this competition, follow this link.

A possible approach using OpenAI

In this blog post, we propose a possible solution using LLMs, specifically OpenAI GPT, to label a small amount of data from the Home Depot Product Search Relevance Competition. The general idea is to prepare the available data to feed it to an appropriate prompt that will indicate the LLM to predict the respective relevance scores. In our case, we aim to assess the performance of these models in the context of data labelling. Therefore, it is proper to use the training data for prediction, and then compare the obtained results with the actual relevance scores. We used the hosted Jupyter Notebook service Google Colaboratory Colab.

So, first things first: you need to install OpenAI and the respective dependencies, by using the pip package:

pip install openai 

Make sure that you have available an OpenAI API key:


And now you can define two ‘get_response’ functions (one for GPT 3.5 and other for GPT 4.0) to define the OpenAI models parameters and to retrieve the models’ responses to a given prompt:

# Define get response function for each model
def get_response_3(prompt):
  text = None
    response = openai.ChatCompletion.create(
        messages=[{'role': 'user', 'content': prompt}],

    choices = response.choices[0]
    text = choices.message['content']

  except Exception as e:
    print('ERROR:', e)

  return text

def get_response_4(prompt):
  text = Nonetry:
    response = openai.ChatCompletion.create(
        messages=[{'role': 'user', 'content': prompt}],

    choices = response.choices[0]
    text = choices.message['content']
   except Exception as e:
    print('ERROR:', e)

  return text

This is an example of a parameter combination that we used in our case, but you can play around with them and test different parameters.

Now you have your OpenAI models ready to be applied. Next, you will need to prepare your Home Depot Search Relevance data from Kaggle, so that you can use them as variables in the prompts. Let’s see how you should proceed.

Preparing the data

From the Home Depot Product Search Relevance Competition website, you are able to download the needed .csv files to apply the GPT models. You follow this link and go to ‘Data Explorer’, and then click on ‘Download All’. In the first approach, we are only using the train.csv file.

In the context of Google Colab, it makes sense to upload the files to your Google Drive, and then mount your drive as follows:

# Import Drive

from google.colab import drive

Next, you upload from your drive the respective files:


The df_train variable is composed of 74050 instances. Of course, due to token and credit limitations of ChatGPT, we are not able to predict all the available examples. Hence, we are forced to select n random instances from the data set:

# Get number of instances to select

while True:
        n = int(input('How many instances would you like to predict?'))
    except ValueError:
        print("Please enter a valid integer value.")

# Randomly select n rows from the DataFrame


And finally, the last step in data preparation is to retrieve the column variables from the dataframe. Remember, the data fields for train.csv are id, product_uid, product_title, search_term.

# Retrieve column variables from DataFrame

id_train_list = random_rows['id'].tolist()
productid_train_list = random_rows['product_uid'].tolist()
product_title_train_list = random_rows['product_title'].tolist()
search_term_train_list = random_rows['search_term'].tolist()

Getting responses

Now you get to the definition of the prompt to give to both GPT 3.5 and GPT 4.0. First, we indicate that the LLM will behave as a predictor of the relevance of search results in Home Depot, relevance being a float number between 1 and 3, as the instructions of the competition specify.

# Define prompt

def get_prompt(id_train_list, productid_train_list, product_title_train_list, search_term_train_list):

    prompt = """You are now a predictor of the relevance of search results in Home Depot.            
    The relevance is a float number between 1 (not relevant) to 3 (highly relevant).            
    For example, a search for 'AA battery' would be considered highly relevant to a pack of size AA batteries (relevance = 3), mildly relevant to a cordless drill battery (relevance = 2), and not relevant to a snow shovel (relevance = 1).            
    You need to predict the relevance score on each query ID {id_train_list}, with the corresponding product ID {productid_train_list}, product title {product_title_train_list}, search term {search_term_train_list}.            
    Return a json object with the following structure:              
    query_id: relevance_score
    return prompt

# Get response for GPT 3.5

response_3=get_response_3(get_prompt(id_train_list, productid_train_list, product_title_train_list, search_term_train_list))

# Get response for GPT 4


Analysing the results

As previously mentioned, we aim to evaluate the performance of both GPT 3.5 and GPT 4 on this data labelling task. To quantify that, you must select proper metrics that will describe how the models are scoring the relevance of Home Depot products with the respective search terms. For example, you can opt to use simple metrics as mean absolute error and mean squared error:

# Compute MAE, MSE and RMSE for GPT 3.5 and GPT 4

mae_3 = mean_absolute_error(relevance_train_list, response_3)
mae_4 = mean_absolute_error(relevance_train_list, response_4)
mse_3 = mean_squared_error(relevance_train_list, response_3)
mse_4 = mean_squared_error(relevance_train_list, response_4)
rmse_3 = mean_squared_error(relevance_train_list, response_3, squared=False)

Additionally, to evaluate the correlation between the predicted values and the actual values, you can calculate the Pearson’s correlation coefficient and the respective p-value:

# Compute Pearson's correlation coefficient and p-value for GPT 3.5 and GPT 4

correlation_3, p_value_3 = pearsonr(relevance_train_list, response_3)
correlation_4, p_value_4 = pearsonr(relevance_train_list, response_4)

These are the results that we obtained with n = 10 instances:

n = 20 random instances:

n = 30 random instances:

As you can notice from the previous examples, for GPT 4 we obtained significantly smaller error values (MAE, MSE and RMSE). Hence, it is natural to affirm that GPT 4 makes predictions more precisely than GPT 3.5, as it is expected. In fact, if we take a look at the predicted vs actual values scatter plots, for any of the presented examples, you observe that for GPT 4 the dots are closer to the y = x line, meaning that the predicted values are closer to the actual values.

However, we notice a quite curious phenomenon as we increase the number of instances: the Pearson’s correlation coefficient and the corresponding p-value tend to get worse, which could mean that the predictions are “more random” as the number of predicted values increases. In fact, you can refer to this blog post that demonstrates as well how larger contexts result in worse predictions, in the case of document processing.

Using more data

Using LLMs, in particular OpenAI GPT 4, seems a proper approach to label missing data, as it is clearly more powerful than GPT 3.5. However, we only tested this method for dozens of instances. How would GPT perform on hundreds of data? How can we overcome the token limitation to test more data?

A possible solution is to perform multiple separate predictions and join the results into one single list, to evaluate them. For example, we created 10 groups of random instances, And then we predict each group separately, with both GPT 3.5 and GPT 4:

responses = []
models = ['gpt-3.5-turbo', 'gpt-4']

for idx, df in enumerate(smaller_dfs):
    random_rows = df.sample(n=int(n/10), random_state=42)
    cols = ['id', 'product_uid', 'product_title', 'search_term', 'product_description']
    # Use dictionary comprehension to convert data to lists
    data_lists = {col: random_rows[col].tolist() for col in cols}
    for model in models:
        prompt = get_prompt2(*data_lists.values())
        response = get_response(model, prompt)

Since you are predicting on more instances, you may need to make some alterations to your prompt. For example, it is needed to ensure that the models will solely return json objects responses so that it is possible to handle the results:

prompt = """           
         Do not return anything else besides the json object, since I want it to convert to a Python list.            

For 200 instances we obtained the following:

Here you will notice three major changes. First, Pearson’s correlation coefficient values are significantly smaller, when compared to the previous examples with fewer instances. Contrarily, the p-value for both GPT 3.5 and 4 is much smaller when we use more instances. Finally, as you look at the scatter plots Predicted vs Actual Values, there are no significant visual differences between both OpenAI models, like you saw in the previous examples. Error values for GPT 4 remain smaller than for GPT 3.5. Therefore, GPT 4 still shows a better performance than GPT 3.5 when predicting more instances, but its power slightly decreases.

Are LLMs promising in data labelling?

Well, as it is demonstrated in the Home Depot Product Search Relevance Competition example with an LLM approach, there seems to be evidence that LLMs have the potential to be applied to data labelling, especially in the context of missing labels or data. You can get relatively accurate results, in particular when using GPT 4.

Nevertheless, there is a hurdle that you cannot ignore: the token limitation of LLMs. It is needed to find a balance between the number of tokens used for the prompt definition and the size of the variables that you want to use for labelling: the instructions to your language model need to be direct, clear and descriptive, while your variables must be sufficient enough to describe your data. For example, we could not get a response from the models (using the same prompt) when trying to add the product description variable, “simply” because we would surpass the 4097 tokens limit of GPT 4. Moreover, using LLMs, as you imagine, can become a significantly costly approach, especially when handling a considerable amount of data.

Notwithstanding, LLMs seem to be pretty useful when you need to label data and you encounter the aforementioned challenges, such as data confidentiality or human errors. For instance, you may start by labelling a small amount of data with LLMs and then train AI models on top of that


So from this blog post, there are 3 take-home messages that you should keep in mind, regarding the use of OpenAI for data labelling:

  1. GPT 4 presents a smaller error in predictions, so its performance is better than GPT 3.5, as it is expected;

  2. GPT 4 seems to perform better when handling a smaller number of instances;

  3. Both GPT 3.5 and 4 present limitations when handling variables with large sizes, due to token limit;

We showed you a simple example of an alternative approach to surpass missing data and labels, but if you need someone to help you design a more complex pipeline to handle problems like this, do not hesitate to contact TensorOps! We are here to help you with innovative solutions!


Sign up to get updates when we release another amazing article

Thanks for subscribing!

bottom of page