Mahesh's Blog

Tech Blog from Mahesh

1 August 2024

Filtering and updating a set of rows in pandas

by

While working with pandas I had a use case where I had to filter a set of rows and update the values on all those rows at once. Doing apply was slow as I had to do the operation on a set of rows (think vectorization).

There was no good article on how to do that. So here goes my approach

  1. Filter the rows that you need to process. In my case, the column I needed to filter was named embedding
is_row_new = df["embedding"].isnull()
new_rows = df[is_row_new]
  1. I had to get the embedding for these rows as a batch. I stored these list of embeddings as new_embeddings
    new_embeddings: list = create_new_embeddings(new_rows)
    
  2. Create a pandas Series and apply it back to the data frame
    new_embedding_updates = pd.Series(new_embeddings, index=df.index[is_row_new])
    df["embedding"].update(new_embedding_updates)
    

That’s it - the trick with dealing with a list is essentially to keep an index of all the rows that you’ve to update. Once you have the new values, just create a new Series and update back the dataframe.

Full Snippet

is_row_new = df["embedding"].isnull()
new_rows = df[is_row_new]
new_embeddings: list = create_new_embeddings(new_rows)
new_embedding_updates = pd.Series(new_embeddings, index=df.index[is_row_new])
df["embedding"].update(new_embedding_updates)
tags: pandas