Filtering and updating a set of rows in pandas
by
While working with pandas I had a use case where I had to filter a set of rows and update the values on all those rows at once. Doing apply was slow as I had to do the operation on a set of rows (think vectorization).
There was no good article on how to do that. So here goes my approach
- Filter the rows that you need to process. In my case, the column I needed to filter was named
embedding
is_row_new = df["embedding"].isnull()
new_rows = df[is_row_new]
- I had to get the embedding for these rows as a batch. I stored these list of embeddings as
new_embeddings
new_embeddings: list = create_new_embeddings(new_rows)
- Create a pandas
Series
and apply it back to the data framenew_embedding_updates = pd.Series(new_embeddings, index=df.index[is_row_new]) df["embedding"].update(new_embedding_updates)
That’s it - the trick with dealing with a list
is essentially to keep an index of all the rows that you’ve to update. Once you have the new values, just create a new Series
and update back the dataframe.
Full Snippet
is_row_new = df["embedding"].isnull()
new_rows = df[is_row_new]
new_embeddings: list = create_new_embeddings(new_rows)
new_embedding_updates = pd.Series(new_embeddings, index=df.index[is_row_new])
df["embedding"].update(new_embedding_updates)