Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topk_queue::finalize sorts by both score and ID (#508) #515

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

elshize
Copy link
Member

@elshize elshize commented Jan 25, 2023

The final sorting order is now by score (descending) and docid (ascending). Furthermore, std::push_heap is replaced with our own implementation to maintain consistency across standard libraries.

Fixes #508

@codecov
Copy link

codecov bot commented Jan 25, 2023

Codecov Report

Base: 93.06% // Head: 93.06% // Decreases project coverage by -0.01% ⚠️

Coverage data is based on head (a371736) compared to base (734ed9f).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head a371736 differs from pull request most recent head 6f9d4bb. Consider uploading reports for the commit 6f9d4bb to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #515      +/-   ##
==========================================
- Coverage   93.06%   93.06%   -0.01%     
==========================================
  Files          90       90              
  Lines        4497     4511      +14     
==========================================
+ Hits         4185     4198      +13     
- Misses        312      313       +1     
Impacted Files Coverage Δ
include/pisa/topk_queue.hpp 100.00% <100.00%> (ø)
include/pisa/wand_data_range.hpp 83.33% <0.00%> (-1.29%) ⬇️
include/pisa/type_safe.hpp 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link

@seanmacavaney seanmacavaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! These changes look reasonable to me.

@elshize elshize requested a review from amallia January 26, 2023 21:52
// elements to the heap.
std::sort(m_q.begin(), m_q.end(), sort_order);

auto search_order = [](auto entry, auto score) { return entry.first > score; };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a comment here to explain what this is doing? I know this is unmodified behaviour, but since you are on it would be better to document the code.

};
// We have to do a full sort because it is not the exact same ordering as when pushing
// elements to the heap.
std::sort(m_q.begin(), m_q.end(), sort_order);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are switching to sort then why don't we just use stable_sort on the scores such that we maintain as second sorting condition the insertion time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure stable_sort will preserve that. I actually haven't analyzed this in detail, but are you saying that given our implementation, we can be sure that the more recently inserted entry is always above in the heap or at the same level but on the left side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that heap_sort creates some issues with reproducibility since two documents with equal score have no guarantees on the order they will appear in the list of results.
Stable sort will keep the relative order of elements with equivalent values, fixing the reproducibility issue.

I prefer this solution because:
(1) it is easier and less error prone
(2) it resolves the issue with docID ordering not maintained while inserting and only done at the end, which could cause the problem described in the code comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what you're suggesting is to change the semantic: instead of ensuring that within the top-k results ties are broken by the docID order, you're proposing to not give any such guarantee, but rather say that the order of elements of with the same score is undefined, but deterministic, i.e., reproducible across multiple runs.

Do I understand correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it resolves the issue with docID ordering not maintained while inserting and only done at the end, which could cause the problem described in the code comment

You'll have to be more explicit here, not sure which issue we're resolving. From my understanding, you're solution simply changes the semantic of what we do, but maybe I'm missing something, so if you could be a bit more clear about what you mean and how it solves it, that would help.

@@ -172,6 +181,26 @@ struct topk_queue {
}
}

// We use our own function (as opposed to `std::heap_push`), to ensure that
// heap implementation is consistent across standard libraries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does to ensure that heap implementation is consistent across standard libraries mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Means that if I compile on libstdc++ 12 and you compile on libc++ 10, then both of us have the same logic, and thus if you run the same exact algorithm on the same exact collection, etc., you will get the same results.

With std::push, the implementation may change from version to version and implementation to implementation (LLVM/GNU/MS/Apple) so it's not necessarily reproducible across environments.

Using pair ordering (score, doc) woulds of course solve all of that, but that's not efficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand! Makes sense. Right now we are not aware of differences in the implementation of libc++ 10 and libstdc++ 12, right?

My major objection to this approach is that if we stick to this paradigm, we would need to reimplement the entire STL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My major objection to this approach is that if we stick to this paradigm, we would need to reimplement the entire STL

Normally, I would agree, but this is somewhat special circumstances. Technically speaking, we probably would want to strictly define the order (score + docid) but we're not because of efficiency. But also we are already implementing the more complex pop operation, so we might as well finish the job. It's very little effort for possibly a significant improvement for someone down the line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put it this way: you're doing research, and have some results, and then you recompile after system upgrade or something, and it doesn't match; it's not life changing, but that's very annoying imo, and I personally see no reason not to fix that with one little function. It could save someone a headache down the line.

The final sorting order is now by score (descending) and docid
(ascending). Furthermore, `std::push_heap` is replaced with our own
implementation to maintain consistency across standard libraries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stability of document order when tied
3 participants