Metrics App & Protocol: Reviewer Scoring

This is a call to Reviewers to set the standard of output and to define how Reviews can and should be done for challenges on the Metrics app.

In the new Metrics app, the final score applied by a review has been simplified.

In the past, MetricsDAO has used a multi-heuristic rubric to assign a final score.

The new score is a likert scale with 5 options:

Spam = 0
Bad = 25
Average = 50
Good = 75
Great = 100

The adjustment of the scale does not mean that Reviewers should not take specific measures to determine what is Bad, Average, Good, and Great. This scale was simplified to increase the Reviewers ability to determine their own structures that can map against this scale.

What should we brainstorm?

  • How should the general Reviewer network determine what is Spam, Bad, Average, Good, Great?

Please also contribute to this post about the various Reviewer networks that can be created and tapped for Challenges: Metrics App & Protocol: Reviewer Network

One possible solution would be to keep the existing 12 point scale and then mapping those outcomes to this scale:

0-2 = Spam
3-5/12 = Bad
6-8/12 = Average
9-10/12 = Good
11-12/12 = Great

This is a simple way that Reviewers could continue to use the existing method and bring it into the new scale.

Perhaps others think the value mapping from scale to scale should be done differently or think that a new set of heuristics should be established?

1 Like

To keep it as frictionless as possible, in the immediate term I think mapping the 12 point scale to the Likert scale is the way to go.

Once the app is fully live, I would like to see a community call that addresses this topic; a broad discussion of possible scoring; and an opportunity for temp checks/proposals to revise the scoring scale. This will help establish broader buy-in.

But - the feedback I’ve been seeing is that the specific scale matters a lot less than continuity: ensuring that everyone is treated according to the same standard, and that this standard involves genuine, skilled reviews. In other words: the specific standards matter less than how they are being applied.

I’ll provide some evidence of this to continue the discussion!

1 Like

Agree that our review system needs continuity (and consistency in the method of evaluation).

Mapping to the 12 point system is the path of least resistance. However, if the final scores are converted to a five-point system (great, good, average, …), then we will have a lot more ties.

  • Will having more ties require more tie breakers?
  • How will a large number of tied submissions affect the payment curve? (say the Challenge Payout says increased rewards are given to the top 5%, then we find that 15% of submissions scored ‘Great’).

I’m sorry if I should already know this, but I’d love to know why we are going with a 5-point scale. Is it a non-negotiable constraint imposed by the app design, or is it a conscious effort to simplify Peer Review?
I think a simplified Peer Review system requires giving Reviewers more discretion.

If we are aiming for simplicity, then I would like to know the following: In your considerable experience with bounties, was there ever a time when Reviewers had total discretion in determining the final score and received no guidelines? If so, I wonder if this radically simple system worked.

As opposed to a Review system distilled down to a 5-point scale, we might consider the opposite approach: a detailed evaluation using a 100 point scale. A 100 point scale might eliminate the need for tie breakers. It might also be a better fit for the payment curve.

I’m not sure what will produce the best outcomes. Maybe it is appropriate to start with a simple design and increase the complexity as and when required.

In any case the points scale should be based on our requirements, not the other way around.

2 Likes

@Zook the way the protocol is designed, it can take any on-chain enforcement (scoring + payout) module. This means that anyone could write a new on-chain module and then it would need to be supported at the application level.

Some of the first modules that were created were Likert, First Come First Serve, and Top 5.

Likert enables payment to all that engage based on their relative score to others.

Regarding ties, I believe this becomes less of a concern as we utilize more significant figures and get more reviews per question.

I would love to hear how others think the 12 point scale should map to the Likert Scale of Spam, Bad, Average, Good, Great.

Consistency is important, and setting expectations around how the scale should be used means that management of the Reviewers in that network falls back to a policy.

1 Like

I do understand the desire for a final score from 0-5 but I think keeping a rubric with 3-5 categories and a score of 0 to 3 (or 1 to 3 which I think is better… Stellar, satisfactory, fail) per category is the right way to go and should not be eliminated. As to the specific mapping, one might want to tweak it a bit after looking at how the scores distribute maybe but your mapping seems fine.

I do think the rubric should be upgraded ASAP. Too much haziness across categories and descriptions of scores IMO, but I also think that we could come up with somewhat different categories.

Personally I think the final score should actually be out of 100… not 5 or 12. Scoring out of 100 creates a more normalized experience.

As I mentioned, the on-chain protocol will accept different modules and the first one we are using is a Likert scale from 1-5.

That does not mean we need to use that scale forever, but we do need to find an interim solve as we continue to find new enforcement criteria.

Sponsors may seek different kinds of outputs that require different types of evaluation.

Regarding the Rubric upgrade, that feels a bit different from this mapping. Would you like to start another thread on the forum with a suggested change to the rubric?

Spam = 0 - 24
Bad = 25 - 49
Average = 50 - 75
Good = 75 - 90
Great = 90 - 100

This won’t be a bad scale though

Would like to suggest an alternative grading scheme based on a different way of thinking

There are a few points feel strongly about.

  • Repeated Graphs (Monthly Weekly Daily)

  • Graphs that do not show anything

  • Really long dashboards that do not mean anything significant

  • Graders takes very long and are paid per hour instead of per dashboard

I would like to solve these points by trying to follow the idea of the blockchain

  • hard to calculate (or hard to generate)

  • easy to prove

Thus grading should be done negatively. Dashboards all start with 100 points.

  • 5 points removed per SQLQuery / Graph that has errors

  • 5 points removed per insufficient explaintaion about a graph

  • 5 points removed for incorrect explaination

  • 5 points removed per duplicated graph

  • 5 points removed if SQLQuery is not credited correctly

  • 100 points removed for plagurism

  • 25 points removed if the grader is unable to grade the dashboard within 15 minutes

We can also do positive points for

  • 5 points added if dashboard is interesting to read

  • 5 points added for a twitter tweet that follows the dashboard

  • solid 8 for no mistakes

This way, the graders are just looking for mistakes in the board and it would be easier for them to tick off a checkbox to deduct points.

I really like the idea of taking a negative grading approach where you remove points.

I think this can even be less subjective than the heuristic based 12 point scale that has been used up to this point.

@ltirrell @kakamora @Zook what do you think about Anton’s approach?

[I also like the idea of a 100 point scale & think this is a good rating system/module to test out in upcoming builds]

@Ant I agree with the problems you identify. I might frame them a little differently:

  • How easy is it for an untrained reader to immediately grasp important & interesting points?

I also like the idea of providing Reviewers with a simple checklist that they can use. This is a good one.

To Zook’s point: in my early days as bounty coordinator for Flipside’s program, we had full discretion on how to review submissions. This “radically simple” approach worked fine, for a small scale of submissions and with a very small cadre of trusted Reviewers. I do not know how well it would scale.

The broader question I’d like to pose: if we have different networks of Reviewers, should we try to enforce that all of them need to use the same rubric? Or, should we allow different reviewers to assess their own scale, as long as they defend and articulate it?

I’m in favor of this approach - with the understanding that we can emphasize core principles, i.e. “great dashboards have to be correct, load quickly, and be interesting”, and leave specific details to Reviewer networks. In this way, different Reviewers can advertise different approaches.

For example: I don’t think I’d choose Ant’s schedule above for a mega-dashboard, where presentation and extra effort is paramount and thus the “positive” scale should carry more weight than the “negative” scale. I do think it works well & that I’d be likely to choose it for “standard” analytics bounties, where simplicity and correctness matter a lot. It would be great for me to be able to select a different reviewer scale or mode for each of them, rather than apply a one-size-fits-all rubric.

This has the added benefit of encouraging people to suggest new ideas (like Ant’s scale above) that we may not have thought of or tried before. By testing out these different approaches, we can get real data and feedback about how well they perform in practice.

I believe that each network could use its own methodology and that publication of that methodology and results that follow it would give challenge launchers more confidence when choosing a Reviewer network.

1 Like
  • How easy is it for an untrained reader to immediately grasp important & interesting points?

This question is extremely subjective and high level. In programming terms, this would be like asking a baker to bake a cake without any specific instructions. You’ll have to break it down into smaller details… for example

  • 5 points reduction for not explaining the overview of harmony protocol.
  • 5 points reduction for not explaining how the voting in metrics dao happens.
  • 25 points reduction for not answering the original question at all.

For example: I don’t think I’d choose Ant’s schedule above for a mega-dashboard, where presentation and extra effort is paramount and thus the “positive” scale should carry more weight than the “negative” scale. I do think it works well & that I’d be likely to choose it for “standard” analytics bounties, where simplicity and correctness matter a lot. It would be great for me to be able to select a different reviewer scale or mode for each of them, rather than apply a one-size-fits-all rubric.

Negative scale should always carry more weight in the scenarios. It’s easier to take points off for doing something bad

for real world examples, that the government uses against us.

  • not stopping at a red light / stop sign
  • breaking the speed limit
  • being caught for vandalism

These are all acts that are easy to catch.

For positive points, they can be up to the grader to give (as long as its on the rubic). These are like extra credits that your teacher give you because you are the teacher’s pet. They take forever to be given out and you really have to be exceptional
Real world examples are like

  • purple hearts
  • medal of honors

they do not really help you in a general sense, but in a very niche way.