Metrics App & Protocol: Reviewer Scoring

This is a call to Reviewers to set the standard of output and to define how Reviews can and should be done for challenges on the Metrics app.

In the new Metrics app, the final score applied by a review has been simplified.

In the past, MetricsDAO has used a multi-heuristic rubric to assign a final score.

The new score is a likert scale with 5 options:

Spam = 0
Bad = 25
Average = 50
Good = 75
Great = 100

The adjustment of the scale does not mean that Reviewers should not take specific measures to determine what is Bad, Average, Good, and Great. This scale was simplified to increase the Reviewers ability to determine their own structures that can map against this scale.

What should we brainstorm?

  • How should the general Reviewer network determine what is Spam, Bad, Average, Good, Great?

Please also contribute to this post about the various Reviewer networks that can be created and tapped for Challenges: Metrics App & Protocol: Reviewer Network

One possible solution would be to keep the existing 12 point scale and then mapping those outcomes to this scale:

0-2 = Spam
3-5/12 = Bad
6-8/12 = Average
9-10/12 = Good
11-12/12 = Great

This is a simple way that Reviewers could continue to use the existing method and bring it into the new scale.

Perhaps others think the value mapping from scale to scale should be done differently or think that a new set of heuristics should be established?

1 Like

To keep it as frictionless as possible, in the immediate term I think mapping the 12 point scale to the Likert scale is the way to go.

Once the app is fully live, I would like to see a community call that addresses this topic; a broad discussion of possible scoring; and an opportunity for temp checks/proposals to revise the scoring scale. This will help establish broader buy-in.

But - the feedback Iā€™ve been seeing is that the specific scale matters a lot less than continuity: ensuring that everyone is treated according to the same standard, and that this standard involves genuine, skilled reviews. In other words: the specific standards matter less than how they are being applied.

Iā€™ll provide some evidence of this to continue the discussion!

1 Like

Agree that our review system needs continuity (and consistency in the method of evaluation).

Mapping to the 12 point system is the path of least resistance. However, if the final scores are converted to a five-point system (great, good, average, ā€¦), then we will have a lot more ties.

  • Will having more ties require more tie breakers?
  • How will a large number of tied submissions affect the payment curve? (say the Challenge Payout says increased rewards are given to the top 5%, then we find that 15% of submissions scored ā€˜Greatā€™).

Iā€™m sorry if I should already know this, but Iā€™d love to know why we are going with a 5-point scale. Is it a non-negotiable constraint imposed by the app design, or is it a conscious effort to simplify Peer Review?
I think a simplified Peer Review system requires giving Reviewers more discretion.

If we are aiming for simplicity, then I would like to know the following: In your considerable experience with bounties, was there ever a time when Reviewers had total discretion in determining the final score and received no guidelines? If so, I wonder if this radically simple system worked.

As opposed to a Review system distilled down to a 5-point scale, we might consider the opposite approach: a detailed evaluation using a 100 point scale. A 100 point scale might eliminate the need for tie breakers. It might also be a better fit for the payment curve.

Iā€™m not sure what will produce the best outcomes. Maybe it is appropriate to start with a simple design and increase the complexity as and when required.

In any case the points scale should be based on our requirements, not the other way around.

2 Likes

@Zook the way the protocol is designed, it can take any on-chain enforcement (scoring + payout) module. This means that anyone could write a new on-chain module and then it would need to be supported at the application level.

Some of the first modules that were created were Likert, First Come First Serve, and Top 5.

Likert enables payment to all that engage based on their relative score to others.

Regarding ties, I believe this becomes less of a concern as we utilize more significant figures and get more reviews per question.

I would love to hear how others think the 12 point scale should map to the Likert Scale of Spam, Bad, Average, Good, Great.

Consistency is important, and setting expectations around how the scale should be used means that management of the Reviewers in that network falls back to a policy.

1 Like

I do understand the desire for a final score from 0-5 but I think keeping a rubric with 3-5 categories and a score of 0 to 3 (or 1 to 3 which I think is betterā€¦ Stellar, satisfactory, fail) per category is the right way to go and should not be eliminated. As to the specific mapping, one might want to tweak it a bit after looking at how the scores distribute maybe but your mapping seems fine.

I do think the rubric should be upgraded ASAP. Too much haziness across categories and descriptions of scores IMO, but I also think that we could come up with somewhat different categories.

Personally I think the final score should actually be out of 100ā€¦ not 5 or 12. Scoring out of 100 creates a more normalized experience.

As I mentioned, the on-chain protocol will accept different modules and the first one we are using is a Likert scale from 1-5.

That does not mean we need to use that scale forever, but we do need to find an interim solve as we continue to find new enforcement criteria.

Sponsors may seek different kinds of outputs that require different types of evaluation.

Regarding the Rubric upgrade, that feels a bit different from this mapping. Would you like to start another thread on the forum with a suggested change to the rubric?

Spam = 0 - 24
Bad = 25 - 49
Average = 50 - 75
Good = 75 - 90
Great = 90 - 100

This wonā€™t be a bad scale though

Would like to suggest an alternative grading scheme based on a different way of thinking

There are a few points feel strongly about.

  • Repeated Graphs (Monthly Weekly Daily)

  • Graphs that do not show anything

  • Really long dashboards that do not mean anything significant

  • Graders takes very long and are paid per hour instead of per dashboard

I would like to solve these points by trying to follow the idea of the blockchain

  • hard to calculate (or hard to generate)

  • easy to prove

Thus grading should be done negatively. Dashboards all start with 100 points.

  • 5 points removed per SQLQuery / Graph that has errors

  • 5 points removed per insufficient explaintaion about a graph

  • 5 points removed for incorrect explaination

  • 5 points removed per duplicated graph

  • 5 points removed if SQLQuery is not credited correctly

  • 100 points removed for plagurism

  • 25 points removed if the grader is unable to grade the dashboard within 15 minutes

We can also do positive points for

  • 5 points added if dashboard is interesting to read

  • 5 points added for a twitter tweet that follows the dashboard

  • solid 8 for no mistakes

This way, the graders are just looking for mistakes in the board and it would be easier for them to tick off a checkbox to deduct points.

I really like the idea of taking a negative grading approach where you remove points.

I think this can even be less subjective than the heuristic based 12 point scale that has been used up to this point.

@ltirrell @kakamora @Zook what do you think about Antonā€™s approach?

[I also like the idea of a 100 point scale & think this is a good rating system/module to test out in upcoming builds]

@Ant I agree with the problems you identify. I might frame them a little differently:

  • How easy is it for an untrained reader to immediately grasp important & interesting points?

I also like the idea of providing Reviewers with a simple checklist that they can use. This is a good one.

To Zookā€™s point: in my early days as bounty coordinator for Flipsideā€™s program, we had full discretion on how to review submissions. This ā€œradically simpleā€ approach worked fine, for a small scale of submissions and with a very small cadre of trusted Reviewers. I do not know how well it would scale.

The broader question Iā€™d like to pose: if we have different networks of Reviewers, should we try to enforce that all of them need to use the same rubric? Or, should we allow different reviewers to assess their own scale, as long as they defend and articulate it?

Iā€™m in favor of this approach - with the understanding that we can emphasize core principles, i.e. ā€œgreat dashboards have to be correct, load quickly, and be interestingā€, and leave specific details to Reviewer networks. In this way, different Reviewers can advertise different approaches.

For example: I donā€™t think Iā€™d choose Antā€™s schedule above for a mega-dashboard, where presentation and extra effort is paramount and thus the ā€œpositiveā€ scale should carry more weight than the ā€œnegativeā€ scale. I do think it works well & that Iā€™d be likely to choose it for ā€œstandardā€ analytics bounties, where simplicity and correctness matter a lot. It would be great for me to be able to select a different reviewer scale or mode for each of them, rather than apply a one-size-fits-all rubric.

This has the added benefit of encouraging people to suggest new ideas (like Antā€™s scale above) that we may not have thought of or tried before. By testing out these different approaches, we can get real data and feedback about how well they perform in practice.

I believe that each network could use its own methodology and that publication of that methodology and results that follow it would give challenge launchers more confidence when choosing a Reviewer network.

1 Like
  • How easy is it for an untrained reader to immediately grasp important & interesting points?

This question is extremely subjective and high level. In programming terms, this would be like asking a baker to bake a cake without any specific instructions. Youā€™ll have to break it down into smaller detailsā€¦ for example

  • 5 points reduction for not explaining the overview of harmony protocol.
  • 5 points reduction for not explaining how the voting in metrics dao happens.
  • 25 points reduction for not answering the original question at all.

For example: I donā€™t think Iā€™d choose Antā€™s schedule above for a mega-dashboard, where presentation and extra effort is paramount and thus the ā€œpositiveā€ scale should carry more weight than the ā€œnegativeā€ scale. I do think it works well & that Iā€™d be likely to choose it for ā€œstandardā€ analytics bounties, where simplicity and correctness matter a lot. It would be great for me to be able to select a different reviewer scale or mode for each of them, rather than apply a one-size-fits-all rubric.

Negative scale should always carry more weight in the scenarios. Itā€™s easier to take points off for doing something bad

for real world examples, that the government uses against us.

  • not stopping at a red light / stop sign
  • breaking the speed limit
  • being caught for vandalism

These are all acts that are easy to catch.

For positive points, they can be up to the grader to give (as long as its on the rubic). These are like extra credits that your teacher give you because you are the teacherā€™s pet. They take forever to be given out and you really have to be exceptional
Real world examples are like

  • purple hearts
  • medal of honors

they do not really help you in a general sense, but in a very niche way.

Personally I really like the idea that @antonyip brought up of subtracting instead of adding, for the way it reduces subjectivity and can make reviews more consistent across reviewers.

In practice: When a peer reviewer is faced with a simple 5-category Likert scale, and especially if they happen to be paid per dashboard and not hourly, whatā€™s to stop them from assigning the category holistically, based on their overall impression of the dashboard, after maybe only a few seconds worth of evaluating?

And if so, a) would MetricsDAO be content with incentivizing this type of review?

And b) should this become the norm in practice, would it matter how the 12 points/existing rubric convert to the 5 categories on paper?

I think this is a very interesting idea. From what I am gathering, this would be sort of a way to punish people for throwing down too many charts and graphs. I think in practice this wouldnā€™t work too well, and I think the grading rubric, if broken down to more subcategories and where there is a clearer distinction of points per categories is ultimately a better way to get consistent reviews and scores. I think a major effort, to communicate the need for BREVITY. Bounty hunters are, in general, under the impression that ā€œabove and beyondā€ means throwing down dozens of charts, and that, in order to fulfill the need to have enough words, they are pasting huge amounts of words from some source or another.
I suggest that we impose a strict rule of an 8-chart minimum. I also suggest that the rubric contain wording that doesnā€™t just stress thoroughness, but also conciseness. We also need to better define what we do and donā€™t want from analysts, in terms of ā€œabove and beyondā€. I think it should build on the insights gleaned from the required analysis, and that when people read a dashboard, they can do so in five minutes. We need to stress quality over quantity, in a big, big way.

I, personally hate the idea of reviewers having their own scale. This is completely unfair to the analyst, who then doesnā€™t know what is expected of them. This is part of why I am extremely skeptical of the reviewer network idea. I think it will lead to reduced quality and more confusion, even if dozens of competing networks of rugged individualistic reviewers form. I also think that in practice, the subtraction method would be too difficult to implement. Ant listed many examples of things to deduct for, and the list is far from complete. Do we only have the choice to subtract 5 or not subtract 5? It can certainly become a gray area. For example, there ARE times, when using a daily and a weekly chart for the same data might make sense.

Totally disagree with your statements.

  • Would rather be suggesting a max of 8 charts instead of a min as we want analysts to be able to get to the point quickly and concisely. (This current generation has a short attention span and if you donā€™t get your point across quickly, no oneā€™s gonna be coming back to read)

Do we only have the choice to subtract 5 or not subtract 5?

Wording can be changed to ā€œsubtract 3-5 points for more then 8 chartsā€, but this violates the concept that I have stated above where grading should be easy to execute. Problems like when do we deduct 3 points and when do we deduct 5 will occur. I donā€™t want the grader to be deciding these kinds of calls thus a flat point deduction would be way better in this scenario.

Furthermore, when 3 graders are grading the same article, the scores should be exactly the same (in reality this never happens because of judgement calls). Exact perfect scores would actually reinforce the fact that scoring is fair even tho 3 people marked the same submission independently ( Concept of reaching the same consensus in the blockchain )

As a final benefit, graders are also able to stop marking once 50 points are deducted as no payment will be made anyway for sub-par work ( Saves graders time )

Having to use use a simplified and strict scoring system would hopefully improve the turn around time needed to grade all the submissions and it would be more fair towards the DAO and graders if they are paid per article rather then on an hourly basis.