natural language processing blog
NAACL 2015 has just passed, NAACL 2013 is long in the past.One bonus offer
of being a program chair is that you reach enjoy with data. In this post I 'd like to examine 2 items of information, one pertaining to writer feedback and also one related to review quality assessment.tldr: Usually,
I think writer response is pointless, other than insofar as it can be cleansing to authors and thus provide some little mental benefit. And also as a whole people don't appear that discontented with their papers'reviews, and also this is largely independent of the outcome of the paper (trained on choice prejudice of those that responded). On Author Feedback NAACL as well as other conferences have
for time permitted
authors to react to their reviewers, seemingly to right errors, however more often merely to suggest their placement. Lots of people I've chatted to who favor writer feedback claim that it made the distinction for one paper between it being rejected(pre-response) and also approved (post-response). Obviously this is unknowable due to the fact that decisions typically aren't made pre-response, so what these authors are reporting is their guess about accept/reject before response versus real approve after feedback. My own inner hunches of accept/reject are typically off the mark, so you could visualize I do not locate this argument specifically solid. To provide a sense about exactly how hard this prediction is, right here is a story revealing whether your paper got accepted or not as a function of the mean overall score. Here, the x-axis is the cumulative circulation of documents(there were more declines compared to approves, so the length of these is normalized to portion)and also the y-axis is the mean overall score for the paper. Focusing merely on the solid lines(lengthy papers), you could see that there were 1 or 2 papers with average scores of 3.6 that got rejected, and also a number of papers with average ratings of 3.0 that obtained allowed. If instead you consider"probability of accept"provided mean overall score, basically what you see is documents with a rating of 2.8 or reduced are practically absolutely turned down; with 3.9 or greater are likely allowed, and also around 3.2 it's an outright toss-up. Currently, you might ask: that preferred to react? Perhaps individuals with clear results elected not to respond whatsoever. Right here are the numbers for that: This is just for lengthy documents(it looks the same for brief ). Each dot is a paper as well as red are turns down, blue are accepts; open circles suggests no reaction. Certainly individuals with scores much less than 3 reacted less frequently compared to those with scores over 3, but even the top scoring documents all consisted of some feedback. There were some "enthusiastic "papers with scores much less compared to 2.0 that responded, although it was definitely not useful.But delay, you might claim: possibly those papers with average rating or 1.33 provided a response and also their final ratings rose to 4.5? Alright firstly, these are final ratings plotted over, not pre-response scores. Yet we definitely can look at just how scores transformed between the initial reviews as well as the last reviews.(Note this merges two things: writer reaction as well as customer conversation. We'll return to that later on.)Below, we again have one dot each paper (these are arbitrarily irritated with little variance so they're all visible). Along the x-axis is the original rating of this paper; along the y-axis is just how much this score changed in between the preliminary testimonial as well as the last evaluation.
On the whole the ordinary absolute modification across all documents is 0.1. The vast bulk (87 % )didn't transform at all.And note: ratings were practically as most likely to go down as up. (Numerous scores went down with no reaction, which I find problematic, other than in instances where customers especially stated "I'm giving advantage of the doubt, however need a clarification right here.") You might likewise say: well, possibly the ratings really did not change a lot, however the evaluation text did. Once more, it's quite unusual. From 430 evaluations (long documents just ), 3 testimonials reduced in length (~ 50 words ), 46 raised by at many 100 words, as well as 49 enhanced by over 100 words (wow, remarkable!). But also for 80 % the testimonial really did not transform at all.Now, returning to the inquiry of"was this as a result of AR or to reviewer conversation ", this is certainly difficult to de-conflate, however we could consider a comparable plot as a function of conversation instead of AR: Below, a few results stand out. Initially, there are a little even more rating modifications as a function of conversation than AR. Many of those worrying dots listed below y= 0 that really did not have a response appear to have actually dropped due to conversations. Additionally, a couple of scores that went up (on the low-scoring, almost-certainly-rejected documents )did so with no discussion, suggesting that AR did change those ratings. Yet around the x=3.2 location (real borderlines), almost all of these with score modifications had conversation, though lots had discussion and no rating modification. As resisted to the x=3.2 position in AR in which they all have feedbacks, and still most of scores do not change.An evident feedback now is that we still have no idea whether the decision really changed as a function of reaction. This is a causality concern we could not execute the counterfactual for, so we're not going to obtain a strong response. Nonetheless, we could check out the following: On the x-axis we have the final rating of the paper, and also on the y-axis we have( a smoothed version of )the probability that this paper is accepted.The 2 nearly the same curves (red and also blue)represent blue= all long documents as well as red =all lengthy documents for which the authors submitted any type of response. In this setting, there really is virtually no difference.The slightly different curve( the black one)matches only to documents where the customers connectinged in a discussion. Around the important factor(rating in the 3-3.5 range), discussion uniformly lowered chances of approval, which is naturally not unusual for any individual that has actually ever before gotten involved in discussions.You ought to take all this with a grain of salt, considering that all three of these contours are within one conventional discrepancy of the blue contour (dotted lines). Finally, there's naturally the chance that reaction and also conversation are extremely correlated: that is, feedback ought to probably stimulate conversation. This is most likely false (by the above plot), but simply to generate the factor house, here is the information: The x-axis is the quantity of conversation and the
y-axis is the length of the reaction. See all those dots along x =0? Those are the ones for which the writers responded(in one situation with a 2000 word essay!)and for which there was definitely no conversation. As well as many of them are rejected papers.On Review Usefulness Several of you could remember that we did a post-conference survey of exactly how much you liked the evaluations you obtained. We really did this at a later point in time due to the fact that we didn't get our acts with each other earlier, today I'll declare we did it at a later factor in time so that authors were no much longer mentally anxious by their reviews.We presented each matching writer with their old reviews, however concealed the ratings from them (I question any/many returned to look). They were then asked, for each evaluation then, exactly how interesting it was as well as just how practical it was. They could pick 0 =not, 1= form of, or 2 =very.The initial hypothesis one may have(absolutely I had it! )is that people"suched as"evaluates that were good to them as well as "done not like"assesses that were not.( I've heard this called the craigslist impact: if the deal is successful, everyone's delighted; else, everybody's dissatisfied.)This really appears not really to be real.
Below are the outcomes: informativeness helpfulness turns down 1.24+-0.57 1.25 +-0.72 allows 1.38+-0.48 1.27 +-0.70.(+-is one common inconsistency)Essentially writers of allowed papers didn't particularly discover their testimonials a lot more useful (+0.14) or useful( +0.02) compared to those of turned down papers. [Note: the sample dimension is 110 declines as well as 86 allows, so there was a greater feedback rate on accepts considering that the acceptance price has to do with 25 %.] You might suspect, however, that exactly how a lot an author likes a testimonial is correlated keeping that reviewer's score as opposed to the total decision. We could take a look at that also( recall authors didn't see the ratings
when completing this survey): rating avg(informativeness ) avg ( helpfulness)0 3.04+- 0.89 3.23 +-0.87 1 3.22 +-0.91 3.15+ -0.93 2 3.21+-0.97 3.22+-0.96. Here, the first pillar (rating)is exactly how the writer scored the review. The second pillar is the average general rating that a reviewer gave a writer that offered them a score
of absolutely no. For example, for evaluations that were ranked insightful=2, the ordinary rating for those evaluations was 3.21. Again, essentially there's no effect.Just to really feel great regarding ourselves, authors
do, in general, appear not-too-unhappy with their reviews.Here's a histogram of informativeness for declined documents: 0: ############## (15 %). 1: ############################################ (44 %). 2: ########################################
(41 %). Right here's the same pie chart for accepted papers: 0: ######## ( 8 %). 1: ##############################################( 46 %). 2: ##############################################(46 %). The only actual distinction is that rejected papers did have twice as big a regularity of" informative=0", however the remainder was pretty close. (And also in situation you're questioning, it's generally the same for"helpfulness."As aimed out by a pal, the inability to separate different elements of the very same things is regular when individuals are asked for user-friendly judgements that do not include thinking. 70 % of feedbacks offered the same score for helpfulness and also
informativeness as well as an additional 29 % provided ratings within 1. Just 0.8 % claimed inf=2 aid=0 and 0.6 % claimed inf=0 and also help=2.)In general I think about whining regarding evaluation(s|ers)as a staple corridor discussion, kind
of an attractor state if you run out of other points to discuss. As well as we all bear in mind the most awful testimonials we have had as well as often neglect about the excellent ones that really
do help our job feel better. As Lillian Lee said at NAACL this year, it's normally our own mistake when reviewers do not understand our papers.
And also remember that all these numbers have actually big randomness related to them, for instance relevant to the NIPS experiment. In spite of this, it in fact seems that we typically aren't as well discontented with our testimonials. Only 11 % of reviews were thought about uninformative and also, while http://www.nlpcoaching.com/nlp-trainers-training/ I believe we should strive to get this number down, I don't think it's that dreadful. Specifically when I think about the truth that 19 % of documents have an ordinary rating of 2.0 or less, which
primarily means they were submitted not standing a possibility( for a variety of reasons). Generally I don't assume writer http://www.ebay.com/sch/i.html?_nkw=nlp reaction has an impact on the result that makes it worth the moment and also power it takes. It could make authors really feel much better, but this is short-term when their paper likely enjoys the same fate after response as before. I do believe discussion serves, and I prefer to see us make use of even more time for conversation however reducing out feedback. Maybe by lowering the variety of documents any kind of provided location chair needs to take care of. We should understand, however, that discussion often offers to lessen the likelihood of acceptance, most likely considering that it's much easier to refute a paper than for, and reviewers do not really have any sort of reward to defend a paper they such as, leading to a veto-effect.