Click Here to Go Directly to the Story
Register/Subscribe
Home



MARCH 29, 1999

B-SCHOOL NEWS

How the Computer Grades Your Essays
In a Q&A, the father of E-Rater, GMAC's Fred McHale, explains what goes on inside that box


  STORY TOOLS
Printer-Friendly Version
E-Mail This Story

The Graduate Management Admissions Council (GMAC), the nonprofit organization that owns and administers the GMAT exam required for students applying to business school, is increasingly marrying technology to testing. In 1997, the outfit introduced the GMAT CAT (computer-adaptive test), an electronic version of its flagship Graduate Management Admissions Test. And in Februrary of this year, it unveiled E-Rater, software that grades the essay section of the GMAT, known as the Analytical Writing Assessment (AWA) [See BW Online, 1/21/99, "This Is E-Rater. It'll Be Scoring Your Essay Today"]

Not surprisingly, there's now plenty of concern among the 230,000 annual GMAT registrants who wonder just how well a computer can actually evaluate written material, especially when it comes to discerning nuances and understanding the context of an argument. Anxieties about the validity of E-Rater's scoring are only being heightened by today's increasingly competitive B-school admissions standards. GMAT scores at Business Week's Top 25 B-schools have risen from 653 in 1997 to 667 in 1998, while acceptance rates have fallen from 22% to 21% over that same period.

By now, nearly 25,000 prospective B-school students have had the E-Rater software partially score their two AWA essays. Yet many are confused about how E-Rater actually works -- especially given some of the questionable advice dispensed by test-preparation companies about the software. To get a better handle on GMAC's latest technologies, Business Week Online's Nadav Enbar recently spoke with the brain behind both E-Rater and the CAT, GMAC Vice-President for Assessment & Research Fred McHale. Here is an edited transcript of that conversation:


Q: Are the GMAT's 230,000 annual test-takers getting a little uneasy with the technology they're being forced to grapple with -- especially given today's rising B-school admissions standards?
A: Uneasy? Well, I mean, I can't say from personal experience because I'm not talking day to day with applicants. But I can tell you from the schools that we've talked to who are talking with the applicants, that they expect their students to be able to handle technology in the classroom. In fact, if an MBA program today does not have some technology built into the curriculum, a student probably isn't going to pick that school. [Students] expect to have access to technology at B-school, and they should expect it in the testing, as well. On the whole, it's not something that they're apprehensive about as a group. I'm sure, though, that there are people, individually, who are.

Q: Has E-Rater generated the same kind of public resistance that the CAT stirred upon its unveiling?
A: We haven't had the kind of reaction to E-Rater from our applicants that we did with computer adaptive testing. The negative reactions have been minimal. Yes, some people have written "I will not be scored by a computer" into their essay response. But that's been very minimal compared to [the transition to] computer adaptive testing. The change in how you behave in taking an adaptive test was so different that people were more apprehensive about that.

Q: How many people have actually had their analytical writing assessments (AWA) graded by the E-Rater?
A: At this point, about 25,000.

Q: Has there been a deviation in scores since that went into effect?
A: None whatsoever.

Q: Previously, each of the two essays in the AWA were scored by two "expert" readers on a 1 to 6 scale (the AWA score is separate from the GMAT, which has a scale of 200 to 800). E-Rater effectively takes one human reader out of the loop. If there's a marked discrepancy in human and E-Rater scores, a second human reader reviews the essays and acts as a tie-breaker. Have you had to bring in a second person frequently to determine a score?
A: Typically, between E-Rater and a human reader, about 8% to 10% of those scored essays have to go to a second [human] reader. That's the same rate as it was when two [human] readers graded the AWA. E-Rater's scores match up 97% of the time with those of either the first or second reader. So 2% to 3% of the scores are still discrepant and have to go on to further resolution.

Q: There are 175 possible AWA questions. If I understand it correctly, E-Rater has been fed thousands of essay responses to a particular question and is then able to discern between a good and not-so-good essay?
A: Yes. We have not thousands but hundreds of essay responses on any one topic. There's expert readers that read those hundreds of essays and group them into: six, five, four, three, two, and one [score groupings]. They actually discard essays that are borderline -- they only want the ones that definitely fit into a specific category. Then they build scoring rubrics and say, people who scored a six tended to organize their ideas in this way, they used these kinds of phrases and words, and so forth. That rubric was always used to train new readers. Now we use that rubric, and we program it into E-Rater. So when E-Rater is looking at a free-text essay, it then takes it and breaks it up and parses into the pieces that it needs, and then applies these scoring rubrics.

Q: Does that mean that it's looking for specific words? A succession of words or phrases?
A: Sometimes it's looking for specific words based on how people responded to the question before. You can use a certain word or phrase and, if people who scored a six tended to use that, it gets a positive weight. If people who scored a one tended to use that word or phrase, it could be a negative. So that's part of it, but it's not the only thing E-Rater looks at. It also has to look at, for example, where a word or phrase appeared. Was it at the beginning? At the end? How were your ideas organized in terms of that paragraph? It's looking at context as well. And it's all based on the rubric of what the [human] readers have established.

Q: How does E-Rater distinguish between, for example, "I jumped rope down the road," vs. "I road down the jumped rope"?
A: Because our test is used around the world, we don't look for things like that because we don't necessarily care that a person whose English is their second language might be off a little bit. What we're looking for is: Can they express their thinking coherently and concisely in terms of the essay's topic? Can they organize their ideas? So if their grammar is off, we don't rate that in terms of the analytical writing assessment. You might rate that in the English proficiency test, but we're not doing that. We're not looking for those types of things.

What we're looking for is an applicant's ability to think critically. In response to a specific topic, would you say that a particular argument's points are false? What are your reasons behind believing that the argument is false?

Q: So if you have the main idea correct in your essay, but your wording is somewhat jumbled, you'll still score well on the writing assessment?
A: That's true, if the wording of the essay is jumbled in a specific phrase or word. But if it's jumbled throughout the paragraph and there's no coherency to the paragraph, no, you can't score well. E-Rater is actually looking at the whole thing at once. It is a holistic scoring.

Q: So far, what is the reaction of admissions officers to E-Rater?
A:
They've been positive. They think it's perfectly fine -- in the way we've structured it. We've structured it in a system of one faculty reader and E-Rater. In the way that we're using it, there's been positive reaction. The one group that I really expected a negative reaction from was the readers. But, we have not found that. In fact, the readers have been positive that it's a helpful tool.

Q: What has been the biggest challenge surrounding the E-Rater since you've implemented it? Have you encountered a lot of skepticism? Are folks scratching their heads wondering how this electronic assessment software actually works, wondering if the results have any validity?
A: There has been a lot of skepticism, and it was expected. People tend to think that E-Rater is just your average grammar-checker on your word processor. But that's just not the case. All we can do is show the results. From the results point of view, the research has held up. I think time will show that E-Rater really can do the more limited type of writing assessment that we're currently doing. I think people get the misconception that we're electronically evaluating good, overall essay writing. But, we're not grading the merits of any old essay. We're looking at specific essay questions and grading them in a structured way.

Q: How have the two main test-prep organizations -- Kaplan and Princeton Review -- approached E-Rater? Are they offering advice on how best to approach the AWA now that it's being partially graded by a computer?
A: Well, the only techniques that I've seen are what I think Kaplan puts out on their Web site, like telling people that using a lot of synonyms for the same word will get you a higher score. Sorry, but that's just not true. When we say we're looking for specific words, we're looking for words that people who score well on this specific topic use. And so, yes, the words are important, but you can't take them out of the context or the overall organization of the paragraph and say just using a lot of them is going to get you a high score. The actual words you use are only one piece of the score.

Q: As the creator behind E-Rater, how would you suggest a test-taker approach his or her AWA?
A: First, learn how to write! That's how you need to approach it. All of the AWA topics are out there published on the Internet. [Editor's Note: To locate the 175 AWA topics, go to: www.gmat.org/mbastore.html, then scroll down to the bottom of the page.] Practice on several of the topics -- not on all of them, focus on really writing to them and get, maybe, a faculty member to evaluate your work. Hopefully, in the future we'll be able to put out some way of practicing over the Internet, so that people can get an idea of their writing skills. We could have E-Rater score the practice essays.

Q: Have you published on the GMAT Web site, or in any publications, some five- or six-point essays for people to model theirs after?
A:
We do publish some sample essays in our practice print materials. We haven't put any on the Internet yet, but we probably will be shortly. We also publish the E-Rater scoring rubric.

Q: In the long run, E-Rater is probably going to save GMAC a fair amount of money, given the fact that less maintenance is involved. The GMAT costs $150 in the U.S. Are you planning on lowering that at all given the savings from E-Rater and the CAT?
A:Yes, we will be saving money by using E-Rater because we're using fewer [human] readers per person. What we intend is that that will keep the cost of testing from rising above where it currently is. The one big disadvantage of using this advanced technology is it costs a lot of money to set up testing centers around the world to build the infrastructure. It has greatly increased the cost of the test, which is a concern, and we don't want the test cost to be a barrier. [Editor's Note: In 1998, the GMAT fee equaled $125 for U.S. test-takers and $160 for students abroad. In 1999, U.S. applicants will pay $150, and $195 for foreign students.] What we hope is that this will keep the cost of the test from rising to pay for this technology. We don't see any big decreases immediately.

Q: Building off the momentum generated by the CAT and E-Rater, what are GMAC future technology plans ? I'm sure you've got a couple that sound like they're straight out of Kubrick's 2001.
A: Well, in looking at business schools' needs, there are several approaches we'll be taking. One is to build some assessments that are similar to the GMAT Verbal that are in other languages. For example, a Spanish verbal reasoning test. It wouldn't be a translation of the GMAT, but an actual different test that's verbal reasoning is in Spanish. The reason we're taking that line, in terms of new assessment, is that the GMAT is used internationally for admissions. There's a growing population in both Spain and Latin America that is interested in going to business programs, and a verbal reasoning test wouldn't necessarily be appropriate because English is not their first language. Their verbal score would be affected by their English skills. But you still want to assess their verbal reasoning skills if they're going into bilingual programs. So we've seen a need for that expressed by the schools in Spain and South America, and we're trying to meet that need.

Q: What's the time frame for implementing that?
A:
For the Spanish version, we're probably looking at a two- to three-year time line.

Q: What else are you planning?
A:
The other approach that we're taking is, how we can improve and advance assessments beyond multiple choice, which is very efficient for paper-based testing, but measures indirectly with things you really want to measure. So we want to create more authentic tasks that people might be dealing with. For example, you could be given a business case, and with the aid of technology, you could present this case and allow people to go get books off the shelf, allow people to interact with different types of media -- whether it's video or something else. We'd be making these tasks that people do in terms of assessment more authentic. We're looking at a four- to five-year time line before we even get there to implement that.

Q: So you envision having future test-takers use the Internet to research a specific case or question and report on it in a certain amount of time? Or having them be assessed on their spoken language skills?
A:
Those are the types of things I'm referring to. But when we talk about use of the Internet, we're talking about having that only be harnessed for self assessments, where people would take the tests themselves in a less secure environment. That's opposed to using a business case where you want the secure environment of a testing, technology center. So, we have two things we're doing: One is very open where you don't have to worry about the security of the test, and the other is more closed, in terms of technology.

Q: Do you plan to put more resources into personal-assessment testing or into the secure, diagnostic exams -- GMAC's bread-and-butter?
A:
Most of our resources will go into the secure testing. But where before we basically spent no resources in terms of self-assessment and more toward assessments geared at the student as an applicant. We will now be taking significant amounts of our money and gearing it to the self-assessment.


For more information about the GMAT CAT and E-Rater, you can visit the GMAC Web site at: www.gmat.org



Back to Top


TODAY'S MOST POPULAR STORIES

  1. Google's OS: Will PC Makers Bite?
  2. Web Radio Gets Deal, Still At Disadvantage
  3. Pickens Pulls Up Stakes
  4. Getting NASA's Groove Back
  5. Amgen's Uphill Marketing Battle

Get Free RSS Feed >>
  MARKET INFO

Portfolio Service Update

Stock Lookup

Enter name or ticker

  LEARN MORE

Learn about your online education options



Media Kit | Special Sections | MarketPlace | Knowledge Centers
McGraw-Hill Cos.