SIGNUPABOUTBW_CONTENTSBW_+!DAILY_BRIEFINGSEARCHCONTACT_US


Return to main story


Q&A WITH MIT'S VICTOR ZUE

Victor Zue is associate director of the Laboratory for Computer Science at MIT and a pioneer in the area of speech-recognition systems that use natural-language processing. For years, Zue astonished his fellow researchers with his ability to read the words people were saying in spectrographs, the digital representation of a voiceprint. Zue's lab at MIT remains a hotbed of speech-recognition work and has spawned a number of companies, notably Applied Language Technologies Inc. in Boston. An authority in natural-language processing, Zue has developed a system that can tell phone callers the weather in 500 cities around the world--a feat of responsiveness to spoken questions. Business Week's Boston Correspondent Paul C. Judge interviewed Zue in his MIT lab.


Q: Speech recognition is getting a lot of attention lately. What's the state of the technology at this point?
A: I've tried using the new dictation products from IBM and Dragon. I started using them to answer E-mail because I'm a two-fingered typist. The first couple of weeks I was in awe. If you train it to your voice, it works pretty well. But I find I use it less and less. For one thing, at home I share a computer with my son and wife. They get annoyed when I start dictating. It's a social clash. Dictating to a computer is a private thing. Also, I got frustrated when the programs didn't work, and that still happens. I haven't given them up, but I find I use the programs less and less. Microsoft has dictation systems as good as IBM's ViaVoice and Naturally Speaking from Dragon Systems.

Q: Why hasn't Microsoft begun to build a business around them?
A: [Microsoft Chairman Bill] Gates and [Xuedong] Huang [research manager of Microsoft's speech technology group] say it's not ready. But the Microsoft voice synthesis is amazing. Their systems are poised to make some big advances in both input and output. They are the only company I know of that's looking at the whole loop.

Q: Do any of these systems incorporate elements of natural-language understanding, even at some primitive stage?
A: Dialogue and natural language have been used so much, those are terribly loaded words. When people say they are using natural language, it's often terribly misleading. One man's ceiling is another man's floor. It's the same thing with dialogue. Some people say it's possible now to have a dialogue with a voice-recognition system. But the machine takes all the initiative. What we're trying to develop at MIT are mixed-initiative systems. That's more within reach, where you ask a question some of the time, and the machine asks a question some of the time.

Q: Now that dictation products are making their way into the mainstream, will they provide the foundation for the next wave of speech-recognition systems?
A: I don't know if it will be built on top of dictation. The goal of natural language is information access and problem solving. Are we going to see the kind of thing people use in Star Trek or 2001? Not in my lifetime. But there are still a hell of a lot of things we can do with what we have now. The trick is limited domains.

Q: What are the limitations of dictation technology?
A: Most people consider the human-computer interaction using speech a problem of transcription. People hit upon dictation as a solution to that problem, and lots of people are chasing that. But my belief is that getting information means more to most people than generating a document using dictation systems. It's a human-centric view: The burden is on the machine to learn to communicate with you.

Q: So what types of applications do you expect to see in the next five years?
A: I have a personal litmus test for whether an application is worth pursuing with speech systems. First, it must have information that millions of people care about, like sports, stocks, weather. Second, it must be information that keeps changing, so people have a reason to keep coming back. You need that kind of demand to build a system. Third, the domain must be inherently closed. There are only 2,000 cities in the world for air travel, for instance. Or 1,000 makes and models of cars since the automobile was invented. And fourth, it needs to be something you can make money off of, to launch a business.

Q: What kind of impact has the explosive growth of the World Wide Web had on speech technology?
A: Voice really provides the most convenient way to access all of that information that's out there on the Web. There are about 90 million households in the U.S. with phones now, but only about 25 million households with access to the Internet. The telephone is the key. The future business of information access must be done on the phone, I believe. Some dialogue systems today are just a replacement for earlier systems that asked you to press 1 for sales, 2 for service. It just removes the action of pressing a button on the phone. What kind of improvement is that? Natural language allows you an opportunity to bypass steps and get what you want. When it's hooked up to the information that's available on the Internet, like MIT's Jupiter system, which provides weather from 500 cities around the world to callers, a natural-language interface lets you leapfrog a whole series of mouse clicks.

Some MIT students want to set up a system similar to the Jupiter weather-information system for sports scores. That passes my litmus test: Lots of people are interested, the content changes every day, there's a limited number of teams and sports. And using a voice-driven interface could give you a much simpler way to drill down directly to the score you want, as opposed to clicking through a series of pages or typing in a search. We have a handle now on how to deal with these kinds of problems.

Q: Why do you think speech recognition has attracted so much interest over the years?
A: Humans have this incredible need to make computers more like us. The computer, after all, is a communications partner. Linguistic competence is one of the first attributes we would like to impart to a machine. But our understanding of intelligence is so limited, and our ability to model that intelligence is also so constrained now. That's why we're taking ginger steps.

Q: What role has DARPA played so far, and how much influence will it have going forward as speech recognition begins to move into commercial products?
A: DARPA has been the biggest investor in this technology over the last 20 years. Allen Sears, the program manager for speech and language at DARPA, is starting the next phase, which is spoken dialogue systems. Finally, this big issue is front and center. The government is saying, 'Even if we can do perfect speech recognition, it won't give me the weather for Bosnia, or tell me if a project is running on schedule.' AT&T and Microsoft are both working on dialogue systems, but DARPA's work and support will be key.

Q: Why is development lagging in Japan?
A: That's a great question, and I don't know the answer. Even the raw technology is behind. It's strange, because the Japanese culture is very forgiving. If Japanese people were told that from now on they would buy train tickets over the phone using a speech system, they would all fall into line. Here in the U.S., if you tried to force that, it wouldn't work. But there's no widespread use of speech technology in Japan.

The Europeans are doing very good work. Part of the reason that speech systems are advancing in Europe is by necessity, because of the pan-linguistic culture.

Q: What are the key drivers now in speech?
A: The component technology has to be better. Also, the Web is changing things in a fundamental way. People know that the information they want is out there on the Web if they can just get to it. That's the power of speech systems. The Galaxy architecture we used to build Jupiter and other speech systems at MIT was first conceived and implemented without the benefit of the Web. Now, Galaxy lives inside of a browser. The Web opens up a vast array of information that we can pluck from databases using a voice interface and deliver over a phone. That means machines are beginning to serve humans. We're not talking about HTTP and TCP/IP. We're talking about human language.



Return to main story


SIGNUPABOUTBW_CONTENTSBW_+!DAILY_BRIEFINGSEARCHCONTACT_US


Updated Feb. 12, 1998 by bwwebmaster
Copyright 1998, by The McGraw-Hill Companies Inc. All rights reserved.
Terms of Use