Sunday, September 6, 2009

Get the text of Japanese podcasts with Podcastle.jp

The Japanese government seems to be doing a few things that are pretty useful for language learners. I noted a few while back that they've made the official Japanese-English dictionary of legal terms. That unfortunately is probably only of use to lawyers and the like, but last week I discovered another gem, again courtesy of Japanese tax dollars*, that is of more general use for Japanese learners: Podcastle.

I've been trying to find some Japanese podcasts for which the text is also available, without much success. LingQ's list of resources surprisingly has nothing suitable, and googling was turning up little. For whatever reason, there seem to be few Japanese podcasts that also provide transcripts.

But then I stumbled upon Podcastle.

Read more... Here's how Podcastle.jp describes itself:
Podcastle is a service that lets you search the audio of podcasts in Japanese. Voice recognition technology converts the audio into text. Users can then freely edit any recognition errors.
So, basically, they use a less-than-perfect voice recognition technology (because, after all, I don't think one exists yet that gets close to 100% accuracy) and then users edit the computer-generated transcripts to fix errors.

As I've noted before, one of the Japanese-language podcasts I've been listening to regularly is Yoichi Ito's Business Trends. And, sure enough, it's on Podcastle.jp. I went through a few of the transcripts, and overall their accuracy is pretty good. Indeed, many of them have hundreds of corrections. That tells me that the crowdsourcing is working well, but also that the voice recognition technology must leave quite a bit to be desired.

However, even with the crowdsourcing, the transcripts are not completely accurate. I was listening to one podcast and I noticed that a term appears to have been outright skipped. The term in question was スローダウン suro-daun, which means "slowdown" in the sense of the economy slowing down and is taken directly from the English term. The term was said quickly and somewhat quietly, and I could see why the voice recognition technology might have missed it, but I had no trouble understanding it so I'm sure native-Japanese speakers are able to hear it just as easily—but it remained completely omitted from the text. (I became a crowdsourcee by making the addition myself.)

Another cool feature of Podcast.jp is the ability to follow along with the podcast. You can play the podcast and Podcast.jp will indicate what text you are currently listening to. It's not completely accurate, but it's usually within a few words of where the audio is.

Until something more accurate pops up, this is a pretty good way for Japanese learners to get audio combined with text. The biggest problem is, of course, that the text doesn't always match up with the audio, so it's helpful if you know enough Japanese to figure out when the text might be screwed up. But by and large, it's accurate and good enough to help you get many of those terms you didn't quite catch in the audio.

Despite it's limitations, I'd love to know about any similar things that are available in other languages, so if you've got the info, please drop a line in the comments below.

P.S. If you're wondering how I figured out that this is a government-run project, take a look at their "Credits" page (in Japanese):
Podcastle is released as the research results of the Podcastle Project of the National Institute of Advanced Industrial Science and Technology, an independent administrative agency.
So if this was made with taxpayer dollars, I wonder if there's a way to get the research results—and the code—for free. If there is, someone with skills in working the Japanese bureaucracy please do so, and make this available for all languages ASAP.

Labels: , , ,