Sunday, September 6, 2009

Get the text of Japanese podcasts with Podcastle.jp

The Japanese government seems to be doing a few things that are pretty useful for language learners. I noted a few while back that they've made the official Japanese-English dictionary of legal terms. That unfortunately is probably only of use to lawyers and the like, but last week I discovered another gem, again courtesy of Japanese tax dollars*, that is of more general use for Japanese learners: Podcastle.

I've been trying to find some Japanese podcasts for which the text is also available, without much success. LingQ's list of resources surprisingly has nothing suitable, and googling was turning up little. For whatever reason, there seem to be few Japanese podcasts that also provide transcripts.

But then I stumbled upon Podcastle.

Here's how Podcastle.jp describes itself:
Podcastle is a service that lets you search the audio of podcasts in Japanese. Voice recognition technology converts the audio into text. Users can then freely edit any recognition errors.
So, basically, they use a less-than-perfect voice recognition technology (because, after all, I don't think one exists yet that gets close to 100% accuracy) and then users edit the computer-generated transcripts to fix errors.

As I've noted before, one of the Japanese-language podcasts I've been listening to regularly is Yoichi Ito's Business Trends. And, sure enough, it's on Podcastle.jp. I went through a few of the transcripts, and overall their accuracy is pretty good. Indeed, many of them have hundreds of corrections. That tells me that the crowdsourcing is working well, but also that the voice recognition technology must leave quite a bit to be desired.

However, even with the crowdsourcing, the transcripts are not completely accurate. I was listening to one podcast and I noticed that a term appears to have been outright skipped. The term in question was スローダウン suro-daun, which means "slowdown" in the sense of the economy slowing down and is taken directly from the English term. The term was said quickly and somewhat quietly, and I could see why the voice recognition technology might have missed it, but I had no trouble understanding it so I'm sure native-Japanese speakers are able to hear it just as easily—but it remained completely omitted from the text. (I became a crowdsourcee by making the addition myself.)

Another cool feature of Podcast.jp is the ability to follow along with the podcast. You can play the podcast and Podcast.jp will indicate what text you are currently listening to. It's not completely accurate, but it's usually within a few words of where the audio is.

Until something more accurate pops up, this is a pretty good way for Japanese learners to get audio combined with text. The biggest problem is, of course, that the text doesn't always match up with the audio, so it's helpful if you know enough Japanese to figure out when the text might be screwed up. But by and large, it's accurate and good enough to help you get many of those terms you didn't quite catch in the audio.

Despite it's limitations, I'd love to know about any similar things that are available in other languages, so if you've got the info, please drop a line in the comments below.

P.S. If you're wondering how I figured out that this is a government-run project, take a look at their "Credits" page (in Japanese):
Podcastle is released as the research results of the Podcastle Project of the National Institute of Advanced Industrial Science and Technology, an independent administrative agency.
So if this was made with taxpayer dollars, I wonder if there's a way to get the research results—and the code—for free. If there is, someone with skills in working the Japanese bureaucracy please do so, and make this available for all languages ASAP.

4 comments:

  1. The accuracy of the transcripts was a little disappointing, but still, nice find!

    ReplyDelete
  2. Vince, Can we import and share this content at LingQ? Maybe or one of your readers would like to do so and earn the points.

    ReplyDelete
  3. I'd love to see these up on LingQ myself. I didn't read the fine print on Podcastle.jp, so I'm not sure what the sharing policy is. I did see that there was a way to get your podcasts removed from the site, so it might be a case of "shoot first, ask questions later".

    However, I was actually wondering if this would even be suitable for LingQ. The newer the text, the less accurate it is, and even the older ones can still be inaccurate in some ways. At the very least, it's material that's useful for advanced learners of Japanese, and some of the screwed up text could possible make a mess of the LingQs from the text.

    ReplyDelete
  4. Cool blog. I dig your site outline and I plan on
    returning again! I just love finding blogs like thiswhen I have the time.
    Now days i am going through this site and hope this gonna help you too. You can go in by clicking on my name.

    ReplyDelete