Monday, January 28, 2013

How to get the text of the lyrics to your target language music (including copy-protected Japanese lyrics)

For the most part, getting the lyrics for your foreign-language music is a cakewalk: just search for the translation of the English word "lyrics" in your target language, the name of the artist (in quotes if more than one word), and the song title (again, in quotes if more than one word).

To take an example, if you want to find the lyrics to the song En el muelle de San Blás by the Spanish-language band Maná, Googling:

letra maná "en el muelle de san blás"
will get you the lyrics in the first search result, which you can then copy and paste wherever you like.

Japanese lyrics, however, are a bit more complicated, and perhaps surprisingly it's for reasons that have nothing to do with the language itself.

You can of course copy and paste Japanese text just like any other text, but many websites that host the lyrics for Japanese songs take steps to prevent you from doing that, presumably because of copyright issues. The result is that, in order to get the text of Japanese lyrics for your personal educational use, it's not often going to be as simple as copying and pasting. That said, it's generally not that hard to get the text of the lyrics and will usually only take a little bit longer than just copying and pasting.

We'll start with the easiest sites to get the lyrics from and move on to the more difficult ones. The song we'll try to get the lyrics for is the Ulfuls' Gattsu Da Ze, the video of which has both ninjas and explosions.

Mojim and JetLyrics

First, try to find the lyrics on Mojim or JetLyrics. Neither of these has any technical impediments to copying from their websites, so these will be the easiest way to get your Japanese lyrics.

To check these sites, simply add "mojim" or "jetlyrics" to your Google search terms, such as:

歌詞 mojim ウルフルズ ガッツだぜ

歌詞 jet lyrics ウルフルズ ガッツだぜ

If the lyrics you're looking for are on one of these sites, it will probably appear as the first search result.

The one annoying thing about these sites is that they stick an additional line of text in the middle of the lyrics. On Mojim, it's "轉載來自 ※ Mojim.com 魔鏡歌詞網" and, on JetLyrics, it's the name of the artist, the song title, and the word "Lyrics", e.g.: "ウルフルズ ガッツだぜ!! Lyrics". So deleting that superfluous text is the one extra step that you'll need on these sites beyond simply copying and pasting.

J-Lyric

If Mojim and JetLyrics don't get you your Japanese lyrics, then the next lyrics site you'll want to try is J-Lyric. To see if the lyrics you want are available, use the same pattern for the search string as above but add "j-lyric" instead of "mojim" or "jetlyric", e.g.:

歌詞 j-lyric ウルフルズ ガッツだぜ
If you find your lyrics on J-Lyric, you unfortunately won't be able to copy and paste them directly from the screen. However, it's easy enough to extract them from the source code. I'll use Chrome to show you how to do it, but any browser will have the same capability.

  1. From the "View" menu's "Developer" submenu, select "View Source".

  2. In the source code, search for the first line of the lyrics, or just scroll down until you find it.

  3. Copy from the first line of the lyrics (starting on the line after "<p id='lyricBody'>") down to but not including the "<br />" at the end of the last line of the lyrics.

  4. In a text editor, make a plain-text document and paste in the copied text. To make a plain-text document in, for example, TextEdit, make a new document and then, from the "Format" menu, select "Make Plain Text".

  5. Find and replace "<br />" with nothing, i.e., leave the replace field blank but then replace all. This will delete all occurrences of "<br />" from the lyrics.

  6. Copy and paste the cleaned-up lyrics wherever you want to use them.

Goo, UtaMap, Uta-Net, Kasi-Time, etc.

If you still haven't found your lyrics, remove "j-lyric" from your Google search and run it again, e.g.:

歌詞 ウルフルズ ガッツだぜ
You'll probably see sites like Goo, UtaMap, Uta-Net, Kasi-Time, and others among your top hits. One of these will likely have the lyrics you're looking for, but you won't be able to copy them from the website or pluck them out of the source code. Instead, you'll need to take a screen shot and then use optical character recognition, or OCR, to convert them to text. This is by far the most time-consuming way to get the text of Japanese-song lyrics, but it will still be far less time than typing them up on your own. (And, luckily, you'll probably be able to find most lyrics without resorting to this method.)

Before you actually start converting images to text, there's a one-time set-up you'll need to do:

  1. Download the free "Community Version" of PDF OCR X, which is available for both Mac and Windows. The free version is limited to single-page images, but that should be suitable for screen grabs.

  2. Open PDF OCR X and drag and drop any image file onto the "PDF OCR X" window.

  3. In the "PDF OCR X: Please choose your conversion settings" window, click on "Add More Languages". This will take you to a webpage for PDF OCR X's language packs.

  4. Click on "jpn.traineddata.zip".

  5. Once the zip file has downloaded, unzip the file and drag and drop "jpn.traineddata" onto the "PDF OCR X" window. This will add Japanese to the languages PDF OCR X can recognize.

And then here's what you'll need to do each time

  1. In your browser, increase your zoom so that the lyrics are a pretty decent size on your screen. In most browsers, this can be accomplished by pressing command and = on a Mac or CTRL and = on Windows. (You should get better OCR results with a larger font size, but you'll also likely need to take more screen shots, which may result in you wasting more time than just cleaning up the somewhat messier OCR that you'll get at a smaller font size. Although I didn't do much testing, 2 or 3 screen shots seemed about right.)

  2. Take a screen shot of just the lyrics you want the text of and save it to your desktop. On a Mac, in Grab.app, from the "Capture" menu, select "Selection" to grab only the part of the screen that has the lyrics. In Windows' Snipping Tool, the same can be done by selecting "Rectangular Snip" from the "New" menu. As noted above, you might need to take a few screen shots if the lyrics don't fit on your screen. Both Mac OS X's Grab.app and Windows 7's Snipping Tool will save the screen grab as a TIFF file.

  3. Open PDF OCR X and drag and drop the screen grab's file onto the "PDF OCR X" window.

  4. In the "PDF OCR X: Please choose your conversion settings" window, select "Japanese" from the "Language" pull-down menu and click the "Convert" button. This will create a window called "Converted Text" that has the converted characters.

  5. In the "Converted Text" window, select all (press command and a or CTRL and a on Windows) and then copy and paste the text into a text document.

  6. If you made more than one screen shot, repeat steps 2 through 5 until all converted text is in the text document.

  7. Compare the converted text in your text document against the original website or the screen grabs, fixing any discrepancies. There will probably be a fair number of these, most likely making this step the most time-consuming part of the process. Likely mistakes will include English words that are included in the Japanese text, kana with normal and smaller forms (e.g., "よ" and "ょ"), spacing, and certain kanji.

  8. Copy and paste the cleaned-up lyrics wherever you want to use them.

PDF OCR X is one of a number of applications that take advantage of the tesseract-ocr engine, which is currently under the auspices of Google. You can also install tesseract-ocr directly on your computer, although the only way to access it will be through the command line.

Another completely free OCR option is this webpage, which makes use of the NHocr engine. While this one requires no initial set-up, there were two issues that led me to prefer tesseract-ocr.

First, I found that NHocr's accuracy wasn't quite as good as tesseract-ocr's. Check out this PDF file to see all the mark-ups I needed to make both the NHocr and the tesseract-ocr output match up with the original lyrics. Both require a fair amount of edits, but NHocr's output needed a bit more. In addition to the same recognition problems faced by tesseract mentioned above, NHocr had additional mistakes, such as recognizing "く" as "<", "き" as "書", "り" as "0", and so on.

Second, NHocr can't use TIFF files directly, so you'll also need to convert them into JPEG files. This isn't very hard to do, but it does add another step. On a Mac, this can be done by opening the TIFF file with Preview and selecting "Export" from the "File" menu and then exporting after changing the format to "JPEG".

There is also commercial OCR software out there that may provide better results, e.g., OCRkit ($38.99), ABBYY FineReader ($99.99), ReadIris ($129), OmniPage 18 ($149.99), and Adobe Acrobat Pro ($199, free trial, student/teacher discounts). We regularly use Adobe's OCR at work and the results are pretty good, so I imagine it would work very well for screen shots of machine-made text such as lyrics (here's one additional anecdote regarding Adobe's good accuracy for Japanese texts and here's a tutorial on using Acrobat's OCR). For a table comparing some of the above-mentioned software, see here.

Nevertheless, I've always been able to get any lyrics I wanted without needing to do any OCR in any case, so I haven't tested out any commercial software with lyrics.


With the above methods, you should be able to get the text of pretty much any lyrics you can get on your screen. Once you've got the text of the lyrics, they're ripe for being added to Learning with Texts and to iTunes, but I'll get into more detail on that next week.

3 comments:

  1. I was able to copy lyrics straight from the source code of kasi-time.  With Goo, you can grab the lyrics without OCR:

    1) From Chrome, open the dev tools (f12 on windows, opt+cmd+i on mac).
    2) Click on 'network'.
    3) Navigate to the page with your lyrics (or refresh the page if you are already there)
    4) In the 'Network' section of chrome dev tools, look for 'print_json.php' in the left column and select it.
    5) In the right panel of 'Network' click on 'Response'.  You will see a line with a  bunch of escaped unicode.  Escaped unicode looks like this: \u661f\u306b .  These are your lyrics.
    6) Copy that string and stick them in a unicode convertor tool like http://rishida.net/tools/conversion/   <--- paste the unicode as 'Mixed Input'

    ReplyDelete
  2. I wasn't able to get Gattsu Da Ze's lyrics from the source code of Kasi-Time's page for that song. Is it possible that it varies by song? Or am I just missing the boat here?

    I haven't tested your second tip yet, but that'll definitely be quicker than OCR.  I'll give it a go and update the post later.

    ReplyDelete
  3. Oops.  Just double-checked and for kasi-time it's not in the static source, but its in the DOM.  Still easy to fetch:

    1) Open chrome dev tools (windows f12, mac opt+cmd+i)
    2) Go to 'Elements' tab in dev tools
    3) Click on the little magnifying glass in the bottom left corner
    4) Click on the song lyrics.  This should select a div tag with a class called 'mainkashi' in the DOM tree in the 'Elements' section.  This div should have all your lyrics, sprinkled with some html.
    5) Right-click on the div tag (in the 'Elements' section of the dev tools) and select 'Copy as HTML' 
    6) Paste into an HTML to text converter (http://beaker.mailchimp.com/html-to-text) or clean it up using a text editor.

    ReplyDelete