Introduction

Glossika is a great language learning course.

To explain it simply, the course consists of 3000 sentences.

You hear the sentence in your source language, then you hear it in your target language.

These setences are put together for you in MP3 files, 50 sentences in each.

I wanted to split these files to get 2x 3000 individual sentence audio files. (3000 for the source language, 3000 for the target language)

Although as of late 2017 - early 2018 they phased out their old, fully offline materials and we can't even redownload any old purchases.

I now feel comfortable making this tool public because, I did not want to make any impact on Glossika's sales at all. Glossika's course is now all online, so I'm not worried anymore.

Guide

Setup

Get GloTool from my Github - https://github.com/llakssz/GloTool/releases/latest

You need Python 3 installed, along with the python module 'pydub'.

On my system, Python 3 is installed as 'python', not 'python3'.

If you have Python 3 already installed, you can install pydub by going to your Terminal/Command Prompt and entering:

pip install pydub

If pip couldn't be found, try:

python -m pip install pydub

You also need to have ffmpeg or avconv on your system.

Install either one and make sure it's in your PATH.

If you are on Windows and have trouble, you can be lazy and place the ffmpeg.exe into the same folder as GloTool.

(Download the archive from https://ffmpeg.zeranoe.com/builds/, go to the /bin/ folder, use that the ffmpeg.exe)

This exact release worked perfectly for me.

https://ffmpeg.zeranoe.com/builds/win32/static/ffmpeg-3.4.2-win32-static.zip

In this example I will use my English-Japanese Glossika files, and split them up into individual sentences.

It documents any problems and how to fix them.

Usage

Let's start with Fluency 1 (the first 1000 sentences), the zip file is called ENJA-F1-GMS.zip

Extract it, and you will get GMS-A, GMS-B, and GMS-C.

We want the B files. The B files are special because, for every sentence, they have it in English once, then Japanese once.

Place the GMS-B files in the same directory as GloTool.

Open Terminal/Command Prompt, navigate to the folder where GloTool is and enter:

python GloTool.py GMS-B -Bfiles -s

Our input folder is called 'GMS-B'.

They are B files, not C (or A - unsupported).

We want to split these files.

Wait a little, and we see some output:

** ENJA-F1-GMS-B-0001.mp3 **

Detecting silence...

Writing files...

Wrote 102 files - BAD! X-X-X-X-X-X-X-X-X

We detected too many sentences.

Try using -skipfirst X and -skipend X to fix.

Took 87 seconds

Exiting

This is bad, too many sentences were detected in the mp3 file.

There should be 50 sentences for a B file, so 100 separate English and Japanese audio clips.

But we got 102.

Take a look at the output files that were generated.

Look in the 'source' folder (Our source language is English)

0001.wav should be the very first sentence in English.

"The weather's nice today."

No, we hear "日本語/Nihongo" ('Japanese language' in Japanese)

Why:

At the beginning of these B files, we hear:

"The GMS Method Fluency 1 English Nihongo"

We don't want these audio pieces saved, we just want the sentences.

The Glossika files can vary quite a bit among the different languages availiable, I couldn't really make it so all files are detected perfectly all the time, so we have to play around a little.

Solution:

We need to skip an extra piece at the beginning.

By default, the '-skipfirst' argument is 2.

This time, enter:

python GloTool.py GMS-B -Bfiles -s -skipfirst 3

Bad output again:

** ENJA-F1-GMS-B-0001.mp3 **

Detecting silence...

Writing files...

Wrote 101 files - BAD! X-X-X-X-X-X-X-X-X

We detected too many sentences.

Try using -skipfirst X and -skipend X to fix.

Took 88 seconds

Exiting

101 pieces were saved. This makes sense, we got 102 last time, and we fixed it to skip one more.

Verify that 0001.wav in the source and 0001.wav in the target folder are what they should be. (Look at your pdf)

They are correct, good.

So the beginning isn't the problem anymore.

There are usually 2 possibilites for the problem:

1 - (Easy) Similar to earlier, we are simply picking up an extra unwanted piece, this time at the end.

2 - (Tricky) One of the sentence pieces has a pause that's too long, and so it looks like a new sentence.

If 2 is the case, you will need to find which file has this too long silence, and cut it shorter (less than 2 seconds) in a program like Audacity.

In my case, listening to 0051.wav (the file that shouldn't exist) in the 'source' folder we hear:

"Licensed by HLXKAG..."

So this is possibility 1 from above, we want to ignore a piece from the end.

We fix this with the argument -skipend. The default is 1, so let's give it 2.

This time, enter:

python GloTool.py GMS-B -Bfiles -s -skipfirst 3 -skipend 2

And the output:

Let's split...

** ENJA-F1-GMS-B-0001.mp3 **

Detecting silence...

Writing files...

Wrote 100 files - GOOD! *****************

Took 87 seconds

Great! Exactly what we want. Let it run and hopefully there won't be any problems with the later files.

I'll try to make a guide for joining later, although it's not impossible to figure it out yourself looking at GloTool, it's Python after all :)