Pingar API will soon be able to analyze Japanese documents. Adding the Japanese language posed new challenges for us as it is a hybridized language that makes use of several alphabets for different purposes. Furthermore, unlike most Latin languages, Japanese does not use spaces to delimit words, which meant we had to implement segmentation algorithms for processing Japanese text.
What kind of alphabets are there in Japanese?
Below is an example of how the three alphabets are used together. Observe how they look distinctly different! This sentence reads ‘I can’t speak French at all.’

Let’s now explain the role each alphabet has in Japanese and its origins.
Kanji (漢字) – Written above in black, this alphabet is the standardized use of Chinese characters in the Japanese language. It is mainly used for nouns, verbs and adjectives. Each character can have a range of Chinese readings (Onyomi) and Japanese readings (Kunyomi), with each character having possibly up to a dozen readings in some cases. There are approximately 2000 to 3000 characters in common use in Japan, however there are thousands more that also are part of the Japanese language including several variants of each one.
Hiragana (平仮名・ひらがな) – Written above in red, this alphabet is phonetic and is used to write native Japanese words in their most elementary form (anything written in Kanji can be broken down into its Hiragana equivalent, which is necessary for children and learners of the language). Hiragana is also used often in grammatical constructs. Hiragana comprises of 48 characters, and is developed from calligraphic form of the Manyogana alphabet, an earlier writing system that employed the use of Chinese characters.
Katakana (片仮名・カタカナ) – Written above in blue, this alphabet behaves identically to Hiragana in every way with 48 equivalent characters, except that it is not used for grammatical purposes and is reserved for foreign words that are imported into the Japanese language. The influx of German, English and French words into the Japanese language several centuries ago propelled the use of Katakana to encapsulate these new foreign words. Katakana has also found itself being used for writing onomatopoeia and names. Similar to Hiragana, it was developed from Manyogana by Buddhist monks who used it for shorthand.
Why are there no spaces in Japanese?
At Pingar, we have previously confronted the need of implementing text segmentation when adding the Chinese language components (traditional and simplified ) to the Pingar API. The redundancy of spaces in Chinese stems from the fact that there are thousands of characters and therefore, it is easier to define word boundaries without spaces. This is the opposite case in Latin based language such as English which largely depends on spaces to reduce ambiguity and ease of reading. For example, canyoureadthiseasily? Is it ‘can you’ or ‘can your’? It takes time to process, right? Since the Japanese language draws from a few alphabets that collectively have several thousand characters (similar to the Chinese language), it also has a redundancy for space.
How to segment Japanese text?
In order to include the Japanese language into the Pingar API we need to be able to segment it. This means locating sentence and word boundaries. Segmentation is a difficult task, so we experimented with several techniques and found that a Machine Learning classifier called Conditional Random Fields (CRF) works best, most probably due to the agglutinative nature of the Japanese language. CRF is a statistical modeling technique which can be used to identify common boundaries between words (starts and ends) through a process of labeling and segmenting sequences of text. This approach has been successfully tested on this problem in academic research, see Applying Conditional Random Fields to Japanese Morphological Analysis and Training Conditional Random Fields Using Incomplete Annotations.
What do the results look like?
The feedback from the first demonstrations to native speakers interested in this market was very positive. Earlier this year we signaled our intent to include Japanese in the Pingar API, the construction of the segmenter represents a significant milestone on this promise. It shows our long term goal to further internationalize the Pingar API. The following is an output from the Taxonomy Extraction function from Pingar’s Entity Extraction module. The extraction is only one level deep with the input text based on Kendo (a Japanese sword fighting martial art). See the results for yourself:

Entity Extraction: Extract Taxonomy Terms from Text
レジャーと文化 (Leisure & Culture)
国際情勢と防衛 (International Affairs & Defense)
教育とスキル (Education & Skills)
情報通信技術 (Information & Communications Technology)
The taxonomy terms detected by the Pingar API are accurate and relevant. When used as metadata, these terms can auto-organize similar documents and group search results effectively. In the coming months we look to extend other Pingar API methods to the Japanese language and add live demos to our website. Look forward to our next update soon.