icgrep

icgrep is designed for utf-8 datatypes, however what happens if you try to use it with other character encodings? EUC-JP is Extended Unix Code used to represent Japanese character symbols. Most notably it is not Unicode, let alone utf-8. EUC-JP is also a variable-width encoding, similar to utf-8, so it makes sense that maybe we can have some success with it?

edict is a japanese dictionary file found here

It's encoded in EUC-JP, so lets try to use it as input for icgrep.

$ ./icgrep "apple" edict -c
159


$ ./icgrep "apple" edict -c
CAM植物 [カムしょくぶつ] /(n) CAM plant (any plant, such as the pineapple, that uses crassulacean acid metabolism)/
あし毛 [あしげ] /(n,adj-no) dapple-grey (gray) (horse coat colour)/
この親にしてこの子あり [このおやにしてこのこあり] /(exp) (proverb) the apple doesn't fall far from the tree/like father, like son/
つかみ合う [つかみあう] /(v5u,vi) to grapple/


So icgrep can read non utf-8 encoded text and look for matches to an ascii regular expression. But what if we try to look for a Japanese string?

$ ./icgrep "りんご" edict -c
icgrep ERROR: Invalid UTF-8 encoding!


So icgrep doesn't like that. Which makes sense because EUC-JP is not supported obviously.

So how does standard grep handle it?

$ grep "りんご" edict -c
23


grep was able to find 23 matches to the string. However if we try to print these lines.

$ grep "りんご" edict
Binary file edict matches


So grep thinks the file is binary data, which is interesting.