icgrep

icgrep is designed for utf-8 datatypes, however what happens if you try to use it with other character encodings? EUC-JP is Extended Unix Code used to represent Japanese character symbols. Most notably it is not Unicode, let alone utf-8. EUC-JP is also a variable-width encoding, similar to utf-8, so it makes sense that maybe we can have some success with it?

edict is a japanese dictionary file found here

It's encoded in EUC-JP, so lets try to use it as input for icgrep.


 $ ./icgrep "apple" edict -c
	
	159


 $ ./icgrep "apple" edict -c
	
ＣＡＭ植物 [カムしょくぶつ] /(n) CAM plant (any plant, such as the pineapple, that uses crassulacean acid metabolism)/
	
あし毛 [あしげ] /(n,adj-no) dapple-grey (gray) (horse coat colour)/
	
この親にしてこの子あり [このおやにしてこのこあり] /(exp) (proverb) the apple doesn't fall far from the tree/like father, like son/
	
つかみ合う [つかみあう] /(v5u,vi) to grapple/

So icgrep can read non utf-8 encoded text and look for matches to an ascii regular expression. But what if we try to look for a Japanese string?


 $ ./icgrep "りんご" edict -c
	
	icgrep ERROR: Invalid UTF-8 encoding!

So icgrep doesn't like that. Which makes sense because EUC-JP is not supported obviously.

So how does standard grep handle it?


	$ grep "りんご" edict -c
	
	23

grep was able to find 23 matches to the string. However if we try to print these lines.


	$ grep "りんご" edict
	
 Binary file edict matches

So grep thinks the file is binary data, which is interesting.