Files with special character or Encoding in Spark

Encoding is used to translate the numeric values into a readable character it provides the information that your computer needs to display the text on the screen.

Spark read.csv() method accepts a parameter for encoding which decodes the file while reading, if the parameter is not given it takes the default utf-8.

csv(pathschema=Nonesep=Noneencoding=Nonequote=Noneescape=Nonecomment=Noneheader=NoneinferSchema=NoneignoreLeadingWhiteSpace=NoneignoreTrailingWhiteSpace=NonenullValue=NonenanValue=NonepositiveInf=NonenegativeInf=NonedateFormat=NonetimestampFormat=NonemaxColumns=NonemaxCharsPerColumn=NonemaxMalformedLogPerPartition=Nonemode=None)
So what basically happens is while reading the file Spark decodes the file as per the encoding given, then it applies the schema on top of it and loads the data in data frame

If data contained in the files complies with the schema you will get all the records loaded as data frame but that can only happen if spark is able to read the file based on the encoding you provided, what if Spark could not translate the value to a proper readable form?? How Spark will apply Schema on top of the records which are not at all readable for spark?? In that case the value wont comform with  the schema and get moved o the bad record column. 

Every now and then we receive feed which has special character, we also encounter scenarios where we are not aware of the actual encoding of the file. We spend time in trying out different encoding formats like utf-8, utf-16, iso-8859-1, or latin-1 but none of the encoding works properly in reading the file through spark. Many database applications like Netezza, Teradata can deal with file having different kinds of encoding but when these files are loaded through spark you will get bad records (Click here to know how to Capture bad Records while loading file in hive through Spark


However I need to do more research on Spark encoding issues, I am writing some of the issues which I have encountered. In this post I will share the encoding related issue which I faced sometimes back while loading the files though spark.

Source system for the feed file which I had was Netezza, file was loaded successfully in Netezza with zero bad records when I tried loading the file through spark.read.csv there were some bad records. I tried different kinds of encoding but every time I got bad records.


finally I decided to read the files as bytes and then try to apply encoding on it 

This fixed my problem and all records got loaded.

Note:- You may see some non readable character as output if there are special characters in file.


To learn more on Spark click hereLet us know your views or feedback on Facebook or Twitter @BigdataDiscuss.

0 comments:

Post a Comment