I want to use scala to parse a .mht file, but I found my code is exactly like Java.
Following is a mht file sample:
From: <Save by Tencent MsgMgr>
Subject: Tencent IM Message
MIME-Version: 1.0
Content-Type:multipart/related;
charset="utf-8"
type="text/html";
boundary="----=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19"
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type: text/html
Content-Transfer-Encoding:7bit
<html xmlns="http://www.w3.org/1999/xhtml"><head></head>...</html>
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
There is a special line called boundary, which is a separator line:
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
The first part is some information about this file, which can be ignored. Following are 4 blocks, the first one is a html file, others are jpg images with base64 encoded text.
If I use Java, the code is like:
BufferedReader reader = new BufferedReader(new FileInputStream(new File("test.mht")))
String line = null;
String boundary = null;
// for a block
String contentType = null;
String encoding = null;
String location = null;
List<String> data = null;
while((line=reader.readLine())!=null) {
// first, get the boundary
if(boundary==null) {
if(line.trim().startsWith("boundary=\"") {
boundary = substringBetween(line, "\"", "\"");
}
continue;
}
if(line.equals("--"+boundary) { // new block
if(contentType!=null) {
// save data to a file
}
encoding=null;
contentType=null;
location = null;
data = new ArrayList<String>();
} else {
if(id==null || contentType==null || location ==null) {
if(line.trim().startsWith("Content-Type:") { /* get content type */ }
// else check encoding
// else check location
} else {
data.add(line);
}
}
}
I tried to use scala to rewrite the code, but I found the structure of my code is nearly the same, except I used the scala syntax instead of Java.
Is there a scala way to do the same work?
PS: I don’t want to load the full file into memory, since the file is huge. Instead I want to read and parse it line by line.
Thanks for helping!
This could be a very simple use case of state machine.