When disassembling an old .com executable file compiled from a code like this:
.model tiny ; com program
.code ; code segment
org 100h ; code starts at offset 100h
main proc near
mov ah,09h ; function to display a string
mov dx,offset message ; offset ofMessage string terminating with $
int 21h ; dos interrupt
mov ah,4ch ; function to terminate
mov al,00
int 21h ; Dos Interrupt
endp
message db "Hello World $" ; Message to be displayed terminating with a $
end main
in hex it looks like this:
B4 09 BA 0D 01 CD 21 B4 4C B0 00 CD 21 48 65 6C
6C 6F 20 57 6F 72 6C 64 20 24
how the disassembler knows where the code ends and the string “Hello world” starts?
Disassembler does not know where the code ends and where the data starts in a
.comfile, because in.comfiles there is no such distinction. In.comfiles everything is loaded into the same segment and as DOS runs in real mode and does not have any kind of memory protection at all, you can for example write obfuscated code that looks like regular text and jump into it in your code. For example (possibly crashes DOS, haven’t tested):So
db "Hello World $"is perfectly valid 16-bit code (checked withudclidisassembler that comes with udis86 disassembler library for x86 and x86-64 in Linux:However,
db 0x64 0x20 0x24is not valid 32-bit or 64-bit code.This is 32-bit disassembly of
db "Hello World! $":What a disassembler can do is to use some heuristics and code tracing to decide whether to print some parts of the disassembly as code and some other parts as data. But a disassembler can never know where code ends and where data begins, because in
.comfiles such distinction exists only in the programmer’s head and possibly in source code and in assembler’s limitations, but not in the binary.comfile format itself.