I am working on a C program that uses sockets to retrieve a file using the HTTP GET request. I use the recv function to write to a buffer, then append a new file with the contents of the buffer. The program works fine except for one problem: The top of every file includes the HTTP response.
For example, I can successfully download and open a PDF file from the web using my program and it will open with no issues. However if I edit the PDF in Notepad++, I see the following at the top:
HTTP/1.1 200 OK
Date: Wed, 07 Nov 2012 19:57:54 GMT
Server: Apache/2.2.21 (Unix) mod_python/3.3.1 Python/2.6.6 PHP/5.3.8
Last-Modified: Wed, 01 Aug 2012 21:31:31 GMT
ETag: "f2ae8c-4134aa-4c63b04c07df2"
Accept-Ranges: bytes
Content-Length: 4273322
Content-Type: application/pdf
%PDF-1.4
%äðíø
10 0 obj
<</Filter/FlateDecode/Length 2722>>
...
If I download the PDF file using my browser, the files match except for the HTML response at the top of the file retrieved by my program. I have verified this by removing the offending lines and comparing the file hashes.
I feel that there are much more elegant and proper ways of approaching this. I know that there are always two newline characters after the HTTP response before the file begins, so here is my (sloppy, non-working) attempt at extracting the response:
FILE* ptr_file = fopen("PDF_TEST.pdf", "w+");
char* buffer[BUFFER_SIZE];
int file_pos = 0;
int bytes_rcvd = 0;
int first_iter = 1;
while((bytes_rcvd = recv(socket_server, buffer, BUFFER_SIZE, 0)) > 0)
{
if(first_iter)
{// Need to remove the HTTP response from the buffer
char* str_buffer;
char* html_resp = strstr(buffer, "\n\n");
int html_resp_length = strlen(html_resp) + 2;
printf("HTML RESPONSE:\n%s\n\n", html_resp);
char* first_buffer[BUFFER_SIZE - html_resp_length];
memcpy(first_buffer, buffer+html_resp_length-1, sizeof(first_buffer));
printf("\n\nREST OF BUFFER:%s\n", first_buffer);
bytes_rcvd -= html_resp_length;
fwrite(first_buffer, 1, bytes_rcvd, ptr_file);
first_iter = 0;
continue;
}
fwrite(buffer, 1, bytes_rcvd, ptr_file);
file_pos += bytes_rcvd;
}
I get segmentation faults with this code, but I believe that’s due to the fact that my buffer is an array of char* and I’m using it as if it where a char array.
My questions:
1.) What is the best way of separating the HTTP response from the file?
2.) Is it better to use the Content-Length specified by the HTML response for writing to the file, or should I use my current method of writing the number of bytes received?
Any input is appreciated.
One way is to have two loops: The first for the response header, read until you get an empty line. The second receive loop for the data.