This is one of the problems for my homework in my Database class.
I don’t understand why we need to transform the csv file into binary file. i think that way will make it harder to search the data. Can anyone tell me why we need to do that? Is my teacher fooling me or it is really better to transform a csv file to binary file in order to read with binary search method. An example of one row of the csv file is
1|37|O|131251.81|1996-01-02|5-LOW|Clerk#000000951|0|nstructions sleep furiously among
This is the assignment that my teacher gave me.
and i am really stuck at task C.
Overview
The objective of this assignment is to help you understand the issues involved in querying large data sets that are too large to fit in memory in its entirety. To investigate those issues, you will write a java program to read a table of data in the form of a CSV file and run queries on the table as efficiently as possible. A template of the program is provided and your code should be added to the Assignment1.java file. A driver program, Driver.java is provided so that you can test your program. The driver program takes as input a file which contains a list of commands to be interpreted and executed by the program. You will be implementing several versions of the program in a guided fashion. In all versions, you must assume that the data may not fit in memory, i.e., you will not be able to read all the data into an in-memory java data structure.
In all versions, the basic sequence of commands begins by loading the data, followed by a series of queries which are either equality queries or range queries. You may assume that the input is correct and well-behaved, i.e., the goal of this assignment is not error-handling.
Task A (15 pts)
In the first version, you will implement the simplest and most naive solution. The list of commands supported by your java program must include the following:
naiveLoad filename : tells the program that the following queries will be for the csv file with filename
naiveSearchEq columnNum value: prints the rows of the table where the value in column number columnNum is equal to the given value. Column numbers start from one.
naiveSearchGtr columnNum value: prints the rows of the table where the value in column number columnNum is greater than the given value.
The search commands should be implemented by reading the CSV file character by character using the java class FileReader. You should read the java documentation for FileReader, InputStreamReader etc. You MUST use the FileReader class.
Task B (15 pts)
In the second version, you will improve upon the first version by using buffered IO. Write a second version of the search commands using theBufferedReader class. Name the commands and corresponding methods as follows:
naiveBufSearchEq columnNum value: prints the rows of the table where the value in column number columnNum is equal to the given value. Column numbers start from one.
naiveBufSearchGtr columnNum value: prints the rows of the table where the value in column number columnNum is greater than the given value.
Task C (50 pts)
In the third version, you will take a different approach to the problem. You will first load the CSV data file and transform it into a BINARY file. You MUST name your binary file “data.bin”. Subsequent queries will then operate on the binary file. You are free to design the format of the binary file. Name the commands and corresponding methods as follows:
binaryLoad filename : transform the csv file with filename into a binary file. The filename of the binary file should be stored in your program.
binarySearchEq columnNum value: prints the rows of the table where the value in column number columnNum is equal to the given value. Column numbers start from one.
binarySearchGtr columnNum value: prints the rows of the table where the value in column number columnNum is greater than the given value.
Task D (20 pts)
Take timings of version 1, 2, and 3 of your program and compare the running times. You should average the timings over at least 10 runs. In the inline submission on laulima, answer the following questions:
Tabulate the average running time of the three versions of your program. Compare the running times of the three versions.
How are the timings of the different versions different?
Why are the timings of the different versions different ?
What did you learn in this assignment? What was most difficult/challenging (if any)?
Given the updated objectives I would make a pass through the file and build a sorted index on the key. The index would contain key values and the offset of each record with that key. I would then write a new file consisting of the index followed by the original data. If you are allowed to use two files, just write the index to disk as a separate file.
The index will be MUCH smaller than the original file. When you need to search, read only the index portion (or file), look up the key using a binary search, retrieve the offset from the index entry, and use that offset to seek into the data and read only that record.
If even the index is too large to fit into RAM, then you have to build it in two steps.