Simple Sinew
Overview¶
We are going to do a project base on a dissertation named "Sinew" for a deeper understanding of mordern database system.
link of the dissertation: sinew
Team Members¶
Lingli Wu
Hang Xu
Li Xu
Jaiqi Ye
Tingting Zhang
Description¶
Destination¶
Simplely complete the two function modules Catalog and Serializer.
Requirement¶
Programming language: C/C++ only and when using c++, STL is not allowed.
System: Only can use Linux with gcc/g++/gdb and use makefile for batch compilation.
Problem¶
- complete command "insert file" and Json date format is used for the original data file.
- complete command "checkout catalog" which will print all the information in the catalog, here we simplify.
- complete command "find A = B" which will print the found result in the format of json. If there is no result, print None.
Experimental Data¶
We use nobench to generate the test data we need.
Implementation requirements¶
- for each json data record, translate it into the format describing in the dissertation "sinew" as the following:
notice that you should store all of the data in binary format.
-
you should use "fwrite/fread" in C to read or write data. Every time you are going to have a file operation, you need to check the buffer space and you can only have the operation when it reach its full size, 8KB. No padding is expected in the data file.
-
all int type data is required to be 32 bit.
Optimization¶
build a file index to reduce the time cost for querying.you can also use binary chop algorithm to speed up
Solution¶
Tools¶
- Programming language: C (pure C)
- RAM: 4GB
- OS: Ubuntu Kylin 14.10 amd64
- Compile Tool: gcc (Ubuntu 4.9.2-10ubuntu13) 4.9.2 & Makefile
- Main Test Script:Shell
- Test data:Nobench data generator
- Project Address:git
- Project Train of Thoughttrain of Thought
The whole experiment is divided into four parts, in the main function of the use of three main functions:
Catalog¶
According to the paper's tips to achieve the requirements of the Catalog. Corresponding to achieve Catalog update, find, output
Insert¶
To achieve the serialization of data stored in the file, and call the CATALOG function update. Analysis (no use of third party tools, all of their own) to implement Json data, and the serialization of the data, and in a binary manner.
Find¶
To achieve the search of data, the establishment of file index, the use of two points to find (aid in sequential storage), as well as the anti serialization (the data will be re - re Json data, and then output)
Design Process¶
1. Catalog¶
The data structure of the Catalog unit is as follow:
We use a method “list and array” to store it. The list is extendable and we can insert unit easily. Once we finish inserting, we will build an array to store all addresses of the array to form a fast visit index for the list.
After this data structure is established, we can use the catalog index to search the catalog data, we provide two functions:
FindByKeyName: the corresponding data element of catalog is searched by the key name and the type of key.
FindById: Use ids to find the corresponding data record in the catalog index.
In file (disk), catalog is stored in a format text:
(attrid, keyname, count, keytype)
2.Json Analysis¶
Json stores data format as a string array. We define the structure, including the start pointer, end pointer, JSON data type (used to determine the read of the string as the key or value) and data type (used to judge the value of the type, including: int, bool, string, array, nested and other types). Using pointers to read JSON key or value, start pointer and the end pointer representing its starting position in the string array and end position, first read for the key and the second read the value, and so on (as shown in the figure). Judgment value of the first character in a text string, if the character is digital, the value of data type is int; if the character "t" or "F", the string bool type; if the quotation marks "," ", the string is the string; if the symbol" [", the string is an array type; if the symbol" {", the string as the nested; otherwise for unknown type.
3.Serialize¶
We define the following functions:
To JSON data serialization, we must first implement the main function in the test_JsonSerializer function. The format of a tuple is as follows:
When we have a look at the JSON data, we find that the data are int32, bool, array, and nested objects. Take the following JSON data as an example to illustrate the idea of serialization:
{"dyn2": 50378, “dyn2”:"ABC", "sparse_787": "GBRDCMBQGE======"}
At first, each tuple table is used to record the position of the whole table in the entire table, as shown in the above table.
And on behalf of the table in the #attributes_num is attribute in each tuple is the number and implement to example is each JSON data in key value pairs of a number; aid0~aid of (n-1) corresponding to the is each a key; and offset that is, from the tuple start to read the key value pairs according to the offset. The JSON data as an example, serialization, the tuple first position stored tuple ID, the second position stored key on the number. The key on the number 3, aid0~aid2 accounted for to the three to five positions; and because each area corresponding to an offset read the corresponding data, so offs0~offs2 also accounted for three position; and the ninth position is used to store all the data for the total length of len; then began to read data, so offs0 equal to 9. In this example, the first value is int, which takes 4 bytes, so offs1=9+4=13; similarly, the bool type data is treated as a int type; while the second value is a string, each character is recorded in the corresponding ASCII code, corresponding to a byte; so offs2=13+3=16.
And so on, and when we encounter an array or object type, compared to simply put them as a string, we have a different approach. When we encounter these two kinds of data types, we will build a new record, and then use the recursive method to resolve it. For nested object type, the tuple_id data from the JSON data in the whole table is returned as the data stored in the main table of the table, and we will put the number of attributes to -1, to distinguish the other value types, to facilitate the identification of the reverse sequence.
We have the following data structure:
4. Find¶
First, we establish the index of the entire file, the structure of the file index is shown in the following diagram:
For each query, we first match the attribute, now with the string form of attribute in the catolog to find the corresponding attrid, with the attr_id traversal file records in a tuple. In traversing each of the time, use two points to find, find the file index in the array of attr_ids, find the record to the result, can not find it. Finally get a result list, which records the results of the search Attrid.
Then match the value and match the specific values in the corresponding file record. Before the comparison, the record is first recorded, and then the two string matching operation. If you do not match then the tag match is false, and the match is found.
Then, for each record in the match, if result is true, we will be materialized, and output.
The process of physical and chemical process is to find the corresponding data from the file index, and then use Key_type to determine the data type:
For the int and bool types, read directly from the file record in the int
For string, offset, a data record is read and then, two offset subtraction, to the length of the string. Read these characters.
For nested array type, read ID array, and then find the record in the entire file record array, call the array function of the, to return a result string.
For nested JSON types, read the ID JSON, and then recursively call this JSON's physical function, to return to the JSON string of the sub object
Finally get the original JSON record of the string, the output of.
5.Buffer¶
Test¶
Test Plan¶
Test is divided into two parts, the main project of the test and the function of each module function of small unit logic test.
The main project tests include:
- the main project import test import (insert)
- CHECK CATALOG test (catalog checkout) of the main project
- the main item search and find time test (find)
Project Test¶
The Import of Main Project using 100K data
Catalog Checkout
Catalog size : 1013
Find Test
Script:
Result:
Conclusion¶
We have learn a lot form this project, especially at C programming, organizing such a project. Tanks for the hardwork of our team. And you can go to github to download the source code for detail. If you, reader have any question, you can email me or just leave your message at message wall in this website.