Add support for dynamic columns: - A column that can hold information from many columns - One can instantly add or remove column data This is a useful feature for any store type of application, where you want to store different type of information for different kind of items. For example, for shoes you want to store: material, size, colour, maker For a computer you want to store ram, hard disk size etc... In a normal 'relational' system you would need to a table for each type. With dynamic columns you have all common items as fixed fields (like product_code, manufacturer, price) and the rest stored in a dynamic column. The proposed idea is to store the dynamic information in a blob in Google Protocol Buffers (further GPB) format and use SQL constructs to extract parts of GPB data for use in select list, for filtering, and so forth. Any support for indexing GPB data is outside of scope of this WL entry. Future ideas: - Allow indexing with name instead of numbers. When this is done we can drop the type as part of column_get() - Allow indexing on dynamic fields. Syntax examples: Creating table (no change from before, 'extra' is used for dynamic column data): CREATE TABLE item (ID int auto_increment primary_key, Type_id int, Price decimal(7,2), Country_id int, Manufacturer_id int, extra column_blob); ------------- Creating a column with many fields: Syntax: COLUMN_CREATE(column_nr, value, [column_nr,value]...) INSERT into item (NULL, 1 /* T-shirt */, 10, 1 / * Germany /, 1 /* Nike /, COLUMN_CREATE(1 /* color /, “Blue”, 2 /* Size */, “M”)); ------------- Updating a dynamic column: Syntax: COLUMN_ADD(blob,column_nr, value, column_nr,value]...) UPDATE item SET extra=COLUMN_ADD(extra, 6 /* Memory */, 2048) WHERE id=2; If the column already exists, it will be overwritten. ------------- Deleting a dynamic column (if it exists): Syntax: COLUMN_DELETE(blob, column_nr, column_nr...); UPDATE item SET extra=COLUMN_DELETE(extra, 6) WHERE id=2; ----------------- Querying a dynamic column: Syntax: COLUMN_EXISTS(blob, column_nr); SELECT * from item where COLUMN_EXISTS(extra,4); ----------------- Querying which columns exists: Syntax: COLUMN_LIST(blob, column_nr); SELECT COLUMN_LIST(extra) FROM item WHERE id=1; → “1,2” ------------------ Retrieving a dynamic column: Syntax: COLUMN_GET(string, column_nr AS type); SELECT id, COLUMN_GET(extra, 1 /* color*/ AS char) from item; This will return NULL if the column don't exists
1. GPB Encoding overview 2. GPB in an SQL database 3. Encoding to use for dynamic columns 4. How to store and access data in a protocol buffer from SQL 5. Extensions for the future 1. GPB Encoding overview ======================== GBB is a compact encoding for structured and typed data. A unit of GPB data (it is called message) is only partially self-describing: it's possible to iterate over its parts, but, quoting the spec http://code.google.com/apis/protocolbuffers/docs/encoding.html: " the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file). " 2. GPB in an SQL database ========================= It is possible to store GPB data in MariaDB today - one can declare a binary blob column and use it to store GPB messages. Storing and retrieving entire messages will be the only available operations, though, as the server has no idea about the GPB format. It is apparent that ability to peek inside GPB data from SQL layer would be of great advantage: one would be able to - select only certain fields or parts of GPB messages - filter records based on the values of GPB fields - etc performing such operations at SQL layer will allow to reduce client<->server traffic right away, and will open path to getting the best possible performance. 3. Encoding to use for dynamic columns ====================================== The data should be coded into the proto buffer in the following format: <field_number><value_type><value>[<field_number><value_type><value>...] Where field_number is a number between 0-65536 that identifes the field <value_type> is a enum of type 'Item_result' <value> is the value coded in proto format. In other words, we should have no nested or complex structure. 4. How to store and access data in a protocol buffer from SQL ============================================================ User-friendly/meaningful access to GPB fields requires knowledge of GPB field names and types, which are not available from GPB message itself (see "GPB encoding overview" section). To make things easy for the user, we will at first stage provide SQL functions to manipulate a string that is actually in proto format. The functions we should provde are: proto_get(gpb, field_number, type) This return the field tagged with 'field_number' from the 'gpb' buffer. Example: proto_get(blob, 1, varchar) -> Returns field number 1 as varchar proto_put(gpb, field_number, value) This returns a new gbp buffer with the new value appended. Example: proto_put(proto_put(blob, 1, 1), 2, "hello") 5. Extension for future ======================= In the future we may want to access data based on name and get MariaDB to automaticly know the correct type. To do this we need to be able to store a definition for the content of the proto buffer somewhere. DecisionToMake: How to pass the server the GPB definition? First idea: add a CREATE TABLE parameter which will specify the definition itself.
* We should have both server-side support and client-side support (client side means functions in libmysqlclient so that user can select the full BLOB and extract fields in the application). The current proposal for storage is: Header: <flag><number_of_columns> Sorted index: <column_nr><offset><column_nr><offset> The flag can be used to distinguish between versions, the length of offset (2 or 3 bytes), storage format and other things. Each column_nr will take 2 bytes and offset 2 or 3 bytes (depending on length of total string). This is to allow to do a fast binary search to check if a column exists. Each column is stored as: <character_set if string><data> We don't have to store lengths as this can be calculated from the offsets. Numbers will be stored without pre-zero. API for library used for dynamic columns: int dynamic_column_create(DYNAMIC_COLUMN *str, uint column_nr, DYNAMIC_COLUMN_VALUE *value); int dynamic_column_create_many(DYNAMIC_COLUMN *str, uint column_count, uint *column_numbers, DYNAMIC_COLUMN_VALUE *values); int dynamic_column_update(DYNAMIC_COLUMN *org, uint column_nr, DYNAMIC_COLUMN_VALUE *value); int dynamic_column_update_many(DYNAMIC_COLUMN *str, uint add_column_count, uint *column_numbers, DYNAMIC_COLUMN_VALUE *values); int dynamic_column_delete(DYNAMIC_COLUMN *org, uint column_nr); /* test existance of non-NULL column */ int dynamic_column_exists(DYNAMIC_COLUMN *org, uint column_nr); /* List of not NULL columns */ int dynamic_column_list(DYNAMIC_COLUMN *org, DYNAMIC_ARRAY *array_of_uint); /* if the column do not exists it is NULL */ int dynamic_column_get(DYNAMIC_COLUMN *org, uint column_nr, DYNAMIC_COLUMN_VALUE *store_it_here); #define dynamic_column_column_free(V) dynstr_free(V) #define dynamic_column_value_init(V) (V)->type= DYN_COL_NULL Structure definitions: #define ER_DYNCOL_YES 1 /* NO and OK is the same used just to show semantics */ #define ER_DYNCOL_NO 0 #define ER_DYNCOL_OK 0 /* Wrong format of the string with encoded columns */ #define ER_DYNCOL_FORMAT -1 /* Some limit reached */ #define ER_DYNCOL_LIMIT -2 /* Out of resourses */ #define ER_DYNCOL_RESOURCE -3 /* Incorrect input data */ #define ER_DYNCOL_DATA -3 typedef DYNAMIC_STRING DYNAMIC_COLUMN; enum enum_dynamic_column_type { DYN_COL_NULL= 0, DYN_COL_INT, DYN_COL_UINT, DYN_COL_DOUBLE, DYN_COL_STRING, DYN_COL_DECIMAL, DYN_COL_DATETIME, DYN_COL_DATE, DYN_COL_TIME }; typedef enum enum_dynamic_column_type DYNAMIC_COLUMN_TYPE; struct st_dynamic_column_value { DYNAMIC_COLUMN_TYPE type; union { long long long_value; unsigned long long ulong_value; double double_value; struct { LEX_STRING string_value; CHARSET_INFO *charset; }; struct { decimal_digit_t decimal_buffer[DECIMAL_BUFF_LENGTH]; decimal_t decimal_value; }; MYSQL_TIME time_value; }; }; Ideas: - Use linear search instead of string search with small set of columns - In this case we can store lengths instead of offsets, to speed up modifications of the array. - Store type as part of offset (3 bit). As this will limit us to 8K byte offset, this means we must at the same time support 2 and 3 byte offsets. DATA ENCODING NULL - No field mentioned Length of the field will be get from offsets of two fields. Decimal and string values will has numeric unsigned prefix coded as variable length unsigned int. For decimal it will be point position for string number of collation.
Status updated. --- /tmp/wklog.34.old.16237 2011-06-30 10:58:17.000000000 +0000 +++ /tmp/wklog.34.new.16237 2011-06-30 10:58:17.000000000 +0000 @@ -1,2 +1,2 @@ -In-Progress +Complete
High Level Description modified. --- /tmp/wklog.34.old.5715 2011-04-18 11:44:35.000000000 +0000 +++ /tmp/wklog.34.new.5715 2011-04-18 11:44:35.000000000 +0000 @@ -75,9 +75,9 @@ ------------------ Retrieving a dynamic column: -Syntax: COLUMN_GET(column_nr FROM blob AS type); +Syntax: COLUMN_GET(string, column_nr AS type); -SELECT id, COLUMN_GET(1 /* color*/ FROM extra AS char(255)) from item; +SELECT id, COLUMN_GET(extra, 1 /* color*/ AS char) from item; This will return NULL if the column don't exists
Low Level Design modified. --- /tmp/wklog.34.old.25881 2011-04-13 19:08:53.000000000 +0000 +++ /tmp/wklog.34.new.25881 2011-04-13 19:08:53.000000000 +0000 @@ -22,35 +22,60 @@ API for library used for dynamic columns: -typedef DYNAMIC_STRING DYNAMIC_COLUMN; +int dynamic_column_create(DYNAMIC_COLUMN *str, + uint column_nr, + DYNAMIC_COLUMN_VALUE *value); + +int dynamic_column_create_many(DYNAMIC_COLUMN *str, + uint column_count, + uint *column_numbers, + DYNAMIC_COLUMN_VALUE *values); + +int dynamic_column_update(DYNAMIC_COLUMN *org, uint column_nr, + DYNAMIC_COLUMN_VALUE *value); +int dynamic_column_update_many(DYNAMIC_COLUMN *str, + uint add_column_count, + uint *column_numbers, + DYNAMIC_COLUMN_VALUE *values); -my_bool dynamic_column_create(DYNAMIC_COLUMN *value, int column_nr, -DYNAMIC_COLUMN_VALUE *value); +int dynamic_column_delete(DYNAMIC_COLUMN *org, uint column_nr); -my_bool dynamic_column_create_many(DYNAMIC_COLUMN *value, uint column_count, int -*column_numbers, DYNAMIC_COLUMN_VALUE *values); +/* test existance of non-NULL column */ +int dynamic_column_exists(DYNAMIC_COLUMN *org, uint column_nr); -my_bool dynamic_column_update(DYNAMIC_COLUMN *org, int column_nr, -DYNAMIC_COLUMN_VALUE *value); +/* List of not NULL columns */ +int dynamic_column_list(DYNAMIC_COLUMN *org, DYNAMIC_ARRAY *array_of_uint); -my_bool dynamic_column_delete(DYNAMIC_COLUMN *org, int column_nr); +/* + if the column do not exists it is NULL +*/ +int dynamic_column_get(DYNAMIC_COLUMN *org, uint column_nr, + DYNAMIC_COLUMN_VALUE *store_it_here); -my_bool dynamic_column_exists(DYNAMIC_COLUMN *org, int column_nr); -- Returns 1 if value exists +#define dynamic_column_column_free(V) dynstr_free(V) -/* List of not NULL columns */ -my_bool dynamic_column_list(DYNAMIC_COLUMN *org, DYNAMIC_ARRAY *array_of_int); +#define dynamic_column_value_init(V) (V)->type= DYN_COL_NULL -my_bool dynamic_column_get(DYNAMIC_COLUMN *org, int column_nr, -DYNAMIC_COLUMN_VALUE *store_it_here); -Returns 1 if column number didn't exists. +Structure definitions: -void dynamic_column_free(DYNAMIC_COLUMN *value); +#define ER_DYNCOL_YES 1 +/* NO and OK is the same used just to show semantics */ +#define ER_DYNCOL_NO 0 +#define ER_DYNCOL_OK 0 +/* Wrong format of the string with encoded columns */ +#define ER_DYNCOL_FORMAT -1 +/* Some limit reached */ +#define ER_DYNCOL_LIMIT -2 +/* Out of resourses */ +#define ER_DYNCOL_RESOURCE -3 +/* Incorrect input data */ +#define ER_DYNCOL_DATA -3 -Structure definitions: +typedef DYNAMIC_STRING DYNAMIC_COLUMN; -enum DYNAMIC_COLUMN_TYPE +enum enum_dynamic_column_type { + DYN_COL_NULL= 0, DYN_COL_INT, DYN_COL_UINT, DYN_COL_DOUBLE, @@ -58,24 +83,29 @@ DYN_COL_DECIMAL, DYN_COL_DATETIME, DYN_COL_DATE, - DYN_COL_TIME, - DYN_COL_NULL + DYN_COL_TIME }; -struct DYNAMIC_COLUMN_VALUE +typedef enum enum_dynamic_column_type DYNAMIC_COLUMN_TYPE; + +struct st_dynamic_column_value { - enum DYNAMIC_COLUMN_TYPE type; + DYNAMIC_COLUMN_TYPE type; union { - longlong long_value; + long long long_value; + unsigned long long ulong_value; double double_value; - { + struct { LEX_STRING string_value; CHARSET_INFO *charset; - } + }; + struct { + decimal_digit_t decimal_buffer[DECIMAL_BUFF_LENGTH]; decimal_t decimal_value; + }; MYSQL_TIME time_value; - } + }; }; Ideas:
Category updated. --- /tmp/wklog.34.old.26161 2011-03-12 10:45:13.000000000 +0000 +++ /tmp/wklog.34.new.26161 2011-03-12 10:45:13.000000000 +0000 @@ -1,2 +1,2 @@ -Server-BackLog +Server-Sprint
Implementor updated: -> Sanja
Low Level Design modified. --- /tmp/wklog.34.old.4265 2011-03-07 08:57:57.000000000 +0000 +++ /tmp/wklog.34.new.4265 2011-03-07 08:57:57.000000000 +0000 @@ -38,6 +38,7 @@ my_bool dynamic_column_exists(DYNAMIC_COLUMN *org, int column_nr); - Returns 1 if value exists +/* List of not NULL columns */ my_bool dynamic_column_list(DYNAMIC_COLUMN *org, DYNAMIC_ARRAY *array_of_int); my_bool dynamic_column_get(DYNAMIC_COLUMN *org, int column_nr, @@ -57,7 +58,8 @@ DYN_COL_DECIMAL, DYN_COL_DATETIME, DYN_COL_DATE, - DYN_COL_TIME + DYN_COL_TIME, + DYN_COL_NULL }; struct DYNAMIC_COLUMN_VALUE @@ -83,27 +85,14 @@ - Store type as part of offset (3 bit). As this will limit us to 8K byte offset, this means we must at the same time support 2 and 3 byte offsets. -DATA ENCODING: +DATA ENCODING - NULL: - a) No field mentioned - b) Field mentioned but offset is 0 (or other impossible value) - - For encoding unsigned int we could use method from protobuff - variable length -integers. where first bit of 8 set to 1 if next byte belong to this number, -other 7 bit used for number itself. Could be used for storing (decreasing -probability): - 1) string length - 2) unsigned int - 3) parts of decimal numbers - For encoding signed numbers it could be coded like: - 0 -> 0 - 1 -> 1 - -1 -> 2 - 2 -> 3 - -2 -> 4 - ... - and then as unsigned described above (it is also taken from protobuff). Could -be used for storing sugned integers, signed decimals. +NULL - No field mentioned + +Length of the field will be get from offsets of two fields. + +Decimal and string values will has numeric unsigned prefix coded as variable +length unsigned int. For decimal it will be point position for string number of +collation.
Title modified. --- /tmp/wklog.34.old.3370 2011-03-07 08:38:51.000000000 +0000 +++ /tmp/wklog.34.new.3370 2011-03-07 08:38:51.000000000 +0000 @@ -1,2 +1,2 @@ -Add support for dynamic columns (via google protocol buffers) +Add support for dynamic columns
Low Level Design modified. No change.
Low Level Design modified. --- /tmp/wklog.34.old.22411 2011-02-25 09:18:52.000000000 +0000 +++ /tmp/wklog.34.new.22411 2011-02-25 09:18:52.000000000 +0000 @@ -1,4 +1,3 @@ - * We should have both server-side support and client-side support (client side means functions in libmysqlclient so that user can select the full BLOB and extract fields in the application). @@ -84,4 +83,27 @@ - Store type as part of offset (3 bit). As this will limit us to 8K byte offset, this means we must at the same time support 2 and 3 byte offsets. +DATA ENCODING: + + NULL: + a) No field mentioned + b) Field mentioned but offset is 0 (or other impossible value) + + For encoding unsigned int we could use method from protobuff - variable length +integers. where first bit of 8 set to 1 if next byte belong to this number, +other 7 bit used for number itself. Could be used for storing (decreasing +probability): + 1) string length + 2) unsigned int + 3) parts of decimal numbers + For encoding signed numbers it could be coded like: + 0 -> 0 + 1 -> 1 + -1 -> 2 + 2 -> 3 + -2 -> 4 + ... + and then as unsigned described above (it is also taken from protobuff). Could +be used for storing sugned integers, signed decimals. +
Observers changed: Sergei