Parsing CREATE TABLE statement with Regex in Node.js

Published on: 18th Jan 2021

Updated on: 16th Mar 2023

Overview

Usually, we start developing a new system with designing database and then writing codes. If you are not a fan of ORM, you will have to manually writing the code for the data objects and it is very time consuming. To avoid writing the data objects manually, we develop code generator that will generate the basic codes and then we customize it.

To develop code generator that generate data object code, the generator must first understand the table structure. There are two ways to do that.

For MSSQL, we query the INFORMATION_SCHEMA view for gathering the field name, data type and it's attributes.
Parse the CREATE TABLE statement.

The first option is quite easy and straight forward. The catch is that it is database dependant and might not supported in certain database engine. The second option should be more viable because of the CREATE TABLE statement is very much following the ANSI-SQL standard with some minor database dependant feature.

In this article, we will focus in parsing the CREATE TABLE statement with RegEx using JavaScript.

Ways of parsing text

Let's see how many ways of parsing text:

Munching the CREATE TABLE statement character by character

The naive way of parsing text is to write a program to check one character after another and then look for the "terminator" (i.e., a symbol that mark the end of the current section or statement). Parsing text in this way is error prone and it will take a lot efforts in guessing and testing.
Using RegEx

With RegEx, it requires to find the pattern in the text and then come out with the expression. Luckily, SQL syntax is well defined and writing the expression will be easier.

The good thing about this method is that you are focusing on finding a better expression instead of deciding how many characters to be read and split.

A little bit about RegEx

RegEx stands for Regular Expression which perform text searching for the given pattern. It optimizes the given pattern before perform any searching. If we are developing the optimized code for searching the text manually, it will take more time to complete the project. Just take note that, RegEx is handy but it is not going to be fast. This means, the more RegEx to be perform, the slower it will be.

Here it goes

We are going to develop the parser using JavaScript which runs in Node.js and it parses SQLite CREATE TABLE statement. Let's create a .js file, copy the codes and run it in Node.js.


let sql = `create table if not exists tb_sales (
    id integer primary key
    , dt date
    , prod_id integer not null
    , qty integer default (0) not null
    , unit_price real
    , amt real 
    , doc_no text null
    , cust_code text
    , modified_on datetime default current_timestamp
    , modified_by text
);
`;

let result;

// The regex for each component:
let rg_tb = /(create\s+table\s+if\s+not\s+exists|create\s+table)\s(?<tb>.*)\s\(/gim;
let rg_fld = /(\w+\s+text.*|\w+\s+varchar.*|\w+\s+char.*|\w+\s+real.*|\w+\s+float.*|\w+\s+integer.*|\w+\s+int.*|\w+\s+datetime.*|\w+\s+date.*)/gim;
let rg_fld2 = /(?<fname>\w+)\s+(?<ftype>\w+)(?<fattr>.*)/gi;
let rg_not_null = /not\s+null/i
let rg_pk = /primary\s+key/i
let rg_fld_def = /default\s(.+)/gi

// look for table name
result = rg_tb.exec(sql);
console.log('TABLE NAME:', result.groups.tb);

let fld_list = [];

while ((result = rg_fld.exec(sql)) != null) {
    let f = result[0];

    //reset
    rg_fld2.lastIndex = 0;
    let fld_def = rg_fld2.exec(f);

    // remove the field definition terminator.
    let attr = fld_def.groups.fattr.replace(',', '').trim();

    // look for NOT NULL.
    let nullable = !rg_not_null.test(attr);

    // remove NOT NULL.
    let attr2 = attr.replace(rg_not_null, '');

    // look for PRIMARY KEY
    let is_pk = rg_pk.test(attr2);

    // look for DEFAULT
    let def = rg_fld_def.exec(attr2);
    if (def && def.length > 0) {
        def = def[1].trim();
    }
    else {
        def = null;
    }

    // append to the arr
    fld_list.push({
        name: fld_def.groups.fname.trim(),
        type: fld_def.groups.ftype.trim(),
        nullable: nullable,
        pk: is_pk,
        def: def,
        //attr0: attr
    });
}

console.table(fld_list);

Code explains:

The CREATE TABLE statement in the above code has a few different formatting, some extra spaces, the comma (field definition terminator) is not always at the beginning of the line and extra blank line. Our target is to make it harder to parse so that we can find out how robust is our expressions.
Below the CREATE TABLE statement, we are declaring all the necessary RegEx for each 'section' in the statement. The expression includes,
- rg_tb - this expression searches for 'CREATE TABLE' and extract the table name.
- rg_fld - this expression searches for the field definition. The caveat of this expression is that it is able to extract the field definition on per line basis. If any one line contains more than one field, then, it won't work properly.
- rg_fld2 - this expression is for extracting the field name, data type and other attributes. The most important is that we must set 'rg_fld2.lastIndex=0' before we call rg_fld2.exec(f). This is because after exec(), the lastIndex will be set. The next call to exec() will continue from where it was earlier.
- rg_not_null - this expression is for detecting NOT NULL keyword. By definition, all fields are nullable (NULL keyword is optional). But for non-nullable field, NOT NULL keyword must exist. This is the reason why we are detecting NOT NULL.
- rg_pk - this expression is for look for PRIMARY KEY keyword that appear in the field attribute section. In case you have a composite key, this expression will not work properly.
- rg_fld_def - this expression is to extract the DEFAULT attribute.
The result of the above code looks like this:

TABLE NAME: tb_sales
┌─────────┬───────────────┬────────────┬──────────┬───────┬─────────────────────┐
│ (index) │     name      │    type    │ nullable │  pk   │         def         │
├─────────┼───────────────┼────────────┼──────────┼───────┼─────────────────────┤
│    0    │     'id'      │ 'integer'  │   true   │ true  │        null         │
│    1    │     'dt'      │   'date'   │   true   │ false │        null         │
│    2    │   'prod_id'   │ 'integer'  │  false   │ false │        null         │
│    3    │     'qty'     │ 'integer'  │  false   │ false │        '(0)'        │
│    4    │ 'unit_price'  │   'real'   │   true   │ false │        null         │
│    5    │     'amt'     │   'real'   │   true   │ false │        null         │
│    6    │   'doc_no'    │   'text'   │   true   │ false │        null         │
│    7    │  'cust_code'  │   'text'   │   true   │ false │        null         │
│    8    │ 'modified_on' │ 'datetime' │   true   │ false │ 'current_timestamp' │
│    9    │ 'modified_by' │   'text'   │   true   │ false │        null         │
└─────────┴───────────────┴────────────┴──────────┴───────┴─────────────────────┘

So, everything looks good except that the rg_fld is not able to extract multiple fields that appears on a single line. Let's try to solve it by adjusting the expression rg_fld for more than one field definition in the same line. Consider that the paid_amt real appear next to the qty real. The code will look like this:

let sql2 = `unit_price   real
    , amt real 
, qty real, paid_amt real	 
`;

let r2;
while ((r2 = rg_fld.exec(sql2)) != null) {
    console.log('=>', r2[0])
}

And the result shown below is not what we are expecting. The problem is that qty real, paid_amt real was treated as one field instead of two fields.

=> unit_price   real
=> amt real 
=> qty real, paid_amt real

Let's tweak the expression by declaring a new one so that it look for the field definition with terminator (comma symbol) or without the terminator (i.e., the last field definition does not have terminator).

let sql2 = `unit_price   real
    , amt real 
, qty real, paid_amt real	 
`;

let r2;

let t2 = /\w+\s+real\s*,|\w+\s+real.*/gim
while ((r2 = t2.exec(sql2)) != null) {
    console.log('=>', r2[0])
}

And the result is close to what we are looking for and the comma was included in the result which we don't want.

=> unit_price   real
    ,
=> amt real 
,
=> qty real,
=> paid_amt real

Now, let's remove the comma and next line symbol manually.

let sql2 = `unit_price   real
    , amt real 
, qty real, paid_amt real	 
`;

let r2;

let t2 = /\w+\s+real\s*,|\w+\s+real.*/gim
while ((r2 = t2.exec(sql2)) != null) {
    let f = r2[0].replace('\r', '').replace('\n', '').replace(',', '');    
    console.log('=>', f)
}

The following result is clean and ready to apply rg_fld2 expression.

=> unit_price   real    
=> amt real 
=> qty real
=> paid_amt real

By the way, this parser has not been completed. You may have to figure out how to incorporate t2 expression into rg_fld.

Useful links

You will find the CREATE TABLE statement in detail in the following link,

https://sqlite.org/lang_createtable.html

There are many sites in the Internet which explains RegEx in details. Here's two of it which I find very helpful.

https://flaviocopes.com/javascript-regular-expressions/

https://javascript.info/regular-expressions

To test the regular expression that you have written, you may rely on this tool,

https://regexr.com/

Short note for RegEx

\w - ASCII alphanumeric.
. - matches any character except newline.
\s - white space
+ - one or more occurrence.
* - zero or more occurrence.
(..) - the bracket is for creating a 'group'. Please refers to rg_fld2 variable in the above code.
| - pipe symbol matches one of the occurrence. For example, "(text|varchar)\w+" matches the text that starts with 'text' or 'varchar'.

For other usage of RegEx, please refers to one of the post: Use case of RegEx in JavaScript

Jump to #NODEJS blog

Jump to #JAVASCRIPT blog

Author

Lau Hon Wan, software developer.