Analyze data with MSSQL SELECT statement
Published on: 31th May 2021
Updated on 31th May 2021
Overview
SELECT statement is very powerful which allows you to retrieve all records with proper filters and even analyze data.
To explain the SELECT statement, let's start with developing a simple system that is keeping track of attendance for members.
Table structure
The following table holds the attendance information for the member. To simplify the query, we will not include the member table. Let's focus on the attendance.
create table tb_attendance (
attendance_id uniqueidentifier not null primary key
, profile_id uniqueidentifier not null -- links to the member table.
, check_in_on datetime not null -- the check in time.
, check_out_on datetime -- the check out time. Fill in upon check out.
)
Sample data
insert into tb_attendance (attendance_id, profile_id, check_in_on, check_out_on) values
('7B7511A7-C4BC-40C2-BB06-961AAF5C263E', '3829BE4D-CCF1-4389-A024-336C06B53509', '2007-07-13 20:14:43.000', '2007-07-13 22:14:03.000')
, ('AC7F544D-A4FF-4AE8-9FF3-B9C0295C8F8B', '2D0F3DDB-9131-4313-891E-A11DD8FC0298', '2007-07-13 14:47:42.000', '2007-07-13 19:04:54.000')
, ('5419770E-D7CD-482C-B3D6-FD89FCCA9789', 'FC5E926D-2417-46C0-BD06-86A2E3D5F257', '2007-07-13 14:50:37.000', '2007-07-13 21:15:05.000')
, ('873683FD-93AD-45C6-BC6D-DA547F71CCE9', '65A1251F-A18D-402D-AA51-84466452A93B', '2007-07-13 18:58:23.000', '2007-07-13 19:04:49.000')
, ('08780963-BAB6-402C-92C1-0CBC3EFC0B70', '65A1251F-A18D-402D-AA51-84466452A93B', '2007-07-13 19:32:49.000', '2007-07-13 20:25:46.000')
, ('B3532C0F-DF5E-4BEF-8247-8CE054839FC3', '65A1251F-A18D-402D-AA51-84466452A93B', '2007-07-14 18:05:33.000', '2007-07-14 19:14:01.000')
, ('B755D38D-8C23-4A68-8CDA-D495DD43E0ED', 'FC5E926D-2417-46C0-BD06-86A2E3D5F257', '2007-07-14 17:23:08.000', '2007-07-14 19:11:15.000')
Notes: in the above test data, the member of 65A1251F-A18D-402D-AA51-84466452A93B is having two attendances on 13th July. This is because the member has checked out for dinner and re-check in.
Queries
Show all records in the table
In any system, showing the full records to the user is a mandatory feature. This can be done easily with SELECT statement,
select *
from tb_attendance
where
check_in_on between '2007-07-13' and '2007-07-13 23:59'
order by check_in_on
The query shows all the attendance for 13th July 2007 and the records are sorted by check in time. The *
(asterisk symbol) means to retrieve all fields from the given table.
User requirement #1
Sounds simple when we learned the SELECT statement until we started facing the real user requirement which is challenging. Here's the requirement
Rule: the member is allowed to check in multiple times on any date but the total visit will be treated as one.
Requirement: shows the visit count for each date.
It does not sound complicated. So, let's try out this first query,
select
date = convert(nvarchar, check_in_on, 102)
, visit_count = count(*)
from tb_attendance
group by
convert(nvarchar, check_in_on, 102)
The above query extracts the date value from the check_in_on
field (in fact this field stores the date + time) and then counts the record.
date | visit_count |
---|---|
2007.07.13 | 5 |
2007.07.14 | 2 |
But, the visit_count value on 13th July is wrong. The correct value should be 4 instead of 5. This is because the above query does not treat the multiple entries that happened on the same date as the same member as one visit.
Try the following new query looks which count by the distinct profile_id (i.e., member),
select
date = convert(nvarchar, check_in_on, 102)
, visit_count = count(distinct profile_id)
from tb_attendance
group by
convert(nvarchar, check_in_on, 102)
And yes. This query works perfectly as it shows 4 visits on 13th July.
date | visit_count |
---|---|
2007.07.13 | 4 |
2007.07.14 | 2 |
User requirement #2
Rule: the member is allowed to check in multiple times on any date but the total visit will be treated as one.
Requirement: shows the first attendance record for 13th July (ignore the re-entry).
To meet the above requirement, we need the help with ROW_NUMBER() OVER (..)
keyword which is able to assign a running number to each group. Let's try from the simplest form of ROW_NUMBER()
.
First, the query will look like this:
select
attendance_id
, profile_id
, row_idx = row_number()
over (
order by profile_id, convert(nvarchar, check_in_on, 102)
)
from tb_attendance
where
check_in_on between '2007-07-13' and '2007-07-13 23:59'
Basically, the ROW_NUMBER()
keyword assigns a running number for each record and we keep the running number in the row_idx
field. The result will look like this,
attendance_id | profile_id | row_idx |
---|---|---|
7B7511A7-C4BC-40C2-BB06-961AAF5C263E | 3829BE4D-CCF1-4389-A024-336C06B53509 | 1 |
873683FD-93AD-45C6-BC6D-DA547F71CCE9 | 65A1251F-A18D-402D-AA51-84466452A93B | 2 |
08780963-BAB6-402C-92C1-0CBC3EFC0B70 | 65A1251F-A18D-402D-AA51-84466452A93B | 3 |
5419770E-D7CD-482C-B3D6-FD89FCCA9789 | FC5E926D-2417-46C0-BD06-86A2E3D5F257 | 4 |
AC7F544D-A4FF-4AE8-9FF3-B9C0295C8F8B | 2D0F3DDB-9131-4313-891E-A11DD8FC0298 | 5 |
But it is far from meeting the user requirement.
Let's try to add the PARTITION BY
keyword into the ROW_NUMBER()
section. The record will be partitioned based on profile_id and check in date. This keyword indicates that the running number will be reset to 1 whenever profile_id and/or check in date changed.
select
attendance_id
, profile_id
, row_idx = row_number()
over (
partition by profile_id, convert(nvarchar, check_in_on, 102)
order by profile_id, convert(nvarchar, check_in_on, 102)
)
from tb_attendance
where
check_in_on between '2007-07-13' and '2007-07-13 23:59'
For the member of '65A1251F-A18D-402D-AA51-84466452A93B' who has two attendance records. The row_idx
value for the first record is 1 and the second record is 2. The result is very close to the user requirement except that we need to find a way to get rid of the row_idx=2
.
attendance_id | profile_id | row_idx |
---|---|---|
7B7511A7-C4BC-40C2-BB06-961AAF5C263E | 3829BE4D-CCF1-4389-A024-336C06B53509 | 1 |
873683FD-93AD-45C6-BC6D-DA547F71CCE9 | 65A1251F-A18D-402D-AA51-84466452A93B | 1 |
08780963-BAB6-402C-92C1-0CBC3EFC0B70 | 65A1251F-A18D-402D-AA51-84466452A93B | 2 |
5419770E-D7CD-482C-B3D6-FD89FCCA9789 | FC5E926D-2417-46C0-BD06-86A2E3D5F257 | 1 |
AC7F544D-A4FF-4AE8-9FF3-B9C0295C8F8B | 2D0F3DDB-9131-4313-891E-A11DD8FC0298 | 1 |
That's easy. Basically, the user requirements means that we should return the record if row_idx=1
. We made the previous query as a sub-query and added a filter in the main query with row_idx=1
.
select *
from (
select
attendance_id
, profile_id
, row_idx = row_number()
over (
partition by profile_id, convert(nvarchar, check_in_on, 102)
order by profile_id, convert(nvarchar, check_in_on, 102)
)
from tb_attendance
where
check_in_on between '2007-07-13' and '2007-07-13 23:59'
) as a
where
a.row_idx = 1
Now, the result has 4 records and the member of '65A1251F-A18D-402D-AA51-84466452A93B' is having 1 record only.
attendance_id | profile_id | row_idx |
---|---|---|
7B7511A7-C4BC-40C2-BB06-961AAF5C263E | 3829BE4D-CCF1-4389-A024-336C06B53509 | 1 |
873683FD-93AD-45C6-BC6D-DA547F71CCE9 | 65A1251F-A18D-402D-AA51-84466452A93B | 1 |
5419770E-D7CD-482C-B3D6-FD89FCCA9789 | FC5E926D-2417-46C0-BD06-86A2E3D5F257 | 1 |
AC7F544D-A4FF-4AE8-9FF3-B9C0295C8F8B | 2D0F3DDB-9131-4313-891E-A11DD8FC0298 | 1 |
So, we solved the second user requirement.
User requirement #3
Rule: the member is allowed to check in multiple times on any date but the total visit will be treated as one.
Requirement: shows the number of members during the business operation hours for the given date.
This requirement sounds easy but there is a catch! It was asking for the number of members in the facility in operation hour but not the check in hour.
If you wrote the query like the one below, then, you have answered the question wrongly. The following query is showing the number of members who checked in at a particular hour.
select
hour = datepart(hour, check_in_on)
, check_in_count = count(*)
from tb_attendance
where
check_in_on between '2007-07-13' and '2007-07-13 23:59'
group by
datepart(hour, check_in_on)
The result will look like this. It is not what the user wants!
hour | check_in_count |
---|---|
14 | 2 |
18 | 1 |
19 | 1 |
20 | 1 |
The correct query that shows the number of members in the facility during the operation hours. Let's say, the operation is between 10am to 9pm.
-- declare a 'timezone' CTE
; with tz (hour)
as (
select hour = 10
union all
select hour + 1
from tz
where
hour + 1 < 21
),
-- declare the attendance hours
att (in_hour, out_hour)
as (
select
in_hour = datepart(hour, check_in_on)
, out_hour = datepart(hour, check_out_on)
from tb_attendance
where
check_in_on between '2007-07-13' and '2007-07-13 23:59'
)
select
tz.hour
-- count the number of member who is in the facility
, member_at_gym = (
select count(*)
from att
where
tz.hour between att.in_hour and att.out_hour
)
from tz
Where
-
'tz' is the CTE (Common Table Expression) that returns the business operation hours.
-
'att' is the CTE that returns the attendance for 13th July. The check in time and check out time has been translated into
in_hour
andout_hour
respectively. -
Finally, we combined the above CTE-s and generated the result.
hour member_at_gym 10 0 11 0 12 0 13 0 14 2 15 2 16 2 17 2 18 3 19 4 20 3 From the result, it tells the user 4 members are using the facility at 7pm (i.e., hour=19).
User requirement #4
Rule: the member is allowed to check in multiple times on any date but the total visit will be treated as one.
Requirement: show the number of visits for each member.
If the rule does not exist, we may use the following query to retrieve the result.
select
profile_id
, visit_count = count(*)
from tb_attendance
group by
profile_id
and the result is,
profile_id | visit_count |
---|---|
3829BE4D-CCF1-4389-A024-336C06B53509 | 1 |
65A1251F-A18D-402D-AA51-84466452A93B | 3 |
FC5E926D-2417-46C0-BD06-86A2E3D5F257 | 2 |
2D0F3DDB-9131-4313-891E-A11DD8FC0298 | 1 |
Since, there is a rule said that multiple entries on the same day will be count as one, then, the correct query should look like the following:
select
profile_id
, visit_count = count(*)
from (
select distinct
profile_id
, check_in_date = convert(nvarchar, check_in_on, 102)
from tb_attendance
) as a
group by
profile_id
First, we retrieve the distinct record that contains the profile_id
field and extract the date value (without time value) from the check_in_on
field in the sub-query. After that, we return the summary based on the result of the sub-query. Then, the result will look like this,
profile_id | visit_count |
---|---|
3829BE4D-CCF1-4389-A024-336C06B53509 | 1 |
65A1251F-A18D-402D-AA51-84466452A93B | 2 |
FC5E926D-2417-46C0-BD06-86A2E3D5F257 | 2 |
2D0F3DDB-9131-4313-891E-A11DD8FC0298 | 1 |
Related posts
Jump to #MSSQL blog
Author
Lau Hon Wan, software developer.