Chapter 7 MySQL Databases
Written by Nathan Muncy.
A MySQL server is installed on labarserv2 which allows for project data to be stored within a relational database management system (RDBMS). Utilizing a RDBMS allows the researcher to centralize, better manage, flexibly access, and share their data. Additionally, the dedicated structures will help maintain data consistency across users and projects.
This chapter will serve as a brief tutorial and reference for setting up and then using project-specific databases.
7.1 Create an Account
Admins of labarserv2 create accounts and grant access to the mysql server and databases. Start by logging into the server as admin $sudo mysql
.
Current user and host information are available at mysql.user
:
use mysql;
select user, host from user;
To create a user (‘test’) identified by a password (‘foobar’), specify the new user name, host or IP, and password:
create user 'test'@'localhost' identified by 'foobar';
create user 'test'@'127.0.0.1' identified by 'foobar';
Next, permissions to a specific database (‘db_test’), and also tables if desired, can be granted (and note that comments are specified in SQL via --
):
grant insert, update, select, references,
create temporary tables
on db_test.*
to 'test'@'localhost'; -- give select permissions
grant all on db_test.* to 'test'@'localhost'; -- give all permissions
Finally, check that appropriate permissions have been added to the user account and then apply the changes to the server:
show grants for 'test'@'localhost';
flush privileges;
The user should then be able to login to the server via $mysql -u test -p
, enter their password (foobar), and access the database (use db_test; show tables;
).
7.2 Login
Users that have an account can login to the mysql server via
$mysql -u <user_name> -p
replacing <user_name>
with their account name and then supply their password when prompted. Databases to which the user has access can be listed via show databases;
. After selecting a database for use (use db_some_name;
), available tables can be listed via show tables;
.
A database can also be specified at login:
$mysql -u <user_name> -p db_some_name
Exit the server via exit;
or quit;
.
7.3 Create a Database
Admin (i.e. all) privileges are needed by the user to create a database. Accordingly, it is most likely best for labarserv2 admins to create databases and then grant specific (or all) permissions to the user for the specific database.
Database creation is accomplished simply via:
create database db_some_name;
Conventions - The following database conventions are implemented in the LaBar Lab:
- Database names start with
db
, e.g.db_project1
anddb_project2
. This allows for project databases to be identified separately while also being distinct from admin and schema databases - Separate databases are made for separate projects, that is, one project should not span multiple databases
- Database names are succinct, being both descriptive and as short as is reasonable
- Databases are named in snake case (
db_emotion_regulation
and notdb-emotion-regulation
)
7.4 Create a Table
Tables are 2-dimensional matrices containing rows and columns of data. Each row must contain unique information and each column contains only one data type. Whether to use long- or wide-formatted tables depends on the user and nature of the data, but long-formatted tables (or at least tidy) is preferable and generally more flexible.
The create table
syntax is used within a database to both create the table and specify the number and types of columns. Three types of tables (reference, data, and definition) are used in the LaBar Lab, which are prepended with ref_
, tbl_
, and def_
identifiers, respectively.
7.4.1 Reference Tables
When a certain value, such as a participant ID or emotion name, shows up across multiple data tables, the value can be stored in a reference table rather than be used repeatedly across multiple tables. These reference tables, and the fact that values from multiple databases can be queried at once, are the point of RDBMSs. As an added perk, the size of the database is also decreased as redundant values are minimized.
Say a project intends to gather information across multiple subjects, each of whom will participate in three sessions, and some measures will occur during all sessions. We can then set up reference tables for the subject IDs, session IDs, and emotion IDs as these values would be common across all measures.
To start a reference table for subject IDs:
use db_test;
create table ref_subj (
subj_id int not null,
subj_name char(4) not null,
primary key(subj_id)
);
Here we created the table ref_subj
in the database db_test
. This table contains two columns: the first column is titled subj_id
, is integer type, and does not allow null values. The second column is titled subj_name
, the value uses exactly 4 characters, and also null values are also not allowed. Finally, the primary key is set to reference the ‘subj_id’ column. This key is used to uniquely identify each row in the table, and will serve as the reference value across database tables, and note that not null
is required for primary keys.
A description of the table is available via the describe table_name
command:
mysql> describe ref_subj;
+-----------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| subj_id | int | NO | PRI | NULL | |
| subj_name | char(4) | NO | | NULL | |
+-----------+---------+------+-----+---------+-------+
2 rows in set (0.01 sec)
Conventions - The following reference table conventions are implemented in the LaBar Lab:
- The table name starts with ‘ref_’
- The name for columns involved in the primary key ends in ’_id’
- The value which corresponds to the primary key ends in ’_name’
- The table name is reflected in column names when possible (e.g. ‘subj’)
Following these conventions will increase the ease of joining tables and allow other researchers to more readily understand the data structure.
Data can then be added to the table via insert
by specifying the desired columns and input values:
insert into ref_subj
(subj_id, subj_name)
values
(1, "SUB1"),
(2, "SUB2"),
(3, "SUB3"),
(4, "SUB4");
Check that the reference table is properly specified via select
:
mysql> select * from ref_subj;
+---------+-----------+
| subj_id | subj_name |
+---------+-----------+
| 1 | SUB1 |
| 2 | SUB2 |
| 3 | SUB3 |
| 4 | SUB4 |
+---------+-----------+
4 rows in set (0.00 sec)
The other reference tables, for session and emotion, can likewise be built (note the variable emotion name length):
-- Session reference
create table ref_sess (
sess_id int not null,
sess_name char(4) not null,
primary key(sess_id)
);
insert into ref_sess
(sess_id, sess_name)
values
(1, "day1"),
(2, "day2"),
(3, "day3");
-- Emotion reference
create table ref_emo (
emo_id int not null,
emo_name varchar(10) not null,
primary key(emo_id)
);
insert into ref_emo
(emo_id, emo_name)
values
(1, "amusement"),
(2, "anger"),
(3, "anxiety"),
(4, "awe"),
(5, "calmness"),
(6, "craving"),
(7, "disgust"),
(8, "excitement"),
(9, "fear"),
(10, "horror"),
(11, "joy"),
(12, "neutral"),
(13, "romance"),
(14, "sadness"),
(15, "surprise");
Once completed, we have three reference tables that can be used to interpret repeated values in data tables via join (see below). Such a modular structure, using one table per data type and/or source while accounting for repetitive data, allows for better maintenance and greater efficiency of the database.
7.4.2 Data Tables
Continuing with the example started above, data from each measure collected is stored in data tables, with repeated values linked to reference tables. Building data tables uses the same methodology described above, with the addition of specifying foreign keys and thus beginning to make our database relational.
Let us suppose that for each session our participants supply PANAS and ERQ ratings as well as a hypothetical in-house measure (EmoFreq) during which participants are prompted to respond how frequently, intensely, and saliently they felt a given emotion (ref_emo
) over the previous week.
Using the create table
syntax, again, we first build a data table for the PANAS questionnaire:
create table tbl_panas (
subj_id int not null,
sess_id int not null,
item_panas int not null,
resp_panas int,
primary key(subj_id, sess_id, item_panas),
foreign key(subj_id) references ref_subj(subj_id) on delete cascade,
foreign key(sess_id) references ref_sess(sess_id)
);
Here we created a long-formatted table containing four columns – one each for the subject identifier (subj_id
), session identifier (sess_id
), PANAS item/prompt (item_panas
), and participant response to PANAS item/prompt (resp_panas
). Participant responses are recorded as integer type, and only the PANAS item number (integer) is used – if it were desired to keep the question language/prompt, rather than question number, then a reference table could be set up (e.g. ref_panas
, see conventions above) and the column here would then be named panas_id
. Also note that item_panas
is specified as not null
since participants will see every PANAS item (and this column is used in the primary key). Finally, given that participants may not respond to each item, resp_panas
allows for null values.
The primary key is specified here as the composition of subject, session, and PANAS item (sometimes termed a ‘composite key’ or ‘natural key’) and serves as as the unique identifier for each participant response. It is recommended to always explicitly specify your primary key. Additionally, explicitly controlling keys helps in setting up proper relationship mappings. If a natural key ought not to be used for a primary key, or data naturally increment, then a column could be set to auto increment and serve as the primary key.
Foreign keys are used to reflect that a column is linked to another table, and can also be used to constrain column values (not demonstrated). Here, the columns subj_id
and sess_id
will not be populated with the actual subject and session names/values, as that would involve a lot of repetitive information, but instead their respective reference table ID. Specifically, we set the subj_id
column (tbl_panas.subj_id
) to reference the subj_id
column of ref_subj
(ref_subj.subj_id
). Similarly tbl_panas.sess_id
references ref_sess.sess_id
. Finally, on delete cascade
allows records from this current table to be removed if the referenced value in ref_subj
is deleted (perhaps due to participant withdrawal).
Conventions - The following data tables conventions are implemented in the LaBar Lab:
- The table name starts with ‘tbl_’
- The measure name is reflected in the table name and relevant columns when possible (e.g. ‘panas’)
- The name for columns that become a foreign key end in ’_id’ when possible
- Column names for measure identifiers start with ‘item_’
- Participant responses start with ‘resp_’
The data tables for the ERQ and in-house EmoFreq surveys can also be made in similar fashion:
-- ERQ table
create table tbl_erq (
subj_id int not null,
sess_id int not null,
item_erq int not null,
resp_erq int,
primary key(subj_id, sess_id, item_erq),
foreign key(subj_id) references ref_subj(subj_id) on delete cascade,
foreign key(sess_id) references ref_sess(sess_id)
);
-- EmoFreq table
create table tbl_emofreq (
subj_id int not null,
sess_id int not null,
emo_id int not null,
resp_emofreq_frq int,
resp_emofreq_int int,
resp_emofreq_sal int,
primary key(subj_id, sess_id, emo_id),
foreign key(subj_id) references ref_subj(subj_id) on delete cascade,
foreign key(sess_id) references ref_sess(sess_id),
foreign key(emo_id) references ref_emo(emo_id)
);
For the in-house EmoFreq measure, where participants are prompted with an emotion and they respond how frequently, intensely, and saliently they experienced the emotion over the previous week, the item_
column was omitted and replaced with the foreign key emo_id
as the prompt is simply the emotion name (and to incorporate our ref_emo
table). Additionally, one column per response prompt (frequency, intensity, and salience) is specified. The definition tables section (below), will illustrate a method of making tables more intelligible for the person using them by supplying column definitions and other information needed to interpret the data.
Finally, following foreign keys to their source is available in the information_schema
database:
SELECT
`TABLE_SCHEMA`,
`TABLE_NAME`,
`COLUMN_NAME`,
`REFERENCED_TABLE_SCHEMA`,
`REFERENCED_TABLE_NAME`,
`REFERENCED_COLUMN_NAME`
FROM
`INFORMATION_SCHEMA`.`KEY_COLUMN_USAGE`
WHERE
`TABLE_SCHEMA` = SCHEMA()
AND `REFERENCED_TABLE_NAME` IS NOT NULL;
which yields:
+--------------+-------------+-------------+-------------------------+-----------------------+------------------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | REFERENCED_TABLE_SCHEMA | REFERENCED_TABLE_NAME | REFERENCED_COLUMN_NAME |
+--------------+-------------+-------------+-------------------------+-----------------------+------------------------+
| db_test | tbl_emofreq | subj_id | db_test | ref_subj | subj_id |
| db_test | tbl_emofreq | sess_id | db_test | ref_sess | sess_id |
| db_test | tbl_emofreq | emo_id | db_test | ref_emo | emo_id |
| db_test | tbl_erq | subj_id | db_test | ref_subj | subj_id |
| db_test | tbl_erq | sess_id | db_test | ref_sess | sess_id |
| db_test | tbl_panas | subj_id | db_test | ref_subj | subj_id |
| db_test | tbl_panas | sess_id | db_test | ref_sess | sess_id |
+--------------+-------------+-------------+-------------------------+-----------------------+------------------------+
7 rows in set (0.00 sec)
This is useful when tracking the relationship mappings (such as when working with a new database or when table/column names are less-than-helpful), and requires the references
grant for the user.
7.4.3 Definition Tables
Few things are more frustrating when working within RDBMSs than not understanding column names or not knowing what data they contain. This is compounded by less-than-intuitive column names that are sometimes necessary due to maximum lengths, as is commonly the case when working on larger projects. Definition tables help assuage these frustrations by supplying definitions or explanations of column names and purpose.
As the data table tbl_emofreq
(above) has some column names that may be unintuitive, we will specify a definition table to aid in interpreting the table:
create table def_emofreq (
col_name enum(
'resp_emofreq_frq',
'resp_emofreq_int',
'resp_emofreq_sal'
),
col_desc text
);
This definition table contains two columns, col_name
and col_desc
. Column col_name
is specified to only accept certain values via enum, which values correspond to the confusing column names in tbl_emofreq
. Column col_desc
is set to use the text type, which will allow for more verbose strings. A primary key has not been set as this table is not meant to be integrated with other tables but instead to be read by the researcher.
Conventions - The following definition table conventions are implemented in the LaBar Lab:
- A definition table should exist for every data table that does not hold standardized items (e.g. PANAS, ERQ) that could be looked up in a resource (but a definition table is still recommended).
- Definition tables should not be used to hold item interpretations, e.g. storing what is meant when
tbl_panas.item_panas = 1
(the text of the PANAS’s first item). These belong in a reference table. - The table uses the same name as the data table, but replacing ‘tbl_’ with ‘def_’.
- The table contains the column
col_name
, which itself only contains the names of columns from the data table. - All columns not involved in ID from the data table are specified in
col_name
. - The table contains the column
col_desc
, which holds a human-readable description of the column. - Additional columns can be added using the
col_
format, when required.
The definition table def_emofreq
can now be populated with text:
insert into def_emofreq
(col_name, col_desc)
values
('resp_emofreq_frq', 'Participant response to frequency prompt'),
('resp_emofreq_int', 'Participant response to intensity prompt'),
('resp_emofreq_sal', 'Participant response to salience prompt');
7.5 Insert Data
Building on our example for building databases and tables, now that a database and tables are properly specified we can begin adding data to the tables. Adding data is easily accomplished via the insert into
command, and we have seen a few examples already when building the reference and definition tables.
To add a single row (or several) of data into a table, we specify the table, columns, and values:
use db_test;
insert ignore into tbl_erq
(subj_id, sess_id, item_erq, resp_erq)
values
(1, 1, 1, 6),
(1, 2, 1, 4),
(1, 3, 1, 2);
Working within our example database db_test
we have added three new rows of data to tbl_erq
. The insert ignore into
skips the insertion if the same values already exist, avoiding overwriting existing data in tbl_erq
. The relevant columns were referenced, and then participant 1 (ref ref_subj.subj_name
) responses to the first ERQ item were added for each session. We can check that the insertion was successful by selecting our data:
mysql> select * from tbl_erq;
+---------+---------+----------+----------+
| subj_id | sess_id | item_erq | resp_erq |
+---------+---------+----------+----------+
| 1 | 1 | 1 | 6 |
| 1 | 2 | 1 | 4 |
| 1 | 3 | 1 | 2 |
+---------+---------+----------+----------+
3 rows in set (0.01 sec)
Obviously, manually entering every row is untenable and error prone. The insert command of single rows was shown for practice and also to be leveraged algorithmically in a workflow.
If CSV files (or similar formats) already exist in the same format as your table, then entire tables can be inserted as well. The following shell syntax can be used to generate some random data for tbl_erq
(and should be modified to build input data for ref_subj
, tbl_panas
, and tbl_emofreq
):
#!/bin/bash
echo "subj_id,sess_id,item_erq,resp_erq" >${HOME}/erq_data.csv
for subj in {1..9}; do
for sess in {1..3}; do
for item in {1..10}; do
resp=$(((RANDOM % 10) + 1))
echo "${subj},${sess},${item},${resp}" >>${HOME}/erq_data.csv
done
done
done
Having built properly formatted CSV files for tbl_panas
, tbl_erq
, and tbl_emofreq
, and assuming that local infiles are allowed, login to the server using the local-infile option. Using this options allows the server to read a file on the local machine.
$mysql --local-infile=1 -u <user> -p db_test
Entire CSVs from the Desktop can now be loaded via the load data
command (and note that rows will be skipped for subj_id
values not found in ref_subj
):
delete from tbl_erq; --clear example inputs
load data local infile '/path/erq_data.csv'
into table tbl_erq
fields terminated by ','
enclosed by '"'
lines terminated by '\n'
ignore 1 rows;
As specified, local infile
supplies the path to the CSV file. Then, extra options are specified to help parse the file; such as specifying that the CSV is comma-delimited, strings are enclosed by a double quote, line endings use a linefeed, and a header exists in the CSV. Note: the infile string will require editing to point to the erq_data.csv on your local machine.
This method is faster than, and preferred over, manually inserting each row. We can again verify that the table was updated, but limit the return to 10 rows:
mysql> select * from tbl_erq limit 10;
+---------+---------+----------+----------+
| subj_id | sess_id | item_erq | resp_erq |
+---------+---------+----------+----------+
| 1 | 1 | 1 | 6 |
| 1 | 1 | 2 | 3 |
| 1 | 1 | 3 | 6 |
| 1 | 1 | 4 | 1 |
| 1 | 1 | 5 | 3 |
| 1 | 1 | 6 | 1 |
| 1 | 1 | 7 | 5 |
| 1 | 1 | 8 | 5 |
| 1 | 1 | 9 | 3 |
| 1 | 1 | 10 | 5 |
+---------+---------+----------+----------+
10 rows in set (0.00 sec)
7.6 Select Data
Data from tables is accessible with the select command. Simply, the anatomy of this command is to specify which columns from which tables are desired (and using an ’*’ when all columns are desired). This section will provide examples of how to extract data from tables using select
and join
.
7.6.1 Basic Usage
After logging into the server and activating the appropriate database ($mysql -u <user> -p db_test
), a list of available tables is accessible via show, and describe can be used to see the number, name, and type of columns:
mysql> show tables;
+-------------------+
| Tables_in_db_test |
+-------------------+
| def_emofreq |
| ref_emo |
| ref_sess |
| ref_subj |
| tbl_emofreq |
| tbl_erq |
| tbl_panas |
+-------------------+
7 rows in set (0.00 sec)
mysql> describe tbl_erq;
+----------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------+------+-----+---------+-------+
| subj_id | int | NO | PRI | NULL | |
| sess_id | int | NO | PRI | NULL | |
| item_erq | int | NO | PRI | NULL | |
| resp_erq | int | YES | | NULL | |
+----------+------+------+-----+---------+-------+
4 rows in set (0.00 sec)
Accordingly, we can select all columns of tbl_erq
(and limit the selection to the first 5 rows) with select
:
mysql> select * from tbl_erq limit 5;
+---------+---------+----------+----------+
| subj_id | sess_id | item_erq | resp_erq |
+---------+---------+----------+----------+
| 1 | 1 | 1 | 7 |
| 1 | 1 | 2 | 7 |
| 1 | 1 | 3 | 10 |
| 1 | 1 | 4 | 1 |
| 1 | 1 | 5 | 6 |
+---------+---------+----------+----------+
5 rows in set (0.00 sec)
We can also count the number of rows (select count(*) ...
) as well as how many entries a specific subject has (select count(*) ... where subj_id = 1
):
mysql> select count(*) from tbl_erq;
+----------+
| count(*) |
+----------+
| 270 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from tbl_erq where subj_id = 1;
+----------+
| count(*) |
+----------+
| 30 |
+----------+
1 row in set (0.00 sec)
Finally, we can select only desired columns (and rows of a specific session ID):
mysql> select subj_id, item_erq, resp_erq
-> from tbl_erq
-> where sess_id = 1
-> limit 5;
+---------+----------+----------+
| subj_id | item_erq | resp_erq |
+---------+----------+----------+
| 1 | 1 | 7 |
| 1 | 2 | 7 |
| 1 | 3 | 10 |
| 1 | 4 | 1 |
| 1 | 5 | 6 |
+---------+----------+----------+
5 rows in set (0.00 sec)
7.6.2 Join Tables
7.6.2.1 Accessing Reference Tables
Our example database has a number of tables containing different information. Specifically, the reference tables contain values that are expected to appear across multiple tables while the data tables hold actual participant responses and information, which themselves are linked to the reference tables via foreign keys. Previously, we selected data from tbl_erq
but did not interpret the subj_id
or sess_id
columns, whose values exist in ref_subj
and ref_sess
, respectively.
Combining information from multiple tables is accomplished with the join command. This join
is appended to the select
command to modify what is returned from the query:
select
ref_subj.subj_name, ref_sess.sess_name, tbl_erq.item_erq,
tbl_erq.resp_erq
from tbl_erq
join ref_subj on ref_subj.subj_id = tbl_erq.subj_id
join ref_sess on ref_sess.sess_id = tbl_erq.sess_id
where tbl_erq.sess_id = 2
limit 5;
Here, the values (names) of tbl_erq.subj_id
and tbl_erq.sess_id
are interpreted and included in the output by joining the ref_subj
and ref_sess
tables. First, columns from three different tables are selected: the name
columns from the relevant reference tables as well as the item
and resp
columns from tbl_erq
. Next, the table containing the relevant data is specified (from tbl_erq
), and then tbl_erq
is joined with the reference tables by specifying which columns will match between the data and reference tables. Specifically, the query is instructed to join ref_subj
to tbl_erq
, and to match the column values in ref_subj.subj_id
and tbl_erq.subj_id
to guide the join (and the same for ref_sess
) in order to return the specified columns. This join
functionality is the bread-and-butter of RDBMS and the reason for which the conventions for both reference and data tables (above) have been specified.
Table (and database) names can get large and unwieldy, making for overly verbose select
commands. Below is an equivalent command which also assigns the table names to separate variables (a, b, and c):
select
a.subj_name, b.sess_name, c.item_erq, c.resp_erq
from tbl_erq c
join ref_subj a on a.subj_id = c.subj_id
join ref_sess b on b.sess_id = c.sess_id
where c.sess_id = 2
limit 5;
Given this selection, we expect the query to return a matrix containing the subject name, session name, and ERQ item and participant responses limited to the second session (and only the first 5 rows).
+-----------+-----------+----------+----------+
| subj_name | sess_name | item_erq | resp_erq |
+-----------+-----------+----------+----------+
| SUB1 | day2 | 1 | 1 |
| SUB1 | day2 | 2 | 6 |
| SUB1 | day2 | 3 | 2 |
| SUB1 | day2 | 4 | 6 |
| SUB1 | day2 | 5 | 7 |
+-----------+-----------+----------+----------+
5 rows in set (0.00 sec)
Admin Note: this example assumes sql_mode=only_full_group_by
.
7.6.2.2 Joining Data Tables
Conducting analyses usually involve combining data across multiple measures, perhaps to investigate the relationship of X and Y or to see if controlling for Z has any impact on an X-Y relationship. In our example, participant responses to different measures (e.g. ERQ, PANAS) are kept in separate data tables (tbl_erq
and tbl_panas
, respectively), and these data can be aggregated into a single matrix with join
.
Using simulated data (for syntax, see insert data), the tbl_erq
has 10 items, and tbl_panas
20, that were administered to 9 participants for each of 3 sessions:
mysql> select distinct(item_erq) from tbl_erq;
+----------+
| item_erq |
+----------+
| 1 |
| 2 |
..
| 9 |
| 10 |
+----------+
10 rows in set (0.00 sec)
mysql> select count(*) from tbl_erq;
+----------+
| count(*) |
+----------+
| 270 |
+----------+
1 row in set (0.00 sec)
mysql> select distinct(item_panas) from tbl_panas;
+------------+
| item_panas |
+------------+
| 1 |
| 2 |
..
| 19 |
| 20 |
+------------+
20 rows in set (0.00 sec)
mysql> select count(*) from tbl_panas;
+----------+
| count(*) |
+----------+
| 540 |
+----------+
1 row in set (0.01 sec)
We can join our reference, ERQ, and PANAS tables, then, using an extension of the join example above:
-- Long format
select
a.subj_name, b.sess_name, c.item_erq, c.resp_erq,
d.item_panas, d.resp_panas
from tbl_erq c
join ref_subj a on a.subj_id = c.subj_id
join ref_sess b on b.sess_id = c.sess_id
join tbl_panas d on d.subj_id = c.subj_id
and d.sess_id = c.sess_id
where c.subj_id = 1 and c.sess_id = 1
limit 15
;
As before, values from the reference tables are incorporated into output matrix, and this time we add a join
of tbl_erq
with tbl_panas
using both subj_id
and sess_id
. Additionally, the output is limited to rows for the first subject and session ID.
This output is long formatted, and accordingly we can see that each ERQ response is reported for every PANAS item:
+-----------+-----------+----------+----------+------------+------------+
| subj_name | sess_name | item_erq | resp_erq | item_panas | resp_panas |
+-----------+-----------+----------+----------+------------+------------+
| SUB1 | day1 | 10 | 5 | 1 | 6 |
| SUB1 | day1 | 9 | 5 | 1 | 6 |
| SUB1 | day1 | 8 | 2 | 1 | 6 |
| SUB1 | day1 | 7 | 5 | 1 | 6 |
| SUB1 | day1 | 6 | 9 | 1 | 6 |
| SUB1 | day1 | 5 | 6 | 1 | 6 |
| SUB1 | day1 | 4 | 1 | 1 | 6 |
| SUB1 | day1 | 3 | 10 | 1 | 6 |
| SUB1 | day1 | 2 | 7 | 1 | 6 |
| SUB1 | day1 | 1 | 7 | 1 | 6 |
| SUB1 | day1 | 10 | 5 | 2 | 5 |
| SUB1 | day1 | 9 | 5 | 2 | 5 |
| SUB1 | day1 | 8 | 2 | 2 | 5 |
| SUB1 | day1 | 7 | 5 | 2 | 5 |
| SUB1 | day1 | 6 | 9 | 2 | 5 |
+-----------+-----------+----------+----------+------------+------------+
15 rows in set (0.00 sec)
While perfectly acceptable, the output matrix has a lot of redundant information. One solution is to craft an output matrix in tidy format. That is, rather than pulling both tbl_erq.item_erq
and tbl_panas.item_panas
, as they are both integer, we could instead create a column titled item_num
that has the full range of items. This would allow us to join the response columns, tbl_erq.resp_erq
and tbl_panas.resp_panas
by their item values.
Caution is warranted, however, given the nature of the default join
performance (inner) since ERQ and PANAS have a different number of items. Briefly, an ‘inner’ join returns the requested values from where there is a match in both tables while a ‘left’ join returns all records from the first table even if no matching record is found in the second (see here for some helpful diagrams). Accordingly, we will use tbl_panas.item_panas
to create the new item_num
column and a left
join:
-- Tidy format
select
a.subj_name, b.sess_name, d.item_panas as item_num,
c.resp_erq, d.resp_panas
from tbl_panas d
join ref_subj a on a.subj_id = d.subj_id
join ref_sess b on b.sess_id = d.sess_id
left join tbl_erq c on c.subj_id = d.subj_id
and c.sess_id = d.sess_id and c.item_erq = d.item_panas
where d.sess_id = 1
limit 22
;
The output tidy matrix, then, does not have the repeating response values found in the long format but NULL values are instead filled for resp_erq
when a corresponding item value does not exist.
+-----------+-----------+----------+----------+------------+
| subj_name | sess_name | item_num | resp_erq | resp_panas |
+-----------+-----------+----------+----------+------------+
| SUB1 | day1 | 1 | 7 | 6 |
| SUB1 | day1 | 2 | 7 | 5 |
| SUB1 | day1 | 3 | 10 | 7 |
| SUB1 | day1 | 4 | 1 | 5 |
| SUB1 | day1 | 5 | 6 | 2 |
| SUB1 | day1 | 6 | 9 | 7 |
| SUB1 | day1 | 7 | 5 | 5 |
| SUB1 | day1 | 8 | 2 | 10 |
| SUB1 | day1 | 9 | 5 | 10 |
| SUB1 | day1 | 10 | 5 | 3 |
| SUB1 | day1 | 11 | NULL | 1 |
| SUB1 | day1 | 12 | NULL | 3 |
| SUB1 | day1 | 13 | NULL | 2 |
| SUB1 | day1 | 14 | NULL | 2 |
| SUB1 | day1 | 15 | NULL | 7 |
| SUB1 | day1 | 16 | NULL | 9 |
| SUB1 | day1 | 17 | NULL | 1 |
| SUB1 | day1 | 18 | NULL | 1 |
| SUB1 | day1 | 19 | NULL | 7 |
| SUB1 | day1 | 20 | NULL | 10 |
| SUB2 | day1 | 1 | 4 | 9 |
| SUB2 | day1 | 2 | 10 | 8 |
+-----------+-----------+----------+----------+------------+
22 rows in set (0.00 sec)
7.7 Python and MySQL
Python can be used to connect to the MySQL server both locally (i.e. on labarserv2) and remotely (e.g. from the Duke Compute Cluster). A number of packages are available to support this, but here we will focus on mysql-connector-python (also, here) and pymysql as they seem to work best with our compute environments at the time of this writing. While everything Python does can be done manually when working on the MySQL server directly, we can leverage Python to automate updating, managing, and downloading from the databases.
7.7.1 Connecting Locally
To interact with a MySQL database locally, start by installing mysql-connector-python into your python environment:
pip install mysql-connector-python
Next, we can start a python file, import mysql.connector
, and establish a connection between Python and the MySQL server. This is done by creating a connect object using mysql.connector
. When establishing the connection, the user, host, password, and database name are specified. For the user nate
and password of foobar
, we would then create the connection object with the db_test
database via:
import mysql.connector
db_con = mysql.connector.connect(
host="localhost", user="nate", password="foobar", database="db_test"
)
It is bad practice to use actual passwords in your code, particularly if you intend to track changes with Git and/or make your repository public. A better option would be to store your password somewhere and then access it through the operating system. Here, I have my password accessible from the variable $SQL_PASS
:
import os
import mysql.connector
db_con = mysql.connector.connect(
host="localhost",
user="nate",
password=os.environ["SQL_PASS"],
database="db_test",
)
With a connection established, we next create a cursor object. This cursor object is used to interact with the MySQL server, allowing for the execution of statements or commands, which is to say that the cursor object has a number of methods available. The cursor is available as an attribute of the connection object:
db_cur = db_con.cursor()
7.7.2 Connecting Remotely
Given that the MySQL server exists on labarserv2, it may not always be feasible to login to labarserv2 to execute SQL queries due to time or resource constraints. For instance, a complex analysis could be conducted on the Duke Compute Cluster (DCC) that require interacting with databases on labarserv2, and manually moving CSVs between servers is ill-advised and inefficient.
Practically speaking, it is more common to interact with a SQL database over the network (remotely), such is the case when a researcher is working on their personal/work machine with data that exists on a remote server. Here we will describe setting up an SSH Tunnel that will allow us to connect to the remote server.
We will use a number of packages: pymysql (which works better than mysql-connector-python
for the DCC-labarserv2
connection for whatever reason), paramiko, and sshtunnel. To start, install the required python packages via pip
or conda
:
pip install pymysql
pip install paramiko
pip install sshtunnel
Next, we need to make sure that we have an RSA key set up on our local machine for the remote (or on the DCC for labarserv2). Additionally, we should store our MySQL password (for the account which exists on the remote server) in our operating system. I have my RSA key to labarserv2 accessible via $RSA_LS2
, and my MySQL password via $SQL_PASS
.
Starting a new python file, we can import the relevant packages and then access the RSA key with paramiko:
import os
import pymysql
import paramiko
import sshtunnel as SSHTunnelForwarder
rsa_ls2 = paramiko.RSAKey.from_private_key_file(os.environ["RSA_LS2"])
Next, we can set up for the SSH Tunnel. The remote IP of labarserv2 is ccn-labarserv2.vm.duke.edu
, my user name for labarserv2 nmm51
, and we will bind the default MySQL port 3306
to 127.0.0.1
:
ssh_tunnel = SSHTunnelForwarder(
("ccn-labarserv2.vm.duke.edu", 22),
ssh_username="nmm51",
ssh_pkey=rsa_ls2,
remote_bind_address=("127.0.0.1", 3306),
)
Finally, the tunnel is started:
ssh_tunnel.start()
With an SSH tunnel established, we can then establish a connection to the MySQL database as we did above, referencing the host and newly established tunnel/port.
db_con = pymysql.connect(
host="127.0.0.1",
user="nmm51",
passwd=os.environ["SQL_PASS"],
db="db_test",
port=ssh_tunnel.local_bind_port,
)
Note (a) that the options for specifying the password and database differ from above (which used mysql.connector.connect
), and (b) such a 'user'@'host'
account will have to exist in the MySQL server, in this case 'nmm51'@'127.0.0.1'
.
A cursor can then be established in the same way as above:
db_cur = db_con.cursor()
7.7.3 Inserting with Python
With a connection and cursor established, we are ready to setup some syntax to execute on the server. In this example, we have a CSV available of ERQ responses formatted for tbl_erq
available at ${HOME}/erq_data.csv
(see Insert Data). This data can be loaded as dataframe:
import os
import pandas as pd
df_erq = pd.read_csv(os.path.join(os.environ["HOME"], "erq_data.csv"))
Inserting data into tables can be accomplished by the cursor’s execute
method, which requires two arguments – the operation/command and a tuple of data. Accordingly, we ready for data insertion by building the first argument, a string that contains SQL syntax and utilizes the MySQL variable %s
in the values
field rather than actual data:
in_cmd = (
"insert into tbl_erq "
+ "(subj_id, sess_id, item_erq, resp_erq) "
+ "values (%s, %s, %s, %s)"
)
Next we can extract the first row of df_erq
as a tuple to create the second execute
argument. MySQL does not support int64
types, the default integer value in df_erq
, so we will convert the values to int
when creating the tuple.
in_val = tuple([int(x) for x in df_erq.values[0]])
With both execute
arguments specified, we can insert the data into tbl_erq
by triggering the cursor’s execute
method, followed by the connection’s commit
method. Note: if following this tutorial, it might be useful to login to the server and delete the tbl_erq
data which was previously inserted in order to verify that these commands are working (mysql> delete from tbl_erq;
).
db_cur.execute(in_cmd, in_val)
db_con.commit()
With this basic insertion method working, we could then loop through pandas
dataframes to build or update tables. Here we will add all contents of df_erq
to tbl_erq
, row-by-row (and also change the df_erq
column types to simply the tuple generation):
df_erq = df_erq.astype("object")
for idx in df_erq.index:
in_val = tuple(df_erq.values[idx])
db_cur.execute(in_cmd, in_val)
db_con.commit()
While working with each row is fast enough for small datasets, this quickly becomes (extremely) inefficient when datasets are larger and accordingly a row-wise approach is not the recommended. A much better way to upload an entire dataset is with the cursor’s executemany
method. Similar to execute
, executemany
takes two arguments – the operation and a list of tuples. In preparation, then, we can create a list of tuples from our dataframe via pandas’ itertuples
:
in_val = list(
df_erq[["subj_id", "sess_id", "item_erq", "resp_erq"]].itertuples(
index=False, name=None
)
)
Lastly, we upload the entire dataset via executemany
:
db_cur.executemany(in_cmd, in_val)
db_con.commit()
7.7.4 Selecting with Python
To retrieve data from our database tables, we can begin specifying SQL commands for execution once a connection and cursor have been established. We’ll start by selecting the first 10 rows of tbl_erq
. As with inserting data, we will use the cursor’s execute
method, but this time only a single argument containin SQL syntax will be required:
query = "select * from tbl_erq limit 10"
db_cur.execute(query)
With the execution complete, we are able to fetch results from the query. The fetchall
method returns a list of tuples, which themselves can be used to build a pandas dataframe:
import pandas as pd
result = db_cur.fetchall()
df_erq = pd.DataFrame(
result, columns=["subj_id", "sess_id", "item_erq", "resp_erq"]
)
A more useful select
command will likely involve joining across tables. For instance, perhaps we are interested in all day2 data from tbl_erq
(limited to the first 5 rows):
query = """select
a.subj_name, b.sess_name, c.item_erq, c.resp_erq
from tbl_erq c
join ref_subj a on a.subj_id = c.subj_id
join ref_sess b on b.sess_id = c.sess_id
where c.sess_id = 2
limit 5;
"""
db_cur.execute(query)
df_join = pd.DataFrame(
db_cur.fetchall(),
columns=["subj_name", "sess_name", "item_erq", "resp_erq"],
)
It is likely useful to note that python variables can be included in the query string. Here we select all data from a single subject by their ID:
subj_id = 1
query = f"select * from tbl_erq where subj_id = {subj_id}"
db_cur.execute(query)
df_subj = pd.DataFrame(
db_cur.fetchall(),
columns=["subj_id", "sess_id", "item_erq", "resp_erq"],
)
7.7.5 Closing Connections
To avoid crashing the MySQL server, it is import to close the connection with the server once all requests are finished. This can be done as simply as using the close
method:
db_con.close()
Another option is to utilize Python’s with
syntax to aid in connection closing:
with mysql.connector.connect(
host="localhost",
user="nate",
password=os.environ["SQL_PASS"],
database="db_test",
) as db_con:
db_cur = db_con.cursor()
db_cur.execute("select * from tbl_erq limit 10")
df_erq = pd.DataFrame(
db_cur.fetchall(),
columns=["subj_id", "sess_id", "item_erq", "resp_erq"],
)
Likewise, SSH tunnels need to be closed at the end of the process when working remotely:
ssh_tunnel.stop()
7.8 R and MySQL
R can be used to connect to the MySQL server both locally (i.e. on labarserv2) and remotely (e.g. from a work station). While a number of packages exist that can manage the connection between the R and MySQL servers, here we will focus on DBI
and RMySQL
. DBI
allows for communication between R and an RDBMS and supplies many of the functions we will use to interact with the database and tables. RMySQL
facilitates an interface directly with MySQL (see here for a quick usage reference).
Start by installing RMySQL
:
install.packages("RMySQL")
RMySQL
should also install DBI
as a dependency, but if it does not then we can make sure we have DBI
:
install.packages("DBI")
7.8.1 Connecting Locally
To interact with a MySQL database locally, we first start by establishing a connection between R and MySQL. Start a new R file, load the RMySQL
library, and setup a connection via dbConnect
that specifies we are using the MySQL
driver:
library("RMySQL")
db_con <- dbConnect(
RMySQL::MySQL(),
dbname = "db_test",
host = "localhost",
port = 3306,
user = "nate",
password = Sys.getenv("SQL_PASS")
)
Here, I specify the desired database via the dbname
option, and used default host
and port
arguments. Additionally, as it is bad practice to embed passwords in code (particularly when tracking changes with Git or when taking the code public!) I have my MySQL password saved in the .RProfile
and available via SQL_PASS
.
Finally, we can check our connection to the database by listing tables via dbListTables
:
dbListTables(db_con)
7.8.2 Connecting Remotely
Given that the MySQL server exists on labarserv2, it may not always be feasible to login to labarserv2 to execute SQL queries due to time or resource constraints. Commonly, analysis scripts exist locally on a work station and the researcher aims to download data for local analyses.
One approach is to setup an SSH Tunnel between our local machine and labarserv2:
$ssh -i $RSA_LS2 \
nmm51@ccn-labarserv2.vm.duke.edu \
-L 3308:127.0.0.1:3306 \
-N
Here, I have specified an SSH connection and supplied a number of options. Specifically, I supply my RSA key for labarserv2 (available in the OS via $RSA_LS2
) with the -i
option, I bind my local port 3308
with the default MySQL IP of 127.0.0.1
and MySQL port of 3306
via the -L
option. Finally, the -N
option is useful for port forwarding as a remote command is not executed.
This ssh
command will cause the terminal to hang, and I can also check that the port is open via $ps aux
:
$ps aux | grep ssh | grep 3308
Nathan 195108 0.0 0.0 14708 6404 pts/4 S+ 12:49 0:00 ssh -i /home/Nathan/.ssh/id_rsa_labarser2 nmm51@ccn-labarserv2.vm.duke.edu -L 3308:127.0.0.1:3306 -N
With an SSH tunnel established, we can then create a connection between R and MySQL using a nearly identical command as above, save only that we specify our local port and the MyServer IP:
db_con <- dbConnect(
RMySQL::MySQL(),
dbname = "db_test",
host = "127.0.0.1",
port = 3308,
user = "nate",
password = Sys.getenv("SQL_PASS")
)
Finally, we can check our connection to the database by listing tables via dbListTables
:
dbListTables(db_con)
7.8.3 Inserting with R
With a connection between R and MySQL established, we can begin to interact with the database. In this example, we have a CSV available of ERQ responses formatted for tbl_erq
available at ${HOME}/erq_data.csv
(see Insert Data). This data can be loaded as a dataframe:
df_erq <- read.csv(paste(Sys.getenv("HOME"), "erq_data.csv", sep = "/"))
One method of inserting data into a table is by building an insert
object with sqlInterpolate
and then passing this object to dbSendQuery
. The function sqlInterpolate
takes three arguments: the connection object, a SQL command, and data. As we have the connection object saved as db_con
(above), we can next specify a SQL insert command that uses ?val
to identify the variable val
:
in_cmd <- "insert into tbl_erq
(subj_id, sess_id, item_erq, resp_erq)
values (?val1, ?val2, ?val3, ?val4)"
Here, four variables are set (val1-4) to serve as placeholders for data.
To specify data (the third argument of sqlInterpolate
), we create a named list which uses the SQL variables from the insert command (val1-4). Doing this for a single row of data (df_erq[1, ]
), we can assign each df_erq
column to their respective variable:
in_val <- c(
val1 = df_erq[1, 1],
val2 = df_erq[1, 2],
val3 = df_erq[1, 3],
val4 = df_erq[1, 4]
)
Finally we build our insert object with sqlInterpolate
, and pass that object to dbSendQuery
. The function of dbSendQuery
is to submit and the execute whatever statement is supplied as the second argument, here our insert
object.
insert <- sqlInterpolate(db_con, in_cmd, .dots = in_val)
dbSendQuery(db_con, insert)
With this simple insertion method working, we can loop through the dataframe to iteratively insert
data into tbl_erq
. Here we submit data row-by-row:
for (row in 1:nrow(df_erq)) {
in_val <- c(
val1 = df_erq[row, 1],
val2 = df_erq[row, 2],
val3 = df_erq[row, 3],
val4 = df_erq[row, 4]
)
insert <- sqlInterpolate(db_con, in_cmd, .dots = in_val)
dbSendQuery(db_con, insert)
}
A few words of caution – first this insertion method will always append the table, which can potentially result in rows that contain identical data (and potentially corrupt a database). Second, while fast enough for illustrative purposes, inserting each row quickly becomes extremely inefficient as the size of the dataframe increases.
We can overcome these issues by utilizing dbWriteTable
instead of sqlInterpolate
and dbSendQuery
, which has the added benefit of a simplified specification. For instance, we can add the first row of df_erq
to the table tbl_erq
via :
dbWriteTable(db_con, "tbl_erq", df_erq[1, ], row.names = FALSE)
If this data already existed in the table then this command would return an error rather than simply adding an identical row of data to the table. Additionally, to resolve or avoid these errors, dbWriteTable
has append
and overwrite
options that can be selected.
Finally, and perhaps more usefully, entire dataframes can be written to the tbl_erq
table:
dbWriteTable(db_con, "tbl_erq", df_erq, overwrite = TRUE, row.names = FALSE)
7.8.4 Selecting with R
To retrieve from the database, we can utilize dbSendQuery
which requires two arguments: the connection object (which we have as db_con
) and a SQL statement. Note that while dbSendQuery
submits and executes commands on the SQL server (as we saw above), it does not extract data and accordingly we save the object.
query <- dbSendQuery(db_con, "select * from tbl_erq")
The output matrix of the select
command can then be loaded into R via dbFetch
:
df_erq <- dbFetch(query, n = -1)
dbClearResult(query)
All records selected in the query
are retrieved into df_erq
by specifying n = -1
, and we are required to close this query via dbClearResult
whenever some object is obtained from dbSendQuery
(i.e. we do not have to dbClearResult
when using dbWriteTable
).
When working in larger workflows and with complicated selection commands, it might be useful to save SQL syntax as its own file and then read said file when building the query object. For instance, the following statement joins the tbl_erq
and tbl_panas
, while also converting from long to tidy format:
select
a.subj_name, b.sess_name, d.item_panas as item_num,
c.resp_erq, d.resp_panas
from tbl_panas d
join ref_subj a on a.subj_id = d.subj_id
join ref_sess b on b.sess_id = d.sess_id
left join tbl_erq c on c.subj_id = d.subj_id
and c.sess_id = d.sess_id and c.item_erq = d.item_panas
where d.sess_id = 1
limit 22
;
This SQL file (saved at ${HOME}/dl_erq.sql
) can then be read by the readr
package, and incorporated into the dbSendQuery
command:
sel_erq <- paste(Sys.getenv("HOME"), "dl_erq.sql", sep = "/")
query <- dbSendQuery(db_con, statement = readr::read_file(sel_erq))
df_join <- dbFetch(query, n = -1)
dbClearResult(query)
7.8.5 Closing Connections
Once interacting with the MySQL database is no longer necessary, it is important to close the connection in order to avoid crashing the server. This is easily accomplished with dbDisconnect
:
dbDisconnect(db_con)
Likewise, an SSH Tunnel can be closed with ctl-C
or by terminating the process via pkill
.