跟着 mimic-code 探索 MIMIC 数据之 tutorials (二)

mimic-code 的 tutorials 还提供了 sql-crosstab，很短，我大概看了感觉不是很有用，先放着了。using_r_with_jupyter.ipynb 就是教你怎么用 Jupyter + R，没什么。explore-items.Rmd 是 MySQL + R，但是没太搞懂这是在干嘛，而且我也没 MySQL，代码转 Postgres 应该不难，我太懒了。直接看最后一个，cohort-selection.ipynb，打开看了 Postgres + Python，讲怎么选择病例队列的一些小技巧，感觉写得挺好的。就这个了，开始。

原文档用的 Python，我不喜欢。当然还是 R 好啦，所以我直接用里面的 sql 语句就行了。

Cohort selection

The aim of this tutorial is to describe how patients are tracked in the MIMIC-III database. By the end of this notebook you should:

Understand what subject_id, hadm_id, and icustay_id represent
Know how to set up a cohort table for subselecting a patient population
Understand the difference between service and physical location

Requirements:

MIMIC-III in a PostgreSQL database
Python packages installable with: pip install numpy pandas matplotlib psycopg2 jupyter

文档的目的是展示 MIMIC 中病例信息的跟踪追溯。主要讲解 subject_id, hadm_id, 和 icustay_id 代表着什么，怎么提取研究病例队列，以及理解病人接受 service 和患者物理位置之间的差别（老实说我都不知道这个到底是什么，大概指病人人在 ICU 外但是接受 ICU 治疗？）。

我自己用的是 RStudio + PostgreSQL，所以代码相对原文档会有一些改动。

首先是设置和数据库连接和基本选项：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


library(RPostgreSQL)
library(tidyverse)

# connect to PostgresSQL
drv <- dbDriver("PostgreSQL")
con <- dbConnect(
  drv = drv,
  dbname = "mimic",
  user = "postgres",
  .rs.askForPassword("Enter password for user postgres:")
)

# set the search path to the mimiciii schema
dbSendQuery(con, "SET search_path TO mimiciii, public;")

# 为了偷懒我写了一个方便查询数据库的函数
query <- function(query = query) {
  con %>%
    dbGetQuery(sql(query)) %>%
    as_tibble()
}

队列选择一般都是从这三个表开始 : patients, admissions 以及 icustays:

patients: information about a patient that does not change - e.g. date of birth, genotypical sex
admissions: information recorded on hospital admission - admission type (elective, emergency), time of admission
icustays: information recorded on intensive care unit admission - primarily admission and discharge time

MIMIC-III 主要是关注 ICU 的数据库，所以我们一般都是想看患者在 ICU 的进科出科情况。也因此，一般在选取患者队列时都不会从病例作为切入 (即通过 subject_id)，而是通过 ICU 出入情况，即通过 icustays 表格中的 icustay_id 切入。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


query("SELECT subject_id
       , hadm_id
       , icustay_id
	   FROM icustays LIMIT 10;")
# 在仅仅是尝试性或者探索性的看看数据的时候一般都用 LIMIT 10


# A tibble: 10 x 3
   subject_id hadm_id icustay_id
 *      <int>   <int>      <int>
 1        268  110404     280836
 2        269  106296     206613
 3        270  188028     220345
 4        271  173727     249196
 5        272  164716     210407
 6        273  158689     241507
 7        274  130546     254851
 8        275  129886     219649
 9        276  135156     206327
10        277  171601     272866

计算 ICU 的住院时间：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


query("SELECT subject_id
       , hadm_id
       , icustay_id
       , outtime - intime AS icu_length_of_stay_interval
       , EXTRACT(EPOCH FROM outtime - intime) AS icu_length_of_stay
       FROM icustays LIMIT 10;")

# A tibble: 10 x 5
   subject_id hadm_id icustay_id icu_length_of_stay_interval icu_length_of_stay
 *      <int>   <int>      <int> <chr>                                    <dbl>
 1        268  110404     280836 3 days 05:58:33                         280713
 2        269  106296     206613 3 days 06:41:28                         283288
 3        270  188028     220345 2 days 21:27:09                         250029
 4        271  173727     249196 2 days 01:26:22                         177982
 5        272  164716     210407 1 day 14:53:09                          139989
 6        273  158689     241507 1 day 11:40:06                          128406
 7        274  130546     254851 8 days 19:32:32                         761552
 8        275  129886     219649 7 days 03:09:14                         616154
 9        276  135156     206327 1 day 08:06:29                          115589
10        277  171601     272866 17:33:02                                 63182

EXTRACT(EPOCH FROM ... ) 从 TIMESTAMP 中提出以秒为单位的 INTERVAL，所以真正要计算时间，还要除以 (60 * 60 * 24)：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


query("SELECT subject_id
       , hadm_id
       , icustay_id
       , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 AS icu_length_of_stay 
       FROM icustays LIMIT 10;")

#---

# A tibble: 10 x 4
   subject_id hadm_id icustay_id icu_length_of_stay
 *      <int>   <int>      <int>              <dbl>
 1        268  110404     280836              3.25 
 2        269  106296     206613              3.28 
 3        270  188028     220345              2.89 
 4        271  173727     249196              2.06 
 5        272  164716     210407              1.62 
 6        273  158689     241507              1.49 
 7        274  130546     254851              8.81 
 8        275  129886     219649              7.13 
 9        276  135156     206327              1.34 
10        277  171601     272866              0.731

如果还想对 ICU 住院时间进行筛选，比如只想看住院超过 24h 的，就得先建个临时表格。比如：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


query("WITH co AS
      (SELECT subject_id
      , hadm_id
      , icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 AS icu_length_of_stay 
      FROM icustays LIMIT 10
      ) 
      SELECT co.subject_id
      , co.hadm_id
      , co.icustay_id
      , co.icu_length_of_stay 
      FROM co
      WHERE icu_length_of_stay >= 2;")

# A tibble: 6 x 4
  subject_id hadm_id icustay_id icu_length_of_stay
*      <int>   <int>      <int>              <dbl>
1        268  110404     280836               3.25
2        269  106296     206613               3.28
3        270  188028     220345               2.89
4        271  173727     249196               2.06
5        274  130546     254851               8.81
6        275  129886     219649               7.13

这样就只筛选到住院时间 > 2 天的病例。

很多使用 MIMIC 数据库的研究都会聚焦于特定的人群。比如，MIMIC 中的数据包含了 ICU 中成人和新生儿的住院记录，但是一般研究是不会在这两个人群里同时开展的。所以很多研究的第一步就是从 icustays 表格中选择病例人群，即从这张表格中筛选合适的 icustay_id。上面的例子就是选取 ICU 住院时间超过 2 天的。

选取病例人群的时候，好的做法是构建一个队列表格。这个表格应该包含数据库中所有的 icustay_id，然后通过一个添加一个 binary flag 来指明每个病例是否要从研究人群中剔除。比如还是上面的筛选 ICU 住院时间 > 2 天的病例的例子：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


query("WITH co AS 
      (
        SELECT subject_id
        , hadm_id
        , icustay_id
        , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 AS icu_length_of_stay 
        FROM icustays LIMIT 10 
      )
      SELECT co.subject_id
      , co.hadm_id
      , co.icustay_id
      , co.icu_length_of_stay,
      CASE
        WHEN co.icu_length_of_stay < 2 THEN 1 
      ELSE 0 END AS exclusion_los
      FROM co;")

# A tibble: 10 x 5
   subject_id hadm_id icustay_id icu_length_of_stay exclusion_los
 *      <int>   <int>      <int>              <dbl>         <int>
 1        268  110404     280836              3.25              0
 2        269  106296     206613              3.28              0
 3        270  188028     220345              2.89              0
 4        271  173727     249196              2.06              0
 5        272  164716     210407              1.62              1
 6        273  158689     241507              1.49              1
 7        274  130546     254851              8.81              0
 8        275  129886     219649              7.13              0
 9        276  135156     206327              1.34              1
10        277  171601     272866              0.731             1

之前的例子里，最后结果只返回了 6 行，因为有 4 行被我们筛选出去了。而在这里，所有的 10 行数据都在，但是最后一列显示有 4 行数据是不应该包含在我们的研究人群中的。这种做法的好处在于在研究的最后，我们很容易总结整个研究人群的排除情况，也很容易根据需要作出修改。

再回想一下之前提到的剔除标准：标记非成人病例为剔除对象。所以，首先必须得知道病人在进入 ICU 时的年龄，这个需要用患者出生日期和 ICU 入院时间来计算。icustays 里的 intime 记录病人入 ICU 的时间，所以我们还需要从 patients 得到病人的出生日期用来计算入 ICU 时的年龄。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


query("WITH co AS
      (
        SELECT icu.subject_id
        , icu.hadm_id
        , icu.icustay_id
        , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 AS icu_length_of_stay
        ,icu.intime - pat.dob AS age
        FROM icustays icu
        INNER JOIN patients pat ON
          icu.subject_id = pat.subject_id
        LIMIT 10
      )
      SELECT co.subject_id
      , co.hadm_id
      , co.icustay_id
      , co.icu_length_of_stay
      , co.age,
      CASE
        WHEN co.icu_length_of_stay < 2 THEN 1
        ELSE 0 END AS exclusion_los
      FROM co;")

# A tibble: 10 x 6
   subject_id hadm_id icustay_id icu_length_of_stay age                 exclusion_los
 *      <int>   <int>      <int>              <dbl> <chr>                       <int>
 1          2  163353     243653             0.0918 21:20:07                        1
 2          3  145834     211552             6.06   27950 days 19:10:11             0
 3          4  185777     294638             1.68   17475 days 00:29:31             1
 4          5  178980     214757             0.0844 06:04:24                        1
 5          6  107064     228232             3.67   24084 days 21:30:54             0
 6          7  118037     278444             0.268  15:35:29                        1
 7          7  118037     236754             0.739  2 days 03:26:01                 1
 8          8  159514     262299             1.08   12:36:10                        1
 9          9  150750     220597             5.32   15263 days 13:07:02             0
10         10  184167     288409             8.09   11:39:05                        0

结果发现，再一次的，计算的年龄成了 INTERVAL。所以还得转换。转换有 3 种办法：

用 EXTRACT() 提取 INTERVAL，此时 INTERVAL 是 天 + 小时 : 分钟 : 秒 这样的形式，然后作除法得到年（前面用到的做法）；
先用 PostgreSQL 的 AGE() 返回为年龄精确值，然后用 DATE_PART() 提取年数得到以年为单位的年龄；
一样，AGE() 得到年龄精确值，DATE_PART() 分别提取年月日计算精确年龄。

我们把三种方法都试试看：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


query("WITH co AS 
      (
      SELECT icu.subject_id
      , icu.hadm_id
      , icu.icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 AS icu_length_of_stay
      , icu.intime - pat.dob AS age FROM icustays icu
      INNER JOIN patients pat ON
        icu.subject_id = pat.subject_id
      LIMIT 10
      )
      SELECT co.subject_id
      , co.hadm_id
      , co.icustay_id
      , co.icu_length_of_stay
      , co.age
      , EXTRACT('year' FROM co.age) AS age_extract_year
      , EXTRACT('year' FROM co.age) 
        + EXTRACT('months' FROM co.age) / 12.0
        + EXTRACT('days' FROM co.age) / 365.242
        + EXTRACT('hours' FROM co.age) / 24.0 / 364.242 AS age_extract_precise
      , EXTRACT('epoch' FROM co.age) / 60.0 / 60.0 / 24.0 / 365.242 AS age_extract_epoch,
      CASE WHEN
        co.icu_length_of_stay < 2 THEN 1
      ELSE 0 END AS exclusion_los
      FROM co;")

# A tibble: 10 x 7
   subject_id icu_length_of_stay age                 age_extract_year age_extract_precise age_extract_epoch exclusion_los
 *      <int>              <dbl> <chr>                          <dbl>               <dbl>             <dbl>         <int>
 1          2             0.0918 21:20:07                           0            0.00240           0.00243              1
 2          3             6.06   27950 days 19:10:11                0           76.5              76.5                  0
 3          4             1.68   17475 days 00:29:31                0           47.8              47.8                  1
 4          5             0.0844 06:04:24                           0            0.000686          0.000693             1
 5          6             3.67   24084 days 21:30:54                0           65.9              65.9                  0
 6          7             0.268  15:35:29                           0            0.00172           0.00178              1
 7          7             0.739  2 days 03:26:01                    0            0.00582           0.00587              1
 8          8             1.08   12:36:10                           0            0.00137           0.00144              1
 9          9             5.32   15263 days 13:07:02                0           41.8              41.8                  0
10         10             8.09   11:39:05                           0            0.00126           0.00133              0

可以看到后面两种方法计算的年龄其实基本上没什么差别。而第一种办法，由于提取出来的实际上都是以天为单位的 INTERVAL，所以提取年得不到年龄的，只得到 0 了。所以结论就是，其实用不同的办法算得年龄没什么大的区别，按个人喜好自己定一个就 OK。后面我们都会用最简单的 EXTRACT(EPOCH FROM ... ) 这种方法。

然后我们就可以通过设置年龄必须 >= 16 来把新生儿剔除掉了（虽然也把青少年剔除了，但是其实 MIMIC 只有新生儿和成人）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


query("WITH co AS
      (
      SELECT icu.subject_id
      , icu.hadm_id
      , icu.icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 AS icu_length_of_stay
      , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 AS age
      FROM icustays icu
      INNER JOIN patients pat ON
        icu.subject_id = pat.subject_id
      LIMIT 10
      )
      SELECT co.subject_id
      , co.hadm_id
      , co.icustay_id
      , co.icu_length_of_stay
      , co.age,
      CASE WHEN
        co.icu_length_of_stay < 2 THEN 1
      ELSE 0 END AS exclusion_los
      ,CASE WHEN co.age < 16 THEN 1
      ELSE 0 END AS exclusion_age
      FROM co;")

# A tibble: 10 x 7
   subject_id hadm_id icustay_id icu_length_of_stay       age exclusion_los exclusion_age
 *      <int>   <int>      <int>              <dbl>     <dbl>         <int>         <int>
 1          2  163353     243653             0.0918  0.00243              1             1
 2          3  145834     211552             6.06   76.5                  0             0
 3          4  185777     294638             1.68   47.8                  1             0
 4          5  178980     214757             0.0844  0.000693             1             1
 5          6  107064     228232             3.67   65.9                  0             0
 6          7  118037     278444             0.268   0.00178              1             1
 7          7  118037     236754             0.739   0.00587              1             1
 8          8  159514     262299             1.08    0.00144              1             1
 9          9  150750     220597             5.32   41.8                  0             0
10         10  184167     288409             8.09    0.00133              0             1

可以看到有 6 行因为年龄不足 16 岁而标记为待剔除，而且这 6 例里大部分也和之前的住院日 > 2 天有很多重合。

下面再尝试另一个常见的剔除标准：二次入 ICU 病例，不管是院内还是院外的。这么做的理由是筛选后可以达到很多统计分析所需要的各样本之间独立的要求。如果保留同一患者多次 ICU 住院信息，那么就必须考虑到这多次入院之间的高度相关性（同一患者因同样的情况多次入院），这对统计分析添加了不必要的麻烦。所以，我们通过 RANK() 对多次入院情况做排序编号：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


query("
  WITH co AS (
  SELECT icu.subject_id
  , icu.hadm_id
  , icu.icustay_id
  , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 AS icu_length_of_stay
  , EXTRACT('epoch' FROM icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 AS age
  , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
  FROM icustays icu
  INNER JOIN patients pat ON
    icu.subject_id = pat.subject_id LIMIT 10
  )
  SELECT co.subject_id
  , co.hadm_id
  , co.icustay_id
  , co.icu_length_of_stay
  , co.age
  , co.icustay_id_order,
  CASE WHEN 
    co.icu_length_of_stay < 2 THEN 1
  ELSE 0 END AS exclusion_los,
  CASE WHEN
    co.age < 16 THEN 1
  ELSE 0 END AS exclusion_age
  FROM co;")

# A tibble: 10 x 8
   subject_id hadm_id icustay_id icu_length_of_stay       age icustay_id_order exclusion_los exclusion_age
 *      <int>   <int>      <int>              <dbl>     <dbl>            <dbl>         <int>         <int>
 1          2  163353     243653             0.0918  0.00243                 1             1             1
 2          3  145834     211552             6.06   76.5                     1             0             0
 3          4  185777     294638             1.68   47.8                     1             1             0
 4          5  178980     214757             0.0844  0.000693                1             1             1
 5          6  107064     228232             3.67   65.9                     1             0             0
 6          7  118037     278444             0.268   0.00178                 1             1             1
 7          7  118037     236754             0.739   0.00587                 2             1             1
 8          8  159514     262299             1.08    0.00144                 1             1             1
 9          9  150750     220597             5.32   41.8                     1             0             0
10         10  184167     288409             8.09    0.00133                 1             0             1

可以对看到 subject_id 为 7 的患者就有两次入院信息。所以我们要做的就是再加入一个 CASE WHEN 把这样的病例去掉（虽然其实这个病例也会因为其他标准不符合而被剔除）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


query("
  WITH co AS (
  SELECT icu.subject_id
  , icu.hadm_id
  , icu.icustay_id
  , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 AS icu_length_of_stay
  , EXTRACT('epoch' FROM icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 AS age
  , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
  FROM icustays icu 
  INNER JOIN patients pat ON
    icu.subject_id = pat.subject_id
  LIMIT 10
  )
  SELECT co.subject_id
  , co.hadm_id
  , co.icustay_id
  , co.icu_length_of_stay
  , co.age
  , co.icustay_id_order,
  CASE WHEN
    co.icu_length_of_stay < 2 THEN 1
  ELSE 0 END AS exclusion_los,
  CASE WHEN
    co.age < 16 THEN 1
  ELSE 0 END AS exclusion_age,
  CASE WHEN
    co.icustay_id_order != 1 THEN 1
  ELSE 0 END AS exclusion_first_stay
  FROM co;")

# A tibble: 10 x 9
   subject_id hadm_id icustay_id icu_length_of_stay       age icustay_id_order exclusion_los exclusion_age exclusion_first_stay
 *      <int>   <int>      <int>              <dbl>     <dbl>            <dbl>         <int>         <int>                <int>
 1          2  163353     243653             0.0918  0.00243                 1             1             1                    0
 2          3  145834     211552             6.06   76.5                     1             0             0                    0
 3          4  185777     294638             1.68   47.8                     1             1             0                    0
 4          5  178980     214757             0.0844  0.000693                1             1             1                    0
 5          6  107064     228232             3.67   65.9                     1             0             0                    0
 6          7  118037     278444             0.268   0.00178                 1             1             1                    0
 7          7  118037     236754             0.739   0.00587                 2             1             1                    1
 8          8  159514     262299             1.08    0.00144                 1             1             1                    0
 9          9  150750     220597             5.32   41.8                     1             0             0                    0
10         10  184167     288409             8.09    0.00133                 1             0             1                    0

可以看到 subject_id 为 7 的患者第 2 次的入院信息确实已经被标记为待剔除。

最后，我们可能还想根据入院接受治疗特定情况剔除掉部分人。因为不同科室接收的病人基本情况差别也很大，而通过剔除特定人群之后可以使研究的人群一致性更好。services 表格就提供了患者入院接受何种治疗的情况：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


query("SELECT subject_id
       , hadm_id
       , transfertime
       , prev_service
       , curr_service
       FROM services
       LIMIT 10;")

# A tibble: 10 x 5
   subject_id hadm_id transfertime        prev_service curr_service
 *      <int>   <int> <dttm>              <chr>        <chr>       
 1        471  135879 2122-07-22 14:07:27 TSURG        MED         
 2        471  135879 2122-07-26 18:31:49 MED          TSURG       
 3        472  173064 2172-09-28 19:22:15 NA           CMED        
 4        473  129194 2201-01-09 20:16:45 NA           NB          
 5        474  194246 2181-03-23 08:24:41 NA           NB          
 6        474  146746 2181-04-04 17:38:46 NA           NBB         
 7        475  139351 2131-09-16 18:44:04 NA           NB          
 8        476  161042 2100-07-05 10:26:45 NA           NB          
 9        477  191025 2156-07-20 11:53:03 NA           MED         
10        478  137370 2194-07-15 13:55:21 NA           NB 

从上面可以看到，curr_service 是 current service 的缩写，prev_service 在患者有转科的情况下记录转科前的科室，否则为 null。比如 subject_id 为 471 的患者发生过至少两次 service 的变更：一次从 TSURG 到 MED，另一次从 MED 到 TSURG （注：可能还有更多记录因为我们用了 LIMIT 10 而没有显示，可以通过 SELECT * FROM services WHERE subject_id = 471 进一步查看）。

表格里所有的 service 可以从 MIMIC 网站查看：http://mimic.physionet.org/mimictables/services/。简单来说有这些：

Service	Description
CMED	Cardiac Medical - for non-surgical cardiac related admissions
CSURG	Cardiac Surgery - for surgical cardiac admissions
DENT	Dental - for dental/jaw related admissions
ENT	Ear, nose, and throat - conditions primarily affecting these areas
GU	Genitourinary - reproductive organs/urinary system
GYN	Gynecological - female reproductive systems and breasts
MED	Medical - general service for internal medicine
NB	Newborn - infants born at the hospital
NBB	Newborn baby - infants born at the hospital
NMED	Neurologic Medical - non-surgical, relating to the brain
NSURG	Neurologic Surgical - surgical, relating to the brain
OBS	Obstetrics - conerned with childbirth and the care of women giving birth
ORTHO	Orthopaedic - surgical, relating to the musculoskeletal system
OMED	Orthopaedic medicine - non-surgical, relating to musculoskeletal system
PSURG	Plastic - restortation/reconstruction of the human body (including cosmetic or aesthetic)
PSYCH	Psychiatric - mental disorders relating to mood, behaviour, cognition, or perceptions
SURG	Surgical - general surgical service not classified elsewhere
TRAUM	Trauma - injury or damage caused by physical harm from an external source
TSURG	Thoracic Surgical - surgery on the thorax, located between the neck and the abdomen
VSURG	Vascular Surgical - surgery relating to the circulatory system

如果我们想剔除掉接受手术治疗的病人的，那就需要排除这些 service：

CSURG
NSURG
ORTHO
PSURG
SURG
TSURG
VSURG

可以通过 %SURG or ORTHO 通配符匹配搞定：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


query("SELECT hadm_id
       , curr_service,
       CASE
        WHEN curr_service LIKE '%SURG' THEN 1
        WHEN curr_service = 'ORTHO' THEN 1
       ELSE 0 END AS surgical
       FROM services se
       LIMIT 10;")

# A tibble: 10 x 3
   hadm_id curr_service surgical
 *   <int> <chr>           <int>
 1  135879 MED                 0
 2  135879 TSURG               1
 3  173064 CMED                0
 4  129194 NB                  0
 5  194246 NB                  0
 6  146746 NBB                 0
 7  139351 NB                  0
 8  161042 NB                  0
 9  191025 MED                 0
10  137370 NB                  0

OK，该剔除的都标记好了。但是我们发现我们只有 hadm_id，而我们选取队列是以 icustay_id 为中心的。所以现在还要通过 hadm_id 与 icustays 表格来一次 JOIN 得到 icustay_id：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


query("SELECT icu.hadm_id
       , icu.icustay_id
       , curr_service,
       CASE
        WHEN curr_service like '%SURG' THEN 1
        WHEN curr_service = 'ORTHO' THEN 1
      ELSE 0 END AS surgical
      FROM icustays icu
      LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id
      LIMIT 10;")

# A tibble: 10 x 4
   hadm_id icustay_id curr_service surgical
 *   <int>      <int> <chr>           <int>
 1  100001     275225 MED                 0
 2  100003     209281 MED                 0
 3  100006     291788 MED                 0
 4  100006     291788 OMED                0
 5  100007     217937 SURG                1
 6  100009     253656 CSURG               1
 7  100010     271147 GU                  0
 8  100011     214619 TRAUM               0
 9  100012     239289 SURG                1
10  100016     217590 MED                 0

然后现在新的问题又来了：一个 icustay_id 对应多个 service 怎么选择？其实这个是关于如何选择研究队列的问题，而不是代码写法的问题。 比如我们决定把来 ICU 之前是做手术的病人剔除掉，那么上面的 JOIN 就要改了：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


query("SELECT icu.hadm_id
       , icu.icustay_id
       , se.curr_service,
       CASE
        WHEN curr_service LIKE '%SURG' THEN 1
        WHEN curr_service = 'ORTHO' THEN 1
       ELSE 0 END AS surgical
       FROM icustays icu
       LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id
       AND se.transfertime < icu.intime + interval '12' hour
       LIMIT 10;")

# A tibble: 10 x 4
   hadm_id icustay_id curr_service surgical
 *   <int>      <int> <chr>           <int>
 1  100001     275225 MED                 0
 2  100003     209281 MED                 0
 3  100006     291788 MED                 0
 4  100007     217937 SURG                1
 5  100009     253656 CSURG               1
 6  100010     271147 GU                  0
 7  100011     214619 TRAUM               0
 8  100012     239289 SURG                1
 9  100016     217590 MED                 0
10  100017     258320 MED                 0

与前面的结果比较，发现 hadm_id = 100006 的患者 service = OMED 的行去掉了：因为这个患者的 OMED 是在 ICU 之后的，我们不纳入研究（虽然其实 OMED 是非手术）。注意上面代码的 JOIN 中我们用到了 + interval '12' hour ，这给我们的剔除标准增加了一点点宽容度。原因在于数据中记录的这些时间信息都是院内不同地方不同的人不同时刻进行录入的，所以必然有一些不一致。比如，一个 ICU 病人可能因为需要手术而发生 transfer，但是记录的转科时间上却在进入 ICU 的时间一小时后。这就属于行政上的“噪音”，而我们加入的 12 个小时有助于解决这个问题。** 再次说明，这个这是关于队列如何选择的问题 **——可能你觉得 12 h 太长，2-4 h 比较合适——但是其实对于我们的例子来说区别不大，因为 80% 的病人没有转科的情况。

最后，我们合并结果为每次 ICU 只有一个 service 记录。和前面一样，用到 RANK() ：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


query("WITH serv AS
      (
      SELECT icu.hadm_id
      , icu.icustay_id
      , se.curr_service,
      CASE WHEN
        curr_service like '%SURG' THEN 1
      WHEN curr_service = 'ORTHO' THEN 1
      ELSE 0 END AS surgical,
      RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) AS rank
      FROM icustays icu
      LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id
      AND se.transfertime < icu.intime + interval '12' hour
      LIMIT 10
      )
      SELECT hadm_id
      , icustay_id
      , curr_service
      , surgical
      FROM serv
      WHERE rank = 1;")

# A tibble: 10 x 4
   hadm_id icustay_id curr_service surgical
 *   <int>      <int> <chr>           <int>
 1  100001     275225 MED                 0
 2  100003     209281 MED                 0
 3  100006     291788 MED                 0
 4  100007     217937 SURG                1
 5  100009     253656 CSURG               1
 6  100010     271147 GU                  0
 7  100011     214619 TRAUM               0
 8  100012     239289 SURG                1
 9  100016     217590 MED                 0
10  100017     258320 MED                 0

然后最后的最后在和我们之前的筛选队列再 JOIN 一下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


query("WITH co AS (
      SELECT icu.subject_id
      , icu.hadm_id
      , icu.icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 AS icu_length_of_stay
      , EXTRACT('epoch' FROM icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 AS age
      , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
      FROM icustays icu
      INNER JOIN patients pat ON
        icu.subject_id = pat.subject_id
      LIMIT 10),
      serv AS (
      SELECT icu.hadm_id
      , icu.icustay_id
      , se.curr_service
      , CASE 
        WHEN curr_service LIKE '%SURG' THEN 1
        WHEN curr_service = 'ORTHO' THEN 1
      ELSE 0 END AS surgical
      , RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) AS rank
      FROM icustays icu
      LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id
      AND se.transfertime < icu.intime + interval '12' hour
      )
      SELECT co.subject_id
      , co.hadm_id
      , co.icustay_id
      , co.icu_length_of_stay
      , co.age, co.icustay_id_order
      , CASE
        WHEN co.icu_length_of_stay < 2 THEN 1
      ELSE 0 END AS exclusion_los
      , CASE WHEN 
        co.age < 16 THEN 1
      ELSE 0 END AS exclusion_age
      , CASE WHEN
          co.icustay_id_order != 1 THEN 1
      ELSE 0 END AS exclusion_first_stay
      , CASE WHEN serv.surgical = 1 THEN 1
      ELSE 0 END AS exclusion_surgical
      FROM co
      LEFT JOIN serv ON
        co.icustay_id = serv.icustay_id
      AND serv.rank = 1;")

# A tibble: 10 x 10
   subject_id hadm_id icustay_id icu_length_of_stay       age icustay_id_order exclusion_los exclusion_age exclusion_first_st… exclusion_surgic…
 *      <int>   <int>      <int>              <dbl>     <dbl>            <dbl>         <int>         <int>               <int>             <int>
 1          6  107064     228232             3.67   65.9                     1             0             0                   0                 1
 2          7  118037     278444             0.268   0.00178                 1             1             1                   0                 0
 3          7  118037     236754             0.739   0.00587                 2             1             1                   1                 0
 4          3  145834     211552             6.06   76.5                     1             0             0                   0                 1
 5          9  150750     220597             5.32   41.8                     1             0             0                   0                 0
 6          8  159514     262299             1.08    0.00144                 1             1             1                   0                 0
 7          2  163353     243653             0.0918  0.00243                 1             1             1                   0                 0
 8          5  178980     214757             0.0844  0.000693                1             1             1                   0                 0
 9         10  184167     288409             8.09    0.00133                 1             0             1                   0                 0
10          4  185777     294638             1.68   47.8                     1             1             0                   0                 0

然后我们就顺利得到了需要的病人队列，可以开始提取数据了。

最后来总结一下我们的筛选流程（最后这一步也可以在 R 里写，嫌麻烦算了，直接复制粘贴到 Python 里了…）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86


import pandas as pd
import numpy as np
import psycopg2
from IPython.display import display, HTML
sqluser='postgres'
dbname='mimic'
schema_name='mimiciii'

con = psycopg2.connect(dbname=dbname,user=sqluser, password='not_shown_here')

query_schema = 'set search_path to ' + schema_name + ';'

query = query_schema + """
WITH co AS (
SELECT icu.subject_id
, icu.hadm_id
, icu.icustay_id
, first_careunit
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 AS icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 AS age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10)
, serv AS
(
SELECT icu.hadm_id
, icu.icustay_id
, se.curr_service
, CASE
    WHEN curr_service LIKE '%SURG' THEN 1
    WHEN curr_service = 'ORTHO' THEN 1
    ELSE 0 END AS surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) AS rank
FROM icustays icu
LEFT JOIN services se
 ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
)
SELECT co.subject_id
, co.hadm_id
, co.icustay_id
, co.icu_length_of_stay
, co.age
, co.icustay_id_order
, serv.curr_service
, co.first_careunit
, CASE WHEN 
    co.icu_length_of_stay < 2 THEN 1
ELSE 0 END AS exclusion_los
, CASE WHEN
    co.age < 16 THEN 1
ELSE 0 END AS exclusion_age
, CASE WHEN
    co.icustay_id_order != 1 THEN 1
ELSE 0 END  AS exclusion_first_stay
, CASE WHEN
    serv.surgical = 1 THEN 1
ELSE 0 END AS exclusion_surgical
FROM co
LEFT JOIN serv ON
    co.icustay_id = serv.icustay_id
AND serv.rank = 1
"""

df = pd.read_sql_query(query, con)

print('{:20 s} {:5d}'.format('Observations', df.shape[0]))
idxExcl = np.zeros(df.shape[0],dtype=bool)
for col in df.columns:
    if "exclusion_" in col:
        print('{:20 s} {:5d} ({:2.2f}%)'.format(col, df[col].sum(), df[col].sum()*100.0/df.shape[0]))
        idxExcl = (idxExcl) | (df[col]==1)

print('')
print('{:20 s} {:5d} ({:2.2f}%)'.format('Total excluded', np.sum(idxExcl), np.sum(idxExcl)*100.0/df.shape[0]))


Observations            10
exclusion_los            6 (60.00%)
exclusion_age            6 (60.00%)
exclusion_first_stay     1 (10.00%)
exclusion_surgical       2 (20.00%)

Total excluded           9 (90.00%)

可以发现，由于我们前面建立了筛选的队列表格，所以最后想看整个筛选过程就变得很简单。

这篇文档真的觉得很有用，其一是很展示了每一步应该怎么写查询语句并有详细的解释；其二也是最重要的，给出了选择研究队列的一般理念和一个简单的例子。

THE END

文章目录

Cohort selection