某天發生 OP 叫修,說是 Postgres 磁碟空間不足 90%,
結果上系統檢查竟然只使用 53%,感覺很奇怪,會不會是誤判,
調出每5分鐘檢查的監控記錄,發現的確是有使用率上升 (53% -> 57% > 63% ... > 93% > 53%)
但是op 叫修後,立即恢復,這什麼妖術 QAQ
只好來檢查檢查 PostgreSQL Log目錄,來看看有什麼蛛絲馬跡,
果不其然有個 ERROR: could not write to hash-join temporary file: No space left on device 的錯誤,
不過依照 process id 31484 來檢查發現,只是 copy 寫資料進 table,怎麼會用到 temp,
2019-03-13 10:17:05 CST [[unknown]] [[unknown]] [[unknown]] [31484]: [1-1] [2019-03-13 10:17:05 CST] [00000]LOG: connection received: host=10.10.10.10 port=38932
2019-03-13 10:17:05 CST [pgdatabase] [[unknown]] [pguser] [31484]: [2-1] [2019-03-13 10:17:05 CST] [00000]LOG: connection authorized: user=pguser database=pgdatabase
2019-03-13 10:17:09 CST [pgdatabase] [psql] [pguser] [31484]: [3-1] [2019-03-13 10:17:05 CST] [00000]LOG: statement: CREATE TABLE blank (a1 varchar(10));
2019-03-13 10:17:47 CST [pgdatabase] [psql] [pguser] [31484]: [4-1] [2019-03-13 10:17:05 CST] [00000]LOG: statement: COPY blank_check_0313 FROM STDIN with csv;
2019-03-13 10:45:18 CST [pgdatabase] [psql] [pguser] [31484]: [5-1] [2019-03-13 10:17:05 CST] [53100]ERROR: could not write to hash-join temporary file: No space left on device
2019-03-13 10:45:18 CST [pgdatabase] [psql] [pguser] [31484]: [6-1] [2019-03-13 10:17:05 CST] [53100]STATEMENT: select c1, c2, mount, status from main where c1 in (select a.a1 from blank a);
2019-03-13 11:08:16 CST [pgdatabase] [psql] [pguser] [31484]: [7-1] [2019-03-13 10:17:05 CST] [00000]LOG: disconnection: session time: 0:51:11.528 user=pguser database=pgdatabase host=10.10.10.10 port=38932
(以上 Log 內容有把敏感字眼修改掉,但內容是不變的)
看完 Log 內容後,倒底是因 copy 而把 temp tablespace 吃滿,
還是中間有執行那段 sql 而 where 條件不好,導致 temp 爆炸,
先後的原因就不得而知,只能建議開發者若要用到 temp tablespace 的話,
建議指向另外新建的 temp tablespace 空間,
用預設 temp tablespace 的話,會把 PGDATA 下面的空間吃掉,
可能爆了都不知道…
0 留言