2017-03-10

MongoDB: unique sparse index of array field

MongoDB 中可以對 document (以 sql 說法就是 record）中的某個「陣列」欄位做 index，其方式是這樣：

Multikey Indexes

To index a field that holds an array value, MongoDB creates an index key for each element in the array.

就是把陣列值通通展開一起做 index 的意思。這方法乍看合理，但是，如果把 unique, sparse 一起考慮進去，搭配 $push, $pop 等 array operator 的行為，就會發生種種奇妙的情況。從 1.x 至今，這件事是一整本糊塗帳。

先從 unique 說起。unique 的定義是：index 中每個值只能對應到一個 document。在 multikey 的情況，就會是以下這種「可能不符直覺」的行為：

> db.c.ensureIndex({arrayfield: 1}, {unique: true, name: 'unique'});
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
> db.c.insert({arrayfield: [1, 2]});
WriteResult({ "nInserted" : 1 })
> db.c.insert({arrayfield: [1, 3]});
WriteResult({
	"nInserted" : 0,
	"writeError" : {
		"code" : 11000,
		"errmsg" : "E11000 duplicate key error collection: test.c index: arrayfield_1 dup key: { : 1.0 }"
	}
})

原因是 insert {arrayfield: [1, 2]} 所產生的 index 為 1, 2 各一筆，而 {arrayfield: [1, 3]} 會需要產生 1, 3 各一筆，此時 1 就重複了。所以這邊的 unique 意思其實是「陣列中的值在不同筆紀錄中不能重複」的意思。值得注意的是，以下的例子又是可以的：

1 2	> db.c.insert({arrayfield: [3, 3]}); WriteResult({ "nInserted" : 1 })

這裡產生的 index 3 只有對應到一個 document，所以沒有違反 unique。要避免這種情況，就必須先確保產生陣列時沒有重複的值，而後續的操作不用 $push 而用 $addToSet。

unique 有個常見的問題，就是「沒有」也算值，所以若有兩筆紀錄都沒有要 index 的欄位，那「沒有欄位」這件事就重複了，這是不被允許的：

> db.c.insert({name: 'a'});
WriteResult({ "nInserted" : 1 })
> db.c.insert({name: 'b'});
WriteResult({
	"nInserted" : 0,
	"writeError" : {
		"code" : 11000,
		"errmsg" : "E11000 duplicate key error collection: test.c index: unique dup key: { : null }"
	}
})

這時候一般就會加入 sparse，其定義是「不 index collection 中所有的 document，而只管『有這個欄位』的那些」。直接看例子：

> // drop previous unique index
> db.c.dropIndex('unique');
{ "nIndexesWas" : 2, "ok" : 1 }
> // create an unique & sparse index
> db.c.ensureIndex({arrayfield: 1}, {unique: true, sparse: true});
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
> db.c.insert({name: 'b'});  // can insert this now
WriteResult({ "nInserted" : 1 })

看似問題解決了，其實不然。對陣列的操作，經常需要加入或移除元素，假定現在先把 [3, 3] 清成空陣列，後續又把 [1, 2] 清空，此時照 sparse 的原則來想，兩筆紀錄都是「沒有值」，應該要被允許才對。在 mongo 1.x 版本，這是成立的。但這造成一個問題，就是如果在「有 index」的情況下查詢 {arrayindex: []}，會查不出東西，因為「空陣列」沒有被 index，就被當成不存在了。然而如果不用 index，一筆一筆比對，就又會出現，此時就有行為不一致的問題，也就是 SERVER-2258 裡面的例子：

> db.c.save({a:[]});
> db.c.ensureIndex({a:1});
> db.c.find({a:[]});  // no result
> db.c.find({a:[]}).hint( {$natural:1} );
{ "_id" : ObjectId("4d0fba6fc6237b412f53adeb"), "a" : [ ] }
> db.c.find({a:[]}).hint({a:1});  // again, no result

如前所述，multikey 是針對陣列中的每個「值」做 index 的。為了解決這問題，2.0 版之後的空陣列就被當成包含一個值 undefined。這解了 SERVER-2258，但又產生了新的問題，因為這樣一來，重複的空陣列就會違反 unique：

> db.c.update({arrayfield: 1}, {$pull: {arrayfield: 1}});
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.c.update({arrayfield: 2}, {$pull: {arrayfield: 2}});
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.c.find().pretty();
{ "_id" : ObjectId("58c256eb84c90ec348fb8950"), "arrayfield" : [ ] }
{
	"_id" : ObjectId("58c2582384c90ec348fb8952"),
	"arrayfield" : [
		3,
		3
	]
}
{ "_id" : ObjectId("58c25a2c84c90ec348fb8955"), "name" : "a" }
{ "_id" : ObjectId("58c25be084c90ec348fb8957"), "name" : "b" }
> db.c.update({arrayfield: 3}, {$pull: {arrayfield: 3}});
WriteResult({
	"nMatched" : 0,
	"nUpserted" : 0,
	"nModified" : 0,
	"writeError" : {
		"code" : 11000,
		"errmsg" : "E11000 duplicate key error collection: test.c index: arrayfield_1 dup key: { : undefined }"
	}
})

可看到錯誤訊息 arrayfield_1 dup key: { : undefined }，表示 undefined 這個值重複了。而且，這個 breaking change 並沒有寫在 release note 中。此問題被反應在 SERVER-3934，至今未解。一個 workaround 是採用 {v: 0} 也就是舊版的 index，但看來在 2.6+ 之後也不再支援。

小結一下：對於「array field 中的值在整個 collection 中不能重複，但此 array field 有可能不存在」的 use case，以目前 MongoDB (3.4) 機制是無法直接做到的。