用Twitter的cursor方式进行Web数据分页

Tim[后端技术] 2010-01-25 14:56:02 累计浏览 3,241 次

本机暂存

内容概览

作者从Web应用中常见的列表数据加载场景出发，对比了传统的偏移量分页与Twitter采用的游标分页在实现原理与性能上的核心差异。文章指出，传统的“LIMIT/OFFSET”方式在页数较深时，数据库需要跳过大量已查询的记录，导致性能急剧下降；而游标分页则通过记录当前页最后一条数据的唯一标识（如ID或时间戳），将下一次查询转换为高效的范围查询，彻底避免了深分页的性能陷阱。

这篇文章的实用价值在于清晰地划定了两种方式的适用边界。游标分页尤其适合数据频繁更新、需要无限滚动的信息流场景（如社交媒体时间线），能保证用户体验的流畅性。而传统分页由于能随机跳转到指定页面，在管理后台等需要精确页码导航的界面中仍有其用武之地。最后，作者也提及了实现游标分页时需要考虑的一些细节，比如对排序字段的索引要求以及如何处理数据变更带来的边界情况，为实践者提供了切实的参考。

本文讨论Web应用中实现数据分页功能，不同的技术实现方式的性能方区别。

上图功能的技术实现方法拿MySQL来举例就是

select * from msgs where thread_id = ? limit page * count, count

不过在看Twitter API的时候，我们却发现不少接口使用cursor的方法，而不用page, count这样直观的形式，如 followers ids 接口

URL:

http://twitter.com/followers/ids.format

Returns an array of numeric IDs for every user following the specified user.

Parameters:

* cursor. Required. Breaks the results into pages. Provide a value of -1 to begin paging. Provide values as returned to in the response body’s next_cursor and previous_cursor attributes to page back and forth in the list.

o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1

o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1300794057949944903

http://twitter.com/followers/ids.format

从上面描述可以看到，http://twitter.com/followers/ids.xml 这个调用需要传cursor参数来进行分页，而不是传统的 url?page=n&count=n的形式。这样做有什么优点呢？是否让每个cursor保持一个当时数据集的镜像？防止由于结果集实时改变而产生查询结果有重复内容？

在Google Groups这篇Cursor Expiration讨论中Twitter的架构师John Kalucki提到

A cursor is an opaque deletion-tolerant index into a Btree keyed by source

userid and modification time. It brings you to a point in time in the

reverse chron sorted list. So, since you can’t change the past, other than

erasing it, it’s effectively stable. (Modifications bubble to the top.) But

you have to deal with additions at the list head and also block shrinkage

due to deletions, so your blocks begin to overlap quite a bit as the data

ages. (If you cache cursors and read much later, you’ll see the first few

rows of cursor[n+1]’s block as duplicates of the last rows of cursor[n]’s

block. The intersection cardinality is equal to the number of deletions in

cursor[n]’s block). Still, there may be value in caching these cursors and

then heuristically rebalancing them when the overlap proportion crosses some

threshold.

在另外一篇new cursor-based pagination not multithread-friendly中John又提到

The page based approach does not scale with large sets. We can no

longer support this kind of API without throwing a painful number of

503s.

Working with row-counts forces the data store to recount rows in an O

(n^2) manner. Cursors avoid this issue by allowing practically

constant time access to the next block. The cost becomes O(n/

block_size) which, yes, is O(n), but a graceful one given n < 10^7 and

a block_size of 5000. The cursor approach provides a more complete and

consistent result set.

Proportionally, very few users require multiple page fetches with a

page size of 5,000.

Also, scraping the social graph repeatedly at high speed is could

often be considered a low-value, borderline abusive use of the social

graph API.

通过这两段文字我们已经很清楚了，对于大结果集的数据，使用cursor方式的目的主要是为了极大地提高性能。还是拿MySQL为例说明，比如翻页到100,000条时，不用cursor，对应的SQL为

select * from msgs limit 100000, 100

在一个百万记录的表上，第一次执行这条SQL需要5秒以上。

假定我们使用表的主键的值作为cursor_id, 使用cursor分页方式对应的SQL可以优化为

select * from msgs where id > cursor_id limit 100;

同样的表中，通常只需要100ms以下, 效率会提高几十倍。MySQL limit性能差别也可参看我3年前写的一篇不成熟的文章 MySQL LIMIT 的性能问题。

结论

建议Web应用中大数据集翻页可以采用这种cursor方式，不过此方法缺点是翻页时必须连续，不能跳页。

同分类推荐文章

Go 语言技能：AI 时代的 Go 开发工具链（2026-06-28 18:00:00）
等了十年的 Go 链式管道，终于来了：seq 让你像写 Scala 一样写 Go （2026-06-25 18:38:18）
Go 实验特性详解（2026-06-21 10:05:27）

查看更多后端文章 →

建议继续学习

用Hyer来进行网站的抓取（累计阅读 158,256）
MySQL数据库在实际应用一些方面的介绍（累计阅读 36,404）
WordPress插件开发 -- 在插件使用数据库存储数据（累计阅读 29,167）
Mysql监控指南（累计阅读 21,368）
由浅入深探究mysql索引结构原理、性能分析与优化（累计阅读 16,540）
在Apache2.2.XX下安装Mod-myvhost模块（累计阅读 13,062）
15个最好的免费开源电子商务平台（累计阅读 12,544）
浅谈MySQL索引背后的数据结构及算法（累计阅读 11,944）
整理了一份招PHP高级工程师的面试题（累计阅读 11,734）
深入浅出INNODB MVCC机制与原理（累计阅读 9,695）