一些學(xué)術(shù)論文提出,在解讀肺癌,、皮膚癌等疾病的醫(yī)學(xué)影像時(shí),,人工智能(AI)比人類醫(yī)生的能力更強(qiáng)。但近期的研究對該結(jié)論提出了質(zhì)疑,。
醫(yī)學(xué)期刊《英國醫(yī)學(xué)雜志》3月發(fā)表的一篇論文發(fā)現(xiàn),,許多相關(guān)研究言過其實(shí),夸大了AI的實(shí)際效果,。這一發(fā)現(xiàn)意義重大,,原因是醫(yī)療行業(yè)正在尋求借助AI技術(shù)加快疾病診斷速度,而該發(fā)現(xiàn)動搖了行業(yè)變革的理論基礎(chǔ),。
同時(shí),,科技行業(yè)也因熱衷于開發(fā)和兜售用于醫(yī)學(xué)影像分析的AI技術(shù)而備受質(zhì)疑。該論文的作者擔(dān)心,,狂熱的企業(yè)與投資者可能會在相關(guān)技術(shù)得到充分審查之前便會試圖將其推向市場,。
麥烏拉·納杰德蘭是這篇論文的合著者之一,他說:“我們并非不尊重風(fēng)險(xiǎn)資本家,,他們在許多創(chuàng)新項(xiàng)目的融資過程中有著重要作用,,但顯然他們最關(guān)心的始終還是如何盡快將產(chǎn)品推向市場。雖然我們懷有同樣的熱情,,但我們也非常清楚,,要想大規(guī)模推廣相關(guān)技術(shù),必須首先確保其安全性和有效性,?!?/p>
這篇論文還涉及了導(dǎo)致3萬多美國人死亡的新冠疫情,。有研究者聲稱,已開發(fā)出比人類更快的AI系統(tǒng),,來通過胸部CT掃描診斷病人是否感染了新冠病毒,。
最近《英國醫(yī)學(xué)雜志》回顧了近100項(xiàng)有關(guān)AI深度學(xué)習(xí)技術(shù)的研究,該技術(shù)已被應(yīng)用到各種疾病的醫(yī)學(xué)掃描中,,包括黃斑變性,、結(jié)核病和幾種癌癥。
最后發(fā)現(xiàn),,有77項(xiàng)研究在缺少隨機(jī)試驗(yàn)的情況下比較了AI系統(tǒng)與人類醫(yī)生的表現(xiàn),,并在其摘要或結(jié)語中給出了具體評價(jià),其中,,23項(xiàng)研究表示AI在診斷特定疾病時(shí)的表現(xiàn)比臨床醫(yī)生“更優(yōu)秀”,。
論文合著者、非盈利機(jī)構(gòu)斯克利普斯研究所創(chuàng)始人兼董事埃里克·托普爾表示,,這些研究的一個(gè)主要問題是“其中許多都有人為的痕跡”,,相關(guān)研究人員只是在聲稱其技術(shù)的表現(xiàn)“比醫(yī)生好”而已。他解釋說,,在現(xiàn)實(shí)生活中,,AI和人類醫(yī)生并不是非此即彼的關(guān)系??傄嗅t(yī)生來檢查診斷結(jié)果,,因此對比AI和人類醫(yī)生的表現(xiàn)本身就是一件很荒誕的事。
托普爾說:“總有人很熱衷于拿機(jī)器和醫(yī)生來做比較,,問題在于你不可能把解讀醫(yī)學(xué)影像的工作完全交給機(jī)器來做,。如果真碰上威脅生命或者比較嚴(yán)重的疾病,還是得有醫(yī)生來判斷,?!?/p>
他補(bǔ)充道:“我想說的是,如果你看了所有這些論文,,你會發(fā)現(xiàn)其中多達(dá)90%的論文都是在進(jìn)行人機(jī)比較,,真的沒必要這么做?!?/p>
英國國家健康研究所的臨床醫(yī)學(xué)研究員納杰德蘭表示,,宣揚(yáng)AI(相較于人類醫(yī)生)的優(yōu)勢可能會對公眾造成誤導(dǎo)。
納杰德蘭說:“現(xiàn)在外面的炒作很多,,這些炒作又通過媒體很快變成各種如‘AI即將取代醫(yī)生’的傳言流入患者耳中,。”
他表示,,除了進(jìn)行人機(jī)比較這一核心謬誤,,這些論文最大的問題在于未能遵循醫(yī)療專業(yè)人士過去十年一直在努力打造的更為嚴(yán)格的報(bào)告標(biāo)準(zhǔn),。例如,這些論文一般都未使用多個(gè)數(shù)據(jù)集來衡量其深度學(xué)習(xí)模型的準(zhǔn)確性,,這就導(dǎo)致其研究對象十分有限,,未能包括各種不同人群。
在查閱近期發(fā)表的一些關(guān)于使用深度學(xué)習(xí)技術(shù)通過胸部CT掃描診斷新冠肺炎的論文時(shí),,澳大利亞皇家阿德萊德醫(yī)院醫(yī)學(xué)影像研究室主任盧克·奧克登·雷納也注意到了類似的問題,。與《英國醫(yī)學(xué)雜志》論文描述的那些問題多多的醫(yī)學(xué)影像研究一樣,新冠肺炎相關(guān)論文的結(jié)論也是建構(gòu)在十分有限的數(shù)據(jù)之上,,無法代表全體人群的實(shí)際情況,,帶有選擇性偏差問題。
在其中一篇論文中奧克登·雷納注意到,,研究人員開發(fā)了一種深度學(xué)習(xí)系統(tǒng),該系統(tǒng)能夠基于從同濟(jì)大學(xué)附屬醫(yī)院1014名患者處采集到的數(shù)據(jù)識別新冠病毒,。這些患者均已通過傳統(tǒng)拭子測試確診患有新冠肺炎,,并且也接受了胸部CT掃描確認(rèn)其肺部是否已被感染。
也就是說,,研究人員訓(xùn)練深度學(xué)習(xí)系統(tǒng)時(shí)用的可能是偏斜數(shù)據(jù),。醫(yī)生很可能正是因?yàn)閼岩蛇@些患者患有與新冠病毒相關(guān)的肺部疾病才讓他們?nèi)プ隽诵夭緾T掃描。同樣的技術(shù)在篩查無肺部感染癥狀的患者時(shí)可能就沒什么用了,。
奧克登·雷納在發(fā)給《財(cái)富》的郵件中寫道:“一般而言,,數(shù)據(jù)集越準(zhǔn)確、越全面,,其用處也就越大,。”
他認(rèn)為,,就新冠肺炎而言,,現(xiàn)有檢測手段已經(jīng)十分有效,AI技術(shù)應(yīng)該用于其它更重要的任務(wù)之上,,研究者完全沒有必要就使用深度學(xué)習(xí)技術(shù)診斷新冠肺炎發(fā)表論文,。
奧克登·雷納還在郵件中表示:“只靠CT掃描篩查新冠肺炎效果可能并不好。如果在現(xiàn)有醫(yī)療流程中有哪些瓶頸問題是AI可以解決的,,那就需要專門收集與該問題相關(guān)的數(shù)據(jù),。”
托普爾同意奧克登·雷納的觀點(diǎn),,他表示:“在使用CT掃描判斷肺部是否可能感染新冠病毒方面,,算法是有用武之地的,但我們不一定要做CT掃描,?!?/p>
托普爾解釋說,,隨著傳統(tǒng)檢測工具全球供應(yīng)量的增加,已然成為了比CT掃描更容易獲得的檢測手段,,而且成本還更低,。
托普爾表示,近期發(fā)表的這些AI醫(yī)學(xué)影像研究給我們提了個(gè)醒,,在評估自己的發(fā)現(xiàn)時(shí),,我們應(yīng)當(dāng)始終抱有懷疑精神。從本質(zhì)上說,,這些論文都是關(guān)于AI技術(shù)在當(dāng)前醫(yī)療體系中潛在應(yīng)用前景的初步研究,,但研究者仍然需要開展更深入的臨床試驗(yàn),驗(yàn)證相關(guān)技術(shù)的有效性,。
在初步研究之后,,研究者通常會進(jìn)行更為正式的學(xué)術(shù)研究,即前瞻性研究,。托普爾表示:“研究者不能直接悶頭去做前瞻性研究,,也不應(yīng)夸大自己的研究結(jié)論?!保ㄘ?cái)富中文網(wǎng))
譯者:梁宇
審校:夏林
一些學(xué)術(shù)論文提出,,在解讀肺癌、皮膚癌等疾病的醫(yī)學(xué)影像時(shí),,人工智能(AI)比人類醫(yī)生的能力更強(qiáng),。但近期的研究對該結(jié)論提出了質(zhì)疑。
醫(yī)學(xué)期刊《英國醫(yī)學(xué)雜志》3月發(fā)表的一篇論文發(fā)現(xiàn),,許多相關(guān)研究言過其實(shí),,夸大了AI的實(shí)際效果。這一發(fā)現(xiàn)意義重大,,原因是醫(yī)療行業(yè)正在尋求借助AI技術(shù)加快疾病診斷速度,,而該發(fā)現(xiàn)動搖了行業(yè)變革的理論基礎(chǔ)。
同時(shí),,科技行業(yè)也因熱衷于開發(fā)和兜售用于醫(yī)學(xué)影像分析的AI技術(shù)而備受質(zhì)疑,。該論文的作者擔(dān)心,狂熱的企業(yè)與投資者可能會在相關(guān)技術(shù)得到充分審查之前便會試圖將其推向市場,。
麥烏拉·納杰德蘭是這篇論文的合著者之一,,他說:“我們并非不尊重風(fēng)險(xiǎn)資本家,他們在許多創(chuàng)新項(xiàng)目的融資過程中有著重要作用,,但顯然他們最關(guān)心的始終還是如何盡快將產(chǎn)品推向市場,。雖然我們懷有同樣的熱情,但我們也非常清楚,要想大規(guī)模推廣相關(guān)技術(shù),,必須首先確保其安全性和有效性,。”
這篇論文還涉及了導(dǎo)致3萬多美國人死亡的新冠疫情,。有研究者聲稱,,已開發(fā)出比人類更快的AI系統(tǒng),來通過胸部CT掃描診斷病人是否感染了新冠病毒,。
最近《英國醫(yī)學(xué)雜志》回顧了近100項(xiàng)有關(guān)AI深度學(xué)習(xí)技術(shù)的研究,,該技術(shù)已被應(yīng)用到各種疾病的醫(yī)學(xué)掃描中,包括黃斑變性,、結(jié)核病和幾種癌癥,。
最后發(fā)現(xiàn),有77項(xiàng)研究在缺少隨機(jī)試驗(yàn)的情況下比較了AI系統(tǒng)與人類醫(yī)生的表現(xiàn),,并在其摘要或結(jié)語中給出了具體評價(jià),,其中,23項(xiàng)研究表示AI在診斷特定疾病時(shí)的表現(xiàn)比臨床醫(yī)生“更優(yōu)秀”,。
論文合著者,、非盈利機(jī)構(gòu)斯克利普斯研究所創(chuàng)始人兼董事埃里克·托普爾表示,這些研究的一個(gè)主要問題是“其中許多都有人為的痕跡”,,相關(guān)研究人員只是在聲稱其技術(shù)的表現(xiàn)“比醫(yī)生好”而已。他解釋說,,在現(xiàn)實(shí)生活中,,AI和人類醫(yī)生并不是非此即彼的關(guān)系??傄嗅t(yī)生來檢查診斷結(jié)果,,因此對比AI和人類醫(yī)生的表現(xiàn)本身就是一件很荒誕的事。
托普爾說:“總有人很熱衷于拿機(jī)器和醫(yī)生來做比較,,問題在于你不可能把解讀醫(yī)學(xué)影像的工作完全交給機(jī)器來做,。如果真碰上威脅生命或者比較嚴(yán)重的疾病,還是得有醫(yī)生來判斷,?!?/p>
他補(bǔ)充道:“我想說的是,如果你看了所有這些論文,,你會發(fā)現(xiàn)其中多達(dá)90%的論文都是在進(jìn)行人機(jī)比較,,真的沒必要這么做?!?/p>
英國國家健康研究所的臨床醫(yī)學(xué)研究員納杰德蘭表示,,宣揚(yáng)AI(相較于人類醫(yī)生)的優(yōu)勢可能會對公眾造成誤導(dǎo)。
納杰德蘭說:“現(xiàn)在外面的炒作很多,這些炒作又通過媒體很快變成各種如‘AI即將取代醫(yī)生’的傳言流入患者耳中,?!?
他表示,除了進(jìn)行人機(jī)比較這一核心謬誤,,這些論文最大的問題在于未能遵循醫(yī)療專業(yè)人士過去十年一直在努力打造的更為嚴(yán)格的報(bào)告標(biāo)準(zhǔn),。例如,這些論文一般都未使用多個(gè)數(shù)據(jù)集來衡量其深度學(xué)習(xí)模型的準(zhǔn)確性,,這就導(dǎo)致其研究對象十分有限,,未能包括各種不同人群。
在查閱近期發(fā)表的一些關(guān)于使用深度學(xué)習(xí)技術(shù)通過胸部CT掃描診斷新冠肺炎的論文時(shí),,澳大利亞皇家阿德萊德醫(yī)院醫(yī)學(xué)影像研究室主任盧克·奧克登·雷納也注意到了類似的問題,。與《英國醫(yī)學(xué)雜志》論文描述的那些問題多多的醫(yī)學(xué)影像研究一樣,新冠肺炎相關(guān)論文的結(jié)論也是建構(gòu)在十分有限的數(shù)據(jù)之上,,無法代表全體人群的實(shí)際情況,,帶有選擇性偏差問題。
在其中一篇論文中奧克登·雷納注意到,,研究人員開發(fā)了一種深度學(xué)習(xí)系統(tǒng),,該系統(tǒng)能夠基于從同濟(jì)大學(xué)附屬醫(yī)院1014名患者處采集到的數(shù)據(jù)識別新冠病毒。這些患者均已通過傳統(tǒng)拭子測試確診患有新冠肺炎,,并且也接受了胸部CT掃描確認(rèn)其肺部是否已被感染,。
也就是說,研究人員訓(xùn)練深度學(xué)習(xí)系統(tǒng)時(shí)用的可能是偏斜數(shù)據(jù),。醫(yī)生很可能正是因?yàn)閼岩蛇@些患者患有與新冠病毒相關(guān)的肺部疾病才讓他們?nèi)プ隽诵夭緾T掃描,。同樣的技術(shù)在篩查無肺部感染癥狀的患者時(shí)可能就沒什么用了。
奧克登·雷納在發(fā)給《財(cái)富》的郵件中寫道:“一般而言,,數(shù)據(jù)集越準(zhǔn)確,、越全面,其用處也就越大,?!?/p>
他認(rèn)為,就新冠肺炎而言,,現(xiàn)有檢測手段已經(jīng)十分有效,,AI技術(shù)應(yīng)該用于其它更重要的任務(wù)之上,研究者完全沒有必要就使用深度學(xué)習(xí)技術(shù)診斷新冠肺炎發(fā)表論文,。
奧克登·雷納還在郵件中表示:“只靠CT掃描篩查新冠肺炎效果可能并不好,。如果在現(xiàn)有醫(yī)療流程中有哪些瓶頸問題是AI可以解決的,那就需要專門收集與該問題相關(guān)的數(shù)據(jù),?!?/p>
托普爾同意奧克登·雷納的觀點(diǎn),他表示:“在使用CT掃描判斷肺部是否可能感染新冠病毒方面,算法是有用武之地的,,但我們不一定要做CT掃描,。”
托普爾解釋說,,隨著傳統(tǒng)檢測工具全球供應(yīng)量的增加,,已然成為了比CT掃描更容易獲得的檢測手段,而且成本還更低,。
托普爾表示,,近期發(fā)表的這些AI醫(yī)學(xué)影像研究給我們提了個(gè)醒,在評估自己的發(fā)現(xiàn)時(shí),,我們應(yīng)當(dāng)始終抱有懷疑精神,。從本質(zhì)上說,這些論文都是關(guān)于AI技術(shù)在當(dāng)前醫(yī)療體系中潛在應(yīng)用前景的初步研究,,但研究者仍然需要開展更深入的臨床試驗(yàn),,驗(yàn)證相關(guān)技術(shù)的有效性。
在初步研究之后,,研究者通常會進(jìn)行更為正式的學(xué)術(shù)研究,,即前瞻性研究。托普爾表示:“研究者不能直接悶頭去做前瞻性研究,,也不應(yīng)夸大自己的研究結(jié)論,。”(財(cái)富中文網(wǎng))
譯者:梁宇
審校:夏林
Artificial intelligence is better at analyzing medical images for illnesses like pneumonia and skin cancer than doctors are, according to a number of academic papers. But that conclusion is being called into question by recent research.
A paper published in March in medical journal The BMJ found that many of those studies exaggerated their conclusions, making A.I. technologies seem more effective than they were in reality. The finding is significant because it undermines a huge ongoing shift in the health care industry, which is looking to use technology to more quickly diagnose ailments.
It also calls into question a tech industry that is scrambling to develop and sell A.I. technology for analyzing medical imagery. The paper’s authors are worried that overzealous companies and their investors may push to sell the technology before it has been thoroughly vetted.
“With no disrespect to venture capitalists—obviously they’re an important part of the funding process for a lot of this innovation—but obviously their enthusiasm is always to try and get things to market as quickly as possible,” says Myura Nagendran, a coauthor of the BMJ paper. “While we share that enthusiasm, we’re also acutely aware of how important it is to make sure these things are safe and work effectively if we institute them en masse.”
The finding also touches on the current coronavirus pandemic, which has claimed over 30,000 lives in the U.S. Some researchers maintain that they’ve developed A.I. systems that are faster than humans at examining chest CT scans for COVID-19 infections.
The recent BMJ review looked at nearly 100 studies of a type of artificial intelligence called deep learning that had been used on medical scans of various disorders including macular degeneration, tuberculosis, and several types of cancers.
The review found that 77 studies that lacked randomized testing included specific comments in their abstracts, or summaries, comparing their A.I. system’s performance to that of human doctors. Of those, 23 said that their A.I. was “superior” to clinical physicians at diagnosing certain illnesses.
One of the main problems with these papers is that there’s “an artificial, contrived nature of a lot of these studies” in which researchers basically claim that their technology “outperformed a doctor,” says Eric Topol, one of the BMJ paper’s authors and the founder and director of the nonprofit Scripps Research Translational Institute. It’s absurd to compare an A.I.’s performance to that of human doctors, he explains, because in the real world, choosing between an A.I. system or a human doctor is not an either-or situation. Doctors will always review the findings.
“There’s this kind of nutty inclination to pit machines versus doctors, and that’s really a consistent flaw because it’s not only going to be machines that do readings of medical images,” Topol says. “You’re still going to have oversight if there’s anything reported that’s life-threatening or serious.”
Topol adds, “The point I’m just getting at, is if you look at all these papers, the vast majority—90%—do the man-versus-machine comparison, and it really isn’t necessary to do that.”
Nagendran, an academic clinical fellow for the U.K.’s National Institute for Health Research, says that studies describing A.I.’s superiority to human doctors can mislead people.
“There’s been a lot of hype out there, and that can very quickly translate through the media into stories that patients hear, saying things like, ‘It’s just around the corner, the A.I. will be seeing you rather than your doctor,’” says Nagendran.
“Besides the core fallacy of pitting A.I. versus humans, one of the big problems is that these papers typically fail to follow more robust reporting standards that health care professionals have been trying to make standard over the past decade, Nagendran says. One sore point, for instance, is that the papers generally fail to measure the accuracy of their deep-learning models on multiple data sets, which could include different populations of people, as opposed to just a limited number.
Luke Oakden-Rayner, a director of medical imaging research at the Royal Adelaide Hospital in Australia, noticed a similar problem when he examined a handful of recently published papers on using deep learning to diagnose COVID-19 via chest CT scans. Like the faulty medical imaging studies that the BMJ paper described, the coronavirus-related papers based their conclusions on a limited amount of data that was not representative of the entire population, a problem that’s known as selection bias.
In one paper Oakden-Rayner noted that the researchers developed a deep-learning system to recognize the coronavirus from data taken from 1,014 patients at Tongji University in Shanghai. These patients were diagnosed as having COVID-19 via the conventional swab tests used to detect the illness; they also had chest CT scans to see if there was any of the infection in their lungs.
But that deep-learning system was likely trained on skewed data. Doctors probably suspected those patients were having lung problems related to COVID-19, which is why they ordered CT scans of the patients’ chests. The same technology would be unlikely to work well with people who have COVID-19, but don’t have any symptoms in their lungs.
“As a general rule more accurate and complete data sets are more useful,” Oakden-Rayner says in an email to Fortune.
Oakden-Rayner questioned the need for A.I. researchers to even publish papers about using deep learning to diagnose the coronavirus, explaining that current testing is already effective and that there are more important jobs that A.I. can help with.
“Simply detecting COVID-19 on CT scans is unlikely to be very helpful,” Oakden-Rayner says in the email. “If there is a bottleneck that A.I. can solve in the medical workflow, then data for that task specifically will need to be collected.”
Topol agrees with Oakden-Rayner, saying, “It can be useful to have an algorithm review of a CT scan of lungs as to whether they are potentially related to COVID, but you don’t really need a CT scan.”
More conventional testing tools are increasingly being distributed worldwide, making them more available than CT scans, which are more expensive, Topol explains.
The takeaway from all these recent A.I. medical imaging studies is that people should use some skepticism in considering their findings, says Topol. These are essentially preliminary research papers that highlight potential uses of A.I. in the current health care system, but researchers still need deeper clinical trials to verify the technology’s effectiveness.
“You can’t just go right ahead to a prospective study,” Topol says regarding a more formal type of academic study that typically follows preliminary research. “You just don’t want to overstate the conclusions.”