使用 XPath 定位抓取網頁元素路徑

shazi7804

6 年前

XPath (XML Path Language) 常用來作為自動化抓取元素路徑的語法，這篇會主要講抓取 XPath 的一些用法以及優劣。

透過 Chrome DevTools 快速抓取 XPath 路徑

Chrome DevTools 可謂目前最強大的前端開發、Debug 工具，抓 XPath 用 Chrome DevTools 是最簡單的方法，可以利用 HTML Elements 選取直接找到 XPath 路徑。

以 shazi.info 網站為例，抓取作者的頭像 XPath 路徑

到 shazi.info 開啟 Chrome DevTools (「F12」或者「右鍵 -> 檢查」)
點選 Elements 頁籤並用「Elements 選取工具」選取頁面要抓的位置
可以看到 Elements 選到指定的 HTML Elements 位置，右鍵選擇「Copy -> Copy XPath」
拿到 //*[@id="grofile-2"]/img 的 XPath 值

XPath 語法以及 DevTools 的優劣

雖然說 Chrome 提供夠強大的 XPath 抓取方式，但 DevTools 抓的 XPATH 並不會非常精準，還是要理解一下 XPath 是怎麼產生出來的。

以同一個案例來看，整個 Elements 結構是如下：

<html>
  <head></head>
  <body>
    <div id="container">
      <div id="header">
      <div id="wrapper">
        <div id="contentwrapper" class="animated fadeIn">
          <div id="content">
          <div id="rightbar">
            <div id="grofile-2" class="widget widget-grofile grofile">
            <img src="https://secure.gravatar.com/avatar/62a517c88ea8bfc9c5f9ff6720e8b00a?s=320" class="grofile-thumbnail no-grav" alt="shazi7804">
            <div id="media_gallery-2">
            ... more div
      <div id="footer">
  </body>
</html>

中間我省略了許多跟這個 XPath 不相關的 HTML Element，XPath 的寫法可以很嚴謹或者很鬆散，取決於現在的 Element 設計樣式，詳細的 XPath 語法可以參考「W3schools – XPath Syntax」以下幾種寫法參考：

以 div id 作為 Keyword 的 XPath

/html/body/div[@id='container']/div[@id='wrapper']/div[@id='contentwrapper']/div[@id='rightbar']/div[@id='grofile-2']/img

以 div id 作為 Keyword 的 XPath 好處是當 div id 不變時這個 XPath 就不會跑掉。

利用 // 大範圍搜尋並用 class 作為 Keyword 的 XPath

/html/body//*[@class='animated fadeIn']//*[@class='widget widget-grofile grofile']/img

如果 div id 不可信的話，也能拿其他 Element 作為 Keyword 搜尋，這個案例來看除了 div id 以外還有 div class，也因為不是每一個 div 都有 class 參數，所以可以更簡短的找到 XPath 位置。

利用 id unique 的特性找到 XPath

//*[@id="grofile-2"]/img

這也是 Chrome DevTools 抓到的 XPath 路徑，非常簡短的寫法，能這麼簡短是因為整個 Element 內僅有一個 id="grofile-2" 的值，有點像是在整個 Element 內直接搜尋 id="grofile-2" 然後抓這底下的 img tag

這個方法並沒有不好，但在需要長期維護的自動化測試情境就必須要考量較多的問題，舉例：

缺點：當在不同結構下同時有兩個 id="grofile-2" 時，//*[@id="grofile-2"]/img 就會抓錯 element。
優點：當有需求要把 id="grofile-2" 要移到另一個結構下時，//*[@id="grofile-2"]/img 的作法就不需修改 XPath。

總結

XPath 其實跟 Regex 有異曲同工之妙，可以寫的很精準也可以用 * 寫的很寬鬆，所以實務上 XPath 到底要不要抓這麼精準，其實是需要和 Front-end 工程師做好溝通，而好的 XPath 精準路徑也能讓 XPath 在抓取時效能大幅提昇。

References

分享此文：