Techioz Blog

Ruby Web スクレイピング (鋸) - クリーンアップ

概要

ウェブサイトからデータを取得する方法を実験しています。

これは私が数日間の調査を経てまとめたものですが、Nokogiri からの出力は私が期待するほど「きれい」ではありません。配列を印刷すると、出力に大量の改行「/n」が表示されます。

require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'

# Assigning the page to scrape
page = HTTParty.get('http://www.realtor.com/realestateandhomes-search/Atlanta_GA/type-single-family-home/price-na-500000')

# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)

# Create an empty array for property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
    property_details = d.text
    details_array.push(property_details)
end

Pry.start(binding)

Pry で、details_array または address_array を表示すると、出力は次のようになります。

[2] pry(main)> details_array
=> ["\n      \n        \n          \n                2265 Tanglewood Cir NE,\n            Atlanta,\n            GA\n            30345\n \n        \n\n        \n          Dresden East\n        \n        \n\n            $289,900\n          \n          \n            \n        3 bd\n                2 ba\n                1,566 sq ft\n             
0.3 acres lot\n            \n          \n        \n          \n            Single Family Home\n          \n        \n          \n            \n  
Brokered by Re/Max Town And Country\n            \n          \n       
\n        \n          \n            Brokered by \n            Re/Max
Town And Country\n          \n        \n      \n    ",  "\n      \n   
\n          \n                2141 Dunwoody Gln,\n           
Atlanta,\n            GA\n            30338\n          \n        \n\n 
\n          \n            $469,900\n          \n          \n          
\n                4 bd\n                3 ba\n                2,850 sq
ft\n                0.3 acres lot\n                2 car\n           
\n          \n        \n          \n            Single Family Home\n  
\n        \n          \n            \n              Brokered by
Buckhead Home Realty Llc\n            \n          \n        \n       
\n          \n            Brokered by \n            Buckhead Home
Realty Llc\n          \n        \n      \n    ",  "\n      \n       
\n          \n                1048 Martin St SE,\n           
Atlanta,\n            GA\n            30315\n          \n        \n\n 
\n          Intown South\n          Peoplestown\n        \n        \n 
\n            $164,900\n          \n          \n            \n        
5 bd\n                3 ba\n                2,376 sq ft\n             
7,405 sq ft lot\n            \n          \n        \n          \n     
Single Family Home\n          \n        \n          \n            \n  
Brokered by Greenlet Llc\n            \n          \n        \n       
\n          \n            Brokered by \n            Greenlet Llc\n    
\n        \n      \n    ",  "\n      \n        \n          \n         
1048 Martin St SE,\n            Atlanta,\n            GA\n           
30315\n          \n        \n\n        \n          Intown South\n     
Peoplestown\n        \n        \n          \n            $164,900\n   
\n          \n            \n                5 bd\n                3
ba\n                2,055 sq ft\n                7,584 sq ft lot\n    
\n          \n        \n          \n            Single Family Home\n  
\n        \n          \n            \n              Brokered by
Greenlet, Llc\n            \n          \n        \n        \n         
\n            Brokered by \n            Greenlet, Llc\n          \n   
\n      \n    ",  "\n      \n        \n          \n               
1991 Woodbine Ter NE,\n            Atlanta,\n            GA\n         
30329\n          \n        \n\n        \n          Sagamore Hills\n   
\n        \n          \n            $299,900\n          \n          \n
\n                3 bd\n                1+ ba\n                1,449
sq ft\n                0.8 acres lot\n            \n          \n      
\n          \n            Single Family Home\n          \n        \n  
\n           :

解決策

セレクターを使用してドキュメントを十分に掘り下げていないようです。このことを考慮:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div>
      <p>foo</p>
      <p>bar</p>
    </div>
  </body>
</html>
EOT

doc.search('div').map(&:text) # => ["\n      foo\n      bar\n    "]
親タグのテキストを確認すると、HTML のフォーマットに使用されるテキスト ノードと、目的の

ノードのテキストの両方が得られます。

必要な実際のノードまでドリルダウンしてそのテキストを取得すると、タグ間の書式設定が削除されます。

doc.search('div p').map(&:text) # => ["foo", "bar"]

「スクレイピング時にノードからのすべてのテキストの結合を回避する方法」も参照してください。