Parsing HTML with Nokogiri

Need to do rake task for data migration from existing system, which is directory based website to the new website backed by MySql database. So basically we generate YAML format from existing website data, and validate YAML content manually. There are some parts which don’t have specific patterns, so it cant be done by program. The next step is we update new system’s database based on that YAML file.

One of the problem is I have to parse the content of HTML file. I need to find specific and element's values inside. Fortunately there is Nokogiri library.

Using Nokogiri saves more brain cells than using string.scan(regex).

page = Nokogiri::HTML(html_content)
page.css('.item').each do |div|
    div.children.each do |child|
        puts "image source #{child['src']}"  if child.name == 'img'
        puts "Title #{child.children[0]}"  if child.name == 'h2'
        puts "h3 content #{child.children[0]}"  if child.name == 'h3'
    end
end