|
<h1>Goutte, a simple PHP Web Scraper</h1>
<p>Goutte is a screen scraping and web crawling library for PHP.</p>
<p>Goutte provides a nice API to crawl websites and extract data from the HTML/XML
responses.</p>
<h2>Requirements</h2>
<p>Goutte depends on PHP 5.5+ and Guzzle 6+.</p>
<p>.. tip::</p>
<pre><code>If you need support for PHP 5.4 or Guzzle 4-5, use Goutte 2.x (latest `phar
<https://github.com/FriendsOfPHP/Goutte/releases/download/v2.0.4/goutte-v2.0.4.phar>`_).
If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.x (latest `phar
<https://github.com/FriendsOfPHP/Goutte/releases/download/v1.0.7/goutte-v1.0.7.phar>`_).
</code></pre>
<h2>Installation</h2>
<p>Add <code>fabpot/goutte</code> as a require dependency in your <code>composer.json</code> file:</p>
<p>.. code-block:: bash</p>
<pre><code>composer require fabpot/goutte
</code></pre>
<h2>Usage</h2>
<p>Create a Goutte Client instance (which extends
<code>Symfony\Component\BrowserKit\Client</code>):</p>
<p>.. code-block:: php</p>
<pre><code>use Goutte\Client;
$client = new Client();
</code></pre>
<p>Make requests with the <code>request()</code> method:</p>
<p>.. code-block:: php</p>
<pre><code>// Go to the symfony.com website
$crawler = $client->request('GET', 'https://www.symfony.com/blog/');
</code></pre>
<p>The method returns a <code>Crawler</code> object
(<code>Symfony\Component\DomCrawler\Crawler</code>).</p>
<p>To use your own Guzzle settings, you may create and pass a new Guzzle 6
instance to Goutte. For example, to add a 60 second request timeout:</p>
<p>.. code-block:: php</p>
<pre><code>use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;
$goutteClient = new Client();
$guzzleClient = new GuzzleClient(array(
'timeout' => 60,
));
$goutteClient->setClient($guzzleClient);
</code></pre>
<p>Click on links:</p>
<p>.. code-block:: php</p>
<pre><code>// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);
</code></pre>
<p>Extract data:</p>
<p>.. code-block:: php</p>
<pre><code>// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
print $node->text()."\n";
});
</code></pre>
<p>Submit forms:</p>
<p>.. code-block:: php</p>
<pre><code>$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
print $node->text()."\n";
});
</code></pre>
<h2>More Information</h2>
<p>Read the documentation of the <code>BrowserKit</code>_ and <code>DomCrawler</code>_ Symfony
Components for more information about what you can do with Goutte.</p>
<h2>Pronunciation</h2>
<p>Goutte is pronounced <code>goot</code> i.e. it rhymes with <code>boot</code> and not <code>out</code>.</p>
<h2>Technical Information</h2>
<p>Goutte is a thin wrapper around the following fine PHP libraries:</p>
<ul>
<li><p>Symfony Components: <code>BrowserKit</code><em>, <code>CssSelector</code></em> and <code>DomCrawler</code>_;</p></li>
<li><p><code>Guzzle</code>_ HTTP Component.</p></li>
</ul>
<h2>License</h2>
<p>Goutte is licensed under the MIT license.</p>
<p>.. _<code>Composer</code>: <a href="https://getcomposer.org" rel="nofollow">https://getcomposer.org</a>
.. _<code>Guzzle</code>: <a href="http://docs.guzzlephp.org" rel="nofollow">http://docs.guzzlephp.org</a>
.. _<code>BrowserKit</code>: <a href="https://symfony.com/components/BrowserKit" rel="nofollow">https://symfony.com/components/BrowserKit</a>
.. _<code>DomCrawler</code>: <a href="https://symfony.com/doc/current/components/dom_crawler.html" rel="nofollow">https://symfony.com/doc/current/components/dom_crawler.html</a>
.. _<code>CssSelector</code>: <a href="https://symfony.com/doc/current/components/css_selector.html" rel="nofollow">https://symfony.com/doc/current/components/css_selector.html</a></p>
|