Drupal investigation

README.rst 3.2KB

    <h1>Goutte, a simple PHP Web Scraper</h1> <p>Goutte is a screen scraping and web crawling library for PHP.</p> <p>Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.</p> <h2>Requirements</h2> <p>Goutte depends on PHP 5.5+ and Guzzle 6+.</p> <p>.. tip::</p> <pre><code>If you need support for PHP 5.4 or Guzzle 4-5, use Goutte 2.x (latest `phar &lt;https://github.com/FriendsOfPHP/Goutte/releases/download/v2.0.4/goutte-v2.0.4.phar&gt;`_). If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.x (latest `phar &lt;https://github.com/FriendsOfPHP/Goutte/releases/download/v1.0.7/goutte-v1.0.7.phar&gt;`_). </code></pre> <h2>Installation</h2> <p>Add <code>fabpot/goutte</code> as a require dependency in your <code>composer.json</code> file:</p> <p>.. code-block:: bash</p> <pre><code>composer require fabpot/goutte </code></pre> <h2>Usage</h2> <p>Create a Goutte Client instance (which extends <code>Symfony\Component\BrowserKit\Client</code>):</p> <p>.. code-block:: php</p> <pre><code>use Goutte\Client; $client = new Client(); </code></pre> <p>Make requests with the <code>request()</code> method:</p> <p>.. code-block:: php</p> <pre><code>// Go to the symfony.com website $crawler = $client-&gt;request(&#39;GET&#39;, &#39;https://www.symfony.com/blog/&#39;); </code></pre> <p>The method returns a <code>Crawler</code> object (<code>Symfony\Component\DomCrawler\Crawler</code>).</p> <p>To use your own Guzzle settings, you may create and pass a new Guzzle 6 instance to Goutte. For example, to add a 60 second request timeout:</p> <p>.. code-block:: php</p> <pre><code>use Goutte\Client; use GuzzleHttp\Client as GuzzleClient; $goutteClient = new Client(); $guzzleClient = new GuzzleClient(array( &#39;timeout&#39; =&gt; 60, )); $goutteClient-&gt;setClient($guzzleClient); </code></pre> <p>Click on links:</p> <p>.. code-block:: php</p> <pre><code>// Click on the &#34;Security Advisories&#34; link $link = $crawler-&gt;selectLink(&#39;Security Advisories&#39;)-&gt;link(); $crawler = $client-&gt;click($link); </code></pre> <p>Extract data:</p> <p>.. code-block:: php</p> <pre><code>// Get the latest post in this category and display the titles $crawler-&gt;filter(&#39;h2 &gt; a&#39;)-&gt;each(function ($node) { print $node-&gt;text().&#34;\n&#34;; }); </code></pre> <p>Submit forms:</p> <p>.. code-block:: php</p> <pre><code>$crawler = $client-&gt;request(&#39;GET&#39;, &#39;https://github.com/&#39;); $crawler = $client-&gt;click($crawler-&gt;selectLink(&#39;Sign in&#39;)-&gt;link()); $form = $crawler-&gt;selectButton(&#39;Sign in&#39;)-&gt;form(); $crawler = $client-&gt;submit($form, array(&#39;login&#39; =&gt; &#39;fabpot&#39;, &#39;password&#39; =&gt; &#39;xxxxxx&#39;)); $crawler-&gt;filter(&#39;.flash-error&#39;)-&gt;each(function ($node) { print $node-&gt;text().&#34;\n&#34;; }); </code></pre> <h2>More Information</h2> <p>Read the documentation of the <code>BrowserKit</code>_ and <code>DomCrawler</code>_ Symfony Components for more information about what you can do with Goutte.</p> <h2>Pronunciation</h2> <p>Goutte is pronounced <code>goot</code> i.e. it rhymes with <code>boot</code> and not <code>out</code>.</p> <h2>Technical Information</h2> <p>Goutte is a thin wrapper around the following fine PHP libraries:</p> <ul> <li><p>Symfony Components: <code>BrowserKit</code><em>, <code>CssSelector</code></em> and <code>DomCrawler</code>_;</p></li> <li><p><code>Guzzle</code>_ HTTP Component.</p></li> </ul> <h2>License</h2> <p>Goutte is licensed under the MIT license.</p> <p>.. _<code>Composer</code>: <a href="https://getcomposer.org" rel="nofollow">https://getcomposer.org</a> .. _<code>Guzzle</code>: <a href="http://docs.guzzlephp.org" rel="nofollow">http://docs.guzzlephp.org</a> .. _<code>BrowserKit</code>: <a href="https://symfony.com/components/BrowserKit" rel="nofollow">https://symfony.com/components/BrowserKit</a> .. _<code>DomCrawler</code>: <a href="https://symfony.com/doc/current/components/dom_crawler.html" rel="nofollow">https://symfony.com/doc/current/components/dom_crawler.html</a> .. _<code>CssSelector</code>: <a href="https://symfony.com/doc/current/components/css_selector.html" rel="nofollow">https://symfony.com/doc/current/components/css_selector.html</a></p>