PHPでリンクしているURIのリストを取得

ページ情報

制作日

2003-12-20

最終更新日

2003-12-20

参照用URI

http://www.arielworks.net/articles/2003/1220b

分野

PHP

楽して404チェック第2段

HTTPのHEADリクエストは出来るようになったので、次はチェックすべきURIのリストを制作します。

ソース

汎用性を持たせるために、指定したURIを開いてファイル内でリンクされているURIのリストを返す関数を作ります。

//-------------------------------------------------------------------------
// array get_uri_list( string URI [, bool option] )
// URIがHTMLファイルだった場合、ファイル内でリンクされているURIのリストを返
// します。optionがTRUEだった場合ID（#以降）が返り値の各URIからID（#以降）
// を除去します。FALSEの場合そのまま返します。
// get_http_header()については http://www.arielworks.net/articles/2003/1220a
// を参照してください。
//-------------------------------------------------------------------------
function get_uri_list( $target , $option = TRUE ) {

    $contents = '';

    // get_http_header() を使ってContent-Typeを取得する。
    // text/html以外は解析しないようにする。
    $info_tgt = get_http_header( $target );
    if( ! preg_match( '/^text\/html/i', $info_tgt['Content-Type'] ) ) {
        return array();
    }

    // 指定したURIを読み込んで内容を保存する
    if( $handle = @fopen( $target, 'r' ) ) {
        while ($line = fgets( $handle, 4096 ) ) {
            $contents .= $line;
        }
        fclose( $handle );
    }

    $temp_list = array();
    // 正規表現でURI部分を抜き出す。ちなみに、タグの中でリンクされているURI
    // しか抜き出さない。code属性などはあまり使わないのでここでは無視する。
    preg_match_all(
        '/<.+? (src|href|cite)=("|\')?(.*?)(("|\'| ).*?>|>)/is',
        $contents, $temp_list, PREG_PATTERN_ORDER );

    $list = array();
    foreach( $temp_list[3] as $temp_uri ) {

        $temp_uri_info = parse_url( $temp_uri );

        // スキームがあるときは絶対パスなのでそのまま使う。
        if( $temp_uri_info['scheme'] ) {

            // IDを消す
            if( $option && strrpos( $temp_uri, '#' ) !== FALSE ) {
                $temp_uri =
                    substr( $temp_uri, 0, strrpos( $temp_uri, '#' ) );
            }

            $list[] = $temp_uri;

        // スラッシュから始まるときはホストからのパス
        } elseif( substr( $temp_uri, 0, 1 ) == '/' ) {

            // ホスト名だけ取得するので、解析しているURIをバラしてから
            $temp_target_info = parse_url( $target );

            // 再び結合する。
            $temp_uri = $temp_target_info['scheme'] . '://'
                . $temp_target_info['host'] . $temp_uri;

            if( $option && strrpos( $temp_uri, '#' ) !== FALSE ) {
                $temp_uri =
                    substr( $temp_uri, 0, strrpos( $temp_uri, '#' ) );
            }

            $list[] = $temp_uri;

        // それ以外は相対パス
        } else {

            $temp_u = array();

            $over = 0;

            // スラッシュで分割してから".."と"."を考慮して再構築する。
            foreach( explode( '/', $temp_uri ) as $temp_tu ) {

                if( $temp_tu == '..' ) {
                    if( count( $temp_u ) == 0 ) {
                        $over++; // 上がない場合をカウント
                    } else {
                        array_pop( $temp_u );
                    }
                } elseif( $temp_tu == '.' ) {
                } else {
                    $temp_u[] = $temp_tu;
                }

            }
            $temp_u = implode( '/', $temp_u );

            $temp_t = explode( '/', $target );
            array_pop( $temp_t ); // ファイル名を飛ばす
            for( $i = 0; $i < $over; $i++ ) {
                array_pop( $temp_t );
            }
            $temp_t = implode( '/', $temp_t );

            $temp_uri = $temp_t . '/' .  $temp_u;

            if( $option && strrpos( $temp_uri, '#' ) !== FALSE ) {
                $temp_uri =
                    substr( $temp_uri, 0, strrpos( $temp_uri, '#' ) );
            }

            $list[] = $temp_uri;

        }

    }

    return $list;

}

これでget_uri_list()を実行するすると返り値にファイル中でリンクされているURIのリストが配列として格納されます。

スクリプト中で使われているget_html_header()についてはPHPでHTTPリクエストをしてみるを参照してください。

連絡先、リンク、転載や複製などについては「サイト案内」をご覧ください。