Loading graph data

Library support loading graphs from multiple file formats. Nevertheless, we will be implementing more of them in next releases.

Graph loading API

Main graph loading object is a LoadGraph

It takes implementations of a GraphLoader and lets you easily configure loading process. Parameters (Parameter ) for configuration are set using using(parameter: Parameter) method. Parameters are specific for each GraphLoader

Loading from CSV

To load graph from CSV file you must use CSV implementation of GraphLoader trait:

import ml.sparkling.graph.api.loaders.GraphLoading.LoadGraph
import ml.sparkling.graph.loaders.csv.GraphFromCsv.CSV
import org.apache.spark.SparkContext

implicit ctx:SparkContext=???
// initialize your SparkContext as implicit value so it will be passed automatically to graph loading API

val filePath="your_graph_path.csv"

val graph=LoadGraph.from(CSV(filePath)).load()

That is simplest way of loading standard CSV file:

"vertex1","vertex2"
"<numerical_id_of_vertex_1>","<numerical_id_of_vertex_2>"

In order to change file format you can use parameters like:

import ml.sparkling.graph.loaders.csv.GraphFromCsv.LoaderParameters.{Delimiter,Quotation}
import ml.sparkling.graph.api.loaders.GraphLoading.LoadGraph
import ml.sparkling.graph.loaders.csv.GraphFromCsv.CSV
import org.apache.spark.SparkContext

implicit ctx:SparkContext=???
// initialize your SparkContext as implicit value so it will be passed automatically to graph loading API

val filePath="your_graph_path.csv"
val graph=LoadGraph.from(CSV(filePath)).using(Delimiter(";")).using(Quotation("'")).load()

Presented snipet will load graph from file with format:

'vertex1';'vertex2'
'<numerical_id_of_vertex_1>';'<numerical_id_of_vertex_2>'

Loading graphs with vertex identifiers that are not numerical

Because in some cases vertices identifiers can be not numerical (username as string). You can load this kind of graph specifying that Indexing is required:

import ml.sparkling.graph.api.loaders.GraphLoading.LoadGraph
import ml.sparkling.graph.loaders.csv.GraphFromCsv.CSV
import ml.sparkling.graph.loaders.csv.GraphFromCsv.LoaderParameters.Indexing
import org.apache.spark.SparkContext

implicit ctx:SparkContext=???
// initialize your SparkContext as implicit value so it will be passed automatically to graph loading API

val filePath="your_graph_path.csv"

val graph=LoadGraph.from(CSV(filePath)).using(Indexing).load()

That approach gives you ability to load graphs from CSV files with any structure and vertex identifiers of any type. For example:

"vertex1","vertex2"
"centralized","computation"
"is","lame"

Full list of CSV loading parameters is available in here

Loading from GraphML

To load graph from GraphML XML file you must use GraphML implementation of GraphLoader trait:

import ml.sparkling.graph.api.loaders.GraphLoading.LoadGraph
import ml.sparkling.graph.loaders.graphml.GraphFromGraphML.GraphML
import org.apache.spark.SparkContext

implicit ctx:SparkContext=???
// initialize your SparkContext as implicit value so it will be passed automatically to graph loading API

val filePath="your_graph_path.xml"

val graph=LoadGraph.from(GraphML(filePath)).load()

That is simplest way of loading standard GraphML XML file (vertices are automatically indexed, and receive VertexId identifier ):

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
    <key id="v_name" for="node" attr.name="name" attr.type="string"/>
    <key id="v_type" for="node" attr.name="type" attr.type="string"/>
    <graph id="G" edgedefault="undirected">
        <node id="n0">
            <data key="v_name">name0</data>
            <data key="v_type">type0</data>
        </node>
        <node id="n1">
            <data key="v_name">name1</data>
        </node>
        <node id="n2">
            <data key="v_name">name2</data>
        </node>
        <node id="n3">
            <data key="v_name">name3</data>
        </node>
        <edge id="e1" source="n0" target="n1"/>
        <edge id="e2" source="n1" target="n2"/>
    </graph>
</graphml>

All attributes associated with vertices will be puted into GraphProperties type which expands to Map[String,Any]. By default each edge and vertex has id attribute.

import ml.sparkling.graph.api.loaders.GraphLoading.LoadGraph
import ml.sparkling.graph.loaders.graphml.GraphFromGraphML.{GraphProperties, GraphML}
import org.apache.spark.SparkContext

implicit ctx:SparkContext=???
// initialize your SparkContext as implicit value so it will be passed automatically to graph loading API

val filePath="your_graph_path.xml"

val graph: Graph[GraphProperties, GraphProperties] =LoadGraph.from(GraphML(filePath)).load()
val verticesIdsFromFile: Array[String] = graph.vertices.map(_._2("id").asInstanceOf[String]).collect()